The Linear Algebra Behind Search Engines

Author(s):

Amy Langville

The Information Age has flooded readers with information. However, these gains in the amount of information available are useless without parallel gains in techniques for effectively storing and searching through such information. Consider the following facts:

In 2001, there were about 250,000 periodicals published in print worldwide, with roughly 12,000 added each year (Ulrich's Directory, 2001).
In 2000, there were nearly 3.5 million books in print (Books in Print, 2001).
The number of books in top libraries doubles every 14 years (Wurman, 1989).
In 2000, there were about 2.5 billion web pages on the Internet, with a growth rate of 7.3 million per day (Lyman & Varian, 2000).

Amy Langville is an Assistant Professor of Mathematics at the College of Charleston. She developed this module as a postdoctoral researcher at North Carolina State University.

This sample deluge of facts shows how easy it is to feel overwhelmed by the amount of available information, an amount that is growing at an exponential rate.

The pressing need for organizing and sorting through this mushrooming amount of information presents an enormous challenge, one whose solution is of paramount importance. The entire field of information retrieval is devoted to meeting this challenge. Traditional information retrieval studies the science of search in traditional, static, well-controlled document collections. Web information retrieval deals with the world's largest document collection, the World Wide Web, a nontraditional, dynamic, uncontrolled collection.

In this module, you will learn about the field of information retrieval and the basic components of search engines. We will focus on one particular search engine model, the Vector Space Model, which incorporates basic concepts from linear algebra. We will also discuss a more advanced vector space model, Latent Semantic Indexing (LSI), which uses even more linear algebra to improve search results.

There are many search engine techniques. As of June 2000, there were at least 3,500 different search engines on the Web (Bradley, 2002). Due to the proprietary nature of the field of information retrieval, it is difficult to say which techniques are most prevalent in industrial search engines such as Yahoo, Excite, or Overture. Search engine creators loathe sharing their innovative ideas with other vendors, as this could jeopardize millions of dollars in revenue. However, without a doubt, some search engines are more successful than others, an example being Google.