The Linear Algebra Behind Search Engines

Author(s):

Amy Langville

The concept of angles in 3-D can be extended to general n-dimensional (n-D) spaces. In n-D spaces, each vector has n components. Again, we want to summarize the information in an n-D vector into one number, so that we can quickly compare the information contained in any set of vectors. Even though n-D vectors are impossible to visualize for n > 3, analogous concepts of length and angle exist in n-D space just as they do in 3-D space. This mathematical extension enables us to talk about a real search engine, such as Inktomi.

There are about 300,000 words in the English language (Berry, Drmac & Jessup, 1999). Thus, the query and document vectors are potentially 300,000-D vectors. For example, a search engine query on beach volleyball would create the 300,000-D query vector

where the first "1" corresponds to the English word beach and the second "1" corresponds to the word volleyball.

Each component in the 300,000-D vector corresponds to a word in the English language. Thus, the first position in the vector corresponds to the first word in the English dictionary, aardvark, and the last position corresponds to zymurgy, the last word in the dictionary. If the fourth position contains a 1, then a search should be run for the fourth word in the English language. Similarly, a 0 in the fourth position means the fourth word is irrelevant to the search. In general, if the i^th position in the vector contains a 1, then the i^th word in the English language should be searched for. On the other hand, if the i^th position contains a 0, the i^th word in the English language is irrelevant to the search. This is the procedure for translating a group of query words into an equivalent mathematical representation using the query vector.

The same is done for the documents on the Web, translating them into mathematical representations. Each document on the Web has a list of keywords associated with it. These keywords are used to create a document vector for that document. For example, a document entitled "Will Shaq's Free Throws Stop the Lakers" might designate basketball, playoffs, and NBA as its keywords. This document's vector contains three 1's in the positions corresponding to its three keywords and 0's elsewhere.

Search engines often apply weighting schemes to keywords. The scheme just described is unweighted, since each keyword gets the value 1 in the document vector. However, in a weighted scheme, the NBA playoff document might give more weight to playoff than basketball, and thus the entry corresponding to playoff might be 5 and the entry for basketball might be 2.

You have already encountered a keyword weighting scheme in the Food! collection example, where the weights were percentages of the article devoted to a particular keyword. In practice, numerous clever weighting and frequency schemes are implemented in an effort to make the search more accurate (Berry & O'Brien, 1998, Dumais, 1991, Jones, 1972). In addition, various other 1-D measures besides the cosine angle measure for capturing the distance between two vectors have been proposed (Jones & Furnas, 1987).

This n-D VSM must proceed through the same steps as the 3-D VSM example (page 5) in order to process a query. First, the angle measure between q and each document vector d_i in the collection must be computed. With three billion documents of the Web, this means three billion angle measures must be calculated for each query. Then those angle measures exceeding the cutoff value are ranked, and this ranked list is presented to the user. And all this must be done in a fraction of a second, which leads us to the next page.

Amy Langville, "The Linear Algebra Behind Search Engines - Extending the VSM to [i]n[/i] Dimensions," Convergence (December 2005)

JOMA