The Linear Algebra Behind Search Engines

Author(s): 
Amy Langville

 

Instructor's Note

The Information Age has flooded readers with information. However, these gains in the amount of information available are useless without parallel gains in techniques for effectively storing and searching through such information. Consider the following facts:

  • In 2001, there were about 250,000 periodicals published in print worldwide, with roughly 12,000 added each year (Ulrich's Directory, 2001).

  • In 2000, there were nearly 3.5 million books in print (Books in Print, 2001).

  • The number of books in top libraries doubles every 14 years (Wurman, 1989).

  • In 2000, there were about 2.5 billion web pages on the Internet, with a growth rate of 7.3 million per day (Lyman & Varian, 2000).

Amy Langville is an Assistant Professor of Mathematics at the College of Charleston. She developed this module as a postdoctoral researcher at North Carolina State University.

This sample deluge of facts shows how easy it is to feel overwhelmed by the amount of available information, an amount that is growing at an exponential rate.

The pressing need for organizing and sorting through this mushrooming amount of information presents an enormous challenge, one whose solution is of paramount importance. The entire field of information retrieval is devoted to meeting this challenge. Traditional information retrieval studies the science of search in traditional, static, well-controlled document collections. Web information retrieval deals with the world's largest document collection, the World Wide Web, a nontraditional, dynamic, uncontrolled collection.

In this module, you will learn about the field of information retrieval and the basic components of search engines. We will focus on one particular search engine model, the Vector Space Model, which incorporates basic concepts from linear algebra. We will also discuss a more advanced vector space model, Latent Semantic Indexing (LSI), which uses even more linear algebra to improve search results.

There are many search engine techniques. As of June 2000, there were at least 3,500 different search engines on the Web (Bradley, 2002). Due to the proprietary nature of the field of information retrieval, it is difficult to say which techniques are most prevalent in industrial search engines such as YahooExcite, or Overture. Search engine creators loathe sharing their innovative ideas with other vendors, as this could jeopardize millions of dollars in revenue. However, without a doubt, some search engines are more successful than others, an example being Google.

Published December, 2005
© 2005, Amy N. Langville

 

The Linear Algebra Behind Search Engines - Introduction

Author(s): 
Amy Langville

 

Instructor's Note

The Information Age has flooded readers with information. However, these gains in the amount of information available are useless without parallel gains in techniques for effectively storing and searching through such information. Consider the following facts:

  • In 2001, there were about 250,000 periodicals published in print worldwide, with roughly 12,000 added each year (Ulrich's Directory, 2001).

  • In 2000, there were nearly 3.5 million books in print (Books in Print, 2001).

  • The number of books in top libraries doubles every 14 years (Wurman, 1989).

  • In 2000, there were about 2.5 billion web pages on the Internet, with a growth rate of 7.3 million per day (Lyman & Varian, 2000).

Amy Langville is an Assistant Professor of Mathematics at the College of Charleston. She developed this module as a postdoctoral researcher at North Carolina State University.

This sample deluge of facts shows how easy it is to feel overwhelmed by the amount of available information, an amount that is growing at an exponential rate.

The pressing need for organizing and sorting through this mushrooming amount of information presents an enormous challenge, one whose solution is of paramount importance. The entire field of information retrieval is devoted to meeting this challenge. Traditional information retrieval studies the science of search in traditional, static, well-controlled document collections. Web information retrieval deals with the world's largest document collection, the World Wide Web, a nontraditional, dynamic, uncontrolled collection.

In this module, you will learn about the field of information retrieval and the basic components of search engines. We will focus on one particular search engine model, the Vector Space Model, which incorporates basic concepts from linear algebra. We will also discuss a more advanced vector space model, Latent Semantic Indexing (LSI), which uses even more linear algebra to improve search results.

There are many search engine techniques. As of June 2000, there were at least 3,500 different search engines on the Web (Bradley, 2002). Due to the proprietary nature of the field of information retrieval, it is difficult to say which techniques are most prevalent in industrial search engines such as YahooExcite, or Overture. Search engine creators loathe sharing their innovative ideas with other vendors, as this could jeopardize millions of dollars in revenue. However, without a doubt, some search engines are more successful than others, an example being Google.

Published December, 2005
© 2005, Amy N. Langville

 

The Linear Algebra Behind Search Engines - Summary of Search Engine Models

Author(s): 
Amy Langville

I outline here just a few of the most basic information retrieval techniques, in order to provide a context for the techniques we will study in more detail. Specifics of some of these techniques can easily become very complicated and are, in general, hard to come by, since many vendors refuse to share in this competitive environment.

Boolean Search Engines

The Boolean model of information retrieval, one of the earliest and simplest retrieval methods, uses exact matching to match documents to a user "query" or information request by finding documents that are "relevant" in terms of matching the words in the query. More refined descendents of this model are still used by most libraries.

The adjective "Boolean" refers to the use of Boolean algebra, whereby words are logically combined with the Boolean operators AND, OR, and NOT. For example, the Boolean AND of two logical statements x and y means that both x AND y must be satisfied, while the Boolean OR of these same two statements means that at least one of these statements must be satisfied. Any number of logical statements can be combined using the three Boolean operators.

The Boolean information retrieval model considers which keywords are present or absent in a document or title. Thus, a document is judged as relevant or irrelevant -- there is no concept of a "partial match" between documents and queries. The inability to identify partial matches can lead to poor performance (Baeza-Yates & Ribeiro-Neto, 1999). Other more advanced set theoretic techniques, such as the so-called "fuzzy sets", try to remedy this black-white Boolean logic by introducing shades of gray. For example, a title search for car AND maintenance on a Boolean engine causes the virtual machine to return all documents that have both words in the title. As a result, an apparently relevant document entitled "Automobile Maintenance" will not be returned. Fuzzy Boolean engines use fuzzy logic to categorize this document as somewhat relevant and return it to the user.

The car maintenance query example illustrates the main drawbacks of Boolean search engines -- they fall prey to two of the most common information retrieval problems, synonymy and polysemy.

  • Synonymy occurs when multiple words have the same meaning, such as "car" and "automobile". A standard Boolean engine cannot return semantically related documents whose keywords were not included in the original query.
  • Polysemy occurs when a word has multiple meanings. For example, when a user types bank as a query, does this mean a financial center, a slope on a hill, a shot in pool, or a collection of objects (Berry, Drmac & Jessup, 1999)? The problem of polysemy can cause retrieval of many documents that are irrelevant to the user's intended meaning.

Many Boolean search engines also require the user to be familiar with Boolean operators and the specialized syntax of the engine. For example, to find information about the phrase iron curtain, many engines require quotation marks around the phrase, which tell the search engine that the entire phrase should be searched as if it were just one keyword. A user who forgets this syntax requirement would be surprised to find retrieved documents about interior decorating and mining for iron ore.

Nevertheless, variants of the Boolean model do form the basis for many search engines. There are several reasons for their prevalence:

  1. Creating and programming a Boolean engine is straightforward.
  2. Queries can be processed quickly -- a quick scan through the keyword files for the documents can be executed in parallel.
  3. Boolean models scale well to very large document collections. Accommodating a growing collection is easy -- the programming remains simple, and only the storage and parallel processing capabilities need to grow.

Baeza-Yates & Ribeiro-Neto (1999), Frakes & Baeza-Yates (1992), and Korfhage (1997) all contain chapters with excellent introductions to the Boolean model and its extensions.

Vector Space Model Search Engines

Another information retrieval technique uses the vector space model (Salton, 1971), developed by Gerard Salton in the early 1960's, to sidestep some of the information retrieval problems of the Boolean model. Vector space models transform textual data into numeric vectors and matrices, then employ matrix analysis techniques to discern key features and connections in the document collection.

Some advanced vector space models address the common text analysis problems of synonymy and polysemy. Models such as Latent Semantic Indexing (LSI) (Dumais, 1991) can access the hidden semantic structure in a document collection. For example, an LSI engine processing the query car will return documents whose keywords are related semantically (in meaning), e.g., automobile. This ability to reveal hidden semantic meanings makes vector space models such as LSI very powerful information retrieval tools.

Two additional advantages of the vector space model are relevance scoring and relevance feedback. The model allows documents to match a query partially by assigning each document a number between 0 and 1, which can be interpreted as the likelihood of relevance to the query. The group of retrieved documents can then be sorted by degree of relevancy, a luxury not possible with the Boolean model. Thus, vector space models return documents in an ordered list, with the first document judged to be most relevant to the user's query. Some vector space search engines report the relevance score as a relevancy percentage. For example, a 97% next to a document means that document is judged as 97% relevant to the user's query.

Example. The Federal Communications Commission's search engine, which is powered by Inktomi, was known at one time to use a vector space model. Enter a query such as taxes and notice the relevancy score reported on the right side.

Relevance feedback is an information retrieval tuning technique that is a natural addition to the vector space model. It allows the user to select a subset of the retrieved documents that are useful and then resubmit with this additional relevance feedback information to obtain a revised set of generally more useful retrieved documents.

The drawbacks of vector space models are their computational expense and poor scalability. At query time, distance measures (also known as similarity measures ) must be computed between each document and the query, and advanced models such as LSI require an expensive singular value decomposition of a large matrix that numerically represents the entire document collection. As the collection grows, the expense of this matrix decomposition becomes prohibitive.

The informative little book by Berry & Browne (1999) provides an excellent explanation of vector space models, especially LSI, and contains several examples and sample code. I provide a much more condensed explanation on the Example Search Engine Method page.

Probabilistic Model Search Engines

Probabilistic models attempt to estimate the probability that the user will find a particular document relevant. Retrieved documents are ranked by their odds of relevance -- the ratio of the probability that the document is relevant to the probability that the document is not relevant to the query. This model operates recursively and requires that the underlying algorithm guess at initial parameters, then iteratively try to improve this initial guess to obtain a final ranking of relevancy probabilities. Improvements to the basic probabilistic model of information retrieval are made with Bayesian inference techniques (Baeza-Yates & Ribeiro-Neto, 1999).

Unfortunately, probabilistic models can be very hard to build and program. Their complexity grows quickly, deterring many researchers and limiting the scalability of the engine based on the model. Probabilistic models also require several unrealistic simplifying assumptions, such as independence between terms as well as between documents. For instance, in this document the most likely word to follow information is the word retrieval, so it is not reasonable to assume that these terms are independent.

On the other hand, the probabilistic framework can naturally accommodate a priori preferences, and thus these models do offer promise of tailoring search results to the preferences of individual users. For example, a user's query history can be incorporated into the probabilistic model's initial guess, which generates better query results than a democratic guess.

Meta-Search Engines

Meta-search engines are based on the principle that while one search engine is good, two (or more) are better. One search engine may be great at a certain task, while a second search engine is better at a another task. Thus, meta-search engines such as Copernic and SurfWax were created to simultaneously exploit the best features of many individual search engines. Meta-search engines send the user's query to several search engines at once and return the results from all of the engines in one long unified list. Some meta-search engines also include subject-specific search engines, which can be helpful when searching within one particular discipline. For example, Monster is an employment search engine.

The Linear Algebra Behind Search Engines - Comparing Search Engines

Author(s): 
Amy Langville
At annual information retrieval conferences, such as TREC (Harman & Voorhees, 1996) and CIR (Berry, 2001) for traditional information retrieval and WWW for web information retrieval, researchers present results that are used to compare the various information retrieval models underlying search engines and help the field progress toward better, more efficient search engines. The two most common ratings used to differentiate the various search techniques are precision and recall. Precision is the ratio of the number of relevant documents retrieved to the total number of documents retrieved. Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the collection. The higher the precision and recall, the better the search engine.

Of course, search engines are tested on document collections with known parameters. For example, the commonly used test collection Medlars (2002), containing 5,831 keywords and 1,033 documents, has been examined so often that its properties are well-known. For instance, there are exactly 24 documents relevant to the phrase neoplasm immunology. Thus, the denominator of the recall ratio for a user query on neoplasm immunology is 24. If only 10 documents were retrieved by a search engine for this query, then a recall of 10/24 = 0.4166 is reported.

Recall and precision are information retrieval-specific performance measures, but when evaluating any computer system, time and space are always performance issues. All else held constant, quick, memory-efficient search engines are preferred to slower, memory-inefficient engines. A search engine with fabulous recall and precision is useless if it requires 30 minutes to perform one query or stores the data on 75 supercomputers. Some other performance measures take a user-centered viewpoint and are aimed at assessing user satisfaction and frustration with the information system. Korfhage (1997) discusses these and several other measures for comparing search engines.

The Linear Algebra Behind Search Engines - The Challenges of Web Information Retrieval

Author(s): 
Amy Langville
New information retrieval methods, in addition to those described above, are often used to search the Web. The Web's hyperlink structure, a structure not present in a standard information retrieval repository of documents, provides additional information that can be exploited in the retrieval process. Google was one of the first commercial search engines to recognize the importance of the Web's hyperlink structure. The underlying model for the successful Google engine is a Markov chain. (See Langville & Meyer, 2005 for an introduction to web retrieval methods.)

There are other information retrieval challenges in addition to those briefly mentioned already, such as polysemy, synonymy, query language, and speed. The Web presents its own unique challenges -- it can be viewed as one huge database with these unique properties:

  • large amounts of volatile data (rapid updates, broken links, file disappearances),
  • an exponentially growing number of pages,
  • heterogeneous data (multiple formats, languages, alphabets),
  • lack of structure,
  • redundant data,
  • a lack of an editorial review publication process, which leads to numerous information errors, falsehoods, and invalid statements.

Furthermore, some advertisers try to increase traffic to their pages by taking elaborate measures to fool automated search engines. For example, an advertiser might label sports, beer, and swimsuits as keywords in its subject tag, thinking these topics are likely to arouse Web surfers' interests, while the true content of the page is classical clocks, a much less alluring topic. In fact, there are even companies whose sole purpose and means of profit is the manipulation of search engines. There are also search engines whose owners sell companies the right to be listed at the top of the retrieved documents list for particular queries (Marchiori, 1997).

There is an additional Web information retrieval challenge related to precision. Although the amount of accessible information continues to grow, user ability to look at documents has not. Users rarely look beyond the first 10 or 20 documents retrieved, and this user impatience means that search engine precision must increase just as rapidly as the number of documents is increasing.

Broken links also pose a special problem. Often the most appealing retrieved document turns out to be a broken link, which quickly causes user frustration. Most search engines use a "web crawling" technique to gather and store information about individual Web pages and documents, thus, in effect, removing broken links from their database of retrievable documents. 

Quiz: Test your reading comprehension

Quiz: Test your Web search engine skills

 

The Linear Algebra Behind Search Engines - Focus on the Vector Space Model

Author(s): 
Amy Langville

Of the basic models of information retrieval, we focus in this project on the Vector Space Model (VSM) because it has the strongest connection to linear algebra. This model and its more advanced version, Latent Semantic Indexing (LSI), are beautiful examples of linear algebra in practice.

Let's begin with a trivial example. Consider the database of articles for a hypothetical periodical called Food! magazine. The articles are distinguishable by only three keywords: desserts, breads, vegetables. One article (document 1) talks only about desserts. Another article (document 2) discusses only breads. A third article (document 3) is 30% about vegetables and 70% about breads. All divisions of article space among the three keywords are possible. Vectors and matrices are used to describe how a vector space search engine processes a query in the Food! database.

To store data about the division of topics discussed in the particular Food! document number 1, we use the column vector , where the first entry of the vector represents the percentage of the article devoted to desserts, the second corresponds to breads, and the third to vegetables. Similarly, document number 3 has an associated vector of All of the document vector information can be stored together in a matrix. Suppose there are only 7 documents labeled d1, d2, d3, d4, d5, d6, d7, in the Food! database and only the three keywords (desserts, breads, vegetables) to distinguish article topics. Then, the matrix A capturing the data for Food! documents is

Note: There are numerous matrix element weighting schemes for VSM methods. Here we are using the proportion of the document dealing with each corresponding term, but various other weights, such as the frequency of a term's usage in a document, could be used (Jones & Furnas, 1987).

A is a 3-by-7 matrix. If there were, say, 20 keywords and 1000 documents, then A would be a 20-by-1000 matrix. A is called a term-by-document matrix because the rows correspond to terms (keywords) in the database and the columns correspond to documents. Since there are only three terms in our Food! example, we can plot each column (document) vector in the matrix A in three-dimensional space. A static plot of the seven document vectors is shown in Figure 1 -- to see an animated version, click on the figure. When you are ready to return here, close the animation window.


Figure 1. 3-D plot of 7 document vectors


Let's analyze this plot to see if it gives us any insight into the information retrieval problem of searching the Food! database to find the document closest to a given query. Suppose a user of this Food! search engine wants to find all articles related to a query of vegetables. Then our query vector is Now the problem is mathematical: Find which documents represented by the vectors (d1, d2, d3, d4, d5, d6, d7) are "closest" to this query vector q. You might try to inspect the 3-D plot of the document vectors and the query vector visually (Figure 2) to see which document vectors are "closest." This visual inspection can be difficult, since the display of a 3-D plot must be done on a 2-D screen -- but, as with Figure 1, you can click on the figure to see an animated version. Can you guess which document vectors are closest to the query vector? Why did you make that choice?


Figure 2. 3-D plot of 7 document vectors (red) and query vector (green)


The information contained in a multi-dimensional vector can be summarized in two one-dimensional measures, length and angle with respect to a fixed direction. The length of a vector is the distance from the tail to the head of the vector. The angle between two vectors is the measure (in degrees or radians) of the angle between those two vectors in the plane that they determine. Keeping the length and angle (relative to the query vector) of the Food! vectors in mind, try to answer this question:

Visually, what is it about document 4's vector that makes it most similar to the query vector?

Clearly, it is not the length of the vector -- document 1's vector has the same length as the query vector, but it does not appear to be physically "closest" to the query vector. Rather, the angle of the vector with the query vector tells us that document 4's vector is closest to the query vector. For each document in the Food! database, we can use one number, the angle between the document vector and the query vector, to capture the physical "distance" of that document from the query. The document vector whose direction is closest to the query vector's direction (i.e., for which the angle is smallest) is the best choice, yielding the document most closely related to the query.

We can compute the cosine of the angle θ between the nonzero vectors x and y by

where ||x||2 and ||y||2 represent the Euclidean norms (lengths) of the vectors x and y, respectively. Thus, the cosine of the angle between document vector d3 and query vector q is

We record cosines of the angles for each Food! document vector with respect to query 1 in Table 1.

Table 1. Cosines of angles between Food! document vectors and query vector 1

If the cosine of an angle is 1, then the angle between the document vector and the query vector measures 0°, meaning the document vector and the query vector point in the same direction. A cosine measure of 0 means the document is unrelated to the query, i.e., the vectors are perpendicular. Thus, a cosine measure close to 1 means that the document is closely related to the query.

Note: Since all entries in the vectors are nonnegative, the dot products are nonnegative, which means there is no need to take absolute value of the cosine measure.

We now consider a second query, this time desiring articles with an equal mix of information about desserts and breads, so the new query vector is We show the cosines of the angles between the document vectors and this query vector is in Table 2.

Table 2: Cosines of angles between Food! document vectors and query vector 2

We see that document 6 is now most relevant to the query -- as expected, since d6 is identical to query vector 2 in the divisions of topics discussed. Document 5 is the second most relevant document to query 2, followed by document 7.

In practice, a search engine system must designate a cutoff value. Only those documents whose angles with the query vector are above this cutoff value are reported in the list of retrieved documents viewed by the user. In Figure 3 we show (in blue) the cone of all vectors whose angle with the query vector is 45°. The cosine of this angle is 0.7071, and thus, if we want to return only documents whose vectors lie inside (or on) the cone, the cutoff value is 0.7071. (As before, click on the figure to see an animated version.)


  

Figure 3. Two views of cone around query vector created by cutoff value


If, for example, the cutoff value for query 2 is 0.9, then only d5 and d6 are returned to the user as relevant (see Table 2 again). If, however, a cutoff value of 0.7 is selected, all but d3 will be returned as relevant. Thus, choosing an appropriate cutoff value is not a trivial task and clearly affects both the precision and recall of the search engine.

A quick look at the plot in Figure 4 (or its animated version) of these same document vectors with the query 2 vector shows that d6, d5 and d7 are indeed closest to q in angle measure. Of course, this visual inspection method can be deceptive at times, which is why information retrieval researchers always resort to the nice theoretical underpinnings they have derived as mathematical formulas to compute the cosines of these angles.

Figure 4. 3-D plot of 7 document vectors and query vector 2


Now let's review how the VSM processes query q. First, the cosine of the angle measure between each di and q must be computed. Those distance measures exceeding some predetermined cutoff value are ranked, and the ranked list is presented to the user. For practical collections, much larger than this trivial 3-term-7-document collection, computing these distance measures is the main computational expense.

Exercise: Create Your Own 3-D Physical Search Engine

 

The Linear Algebra Behind Search Engines - Extending the VSM to [i]n[/i] Dimensions

Author(s): 
Amy Langville

The concept of angles in 3-D can be extended to general n-dimensional (n-D) spaces. In n-D spaces, each vector has n components. Again, we want to summarize the information in an n-D vector into one number, so that we can quickly compare the information contained in any set of vectors. Even though n-D vectors are impossible to visualize for n  > 3, analogous concepts of length and angle exist in n-D space just as they do in 3-D space. This mathematical extension enables us to talk about a real search engine, such as Inktomi.

There are about 300,000 words in the English language (Berry, Drmac & Jessup, 1999). Thus, the query and document vectors are potentially 300,000-D vectors. For example, a search engine query on beach volleyball would create the 300,000-D query vector

where the first "1" corresponds to the English word beach and the second "1" corresponds to the word volleyball.

Each component in the 300,000-D vector corresponds to a word in the English language. Thus, the first position in the vector corresponds to the first word in the English dictionary, aardvark, and the last position corresponds to zymurgy, the last word in the dictionary. If the fourth position contains a 1, then a search should be run for the fourth word in the English language. Similarly, a 0 in the fourth position means the fourth word is irrelevant to the search. In general, if the ith position in the vector contains a 1, then the ith word in the English language should be searched for. On the other hand, if the ith position contains a 0, the ith word in the English language is irrelevant to the search. This is the procedure for translating a group of query words into an equivalent mathematical representation using the query vector.

The same is done for the documents on the Web, translating them into mathematical representations. Each document on the Web has a list of keywords associated with it. These keywords are used to create a document vector for that document. For example, a document entitled "Will Shaq's Free Throws Stop the Lakers" might designate basketball, playoffs, and NBA as its keywords. This document's vector contains three 1's in the positions corresponding to its three keywords and 0's elsewhere.

Search engines often apply weighting schemes to keywords. The scheme just described is unweighted, since each keyword gets the value 1 in the document vector. However, in a weighted scheme, the NBA playoff document might give more weight to playoff than basketball, and thus the entry corresponding to playoff might be 5 and the entry for basketball might be 2.

You have already encountered a keyword weighting scheme in the Food! collection example, where the weights were percentages of the article devoted to a particular keyword. In practice, numerous clever weighting and frequency schemes are implemented in an effort to make the search more accurate (Berry & O'Brien, 1998, Dumais, 1991, Jones, 1972). In addition, various other 1-D measures besides the cosine angle measure for capturing the distance between two vectors have been proposed (Jones & Furnas, 1987).

This n-D VSM must proceed through the same steps as the 3-D VSM example (page 5) in order to process a query. First, the angle measure between q and each document vector di in the collection must be computed. With three billion documents of the Web, this means three billion angle measures must be calculated for each query. Then those angle measures exceeding the cutoff value are ranked, and this ranked list is presented to the user. And all this must be done in a fraction of a second, which leads us to the next page.

The Linear Algebra Behind Search Engines - An Advanced Vector Space Model: LSI

Author(s): 
Amy Langville

The vector space model provides the framework for most information retrieval algorithms used today. However, this most basic vector space model alone is not efficient enough. Many modifications and heuristics have been invented to speed up the basic model, giving rise to a popular model called the Latent Semantic Indexing (LSI) model (Berry, Drmac & Jessup, 1999).

The major mathematical entity in the LSI model is also a matrix. We have already discussed how the vector space model described in preceding pages can be viewed as a matrix. Now let's adjoin the 3 billion document vectors associated with the Web (each of which is a 300,000-D vector) to one another to form a 300,000-by-3,000,000,000 matrix, which is called the term-by-document matrix. This matrix might look like

where the (i,j)-element of the matrix contains a 1 if the word for position i in document j is a keyword and a 0 otherwise. Column 1 in this matrix is document 1's document vector, column 2 is document 2's vector, and so forth.

This huge term-by-document matrix may contain redundant information. Berry, Drmac & Jessup (1999) provide the following example:

There may be a document on the Web with keywords applied linear algebra and another document on the Web with keywords computer graphics. There might also be a document with keywords linear algebra applied [to] computer graphics. Clearly, the vectors for the first two documents sum to the vector for the third document.

In linear algebra, this is called dependence among columns of the matrix. This dependence among columns may be frequent, since there are so many documents on the Web. Fortunately, this dependence in the term-by-document matrix can be used to reduce the work required in information retrieval.

Here is where the LSI model departs from the vector space model. In most general terms, the LSI method manipulates the matrix to eradicate dependencies and thus consider only the independent, smaller part of this large term-by-document matrix. In particular, the mathematical tool used to achieve the reduction is the truncated singular value decomposition (SVD) of the matrix.

Note: We introduce the SVD briefly in what follows; Berry, Drmac & Jessup (1999) and Meyer (2000) contain detailed explanations of the SVD of a matrix.

Basically, a truncated SVD of the term-by-document matrix reduces the m-by-n matrix A to an approximation Ak that is stored compactly with a smaller number of vectors. The subscript k represents how much of the original matrix A we wish to preserve. A large value of k means that Ak is very close to A. However, the value of this good approximation must be weighed against the desire for data compression. In fact, the truncated representation Ak requires the storage of exactly k scalars, k m-dimensional vectors and k n-dimensional vectors. Clearly, as k increases, so do the storage requirements.

Let's pause now for a formal definition of the SVD, followed by a numerical example. Before defining the SVD, we also review some other matrix algebra terms. The following definitions and theorems are taken from Meyer (2000).

Definition   The rank of an m-by-n matrix A is the number of linearly independent rows or columns of A.

Definition   An orthogonal matrix is a real m-by-n matrix P of rank r whose columns (or rows) constitute an orthonormal basis for Rr. That is, the set of column (or row) vectors {p1, p2, …, pr}, where r = rank(P), has the property that piTpj = 1 for i = j, 0 otherwise.

Theorem   For each m-by-n matrix A of rank r, there are orthogonal matrices

Um-by-m and Vn-by-n

and a diagonal matrix

Dr-by-r = diag(σ1, σ2, …, σr)

such that

with σ1 ≥ σ2 ≥ … ≥ σr > 0.

Definition   The σi's in the preceding theorem are called the nonzero singular values of A. When r < p = min(m,n), A is said to have p - r additional zero singular values. The factorization in the theorem is called the singular value decomposition (SVD) of A, and the columns of U (denoted ui) and the columns of V (denoted vi) are called the left-hand and right-hand singular vectors of A, respectively. Written another way, the SVD expresses A as a sum of rank-1 matrices:

.


Definition   The truncated SVD is

Theorem (Meyer, 2000, p. 417)   If σ1 ≥ σ2 ≥ … ≥ σr are the nonzero singular values of Am-by-n, then for each k, the distance from A to the closest rank-k matrix is


Thus, σk+1 is a measure of how well Ak approximates A. In fact, for Am-by-n of rank r, Ak is the best rank-k matrix approximation to A (Eckart & Young, 1936). The goal in using Ak to approximate A is to choose k large enough to give a good approximation to A, yet small enough so that k << r requires much less storage.

Figure 6 shows the storage savings obtained with the truncated SVD in general.

Figure 6: Data compression using truncated SVD


A numerical example may make these concepts clearer. Suppose the original term-by-document matrix A is 100-by-300. After a truncated SVD, suppose A20 approximates A well, i.e., we take  =  20. To calculate the storage savings, we count the number of elements that must be stored in both representations -- the full 100-by-300 matrix A (30,000 elements) versus the truncated SVD representation, which requires storage of a 20-by-1 vector, a 100-by-20 matrix and a 20-by-300 matrix (20 × 1 + 100 × 20 + 20 × 300 = 8,020 elements). The number of elements in the full matrix A representation is almost 4 times greater than the number of elements required by the truncated SVD.

Larger data collections often have much greater storage savings. The Medlars collection uses a 5831 × 1033 term-by-document matrix, which can be very well approximated by the truncated SVD with = 70, creating storage savings about 13 times less than the original matrix.

We close this section with a small example, taken from Berry & Browne (1999), showing that the truncated SVD can well approximate a matrix. Suppose A is the 9 × 7 matrix

The truncation value = 3 gives the truncated SVD representation

The elements of A3 only slightly resemble the elements of A -- some are even negative -- but this is not a problem. We are interested only in A3's ability to model the relationships between the vectors of A. Suppose the query vector is

Compare the ranking of document vectors using A with the ranking obtained using A3:

Table 3: Cosines of angles between document vectors and query using A and A3


As k increases, the truncated SVD better captures the essence of A, yet the storage savings decrease. For example, here is  A6:

Notice that A6 does a much better job of capturing the relationship between the vectors of A than A3 does. Now compare the ranking of document vectors using A with the ranking obtained using A6. Also compare the ranking from A3 with that from A6.

Table 4: Absolute cosines of angles between document
vectors and query using A, A3, and A6

Notice that A6's ranking is much closer to the true ranking obtained from the full matrix A than A3's ranking. But the storage required by A6 in its more compact form (U6, V6, σ6) is much greater than the storage required by the less accurate A3 (U3, V3, σ3). Choosing an appropriate and judicious truncation value k is an area of active research for VSMs (Berry, Drmac & Jessup, 1999, Berry & O'Brien, 1998, Letsche & Berry, 1997, Ding, 1999, Zha, Marques & Simon, 1998).

Having demonstrated the truncated SVD's potential to save storage (note), let's now consider how the angle measures are computed using Uk, Dk, Vk. Recall the formula for computing the cosine of the angle &thetai between document vector di and query vector q:

Since di is the ith column of A, let's replace di with Aei, where ei is the column vector of all zeros except for a 1 in the ith position. Thus,

Using the truncated SVD, let's replace A with its approximation Ak and obtain

Since Ak can be decomposed into three small matrices, we replace Ak with UkDkVkT:

To further reduce computations at runtime, we let si=eiTVkDk. Then,

since ||UksiT||2= ||siT||2 for the orthogonal matrix Uk, and ||siT||2=||si||2.

It's helpful to review the above equation, emphasizing the advantages of the truncated SVD. First, Uk, si and ||si||2 only need to be computed once and can be reused for each query q. At runtime, only UkTq and ||q||2 must be computed. The vectors in Uk and Vk are computed once, and these can be determined without computing the entire SVD of A, since there is an algorithm for computing the first k singular values and vectors iteratively (Golub & Van Loan, 1996).

In practice, k can be much smaller than r and yet Ak still approximates A well. One reason for this concerns the noise in A due to polysemous and synonymous words in the English language. Truncating the SVD reduces this noise. Another reason why << r works well in information retrieval applications is that numerical accuracy is of little concern. For example, three digits of precision in the cosine measure may be sufficient. That is, a cosine measure of .231 is just as meaningful as the more numerically precise result of .23157824. The truncated SVD enables storage benefits and still maintains adequate accuracy in the angle measure calculations.

Exercise: Experiment with an LSI Search Engine built in Matlab

Quiz: Linear Algebra

 

The Linear Algebra Behind Search Engines - The Future of Information Retrieval

Author(s): 
Amy Langville

Many search engines are beginning to incorporate structured text retrieval, meaning that users can search for text in italics, capitals, or boldface print. Google currently incorporates some structured retrieval. Users can choose to search for keywords located in the title, body, or hyperlinks of documents. Google's advanced features include searching for HTML documents, PostScript documents, or Excel documents, among other choices. Hotbot allows users to search for images, video, or MP3-formatted data, in addition to ordinary text. Its advanced features also include date information. For example, users can choose to retrieve only recent documents, those updated in the last two weeks, two months, or two years.

Another recent trend in search engine research is to use syntactic clues and query structure to improve retrieval results. These methods incorporate the location of the keywords in the query, the use of prepositions, verbs, and adjectives, hyperlink structure on document pages, and citation links to enhance retrieval. Other research efforts include accommodations for errors and typos in the query, using the concepts of pattern matching and edit distance.

Most keyword search engines, such as the library search engine for N. C. State University, allow for wildcards. For example, you can enter computer * to retrieve books about computer-aided analysis, the computer age, computer circuits, and many other such topics.

One of the most exciting applications of information retrieval is image retrieval. True image retrieval uses mathematical models to capture how similar images are to an original image without using text hints, such as HTML image tags. Soon users will be able to retrieve images in addition to text documents. One application of image retrieval is locating possible suspects for a crime by searching through a criminal photo ID database electronically to find those criminals whose photo profiles most completely match the sketch rendered by the crime scene artist. This can be further extended to sound retrieval. Searching through audio files is as challenging as image retrieval, but may be possible in the near future.

To learn more, explore the online resources and print references listed on the next page.

The Linear Algebra Behind Search Engines - References and Resources

Author(s): 
Amy Langville

Print and Online References

  • Baeza-Yates, R., and B. Ribeiro-Neto (1999). Modern Information Retrieval. New York: ACM Press.
  • Berry, M. W. (Ed.) (2001). Computational Information Retrieval, Proceedings of CIR'00, Philadelphia: SIAM.
  • Berry, M. W., and M. Browne (1999). Understanding Search Engines: Mathematical Modeling and Text Retrieval. Philadelphia, PA: SIAM.
  • Berry, M. W., Z. Drmac, and E. R. Jessup (1999). Matrices, Vector Spaces and Information Retrieval. SIAM Review 41:335-362.
  • Berry, M. W., and G. W. O'Brien (1998). Using linear algebra for intelligent information retrieval. SIAM Review 37:573-595.
  • Books in Print (2001). New York: R.R. Bowker.
  • Bradley, P. (2003). Multi-search engines - a comparison. Web page: http://www.philb.com/msengine.htm. Accessed 10/28/05.
  • Ding, C. H. Q. (1999). A similarity-based probability model for LSI. Proceedings of the 22nd ACM SIGIR `99 Conference, pp. 59-65.
  • Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers 23:229-236.
  • Eckart, C., and G. Young (1936). The approximation of one matrix by another of lower rank. Psychometrika 1:211-218.
  • Frakes, W. B., and R. Baeza-Yates (1992). Information Retrieval: Data Structures and Algorithms. Englewood Cliffs, NJ: Prentice Hall.
  • Glossbrenner, A. and E. (2001). Search Engines for the World Wide Web. Berkeley, CA: Peachpit Press.
  • Golub, G. H., and C. F. Van Loan (1996). Matrix Computations. Baltimore: Johns Hopkins University Press.
  • Harman, D., and E. Voorhees (Eds.) (1996). Overview of the fifth Text REtrieval Conference (TREC-5). In Information Technology: The Fifth Text REtrieval Conference (TREC-5), Gaithersburg, MD: NIST, 500-238 (Nov.): 1-28.
  • Jones, S. K. (1972). A statistical interpretation of term specificity and its applications in retrieval. J. Documentation 28:11-21.
  • Jones, W., and G. Furnas (1987). Pictures of relevance: A geometric analysis of similarity measures. Journal of American Society for Information Retrieval 38:420-442.
  • Korfhage, R. R. (1997). Information Storage and Retrieval. New York: Wiley Computer Publishing.
  • Langville, A. N., and C. D. Meyer (2005). A survey of eigenvector methods for Web information retrieval. SIAM Review, 47(1):135-161.
  • Langville, A. N., and C. D. Meyer (2006). Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton, NJ: Princeton University Press.
  • Letsche, T. A., and M. W. Berry. (1997). Large-scale information retrieval with LSI. Informatics and Computer Science, 100:105-137.
  • Lyman, P., and H. R. Varian (2000). How Much Information. Web page: http://www.sims.berkeley.edu/how-much-info. Accessed 10/28/05.
  • Marchiori, M. (1997). The quest for correct information of the Web: hyper search engines. The Sixth International WWW Conference (WWW97). Santa Clara, USA, April 7-11.
  • Medlars Collection (2002). Available at http://www.cs.utk.edu/~lsi/corpa.html. Accessed 10/28/05.
  • Meyer, C.D. (2000). Matrix Analysis, Philadelphia: SIAM.
  • Salton, G. (1971). The SMART Retrieval System: Experiments in Automatic Document Processing, New Jersey: Prentice Hall.
  • Ulrich's International Periodicals Directory (2001). New York: R.R. Bowker.
  • Wurman, R. (1989). Information Anxiety, New York: Doubleday.
  • Zha, H., O. Marques, and H. Simon (1998). A subspace-based model for information retrieval with applications in latent semantic indexing. Proceedings of Irregular '98, Lecture Notes in Computer Science, 1457:29-42.
  • Zha, H., and H. D. Simon (1999). On updating problems in latent semantic indexing. SIAM Journal on Scientific Computing, 21(2):782-791.

Online Resources

The Linear Algebra Behind Search Engines - Appendix: Components of a VSM Search Engine

Author(s): 
Amy Langville
In this appendix I provide the steps necessary for implementation of the vector space model search engine. The accompanying flow chart contains a pictorial representation of the VSM search engine.
  • Data processing: The data must be stored efficiently and in a uniform format. This means that the data must be preprocessed or cleaned before they are stored in a data file. For example, keywords must be selected for a document, either automatically or manually. Once the appropriate keywords are chosen for the document, they must undergo a stemming procedure to be stored only as stems. For example, the single word stem "comput" might be used to capture the keywords "compute", "computer", "computational", "computing", etc. This saves storage in the list of keywords and can be considered a data compression step as well. Once all the documents have been processed and their keywords chosen and stemmed, the term list is further compressed by the removal of stoplisted words, common words that add little to the meaning of the document. Stoplisted word lists have been created (Frakes & Baeza-Yates, 1992) and are consulted by most search engines.
  • Updating the system that stores document vector information: In any document collection, documents are continually being added, deleted, and updated. Every so often, the current storage of the documents must be modified to mirror these changes. For example, some search engine systems update their database every week, adding new document vectors, removing outdated document vectors, and modifying current document vectors to reflect addenda and modifications to the documents. One can imagine that this updating process becomes an arduous but essential procedure for a document collection as volatile as the Web.
  • Data compressing: To operate efficiently, very large document collections require some type of data compression beyond the simple compression techniques of stemming and stoplisting. There are numerous clever ideas for smartly compressing the document collection -- the one I have emphasized in the context of the vector space model is the truncated singular value decomposition (SVD). The truncated SVD compresses an m × n term-by-document matrix into a k ×  1 vector, an m × k term-by-document cluster matrix and a k × n term cluster-by-document matrix, where k is much smaller than n.
  • Query processing: The user's entered keywords must be translated into a query vector that is appropriate for the search engine model being used. In our example vector space model, this means the original query of keywords has to be transformed into a vector as demonstrated on page 5. Also, in the query translation step, the stemming and stoplisting operations must be performed on the query as well. A query of birth customs of ancient Mayans run by Google would display in small font above the list of retrieved documents the following phrase: "'of' is a very common word and was not included in your search." The word "of" is a stoplisted word, which Google automatically removes from consideration. Queries must also undergo a stemming procedure before being processed. A query of fishing lures might, after the stemming procedure, be read by the search engine as fish lure.
  • Computing distance measures: The measure of how close each document vector is to the query vector must be done as efficiently as possible. Sometimes the data compression step can help to make these distance computations faster. Other methods exploit the notion of clustering to reduce the number of distance computations required. If a document can immediately be labeled as "very far from" the query vector, then most likely all other documents in the cluster of this poor document will receive this same poor classification. Thus, the distance measures of documents in the poor document's cluster need not be computed. Instead, only distance measures among those documents most likely to be relevant need to be computed. Finding appropriate clusters and then relevant clusters is a research problem in itself.
  • Ranking distance measures: Once the distance measures have been computed for either all document vectors or some "most likely to be relevant" document vectors, the results need to be ranked according to some measure. For our example vector space model, the distance measures are cosines of angles and are between 0 and 1. Then the top, say, 20 documents with the highest cosine angle measures are reported to the user.
  • Query reformulating/relevance feedback: The goal of query reformulation is to create a revised query that can be sent through the search engine and that will hopefully give better, more relevant results. The most popular type of query reformulation is relevance feedback, by which the user examines the list of retrieved documents and marks those that are relevant. Query reformulation can also be attempted without involving the user. However, such automatic query reformulation techniques, using synonyms, thesauri, and word clusters to expand the query, are currently most appropriate for small data collections.