The Linear Algebra Behind Search Engines

Author(s):

Amy Langville

The vector space model provides the framework for most information retrieval algorithms used today. However, this most basic vector space model alone is not efficient enough. Many modifications and heuristics have been invented to speed up the basic model, giving rise to a popular model called the Latent Semantic Indexing (LSI) model (Berry, Drmac & Jessup, 1999).

The major mathematical entity in the LSI model is also a matrix. We have already discussed how the vector space model described in preceding pages can be viewed as a matrix. Now let's adjoin the 3 billion document vectors associated with the Web (each of which is a 300,000-D vector) to one another to form a 300,000-by-3,000,000,000 matrix, which is called the term-by-document matrix. This matrix might look like

where the (i,j)-element of the matrix contains a 1 if the word for position i in document j is a keyword and a 0 otherwise. Column 1 in this matrix is document 1's document vector, column 2 is document 2's vector, and so forth.

This huge term-by-document matrix may contain redundant information. Berry, Drmac & Jessup (1999) provide the following example:

There may be a document on the Web with keywords applied linear algebra and another document on the Web with keywords computer graphics. There might also be a document with keywords linear algebra applied [to] computer graphics. Clearly, the vectors for the first two documents sum to the vector for the third document.

In linear algebra, this is called dependence among columns of the matrix. This dependence among columns may be frequent, since there are so many documents on the Web. Fortunately, this dependence in the term-by-document matrix can be used to reduce the work required in information retrieval.

Here is where the LSI model departs from the vector space model. In most general terms, the LSI method manipulates the matrix to eradicate dependencies and thus consider only the independent, smaller part of this large term-by-document matrix. In particular, the mathematical tool used to achieve the reduction is the truncated singular value decomposition (SVD) of the matrix.

Note: We introduce the SVD briefly in what follows; Berry, Drmac & Jessup (1999) and Meyer (2000) contain detailed explanations of the SVD of a matrix.

Basically, a truncated SVD of the term-by-document matrix reduces the m-by-n matrix A to an approximation A_k that is stored compactly with a smaller number of vectors. The subscript k represents how much of the original matrix A we wish to preserve. A large value of k means that A_k is very close to A. However, the value of this good approximation must be weighed against the desire for data compression. In fact, the truncated representation A_k requires the storage of exactly k scalars, k m-dimensional vectors and k n-dimensional vectors. Clearly, as k increases, so do the storage requirements.

Let's pause now for a formal definition of the SVD, followed by a numerical example. Before defining the SVD, we also review some other matrix algebra terms. The following definitions and theorems are taken from Meyer (2000).

Definition   The rank of an m-by-n matrix A is the number of linearly independent rows or columns of A.

Definition   An orthogonal matrix is a real m-by-n matrix P of rank r whose columns (or rows) constitute an orthonormal basis for R^r. That is, the set of column (or row) vectors {p₁, p₂, …, p_r}, where r = rank(P), has the property that p_i^Tp_j = 1 for i = j, 0 otherwise.

Theorem   For each m-by-n matrix A of rank r, there are orthogonal matrices

U_m-by-m and V_n-by-n

and a diagonal matrix

D_r-by-r = diag(σ₁, σ₂, …, σ_r)

such that

with σ₁ ≥ σ₂ ≥ … ≥ σ_r > 0.

Definition   The σ_i's in the preceding theorem are called the nonzero singular values of A. When r < p = min(m,n), A is said to have p - r additional zero singular values. The factorization in the theorem is called the singular value decomposition (SVD) of A, and the columns of U (denoted u_i) and the columns of V (denoted v_i) are called the left-hand and right-hand singular vectors of A, respectively. Written another way, the SVD expresses A as a sum of rank-1 matrices:

.

Definition   The truncated SVD is

Theorem (Meyer, 2000, p. 417)   If σ₁ ≥ σ₂ ≥ … ≥ σ_r are the nonzero singular values of A_m-by-n, then for each k, the distance from A to the closest rank-k matrix is

Thus, σ_k+1 is a measure of how well A_k approximates A. In fact, for A_m-by-n of rank r, A_k is the best rank-k matrix approximation to A (Eckart & Young, 1936). The goal in using A_k to approximate A is to choose k large enough to give a good approximation to A, yet small enough so that k << r requires much less storage.

Figure 6 shows the storage savings obtained with the truncated SVD in general.

Figure 6: Data compression using truncated SVD

A numerical example may make these concepts clearer. Suppose the original term-by-document matrix A is 100-by-300. After a truncated SVD, suppose A₂₀ approximates A well, i.e., we take k = 20. To calculate the storage savings, we count the number of elements that must be stored in both representations -- the full 100-by-300 matrix A (30,000 elements) versus the truncated SVD representation, which requires storage of a 20-by-1 vector, a 100-by-20 matrix and a 20-by-300 matrix (20 × 1 + 100 × 20 + 20 × 300 = 8,020 elements). The number of elements in the full matrix A representation is almost 4 times greater than the number of elements required by the truncated SVD.

Larger data collections often have much greater storage savings. The Medlars collection uses a 5831 × 1033 term-by-document matrix, which can be very well approximated by the truncated SVD with k = 70, creating storage savings about 13 times less than the original matrix.

We close this section with a small example, taken from Berry & Browne (1999), showing that the truncated SVD can well approximate a matrix. Suppose A is the 9 × 7 matrix

The truncation value k = 3 gives the truncated SVD representation

The elements of A₃ only slightly resemble the elements of A -- some are even negative -- but this is not a problem. We are interested only in A₃'s ability to model the relationships between the vectors of A. Suppose the query vector is

Compare the ranking of document vectors using A with the ranking obtained using A₃:

Table 3: Cosines of angles between document vectors and query using A and A3

As k increases, the truncated SVD better captures the essence of A, yet the storage savings decrease. For example, here is A₆:

Notice that A₆ does a much better job of capturing the relationship between the vectors of A than A₃ does. Now compare the ranking of document vectors using A with the ranking obtained using A₆. Also compare the ranking from A₃ with that from A₆.

Table 4: Absolute cosines of angles between document
vectors and query using A, A3, and A6

Notice that A₆'s ranking is much closer to the true ranking obtained from the full matrix A than A₃'s ranking. But the storage required by A₆ in its more compact form (U₆, V₆, σ₆) is much greater than the storage required by the less accurate A₃ (U₃, V₃, σ₃). Choosing an appropriate and judicious truncation value k is an area of active research for VSMs (Berry, Drmac & Jessup, 1999, Berry & O'Brien, 1998, Letsche & Berry, 1997, Ding, 1999, Zha, Marques & Simon, 1998).

Having demonstrated the truncated SVD's potential to save storage (note), let's now consider how the angle measures are computed using U_k, D_k, V_k. Recall the formula for computing the cosine of the angle &theta_i between document vector d_i and query vector q:

Since d_i is the i^th column of A, let's replace d_i with Ae_i, where e_i is the column vector of all zeros except for a 1 in the i^th position. Thus,

Using the truncated SVD, let's replace A with its approximation A_k and obtain

Since A_k can be decomposed into three small matrices, we replace A_k with U_kD_kV_k^T:

To further reduce computations at runtime, we let s_i=e_i^TV_kD_k. Then,

since ||U_ks_i^T||₂= ||s_i^T||₂ for the orthogonal matrix U_k, and ||s_i^T||₂=||s_i||₂.

It's helpful to review the above equation, emphasizing the advantages of the truncated SVD. First, U_k, s_i and ||s_i||₂ only need to be computed once and can be reused for each query q. At runtime, only U_k^Tq and ||q||₂ must be computed. The vectors in U_k and V_k are computed once, and these can be determined without computing the entire SVD of A, since there is an algorithm for computing the first k singular values and vectors iteratively (Golub & Van Loan, 1996).

In practice, k can be much smaller than r and yet A_k still approximates A well. One reason for this concerns the noise in A due to polysemous and synonymous words in the English language. Truncating the SVD reduces this noise. Another reason why k << r works well in information retrieval applications is that numerical accuracy is of little concern. For example, three digits of precision in the cosine measure may be sufficient. That is, a cosine measure of .231 is just as meaningful as the more numerically precise result of .23157824. The truncated SVD enables storage benefits and still maintains adequate accuracy in the angle measure calculations.

Exercise: Experiment with an LSI Search Engine built in Matlab

Quiz: Linear Algebra

Amy Langville, "The Linear Algebra Behind Search Engines - An Advanced Vector Space Model: LSI," Convergence (December 2005)

JOMA