The Linear Algebra Behind Search Engines - Appendix: Components of a VSM Search Engine

Author(s):

Amy Langville

In this appendix I provide the steps necessary for implementation of the vector space model search engine. The accompanying flow chart contains a pictorial representation of the VSM search engine.

Data processing: The data must be stored efficiently and in a uniform format. This means that the data must be preprocessed or cleaned before they are stored in a data file. For example, keywords must be selected for a document, either automatically or manually. Once the appropriate keywords are chosen for the document, they must undergo a stemming procedure to be stored only as stems. For example, the single word stem "comput" might be used to capture the keywords "compute", "computer", "computational", "computing", etc. This saves storage in the list of keywords and can be considered a data compression step as well. Once all the documents have been processed and their keywords chosen and stemmed, the term list is further compressed by the removal of stoplisted words, common words that add little to the meaning of the document. Stoplisted word lists have been created (Frakes & Baeza-Yates, 1992) and are consulted by most search engines.
Updating the system that stores document vector information: In any document collection, documents are continually being added, deleted, and updated. Every so often, the current storage of the documents must be modified to mirror these changes. For example, some search engine systems update their database every week, adding new document vectors, removing outdated document vectors, and modifying current document vectors to reflect addenda and modifications to the documents. One can imagine that this updating process becomes an arduous but essential procedure for a document collection as volatile as the Web.
Data compressing: To operate efficiently, very large document collections require some type of data compression beyond the simple compression techniques of stemming and stoplisting. There are numerous clever ideas for smartly compressing the document collection -- the one I have emphasized in the context of the vector space model is the truncated singular value decomposition (SVD). The truncated SVD compresses an m × n term-by-document matrix into a k × 1 vector, an m × k term-by-document cluster matrix and a k × n term cluster-by-document matrix, where k is much smaller than n.
Query processing: The user's entered keywords must be translated into a query vector that is appropriate for the search engine model being used. In our example vector space model, this means the original query of keywords has to be transformed into a vector as demonstrated on page 5. Also, in the query translation step, the stemming and stoplisting operations must be performed on the query as well. A query of birth customs of ancient Mayans run by Google would display in small font above the list of retrieved documents the following phrase: "'of' is a very common word and was not included in your search." The word "of" is a stoplisted word, which Google automatically removes from consideration. Queries must also undergo a stemming procedure before being processed. A query of fishing lures might, after the stemming procedure, be read by the search engine as fish lure.
Computing distance measures: The measure of how close each document vector is to the query vector must be done as efficiently as possible. Sometimes the data compression step can help to make these distance computations faster. Other methods exploit the notion of clustering to reduce the number of distance computations required. If a document can immediately be labeled as "very far from" the query vector, then most likely all other documents in the cluster of this poor document will receive this same poor classification. Thus, the distance measures of documents in the poor document's cluster need not be computed. Instead, only distance measures among those documents most likely to be relevant need to be computed. Finding appropriate clusters and then relevant clusters is a research problem in itself.
Ranking distance measures: Once the distance measures have been computed for either all document vectors or some "most likely to be relevant" document vectors, the results need to be ranked according to some measure. For our example vector space model, the distance measures are cosines of angles and are between 0 and 1. Then the top, say, 20 documents with the highest cosine angle measures are reported to the user.
Query reformulating/relevance feedback: The goal of query reformulation is to create a revised query that can be sent through the search engine and that will hopefully give better, more relevant results. The most popular type of query reformulation is relevance feedback, by which the user examines the list of retrieved documents and marks those that are relevant. Query reformulation can also be attempted without involving the user. However, such automatic query reformulation techniques, using synonyms, thesauri, and word clusters to expand the query, are currently most appropriate for small data collections.

JOMA