The Linear Algebra Behind Search Engines

Author(s):

Amy Langville

I outline here just a few of the most basic information retrieval techniques, in order to provide a context for the techniques we will study in more detail. Specifics of some of these techniques can easily become very complicated and are, in general, hard to come by, since many vendors refuse to share in this competitive environment.

Boolean Search Engines

The Boolean model of information retrieval, one of the earliest and simplest retrieval methods, uses exact matching to match documents to a user "query" or information request by finding documents that are "relevant" in terms of matching the words in the query. More refined descendents of this model are still used by most libraries.

The adjective "Boolean" refers to the use of Boolean algebra, whereby words are logically combined with the Boolean operators AND, OR, and NOT. For example, the Boolean AND of two logical statements x and y means that both x AND y must be satisfied, while the Boolean OR of these same two statements means that at least one of these statements must be satisfied. Any number of logical statements can be combined using the three Boolean operators.

The Boolean information retrieval model considers which keywords are present or absent in a document or title. Thus, a document is judged as relevant or irrelevant -- there is no concept of a "partial match" between documents and queries. The inability to identify partial matches can lead to poor performance (Baeza-Yates & Ribeiro-Neto, 1999). Other more advanced set theoretic techniques, such as the so-called "fuzzy sets", try to remedy this black-white Boolean logic by introducing shades of gray. For example, a title search for car AND maintenance on a Boolean engine causes the virtual machine to return all documents that have both words in the title. As a result, an apparently relevant document entitled "Automobile Maintenance" will not be returned. Fuzzy Boolean engines use fuzzy logic to categorize this document as somewhat relevant and return it to the user.

The car maintenance query example illustrates the main drawbacks of Boolean search engines -- they fall prey to two of the most common information retrieval problems, synonymy and polysemy.

Synonymy occurs when multiple words have the same meaning, such as "car" and "automobile". A standard Boolean engine cannot return semantically related documents whose keywords were not included in the original query.
Polysemy occurs when a word has multiple meanings. For example, when a user types bank as a query, does this mean a financial center, a slope on a hill, a shot in pool, or a collection of objects (Berry, Drmac & Jessup, 1999)? The problem of polysemy can cause retrieval of many documents that are irrelevant to the user's intended meaning.

Many Boolean search engines also require the user to be familiar with Boolean operators and the specialized syntax of the engine. For example, to find information about the phrase iron curtain, many engines require quotation marks around the phrase, which tell the search engine that the entire phrase should be searched as if it were just one keyword. A user who forgets this syntax requirement would be surprised to find retrieved documents about interior decorating and mining for iron ore.

Nevertheless, variants of the Boolean model do form the basis for many search engines. There are several reasons for their prevalence:

Creating and programming a Boolean engine is straightforward.
Queries can be processed quickly -- a quick scan through the keyword files for the documents can be executed in parallel.
Boolean models scale well to very large document collections. Accommodating a growing collection is easy -- the programming remains simple, and only the storage and parallel processing capabilities need to grow.

Baeza-Yates & Ribeiro-Neto (1999), Frakes & Baeza-Yates (1992), and Korfhage (1997) all contain chapters with excellent introductions to the Boolean model and its extensions.

Vector Space Model Search Engines

Another information retrieval technique uses the vector space model (Salton, 1971), developed by Gerard Salton in the early 1960's, to sidestep some of the information retrieval problems of the Boolean model. Vector space models transform textual data into numeric vectors and matrices, then employ matrix analysis techniques to discern key features and connections in the document collection.

Some advanced vector space models address the common text analysis problems of synonymy and polysemy. Models such as Latent Semantic Indexing (LSI) (Dumais, 1991) can access the hidden semantic structure in a document collection. For example, an LSI engine processing the query car will return documents whose keywords are related semantically (in meaning), e.g., automobile. This ability to reveal hidden semantic meanings makes vector space models such as LSI very powerful information retrieval tools.

Two additional advantages of the vector space model are relevance scoring and relevance feedback. The model allows documents to match a query partially by assigning each document a number between 0 and 1, which can be interpreted as the likelihood of relevance to the query. The group of retrieved documents can then be sorted by degree of relevancy, a luxury not possible with the Boolean model. Thus, vector space models return documents in an ordered list, with the first document judged to be most relevant to the user's query. Some vector space search engines report the relevance score as a relevancy percentage. For example, a 97% next to a document means that document is judged as 97% relevant to the user's query.

Example. The Federal Communications Commission's search engine, which is powered by Inktomi, was known at one time to use a vector space model. Enter a query such as taxes and notice the relevancy score reported on the right side.

Relevance feedback is an information retrieval tuning technique that is a natural addition to the vector space model. It allows the user to select a subset of the retrieved documents that are useful and then resubmit with this additional relevance feedback information to obtain a revised set of generally more useful retrieved documents.

The drawbacks of vector space models are their computational expense and poor scalability. At query time, distance measures (also known as similarity measures ) must be computed between each document and the query, and advanced models such as LSI require an expensive singular value decomposition of a large matrix that numerically represents the entire document collection. As the collection grows, the expense of this matrix decomposition becomes prohibitive.

The informative little book by Berry & Browne (1999) provides an excellent explanation of vector space models, especially LSI, and contains several examples and sample code. I provide a much more condensed explanation on the Example Search Engine Method page.

Probabilistic Model Search Engines

Probabilistic models attempt to estimate the probability that the user will find a particular document relevant. Retrieved documents are ranked by their odds of relevance -- the ratio of the probability that the document is relevant to the probability that the document is not relevant to the query. This model operates recursively and requires that the underlying algorithm guess at initial parameters, then iteratively try to improve this initial guess to obtain a final ranking of relevancy probabilities. Improvements to the basic probabilistic model of information retrieval are made with Bayesian inference techniques (Baeza-Yates & Ribeiro-Neto, 1999).

Unfortunately, probabilistic models can be very hard to build and program. Their complexity grows quickly, deterring many researchers and limiting the scalability of the engine based on the model. Probabilistic models also require several unrealistic simplifying assumptions, such as independence between terms as well as between documents. For instance, in this document the most likely word to follow information is the word retrieval, so it is not reasonable to assume that these terms are independent.

On the other hand, the probabilistic framework can naturally accommodate a priori preferences, and thus these models do offer promise of tailoring search results to the preferences of individual users. For example, a user's query history can be incorporated into the probabilistic model's initial guess, which generates better query results than a democratic guess.

Meta-Search Engines

Meta-search engines are based on the principle that while one search engine is good, two (or more) are better. One search engine may be great at a certain task, while a second search engine is better at a another task. Thus, meta-search engines such as Copernic and SurfWax were created to simultaneously exploit the best features of many individual search engines. Meta-search engines send the user's query to several search engines at once and return the results from all of the engines in one long unified list. Some meta-search engines also include subject-specific search engines, which can be helpful when searching within one particular discipline. For example, Monster is an employment search engine.

Amy Langville, "The Linear Algebra Behind Search Engines - Summary of Search Engine Models," Convergence (December 2005)

JOMA