At annual information retrieval conferences, such as TREC (Harman & Voorhees, 1996
) and CIR (Berry, 2001
) for traditional information retrieval and WWW for web information retrieval, researchers present results that are used to compare the various information retrieval models underlying search engines and help the field progress toward better, more efficient search engines. The two most common ratings used to differentiate the various search techniques are precision
is the ratio of the number of relevant documents retrieved to the total number of documents retrieved. Recall
is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the collection. The higher the precision and recall, the better the search engine.
Of course, search engines are tested on document collections with known parameters. For example, the commonly used test collection Medlars (2002
), containing 5,831 keywords and 1,033 documents, has been examined so often that its properties are well-known. For instance, there are exactly 24 documents relevant to the phrase neoplasm immunology
. Thus, the denominator of the recall ratio for a user query on neoplasm immunology
is 24. If only 10 documents were retrieved by a search engine for this query, then a recall of 10/24 = 0.4166 is reported.
Recall and precision are information retrieval-specific performance measures, but when evaluating any computer system, time and space are always performance issues. All else held constant, quick, memory-efficient search engines are preferred to slower, memory-inefficient engines. A search engine with fabulous recall and precision is useless if it requires 30 minutes to perform one query or stores the data on 75 supercomputers. Some other performance measures take a user-centered viewpoint and are aimed at assessing user satisfaction and frustration with the information system. Korfhage (1997
) discusses these and several other measures for comparing search engines.