JOMA Information Retrieval Project--Experiment with an LSI Search Engine built in Matlab

Experiment with an LSI Search Engine Built in Matlab

Introduction

Different search engines have different features and capabilities. Here you will consider a very simple search engine that uses the mathematical software tool MATLAB. This primitive search engine works on a limited vocabulary -- that is, only particular words are in its term list. If a user enters a natural language query and includes words not in the search engine's limited vocabulary, these words are ignored.

A limited vocabulary can affect relevance and introduces another performance issue, pertinence. A limited vocabulary search engine may force a user to modify the initial query to fit into the required query format. The retrieved documents, which may be relevant to the modified query, may not be pertinent to initial intended query. The user thus may judge these "relevant" documents to be irrelevant. Here the failure of the information retrieval (IR) system is in the query formation process, not in the actual retrieval process.

A related performance issue is usefulness. A user may judge a relevant document to be not useful if its content is too elementary or is common knowledge for the user. The retrieval process works correctly, but the user's previous knowledge and preferences are not considered. Some recent research efforts incorporate user preferences by creating user profiles, which then guide the system in meeting a particular user's needs.

This small MATLAB search engine introduces another type of search engine, the limited vocabulary search engine, which, in turn, introduces the two new relevancy issues of pertinence and usefulness.

The main purpose of our MATLAB search engine is to demonstrate how searching for words translates into a mathematical problem with numbers and vectors. Thus, this search engine does not contain full documents, only document ID numbers and numerical information about each document. The point of this part of the module is not to create a full-fledged search engine, similar to commercial search engines on the Web, rather it is an educational tool. Please keep this limitation of the MATLAB search engine in mind as you proceed.

Preliminary Steps

Download the Data files and Matlab m-files. (Be patient -- this is a 38MB file.) Put these archived and compressed files in a folder/directory of your computer. (You may name this folder anything you wish, e.g., SearchEngine.) However, to run the MATLAB search engine you must open Matlab and change the current directory to the same directory as the SearchEngine folder containing the downloaded files. You can use the pwd command in Matlab to find what directory you are currently in, then use cd to change directories to the one named SearchEngine. If Matlab is not in the same directory as the SearchEngine folder, it will return error messages, because it can't find the data and m-files it needs to proceed.

Unzip the archived and zipped tar collections. Use the compression routine on your PC or Unix machine. Make sure all .Z files have been unzipped and placed in the SearchEngine folder.

Run the MATLAB search engine GUI by typing SEgui2_export at the matlab prompt. A gray GUI box should pop up.

Lab Steps

Step 1 of the GUI asks you to select a document collection to analyze. There are three choices: MED, CISI, and CRAN. MED is a collection of 1033 medical abstracts from the Medlars collection. CISI is a collection of 1460 information science abstracts. CRAN is a collection of 1398 aerodynamics abstracts from the Cranfield collection. Once you select a collection to analyze, the data for the entire collection are loaded, which may take a few seconds. Unless you want to change data collections, you need to load the data only once. The Xterm window displays some facts about your selected document collection.

Step 2 invites you to examine the term-by-document matrix A of your selected collection without viewing the entire matrix at once.

The spy button allows you to open a window with a two-dimensional plot of the nonzero elements of the upper left corner of the term-by-document matrix A. Very large term-by-document matrices have many, many zero elements. Such matrices are called sparse. Some sparse matrices have nonzeros located in a particular pattern. Sometimes discovering such a pattern makes matrix manipulations easier. The nonzero entries of these three term-by-document collections do not seem to follow any nice pattern, such as a triangular or diagonal pattern.

The view feature enables you to examine document vectors of your choice. You can view the elements of desired document vectors by first entering the chosen document vector ID numbers in the white text edit box, then clicking on the view button. The nonzero elements of each vector are listed in the Xterm window.

The plot feature allows you to plot these same chosen document vectors with a click of the plot button. The x-axis represents the index of the document vector while the y-axis represents the value of the vector at that index. For example, a blue spike at x = 1000 with y = 1 means that the 1000^th element of the document vector is 1.

The lengths of the chosen document vectors are computed as soon as the calculate length button is pressed. The results of these computations are displayed in the Xterm window.

Query Formulation

The MATLAB search engine allows two types of query formulation. You may choose to enter your own query or choose from a small set of preselected queries. For example, for the MED document collection, one preselected query option is infantile autism. If you decide to search for all documents related to infantile autism, the appropriate query is automatically created for you. The other option is to choose your own query. With this choice a small window, entitled "Inputs to create query vector", pops up and awaits your input. At the same time, a Netscape Navigator window opens. You may now enter any query you like by choosing from the restricted vocabulary list displayed on the Netscape Navigator page.

For example, with the MED collection, to search for abnormal absorption, you must enter the term ID associated with each term, 29 for abnormal and 44 for absorption. In the "Inputs to create query vector" window, line 1, enter the numbers 29 and 44 separated by spaces. In the second white text line, you can enter the weights associated with each term. You may choose, for example, to weight the term abnormal twice as much as the term absorption. One way to do this is to enter "2 1" in the second text line.

NOTE: Some versions of MATLAB may have trouble opening this Netscape Navigator page. In this case, you can manually open the webpage by entering the appropriate URL in your favorite web browser. The URL for the MED data is http://www.cs.utk.edu/~lsi/matrices/MED.terms.gz. For the CRAN and CISI data sets, replace MED in this URL with CRAN or CISI.

The View elements of q feature allows you to display the nonzero elements of the query vector q in the Xterm window.

Similar to the plot feature for the document vectors, you may also plot the query vector.

The final option in the query formulation part of the search engine is calculate length of q, which prints the length of q in the Xterm window.

Angle Calculation

The truncated SVD is used to approximate the term-by-document matrix A. Using A_k in place of A saves storage and reveals hidden semantic associations. Choose a value for k: 10, 25, 50, 100, 200.

The VSM sorts all documents according to their angle measure with the query. Only the top p most relevant documents are returned. Choose a value for p and enter it in the small white text edit box. Once you decide how many of the most relevant documents should be retrieved, calculate angles. Pressing this button computes the angle between each document vector and the query vector. The amount of CPU time required by this step is reported in the white text box labelled CPUtime. Notice how larger values of k require more calculations and thus more CPU time. The top p most relevant document ID numbers and their angle measure are displayed in the Xterm window.

You may choose to plot the three most relevant document vectors along with q to see if you can visually recognize any similarity. In all but the most simple cases, you can not discern any similarities visually, which emphasizes the power of mathematics and its ability to extend beyond 2 and 3 dimensions.

Finally, the two most common measures for assessing the quality of an IR system are presented. If you choose a preselected query, recall and precision are available and are reported in the white text boxes of section 4(d) of the GUI. However, if you enter your own query, recall and precision are not available. Remember that recall and precision are generally only available for well-studied document collections and queries. When you enter your own query, there is no way to tell exactly how many relevant documents pertain to your query without reading each and every document. Reread the section on recall and precision (page 3) for a review of this point. Practice using a preselected query with different values for k to see how precision and recall are affected. As k increases, what happens to the precision and recall? Does this agree with theory? Why or why not?