Experiment with an LSI Search Engine Built in Matlab
Different search engines have different features and capabilities. Here
you will consider a very simple search engine that uses the mathematical software tool
MATLAB. This primitive search engine works on a limited vocabulary -- that is, only
particular words are in its term list. If a user enters a natural language
query and includes words not in the search engine's limited vocabulary, these
words are ignored.
A limited vocabulary can affect relevance and introduces
performance issue, pertinence. A limited vocabulary search engine may force a
modify the initial query to fit into the required query format. The
retrieved documents, which may be relevant to the modified query, may not be
pertinent to initial intended query. The user thus may judge these "relevant"
documents to be irrelevant. Here the failure of the information retrieval (IR) system is in the query
formation process, not in the actual retrieval process.
performance issue is usefulness. A user may judge a relevant document to be not
useful if its content is too elementary or is common knowledge for the user.
The retrieval process works correctly, but the user's previous knowledge and
preferences are not considered. Some recent research efforts incorporate user
preferences by creating user profiles, which then guide the system in meeting a
particular user's needs.
This small MATLAB search engine introduces another type of search engine, the limited vocabulary search engine, which, in
turn, introduces the two new relevancy issues of pertinence and usefulness.
The main purpose of our MATLAB search engine is to demonstrate how
searching for words translates into a mathematical problem with numbers and
vectors. Thus, this search engine does not contain full documents,
only document ID numbers and numerical information about each document. The
point of this part of the module is not to create a full-fledged search engine,
similar to commercial search engines on the Web, rather it is an educational
tool. Please keep this limitation of the MATLAB search engine in mind as you
- Download the Data files and Matlab m-files.
(Be patient -- this is a 38MB file.) Put
these archived and compressed files in a folder/directory of your computer.
(You may name this folder anything you wish, e.g., SearchEngine.) However, to run the MATLAB search engine you must open Matlab and change the current directory to the same directory as the SearchEngine folder containing the downloaded files. You can use the pwd command in Matlab to find what directory you are currently in, then use cd to change directories to the one named SearchEngine. If Matlab is not in the same directory as the SearchEngine folder, it will return error messages, because it can't find the data and m-files it needs to proceed.
- Unzip the archived and zipped tar collections. Use the
compression routine on your PC or Unix machine. Make sure all .Z files have
been unzipped and placed in the SearchEngine folder.
- Run the MATLAB search engine GUI by typing SEgui2_export at the matlab prompt. A gray GUI box should pop up.
- Step 1 of the GUI asks you to select a document collection to analyze.
There are three choices: MED, CISI, and CRAN. MED is a collection of 1033
medical abstracts from the Medlars collection. CISI is a collection of 1460
information science abstracts. CRAN is a collection of 1398 aerodynamics
abstracts from the Cranfield collection. Once you select a collection to
analyze, the data for the entire collection are loaded, which may take a few
seconds. Unless you want to change data collections, you need to load the
data only once. The Xterm window displays some facts about your selected document
- Step 2 invites you to examine the term-by-document matrix A of your
selected collection without viewing the entire matrix at once.
- The spy button allows you to open a window with a two-dimensional
plot of the nonzero elements of the upper left corner of the term-by-document
matrix A. Very large term-by-document matrices have many, many zero elements.
Such matrices are called sparse. Some sparse matrices have
nonzeros located in a particular pattern. Sometimes discovering such a pattern
makes matrix manipulations easier. The nonzero entries of these three term-by-document
collections do not seem to follow any nice pattern, such as a triangular
or diagonal pattern.
- The view feature enables you to examine document vectors of your choice.
You can view the elements of desired document vectors by first entering the
chosen document vector ID numbers in the white text edit box, then clicking on
the view button. The nonzero elements of each vector are listed in
the Xterm window.
- The plot feature allows you to plot these same chosen document vectors with
a click of the plot button. The x-axis represents the index of the
document vector while the y-axis represents the value of the vector at that
index. For example, a blue spike at x = 1000 with y = 1 means that the
1000th element of the document vector is 1.
- The lengths of the chosen document vectors are computed as soon as the
calculate length button is pressed. The results of these
computations are displayed in the Xterm window.
- The MATLAB search engine allows two types of query formulation. You may
choose to enter your own query or choose from a small set of preselected
queries. For example, for the MED document collection, one preselected query
option is infantile autism. If you decide to search for all documents
related to infantile autism, the appropriate query is automatically created for you. The
other option is to choose your own query. With this choice a small
window, entitled "Inputs to create query vector", pops up and
awaits your input. At the same time, a Netscape Navigator window opens. You may now enter any query you like by choosing
from the restricted vocabulary list displayed on the Netscape Navigator page.
example, with the MED collection, to search for abnormal absorption,
you must enter the term ID associated with each term, 29 for
abnormal and 44 for absorption. In
the "Inputs to create query vector" window, line 1, enter the numbers 29 and
44 separated by spaces. In the second white text line, you can enter the
weights associated with each term. You may choose, for example, to weight the term
abnormal twice as much as the term absorption. One way to do
this is to enter "2 1" in the second text line.
NOTE: Some versions of
MATLAB may have trouble opening this Netscape Navigator page. In this case,
you can manually open the webpage by entering the appropriate URL in your
favorite web browser. The URL for the MED data is http://www.cs.utk.edu/~lsi/matrices/MED.terms.gz.
For the CRAN and CISI data sets, replace MED in this URL with CRAN
- The View elements of q feature allows you to display the nonzero elements of the query vector
q in the Xterm window.
- Similar to the plot feature for the document vectors, you may also
- The final option in the query formulation part of the search engine
is calculate length of q, which prints the length of q in the Xterm window.
- The truncated SVD is used to approximate the term-by-document matrix A.
Using Ak in place of A saves storage and reveals hidden semantic
associations. Choose a value for k: 10, 25, 50, 100, 200.
- The VSM sorts all documents according to their angle measure with the
query. Only the top p most relevant documents are returned. Choose a value
for p and enter it in the small white text edit box. Once you decide how many
of the most relevant documents should be retrieved, calculate angles.
Pressing this button computes the angle between each document vector and the
query vector. The amount of CPU time required by this step is reported in the
white text box labelled CPUtime. Notice how larger values of k
require more calculations and thus more CPU time. The top p most relevant
document ID numbers and their angle measure are displayed in the Xterm window.
- You may choose to plot the three most relevant document vectors along with
q to see if you can visually recognize any similarity. In all but the most
simple cases, you can not discern any similarities visually, which
emphasizes the power of mathematics and its ability to extend beyond 2 and 3
- Finally, the two most common measures for assessing the quality of an IR
system are presented. If you choose a preselected query, recall and
precision are available and are reported in the white text boxes of
section 4(d) of the GUI. However, if you enter your own query, recall and precision
are not available. Remember that recall and precision are generally only
available for well-studied document collections and queries. When you enter
your own query, there is no way to tell exactly how many relevant documents
pertain to your query without reading each and every document.
Reread the section on recall and precision (page 3) for a review of this point. Practice using a preselected
query with different values for k to see how precision and recall are
affected. As k increases, what happens to the precision and recall?
Does this agree with theory? Why or why not?