Experiment with an LSI Search Engine Built in Matlab
Introduction
Different search engines have different features and capabilities. Here you will consider a very simple search engine that uses the mathematical software tool MATLAB. This primitive search engine works on a limited vocabulary -- that is, only particular words are in its term list. If a user enters a natural language query and includes words not in the search engine's limited vocabulary, these words are ignored.
A limited vocabulary can affect relevance and introduces another performance issue, pertinence. A limited vocabulary search engine may force a user to modify the initial query to fit into the required query format. The retrieved documents, which may be relevant to the modified query, may not be pertinent to initial intended query. The user thus may judge these "relevant" documents to be irrelevant. Here the failure of the information retrieval (IR) system is in the query formation process, not in the actual retrieval process.
A related performance issue is usefulness. A user may judge a relevant document to be not useful if its content is too elementary or is common knowledge for the user. The retrieval process works correctly, but the user's previous knowledge and preferences are not considered. Some recent research efforts incorporate user preferences by creating user profiles, which then guide the system in meeting a particular user's needs.
This small MATLAB search engine introduces another type of search engine, the limited vocabulary search engine, which, in
turn, introduces the two new relevancy issues of pertinence and usefulness.
The main purpose of our MATLAB search engine is to demonstrate how
searching for words translates into a mathematical problem with numbers and
vectors. Thus, this search engine does not contain full documents,
only document ID numbers and numerical information about each document. The
point of this part of the module is not to create a full-fledged search engine,
similar to commercial search engines on the Web, rather it is an educational
tool. Please keep this limitation of the MATLAB search engine in mind as you
proceed.
NOTE: Some versions of MATLAB may have trouble opening this Netscape Navigator page. In this case, you can manually open the webpage by entering the appropriate URL in your favorite web browser. The URL for the MED data is http://www.cs.utk.edu/~lsi/matrices/MED.terms.gz. For the CRAN and CISI data sets, replace MED in this URL with CRAN or CISI.