Experiment with an LSI Search Engine Built in Matlab

Introduction

Different search engines have different features and capabilities. Here you will consider a very simple search engine that uses the mathematical software tool MATLAB. This primitive search engine works on a limited vocabulary -- that is, only particular words are in its term list. If a user enters a natural language query and includes words not in the search engine's limited vocabulary, these words are ignored.

A limited vocabulary can affect relevance and introduces another performance issue, pertinence. A limited vocabulary search engine may force a user to modify the initial query to fit into the required query format. The retrieved documents, which may be relevant to the modified query, may not be pertinent to initial intended query. The user thus may judge these "relevant" documents to be irrelevant. Here the failure of the information retrieval (IR) system is in the query formation process, not in the actual retrieval process.

A related performance issue is usefulness. A user may judge a relevant document to be not useful if its content is too elementary or is common knowledge for the user. The retrieval process works correctly, but the user's previous knowledge and preferences are not considered. Some recent research efforts incorporate user preferences by creating user profiles, which then guide the system in meeting a particular user's needs.

This small MATLAB search engine introduces another type of search engine, the limited vocabulary search engine, which, in turn, introduces the two new relevancy issues of pertinence and usefulness.

The main purpose of our MATLAB search engine is to demonstrate how searching for words translates into a mathematical problem with numbers and vectors. Thus, this search engine does not contain full documents, only document ID numbers and numerical information about each document. The point of this part of the module is not to create a full-fledged search engine, similar to commercial search engines on the Web, rather it is an educational tool. Please keep this limitation of the MATLAB search engine in mind as you proceed.


Preliminary Steps

Lab Steps
  1. Step 1 of the GUI asks you to select a document collection to analyze. There are three choices: MED, CISI, and CRAN. MED is a collection of 1033 medical abstracts from the Medlars collection. CISI is a collection of 1460 information science abstracts. CRAN is a collection of 1398 aerodynamics abstracts from the Cranfield collection. Once you select a collection to analyze, the data for the entire collection are loaded, which may take a few seconds. Unless you want to change data collections, you need to load the data only once. The Xterm window displays some facts about your selected document collection.

  2. Step 2 invites you to examine the term-by-document matrix A of your selected collection without viewing the entire matrix at once.

  3. Query Formulation

  4. Angle Calculation