Modeling Similarity in the Age of Data

MAA Distinguished Lecture: Kevin McCurley

The Internet is such an integral part of our society that online search now seems a natural extension of human thought. With Google’s search engine at the forefront, the verb "google" has become synonymous with any online search. At a recent MAA Distinguished Lecture, Google research scientist Kevin McCurley described some of the mathematics that goes into generating good search results and gave insights into the more difficult task of coming up with "similar" results.

With more than one trillion urls online, building a searchable index that generates useful results requires a multi-pronged attack to counter a variety of obstacles and opponents, including those bent on skewing search results. When he first started his career at Google in 2005, McCurley quickly came to realize that, to counter such opponents, his work could be described as "adversarial computing."

To illustrate the complexities of search, McCurley showed the results of a Google search on "mathematics." He noted the two success criteria for information retrieval: precision, returning documents relevant to the original query, and recall, presenting all documents relevant to the query. Of the two, precision is the more important criterion, he said. With the glut of information available online, providing a user with hundreds of thousands of documents to search through is not helpful.

Google and other search engines are successful if researchers can develop and continually improve algorithms that quickly pinpoint relevant material and eliminate irrelevant skewing factors. Indeed, a variety of mathematical procedures, going well beyond Google’s original PageRank algorithm, go into “Google’s secret sauce,” McCurley said.

Taking the results of a basic Google search, McCurley then examined the "similar" link that can be found under every result.

"We had a very long period where we trained users not to click on [the “similar” link] by giving them very bad results," McCurley said. "That was very effective, but I have worked on this recently, and I'd say it's better than it used to be."

McCurley worked on the team that developed a new algorithm to generate about thirty documents similar, but not too similar, to a result that might come up in a user's original search. He recounted a dizzying history of development, which wandered into the literature in economics and mathematical psychology to answer fundamental questions such as "What is similarity?"

Providing users with something of value on specific topics means, for example, eliminating exact or nearly exact duplicates. The task also requires handling language that is inherently inaccurate or diverse vocabularies that describe the same thing.

After several years of trial and error, the research team developed a working model using a combination of co-citation, rank ids, rank aggregation, and responses from individual users. You can test their model anytime by clicking through “similar” links on Google. Try this one.

"We're just at the beginning of understanding this," McCurley concluded, "but at least the results are a bit better now through some applied mathematics."

McCurley's presentation was a part of the MAA's Distinguished Lecture Series, sponsored by the National Security Agency. —L. McHugh

Listen to the full lecture here.