If you have ever thought that Google Inc. might be taking over the world, you should know that PageRank might be responsible. According to Google’s own website, the heart of their software is PageRank and it continues to provide the basis for all their web search tools.
Google’s PageRank and Beyond (subtitled The Science of Search Engine Rankings) describes the link analysis tool called PageRank, puts it in the context of web search engines and information retrieval, and describes competing methods for ranking webpages. It is an utterly engaging book, especially for one that depends so heavily on linear algebra to drive the plot. Indeed, it provides excellent motivation for learning more than just the basics of linear algebra.
The book is directed toward at least two different audiences. The first five chapters are accessible to a curious general reader. Succeeding chapters require increasing amounts of mathematical sophistication. There are (rigorous) proofs beginning at the end of Chapter 4 and an extensive final chapter called “The Mathematics Guide” with details and background from linear algebra, Markov chains and graph theory. Scattered throughout are a fascinating Asides and Tips. Have you ever wondered how search engines make money? Do you know how to find all the pages that link to yours? What is link spamming? How does Google deal with spammers that attempt to inflate their search rankings? It’s all here.
How does PageRank operate? It is pretty simple, perhaps deceptively so. The whole algorithm is based on the notion that a webpage is important if it is pointed to by other important pages. So one builds a hyperlink matrix whose entries encode — in a normalized way — the number of outlinks from each webpage. This leads to a large, sparse matrix that looks a lot like a stochastic transition matrix for a Markov chain. However, because of “dangling nodes”, webpages that have inlinks but not outlinks (e.g., pdf files, image files, etc.), this matrix has some rows that are all zeroes. To deal with this, the architects of PageRank invented the random surfer. When he gets stuck in a dangling node, the random surfer randomly jumps to a new webpage. With this addition, the hyperlink matrix becomes stochastic, but this is not quite enough. The PageRank is ultimately computed by iteration, beginning with the assumption that all pages have the same rank. However, the iteration need not converge to a unique solution without one more adjustment. This piece is called teleportation and permits the random surfer to jump — on occasion, perhaps when bored — to a new webpage that is not linked to his current location. Now we have a matrix — the Google matrix — that’s the sum of a sparse matrix and a rank one matrix; it is stochastic, irreducible and aperiodic. These properties guarantee that, for any starting vector, the power method applied to the Google matrix will converge to a unique stationary vector.
Ordinarily, the power method converges very slowly, but the special structure of the Google matrix dramatically speeds the computation. In addition, the power method does not require manipulation and re-storage of matrix entries. This is important because the Google matrix is huge. Google page rankings are said to be recomputed once a month in a process called the Google Dance. Page rankings dance up and down during the three days of the update calculation.
It is doubtful that PageRank’s inventors thought in terms of Markov chains. The dominant metaphor for them seems to have been the random surfer, and the pressing need was for a convergent and computationally manageable algorithm. However, it would be a mistake to think the approach was ad hoc since it appears to have been strongly grounded in bibliometrics, the analysis of the citation structure among academic papers.
Although Google’s success has led to a dominant market position, it’s important to realize that there are other very capable search engines. The authors describe a few of them. One of the alternatives is HITS (Hypertext Induced Topic Search), which is the basis for the Teoma web search engine’s popularity ranking. Whereas Google’s search is query-independent (so scores are computed offline and remain the same until the next monthly page ranking), HITS creates a neighborhood graph for each query. HITS uses the metaphor of authorities and hubs. Authorities are pages with many inlinks and hubs are pages with many outlinks. HITS is based on the principle that good authorities are pointed to by good hubs, and good hubs point to good authorities. Accordingly, HITS uses both an authority score and a hub score for each page.
One of the strengths that the authors brought to this book was the ability to capture both the important details of web search engines as well as the broader context of information retrieval. A chapter toward the end of the book — one quite accessible to a general reader — discusses the future of web information retrieval in the light of Google search and its competitors. Among other topics, it describes personalization of searches, intelligent agents, time sensitive search, privacy, security and the challenges of link spamming.
This is a book with broad appeal to those interested in web search engines, in Google itself, or in a fascinating application of linear algebra.
Bill Satzer (email@example.com) is a senior intellectual property scientist at 3M Company, having previously been a lab manager at 3M for composites and electromagnetic materials. His training is in dynamical systems and particularly celestial mechanics; his current interests are broadly in applied mathematics and the teaching of mathematics.