Grand Valley State University Imagine a library containing 25 billion documents but with no centralized organization and no librarians. In addition, anyone may add a document at any time without telling anyone. You may feel sure that one of the documents contained in the collection has a piece of information that is vitally important to you, and, being impatient like most of us, you'd like to find it in a matter of seconds. How would you go about doing it? Posed in this way, the problem seems impossible. Yet this description is not too different from the World Wide Web, a huge, highly-disorganized collection of documents in many different formats. Of course, we're all familiar with search engines (perhaps you found this article using one) so we know that there is a solution. This article will describe
How to tell who's important
If you've ever created a web page, you've probably included links to other pages that contain valuable, reliable information. By doing so, you are affirming the importance of the pages you link to. Google's PageRank algorithm stages a monthly popularity contest among all pages on the web to decide which pages are most important. The fundamental idea put forth by PageRank's creators, Sergey Brin and Lawrence Page, is this: the importance of a page is judged by the number of pages linking to it as well as their importance. We will assign to each web page P a measure of its importance I(P), called the page's PageRank. At various sites, you may find an approximation of a page's PageRank. (For instance, the home page of The American Mathematical Society currently has a PageRank of 8 on a scale of 10. Can you find any pages with a PageRank of 10?) This reported value is only an approximation since Google declines to publish actual PageRanks in an effort to frustrate those would manipulate the rankings. Here's how the PageRank is determined. Suppose that page Pj has lj links. If one of those links is to page Pi, then Pj will pass on 1/lj of its importance to Pi. The importance ranking of Pi is then the sum of all the contributions made by pages linking to it. That is, if we denote the set of pages linking to Pi by Bi, then This may
with stationary vector |
Computing I
There are many ways to find the eigenvectors of a square matrix. However, we are in for a special challenge since the matrix H is a square matrix with one column for each web page indexed by Google. This means that H has about n = 25 billion columns and rows. However, most of the entries in H are zero; in fact, studies show that web pages have an average of about 10 links, meaning that, on average, all but 10 entries in every column are zero. We will choose a method known as the power method for finding the stationary vector I of the matrix H. How does the power method work? We begin by choosing a vector I 0 as a candidate for I and then producing a sequence of vectors I k by The method is founded on the following general principle that we will soon investigate.
General principle: The sequence I k will converge to the stationary vector I. |
I 0 | I 1 | I 2 | I 3 | I 4 | ... | I 60 | I 61 |
1 | 0 | 0 | 0 | 0.0278 | ... | 0.06 | 0.06 |
0 | 0.5 | 0.25 | 0.1667 | 0.0833 | ... | 0.0675 | 0.0675 |
0 | 0.5 | 0 | 0 | 0 | ... | 0.03 | 0.03 |
0 | 0 | 0.5 | 0.25 | 0.1667 | ... | 0.0675 | 0.0675 |
0 | 0 | 0.25 | 0.1667 | 0.1111 | ... | 0.0975 | 0.0975 |
0 | 0 | 0 | 0.25 | 0.1806 | ... | 0.2025 | 0.2025 |
0 | 0 | 0 | 0.0833 | 0.0972 | ... | 0.18 | 0.18 |
0 | 0 | 0 | 0.0833 | 0.3333 | ... | 0.295 | 0.295 |
Three important questions
Three questions naturally come to mind:
- Does the sequence I k always converge?
- Is the vector to which it converges independent of the initial vector I 0?
- Do the importance rankings contain the information that we want?
Given the current method, the answer to all three questions is "No!" However, we'll see how to modify our method so that we can answer "yes" to all three. Let's first look at a very simple example. Consider the following small web consisting of two web pages, one of which links to the other:
with matrix |
I 0 | I 1 | I 2 | I 3=I |
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
A probabilitistic interpretation of H
Imagine that we surf the web at random; that is, when we find ourselves on a web page, we randomly follow one of its links to another page after one second. For instance, if we are on page Pj with lj links, one of which takes us to page Pi, the probability that we next end up on Pi page is then . As we surf randomly, we will denote by the fraction of time that we spend on page Pj. Then the fraction of the time that we end up on page Pi page coming from Pj is . If we end up on Pi, we must have come from a page linking to it. This means that where the sum is over all the pages Pj linking to Pi. Notice that this is the same equation defining the PageRank rankings and so . This allows us to interpret a web page's PageRank as the fraction of time that a random surfer spends on that web page. This may make sense if you have ever surfed around for information about a topic you were unfamiliar with: if you follow links for a while, you find yourself coming back to some pages more often than others. Just as "All roads lead to Rome," these are typically more important pages. Notice that, given this interpretation, it is natural to require that the sum of the entries in the PageRank vector I be one. Of course, there is a complication in this description: If we surf randomly, at some point we will surely get stuck at a dangling
with matrix | and eigenvector |
How does the power method work?
In general, the power method is a technique for finding an eigenvector of a square matrix corresponding to the eigenvalue with the largest magnitude. In our case, we are looking for an eigenvector of S corresponding to the eigenvalue 1. Under the best of circumstances, to be described soon, the other eigenvalues ofS will have a magnitude smaller than one; that is, and that We will also assume that there is a basis vj of eigenvectors forS with corresponding eigenvalues . This assumption is not necessarily true, but with it we may more easily illustrate how the power method works. We may write our initial vector I 0 as Then Since the eigenvalues with have magnitude smaller than one, it follows that if and therefore , an eigenvector corresponding to the eigenvalue 1. It is important to note here that the rate at which is determined by . When is relatively close to 0, then relatively quickly. For instance, consider the matrix The eigenvalues of this matrix are and . In the figure below, we see the vectors I k, shown in red, converging to the stationary vector I shown in green. Now consider the matrix Here the eigenvalues are and . Notice how the vectors I k converge more slowly to the stationary vector I in this example in which the second eigenvalue has a larger magnitude.
When things go wrong
In our discussion above, we assumed that the matrix S had the property that and Then we see
I 0 | I 1 | I 2 | I 3 | I 4 | I 5 |
1 | 0 | 0 | 0 | 0 | 1 |
0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 |
with stationary vector |
Computing I
What we've described so far looks like a good theory, but remember that we need to apply it to matrices where n is about 25 billion! In fact, the power method is especially well-suited to this situation. Remember that the stochastic matrix S may be written as and therefore the Google matrix has the form Therefore, Now recall that most of the entries in H are zero; on average, only ten entries per column are nonzero. Therefore, evaluating HI k requires only ten nonzero terms for each entry in the resulting vector. Also, the rows of A are all identical as are the rows of 1. Therefore, evaluating AI k and 1I k amounts to adding the current importance rankings of the dangling nodes or of all web pages. This only needs to be done once. With the value of chosen to be near 0.85, Brin and Page report that 50 - 100 iterations are required to obtain a sufficiently good approximation to I. The calculation is reported to take a few days to complete. Of course, the web is continually changing. First, the content of web pages, especially for news organizations, may change frequently. In addition, the underlying hyperlink structure of the web changes as pages are added or removed and links are added or removed. It is rumored that Google recomputes the PageRank vector I roughly every month. Since the PageRank of pages can be observed to fluctuate considerably during this time, it is known to some as the Google Dance. (In 2002, Google held a Google Dance!)
Summary
Brin and Page introduced Google in 1998, a time when the pace at which the web was growing began to outstrip the ability of current search engines to yield useable results. At that time, most search engines had been developed by businesses who were not interested in publishing the details of how their products worked. In developing Google, Brin and Page wanted to "push more development and understanding into the academic realm." That is, they hoped, first of all, to improve the design of search engines by moving it into a more open, academic environment. In addition, they felt that the usage statistics for their search engine would provide an interesting data set for research. It appears that the federal government, which recently tried to gain some of Google's statistics, feels the same way. There are other algorithms that use the hyperlink structure of the web to rank the importance of web pages. One notable example is the HITS algorithm, produced by Jon Kleinberg, which forms the basis of the Teoma search engine. In fact, it is interesting to compare the results of searches sent to different search engines as a way to understand why some complain of a Googleopoly.
0 comments:
Post a Comment