Web Science:
PageRank

Parke Godfrey
29 November 2012
CSE-2041

Credits

These slides are based in large part from the book

  1. Amy N. Langville & Carl D. Meyer
    Google's PageRank and Beyond: The Science of Search Engine Ranking
    Princeton University Press, 2006.
    ISBN-13: 978-0-691-12202-1
    ISBN-10: 0-691-12202-4

and from

  1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd
    The PageRank citation ranking: bringing order to the web.
    TR, Stanford University, 1999.

  2. Private correspondence.

Web Pages
How to assign importance?


The Random Web Surfer

Exploiting The Web Graph
the links are what is important


A page is important if important pages link to it.


The random web server captures this idea:

  • if he visits important pages more often, then he is likely to visit pages they link to more often.

Problems
the random surfer might encounter?


Modeling the Random Surfer
linear algebra to the rescue!



We can cast their approach by Markov chain theory, a part of linear algebra.

The PageRank Equation
the billion dollar formula


\[ \displaystyle{ \trans{\prv} = \trans{\prv} (\alpha\S + (1 - \alpha)\E) } \]

The Idea
in math


\[ \displaystyle{ r(P_{i}) = \sum_{P_{j}\in B_{P_{i}}} {{r(P_{j})} \over {|P_{j}|}} } \]

The Idea (2)
in math


Of course, this definition for \(r\) is recursive: the values of \(r\) over the pages depend on the values of \(r\)!

So, we could recast this as an iterative process.

\[ \displaystyle{ r_{k+1}(P_{i}) = \sum_{P_{j}\in B_{P_{i}}} {{r_{k}(P_{j})} \over {|P_{j}|}} } \]

With Matrices
this time, please!


\[ \displaystyle{ \trans{\prv^{(k+1)}} = \trans{\prv^{k}}\H } \]


\(H\) is very sparse!

Modeled? Does this capture the idea?


Great! Turn the crank until we converge; that is, once \(\prv^{(k+1)} = \prv^{k}\).


Well, there are potential problems.

Will this iterative process continue indefinitely, or converge?


  • Under what circumstances or properties of \(\H\) is it guaranteed to converge?

  • Will it converge to just one vector or multiple vectors?

    • Will it converge to something that makes sense in the context of PageRank?

    • Does the convergence depend on the starting vector \(\trans{\prv^{(0)}}\)?

  • If it will converge, how long is “eventually”?

    That is, how many iterations can we expect until convergence?

Problem with sinks: dangling node vectors
the stochasticity adjustment


Many nodes / pages have no outgoing links. E.g., images, PDFs.

\[ \displaystyle{ \S = \H + \vector{a}((1/n)\trans{\e}) } \]

\(\S\) is stochastic: each row sums to \(1\).

Problem: no unique solution guaranteed!
the primitivity adjustment


A periodic matrix is both irreducible and aperiodic.

No \(0\)'s in the matrix yields this.

\[ \displaystyle{ \G = \alpha\S + (1 - \alpha)(1/n)\e\trans{\e} } \]

Teleportation Matrix?!
does this fit the story?


Well...the random surfer gets bored occasionally — with probability \((1 - \alpha)\) — and jumps from the current page to any other page — with probability \(1/n\).


Really, however, this \(\alpha\)-adjustment is artificial.

But it is a mathematical necessity for this approach to work.

The Google Matrix


\(\G\)

A Unique Solution
that is computable


PageRank can be solved in various ways

What about \(\alpha\)?
the great fudge factor



\(\alpha\) is set as a constant (\(\alpha \in (0..1)\)) for the duration of the computation of \(\prv\).

What should it be set to?

Setting \(\alpha\) close to \(1\)


Setting \(\alpha\) close to \(0\)


The Goog's setting for \(\alpha\)
around, say, 2005


Still, in the early 2000's, the \(\prv\) computation took around 3 days.

There was a monthly “day of reckoning” (the 28th?) when Google would roll over to the new \(\prv\).

SEO'ers would take note!

Is PageRank the be all
of Google's search algorithm? Of search algorithms?


No. There are many weaknesses.

Was PageRank the Pillar
of Google's launch to fame?


Half of it.

AdSense
the other pillar



The other pillar — and foundation of Google's business model — is AdSense.

The Other Half of Search
information retrieval


Tweaking the Markovian Approach


Web Science
complex systems


The science of who links to whom has extended beyond the Web to a variety of other networks that collectively by the name of complex systems. Graph techniques have successfully been applied to learn valuable information about networks ranging from the AIDS transmission and power grid networks to terrorist and email networks.

— Langville & Meyer 2006 [p.30]