iamjiyoung

Author: Jiyoung Choi

The PageRank is a search algorithm based on the linkage structure of the web, which Larry Page and Sergey Brin introduced. They built the search engine Google to test the utility of PageRank for search. (Origin of the name “Google”)

The goal of this lecture is to construct Google matrix and explore how to find PageRank vector. Let’s start by making an adjacency matrix based on web linkage.

Let $u$ be a webpage, and let $F_u$ be the set of pages $u$ points to and $B_u$ be the set of pages that point to $u$ . Consider each webpage as vertices and link as directed edge. We can construct an adjacency matrix $A_{uv} = 1$ if there exist links from $v$ to $u$ , otherwise $A_{uv}=0$ where $u, v$ are webpages. For example, if webpages have linkage in Figure 1, A is defined as follows.

Figure 1. A simple example consists of 5 webpages

$A= \begin{bmatrix} 0 & 1 & 1 & 1 & 0\\ 1 & 0 & 0 & 1 & 0\\ 0 & 0 & 0 & 1 & 0\\ 1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 0 & 0\end{bmatrix}$

We can observe that the last column of A is a zero vector because $v_5$ does not have forward links. We call such a node a dangling node.

Let $N_u = |F_u|$ be the number of links from $u$ . Then column normalized adjacency matrix $R$ of $A$ is defined by $R(u) = {A(u) \over N_u}$ . $R(u)$ denotes column corresponding page $u$ . In our example,

$R= \begin{bmatrix} 0 & 1 & 1 \over 2& 1 \over 3& 0\\1 \over 2 & 0 & 0 & 1 \over 3& 0\\0 & 0 & 0 & 1\over 3 & 0\\1\over 2 & 0 & 0 & 0 & 0\\0 & 0 & 1 \over 2& 0 & 0\end{bmatrix}.$

This matrix $R$ is not directly used in the PageRank. To compute PageRank vector, we will use the power method, an iterative procedure, has convergence problem. The procedure can cycle, or the limit may be dependent on the starting vector. To avoid this problem, we need to have a left stochastic matrix.

Like in our example, it can be possible that $R$ is not stochastic because of dangling nodes. Therefore, define stochastic matrix $R' = R + {\mathbf{e}\mathbf{a}^T \over n}.$ where $\mathbf{e}$ is a vector of all ones, $n$ is the number of web pages, and $\mathrm{a}(u)=1$ if $R(u) =\mathbf{0}$ , otherwise $\mathrm{a}(u)=0$ .

When you do web surfing, though there are no forward links (dangling nodes), we may visit other new web pages. The ${\mathbf{e}\mathbf{a}^T \over n}$ matrix add the situation with uniform probability for all web pages.

Problem 1: Find $R'$ for our example.

Then by the Gershgorin circle theorem, the largest eigenvalue of $R'$ is less than or equal to 1. Since det $(S-I)=0$ , we know there is the largest eigenvalue $\lambda_n=1$ . We want to find the dominant eigenvector of $R'$ , which is PageRank vector. For finding eigenvector, PageRank uses Power Method.

Some of you may already realize, finding PageRank vector is finding stationary distribution (See Lecture 5, Lecture 6) of random walks on the web graph. However, if the stationary distribution vector is not unique, we cannot get the exact PageRank vector. This problem can be solved by Perron–Frobenius theorem. If $R'$ is irreducible, eigenvector corresponding eigenvalue 1 is unique by Perron-Frobenius.

Definition 1: A matrix $A$ is primitive if it is non-negative and its $k$ th power is positive for some natural number $k$ .

Definition 2: An $n \times n$ matrix $A$ is a reducible matrix if for some permutation matrix $P$ , such that $P^TAP$ is block upper triangular. If a square matrix is not reducible, it is said to be an irreducible matrix.

For more knowledge related to primitive and irreducible, see [6].

Problem 2: Every primitive matrix is irreducible.

We can construct a primitive matrix $G = mR' + (1-m)E$ where $0 \le m \le 1$ and $E = {\mathbf{e}\mathbf{e}^T \over n}$ . In reality, Google use $m=0.85$ ([1], [7]). Since $G$ is primitive, so it is irreducible, we can get the unique PageRank vector by Power Method. We call this matrix $G$ as Google matrix. Such $m$ is called damping constant, this reflects a person who is randomly clicking on links will eventually stop clicking. In our example, $G$ is

$G= \begin{bmatrix}3 \over 100 & 22 \over 25 & 91 \over 200& 47 \over 150& 1 \over 5\\91 \over 200 & 3 \over 100 & 3 \over 100 & 47 \over 150& 1 \over 5\\3 \over 100 & 3 \over 100 & 3 \over 100 & 47\over 150 & 1 \over 5\\91 \over 200 & 3 \over 100 & 3 \over 100 & 3 \over 100 & 1 \over 5\\3 \over 100 & 3 \over 100 & 91 \over 200& 3 \over 100 & 1 \over5\end{bmatrix}.$

when $m=0.85$ .

Here, we call the unique largest eigenvalue $\lambda$ of $G$ as Perron root (for Google matrix $\lambda = 1$ ), and $\mathbf{x}$ is Perron vector if $G\mathbf{x}=\lambda \mathbf{x}$ and $||\mathbf{x}||_1 = 1$ . $G$ has unique dominant eigenvalue implies that there is the unique PageRank vector. (If you want to check convergence, see [4], p. 179)

Finally, Power Method on $G$ ([2]), recursively compute $\mathbf{x}_k = G\mathbf{x}_{k-1}$ , finds the limit of $\mathbf{x}_k$ , say $\mathbf{x}$ . This $\mathbf{x}$ is PageRank vector. That is, PageRank of page $u$ is $\mathbf{x}(u)$ where $\mathbf{x}$ is Perron vector and $||\mathbf{x}_1||=1$ . Note that $\mathbf{x}$ is a probability vector reflects the possibility of click each webpage.

References

[1] Amy Langville and Carl Meyer (2003). Fiddling with PageRank.

[2] Amy Langville, Carl Meyer (2004). The Use of the Linear Algebra by Web Search Engines. IMAGE Newsletter. 33.

[3] Amy Langville and Carl Meyer (2004). Deeper Inside PageRank. Internet Mathematics. 1. 10.1080/15427951.2004.10129091.

[4] Amy Langville and Carl Meyer (2006). Google’s PageRank and Beyond: The Science of Search Engine Rankings.

[5] Larry Page, Sergey Brin, R. Motwani, and T. Winograd (1998). The PageRank Citation Ranking: Bringing Order to the Web.

[6] Roger Horn and Charles Johnson (1985). Matrix Analysis.

[7] Taher Haveliwala and Sepandar Kamvar (2003). The Second Eigenvalue of the Google Matrix.

Author Archives: iamjiyoung

Lecture 16 – PageRank