Page Rank

1. Matrix Formulation

$\tilde{A}=D^{-1}A$

$A$ : adjacency matrix
$D$ : outdegree matrix(diagonal)
$D^{-1}$ : $\frac{1}{outdegree}$ matrix(diagonal)

example

calculate matrix
- $A = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix}, D = \begin{bmatrix} 2 & 0 & 0 \\ 0 & 2 & 0 \\ 0 & 0 & 1 \end{bmatrix}, D^{-1} = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 1 \end{bmatrix}$
- $\tilde{A} = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 1 \end{bmatrix} @ \begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 0 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0.5 & 0.5 & 0 \\ 0.5 & 0 & 0.5 \\ 0 & 1 & 0 \end{bmatrix}$
- $\tilde{A}^T = \begin{bmatrix} 0.5 & 0.5 & 0 \\ 0.5 & 0 & 1 \\ 0 & 0.5 & 0 \end{bmatrix}$
power iteration
1. init $r^{(0)} = \begin{bmatrix} \frac{1}{3} \\ \frac{1}{3} \\ \frac{1}{3} \end{bmatrix}$
2. $r^{(1)}=\tilde{A}^Tr^{(0)} = \begin{bmatrix} 0.5 & 0.5 & 0 \\ 0.5 & 0 & 1 \\ 0 & 0.5 & 0 \end{bmatrix} @ \begin{bmatrix} \frac{1}{3} \\ \frac{1}{3} \\ \frac{1}{3} \end{bmatrix} = \begin{bmatrix} \frac{1}{3} \\ \frac{1}{2} \\ \frac{1}{6} \end{bmatrix}$
3. 수렴할때까지 반복
4. $r=\begin{bmatrix} \frac{2}{5} \\ \frac{2}{5} \\ \frac{1}{5} \end{bmatrix}$

2. Random Walk Interpretation

어떤 노드에서 다른 노드로 이동할때 random하게 이동하는 것
time $t$ 에 node $i$ 로 도착할 확률을 $P^{(t)}_i$ 로 나타내며 값이 클수록 자주 방문된 노드임을 의미
$p^{(t+1)}=\tilde{A}p^{(t)}$ 로 표현 가능, $t \rightarrow \infty$ 이면 $p^{(t+1)} \approx p^{(t)}$ 해져서 $p=\tilde{A}p$ 즉, $r=\tilde{A}r$ 과 유사해짐

conditions

$t=0$ 시점의 초기 확률 분포가 무엇이든지:

random node에서 다른 어느 node로 이동 가능해야함
이동하는 패턴이 주기적으로 반복되면 안됨
stationary distribution이 고유해야함( $p^{(t)} \approx p^{(t+1)} \approx p$ )

3. Google Formulation

기존 방식은 데이터에서 deadend(이동할 노드가 없는 경우 → column stochastic하지 않음)나 spider trap(갇히게 됨)이 발생하면 rank 계산에 있어 문제가 발생함
solution:
- deadend: $\frac{1}{n}$ 로 값을 초기화해 해결함
- spider trap: teleport를 이용해 random node로 jump하여 벗어남

$r=Gr$

$G=\beta\tilde{B}^T+(1-\beta)[\frac{1}{n}]_{n\times n}$
- $\tilde{B}^T$ : $\tilde{A}^T$ 에서 column-stochastic(열의 합=1)하지 않은 열에 대해 각 값을 $\frac{1}{n}$ 으로 초기화
- $\beta \in [0.8, 0.9], default=0.85$
$r_j=\beta\sum_{i\in N_j}\frac{r_i}{d_i}+(1-\beta)\frac{1}{n}$
- random walk: $\beta\sum_{i\in N_j}\frac{r_i}{d_i}$
- teleport: $(1-\beta)\frac{1}{n}$

4. Top Specific Page Rank

teleport로 random work할때 query의 topic과 관련된 page들의 집합 $S$ 안에서만 골라 이동

$r_s=\beta\tilde{B}^Tr_s+(1-\beta)q_s$

random walk: $\beta\tilde{B}^Tr_s$
topic specific teleport: $(1-\beta)q_s$
$q_s$ : $S$ 안에 있는 요소인 경우 $\frac{1}{|S|}$ , 그외는 $0$

example

$S=\{1,2\}, \beta=0.8$

calculate matrix
- $A = \begin{bmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix}, D = \begin{bmatrix} 2 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, D^{-1} = \begin{bmatrix} 0.5 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$
- $\tilde{A}=\begin{bmatrix} 0.5 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix} @ \begin{bmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} = \begin{bmatrix} 0 & 0.5 & 0.5 & 0 \\ 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix}$
- $\tilde{A}^T=\begin{bmatrix} 0 & 1 & 0 & 0 \\ 0.5 & 0 & 0 & 0 \\ 0.5 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} = \text{column stochastic} = \tilde{B}^T$
- $q_s=\begin{bmatrix} 0.5 \\ 0.5 \\ 0 \\ 0 \end{bmatrix}$
power iteration
1. init $r^{(0)}_s = \begin{bmatrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{bmatrix}$
2. $r^{(1)}_s = 0.8 \times \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0.5 & 0 & 0 & 0 \\ 0.5 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix} @ \begin{bmatrix} 0.25 \\ 0.25 \\ 0.25 \\ 0.25 \end{bmatrix}+0.2 \times \begin{bmatrix} 0.5 \\ 0.5 \\ 0 \\ 0 \end{bmatrix}=\begin{bmatrix} 0.3 \\ 0.2 \\ 0.3 \\ 0.2 \end{bmatrix}$
3. 수렴할때까지 반복
4. $r=\begin{bmatrix} 0.26 & 0.20 & 0.29 & 0.23 \end{bmatrix}$

Algorithm

지금까지의 Page Rank 방식에서 발생하는 $G$ 가 fully dense한 문제점에 대해 아래 2가지 해결책을 제시

1. Sparse Matrix Formulation

$\tilde{D}^T$ : deadend를 해결하기 위해 $\frac{1}{n}$ 로 값을 설정한 부분을 따로 정의
$\tilde{B}^T = \tilde{A}^T + \tilde{D}^T$
- $\tilde{A}^T = \begin{bmatrix} 0.5 & 0.5 & 0 \\ 0.5 & 0 & 0 \\ 0 & 0.5 & 0 \end{bmatrix}$
- $\tilde{D}^T = \begin{bmatrix} 0 & 0 & \frac{1}{3} \\ 0 & 0 & \frac{1}{3} \\ 0 & 0 & \frac{1}{3} \end{bmatrix}$
$G=\beta\tilde{B}^T+(1-\beta)[\frac{1}{n}]_{nxn} \rightarrow \beta\tilde{A}^T+\beta\tilde{D}^T+(1-\beta)[\frac{1}{n}]_{nxn}$
하지만 여전히 $\tilde{A}^T$ 는 sparse함(real world에서 90~99% sparse) → CSR(Compressed Sparse Row)와 같은 sparse matrix format을 사용
또 deadend가 많이 발생하면 $\tilde{D}^T$ 도 sparse함 → injection leaked score를 사용
- $r^{(t+1)}=\beta\tilde{A}^Tr^{(t)}+(1-S)[\frac{1}{n}]_{n\times 1}$
- $S=sum(\beta\tilde{A}^Tr^{(t)})$ , $1-S$ : leaked score의 총합

2. Block Based Update Algorithm

graph(data)에 edge가 너무 많아서 memory에 올라가지 못하는 경우 adjacency matrix인 $\tilde{A}$ 를 source node, degree, destination nodes를 포함한 list로 표현

source node

degree

destination nodes

0, 1, 3, 5

0, 5

3, 4

o Basic

하나의 source node씩 읽으면서 PageRank 값을 업데이트

디스크에서 $r^{old}$ (이전 PageRank 값)과 $\tilde{A}$ (인접 행렬)를 읽음
새로운 PageRank 값 $r^{new}$ 를 계산하여 디스크에 저장

$cost=2|r|+|\tilde{A}|$

o Block Based

$r^{new}$ 를 $K$ 개의 블록으로 나누고 각 블록을 읽으면서 수행

$r^{old}$ 와 $\tilde{A}$ 를 $K$ 번 스캔
각 블록을 디스크에 저장하여 새로운 PageRank 값 $r^{new}$ 를 계산

$cost=K(|\tilde{A}|+|r|)+|r|$

o Block Stripe Based

$\tilde{A}$ 를 stripe 단위로 나누고, 각 stripe는 $r^{new}$ 의 블록에 대응되는 destination node만 포함해 수행, 한 번 읽은 파일을 다시 읽지 않아도 됨

$r^{new}$ 의 각 블록을 디스크에 저장하고, $r^{old}$ 를 $K$ 번 스캔
$\tilde{A}$ 의 stripe를 읽어 $r^{new}$ 를 업데이트, stripe의 크기는 $(1+\epsilon)|\tilde{A}|$ 이며, $\epsilon$ 은 $K$ 보다 작은 값

$cost=(1+\epsilon)|\tilde{A}|+(1+K)|r|$

PreviousDimension Reduction NextRecommender System

Last updated 1 year ago

hashtag1. Matrix Formulation

hashtagexample

hashtag2. Random Walk Interpretation

hashtagconditions

hashtag3. Google Formulation

hashtag4. Top Specific Page Rank

hashtagexample

hashtagAlgorithm

hashtag1. Sparse Matrix Formulation

hashtag2. Block Based Update Algorithm

1. Matrix Formulation

example

2. Random Walk Interpretation

conditions

3. Google Formulation

4. Top Specific Page Rank

example

Algorithm

1. Sparse Matrix Formulation

2. Block Based Update Algorithm