This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Based Proximity Matrix Factorization for Node Embedding

Xingyi Zhang Kun Xie Sibo Wang The Chinese University of Hong KongHong Kong SARChina xyzhang, xiekun, [email protected]  and  Zengfeng Huang Fudan UniversityShanghaiChina [email protected]
(2021)
Abstract.

Node embedding learns a low-dimensional representation for each node in the graph. Recent progress on node embedding shows that proximity matrix factorization methods gain superb performance and scale to large graphs with millions of nodes. Existing approaches first define a proximity matrix and then learn the embeddings that fit the proximity by matrix factorization. Most existing matrix factorization methods adopt the same proximity for different tasks, while it is observed that different tasks and datasets may require different proximity, limiting their representation power.

Motivated by this, we propose Lemane, a framework with trainable proximity measures, which can be learned to best suit the datasets and tasks at hand automatically. Our method is end-to-end, which incorporates differentiable SVD in the pipeline so that the parameters can be trained via backpropagation. However, this learning process is still expensive on large graphs. To improve the scalability, we train proximity measures only on carefully subsampled graphs, and then apply standard proximity matrix factorization on the original graph using the learned proximity. Note that, computing the learned proximities for each pair is still expensive for large graphs, and existing techniques for computing proximities are not applicable to the learned proximities. Thus, we present generalized push techniques to make our solution scalable to large graphs with millions of nodes. Extensive experiments show that our proposed solution outperforms existing solutions on both link prediction and node classification tasks on almost all datasets.

Node Embedding; Trainable Proximity; Matrix Factorization
journalyear: 2021copyright: acmlicensedconference: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14–18, 2021; Virtual Event, Singaporebooktitle: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singaporeprice: 15.00doi: 10.1145/3447548.3467296isbn: 978-1-4503-8332-5/21/08ccs: Information systems Data miningccs: Computing methodologies Dimensionality reduction and manifold learning

1. Introduction

Node embedding is the task to map nodes in the original graph into low-dimensional representations. For each node, it outputs an embedding vector and the embedding vectors play an important role in preserving not only the structural information but also other underlying properties in the graph. These vectors can be fed into machine learning models, facilitating widespread machine learning tasks, such as node classification (Perozzi et al., 2014; Grover and Leskovec, 2016; Tang et al., 2015), link prediction (Wang et al., 2018; Tsitsulin et al., 2018), graph reconstruction (Yin and Wei, 2019; Yang et al., 2020), and recommendation (Fouss et al., 2007; Yu et al., 2014).

An important category of node embedding methods with superb performance is the ones using proximity matrix factorization. For such solutions, they first define the proximity matrix 𝑺\boldsymbol{S} for the nodes in the input graph where 𝑺(i,j)\boldsymbol{S}(i,j) is the proximity measure of node jj with respect to node ii. Different methods may adopt different proximity measures and personalized PageRank (PPR) is a popular choice of proximity measure in node embedding. For example, PPR is adopted in NRP (Yang et al., 2020) as the proximity. In STRAP (Yin and Wei, 2019), the authors further propose to adopt 𝝅u(v)+𝝅vT(u)\boldsymbol{\pi}_{u}(v)+\boldsymbol{\pi}^{T}_{v}(u) as the proximity where 𝝅u(v)\boldsymbol{\pi}_{u}(v) is the PPR of vv with respect to uu and 𝝅vT(u)\boldsymbol{\pi}^{T}_{v}(u) is the PPR of uu with respect to vv on the transpose graph GTG^{T} by reversing the direction of each edge in GG. Given the proximity matrix 𝑺\boldsymbol{S}, two embedding vectors 𝒙u\boldsymbol{x}_{u} and 𝒚u\boldsymbol{y}_{u} are derived such that 𝒙u𝒚v𝑺(u,v)\boldsymbol{x}_{u}\cdot\boldsymbol{y}_{v}\sim\boldsymbol{S}(u,v). For existing solutions in this category, the embedding vectors are typically obtained by singular value decomposition (SVD) or eigen-decomposition on 𝑺\boldsymbol{S} or on a sparse matrix closely related to 𝑺\boldsymbol{S}.

Despite their success, all existing matrix factorization approaches aim to learn an embedding that preserves the chosen proximity without considering if the proximity is suitable for the task on the dataset or not. However, it is observed that different tasks and datasets may require different proximities to achieve high performance, limiting the representation power of such solutions. In addition, it is shown in existing node embedding methods, e.g., STRAP (Yin and Wei, 2019) and NetMF (Qiu et al., 2018), that non-linear operations (such as taking logarithm or softmax) on the proximity matrix can help improve the representation power of the embedding. Nevertheless, most latest matrix factorization methods, like HOPE (Ou et al., 2016), AROPE (Zhang et al., 2018b), and NRP (Yang et al., 2020), do not explicitly derive the proximity matrix, which limits their representation powers. For those methods that explicitly derive the proximity matrix, it indicates that the matrix factorization needs to be taken on the final proximity matrix. Thus, to derive trainable proximity measures, the model needs to be end-to-end and includes the proximity computation as well as the SVD decomposition into the training process. This further imposes challenges especially when the input graph has millions of nodes.

Contribution. Motivated by the limitation of existing solutions, we present an effective framework Lemane111Learning based Proximity Matrix Factorization for Node Embedding with trainable proximity measures, which can be learned to best suit the datasets and tasks at hand automatically. Our trainable proximity measure is inspired by personalized PageRank (PPR) (Page et al., 1999). The PPR 𝝅u(v)\boldsymbol{\pi}_{u}(v) can be defined as the probability that an α\alpha-discounted random walk from uu stops at node vv, where an α\alpha-discounted random walk from a source node uu has α\alpha probability to stop at the current node and has (1α)(1-\alpha) probability to randomly jump to an out-neighbor of the current node. For α\alpha-discounted random walks, the majority will always be the one-hop random walks, which may not be the most representative one for the task at hand. This motivates us to learn a more representative random walk for our trainable proximity measure. Instead of fixing the stopping probability as α\alpha at each step, our trainable proximity is defined on a supervised random walk where the stopping probability at the ll-th hop is learned by our defined loss function. Then, our trainable proximity of node vv with respect to uu, dubbed as the supervised PPR 𝑺(u,v)\boldsymbol{S}(u,v), is the probability that the supervised random walk from uu stops at vv.

To learn the stopping probabilities at each hop for the supervised random walk, we design different loss functions for different tasks in order to learn a more representative proximity for the task at hand. In this paper, we focus on two popular tasks of node embedding: link prediction and node classification. Given the loss functions and trainable parameters, we then design an end-to-end method which incorporates a differentiable SVD in the pipeline so that the parameters can be trained via backpropagation. Our solution is mainly inspired by previous work on learning-based low-rank approximations (Indyk et al., 2019), which includes a differentiable SVD that allows the gradients to flow easily in the framework to solve the low-rank approximation problem. In our framework, with the differentiable SVD, the gradients can easily flow from the loss function and differentiable SVD to our proximity matrix computation, which is determined by the training parameters, i.e., the stopping probabilities at each hop of the supervised random walk.

However, the above training process is too expensive and does not scale to large graphs. To improve the scalability, we train the stopping probabilities for the supervised random walk only on carefully subsampled graphs, and then apply standard proximity matrix factorization on the original graph using the learned proximity. Our main observation is that the stopping probabilities at each hop of the supervised random walk are node-independent. Thus, with a carefully subsampled graph, the learned stopping probability at each hop for the task should still be similar to that on the input graph. Motivated by this, we present an effective subgraph sampling based method to train the parameters on multiple sampled subgraphs, which improves the scalability of our Lemane.

Finally, given learned probabilities, computing learned proximities for each node-pair is still expensive for large graphs, and existing efficient algorithms for computing proximities, like PPR, are not applicable to the learned proximities. Thus, we present generalized push techniques to make our solution scalable to large graphs with millions of nodes. Given a source node ss, our generalized push algorithm computes an approximate supervised PPR score for each node with respect to ss with O(1δ)O(\frac{1}{\delta}) cost where δ\delta is a parameter to control the quality of approximate supervised PPR scores. Then, the proximity matrix can be computed with O(nδ)O(\frac{n}{\delta}) cost and takes O(nδ)O(\frac{n}{\delta}) space. A sparse SVD algorithm is applied on the proximity matrix 𝑺\boldsymbol{S} to derive the final embedding with O(nδ+nd2)O(\frac{n}{\delta}+n\cdot d^{2}) running cost, where dd is the embedding dimension.

In our experiment, we compare our Lemane against 15 existing node embedding methods on the link prediction and node classification tasks using 6 real datasets with up to 3 million nodes and 117 million edges. Extensive experiments show that our Lemane outperforms existing methods on both link prediction and node classification tasks on almost all datasets.

2. Related Work

There are three basic categories of node embedding methods: skip-gram methods, matrix factorization methods, and neural network methods. Next, We briefly review existing works for each category.

Skip-gram methods. The methods in this category are inspired by the great success of the word2vec model (Mikolov et al., 2013) for natural language processing. DeepWalk (Perozzi et al., 2014) first proposes to train embedding vectors by feeding truncated random walks to the Skip-gram model. The nodes sampled from the random walks are then treated as the positive samples. Subsequent methods try to explore more representative random walks to feed into the Skip-gram model. LINE (Tang et al., 2015) adopts one-hop and two-hop random walks while Node2vec (Grover and Leskovec, 2016) proposes to explore higher-order random walks that exploits both DFS and BFS nature of the graph. VERSE (Tsitsulin et al., 2018) and APP (Zhou et al., 2017) adopt α\alpha-discounted random walks to obtain positive samples. Recently, InfiniteWalk (Chanpuriya and Musco, 2020) studies DeepWalk in the limit as the window size goes to infinite, linking DeepWalk to graph Laplacian matrix.

Matrix factorization methods. Another idea in node embedding is to do matrix factorization on a chosen proximity matrix. To explicitly derive the proximity matrix, e.g., the case in NetMF (Qiu et al., 2018), it typically takes Θ(n2)\Theta(n^{2}) cost and is too expensive for large graphs. To avoid the Θ(n2)\Theta(n^{2}) running cost, HOPE (Ou et al., 2016), AROPE (Zhang et al., 2018b), and NRP (Yang et al., 2020) are proposed to derive the embedding without explicitly computing the proximity matrix. For instance, instead of computing all-pair proximity scores and then decomposing the proximity matrix 𝑺\boldsymbol{S}, NRP turns to do SVD on the adjacency matrix, which is sparse for most real-life graphs, reducing the embedding computational cost. Another solution to avoid the Θ(n2)\Theta(n^{2}) cost is to calculate a sparsified proximity matrix 𝑺\boldsymbol{S}. The representative is STRAP (Yin and Wei, 2019), which imposes a threshold δ\delta and returns at most O(1δ)O(\frac{1}{\delta}) proximity scores no smaller than δ\delta for each node, making the proximity matrix of O(nδ)O(\frac{n}{\delta}) size. An SVD is then applied to the sparsified proximity matrix. Since the second solution explicitly derives the proximity matrix, it allows to take non-linear operations on the proximity matrix, improving the representation powers.

Neural network methods. Deep learning provides an alternative solution to generate node embeddings. SDNE (Wang et al., 2016) and DNGR (Cao et al., 2016) employ multi-layer auto-encoders with a target matrix to generate embeddings. DRNE (Tu et al., 2018) utilizes the layer normalized LSTM (Hochreiter and Schmidhuber, 1997) to generate node embeddings by aggregating the representations of neighbors of each node recursively. GraphGAN (Wang et al., 2018) adopts the well-known generative adversarial networks (Goodfellow et al., 2014) into graph representation learning via an adversarial minimax game. AW (Abu-El-Haija et al., 2018) proposes a novel attention model on the power series of the transition matrix, which guides the random walk to pay attention to important positions within the random walks by optimizing an upstream objective. The bottleneck of these solutions is the high computational cost, which restricts these methods to small graphs.

Other methods. There are also several methods that do not belong to the above three categories. For example, GraphWave (Donnat et al., 2018) learns node representations by leveraging heat wavelet diffusion patterns in an unsupervised way. NetHiex (Ma et al., 2018) captures the underlying hierarchical taxonomy of the graph to learn node representations with multiple components. RaRE (Gu et al., 2018) proposes a node embedding method that considers both social rank and proximity of nodes, and separately learns two representations for a node. AutoNE (Tu et al., 2019) incorporates AutoML into node embedding, which can automatically optimize the hyperparameters from the subgraphs of the original graph. PBG (Lerer et al., 2019) presents a distributed embedding system that uses the block decomposition of the adjacency matrix as a partition method to scale to arbitrary large graphs. A recent work, GraphZOOM (Deng et al., 2020), first generates subgraphs based on a fused graph and then applies existing approaches to generate node embeddings. Since these methods do not preserve any node pair proximity, the main concern is their insufficient effectiveness for downstream tasks as we will show in our experiment.

There are also various graph embedding methods designed for specific graphs, like dynamic graphs (Goyal et al., 2018; Trivedi et al., 2019) and heterogeneous networks (Dong et al., 2017; Fu et al., 2017). In this paper, we focus on the most fundamental case when the network is static and no feature vector is given.

3. Lemane Framework

1 𝑺𝟎,𝑹𝑰n\boldsymbol{S}\leftarrow\boldsymbol{0},\boldsymbol{R}\leftarrow\boldsymbol{I}_{n}
2 for k=0k=0 to LL do
3       𝑺𝑺+αk𝑹\boldsymbol{S}\leftarrow\boldsymbol{S}+\alpha_{k}\cdot\boldsymbol{R}
4       𝑹(1αk)𝑷𝑹\boldsymbol{R}\leftarrow(1-\alpha_{k})\cdot\boldsymbol{P}\cdot\boldsymbol{R}
5      
return 𝑺\boldsymbol{S}
Algorithm 1 Compute-Supervised-PPR(𝑷\boldsymbol{P}, LL)

In this section, we present our Lemane framework. Section 3.1 introduces the trainable proximity matrix and the training parameters. Section 3.2 elaborates on the training process of Lemane and introduces the loss functions used for link prediction and node classification, respectively. Section 3.3 presents how to carefully obtain the subsampled graphs to do training on large graphs.

3.1. Trainable Proximity Measure

Recap from Section 1 that, the personalized PageRank (PPR) 𝝅u(v)\boldsymbol{\pi}_{u}(v) of node vv with respect to uu is the probability that an α\alpha-discounted random walk from uu stops at node vv, where an α\alpha-discounted random walk has α\alpha probability to stop at the current node and (1α)(1-\alpha) probability to randomly jump to one of its out-neighbors. Define the transition matrix 𝑷=𝑫1𝑨\boldsymbol{P}=\boldsymbol{D}^{-1}\boldsymbol{A} where 𝑫\boldsymbol{D} is the diagonal matrix such that 𝑫(i,i)\boldsymbol{D}(i,i) is the out-degree (resp. degree) of node ii if GG is directed (resp. undirected), and 𝑨\boldsymbol{A} is the adjacency matrix. Then, the PPR proximity matrix can be expressed as:

𝑺=l=0α(1α)l𝑷l.\boldsymbol{S}=\sum_{l=0}^{\infty}\alpha\cdot(1-\alpha)^{l}\cdot\boldsymbol{P}^{l}.

In our trainable proximity measure, instead of fixing the stopping probability at each step to be α\alpha, we allow the stopping probability αl\alpha_{l} of the random walk at the ll-th hop to be trainable. Currently, we assume that the stopping probability αl\alpha_{l} at the ll-th step is given and will show how to train αl\alpha_{l} in Section 3.2. A random walk that follows such learned stopping probability at each step is denoted as a supervised random walk and the proximity derived from the supervised random walk is denoted as supervised PPR. The supervised PPR proximity matrix can be defined as:

(1) 𝑺=α0𝑰n+l=1αlk=0l1(1αk)𝑷l.\boldsymbol{S}=\alpha_{0}\boldsymbol{I}_{n}+\sum_{l=1}^{\infty}\alpha_{l}\cdot\prod_{k=0}^{l-1}(1-\alpha_{k})\cdot\boldsymbol{P}^{l}.

To explain, the probability that a supervised random walk stops at exactly the ll-th hop is αlk=0l1(1αk)\alpha_{l}\cdot\prod_{k=0}^{l-1}(1-\alpha_{k}). Thus, by summing up the probability to stop at each hop, we derive the supervised PPR score.

The exact supervised PPR score then can be computed with an iterative manner as shown in Algorithm 1 when LL\rightarrow\infty. However, we observe that when LL is sufficiently large, the probability that the supervised random walk stops with a hop number larger than LL is close to zero. Hence, we discard all supervised random walks that stop with a hop larger than LL. The following theorem shows the quality of calculated supervised PPR scores with Algorithm 1.

Theorem 1.

Let 𝐒L\boldsymbol{S}_{L} be the supervised PPR proximity matrix derived by Algorithm 1. Let 𝐌||\boldsymbol{M}||_{\infty} be the infinity-norm of matrix 𝐌\boldsymbol{M}, we have:

𝑺L𝑺k=0L(1αk)𝑺.||\boldsymbol{S}_{L}-\boldsymbol{S}||_{\infty}\leq\prod_{k=0}^{L}(1-\alpha_{k})||\boldsymbol{S}||_{\infty}.

All the proofs of our theorems can be found in the appendix. According to our observation, the probability that the supervised random walk stops at a hop greater than LL is almost zero when we set L=15L=15. Hence, LL is set to 1515 in our experiment.

Algorithm 1 makes a tight connection between the supervised PPR proximity matrix and our training parameters αl\alpha_{l} (0lL0\leq l\leq L), which allows the gradients to flow from the derived supervised PPR proximity matrix 𝑺L\boldsymbol{S}_{L} to the training parameters via backpropagation. Let nn and mm denote the number of nodes and the number of edges, respectively. The time complexity of the algorithm can be bounded by O(mnL)O(m\cdot n\cdot L) since 𝑷\boldsymbol{P} is a sparse matrix with mm entries.

Remark. Note that the idea to learn the stopping probability for each hop is not a new idea in graph neural networks, e.g., (Klicpera et al., 2019; Chien et al., 2021). It is much easier for neural networks as the stopping probabilities can be treated as additional weights to learn. However, it is more challenging to integrate this idea to node embedding, especially for matrix factorization methods. We show an efficient and effective solution that works for large graphs, which is non-trivial.

3.2. Training process of Lemane

Algorithm 2 shows the pseudo-code of the training process of Lemane. Firstly, it initializes the stopping probability αk\alpha_{k} for 0kL0\leq k\leq L such that the probability a random walk stops at the ll-th hop follows some standard distribution, e.g., uniform, geometric, or Poisson (Line 1). Next, it uses the initial settings of these stopping probabilities to derive the supervised PPR score by invoking Algorithm 1 (Line 3). Given the supervised PPR scores, it eliminates all the proximity scores that are too small. A threshold δ\delta is included and all scores smaller than δ\delta are set to zero (Lines 4-5). Then, a non-linear operation, log(𝑺δ)\log({\frac{\boldsymbol{S}}{\delta}}), is applied to the proximity matrix 𝑺\boldsymbol{S}. The proximity scores are divided by δ\delta to guarantee that each entry after taking the log\log will be non-negative (Line 6). Thereafter, a differentiable SVD, e.g., (Indyk et al., 2019), is applied on the new matrix 𝑴\boldsymbol{M} with input parameter dd (Line 7). The SVD obtains two n×dn\times d matrices 𝑼\boldsymbol{U} and 𝑽\boldsymbol{V}, and a d×dd\times d diagonal matrix 𝚺\boldsymbol{\Sigma} such that 𝑼𝚺𝑽𝑻𝑴\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V^{T}}\approx\boldsymbol{M}. Given the three matrices, 𝑼Σ\boldsymbol{U}\sqrt{\Sigma} is returned as the first embedding matrix 𝑿\boldsymbol{X} and 𝑽Σ\boldsymbol{V}\sqrt{\Sigma} is returned as the second embedding matrix 𝒀\boldsymbol{Y}.

Subsequently, the two embedding matrices are fed into the loss function for the link prediction or node classification (Line 9). The loss functions will be discussed shortly. Then, the stopping probabilities are updated according to the loss function by backpropagation. The training process terminates until the loss function converges (Line 2). Notice that, due to the extremely long computational chain, it is infeasible to write down the explicit form of the gradients. However, like modern deep neural networks, we can use the autograd feature in PyTorch to numerically compute the gradients with respect to the training parameters. Also, the builtin SVD in PyTorch supports to compute the gradient and hence we directly adopt the PyTorch implementation. In what follows, we elaborate on the details of the loss functions.

Loss function for link prediction. For link prediction, there are two components in our loss function. The first component of the loss function aims to ensure that the total information in the supervised random walks started from node uu is closed to its out-degree. Let 𝜶=(α0,α1,,αL)\boldsymbol{\alpha}=(\alpha_{0},\alpha_{1},\cdots,\alpha_{L}). The first part of the loss function is as follows:

(2) 1(𝜶;𝑨)=1n2uvu𝒙u𝒚vdout(u)22,\mathcal{L}_{1}(\boldsymbol{\alpha};\boldsymbol{A})=\frac{1}{n^{2}}\sum_{u}\parallel\sum_{v\neq u}\boldsymbol{x}_{u}\cdot\boldsymbol{y}_{v}-d_{out}(u)\parallel_{2}^{2},

where 𝒙u\boldsymbol{x}_{u} is the uu-th row of 𝑿\boldsymbol{X}, 𝒚v\boldsymbol{y}_{v} is the vv-th row of 𝒀\boldsymbol{Y}, and dout(u)d_{out}(u) is the out-degree of uu. Equation 2 indicates that our learned embedding by the supervised random walk should preserve the out-degree information as much as possible.

The second part of the loss function for link prediction is the average cross-entropy over all existing edges:

(3) 2(𝜶;𝑨)=1muv𝑨u,vlog(σ(𝒙u𝒚v)),\mathcal{L}_{2}(\boldsymbol{\alpha};\boldsymbol{A})=-\frac{1}{m}\sum_{u}\sum_{v}\boldsymbol{A}_{u,v}\log(\sigma(\boldsymbol{x}_{u}\cdot\boldsymbol{y}_{v})),

where σ(x)=1/(1+exp(x))\sigma(x)=1/(1+\exp(-x)) is the sigmoid function and 𝑨u,v=1\boldsymbol{A}_{u,v}=1 if edge (u,v)(u,v) exists in the input graph and 𝑨u,v=0\boldsymbol{A}_{u,v}=0 otherwise. The final loss function for link prediction is:

(4) p=β1+γ2,\mathcal{L}_{\textrm{p}}=\beta\mathcal{L}_{1}+\gamma\mathcal{L}_{2},

where β\beta and γ\gamma are two balancing hyperparameters.

Loss function for node classification. There are also two parts in the loss function for node classification. We first concatenate two output embedding matrices 𝑿\boldsymbol{X} and 𝒀\boldsymbol{Y} together to get the unique embedding matrix 𝒁=concat(𝑿,𝒀)\boldsymbol{Z}=\textrm{concat}(\boldsymbol{X},\boldsymbol{Y}). Then following the fact that two randomly selected nodes have different labels with high probability, we randomly sample a small set of negative node-pairs 𝒩V×V\mathcal{N}\subset V\times V for each iteration. Let H=(V,𝒩)H=(V,\mathcal{N}) denotes the graph with edge set 𝒩\mathcal{N} with unnormalized Laplacian matrix 𝑳H\boldsymbol{L}_{H} and Gk=(Vk,Ek)G_{k}=(V_{k},E_{k}) be a complete graph formed by nodes in GG with the same label kk. Inspired by the relaxation proposed in (Zhang et al., 2020), the first loss function for node classification is defined as follows:

Input: Matrix 𝑷\boldsymbol{P}, maximum length LL, threshold δ\delta, embedding dimension dd, learning rate η\eta
Output: stopping probabilities α0,,αL\alpha_{0},...,\alpha_{L}
1 Initialize αk\alpha_{k} for k=0,1,,Lk=0,1,...,L
2 while not convergence do
3       𝑺\boldsymbol{S}\leftarrow Compute-Supervised-PPR(𝑷,L\boldsymbol{P},L)
4       if S(i,j)<δS(i,j)<\delta, for S(i,j)S\forall S(i,j)\in S then
5             S(i,j)0S(i,j)\leftarrow 0
6            
7      Get matrix 𝑴\boldsymbol{M}\leftarrow log (𝑺δ)(\frac{\boldsymbol{S}}{\delta}) for non-zero entries
8       [𝑼,𝚺,𝑽][\boldsymbol{U},\boldsymbol{\Sigma},\boldsymbol{V}]\leftarrow Differentiable-SVD(𝑴,d)(\boldsymbol{M},d)
9       𝑿𝑼𝚺\boldsymbol{X}\leftarrow\boldsymbol{U}\sqrt{\boldsymbol{\Sigma}}, 𝒀𝑽𝚺\boldsymbol{Y}\leftarrow\boldsymbol{V}\sqrt{\boldsymbol{\Sigma}}
10       Compute link prediction loss p\mathcal{L}_{p} via Eq.4 or node classification loss c\mathcal{L}_{c} via Eq.7
11       for k=0,,Lk=0,...,L do
12             αkαkηαk\alpha_{k}\leftarrow\alpha_{k}-\eta\nabla_{\alpha_{k}}\mathcal{L}
13            
14      
15return α0,,αL\alpha_{0},...,\alpha_{L}
Algorithm 2 Lemane-Trainining
(5) 1(𝜶;𝑨)=k=1ncu,vVk𝒛u𝒛v22nc(u,v)𝒩𝒛u𝒛v22=k=1ncTr(𝒁𝑳k𝒁)ncTr(𝒁𝑳H𝒁),\mathcal{L}^{\prime}_{1}(\boldsymbol{\alpha};\boldsymbol{A})=\frac{\sum_{k=1}^{n_{c}}\sum_{u,v\in V_{k}}\left\|\boldsymbol{z}_{u}-\boldsymbol{z}_{v}\right\|_{2}^{2}}{n_{c}\cdot\sum_{(u,v)\in\mathcal{N}}\left\|\boldsymbol{z}_{u}-\boldsymbol{z}_{v}\right\|_{2}^{2}}=\frac{\sum_{k=1}^{n_{c}}\text{Tr}(\boldsymbol{Z}^{\top}\boldsymbol{L}_{k}\boldsymbol{Z})}{n_{c}\text{Tr}(\boldsymbol{Z}^{\top}\boldsymbol{L}_{H}\boldsymbol{Z})},

where 𝒛u\boldsymbol{z}_{u} is the concatenated representation of node uu, ncn_{c} is the total number of class labels in the graph, and LkL_{k} is the unnormalized Laplacian matrix of GkG_{k}. The goal of 1\mathcal{L^{\prime}}_{1} is to minimize pair-wise distances between node pairs with the same class label and maximize pair-wise distances between negative samples.

Next, we employ an activation function to normalize the output embedding to a probability distribution over predicted class labels: puk=softmax(𝒁𝑾+𝒃)ukp_{uk}={\textrm{softmax}}(\boldsymbol{Z}\cdot\boldsymbol{W}+\boldsymbol{b})_{uk}, where 𝑾\boldsymbol{W} is a 2d×nc2d\times n_{c} fixed mapping matrix generated from uniform distribution, 𝒃\boldsymbol{b} is the bias term, pukp_{uk} denotes the probability that node uu has class label kk, and softmax(𝒙)uk=exp(𝒙uk)/(c=1ncexp(𝒙uc)){\textrm{softmax}}(\boldsymbol{x})_{uk}=\exp(\boldsymbol{x}_{uk})/(\sum_{c=1}^{n_{c}}\exp(\boldsymbol{x}_{uc})). Let 𝒴\mathcal{Y} denote the n×ncn\times n_{c} label matrix, where 𝒴u,k=1\mathcal{Y}_{u,k}=1 if node uu has class label kk and 𝒴u,k=0\mathcal{Y}_{u,k}=0 otherwise. Then, the second part of the loss function for node classification is the average cross-entropy over all nodes:

(6) 2(𝜶;𝑨)=1nuk=1nc𝒴u,klogpuk.\mathcal{L}^{\prime}_{2}(\boldsymbol{\alpha};\boldsymbol{A})=-\frac{1}{n}\sum_{u}\sum_{k=1}^{n_{c}}\mathcal{Y}_{u,k}\log p_{uk}.

The final loss function for node classification is defined according to 1\mathcal{L}^{\prime}_{1} and 2\mathcal{L}^{\prime}_{2} as follows:

(7) c=β1+γ2,\mathcal{L}_{\textrm{c}}=\beta^{\prime}\mathcal{L}^{\prime}_{1}+\gamma^{\prime}\mathcal{L}^{\prime}_{2},

where β\beta^{\prime} and γ\gamma^{\prime} are two balancing hyperparameters.

Remark. Notice that the training algorithm for Lemane on node classification needs additional label information and we randomly sample 5%5\% of the labeled nodes for training. To make a fair comparison with our competitors, to train the classifiers, we will only include (x5)%(x-5)\% new training data if we split x%x\% of the data for training and the remaining (100x)%(100-x)\% for testing. This guarantees that our Lemane only accesses the same amount of labeled data compared to other competitors.

3.3. Training Lemane with sub-sampling

The above-mentioned training process requires calculations on a dense matrix of O(n2)O(n^{2}) size and requires O(Lnm)O(L\cdot n\cdot m) running cost, which makes it non-scalable to large graphs. However, such cost seems unavoidable at first glance since we need to obtain the proximity matrix to do backpropagation. After a careful analysis, we make the following two observations to help us avoid the high running costs. Our first key observation is that the parameters that we need to train are only the stopping probabilities at each hop, which are node-independent. That is to say, if we can find a subgraph of the input graph GG such that the learned stopping probabilities on the subgraph are identical to that on GG for the same task, we can simply learn the parameters on the subgraph with smaller size and apply the learned parameters to the original graph directly, reducing the computational costs. Another observation is that a subgraph of GG with similar connectivity should share similar learning stopping probabilities as the input graph on the same task.

Motivated by this, we present our sub-sampling based training method for Lemane on large graphs. Obviously, a straightforward solution is to sample a number nsn_{s} of nodes and then consider the subgraph containing these nsn_{s} nodes. However, such a solution severely degrades the connectivity among the nodes. Simple edge sampling strategies will face a similar dilemma which hampers the connectivity among the sampled nodes.

Input: Graph G, source node ss, threshold δ\delta, stopping probabilities α0,,αL\alpha_{0},...,\alpha_{L}
Output: Approximate proximity vector 𝝅^s\hat{\boldsymbol{\pi}}_{s}
1 Initialize 𝝅^s0,𝒓s(k)0\hat{\boldsymbol{\pi}}_{s}\leftarrow 0,\boldsymbol{r}^{(k)}_{s}\leftarrow 0 for k=0,1,,Lk=0,1,...,L
2 Initialize 𝒓s(0)(s)1\boldsymbol{r}^{(0)}_{s}(s)\leftarrow 1
3 while vV,0kL\exists v\in V,0\leq k\leq L such that 𝐫s(k)(v)>δdout(v)\boldsymbol{r}^{(k)}_{s}(v)>\delta\cdot d_{out}(v) do
4       𝝅^s(v)𝝅^s(v)+αk𝒓s(k)(v)\hat{\boldsymbol{\pi}}_{s}(v)\leftarrow\hat{\boldsymbol{\pi}}_{s}(v)+\alpha_{k}\cdot\boldsymbol{r}^{(k)}_{s}(v)
5       for uN(v)u\in N(v) do
6             𝒓s(k+1)(u)𝒓s(k+1)(u)+(1αk)𝒓s(k)(v)dout(v)\boldsymbol{r}^{(k+1)}_{s}(u)\leftarrow\boldsymbol{r}^{(k+1)}_{s}(u)+(1-\alpha_{k})\cdot\frac{\boldsymbol{r}^{(k)}_{s}(v)}{d_{out}(v)}
7            
8      𝒓s(k)(v)0\boldsymbol{r}^{(k)}_{s}(v)\leftarrow 0
9      
10return 𝝅^s\hat{\boldsymbol{\pi}}_{s}
Algorithm 3 Lemane-Generalized-Push

To keep the connectivity among the sampled nodes, we apply a BFS style traversal for subgraph sampling. Our goal is still to sample a subgraph with a constant number nsn_{s} of nodes. To sample such a subgraph, firstly a source node uu is randomly sampled from the input graph GG. Thereafter, a BFS traversal is applied from the source uu to explore the local community of node uu. If the number of visited nodes by the BFS from uu is smaller than nsn_{s}, another node vv is randomly sampled as the source to do BFS. The BFS sampling stops as soon as in total nsn_{s} nodes are visited.

However, the weights trained on a single subgraph might be biased and makes the learned stopping probabilities non-generalizable to the original input graph GG. To make the learned parameters generalizable to the input graph, we sample multiple subgraphs by the above strategy to learn the parameters. In particular, in each iteration, we sample a subgraph by the BFS strategy and then update the training parameters, i.e., the stopping probabilities, according to the loss functions defined on this subgraph. The loss functions on the subgraph are modified accordingly where the total number nn of nodes in Equations 2 and 6 are replaced by sample size nsn_{s}; the total number mm of edges in Equation 3 is also replaced by sample size nsn_{s}; the set of nodes with label kk in Equation 5 is changed to VSk=VkVSV_{Sk}=V_{k}\cap V_{S}, where VSV_{S} is the node set of the subgraph, VkV_{k} is the set of nodes with label kk in GG; and the negative sample set in Equation 5 is changed to 𝒩𝒮VS×VS\mathcal{N_{S}}\subset V_{S}\times V_{S}.

With such a sampling technique, the time complexity of Algorithm 2 can be bounded by O(hLns3)O(h\cdot L\cdot n_{s}^{3}) where hh is the number of training iterations by Algorithm 2 until it converges. Since nsn_{s} is a controllable constant, we set nsn_{s} such that the proximity matrix can be fed into the GPU memory for more efficient training.

4. Generalized Push

Given the learned stopping probabilities, a straightforward solution is to invoke Algorithm 1 to derive the supervised PPR scores. However, this incurs O(Lnm)O(L\cdot n\cdot m) running cost, which is prohibitive for large graphs. To tackle this issue, we present a generalized push algorithm to efficiently compute the supervised PPR proximity matrix with O(nδ)O(\frac{n}{\delta}) cost, where δ\delta is a parameter to control the computational cost as well as the sparsity of the proximity matrix.

4.1. Generalized Push Algorithm

Our main idea to compute the supervised PPR proximity matrix is to derive a sparsified proximity matrix such that the supervise PPR scores no larger than δ\delta can be safely discarded. But still, how much cost should we take to derive a sufficiently accurate approximation proximity score? In STRAP (Yin and Wei, 2019), the authors propose a solution to derive the PPR estimations such that |𝝅u(v)𝝅^u(v)|<δ|\boldsymbol{\pi}_{u}(v)-\hat{\boldsymbol{\pi}}_{u}(v)|<\delta with a cost of O(mδ)O(\frac{m}{\delta}). But, can we further reduce the running cost without sacrificing the embedding quality? Here, we give an affirmative answer. Our solution is inspired by the local graph clustering algorithm Local-Push (Andersen et al., 2006) which returns approximate PPRs with respect to a source ss in O(1δ)O(\frac{1}{\delta}) running time and guarantees that

|𝝅u(v)𝝅^u(v)|/dout(v)<δ, for any vV,|\boldsymbol{\pi}_{u}(v)-\hat{\boldsymbol{\pi}}_{u}(v)|/d_{out}(v)<\delta\textrm{, for any }v\in V,

on undirected graphs where dout(v)d_{out}(v) is the degree of node vv. The Local-Push algorithm suggests that if we only want to compute the approximate PPR scores around the local graph cluster with respect to a source, the running time can be reduced. In our case, the node embedding aims to find nodes that are their representatives and the nodes in their local graph cluster stand as perfect representatives. However, the Local-Push algorithm only works for PPR, not for our supervised PPR. Thus, we present a generalized push algorithm that works for arbitrary stopping probabilities.

Algorithm 3 shows the pseudo-code for our generalized push algorithm. Given a source node ss, a vector 𝝅^s\boldsymbol{\hat{\pi}}_{s} is maintained to store the portion of supervised random walks that has stopped at each node, and is the estimated supervised PPR scores with respect to source ss. Besides, for each hop 0kL0\leq k\leq L, where LL is the maximum length of a truncated supervised random walk introduced in Section 3.1, an additional residue vector 𝒓s(k)\boldsymbol{r}_{s}^{(k)} is maintained. The vector 𝒓s(k)\boldsymbol{r}_{s}^{(k)} indicates the portion of supervised random walks from ss that currently stay at the kk-th hop but have not stopped yet. Thus, if the residue vectors are all zero, it returns the exact supervised PPR scores. Initially, the residue vectors are all zero except for 𝒓s(0)(s)=1\boldsymbol{r}_{s}^{(0)}(s)=1 (Lines 1-2), indicating that the supervised random walks initially all stay at ss and has not stopped yet. Then, if any entry 𝒓s(k)(v)\boldsymbol{r}_{s}^{(k)}(v) in the LL residue vectors is above δdout(v)\delta\cdot d_{out}(v) (Line 3), a push operation (Lines 4-7) is invoked. In particular, it first converts αk𝒓s(k)(v)\alpha_{k}\cdot\boldsymbol{r}_{s}^{(k)}(v) to 𝝅^s(v)\boldsymbol{\hat{\pi}}_{s}(v) (Line 4). To explain, αk\alpha_{k} portion of the 𝒓s(k)(v)\boldsymbol{r}_{s}^{(k)}(v) random walks stop at the kk-th hop. Next, the remaining (1αk)(1-\alpha_{k}) portion of the 𝒓s(k)(v)\boldsymbol{r}_{s}^{(k)}(v) random walks randomly jump to the out-neighbors of vv (Lines 5-6). Thus, for each uu that is an out-neighbor of vv, the residue 𝒓s(k+1)(u)\boldsymbol{r}_{s}^{(k+1)}(u) is incremented by (1αk)𝒓s(k)(v)/dout(v)(1-\alpha_{k})\cdot\boldsymbol{r}_{s}^{(k)}(v)/d_{out}(v). After the push operation, the residue 𝒓s(k)(v)\boldsymbol{r}_{s}^{(k)}(v) is set to zero (Line 7). The algorithm terminates when there exists no residue 𝒓s(k)(v)\boldsymbol{r}_{s}^{(k)}(v) for any kk such that it is larger than dout(v)δd_{out}(v)\cdot\delta.

Input: Graph G, dimension dd, threshold δ\delta, stopping probability vector 𝜶\boldsymbol{\alpha}
Output: Embedding matrices 𝑿\boldsymbol{X} and 𝒀\boldsymbol{Y}
1 Initialize the proximity matrix 𝑺𝟎\boldsymbol{S}\leftarrow\boldsymbol{0}
2 for each node uVu\in V do
3       𝝅^u\boldsymbol{\hat{\pi}}_{u}\leftarrow Lemane-Generalized-Push(G,u,δ,𝜶)(G,u,\delta,\boldsymbol{\alpha})
4       𝝅^uT\boldsymbol{\hat{\pi}}_{u}^{T}\leftarrow Lemane-Generalized-Push(GT,u,δ,𝜶)(G^{T},u,\delta,\boldsymbol{\alpha})
5       for each node vv in VV do
6             if 𝛑^u(v)>δ\boldsymbol{\hat{\pi}}_{u}(v)>\delta then
7                   𝑺(u,v)+=𝝅^u(v)\boldsymbol{S}(u,v)+=\boldsymbol{\hat{\pi}}_{u}(v)
8                  
9            if 𝛑^uT(v)>δ\boldsymbol{\hat{\pi}}^{T}_{u}(v)>\delta then
10                   𝑺(v,u)+=𝝅^uT(v)\boldsymbol{S}(v,u)+=\boldsymbol{\hat{\pi}}^{T}_{u}(v)
11            
12      
13𝑴log(𝑺δ)\boldsymbol{M}\leftarrow\log(\frac{\boldsymbol{S}}{\delta}) for non-zero entries
14 [𝑼,𝚺,𝑽][\boldsymbol{U},\boldsymbol{\Sigma},\boldsymbol{V}]\leftarrow SparseSVD(𝑴,d)(\boldsymbol{M},d)
15 𝑿𝑼𝚺\boldsymbol{X}\leftarrow\boldsymbol{U}\sqrt{\boldsymbol{\Sigma}}, 𝒀𝑽𝚺\boldsymbol{Y}\leftarrow\boldsymbol{V}\sqrt{\boldsymbol{\Sigma}}
16 return 𝑿,𝒀\boldsymbol{X},\boldsymbol{Y}
Algorithm 4 Lemane-Embedding
Theorem 1.

Algorithm 3 runs in O(1δ)O(\frac{1}{\delta}) time.

Theorem 2.

Let 𝛑sL(u)\boldsymbol{\pi}_{s}^{L}(u) be the supervised PPR considering random walks within LL hops. Then for undirected graphs, Algorithm 3 returns an estimation 𝛑^s(u)\boldsymbol{\hat{\pi}}_{s}(u) of 𝛑sL(u)\boldsymbol{\pi}_{s}^{L}(u) for each node uu such that:

|𝝅^s(u)𝝅sL(u)|/dout(u)δL.|\boldsymbol{\hat{\pi}}_{s}(u)-\boldsymbol{\pi}^{L}_{s}(u)|/d_{out}(u)\leq\delta\cdot L.

By setting δ=δL\delta=\frac{\delta^{\prime}}{L}, Algorithm 3 runs in O(Lδ)O(\frac{L}{\delta^{\prime}}) time. At the same time, the error bound in Algorithm 2 can be bounded by LδL=δL\cdot\frac{\delta^{\prime}}{L}=\delta^{\prime}. Since LL can be treated as a constant, the running time with δ=δL\delta=\frac{\delta^{\prime}}{L} is still O(1δ)O(\frac{1}{\delta^{\prime}}) and for any node uu, we have that:

|𝝅^s(u)𝝅sL(u)|/dout(u)δ.|\boldsymbol{\hat{\pi}}_{s}(u)-\boldsymbol{\pi}^{L}_{s}(u)|/d_{out}(u)\leq\delta^{\prime}.

The above analysis shows that our generalized push algorithm can provide identical result quality as the Local-Push algorithm with the identical asymptotic running cost, thus returning high-quality results for the representative nodes of the source node.

4.2. Final Embedding

Given the generalized push algorithm, we finally show how to output the embedding for the graph. Following STRAP (Yin and Wei, 2019), we compute the supervised PPR on both the input graph GG and the transpose graph GTG^{T} by reversing the direction of each edge of GG and set 𝑺(u,v)\boldsymbol{S}(u,v) as 𝝅u(v)+𝝅vT(u)\boldsymbol{\pi}_{u}(v)+\boldsymbol{\pi}_{v}^{T}(u), where 𝝅u(v)\boldsymbol{\pi}_{u}(v) is the supervised PPR of vv with respect to uu and 𝝅vT(u)\boldsymbol{\pi}_{v}^{T}(u) is the supervised PPR of uu with respect to vv on the transpose graph GTG^{T}. Note that we do not bring this part into our training phase to reduce the computational costs since we use the same stopping probabilities for both the input graph GG and the transpose graph GTG^{T}. Algorithm 4 shows how to compute approximate proximity scores. For each node uu, we compute the approximate supervised PPR on the input graph GG and the transpose graph GTG^{T} (Lines 3-4). For any approximate supervised PPR score 𝝅^u(v)\boldsymbol{\hat{\pi}}_{u}(v), it is added to 𝑺(u,v)\boldsymbol{S}(u,v) only if it is larger than the threshold δ\delta. Similarly, each approximate supervised PPR score 𝝅^uT(v)\boldsymbol{\hat{\pi}}^{T}_{u}(v) on the transpose graph GTG^{T} is added to 𝑺(v,u)\boldsymbol{S}(v,u) only if 𝝅^uT(v)\boldsymbol{\hat{\pi}}^{T}_{u}(v) is larger than δ\delta (Lines 5-9). With such a pruning strategy, the proximity matrix 𝑺\boldsymbol{S} is sparsified to include O(nδ)O(\frac{n}{\delta}) non-zero entries. Then, a non-linear operation log(𝑺δ)\log(\frac{\boldsymbol{S}}{\delta}), is applied to 𝑺\boldsymbol{S}. Notice that all the entries are divided by δ\delta before taking the logarithm to guarantee that the values will be non-negative (Line 10). The resulting matrix 𝑴\boldsymbol{M} is then fed to a sparse SVD to derive final embedding matrices 𝑿\boldsymbol{X} and 𝒀\boldsymbol{Y} (Lines 11-12). We have the following theorem for the running time and decomposition quality with respect to 𝑴\boldsymbol{M}.

Theorem 3.

Algorithm 4 runs in O(nδ+nd2ϵ4)O(\frac{n}{\delta}+\frac{n\cdot d^{2}}{\epsilon^{4}}) time to guarantee that the embedding preserves the feeding matrix 𝐌\boldsymbol{M} with (1+ϵ)(1+\epsilon)-approximation to the best rank-dd matrix in terms of Frobenius-norm.

𝑴𝑿𝒀TF(1+ϵ)minrank(𝑩)d𝑴𝑩F.||\boldsymbol{M}-\boldsymbol{X}\boldsymbol{Y}^{T}||_{F}\leq(1+\epsilon)\min_{rank(\boldsymbol{B})\leq d}||\boldsymbol{M}-\boldsymbol{B}||_{F}.

Following previous work (Yin and Wei, 2019), we treat ϵ\epsilon as a constant and use the default setting of builtin SVD implementations. The running time of Algorithm 4 is O(nδ+nd2)O(\frac{n}{\delta}+n\cdot d^{2}).

5. Experiments

Table 1. Dataset statistics.
Name Type nn mm labels
Wikipedia directed 4.78K 184.81K 40
WikiVote directed 7.12K 103.69K -
BlogCatalog undirected 10.31K 333.98K 39
Slashdot directed 82.17K 870.16K -
TWeibo directed 1.94M 50.66M 100
Orkut undirected 3.07M 117.19M 100
Table 2. Link prediction precision (%) on small datasets.
Method Wikipedia Wikivote BlogCatalog
DeepWalk 88.33 68.32 85.35
Node2vec 82.86 78.50 82.07
VERSE 88.09 82.82 88.49
InfiniteWalk 66.86 81.07 84.05
AROPE 84.27 62.08 88.30
RandNE 83.15 77.62 87.09
ProNE 52.06 66.22 57.87
NetSMF 72.01 72.64 47.92
STRAP 86.53 92.58 89.58
NRP 83.56 91.07 90.10
GraphGAN 70.33 71.76 71.83
AW 50.42 56.62 62.56
NetHiex 45.03 73.01 65.58
GraphZoom 84.73 82.10 86.82
Louvain 56.22 58.33 59.97
Lemane-F 87.79 92.78 89.92

We compare our Lemane against alternative solutions on link prediction and node classification tasks. All experiments are conducted on a Linux machine with an Intel Xeon(R) CPU clocked at 2.70GHz, an NVIDIA GeForce RTX 2080 Super 8GB GPU, and 384GB memory.

5.1. Experimental settings

Datasets. We test on six real datasets that are used in recent node embedding studies (Perozzi et al., 2014; Grover and Leskovec, 2016; Tsitsulin et al., 2018; Lerer et al., 2019; Yin and Wei, 2019; Yang et al., 2019, 2020). The statistics of these datasets are shown in Table 1. BlogCatalog (Tang and Liu, 2009), Slashdot (Leskovec et al., 2009), TWeibo (twe, ggle) and Orkut (Mislove et al., 2007) are four social networks in which links represent a friendship/following relationship between users. Wikipedia (wik, mark) is a co-occurrence network of words appearing in the Wikipedia dump. Wikivote (Leskovec et al., 2010) is a who-votes-on-whom network on Wikipedia. All datasets and the node labels (if any) can be downloaded from public sources (blo, orks; wik, mark; sna, SNAP; twe, ggle).

Competitors. We evaluate Lemane against 15 node embedding methods, including some classic methods and several state-of-the-art methods. We divide these methods into four groups as follows.

  • Skip-gram methods: DeepWalk (Perozzi et al., 2014), Node2vec (Grover and Leskovec, 2016), VERSE (Tsitsulin et al., 2018), and InfiniteWalk (Chanpuriya and Musco, 2020);

  • Matrix factorization methods: AROPE (Zhang et al., 2018b), RandNE (Zhang et al., 2018a), ProNE (Zhang et al., 2019), NetSMF (Qiu et al., 2019), STRAP (Yin and Wei, 2019), and NRP222There were some implementation issues in the released code of NRP on undirected graphs, which was fixed recently by the inventors. (Yang et al., 2020);

  • Neural network methods: GraphGAN (Wang et al., 2018) and AW (Abu-El-Haija et al., 2018);

  • Other methods: NetHiex (Ma et al., 2018), GraphZOOM333We use the default embedding method DeepWalk for GraphZoom. (Deng et al., 2020), Louvain (Bhowmick et al., 2020).

For our methods, we use Lemane-F to indicate the algorithm trained on the entire graph and Lemane-S to indicate the algorithm trained on subsampled graphs. Notice that Lemane-F is adopted for small datasets Wikipedia, Wikivote, and BlogCatalog. Lemane-S is adopted for large datasets Slashdot, TWeibo, and Orkut.

Table 3. Link prediction precision (%) on large datasets.
Method Slashdot Tweibo Orkut
AROPE 82.83 69.46 82.03
RandNE 81.03 70.74 79.45
ProNE 72.80 45.47 80.88
STRAP 83.07 94.58 85.73
NRP 80.98 93.87 86.34
Louvain 55.56 64.25 80.85
Lemane-S 84.13 94.89 89.15

Parameter settings. We obtain the source code of all competitors from GitHub and perform these methods with default parameter settings suggested by their authors. Following previous studies (Perozzi et al., 2014; Tsitsulin et al., 2018; Yin and Wei, 2019), we set the embedding dimensionality d=128d=128. For our Lemane-S, we set the sample size ns=5000n_{s}=5000. For Lemane-F and Lemane-S, we use grid search to set β,β,γ,γ\beta,\beta^{\prime},\gamma,\gamma^{\prime} from {0.01, 0.1, 0.5, 1, 2, 3}, learning rate from {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}, and threshold δ\delta from {10710^{-7},10610^{-6},10510^{-5},10410^{-4}}; we use JacobiSVD for Lemane-F and frPCA (Feng et al., 2018) for Lemane-S, to generate final embeddings.

Since the backpropagation in Lemane is complicated, our objective function may easily fall into a local minimum. To tackle this issue, the stopping probabilities are initialized with different distributions. Specifically, we set α0,α1,,αL\alpha_{0},\alpha_{1},\cdots,\alpha_{L} such that the probability that the supervised random walk stops at the ll-th hop follows a geometric distribution or Poisson distribution with respect to ll and we report the best results of Lemane. The parameters of Lemane are optimized by Stochastic Gradient Descent (SGD) optimizer.

5.2. Link Prediction

Link prediction aims to predict which pairs of nodes are likely to form edges. Following previous work (Yang et al., 2020; Zhang et al., 2018b), we randomly hide 30%30\% of the edges for testing and train the embedding vectors on the rest of the graph. Then, the testing set is generated by including (i) the node pairs corresponding to the 30%30\% removed edges, and (ii) an equal number of node pairs that are not connected by any edge in the original graph GG. Given a node pair (u,v)(u,v) in EtestE_{test}, we compute a score for (u,v)(u,v) based on the embedding vectors, and evaluate model performance using precision score.

Following previous work (Yin and Wei, 2019; Yang et al., 2020), for DeepWalk, Node2vec, VERSE, InfiniteWalk, GraphGAN, and Louvain, we use the edge-feature approach introduced in (Ma et al., 2018): (i) randomly select 30%30\% existing edges which are not in EtestE_{test} and the same number of non-existing edges as training set EtrainE_{train} on each dataset; (ii) for each node pair (u,v)EtrainEtest(u,v)\in E_{train}\cup E_{test}, we concatenate dd-dimension embedding vectors of node uu and that of node vv; (iii) we consider the 2d2d-length vectors as the features of node pairs in EtrainE_{train} and feed them into a binary logistic regression classifier; (iv) then the trained classifier is used to perform link prediction on EtestE_{test}. For AROPE, RandNE, ProNE, AW, NetHiex, and GraphZOOM, the score of a node pair (u,v)(u,v) is the inner product of the embedding vector of node uu and that of node vv; for NetSMF, STRAP, NRP, and our Lemane, the score of a node pair (u,v)(u,v) is the inner product of embedding 𝒙u\boldsymbol{x}_{u} from 𝑿\boldsymbol{X} and 𝒚v\boldsymbol{y}_{v} from 𝒀\boldsymbol{Y} (Ref. to Section 3 for the definitions of 𝑿\boldsymbol{X} and 𝒀\boldsymbol{Y}).

Table 2 reports the performance of Lemane-F against the 15 competitors on three small datasets. Table 3 further reports the performance of Lemane-S against 6 methods which scale to large graphs. For the other 9 methods, they either cannot finish training in 24 hours or run out of memory on the large graphs. As we can observe, our Lemane shows the best performance on 4 social networks where link prediction finds extensive applications. On the Wikipedia dataset, a co-occurrence network of words appearing in the Wikipedia, our Lemane still achieves high performance and is the best method among all matrix factorization methods. Compared to two state-of-the-art matrix factorization methods STRAP and NRP, our Lemane takes the lead by more than 1%1\% on Wikipedia and Slashdot, and takes the lead by almost 3%3\% on Orkut. This demonstrates the effectiveness of our learning based method.

Table 4. Node Classification Micro-F1 (%) on Wikipedia.
Method 0.1 0.3 0.5 0.7 0.9
DeepWalk 42.02 46.12 48.46 49.39 49.35
Node2vec 44.75 48.08 49.82 50.69 50.41
VERSE 38.76 41.92 43.84 44.92 44.31
InfiniteWalk 38.96 42.64 45.94 47.73 48.24
AROPE 45.82 50.59 52.47 53.36 52.16
RandNE 34.51 32.83 43.25 45.55 45.93
ProNE 44.49 50.45 53.15 54.38 54.43
NetSMF 40.29 42.56 43.68 44.08 44.12
STRAP 46.51 50.77 52.44 52.64 52.37
NRP 48.04 52.71 54.39 55.20 54.30
GraphGAN 32.87 35.43 36.47 37.68 37.50
AW 40.70 40.70 40.44 40.30 39.47
NetHiex 45.58 47.95 49.39 49.77 49.20
GraphZoom 40.76 41.03 41.18 41.07 40.54
Louvain 40.86 40.99 40.72 41.25 40.69
Lemane-F 47.79 52.39 54.34 54.67 54.25
Table 5. Node Classification Micro-F1 (%) on BlogCatalog.
Method 0.1 0.3 0.5 0.7 0.9
DeepWalk 33.01 36.97 38.70 39.87 41.31
Node2vec 35.01 37.16 37.97 38.51 39.13
VERSE 32.76 36.32 38.18 39.09 40.71
InfiniteWalk 34.30 38.00 40.21 41.87 43.14
AROPE 29.12 32.78 34.12 34.95 35.77
RandNE 26.75 31.56 36.20 38.34 39.93
ProNE 36.38 40.33 41.56 42.32 42.28
NetSMF 34.95 37.99 39.30 40.19 40.72
STRAP 38.62 41.80 42.96 43.39 43.97
NRP 38.73 41.65 42.36 43.15 43.34
GraphGAN 14.97 17.23 18.81 20.07 21.16
AW 16.52 16.91 16.98 17.25 17.39
NetHiex 37.46 40.06 40.63 41.43 42.33
GraphZoom 22.02 25.59 27.75 29.37 30.70
Louvain 19.16 19.88 21.12 21.30 21.74
Lemane-F 39.64 42.38 43.34 44.03 44.37

5.3. Node Classification

Node classification task aims to predict the label(s) of each node based on the embeddings. As we mentioned in Section 3, we first randomly sample 5%5\% labeled nodes for parameter training. Then the classification task is performed with the following steps: (i) following (Yang et al., 2020), for Lemane and other matrix factorization methods 444For matrix factorization based methods, there was some implementation issues in the evaluation code for node classification on directed graphs. We have rerun the experiment for matrix factorization based methods on directed graphs., we first normalize the embedding vector555We observe some significant improvement on Orkut dataset when normalization is applied. Thus, we take normalization for the embedding vectors on all datasets. 𝒙v\boldsymbol{x}_{v} from 𝑿\boldsymbol{X} and embedding vector 𝒚v\boldsymbol{y}_{v} from 𝒀\boldsymbol{Y} of each node vv, and then concatenate them to get the representation of vv; (ii) for methods without factorization operation, we use the embedding vector of node vv as its representation. Specifically, we randomly split the node and label sets into the training set and testing set and the training ratio varies from 10%10\% to 90%90\%. To make a fair comparison, note that if the training ratio is x%x\%, then for Lemane, it will sample an additional (x5)%(x-5)\% and include the 5%5\% labeled nodes to train the classifiers. Following previous work (Perozzi et al., 2014; Grover and Leskovec, 2016), we employ a one-vs-rest logistic regression classifier implemented by LIBLINEAR (Fan et al., 2008) with default parameters for all methods. Micro-F1 score is used as the evaluation metric for the classification task.

Table 6. Node classification Micro-F1 (%) on TWeibo.
Method 0.1 0.3 0.5 0.7 0.9
AROPE 33.96 34.11 34.18 34.24 34.29
RandNE 34.64 34.67 34.80 34.92 34.96
ProNE 35.27 35.37 35.43 35.48 35.52
STRAP 35.75 35.97 36.03 36.04 36.04
NRP 35.73 35.97 36.03 36.04 36.04
Louvain 34.14 34.29 34.33 34.38 34.42
Lemane-S 35.77 36.03 36.04 36.05 36.05
Table 7. Node classification Micro-F1 (%) on Orkut.
Method 0.1 0.3 0.5 0.7 0.9
AROPE 49.11 50.86 51.55 51.89 52.22
RandNE 44.94 49.35 50.47 51.11 51.53
ProNE 36.09 37.13 38.32 38.81 39.44
STRAP 70.37 73.09 74.04 74.60 75.08
NRP 72.47 75.58 76.69 77.36 77.98
Louvain 29.16 36.00 36.51 36.85 36.63
Lemane-S 73.32 76.27 77.26 77.89 78.14

For node classification, we test on four datasets Wikipedia, BlogCatalog, TWeibo, and Orkut, which include label information. Table 4 and Table 5 show the performance of our Lemane-F against all 15 methods on the two small datasets: Wikipedia and BlogCatalog, respectively. Table 6 and Table 7 show the performance of our Lemane-S against 6 methods that scale to TWeibo and Orkut.

We make the following observations. Firstly, our Lemane achieves the best Micro-F1 scores on three datasets BlogCatalog, Tweibo, and Orkut in all of the tested training ratios. Besides, compared to STRAP, which takes PPR without training the stopping probabilities, our Lemane achieves more than 1%1\% lead on Wikipedia datasets and up to 3%3\% on the Orkut dataset. Compared to the second-best matrix factorization method NRP, our Lemane further achieves about 1%1\% lead on the BlogCatalog dataset in all of the tested training ratios.

In summary, the experimental study reveals that our learning based Lemane can learn proximity measures that most suit the task in most scenarios. Compared to other matrix factorization methods, e.g., STRAP (Yin and Wei, 2019) and NRP (Yang et al., 2020), that take personalized PageRank as the proximity measure without learning, our Lemane can train the proximity measure, i.e., the supervised PPR, to gain better performance.

6. Conclusion

In this paper, we present Lemane that learns trainable proximity measures to best suit the datasets and tasks at hand automatically. Experimental results reveal that Lemane can learn more representative embeddings compared with state-of-the-art approaches.

Acknowledgements.
Sibo Wang is supported by Hong Kong RGC ECS (No. 24203419), RGC CRF (No. C4158-20G), CUHK Direct Grant (No. 4055114), and NSFC (No. U1936205). Zengfeng Huang is supported by Shanghai Sailing Program Grant No. 18YF1401200, Shanghai Science and Technology Commission Grant No. 17JC1420200.

References

  • (1)
  • twe (ggle) Kaggle. https://www.kaggle.com/c/kddcup2012-track1/data.
  • wik (mark) Large text compression benchmark. http://mattmahoney.net/dc/textdata.
  • sna (SNAP) SNAP. http://snap.stanford.edu/data/.
  • blo (orks) Social-Dimension Approach to Classification in Large-Scale Networks. http://leitang.net/social_dimension.html.
  • Abu-El-Haija et al. (2018) Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alex Alemi. 2018. Watch Your Step: Learning Node Embeddings via Graph Attention. In NeurIPS. 9198–9208.
  • Andersen et al. (2006) Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local Graph Partitioning using PageRank Vectors. In FOCS. 475–486.
  • Bhowmick et al. (2020) Ayan Kumar Bhowmick, Koushik Meneni, Maximilien Danisch, Jean-Loup Guillaume, and Bivas Mitra. 2020. LouvainNE: Hierarchical Louvain Method for High Quality and Scalable Network Embedding. In WSDM. 43–51.
  • Cao et al. (2016) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations. In AAAI. 1145–1152.
  • Chanpuriya and Musco (2020) Sudhanshu Chanpuriya and Cameron Musco. 2020. InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity. In SIGKDD. 1325–1333.
  • Chien et al. (2021) Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive Universal Generalized PageRank Graph Neural Network. In ICLR.
  • Clarkson and Woodruff (2017) Kenneth L. Clarkson and David P. Woodruff. 2017. Low-Rank Approximation and Regression in Input Sparsity Time. J. ACM 63, 6 (2017).
  • Deng et al. (2020) Chenhui Deng, Zhiqiang Zhao, Yongyu Wang, Zhiru Zhang, and Zhuo Feng. 2020. GraphZoom: A Multi-level Spectral Approach for Accurate and Scalable Graph Embedding. In ICLR.
  • Dong et al. (2017) Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. 2017. Metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In SIGKDD. 135–144.
  • Donnat et al. (2018) Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. Learning Structural Node Embeddings via Diffusion Wavelets. In SIGKDD. 1320–1329.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 9 (2008), 1871–1874.
  • Feng et al. (2018) Xu Feng, Yuyang Xie, Mingye Song, Wenjian Yu, and Jie Tang. 2018. Fast Randomized PCA for Sparse Data. In ACML. 710–725.
  • Fouss et al. (2007) Francois Fouss, Alain Pirotte, Jean-michel Renders, and Marco Saerens. 2007. Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. TKDE 19 (2007), 355–369.
  • Fu et al. (2017) Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. HIN2Vec: Explore Meta-Paths in Heterogeneous Information Networks for Representation Learning. In CIKM. 1797–1806.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NeurIPS. 2672–2680.
  • Goyal et al. (2018) Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. 2018. DynGEM: Deep Embedding Method for Dynamic Graphs. CoRR abs/1805.11273 (2018).
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable Feature Learning for Networks. In SIGKDD. 855–864.
  • Gu et al. (2018) Yupeng Gu, Yizhou Sun, Yanen Li, and Yang Yang. 2018. RaRE: Social Rank Regulated Large-Scale Network Embedding. In WWW. 359–368.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Indyk et al. (2019) Piotr Indyk, Ali Vakilian, and Yang Yuan. 2019. Learning-Based Low-Rank Approximations. In NeurIPS. 7402–7412.
  • Klicpera et al. (2019) Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffusion Improves Graph Learning. In (NeurIPS). 13333–13345.
  • Lerer et al. (2019) Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In SysML.
  • Leskovec et al. (2010) Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010. Predicting Positive and Negative Links in Online Social Networks. In WWW. 641–650.
  • Leskovec et al. (2009) Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6, 1 (2009), 29–123.
  • Ma et al. (2018) Jianxin Ma, Peng Cui, Xiao Wang, and Wenwu Zhu. 2018. Hierarchical Taxonomy Aware Network Embedding. In SIGKDD. 1920–1929.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS. 3111–3119.
  • Mislove et al. (2007) Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and Analysis of Online Social Networks. In IMC. 29–42.
  • Ou et al. (2016) Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric Transitivity Preserving Graph Embedding. In SIGKDD. 1105–1114.
  • Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: bringing order to the web. (1999).
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In SIGKDD. 701–710.
  • Pons and Latapy (2005) Pascal Pons and Matthieu Latapy. 2005. Computing communities in large networks using random walks. In ISCIS. 284–293.
  • Qiu et al. (2019) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. NetSMF: Large-Scale Network Embedding As Sparse Matrix Factorization. In WWW. 1509–1520.
  • Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and Node2vec. In WSDM. 459–467.
  • Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-Scale Information Network Embedding. In WWW. 1067–1077.
  • Tang and Liu (2009) Lei Tang and Huan Liu. 2009. Relational Learning via Latent Social Dimensions. In SIGKDD. 817–826.
  • Trivedi et al. (2019) Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. 2019. DyRep: Learning Representations over Dynamic Graphs. In ICLR.
  • Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In WWW. 539–548.
  • Tu et al. (2018) Ke Tu, Peng Cui, Xiao Wang, Philip S. Yu, and Wenwu Zhu. 2018. Deep Recursive Network Embedding with Regular Equivalence. In SIGKDD. 2357–2366.
  • Tu et al. (2019) Ke Tu, Jianxin Ma, Peng Cui, Jian Pei, and Wenwu Zhu. 2019. AutoNE: Hyperparameter Optimization for Massive Network Embedding. In SIGKDD. 216–225.
  • Wang et al. (2016) Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Embedding. In SIGKDD. 1225–1234.
  • Wang et al. (2018) Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. GraphGAN: Graph Representation Learning With Generative Adversarial Nets. In AAAI. 2508–2515.
  • Yang et al. (2020) Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, and Sourav S. Bhowmick. 2020. Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank. PVLDB 13, 5 (2020), 670–683.
  • Yang et al. (2019) Renchi Yang, Xiaokui Xiao, Zhewei Wei, Sourav S. Bhowmick, Jun Zhao, and Rong-Hua Li. 2019. Efficient Estimation of Heat Kernel PageRank for Local Clustering. In SIGMOD. 1339–1356.
  • Yin and Wei (2019) Yuan Yin and Zhewei Wei. 2019. Scalable Graph Embeddings via Sparse Transpose Proximities. In SIGKDD. 1429–1437.
  • Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized Entity Recommendation: A Heterogeneous Information Network Approach. In WSDM. 283–292.
  • Zhang et al. (2019) Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: Fast and Scalable Network Representation Learning. In IJCAI. 4278–4284.
  • Zhang et al. (2020) Shengzhong Zhang, Zengfeng Huang, Haicang Zhou, and Ziang Zhou. 2020. SCE: Scalable Network Embedding from Sparsest Cut. In SIGKDD. 257–265.
  • Zhang et al. (2018a) Ziwei Zhang, Peng Cui, Haoyang Li, Xiao Wang, and Wenwu Zhu. 2018a. Billion-Scale Network Embedding with Iterative Random Projection. In ICDM. 787–796.
  • Zhang et al. (2018b) Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018b. Arbitrary-Order Proximity Preserved Network Embedding. In SIGKDD. 2778–2786.
  • Zhou et al. (2017) Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. 2017. Scalable Graph Embedding for Asymmetric Proximity. In AAAI. 2942–2948.

Appendix A Proofs

Proof of Theorem 1. We first prove the following lemma.

Lemma 0.

The probability that a supervised random walk stops at exactly the kk-th hop is αkl=0k1(1αl)\alpha_{k}\prod_{l=0}^{k-1}(1-\alpha_{l}) and for any k, we have

αk+l=1αk+lj=0k+l1(1αj)=1.\alpha_{k}+\sum_{l=1}^{\infty}\alpha_{k+l}\prod_{j=0}^{k+l-1}(1-\alpha_{j})=1.
Proof.

Define Probk(L)=αk+l=1Lαk+lj=0k+l1(1αj)Prob_{k}(L)=\alpha_{k}+\sum_{l=1}^{L}\alpha_{k+l}\prod_{j=0}^{k+l-1}(1-\alpha_{j}). We claim that 1Probk(L)=j=0L(1αk+j)1-Prob_{k}(L)=\prod_{j=0}^{L}(1-\alpha_{k+j}), implying that (1Probk(L))0(1-Prob_{k}(L))\to 0 when LL\to\infty and thus, Probk(L)1Prob_{k}(L)\to 1 when LL\to\infty. Now we justify 1Probk(L)=j=0L(1αk+j)1-Prob_{k}(L)=\prod_{j=0}^{L}(1-\alpha_{k+j}) by induction. When L=0L=0, 1Probk(0)=1αk1-Prob_{k}(0)=1-\alpha_{k} holds obviously. Suppose 1Probk(t)=j=0t(1αk+j)1-Prob_{k}(t)=\prod_{j=0}^{t}(1-\alpha_{k+j}) holds. Then for Probk(t+1)=Probk(t)+αk+t+1(1Probk(t))Prob_{k}(t+1)=Prob_{k}(t)+\alpha_{k+t+1}(1-Prob_{k}(t)), we can derive that:

1Probk(t+1)\displaystyle 1-Prob_{k}(t+1) =1Probk(t)αk+t+1(1Probk(t))\displaystyle=1-Prob_{k}(t)-\alpha_{k+t+1}(1-Prob_{k}(t))
=(1αk+t+1)(1Probk(t))=j=0t+1(1αk+j).\displaystyle=(1-\alpha_{k+t+1})(1-Prob_{k}(t))=\prod_{j=0}^{t+1}(1-\alpha_{k+j}).

Thus 1Probk(L)=j=0L(1αk+j)1-Prob_{k}(L)=\prod_{j=0}^{L}(1-\alpha_{k+j}) holds for any LL. Proof done. ∎

Note that the transition matrix 𝑷\boldsymbol{P} is a row-stochastic matrix, i.e. 𝑷(i,j)0\boldsymbol{P}(i,j)\geq 0 for any i,j=1,2,,ni,j=1,2,\cdots,n, and j=1n𝑷(i,j)=1\sum_{j=1}^{n}\boldsymbol{P}(i,j)=1 for any i=1,2,,ni=1,2,\cdots,n. Following these properties, we can easily derive that 𝑷l\boldsymbol{P}^{l} is also row-stochastic, where ll is the power of the matrix. Now we first look at the r.h.s. of the inequality in Theorem 1. Recall that the infinity-norm of matrix is the maximum absolute row sum, i.e., 𝑴=max1inj=1n|𝑴(i,j)|||\boldsymbol{M}||_{\infty}=\max_{1\leq i\leq n}\sum_{j=1}^{n}|\boldsymbol{M}(i,j)|. Then, by Lemma 1, for any i=1,2,,ni=1,2,\cdots,n, we have

j=1n𝑺(i,j)\displaystyle\sum_{j=1}^{n}\boldsymbol{S}(i,j) =α0j=1n𝑷0(i,j)+α1(1α0)j=1n𝑷1(i,j)+\displaystyle=\alpha_{0}\sum_{j=1}^{n}\boldsymbol{P}^{0}(i,j)+\alpha_{1}(1-\alpha_{0})\sum_{j=1}^{n}\boldsymbol{P}^{1}(i,j)+\cdots
=α0+l=1αlk=0l1(1αk)=1.\displaystyle=\alpha_{0}+\sum_{l=1}^{\infty}\alpha_{l}\prod_{k=0}^{l-1}(1-\alpha_{k})=1.

In terms of the l.h.s. of the inequality in Theorem 1, for any ii-th row of matrix 𝑺𝑺L\boldsymbol{S}-\boldsymbol{S}_{L}, we have that:

j=1n(𝑺𝑺L)(i,j)\displaystyle\sum_{j=1}^{n}(\boldsymbol{S}-\boldsymbol{S}_{L})(i,j)
=\displaystyle= l=L+1αlk=0l1(1αk)j=1n𝑷L+1(i,j)=l=L+1αlk=0l1(1αk)\displaystyle\sum_{l=L+1}^{\infty}\alpha_{l}\prod_{k=0}^{l-1}(1-\alpha_{k})\sum_{j=1}^{n}\boldsymbol{P}^{L+1}(i,j)=\sum_{l=L+1}^{\infty}\alpha_{l}\cdot\prod_{k=0}^{l-1}(1-\alpha_{k})
=\displaystyle= k=0L(1αk)(αL+1+αL+2(1αL+1)+)k=0L(1αk).\displaystyle\prod_{k=0}^{L}(1-\alpha_{k})(\alpha_{L+1}+\alpha_{L+2}(1-\alpha_{L+1})+\cdots)\leq\prod_{k=0}^{L}(1-\alpha_{k}).

By combining the above derivations,

j=1n(𝑺𝑺L)(i,j)k=0L(1αk)j=1n𝑺(i,j)\sum_{j=1}^{n}(\boldsymbol{S}-\boldsymbol{S}_{L})(i,j)\leq\prod_{k=0}^{L}(1-\alpha_{k})\cdot\sum_{j=1}^{n}\boldsymbol{S}(i,j)

holds for any i=1,2,,ni=1,2,\cdots,n. Thus, it clearly holds that

𝑺𝑺Lk=0L(1αk)𝑺.||\boldsymbol{S}-\boldsymbol{S}_{L}||_{\infty}\leq\prod_{k=0}^{L}(1-\alpha_{k})||\boldsymbol{S}||_{\infty}.

The theorem is proved.

Proof of Theorem 1. The cost of Algorithm 3 is dominated by the number of push operations. Recall that the push operation is invoked only if there exists an entry 𝒓s(k)(v)\boldsymbol{r}^{(k)}_{s}(v) in residue vectors such that 𝒓s(k)(v)δdout(v)\boldsymbol{r}^{(k)}_{s}(v)\geq\delta\cdot d_{out}(v). Let 𝝅s(l)(v)\boldsymbol{\pi}_{s}^{(l)}(v) denote the ll-hop supervised PPR whose value is the probability that the ll-hop supervised random walk from ss stops at vv. Then the total number of push operations caused by the residue value 𝒓s(l)(v)\boldsymbol{r}^{(l)}_{s}(v) can be bounded by 𝝅s(l)(v)αlδdout(v)\frac{\boldsymbol{\pi}_{s}^{(l)}(v)}{\alpha_{l}\delta\cdot d_{out}(v)}. In addition, since in each push operation, the node vv will pass its residue to its out-neighbors, the cost for a push operation at vv is bounded by O(dout(v))O(d_{out}(v)). Thus, the total cost of the push operations on entry 𝒓s(l)(v)\boldsymbol{r}^{(l)}_{s}(v) is bounded by 𝝅s(l)(v)αlδdout(v)dout(v)\frac{\boldsymbol{\pi}_{s}^{(l)}(v)}{\alpha_{l}\delta\cdot d_{out}(v)}\cdot d_{out}(v). Based on the above analysis, the time cost of Algorithm 3 is:

l=0LvV𝝅s(l)(v)αlδdout(v)dout(v)1αminδl=0LvV𝝅s(l)(v)1αminδ,\sum_{l=0}^{L}\sum_{v\in V}\frac{\boldsymbol{\pi}_{s}^{(l)}(v)}{\alpha_{l}\delta\cdot d_{out}(v)}\cdot d_{out}(v)\leq\frac{1}{\alpha_{min}\delta}\sum_{l=0}^{L}\sum_{v\in V}\boldsymbol{\pi}_{s}^{(l)}(v)\leq\frac{1}{\alpha_{min}\delta},

where αmin=min{α1,,αL}\alpha_{min}=\min\{\alpha_{1},\cdots,\alpha_{L}\} can be treated as a constant. Thus, Algorithm 3 runs in O(1αminδ)=O(1δ)O(\frac{1}{\alpha_{min}\delta})=O(\frac{1}{\delta}) time and the proof is done.

Proof of Theorem 2. The following lemmas are used in the proof.

Lemma 0.

(Pons and Latapy, 2005) For undirected graph GG and any two nodes uu and vv in GG, let 𝐏k(u,v)\boldsymbol{P}^{k}(u,v) (resp. 𝐏k(v,u)\boldsymbol{P}^{k}(v,u)) be the probability of going from uu to vv (resp. from vv to uu) through a random walk of fixed length kk. Then,

𝑷k(u,v)dout(v)=𝑷k(v,u)dout(u).\frac{\boldsymbol{P}^{k}(u,v)}{d_{out}(v)}=\frac{\boldsymbol{P}^{k}(v,u)}{d_{out}(u)}.
Lemma 0.

Let 𝛑sL(u)\boldsymbol{\pi}^{L}_{s}(u) denote the supervised PPR within LL hops. Then, after the end of every iteration in Algorithm 3, for 𝛑^s(u)\boldsymbol{\hat{\pi}}_{s}(u) and the residue vectors 𝐫s(0),𝐫s(L)\boldsymbol{r}_{s}^{(0)},\cdots\boldsymbol{r}_{s}^{(L)}, we have

(8) 𝝅sL(u)𝝅^s(u)+k=0LvV𝒓s(k)(v)huk(v),\boldsymbol{\pi}_{s}^{L}(u)\leq\boldsymbol{\hat{\pi}}_{s}(u)+\sum_{k=0}^{L}\sum_{v\in V}\boldsymbol{r}_{s}^{(k)}(v)\cdot h_{u}^{k}(v),

where huk(v)=αk𝐞u(v)+l=1αk+lj=kk+l1(1αj)𝐏l(v,u)h_{u}^{k}(v)=\alpha_{k}\boldsymbol{e}_{u}(v)+\sum_{l=1}^{\infty}\alpha_{k+l}\prod_{j=k}^{k+l-1}(1-\alpha_{j})\boldsymbol{P}^{l}(v,u), and 𝐞u\boldsymbol{e}_{u} is a unit vector in which the uu-th entry of 𝐞u\boldsymbol{e}_{u} is 11 and others are all 0.

Proof.

We prove Lemma 3 by induction. First, at the begin of Algorithm 3, vector 𝝅^s\boldsymbol{\hat{\pi}}_{s} and all residue vectors are set to 𝟎\boldsymbol{0} except for 𝒓s(0)(s)=1\boldsymbol{r}_{s}^{(0)}(s)=1. Then we have

𝝅^s(u)+k=0KvV𝒓s(k)(v)huk(v)\displaystyle\boldsymbol{\hat{\pi}}_{s}(u)+\sum_{k=0}^{K}\sum_{v\in V}\boldsymbol{r}_{s}^{(k)}(v)\cdot h_{u}^{k}(v)
=\displaystyle= α0+l=1αlj=0l1(1αj)𝑷l(s,u)=𝝅s(u),\displaystyle\,\alpha_{0}+\sum_{l=1}^{\infty}\alpha_{l}\prod_{j=0}^{l-1}(1-\alpha_{j})\boldsymbol{P}^{l}(s,u)=\boldsymbol{\pi}_{s}(u),

which implies that Equation 8 holds under the initial condition since 𝝅sL(u)𝝅s(u)\boldsymbol{\pi}_{s}^{L}(u)\leq\boldsymbol{\pi}_{s}(u). Now suppose that Equation 8 holds at the end of ii-th iteration. Suppose further that during (i+1)(i+1)-th iteration, entry 𝒓s(w)(q)δdout(q)\boldsymbol{r}_{s}^{(w)}(q)\geq\delta\cdot d_{out}(q) is detected and push operation is invoked here. If the change of l.h.s of Equation 8 after the push operation is zero, then Equation 8 holds at the end of (l+1)(l+1)-th iteration.

For the convenience of analysis of the change, let 𝝅^s,𝒓˙s(w)\boldsymbol{\hat{\pi}}^{\prime}_{s},\boldsymbol{\dot{r}}_{s}^{(w)} (resp. 𝝅^s′′,𝒓¨s(w)\boldsymbol{\hat{\pi}}^{\prime\prime}_{s},\boldsymbol{\ddot{r}}_{s}^{(w)}) be the corresponding vectors at the end of ii-th iteration(resp. (i+1)(i+1)-th iteration). Then after performing the push operation in the (i+1)(i+1)-th ieration, we have

𝝅^s′′(q)𝝅^s(q)=αw𝒓˙s(w)(q),\boldsymbol{\hat{\pi}}^{\prime\prime}_{s}(q)-\boldsymbol{\hat{\pi}}^{\prime}_{s}(q)=\alpha_{w}\cdot\boldsymbol{\dot{r}}_{s}^{(w)}(q),
𝒓¨s(w)(q)𝒓˙s(w)(q)=𝒓˙s(w)(q).\boldsymbol{\ddot{r}}_{s}^{(w)}(q)-\boldsymbol{\dot{r}}_{s}^{(w)}(q)=-\boldsymbol{\dot{r}}_{s}^{(w)}(q).

Recap N(q)N(q) is the set of qq’s out-neighbors, for oN(q)o\in N(q), we have

𝒓¨s(w+1)(o)𝒓˙s(w+1)(o)=(1αw)𝒓˙s(w)(q)dout(q).\boldsymbol{\ddot{r}}_{s}^{(w+1)}(o)-\boldsymbol{\dot{r}}_{s}^{(w+1)}(o)=(1-\alpha_{w})\frac{\boldsymbol{\dot{r}}_{s}^{(w)}(q)}{d_{out}(q)}.

Then the change between ii-th iteration and (i+1)(i+1)-th iteration is

𝝅^s′′(u)𝝅^s(u)+𝒓¨s(w)(q)huw(q)𝒓˙s(w)(q)huw(q)+\displaystyle\boldsymbol{\hat{\pi}}^{\prime\prime}_{s}(u)-\boldsymbol{\hat{\pi}}^{\prime}_{s}(u)+\boldsymbol{\ddot{r}}_{s}^{(w)}(q)\cdot h_{u}^{w}(q)-\boldsymbol{\dot{r}}_{s}^{(w)}(q)\cdot h_{u}^{w}(q)+
oN(q)𝒓¨s(w+1)(o)huw+1(o)oN(q)𝒓˙s(w+1)(o)huw+1(o)\displaystyle\sum_{o\in N(q)}\boldsymbol{\ddot{r}}_{s}^{(w+1)}(o)\cdot h_{u}^{w+1}(o)-\sum_{o\in N(q)}\boldsymbol{\dot{r}}_{s}^{(w+1)}(o)\cdot h_{u}^{w+1}(o)
=\displaystyle= 𝝅^s′′(u)𝝅^s(u)𝒓˙s(w)(q)huw(q)+\displaystyle\,\boldsymbol{\hat{\pi}}^{\prime\prime}_{s}(u)-\boldsymbol{\hat{\pi}}^{\prime}_{s}(u)-\boldsymbol{\dot{r}}_{s}^{(w)}(q)\cdot h_{u}^{w}(q)+
(1αw)𝒓˙s(w)(q)dout(q)oN(q)huw+1(o)\displaystyle\,(1-\alpha_{w})\frac{\boldsymbol{\dot{r}}_{s}^{(w)}(q)}{d_{out}(q)}\sum_{o\in N(q)}h_{u}^{w+1}(o)
=\displaystyle= 𝝅^s′′(u)𝝅^s(u)𝒓˙s(w)(q)huw(q)+𝒓˙s(w)(q)(huw(q)αw𝒆u(q))\displaystyle\,\boldsymbol{\hat{\pi}}^{\prime\prime}_{s}(u)-\boldsymbol{\hat{\pi}}^{\prime}_{s}(u)-\boldsymbol{\dot{r}}_{s}^{(w)}(q)\cdot h_{u}^{w}(q)+\boldsymbol{\dot{r}}_{s}^{(w)}(q)(h_{u}^{w}(q)-\alpha_{w}\boldsymbol{e}_{u}(q))
=\displaystyle= 𝝅^s′′(u)𝝅^s(u)αw𝒆u(q)𝒓˙s(w)(q)=0.\displaystyle\,\boldsymbol{\hat{\pi}}^{\prime\prime}_{s}(u)-\boldsymbol{\hat{\pi}}^{\prime}_{s}(u)-\alpha_{w}\boldsymbol{e}_{u}(q)\boldsymbol{\dot{r}}_{s}^{(w)}(q)=0.

Therefore, the change of l.h.s of Equation 8 after (i+1)(i+1)-th iteration is zero, which implies that Equation 8 also holds at the end of (i+1)(i+1)-th iteration. The lemma is proved. ∎

Next, we prove Theorem 2. By Equation 8 in Lemma 3, we have:

|𝝅sL(u)𝝅^s(u)|/dout(u)k=0LvV𝒓s(k)(v)dout(u)huk(v)\displaystyle|\boldsymbol{\pi}_{s}^{L}(u)-\boldsymbol{\hat{\pi}}_{s}(u)|/d_{out}(u)\leq\sum_{k=0}^{L}\sum_{v\in V}\frac{\boldsymbol{r}_{s}^{(k)}(v)}{d_{out}(u)}\cdot h_{u}^{k}(v)
\displaystyle\leq δk=0LvV(αk𝒆u(v)+l=1αk+lj=kk+l1(1αj)𝑷l(v,u))\displaystyle\,\delta\sum_{k=0}^{L}\sum_{v\in V}(\alpha_{k}\boldsymbol{e}_{u}(v)+\sum_{l=1}^{\infty}\alpha_{k+l}\prod_{j=k}^{k+l-1}(1-\alpha_{j})\boldsymbol{P}^{l}(v,u))
=\displaystyle= δk=0L(αk+l=1αk+lj=kk+l1(1αj))=Lδ,\displaystyle\,\delta\sum_{k=0}^{L}(\alpha_{k}+\sum_{l=1}^{\infty}\alpha_{k+l}\prod_{j=k}^{k+l-1}(1-\alpha_{j}))=L\cdot\delta,

where the last equation holds by Lemma 1. The theorem is proved.

Proof of Theorem 3. The time complexities of Algorithm 4 depend on two parts: Generalized Push and SparseSVD. From the results in Theorem 1, for each source sVs\in V, the cost of the push operations is O(1δ)O(\frac{1}{\delta}). Thus, the total cost of push operations on nn nodes is bounded by O(nδ)O(\frac{n}{\delta}). Then, we need the following theorem (Clarkson and Woodruff, 2017) to analyze the cost of SparseSVD.

Theorem 4.

Let 𝐀\boldsymbol{A} denote an n×nn\times n matrix, there is an algorithm that, with failure probability 1/101/10, finds two n×dn\times d matrices 𝐔,𝐕\boldsymbol{U},\boldsymbol{V} with orthonormal columns, and a d×dd\times d diagonal matrix 𝚺\boldsymbol{\Sigma}, so that 𝐀𝐔𝚺𝐕TF(1+ϵ)𝐀[𝐀]dF\left||\boldsymbol{A}-\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{T}\right||_{F}\leq(1+\epsilon)\left||\boldsymbol{A}-\left[\boldsymbol{A}\right]_{d}\right||_{F}, where [𝐀]d[\boldsymbol{A}]_{d} denotes the best rank-dd approximation to 𝐀\boldsymbol{A}. The algorithm runs in time

O(nnz(𝑨)+O~(nd2ϵ4+d3ϵ5)).O\left(\textrm{nnz}(\boldsymbol{A})+\tilde{O}\left(nd^{2}\epsilon^{-4}+d^{3}\epsilon^{-5}\right)\right).

Racall that for any approximate supervised PPR score 𝝅^u(v)\boldsymbol{\hat{\pi}}_{u}(v), it is add to 𝑺\boldsymbol{S} only if it is larger than the error parameter δ\delta. Thus the total number of non-zero entries in 𝑺\boldsymbol{S} is nnz(𝑺)=O(nδ)\textrm{nnz}(\boldsymbol{S})=O(\frac{n}{\delta}). The cost of SparseSVD is bounded by O(nδ+nd2ϵ4)O(\frac{n}{\delta}+\frac{nd^{2}}{\epsilon^{4}}). Finally, combining these two parts, the running cost of Algorithm 4 is bounded by O(nδ+nd2ϵ4)O(\frac{n}{\delta}+\frac{nd^{2}}{\epsilon^{4}}), which completes our proof.

Appendix B Hyper-parameters settings

Table 8. Hyperparameters of Lemane for link prediction.
Dataset Hyperparameters
Wikipedia initialization: ϕhk(l)\phi_{hk}(l) with t=5t=5,
learning rate: 0.0010.001, δ\delta: 10510^{-5}, β\beta: 0.010.01, γ\gamma: 11,
SVD used for push: JacobiSVD
Wikivote initialization: ϕhk(l)\phi_{hk}(l) with t=1t=1,
learning rate: 0.50.5, δ\delta: 10610^{-6}, β\beta: 0.50.5, γ\gamma: 11,
SVD used for push: frPCA
BlogCatalog initialization: ϕge(l)\phi_{ge}(l) with α=0.5\alpha=0.5,
learning rate: 0.10.1, δ\delta: 10710^{-7}, β\beta: 0.010.01, γ\gamma: 11,
SVD used for push: frPCA
Slashdot initialization: ϕhk(l)\phi_{hk}(l) with t=5t=5,
learning rate: 0.0010.001, δ\delta: 10510^{-5}, β\beta: 0.10.1, γ\gamma: 11,
SVD used for push: frPCA
Tweibo initialization: ϕge(l)\phi_{ge}(l) with α=0.5\alpha=0.5,
learning rate: 0.010.01, δ\delta: 10510^{-5}, β\beta: 0.10.1, γ\gamma: 11,
SVD used for push: frPCA
Orkut initialization: ϕhk(l)\phi_{hk}(l) with t=1t=1,
learning rate: 0.010.01, δ\delta: 10410^{-4}, β\beta: 11, γ\gamma: 11,
SVD used for push: frPCA
Table 9. Hyperparameters of Lemane for node classification.
Dataset Hyperparameters
Wikipedia initialization: ϕhk(l)\phi_{hk}(l) with t=5t=5,
learning rate: 0.050.05, δ\delta: 10510^{-5}, β\beta^{\prime}: 11, γ\gamma^{\prime}: 0.50.5,
SVD used for push: JacobiSVD
BlogCatalog initialization: ϕhk(l)\phi_{hk}(l) with t=5t=5,
learning rate: 0.010.01, δ\delta: 10510^{-5}, β\beta^{\prime}: 11, γ\gamma^{\prime}: 0.50.5,
SVD used for push: JacobiSVD
TWeibo initialization: ϕhk(l)\phi_{hk}(l) with t=5t=5,
learning rate: 0.050.05, δ\delta: 10510^{-5}, β\beta^{\prime}: 11, γ\gamma^{\prime}: 33,
SVD used for push: frPCA, XX and YY are
normalized before concatenation in
loss function 1\mathcal{L}^{\prime}_{1}
Orkut initialization: ϕhk(l)\phi_{hk}(l) with t=1t=1,
learning rate: 0.50.5 , δ\delta: 10510^{-5}, β\beta^{\prime}: 11, γ\gamma^{\prime}: 22,
SVD used for push: frPCA, XX and YY are
normalized before concatenation in
loss function 1\mathcal{L}^{\prime}_{1}

Table 8 and Table 9 summarize the hyperparameter settings of Lemane on each dataset. The searching hyperparameters include initialized distribution ϕ(l)\phi(l) to indicate the probability that the supervised random walk stops at the ll-th hop, the loss function coefficients β,γ,β,γ\beta,\gamma,\beta^{\prime},\gamma^{\prime}, error parameter δ\delta, learning rate, and buildin SVDs used in generalized push. For the initialization of distribution ϕ(l)\phi(l), two standard distribution functions are used, namely the geometric distribution ϕge(l)=α(1α)l\phi_{ge}(l)=\alpha(1-\alpha)^{l} and the Poisson distribution ϕhk(l)=ettll!\phi_{hk}(l)=\frac{e^{-t}t^{l}}{l!}.