Learning Based Proximity Matrix Factorization for Node Embedding
Abstract.
Node embedding learns a low-dimensional representation for each node in the graph. Recent progress on node embedding shows that proximity matrix factorization methods gain superb performance and scale to large graphs with millions of nodes. Existing approaches first define a proximity matrix and then learn the embeddings that fit the proximity by matrix factorization. Most existing matrix factorization methods adopt the same proximity for different tasks, while it is observed that different tasks and datasets may require different proximity, limiting their representation power.
Motivated by this, we propose Lemane, a framework with trainable proximity measures, which can be learned to best suit the datasets and tasks at hand automatically. Our method is end-to-end, which incorporates differentiable SVD in the pipeline so that the parameters can be trained via backpropagation. However, this learning process is still expensive on large graphs. To improve the scalability, we train proximity measures only on carefully subsampled graphs, and then apply standard proximity matrix factorization on the original graph using the learned proximity. Note that, computing the learned proximities for each pair is still expensive for large graphs, and existing techniques for computing proximities are not applicable to the learned proximities. Thus, we present generalized push techniques to make our solution scalable to large graphs with millions of nodes. Extensive experiments show that our proposed solution outperforms existing solutions on both link prediction and node classification tasks on almost all datasets.
1. Introduction
Node embedding is the task to map nodes in the original graph into low-dimensional representations. For each node, it outputs an embedding vector and the embedding vectors play an important role in preserving not only the structural information but also other underlying properties in the graph. These vectors can be fed into machine learning models, facilitating widespread machine learning tasks, such as node classification (Perozzi et al., 2014; Grover and Leskovec, 2016; Tang et al., 2015), link prediction (Wang et al., 2018; Tsitsulin et al., 2018), graph reconstruction (Yin and Wei, 2019; Yang et al., 2020), and recommendation (Fouss et al., 2007; Yu et al., 2014).
An important category of node embedding methods with superb performance is the ones using proximity matrix factorization. For such solutions, they first define the proximity matrix for the nodes in the input graph where is the proximity measure of node with respect to node . Different methods may adopt different proximity measures and personalized PageRank (PPR) is a popular choice of proximity measure in node embedding. For example, PPR is adopted in NRP (Yang et al., 2020) as the proximity. In STRAP (Yin and Wei, 2019), the authors further propose to adopt as the proximity where is the PPR of with respect to and is the PPR of with respect to on the transpose graph by reversing the direction of each edge in . Given the proximity matrix , two embedding vectors and are derived such that . For existing solutions in this category, the embedding vectors are typically obtained by singular value decomposition (SVD) or eigen-decomposition on or on a sparse matrix closely related to .
Despite their success, all existing matrix factorization approaches aim to learn an embedding that preserves the chosen proximity without considering if the proximity is suitable for the task on the dataset or not. However, it is observed that different tasks and datasets may require different proximities to achieve high performance, limiting the representation power of such solutions. In addition, it is shown in existing node embedding methods, e.g., STRAP (Yin and Wei, 2019) and NetMF (Qiu et al., 2018), that non-linear operations (such as taking logarithm or softmax) on the proximity matrix can help improve the representation power of the embedding. Nevertheless, most latest matrix factorization methods, like HOPE (Ou et al., 2016), AROPE (Zhang et al., 2018b), and NRP (Yang et al., 2020), do not explicitly derive the proximity matrix, which limits their representation powers. For those methods that explicitly derive the proximity matrix, it indicates that the matrix factorization needs to be taken on the final proximity matrix. Thus, to derive trainable proximity measures, the model needs to be end-to-end and includes the proximity computation as well as the SVD decomposition into the training process. This further imposes challenges especially when the input graph has millions of nodes.
Contribution. Motivated by the limitation of existing solutions, we present an effective framework Lemane111Learning based Proximity Matrix Factorization for Node Embedding with trainable proximity measures, which can be learned to best suit the datasets and tasks at hand automatically. Our trainable proximity measure is inspired by personalized PageRank (PPR) (Page et al., 1999). The PPR can be defined as the probability that an -discounted random walk from stops at node , where an -discounted random walk from a source node has probability to stop at the current node and has probability to randomly jump to an out-neighbor of the current node. For -discounted random walks, the majority will always be the one-hop random walks, which may not be the most representative one for the task at hand. This motivates us to learn a more representative random walk for our trainable proximity measure. Instead of fixing the stopping probability as at each step, our trainable proximity is defined on a supervised random walk where the stopping probability at the -th hop is learned by our defined loss function. Then, our trainable proximity of node with respect to , dubbed as the supervised PPR , is the probability that the supervised random walk from stops at .
To learn the stopping probabilities at each hop for the supervised random walk, we design different loss functions for different tasks in order to learn a more representative proximity for the task at hand. In this paper, we focus on two popular tasks of node embedding: link prediction and node classification. Given the loss functions and trainable parameters, we then design an end-to-end method which incorporates a differentiable SVD in the pipeline so that the parameters can be trained via backpropagation. Our solution is mainly inspired by previous work on learning-based low-rank approximations (Indyk et al., 2019), which includes a differentiable SVD that allows the gradients to flow easily in the framework to solve the low-rank approximation problem. In our framework, with the differentiable SVD, the gradients can easily flow from the loss function and differentiable SVD to our proximity matrix computation, which is determined by the training parameters, i.e., the stopping probabilities at each hop of the supervised random walk.
However, the above training process is too expensive and does not scale to large graphs. To improve the scalability, we train the stopping probabilities for the supervised random walk only on carefully subsampled graphs, and then apply standard proximity matrix factorization on the original graph using the learned proximity. Our main observation is that the stopping probabilities at each hop of the supervised random walk are node-independent. Thus, with a carefully subsampled graph, the learned stopping probability at each hop for the task should still be similar to that on the input graph. Motivated by this, we present an effective subgraph sampling based method to train the parameters on multiple sampled subgraphs, which improves the scalability of our Lemane.
Finally, given learned probabilities, computing learned proximities for each node-pair is still expensive for large graphs, and existing efficient algorithms for computing proximities, like PPR, are not applicable to the learned proximities. Thus, we present generalized push techniques to make our solution scalable to large graphs with millions of nodes. Given a source node , our generalized push algorithm computes an approximate supervised PPR score for each node with respect to with cost where is a parameter to control the quality of approximate supervised PPR scores. Then, the proximity matrix can be computed with cost and takes space. A sparse SVD algorithm is applied on the proximity matrix to derive the final embedding with running cost, where is the embedding dimension.
In our experiment, we compare our Lemane against 15 existing node embedding methods on the link prediction and node classification tasks using 6 real datasets with up to 3 million nodes and 117 million edges. Extensive experiments show that our Lemane outperforms existing methods on both link prediction and node classification tasks on almost all datasets.
2. Related Work
There are three basic categories of node embedding methods: skip-gram methods, matrix factorization methods, and neural network methods. Next, We briefly review existing works for each category.
Skip-gram methods. The methods in this category are inspired by the great success of the word2vec model (Mikolov et al., 2013) for natural language processing. DeepWalk (Perozzi et al., 2014) first proposes to train embedding vectors by feeding truncated random walks to the Skip-gram model. The nodes sampled from the random walks are then treated as the positive samples. Subsequent methods try to explore more representative random walks to feed into the Skip-gram model. LINE (Tang et al., 2015) adopts one-hop and two-hop random walks while Node2vec (Grover and Leskovec, 2016) proposes to explore higher-order random walks that exploits both DFS and BFS nature of the graph. VERSE (Tsitsulin et al., 2018) and APP (Zhou et al., 2017) adopt -discounted random walks to obtain positive samples. Recently, InfiniteWalk (Chanpuriya and Musco, 2020) studies DeepWalk in the limit as the window size goes to infinite, linking DeepWalk to graph Laplacian matrix.
Matrix factorization methods. Another idea in node embedding is to do matrix factorization on a chosen proximity matrix. To explicitly derive the proximity matrix, e.g., the case in NetMF (Qiu et al., 2018), it typically takes cost and is too expensive for large graphs. To avoid the running cost, HOPE (Ou et al., 2016), AROPE (Zhang et al., 2018b), and NRP (Yang et al., 2020) are proposed to derive the embedding without explicitly computing the proximity matrix. For instance, instead of computing all-pair proximity scores and then decomposing the proximity matrix , NRP turns to do SVD on the adjacency matrix, which is sparse for most real-life graphs, reducing the embedding computational cost. Another solution to avoid the cost is to calculate a sparsified proximity matrix . The representative is STRAP (Yin and Wei, 2019), which imposes a threshold and returns at most proximity scores no smaller than for each node, making the proximity matrix of size. An SVD is then applied to the sparsified proximity matrix. Since the second solution explicitly derives the proximity matrix, it allows to take non-linear operations on the proximity matrix, improving the representation powers.
Neural network methods. Deep learning provides an alternative solution to generate node embeddings. SDNE (Wang et al., 2016) and DNGR (Cao et al., 2016) employ multi-layer auto-encoders with a target matrix to generate embeddings. DRNE (Tu et al., 2018) utilizes the layer normalized LSTM (Hochreiter and Schmidhuber, 1997) to generate node embeddings by aggregating the representations of neighbors of each node recursively. GraphGAN (Wang et al., 2018) adopts the well-known generative adversarial networks (Goodfellow et al., 2014) into graph representation learning via an adversarial minimax game. AW (Abu-El-Haija et al., 2018) proposes a novel attention model on the power series of the transition matrix, which guides the random walk to pay attention to important positions within the random walks by optimizing an upstream objective. The bottleneck of these solutions is the high computational cost, which restricts these methods to small graphs.
Other methods. There are also several methods that do not belong to the above three categories. For example, GraphWave (Donnat et al., 2018) learns node representations by leveraging heat wavelet diffusion patterns in an unsupervised way. NetHiex (Ma et al., 2018) captures the underlying hierarchical taxonomy of the graph to learn node representations with multiple components. RaRE (Gu et al., 2018) proposes a node embedding method that considers both social rank and proximity of nodes, and separately learns two representations for a node. AutoNE (Tu et al., 2019) incorporates AutoML into node embedding, which can automatically optimize the hyperparameters from the subgraphs of the original graph. PBG (Lerer et al., 2019) presents a distributed embedding system that uses the block decomposition of the adjacency matrix as a partition method to scale to arbitrary large graphs. A recent work, GraphZOOM (Deng et al., 2020), first generates subgraphs based on a fused graph and then applies existing approaches to generate node embeddings. Since these methods do not preserve any node pair proximity, the main concern is their insufficient effectiveness for downstream tasks as we will show in our experiment.
There are also various graph embedding methods designed for specific graphs, like dynamic graphs (Goyal et al., 2018; Trivedi et al., 2019) and heterogeneous networks (Dong et al., 2017; Fu et al., 2017). In this paper, we focus on the most fundamental case when the network is static and no feature vector is given.
3. Lemane Framework
In this section, we present our Lemane framework. Section 3.1 introduces the trainable proximity matrix and the training parameters. Section 3.2 elaborates on the training process of Lemane and introduces the loss functions used for link prediction and node classification, respectively. Section 3.3 presents how to carefully obtain the subsampled graphs to do training on large graphs.
3.1. Trainable Proximity Measure
Recap from Section 1 that, the personalized PageRank (PPR) of node with respect to is the probability that an -discounted random walk from stops at node , where an -discounted random walk has probability to stop at the current node and probability to randomly jump to one of its out-neighbors. Define the transition matrix where is the diagonal matrix such that is the out-degree (resp. degree) of node if is directed (resp. undirected), and is the adjacency matrix. Then, the PPR proximity matrix can be expressed as:
In our trainable proximity measure, instead of fixing the stopping probability at each step to be , we allow the stopping probability of the random walk at the -th hop to be trainable. Currently, we assume that the stopping probability at the -th step is given and will show how to train in Section 3.2. A random walk that follows such learned stopping probability at each step is denoted as a supervised random walk and the proximity derived from the supervised random walk is denoted as supervised PPR. The supervised PPR proximity matrix can be defined as:
(1) |
To explain, the probability that a supervised random walk stops at exactly the -th hop is . Thus, by summing up the probability to stop at each hop, we derive the supervised PPR score.
The exact supervised PPR score then can be computed with an iterative manner as shown in Algorithm 1 when . However, we observe that when is sufficiently large, the probability that the supervised random walk stops with a hop number larger than is close to zero. Hence, we discard all supervised random walks that stop with a hop larger than . The following theorem shows the quality of calculated supervised PPR scores with Algorithm 1.
Theorem 1.
Let be the supervised PPR proximity matrix derived by Algorithm 1. Let be the infinity-norm of matrix , we have:
All the proofs of our theorems can be found in the appendix. According to our observation, the probability that the supervised random walk stops at a hop greater than is almost zero when we set . Hence, is set to in our experiment.
Algorithm 1 makes a tight connection between the supervised PPR proximity matrix and our training parameters (), which allows the gradients to flow from the derived supervised PPR proximity matrix to the training parameters via backpropagation. Let and denote the number of nodes and the number of edges, respectively. The time complexity of the algorithm can be bounded by since is a sparse matrix with entries.
Remark. Note that the idea to learn the stopping probability for each hop is not a new idea in graph neural networks, e.g., (Klicpera et al., 2019; Chien et al., 2021). It is much easier for neural networks as the stopping probabilities can be treated as additional weights to learn. However, it is more challenging to integrate this idea to node embedding, especially for matrix factorization methods. We show an efficient and effective solution that works for large graphs, which is non-trivial.
3.2. Training process of Lemane
Algorithm 2 shows the pseudo-code of the training process of Lemane. Firstly, it initializes the stopping probability for such that the probability a random walk stops at the -th hop follows some standard distribution, e.g., uniform, geometric, or Poisson (Line 1). Next, it uses the initial settings of these stopping probabilities to derive the supervised PPR score by invoking Algorithm 1 (Line 3). Given the supervised PPR scores, it eliminates all the proximity scores that are too small. A threshold is included and all scores smaller than are set to zero (Lines 4-5). Then, a non-linear operation, , is applied to the proximity matrix . The proximity scores are divided by to guarantee that each entry after taking the will be non-negative (Line 6). Thereafter, a differentiable SVD, e.g., (Indyk et al., 2019), is applied on the new matrix with input parameter (Line 7). The SVD obtains two matrices and , and a diagonal matrix such that . Given the three matrices, is returned as the first embedding matrix and is returned as the second embedding matrix .
Subsequently, the two embedding matrices are fed into the loss function for the link prediction or node classification (Line 9). The loss functions will be discussed shortly. Then, the stopping probabilities are updated according to the loss function by backpropagation. The training process terminates until the loss function converges (Line 2). Notice that, due to the extremely long computational chain, it is infeasible to write down the explicit form of the gradients. However, like modern deep neural networks, we can use the autograd feature in PyTorch to numerically compute the gradients with respect to the training parameters. Also, the builtin SVD in PyTorch supports to compute the gradient and hence we directly adopt the PyTorch implementation. In what follows, we elaborate on the details of the loss functions.
Loss function for link prediction. For link prediction, there are two components in our loss function. The first component of the loss function aims to ensure that the total information in the supervised random walks started from node is closed to its out-degree. Let . The first part of the loss function is as follows:
(2) |
where is the -th row of , is the -th row of , and is the out-degree of . Equation 2 indicates that our learned embedding by the supervised random walk should preserve the out-degree information as much as possible.
The second part of the loss function for link prediction is the average cross-entropy over all existing edges:
(3) |
where is the sigmoid function and if edge exists in the input graph and otherwise. The final loss function for link prediction is:
(4) |
where and are two balancing hyperparameters.
Loss function for node classification. There are also two parts in the loss function for node classification. We first concatenate two output embedding matrices and together to get the unique embedding matrix . Then following the fact that two randomly selected nodes have different labels with high probability, we randomly sample a small set of negative node-pairs for each iteration. Let denotes the graph with edge set with unnormalized Laplacian matrix and be a complete graph formed by nodes in with the same label . Inspired by the relaxation proposed in (Zhang et al., 2020), the first loss function for node classification is defined as follows:
(5) |
where is the concatenated representation of node , is the total number of class labels in the graph, and is the unnormalized Laplacian matrix of . The goal of is to minimize pair-wise distances between node pairs with the same class label and maximize pair-wise distances between negative samples.
Next, we employ an activation function to normalize the output embedding to a probability distribution over predicted class labels: , where is a fixed mapping matrix generated from uniform distribution, is the bias term, denotes the probability that node has class label , and . Let denote the label matrix, where if node has class label and otherwise. Then, the second part of the loss function for node classification is the average cross-entropy over all nodes:
(6) |
The final loss function for node classification is defined according to and as follows:
(7) |
where and are two balancing hyperparameters.
Remark. Notice that the training algorithm for Lemane on node classification needs additional label information and we randomly sample of the labeled nodes for training. To make a fair comparison with our competitors, to train the classifiers, we will only include new training data if we split of the data for training and the remaining for testing. This guarantees that our Lemane only accesses the same amount of labeled data compared to other competitors.
3.3. Training Lemane with sub-sampling
The above-mentioned training process requires calculations on a dense matrix of size and requires running cost, which makes it non-scalable to large graphs. However, such cost seems unavoidable at first glance since we need to obtain the proximity matrix to do backpropagation. After a careful analysis, we make the following two observations to help us avoid the high running costs. Our first key observation is that the parameters that we need to train are only the stopping probabilities at each hop, which are node-independent. That is to say, if we can find a subgraph of the input graph such that the learned stopping probabilities on the subgraph are identical to that on for the same task, we can simply learn the parameters on the subgraph with smaller size and apply the learned parameters to the original graph directly, reducing the computational costs. Another observation is that a subgraph of with similar connectivity should share similar learning stopping probabilities as the input graph on the same task.
Motivated by this, we present our sub-sampling based training method for Lemane on large graphs. Obviously, a straightforward solution is to sample a number of nodes and then consider the subgraph containing these nodes. However, such a solution severely degrades the connectivity among the nodes. Simple edge sampling strategies will face a similar dilemma which hampers the connectivity among the sampled nodes.
To keep the connectivity among the sampled nodes, we apply a BFS style traversal for subgraph sampling. Our goal is still to sample a subgraph with a constant number of nodes. To sample such a subgraph, firstly a source node is randomly sampled from the input graph . Thereafter, a BFS traversal is applied from the source to explore the local community of node . If the number of visited nodes by the BFS from is smaller than , another node is randomly sampled as the source to do BFS. The BFS sampling stops as soon as in total nodes are visited.
However, the weights trained on a single subgraph might be biased and makes the learned stopping probabilities non-generalizable to the original input graph . To make the learned parameters generalizable to the input graph, we sample multiple subgraphs by the above strategy to learn the parameters. In particular, in each iteration, we sample a subgraph by the BFS strategy and then update the training parameters, i.e., the stopping probabilities, according to the loss functions defined on this subgraph. The loss functions on the subgraph are modified accordingly where the total number of nodes in Equations 2 and 6 are replaced by sample size ; the total number of edges in Equation 3 is also replaced by sample size ; the set of nodes with label in Equation 5 is changed to , where is the node set of the subgraph, is the set of nodes with label in ; and the negative sample set in Equation 5 is changed to .
4. Generalized Push
Given the learned stopping probabilities, a straightforward solution is to invoke Algorithm 1 to derive the supervised PPR scores. However, this incurs running cost, which is prohibitive for large graphs. To tackle this issue, we present a generalized push algorithm to efficiently compute the supervised PPR proximity matrix with cost, where is a parameter to control the computational cost as well as the sparsity of the proximity matrix.
4.1. Generalized Push Algorithm
Our main idea to compute the supervised PPR proximity matrix is to derive a sparsified proximity matrix such that the supervise PPR scores no larger than can be safely discarded. But still, how much cost should we take to derive a sufficiently accurate approximation proximity score? In STRAP (Yin and Wei, 2019), the authors propose a solution to derive the PPR estimations such that with a cost of . But, can we further reduce the running cost without sacrificing the embedding quality? Here, we give an affirmative answer. Our solution is inspired by the local graph clustering algorithm Local-Push (Andersen et al., 2006) which returns approximate PPRs with respect to a source in running time and guarantees that
on undirected graphs where is the degree of node . The Local-Push algorithm suggests that if we only want to compute the approximate PPR scores around the local graph cluster with respect to a source, the running time can be reduced. In our case, the node embedding aims to find nodes that are their representatives and the nodes in their local graph cluster stand as perfect representatives. However, the Local-Push algorithm only works for PPR, not for our supervised PPR. Thus, we present a generalized push algorithm that works for arbitrary stopping probabilities.
Algorithm 3 shows the pseudo-code for our generalized push algorithm. Given a source node , a vector is maintained to store the portion of supervised random walks that has stopped at each node, and is the estimated supervised PPR scores with respect to source . Besides, for each hop , where is the maximum length of a truncated supervised random walk introduced in Section 3.1, an additional residue vector is maintained. The vector indicates the portion of supervised random walks from that currently stay at the -th hop but have not stopped yet. Thus, if the residue vectors are all zero, it returns the exact supervised PPR scores. Initially, the residue vectors are all zero except for (Lines 1-2), indicating that the supervised random walks initially all stay at and has not stopped yet. Then, if any entry in the residue vectors is above (Line 3), a push operation (Lines 4-7) is invoked. In particular, it first converts to (Line 4). To explain, portion of the random walks stop at the -th hop. Next, the remaining portion of the random walks randomly jump to the out-neighbors of (Lines 5-6). Thus, for each that is an out-neighbor of , the residue is incremented by . After the push operation, the residue is set to zero (Line 7). The algorithm terminates when there exists no residue for any such that it is larger than .
Theorem 1.
Algorithm 3 runs in time.
Theorem 2.
Let be the supervised PPR considering random walks within hops. Then for undirected graphs, Algorithm 3 returns an estimation of for each node such that:
By setting , Algorithm 3 runs in time. At the same time, the error bound in Algorithm 2 can be bounded by . Since can be treated as a constant, the running time with is still and for any node , we have that:
The above analysis shows that our generalized push algorithm can provide identical result quality as the Local-Push algorithm with the identical asymptotic running cost, thus returning high-quality results for the representative nodes of the source node.
4.2. Final Embedding
Given the generalized push algorithm, we finally show how to output the embedding for the graph. Following STRAP (Yin and Wei, 2019), we compute the supervised PPR on both the input graph and the transpose graph by reversing the direction of each edge of and set as , where is the supervised PPR of with respect to and is the supervised PPR of with respect to on the transpose graph . Note that we do not bring this part into our training phase to reduce the computational costs since we use the same stopping probabilities for both the input graph and the transpose graph . Algorithm 4 shows how to compute approximate proximity scores. For each node , we compute the approximate supervised PPR on the input graph and the transpose graph (Lines 3-4). For any approximate supervised PPR score , it is added to only if it is larger than the threshold . Similarly, each approximate supervised PPR score on the transpose graph is added to only if is larger than (Lines 5-9). With such a pruning strategy, the proximity matrix is sparsified to include non-zero entries. Then, a non-linear operation , is applied to . Notice that all the entries are divided by before taking the logarithm to guarantee that the values will be non-negative (Line 10). The resulting matrix is then fed to a sparse SVD to derive final embedding matrices and (Lines 11-12). We have the following theorem for the running time and decomposition quality with respect to .
Theorem 3.
Algorithm 4 runs in time to guarantee that the embedding preserves the feeding matrix with -approximation to the best rank- matrix in terms of Frobenius-norm.
5. Experiments
Name | Type | labels | ||
---|---|---|---|---|
Wikipedia | directed | 4.78K | 184.81K | 40 |
WikiVote | directed | 7.12K | 103.69K | - |
BlogCatalog | undirected | 10.31K | 333.98K | 39 |
Slashdot | directed | 82.17K | 870.16K | - |
TWeibo | directed | 1.94M | 50.66M | 100 |
Orkut | undirected | 3.07M | 117.19M | 100 |
Method | Wikipedia | Wikivote | BlogCatalog |
---|---|---|---|
DeepWalk | 88.33 | 68.32 | 85.35 |
Node2vec | 82.86 | 78.50 | 82.07 |
VERSE | 88.09 | 82.82 | 88.49 |
InfiniteWalk | 66.86 | 81.07 | 84.05 |
AROPE | 84.27 | 62.08 | 88.30 |
RandNE | 83.15 | 77.62 | 87.09 |
ProNE | 52.06 | 66.22 | 57.87 |
NetSMF | 72.01 | 72.64 | 47.92 |
STRAP | 86.53 | 92.58 | 89.58 |
NRP | 83.56 | 91.07 | 90.10 |
GraphGAN | 70.33 | 71.76 | 71.83 |
AW | 50.42 | 56.62 | 62.56 |
NetHiex | 45.03 | 73.01 | 65.58 |
GraphZoom | 84.73 | 82.10 | 86.82 |
Louvain | 56.22 | 58.33 | 59.97 |
Lemane-F | 87.79 | 92.78 | 89.92 |
We compare our Lemane against alternative solutions on link prediction and node classification tasks. All experiments are conducted on a Linux machine with an Intel Xeon(R) CPU clocked at 2.70GHz, an NVIDIA GeForce RTX 2080 Super 8GB GPU, and 384GB memory.
5.1. Experimental settings
Datasets. We test on six real datasets that are used in recent node embedding studies (Perozzi et al., 2014; Grover and Leskovec, 2016; Tsitsulin et al., 2018; Lerer et al., 2019; Yin and Wei, 2019; Yang et al., 2019, 2020). The statistics of these datasets are shown in Table 1. BlogCatalog (Tang and Liu, 2009), Slashdot (Leskovec et al., 2009), TWeibo (twe, ggle) and Orkut (Mislove et al., 2007) are four social networks in which links represent a friendship/following relationship between users. Wikipedia (wik, mark) is a co-occurrence network of words appearing in the Wikipedia dump. Wikivote (Leskovec et al., 2010) is a who-votes-on-whom network on Wikipedia. All datasets and the node labels (if any) can be downloaded from public sources (blo, orks; wik, mark; sna, SNAP; twe, ggle).
Competitors. We evaluate Lemane against 15 node embedding methods, including some classic methods and several state-of-the-art methods. We divide these methods into four groups as follows.
- •
-
•
Matrix factorization methods: AROPE (Zhang et al., 2018b), RandNE (Zhang et al., 2018a), ProNE (Zhang et al., 2019), NetSMF (Qiu et al., 2019), STRAP (Yin and Wei, 2019), and NRP222There were some implementation issues in the released code of NRP on undirected graphs, which was fixed recently by the inventors. (Yang et al., 2020);
- •
- •
For our methods, we use Lemane-F to indicate the algorithm trained on the entire graph and Lemane-S to indicate the algorithm trained on subsampled graphs. Notice that Lemane-F is adopted for small datasets Wikipedia, Wikivote, and BlogCatalog. Lemane-S is adopted for large datasets Slashdot, TWeibo, and Orkut.
Method | Slashdot | Tweibo | Orkut |
---|---|---|---|
AROPE | 82.83 | 69.46 | 82.03 |
RandNE | 81.03 | 70.74 | 79.45 |
ProNE | 72.80 | 45.47 | 80.88 |
STRAP | 83.07 | 94.58 | 85.73 |
NRP | 80.98 | 93.87 | 86.34 |
Louvain | 55.56 | 64.25 | 80.85 |
Lemane-S | 84.13 | 94.89 | 89.15 |
Parameter settings. We obtain the source code of all competitors from GitHub and perform these methods with default parameter settings suggested by their authors. Following previous studies (Perozzi et al., 2014; Tsitsulin et al., 2018; Yin and Wei, 2019), we set the embedding dimensionality . For our Lemane-S, we set the sample size . For Lemane-F and Lemane-S, we use grid search to set from {0.01, 0.1, 0.5, 1, 2, 3}, learning rate from {0.001, 0.005, 0.01, 0.05, 0.1, 0.5}, and threshold from {,,,}; we use JacobiSVD for Lemane-F and frPCA (Feng et al., 2018) for Lemane-S, to generate final embeddings.
Since the backpropagation in Lemane is complicated, our objective function may easily fall into a local minimum. To tackle this issue, the stopping probabilities are initialized with different distributions. Specifically, we set such that the probability that the supervised random walk stops at the -th hop follows a geometric distribution or Poisson distribution with respect to and we report the best results of Lemane. The parameters of Lemane are optimized by Stochastic Gradient Descent (SGD) optimizer.
5.2. Link Prediction
Link prediction aims to predict which pairs of nodes are likely to form edges. Following previous work (Yang et al., 2020; Zhang et al., 2018b), we randomly hide of the edges for testing and train the embedding vectors on the rest of the graph. Then, the testing set is generated by including (i) the node pairs corresponding to the removed edges, and (ii) an equal number of node pairs that are not connected by any edge in the original graph . Given a node pair in , we compute a score for based on the embedding vectors, and evaluate model performance using precision score.
Following previous work (Yin and Wei, 2019; Yang et al., 2020), for DeepWalk, Node2vec, VERSE, InfiniteWalk, GraphGAN, and Louvain, we use the edge-feature approach introduced in (Ma et al., 2018): (i) randomly select existing edges which are not in and the same number of non-existing edges as training set on each dataset; (ii) for each node pair , we concatenate -dimension embedding vectors of node and that of node ; (iii) we consider the -length vectors as the features of node pairs in and feed them into a binary logistic regression classifier; (iv) then the trained classifier is used to perform link prediction on . For AROPE, RandNE, ProNE, AW, NetHiex, and GraphZOOM, the score of a node pair is the inner product of the embedding vector of node and that of node ; for NetSMF, STRAP, NRP, and our Lemane, the score of a node pair is the inner product of embedding from and from (Ref. to Section 3 for the definitions of and ).
Table 2 reports the performance of Lemane-F against the 15 competitors on three small datasets. Table 3 further reports the performance of Lemane-S against 6 methods which scale to large graphs. For the other 9 methods, they either cannot finish training in 24 hours or run out of memory on the large graphs. As we can observe, our Lemane shows the best performance on 4 social networks where link prediction finds extensive applications. On the Wikipedia dataset, a co-occurrence network of words appearing in the Wikipedia, our Lemane still achieves high performance and is the best method among all matrix factorization methods. Compared to two state-of-the-art matrix factorization methods STRAP and NRP, our Lemane takes the lead by more than on Wikipedia and Slashdot, and takes the lead by almost on Orkut. This demonstrates the effectiveness of our learning based method.
Method | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
---|---|---|---|---|---|
DeepWalk | 42.02 | 46.12 | 48.46 | 49.39 | 49.35 |
Node2vec | 44.75 | 48.08 | 49.82 | 50.69 | 50.41 |
VERSE | 38.76 | 41.92 | 43.84 | 44.92 | 44.31 |
InfiniteWalk | 38.96 | 42.64 | 45.94 | 47.73 | 48.24 |
AROPE | 45.82 | 50.59 | 52.47 | 53.36 | 52.16 |
RandNE | 34.51 | 32.83 | 43.25 | 45.55 | 45.93 |
ProNE | 44.49 | 50.45 | 53.15 | 54.38 | 54.43 |
NetSMF | 40.29 | 42.56 | 43.68 | 44.08 | 44.12 |
STRAP | 46.51 | 50.77 | 52.44 | 52.64 | 52.37 |
NRP | 48.04 | 52.71 | 54.39 | 55.20 | 54.30 |
GraphGAN | 32.87 | 35.43 | 36.47 | 37.68 | 37.50 |
AW | 40.70 | 40.70 | 40.44 | 40.30 | 39.47 |
NetHiex | 45.58 | 47.95 | 49.39 | 49.77 | 49.20 |
GraphZoom | 40.76 | 41.03 | 41.18 | 41.07 | 40.54 |
Louvain | 40.86 | 40.99 | 40.72 | 41.25 | 40.69 |
Lemane-F | 47.79 | 52.39 | 54.34 | 54.67 | 54.25 |
Method | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
---|---|---|---|---|---|
DeepWalk | 33.01 | 36.97 | 38.70 | 39.87 | 41.31 |
Node2vec | 35.01 | 37.16 | 37.97 | 38.51 | 39.13 |
VERSE | 32.76 | 36.32 | 38.18 | 39.09 | 40.71 |
InfiniteWalk | 34.30 | 38.00 | 40.21 | 41.87 | 43.14 |
AROPE | 29.12 | 32.78 | 34.12 | 34.95 | 35.77 |
RandNE | 26.75 | 31.56 | 36.20 | 38.34 | 39.93 |
ProNE | 36.38 | 40.33 | 41.56 | 42.32 | 42.28 |
NetSMF | 34.95 | 37.99 | 39.30 | 40.19 | 40.72 |
STRAP | 38.62 | 41.80 | 42.96 | 43.39 | 43.97 |
NRP | 38.73 | 41.65 | 42.36 | 43.15 | 43.34 |
GraphGAN | 14.97 | 17.23 | 18.81 | 20.07 | 21.16 |
AW | 16.52 | 16.91 | 16.98 | 17.25 | 17.39 |
NetHiex | 37.46 | 40.06 | 40.63 | 41.43 | 42.33 |
GraphZoom | 22.02 | 25.59 | 27.75 | 29.37 | 30.70 |
Louvain | 19.16 | 19.88 | 21.12 | 21.30 | 21.74 |
Lemane-F | 39.64 | 42.38 | 43.34 | 44.03 | 44.37 |
5.3. Node Classification
Node classification task aims to predict the label(s) of each node based on the embeddings. As we mentioned in Section 3, we first randomly sample labeled nodes for parameter training. Then the classification task is performed with the following steps: (i) following (Yang et al., 2020), for Lemane and other matrix factorization methods 444For matrix factorization based methods, there was some implementation issues in the evaluation code for node classification on directed graphs. We have rerun the experiment for matrix factorization based methods on directed graphs., we first normalize the embedding vector555We observe some significant improvement on Orkut dataset when normalization is applied. Thus, we take normalization for the embedding vectors on all datasets. from and embedding vector from of each node , and then concatenate them to get the representation of ; (ii) for methods without factorization operation, we use the embedding vector of node as its representation. Specifically, we randomly split the node and label sets into the training set and testing set and the training ratio varies from to . To make a fair comparison, note that if the training ratio is , then for Lemane, it will sample an additional and include the labeled nodes to train the classifiers. Following previous work (Perozzi et al., 2014; Grover and Leskovec, 2016), we employ a one-vs-rest logistic regression classifier implemented by LIBLINEAR (Fan et al., 2008) with default parameters for all methods. Micro-F1 score is used as the evaluation metric for the classification task.
Method | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
---|---|---|---|---|---|
AROPE | 33.96 | 34.11 | 34.18 | 34.24 | 34.29 |
RandNE | 34.64 | 34.67 | 34.80 | 34.92 | 34.96 |
ProNE | 35.27 | 35.37 | 35.43 | 35.48 | 35.52 |
STRAP | 35.75 | 35.97 | 36.03 | 36.04 | 36.04 |
NRP | 35.73 | 35.97 | 36.03 | 36.04 | 36.04 |
Louvain | 34.14 | 34.29 | 34.33 | 34.38 | 34.42 |
Lemane-S | 35.77 | 36.03 | 36.04 | 36.05 | 36.05 |
Method | 0.1 | 0.3 | 0.5 | 0.7 | 0.9 |
---|---|---|---|---|---|
AROPE | 49.11 | 50.86 | 51.55 | 51.89 | 52.22 |
RandNE | 44.94 | 49.35 | 50.47 | 51.11 | 51.53 |
ProNE | 36.09 | 37.13 | 38.32 | 38.81 | 39.44 |
STRAP | 70.37 | 73.09 | 74.04 | 74.60 | 75.08 |
NRP | 72.47 | 75.58 | 76.69 | 77.36 | 77.98 |
Louvain | 29.16 | 36.00 | 36.51 | 36.85 | 36.63 |
Lemane-S | 73.32 | 76.27 | 77.26 | 77.89 | 78.14 |
For node classification, we test on four datasets Wikipedia, BlogCatalog, TWeibo, and Orkut, which include label information. Table 4 and Table 5 show the performance of our Lemane-F against all 15 methods on the two small datasets: Wikipedia and BlogCatalog, respectively. Table 6 and Table 7 show the performance of our Lemane-S against 6 methods that scale to TWeibo and Orkut.
We make the following observations. Firstly, our Lemane achieves the best Micro-F1 scores on three datasets BlogCatalog, Tweibo, and Orkut in all of the tested training ratios. Besides, compared to STRAP, which takes PPR without training the stopping probabilities, our Lemane achieves more than lead on Wikipedia datasets and up to on the Orkut dataset. Compared to the second-best matrix factorization method NRP, our Lemane further achieves about lead on the BlogCatalog dataset in all of the tested training ratios.
In summary, the experimental study reveals that our learning based Lemane can learn proximity measures that most suit the task in most scenarios. Compared to other matrix factorization methods, e.g., STRAP (Yin and Wei, 2019) and NRP (Yang et al., 2020), that take personalized PageRank as the proximity measure without learning, our Lemane can train the proximity measure, i.e., the supervised PPR, to gain better performance.
6. Conclusion
In this paper, we present Lemane that learns trainable proximity measures to best suit the datasets and tasks at hand automatically. Experimental results reveal that Lemane can learn more representative embeddings compared with state-of-the-art approaches.
Acknowledgements.
Sibo Wang is supported by Hong Kong RGC ECS (No. 24203419), RGC CRF (No. C4158-20G), CUHK Direct Grant (No. 4055114), and NSFC (No. U1936205). Zengfeng Huang is supported by Shanghai Sailing Program Grant No. 18YF1401200, Shanghai Science and Technology Commission Grant No. 17JC1420200.References
- (1)
- twe (ggle) Kaggle. https://www.kaggle.com/c/kddcup2012-track1/data.
- wik (mark) Large text compression benchmark. http://mattmahoney.net/dc/textdata.
- sna (SNAP) SNAP. http://snap.stanford.edu/data/.
- blo (orks) Social-Dimension Approach to Classification in Large-Scale Networks. http://leitang.net/social_dimension.html.
- Abu-El-Haija et al. (2018) Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alex Alemi. 2018. Watch Your Step: Learning Node Embeddings via Graph Attention. In NeurIPS. 9198–9208.
- Andersen et al. (2006) Reid Andersen, Fan Chung, and Kevin Lang. 2006. Local Graph Partitioning using PageRank Vectors. In FOCS. 475–486.
- Bhowmick et al. (2020) Ayan Kumar Bhowmick, Koushik Meneni, Maximilien Danisch, Jean-Loup Guillaume, and Bivas Mitra. 2020. LouvainNE: Hierarchical Louvain Method for High Quality and Scalable Network Embedding. In WSDM. 43–51.
- Cao et al. (2016) Shaosheng Cao, Wei Lu, and Qiongkai Xu. 2016. Deep Neural Networks for Learning Graph Representations. In AAAI. 1145–1152.
- Chanpuriya and Musco (2020) Sudhanshu Chanpuriya and Cameron Musco. 2020. InfiniteWalk: Deep Network Embeddings as Laplacian Embeddings with a Nonlinearity. In SIGKDD. 1325–1333.
- Chien et al. (2021) Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive Universal Generalized PageRank Graph Neural Network. In ICLR.
- Clarkson and Woodruff (2017) Kenneth L. Clarkson and David P. Woodruff. 2017. Low-Rank Approximation and Regression in Input Sparsity Time. J. ACM 63, 6 (2017).
- Deng et al. (2020) Chenhui Deng, Zhiqiang Zhao, Yongyu Wang, Zhiru Zhang, and Zhuo Feng. 2020. GraphZoom: A Multi-level Spectral Approach for Accurate and Scalable Graph Embedding. In ICLR.
- Dong et al. (2017) Yuxiao Dong, Nitesh V. Chawla, and Ananthram Swami. 2017. Metapath2vec: Scalable Representation Learning for Heterogeneous Networks. In SIGKDD. 135–144.
- Donnat et al. (2018) Claire Donnat, Marinka Zitnik, David Hallac, and Jure Leskovec. 2018. Learning Structural Node Embeddings via Diffusion Wavelets. In SIGKDD. 1320–1329.
- Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A Library for Large Linear Classification. J. Mach. Learn. Res. 9 (2008), 1871–1874.
- Feng et al. (2018) Xu Feng, Yuyang Xie, Mingye Song, Wenjian Yu, and Jie Tang. 2018. Fast Randomized PCA for Sparse Data. In ACML. 710–725.
- Fouss et al. (2007) Francois Fouss, Alain Pirotte, Jean-michel Renders, and Marco Saerens. 2007. Random-Walk Computation of Similarities between Nodes of a Graph with Application to Collaborative Recommendation. TKDE 19 (2007), 355–369.
- Fu et al. (2017) Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. HIN2Vec: Explore Meta-Paths in Heterogeneous Information Networks for Representation Learning. In CIKM. 1797–1806.
- Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In NeurIPS. 2672–2680.
- Goyal et al. (2018) Palash Goyal, Nitin Kamra, Xinran He, and Yan Liu. 2018. DynGEM: Deep Embedding Method for Dynamic Graphs. CoRR abs/1805.11273 (2018).
- Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. Node2vec: Scalable Feature Learning for Networks. In SIGKDD. 855–864.
- Gu et al. (2018) Yupeng Gu, Yizhou Sun, Yanen Li, and Yang Yang. 2018. RaRE: Social Rank Regulated Large-Scale Network Embedding. In WWW. 359–368.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
- Indyk et al. (2019) Piotr Indyk, Ali Vakilian, and Yang Yuan. 2019. Learning-Based Low-Rank Approximations. In NeurIPS. 7402–7412.
- Klicpera et al. (2019) Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. 2019. Diffusion Improves Graph Learning. In (NeurIPS). 13333–13345.
- Lerer et al. (2019) Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex Peysakhovich. 2019. PyTorch-BigGraph: A Large-scale Graph Embedding System. In SysML.
- Leskovec et al. (2010) Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. 2010. Predicting Positive and Negative Links in Online Social Networks. In WWW. 641–650.
- Leskovec et al. (2009) Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. 2009. Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters. Internet Mathematics 6, 1 (2009), 29–123.
- Ma et al. (2018) Jianxin Ma, Peng Cui, Xiao Wang, and Wenwu Zhu. 2018. Hierarchical Taxonomy Aware Network Embedding. In SIGKDD. 1920–1929.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS. 3111–3119.
- Mislove et al. (2007) Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and Analysis of Online Social Networks. In IMC. 29–42.
- Ou et al. (2016) Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric Transitivity Preserving Graph Embedding. In SIGKDD. 1105–1114.
- Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank citation ranking: bringing order to the web. (1999).
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. DeepWalk: Online Learning of Social Representations. In SIGKDD. 701–710.
- Pons and Latapy (2005) Pascal Pons and Matthieu Latapy. 2005. Computing communities in large networks using random walks. In ISCIS. 284–293.
- Qiu et al. (2019) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. NetSMF: Large-Scale Network Embedding As Sparse Matrix Factorization. In WWW. 1509–1520.
- Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and Node2vec. In WSDM. 459–467.
- Tang et al. (2015) Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. LINE: Large-Scale Information Network Embedding. In WWW. 1067–1077.
- Tang and Liu (2009) Lei Tang and Huan Liu. 2009. Relational Learning via Latent Social Dimensions. In SIGKDD. 817–826.
- Trivedi et al. (2019) Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. 2019. DyRep: Learning Representations over Dynamic Graphs. In ICLR.
- Tsitsulin et al. (2018) Anton Tsitsulin, Davide Mottin, Panagiotis Karras, and Emmanuel Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In WWW. 539–548.
- Tu et al. (2018) Ke Tu, Peng Cui, Xiao Wang, Philip S. Yu, and Wenwu Zhu. 2018. Deep Recursive Network Embedding with Regular Equivalence. In SIGKDD. 2357–2366.
- Tu et al. (2019) Ke Tu, Jianxin Ma, Peng Cui, Jian Pei, and Wenwu Zhu. 2019. AutoNE: Hyperparameter Optimization for Massive Network Embedding. In SIGKDD. 216–225.
- Wang et al. (2016) Daixin Wang, Peng Cui, and Wenwu Zhu. 2016. Structural Deep Network Embedding. In SIGKDD. 1225–1234.
- Wang et al. (2018) Hongwei Wang, Jia Wang, Jialin Wang, Miao Zhao, Weinan Zhang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. GraphGAN: Graph Representation Learning With Generative Adversarial Nets. In AAAI. 2508–2515.
- Yang et al. (2020) Renchi Yang, Jieming Shi, Xiaokui Xiao, Yin Yang, and Sourav S. Bhowmick. 2020. Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank. PVLDB 13, 5 (2020), 670–683.
- Yang et al. (2019) Renchi Yang, Xiaokui Xiao, Zhewei Wei, Sourav S. Bhowmick, Jun Zhao, and Rong-Hua Li. 2019. Efficient Estimation of Heat Kernel PageRank for Local Clustering. In SIGMOD. 1339–1356.
- Yin and Wei (2019) Yuan Yin and Zhewei Wei. 2019. Scalable Graph Embeddings via Sparse Transpose Proximities. In SIGKDD. 1429–1437.
- Yu et al. (2014) Xiao Yu, Xiang Ren, Yizhou Sun, Quanquan Gu, Bradley Sturt, Urvashi Khandelwal, Brandon Norick, and Jiawei Han. 2014. Personalized Entity Recommendation: A Heterogeneous Information Network Approach. In WSDM. 283–292.
- Zhang et al. (2019) Jie Zhang, Yuxiao Dong, Yan Wang, Jie Tang, and Ming Ding. 2019. ProNE: Fast and Scalable Network Representation Learning. In IJCAI. 4278–4284.
- Zhang et al. (2020) Shengzhong Zhang, Zengfeng Huang, Haicang Zhou, and Ziang Zhou. 2020. SCE: Scalable Network Embedding from Sparsest Cut. In SIGKDD. 257–265.
- Zhang et al. (2018a) Ziwei Zhang, Peng Cui, Haoyang Li, Xiao Wang, and Wenwu Zhu. 2018a. Billion-Scale Network Embedding with Iterative Random Projection. In ICDM. 787–796.
- Zhang et al. (2018b) Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018b. Arbitrary-Order Proximity Preserved Network Embedding. In SIGKDD. 2778–2786.
- Zhou et al. (2017) Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. 2017. Scalable Graph Embedding for Asymmetric Proximity. In AAAI. 2942–2948.
Appendix A Proofs
Proof of Theorem 1. We first prove the following lemma.
Lemma 0.
The probability that a supervised random walk stops at exactly the -th hop is and for any k, we have
Proof.
Define . We claim that , implying that when and thus, when . Now we justify by induction. When , holds obviously. Suppose holds. Then for , we can derive that:
Thus holds for any . Proof done. ∎
Note that the transition matrix is a row-stochastic matrix, i.e. for any , and for any . Following these properties, we can easily derive that is also row-stochastic, where is the power of the matrix. Now we first look at the r.h.s. of the inequality in Theorem 1. Recall that the infinity-norm of matrix is the maximum absolute row sum, i.e., . Then, by Lemma 1, for any , we have
In terms of the l.h.s. of the inequality in Theorem 1, for any -th row of matrix , we have that:
By combining the above derivations,
holds for any . Thus, it clearly holds that
The theorem is proved.
Proof of Theorem 1. The cost of Algorithm 3 is dominated by the number of push operations. Recall that the push operation is invoked only if there exists an entry in residue vectors such that . Let denote the -hop supervised PPR whose value is the probability that the -hop supervised random walk from stops at . Then the total number of push operations caused by the residue value can be bounded by . In addition, since in each push operation, the node will pass its residue to its out-neighbors, the cost for a push operation at is bounded by . Thus, the total cost of the push operations on entry is bounded by . Based on the above analysis, the time cost of Algorithm 3 is:
where can be treated as a constant. Thus, Algorithm 3 runs in time and the proof is done.
Proof of Theorem 2. The following lemmas are used in the proof.
Lemma 0.
(Pons and Latapy, 2005) For undirected graph and any two nodes and in , let (resp. ) be the probability of going from to (resp. from to ) through a random walk of fixed length . Then,
Lemma 0.
Let denote the supervised PPR within hops. Then, after the end of every iteration in Algorithm 3, for and the residue vectors , we have
(8) |
where , and is a unit vector in which the -th entry of is and others are all .
Proof.
We prove Lemma 3 by induction. First, at the begin of Algorithm 3, vector and all residue vectors are set to except for . Then we have
which implies that Equation 8 holds under the initial condition since . Now suppose that Equation 8 holds at the end of -th iteration. Suppose further that during -th iteration, entry is detected and push operation is invoked here. If the change of l.h.s of Equation 8 after the push operation is zero, then Equation 8 holds at the end of -th iteration.
For the convenience of analysis of the change, let (resp. ) be the corresponding vectors at the end of -th iteration(resp. -th iteration). Then after performing the push operation in the -th ieration, we have
Recap is the set of ’s out-neighbors, for , we have
Then the change between -th iteration and -th iteration is
Next, we prove Theorem 2. By Equation 8 in Lemma 3, we have:
where the last equation holds by Lemma 1. The theorem is proved.
Proof of Theorem 3. The time complexities of Algorithm 4 depend on two parts: Generalized Push and SparseSVD. From the results in Theorem 1, for each source , the cost of the push operations is . Thus, the total cost of push operations on nodes is bounded by . Then, we need the following theorem (Clarkson and Woodruff, 2017) to analyze the cost of SparseSVD.
Theorem 4.
Let denote an matrix, there is an algorithm that, with failure probability , finds two matrices with orthonormal columns, and a diagonal matrix , so that , where denotes the best rank- approximation to . The algorithm runs in time
Racall that for any approximate supervised PPR score , it is add to only if it is larger than the error parameter . Thus the total number of non-zero entries in is . The cost of SparseSVD is bounded by . Finally, combining these two parts, the running cost of Algorithm 4 is bounded by , which completes our proof.
Appendix B Hyper-parameters settings
Dataset | Hyperparameters |
---|---|
Wikipedia | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: JacobiSVD | |
Wikivote | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA | |
BlogCatalog | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA | |
Slashdot | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA | |
Tweibo | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA | |
Orkut | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA |
Dataset | Hyperparameters |
---|---|
Wikipedia | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: JacobiSVD | |
BlogCatalog | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: JacobiSVD | |
TWeibo | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA, and are | |
normalized before concatenation in | |
loss function | |
Orkut | initialization: with , |
learning rate: , : , : , : , | |
SVD used for push: frPCA, and are | |
normalized before concatenation in | |
loss function |
Table 8 and Table 9 summarize the hyperparameter settings of Lemane on each dataset. The searching hyperparameters include initialized distribution to indicate the probability that the supervised random walk stops at the -th hop, the loss function coefficients , error parameter , learning rate, and buildin SVDs used in generalized push. For the initialization of distribution , two standard distribution functions are used, namely the geometric distribution and the Poisson distribution .