Fast Graph Learning
with Unique Optimal Solutions
Abstract
We consider two popular Graph Representation Learning (GRL) methods: message passing for node classification and network embedding for link prediction. For each, we pick a popular model that we: (i) linearize and (ii) and switch its training objective to Frobenius norm error minimization. These simplifications can cast the training into finding the optimal parameters in closed-form. We program in TensorFlow a functional form of Truncated Singular Value Decomposition (SVD), such that, we could decompose a dense matrix , without explicitly computing . We achieve competitive performance on popular GRL tasks while providing orders of magnitude speedup. We open-source our code at http://github.com/samihaija/tf-fsvd
1 Introduction
Many recent graph representation learning (GRL) models are creative and theoretically-justified (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Qiu et al., 2018; Xu et al., 2019; Abu-El-Haija et al., 2019; Chen et al., 2020). Unfortunately, however, they contain hyperparameters that need to be tuned (such as learning rate, regularization coefficient, depth and width of the network), and training takes a long time (e.g. minutes) even on smaller datasets.
We circumvent these weaknesses. We (i) quickly train (ii) competitive GRL models by posing convex objectives and estimating optimal solutions in closed-form, hence (iii) relieving practitioners from hyperparameter tuning or convergence checks. Our goals remind us of a classical learning technique that has been used for decades. Specifically, Singlar Value Decomposition (SVD).
SVD periodically appears within powerful yet simple methods, competing on state-of-the-art. The common practice is to design a matrix , such that its decomposition (via SVD), provides an estimate for learning a model given an objective. For instance, Levy & Goldberg (2014) show that the learning of NLP skipgram models such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), can be approximated by the SVD of a Shifted Positive Pointwise Mutual Information matrix.
In GRL, Chen et al. (2017); Qiu et al. (2018); Abu-El-Haija et al. (2018) have approximated methods of DeepWalk (Perozzi et al., 2014) and Node2Vec (Grover & Leskovec, 2016) via decomposition of some matrix . However, their decomposition requires to be either (a) exactly calculated or (b) sampled entry-wise, but (a) is unnecessarily expensive for real-world large networks (due to Small World Phenomenon, (Travers & Milgram, 1969)) and (b) incurs unnecessary estimation errors. On the other hand, known algorithms in matrix theory can decompose any matrix without explicitly knowing . Specifically, it is sufficient to provide a function that can multiply with arbitrary vectors (§4). We, argue that if the popular frameworks (e.g., TensorFlow) implement a functional SVD, that accept rather than , then modern practitioners may find it useful.
2 Preliminaries
2.1 Singular Value Decomposition (SVD)
Truncated (top-) Singular Value Decomposition (SVD) estimates input matrix with low-rank estimate that minimizes the Frobenius norm of the error:
(1) |
while parameterizing as , subject to, columns of being orthonormal. It turns out, the minimizer of Frobenius norm recovers the top- singular values (stored along diagonal matrix ), with their corresponding left- and right-singular vectors, respectively, stored as columns of the unitary matrices and (a.k.a, the singular bases).
SVD has many applications and we utilize two: (1) it is used for embedding and matrix completion; and (2) it can estimate the pseudoinverse of matrix (a.k.a., Moore-Penrose inverse), as:
(2) |
where one calculates inverse by reciprocating entries of diagonal matrix . The becomes when , due to (Eckart & Young, 1936; Golub & Loan, 1996).
2.2 Graph Representation Learning (GRL)
-
(i)
Many message passing models can be written as:
(3) where is the number of layers, matrix contains features per node, ’s are trainable parameters, denote activations (e.g. ReLu), and is some (possibly trainable) transformation of adjacency matrix. GCN (Kipf & Welling, 2017) set to symmetric normalization per renormalization trick, GAT (Veličković et al., 2018) set and GIN (Xu et al., 2019) as with identity and . For node classification, it is common to set (applied row-wise), specify the size of s.t. where is number of classes, and optimize cross-entropy objective:
(4) where is a binary matrix with one-hot rows indicating node labels. is Hadamard product. in semi-supervised node classification settings where not all nodes are labeled, before measuring the objective, subset of rows can be kept in and that correspond to labeled nodes.
-
(ii)
Network embedding methods map nodes onto a -dimensional vector space . Modern approaches train skipgram models (e.g. word2vec (Mikolov et al., 2013)) on sampled random walks. It has been shown that these skipgram network embedding methods, including DeepWalk (Perozzi et al., 2014)) and node2vec (Grover & Leskovec, 2016), with a learning process of walk sampling followed by positional embedding, can be approximated as a matrix deomposition (Chen et al., 2017; Abu-El-Haija et al., 2018; Qiu et al., 2018). We point the curious reader to the listed papers for how the decomposition was derived, but show here the derivation of Abu-El-Haija et al. (WYS, 2018), as it performs well in our experiments:
(5) where i.e. concatenates and is the logistic of their cross-correlation (pairwise dot-products). , where is the transition matrix, is diagonal degree matrix, and we fix vector to staircase: . For instance, for context size .
3 Our Proposed Convex Objectives
3.1 Network Embedding Model
Objective in Eq 5 learns node embeddings using cross-entropy. The terms: model output (outer product, ), negatives (non-edges, ), and positives (expected number of node pairs covisits, ), are all dense matrices with nonzero entries. For instance, even a relatively-small social network with =100,000 and average degree of 100 (i.e. ) would produce an occupying 40GB memory, whereas one can do the entire learning with 40MB memory using functional SVD (§4). We start by designing a matrix incorporating positive and negative information ( and ) as:
(6) |
with coefficient weighing negative samples. We use to denote our convexification.
Learning: We can directly set to the SVD basis . Matrix has large entry when nodes () are well-connected (co-visited many times, during random walks) and small if they are non-edges. SVD provides a rank estimator of as i.e. with minimum Frobenius norm of error. We can set the network embedding model parameters as:
(7) |
(8) |
(9) |
where the (RHS) set-subscript denotes gathering rows (a.k.a, advanced indexing), and .
3.2 Messaging Passing Models
We can linearize message passing models (Eq. 3) by assuming all ’s are identity (. Let , specifically let per renormalization trick of (Kipf & Welling, 2017). A linear -layer message passing network can be:
(10) |
Concatenation of all layers was proposed in Jumping Knowledge Networks (JKN, Xu et al., 2018).
(11) |
Loss in Equation 11 can perform well on classification tasks, and according to Hui & Belkin (2021), as well as the cross entropy loss defined in Equation 4. Taking of Eq. 11 then setting to zero,
(12) |
Rank- SVD can estimate and hence as:
(13) |
Order of multiplications in Eq. 13 is for efficiency. Further, in the case when only subset of nodes have labels, the right-most multiplication of Eq. 13 could restricted to the labeled nodes. Let be a matrix of rows selected from according to elements . The right-most multiplication of Eq. 13 can modified to: .
4 Functional Singular Value Decomposition
We do not have to explicitly calculate the matrices . Rather, we only need to implement product functions that can multiply with arbitrary (appropriately-sized) vector . We implement a (TensorFlow) functional version of the randomized SVD algorithm of Halko et al. (2009), that accepts rather than . We show that it can train our models quickly and with arbitrarily small approximation error (in linear time of graph size, in practice, with less than 10 passes over the data) and can yield l2-regularized solutions for classification (see Appendix). We now need the (straightforward and . We leave the second outside this writing. For the first, the non-edges term, , can be re-written by explicit broadcasting as giving
(14) |
All matrix-vector products can be efficiently computed when is sparse.
5 Experimental Results (details in Appendix)



Cora | Citeseer | Pubmed | |
Planetoid | 75.7 (13s) | 64.7 (26s) | 77.2 (25s) |
GCN | 81.5 (4s) | 70.3 (7s) | 79.0 (83s) |
GAT | 83.2 (1m23s) | 72.4 (3m27) | 77.7 (5m33s) |
MixHop | 81.9 (26s) | 71.4 (31s) | 80.8 (1m16s) |
GCNII | 85.5 (2m29s) | 73.4 (2m55s) | 80.3 (1m42s) |
82.4 (0.28s) | 72.2 (0.13s) | 79.7 (0.14s) |
FB | AstroPh | HepTh | PPI | |
WYS | 99.4 | 97.9 | 93.6 | 89.8 |
n2v | 99.0 | 97.8 | 92.3 | 83.1 |
NetMF | 97.6 | 96.8 | 90.5 | 73.6 |
97.0 | 81.9 | 85.0 | 63.6 | |
98.7 | 92.1 | 89.2 | 87.9 | |
(=100) | 98.7 | 96.0 | 90.5 | 86.2 |
Acknowledgments
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) and the Army Contracting Command-Aberdeen Proving Grounds (ACC-APG) under Contract Number W911NF-18-C-0020.
References
- Abu-El-Haija et al. (2018) Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems, NeurIPS, 2018.
- Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, ICML, 2019.
- Arnoldi (1951) W. E. Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem. In Quarterly of Applied Mathematics, pp. 17–29, 1951.
- Chen et al. (2020) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In International Conference on Machine Learning, ICML, 2020.
- Chen et al. (2017) Siheng Chen, Sufeng Niu, Leman Akoglu, Jelena Kovacevic, and Christos Faloutsos. Fast, warped graph embedding: Unifying framework and one-click algorithm. arxiv:1702.05764, 2017.
- Eckart & Young (1936) C. Eckart and G. Young. The approximation of one matrix by another of lower rank. In Psychometrika, pp. 211–218, 1936.
- Fey & Lenssen (2019) Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- Golub & Loan (1996) Gene H. Golub and Charles F. Van Loan. Matrix Computations, pp. 257–258. John Hopkins University Press, Baltimore, 3rd edition, 1996.
- Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pp. 855–864, 2016.
- Halko et al. (2009) N Halko, Martinsson P. G., and J. A Tropp. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. In ACM Technical Reports, 2009.
- Hamilton et al. (2017) William Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, NeurIPS, 2017.
- Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In arXiv, 2020.
- Hui & Belkin (2021) Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In International Conference on Learning Representations, 2021.
- Kipf & Welling (2017) T. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, 2017.
- Lanczos (1950) C Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. In Journal of Research of the National Bureau of Standards, pp. 255–282, 1950.
- Levy & Goldberg (2014) Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, NeurIPS, pp. 2177–2185, 2014.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, NeurIPS, pp. 3111–3119, 2013.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing, EMNLP, pp. 1532–1543, 2014.
- Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In ACM SIGKDD international conference on Knowledge discovery and data mining, KDD, pp. 701–710, 2014.
- Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In International Conference on Web Search and Data Mining, WSDM, pp. 459–467, 2018.
- Travers & Milgram (1969) Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. In Sociometry, pp. 425–443, 1969.
- Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, ICLR, 2018.
- Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, ICML, pp. 5453–5462, 2018.
- Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, ICLR, 2019.
- Yang et al. (2016) Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In International Conference on Machine Learning, 2016.
Appendix A Appendix: Implementation of Functional SVD
A.1 Calculating the SVD
SVD of yields its left and right singular orthonormal (basis) vectors, respectively, in columns of and . Since and , respectively, are the eigenvectors of and , then perhaps the most intuitive algorithms for SVD are variants of the power iteration, including Arnoldi iteration (Arnoldi, 1951) and Lanczos algorithm (Lanczos, 1950). In practice, randomized algorithms for estimating SVD run faster than these variants, including the algorithm of Halko et al. (2009) which is implemented in scikit-learn. None of these methods require individual access to ’s entries, but rather, require two operations: ability to multiply any vector with and with . Therefore, it is only a practical gap that we fill in this section: we open-source a TensorFlow implementation that accept product and transpose operators.
A.2 TensorFlow Implementation
Since we do not explicitly calculate the matrices displayed in Equations 6 and 10, as doing so consumes quadratic memory , we implement a functional form SVD of the celebrated randomized SVD algorithm of Halko et al. (2009). To run our Algorithm 1, one must specify (rank of decomposition), as well as functions that the program provider promises they operate as:
-
1.
Product function that exactly computes for any (recall: )
-
2.
Transpose111An alternative to can be a left-multiply function , however, in practice, TensorFlow is optimized for CSR matrices and computationally favors sparse-times-dense function . ,
-
3.
Shape (constant) function that knows and returns . Once transposes, should return .
A.3 Analysis
A.3.1 Norm Regularization of Wide Models
If is too wide, then we need not to worry much about overfitting, due to the following Theorem.
Theorem 1
(Min. Norm) If system is underdetermined222E.g., if the number of labeled examples i.e. height of and is smaller than the width of . with rows of being linearly independent, then there are infinitely many solutions. Denote solution space . Then, for , matrix , defined in Eq.13 satisfies:
Proof Assume is a column vector (the proof can be generalized to matrix by repeated column-wise application333The minimizer for the Frobenius norm is composed, column-wise, of the minimizers for all .). SVD, , recovers the solution:
(15) |
The Gram matrix is nonsingular as the rows of are linearly independent. To prove the claim let us first verify that :
Let . We must show that . Since and , their subtraction gives:
(16) |
It follows that :
Finally, using Pythagoras Theorem (due to ):
As a consequence, solution for classification models recovered by SVD follow a strong standard Gaussian prior, which may be regarded as a form of regularization.
A.3.2 Computational Complexity and Approximation Error
Theorem 2
(Linear Time) Functional SVD (Alg. 1) trains our convexified GRL models in time linear in the graph size.
Proof of Theorem 2 for our two model families:
-
1.
For rank- SVD over : Let cost of running . The run-time to compute SVD, as derived in Section 1.4.2 of (Halko et al., 2009), is:
(17) Since can be defined as (context window size) multiplications with sparse matrix with non-zero entries, then running fSVD() costs:
(18) -
2.
For rank- SVD over : Suppose feature matrix contains -dimensional rows. One can calculate with sparse multiplies in . Calculating and running SVD (see Section 1.4.1 of Halko et al., 2009) on costs total of:
(19)
Therefore, training time is linear in and .
Contrast with methods of WYS (Abu-El-Haija et al., 2018) and NetMF (Qiu et al., 2018), which require assembling a dense matrix requiring time to decompose. One wonders: how far are we from the optimal SVD with a linear-time algorithm? The following bounds the error.
Theorem 3
(Exponentially-decaying Approx. Error) Rank- randomized SVD algorithm of Halko et al. (2009) gives an approximation error that can be brought down, exponentially-fast, to no more than twice of the approximation error of the optimal (true) SVD.
Appendix B Appendix: Experimental Details
B.1 Datasets
We apply our functional SVD on popular datasets that can be trained using our simplified (i.e., convexified) models. Specifically either (1) semi-supervised node-classification datasets, where features are present, or (2) link-prediction datasets where features are absent. It is possible to convexify other setups, e.g., link prediction when node features are present, but we leave this as future work. We run experiments on seven graph datasets:
-
•
Protein-Protein Interactions (PPI): a large graph where every node is a protein and an edge between two nodes indicate that the two proteins interact. Processed version of PPI was downloaded from (Grover & Leskovec, 2016).
-
•
Three citation networks that are extremely popular: Cora, Citeseer, Pubmed. Each node is an article and each (directed) edge implies that an article cites another. Additionally, each article is accompanied with a feature vector (containing NLP-extracted features of the article’s abstract), as well as a label (article type).
-
•
Two collaboration datasets: ca-AstroPh and ca-HepTh, where nodes are researchers and an edge between two nodes indicate that the researchers co-published together at least one article, in the areas Astro-Physics and High Energy Physics.
-
•
ego-Facebook: an ego-centered social network.
For citation networks, we processed node features and labels. For all other datasets, we did not process features during training nor inference. For train/validation/test partitions: we used the splits of Yang et al. (2016) for Citeseer, Cora, Pubmed; we used the splits of Abu-El-Haija et al. (2018) for PPI, Facebook, ca-AstroPh and ca-HepTh; we used the splits of OGB (Hu et al., 2020) for ogbl-ddi. All datasets and statistics are summarized in Table 2. In §B.2 and §B.3, unless otherwise noted, we download authors’ source code from github, modify444Modified files are in our code repo it to record wall-clock run-time, and run on GPU NVidia Tesla k80. Thankfully, downloaded code has one script to run each dataset, or hyperparameters are clearly stated in the source paper.
Dataset | Nodes | Edges | Source | ||
---|---|---|---|---|---|
PPI | 3,852 | proteins | 20,881 | chemical interactions | http://snap.stanford.edu/node2vec |
ego-Facebook | 4,039 | users | 88,234 | friendships | http://snap.stanford.edu/data |
ca-AstroPh | 17,903 | researchers | 197,031 | co-authorships | http://snap.stanford.edu/data |
ca-HepTh | 8,638 | researchers | 24,827 | co-authorships | http://snap.stanford.edu/data |
Cora | 2,708 | articles | 5,429 | citations | Planetoid (Yang et al., 2016) |
Citeseer | 3,327 | articles | 4,732 | citations | Planetoid (Yang et al., 2016) |
Pubmed | 19,717 | articles | 44,338 | citations | Planetoid (Yang et al., 2016) |
B.2 Semi-supervised Node Classification
We consider a transductive setting where a graph is entirely visible (all nodes and edges). Additionally, some of the nodes are labeled. The goal is to recover the labels of unlabeled nodes. All nodes have feature vectors.
Baselines: We download code of GAT (Veličković et al., 2018), MixHop (Abu-El-Haija et al., 2019), GCNII (Chen et al., 2020) and re-ran them, with slight modifications to record training time. However, for baselines Planetoid (Yang et al., 2016) and GCN (Kipf & Welling, 2017), we copied them from the GCN paper (Kipf & Welling, 2017).
In these experiments, to train our method, we run our functional SVD twice per graph. We take the feature matrix bundled with the datasets, and concatenate to it two matrices, and , calculated per Equation 8: the calculation itself invokes our functional SVD (the first time) on with rank . Hyperparameters of are (negative coefficient) and (context window-size). After concatenating and into , we PCA the resulting matrix to 1000 dimensions, which forms our new . Finally, we express our model as the linear -layer messaging passing network (Eq. 10) and learn its parameters via rank SVD on (the second time), as explained in §3.2. We use the validation partition to tune , , , and .
Table 1 (left) summarizes the performance of our approach () against aforementioned baselines, showing both test accuracy and training time. While our method is competitive with state-of-the-art, it trains much faster.
B.3 ROC-AUC Link Prediction
Given a partial graph: only of a subset of edges are observed. The goal is to recover unobserved edges. This has applications in recommender systems: when a user expresses interest in products, the system wants to predict other products the user is interested in. The task is usually setup by partitioning the edges of the input graph into train and test edges. Further, it is common to sample negative test edges e.g. uniformly from the graph compliment. Lastly, a GRL method for link prediction can be trained on the train edges partition, then can be asked to score the test partition edges versus the negative test edges. The quality of the scoring can be quantified by a ranking metric, e.g., ROC-AUC.
Baselines: We download code of WYS (Abu-El-Haija et al., 2018) and update it to for TensorFlow-2.0. We download code of Qiu et al. (2018) and denote their methods as NetMF and , where the first computes complete matrix before SVD decomposition and the second sample entry-wise – the second is faster for larger graphs but sacrifices on estimation error and performance. For node2vec (n2v), we use its PyG implementation (Fey & Lenssen, 2019).
Table 1 (right) summarizes results test ROC-AUC. For our method (denoted ), we call our functional SVD (Alg. 1) and pass it as defined in Eq. 14. Embeddings are set to the SVD basis (as in, §3.1) and edge score of nodes is dot-product of embeddings. The last row of the table shows results when svd rank 100. Lastly, we set the context window hyperparameter (a.k.a, length of walk) as follows. For WYS, we trained with their default context (as WYS learns the context), but for all others (NetMF, n2v, ours) we used context window of length =5 for datasets Facebook and PPI (for us, this sets ) and =20 for AstroPh and HepTh.
B.4 Sensitivity Analysis
While in §B.2 we tune the number of layers () using the performance on the validation partition, in this section, we show impact of varying on test accuracy. According to the summary in Figure 2 (left), accuracy of classifying a node improves when incorporating information from further nodes. We see little gains beyond . Note that corresponds to ignoring the adjacency matrix altogether when running . Here, we fixed and averaged 5 runs. The (tiny) error bars show the standard deviation.
Further, while in §B.3 we do SVD on with rank or , Figure 2 (right) shows test accuracy while sweeping . In general, increasing the rank improves estimation accuracy and test performance. However, if is larger than the inherit dimensionality of the data, then this could cause overfitting (though perfect memorization of training edges). The Norm Regularization note (§A.3.1) applies only to pseudoinversion i.e. our classification models.