This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fast Graph Learning
with Unique Optimal Solutions

Sami Abu-El-Haija, Valentino Crespi, Greg Ver Steeg, Aram Galstyan
USC Information Sciences Institute
{haija, vcrespi, gregv, galstyan}@isi.edu
Abstract

We consider two popular Graph Representation Learning (GRL) methods: message passing for node classification and network embedding for link prediction. For each, we pick a popular model that we: (i) linearize and (ii) and switch its training objective to Frobenius norm error minimization. These simplifications can cast the training into finding the optimal parameters in closed-form. We program in TensorFlow a functional form of Truncated Singular Value Decomposition (SVD), such that, we could decompose a dense matrix 𝐌\mathbf{M}, without explicitly computing 𝐌\mathbf{M}. We achieve competitive performance on popular GRL tasks while providing orders of magnitude speedup. We open-source our code at http://github.com/samihaija/tf-fsvd

1 Introduction

Many recent graph representation learning (GRL) models are creative and theoretically-justified (Kipf & Welling, 2017; Hamilton et al., 2017; Veličković et al., 2018; Qiu et al., 2018; Xu et al., 2019; Abu-El-Haija et al., 2019; Chen et al., 2020). Unfortunately, however, they contain hyperparameters that need to be tuned (such as learning rate, regularization coefficient, depth and width of the network), and training takes a long time (e.g. minutes) even on smaller datasets.

We circumvent these weaknesses. We (i) quickly train (ii) competitive GRL models by posing convex objectives and estimating optimal solutions in closed-form, hence (iii) relieving practitioners from hyperparameter tuning or convergence checks. Our goals remind us of a classical learning technique that has been used for decades. Specifically, Singlar Value Decomposition (SVD).

SVD periodically appears within powerful yet simple methods, competing on state-of-the-art. The common practice is to design a matrix 𝐌\mathbf{M}, such that its decomposition (via SVD), provides an estimate for learning a model given an objective. For instance, Levy & Goldberg (2014) show that the learning of NLP skipgram models such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), can be approximated by the SVD of a Shifted Positive Pointwise Mutual Information matrix.

In GRL, Chen et al. (2017); Qiu et al. (2018); Abu-El-Haija et al. (2018) have approximated methods of DeepWalk (Perozzi et al., 2014) and Node2Vec (Grover & Leskovec, 2016) via decomposition of some matrix 𝐌\mathbf{M}. However, their decomposition requires 𝐌\mathbf{M} to be either (a) exactly calculated or (b) sampled entry-wise, but (a) is unnecessarily expensive for real-world large networks (due to Small World Phenomenon, (Travers & Milgram, 1969)) and (b) incurs unnecessary estimation errors. On the other hand, known algorithms in matrix theory can decompose any matrix 𝐌\mathbf{M} without explicitly knowing 𝐌\mathbf{M}. Specifically, it is sufficient to provide a function f𝐌(.)=𝐌,.f_{\mathbf{M}}(.)=\langle\mathbf{M},.\rangle that can multiply 𝐌\mathbf{M} with arbitrary vectors (§4). We, argue that if the popular frameworks (e.g., TensorFlow) implement a functional SVD, that accept f𝐌(.)f_{\mathbf{M}}(.) rather than 𝐌\mathbf{M}, then modern practitioners may find it useful.

We review powerful GRL methods (§2.2) that we convexify3), allowing us to use (randomized) SVD for obtaining (approximate) optimum solutions. Our contributions are:

  1. 1.

    We implement a functional SVD (§4) of the randomized algorithm of Halko et al. (2009).

  2. 2.

    We approximate embedding and message passing methods via SVD (§3), showing competitive performance with state-of-the-art, yet much faster to train (§5).

  3. 3.

    We analyze that learning is fast and approximation error can be made arbitrarily small (§A.3.2).

2 Preliminaries

2.1 Singular Value Decomposition (SVD)

Truncated (top-kk) Singular Value Decomposition (SVD) estimates input matrix 𝐌r×c\mathbf{M}\in\mathbb{R}^{r\times c} with low-rank estimate 𝐌~{\widetilde{\mathbf{M}}} that minimizes the Frobenius norm of the error:

min𝐌~𝐌𝐌~F2, subject to: rank(𝐌~)k,\min_{\widetilde{\mathbf{M}}}||\mathbf{M}-{\widetilde{\mathbf{M}}}||_{F}^{2},\textrm{ \ \ \ \ subject to: \ \ }\textrm{rank}(\widetilde{\mathbf{M}})\leq k, (1)

while parameterizing 𝐌~\widetilde{\mathbf{M}} as 𝐌~=𝐔k𝐒k𝐕k\widetilde{\mathbf{M}}=\mathbf{U}_{k}\mathbf{S}_{k}\mathbf{V}_{k}^{\top}, subject to, columns of 𝐔k,𝐕k\mathbf{U}_{k},\mathbf{V}_{k} being orthonormal. It turns out, the minimizer of Frobenius norm ||.||F||.||_{F} recovers the top-kk singular values (stored along diagonal matrix 𝐒kk×k\mathbf{S}_{k}\in\mathbb{R}^{k\times k}), with their corresponding left- and right-singular vectors, respectively, stored as columns of the unitary matrices 𝐔kr×k\mathbf{U}_{k}\in\mathbb{R}^{r\times k} and 𝐕kc×k\mathbf{V}_{k}\in\mathbb{R}^{c\times k} (a.k.a, the singular bases).

SVD has many applications and we utilize two: (1) it is used for embedding and matrix completion; and (2) it can estimate the pseudoinverse of matrix 𝐌\mathbf{M} (a.k.a., Moore-Penrose inverse), as:

𝐌𝐌(𝐌𝐌)1𝐕k𝐒k1𝐔k,\mathbf{M}^{\dagger}\triangleq{\mathbf{M}}^{\top}\left({\mathbf{M}}{\mathbf{M}}^{\top}\right)^{-1}\approx\mathbf{V}_{k}\mathbf{S}_{k}^{-1}\mathbf{U}_{k}^{\top}, (2)

where one calculates inverse 𝐒1\mathbf{S}^{-1} by reciprocating entries of diagonal matrix 𝐒\mathbf{S}. The \approx becomes == when krank(𝐌)k\geq\textrm{rank}(\mathbf{M}), due to (Eckart & Young, 1936; Golub & Loan, 1996).

2.2 Graph Representation Learning (GRL)

  • (i)

    Many message passing models can be written as:

    𝐇=σL(gL(𝐀)σ2(g2(𝐀)σ1(g1(𝐀)𝐗𝐖1)output of layer 1𝐖2)output of layer 2𝐖L)\mathbf{H}=\sigma_{L}\Bigg{(}g_{L}(\mathbf{A})\dots\ \ \overbrace{\sigma_{2}\bigg{(}g_{2}(\mathbf{A})\ \ \underbrace{\sigma_{1}\left(g_{1}(\mathbf{A})\mathbf{X}\mathbf{W}_{1}\right)}_{\textrm{output of layer 1}}\mathbf{W}_{2}\bigg{)}}^{\textrm{output of layer 2}}\dots\mathbf{W}_{L}\Bigg{)} (3)

    where LL is the number of layers, matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} contains dd features per node, 𝐖\mathbf{W}’s are trainable parameters, σ\sigma denote activations (e.g. ReLu), and gg is some (possibly trainable) transformation of adjacency matrix. GCN (Kipf & Welling, 2017) set gg to symmetric normalization per renormalization trick, GAT (Veličković et al., 2018) set g(𝐀)=𝐀MultiHeadedAttentiong(\mathbf{A})=\mathbf{A}\circ\textrm{MultiHeadedAttention} and GIN (Xu et al., 2019) as g(𝐀)=𝐀+(1+ϵ)𝐈g(\mathbf{A})=\mathbf{A}+(1+\epsilon)\mathbf{I} with identity 𝐈\mathbf{I} and ϵ>0\epsilon>0. For node classification, it is common to set σL=softmax\sigma_{L}=\textrm{softmax} (applied row-wise), specify the size of 𝐖L\mathbf{W}_{L} s.t. 𝐇n×y\mathbf{H}\in\mathbb{R}^{n\times y} where yy is number of classes, and optimize cross-entropy objective:

    min{𝐖j}j=1L𝐘log𝐇(1𝐘)log(1𝐇),\min_{\{\mathbf{W}_{j}\}_{j=1}^{L}}-\mathbf{Y}\circ\log\mathbf{H}-(1-\mathbf{Y})\circ\log(1-\mathbf{H}), (4)

    where 𝐘\mathbf{Y} is a binary matrix with one-hot rows indicating node labels. \circ is Hadamard product. in semi-supervised node classification settings where not all nodes are labeled, before measuring the objective, subset of rows can be kept in 𝐘\mathbf{Y} and 𝐇\mathbf{H} that correspond to labeled nodes.

  • (ii)

    Network embedding methods map nodes onto a zz-dimensional vector space 𝐙n×z\mathbf{Z}\in\mathbb{R}^{n\times z}. Modern approaches train skipgram models (e.g. word2vec (Mikolov et al., 2013)) on sampled random walks. It has been shown that these skipgram network embedding methods, including DeepWalk (Perozzi et al., 2014)) and node2vec (Grover & Leskovec, 2016), with a learning process of walk sampling followed by positional embedding, can be approximated as a matrix deomposition (Chen et al., 2017; Abu-El-Haija et al., 2018; Qiu et al., 2018). We point the curious reader to the listed papers for how the decomposition was derived, but show here the derivation of Abu-El-Haija et al. (WYS, 2018), as it performs well in our experiments:

    min𝐙𝐌(WYS)logh(𝐙)(1𝐀)log(1h(𝐙)),\min_{\mathbf{Z}}-\mathbf{M}^{{}^{\textrm{(WYS)}}}\circ\log h(\mathbf{Z})-(1-\mathbf{A})\circ\log(1-h(\mathbf{Z})), (5)

    where h(𝐙)=h([[c;1pt/1ptc]𝐋𝐑])=(1+exp(𝐋×𝐑))1h(\mathbf{Z})=h([\begin{matrix}[c;{1pt/1pt}c]\mathbf{L}&\mathbf{R}\end{matrix}])=(1+\exp(\mathbf{L}\times\mathbf{R}^{\top}))^{-1} i.e. 𝐙\mathbf{Z} concatenates 𝐋,𝐑n×z2\mathbf{L},\mathbf{R}\in\mathbb{R}^{n\times\frac{z}{2}} and hh is the logistic of their cross-correlation (pairwise dot-products). 𝐌(WYS)=iC(𝐃1𝐀)i𝐜i=i=1C𝒯i𝐜i\mathbf{M}^{{}^{\textrm{(WYS)}}}=\sum_{i}^{C}(\mathbf{D}^{-1}\mathbf{A})^{i}\mathbf{c}_{i}=\sum_{i=1}^{C}\mathcal{T}^{i}\mathbf{c}_{i}, where 𝒯\mathcal{T} is the transition matrix, 𝐃=diag(𝟏𝐀)\mathbf{D}=\textrm{diag}(\mathbf{1}^{\top}\mathbf{A}) is diagonal degree matrix, and we fix vector 𝐜\mathbf{c} to staircase: 𝐜i=Ci+1\mathbf{c}_{i}=C-i+1. For instance, 𝐜=[4,3,2,1]\mathbf{c}=[4,3,2,1] for context size C=4C=4.

3 Our Proposed Convex Objectives

3.1 Network Embedding Model

Objective in Eq 5 learns node embeddings 𝐙=[[c;1pt/1ptc]𝐋𝐑]n×z\mathbf{Z}=[\begin{matrix}[c;{1pt/1pt}c]\mathbf{L}&\mathbf{R}\end{matrix}]\in\mathbb{R}^{n\times z} using cross-entropy. The terms: model output (outer product, σ(𝐋×𝐑)\sigma(\mathbf{L}\times\mathbf{R}^{\top})), negatives (non-edges, 1𝐀1-\mathbf{A}), and positives (expected number of node pairs covisits, 𝐌\mathbf{M}), are all dense matrices with 𝒪(n2)m\mathcal{O}(n^{2})\gg m nonzero entries. For instance, even a relatively-small social network with nn=100,000 and average degree of 100 (i.e. m=100nm=100n) would produce an 𝐌\mathbf{M} occupying \approx40GB memory, whereas one can do the entire learning with \approx40MB memory using functional SVD (§4). We start by designing a matrix 𝐌^(WYS)\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}} incorporating positive and negative information (𝐌(WYS)\mathbf{M}^{{}^{\textrm{(WYS)}}} and 1𝐀1-\mathbf{A}) as:

𝐌^(WYS)\displaystyle\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}} =𝐌(WYS)λ(1𝐀)=i𝒯i𝐜iλ(1𝐀),\displaystyle=\mathbf{M}^{{}^{\textrm{(WYS)}}}-\lambda(1-\mathbf{A})=\sum_{i}\mathcal{T}^{i}\mathbf{c}_{i}-\lambda(1-\mathbf{A}), (6)

with coefficient λ0\lambda\geq 0 weighing negative samples. We use .^\widehat{\ .\ } to denote our convexification.

Learning: We can directly set 𝐙\mathbf{Z} to the SVD basis (𝐔,𝐒,𝐕)(\mathbf{U},\mathbf{S},\mathbf{V}). Matrix 𝐌^(WYS)\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}} has large entry 𝐌^uv\widehat{\mathbf{M}}_{uv} when nodes (u,vu,v) are well-connected (co-visited many times, during random walks) and small if they are non-edges. SVD provides a rank kk estimator of 𝐌^\widehat{\mathbf{M}} as 𝐋^𝐑^𝐌^\widehat{\mathbf{L}}\widehat{\mathbf{R}}^{\top}\approx\widehat{\mathbf{M}} i.e. with minimum Frobenius norm of error. We can set the network embedding model parameters 𝐙=[[c;1pt/1ptc]𝐋^𝐑^]\mathbf{Z}=[\begin{matrix}[c;{1pt/1pt}c]\widehat{\mathbf{L}}&\widehat{\mathbf{R}}\end{matrix}] as:

SVD(𝐌^,k)\displaystyle\textrm{SVD}(\widehat{\mathbf{M}},k) argmin𝐔^k𝐒^k𝐕^k𝐌^𝐔^k𝐒^k𝐕^kF2=argmin𝐔^k𝐒^k𝐕^k𝐌^(𝐔^k𝐒^k12)𝐋^(𝐒^k12𝐕^k)𝐑^F2\displaystyle\leftarrow\operatorname*{arg\,min}_{\widehat{\mathbf{U}}_{k}\widehat{\mathbf{S}}_{k}\widehat{\mathbf{V}}_{k}}\left|\left|\widehat{\mathbf{M}}-\widehat{\mathbf{U}}_{k}\widehat{\mathbf{S}}_{k}\widehat{\mathbf{V}}_{k}^{\top}\right|\right|_{F}^{2}=\operatorname*{arg\,min}_{\widehat{\mathbf{U}}_{k}\widehat{\mathbf{S}}_{k}\widehat{\mathbf{V}}_{k}}\left|\left|\widehat{\mathbf{M}}-\underbrace{\left(\widehat{\mathbf{U}}_{k}\widehat{\mathbf{S}}_{k}^{\frac{1}{2}}\right)}_{\triangleq\widehat{\mathbf{L}}}\underbrace{\left(\widehat{\mathbf{S}}_{k}^{\frac{1}{2}}\widehat{\mathbf{V}}_{k}^{\top}\right)}_{\triangleq\widehat{\mathbf{R}}^{\top}}\right|\right|_{F}^{2} (7)
Learning follows as:𝐔^k,𝐒^k,𝐕^kSVD(𝐌^(WYS),k);𝐋^𝐔^k𝐒^k12;𝐑^𝐕^k𝐒^k12\displaystyle\textrm{Learning follows as:}\hskip 17.07182pt\widehat{\mathbf{U}}_{k},\widehat{\mathbf{S}}_{k},\widehat{\mathbf{V}}_{k}\leftarrow\textrm{SVD}(\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}},k);\hskip 17.07182pt\widehat{\mathbf{L}}\leftarrow\widehat{\mathbf{U}}_{k}\widehat{\mathbf{S}}_{k}^{\frac{1}{2}};\hskip 17.07182pt\widehat{\mathbf{R}}\leftarrow\widehat{\mathbf{V}}_{k}\widehat{\mathbf{S}}_{k}^{\frac{1}{2}} (8)
Inference: Given query edges Q={(ui,vi)}i, one can compute model:𝐇Q=𝐑^{v}i𝐋^{u}i,\displaystyle\textrm{{Inference:} Given query edges $Q=\{(u_{i},v_{i})\}_{i}$, one can compute model:}\hskip 2.84544pt\mathbf{H}_{Q}=\widehat{\mathbf{R}}_{\{v\}_{i}}^{\top}\widehat{\mathbf{L}}_{\{u\}_{i}}, (9)

where the (RHS) set-subscript denotes gathering rows (a.k.a, advanced indexing), and 𝐇Q|Q|\mathbf{H}_{Q}\in\mathbb{R}^{|Q|}.

3.2 Messaging Passing Models

We can linearize message passing models (Eq. 3) by assuming all σ\sigma’s are identity (σ(.)=.)\sigma(.)=.). Let g=g1=g2=g=g_{1}=g_{2}=\dots, specifically let g(𝐀)=(𝐃+𝐈)1/2(𝐀+𝐈)(𝐃+𝐈)1/2g(\mathbf{A})=(\mathbf{D}+\mathbf{I})^{\nicefrac{{-1}}{{2}}}(\mathbf{A}+\mathbf{I})(\mathbf{D}+\mathbf{I})^{\nicefrac{{-1}}{{2}}} per renormalization trick of (Kipf & Welling, 2017). A linear LL-layer message passing network can be:

𝐇^=[[c;1pt/1ptc;1pt/1ptc;1pt/1ptc;1pt/1ptc]𝐗g(𝐀)𝐗g(𝐀)2𝐗g(𝐀)L𝐗]=Δ𝐌^(JKN)𝐖^.\widehat{\mathbf{H}}=\underbrace{\left[\begin{matrix}[c;{1pt/1pt}c;{1pt/1pt}c;{1pt/1pt}c;{1pt/1pt}c]\mathbf{X}&g(\mathbf{A})\mathbf{X}&g(\mathbf{A})^{2}\mathbf{X}&\dots&g(\mathbf{A})^{L}\mathbf{X}\end{matrix}\right]}_{\hskip 34.14322pt\overset{\Delta}{=}\ \ \widehat{\mathbf{M}}^{\textrm{(JKN)}}}\widehat{\mathbf{W}}. (10)

Concatenation of all layers was proposed in Jumping Knowledge Networks (JKN, Xu et al., 2018).

Learning: We optimize 𝐖^ with:min𝐖^𝐇^𝐘F2=min𝐖^𝐌^(JKN)𝐖^𝐘F2\displaystyle\textrm{{Learning:} We optimize $\widehat{\mathbf{W}}$ with:}\hskip 28.45274pt\min_{\widehat{\mathbf{W}}}||\widehat{\mathbf{H}}-\mathbf{Y}||_{F}^{2}=\min_{\widehat{\mathbf{W}}}||\widehat{\mathbf{M}}^{\textrm{(JKN)}}\widehat{\mathbf{W}}-\mathbf{Y}||_{F}^{2} (11)

Loss in Equation 11 can perform well on classification tasks, and according to Hui & Belkin (2021), as well as the cross entropy loss defined in Equation 4. Taking 𝐖^\nabla_{\widehat{\mathbf{W}}} of Eq. 11 then setting to zero,

yields minimizer:𝐖^argmin𝐖^𝐌^(JKN)𝐖^𝐘F2=(𝐌^(JKN))𝐘.\displaystyle\textrm{yields minimizer:}\hskip 28.45274pt\widehat{\mathbf{W}}^{*}\triangleq\operatorname*{arg\,min}_{\widehat{\mathbf{W}}}||\widehat{\mathbf{M}}^{\textrm{(JKN)}}\widehat{\mathbf{W}}-\mathbf{Y}||_{F}^{2}=\left(\widehat{\mathbf{M}}^{\textrm{(JKN)}}\right)^{\dagger}\mathbf{Y}. (12)

Rank-kk SVD can estimate (𝐌^(JKN))\left(\widehat{\mathbf{M}}^{\textrm{(JKN)}}\right)^{\dagger} and hence 𝐖^\widehat{\mathbf{W}}^{*} as:

𝐔^k,𝐒^k,𝐕^kSVD(𝐌^(JKN),k);𝐖^(𝐕^k(𝐒^k1(𝐔^k𝐘))).\widehat{\mathbf{U}}_{k},\widehat{\mathbf{S}}_{k},\widehat{\mathbf{V}}_{k}\leftarrow\textrm{SVD}(\widehat{\mathbf{M}}^{{}^{\textrm{(JKN)}}},k);\hskip 28.45274pt\widehat{\mathbf{W}}^{*}\leftarrow(\widehat{\mathbf{V}}_{k}(\widehat{\mathbf{S}}_{k}^{-1}(\widehat{\mathbf{U}}_{k}^{\top}\mathbf{Y}))). (13)

Order of multiplications in Eq. 13 is for efficiency. Further, in the case when only subset of nodes 𝒱={v}v\mathcal{V}=\{v\}_{v} have labels, the right-most multiplication of Eq. 13 could restricted to the labeled nodes. Let 𝐘𝒱\mathbf{Y}_{\mathcal{V}} be a matrix of |𝒱||\mathcal{V}| rows selected from 𝐘\mathbf{Y} according to elements 𝒱\mathcal{V}. The right-most multiplication of Eq. 13 can modified to: 𝐔^k𝒱𝐘𝒱{{\widehat{\mathbf{U}}_{k_{\mathcal{V}}}}^{\top}}\mathbf{Y}_{\mathcal{V}}.

4 Functional Singular Value Decomposition

We do not have to explicitly calculate the matrices 𝐌^\widehat{\mathbf{M}}. Rather, we only need to implement product functions f𝐌^(𝐯)=𝐌^,𝐯f_{\widehat{\mathbf{M}}}(\mathbf{v})=\langle\widehat{\mathbf{M}},\mathbf{v}\rangle that can multiply 𝐌^\widehat{\mathbf{M}} with arbitrary (appropriately-sized) vector 𝐯\mathbf{v}. We implement a (TensorFlow) functional version of the randomized SVD algorithm of Halko et al. (2009), that accepts f𝐌^f_{\widehat{\mathbf{M}}} rather than 𝐌^\widehat{\mathbf{M}}. We show that it can train our models quickly and with arbitrarily small approximation error (in linear time of graph size, in practice, with less than 10 passes over the data) and can yield l2-regularized solutions for classification (see Appendix). We now need the (straightforward f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}}} and f𝐌^(JKN)f_{\widehat{\mathbf{M}}^{{}^{\textrm{(JKN)}}}}. We leave the second outside this writing. For the first, the non-edges term, (1𝐀)(1-\mathbf{A}), can be re-written by explicit broadcasting as (𝟏𝟏𝐀)(\mathbf{1}\mathbf{1}^{\top}-\mathbf{A}) giving

f𝐌^(WYS)(𝐯)=i(𝒯)i𝐯𝒪(im)𝐜iλ𝟏(𝟏𝐯)𝒪(n)+λ𝐀𝐯𝒪(m).f_{\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}}}(\mathbf{v})=\sum_{i}\underbrace{(\mathcal{T})^{i}\mathbf{v}}_{\mathcal{O}(im)}\mathbf{c}_{i}-\lambda\mathbf{1}\underbrace{(\mathbf{1}^{\top}\mathbf{v})}_{\mathcal{O}(n)}+\lambda\underbrace{\mathbf{A}\mathbf{v}}_{\mathcal{O}(m)}. (14)

All matrix-vector products can be efficiently computed when 𝐀\mathbf{A} is sparse.

5 Experimental Results (details in Appendix)

Refer to caption
Figure 1: ROC-AUC versus train time of methods on datasets. Each dataset has a distinct shape, with shape size proportional to graph size. Each method uses a different color. Our methods are in blue (dark uses SVD rank k=32k=32, light uses k=100k=100, trading estimation accuracy for train time). Ideal methods should be placed on top-left corner (i.e., higher test ROC-AUC and faster training).
Refer to caption
Refer to caption
Figure 2: Sensitivity Analysis. Left: Test Accuracy VS Depth of model defined by 𝐌^(JKN)\widehat{\mathbf{M}}^{{}^{\textrm{(JKN)}}}. Right: Test ROC-AUC VS rank of SVD on 𝐌^(WYS)\widehat{\mathbf{M}}^{{}^{\textrm{(WYS)}}}.
Table 1: Test Performance. Left: accuracy (training time) for Semi-supervised Node Classification, over citation datasets. Right: ROC-AUC for link prediction when embedding with zz=64=2k2k.
Cora Citeseer Pubmed
Planetoid 75.7 (13s) 64.7 (26s) 77.2 (25s)
GCN 81.5 (4s) 70.3 (7s) 79.0 (83s)
GAT 83.2 (1m23s) 72.4 (3m27) 77.7 (5m33s)
MixHop 81.9 (26s) 71.4 (31s) 80.8 (1m16s)
GCNII 85.5 (2m29s) 73.4 (2m55s) 80.3 (1m42s)
f𝐌^(JKN)f_{\widehat{\mathbf{M}}^{\textrm{(JKN)}}} 82.4 (0.28s) 72.2 (0.13s) 79.7 (0.14s)
FB AstroPh HepTh PPI
WYS 99.4 97.9 93.6 89.8
n2v 99.0 97.8 92.3 83.1
NetMF 97.6 96.8 90.5 73.6
NetMF~\widetilde{\textrm{NetMF}} 97.0 81.9 85.0 63.6
f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}} 98.7 92.1 89.2 87.9
(kk=100) 98.7 96.0 90.5 86.2

Acknowledgments

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) and the Army Contracting Command-Aberdeen Proving Grounds (ACC-APG) under Contract Number W911NF-18-C-0020.

References

  • Abu-El-Haija et al. (2018) Sami Abu-El-Haija, Bryan Perozzi, Rami Al-Rfou, and Alexander A Alemi. Watch your step: Learning node embeddings via graph attention. In Advances in Neural Information Processing Systems, NeurIPS, 2018.
  • Abu-El-Haija et al. (2019) Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Hrayr Harutyunyan, Nazanin Alipourfard, Kristina Lerman, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In International Conference on Machine Learning, ICML, 2019.
  • Arnoldi (1951) W. E. Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem. In Quarterly of Applied Mathematics, pp.  17–29, 1951.
  • Chen et al. (2020) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In International Conference on Machine Learning, ICML, 2020.
  • Chen et al. (2017) Siheng Chen, Sufeng Niu, Leman Akoglu, Jelena Kovacevic, and Christos Faloutsos. Fast, warped graph embedding: Unifying framework and one-click algorithm. arxiv:1702.05764, 2017.
  • Eckart & Young (1936) C. Eckart and G. Young. The approximation of one matrix by another of lower rank. In Psychometrika, pp.  211–218, 1936.
  • Fey & Lenssen (2019) Matthias Fey and Jan E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • Golub & Loan (1996) Gene H. Golub and Charles F. Van Loan. Matrix Computations, pp.  257–258. John Hopkins University Press, Baltimore, 3rd edition, 1996.
  • Grover & Leskovec (2016) Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pp.  855–864, 2016.
  • Halko et al. (2009) N Halko, Martinsson P. G., and J. A Tropp. Finding structure with randomness: Stochastic algorithms for constructing approximate matrix decompositions. In ACM Technical Reports, 2009.
  • Hamilton et al. (2017) William Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, NeurIPS, 2017.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In arXiv, 2020.
  • Hui & Belkin (2021) Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In International Conference on Learning Representations, 2021.
  • Kipf & Welling (2017) T. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, ICLR, 2017.
  • Lanczos (1950) C Lanczos. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. In Journal of Research of the National Bureau of Standards, pp.  255–282, 1950.
  • Levy & Goldberg (2014) Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. In Advances in Neural Information Processing Systems, NeurIPS, pp.  2177–2185, 2014.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, NeurIPS, pp.  3111–3119, 2013.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing, EMNLP, pp.  1532–1543, 2014.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations. In ACM SIGKDD international conference on Knowledge discovery and data mining, KDD, pp.  701–710, 2014.
  • Qiu et al. (2018) Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. Network embedding as matrix factorization: Unifying deepwalk, line, pte, and node2vec. In International Conference on Web Search and Data Mining, WSDM, pp.  459–467, 2018.
  • Travers & Milgram (1969) Jeffrey Travers and Stanley Milgram. An experimental study of the small world problem. In Sociometry, pp.  425–443, 1969.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, ICLR, 2018.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. Representation learning on graphs with jumping knowledge networks. In International Conference on Machine Learning, ICML, pp. 5453–5462, 2018.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, ICLR, 2019.
  • Yang et al. (2016) Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Revisiting semi-supervised learning with graph embeddings. In International Conference on Machine Learning, 2016.

Appendix A Appendix: Implementation of Functional SVD

A.1 Calculating the SVD

SVD of 𝐌\mathbf{M} yields its left and right singular orthonormal (basis) vectors, respectively, in columns of 𝐔\mathbf{U} and 𝐕\mathbf{V}. Since 𝐔\mathbf{U} and 𝐕\mathbf{V}, respectively, are the eigenvectors of 𝐌𝐌\mathbf{M}\mathbf{M}^{\top} and 𝐌𝐌\mathbf{M}^{\top}\mathbf{M}, then perhaps the most intuitive algorithms for SVD are variants of the power iteration, including Arnoldi iteration (Arnoldi, 1951) and Lanczos algorithm (Lanczos, 1950). In practice, randomized algorithms for estimating SVD run faster than these variants, including the algorithm of Halko et al. (2009) which is implemented in scikit-learn. None of these methods require individual access to 𝐌\mathbf{M}’s entries, but rather, require two operations: ability to multiply any vector with 𝐌\mathbf{M} and with 𝐌\mathbf{M}^{\top}. Therefore, it is only a practical gap that we fill in this section: we open-source a TensorFlow implementation that accept product and transpose operators.

A.2 TensorFlow Implementation

Since we do not explicitly calculate the 𝐌^\widehat{\mathbf{M}} matrices displayed in Equations 6 and 10, as doing so consumes quadratic memory 𝒪(n2)\mathcal{O}(n^{2}), we implement a functional form SVD of the celebrated randomized SVD algorithm of Halko et al. (2009). To run our Algorithm 1, one must specify k+k\in\mathbb{N}_{+} (rank of decomposition), as well as functions f,l,sf,l,s that the program provider promises they operate as:

  1. 1.

    Product function ff that exactly computes f(𝐯)=𝐌,𝐯f(\mathbf{v})=\langle\mathbf{M},\mathbf{v}\rangle for any 𝐯c\mathbf{v}\in\mathbb{R}^{c} (recall: 𝐌r×c\mathbf{M}\in\mathbb{R}^{r\times c})

  2. 2.

    Transpose111An alternative to tt can be a left-multiply function l(𝐮)=𝐮,𝐌l(\mathbf{u})=\langle\mathbf{u},\mathbf{M}\rangle, however, in practice, TensorFlow is optimized for CSR matrices and computationally favors sparse-times-dense function tt. 𝐯r\forall\mathbf{v}\in\mathbb{R}^{r}, (tf)(𝐯)=𝐌,𝐯(t\circ f)(\mathbf{v})=\langle\mathbf{M}^{\top},\mathbf{v}\rangle

  3. 3.

    Shape (constant) function ss that knows and returns (r,c)(r,c). Once transposes, should return (c,r)(c,r).

Algorithm 1 Functional Randomized SVD, following prototype of Halko et al. (2009)
1:input: rank k+k\in\mathbb{N}_{+}, product fn f:crf:\mathbb{R}^{c}\rightarrow\mathbb{R}^{r}, shape fn ss, transpose fn t:(cr)(cr)t:(\mathbb{R}^{c}\rightarrow\mathbb{R}^{r})\rightarrow(\mathbb{R}^{c}\rightarrow\mathbb{R}^{r})
2:procedure fSVD(f,t,s,kf,t,s,k)
3:   (r,c)s()(r,c)\leftarrow s()
4:   Q𝒩(0,1)c×2kQ\sim\mathcal{N}(0,1)^{c\times 2k} \triangleright IID Gaussian. Shape: (c×2k)(c\times 2k)
5:   for i1i\leftarrow 1 to iterations do
6:      Q,_tf.linalg.qr(f(Q))Q,\_\leftarrow\texttt{tf.linalg.qr}(f(Q)) \triangleright (r×2k)(r\times 2k)
7:      Q,_tf.linalg.qr((tf)(Q))Q,\_\leftarrow\texttt{tf.linalg.qr}((t\circ f)(Q)) \triangleright (c×2k)(c\times 2k)    
8:   Q,_tf.linalg.qr(f(Q))Q,\_\leftarrow\texttt{tf.linalg.qr}(f(Q)) \triangleright (r×2k)(r\times 2k)
9:   B((tf)(Q))B\leftarrow((t\circ f)(Q))^{\top} \triangleright (2k×c)(2k\times c)
10:   U,s,Vtf.linalg.svd(B)U,s,V^{\top}\leftarrow\texttt{tf.linalg.svd}(B)
11:   UQ×UU\leftarrow Q\times U \triangleright (r×2k)(r\times 2k)
12:   return U[:,:k],s[:k],V[:,:k]U[:,:k],s[:k],V[:,:k]^{\top}

A.3 Analysis

A.3.1 Norm Regularization of Wide Models

If 𝐌^\widehat{\mathbf{M}} is too wide, then we need not to worry much about overfitting, due to the following Theorem.

Theorem 1

(Min. Norm) If system 𝐌^𝐖^=𝐘\widehat{\mathbf{M}}\widehat{\mathbf{W}}=\mathbf{Y} is underdetermined222E.g., if the number of labeled examples i.e. height of 𝐌\mathbf{M} and 𝐘\mathbf{Y} is smaller than the width of 𝐌\mathbf{M}. with rows of 𝐌^\widehat{\mathbf{M}} being linearly independent, then there are infinitely many solutions. Denote solution space 𝒲^={𝐖^|𝐌^𝐖^=𝐘}\widehat{\mathcal{W}}^{*}=\Big{\{}\widehat{\mathbf{W}}\ \Big{|}\ \widehat{\mathbf{M}}\widehat{\mathbf{W}}=\mathbf{Y}\Big{\}}. Then, for krank(𝐌^)k\geq\textrm{rank}(\widehat{\mathbf{M}}), matrix 𝐖^\widehat{\mathbf{W}}^{*}, defined in Eq.13 satisfies: 𝐖^=argmin𝐖^𝒲^𝐖^F2\widehat{\mathbf{W}}^{*}=\mathop{\mathrm{argmin}}_{\widehat{\mathbf{W}}\in\widehat{\mathcal{W}}^{*}}||\widehat{\mathbf{W}}||_{F}^{2}

Proof Assume 𝐘=𝐲\mathbf{Y}=\mathbf{y} is a column vector (the proof can be generalized to matrix 𝐘\mathbf{Y} by repeated column-wise application333The minimizer for the Frobenius norm is composed, column-wise, of the minimizers argmin𝐌^𝐖:,j=𝐘:,j𝐖:,j22\mathop{\mathrm{argmin}}_{\widehat{\mathbf{M}}\mathbf{W}_{:,j}=\mathbf{Y}_{:,j}}||\mathbf{W}_{:,j}||_{2}^{2} for all jj.). SVD(𝐌^,k)(\widehat{\mathbf{M}},k), krank(𝐌^)k\geq\textrm{rank}(\widehat{\mathbf{M}}), recovers the solution:

𝐖^=(𝐌^)𝐲=𝐌^(𝐌^𝐌^)1𝐲.\widehat{\mathbf{W}}^{*}=\left(\widehat{\mathbf{M}}\right)^{\dagger}\mathbf{y}=\widehat{\mathbf{M}}^{\top}\left(\widehat{\mathbf{M}}\widehat{\mathbf{M}}^{\top}\right)^{-1}\mathbf{y}. (15)

The Gram matrix 𝐌^𝐌^\widehat{\mathbf{M}}\widehat{\mathbf{M}}^{\top} is nonsingular as the rows of 𝐌^\widehat{\mathbf{M}} are linearly independent. To prove the claim let us first verify that 𝐖^𝒲^\widehat{\mathbf{W}}^{*}\in\widehat{\mathcal{W}}^{*}:

𝐌^𝐖^=𝐌^𝐌^(𝐌^𝐌^)1𝐲=𝐲.\widehat{\mathbf{M}}\widehat{\mathbf{W}}^{*}=\widehat{\mathbf{M}}\widehat{\mathbf{M}}^{\top}\left(\widehat{\mathbf{M}}\widehat{\mathbf{M}}^{\top}\right)^{-1}\mathbf{y}=\mathbf{y}.

Let 𝐖^p𝒲^\widehat{\mathbf{W}}_{p}\in\widehat{\mathcal{W}}^{*}. We must show that 𝐖^2𝐖^p2||\widehat{\mathbf{W}}^{*}||_{2}\leq||\widehat{\mathbf{W}}_{p}||_{2}. Since 𝐌^𝐖^p=𝐲\widehat{\mathbf{M}}\widehat{\mathbf{W}}_{p}=\mathbf{y} and 𝐌^𝐖^=𝐲\widehat{\mathbf{M}}\widehat{\mathbf{W}}^{*}=\mathbf{y}, their subtraction gives:

𝐌^(𝐖^p𝐖^)=0.\widehat{\mathbf{M}}(\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*})=0. (16)

It follows that (𝐖^p𝐖^)𝐖^(\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*})\perp\widehat{\mathbf{W}}^{*}:

(𝐖^p𝐖^)𝐖^\displaystyle(\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*})^{\top}\widehat{\mathbf{W}}^{*} =(𝐖^p𝐖^)𝐌^(𝐌^𝐌^)1𝐲\displaystyle=(\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*})^{\top}\widehat{\mathbf{M}}^{\top}\left(\widehat{\mathbf{M}}\widehat{\mathbf{M}}^{\top}\right)^{-1}\mathbf{y}
=(𝐌^(𝐖^p𝐖^))=0 due to Eq. 16(𝐌^𝐌^)1𝐲=0\displaystyle=\underbrace{(\widehat{\mathbf{M}}(\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*}))^{\top}}_{=0\textrm{ due to Eq.~{}\ref{eq:nullM}}}\left(\widehat{\mathbf{M}}\widehat{\mathbf{M}}^{\top}\right)^{-1}\mathbf{y}=0

Finally, using Pythagoras Theorem (due to \perp):

𝐖^p22\displaystyle||\widehat{\mathbf{W}}_{p}||_{2}^{2} =𝐖^+𝐖^p𝐖^22\displaystyle=||\widehat{\mathbf{W}}^{*}+\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*}||_{2}^{2}
=𝐖^22+𝐖^p𝐖^22𝐖^22\displaystyle=||\widehat{\mathbf{W}}^{*}||_{2}^{2}+||\widehat{\mathbf{W}}_{p}-\widehat{\mathbf{W}}^{*}||_{2}^{2}\geq||\widehat{\mathbf{W}}^{*}||_{2}^{2}\ \ \ \ \ \ \ \hbox{}\hfill\blacksquare

As a consequence, solution for classification models recovered by SVD follow a strong standard Gaussian prior, which may be regarded as a form of regularization.

A.3.2 Computational Complexity and Approximation Error

Theorem 2

(Linear Time) Functional SVD (Alg. 1) trains our convexified GRL models in time linear in the graph size.

Proof of Theorem 2 for our two model families:

  1. 1.

    For rank-kk SVD over f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}}: Let cost of running f𝐌^(WYS)=Tmultf_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}}=T_{\textrm{mult}}. The run-time to compute SVD, as derived in Section 1.4.2 of (Halko et al., 2009), is:

    𝒪(kTmult+(r+c)k2).\mathcal{O}(kT_{\textrm{mult}}+(r+c)k^{2}). (17)

    Since f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}} can be defined as CC (context window size) multiplications with sparse n×nn\times n matrix 𝒯\mathcal{T} with mm non-zero entries, then running fSVD(f𝐌^(WYS),kf_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}},k) costs:

    𝒪(kmC+nk2)\mathcal{O}(kmC+nk^{2}) (18)
  2. 2.

    For rank-kk SVD over f𝐌^(JKN)f_{\widehat{\mathbf{M}}^{\textrm{(JKN)}}}: Suppose feature matrix contains dd-dimensional rows. One can calculate 𝐌^(JKN)n×Ld\widehat{\mathbf{M}}^{\textrm{(JKN)}}\in\mathbb{R}^{n\times Ld} with LL sparse multiplies in 𝒪(Lmd)\mathcal{O}(Lmd). Calculating and running SVD (see Section 1.4.1 of Halko et al., 2009) on 𝐌^(JKN)\widehat{\mathbf{M}}^{\textrm{(JKN)}} costs total of:

    𝒪(ndLlog(k)+(n+dL)k2+Lmd).\mathcal{O}(ndL\log(k)+(n+dL)k^{2}+Lmd). (19)

Therefore, training time is linear in nn and mm. \ \ \blacksquare

Contrast with methods of WYS (Abu-El-Haija et al., 2018) and NetMF (Qiu et al., 2018), which require assembling a dense n×nn\times n matrix requiring 𝒪(n2)\mathcal{O}(n^{2}) time to decompose. One wonders: how far are we from the optimal SVD with a linear-time algorithm? The following bounds the error.

Theorem 3

(Exponentially-decaying Approx. Error) Rank-kk randomized SVD algorithm of Halko et al. (2009) gives an approximation error that can be brought down, exponentially-fast, to no more than twice of the approximation error of the optimal (true) SVD.

Proof is in Theorem 1.2 of Halko et al. (2009). \ \ \blacksquare
Consequently, compared to NetMF~\widetilde{\textrm{NetMF}} of (Qiu et al., 2018), which incurs unnecessary estimation error, our estimation error can be brought-down exponentially by increasing the iters parameter of Alg. 1

Appendix B Appendix: Experimental Details

B.1 Datasets

We apply our functional SVD on popular datasets that can be trained using our simplified (i.e., convexified) models. Specifically either (1) semi-supervised node-classification datasets, where features are present, or (2) link-prediction datasets where features are absent. It is possible to convexify other setups, e.g., link prediction when node features are present, but we leave this as future work. We run experiments on seven graph datasets:

  • Protein-Protein Interactions (PPI): a large graph where every node is a protein and an edge between two nodes indicate that the two proteins interact. Processed version of PPI was downloaded from (Grover & Leskovec, 2016).

  • Three citation networks that are extremely popular: Cora, Citeseer, Pubmed. Each node is an article and each (directed) edge implies that an article cites another. Additionally, each article is accompanied with a feature vector (containing NLP-extracted features of the article’s abstract), as well as a label (article type).

  • Two collaboration datasets: ca-AstroPh and ca-HepTh, where nodes are researchers and an edge between two nodes indicate that the researchers co-published together at least one article, in the areas Astro-Physics and High Energy Physics.

  • ego-Facebook: an ego-centered social network.

For citation networks, we processed node features and labels. For all other datasets, we did not process features during training nor inference. For train/validation/test partitions: we used the splits of Yang et al. (2016) for Citeseer, Cora, Pubmed; we used the splits of Abu-El-Haija et al. (2018) for PPI, Facebook, ca-AstroPh and ca-HepTh; we used the splits of OGB (Hu et al., 2020) for ogbl-ddi. All datasets and statistics are summarized in Table 2. In §B.2 and §B.3, unless otherwise noted, we download authors’ source code from github, modify444Modified files are in our code repo it to record wall-clock run-time, and run on GPU NVidia Tesla k80. Thankfully, downloaded code has one script to run each dataset, or hyperparameters are clearly stated in the source paper.

Table 2: Dataset Statistics
Dataset Nodes Edges Source
PPI 3,852 proteins 20,881 chemical interactions http://snap.stanford.edu/node2vec
ego-Facebook 4,039 users 88,234 friendships http://snap.stanford.edu/data
ca-AstroPh 17,903 researchers 197,031 co-authorships http://snap.stanford.edu/data
ca-HepTh 8,638 researchers 24,827 co-authorships http://snap.stanford.edu/data
Cora 2,708 articles 5,429 citations Planetoid (Yang et al., 2016)
Citeseer 3,327 articles 4,732 citations Planetoid (Yang et al., 2016)
Pubmed 19,717 articles 44,338 citations Planetoid (Yang et al., 2016)

B.2 Semi-supervised Node Classification

We consider a transductive setting where a graph is entirely visible (all nodes and edges). Additionally, some of the nodes are labeled. The goal is to recover the labels of unlabeled nodes. All nodes have feature vectors.

Baselines: We download code of GAT (Veličković et al., 2018), MixHop (Abu-El-Haija et al., 2019), GCNII (Chen et al., 2020) and re-ran them, with slight modifications to record training time. However, for baselines Planetoid (Yang et al., 2016) and GCN (Kipf & Welling, 2017), we copied them from the GCN paper (Kipf & Welling, 2017).

In these experiments, to train our method, we run our functional SVD twice per graph. We take the feature matrix 𝐗\mathbf{X} bundled with the datasets, and concatenate to it two matrices, 𝐋^\widehat{\mathbf{L}} and 𝐑^\widehat{\mathbf{R}}, calculated per Equation 8: the calculation itself invokes our functional SVD (the first time) on f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}} with rank =32=32. Hyperparameters of f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}} are λ\lambda (negative coefficient) and CC (context window-size). After concatenating 𝐋^\widehat{\mathbf{L}} and 𝐑^\widehat{\mathbf{R}} into 𝐗\mathbf{X}, we PCA the resulting matrix to 1000 dimensions, which forms our new 𝐗\mathbf{X}. Finally, we express our model as the linear LL-layer messaging passing network (Eq. 10) and learn its parameters via rank kk SVD on f𝐌^(JKN)f_{\widehat{\mathbf{M}}^{\textrm{(JKN)}}} (the second time), as explained in §3.2. We use the validation partition to tune LL, kk, λ\lambda, and CC.

Table 1 (left) summarizes the performance of our approach (f𝐌^(JKN)f_{\widehat{\mathbf{M}}^{\textrm{(JKN)}}}) against aforementioned baselines, showing both test accuracy and training time. While our method is competitive with state-of-the-art, it trains much faster.

B.3 ROC-AUC Link Prediction

Given a partial graph: only of a subset of edges are observed. The goal is to recover unobserved edges. This has applications in recommender systems: when a user expresses interest in products, the system wants to predict other products the user is interested in. The task is usually setup by partitioning the edges of the input graph into train and test edges. Further, it is common to sample negative test edges e.g. uniformly from the graph compliment. Lastly, a GRL method for link prediction can be trained on the train edges partition, then can be asked to score the test partition edges versus the negative test edges. The quality of the scoring can be quantified by a ranking metric, e.g., ROC-AUC.

Baselines: We download code of WYS (Abu-El-Haija et al., 2018) and update it to for TensorFlow-2.0. We download code of Qiu et al. (2018) and denote their methods as NetMF and NetMF~\widetilde{\textrm{NetMF}}, where the first computes complete matrix 𝐌\mathbf{M} before SVD decomposition and the second sample 𝐌\mathbf{M} entry-wise – the second is faster for larger graphs but sacrifices on estimation error and performance. For node2vec (n2v), we use its PyG implementation (Fey & Lenssen, 2019).

Table 1 (right) summarizes results test ROC-AUC. For our method (denoted f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}}), we call our functional SVD (Alg. 1) and pass it f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}} as defined in Eq. 14. Embeddings are set to the SVD basis (as in, §3.1) and edge score of nodes (u,v)(u,v) is \propto dot-product of embeddings. The last row of the table shows results when svd rank ==100. Lastly, we set the context window hyperparameter (a.k.a, length of walk) as follows. For WYS, we trained with their default context (as WYS learns the context), but for all others (NetMF, n2v, ours) we used context window of length CC=5 for datasets Facebook and PPI (for us, this sets 𝐜=[5,4,3,2,1]\mathbf{c}=[5,4,3,2,1]) and CC=20 for AstroPh and HepTh.

B.4 Sensitivity Analysis

While in §B.2 we tune the number of layers (LL) using the performance on the validation partition, in this section, we show impact of varying LL on test accuracy. According to the summary in Figure 2 (left), accuracy of classifying a node improves when incorporating information from further nodes. We see little gains beyond L>6L>6. Note that L=0L=0 corresponds to ignoring the adjacency matrix altogether when running f𝐌^(JKN)f_{\widehat{\mathbf{M}}^{\textrm{(JKN)}}}. Here, we fixed λ=C=1\lambda=C=1 and averaged 5 runs. The (tiny) error bars show the standard deviation.

Further, while in §B.3 we do SVD on f𝐌^(WYS)f_{\widehat{\mathbf{M}}^{\textrm{(WYS)}}} with rank k=32k=32 or k=100k=100, Figure 2 (right) shows test accuracy while sweeping k32k\leq 32. In general, increasing the rank improves estimation accuracy and test performance. However, if kk is larger than the inherit dimensionality of the data, then this could cause overfitting (though perfect memorization of training edges). The Norm Regularization note (§A.3.1) applies only to pseudoinversion i.e. our classification models.