This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Link Prediction via GNN Layers Induced by Negative Sampling

Yuxin Wang, Xiannian Hu, Quan Gan, Xuanjing Huang, Xipeng Qiu, David Wipf The authors are with the School of Computer Science, Fudan University (Yuxin Wang, Xiannian Hu, Xuanjing Huang, Xipeng Qiu) and Amazon (Quan Gan, David Wipf).
E-mail: {wangyuxin21,21210240194}@m.fudan.edu.cn, [email protected], {xjhuang.xpqiu}@fudan.edu.cn, [email protected]
Abstract
111This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. Citation information: DOI 10.1109/TKDE.2024.3481015

Graph neural networks (GNNs) for link prediction can loosely be divided into two broad categories. First, node-wise architectures pre-compute individual embeddings for each node that are later combined by a simple decoder to make predictions. While extremely efficient at inference time, model expressiveness is limited such that isomorphic nodes contributing to candidate edges may not be distinguishable, compromising accuracy. In contrast, edge-wise methods rely on the formation of edge-specific subgraph embeddings to enrich the representation of pair-wise relationships, disambiguating isomorphic nodes to improve accuracy, but with increased model complexity. To better navigate this trade-off, we propose a novel GNN architecture whereby the forward pass explicitly depends on both positive (as is typical) and negative (unique to our approach) edges to inform more flexible, yet still cheap node-wise embeddings. This is achieved by recasting the embeddings themselves as minimizers of a forward-pass-specific energy function that favors separation of positive and negative samples. Notably, this energy is distinct from the actual training loss shared by most existing link prediction models, where contrastive pairs only influence the backward pass. As demonstrated by extensive empirical evaluations, the resulting architecture retains the inference speed of node-wise models, while producing competitive accuracy with edge-wise alternatives. We released our code at https://github.com/yxzwang/SubmissionverOfYinYanGNN.

Index Terms:
Artificial Intelligence, Machine Learning, Graph Neural Networks, Link Prediction

I Introduction

Link Prediction is a fundamental graph learning challenge that involves determining whether or not there should exist an edge connecting two nodes. Given the prevalence of graph-structured data, this task has widespread practical significance spanning domains such as social networks, knowledge graphs, and e-commerce, and recommendation systems [1, 2, 3]. As one representative example of the latter, the goal could be to predict whether a user node should be linked with an item node in a product graph, where edges are indicative of some form of user/item engagement, e.g., clicks, purchases, etc.

Beyond heuristic techniques such as Common Neighbors (CN) [4], Adamic-Adar (AA) [5], and Resource Allocation (RA) [6], graph neural networks (GNNs) have recently shown tremendous promise in addressing link prediction with trainable deep architectures [7, 8, 9, 10, 11, 12]. Broadly speaking these GNN models fall into two categories, based on whether they rely on node-wise or edge-wise embeddings. The former involves using a GNN to pre-compute individual embeddings for each node that are later combined by a simple decoder to predict the presence of edges. This strategy is preferable when inference speed is paramount (as is often the case in real-world applications requiring low-latency predictions), since once node-wise embeddings are available, combining them to make predictions is cheap. Moreover, accelerated decoding techniques such as Maximum Inner Product Search (MIPS) [13, 14, 15] or Flashlight [16] exist to further economize inference. The downside though of node-wise embedding methods is that they may fail to disambiguate isomorphic nodes that combine to form a candidate edge [9].

To this end, edge-wise embeddings with greater expressive power have been proposed for more robust link prediction [10, 17, 12, 18]. These models base their predictions on edge-specific subgraphs capable of breaking isomorphic node relationships via structural information (e.g., overlapping neighbors, shortest paths, positional encodings, or subgraph sketches) that might otherwise undermine the performance of node-wise embeddings. This flexibility comes with a substantial cost though, as inference complexity can be orders of magnitude larger given that a unique subgraph must be extracted and processed by the GNN for every test edge. Although this expense can be alleviated to some cases by pre-processing [12], for inference over very large sets of candidate links, even the pre-processing time can be overwhelming relative to that required by node-wise predictors.

In this work, we address the trade-off between expressiveness and inference efficiency via the following strategy. To maintain minimal inference speeds, we restrict ourselves to a node-wise embedding approach and then try to increase the expressiveness on multiple fronts. Most importantly, we allow each node-level embedding computed during the forward pass to depend on not only its ego network (i.e., subgraph containing a target node), but also on the embeddings of negatively sampled nodes, meaning nodes that were not originally sharing an edge with the target node. This can be viewed as forming a complementary negative ego network for each node. Moreover, rather than heuristically incorporating the resulting positive and negative ego networks within a traditional GNN-based embedding model, we instead combine them so as to infuse their integration with an inductive bias specifically tailored for link prediction. Specifically, we introduce a parameterized graph-regularized energy, in the spirit of triplet ranking loss functions used for capturing both relative similarities and differences between pair-wise items. By design, the parameter-dependent minimizer of this function can then serve the role of end-to-end trainable node-wise embeddings, explicitly dependent on the node features of both positive and negative samples even during the forward pass (not just the backward training pass as is typical). For these reasons, we refer to our model as a Yin (negative) Yang (positive) GNN, or YinYanGNN for short.

In this way, we increase the flexibility of node-wise embedding approaches, without significantly increasing the computational complexity, as no edge-wise embeddings or edge-specific subgraph extraction is necessary. Additionally, by unifying the positive and negative samples within a single energy function minimization process, the implicit receptive field of the embeddings can be arbitrarily large without oversmoothing, a property we inherit from prior related work on optimization-based GNN models applied to much simpler node classification tasks [19]. These observations lead to a statement of our primary contributions:

  1. 1.

    We design node-wise embeddings for link prediction that are explicitly imbued with an inductive bias informed by the node features of both positive and negative samples during the model forward pass, not merely the backward pass as is customary with contrastive learning-based methods. This is accomplished by recasting the embeddings themselves as minimizers of an energy function that explicitly balances the impact of positive (Yang) and negative (Yin) samples, leading to a model we refer to as YinYanGNN.

  2. 2.

    We analyze the convergence properties and computational complexity of the optimization process which produces YinYanGNN embeddings, as well as their expressiveness relative to traditional node-wise models. These results suggest that our approach can potentially serve as a reliable compromise between node- and edge-wise alternatives.

  3. 3.

    Experiments on real-world link prediction benchmarks reveal the YinYanGNN can outperform SOTA node-wise models in terms of accuracy while matching their efficiency. And analogously, YinYanGNN can exceed the efficiency of edge-wise approaches while maintaining similar (and in some cases better) prediction accuracy.

We summarize the differentiating factors of YinYanGNN compared to existing node-wise and edge-wise models in Table I, while the basic architecture is displayed in Figure 1, with more detailed descriptions to follow in subsequent sections.

TABLE I: Comparisons with existing node-wise and edge-wise models. Overall, YinYanGNN can achieve accuracy comparable to edge-wise models while maintaining the efficiency of node-wise models. YinYanGNN also possesses multiple desirable properties linked to its unique architectural design.
Node-wise Models Edge-wise Models YinYanGNN
Can distinquish some
isomorphic node pairs
×
Competive
accuracy
×
Fast inference
run-time
×
Sublinear
acceleration
×
Energy descent
convergence guarantees
× ×
Negative samples
in forward pass
× ×

II Related Work

GNNs for Link Prediction. As mentioned in Section I, GNN models for link prediction can be roughly divided into two categories, those based on node-wise embeddings [7, 8, 20] and those based on edge-wise embeddings [9, 12, 17, 11, 21, 22, 18]. The former is generally far more efficient at inference time given that the embeddings need only be computed once for each node and then repeatedly combined to make predictions for each candidate edge. However, the latter is more expressive by facilitating edge-specific structural features at the cost of much slower inference.

GNN Layers formed from unfolded optimization steps. A plethora of recent research has showcased the potential of constructing resilient GNN architectures for node classification using graph propagation layers that emulate the iterative descent steps of a graph-regularized energy function [23, 24, 25, 26, 27, 19, 28, 29]. These approaches allow the node embeddings at each layer to be regarded as progressively refined approximations of an interpretable energy minimizer. A key advantage is that embeddings obtained in this way can be purposefully designed to address challenges such as GNN oversmoothing or the introduction of robustness against spurious edges. Moreover, these adaptable embeddings can be seamlessly integrated into a bilevel optimization framework [30] for supervised training. Even so, prior work in this domain thus far has been primarily limited to much simpler node classification tasks, where nuanced relationships between pairs of nodes need not be explicitly accounted for. In contrast, we are particularly interested in the latter, and the potential to design new energy functions that introduce inductive biases suitable for link prediction.

GNN Expressiveness. Most work on GNN expressiveness focuses on the representational power of entire graphs[31, 32, 33, 34, 35, 36, 37]. More narrowly in terms of link prediction, structural representations of node sets have also been studied[38], as well as labeling tricks [10] first proposed by [9] to improve expressiveness. Overall, there now exists a variety of edge-wise methods[12, 18, 22] specifically designed to focus on more expressive edge-specific structural features as we discuss elsewhere herein.

Contrastive Learning. Although negative pairs and contrastive learning are commonly used by link prediction frameworks and self-supervised learning more generally [39, 40, 41], their role in prior work is largely restricted to defining the training loss, and therefore only influencing the backward pass. This is unlike YinYanGNN, whereby the negative samples also explicitly impact the forward pass and core model/encoder architecture itself.

Refer to caption
Figure 1: YinYanGNN model illustration. On the left side we show the YinYanGNN forward pass explicitly depending on negative samples, with layers computing embeddings that descend a lower-level energy node\ell_{node} defined by (IV-C). On the right side we show the more traditional backward pass for optimizing meta-loss (1) and one-step optimization of it over parameters WW (which define the lower-level energy) and θ\theta (specific to the meta-loss). Supporting details and derivations will be presented in Section IV.

III Preliminaries

In this section we briefly introduce notation before providing concrete details of the link prediction problem that will be useful later.

III-A Notation

Let 𝒢=(𝒱,,X)\mathcal{G}=(\mathcal{V},\mathcal{E},X) be a graph with node set 𝒱\mathcal{V}, corresponding dxd_{x}-dimensional node features Xn×dxX\in\mathbb{R}^{n\times d_{x}}, and edge set \mathcal{E}, where |𝒱|=n|\mathcal{V}|=n. We use AA to denote the adjacency matrix and DD for the degree matrix. The associated Laplacian matrix is defined by LDAL\triangleq D-A. Furthermore, Yn×dY\in\mathbb{R}^{n\times d} refers to node embeddings of size dd we seek to learn via a node-wise link prediction procedure. Specifically the node embedding for node ii is yiy_{i}, which is equivalent to the ii-th row of YY.

III-B Link Prediction

We begin by introducing a commonly-used loss for link prediction, which is defined over the training set train\mathcal{E}_{train}\subset\mathcal{E}. For both node-wise and edge-wise methods, the shared goal is to obtain an edge probability score p(vi,vj)=σ(eij)p(v_{i},v_{j})=\sigma(e_{ij}) for all edges (vi,vj)train(v_{i},v_{j})\in\mathcal{E}_{train} (as well as negatively-sampled counterparts to be determined shortly), where σ\sigma is a sigmoid function and eije_{ij} is a discriminative representation for edge (vi,vj)(v_{i},v_{j}). Proceeding further, for every true positive edge (vi,vj)(v_{i},v_{j}) in the training set, N1N\geq 1 negative edges (via,vja)a=1,,N(v_{i^{a}},v_{j^{a}})_{a=1,...,N} are randomly sampled from the graph for supervision purposes. We are then positioned to express the overall link prediction loss as link=~{}~{}~{}\mathcal{L}_{link}\triangleq~{}=

(i,j)train[log(p(vi,vj))a=1N1Nlog(1p(via,vja))],\displaystyle\sum_{(i,j)\in\mathcal{E}_{train}}\left[-\log(p(v_{i},v_{j}))-\sum_{a=1}^{N}\frac{1}{N}\log(1-p(v_{i^{a}},v_{j^{a}}))\right], (1)

where each edge probability is computed with the corresponding edge representation. The lingering difference between node- and edge-wise methods then lies in how each edge representation eije_{ij} is actually computed.

For node-wise methods, eij=h(yi,yj)e_{ij}=h(y_{i},y_{j}), where yiy_{i} and yjy_{j} are node-wise embeddings and hh is a decoder function ranging in complexity from a parameter-free inner-product to a multi-layer MLP. While decoder structure varies [42, 43, 44, 45, 16], of particular note for its practical effectiveness is the HadamardMLP approach, which amounts to simply computing the hadamard product between yiy_{i} and yiy_{i} and then passing the result through an MLP with trainable parameters θ\theta. Fast, sublinear inference times are possible with HadamardMLP using an algorithm from [16]. In contrast, the constituent node embeddings themselves are typically computed with some form of trainable GNN encoder model gg of the form yi=g(xi,𝒢i)y_{i}=g(x_{i},\mathcal{G}_{i}) and yj=g(xj,𝒢j)y_{j}=g(x_{j},\mathcal{G}_{j}), where 𝒢i\mathcal{G}_{i} and 𝒢j\mathcal{G}_{j} are the subgraphs containing nodes viv_{i} and vjv_{j}, respectively.

Turning to edge-wise methods, the edge representation eije_{ij} relies on the subgraph 𝒢ij\mathcal{G}_{ij} defined by both viv_{i} and vjv_{j}. In this case eij=he(vi,vj,𝒢ij)e_{ij}=h_{e}(v_{i},v_{j},\mathcal{G}_{ij}), where heh_{e} is an edge encoder GNN whose predictions can generally not be decomposed into a function of individual node embeddings as before. Note also that while the embeddings from node-wise subgraphs for all nodes in the graph can be produced by a single GNN forward pass, a unique/separate edge-wise subgraph and corresponding forward pass are needed to make predictions for each candidate edge. This explains why edge-wise models endure far slower inference speeds in practice.

IV Incorporating Negative Sampling into Node-wise Model Design

Previously we described how computationally-efficient node-wise embedding methods for link prediction rely on edge representations that decompose as eij=h[g(yi,𝒢i),g(yj,𝒢j)]e_{ij}=h[g(y_{i},\mathcal{G}_{i}),g(y_{j},\mathcal{G}_{j})] for node-pair (vi,vj)(v_{i},v_{j}), a decomposition that is decidedly less expressive than the more general form eij=he(vi,vj,𝒢ij)e_{ij}=h_{e}(v_{i},v_{j},\mathcal{G}_{ij}) adopted by edge-wise embedding methods. Although we can never match the flexibility of the edge-wise models with a node-wise approach, we can nonetheless increase the expressiveness of node-wise models while still retaining their attractive computational footprint.

At a high-level, our strategy for accomplishing this goal is to learn node-wise embeddings of the revised form yi=g(vi,𝒢i,𝒢i)y_{i}=g(v_{i},\mathcal{G}_{i},\mathcal{G}_{i}^{-}), where 𝒢i\mathcal{G}_{i}^{-} is a subgraph of 𝒢\mathcal{G}^{-} centered at node viv_{i}, 𝒢=(𝒱,,X)\mathcal{G}^{-}=(\mathcal{V},\mathcal{E}^{-},X), and \mathcal{E}^{-} is a set of negatively-sampled edges between nodes in the original graph 𝒢\mathcal{G}. In this way each node-wise embedding has access to node features from both positive and negative neighboring nodes.

To operationalize this conceptual design, rather than heuristically embedding negative samples within an existing GNN architecture (see Section VII-D for experiments using this simple strategy), we instead chose node-wise embeddings that minimize an energy function regularized by both positive and negative edges, i.e., \mathcal{E} and \mathcal{E}^{-}. More formally, we seek a node embedding matrix Y=argminYnode(𝒢,𝒢)Y=\arg\min_{Y}\ell_{node}(\mathcal{G},\mathcal{G}^{-}) in such a way that the optimal solution decomposes as yi=g(vi,𝒢i,𝒢i)y_{i}=g(v_{i},\mathcal{G}_{i},\mathcal{G}_{i}^{-}) for some differentiable function gg across all nodes viv_{i}. This allows us to anchor the influence of positive and negative edges within a unified energy surface, with trainable minimizers that can be embedded within the link prediction loss from (1). In the remainder of this section we motivate our selection for node\ell_{node}, as well as the optimization steps which form the structure of the corresponding function gg.

IV-A An Initial Energy Function

Prior work on optimization-based node embeddings [46, 26, 27, 19, 28, 29] largely draw on energy functions related to [47], which facilitates the balancing of local consistency relative to labels or a base predictor, with global constraints from graph structure. However, these desiderata alone are inadequate for the link prediction task, where we would also like to drive individual nodes towards regions of the embedding space where they are maximally discriminative with respect to their contributions to positive and negative edges. To this end we take additional inspiration from triplet ranking loss functions [48] that are explicitly designed for learning representations that can capture relative similarities or differences between items.

With these considerations in mind, we initially posit the energy function    nodeYf(X;W)F2+\ell_{node}~{}~{}\triangleq||Y-f(X;W)||^{2}_{F}+

λ(i,j)[dist(yi,yj)λKλKj𝒱(i,j)Kdist(yi,yj)],\displaystyle\lambda\sum_{(i,j)\in\mathcal{E}}\Big{[}dist(y_{i},y_{j})-\frac{\lambda_{K}}{\lambda K}\sum_{j^{\prime}\in\mathcal{V}_{(i,j)}^{K}}dist(y_{i},y_{j}^{\prime})\Big{]}, (2)

where f(X;W)f(X;W) (assumed to apply row-wise to each individual node feature xix_{i}) represents a base model that processes the input features using trainable weights WW, dist(yi,yj)dist(y_{i},y_{j}) is a distance metric, while λ\lambda and λK\lambda_{K} are hyperparameters that control the impact of positive and negative edges. Moreover, 𝒱(i,j)K\mathcal{V}_{(i,j)}^{K} is the set of negative destination nodes sampled for edge (vi,vj)(v_{i},v_{j}) and |𝒱(i,j)K|=K|\mathcal{V}_{(i,j)}^{K}|=K. Overall, the first term pushes the embeddings towards the processed input features, while the second and third terms apply penalties to positive and negative edges in a way that is loosely related to the aforementioned triplet ranking loss (more on this below).

If we choose dist(i,j)=yiyj2dist(i,j)=||y_{i}-y_{j}||^{2} and define edges of the negative graph 𝒢\mathcal{G}^{-} as {(i,j)|j𝒱(i,j)K for (i,j)}\mathcal{E}^{-}\triangleq\{(i,j^{\prime})|~{}j^{\prime}\in\mathcal{V}_{(i,j)}^{K}\text{ for }(i,j)\in\mathcal{E}\}, we can rewrite (2) as node=~{}~{}~{}\ell_{node}=

Yf(X;W)F2+λtr[YLY]λKKtr[YLY],\displaystyle||Y-f(X;W)||^{2}_{F}+\lambda\text{tr}[Y^{\top}LY]-\frac{\lambda_{K}}{K}\text{tr}[Y^{\top}L^{-}Y], (3)

where LL^{-} is the Laplacian matrix of 𝒢\mathcal{G}^{-}. To find the minimum, we compute the gradients

nodeY=2(Yf(X;W))+2λLYλKK2LY,\displaystyle\frac{\partial\ell_{node}}{\partial Y}=2(Y-f(X;W))+2\lambda LY-\frac{\lambda_{K}}{K}2L^{-}Y, (4)

where Y(0)=f(X;W)Y^{(0)}=f(X;W). The corresponding gradient descent updates then become Y(t+1)=Y^{(t+1)}=

Y(t)α((Y(t)f(X;W))+λLY(t)λKKLY(t)),\displaystyle Y^{(t)}-\alpha((Y^{(t)}-f(X;W))+\lambda LY^{(t)}-\frac{\lambda_{K}}{K}L^{-}Y^{(t)}), (5)

where the step size is α2\frac{\alpha}{2}. We note however that (3) need not generally be convex or even lower bounded. Moreover, the gradients may be poorly conditioned for fast convergence depending on the Laplacian matrices involved. Hence we consider several refinements next to stabilize the learning process.

IV-B Energy Function Refinements

Lower-Bounding the Negative Graph. Since the regularization of negative edges brings the possibility of an ill-posed loss surface (a non-convex loss surface that may be unbounded from below) , we introduce a convenient graph-aware lower-bound analogous to the max operator used by the triplet loss. Specifically, we update (3) to the form node=Yf(X;W)F2+~{}~{}~{}\ell_{node}~{}=||Y-f(X;W)||^{2}_{F}+

λtr[YLY]+λKKSoftplus(||γtr[YLY]),\displaystyle\lambda\text{tr}[Y^{\top}LY]+\frac{\lambda_{K}}{K}\text{Softplus}(|\mathcal{E}|\gamma-\text{tr}[Y^{\top}L^{-}Y]), (6)

noting that we use Softplus(x)=log(1+ex)\text{Softplus}(x)=\log(1+e^{x}) instead of max(,0)max(\cdot,0) to make the energy differentiable. Unlike triplet loss that includes positive term in the max(,0)max(\cdot,0) function, we only restrict the lower bound for the negative term, because we still want the positive part to impact our model when the negative part hits the bound.

Normalization. We use the normalized Laplacian matrices of original graph 𝒢\mathcal{G} and negative graph 𝒢\mathcal{G}^{-} instead to make the gradients smaller. Also, for the gradients of the first term in (3), we apply a re-scaling by summation of degrees of both graphs. The modified gradients are nodeY=\frac{\partial\ell_{node}}{\partial Y}=

2(D+D)1(Yf(X;W))+2λL~YλKK2L~Y,\displaystyle 2(D+D^{-})^{-1}(Y-f(X;W))+2\lambda\tilde{L}Y-\frac{\lambda_{K}}{K}2\tilde{L}^{-}Y, (7)

where DD and DD^{-} are diagonal degree matrices of 𝒢\mathcal{G} and 𝒢\mathcal{G}^{-}. The normalized Laplacians are L~=D12LD12,L~=D12LD12\tilde{L}=D^{-\frac{1}{2}}LD^{-\frac{1}{2}},\tilde{L}^{-}={D^{-}}^{-\frac{1}{2}}L^{-}{D^{-}}^{-\frac{1}{2}} leading to the corresponding energy

node=(D+D)1(Yf(X;W))F2+\displaystyle\ell_{node}~{}=\left\|(D+D^{-})^{-1}(Y-f(X;W))\right\|^{2}_{F}+ (8)
λtr[YL~Y]λKKtr[YL~Y].\displaystyle\lambda\text{tr}[Y^{\top}\tilde{L}Y]-\frac{\lambda_{K}}{K}\text{tr}[Y^{\top}\tilde{L}^{-}Y].

Learning to Combine Negative Graphs. We now consider a more flexible implementation of negative graphs. More concretely, we sample KK negative graphs {𝒢(k)}k=1,,K{\{\mathcal{G}^{-}_{(k)}\}}_{k=1,...,K}, in which every negative graph consists of one negative edge per positive edge (𝒢(k){(i,j)|j𝒱(i,j)1 for (i,j)}\mathcal{G}^{-}_{(k)}\triangleq\{(i,j^{\prime})|~{}j^{\prime}\in\mathcal{V}_{(i,j)}^{1}\text{ for }(i,j)\in\mathcal{E}\}) (note that the superscript on 𝒱\mathcal{V} implies a single negative sample per positive edge). And we set learnable weights λKk\lambda_{K}^{k} for the structure term of each negative graph, which converts the energy function to node=~{}\ell_{node}~{}=

Yf(X;W)F2+λtr[YLY]1Kk=1KλKktr[YLkY],\displaystyle||Y-f(X;W)||^{2}_{F}+\lambda\text{tr}[Y^{\top}LY]-\frac{1}{K}\sum_{k=1}^{K}\lambda_{K}^{k}\text{tr}[Y^{\top}L^{-}_{k}Y], (9)

Also in practice we normalize this energy function to

node\displaystyle\ell_{node} =\displaystyle= (D+DK)1(Yf(X;W))F2+λtr[YL~Y]\displaystyle\left\|(D+D^{-}_{K})^{-1}(Y-f(X;W))\right\|^{2}_{F}+\lambda\text{tr}[Y^{\top}\tilde{L}Y] (10)
1Kk=1KλKktr[YLk~Y],\displaystyle-\frac{1}{K}\sum_{k=1}^{K}\lambda_{K}^{k}\text{tr}[Y^{\top}\tilde{L^{-}_{k}}Y],

where DK=k=1KDkD^{-}_{K}=\sum_{k=1}^{K}D^{-}_{k}, DkD^{-}_{k} is the degree matrix of LkL^{-}_{k} and Lk~=DK12LkDK12.\tilde{L^{-}_{k}}={D^{-}_{K}}^{-\frac{1}{2}}L^{-}_{k}{D^{-}_{K}}^{-\frac{1}{2}}. The lower bound is also added as before. Overall, the motivation here is to inject trainable flexibility into the negative sample graph, which is useful for increasing model expressiveness.

IV-C The Overall Algorithm

Combining the modifications we discussed in last section (and assuming a single, fixed λK\lambda_{K} here for simplicity; the more general, learnable case with multiple λKk\lambda_{K}^{k} from Section IV-B naturally follows), we obtain the final energy function

node=\displaystyle\ell_{node}= (D+D)1(Yf(X;W))F2+λtr[YL~Y]\displaystyle\left\|(D+D^{-})^{-1}(Y-f(X;W))\right\|^{2}_{F}+\lambda\text{tr}[Y^{\top}\tilde{L}Y]
+λKKSoftplus(γ||tr[YL~Y]),\displaystyle+\frac{\lambda_{K}}{K}\text{Softplus}(\gamma|\mathcal{E}|-\text{tr}[Y^{\top}\tilde{L}^{-}Y]), (11)

with the associated gradientsnodeY=~{}\frac{\partial\ell_{node}}{\partial Y}~{}=

2(D+D)1(Yf(X;W))+2λL~YλKK2L~Yσ(Q),2(D+D^{-})^{-1}(Y-f(X;W))+2\lambda\tilde{L}Y-\frac{\lambda_{K}}{K}2\tilde{L}^{-}Y\sigma(Q), (12)

where Q=γ||tr[YL~Y]Q=\gamma|\mathcal{E}|-\text{tr}[Y^{\top}\tilde{L}^{-}Y] and σ(x)\sigma(x) is the sigmoid function. The final updates for our model then become

Y(t+1)\displaystyle Y^{(t+1)} =Y(t)α((D+D)1(Y(t)f(X;W))+λL~Y(t)\displaystyle=Y^{(t)}-\alpha\Big{(}(D+D^{-})^{-1}(Y^{(t)}-f(X;W))+\lambda\tilde{L}Y^{(t)}
λKKL~Y(t)σ(Q(t)))\displaystyle-\frac{\lambda_{K}}{K}\tilde{L}^{-}Y^{(t)}\sigma(Q^{(t)})\Big{)}
=C1Y(t)+C2f(X;W)+c3A~Y(t)c4A~Y(t),\displaystyle=C_{1}Y^{(t)}+C_{2}f(X;W)+c_{3}\tilde{A}Y^{(t)}-c_{4}\tilde{A}^{-}Y^{(t)}, (13)

where the diagonal scaling matrices (C1,C2)(C_{1},C_{2}) and scalar coefficients (c3,c4)(c_{3},c_{4}) are given by

C1\displaystyle C_{1} =(1αλ+1KαλKσ(Q(t)))Iα(D+D)1\displaystyle=\left(1-\alpha\lambda+\tfrac{1}{K}\alpha\lambda_{K}\sigma(Q^{(t)})\right)I-\alpha(D+D^{-})^{-1}\text{, }~{}~{}
C2\displaystyle C_{2} =α(D+D)1,\displaystyle=\alpha(D+D^{-})^{-1},
c3\displaystyle c_{3} =αλc4=(1KαλKσ(Q(t))),\displaystyle=\alpha\lambda\text{, }~{}~{}c_{4}=\left(\tfrac{1}{K}\alpha\lambda_{K}\sigma(Q^{(t)})\right), (14)

with Q(t)=γ||tr[(Y(t))L~Y(t)]Q^{(t)}=\gamma|\mathcal{E}|-\text{tr}[({Y^{(t)}})^{\top}\tilde{L}^{-}Y^{(t)}], II as an n×nn\times n identity matrix, and α2\frac{\alpha}{2} as the step size.

From the above expressions, we observe that the first and second terms of (IV-C) can be viewed as rescaled skip connections from the previous layer and input layer/base model, respectively. As we will later show, these scale factors are designed to facilitate guaranteed descent of the objective from (IV-C). Meanwhile, the third term of (IV-C) represents a typical GNN graph propagation layer while the fourth term is the analogous negative sampling propagation unique to our model. In the context of Chinese philosophy, the latter can be viewed as the Yin to the Yang which is the third term, and with a trade-off parameter that can be learned when training the higher-level objective from (1), the Yin/Yang balance can in a loose sense be estimated from the data; hence the name YinYanGNN for our proposed approach.

We illustrate key aspects of YinYanGNN in Figure 1 and the explicit computational steps are shown in Algorithm 1. And we emphasize that although the derivations of YinYanGNN may be somewhat complex, the algorithm that results is quite simple and efficient.

YinYanGNN Training. Upon inspection of Algorithm 1, we observe that lines 3 to 9 depict the training process for each epoch. Here we first sample the negative graph on line 3 which is used (unique to our model) on lines 4 to 6 to produce the YinYanGNN encoder node embeddings Y(T)Y^{(T)}; this completes one model forward pass. Next, on line 7 we again sample negative edges as needed by the supervised training loss. Given these negative samples, along with predictive probabilities obtained via node embeddings from Y(T)Y^{(T)} passed through the HadamardMLP decoder, we compute the link prediction training loss defined by (1) on line 8. Finally, we update model parameters {W,θ}\{W,\theta\} (and optionally λK\lambda_{K}) by standard backpropagation on line 9; this executes one model backward pass (in our experiments we use the Adam optimizer).

YinYanGNN Inference. For inference, we basically follow one forward pass of the training process from above to obtain encoder node embeddings Y(T)Y^{(T)}. These are then applied to the HadamardMLP decoder to produce link prediction scores or edge probability estimates analogous to line 8, but without the loss computation.

Input: Graph 𝒢\mathcal{G}, node features XX, number of layers TT, number of epochs MM, trainable base model f(;W)f(\cdot;W) , HadamardMLP predictor h(;θ)h(\cdot;\theta)
Output: Trained model weights WW and Hadamard MLP predictor weights θ\theta.
1 Set initial prediction Y(0)=f(X;W)Y^{(0)}=f(X;W), where ff is the trainable base model.
2 for Epoch = 1 to MM do
3       Sample negative graph 𝒢\mathcal{G}^{-}.
4       for t=0t=0 to T1T-1 do
5             Y(t+1)=Update(Y(t))Y^{(t+1)}=\text{Update}(Y^{(t)}), computed via (IV-C).
6       end for
7      Sample negative edges for supervision based on tr\mathcal{E}_{tr}.
8       Compute the link prediction loss (1) with node embeddings Y(T)Y^{(T)} and HadamardMLP h(;θ)h(\cdot;\theta).
9       Backpropagate over parameters {W,θ}\{W,\theta\} (and optionally λK\lambda_{K}) using optimizer (Adam, etc.)
10 end for
11
Algorithm 1 YinYanGNN Bilevel Optimization Algorithm for Link Prediction

V Analysis

In this section we first address the computational complexity and convergence issues of our model before turning to further insights into the role of negative sampling in our proposed energy function.

V-A Time Complexity

YinYanGNN has a time complexity given by O(T||Kd+nPd2)O(T|\mathcal{E}|Kd+nPd^{2}) for one forward pass, where as before nn is the number of nodes, TT is the number of propagation layers/iterations, PP is the number of MLP layers in the base model f(X;W)f(X;W), and dd is the hidden embedding size. Notably, this complexity is of the same order as a vanilla GCN model, one of the most common GNN architectures [7], which has a forward-pass complexity of O(T(||d+nd2))O(T(|\mathcal{E}|d+nd^{2})).

TABLE II: Time complexity comparisons. For BUDDY, bb is the hop number for propagation and hh is the complexity of hash operations. indicates that the complexity can be sublinear to nn via [16], an option that is only available to node-wise embedding models such as YinYanGNN.

SEAL BUDDY YinYanGNN GCN
Preprocess O(1)O(1) O(b||(d+h))O(b|\mathcal{E}|(d+h)) O(1)O(1) O(1)O(1)
Train O(||d2|tr|)O(|\mathcal{E}|d^{2}|\mathcal{E}_{tr}|) O((b2h+bd2)|tr|)O((b^{2}h+bd^{2})|\mathcal{E}_{tr}|) O(T||Kd+nPd2+|tr|d2)O(T|\mathcal{E}|Kd+nPd^{2}+|\mathcal{E}_{tr}|d^{2}) O(T(||d+nd2)+|tr|d2)O(T(|\mathcal{E}|d+nd^{2})+|\mathcal{E}_{tr}|d^{2})
Encode O(||d2|𝒱src|n)O(|\mathcal{E}|d^{2}|\mathcal{V}_{src}|n) O((b2h+bd2)|𝒱src|n)O((b^{2}h+bd^{2})|\mathcal{V}_{src}|n) O(T||Kd+nPd2)O(T|\mathcal{E}|Kd+nPd^{2}) O(T(||d+nd2))O(T(|\mathcal{E}|d+nd^{2}))
Decode O(d2|𝒱src|n)O(d^{2}|\mathcal{V}_{src}|n)^{*} O(d2|𝒱src|n)O(d^{2}|\mathcal{V}_{src}|n)^{*}

We now drill down further into the details of overall inference speed. We denote the set of source nodes for test-time link prediction as 𝒱src\mathcal{V}_{src}, and for each node we examine all the other nodes in the graph, which means we have to calculate roughly |𝒱src|n|\mathcal{V}_{src}|n edges. We compare YinYanGNN’s time with SEAL [9] and a recently proposed fast baseline BUDDY [12] in Table II. Like other node-wise methods, we split the inference time of our model into two parts: computing embeddings and decoding (for SEAL and BUDDY they are implicitly combined). The node embedding computation can be done only once so it does not depend on 𝒱src\mathcal{V}_{src}. Our decoding process uses HadamardMLP to compute scores for each destination node (which can also be viewed as being for each edge) and get the top nodes (edges). From this it is straightforward to see that the decoding time dominates the overall computation of embeddings. So for the combined inference time, SEAL is the slowest because of the factor ||n|\mathcal{E}|n while BUDDY and our model are both linear in the graph node number nn independently of |||\mathcal{E}|. However, BUDDY has a larger factor including the compution of subgraph structure b2hb^{2}h, which will be much slower as our experiments will show. Moreover, unlike SEAL or BUDDY, we can apply Flashlight [16] to node-wise methods like YinYanGNN, an accelerated decoding method based on maximum inner product search (MIPS) that allows HadamardMLP to have sublinear decoding complexity in nn. See Section VI-C for related experiments and discussion.

V-B Convergence of YinYanGNN Layers/Iterations

Convergence criteria for the energies from (3) and (IV-C) are as follows (see Appendix X-B for proofs):

Proposition V.1.

If λK<Kδmax\lambda_{K}<K\cdot\delta_{max}, where δmax\delta_{max} is the largest eigenvalue of LL^{-}, then (3) has a unique global minimum. Moreover, if the step-size parameter satisfies α<I+λL~λKKL~F1\alpha<\left\|I+\lambda\tilde{L}-\frac{\lambda_{K}}{K}\tilde{L}^{-}\right\|_{F}^{-1}, then the gradient descent iterations of (5) are guaranteed to converge to this minimum.

Proposition V.2.

There exists an α>0\alpha^{\prime}>0 such that for any α(0,α]\alpha\in(0,\alpha^{\prime}], the iterations (IV-C) will converge to a stationary point of (IV-C).

V-C Role of Negative Sampling in Proposed Energy

Previous work [49, 50] has demonstrated how randomly dropping edges of two isomorphic graphs can produce non-isomorphic, distinguishable subgraphs, and in so doing, enhance GNN expressiveness. Our incorporation of negative samples within the YinYanGNN encoder/forward pass can loosely be viewed in similar terms, but with a novel twist: Instead of randomly dropping edges in the original graph, we randomly add typed negative edges that are likewise capable of breaking otherwise indistinguishable symmetries.

Refer to caption
Figure 2: Modified from [10]. Solid lines represent the original edges and dashed lines represent negative edges sampled in our model architecture (for simplicity we do not draw all negative edges).

Figure 2 serves to illustrate how the inclusion of negative samples within the forward pass of our model can potentially increase the expressiveness beyond traditional node-wise embedding approaches. As observed in the figure, v2v_{2} and v3v_{3} are isophormic nodes in the original graph (solid lines). However, when negative samples/edges are included the isomorphism no longer holds, meaning that link (v1,v2)(v_{1},v_{2}) and link (v1,v3)(v_{1},v_{3}) can be distinguished by a node-wise embedding method even without unique discriminating input features. Moreover, when combined with the flexibility of learning to balance multiple negative sampling graphs as in (10) through the trainable weights {λKk}\{\lambda_{K}^{k}\}, the expressiveness of YinYanGNN becomes strictly greater than or equal to a vanilla nodewise embedding method (with equivalent capacity) that has no explicit access to the potential symmetry-breaking influence of negative samples.

Critically though, these negative samples are not arbitrarily inserted into our modeling framework. Rather, they emerge by taking gradient steps (12) over a principled regularization factor (within (10)) designed to push the embeddings of nodes sharing a negative edge apart during the forward pass. In a Section VII-E ablation we compare this unique YinYanGNN aspect with the alternative strategy of using random node features for breaking isomorphisms.

We conclude here by remarking that the goal of YinYanGNN, and the supporting analysis of this section, is not to explicitly match the expressiveness of edge-wise link prediction methods; even in principle this is not feasible given the additional information edge-wise methods have access to. Instead, the guiding principle of YinYanGNN is more aptly distilled as follows: conditioned on the limitations of a purely node-wise link prediction architecture, to what extent can we improve model expressiveness (and ultimately accuracy) without compromising the scalability which makes node-wise methods attractive in the first place.

VI Experiments

TABLE III: Results on link prediction benchmarks. Baseline results are cited from prior work [12, 18] and the OGB leaderboard (comparisons with additional baselines can be found in the Appendix X-A). "-" means not reported. The format is average score ±\pm standard deviation. The best results are bold-faced and underlined. OOM means out of GPU memory.

Cora Citeseer Pubmed Collab PPA Citation2 DDI
HR@100 HR@100 HR@100 HR@50 HR@100 MRR HR@20
Edge-wise GNNs SEAL 81.71±1.3081.71{\scriptstyle\pm 1.30} 83.89±2.1583.89{\scriptstyle\pm 2.15} 75.54±1.3275.54{\scriptstyle\pm 1.32} 64.74±0.4364.74{\scriptstyle\pm 0.43} 48.80±3.1648.80{\scriptstyle\pm 3.16} 87.67±0.3287.67{\scriptstyle\pm 0.32} 30.56±3.8630.56{\scriptstyle\pm 3.86}
NBFnet 71.65±2.2771.65{\scriptstyle\pm 2.27} 74.07±1.7574.07{\scriptstyle\pm 1.75} 58.73±1.9958.73{\scriptstyle\pm 1.99} OOM OOM OOM 4.00±0.584.00{\scriptstyle\pm 0.58}
Neo-GNN 80.42±1.3180.42{\scriptstyle\pm 1.31} 84.67±2.1684.67{\scriptstyle\pm 2.16} 73.93±1.1973.93{\scriptstyle\pm 1.19} 57.52±0.3757.52{\scriptstyle\pm 0.37} 49.13±0.6049.13{\scriptstyle\pm 0.60} 87.26±0.8487.26{\scriptstyle\pm 0.84} 63.57±3.5263.57{\scriptstyle\pm 3.52}
GDGNN - - - 54.74±0.4854.74{\scriptstyle\pm 0.48} 45.92±2.1445.92{\scriptstyle\pm 2.14} 86.96±0.2886.96{\scriptstyle\pm 0.28} -
SUREL - - - 63.34±0.5263.34{\scriptstyle\pm 0.52} 53.23±1.0353.23{\scriptstyle\pm 1.03} 89.74±0.18¯\underline{\mathbf{89.74{\scriptstyle\pm 0.18}}} -
SUREL+ - - - 64.10±1.0664.10{\scriptstyle\pm 1.06} 54.32±0.4454.32{\scriptstyle\pm 0.44} 88.90±0.0688.90{\scriptstyle\pm 0.06} -
BUDDY 88.00±0.4488.00{\scriptstyle\pm 0.44} 92.93±0.27{92.93{\scriptstyle\pm 0.27}} 74.10±0.7874.10{\scriptstyle\pm 0.78} 65.94±0.58{65.94{\scriptstyle\pm 0.58}} 49.85±0.2049.85{\scriptstyle\pm 0.20} 87.56±0.1187.56{\scriptstyle\pm 0.11} 78.51±1.3678.51{\scriptstyle\pm 1.36}
Node-wise GNNs GCN 66.79±1.6566.79{\scriptstyle\pm 1.65} 67.08±2.9467.08{\scriptstyle\pm 2.94} 53.02±1.3953.02{\scriptstyle\pm 1.39} 44.75±1.0744.75{\scriptstyle\pm 1.07} 18.67±1.3218.67{\scriptstyle\pm 1.32} 84.74±0.2184.74{\scriptstyle\pm 0.21} 37.07±5.0737.07{\scriptstyle\pm 5.07}
SAGE 55.02±4.0355.02{\scriptstyle\pm 4.03} 57.01±3.7457.01{\scriptstyle\pm 3.74} 39.66±0.7239.66{\scriptstyle\pm 0.72} 48.10±0.8148.10{\scriptstyle\pm 0.81} 16.55±2.4016.55{\scriptstyle\pm 2.40} 82.60±0.3682.60{\scriptstyle\pm 0.36} 53.90±4.7453.90{\scriptstyle\pm 4.74}
YinYanGNN 93.83±0.78¯\underline{\mathbf{93.83{\scriptstyle\pm 0.78}}} 94.45±0.53¯\underline{\mathbf{94.45{\scriptstyle\pm 0.53}}} 90.73±0.40¯\underline{\mathbf{90.73{\scriptstyle\pm 0.40}}} 66.10±0.20¯\underline{\mathbf{66.10{\scriptstyle\pm 0.20}}} 54.64±0.49¯\underline{\mathbf{54.64{\scriptstyle\pm 0.49}}} 86.21±0.0986.21{\scriptstyle\pm 0.09} 80.92±3.35¯\underline{\mathbf{80.92{\scriptstyle\pm 3.35}}}

VI-A Experimental Setup

Datasets and Evaluation Metrics.

We evaluate YinYanGNN for link prediction on Planetoid datasets: Cora [51], Citeseer [52], Pubmed [53], and Open Graph Benchmark (OGB) link prediction datasets [45]: ogbl-collab, ogbl-PPA, ogbl-Citation2 and ogbl-DDI. Planetoid represents classic citation network data, whereas OGB involves challenging, multi-domain, diverse benchmarks involving large graphs. Detailed statistics are summarized in the Appendix X-C. We adopt the hits ratio @k(HR@k) as the main evaluation metric as in  [12] for Planetoid datasets. This metric computes the ratio of positive edges ranked equal or above the kk-th place out of candidate negative edges at test time. We set kk to 100 for these three datasets. For OGB datasets, we follow the official settings. Note that the metric for ogbl-Citation2 is Mean Reciprocal Rank (MRR), meaning the reciprocal rank of positive edges among all the negative edges averaged over all source nodes. Finally, we choose the test results based on the best validation results. We also randomly select 10 different seeds and report average results and standard deviation for all datasets. For implementation details and hyperparameters, please refer to Appendix X-C.

Baseline Models.

To calibrate the effectiveness of our model, in this section we conduct comprehensive comparisons with node-wise GNNs: GCN [7] and GraphSage [8], and edge-wise GNNs: SEAL [9], NeoGNN [17], NBFnet [11], BUDDY [12]), GDGNN [21], SUREL [22], and SUREL+ [18]. We differ to Appendix X-A additional experiments spanning traditional link prediction heuristic methods: Common Neighbors (CN) [4], Adamic-Adar (AA) [5] and Resource Allocation (RA) [6], non-GNN or graph methods: MLP, Node2vec [54], and Matrix-Factorization (MF) [1]), knowledge graph (KG) methods: TransE [55], ComplEx [56], and DistMult [57], additional node-wise GNNs: GAT [20], GIN [33], JKNet [58], and GCNII [59], and finally distillation methods.

Overall, for more standardized comparisons, we have chosen baselines based on published papers with open-source code and exclude those methods relying on heuristic augmentation strategies like anchor distances, or non-standard losses for optimization that can be applied more generally. Section VI-D also addresses additional link prediction baseline methods from the OGB leaderboard, including complementary methods that can be applied to YinYanGNN to further improve performance with additional complexity.

Refer to caption
Figure 3: Log-scale inference time. Citation2, PPA are the two largest OGB link prediction graphs.

VI-B Link Prediction Results

Accuracy results are displayed in Table III, where we observe that YinYanGNN achieves the best performance on 6 out of 7 datasets (while remaining competitive across all 7) even when compared against more time-consuming or inference-inefficient edge-wise methods. And we outperform node-wise methods by a large margin (including several others shown in Appendix X-A), demonstrating that YinYanGNN can achieve outstanding predictive accuracy without sacrificing efficiency. Similarly, as included in the Appendix X-A, YinYanGNN also outperforms a variety of non-GNN link prediction baselines.

VI-C Efficiency Analysis

Inference Time. We next present comparisons in terms of inference speed, which is often the key factor determining whether or not a model can be deployed in real-world scenarios. For example, in an online system, providing real-time recommendations may require quickly evaluating a large number of candidate links. Figure 3 reports the results, again relative to both edge- and node-wise baseline models. Noting the log-scale time axis, from these results we observe that YinYanGNN is significantly faster than all the edge-wise models, and nearly identical to the fellow node-wise approaches as expected. And for the fastest edge-wise model, Neo-GNN, YinYanGNN is still simultaneously more efficient (Figure 3) and also much more accurate (Table III). Additional related details and inference-time results are deferred to Appendix X-A We also remark that, as a node-wise model, the efficiency of YinYanGNN can be further improved to sublinear complexity using Flashlight [16]; however, at the time of submission public Flashlight code was not available, so we differ consideration to future work.

Decoding Time. Besides full inference times, in practice, we can save the embeddings output from the encoder and only compare the decoding step, which is a key differentiator. To this end, we compare the decoding speed of YinYanGNN with SEAL, a highly-influential edge-wise model, and BUDDY, which represents a strong baseline with a good balance between accuracy and speed for edge-wise GNNs. We conduct the experiments according to [16]. Table IV displays the results, where our model achieves the fastest performance compared with these two baselines. We also note that although the time complexity of BUDDY is linear in nn (which is the same as our model without Flashlight), the different scaling coefficients caused by the required subgraph information calculation still make BUDDY much slower than YinYanGNN.

TABLE IV: Decoding speed at inference time on Citation2 as measured by nodes/sec. (higher is better).

SEAL BUDDY YinYanGNN YinYanGNN(dot) YinYanGNN(dot MIPS)
nodes/second 0.00024 0.005 0.6 1.3 25

Finally, although the code for Flashlight is not yet publicly available, we can mimic the acceleration it will produce as follows. As Flashlight is a repeated version of MIPS, which can be applied to dot product prediction, we also run MIPS on our model with dot product as the decoder to see how fast MIPS can accelerate. We run the experiments on CPU and achieve a 20x speedup (compare last two columns of Table IV with and without MIPS).

Training Time. We note that as is often the case for link prediction, our focus is on efficient inference, since the number of candidate node pairs grows quadratically in the graph size. That being said, we nonetheless provide a representative comparison here of training times. Specifically, we measure the training times on OGBL-PPA using identical training sets and platforms. BUDDY requires approximately 3 hours, whereas our YinYanGNN model accomplishes the same training task in approximately 12.5 minutes, underscoring a substantial discrepancy in training efficiency.

VI-D Comparisons with Recent OGB Leaderboard Models

General Comments. After the edge-wise SEAL algorithm was originally published, not many new scalable node-wise alternatives have been proposed, likely because it has been difficult to achieve competitive performance. For example, there are actually very few node-wise entries atop the OGB link prediction leaderboard; almost all are less efficient edge-wise methods or related.

In fact, over the course of preparing this submission, there was no model of any kind with OGB leaderboard performance above ours on all datasets, and only a single published node-wise model on the leaderboard above ours on any dataset: this is model CFLP w/ JKNet [60], and only when applied to DDI. But when we examine the CFLP paper, we find that their performance is much lower than YinYanGNN on other non-OGB datasets as shown in Table V.

TABLE V: Comparison of YinYanGNN with CFLP.

Cora Citeseer Pubmed
Model HR@20 HR@20 HR@20
CFLP w/ JKNet 65.57±1.0565.57{\scriptstyle\pm 1.05} 68.09±1.4968.09{\scriptstyle\pm 1.49} 44.90±2.0044.90{\scriptstyle\pm 2.00}
YinYanGNN 84.10±3.23\mathbf{84.10{\scriptstyle\pm 3.23}} 90.52±0.72\mathbf{90.52{\scriptstyle\pm 0.72}} 72.38±5.31\mathbf{72.38{\scriptstyle\pm 5.31}}

Neural Common Neighbors (NCN). Another recent approach with strong leaderboard performance worth mentioning here is NCN [61], which relies on a node-wise encoder as with YinYanGNN. However, the key feature of NCN is a decoder based on collecting common neighbors of nodes spanning each candidate link, which leads to a significantly higher inference cost. Even so, the NCN decoder itself can be easily be integrated with any node-wise encoder, including YinYanGNN, to potentially increase accuracy, albeit with a much higher inference time. Results on OGB-Citation2 (the largest OGB link prediction task) are listed in Table VI. These results demonstrate that YinYanGNN with an NCN decoder can actually outperform both the original NCN MRR from [61] as well as all of the edge-wise models in Table III. The main downside is that inference time increases by 40×\sim 40\times (see Table VI) over YinYanGNN in its original form. This further highlights the challenge of achieving the highest accuracy while preserving the full efficiency of purely node-wise models.

TABLE VI: Better Performances on Citation2 with NCN

Inference Time (s) Citation2 (MRR)
NCN [61] 116 88.64±0.1488.64{\scriptstyle\pm 0.14}
YinYanGNN 3 86.21±0.0986.21{\scriptstyle\pm 0.09}
YinYanGNN w/ NCN 117 90.03±0.02\mathbf{90.03{\scriptstyle\pm 0.02}}

PLNLP and GIDN. Beyond the above, we conclude by commenting on two additional leaderboard methods. First, there is an additional, unpublished node-wise model on the leaderboard called PLNLP [42] that does lead to strong accuracy on both Collab and DDI (two relatively small datasets), but the main contribution is a new training loss that is orthogonal to our method. In fact any node-wise model, including ours, could be trained with their loss to improve performance. The only downside is that the PLNLP approach seems to be dataset dependent, with lesser performance on other benchmarks, e.g., PLNLP only achieves 32.38 Hits@100 on PPA and 84.92 MRR on Citation2. Additionally, the GIDN model [62] follows the same loss as PLNLP and again achieves good results restricted to the smaller-scale Collab and DDI benchmarks (no other results are shown). However, the companion GIDN paper [62] is unpublished and extremely brief, with just a couple short paragraphs explaining the method and no assessment of inference time complexity. Hence we are unsure of how this model compares against node-wise models when applied to large-scale datasets as is our focus.

VII Ablations

VII-A Ablation over Negative Sampling.

As the integration of both positive and negative samples within a unified node-wise embedding framework is a critical component of our model, in Table VII we report results both with and without the inclusion of the negative sampling penalty in our lower-level embedding model from (IV-C). Clearly the former displays notably superior performance as expected.

TABLE VII: Performance of YinYanGNN with or without negative sampling in model architecture.

Pubmed Collab PPA Citation2 DDI
YinYanGNN HR@100 HR@50 HR@100 MRR HR@20
W/O Negative 86.70±3.2386.70{\scriptstyle\pm 3.23} 63.02±0.4463.02{\scriptstyle\pm 0.44} 49.36±2.9149.36{\scriptstyle\pm 2.91} 83.45±0.2183.45{\scriptstyle\pm 0.21} 57.80±5.3957.80{\scriptstyle\pm 5.39}
W/ Negative 90.73±0.40\mathbf{90.73{\scriptstyle\pm 0.40}} 66.10±0.20\mathbf{66.10{\scriptstyle\pm 0.20}} 54.64±0.49\mathbf{54.64{\scriptstyle\pm 0.49}} 86.21±0.09\mathbf{86.21{\scriptstyle\pm 0.09}} 80.92±3.35\mathbf{80.92{\scriptstyle\pm 3.35}}

VII-B Learning Negative Graphs

One benefit of learning {λKk}\{\lambda_{K}^{k}\} is that it allows us to better match the performance of larger KK with fixed λK\lambda_{K} using a smaller KK but with learnable {λKk}\{\lambda_{K}^{k}\}. The latter is more practical given the greater efficiency of a smaller KK. Figures 5 and 5 demonstrate this phenomena, whereby a learnable {λKk}\{\lambda_{K}^{k}\} achieves better accuracy for smaller values of KK on Cora and Pubmed datasets (for larger KK performance is similar).

For bigger datasets where using larger KK can be prohibitively expensive, this capability can be exploited to achieve higher accuracy. For example, on PPA and Citation2 a larger KK may well improve accuracy, but this is not computationally cheap without sampling or other approximations. In contrast, we can improve performance with small KK just by learning {λKk}\{\lambda_{K}^{k}\} as shown in Table VIII. (In contrast, for dense graphs a larger KK is less likely to be helpful, in part because the extra edges will make the negative graph much more dense and more similar to complete graph, which results in the loss of structural information. Not surprisingly then, we found that learning {λKk}\{\lambda_{K}^{k}\} did not improve performance on the dense graph DDI dataset where K=1K=1 was optimal.)

Refer to caption
Figure 4: Learnable or Fixed {λKk}\{\lambda_{K}^{k}\} performances on Cora, standard deviation shown by backgrounds.
Refer to caption
Figure 5: Learnable or Fixed {λKk}\{\lambda_{K}^{k}\} performances on Pubmed, standard deviation shown by backgrounds.
TABLE VIII: Better Performances on PPA and Citation2 with learnable {λKk}\{\lambda_{K}^{k}\}

PPA Citation2
HR@100 MRR
fixed {λKk}\{\lambda_{K}^{k}\} 52.32±0.2452.32{\scriptstyle\pm 0.24} 85.80±0.0885.80{\scriptstyle\pm}0.08
learnable {λKk}\{\lambda_{K}^{k}\} 54.64±0.49\mathbf{54.64{\scriptstyle\pm 0.49}} 86.21±0.09\mathbf{86.21{\scriptstyle\pm 0.09}}

VII-C Different Negative Samplers

As shown in Table IX, the YinYanGNN architecture is tolerant to the use of different negative samplers, from sampling negative edges uniformly in the graph (namely global uniform sampler) to sampling negative edges by fixing the source node and sampling negative destination nodes (namely source uniform sampler).

TABLE IX: Performance of YinYanGNN with different samplers. Global means global uniformly sampler. Source means source uniformly sampler.

Cora Citeseer Pubmed
Sampler HR@100 HR@100 HR@100
Global 92.63±1.1792.63{\scriptstyle\pm 1.17} 93.43±0.4993.43{\scriptstyle\pm 0.49} 90.54±0.4990.54{\scriptstyle\pm 0.49}
Source 92.69±0.2192.69{\scriptstyle\pm 0.21} 94.45±0.5394.45{\scriptstyle\pm 0.53} 90.40±0.2890.40{\scriptstyle\pm 0.28}

VII-D Negative sampling in the forward pass of common GNNs

As we discussed in Section IV, heuristically embedding negative samples within an existing GNN architecture is another possible strategy. But in this way the negative sampling terms are not integrated by a principled energy function. We follow the analysis of negative sampling in energy function, and modify the GCN architecture to incorporate negative sampling in the forward model as follows

Y(t+1)=ReLU[(A~λKKA~)Y(t)W(t)],\displaystyle Y^{(t+1)}=\text{ReLU}\left[\left(\tilde{A}-\frac{\lambda_{K}}{K}\tilde{A^{-}}\right)Y^{(t)}W^{(t)}\right], (15)

where W(t)W^{(t)} are the layer-wise weights and KK is the negative edge number for each positive edge. Also, λK\lambda_{K} is the hyperparameter to control the impact of negative edges as before. As shown in Table X, negative sampling is also useful in some datasets for GCNs, but the performance is still not as good compared to YinYanGNN, where the negative samples are anchored to the proposed energy function.

TABLE X: Performance of GCN with negative edge sampling, compared with vanilla GCN and YinYanGNN.

Cora Citeseer Pubmed
model HR@100 HR@100 HR@100
GCN 66.79±1.6566.79{\scriptstyle\pm 1.65} 67.08±2.9467.08{\scriptstyle\pm 2.94} 53.02±1.3953.02{\scriptstyle\pm 1.39}
GCN W/ negative 76.15±0.6976.15{\scriptstyle\pm 0.69} 67.80±3.7467.80{\scriptstyle\pm 3.74} 74.96±1.1574.96{\scriptstyle\pm 1.15}
YinYanGNN 93.83±0.78\mathbf{93.83{\scriptstyle\pm 0.78}} 94.45±0.53\mathbf{94.45{\scriptstyle\pm 0.53}} 90.73±0.40\mathbf{90.73{\scriptstyle\pm 0.40}}

VII-E Random Features versus Negative edges

In Section V-C, we show that random edges can in principle break the stated isomorphism. However, another potential way to accomplish this effect is to use random features that disambiguate each node. However, this can introduce undesirable auxilary effects, which become apparent when we analyze the resulting energy function when applying random features instead of negative sampling.

When we add random features to (3) we obtain the modified form

noder(Y)=Yf(X+XR;W)F2+λtr[YTLY].\displaystyle\ell_{node_{r}}(Y)=\|Y-f(X+X_{R};W)\|^{2}_{F}+\lambda\text{tr}[Y^{T}LY]. (16)

For simplicity of illustration, we consider an ff that obeys the distributive property, i.e., We then compute the gradient

noderY\displaystyle\frac{\partial\ell_{node_{r}}}{\partial Y} =2(Yf(X+XR;W))+2λLY\displaystyle=2(Y-f(X+X_{R};W))+2\lambda LY (17)
=2(Yf(X;W))+2λLY2f(XR;W).\displaystyle=2(Y-f(X;W))+2\lambda LY-2f(X_{R};W).

From these expressions we observe that random features merely introduce random blurriness to the gradient updates; they do not emerge from an otherwise useful regularization factor. In contrast, with YinYanGNN, the negative samples create gradients that push the node embeddings of nodes that do not share a true edge apart, a useful form of structural regularization that is independent of any mechanism to break isomorphisms per se.

We support these points with additional experiments in Table XI, where random features are introduced and compared against our original YinYanGNN model. From these results we observe that random features can actually degrade performance and introduce instabilities as evidenced by the reduced accuracy and larger standard deviations.

TABLE XI: Performance of YinYanGNN variants.

Cora Citeseer Pubmed
YinYanGNN HR@100 HR@100 HR@100
W/O Negative edges 91.72±0.3391.72{\scriptstyle\pm 0.33} 92.61±0.3192.61{\scriptstyle\pm 0.31} 86.70±3.2386.70{\scriptstyle\pm 3.23}
W/ Random features 91.25±1.4191.25{\scriptstyle\pm 1.41} 89.69±0.7889.69{\scriptstyle\pm 0.78} 79.82±5.1579.82{\scriptstyle\pm 5.15}
W/ Negative edges 93.83±0.78\mathbf{93.83{\scriptstyle\pm 0.78}} 94.45±0.53\mathbf{94.45{\scriptstyle\pm 0.53}} 90.73±0.40\mathbf{90.73{\scriptstyle\pm 0.40}}

VII-F Proportion of Negative Graphs

And finally, in Table XII, we examine how YinYanGNN performance is impacted as we vary the relative proportion assigned to negative graphs, which amounts to the relative weighting of λ\lambda to λK\lambda_{K}. Specifically, we choose λ=1\lambda=1 and vary λK\lambda_{K} from 0.10.1 to 1010 (in each case λK\lambda_{K} is fixed, not learnable) and train YinYanGNN on small datasets Cora, Citeseer, and Pubmed as well as large dataset Citation2. Performance is generally best with λK[0.1,1]\lambda_{K}\in[0.1,1] across all datasets, with the smaller datasets favoring λK=1\lambda_{K}=1 and the largest λK=0.1\lambda_{K}=0.1.

TABLE XII: Performance of YinYanGNN with different negative graph proportions.

Cora Citeseer Pubmed Citation2
λ:λK\lambda:\lambda_{K} HR@100 HR@100 HR@100 MRR
1:0.1 93.26±0.9593.26{\scriptstyle\pm 0.95} 92.44±0.5992.44{\scriptstyle\pm 0.59} 87.08±4.4587.08{\scriptstyle\pm 4.45} 85.80±0.08\mathbf{85.80{\scriptstyle\pm 0.08}}
1:1 93.83±0.78\mathbf{93.83{\scriptstyle\pm 0.78}} 94.45±0.53\mathbf{94.45{\scriptstyle\pm 0.53}} 90.73±0.40\mathbf{90.73{\scriptstyle\pm 0.40}} 81.76±0.0881.76{\scriptstyle\pm 0.08}
1:3 92.23±1.0892.23{\scriptstyle\pm 1.08} 92.42±1.1092.42{\scriptstyle\pm 1.10} 63.22±2.6363.22{\scriptstyle\pm 2.63} 67.65±0.0867.65{\scriptstyle\pm 0.08}
1:5 91.54±1.0191.54{\scriptstyle\pm 1.01} 92.45±0.6392.45{\scriptstyle\pm 0.63} 63.20±2.2063.20{\scriptstyle\pm 2.20} 58.75±0.0358.75{\scriptstyle\pm 0.03}
1:10 87.31±1.8187.31{\scriptstyle\pm 1.81} 92.50±0.3492.50{\scriptstyle\pm 0.34} 57.00±1.2457.00{\scriptstyle\pm 1.24} 49.79±0.1249.79{\scriptstyle\pm 0.12}

VIII Conclusion

In conclusion, we have proposed the YinYanGNN link prediction model that achieves accuracy on par with far more expensive edge-wise models, but with the efficiency of relatively cheap node-wise alternatives. This competitive balance is accomplished using a novel node-wise architecture that incorporates informative negative samples/edges into the design of the model architecture itself to increase expressiveness, as opposed to merely using negative samples for computing a training signal as in prior work. Given the critical importance of inference speed in link prediction applications, YinYanGNN represents a promising candidate for practical usage. In terms of limitations, we have not integrated our implementation with Flashlight for maximally accelerated inference, as public code is not yet available.

IX Acknowledgments

This work was supported by the National Key Research and Development Program of China (No.2022ZD0160102), and conducted while the first author was an intern at Amazon.

References

  • [1] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009.
  • [2] B. P. Chamberlain, E. Rossi, D. Shiebler, S. Sedhain, and M. M. Bronstein, “Tuning word2vec for large scale recommendation systems,” in RecSys, 2020.
  • [3] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling, “Modeling relational data with graph convolutional networks,” in The semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15.   Springer, 2018, pp. 593–607.
  • [4] A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999.
  • [5] L. A. Adamic and E. Adar, “Friends and neighbors on the web,” Social networks, vol. 25, no. 3, pp. 211–230, 2003.
  • [6] T. Zhou, L. Lü, and Y.-C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–630, oct 2009.
  • [7] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
  • [8] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” in NeurIPS, 2017.
  • [9] M. Zhang and Y. Chen, “Link prediction based on graph neural networks,” in NeurIPS, 2018.
  • [10] M. Zhang, P. Li, Y. Xia, K. Wang, and L. Jin, “Labeling trick: A theory of using graph neural networks for multi-node representation learning,” NeurIPS, vol. 34, pp. 9061–9073, 2021.
  • [11] Z. Zhu, Z. Zhang, L.-P. Xhonneux, and J. Tang, “Neural bellman-ford networks: A general graph neural network framework for link prediction,” in NeurIPS, 2021.
  • [12] B. P. Chamberlain, S. Shirobokov, E. Rossi, F. Frasca, T. Markovich, N. Y. Hammerla, M. M. Bronstein, and M. Hansmire, “Graph neural networks for link prediction with subgraph sketching,” in ICLR, 2022.
  • [13] A. Shrivastava and P. Li, “Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search (MIPS),” in Advances in Neural Information Processing Systems, vol. 27, 2014.
  • [14] B. Neyshabur and N. Srebro, “On Symmetric and Asymmetric LSHs for Inner Product Search,” in Proceedings of the 32nd International Conference on Machine Learning.   PMLR, Jun. 2015, pp. 1926–1934.
  • [15] H.-F. Yu, C.-J. Hsieh, Q. Lei, and I. S. Dhillon, “A Greedy Approach for Budgeted Maximum Inner Product Search,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [16] Y. Wang, B. Hooi, Y. Liu, T. Zhao, Z. Guo, and N. Shah, “Flashlight: Scalable link prediction with effective decoders,” in Learning on Graphs Conference.   PMLR, 2022, pp. 14–1.
  • [17] S. Yun, S. Kim, J. Lee, J. Kang, and H. J. Kim, “Neo-gnns: Neighborhood overlap-aware graph neural networks for link prediction,” in NeurIPS, 2021.
  • [18] H. Yin, M. Zhang, J. Wang, and P. Li, “Surel+: Moving from walks to sets for scalable subgraph-based graph representation learning,” Proceedings of the VLDB Endowment, vol. 16, no. 11, pp. 2939–2948, 2023.
  • [19] Y. Yang, T. Liu, Y. Wang, J. Zhou, Q. Gan, Z. Wei, Z. Zhang, Z. Huang, and D. Wipf, “Graph neural networks inspired by classical iterative algorithms,” in International Conference on Machine Learning, 2021.
  • [20] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv:1710.10903, 2017.
  • [21] L. Kong, Y. Chen, and M. Zhang, “Geodesic graph neural network for efficient graph representation learning,” in NeurIPS, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35, 2022, pp. 5896–5909.
  • [22] H. Yin, M. Zhang, Y. Wang, J. Wang, and P. Li, “Algorithm and system co-design for efficient subgraph-based graph representation learning,” Proceedings of the VLDB Endowment, vol. 15, no. 11, pp. 2788–2796, 2022.
  • [23] H. Ahn, Y. Yang, Q. Gan, T. Moon, and D. Wipf, “Descent steps of a relation-aware energy produce heterogeneous graph neural networks,” arXiv:2206.11081, 2022.
  • [24] S. Chen and Y. C. Eldar, “Graph signal denoising via unrolling networks,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5290–5294.
  • [25] X. Liu, W. Jin, Y. Ma, Y. Li, H. Liu, Y. Wang, M. Yan, and J. Tang, “Elastic graph neural networks,” in International Conference on Machine Learning, 2021.
  • [26] Y. Ma, X. Liu, T. Zhao, Y. Liu, J. Tang, and N. Shah, “A unified view on graph neural networks as graph signal denoising,” arXiv:2010.01777, 2020.
  • [27] X. Pan, S. Song, and G. Huang, “A unified framework for convolution-based graph neural networks,” https://openreview.net/forum?id=zUMD–Fb9Bt, 2021. [Online]. Available: https://openreview.net/forum?id=zUMD--Fb9Bt
  • [28] H. Zhang, T. Yan, Z. Xie, Y. Xia, and Y. Zhang, “Revisiting graph convolutional network on semi-supervised node classification from an optimization perspective,” CoRR, 2020.
  • [29] M. Zhu, X. Wang, C. Shi, H. Ji, and P. Cui, “Interpreting and unifying graph neural networks with an optimization framework,” arXiv:2101.11859, 2021.
  • [30] Z. Wang, Q. Ling, and T. Huang, “Learning deep 0\ell_{0} encoders,” in AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
  • [31] J. Feng, Y. Chen, F. Li, A. Sarkar, and M. Zhang, “How powerful are k-hop message passing graph neural networks,” Advances in Neural Information Processing Systems, vol. 35, pp. 4776–4790, 2022.
  • [32] F. Frasca, B. Bevilacqua, M. Bronstein, and H. Maron, “Understanding and extending subgraph gnns by rethinking their symmetries,” Advances in Neural Information Processing Systems, vol. 35, pp. 31 376–31 390, 2022.
  • [33] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv:1810.00826, 2018.
  • [34] B. Zhang, S. Luo, L. Wang, and D. He, “Rethinking the expressive power of gnns via graph biconnectivity,” in The Eleventh International Conference on Learning Representations, 2023.
  • [35] B. Zhang, G. Feng, Y. Du, D. He, and L. Wang, “A complete expressiveness hierarchy for subgraph GNNs via subgraph weisfeiler-lehman tests,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 23–29 Jul 2023, pp. 41 019–41 077. [Online]. Available: https://proceedings.mlr.press/v202/zhang23k.html
  • [36] C. Morris, F. Geerts, J. Tönshoff, and M. Grohe, “WL meet VC,” in Proceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202.   PMLR, 23–29 Jul 2023, pp. 25 275–25 302. [Online]. Available: https://proceedings.mlr.press/v202/morris23a.html
  • [37] J. Zhou, J. Feng, X. Wang, and M. Zhang, “Distance-restricted folklore weisfeiler-leman gnns with provable cycle counting power,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [38] B. Srinivasan and B. Ribeiro, “On the equivalence between positional node embeddings and structural graph representations,” in International Conference on Learning Representations, 2020.
  • [39] Y. You, T. Chen, Y. Sui, T. Chen, Z. Wang, and Y. Shen, “Graph contrastive learning with augmentations,” in NeurIPS, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33, 2020, pp. 5812–5823.
  • [40] J. Qiu, Q. Chen, Y. Dong, J. Zhang, H. Yang, M. Ding, K. Wang, and J. Tang, “Gcc: Graph contrastive coding for graph neural network pre-training,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’20.   New York, NY, USA: Association for Computing Machinery, 2020, p. 1150–1160.
  • [41] W. Shiao, Z. Guo, T. Zhao, E. E. Papalexakis, Y. Liu, and N. Shah, “Link prediction with non-contrastive learning,” arXiv:2211.14394, 2022.
  • [42] Z. Wang, Y. Zhou, L. Hong, Y. Zou, and H. Su, “Pairwise learning for neural link prediction,” arXiv:2112.02936, 2021.
  • [43] S. Rendle, W. Krichene, L. Zhang, and J. Anderson, “Neural collaborative filtering vs. matrix factorization revisited,” in Fourteenth ACM conference on recommender systems, 2020, pp. 240–248.
  • [44] C. Sun and G. Wu, “Adaptive graph diffusion networks with hop-wise attention,” arXiv:2012.15024, 2020.
  • [45] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” in NeurIPS, 2020.
  • [46] J. Chen, J. Mueller, V. N. Ioannidis, S. Adeshina, Y. Wang, T. Goldstein, and D. Wipf, “Does your graph need a confidence boost? Convergent boosted smoothing on graphs with tabular node features,” in International Conference on Learning Representations, 2021.
  • [47] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” Advances in Neural Information Processing Systems, 2004.
  • [48] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “Bpr: Bayesian personalized ranking from implicit feedback,” arXiv:1205.2618, 2012.
  • [49] P. A. Papp, K. Martinkus, L. Faber, and R. Wattenhofer, “Dropgnn: Random dropouts increase the expressiveness of graph neural networks,” in NeurIPS, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021, pp. 21 997–22 009.
  • [50] F. Frasca, B. Bevilacqua, M. Bronstein, and H. Maron, “Understanding and extending subgraph gnns by rethinking their symmetries,” in NeurIPS, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35, 2022, pp. 31 376–31 390.
  • [51] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction of internet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000.
  • [52] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI Magazine, vol. 29, no. 3, pp. 93–93, 2008.
  • [53] G. Namata, B. London, L. Getoor, B. Huang, and U. EDU, “Query-driven active surveying for collective classification,” in Proceedings Mining and Learning with Graphs, 2012.
  • [54] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
  • [55] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” NeurIPS, vol. 26, 2013.
  • [56] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard, “Complex embeddings for simple link prediction,” in ICML.   PMLR, 2016, pp. 2071–2080.
  • [57] B. Yang, W.-t. Yih, X. He, J. Gao, and L. Deng, “Embedding entities and relations for learning and inference in knowledge bases,” arXiv:1412.6575, 2014.
  • [58] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in ICML.   PMLR, 2018, pp. 5453–5462.
  • [59] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple and deep graph convolutional networks,” in ICML.   PMLR, 2020, pp. 1725–1735.
  • [60] T. Zhao, G. Liu, D. Wang, W. Yu, and M. Jiang, “Learning from counterfactual links for link prediction,” in ICML.   PMLR, 2022, pp. 26 911–26 926.
  • [61] X. Wang, H. Yang, and M. Zhang, “Neural common neighbor with completion for link prediction,” in ICLR, 2023.
  • [62] Z. Wang, Y. Guo, J. Zhao, Y. Zhang, H. Yu, X. Liao, H. Jin, B. Wang, and T. Yu, “Gidn: A lightweight graph inception diffusion network for high-efficient link prediction,” arXiv:2210.01301, 2022.
  • [63] Z. Guo, W. Shiao, S. Zhang, Y. Liu, N. V. Chawla, N. Shah, and T. Zhao, “Linkless link prediction via relational distillation,” in ICML.   PMLR, 2023, pp. 12 012–12 033.
  • [64] S. Zhang, Y. Liu, Y. Sun, and N. Shah, “Graph-less neural networks: Teaching old mlps new tricks via distillation,” in ICLR, 2021.
  • [65] S. Bubeck et al., “Convex optimization: Algorithms and complexity,” Foundations and Trends® in Machine Learning, vol. 8, no. 3-4, pp. 231–357, 2015.
  • [66] D. Bertsekas, Nonlinear Programming.   Athena Scientific, 1999.
[Uncaptioned image] Yuxin Wang received the BS degree from Fudan University and is currently a Phd student in Fudan University major in computer science. His main research interest is efficiency in graph neural networks and natural language process.
[Uncaptioned image] Xiannian Hu received the MS degree in computer science from Fudan University in 2024, He is now working in TAOBAO & TMALL GROUP for pricing strategy and AI marketing, interest in machine learning, causal inference, operations optimization.
[Uncaptioned image] Quan Gan is a Senior Applied Scientist at Amazon. His research interests include heterogeneous graph neural networks, algorithm unrolling, and the application of graph neural networks to domains such as tabular data and hypergraph networks. He obtained his Master’s degree at New York University and Bachelor’s degree at Fudan University.
[Uncaptioned image] Xuanjing Huang is currently a Professor at the School of Computer Science, Fudan University, Shanghai, China. She obtained the PhD degree in Computer Science from Fudan University in 1998. Her research interests focus on natural language processing and information retrieval, with a particular emphasis on sentiment analysis, information extraction, pre-trained language models, and the robustness and interpretability of NLP.
[Uncaptioned image] Xipeng Qiu is a professor at the School of Computer Science, Fudan University. He received his B.S. and Ph.D. degrees from Fudan University. His research interests include natural language processing and deep learning.
[Uncaptioned image] David Wipf is a Principal Research Scientist at Amazon Web Services and an IEEE Fellow. Prior to industry positions at Amazon and Microsoft Research, he received a B.S. degree with highest honors from the University of Virginia, and M.S. and Ph.D. degrees from the University of California, San Diego as an NSF Fellow in Vision and Learning in Humans and Machines. He was later an NIH Postdoctoral Fellow at the University of California, San Francisco. His current research interests include deep generative models and graph neural networks. He is the recipient of numerous fellowships and awards including the 2012 Signal Processing Society Best Paper Award.

X Appendix

X-A Additional Experimental Results

In this section we show further details of inference time results, and compare YinYanGNN with more diverse classical baselines, expressive node-wise GNNs, and distillation methods.

X-A1 Inference Time

We show the inference time of different models on the two largest OGB-datasets in Table XIII.

TABLE XIII: Inference time on test set.

Dataset SEAL Neo-GNN GDGNN SUREL SUREL+ BUDDY YinYanGNN GCN SAGE
PPA \approx 4500 s 19 s 210 s 972 s 299 s 178 s 2 s 1.7 s 1.7 s
Citation2 \approx 48000 s 63 s 1680 s 11367 s 1516 s 385 s 3 s 2.5 s 2.4 s

X-A2 Impact of the Number of Negative Samples on Performance and Inference Time

Although detailed inference time comparisons using PPA and Citation2 were shown in Figure 3 and Table XIII, we further explore how the number of negative samples can affect both inference time as well as the corresponding performance on larger graphs. From the results shown in Table XIV we observe that YinYanGNN is robust to the number of negative samples for large datasets. Hence to prioritize efficiency K=1K=1 is a reasonable selection.

TABLE XIV: YinYanGNN performance and inference times versus KK on PPA and Citation2.

PPA PPA Citation2 Citation2
KK HR@100 Inference Time (s) MRR Inference Time (s)
1 54.64±0.4954.64{\scriptstyle\pm 0.49} 0.21 86.21±0.09\mathbf{86.21{\scriptstyle\pm 0.09}} 1.17
2 56.60±0.38\mathbf{56.60{\scriptstyle\pm 0.38}} 0.29 85.95±0.0685.95{\scriptstyle\pm 0.06} 1.35
3 55.80±0.4255.80{\scriptstyle\pm 0.42} 0.35 85.85±0.0585.85{\scriptstyle\pm 0.05} 1.52

X-A3 Non-GNN Approaches

Herein we show results for classical approaches reported from  [12] and OGB leaderboards, with the exception that results for MLP and Node2vec for Cora, Citeseer and Pubmed are obtained by our own implementation in Appendix X-C. These new baselines include traditional heuristic methods (Common Neighbors (CN) [4], Adamic-Adar (AA) [5] and Resource Allocation (RA) [6] ), non-GNN methods (MLP, Node2vec [54], Matrix-Factorization (MF) [1]), and knowledge graph (KG) methods(TransE [55], ComplEx [56], DistMult [57]). YinYanGNN performance exceeds that of these methods.

TABLE XV: Results on link prediction benchmarks. Baseline results are cited from previous works [12, 18] and the OGB leaderboard, with a few exceptions mentioned in the text.

Cora Citeseer Pubmed Collab PPA Citation2 DDI
HR@100 HR@100 HR@100 HR@50 HR@100 MRR HR@20
Heuristics CN 33.92±0.4633.92{\scriptstyle\pm 0.46} 29.79±0.9029.79{\scriptstyle\pm 0.90} 23.13±0.1523.13{\scriptstyle\pm 0.15} 56.44±0.0056.44{\scriptstyle\pm 0.00} 27.65±0.0027.65{\scriptstyle\pm 0.00} 51.47±0.0051.47{\scriptstyle\pm 0.00} 17.73±0.0017.73{\scriptstyle\pm 0.00}
AA 39.85±1.3439.85{\scriptstyle\pm 1.34} 35.19±1.3335.19{\scriptstyle\pm 1.33} 27.38±0.1127.38{\scriptstyle\pm 0.11} 64.35±0.0064.35{\scriptstyle\pm 0.00} 32.45±0.0032.45{\scriptstyle\pm 0.00} 51.89±0.0051.89{\scriptstyle\pm 0.00} 18.61±0.0018.61{\scriptstyle\pm 0.00}
RA 41.07±0.4841.07{\scriptstyle\pm 0.48} 33.56±0.1733.56{\scriptstyle\pm 0.17} 27.03±0.3527.03{\scriptstyle\pm 0.35} 64.00±0.0064.00{\scriptstyle\pm 0.00} 49.33±0.0049.33{\scriptstyle\pm 0.00} 51.98±0.0051.98{\scriptstyle\pm 0.00} 27.60±0.0027.60{\scriptstyle\pm 0.00}
Non-GNN MLP 16.66±0.2416.66{\scriptstyle\pm 0.24} 21.49±0.3021.49{\scriptstyle\pm 0.30} 24.85±1.1424.85{\scriptstyle\pm 1.14} 19.27±1.2919.27{\scriptstyle\pm 1.29} 0.46±0.000.46{\scriptstyle\pm 0.00} 29.06±0.1629.06{\scriptstyle\pm 0.16} NA
Node2vec 46.87±0.9746.87{\scriptstyle\pm 0.97} 50.76±2.3250.76{\scriptstyle\pm 2.32} 54.40±11.1054.40{\scriptstyle\pm 11.10} 48.88±0.5448.88{\scriptstyle\pm 0.54} 22.26±0.8822.26{\scriptstyle\pm 0.88} 61.41±0.1161.41{\scriptstyle\pm 0.11} 23.26±2.0923.26{\scriptstyle\pm 2.09}
MF 37.37±0.0037.37{\scriptstyle\pm 0.00} 48.96±0.0248.96{\scriptstyle\pm 0.02} 13.79±11.9713.79{\scriptstyle\pm 11.97} 38.86±0.2938.86{\scriptstyle\pm 0.29} 32.29±0.9432.29{\scriptstyle\pm 0.94} 51.86±4.4351.86{\scriptstyle\pm 4.43} 13.68±4.7513.68{\scriptstyle\pm 4.75}
KG transE 67.40±1.6067.40{\scriptstyle\pm 1.60} 60.19±1.1560.19{\scriptstyle\pm 1.15} 36.67±0.9936.67{\scriptstyle\pm 0.99} 29.40±1.1529.40{\scriptstyle\pm 1.15} 22.69±0.4922.69{\scriptstyle\pm 0.49} 76.44±0.1876.44{\scriptstyle\pm 0.18} 6.65±0.206.65{\scriptstyle\pm 0.20}
complEx 37.16±2.7637.16{\scriptstyle\pm 2.76} 42.72±1.6842.72{\scriptstyle\pm 1.68} 37.80±1.3937.80{\scriptstyle\pm 1.39} 53.91±0.5053.91{\scriptstyle\pm 0.50} 27.42±0.4927.42{\scriptstyle\pm 0.49} 72.83±0.3872.83{\scriptstyle\pm 0.38} 8.68±0.368.68{\scriptstyle\pm 0.36}
DistMult 41.38±2.4941.38{\scriptstyle\pm 2.49} 47.65±1.6847.65{\scriptstyle\pm 1.68} 40.32±0.8940.32{\scriptstyle\pm 0.89} 51.00±0.5451.00{\scriptstyle\pm 0.54} 28.61±1.4728.61{\scriptstyle\pm 1.47} 66.95±0.4066.95{\scriptstyle\pm 0.40} 11.01±0.4911.01{\scriptstyle\pm 0.49}
Node-wise GNN YinYanGNN 93.83±0.78¯\underline{\mathbf{93.83{\scriptstyle\pm 0.78}}} 94.45±0.53¯\underline{\mathbf{94.45{\scriptstyle\pm 0.53}}} 90.73±0.40¯\underline{\mathbf{90.73{\scriptstyle\pm 0.40}}} 66.10±0.20¯\underline{\mathbf{66.10{\scriptstyle\pm 0.20}}} 54.64±0.49¯\underline{\mathbf{54.64{\scriptstyle\pm 0.49}}} 86.21±0.09¯\underline{\mathbf{86.21{\scriptstyle\pm 0.09}}} 80.92±3.35¯\underline{\mathbf{80.92{\scriptstyle\pm 3.35}}}

X-A4 More node-wise GNNs

We also include experiments with more expressive GNN architectures, namely GAT [20], GIN [33], JKNet [58], and GCNII [59], that all can be applied to arbitrary graph structure and have readily available public implementations. We tune these baselines according to our implementation in Appendix X-C. Results are shown in the table below, where we observe that YinYanGNN still maintains its strong advantage.

TABLE XVI: Comparison of YinYanGNN with additional node-wise GNNs.

Cora Citeseer Pubmed
Model HR@100 HR@100 HR@100
GCNII 48.11±1.8648.11{\scriptstyle\pm 1.86} 44.23±1.5544.23{\scriptstyle\pm 1.55} 49.72±2.3949.72{\scriptstyle\pm 2.39}
GIN 57.23±2.8757.23{\scriptstyle\pm 2.87} 69.19±0.4169.19{\scriptstyle\pm 0.41} 50.97±1.6150.97{\scriptstyle\pm 1.61}
JK-Net 67.86±2.3267.86{\scriptstyle\pm 2.32} 54.04±3.7354.04{\scriptstyle\pm 3.73} 62.73±4.5362.73{\scriptstyle\pm 4.53}
GAT 79.10±0.4179.10{\scriptstyle\pm 0.41} 64.56±2.8864.56{\scriptstyle\pm 2.88} 41.81±2.0141.81{\scriptstyle\pm 2.01}
YinYanGNN 93.83±0.78\mathbf{93.83{\scriptstyle\pm 0.78}} 94.45±0.53\mathbf{94.45{\scriptstyle\pm 0.53}} 90.73±0.40\mathbf{90.73{\scriptstyle\pm 0.40}}

X-A5 Distillation Methods

Another promising method for faster inference time is to distill GNN models. Such distillation methods are complementary to our work (which mainly focuses on GNN architecture design), and could in principle be applied in parallel using YinYanGNN as a teacher model. While we reserve such an effort to future work, for now we compare performance of existing distillation pipelines including LLP [63] and two baseline distillation models, logit matching [64] and representation matching. Results are in Table XVII, where the splits for Cora, Citeseer and Pubmed follow [63], while for Collab we use official settings. From the table we observe that YinYanGNN performance is considerably higher.

TABLE XVII: Results compared with distillation methods.
Cora Citeseer Pubmed Collab
Model HR@20 HR@20 HR@20 HR@50
Logit Matching 74.72±4.2774.72{\scriptstyle\pm 4.27} 72.44±1.5272.44{\scriptstyle\pm 1.52} 42.78±3.1542.78{\scriptstyle\pm 3.15} 35.97±0.9635.97{\scriptstyle\pm 0.96}
Representation Matching 75.75±1.5175.75{\scriptstyle\pm 1.51} 65.19±5.5465.19{\scriptstyle\pm 5.54} 44.44±2.4044.44{\scriptstyle\pm 2.40} 36.86±0.4536.86{\scriptstyle\pm 0.45}
LLP 78.82±1.7478.82{\scriptstyle\pm 1.74} 77.32±2.4277.32{\scriptstyle\pm 2.42} 57.33±2.4257.33{\scriptstyle\pm 2.42} 49.10±0.5749.10{\scriptstyle\pm 0.57}
YinYanGNN 85.39±2.38\mathbf{85.39{\scriptstyle\pm 2.38}} 84.96±2.51\mathbf{84.96{\scriptstyle\pm 2.51}} 61.02±6.87\mathbf{61.02{\scriptstyle\pm 6.87}} 66.10±0.20\mathbf{66.10{\scriptstyle\pm 0.20}}

X-B Proofs

X-B1 Proof for Proposition V.1

We first restate Proposition V.1 here.

Proposition X.1.

If λK<Kδmax\lambda_{K}<K\cdot\delta_{max}, where δmax\delta_{max} is the largest eigenvalue of LL^{-}, then (3) has a unique global minimum. Moreover, if the step-size parameter satisfies α<I+λL~λKKL~F1\alpha<\left\|I+\lambda\tilde{L}-\frac{\lambda_{K}}{K}\tilde{L}^{-}\right\|_{F}^{-1}, then the gradient descent iterations of (5) are guaranteed to converge to this minimum.

Proof.

We first demonstrate conditions whereby (3) is strongly-convex, in which case it will have a unique global minimum. We can accomplish this by showing that the smallest eigenvalue of the corresponding Hessian Matrix is non-negative [65]. In our case, the smallest eigenvalue can be computed as 1+λδmin+λKKδmax1+\lambda\delta^{+}_{min}-\frac{\lambda_{K}}{K}\delta_{max}, where δmin+=0\delta^{+}_{min}=0 is the smallest eigenvalue of LL, and δmax\delta_{max} is the largest eigenvalue of LL^{-}, which is non-negative. So to guarantee strong convexity, we require that 1λKKδmax>01-\frac{\lambda_{K}}{K}\delta_{max}>0 or equivalently that λK<Kδmax\lambda_{K}<K\cdot\delta_{max}.

Proceeding further, if the gradient of a strongly-convex function is Lipschitz-continuous, with Lipschitz constant CC, it follows that gradient descent with step-size less than C1C^{-1} will converge to the unique global minimum. In our setting, (3) has Lipschitz continuous gradients with Lipschitz constant that can be inferred via the bound

nodeY1nodeY2F2\displaystyle\left\|\frac{\partial\ell_{node}}{\partial Y_{1}}-\frac{\partial\ell_{node}}{\partial Y_{2}}\right\|^{2}_{F} =2[Y1Y2+λL~(Y1Y2)λKKL~(Y1Y2)]F2\displaystyle=\left\|2\left[Y_{1}-Y_{2}+\lambda\tilde{L}(Y_{1}-Y_{2})-\frac{\lambda_{K}}{K}\tilde{L}^{-}(Y_{1}-Y_{2})\right]\right\|^{2}_{F} (18)
2[I+λL~λKKL~]F2Y1Y2F2,\displaystyle\leq\left\|2\left[I+\lambda\tilde{L}-\frac{\lambda_{K}}{K}\tilde{L}^{-}\right]\right\|_{F}^{2}\|Y_{1}-Y_{2}\|_{F}^{2},

which holds for any pair of points {Y1,Y2}\{Y_{1},Y_{2}\}. From this it follows that C=2[I+λL~λKKL~]FC=\left\|2\left[I+\lambda\tilde{L}-\frac{\lambda_{K}}{K}\tilde{L}^{-}\right]\right\|_{F}, and hence, if our assumed step-size α2\frac{\alpha}{2} is chosen such that α<I+λL~λKKL~F1\alpha<\left\|I+\lambda\tilde{L}-\frac{\lambda_{K}}{K}\tilde{L}^{-}\right\|_{F}^{-1}, the result is established.

X-B2 Proof for Proposition 5.2

We first restate the Proposition 5.2 here before the proof.

Proposition X.2.

There exists an α>0\alpha^{\prime}>0 such that for any α(0,α]\alpha\in(0,\alpha^{\prime}], the iterations (IV-C) will converge to a stationary point of (IV-C).

Proof.

We first show that the energy from (IV-C) has a Lipschitz continuous gradient within the compact set YF2b2\|Y\|_{F}^{2}\leq b^{2}, where b>0b>0 is some constant and the associated (local) Lipschitz constant denoted CbC_{b} will be a function of this bb. To show this, we begin by calculating the secondary gradient for (IV-C). For this purpose we first write the gradient in the element-wise form

nodeYij=2(D+D)1(Yijf(X;W)ij)+2λkL~ikYkj2λKKkL~ikYkjσ(Q).\frac{\partial\ell_{node}}{\partial Y_{ij}}=2(D+D^{-})^{-1}(Y_{ij}-f(X;W)_{ij})+2\lambda\sum_{k}\tilde{L}_{ik}Y_{kj}-2\frac{\lambda_{K}}{K}\sum_{k}\tilde{L}^{-}_{ik}Y_{kj}\sigma(Q). (19)

So the secondary gradient is

2nodeYij2=2(D+D)ij1Yij+2λL~ii2λKKL~iiσ(Q)2λKKL~iiYijσ(Q)(1σ(Q))QYij.\frac{\partial^{2}\ell_{node}}{\partial Y_{ij}^{2}}=2(D+D^{-})^{-1}_{ij}Y_{ij}+2\lambda\tilde{L}_{ii}-2\frac{\lambda_{K}}{K}\tilde{L}^{-}_{ii}\sigma(Q)-2\frac{\lambda_{K}}{K}\tilde{L}^{-}_{ii}Y_{ij}\sigma(Q)(1-\sigma(Q))\frac{\partial Q}{\partial Y_{ij}}. (20)

Recall that Q=γ||tr[YL~Y]Q=\gamma|\mathcal{E}|-\text{tr}[Y^{\top}\tilde{L}^{-}Y], so

QYij=2kL~ikYkj.\frac{\partial Q}{\partial Y_{ij}}=-2\sum_{k}\tilde{L}^{-}_{ik}Y_{kj}. (21)

Consequently the secondary gradient becomes

2nodeYij2\displaystyle\frac{\partial^{2}\ell_{node}}{\partial Y_{ij}^{2}} =2(D+D)ij1Yij+2λL~ii2λKKL~iiσ(Q)2λKKL~iiσ(Q)(1σ(Q))(2kL~ikYkj)Yij\displaystyle=2(D+D^{-})^{-1}_{ij}Y_{ij}+2\lambda\tilde{L}_{ii}-2\frac{\lambda_{K}}{K}\tilde{L}^{-}_{ii}\sigma(Q)-2\frac{\lambda_{K}}{K}\tilde{L}^{-}_{ii}\sigma(Q)(1-\sigma(Q))(-2\sum_{k}\tilde{L}^{-}_{ik}Y_{kj})Y_{ij} (22)
2|(D+D)ij1|b+2λ|L~ii|+|2λKKL~ii|+4λKK|L~ii|(k|L~ik|b2),\displaystyle\leq 2|(D+D^{-})^{-1}_{ij}|b+2\lambda|\tilde{L}_{ii}|+|2\frac{\lambda_{K}}{K}\tilde{L}^{-}_{ii}|+4\frac{\lambda_{K}}{K}|\tilde{L}^{-}_{ii}|(\sum_{k}|\tilde{L}^{-}_{ik}|b^{2}),

where |||\cdot| denotes the absolute value of scalar-valued quantities. The above inequality comes from the fact that for any YF2b2\|Y\|_{F}^{2}\leq b^{2}, |Yij|b|Y_{ij}|\leq b and |kL~ikYkjYij|k|L~ikYkjYij|k|L~ik|b2|\sum_{k}\tilde{L}^{-}_{ik}Y_{kj}Y_{ij}|\leq\sum_{k}|\tilde{L}^{-}_{ik}Y_{kj}Y_{ij}|\leq\sum_{k}|\tilde{L}^{-}_{ik}|b^{2}. Besides, σ(Q)<1\sigma(Q)<1 and 1σ(Q)<11-\sigma(Q)<1. Since elements of the Hessian exist and are bounded, the corresponding gradients must be Lipschitz continuous within the stated constraint set.

Next, we note that for any initialization Y(0)Y^{(0)} we can choose a bb sufficiently large such that node(Y)>node(Y(0))\ell_{node}(Y)>\ell_{node}(Y^{(0)}) for all Y𝒮b{Y:Yn×d,YF2>b2}Y\in\mathcal{S}_{b}\triangleq\{Y^{\prime}:Y^{\prime}\in\mathbb{R}^{n\times d},\|Y^{\prime}\|_{F}^{2}>b^{2}\}. To see this, note that for any YY with YF2\|Y\|_{F}^{2} sufficiently large, node(Y)\ell_{node}(Y) is unbounded from above given that the first two terms collectively are strongly convex, while the last term is bounded from below.

We then define 𝒮¯b=Rn×d𝒮b\bar{\mathcal{S}}_{b}=R^{n\times d}\setminus\mathcal{S}_{b} (i.e., the complement of 𝒮b\mathcal{S}_{b}) for convenience. Therefore, given any initialization Y(0)Y^{(0)}, there will exist a Lipschitz constant CbC_{b} such that any gradient descent iteration computed at points Y𝒮¯bY\in\bar{\mathcal{S}}_{b} using a step-size α2\frac{\alpha}{2} less than 1/Cb1/C_{b} (so α2/Cb\alpha^{\prime}\triangleq 2/C_{b}) is guaranteed to reduce node(Y)\ell_{node}(Y), and in doing so, the iteration will remain within 𝒮¯b\bar{\mathcal{S}}_{b}. Hence we can apply standard results [66], for the convergence of gradient descent applied to non-convex functions ff to a stationary point provided that ff is lower-bounded (as is (IV-C)) and has Lipschitz-continuous gradients, and the step-size parameter is chosen to be less than the inverse of the corresponding Lipschitz constant.

X-C Further Experimental Details

X-C1 Datasets Statistics

Statistics of the datasets used in the main paper are presented in Table XVIII. The splits for Cora, Citeseer and Pubmed are 7:1:2 for training:validation:test, following  [12]. Splits for OGB datasets are the official ones from  [45].

TABLE XVIII: Datasets Statistics.
Cora Citeseer Pubmed Collab PPA DDI Citation2
Node number 2,708 3,327 18,717 235,868 576,289 4,267 2,927,963
Edge number 5,278 4,676 44,327 1,285,465 30,326,273 1,334,889 30,561,187
splits rand rand rand time throughput time protein
avg degreedegree 3.9 2.74 4.5 5.45 52.62 312.84 10.44

X-C2 Implementation Hardware and Hyperparameters

The experiments were generally conducted on RTX 4090(24GB) and A100(40GB) GPUs (A100 only for ogbl-Citation2). The inference time comparisons in Table XIII were conducted on an RTX A10G, while the implementation of Table XIV was done using an A100(40GB). Hyperparameters were selected using a grid search on Cora, Citeseer and Pubmed. On OGB datasets, we empirically try different hyperparameters based on the experiences from the small datasets. The searching grid is the same for our model and baseline experiments if needed. For these we follow consistent ranges with the code from prior work for YinYanGNN, i.e., learning rate (0.0001-0.01), hidden dimension (32-1024), dropout (0-0.8). For specified hyperparameters for YinYanGNN, the searching grid is K (1-4), propagation step (2-16), α\alpha (0.01-1), λK\lambda_{K} (100-1 or Learnable), λ\lambda (100-1).