This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Algorithm and System Co-design for Efficient Subgraph-based Graph Representation Learning

Haoteng Yin, Muhan Zhang, Yanbang Wang§, Jianguo Wang, Pan Li Department of Computer Science, Purdue University      Institute for Artificial Intelligence, Peking University
§Department of Computer Science, Cornell University
{yinht, csjgwang, panli}@purdue.edu [email protected] § [email protected]
Abstract.

Subgraph-based graph representation learning (SGRL) has been recently proposed to deal with some fundamental challenges encountered by canonical graph neural networks (GNNs), and has demonstrated advantages in many important data science applications such as link, relation and motif prediction. However, current SGRL approaches suffer from scalability issues since they require extracting subgraphs for each training or test query. Recent solutions that scale up canonical GNNs may not apply to SGRL. Here, we propose a novel framework SUREL for scalable SGRL by co-designing the learning algorithm and its system support. SUREL adopts walk-based decomposition of subgraphs and reuses the walks to form subgraphs, which substantially reduces the redundancy of subgraph extraction and supports parallel computation. Experiments over six homogeneous, heterogeneous and higher-order graphs with millions of nodes and edges demonstrate the effectiveness and scalability of SUREL. In particular, compared to SGRL baselines, SUREL achieves 10×\times speed-up with comparable or even better prediction performance; while compared to canonical GNNs, SUREL achieves 50% prediction accuracy improvement.

PVLDB Reference Format:
Haoteng Yin, Muhan Zhang, Yanbang Wang, Jianguo Wang, Pan Li. PVLDB, 15(11): XXX-XXX, 2022.
doi:XX.XX/XXX.XX This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 15, No. 11 ISSN 2150-8097.
doi:XX.XX/XXX.XX

PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at %leave␣empty␣if␣no␣availability␣url␣should␣be␣sethttps://github.com/Graph-COM/SUREL.git.

1. Introduction

Graph-structured data is prevalent to model relations and interactions between elements in real-world applications (Koller et al., 2007). Graph representation learning (GRL) aims to learn representations of graph-structured data and has recently become a hot research topic (Hamilton, 2020). Previous works on GRL focus on either model design or system design while very few works jointly consider them. Works on model design tend to propose more expressive, generalizable and robust GRL models while paying less attention to their deployment (Veličković et al., 2018; Morris et al., 2019). Hence, many theoretically powerful models can hardly apply to large real-world graphs. On the other hand, research on system design focuses on system-level techniques for better model development, such as graph partitioning (Chiang et al., 2019), sub-sampling (Hamilton et al., 2017; Zeng et al., 2020) and pipelining (Zhang et al., 2020; Jia et al., 2020; Thorpe et al., 2021; Zhang et al., 2021b). However, they only consider basic GRL models, in particular graph neural network (GNN) models, yet often overlook their modeling limitations to solve practical GRL tasks.

Canonical GNNs (Kipf and Welling, 2017; Hamilton et al., 2017) share a common framework: each node is associated with a vector representation that gets iteratively updated by aggregating the representations from its neighboring nodes via graph convolution layers. The final prediction is made by combining the representations of nodes of interest. Although recent successes in system research have greatly pumped up the efficiency (Fey and Lenssen, 2019; Wang et al., 2019), the GNN framework intrinsically suffers from three modeling limitations. First, information may be over-squashed into a single node representation that results in subpar performance when multiple tasks are associated, e.g. to predict multiple relations or links attached to the same node (Epasto and Perozzi, 2019; Alon and Yahav, 2020). Second, canonical GNNs cannot capture intra-node distance information due to limited expressive power (Li et al., 2020; Wang et al., 2022), and thus fail to make predictions over a set of nodes (See Fig. 1a), such as substructure counting (Bouritsas et al., 2020; Chen et al., 2020) and higher-order pattern prediction (Srinivasan and Ribeiro, 2020; Zhang et al., 2021a). Third, the depth of GNNs is entangled with the range of the receptive field. For more non-linearity, using deeper GNNs comes with a larger but possibly unnecessary receptive field, which poses the risk of contaminating the representations with irrelevant information (Huang and Zitnik, 2020; Zeng et al., 2021).

Recently, subgraph-based GRL (SGRL) has emerged as a new trend and has shown superior performance in tasks such as link prediction (Zhang and Chen, 2018; Zhang et al., 2021a), relation prediction (Teru et al., 2020), higher-order pattern prediction (Li et al., 2020; Liu et al., 2022), temporal network modeling (Wang et al., 2021), recommender systems (Zhang and Chen, 2020), graph meta-learning (Huang and Zitnik, 2020), and subgraph matching (Liu et al., 2020b; Lou et al., 2020) and prediction (Wang and Zhang, 2022). Different from canonical GNNs, SGRL extracts a subgraph patch for each training and test query and learns the representation of the extracted patch for final prediction (See Fig. 1b). For example, SEAL (Zhang and Chen, 2018; Zhang et al., 2021a) learns the representation of a subgraph around a given node pair to predict the link between them. This framework fundamentally overcomes the above three limitations. First, subgraph extraction allows decoupling the contributions made by a node to different queries, which prevents information over-squashing. Second, subgraph patches can be paired with distance-related features that favor prediction over a set of nodes (Li et al., 2020; Zhang et al., 2021a). Third, subgraph extraction disentangles model depth and range of receptive field, which allows learning a rather non-linear model with only relevant local subgraphs as input.

Refer to caption
Figure 1. A Toy Example of SGRL: the task is to predict whether uwuw or uvuv is more likely to form a link. Ideally, if this comes from a social network, uvuv is more likely linked because they share a common neighbor. However, canonical GNNs cannot tell such difference since ww and vv share the same subtree structures resulting in the same representation (Xu et al., 2019). SGRL solves this problem by extracting a subgraph patch around each queried node pair. Prediction based on the subgraph representation provides much better performance than canonical GNNs (Zhang and Chen, 2018; Zhang et al., 2021a).

Despite their importance, the SGRL framework has not received as much attention as the canonical GNN framework in the system research community. The underlying challenge comes from the subgraph extraction step in SGRL, which can be rather irregular and time-consuming. Specifically, SGRL requires to materialize a subgraph patch for each query during training and inference. Previous works of SGRL typically extract subgraphs offline for all such queries (Zhang and Chen, 2018; Liu et al., 2022), but it is not scalable for large graphs due to extensive memory need. Meanwhile, the online extraction (Zeng et al., 2021) is not an option as it requires considerable processing time. The irregularity of subgraphs further makes it difficult to efficiently handle the extraction process in both cases.

Here, we aim to fill the gap by designing a novel computational framework SUREL, to support SGRL over large graphs. SUREL consists of a new system-friendly learning algorithm for SGRL and a scalable system to support this algorithm. The crucial design of SUREL is to reduce the overhead caused by the online subgraph extraction, which all current SGRL approaches suffer from.

The key idea behind SUREL is to break (and down-sample) subgraphs into random walks of regular size that can be easily sampled and, more importantly, reused among different queries. To compensate for the missing structural information after subgraph decomposition, we introduce relative position encoding (RPE), an intra-node distance feature that records the position of each node in the sampled subgraph. Specifically, for each node uu in the network, SUREL collects a certain number of random walk starting from uu. Each node appearing in these walks uses its landing counts at each step as the RPE vector. Overall, the set of collected walks paired with RPEs can be viewed as a subgraph patch centered at uu. The complexity of the above process is linear with the number of nodes, and can be done in parallel and offline. For training and inference, given a queried node set QQ, SUREL first groups the sampled walks originated from all nodes in QQ. Then, it implicitly joins the subgraph patches centered at each node in QQ by combining their node-level RPEs into a query-level RPE for each node associated in the grouped walks, which can also be executed in full parallel. Finally, SUREL uses neural networks to learn the representation of the joined set of walks attached with query-level RPEs for final prediction. Since these walks are regular, the training process can be done quickly by GPU. The system architecture of SUREL is illustrated in Fig. 2.

Our contributions can be summarized as follows: (1) A Novel System-Friendly Algorithm. We propose the first scalable algorithm for SGRL tasks by adopting a novel walk-based computation framework. This framework uses regular data structures and allows extreme system acceleration. (2) Dedicated System Support (Open-source). We design SUREL to support the proposed algorithm. It can rapidly sample walks, encode positional features, and join them to represent multiple subgraphs in parallel. SUREL adopts many system optimization techniques including parallelization, memory management, load balancing, etc. (3) High Performance and Efficiency. We evaluate SUREL on link/relation/motif three prediction tasks over 6 real-world graphs of millions of nodes/edges. SUREL significantly outperform the current SGRL approaches, and executes 10×\times faster in training and testing. Meanwhile, benefiting from the SGRL essence, SUREL outperforms canonical GNNs by a great margin on prediction performance (almost 50%50\% in all tasks).

2. Preliminaries and Related Works

In this section, we set up notations, formulate the SGRL problem and review some related works.

2.1. Notations

Definition 2.0 (Graph-structured data).

Let 𝒢=(𝒱,,X)\mathcal{G}=(\mathcal{V},\mathcal{E},X) denote an attributed graph, where 𝒱=[n]\mathcal{V}=[n] and 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} are the node set and the edge set respectively. Xn×dX\in\mathbb{R}^{n\times d} denotes the node attributes with dd-dimension. Further, we use 𝒩v\mathcal{N}_{v} to represent the set of nodes in the direct neighborhood of node vv, i.e., 𝒩v={u:(u,v)}\mathcal{N}_{v}=\{u:(u,v)\in\mathcal{E}\}.

Definition 2.0 (mm-hop Subgraph).

Given a graph 𝒢\mathcal{G} and a node set of interest QQ, let 𝒢Qm\mathcal{G}_{Q}^{m} denote the mm-hop neighboring subgraph w.r.t the set QQ. 𝒢Qm\mathcal{G}_{Q}^{m} is the induced subgraph of 𝒢\mathcal{G}, of which the node set 𝒱Qm\mathcal{V}^{m}_{Q} includes the set QQ and all the nodes in 𝒢\mathcal{G} whose shortest path distance to QQ is less than or equal to mm. Its edge set is a subset of \mathcal{E}, where each edge has both endpoints in its node set 𝒱Qm\mathcal{V}^{m}_{Q}. The nodes in 𝒱Qm\mathcal{V}^{m}_{Q} still carry the original node attributes if 𝒢\mathcal{G} is attributed.

2.2. Graph Learning Problems and Background

Now, we formally formulate the GRL and SGRL problems.

Definition 2.0 (Graph Representation Learning (GRL)).

Given a graph 𝒢\mathcal{G} and a queried set of nodes QQ, graph representation learning aims to learn a mapping from the graph-structured data to some predicting labels as f(𝒢,Q)yf(\mathcal{G},Q)\rightarrow y, where the mapping f(𝒢,Q)f(\mathcal{G},Q) may reflect structures and node attributes of 𝒢\mathcal{G} and their relation to QQ.

Next, we define SGRL where for a particular query QQ, the predictions are made based on the local subgraph around QQ.

Definition 2.0 (Subgraph-based GRL (SGRL)).

Given a node set QQ over an ambient graph 𝒢\mathcal{G} and a positive integer mm, SGRL is to learn the mapping to some labels, which takes the mm-hop neighboring subgraph of QQ in 𝒢\mathcal{G} as the input f(𝒢Qm,Q)yf(\mathcal{G}_{Q}^{m},Q)\rightarrow y. An SGRL task typically is given some labeled node set queries {(Qi,yi)}i=1L\{(Q_{i},y_{i})\}_{i=1}^{L} for training and other unlabeled node set queries {Qi}i=L+1L+N\{Q_{i}\}_{i=L+1}^{L+N} for testing.

We list a few important examples of SGRL tasks. Link prediction seeks to estimate the likelihood of a link between two endpoints in a given graph. Additionally, it can be generalized to predict the type of links, such as relation prediction for heterogeneous graphs. In this case, the set QQ corresponds to a pair of nodes. The network scientific community has identified the importance of leveraging the local induced subgraphs for link prediction (Liben-Nowell and Kleinberg, 2007). For example, the number of common friends (shown as neighbors in a social network) implies how likely two individuals may become friends in the future. Another generalized form of link prediction is higher-order pattern prediction, where the set QQ consists of three or more nodes. The goal is to predict whether the set of nodes in QQ will foster a covered edge (termed hyperedge).

Graph neural networks (GNNs). Canonical GNNs associate each node vv with a vector representation 𝐡\boldsymbol{\mathrm{h}}, which is learned and updated by aggregating messages from vv’s neighbors, as

𝐡vk=UPDATE(𝐡vk1,AGGREGATE({𝐡uk1|u𝒩v})).\boldsymbol{\mathrm{h}}_{v}^{k}=\text{UPDATE}\left(\boldsymbol{\mathrm{h}}_{v}^{k-1},\text{AGGREGATE}\left(\{\boldsymbol{\mathrm{h}}_{u}^{k-1}|u\in\mathcal{N}_{v}\}\right)\right).\vspace{-2mm}

Here, UPDATE is implemented by neural networks while AGGREGATE is a pooling operation invariant to the order of the neighbors. By unfolding the neighborhood around each node, the computation graph to get each node representation forms a tree structure. According to Def. 2.4, canonical GNNs seem also able to perform SGRL by encoding the local subtree rooted at each node into a node representation (See Fig. 1a). Nevertheless, by this way, each node representation only separately reflects the subgraph around each node but cannot jointly represent the subgraph around multiple nodes, which yields the problem in Fig. 1. However, the SGRL framework considered in this work is able to learn the representation of the joint subgraph around a queried node set.

2.3. Other Related Works

Without exception, previous works focus on improving the scalability of canonical GNNs and their system support, but some of their techniques inspire the design of SUREL.

To overcome the memory bottleneck of GPU when processing large-scaled graphs, sub-sampling the graph structure is a widely adopted strategy. GraphSAGE (Hamilton et al., 2017) and VR-GCN (Chen et al., 2018b) use uniform sampling schema and variance reduction technique respectively to restrict the size of node neighbors; PIN-SAGE (Ying et al., 2018) exploits Personalized PageRank (PPR) scores to sample neighbors. FastGCN (Chen et al., 2018a) and ASGCN (Huang et al., 2018) perform independent layer-wise node sampling to allow neighborhood sharing. Cluster-GCN (Chiang et al., 2019) and GraphSAINT (Zeng et al., 2020) study subgraph-based mini-batching approaches to reduce the size of training graphs. Note that the subgraphs in our setting are substantially different from theirs, since our subgraphs work as features for queries while their subgraphs are a compensatory choice to achieve better scalability.

Many works better the system support for GNNs. DGL (Wang et al., 2019) and PyG (Fey and Lenssen, 2019) are designed for scalable single-machine GNN training. Marius (Mohoney et al., 2021) is proposed to efficiently learn large-scale graph embeddings on a single machine. There are several distributed systems dedicated to GNNs: AliGraph (Yang, 2019) addresses the storage issue of applying GNNs on massive industrial graphs; AGL (Zhang et al., 2020) employs a subgraph-based system for GRL; ROC (Jia et al., 2020) builds a multi-GPU framework for deeper and larger GNN models; Dorylus (Thorpe et al., 2021) designs a CPU-based distributed system for GNN training. G3\text{G}^{3} (Liu et al., 2020a) speedups GNN training via supporting parallel graph-structured operations. Zhou et al. (2021) uses feature dimension pruning to accelerate large-scale GNN inference. However, all these systems only support canonical GNNs so they all suffer from the intrinsic modeling limitations of GNNs.

3. The Architecture of SUREL

Refer to caption
Figure 2. The System Architecture of Subgraph-based Graph Representation Learning Framework (SUREL).

In this section, we first give an overview of the SUREL framework as shown in Fig. 2. Then, we focus on the design and the implementation of three modules: Walk Sampler & Relative Position Encoder (Preprocessing), Walk-based Subgraph Storage, Query-based Subgraph Joining & Neural Encoding. At last, we elaborate an efficient training pipeline with Subgraph Query Mini-batching.

3.1. Overview

Existing SGRL frameworks that extract a subgraph per query do not support efficient training and inference. mm-hop subgraph extraction faces the size “explosion” issue as many nodes have significantly large degrees in real-world networks. Moreover, subgraphs of different sizes cause workload fluctuation, hindering load balancing and memory management.

Subgraph extraction can be replaced with efficient walk-based sampling, which sidesteps all above issues via regulating the number and the length of sampled walks. The number and the length of these walks are small constants, so the space and time complexity here is only linear w.r.t the number of nodes. Specifically, during preprocessing, SUREL reduces the subgraph around each node in a given graph to a set of random walks originated from it. To compensate for the loss of structural information after breaking subgraphs into walks, an intra-node distance feature termed relative positional encoding (RPE) is proposed, which enables locating each node in the sampled subgraph. The collected set of walks paired with its RPEs is hosted in the walk-based subgraph storage, with a dedicated data structure designed to support rapid and intensive access. The preprocessing flow is presented in the upper part of Fig. 2.

For training and testing, given a query (set of nodes), SUREL employs subgraph joining to implicitly construct a subgraph around the entire query in full parallel. First, all the walks originated from the queried node set are grouped. Then, the precomputed node-level RPEs are joined into query-level RPEs. SUREL further adopts neural networks to encode the grouped walks paired with query-level RPEs, and makes final predictions based on the obtained subgraph representation. A mini-batching strategy is designed to maximize data reuse during training by exploiting the query overlaps.

Refer to caption
Figure 3. An Illustration of Joining RPE into Query-level RPEs with the Support of Walk-based Subgraph Storage.

3.2. Preprocessing - Walk Sampling & Encoding

The bottleneck of current SGRL frameworks is how to cheaply acquire the mm-hop neighbors for each queried set of nodes. SUREL proposes to decompose the mm-hop subgraph into a set of mm-length walks that start from the queried set of nodes. As the walks are regular, their storage and access are extremely efficient. This also resolves the computational problem caused by the long-tailed distribution of node degrees. More importantly, the collected walks grouped by their starting nodes can be shared and reused among different queries. Our design decouples SGRL from redundant subgraph extraction and enables the reusability of preprocessed data. We summarize the preprocessing routines with the support of hash-indexed storage in Algorithm 1 and introduce the specifics next.

Walk Sampling. During preprocessing, SUREL samples MM-many mm-step walks for every node in a given graph. As Fig. 3 (upper left) shows, the sampled walks are grouped in a set 𝒲u\mathcal{W}_{u}, where uu denotes the starting node of these walks. Walk sampling can be easily divided into parallelizable pieces. The parallelization is implemented based on NumPy and OpenMP framework in C. Moreover, to further accelerate walk sampling, we use compressed sparse row (CSR) to represent the graph. The CSR format consists of two arrays, idxptr of length |𝒱|+1|\mathcal{V}|+1 used to record the degrees of nodes, and indices of size |||\mathcal{E}|, each row of which corresponds to the neighbor list per node. CSR allows intensive fast access to the neighbors of a node while keeping the memory cost low, which is vital for walk sampling in large-scale graphs.

Input: Graph 𝒢\mathcal{G}; number of walks MM; step of walks mm
Output: Associative array 𝒜\mathcal{A}, RPE array 𝒯\mathcal{T}
1 Initialize the array 𝒜\mathcal{A} and 𝒯\mathcal{T}, the dictionary \mathcal{H}
2 for each node u𝒢u\in\mathcal{G} do
3       Run MM times mm-step random walks on 𝒢\mathcal{G} as a set of walk 𝒲uM×m\mathcal{W}_{u}\in\mathbb{Z}^{M\times m};
4       Add the key 𝒱u=set(𝒲u)\mathcal{V}_{u}=\text{set}(\mathcal{W}_{u}) to u\mathcal{H}_{u};
5       Calculate RPE for x𝒱u\forall x\in\mathcal{V}_{u}, save the value 𝒳u,x\mathcal{X}_{u,x} to 𝒯\mathcal{T}, and write its index in 𝒯\mathcal{T} as RPE-IDu,x\text{RPE-ID}_{u,x} back to u(x)\mathcal{H}_{u}(x);
6       Insert {u:(𝒲u,u)}\{u:(\mathcal{W}_{u},\mathcal{H}_{u})\} to 𝒜\mathcal{A}
7 end for
Prune 𝒯\mathcal{T} and update the value of \mathcal{H} by re-indexing.
Algorithm 1 Data Preprocessing in SUREL

Relative Positional Encoding (RPE). Structural information gets lost after breaking subgraphs into walks. SUREL compensates such loss via RPE to locate the relative position of a node in each sampled subgraph, which characterizes the structural contribution of the node to its corresponding subgraph.

For each set of walks 𝒲u\mathcal{W}_{u}, we first establish a set 𝒱u\mathcal{V}_{u} that contains distinct nodes appearing in 𝒲u\mathcal{W}_{u}. Define node-level RPE 𝒳u:𝒱um+1\mathcal{X}_{u}:\mathcal{V}_{u}\rightarrow\mathbb{R}^{m+1} as follows: for each node x𝒱ux\in\mathcal{V}_{u}, a vector 𝒳u,xm+1\mathcal{X}_{u,x}\in\mathbb{R}^{m+1} is assigned, where 𝒳u,x[i]\mathcal{X}_{u,x}[i] is the landing counts of node xx at position ii in all walks of 𝒲u\mathcal{W}_{u}. In SUREL, RPE can be computed on the fly as walks get sampled, thus resulting in nearly zero extra computational cost. The set of walks 𝒲u\mathcal{W}_{u} paired with the RPE 𝒳u\mathcal{X}_{u} essentially characterize a sub-sampled subgraph around the node uu. Next, we present a dedicated data structure to host 𝒲u\mathcal{W}_{u} and 𝒳u\mathcal{X}_{u} altogether.

3.3. Walk-based Subgraph Storage

It is easy to manage the collected set of walks due to its regularity. An mMm*M-sized chunk is allocated to each set of walks, which assists to speed up data fetching. How to organize node-level RPE presents a real challenge because the cardinality of the set |𝒱u||\mathcal{V}_{u}| varies from node to node. One naïve way to avoid such irregularity is to directly scatter these RPEs back to nodes in previously collected walks. But, this gives an mM(m+1)m*M*(m+1) tensor, resulting in an unrealizable memory need. Moreover, it loses track of node IDs in walks that are needed for joining subgraphs later.

We use an associative array 𝒜\mathcal{A} to organize all walk-based subgraphs as shown in the upper part of Fig. 3. For each node u𝒱u\in\mathcal{V}, its corresponding entry in 𝒜\mathcal{A} is a node-level subgraph formed as a tuple (𝒲u,u)(\mathcal{W}_{u},\mathcal{H}_{u}). Here, 𝒲u\mathcal{W}_{u} is a set of walks starting from uu, and u\mathcal{H}_{u} is a dictionary that maps the unique node set 𝒱u\mathcal{V}_{u} of 𝒲u\mathcal{W}_{u} to its corresponding node-level RPE 𝒳u\mathcal{X}_{u}. The use of dictionary resolves irregularities in 𝒱u\mathcal{V}_{u} mentioned above, while maintaining the connection between node IDs and their RPEs. In addition, array 𝒯\mathcal{T} is introduced to store RPE values centrally, rather than scattered across dictionaries. As Fig. 3 (upper right) shows, the value of u(x)\mathcal{H}_{u}(x) is now replaced with the index of the RPE value 𝒳u,x\mathcal{X}_{u,x} stored in 𝒯\mathcal{T} accordingly, noted as RPE-IDu,x\text{RPE-ID}_{u,x}. This design overall guarantees the access of RPE in O(1)O(1) time.

The above 𝒜\mathcal{A} and u\mathcal{H}_{u} are built on top of uthash’s macros 111https://troydhanson.github.io/uthash/, with extended support for arbitrary insertions and deletions of key–value pairs. It offers data access and search in O(1)O(1) time on average, which is about as good as the direct address table but greatly reduces the space wastage. In particular, it has no dependency or need for communication between multiple hash queries, thus can be pleasingly executed in parallel. Both 𝒜\mathcal{A} and u\mathcal{H}_{u} are stored in RAM on the CPU side. As we observed in Fig. 3, there are many repeated RPE values. Once all nodes are sampled, the array 𝒯\mathcal{T} can be pruned to remove duplicates. RPE-IDs will be updated synchronously when 𝒯\mathcal{T} is reindexed. For example, both node aa and vv have the RPE value of [0,0,1][0,0,1], whose index in 𝒯\mathcal{T} is (1) after pruning. Thus, both u(a)\mathcal{H}_{u}(a) and u(v)\mathcal{H}_{u}(v) are assigned to the new RPE-ID as (1). The shape of 𝒯\mathcal{T} is regular and its size is usually small after pruning, which can be fully loaded in GPU. In practice, we found that pining RPEs in GPU memory is critical, as it can significantly reduce the communication cost of moving data back and forth between RAM and SDRAM.

3.4. Query-based Subgraph Joining

The storage designed above records the downsampled subgraph around each node. As SGRL is mostly useful for making predictions over a set of nodes QQ, here we further illustrate how to get the joined subgraph around all the nodes uQu\in Q.

The idea is to concatenate all set of walks [,𝒲u,][...,\mathcal{W}_{u},...] for uQu\in Q, since each set of walks 𝒲u\mathcal{W}_{u} can be viewed as a subgraph around uu. Besides, each node xx in the walks will be paired with a query-level RPE 𝒳Q,x\mathcal{X}_{Q,x} that characterizes the relative position of node xx in the joint subgraph around the queried set QQ. Specifically, 𝒳Q,x\mathcal{X}_{Q,x} is defined by joining all RPEs 𝒳u,x\mathcal{X}_{u,x} for uQu\in Q, i.e., 𝒳Q,x=uQ𝒳u,x([,𝒳u,x,])(m+1)×|Q|\mathcal{X}_{Q,x}=\uplus_{u\in Q}\mathcal{X}_{u,x}(\triangleq[...,\mathcal{X}_{u,x},...])\in\mathbb{R}^{(m+1)\times|Q|}. There will be some uQu\in Q such that x𝒱ux\notin\mathcal{V}_{u}, for which 𝒳u,x\mathcal{X}_{u,x} is set to all zeros. Through this procedure, the joined subgraph with query-level RPEs is sent to GPU for representation learning and then model inference.

The data structure described in Sec. 3.3 enables a highly parallel implementation of subgraph joining along with optimized memory management. On the CPU side, 𝒳Q,x\mathcal{X}_{Q,x} is not directly used to assemble walks. Instead, we use a query-level RPE-ID that joins node-level RPE indices in 𝒯\mathcal{T}, i.e. use RPE-IDQ,x=[,RPE-IDu,x,]|Q|\text{RPE-ID}_{Q,x}=[...,\text{RPE-ID}_{u,x},...]\in\mathbb{R}^{|Q|} for uQu\in Q, which reduces the memory cost from (m+1)|Q|(m+1)*|Q| to |Q||Q|. For instance, in Fig. 3 (bottom right), 𝒳Q,u=([2,0,0],[0,0,1])\mathcal{X}_{Q,u}=([2,0,0],[0,0,1]) can be substituted by RPE-IDQ,u=(3,1)\text{RPE-ID}_{Q,u}=(3,1), as their RPE values locate at the entry (3) and (1) of 𝒯\mathcal{T}. As follows, SUREL pre-allocates an array with the fixed-size [mM|Q|,|Q|][m*M*|Q|,|Q|], where mM|Q|m*M*|Q| is the size of walks around QQ. Then, SUREL fills the index array with RPE-IDu,x\text{RPE-ID}_{u,x} by multithreads. Note that RPE-IDu,x\text{RPE-ID}_{u,x} can be rapidly retrieved via the dictionary operation u(x)\mathcal{H}_{u}(x). Lastly, assembling RPE values to walks is performed on GPU via the indexing operation 𝒳u,x=𝒯(RPE-IDu,x)\mathcal{X}_{u,x}=\mathcal{T}(\text{RPE-ID}_{u,x}), where 𝒯\mathcal{T} is pinned in GPU memory earlier. SUREL incorporates a Python/C hybrid API for subgraph joining, building on top of NumPy, PyTorch, OpenMP and uthash.

Some remarks can be made here. First, the above algorithm contains some redundancy to compute the query-level RPE-ID for the nodes that appear multiple times in the walks. In practice, we find that about half of the nodes appear only once, thus doubling the computation time at most. To avoid such redundancy, one can first compute the set union 𝒱Q=uQ𝒱u\mathcal{V}_{Q}=\cup_{u\in Q}\mathcal{V}_{u}, and then compute the query-level RPE-ID by traversing all nodes in 𝒱Q\mathcal{V}_{Q}. However, parallel set union is difficult to implement efficiently. When multithreading is enabled, we observe a significant increase in the efficiency of SUREL, as opposed to the union operation. Also, by dynamically adjusting the number of threads, the workload between CPU and GPU can be well balanced. Second, we have empirically found that using RPE-ID instead of RPE to assemble walks provides an observable performance boost (speed up by 2×2\times or more), otherwise data communication between CPU and GPU would the main bottleneck.

3.5. Neural Encoding

After subgraph joining for each query, the obtained subgraph is represented by a concatenated set of walks on which nodes are paired with query-level RPEs (See Fig. 3). Next, we introduce neural networks to encode these walks into a subgraph representation hQh_{Q}.

Due to its regularity, any sequential models, e.g., MLP, CNN, RNN, and transformers can be adopted for sampled walks. We test RNN and MLP for neural encoding, both of which achieve similar results. Next, we take the RNN as an example. We encode each walk W=(w0,w1,,wm)𝒲W=(w_{0},w_{1},...,w_{m})\in\mathcal{W} as enc(W)=RNN({f(𝒳Q,wi)}i=0,1,,m)\text{enc}(W)=\text{RNN}(\{f\left(\mathcal{X}_{Q,w_{i}}\right)\}_{i=0,1,\dots,m}), where wiw_{i}’s denote the node at step ii in one sampled walk. Here, ff is to encode the query-level RPE. Node or edge attributes for each step wkWw_{k}\in W can be supported by attaching those attributes after its RPE. To obtain the final subgraph representation of QQ, we aggregate the encoding of all the associated walks through a mean pooling, i.e., hQ=mean({enc(W)|Wstarts from some uQ}).h_{Q}=\text{mean}(\{\text{enc}(W)|W\;\text{starts from some $u\in Q$}\}). In the end, a two-layer classifier is used to make prediction by taking hQh_{Q} as input. In our experiments, all the tasks can be formulated as binary classification, and thus we adopt Binary Cross Entropy as the loss function.

Input: A graph 𝒢\mathcal{G}, a set of training queries {(Qi,yi)}\{(Q_{i},y_{i})\}, batch capacity B1B_{1}, batch size B2B_{2}
Output: A Neural Network for Neural Encoding NN()\text{NN}(\cdot)
1 Prepare the collection of set of walks 𝒲\mathcal{W} and RPEs 𝒳\mathcal{X}
2 for iter=1,,max_iteriter=1,\dots,max\_iter do
3       Initialize the set 𝒬=\mathcal{Q}=\emptyset to track reached queries;
4       Randomly choose a seed-set of nodes 𝒱¯\bar{\mathcal{V}} from Q\cup Q;
5       Run breath-first search to expand 𝒱¯\bar{\mathcal{V}} and 𝒬\mathcal{Q} until |𝒱¯|=B1|\bar{\mathcal{V}}|=B_{1} or |𝒬|=B2|\mathcal{Q}|=B_{2};
6       Generate negative training queries (if not given) for a mini-batch and put them into 𝒬\mathcal{Q};
7       Perform subgraph joining for queries in 𝒬\mathcal{Q};
8       Encode the concatenated walks by NN()\text{NN}(\cdot) to get the subgraph representation hQh_{Q} for each query;
9       Use backpropagation (Rumelhart et al., 1986) to optimize model parameters.
10 end for
Algorithm 2 The Training Pipeline of SUREL

3.6. The Training and Serving Pipelines

SUREL organically incorporates the storage designed in Sec. 3.3 and the subgraph-joining operation described in Sec. 3.4 to achieve efficient training and model serving.

Subgraph queries QQ’s are sets of nodes, which often come from a common ambient on a large graph. There might be many overlaps between different queries and their mm-hop induced subgraphs. If the queried subgraphs are known in prior, we may put these queries with high node overlap into the same batch to improve data reuse. Here, queries of each given task are assumed to have the same size, e.g. |Q|=2|Q|=2 for link prediction. In practice, test queries are usually given online while the training ones can be prepared in advance. Hence, we propose to accelerate the training pipeline by mini-batching the overlapping queries. Practitioners can choose the appropriate pipeline according to the specific situation. Algorithm 2 summarizes the overall training procedure of SUREL.

Mini-batching for Training. We first randomly sample a seed-set of nodes 𝒱¯\bar{\mathcal{V}} from the union of queried node sets Q\cup Q. Then, we run breadth-first search (BFS) to expand the seed-set 𝒱¯\bar{\mathcal{V}}. Neighbor fetching of the BFS here is based on the grouped queries instead of the original graph: a neighbor of node uu is defined as the node that shares at least one query with it. During BFS, the reached queries will be added to a set 𝒬\mathcal{Q}. The expansion stops once the size of either the seed-set 𝒱¯\bar{\mathcal{V}} or the mini-batch 𝒬\mathcal{Q} reaches some pre-defined limits. Since the data structure for each query in SUREL after subgraph joining is regular, it is easy to decide the size limits of seed-set and mini-batch based on resource availability (i.e. GPU memory). In practice, this BFS procedure improves reusability of data within each mini-batch, and may significantly decrease the communication cost between CPU and GPU. If the training set only contains positive queries (often in link/motif prediction tasks), we design an efficient sampling strategy for negative queries by the same principle that randomly pairs them within the same batch.

4. Evaluation

In this section, we aim to evaluate the following points:

  • Regarding prediction performance, can SUREL outperform state-of-the-art SGRL models? Can SUREL significantly outperform canonical GNNs and transductive graph embedding methods due to the claimed benefit of SGRL?

  • Regarding runtime, can SUREL significantly outperform state-of-the-art SGRL models? Can SUREL achieve runtime performance comparable to canonical GNNs? Previous SGRL models are typically much slower than canonical GNNs.

  • How about the parameter sensitivity of SUREL? How do the parameters mm and MM impact the overall performance?

  • How is the parallel design of SUREL performing and scaling?

Table 1. Summary Statistics for Evaluation Datasets.
Dataset Type #Nodes #Edges
citation2 Homo. 2,927,963 30,561,187
collab Homo. 235,868 1,285,465
ppa Homo. 576,289 30,326,273
ogb-mag Hetero.
Paper(P): 736,389
Author(A): 1,134,649
P-A: 7,145,660
P-P: 5,416,271
tags-math Higher. 1,629
91,685 (projected)
822,059 (hyperedges)
DBLP-
coauthor
Higher. 1,924,991
7,904,336 (projected)
3,700,067 (hyperedges)

4.1. Evaluation Setup

We conduct extensive experiments to evaluate the proposed framework with three kinds of graphs (homogeneous, heterogeneous, and higher-order homogeneous) on three corresponding types of tasks, namely, link prediction, relation prediction and higher-order pattern prediction. Homogeneous graphs are the graphs without node/link types. Heterogeneous graphs include node/link types. Higher-order graphs contain higher-order links that may connect more than 2 nodes. The dataset statistics are summarized in Table 1, most of which are larger than the datasets used in (Zhang et al., 2021b; Zhou et al., 2021), not to mention that our node-set prediction task is much more complex than the node classification task considered in the previous works.

Open Graph Benchmark (OGB). We use three link prediction and one relation prediction datasets (Hu et al., 2020): ppa - a protein interaction network, collab - a collaboration network, and citation2 - a citation network; and one heterogeneous network ogb-mag, which contains four types of nodes (paper, author, institution and field) and their relations extracted from MAG (Wang et al., 2020).

Higher-order Graph Dataset. DBLP-coauthor is a temporal higher-order network that records co-authorship of papers as timestamped higher-order links. tags-math contains sets of tags that are applied to questions on the website math.stackexchange.com as higher-order links. For the two higher-order graphs, SUREL and all the baselines will treat them as standard graphs by projecting higher-order links into cliques. However, the training and test queries are generated based on higher-order links detailed next.

Settings. For Link Prediction, we follow the data split as OGB requires to isolating the validation and test links (queries) from the graphs. For Relation Prediction, the relations of paper-author (P-A) and paper-citation (P-P) are selected. The dataset is split based on timestamps. 0.5% of existing edges of each target relation type are selected from ogb-mag. For each paper, two authors/citations are picked from its P-A/P-P relations respectively, one for validation and the other for testing. The remaining links are used for training. For Higher-order Pattern Prediction, we focus on predicting whether two nodes will be connected to a third node concurrently via a higher-order link in the future. Specifically, positive queries are node triplets, where two nodes are linked before the timestamp tt and the third node establishes connection to the pair via a higher-order link after tt. The split ratio of positive node triplets is 60/20/20 for training/validation/testing. For Relation Prediction and Higher-order Pattern Prediction, each positive query is paired with 1000 randomly sampled negative queries (except tags-math uses 100) in testing. For fair comparison, all baselines are tested with the same set of negative queries sampled individually for each dataset. All experiments are run 10 times independently, and we report the mean performance and standard deviations.

Table 2. Comparison of SGRL Methods for Subgraph Sampling. Suppose using O(||)O(|\mathcal{E}|) many queries and SS to denote the average size of sampled subgraphs. The wall-clock time is measured on citation2 test set with p=16p=16 threads.
Methods SEAL (1-hop) (Zhang and Chen, 2018; Zhang et al., 2021a) DE-GNN (Li et al., 2020) SUREL
Time Complexity O(S||)O(S|\mathcal{E}|) O(S||)O(S|\mathcal{E}|) O(mMp|𝒱|)O(\frac{mM}{p}\cdot|\mathcal{V}|)
Wall Time 36h ¿ 1 month 26s

Baselines. We consider three classes of baselines. Graph Embedding methods for transductive learning: Node2vec (Grover and Leskovec, 2016) and DeepWalk (Perozzi et al., 2014), which learns a single embedding for each node and may suffer from the information over-squashing issue; Canonical GNNs: GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), GraphSAINT (Zeng et al., 2020), Cluster-GCN (Chiang et al., 2019), Relational GCN (R-GCN) (Schlichtkrull et al., 2018), Relation-aware Heterogeneous Graph Neural Network (R-HGNN) (Yu et al., 2022); SGRL models: SEAL(Zhang and Chen, 2018; Zhang et al., 2021a), DE-GNN(Li et al., 2020). SEAL supports both offline and online subgraph extraction per query. However, it takes SEAL 2+ hours and 102GB RAM to offline extract 2% training subgraphs on citation2. Thus, we only keep the online setting for SEAL. DE-GNN only supports offline subgraph extraction. Table 2 compares subgraph sampling for different SGRL methods. We adopt official implementations of above baselines with tuned parameters that match reported results. SUREL uses an 2-layer MLP for embeddings of RPEs and an 2-layer RNN to encode query-level joined walks. The obtained subgraph embeddings are fed into an MLP classifier for final prediction. Default training parameters are: learning rate lr=1e-3 with early stopping of 5-epoch patience, dropout p=0.1, Adam (Kingma and Ba, 2015) as the optimizer, batch capacity B1=1500B_{1}=1500, and batch size B2=32B_{2}=32. Hidden dimension dd and walk parameters M,mM,m are investigated in Sec. 4.4.

Metric. The evaluation metrics include Hits@K and Mean Reciprocal Rank (MRR). Hit@K counts the percentage of positive samples ranked at the top-K place against all the negative ones. MRR firstly calculates the inverse of the ranking of the first correct prediction against the given number of paired negative samples, and then an average is taken over the total queries.

Environment. We use a server with four Intel Xeon Gold 6248R CPUs, 1TB DRAM, and eight NVIDIA RTX 6000 (24GB) GPUs.

Table 3. Results for Link Prediction, Relation Prediction, and Higher-order Pattern Prediction.
Models citation2 collab ppa
MRR (%) Hits@50 (%) Hits@100 (%)
Node2vec 61.28±0.15 47.54±0.78 18.05±0.52
DeepWalk 84.47±0.04 49.08±0.93 27.80±1.71
GCN 84.74±0.21 44.75±1.07 18.67±1.32
SAGE 82.60±0.36 54.63±1.12 16.55±2.40
Cluster-GCN 80.04±0.25 44.02±1.37 3.56±0.40
GraphSAINT 79.85±0.40 53.12±0.52 3.83±1.33
SEAL 87.67±0.32 63.64±0.71 48.80±3.16
SUREL 89.74±0.18 63.34±0.52 53.23±1.03
Models MAG(P-A) MAG(P-P) tags-math DBLP-coauthor
MRR (%) MRR (%) MRR (%) MRR (%)
GCN 39.43±0.29 57.43±0.30 51.64±0.27 37.95±2.59
SAGE 25.35±1.49 60.54±1.60 54.68±2.03 22.91±0.94
R-GCN 37.10±1.05 56.82±4.71 - -
R-HGNN 33.41±2.47 45.91±3.28 - -
DE-GNN - - 36.67±1.59 Timeout
SUREL 45.33±2.94 82.47±0.26 71.86±2.15 97.66±2.89

4.2. Prediction Performance Analysis

Table 3 shows results of three prediction tasks. Apparently, for these three link prediction benchmarks, the performance of SGRL models is significantly better than transductive graph embedding models and canonical GNNs, particularly for the challenging tasks over ppa and collab. Within SGRL models, SUREL sets two SOTA results on ppa and citation2, and gets comparable performance on collab against SEAL, which validates the modeling effectiveness of our proposed walk-based framework. For relation prediction and higher-order pattern prediction, we observe a large gap (up to 60%) between canonical GNNs and SUREL-based models, especially in higher-order cases. This again demonstrates the inherent modeling limitation of canonical GNNs to predict over a set of nodes. DE-GNN suffers from serious scalability issues when employing subgraph extraction for higher-order pattern prediction. Our best attempt is to deploy DE-GNN on tags-math by using 10% training samples, while the other three graphs failed. DE-GNN spends more than 300 hours preprocessing just 5%5\% training queries of DBLP-coauthor.

Table 4. Breakdown of Runtime, Memory Consumption for Different Models on citation2, collab and DBLP-coauthor. Training time is calculated if no better validation result is observed in 3 consecutive epochs, which assumes the model has converged. Full-batch training models need NVIDIA A100 (48GB) GPUs, results of which are marked with *. Other models take less time on A100 than on RTX 6000.
Models Runtime (s) Memory (GB)
Prep. Train Inf. Total RAM SDRAM
citataion2 GCN * 17 16,835 32 16,884 9.5 37.55
Cluster-GCN 197 2,663 82 2,942 18.3 14.07
GraphSAINT 140 3,845 86 4,071 16.9 14.77
SEAL (1-hop) 46 22,296 130,312 152,654 36.5 3.35
SUREL 31 2,096 7,959 10,086 15.2 4.50
collab GCN 6 840 0.1 846 3.2 5.17
Cluster-GCN 8 649 0.2 666 3.4 5.29
GraphSAINT <<1 6,746 0.2 6,747 3.2 6.58
SEAL (1-hop) 10 7,675 37 7,722 15.4 6.97
SUREL <<1 1,720 8 1,728 3.6 5.57
DBLP GCN * - 153 95 248 8.0 25.80
SAGE * - 86 77 161 7.5 24.70
SUREL 10 430 1,667 2,107 8.6 8.61
Refer to caption
Refer to caption
Figure 4. Performance Profiling of Training & Inference (Up: Time-to-accuracy; Down: Inference Throughput).
Refer to caption
(a) Prediction Performance (m/M)
Refer to caption
(b) Training Time per Batch (2K queries)
Refer to caption
(c) Inference Time
Refer to caption
(d) Prediction Performance (dim)
Figure 5. Hyper-parameter Analysis: the number of walks MM, the step of walks mm, and the hidden dimension dd.

4.3. Runtime and Memory Complexity Analysis

Table 4 reports the runtime, memory consumption comparison on a single machine (using one GPU) between canonical GNNs and SGRL models. SUREL offers a reasonable total runtime on these benchmarks compared with canonical GNNs. Meanwhile, its preprocessing overhead is negligible as showed in Table 4 under the term ‘Prep.’, and the higher-order case can be efficiently handled as well. SEAL adopts online extraction, and thus the cost is not counted in preprocessing, while its training suffers from the computation bottleneck. DE-GNN uses offline extraction, and it takes 15+ hours and 98GB RAM to process training queries in tags-math, which is obviously incapable of scaling to DBLP-coauthor (so not present in Table 4). Overall, SUREL substantially accelerates the subgraph extraction and makes it feasible for SGRL on large-scale graphs.

In terms of memory management, SUREL achieves comparable RAM usage to canonical GNNs, because the number of walks MM and the steps mm are small constants in practice. The extra memory cost is linear in |𝒱||\mathcal{V}|, so the total memory cost is still dominated by the original graph. However, SEAL induces much more RAM usage as it extracts subgraphs of long-tail sizes, and its total memory cost is often super-linear in |𝒱||\mathcal{V}|. Both SEAL and SUREL consume much less SDRAM because they do not need GPU to load large adjacency matrices and host node representations.

We further profile the training and inference performance, and present it in Fig. 4. The upper half plots the time-to-accuracy comparison between canonical GNNs and SGRL models. Each dot indicates one training epoch for full-batch GCN, SEAL and SUREL, 10 training epochs for Cluster-GCN and GraphSAINT. As it shows, both SEAL and SUREL use 1-3 epochs to get good enough performance, and each epoch of SUREL takes around 1/10 time of SEAL on citation2. The time per epoch of full-batch GCN is comparable with SUREL, while Cluster-GCN and GraphSAINT are faster. However, these models generally take longer time to converge to even subpar performance. On ppa, the curve of SEAL is pretty oscillating, leading to longer convergence. SUREL uses large MM and mm to achieve better and more stable performance on ppa, so the training time per epoch is comparable with SEAL. The training curves of canonical GNN baselines are not plotted for ppa because of their poor performance (See Table 3).

The bottom half of Fig. 4 provides the comparison of end-to-end inference throughput between two classes of models. Canonical GNNs offer rapid inference, since they generate node representations as the intermediate computation results that are shared across all queries. But as aforementioned, sharing node representations may over-squash useful information and degenerate performance as shown in Table 3. SEAL, as SGRL, achieves good prediction performance but its inference is extremely slow, because of subgraph extraction per query. SUREL fundamentally solves this bottleneck by replacing the extraction with walk-based subgraph joining. It is 416×\textbf{4}-\textbf{16}\times faster than SEAL on inference for link prediction, and achieves even more speedup than DE-GNN in higher-order settings.

Refer to caption
(a) Throughput
Refer to caption
(b) Runtime
Figure 6. Performance Scaling of SUREL (Walk Sampler and Query-level RPE Joining) vs Different Number of Threads.

4.4. Significant Hyperparameter Analysis

The number MM, the step mm of walks and the hidden dimension dd effect scalability and accuracy of SUREL. To examine their impact, we evaluate SUREL on citation2, a large sparse graph, and collab, a medium dense graph, for different values of mm, MM, and dd.

Prediction Performance. Fig. 5(a) and 5(d) show the prediction results. As expected, the performance consistently increases if we use a larger number of walks MM. But for the step mm, it is not always true that longer steps will guarantee better results, which depends on the specifics of the dataset. For instance, in network citation2, to accurately predict the link between two papers, more steps are needed as it would capture a larger group of papers which share similar semantics. While for collab, the case is different, as deeper walks would include more noise for predicting collaborations between two authors. In general, some small m(25)m\leavevmode\nobreak\ (2\sim 5) and M(50400)M\leavevmode\nobreak\ (50\sim 400) ensure adequate performance. By adjusting mm and MM, we can achieve the trade-off between accuracy and scalability, none of which is achievable through other SGRL models. Moreover, SUREL is insensitive to the hidden dimension as shown by Fig. 5(d).

Training and Inference Time Cost. As Figs. 5(b) and 5(c) demonstrated, the time of walk sampling and subgraph joining is nearly linear w.r.t. the total number of walks (mMm*M) under the same number of threads (16 by default). Here, we do not regulate MM based on the degree of each node in a query, which may induce certain duplication in sampled walks originated from the nodes with small degrees. Using degree-adaptive MM is promising to further improve the scalability of SUREL while keeping good prediction performance. We leave such investigation for future study.

4.5. Performance Scaling

To investigate the scaling performance of the parallel implementation, we examine the runtime of heavy operations in SUREL by using different numbers of threads. Fig. 6(a) shows the throughput of walk sampler and query-level RPE joining on citation2. The runtime is also compared to the estimated runtime by Amdahl’s law (Grama et al., 2003) shown in Fig. 6(b): walk sampling and RPE joining are in good agreement with the expected speedup, thus implying well parallelized implementation.

5. Conclusion

We propose a novel computational paradigm, SUREL for subgraph-based representation learning on large-scale graphs. SUREL targets predicting relations over set of nodes. It decouples graph structures into sets of walks to avoid irregularities in subgraphs and enable reuse of intermediate results. It then applies walk-based subgraph joining paired with relative positional encoding for representation learning of queried node sets. Such design allows for full parallelization and significantly improves model scalability. SUREL incorporates the principle of algorithm and system co-design that unlocks the full potential of learning on large-scale data with limited resources. To the best of our knowledge, this is the first work to study subgraph-based representation learning from the perspective of system scalability. Experiments also show that SUREL achieves superior performance in both prediction and scalability on three different SGRL tasks over six large, real-world graph benchmarks.

Acknowledgements.
We greatly thank all the reviewers for valuable feedback and actionable suggestions. H. Yin and P. Li are supported by the 2021 JPMorgan Faculty Award and the National Science Foundation (NSF) award HDR-2117997.

References

  • (1)
  • Alon and Yahav (2020) Uri Alon and Eran Yahav. 2020. On the bottleneck of graph neural networks and its practical implications. In International Conference on Learning Representations.
  • Bates et al. (2015) Michael L Bates, Susan M Bengtson Nash, Darryl W Hawker, John Norbury, Jonny S Stark, and Roger A Cropp. 2015. Construction of a trophically complex near-shore Antarctic food web model using the Conservative Normal framework with structural coexistence. Journal of Marine Systems 145 (2015), 1–14.
  • Bouritsas et al. (2020) Giorgos Bouritsas, Fabrizio Frasca, Stefanos Zafeiriou, and Michael M Bronstein. 2020. Improving graph neural network expressivity via subgraph isomorphism counting. In ICML 2020 Workshop on Graph Representation Learning and Beyond.
  • Chen et al. (2018a) Jie Chen, Tengfei Ma, and Cao Xiao. 2018a. Fastgcn: fast learning with graph convolutional networks via importance sampling. In International Conference on Learning Representations.
  • Chen et al. (2018b) Jianfei Chen, Jun Zhu, and Le Song. 2018b. Stochastic training of graph convolutional networks with variance reduction. In International Conference on Machine Learning. PMLR, 942–950.
  • Chen et al. (2020) Zhengdao Chen, Lei Chen, Villar Soledad, and Joan Bruna. 2020. Can Graph Neural Networks Count Substructures?. In Advances in Neural Information Processing Systems, Vol. 33.
  • Chiang et al. (2019) Wei-Lin Chiang, Xuanqing Liu, Si Si, Yang Li, Samy Bengio, and Cho-Jui Hsieh. 2019. Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 257–266.
  • Epasto and Perozzi (2019) Alessandro Epasto and Bryan Perozzi. 2019. Is a single embedding enough? Learning node representations that capture multiple social contexts. In The world wide web conference. 394–404.
  • Fey and Lenssen (2019) Matthias Fey and Jan Eric Lenssen. 2019. Fast Graph Representation Learning with PyTorch Geometric. In ICLR 2019 Workshop on Representation Learning on Graphs and Manifolds.
  • Grama et al. (2003) Ananth Grama, Vipin Kumar, Anshul Gupta, and George Karypis. 2003. Introduction to parallel computing. Pearson Education.
  • Grover and Leskovec (2016) Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 855–864.
  • Hamilton (2020) William L Hamilton. 2020. Graph representation learning. Synthesis Lectures on Artifical Intelligence and Machine Learning 14, 3 (2020), 1–159.
  • Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1025–1035.
  • Hu et al. (2020) Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. 2020. Open Graph Benchmark: Datasets for Machine Learning on Graphs. arXiv preprint arXiv:2005.00687 (2020).
  • Huang and Zitnik (2020) Kexin Huang and Marinka Zitnik. 2020. Graph meta learning via local subgraphs. In Advances in Neural Information Processing Systems, Vol. 33.
  • Huang et al. (2018) Wenbing Huang, Tong Zhang, Yu Rong, and Junzhou Huang. 2018. Adaptive Sampling Towards Fast Graph Representation Learning. In Advances in Neural Information Processing Systems, Vol. 31.
  • Jia et al. (2020) Zhihao Jia, Sina Lin, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2020. Improving the accuracy, scalability, and performance of graph neural networks with roc. Proceedings of Machine Learning and Systems 2 (2020), 187–198.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations.
  • Koller et al. (2007) Daphne Koller, Nir Friedman, Sašo Džeroski, Charles Sutton, Andrew McCallum, Avi Pfeffer, Pieter Abbeel, Ming-Fai Wong, Chris Meek, Jennifer Neville, et al. 2007. Introduction to statistical relational learning. MIT press.
  • Li et al. (2020) Pan Li, Yanbang Wang, Hongwei Wang, and Jure Leskovec. 2020. Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning. In Advances in Neural Information Processing Systems, Vol. 33.
  • Liben-Nowell and Kleinberg (2007) David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. Journal of the American society for information science and technology 58, 7 (2007), 1019–1031.
  • Liu et al. (2020a) Husong Liu, Shengliang Lu, Xinyu Chen, and Bingsheng He. 2020a. G3: when graph neural networks meet parallel graph processing systems on GPUs. Proceedings of the VLDB Endowment 13, 12 (2020), 2813–2816.
  • Liu et al. (2020b) Xin Liu, Haojie Pan, Mutian He, Yangqiu Song, Xin Jiang, and Lifeng Shang. 2020b. Neural subgraph isomorphism counting. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1959–1969.
  • Liu et al. (2022) Yunyu Liu, Jianzhu Ma, and Pan Li. 2022. Neural Predicting Higher-Order Patterns in Temporal Networks. In Proceedings of the Web Conference 2022. Association for Computing Machinery, 1340–1351.
  • Lou et al. (2020) Zhaoyu Lou, Jiaxuan You, Chengtao Wen, Arquimedes Canedo, Jure Leskovec, et al. 2020. Neural Subgraph Matching. arXiv preprint arXiv:2007.03092 (2020).
  • Mohoney et al. (2021) Jason Mohoney, Roger Waleffe, Henry Xu, Theodoros Rekatsinas, and Shivaram Venkataraman. 2021. Marius: Learning Massive Graph Embeddings on a Single Machine. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 533–549.
  • Morris et al. (2019) Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric Lenssen, Gaurav Rattan, and Martin Grohe. 2019. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4602–4609.
  • Oono and Suzuki (2019) Kenta Oono and Taiji Suzuki. 2019. Graph Neural Networks Exponentially Lose Expressive Power for Node Classification. In International Conference on Learning Representations.
  • Perozzi et al. (2014) Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 701–710.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533–536.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593–607.
  • Srinivasan and Ribeiro (2020) Balasubramaniam Srinivasan and Bruno Ribeiro. 2020. On the Equivalence between Node Embeddings and Structural Graph Representations. In International Conference on Learning Representations.
  • Stenseth et al. (1997) Nils Chr Stenseth, Wilhelm Falck, Ottar N Bjørnstad, and Charles J Krebs. 1997. Population regulation in snowshoe hare and Canadian lynx: asymmetric food web configurations between hare and lynx. Proceedings of the National Academy of Sciences 94, 10 (1997), 5147–5152.
  • Teru et al. (2020) Komal Teru, Etienne Denis, and Will Hamilton. 2020. Inductive relation prediction by subgraph reasoning. In International Conference on Machine Learning. PMLR, 9448–9457.
  • Thorpe et al. (2021) John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu. 2021. Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI 21). 495–514.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In International Conference on Learning Representations.
  • Wang et al. (2022) Haorui Wang, Haoteng Yin, Muhan Zhang, and Pan Li. 2022. Equivariant and Stable Positional Encoding for More Powerful Graph Neural Networks. In International Conference on Learning Representations.
  • Wang et al. (2020) Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. 2020. Microsoft academic graph: When experts are not enough. Quantitative Science Studies 1, 1 (2020), 396–413.
  • Wang et al. (2019) Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, et al. 2019. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs. In ICLR 2019 Workshop on Representation Learning on Graphs and Manifolds.
  • Wang and Zhang (2022) Xiyuan Wang and Muhan Zhang. 2022. GLASS: GNN with Labeling Tricks for Subgraph Representation Learning. In International Conference on Learning Representations.
  • Wang et al. (2021) Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. 2021. Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks. In International Conference on Learning Representations.
  • Xie et al. (2021) Anze Xie, Anders Carlsson, Jason Mohoney, Roger Waleffe, Shanan Peters, Theodoros Rekatsinas, and Shivaram Venkataraman. 2021. Demo of marius: a system for large-scale graph embeddings. Proceedings of the VLDB Endowment 14, 12 (2021), 2759–2762.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2019. How Powerful are Graph Neural Networks?. In International Conference on Learning Representations.
  • Yang (2019) Hongxia Yang. 2019. Aligraph: A comprehensive graph neural network platform. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3165–3166.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 974–983.
  • Yu et al. (2022) Le Yu, Leilei Sun, Bowen Du, Chuanren Liu, Weifeng Lv, and Hui Xiong. 2022. Heterogeneous graph representation learning with relation awareness. IEEE Transactions on Knowledge and Data Engineering (2022).
  • Zeng et al. (2021) Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, Andrey Malevich, Rajgopal Kannan, Viktor Prasanna, Long Jin, and Ren Chen. 2021. Decoupling the Depth and Scope of Graph Neural Networks. In Advances in Neural Information Processing Systems, Vol. 34.
  • Zeng et al. (2020) Hanqing Zeng, Hongkuan Zhou, Ajitesh Srivastava, Rajgopal Kannan, and Viktor Prasanna. 2020. Graphsaint: Graph sampling based inductive learning method. In International Conference on Learning Representations.
  • Zhang et al. (2020) Dalong Zhang, Xin Huang, Ziqi Liu, Jun Zhou, Zhiyang Hu, Xianzheng Song, Zhibang Ge, Lin Wang, Zhiqiang Zhang, and Yuan Qi. 2020. AGL: a scalable system for industrial-purpose graph machine learning. Proceedings of the VLDB Endowment 13, 12 (2020), 3125–3137.
  • Zhang and Chen (2018) Muhan Zhang and Yixin Chen. 2018. Link prediction based on graph neural networks. In Advances in Neural Information Processing Systems, Vol. 31.
  • Zhang and Chen (2020) Muhan Zhang and Yixin Chen. 2020. Inductive Matrix Completion Based on Graph Neural Networks. In International Conference on Learning Representations.
  • Zhang et al. (2021a) Muhan Zhang, Pan Li, Yinglong Xia, Kai Wang, and Long Jin. 2021a. Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning. In Advances in Neural Information Processing Systems, Vol. 34.
  • Zhang et al. (2021b) Wentao Zhang, Zhi Yang, Yexin Wang, Yu Shen, Yang Li, Liang Wang, and Bin Cui. 2021b. GRAIN: Improving Data Efficiency of GraPh Neural Networks via Diversified inFluence Maximization. Proceedings of the VLDB Endowment 14, 11 (2021), 2473–2482.
  • Zhou et al. (2021) Hongkuan Zhou, Ajitesh Srivastava, Hanqing Zeng, Rajgopal Kannan, and Viktor Prasanna. 2021. Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning. Proceedings of the VLDB Endowment 14, 9 (2021), 1597–1605.

Appendix A Notations

Frequently used symbols are summarized in Table 5.

Table 5. Summary of Frequently Used Notations.
Symbol Meaning
Q a query (set of nodes), i.e. Q={u,v,w}Q=\{u,v,w\}
𝒬\mathcal{Q} a collection of queries, i.e. Q𝒬Q\in\mathcal{Q}
𝒲u\mathcal{W}_{u} the set of walks starting from node uu
𝒲\mathcal{W} a collection of walks, i.e. W𝒲W\in\mathcal{W}
𝒱u\mathcal{V}_{u} the set of unique nodes appearing in 𝒲u\mathcal{W}_{u}
𝒳u\mathcal{X}_{u} the relative positional encoding (RPE) of nodes in 𝒱u\mathcal{V}_{u} regarding their step occurrence in 𝒲u\mathcal{W}_{u}
𝒳u,v\mathcal{X}_{u,v} the RPE vector of node vv regarding the node uu (all zeros if v𝒱uv\notin\mathcal{V}_{u})
𝒯\mathcal{T} the RPE array with indices as RPE-IDs and entries as RPE vectors
u\mathcal{H}_{u} the dictionary with nodes in 𝒱u\mathcal{V}_{u} as keys and RPEs 𝒳u\mathcal{X}_{u} (or RPE-IDs) as values
𝒜\mathcal{A} the associative array with nodes in u𝒱u\in\mathcal{V} as key and the tuple (𝒲u,𝒳u)(\mathcal{W}_{u},\mathcal{X}_{u}) as entry
\uplus the concatenation operation that joins all node-level RPEs, i.e. join 𝒳u,x\mathcal{X}_{u,x} for uQu\in Q as [,𝒳u,x,][...,\mathcal{X}_{u,x},...].
𝒳Q,x\mathcal{X}_{Q,x} the query-level RPE for node xx, 𝒳Q,x=uQ𝒳u,x\mathcal{X}_{Q,x}=\uplus_{u\in Q}\mathcal{X}_{u,x}

Appendix B Further Analysis

Fig. 7 shows the degree distribution and the node size of subgraph with respect to different numbers of hops in real-world networks collab and citation2. The node size of subgraphs dramatically increases when the number of hop m2m\geq 2, because many nodes have significantly large degrees in real-world networks as Fig. 7 LEFT illustrated. This leads to the size “explosion” issue for current SGRL models. Accordingly, most of SGRL models including SEAL (Zhang and Chen, 2018; Zhang et al., 2021a) and DE-GNN (Li et al., 2020) can only accommodate at most 1-hop neighbors over large networks to avoid the scalability crisis, which in return compromises their performance. SUREL solves this issue by breaking the subgraph into regular walks, which enables it to reach up to mm-hop neighbors via mm-step random walks. The long-hop neighbors give extra information, which is beneficial for improving the prediction performance.

Appendix C Limitations of Canonical GNNs and More Illustration of the Algorithmic Insights of SUREL

Canonical GNNs are known to have several limitations, such as limited expressive power (Xu et al., 2019; Morris et al., 2019), feature oversmoothing (Oono and Suzuki, 2019), information over-squashing(Epasto and Perozzi, 2019; Alon and Yahav, 2020) and noise contaminating with a large receptive field (Huang and Zitnik, 2020; Zeng et al., 2021).

One of the biggest limitations is that canonical GNNs cannot distinguish the nodes that can be mapped to each other under some graph automorphism. For example, the nodes ww and vv in Fig. 1 satisfy this property, and canonical GNNs will associate them with the same node representation. Another more practical example from a food web is shown in Fig. 8. In fact, most large networks do not have non-trivial automorphism, and such nodes are not that common. However, the above issue of GNNs actually induces a more severe concern: the node representations learnt by GNNs cannot well capture the intra-node distance information, which is crucial to predict over a set of nodes.

Another issue of canonical GNNs is to over-squash information into a single node representation. Node representations can be viewed as intermediate computation results that are often used in several downstream tasks. However, a node representation, if it carries the information that is suitable for one task, may carry subpar information for another task.

The third issue of canonical GNNs is the entanglement of the number of GNN layers and the range of the receptive field. In practice, when more complex and non-linear functions are to be approximated, one may want to add more layers to the neural network. However, GNNs adding more layers will also enlarge the receptive field that may introduce noise.

The last two issues of GNNs are demonstrated in the left column of Fig. 9. SGRL methods can handle these two issues of GNNs in a simple way as illustrated in the right column of Fig. 9.

Refer to caption
Refer to caption
Figure 7. Characteristics of Real-World Networks
Refer to caption
Figure 8. A food web example shows two disconnected components - the boreal forest (Stenseth et al., 1997) and the antarctic fauna (Bates et al., 2015). The query here is which one is more likely the predator of Pelagic Fish, Lynx or Orca? Canonical GNNs cannot solve this query. Srinivasan and Ribeiro (2020) explain this as the failure of GNNs to establish the correlation between the node representations.
Refer to caption
Figure 9. LEFT: Two Limitations of Canonical-GNN-based GRL; RIGHT: How to Solve Them by Using Subgraph-based GRL.

Furthermore, to handle the challenge encountered in Fig. 1a, Fig. 1b indicates SGRL methods can be adopted, which is to consider the joint subgraph around the target nodes ({u,w}\{u,w\} or {u,v}\{u,v\}) as a whole. Subgraph extraction can be mathematically viewed as adding an intra-node distance feature to each node zz: suppose Q={u,w}Q=\{u,w\} is the queried node set; if the distances from zz to uu and ww are both less than or equal to 1, zz will be selected. Such distance features can also be used as extra node features directly attached to raw node features. Li et al. (2020) have proved the effectiveness of using intra-node distance to better GNN expressive power. SUREL is able to capture such intra-node distance by adopting relative positional encoding (RPE). Lastly, we illustrate how SUREL is related to previous SGRL approaches in Fig. 10.

Refer to caption
Figure 10. How is SUREL related to previous SGRL models? Previous SGRL models, such as SEAL (Zhang and Chen, 2018) and DE-GNN (Li et al., 2020), first extract the neighboring subgraphs according to the queries, and then attach distance features to the nodes. In SUREL, all the subgraphs are represented by sets of walks sampled from the corresponding subgraphs. Given a query, the joint subgraph is represented by the concatenation of sets of walks. The query-level relative positional encoding (RPEs) is obtained by joining node-level RPEs that represent the intra-node distance features.
Table 6. Summary Statistics and Experimental Setup for Evaluation Datasets.
Dataset Type #Nodes #Edges Avg. Node Deg. Density Split Ratio Split Type Metric
citation2 Homo. 2,927,963 30,561,187 20.7 0.00036% 98/1/1 Time MRR
collab Homo. 235,868 1,285,465 8.2 0.0046% 92/4/4 Time Hits@50
ppa Homo. 576,289 30,326,273 73.7 0.018% 70/20/10 Throughput Hits@100
ogb-mag Hetero.
Paper(P): 736,389
Author(A): 1,134,649
P-A: 7,145,660
P-P: 5,416,271
21.7 N/A 99/0.5/0.5 Time MRR
tags-math Higher. 1,629
91,685 (projected)
822,059 (hyperedges)
N/A N/A 60/20/20 Time MRR
DBLP-coauthor Higher. 1,924,991
7,904,336 (projected)
3,700,067 (hyperedges)
N/A N/A 60/20/20 Time MRR

Appendix D Additional Experimental Settings

D.1. More Details about the Datasets

The dataset statistics and experimental setup for evaluation are provided in Table 6. We choose OGB datasets222https://ogb.stanford.edu/docs/dataset_overview/ to evaluate our framework and other baselines, since it comes with large, real-world graphs (million of nodes/edges) for realistic applications (i.e. network of academic, proteins). Moreover, it provides standard, open-sourced evaluation metrics and tools for benchmarking.

Following the format of OGB, we design four prediction tasks of relations and higher-order patterns, and construct the corresponding datasets on heterogeneous graphs and hypergraphs333https://www.cs.cornell.edu/~arb/data/. The original ogb-mag only contains features for ‘paper’-type nodes. We add the node embedding provided by (Yu et al., 2022) as raw features for the rest type of nodes in MAG(P-A)/(P-P). For these four tasks, the model is evaluated by one positive query paired with certain number of randomly sampled negative queries, as listed in Table 7 along with other dataset statistics. The customized dataset for relation and higher-order pattern prediction is accessible via Box at https://app.box.com/s/v9nszkai2gig13lm1o6q2ya82ap07gyb.

Table 7. Dataset Statistics for Relation Prediction and Higher-order Pattern Prediction.
Dataset Query Type #Train (Pos.) #Val./Test (Pos.) Pos./Neg.
MAG(P-A) relation 6,519,308 16,180 1:1000
MAG(P-P) relation 5,199,201 22,639 1:1000
tag-math higher-order 74,955 24,985 1:100
DBLP-coauthor higher-order 79,566 26,522 1:1000

D.2. More Details about the Baselines

For link prediction and relation prediction, we select baselines from the current OGB leaderboard 444https://ogb.stanford.edu/docs/leader_linkprop/ based on two main factors: scalability and prediction performance. All the public models have code released with a technical report. With additional verification, we adopt the published numbers if available on the leaderboard. For the rest, we benchmark the model using their official implementation and tuning parameters as listed below.

  • Graph embedding: graph embedding models for transductive learning such as Node2vec (Grover and Leskovec, 2016), DeepWalk555https://github.com/dmlc/dgl/tree/master/examples/pytorch/ogb/deepwalk (Perozzi et al., 2014). As implicit matrix factorization, it can be extensively optimized (Mohoney et al., 2021; Xie et al., 2021) for large-scale graph mining tasks. The obtained node representation embeds the global position of target nodes in a given graph, which potentially can be exploited for link prediction.

  • GCN family: a graph auto-encoder model that using graph convolution layers to learning node representations, including GCN (Kipf and Welling, 2017), GraphSAGE (Hamilton et al., 2017), and the derived models, such as Cluster-GCN (Chiang et al., 2019), GraphSAINT (Zeng et al., 2020).

  • R-GCN666https://github.com/pyg-team/pytorch_geometric/blob/master/examples (Schlichtkrull et al., 2018): a relational GCN that models heterogeneous graphs with node/link types.

  • R-HGNN777https://github.com/yule-BUAA/R-HGNN (Yu et al., 2022): a heterogeneous GNN that focuses on learning relation-aware node representations with attention mechanism.

  • SEAL888https://github.com/facebookresearch/SEAL_OGB (Zhang and Chen, 2018): apply GCN with double radius labeling tricks to obtain subgraph-level readout for link prediction. SEAL reigns in the top spots of OGB leaderboard on multiple tasks, thanks to the expressiveness inherited from SGRL. The implementation we tested is specially optimized for OGB datasets provided in (Zhang et al., 2021a).

  • DE-GNN999https://github.com/snap-stanford/distance-encoding (Li et al., 2020): a provably more powerful SGRL that utilizes distance features (i.e. shortest path distance, landing probability) to assist GNNs in representing any set of nodes. DE-GNN can be applied to tasks such as node classification, link prediction and higher-order cases, with great performance.

For graph embedding approaches, we first use these models to generate node embeddings, and then train an MLP as the link predictor with input of the Hadamard product between hidden representations of two nodes. Then, as OGB guideline required, to perform data splitting, tune the MLP over the validation set, and test it through the benchmark evaluator.

All canonical GNN baselines 101010https://github.com/snap-stanford/ogb/tree/master/examples/linkproppred come with three message passing layers of 256 hidden dimensions, and a tuned dropout ratio in {0,0.5}\{0,0.5\} for full-batch training. Canonical GNN models combine node embeddings in the queried set as the representations of links/hyperedges, which are later fed into an MLP classifier for final prediction. Besides, they need to use full training data to generate robust node representations. The hypergraph datasets do not come with raw node features. Thus, canonical GNNs here use random features as input for training along with other model parameters. R-GCN and R-HGNN use relational GCNConv layers that support message passing with different relation types between nodes. The relation type of edges is used as the input beside node features.

SGRL-based models only use partial edges from the training set. Both SEAL and DE-GNN extract 1-hop enclosing subgraphs for training. SEAL applies three GCN layers of 32 hidden dimensions plus a sortpooling and several 1D convolution layers to generate readout of the target subgraphs for prediction. DE-GNN adopts shortest path distance calculated from each extracted subgraph as the input feature, and then employs two TAGConv layers of 100 hidden dimensions to generate readout of the queried node sets.

D.3. Architecture and Hyperparameter

SUREL consists of a 2-layer MLP with ReLU activation for the embedding of node RPEs and a 2-layer RNN to encode joint walks obtained from queried subgraph joining. The hidden dimension of both networks is set to 64. Lastly, the concatenated hidden representations of queried node sets are fed into an 2-layer MLP classifier to make final predictions.

For link and relation prediction, we follow the inductive setting that only partial samples will be used for training. Over the training graph, we randomly select 5% links as positive training queries, each paired with kk-many negative samples (k=50k=50 by default). We remove these links and use the remaining 95% links to collect random walks and compute node RPEs via Algorithm 1. For higher-order pattern prediction, we use the given graph before timestamp tt to obtain walks and RPEs, and then optimize model parameters by higher-order triplets provided in the training set.

The results reported in Table 3 are obtained through the combination of hyperparameters listed in Table 8. For the profiling of SUREL in Table 4 and Fig. 4, we use the following combinations: citation2 with m=50,M=4,k=20m=50,M=4,k=20; collab with m=2,M=200,k=20m=2,M=200,k=20, as relative small values of mm, MM, and kk already guarantee sufficient good performance. The rest hyperparameters remain the same as reported earlier.

Table 8. Hyperparameters Used for Benchmark SUREL.
Dataset #Steps mm #Walks MM #Neg. Samples kk
citation2 4 200 50
collab 2 400 50
ppa 4 200 50
MAG (P-A) 3 200 10
MAG (P-P) 4 100 50
tags-math 3 100 10
DBLP-coauthor 3 100 10

D.4. Computation Complexity Analysis

We analyze the computation cost in the proposed framework SUREL and identify three major parts:

Random Walks: the space complexity is O(mM|𝒱|O(mM|\mathcal{V}|), where |𝒱||\mathcal{V}| denotes the size of node set of the input graph; the time complexity is O(mM|𝒱|/p)O(mM|\mathcal{V}|/p), where pp is the number of threads. Practically, the number of steps mm generally lies in the range of 252\sim 5.

Relative Positional Encoding: the space complexity for RPE is O(m2M|𝒱|O(m^{2}M|\mathcal{V}|) as there are at most mMm\cdot M many distinct nodes in the set of walks originated from a single node. Meanwhile, the space requirement can be further reduced to 1/10 after pruning and remapping RPE via RPE-ID. RPE can be computed along with walk sampling. Thus, the time complexity of RPE computation is still O(mM|𝒱|/p)O(mM|\mathcal{V}|/p) by combining with random walks.

Subgraph Joining: for a query QQ, the time complexity is O(cmM|Q|/p)O(c\cdot mM|Q|/p) for joining all associated set of walks in a queried subgraph. cc is a scalar related to the size of QQ. In practice, |Q|=2|Q|=2 for link prediction, |Q|3|Q|\geq 3 for higher-order pattern prediction.

D.5. Implementation Details

We implemented our framework on top of PyTorch, NumPy, and OpenMP. uhash is adopted to serve light and high efficient indexing for RPEs associated with sampled walks. For better parallelization, the computationally intensive part is written in C language with bindings to support Python APIs, including walk sampler, RPE encoder, and subgraph joining operation. To reduce the overhead of hybrid programming, we use RPE-IDs and native C/Numpy arrays instead of Python objects to exchange results between API calls and underlying C functions. We also provide Python APIs to support customizing above-mentioned parallel operations. The SUREL framework is open-source at https://github.com/Graph-COM/SUREL and free for academic use under the BSD-2-Clause license.