Towards Better Dynamic Graph Learning:
New Architecture and Unified Library

Le Yu, Leilei Sun, Bowen Du, Weifeng Lv
State Key Laboratory of Software Development Environment
School of Computer Science and Engineering
Beihang University
{yule,leileisun,dubowen,lwf}@buaa.edu.cn
Corresponding Author.

Abstract

We propose DyGFormer, a new Transformer-based architecture for dynamic graph learning. DyGFormer is conceptually simple and only needs to learn from nodes’ historical first-hop interactions by: (i) a neighbor co-occurrence encoding scheme that explores the correlations of the source node and destination node based on their historical sequences; (ii) a patching technique that divides each sequence into multiple patches and feeds them to Transformer, allowing the model to effectively and efficiently benefit from longer histories. We also introduce DyGLib, a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols to promote reproducible, scalable, and credible dynamic graph learning research. By performing exhaustive experiments on thirteen datasets for dynamic link prediction and dynamic node classification tasks, we find that DyGFormer achieves state-of-the-art performance on most of the datasets, demonstrating its effectiveness in capturing nodes’ correlations and long-term temporal dependencies. Moreover, some results of baselines are inconsistent with previous reports, which may be caused by their diverse but less rigorous implementations, showing the importance of DyGLib. All the used resources are publicly available at https://github.com/yule-BUAA/DyGLib.

1 Introduction

Dynamic graphs denote entities as nodes and represent their interactions as links with timestamps [26], which can model many real-world scenarios such as social networks [28, 50, 2], user-item interaction systems [31, 15, 71, 72, 70], traffic networks [67, 63, 19, 4, 69], and physical systems [24, 47, 43]. In recent years, representation learning on dynamic graphs has become a trending research topic [26, 49, 65]. There are two main categories of existing methods: discrete-time [39, 17, 48, 11, 66] and continuous-time [64, 45, 60, 51, 12]. In this paper, we focus on the latter approaches because they are more flexible and effective than the formers and are being increasingly investigated.

Despite the rapid development of dynamic graph learning methods, they still suffer from two limitations. Firstly, most of them independently compute the temporal representations of nodes in an interaction without exploiting nodes’ correlations, which are often indicative of future interactions. Moreover, existing methods learn at the interaction level and thus only work for nodes with fewer interactions. When nodes have longer histories, they require sampling strategies to truncate the interactions for feasible calculations of the computationally expensive modules like graph convolutions [55, 64, 45, 36, 9, 59], temporal random walks [60, 25] and sequential models [57, 12]. Though some approaches use memory networks [61, 53] to sequentially process interactions with affordable computational costs [28, 55, 45, 36, 59, 35], they are faced with the vanishing/exploding gradients due to the usage of recurrent neural networks [40, 57]. Therefore, we conclude that previous methods lack the ability to capture either nodes’ correlations or long-term temporal dependencies.

Secondly, the training pipelines of different methods are inconsistent and often lead to poor reproducibility. Also, existing methods are implemented by diverse frameworks (e.g., Pytorch [41], Tensorflow [1], DGL [58], PyG [16], C++), making it time-consuming and difficult for researchers to quickly understand the algorithms and further dive into the core of dynamic graph learning. Although there exist some libraries for dynamic graph learning [18, 46, 74], they mainly focus on dynamic network embedding methods [18], discrete-time graph learning methods [46], or engineering techniques for training on large-scale dynamic graphs [74] (elaborated in Section 2). Currently, we find that there are still no standard tools for continuous-time dynamic graph learning.

In this paper, we aim to address the above drawbacks with two key technical contributions.

We propose a new Transformer-based dynamic graph learning architecture (DyGFormer). DyGFormer is conceptually simple by solely learning from the sequences of nodes’ historical first-hop interactions. To be specific, DyGFormer is designed with a neighbor co-occurrence encoding scheme, which encodes the appearing frequencies of each neighbor in the sequences of the source and destination nodes to explicitly explore their correlations. In order to capture long-term temporal dependencies, DyGFormer splits each node’s sequence into multiple patches and feeds them to Transformer [56]. This patching technique not only makes the model effectively benefit from longer histories via preserving local temporal proximities, but also efficiently reduces the computational complexity to a constant level that is irrelevant to the input sequence length.

We present a unified continuous-time dynamic graph learning library (DyGLib). DyGLib is an open-source toolkit with standard training pipelines, extensible coding interfaces, and comprehensive evaluating strategies, aiming to foster standard, scalable, and reproducible dynamic graph learning research. DyGLib has integrated a variety of continuous-time dynamic graph learning methods as well as benchmark datasets from various domains. It trains all the methods via the same pipeline to eliminate the influence of different implementations and adopts a modularized design for developers to conveniently incorporate new datasets and algorithms based on their specific requirements. Moreover, DyGLib supports both dynamic link prediction and dynamic node classification tasks with exhaustive evaluating strategies to provide comprehensive comparisons of existing methods.

To evaluate the model performance, we conduct extensive experiments based on DyGLib, including dynamic link prediction under transductive and inductive settings with three negative sampling strategies as well as dynamic node classification. From the results, we observe that: (i) DyGFormer outperforms existing methods on most datasets, demonstrating its superiority in capturing nodes’ correlations and long-term temporal dependencies; (ii) some findings of baselines are not in line with previous reports because of their varied pipelines and problematic implementations, which illustrates the necessity of introducing DyGLib. We also provide an in-depth analysis of the neighbor co-occurrence encoding and patching technique for a better understanding of DyGFormer.

2 Related Work

Dynamic Graph Learning. Representation learning on dynamic graphs has been widely studied in recent years [26, 49, 65]. Discrete-time methods manually divide the dynamic graph into a sequence of snapshots and apply static graph learning methods on each snapshot, which ignore the temporal order of nodes in each snapshot [39, 17, 48, 11, 66]. In contrast, continuous-time methods directly learn on the whole dynamic graph with temporal graph neural networks [55, 64, 45, 36, 9, 59], memory networks [28, 55, 45, 36, 59, 35], temporal random walks [60, 25] or sequential models [57, 12]. Although insightful, most existing dynamic graph learning methods neglect the correlations between two nodes in an interaction. They also fail to handle nodes with longer interactions due to unaffordable computational costs of complex modules or issues in optimizing models (e.g., the vanishing/exploding gradients). In this paper, we propose a new Transformer-based architecture to show the necessity of capturing nodes’ correlations and long-term temporal dependencies, which is achieved by two designs: a neighbor co-occurrence encoding scheme and a patching technique.

Transformer-based Applications in Various Fields. Transformer [56] is an innovative model that employs the self-attention mechanism to handle sequential data, which has been successfully applied in a variety of domains, such as natural language processing [13, 33, 6], computer vision [7, 14, 34] and time series forecasting [30, 73, 62]. The idea of dividing the original data into patches as inputs of the Transformer has been attempted in some studies. ViT [14] splits an image into multiple patches and feeds the sequence of patches’ linear embeddings into a Transformer, which achieves surprisingly good performance on image classification. PatchTST [38] divides a time series into subseries-level patches and calculates the patches by a channel-independent Transformer for long-term multivariate time series forecasting. In this work, we propose a patching technique to learn on dynamic graphs, which can provide our approach with the ability to handle nodes with longer histories.

Graph Learning Library. Currently, there exist many libraries for static graphs [5, 58, 16, 22, 8, 29, 32], but few for dynamic graph learning [18, 46, 74]. DynamicGEM [18] focuses on dynamic graph embedding methods, which just consider the graph topology and cannot leverage node features. PyTorch Geometric Temporal [46] implements discrete-time algorithms for spatiotemporal signal processing and is mainly applicable for nodes with aligned historical observations. TGL [74] trains on large-scale dynamic graphs with some engineering tricks. Though TGL has integrated some continuous-time methods, they are somewhat out-of-date, resulting in the lack of state-of-the-art models. Moreover, TGL is implemented by both PyTorch and C++, which needs additional compilation and increases the usage difficulty. In this paper, we present a unified continuous-time dynamic graph learning library with thorough baselines, diverse datasets, extensible implementations, and comprehensive evaluations to facilitate dynamic graph learning research.

3 Preliminaries

Definition 1.

Dynamic Graph. We represent a dynamic graph as a sequence of non-decreasing chronological interactions $\mathcal{G}=\left\{\left(u_{1},v_{1},t_{1}\right),\left(u_{2},v_{2},t_{2}\right),\cdots\right\}$ with $0\leq t_{1}\leq t_{2}\leq\cdots$ , where $u_{i},v_{i}\in\mathcal{N}$ denote the source node and destination node of the $i$ -th link at timestamp $t_{i}$ . $\mathcal{N}$ is the set of all the nodes. Each node $u\in\mathcal{N}$ can be associated with node feature $\bm{x}_{u}\in\mathbb{R}^{d_{N}}$ , and each interaction $\left(u,v,t\right)$ has link feature $\bm{e}_{u,v}^{t}\in\mathbb{R}^{d_{E}}$ . $d_{N}$ and $d_{E}$ denote the dimensions of the node feature and link feature. If the graph is non-attributed, we simply set the node feature and link feature to zero vectors, i.e., $\bm{x}_{u}=\bm{0}$ and $\bm{e}_{u,v}^{t}=\bm{0}$ .

Definition 2.

Problem Formalization. Given the source node $u$ , destination node $v$ , timestamp $t$ , and historical interactions before $t$ , i.e., $\left\{\left(u^{\prime},v^{\prime},t^{\prime}\right)|t^{\prime}<t\right\}$ , representation learning on dynamic graph aims to design a model to learn time-aware representations $\bm{h}_{u}^{t}\in\mathbb{R}^{d}$ and $\bm{h}_{v}^{t}\in\mathbb{R}^{d}$ for $u$ and $v$ with $d$ as the dimension. We validate the effectiveness of the learned representations via two common tasks in dynamic graph learning: (i) dynamic link prediction, which predicts whether $u$ and $v$ are connected at $t$ ; (ii) dynamic node classification, which infers the state of $u$ or $v$ at $t$ .

4 New Architecture and Unified Library

4.1 DyGFormer: Transformer-based Architecture for Dynamic Graph Learning

The framework of our DyGFormer is shown in Figure 1, which employs Transformer [56] as the backbone. Given an interaction $\left(u,v,t\right)$ , we first extract historical first-hop interactions of source node $u$ and destination node $v$ before timestamp $t$ and obtain two interaction sequences $\mathcal{S}_{u}^{t}$ and $\mathcal{S}_{v}^{t}$ . Next, in addition to computing the encodings of neighbors, links, and time intervals for each sequence, we also encode the frequencies of every neighbor’s appearances in both $\mathcal{S}_{u}^{t}$ and $\mathcal{S}_{v}^{t}$ to exploit the correlations between $u$ and $v$ , resulting in four encoding sequences for $u$ / $v$ in total. Then, we divide each encoding sequence into multiple patches and feed all the patches into a Transformer for capturing long-term temporal dependencies. Finally, the outputs of the Transformer are averaged to derive time-aware representations of $u$ and $v$ at timestamp $t$ (i.e., $\bm{h}_{u}^{t}$ and $\bm{h}_{v}^{t}$ ), which can be applied in various downstream tasks like dynamic link prediction and dynamic node classification.

Refer to caption — Figure 1: Framework of the proposed model.

Learning from Historical First-hop Interactions. Unlike most previous methods that require nodes’ historical interactions from multiple hops (e.g., DyRep [55], TGAT [64], TGN [45], CAWN [60]), we only learn from the sequences of nodes’ historical first-hop interactions, turning the dynamic graph learning task into a simpler sequence learning problem. Mathematically, given an interaction $\left(u,v,t\right)$ , for source node $u$ and destination node $v$ , we obtain the sequences that involve first-hop interactions of $u$ and $v$ before timestamp $t$ , which are denoted by $\mathcal{S}_{u}^{t}=\left\{\left(u,u^{\prime},t^{\prime}\right)|t^{\prime}<t\right\}\cup\left\{\left(u^{\prime},u,t^{\prime}\right)|t^{\prime}<t\right\}$ and $\mathcal{S}_{v}^{t}=\left\{\left(v,v^{\prime},t^{\prime}\right)|t^{\prime}<t\right\}\cup\left\{\left(v^{\prime},v,t^{\prime}\right)|t^{\prime}<t\right\}$ , respectively.

Encoding Neighbors, Links, and Time Intervals. For source node $u$ , we retrieve the features of involved neighbors and links in sequence $\mathcal{S}_{u}^{t}$ based on the given features to represent their encodings, which are denoted by $\bm{X}_{u,N}^{t}\in\mathbb{R}^{|\mathcal{S}_{u}^{t}|\times d_{N}}$ and $\bm{X}_{u,E}^{t}\in\mathbb{R}^{|\mathcal{S}_{u}^{t}|\times d_{E}}$ . Following [64], we learn the periodic temporal patterns by encoding the time interval $\Delta t^{\prime}=t-t^{\prime}$ via $\sqrt{\frac{1}{d_{T}}}\left[\cos\left(w_{1}\Delta t^{\prime}\right),\sin\left(w_{1}\Delta t^{\prime}\right),\cdots,\cos\left(w_{d_{T}}\Delta t^{\prime}\right),\sin\left(w_{d_{T}}\Delta t^{\prime}\right)\right]$ , where $w_{1},\cdots,w_{d_{T}}$ are trainable parameters. $d_{T}$ is the encoding dimension. The time interval encodings of interactions in $\mathcal{S}_{u}^{t}$ is denoted by $\bm{X}_{u,T}^{t}\in\mathbb{R}^{|\mathcal{S}_{u}^{t}|\times d_{T}}$ . We use the same process to get the corresponding encodings for destination node $v$ , i.e., $\bm{X}_{v,N}^{t}\in\mathbb{R}^{|\mathcal{S}_{v}^{t}|\times d_{N}}$ , $\bm{X}_{v,E}^{t}\in\mathbb{R}^{|\mathcal{S}_{v}^{t}|\times d_{E}}$ , and $\bm{X}_{v,T}^{t}\in\mathbb{R}^{|\mathcal{S}_{v}^{t}|\times d_{T}}$ .

Neighbor Co-occurrence Encoding Scheme. Existing methods separately compute representations of node $u$ and $v$ without modeling their correlations. We present a neighbor co-occurrence encoding scheme to tackle this issue, which assumes the appearing frequency of a neighbor in a sequence indicates its importance, and the occurrences of a neighbor in sequences of $u$ and $v$ (i.e., co-occurrence) could reflect the correlations between $u$ and $v$ . That is to say, if $u$ and $v$ have more common historical neighbors in their sequences, they are more likely to interact in the future.

Formally, for each neighbor in the interaction sequence $\mathcal{S}_{u}^{t}$ and $\mathcal{S}_{v}^{t}$ , we count its occurrences in both $\mathcal{S}_{u}^{t}$ and $\mathcal{S}_{v}^{t}$ , and derive a two-dimensional vector. By packing the vectors of all the neighbors together, we can get the neighbor co-occurrence features for $u$ and $v$ , which are represented by $\bm{C}_{u}^{t}\in\mathbb{R}^{|\mathcal{S}_{u}^{t}|\times 2}$ and $\bm{C}_{v}^{t}\in\mathbb{R}^{|\mathcal{S}_{v}^{t}|\times 2}$ . For example, suppose the historical neighbors of $u$ and $v$ are $\left\{a,b,a\right\}$ and $\left\{b,b,a,c\right\}$ . The appearing frequencies of $a$ , $b$ , and $c$ in $u$ / $v$ ’s historical interactions are 2/1, 1/2, and 0/1, respectively. Therefore, the neighbor co-occurrence features of $u$ and $v$ are denoted by $\bm{C}_{u}^{t}=\left[\left[2,1\right],\left[1,2\right],\left[2,1\right]\right]^{\top}$ and $\bm{C}_{v}^{t}=\left[\left[1,2\right],\left[1,2\right],\left[2,1\right],\left[0,1\right]\right]^{\top}$ . Then, we apply a function $f\left(\cdot\right)$ to encode the neighbor co-occurrence features by

\begin{split}\bm{X}_{*,C}^{t}=f\left(\bm{C}_{*}^{t}\left[:,0\right]\right)+f\left(\bm{C}_{*}^{t}\left[:,1\right]\right)\in\mathbb{R}^{|\mathcal{S}_{*}^{t}|\times d_{C}},\end{split}

(1)

where $*$ could be $u$ or $v$ . The input and output dimensions of $f\left(\cdot\right)$ are 1 and $d_{C}$ . In this paper, we implement $f\left(\cdot\right)$ by a two-layer perceptron with ReLU activation [37]. It is important to note that the neighbor co-occurrence encoding scheme is general and can be easily integrated into some dynamic graph learning methods for better results. We will demonstrate its generalizability in Section 5.3.

Patching Technique. Instead of focusing on the interaction level, we divide the encoding sequence into multiple non-overlapping patches to break through the bottleneck of existing methods in capturing long-term temporal dependencies. Let $P$ denote the patch size. Each patch is composed of $P$ temporally adjacent interactions with flattened encodings and can preserve local temporal proximities. Take the patching of $\bm{X}_{u,N}^{t}\in\mathbb{R}^{|\mathcal{S}_{u}^{t}|\times d_{N}}$ as an example. $\bm{X}_{u,N}^{t}$ will be divided into $l_{u}^{t}=\lceil\frac{|\mathcal{S}_{u}^{t}|}{P}\rceil$ patches in total (note that we will pad $\bm{X}_{u,N}^{t}$ if its length $|\mathcal{S}_{u}^{t}|$ cannot be divided by $P$ ), and the patched encoding is represented by $\bm{M}_{u,N}^{t}\in\mathbb{R}^{l_{u}^{t}\times d_{N}\cdot P}$ . Similarly, we can also get the patched encodings $\bm{M}_{u,E}^{t}\in\mathbb{R}^{l_{u}^{t}\times d_{E}\cdot P}$ , $\bm{M}_{u,T}^{t}\in\mathbb{R}^{l_{u}^{t}\times d_{T}\cdot P}$ , $\bm{M}_{u,C}^{t}\in\mathbb{R}^{l_{u}^{t}\times d_{C}\cdot P}$ , $\bm{M}_{v,N}^{t}\in\mathbb{R}^{l_{v}^{t}\times d_{N}\cdot P}$ , $\bm{M}_{v,E}^{t}\in\mathbb{R}^{l_{v}^{t}\times d_{E}\cdot P}$ , $\bm{M}_{v,T}^{t}\in\mathbb{R}^{l_{v}^{t}\times d_{T}\cdot P}$ , and $\bm{M}_{v,C}^{t}\in\mathbb{R}^{l_{v}^{t}\times d_{C}\cdot P}$ . Note that when $|\mathcal{S}_{u}^{t}|$ becomes longer, we will correspondingly increase $P$ , making the number of patches (i.e., $l_{u}^{t}$ and $l_{v}^{t}$ ) at a constant level to reduce the computational cost.

Transformer Encoder. We first align the patched encodings to the same dimension $d$ with trainable weight $\bm{W}_{*}\in\mathbb{R}^{d_{*}\cdot P\times d}$ and $\bm{b}_{*}\in\mathbb{R}^{d}$ to obtain $\bm{Z}_{u,*}^{t}\in\mathbb{R}^{l_{u}^{t}\times d}$ and $\bm{Z}_{v,*}^{t}\in\mathbb{R}^{l_{v}^{t}\times d}$ , where $*$ could be $N$ , $E$ , $T$ or $C$ . To be specific, the alignments are realized by

\bm{Z}_{u,*}^{t}=\bm{M}_{u,*}^{t}\bm{W}_{*}+\bm{b}_{*}\in\mathbb{R}^{l_{u}^{t}\times d},\bm{Z}_{v,*}^{t}=\bm{M}_{v,*}^{t}\bm{W}_{*}+\bm{b}_{*}\in\mathbb{R}^{l_{v}^{t}\times d}.

(2)

Then, we concatenate the aligned encodings of $u$ and $v$ , and get $\bm{Z}_{u}^{t}=\bm{Z}_{u,N}^{t}\|\bm{Z}_{u,E}^{t}\|\bm{Z}_{u,T}^{t}\|\bm{Z}_{u,C}^{t}\in\mathbb{R}^{l_{u}^{t}\times 4d}$ and $\bm{Z}_{v}^{t}=\bm{Z}_{v,N}^{t}\|\bm{Z}_{v,E}^{t}\|\bm{Z}_{v,T}^{t}\|\bm{Z}_{v,C}^{t}\in\mathbb{R}^{l_{v}^{t}\times 4d}$ .

Next, we employ a Transformer encoder to capture the temporal dependencies, which is built by stacking $L$ Multi-head Self-Attention (MSA) and Feed-Forward Network (FFN) blocks. The residual connection [20] is employed after every block. We follow [14] by using GELU [21] instead of ReLU [37] between the two-layer perception in each FFN block and applying Layer Normalization (LN) [3] before each block rather than after. Instead of individually processing $\bm{Z}_{u}^{t}$ and $\bm{Z}_{v}^{t}$ , our Transformer encoder takes the stacked $\bm{Z}^{t}=\left[\bm{Z}_{u}^{t};\bm{Z}_{v}^{t}\right]\in\mathbb{R}^{(l_{u}^{t}+l_{v}^{t})\times 4d}$ as inputs, aiming to learn the temporal dependencies within and across the sequences of $u$ and $v$ . The calculation process is

	$\displaystyle\text{Attention}\left(\bm{Q},\bm{K},\bm{V}\right)=\text{Softmax}\left(\frac{\bm{Q}\bm{K}^{\top}}{\sqrt{d_{k}}}\right)\bm{V},$		(3)
	$\displaystyle\text{FFN}\left(\bm{O},\bm{W}_{1},\bm{b}_{1},\bm{W}_{2},\bm{b}_{2}\right)=\text{GELU}\left(\bm{O}\bm{W}_{1}+\bm{b}_{1}\right)\bm{W}_{2}+\bm{b}_{2},$		(4)
	$\displaystyle\bm{O}_{i}^{t,l}=\text{Attention}\left(\text{LN}(\bm{Z}^{t,l-1})\bm{W}_{Q,i}^{l},\text{LN}(\bm{Z}^{t,l-1})\bm{W}_{K,i}^{l},\text{LN}(\bm{Z}^{t,l-1})\bm{W}_{V,i}^{l}\right),$		(5)
	$\displaystyle\bm{O}^{t,l}=\text{MSA}\left(\bm{Z}^{t,l-1}\right)+\bm{Z}^{t,l-1}=\left(\bm{O}_{1}^{t,l}\\|\cdots\\|\bm{O}_{I}^{t,l}\right)\bm{W}_{O}^{l}+\bm{Z}^{t,l-1},$		(6)
	$\displaystyle\bm{Z}^{t,l}=\text{FFN}\left(\text{LN}\left(\bm{O}^{t,l}\right),\bm{W}_{1}^{l},\bm{b}_{1}^{l},\bm{W}_{2}^{l},\bm{b}_{2}^{l}\right)+\bm{O}^{t,l}.$		(7)

$\bm{W}_{Q,i}^{l}\in\mathbb{R}^{4d\times d_{k}}$ , $\bm{W}_{K,i}^{l}\in\mathbb{R}^{4d\times d_{k}}$ , $\bm{W}_{V,i}^{l}\in\mathbb{R}^{4d\times d_{v}}$ , $\bm{W}_{O}^{l}\in\mathbb{R}^{I\cdot d_{v}\times 4d}$ , $\bm{W}_{1}^{l}\in\mathbb{R}^{4d\times 16d}$ , $\bm{b}_{1}^{l}\in\mathbb{R}^{16d}$ , $\bm{W}_{2}^{l}\in\mathbb{R}^{16d\times 4d}$ and $\bm{b}_{2}^{l}\in\mathbb{R}^{4d}$ are trainable parameters at the $l$ -th layer. We set $d_{k}=d_{v}=4d/I$ with $I$ as the number of attention heads. The input of the first layer is $\bm{Z}^{t,0}=\bm{Z}^{t}\in\mathbb{R}^{(l_{u}^{t}+l_{v}^{t})\times 4d}$ , and the output of the $L$ -th layer is denoted by $\bm{H}^{t}=\bm{Z}^{t,L}\in\mathbb{R}^{(l_{u}^{t}+l_{v}^{t})\times 4d}$ .

Time-aware Node Representation. The time-aware representations of node $u$ and $v$ at timestamp $t$ are derived by averaging their related representations in $\bm{H}^{t}$ with an output layer,

\begin{split}\bm{h}_{u}^{t}&=\text{MEAN}\left(\bm{H}^{t}[:l_{u}^{t},:]\right)\bm{W}_{out}+\bm{b}_{out}\in\mathbb{R}^{d_{out}},\\ \bm{h}_{v}^{t}&=\text{MEAN}\left(\bm{H}^{t}[l_{u}^{t}:l_{u}^{t}+l_{v}^{t},:]\right)\bm{W}_{out}+\bm{b}_{out}\in\mathbb{R}^{d_{out}},\\ \end{split}

(8)

where $\bm{W}_{out}\in\mathbb{R}^{4d\times d_{out}}$ and $\bm{b}_{out}\in\mathbb{R}^{d_{out}}$ are trainable weights with $d_{out}$ as the output dimension.

4.2 DyGLib: Unified Library for Continuous-Time Dynamic Graph Learning

We introduce a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating strategies for reproducible, scalable, and credible continuous-time dynamic graph learning research. The overall procedure of DyGLib is shown in Figure 3 in Section A.3.

Standard Training Pipelines. To eliminate the influence of different training pipelines in previous studies, we unify the data format, create a customized data loader, and train all the methods with the same model trainers. Our standard training pipelines guarantee reproducible performance and enable users to quickly identify the key components of different models. Researchers only need to focus on designing the model architecture without considering other irrelevant implementation details.

Extensible Coding Interfaces. We provide extensible coding interfaces for the datasets and algorithms, which are all implemented by PyTorch. These scalable designs enable users to incorporate new datasets and popular models based on their specific requirements, which can significantly reduce the usage difficulty for beginners and allow experts to conveniently validate new ideas. Currently, DyGLib has integrated thirteen datasets from various domains and nine continuous-time dynamic graph learning methods. It is worth noticing that we also found some issues in previous implementations and have fixed them in DyGLib (see details in Section B.3).

Comprehensive Evaluating Protocols. DyGLib supports both transductive/inductive dynamic link prediction and dynamic node classification tasks. Most previous works evaluate their methods on the dynamic link prediction task with the random negative sampling strategy but a few models already reach saturation performance under such a strategy, making it hard to distinguish more advanced designs. For more reliable comparisons, we adopt three strategies (i.e., random, historical, and inductive negative sampling strategies) in [44] to comprehensively evaluate the model performance.

5 Experiments

In this section, we report the results of various approaches by using DyGLib. We show the superiority of DyGFormer over existing methods and also give an in-depth analysis of DyGFormer.

5.1 Experimental Settings

Datasets and Baselines. We experiment with thirteen datasets (Wikipedia, Reddit, MOOC, LastFM, Enron, Social Evo., UCI, Flights, Can. Parl., US Legis., UN Trade, UN Vote, and Contact), which are collected by [44] and cover diverse domains. Details of the datasets are shown in Section B.1. We compare DyGFormer with eight popular continuous-time dynamic graph learning baselines that are based on graph convolutions, memory networks, random walks, and sequential models, including JODIE [28], DyRep [55], TGAT [64], TGN [45], CAWN [60], EdgeBank [44], TCL [57], and GraphMixer [12]. We give the descriptions of baselines in Section B.2.

Evaluation Tasks and Metrics. We follow [64, 45, 60, 44] to evaluate models for dynamic link prediction, which predicts the probability of a link occurring between two given nodes at a specific time. This task has two settings: the transductive setting aims to predict future links between nodes that are observed during training, and the inductive setting predicts future links between unseen nodes. We use a multi-layer perceptron to take the concatenated representations of two nodes as inputs and return the probability of a link as the output. Average Precision (AP) and Area Under the Receiver Operating Characteristic Curve (AUC-ROC) are adopted as the evaluation metrics. We adopt random (rnd), historical (hist), and inductive (ind) negative sampling strategies in [44] for evaluation, where the latter two strategies are more challenging. Please refer to [44] for more details. We also follow [64, 45] to conduct dynamic node classification, which estimates the state of a node in a given interaction at a specific time. A multi-layer perceptron is employed to map the node representations to the labels. We use AUC-ROC as the evaluation metric due to the label imbalance. For both tasks, we chronologically split each dataset with the ratio of 70%/15%/15% for training/validation/testing.

Model Configurations. For baselines, in addition to following their official settings, we also perform an exhaustive grid search to find the optimal configurations of some critical hyperparameters for more reliable comparisons. As DyGFormer can access longer histories, we vary each node’s input sequence length from 32 to 4096 by a factor of 2. To keep the computational complexity at a constant level that is irrelevant to the input length, we correspondingly increase the patch size from 1 to 128. Please see Section B.5 for the detailed configurations of different models.

Implementation Details. For both tasks, we optimize all models (i.e., excluding EdgeBank which has no trainable parameters) by Adam [27] and use supervised binary cross-entropy loss as the objective function. We train the models for 100 epochs and use the early stopping strategy with a patience of 20. We select the model that achieves the best performance on the validation set for testing. We set the learning rate and batch size to 0.0001 and 200 for all the methods on all the datasets. We run the methods five times with seeds from 0 to 4 and report the average performance to eliminate deviations. Experiments are conducted on an Ubuntu machine equipped with one Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz with 16 physical cores. The GPU device is NVIDIA Tesla T4 with 15 GB memory.

5.2 Performance Comparisons and Discussions

Table 1: AP for transductive dynamic link prediction with random, historical, and inductive negative sampling strategies. NSS is the abbreviation of Negative Sampling Strategies.

NSS	Datasets	JODIE	DyRep	TGAT	TGN	CAWN	EdgeBank	TCL	GraphMixer	DyGFormer
rnd	Wikipedia	96.50 $\pm$ 0.14	94.86 $\pm$ 0.06	96.94 $\pm$ 0.06	98.45 $\pm$ 0.06	98.76 $\pm$ 0.03	90.37 $\pm$ 0.00	96.47 $\pm$ 0.16	97.25 $\pm$ 0.03	99.03 $\pm$ 0.02
	Reddit	98.31 $\pm$ 0.14	98.22 $\pm$ 0.04	98.52 $\pm$ 0.02	98.63 $\pm$ 0.06	99.11 $\pm$ 0.01	94.86 $\pm$ 0.00	97.53 $\pm$ 0.02	97.31 $\pm$ 0.01	99.22 $\pm$ 0.01
	MOOC	80.23 $\pm$ 2.44	81.97 $\pm$ 0.49	85.84 $\pm$ 0.15	89.15 $\pm$ 1.60	80.15 $\pm$ 0.25	57.97 $\pm$ 0.00	82.38 $\pm$ 0.24	82.78 $\pm$ 0.15	87.52 $\pm$ 0.49
	LastFM	70.85 $\pm$ 2.13	71.92 $\pm$ 2.21	73.42 $\pm$ 0.21	77.07 $\pm$ 3.97	86.99 $\pm$ 0.06	79.29 $\pm$ 0.00	67.27 $\pm$ 2.16	75.61 $\pm$ 0.24	93.00 $\pm$ 0.12
	Enron	84.77 $\pm$ 0.30	82.38 $\pm$ 3.36	71.12 $\pm$ 0.97	86.53 $\pm$ 1.11	89.56 $\pm$ 0.09	83.53 $\pm$ 0.00	79.70 $\pm$ 0.71	82.25 $\pm$ 0.16	92.47 $\pm$ 0.12
	Social Evo.	89.89 $\pm$ 0.55	88.87 $\pm$ 0.30	93.16 $\pm$ 0.17	93.57 $\pm$ 0.17	84.96 $\pm$ 0.09	74.95 $\pm$ 0.00	93.13 $\pm$ 0.16	93.37 $\pm$ 0.07	94.73 $\pm$ 0.01
	UCI	89.43 $\pm$ 1.09	65.14 $\pm$ 2.30	79.63 $\pm$ 0.70	92.34 $\pm$ 1.04	95.18 $\pm$ 0.06	76.20 $\pm$ 0.00	89.57 $\pm$ 1.63	93.25 $\pm$ 0.57	95.79 $\pm$ 0.17
	Flights	95.60 $\pm$ 1.73	95.29 $\pm$ 0.72	94.03 $\pm$ 0.18	97.95 $\pm$ 0.14	98.51 $\pm$ 0.01	89.35 $\pm$ 0.00	91.23 $\pm$ 0.02	90.99 $\pm$ 0.05	98.91 $\pm$ 0.01
	Can. Parl.	69.26 $\pm$ 0.31	66.54 $\pm$ 2.76	70.73 $\pm$ 0.72	70.88 $\pm$ 2.34	69.82 $\pm$ 2.34	64.55 $\pm$ 0.00	68.67 $\pm$ 2.67	77.04 $\pm$ 0.46	97.36 $\pm$ 0.45
	US Legis.	75.05 $\pm$ 1.52	75.34 $\pm$ 0.39	68.52 $\pm$ 3.16	75.99 $\pm$ 0.58	70.58 $\pm$ 0.48	58.39 $\pm$ 0.00	69.59 $\pm$ 0.48	70.74 $\pm$ 1.02	71.11 $\pm$ 0.59
	UN Trade	64.94 $\pm$ 0.31	63.21 $\pm$ 0.93	61.47 $\pm$ 0.18	65.03 $\pm$ 1.37	65.39 $\pm$ 0.12	60.41 $\pm$ 0.00	62.21 $\pm$ 0.03	62.61 $\pm$ 0.27	66.46 $\pm$ 1.29
	UN Vote	63.91 $\pm$ 0.81	62.81 $\pm$ 0.80	52.21 $\pm$ 0.98	65.72 $\pm$ 2.17	52.84 $\pm$ 0.10	58.49 $\pm$ 0.00	51.90 $\pm$ 0.30	52.11 $\pm$ 0.16	55.55 $\pm$ 0.42
	Contact	95.31 $\pm$ 1.33	95.98 $\pm$ 0.15	96.28 $\pm$ 0.09	96.89 $\pm$ 0.56	90.26 $\pm$ 0.28	92.58 $\pm$ 0.00	92.44 $\pm$ 0.12	91.92 $\pm$ 0.03	98.29 $\pm$ 0.01
	Avg. Rank	5.08	5.85	5.69	2.54	4.31	7.54	6.92	5.46	1.62
hist	Wikipedia	83.01 $\pm$ 0.66	79.93 $\pm$ 0.56	87.38 $\pm$ 0.22	86.86 $\pm$ 0.33	71.21 $\pm$ 1.67	73.35 $\pm$ 0.00	89.05 $\pm$ 0.39	90.90 $\pm$ 0.10	82.23 $\pm$ 2.54
	Reddit	80.03 $\pm$ 0.36	79.83 $\pm$ 0.31	79.55 $\pm$ 0.20	81.22 $\pm$ 0.61	80.82 $\pm$ 0.45	73.59 $\pm$ 0.00	77.14 $\pm$ 0.16	78.44 $\pm$ 0.18	81.57 $\pm$ 0.67
	MOOC	78.94 $\pm$ 1.25	75.60 $\pm$ 1.12	82.19 $\pm$ 0.62	87.06 $\pm$ 1.93	74.05 $\pm$ 0.95	60.71 $\pm$ 0.00	77.06 $\pm$ 0.41	77.77 $\pm$ 0.92	85.85 $\pm$ 0.66
	LastFM	74.35 $\pm$ 3.81	74.92 $\pm$ 2.46	71.59 $\pm$ 0.24	76.87 $\pm$ 4.64	69.86 $\pm$ 0.43	73.03 $\pm$ 0.00	59.30 $\pm$ 2.31	72.47 $\pm$ 0.49	81.57 $\pm$ 0.48
	Enron	69.85 $\pm$ 2.70	71.19 $\pm$ 2.76	64.07 $\pm$ 1.05	73.91 $\pm$ 1.76	64.73 $\pm$ 0.36	76.53 $\pm$ 0.00	70.66 $\pm$ 0.39	77.98 $\pm$ 0.92	75.63 $\pm$ 0.73
	Social Evo.	87.44 $\pm$ 6.78	93.29 $\pm$ 0.43	95.01 $\pm$ 0.44	94.45 $\pm$ 0.56	85.53 $\pm$ 0.38	80.57 $\pm$ 0.00	94.74 $\pm$ 0.31	94.93 $\pm$ 0.31	97.38 $\pm$ 0.14
	UCI	75.24 $\pm$ 5.80	55.10 $\pm$ 3.14	68.27 $\pm$ 1.37	80.43 $\pm$ 2.12	65.30 $\pm$ 0.43	65.50 $\pm$ 0.00	80.25 $\pm$ 2.74	84.11 $\pm$ 1.35	82.17 $\pm$ 0.82
	Flights	66.48 $\pm$ 2.59	67.61 $\pm$ 0.99	72.38 $\pm$ 0.18	66.70 $\pm$ 1.64	64.72 $\pm$ 0.97	70.53 $\pm$ 0.00	70.68 $\pm$ 0.24	71.47 $\pm$ 0.26	66.59 $\pm$ 0.49
	Can. Parl.	51.79 $\pm$ 0.63	63.31 $\pm$ 1.23	67.13 $\pm$ 0.84	68.42 $\pm$ 3.07	66.53 $\pm$ 2.77	63.84 $\pm$ 0.00	65.93 $\pm$ 3.00	74.34 $\pm$ 0.87	97.00 $\pm$ 0.31
	US Legis.	51.71 $\pm$ 5.76	86.88 $\pm$ 2.25	62.14 $\pm$ 6.60	74.00 $\pm$ 7.57	68.82 $\pm$ 8.23	63.22 $\pm$ 0.00	80.53 $\pm$ 3.95	81.65 $\pm$ 1.02	85.30 $\pm$ 3.88
	UN Trade	61.39 $\pm$ 1.83	59.19 $\pm$ 1.07	55.74 $\pm$ 0.91	58.44 $\pm$ 5.51	55.71 $\pm$ 0.38	81.32 $\pm$ 0.00	55.90 $\pm$ 1.17	57.05 $\pm$ 1.22	64.41 $\pm$ 1.40
	UN Vote	70.02 $\pm$ 0.81	69.30 $\pm$ 1.12	52.96 $\pm$ 2.14	69.37 $\pm$ 3.93	51.26 $\pm$ 0.04	84.89 $\pm$ 0.00	52.30 $\pm$ 2.35	51.20 $\pm$ 1.60	60.84 $\pm$ 1.58
	Contact	95.31 $\pm$ 2.13	96.39 $\pm$ 0.20	96.05 $\pm$ 0.52	93.05 $\pm$ 2.35	84.16 $\pm$ 0.49	88.81 $\pm$ 0.00	93.86 $\pm$ 0.21	93.36 $\pm$ 0.41	97.57 $\pm$ 0.06
	Avg. Rank	5.46	5.08	5.08	3.85	7.54	5.92	5.46	4.00	2.62
ind	Wikipedia	75.65 $\pm$ 0.79	70.21 $\pm$ 1.58	87.00 $\pm$ 0.16	85.62 $\pm$ 0.44	74.06 $\pm$ 2.62	80.63 $\pm$ 0.00	86.76 $\pm$ 0.72	88.59 $\pm$ 0.17	78.29 $\pm$ 5.38
	Reddit	86.98 $\pm$ 0.16	86.30 $\pm$ 0.26	89.59 $\pm$ 0.24	88.10 $\pm$ 0.24	91.67 $\pm$ 0.24	85.48 $\pm$ 0.00	87.45 $\pm$ 0.29	85.26 $\pm$ 0.11	91.11 $\pm$ 0.40
	MOOC	65.23 $\pm$ 2.19	61.66 $\pm$ 0.95	75.95 $\pm$ 0.64	77.50 $\pm$ 2.91	73.51 $\pm$ 0.94	49.43 $\pm$ 0.00	74.65 $\pm$ 0.54	74.27 $\pm$ 0.92	81.24 $\pm$ 0.69
	LastFM	62.67 $\pm$ 4.49	64.41 $\pm$ 2.70	71.13 $\pm$ 0.17	65.95 $\pm$ 5.98	67.48 $\pm$ 0.77	75.49 $\pm$ 0.00	58.21 $\pm$ 0.89	68.12 $\pm$ 0.33	73.97 $\pm$ 0.50
	Enron	68.96 $\pm$ 0.98	67.79 $\pm$ 1.53	63.94 $\pm$ 1.36	70.89 $\pm$ 2.72	75.15 $\pm$ 0.58	73.89 $\pm$ 0.00	71.29 $\pm$ 0.32	75.01 $\pm$ 0.79	77.41 $\pm$ 0.89
	Social Evo.	89.82 $\pm$ 4.11	93.28 $\pm$ 0.48	94.84 $\pm$ 0.44	95.13 $\pm$ 0.56	88.32 $\pm$ 0.27	83.69 $\pm$ 0.00	94.90 $\pm$ 0.36	94.72 $\pm$ 0.33	97.68 $\pm$ 0.10
	UCI	65.99 $\pm$ 1.40	54.79 $\pm$ 1.76	68.67 $\pm$ 0.84	70.94 $\pm$ 0.71	64.61 $\pm$ 0.48	57.43 $\pm$ 0.00	76.01 $\pm$ 1.11	80.10 $\pm$ 0.51	72.25 $\pm$ 1.71
	Flights	69.07 $\pm$ 4.02	70.57 $\pm$ 1.82	75.48 $\pm$ 0.26	71.09 $\pm$ 2.72	69.18 $\pm$ 1.52	81.08 $\pm$ 0.00	74.62 $\pm$ 0.18	74.87 $\pm$ 0.21	70.92 $\pm$ 1.78
	Can. Parl.	48.42 $\pm$ 0.66	58.61 $\pm$ 0.86	68.82 $\pm$ 1.21	65.34 $\pm$ 2.87	67.75 $\pm$ 1.00	62.16 $\pm$ 0.00	65.85 $\pm$ 1.75	69.48 $\pm$ 0.63	95.44 $\pm$ 0.57
	US Legis.	50.27 $\pm$ 5.13	83.44 $\pm$ 1.16	61.91 $\pm$ 5.82	67.57 $\pm$ 6.47	65.81 $\pm$ 8.52	64.74 $\pm$ 0.00	78.15 $\pm$ 3.34	79.63 $\pm$ 0.84	81.25 $\pm$ 3.62
	UN Trade	60.42 $\pm$ 1.48	60.19 $\pm$ 1.24	60.61 $\pm$ 1.24	61.04 $\pm$ 6.01	62.54 $\pm$ 0.67	72.97 $\pm$ 0.00	61.06 $\pm$ 1.74	60.15 $\pm$ 1.29	55.79 $\pm$ 1.02
	UN Vote	67.79 $\pm$ 1.46	67.53 $\pm$ 1.98	52.89 $\pm$ 1.61	67.63 $\pm$ 2.67	52.19 $\pm$ 0.34	66.30 $\pm$ 0.00	50.62 $\pm$ 0.82	51.60 $\pm$ 0.73	51.91 $\pm$ 0.84
	Contact	93.43 $\pm$ 1.78	94.18 $\pm$ 0.10	94.35 $\pm$ 0.48	90.18 $\pm$ 3.28	89.31 $\pm$ 0.27	85.20 $\pm$ 0.00	91.35 $\pm$ 0.21	90.87 $\pm$ 0.35	94.75 $\pm$ 0.28
	Avg. Rank	6.62	6.38	4.15	4.38	5.46	5.62	4.69	4.46	3.23

We report the performance of different methods on the AP metric for transductive dynamic link prediction with three negative sampling strategies in Table 1. The best and second-best results are emphasized by bold and underlined fonts. Note that the results are multiplied by 100 for a better display layout. Please refer to Section C.1 and Section C.2 for the results of AP for inductive dynamic link prediction as well as AUC-ROC for transductive and inductive dynamic link prediction tasks. Since EdgeBank can be only evaluated for transductive dynamic link prediction, we do not show its performance under the inductive setting. From the results, we have two main observations.

Firstly, DyGFormer usually outperforms baselines and achieves an average rank of 2.49/2.69 on AP/AUC-ROC for transductive and 2.69/2.56 for inductive dynamic link prediction across three negative sampling strategies. This is because: (i) The neighbor co-occurrence encoding scheme helps DyGFormer exploit correlations between the source node and destination node, which are often predictive for future links (see Section 5.5). (ii) The patching technique allows DyGFormer to access longer histories and capture long-term temporal dependencies (see Section 5.4). In Table 9 in Section B.5, the input sequence lengths of DyGFormer are much longer than those of baselines on several datasets, indicating that it can utilize longer sequences better. We also observe the varying results of DyGFormer across different negative sampling strategies and give an analysis in Section 5.7.

Secondly, some of our findings of baselines differ from previous reports. For instance, the performance of some baselines can be significantly improved by properly setting some hyperparameters. Additionally, some methods would obtain worse results after we fix the problems or make adaptions in their implementations. More explanations can be found in Section B.4. These observations highlight the importance of rigorously evaluating different methods by a unified library and verify the necessity of introducing DyGLib to facilitate the development of dynamic graph learning.

We also report the results of dynamic node classification in Table 15 in Section C.3. We observe that DyGFormer obtains better performance than most baselines and achieves an impressive average rank of 2.50 among them, demonstrating the superiority of DyGFormer once again.

Datasets	TCL	w/ NCoE	Improv.
Wikipedia	96.47	99.09	2.72%
Reddit	97.53	99.04	1.55%
MOOC	82.38	86.92	5.51%
LastFM	67.27	84.02	24.90%
Enron	79.70	90.18	13.15%
Social Evo.	93.13	94.06	1.00%
UCI	89.57	94.69	5.72%
Flights	91.23	97.71	7.10%
Can. Parl.	68.67	69.34	0.98%
US Legis.	69.59	69.47	-0.17%
UN Trade	62.21	63.46	2.01%
UN Vote	51.90	51.52	-0.73%
Contact	92.44	97.98	5.99%

5.3 Generalizability of Neighbor Co-occurrence Encoding Scheme

Our Neighbor Co-occurrence Encoding scheme (NCoE) is versatile and can be easily integrated with dynamic graph learning methods based on sequential models. Hence, we incorporate NCoE with TCL and GraphMixer and show their performance in Table 2 and Table 16 in Section C.4. We find TCL and GraphMixer usually yield better results with NCoE, achieving an average improvement of 5.36% and 1.86% over all datasets. This verifies the effectiveness and versatility of the neighbor co-occurrence encoding, and highlights the importance of capturing correlations between nodes. Also, as TCL and DyGFormer are built upon Transformer, TCL w/ NCoE can achieve similar results with DyGFormer on datasets that enjoy shorter input sequences (in which cases the patching technique in DyGFormer contributes little). However, when datasets exhibit more obvious long-term temporal dependencies (e.g., LastFM, Can. Parl.), the performance gaps become more significant.

5.4 Advantages of Patching Technique

We validate the advantages of our patching technique in preserving the local temporal proximities and reducing the computational complexity, which helps DyGFormer effectively and efficiently utilize longer histories. We conduct experiments on LastFM and Can. Parl. since they can benefit from longer historical records. For baselines, we sample more neighbors or perform more causal anonymous walks (starting from 32) to make them access longer histories. The results are depicted in Figure 2, where the x-axis is represented by a logarithmic scale with base 2. We also plot the performance of baselines with the optimal length by unconnected points based on Table 9 in Section B.5. Note that the results of some baselines are incomplete since they raise the out-of-memory error when the lengths are longer. For example, TGAT is only computationally feasible when extending the input length to 32, resulting in two discrete points with lengths 20 (the optimal length) and 32.

From Figure 2, we conclude that: (i) most of the baselines perform worse when the input lengths become longer, indicating they lack the ability to capture long-term temporal dependencies; (ii) the baselines usually encounter expensive computational costs when computing on longer histories. Although memory network-based methods (i.e., DyRep and TGN) can handle longer histories with affordable computational costs, they cannot benefit from longer histories due to the potential issues of vanishing or exploding gradients; (iii) DyGFormer consistently achieves gains from longer sequences, demonstrating the advantages of the patching technique in leveraging longer histories.

We also compare the running time and memory usage of DyGFormer with and without the patching technique during the training process. The results are shown in Table 17 in Section C.5. We could observe that the patching technique efficiently reduces model training costs in both time and space, allowing DyGFormer to access longer histories. As the input sequence length increases, the reductions become more significant. Moreover, with the patching technique, we find that DyGFormer achieves an average improvement of 0.31% and 0.74% in performance on LastFM and Can. Parl. than DyGFormer without patching. This observation further demonstrates the advantage of our patching technique in leveraging the local temporal proximities for better results.

5.5 Verification of the Motivation of Neighbor Co-occurrence Encoding Scheme

To verify the motivation of NCoE (i.e., nodes with more common historical neighbors tend to interact in the future), we compare the performance of DyGFormer and DyGFormer without NCoE. We choose 0.5 as the threshold and use TP, TN, FN, and FP to denote True/False Positive/Negative. Common Neighbor Ratio (CNR) is defined as the ratio of common neighbors in source node $u$ ’s sequence $S_{u}$ and destination node $v$ ’s sequence $S_{v}$ , i.e., $|S_{u}\cap S_{v}|/|S_{u}\cup S_{v}|$ . We focus on links whose predictions of DyGFormer w/o NCoE are changed by DyGFormer (i.e., FN→TP, FP→TN, TP→FN, and TN→FP). We define Changed Link Ratio (CLR) as the ratio of the changed links to their original set, which is respectively computed by $|$ FN→TP $|/|$ FN $|$ , $|$ FP→TN $|/|$ FP $|$ , $|$ TP→FN $|/|$ TP $|$ , and $|$ TN→FP $|/|$ TN $|$ . If NCoE is helpful, DyGFormer will revise more wrong predictions (more FN→TP and FP→TN) and make fewer incorrect changes (fewer TP→FN and TN→FP). We report CLR and average CNR of links in the above sets on five typical datasets in Table 3.

Table 3: CLR and CNR of changes made by DyGFormer.

Datasets	CLR (%)				CNR (%)
Datasets	FN→TP	FP→TN	TP→FN	TN→FP	FN→TP	FP→TN	TP→FN	TN→FP
Wikipedia	68.36	72.73	1.68	1.69	18.16	0.01	0.10	2.49
UCI	71.45	94.11	7.29	1.82	19.08	2.49	3.35	13.02
Flights	83.66	83.83	1.73	2.11	37.09	2.28	7.06	20.28
US Legis.	31.63	23.67	6.63	1.59	69.92	62.13	61.14	63.80
UN Vote	44.02	36.46	28.95	30.53	78.57	81.39	80.86	77.02

We find NCoE effectively helps DyGFormer rectify wrong predictions of DyGFormer w/o NCoE on datasets with significantly higher CNR of positive links than negative ones, which happens with most datasets. Concretely, for Wikipedia, UCI, and Flights, their CNRs of FN→TP are much higher than FP→TN (e.g., 37.09% vs. 2.28% on Flights) and DyGFormer revises most wrong predictions of DyGFormer w/o NCoE (e.g., 83.66% for positive links in FN and 83.83% for negative links in FP on Flights). Corrections made by our encoding scheme are less obvious on datasets whose CNRs between positive and negative links are similar, which occurs in only 2 of 13 datasets. For US Legis. and UN Vote, their CNRs between FN and FP are analogous (e.g., 69.92% vs. 62.13% on US Legis.), weakening the advantage of our neighbor co-occurrence encoding scheme (e.g., only 31.63%/23.67% of positive/negative links are corrected in FN/FP on US Legis.). Therefore, we conclude that the neighbor co-occurrence encoding scheme helps DyGFormer capture common historical neighbors in $S_{u}$ and $S_{v}$ , and bring better results in most cases.

5.6 When Will DyGFormer Be a Good Choice?

Note that DyGFormer is superior to baselines by 1) exploring the source and destination nodes’ correlations from their historical sequences by neighbor co-occurrence encoding scheme; 2) using the patching technique to attend longer histories. Thus, DyGFormer tends to perform better on datasets that favor these two designs. We define Link Ratio (LR) as the ratio of links in their corresponding positive or negative set, which can be computed by TP $/$ (TP+FN), TN $/$ (TN+FP), FN $/$ (TP+FN), and FP $/$ (TN+FP). As a method with more TP and TN (i.e., fewer FN and FP) is better, we report the results of LR and average CNR of links in TP and TN on five typical datasets in Table 5.

Table 4: LR and CNR of TP and TN with random negative sampling strategy.

Datasets	LR (%)		CNR (%)
Datasets	TP	TN	TP	TN
Wikipedia	92.74	97.19	59.09	0.01
UCI	82.70	96.77	28.03	1.45
Flights	96.13	95.33	47.58	1.40
US Legis.	78.95	56.83	75.18	53.98
UN Vote	65.18	45.43	56.24	76.02

Table 5: LR and CNR of FP under random, historical, and inductive negative sampling strategy.

Datasets	LR(%)			CNR(%)
Datasets	rnd	hist	ind	rnd	hist	ind
Wikipedia	2.81	89.28	94.53	0.02	14.00	11.66
UCI	3.23	64.93	76.42	9.98	12.22	13.81
Flights	4.67	94.52	92.94	0.01	35.62	30.29
US Legis.	43.17	17.31	21.21	79.40	87.51	75.51
UN Vote	54.57	39.90	52.92	79.60	75.53	79.15

We observe when CNR of TP is significantly higher than CNR of TN in the datasets, DyGFormer often outperforms baselines (most datasets satisfy this property). For Wikipedia, UCI, and Flights, their CNRs of TP are much higher than those of TN (e.g., 59.09% vs. 0.01% on Wikipedia). Such a characteristic matches the motivation of our neighbor co-occurrence encoding scheme, enabling DyGFormer to correctly predict most links (e.g., 92.74% of positive links and 97.19% of negative links are properly predicted on Wikipedia). Moreover, as LastFM and Can. Parl. can gain from longer histories (see Figure 2 and Table 9 in Section B.5), DyGFormer is significantly better than baselines on these two datasets. When CNRs of TP and TN are less distinguishable in the datasets, DyGFormer may perform worse (only 2 out of 13 datasets show this property). For US Legis., the CNRs between TP and TN are close (i.e., 75.18% vs. 53.98%), making DyGFormer worse than memory-based baselines (i.e., JODIE, DyRep, and TGN). For UN Vote, its CNR of TP is even lower than that of TN (i.e., 56.24% vs. 76.02%), which is opposite to our motivation, leading to poor results of DyGFormer than a few baselines. Since these two datasets cannot obviously gain from longer sequences either (see Table 9 in Section B.5), DyGFormer obtains worse results on them. Thus, we conclude that for datasets with much higher CNR of TP than CNR of TN or datasets that can benefit from longer histories, DyGFormer is a good choice. Otherwise, we may need to try other methods.

5.7 Why do DyGFormer’s Performance Vary across Different Negative Sampling Strategies?

Compared with the random (rnd) strategy, historical (hist) and inductive (ind) strategies will sample previous links as negative ones. This makes previous positive links negative, which may hurt the performance DyGFormer since the motivation of our neighbor co-occurrence encoding scheme is violated. As positive links are identical among rnd, hist, and ind, we compute LR and the average CNR of links in FP and show results in Table 5.

We find when hist or ind causes several magnitudes higher CNR of FP than rnd in the datasets, DyGFormer drops sharply. For Wikipedia, UCI, and Flights, the CNRs of FP with hist/ind are much higher than rnd (e.g., 14.00%/11.66% vs. 0.02% on Wikipedia). This misleads DyGFormer to predict negative links as positive and causes drops (e.g., 89.28%/94.53% of negative links are incorrectly predicted with hist/ind on Wikipedia, while only 2.81% are wrong with rnd). We also note the drops in UCI are milder since the changes in CNR caused by hist or ind vs. rnd are less obvious than changes in Wikipedia and Flights. When changes in CNR of FP caused by hist or ind are not obvious in the datasets, DyGFormer is less affected. Since hist/ind makes little changes in CNRs of FP on US Legis., we find it ranks second with hist/ind, which may indicate DyGFormer is less influenced by the neighbor co-occurrence encoding scheme and generalizes well to various negative sampling strategies. For UN Vote, although its CNRs of FP are not affected by hist and ind either, DyGFormer still performs worse due to its inferior performance with rnd. Hence, we deduce that our neighbor co-occurrence encoding may be sometimes fragile to various negative sampling strategies if its motivation is violated, leading to the varying performance of DyGFormer.

5.8 Ablation Study

Finally, we validate the effectiveness of the neighbor co-occurrence encoding, time encoding, and mixing of the sequence of source node and destination node in DyGFormer. From Figure 4 in Section C.6, we observe that DyGFormer obtains the best performance when using all the components, and the results would be worse when any component is removed. This illustrates the necessity of each design in DyGFormer. Please refer to Section C.6 for detailed implementations and discussions.

6 Conclusion

In this paper, we proposed a new Transformer-based architecture (DyGFormer) and a unified library (DyGLib) to foster the development of dynamic graph learning. DyGFormer differs from previous methods in (i) a neighbor co-occurrence encoding scheme to exploit the correlations of nodes in each interaction; and (ii) a patching technique to help the model capture long-term temporal dependencies. DyGLib served as a toolkit for reproducible, scalable, and credible continuous-time dynamic graph learning with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols. We hope our work can provide new perspectives on designing new dynamic graph learning frameworks and encourage more researchers to dive into this field. In the future, we will continue to enrich DyGLib by incorporating the recently released datasets and state-of-the-art models.

Acknowledgments and Disclosure of Funding

This work was supported by the National Natural Science Foundation of China (No. 62272023 and 51991395) and the Fundamental Research Funds for the Central Universities (No. YWF-23-L-1203).

References

[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283. USENIX Association, 2016.
[2] Unai Alvarez-Rodriguez, Federico Battiston, Guilherme Ferraz de Arruda, Yamir Moreno, Matjaž Perc, and Vito Latora. Evolutionary dynamics of higher-order interactions in social networks. Nature Human Behaviour, 5(5):586–595, 2021.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[4] Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. In Advances in Neural Information Processing Systems 33, 2020.
[5] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew M. Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33, 2020.
[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Computer Vision - ECCV 2020 - 16th European Conference, volume 12346 of Lecture Notes in Computer Science, pages 213–229. Springer, 2020.
[8] Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Peng Zhang, Guohao Dai, Yu Wang, Chang Zhou, Hongxia Yang, and Jie Tang. Cogdl: An extensive toolkit for deep learning on graphs. CoRR, abs/2103.00959, 2021.
[9] Xiaofu Chang, Xuqin Liu, Jianfeng Wen, Shuang Li, Yanming Fang, Le Song, and Yuan Qi. Continuous-time dynamic graph learning via neural interaction processes. In The 29th ACM International Conference on Information and Knowledge Management, pages 145–154. ACM, 2020.
[10] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST@EMNLP 2014, pages 103–111. Association for Computational Linguistics, 2014.
[11] Weilin Cong, Yanhong Wu, Yuandong Tian, Mengting Gu, Yinglong Xia, Mehrdad Mahdavi, and Chun-cheng Jason Chen. Dynamic graph representation learning via graph transformer networks. CoRR, abs/2111.10447, 2021.
[12] Weilin Cong, Si Zhang, Jian Kang, Baichuan Yuan, Hao Wu, Xin Zhou, Hanghang Tong, and Mehrdad Mahdavi. Do we really need complicated model architectures for temporal networks? In International Conference on Learning Representations, 2023.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational Linguistics, 2019.
[14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations. OpenReview.net, 2021.
[15] Ziwei Fan, Zhiwei Liu, Jiawei Zhang, Yun Xiong, Lei Zheng, and Philip S. Yu. Continuous-time sequential recommendation with temporal graph collaborative transformer. In The 30th ACM International Conference on Information and Knowledge Management, pages 433–442. ACM, 2021.
[16] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. CoRR, abs/1903.02428, 2019.
[17] Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. Knowl. Based Syst., 187, 2020.
[18] Palash Goyal, Sujit Rokka Chhetri, Ninareh Mehrabi, Emilio Ferrara, and Arquimedes Canedo. Dynamicgem: A library for dynamic graph embedding methods. CoRR, abs/1811.10734, 2018.
[19] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 922–929. AAAI Press, 2019.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE Computer Society, 2016.
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
[22] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33, 2020.
[23] Shenyang Huang, Farimah Poursafaei, Jacob Danovitch, Matthias Fey, Weihua Hu, Emanuele Rossi, Jure Leskovec, Michael Bronstein, Guillaume Rabusseau, and Reihaneh Rabbany. Temporal graph benchmark for machine learning on temporal graphs. arXiv preprint arXiv:2307.01026, 2023.
[24] Zijie Huang, Yizhou Sun, and Wei Wang. Learning continuous system dynamics from irregularly-sampled partial observations. In Advances in Neural Information Processing Systems 33, 2020.
[25] Ming Jin, Yuan-Fang Li, and Shirui Pan. Neural temporal walks: Motif-aware representation learning on continuous-time dynamic graphs. In NeurIPS, 2022.
[26] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi, Peter Forsyth, and Pascal Poupart. Representation learning for dynamic graphs: A survey. J. Mach. Learn. Res., 21:70:1–70:73, 2020.
[27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
[28] Srijan Kumar, Xikun Zhang, and Jure Leskovec. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1269–1278. ACM, 2019.
[29] Jintang Li, Kun Xu, Liang Chen, Zibin Zheng, and Xiao Liu. Graphgallery: A platform for fast benchmarking and easy development of graph neural networks based intelligent software. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings, pages 13–16. IEEE, 2021.
[30] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems 32, pages 5244–5254, 2019.
[31] Xiaohan Li, Mengqi Zhang, Shu Wu, Zheng Liu, Liang Wang, and Philip S. Yu. Dynamic graph collaborative filtering. In 20th IEEE International Conference on Data Mining, pages 322–331. IEEE, 2020.
[32] Meng Liu, Youzhi Luo, Limei Wang, Yaochen Xie, Hao Yuan, Shurui Gui, Haiyang Yu, Zhao Xu, Jingtun Zhang, Yi Liu, Keqiang Yan, Haoran Liu, Cong Fu, Bora Oztekin, Xuan Zhang, and Shuiwang Ji. DIG: A turnkey library for diving into graph deep learning research. J. Mach. Learn. Res., 22:240:1–240:9, 2021.
[33] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
[34] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, pages 9992–10002. IEEE, 2021.
[35] Yuhong Luo and Pan Li. Neighborhood-aware scalable temporal network representation learning. In The First Learning on Graphs Conference, 2022.
[36] Yao Ma, Ziyi Guo, Zhaochun Ren, Jiliang Tang, and Dawei Yin. Streaming graph neural networks. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 719–728. ACM, 2020.
[37] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814. Omnipress, 2010.
[38] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations, 2023.
[39] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 5363–5370. AAAI Press, 2020.
[40] Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of JMLR Workshop and Conference Proceedings, pages 1310–1318. JMLR.org, 2013.
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.
[42] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001.
[43] Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh-based simulation with graph networks. In 9th International Conference on Learning Representations. OpenReview.net, 2021.
[44] Farimah Poursafaei, Andy Huang, Kellin Pelrine, and Reihaneh Rabbany. Towards better evaluation for dynamic link prediction. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
[45] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. In ICML 2020 Workshop on Graph Representation Learning, 2020.
[46] Benedek Rozemberczki, Paul Scherer, Yixuan He, George Panagopoulos, Alexander Riedel, Maria Sinziana Astefanoaei, Oliver Kiss, Ferenc Béres, Guzmán López, Nicolas Collignon, and Rik Sarkar. Pytorch geometric temporal: Spatiotemporal signal processing with neural machine learning models. In The 30th ACM International Conference on Information and Knowledge Management, pages 4564–4573. ACM, 2021.
[47] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter W. Battaglia. Learning to simulate complex physics with graph networks. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8459–8468. PMLR, 2020.
[48] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In The Thirteenth ACM International Conference on Web Search and Data Mining, pages 519–527. ACM, 2020.
[49] Joakim Skarding, Bogdan Gabrys, and Katarzyna Musial. Foundations and modeling of dynamic networks using dynamic graph neural networks: A survey. IEEE Access, 9:79143–79168, 2021.
[50] Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, and Jian Tang. Session-based social recommendation via dynamic graph attention networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 555–563. ACM, 2019.
[51] Amauri H. Souza, Diego Mesquita, Samuel Kaski, and Vikas Garg. Provably expressive temporal graph networks. In NeurIPS, 2022.
[52] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
[53] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In Advances in Neural Information Processing Systems 28, pages 2440–2448, 2015.
[54] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. In Advances in Neural Information Processing Systems 34, pages 24261–24272, 2021.
[55] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. Dyrep: Learning representations over dynamic graphs. In 7th International Conference on Learning Representations. OpenReview.net, 2019.
[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
[57] Lu Wang, Xiaofu Chang, Shuang Li, Yunfei Chu, Hui Li, Wei Zhang, Xiaofeng He, Le Song, Jingren Zhou, and Hongxia Yang. TCL: transformer-based dynamic graph modelling via contrastive learning. CoRR, abs/2105.07944, 2021.
[58] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. Deep graph library: Towards efficient and scalable deep learning on graphs. CoRR, abs/1909.01315, 2019.
[59] Xuhong Wang, Ding Lyu, Mengjian Li, Yang Xia, Qi Yang, Xinwen Wang, Xinguang Wang, Ping Cui, Yupu Yang, Bowen Sun, and Zhenyu Guo. APAN: asynchronous propagation attention network for real-time temporal graph embedding. In International Conference on Management of Data, pages 2628–2638. ACM, 2021.
[60] Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. Inductive representation learning in temporal networks via causal anonymous walks. In 9th International Conference on Learning Representations. OpenReview.net, 2021.
[61] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In 3rd International Conference on Learning Representations, 2015.
[62] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Advances in Neural Information Processing Systems 34, pages 22419–22430, 2021.
[63] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 1907–1913. ijcai.org, 2019.
[64] Da Xu, Chuanwei Ruan, Evren Körpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. In 8th International Conference on Learning Representations. OpenReview.net, 2020.
[65] Guotong Xue, Ming Zhong, Jianxin Li, Jia Chen, Chengshuai Zhai, and Ruochen Kong. Dynamic network embedding survey. Neurocomputing, 472:212–223, 2022.
[66] Jiaxuan You, Tianyu Du, and Jure Leskovec. ROLAND: graph learning framework for dynamic graphs. In The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2358–2366. ACM, 2022.
[67] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 3634–3640. ijcai.org, 2018.
[68] Le Yu. An empirical evaluation of temporal graph benchmark. arXiv preprint arXiv:2307.12510, 2023.
[69] Le Yu, Bowen Du, Xiao Hu, Leilei Sun, Liangzhe Han, and Weifeng Lv. Deep spatio-temporal graph convolutional network for traffic accident prediction. Neurocomputing, 423:135–147, 2021.
[70] Le Yu, Zihang Liu, Leilei Sun, Bowen Du, Chuanren Liu, and Weifeng Lv. Continuous-time user preference modelling for temporal sets prediction. IEEE Transactions on Knowledge and Data Engineering, 2023.
[71] Le Yu, Guanghui Wu, Leilei Sun, Bowen Du, and Weifeng Lv. Element-guided temporal graph representation learning for temporal sets prediction. In The ACM Web Conference 2022, pages 1902–1913. ACM, 2022.
[72] Mengqi Zhang, Shu Wu, Xueli Yu, Qiang Liu, and Liang Wang. Dynamic graph neural networks for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering, 2022.
[73] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 11106–11115. AAAI Press, 2021.
[74] Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George Karypis. Tgl: a general framework for temporal gnn training on billion-scale graphs. Proceedings of the VLDB Endowment, 15(8):1572–1580, 2022.

Appendix A Additional Discussions and Details

A.1 Limitations

One potential limitation of DyGFormer lies in the ignorance of high-order relationships between nodes since it solely learns from the first-hop interactions of nodes. In certain scenarios where nodes’ high-order relationships are essential, DyGFormer may be suboptimal compared with baselines that learn the higher-order interactions. However, trivially feeding the multi-hop neighbors of nodes into DyGFormer would incur expensive computational costs. It is promising to design more efficient and effective frameworks to model nodes’ high-order relationships for dynamic graph learning.

Another potential limitation is the sensitivity of the neighbor co-occurrence encoding scheme against different negative sampling strategies (discussed in Section 5.7). When the assumption of our neighbor co-occurrence encoding scheme is violated, its performance may drop drastically in some cases. It is an insightful direction to design more robust encoding schemes to tackle this issue.

A.2 Licenses

All the used codes and datasets are publicly available and permit usage for research purposes under either MIT License or Apache License 2.0.

A.3 Overall Procedure of DyGLib

Figure 3 shows the overall procedure of DyGLib.

Appendix B Detailed Experimental Settings

B.1 Descriptions of Datasets

We use thirteen datasets collected by [44] in the experiments, which are publicly available¹¹1https://zenodo.org/record/7213796#.Y1cO6y8r30o:

•

Wikipedia is a bipartite interaction graph that contains the edits on Wikipedia pages over a month. Nodes represent users and pages, and links denote the editing behaviors with timestamps. Each link is associated with a 172-dimensional Linguistic Inquiry and Word Count (LIWC) feature [42]. This dataset additionally contains dynamic labels that indicate whether users are temporarily banned from editing.
•

Reddit is bipartite and records the posts of users under subreddits during one month. Users and subreddits are nodes, and links are the timestamped posting requests. Each link has a 172-dimensional LIWC feature. This dataset also includes dynamic labels representing whether users are banned from posting.
•

MOOC is a bipartite interaction network of online sources, where nodes are students and course content units (e.g., videos and problem sets). Each link denotes a student’s access behavior to a specific content unit and is assigned with a 4-dimensional feature.
•

LastFM is bipartite and consists of the information about which songs were listened to by which users over one month. Users and songs are nodes, and links denote the listening behaviors of users.
•

Enron records the email communications between employees of the ENRON energy corporation over three years.
•

Social Evo. is a mobile phone proximity network that monitors the daily activities of an entire undergraduate dormitory for a period of eight months, where each link has a 2-dimensional feature.
•

UCI is an online communication network, where nodes are university students and links are messages posted by students.
•

Flights is a dynamic flight network that displays the development of air traffic during the COVID-19 pandemic. Airports are represented by nodes and the tracked flights are denoted as links. Each link is associated with a weight, indicating the number of flights between two airports in a day.
•

Can. Parl. is a dynamic political network that records the interactions between Canadian Members of Parliament (MPs) from 2006 to 2019. Each node represents an MP from an electoral district and a link is created when two MPs both vote "yes" on a bill. The weight of each link refers to the number of times that one MP voted “yes” for another MP in a year.
•

US Legis. is a senate co-sponsorship network that tracks social interactions between legislators in the US Senate. The weight of each link specifies the number of times two congresspersons have co-sponsored a bill in a given congress.
•

UN Trade contains the food and agriculture trade between 181 nations for more than 30 years. The weight of each link indicates the total sum of normalized agriculture import or export values between two particular countries.
•

UN Vote records roll-call votes in the United Nations General Assembly. If two nations both voted "yes" to an item, the weight of the link between them is increased by one.
•

Contact describes how the physical proximity evolves among about 700 university students over a month. Each student has a unique identifier and links denote that they are within close proximity to each other. Each link is associated with a weight, revealing the physical proximity between students.

We show the statistics of the datasets in Table 6, where #N&L Feat stands for the dimensions of node and link features. We notice a slight difference between the statistics of the Contact dataset reported in [44] (which has 694 nodes and 2,426,280 links) and our own calculations, although both of them are computed based on the released dataset by [44]. We ultimately report our statistics of the Contact dataset in this paper.

Table 6: Statistics of the datasets.

Datasets	Domains	#Nodes	#Links	#N&L Feat	Bipartite	Duration	Unique Steps	Time Granularity
Wikipedia	Social	9,227	157,474	– & 172	True	1 month	152,757	Unix timestamps
Reddit	Social	10,984	672,447	– & 172	True	1 month	669,065	Unix timestamps
MOOC	Interaction	7,144	411,749	– & 4	True	17 months	345,600	Unix timestamps
LastFM	Interaction	1,980	1,293,103	– & –	True	1 month	1,283,614	Unix timestamps
Enron	Social	184	125,235	– & –	False	3 years	22,632	Unix timestamps
Social Evo.	Proximity	74	2,099,519	– & 2	False	8 months	565,932	Unix timestamps
UCI	Social	1,899	59,835	– & –	False	196 days	58,911	Unix timestamps
Flights	Transport	13,169	1,927,145	– & 1	False	4 months	122	days
Can. Parl.	Politics	734	74,478	– & 1	False	14 years	14	years
US Legis.	Politics	225	60,396	– & 1	False	12 congresses	12	congresses
UN Trade	Economics	255	507,497	– & 1	False	32 years	32	years
UN Vote	Politics	201	1,035,742	– & 1	False	72 years	72	years
Contact	Proximity	692	2,426,279	– & 1	False	1 month	8,064	5 minutes

B.2 Descriptions of Baselines

We select the following eight baselines:

•

JODIE is designed for temporal bipartite networks of user-item interactions. It employs two coupled recurrent neural networks to update the states of users and items. A projection operation is introduced to learn the future representation trajectory of each user/item [28].
•

DyRep proposes a recurrent architecture to update node states upon each interaction. It also includes a temporal-attentive aggregation module to consider the temporally evolving structural information in dynamic graphs [55].
•

TGAT computes the node representation by aggregating features from each node’s temporal-topological neighbors based on the self-attention mechanism. It is also equipped with a time encoding function for capturing temporal patterns [64].
•

TGN maintains an evolving memory for each node and updates this memory when the node is observed in an interaction, which is achieved by the message function, message aggregator, and memory updater. An embedding module is leveraged to generate the temporal representations of nodes [45].
•

CAWN first extracts multiple causal anonymous walks for each node, which can explore the causality of network dynamics and generate relative node identities. Then, it utilizes recurrent neural networks to encode each walk and aggregates these walks to obtain the final node representation [60].
•

EdgeBank is a pure memory-based approach without trainable parameters for transductive dynamic link prediction. It stores the observed interactions in the memory unit and updates the memory through various strategies. An interaction will be predicted as positive if it was retained in the memory, and negative otherwise [44]. The publication presents two updating strategies, but their official implementations include two more²²2https://github.com/fpour/DGB/blob/main/EdgeBank/link_pred/edge_bank_baseline.py. To be specific, the four variants of EdgeBank are: EdgeBank_∞ with unlimited memory to store all the observed edges; EdgeBank ${}_{\text{tw-ts}}$ and EdgeBank ${}_{\text{tw-re}}$ , which only remember edges within a fixed-size time window from the immediate past. The window size of EdgeBank ${}_{\text{tw-ts}}$ is set to the duration of the test split, while EdgeBank ${}_{\text{tw-re}}$ makes it related to the time intervals of repeated edges; EdgeBank ${}_{\text{th}}$ that retains edges with appearing counts higher than a threshold. We experiment with all four variants and report the best performance among them.
•

TCL first generates each node’s interaction sequence by performing a breadth-first search algorithm on the temporal dependency interaction sub-graph. Then, it presents a graph transformer that considers both graph topology and temporal information to learn node representations. It also incorporates a cross-attention operation for modeling the inter-dependencies between two interaction nodes [57].
•

GraphMixer shows that a fixed time encoding function performs better than the trainable version. It incorporates the fixed function into a link encoder based on MLP-Mixer [54] to learn from temporal links. A node encoder with neighbor mean-pooing is employed to summarize node features [12].

B.3 Some Problematic Implementations in Baselines

JODIE, DyRep, and TGN models are based on memory networks, and their implementations have designed the raw messages to avoid information leakage. However, they fail to store the raw messages when saving models because the raw messages are maintained in a dictionary and thus cannot be saved as model parameters³³3https://github.com/twitter-research/tgn/blob/2aa295a1f182137a6ad56328b43cb3d8223cae77/train_self_supervised.py#L302. Our DyGLib has addressed this issue by additionally saving the correct raw messages when saving models. In the official implementation of CAWN, the encoding of each causal anonymous walk is represented by the embedding at the last position of the walk⁴⁴4https://github.com/snap-stanford/CAW/blob/f994ff2b2c29778e6250b6a9928fd9943e0163f7/module.py#L1069. However, some walks are padded to support mini-batch training in practice, making the last position of these padded walks meaningless. To get the correct encoding, it is necessary to use the actual length of each walk. Moreover, there are issues with the nodeedge2idx dictionary computed by the get_ts2idx function⁵⁵5https://github.com/snap-stanford/CAW/blob/f994ff2b2c29778e6250b6a9928fd9943e0163f7/graph.py#L79 if multiple interactions simultaneously occur at the last timestamp. This would lead to information leakage since the model can potentially access more interactions for a given interaction even if they happen at the same time. Our DyGLib has fixed these problems as well.

B.4 Some Inconsistent Observations with Previous Reports

In the experiments, we find the behaviors of baselines are inconsistent with their previous reports in some cases. We provide detailed illustrations and attribute these phenomena to the following reasons.

Suboptimal Settings of Hyperparameters. Some important hyperparameters in the baselines are not sufficiently fine-tuned, such as the dropout rate, the number of sampled neighbors, the number of random walks, the length of input sequences, and the neighbor sampling strategies. In this paper, we perform the grid search to find the best settings of these hyperparameters and observe that the performance of many baselines can be significantly improved by properly setting certain hyperparameters. Take TGAT for transductive dynamic link prediction with the random negative sampling strategy as an example (see Table 1). Compared with the performance in [44], the results of AP are significantly improved on datasets like MOOC (from 0.61 to 0.86), LastFM (from 0.50 to 0.73), Enron (from 0.59 to 0.71), Social Evo. (from 0.76 to 0.93), Flights (from 0.89 to 0.94), and Contact (from 0.58 to 0.96). Similar improvements can also be found on JODIE, DyRep, and TGN on several datasets.

Usage of Problematic Implementations. Some previous studies utilize problematic implementations in their experiments (illustrated in Section B.3), and the reported results may not be rigorous. Therefore, some methods would obtain worse results after we fix the problems in their implementations. Take CAWN for transductive dynamic link prediction with the random negative sampling strategy as an example (see Table 1). Compared with the results in [44], the performance on the AP metric of CAWN drops sharply on datasets like LastFM (from 0.98 to 0.87), Can. Parl. (from 0.94 to 0.70), US Legis. (from 0.97 to 0.71), UN Trade (from 0.97 to 0.65), and UN Vote (from 0.82 to 0.53). This is because [44] uses the problematic implementations of CAWN to conduct experiments and the results will sometimes become worse after fixing the issues.

Adaptions for Evaluations. There also exist some differences between GraphMixer’s results in the original paper and our work because we modify its implementation to fit our evaluations. The original GraphMixer could only be evaluated for transductive dynamic link prediction. In this work, we remove the one-hot encoding of nodes in GraphMixer to adapt it to the inductive setting, which may lead to decreased performance in certain situations. Take the performance of GraphMixer for transductive dynamic link prediction with the random negative sampling strategy as an example (see Table 1). Compared with the results in [12], the performance of GraphMixer slightly drops on datasets like Wikipedia (from 0.9985 to 0.9725) and Reddit (from 0.9993 to 0.9731).

B.5 Configurations of Different Methods

We first present the official settings of baselines as well as the configurations of DyGFormer. These configurations remain unchanged across all the datasets.

•
JODIE:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of node memory: 172
- –
  
  Dimension of output representation: 172
- –
  
  Memory updater: vanilla recurrent neural network
•
DyRep:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of node memory: 172
- –
  
  Dimension of output representation: 172
- –
  
  Number of graph attention heads: 2
- –
  
  Number of graph convolution layers: 1
- –
  
  Memory updater: vanilla recurrent neural network
•
TGAT:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of output representation: 172
- –
  
  Number of graph attention heads: 2
- –
  
  Number of graph convolution layers: 2
•
TGN:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of node memory: 172
- –
  
  Dimension of output representation: 172
- –
  
  Number of graph attention heads: 2
- –
  
  Number of graph convolution layers: 1
- –
  
  Memory updater: gated recurrent unit [10]
•
CAWN:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of position encoding: 172
- –
  
  Dimension of output representation: 172
- –
  
  Number of attention heads for encoding walks: 8
- –
  
  Length of each walk (including the target node): 2
- –
  
  Time scaling factor $\alpha$ : 1e-6
•
TCL:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of depth encoding: 172
- –
  
  Dimension of output representation: 172
- –
  
  Number of attention heads: 2
- –
  
  Number of Transformer layers: 2
•
GraphMixer:
- –
  
  Dimension of time encoding: 100
- –
  
  Dimension of output representation: 172
- –
  
  Number of MLP-Mixer layers: 2
- –
  
  Time gap $T$ : 2000
•
DyGFormer:
- –
  
  Dimension of time encoding $d_{T}$ : 100
- –
  
  Dimension of neighbor co-occurrence encoding $d_{C}$ : 50
- –
  
  Dimension of aligned encoding $d$ : 50
- –
  
  Dimension of output representation $d_{out}$ : 172
- –
  
  Number of attention heads $I$ : 2
- –
  
  Number of Transformer layers $L$ : 2

Then, we perform the grid search to find the best settings of some critical hyperparameters, where the searched ranges and related methods are shown in Table 7. It is worth noticing that DyGFormer can directly handle nodes with sequence lengths shorter than the defined length. When the sequence length exceeds the specified length, we select the most recent interactions up to the defined length.

Table 7: Searched ranges of hyperparameters and the related methods.

Hyperparameters

Searched Ranges

Related Methods

Dropout Rate [52]

[0.0, 0.1, 0.2, 0.3,

0.4, 0.5, 0.6]

JODIE, DyRep, TGAT, TGN, CAWN,

TCL, GraphMixer, DyGFormer

Number of

Sampled Neighbors

[10, 20, 30]

DyRep, TGAT, TGN,

TCL, GraphMixer

Neighbor Sampling

Strategies

[uniform,recent]

DyRep, TGAT, TGN,

TCL, GraphMixer

Number of Causal

Anonymous Walks

[16, 32, 64, 128]

CAWN

Memory Updating

Variants

[EdgeBank_∞, EdgeBank

{}_{\text{tw-ts}}

EdgeBank

{}_{\text{tw-re}}

, EdgeBank

{}_{\text{th}}

]

EdgeBank

Length of Input

Sequences &

Patch Size

[32 & 1, 64 & 2, 128 & 4,

256 & 8, 512 & 16, 1024 & 32,

2048 & 64, 4096 & 128]

DyGFormer

Finally, we show the hyperparameter settings of various methods that are determined by the grid search in Table 8, Table 9, Table 10 and Table 11.

Table 8: Configurations of the dropout rate of different methods.

Datasets	JODIE	DyRep	TGAT	TGN	CAWN	TCL	GraphMixer	DyGFormer
Wikipedia	0.1	0.1	0.1	0.1	0.1	0.1	0.5	0.1
Reddit	0.1	0.1	0.1	0.1	0.1	0.1	0.5	0.2
MOOC	0.2	0.0	0.1	0.2	0.1	0.1	0.4	0.1
LastFM	0.3	0.0	0.1	0.3	0.1	0.1	0.0	0.1
Enron	0.1	0.0	0.2	0.0	0.1	0.1	0.5	0.0
Social Evo.	0.1	0.1	0.1	0.0	0.1	0.0	0.3	0.1
UCI	0.4	0.0	0.1	0.1	0.1	0.0	0.4	0.1
Flights	0.1	0.1	0.1	0.1	0.1	0.1	0.2	0.1
Can. Parl.	0.0	0.0	0.2	0.3	0.0	0.2	0.2	0.1
US Legis.	0.2	0.0	0.1	0.1	0.1	0.3	0.4	0.0
UN Trade	0.4	0.1	0.1	0.2	0.1	0.0	0.1	0.0
UN Vote	0.1	0.1	0.2	0.1	0.1	0.0	0.0	0.2
Contact	0.1	0.0	0.1	0.1	0.1	0.0	0.1	0.0

Table 9: Configurations of the number of sampled neighbors, the number of causal anonymous walks, and the length of input sequences

\&

the patch size of different methods.

Datasets	DyRep	TGAT	TGN	CAWN	TCL	GraphMixer	DyGFormer
Wikipedia	10	20	10	32	20	30	32 & 1
Reddit	10	20	10	32	20	10	64 & 2
MOOC	10	20	10	64	20	20	256 & 8
LastFM	10	20	10	128	20	10	512 & 16
Enron	10	20	10	32	20	20	256 & 8
Social Evo.	10	20	10	64	20	20	32 & 1
UCI	10	20	10	64	20	20	32 & 1
Flights	10	20	10	64	20	20	256 & 8
Can. Parl.	10	20	10	128	20	20	2048 & 64
US Legis.	10	20	10	32	20	20	256 & 8
UN Trade	10	20	10	64	20	20	256 & 8
UN Vote	10	20	10	64	20	20	128 & 4
Contact	10	20	10	64	20	20	32 & 1

Table 10: Configurations of neighbor sampling strategies of different methods.

Datasets	DyRep	TGAT	TGN	TCL	GraphMixer
Wikipedia	recent	recent	recent	recent	recent
Reddit	recent	uniform	recent	uniform	recent
MOOC	recent	recent	recent	recent	recent
LastFM	recent	recent	recent	recent	recent
Enron	recent	recent	recent	recent	recent
Social Evo.	recent	recent	recent	recent	recent
UCI	recent	recent	recent	recent	recent
Flights	recent	recent	recent	recent	recent
Can. Parl.	uniform	uniform	uniform	uniform	uniform
US Legis.	recent	recent	recent	uniform	recent
UN Trade	recent	uniform	recent	uniform	uniform
UN Vote	recent	recent	uniform	uniform	uniform
Contact	recent	recent	recent	recent	recent

Table 11: Configurations of the variants of EdgeBank with three negative sampling strategies.

Datasets	Random	Historical	Inductive
Wikipedia	EdgeBank_∞	EdgeBank ${}_{\text{th}}$	EdgeBank ${}_{\text{th}}$
Reddit	EdgeBank_∞	EdgeBank ${}_{\text{th}}$	EdgeBank ${}_{\text{th}}$
MOOC	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{th}}$
LastFM	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{th}}$
Enron	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{th}}$
Social Evo.	EdgeBank ${}_{\text{th}}$	EdgeBank ${}_{\text{th}}$	EdgeBank ${}_{\text{th}}$
UCI	EdgeBank_∞	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-re}}$
Flights	EdgeBank_∞	EdgeBank ${}_{\text{th}}$	EdgeBank ${}_{\text{th}}$
Can. Parl.	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{th}}$
US Legis.	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-ts}}$	EdgeBank ${}_{\text{tw-ts}}$
UN Trade	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{th}}$
UN Vote	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{tw-re}}$
Contact	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{tw-re}}$	EdgeBank ${}_{\text{th}}$

Appendix C Detailed Experimental Results

C.1 Additional Results for Transductive Dynamic Link Prediction

We show the AUC-ROC for transductive dynamic link prediction with three negative sampling strategies in Table 12.

Table 12: AUC-ROC for transductive dynamic link prediction with random, historical, and inductive negative sampling strategies.

NSS	Datasets	JODIE	DyRep	TGAT	TGN	CAWN	EdgeBank	TCL	GraphMixer	DyGFormer
rnd	Wikipedia	96.33 $\pm$ 0.07	94.37 $\pm$ 0.09	96.67 $\pm$ 0.07	98.37 $\pm$ 0.07	98.54 $\pm$ 0.04	90.78 $\pm$ 0.00	95.84 $\pm$ 0.18	96.92 $\pm$ 0.03	98.91 $\pm$ 0.02
	Reddit	98.31 $\pm$ 0.05	98.17 $\pm$ 0.05	98.47 $\pm$ 0.02	98.60 $\pm$ 0.06	99.01 $\pm$ 0.01	95.37 $\pm$ 0.00	97.42 $\pm$ 0.02	97.17 $\pm$ 0.02	99.15 $\pm$ 0.01
	MOOC	83.81 $\pm$ 2.09	85.03 $\pm$ 0.58	87.11 $\pm$ 0.19	91.21 $\pm$ 1.15	80.38 $\pm$ 0.26	60.86 $\pm$ 0.00	83.12 $\pm$ 0.18	84.01 $\pm$ 0.17	87.91 $\pm$ 0.58
	LastFM	70.49 $\pm$ 1.66	71.16 $\pm$ 1.89	71.59 $\pm$ 0.18	78.47 $\pm$ 2.94	85.92 $\pm$ 0.10	83.77 $\pm$ 0.00	64.06 $\pm$ 1.16	73.53 $\pm$ 0.12	93.05 $\pm$ 0.10
	Enron	87.96 $\pm$ 0.52	84.89 $\pm$ 3.00	68.89 $\pm$ 1.10	88.32 $\pm$ 0.99	90.45 $\pm$ 0.14	87.05 $\pm$ 0.00	75.74 $\pm$ 0.72	84.38 $\pm$ 0.21	93.33 $\pm$ 0.13
	Social Evo.	92.05 $\pm$ 0.46	90.76 $\pm$ 0.21	94.76 $\pm$ 0.16	95.39 $\pm$ 0.17	87.34 $\pm$ 0.08	81.60 $\pm$ 0.00	94.84 $\pm$ 0.17	95.23 $\pm$ 0.07	96.30 $\pm$ 0.01
	UCI	90.44 $\pm$ 0.49	68.77 $\pm$ 2.34	78.53 $\pm$ 0.74	92.03 $\pm$ 1.13	93.87 $\pm$ 0.08	77.30 $\pm$ 0.00	87.82 $\pm$ 1.36	91.81 $\pm$ 0.67	94.49 $\pm$ 0.26
	Flights	96.21 $\pm$ 1.42	95.95 $\pm$ 0.62	94.13 $\pm$ 0.17	98.22 $\pm$ 0.13	98.45 $\pm$ 0.01	90.23 $\pm$ 0.00	91.21 $\pm$ 0.02	91.13 $\pm$ 0.01	98.93 $\pm$ 0.01
	Can. Parl.	78.21 $\pm$ 0.23	73.35 $\pm$ 3.67	75.69 $\pm$ 0.78	76.99 $\pm$ 1.80	75.70 $\pm$ 3.27	64.14 $\pm$ 0.00	72.46 $\pm$ 3.23	83.17 $\pm$ 0.53	97.76 $\pm$ 0.41
	US Legis.	82.85 $\pm$ 1.07	82.28 $\pm$ 0.32	75.84 $\pm$ 1.99	83.34 $\pm$ 0.43	77.16 $\pm$ 0.39	62.57 $\pm$ 0.00	76.27 $\pm$ 0.63	76.96 $\pm$ 0.79	77.90 $\pm$ 0.58
	UN Trade	69.62 $\pm$ 0.44	67.44 $\pm$ 0.83	64.01 $\pm$ 0.12	69.10 $\pm$ 1.67	68.54 $\pm$ 0.18	66.75 $\pm$ 0.00	64.72 $\pm$ 0.05	65.52 $\pm$ 0.51	70.20 $\pm$ 1.44
	UN Vote	68.53 $\pm$ 0.95	67.18 $\pm$ 1.04	52.83 $\pm$ 1.12	69.71 $\pm$ 2.65	53.09 $\pm$ 0.22	62.97 $\pm$ 0.00	51.88 $\pm$ 0.36	52.46 $\pm$ 0.27	57.12 $\pm$ 0.62
	Contact	96.66 $\pm$ 0.89	96.48 $\pm$ 0.14	96.95 $\pm$ 0.08	97.54 $\pm$ 0.35	89.99 $\pm$ 0.34	94.34 $\pm$ 0.00	94.15 $\pm$ 0.09	93.94 $\pm$ 0.02	98.53 $\pm$ 0.01
	Avg. Rank	4.38	5.77	6.00	2.54	4.38	7.31	7.23	5.77	1.62
hist	Wikipedia	80.77 $\pm$ 0.73	77.74 $\pm$ 0.33	82.87 $\pm$ 0.22	82.74 $\pm$ 0.32	67.84 $\pm$ 0.64	77.27 $\pm$ 0.00	85.76 $\pm$ 0.46	87.68 $\pm$ 0.17	78.80 $\pm$ 1.95
	Reddit	80.52 $\pm$ 0.32	80.15 $\pm$ 0.18	79.33 $\pm$ 0.16	81.11 $\pm$ 0.19	80.27 $\pm$ 0.30	78.58 $\pm$ 0.00	76.49 $\pm$ 0.16	77.80 $\pm$ 0.12	80.54 $\pm$ 0.29
	MOOC	82.75 $\pm$ 0.83	81.06 $\pm$ 0.94	80.81 $\pm$ 0.67	88.00 $\pm$ 1.80	71.57 $\pm$ 1.07	61.90 $\pm$ 0.00	72.09 $\pm$ 0.56	76.68 $\pm$ 1.40	87.04 $\pm$ 0.35
	LastFM	75.22 $\pm$ 2.36	74.65 $\pm$ 1.98	64.27 $\pm$ 0.26	77.97 $\pm$ 3.04	67.88 $\pm$ 0.24	78.09 $\pm$ 0.00	47.24 $\pm$ 3.13	64.21 $\pm$ 0.73	78.78 $\pm$ 0.35
	Enron	75.39 $\pm$ 2.37	74.69 $\pm$ 3.55	61.85 $\pm$ 1.43	77.09 $\pm$ 2.22	65.10 $\pm$ 0.34	79.59 $\pm$ 0.00	67.95 $\pm$ 0.88	75.27 $\pm$ 1.14	76.55 $\pm$ 0.52
	Social Evo.	90.06 $\pm$ 3.15	93.12 $\pm$ 0.34	93.08 $\pm$ 0.59	94.71 $\pm$ 0.53	87.43 $\pm$ 0.15	85.81 $\pm$ 0.00	93.44 $\pm$ 0.68	94.39 $\pm$ 0.31	97.28 $\pm$ 0.07
	UCI	78.64 $\pm$ 3.50	57.91 $\pm$ 3.12	58.89 $\pm$ 1.57	77.25 $\pm$ 2.68	57.86 $\pm$ 0.15	69.56 $\pm$ 0.00	72.25 $\pm$ 3.46	77.54 $\pm$ 2.02	76.97 $\pm$ 0.24
	Flights	68.97 $\pm$ 1.87	69.43 $\pm$ 0.90	72.20 $\pm$ 0.16	68.39 $\pm$ 0.95	66.11 $\pm$ 0.71	74.64 $\pm$ 0.00	70.57 $\pm$ 0.18	70.37 $\pm$ 0.23	68.09 $\pm$ 0.43
	Can. Parl.	62.44 $\pm$ 1.11	70.16 $\pm$ 1.70	70.86 $\pm$ 0.94	73.23 $\pm$ 3.08	72.06 $\pm$ 3.94	63.04 $\pm$ 0.00	69.95 $\pm$ 3.70	79.03 $\pm$ 1.01	97.61 $\pm$ 0.40
	US Legis.	67.47 $\pm$ 6.40	91.44 $\pm$ 1.18	73.47 $\pm$ 5.25	83.53 $\pm$ 4.53	78.62 $\pm$ 7.46	67.41 $\pm$ 0.00	83.97 $\pm$ 3.71	85.17 $\pm$ 0.70	90.77 $\pm$ 1.96
	UN Trade	68.92 $\pm$ 1.40	64.36 $\pm$ 1.40	60.37 $\pm$ 0.68	63.93 $\pm$ 5.41	63.09 $\pm$ 0.74	86.61 $\pm$ 0.00	61.43 $\pm$ 1.04	63.20 $\pm$ 1.54	73.86 $\pm$ 1.13
	UN Vote	76.84 $\pm$ 1.01	74.72 $\pm$ 1.43	53.95 $\pm$ 3.15	73.40 $\pm$ 5.20	51.27 $\pm$ 0.33	89.62 $\pm$ 0.00	52.29 $\pm$ 2.39	52.61 $\pm$ 1.44	64.27 $\pm$ 1.78
	Contact	96.35 $\pm$ 0.92	96.00 $\pm$ 0.23	95.39 $\pm$ 0.43	93.76 $\pm$ 1.29	83.06 $\pm$ 0.32	92.17 $\pm$ 0.00	93.34 $\pm$ 0.19	93.14 $\pm$ 0.34	97.17 $\pm$ 0.05
	Avg. Rank	4.38	4.77	5.85	3.46	7.38	5.38	6.08	4.77	2.92
ind	Wikipedia	70.96 $\pm$ 0.78	67.36 $\pm$ 0.96	81.93 $\pm$ 0.22	80.97 $\pm$ 0.31	70.95 $\pm$ 0.95	81.73 $\pm$ 0.00	82.19 $\pm$ 0.48	84.28 $\pm$ 0.30	75.09 $\pm$ 3.70
	Reddit	83.51 $\pm$ 0.15	82.90 $\pm$ 0.31	87.13 $\pm$ 0.20	84.56 $\pm$ 0.24	88.04 $\pm$ 0.29	85.93 $\pm$ 0.00	84.67 $\pm$ 0.29	82.21 $\pm$ 0.13	86.23 $\pm$ 0.51
	MOOC	66.63 $\pm$ 2.30	63.26 $\pm$ 1.01	73.18 $\pm$ 0.33	77.44 $\pm$ 2.86	70.32 $\pm$ 1.43	48.18 $\pm$ 0.00	70.36 $\pm$ 0.37	72.45 $\pm$ 0.72	80.76 $\pm$ 0.76
	LastFM	61.32 $\pm$ 3.49	62.15 $\pm$ 2.12	63.99 $\pm$ 0.21	65.46 $\pm$ 4.27	67.92 $\pm$ 0.44	77.37 $\pm$ 0.00	46.93 $\pm$ 2.59	60.22 $\pm$ 0.32	69.25 $\pm$ 0.36
	Enron	70.92 $\pm$ 1.05	68.73 $\pm$ 1.34	60.45 $\pm$ 2.12	71.34 $\pm$ 2.46	75.17 $\pm$ 0.50	75.00 $\pm$ 0.00	67.64 $\pm$ 0.86	71.53 $\pm$ 0.85	74.07 $\pm$ 0.64
	Social Evo.	90.01 $\pm$ 3.19	93.07 $\pm$ 0.38	92.94 $\pm$ 0.61	95.24 $\pm$ 0.56	89.93 $\pm$ 0.15	87.88 $\pm$ 0.00	93.44 $\pm$ 0.72	94.22 $\pm$ 0.32	97.51 $\pm$ 0.06
	UCI	64.14 $\pm$ 1.26	54.25 $\pm$ 2.01	60.80 $\pm$ 1.01	64.11 $\pm$ 1.04	58.06 $\pm$ 0.26	58.03 $\pm$ 0.00	70.05 $\pm$ 1.86	74.59 $\pm$ 0.74	65.96 $\pm$ 1.18
	Flights	69.99 $\pm$ 3.10	71.13 $\pm$ 1.55	73.47 $\pm$ 0.18	71.63 $\pm$ 1.72	69.70 $\pm$ 0.75	81.10 $\pm$ 0.00	72.54 $\pm$ 0.19	72.21 $\pm$ 0.21	69.53 $\pm$ 1.17
	Can. Parl.	52.88 $\pm$ 0.80	63.53 $\pm$ 0.65	72.47 $\pm$ 1.18	69.57 $\pm$ 2.81	72.93 $\pm$ 1.78	61.41 $\pm$ 0.00	69.47 $\pm$ 2.12	70.52 $\pm$ 0.94	96.70 $\pm$ 0.59
	US Legis.	59.05 $\pm$ 5.52	89.44 $\pm$ 0.71	71.62 $\pm$ 5.42	78.12 $\pm$ 4.46	76.45 $\pm$ 7.02	68.66 $\pm$ 0.00	82.54 $\pm$ 3.91	84.22 $\pm$ 0.91	87.96 $\pm$ 1.80
	UN Trade	66.82 $\pm$ 1.27	65.60 $\pm$ 1.28	66.13 $\pm$ 0.78	66.37 $\pm$ 5.39	71.73 $\pm$ 0.74	74.20 $\pm$ 0.00	67.80 $\pm$ 1.21	66.53 $\pm$ 1.22	62.56 $\pm$ 1.51
	UN Vote	73.73 $\pm$ 1.61	72.80 $\pm$ 2.16	53.04 $\pm$ 2.58	72.69 $\pm$ 3.72	52.75 $\pm$ 0.90	72.85 $\pm$ 0.00	52.02 $\pm$ 1.64	51.89 $\pm$ 0.74	53.37 $\pm$ 1.26
	Contact	94.47 $\pm$ 1.08	94.23 $\pm$ 0.18	94.10 $\pm$ 0.41	91.64 $\pm$ 1.72	87.68 $\pm$ 0.24	85.87 $\pm$ 0.00	91.23 $\pm$ 0.19	90.96 $\pm$ 0.27	95.01 $\pm$ 0.15
	Avg. Rank	5.92	6.15	4.85	4.54	5.15	5.08	5.00	4.77	3.54

C.2 Additional Results for Inductive Dynamic Link Prediction

We present the AP and AUC-ROC for inductive dynamic link prediction with three negative sampling strategies in Table 13 and Table 14 .

Table 13: AP for inductive dynamic link prediction with random, historical, and inductive negative sampling strategies.

NSS	Datasets	JODIE	DyRep	TGAT	TGN	CAWN	TCL	GraphMixer	DyGFormer
rnd	Wikipedia	94.82 $\pm$ 0.20	92.43 $\pm$ 0.37	96.22 $\pm$ 0.07	97.83 $\pm$ 0.04	98.24 $\pm$ 0.03	96.22 $\pm$ 0.17	96.65 $\pm$ 0.02	98.59 $\pm$ 0.03
	Reddit	96.50 $\pm$ 0.13	96.09 $\pm$ 0.11	97.09 $\pm$ 0.04	97.50 $\pm$ 0.07	98.62 $\pm$ 0.01	94.09 $\pm$ 0.07	95.26 $\pm$ 0.02	98.84 $\pm$ 0.02
	MOOC	79.63 $\pm$ 1.92	81.07 $\pm$ 0.44	85.50 $\pm$ 0.19	89.04 $\pm$ 1.17	81.42 $\pm$ 0.24	80.60 $\pm$ 0.22	81.41 $\pm$ 0.21	86.96 $\pm$ 0.43
	LastFM	81.61 $\pm$ 3.82	83.02 $\pm$ 1.48	78.63 $\pm$ 0.31	81.45 $\pm$ 4.29	89.42 $\pm$ 0.07	73.53 $\pm$ 1.66	82.11 $\pm$ 0.42	94.23 $\pm$ 0.09
	Enron	80.72 $\pm$ 1.39	74.55 $\pm$ 3.95	67.05 $\pm$ 1.51	77.94 $\pm$ 1.02	86.35 $\pm$ 0.51	76.14 $\pm$ 0.79	75.88 $\pm$ 0.48	89.76 $\pm$ 0.34
	Social Evo.	91.96 $\pm$ 0.48	90.04 $\pm$ 0.47	91.41 $\pm$ 0.16	90.77 $\pm$ 0.86	79.94 $\pm$ 0.18	91.55 $\pm$ 0.09	91.86 $\pm$ 0.06	93.14 $\pm$ 0.04
	UCI	79.86 $\pm$ 1.48	57.48 $\pm$ 1.87	79.54 $\pm$ 0.48	88.12 $\pm$ 2.05	92.73 $\pm$ 0.06	87.36 $\pm$ 2.03	91.19 $\pm$ 0.42	94.54 $\pm$ 0.12
	Flights	94.74 $\pm$ 0.37	92.88 $\pm$ 0.73	88.73 $\pm$ 0.33	95.03 $\pm$ 0.60	97.06 $\pm$ 0.02	83.41 $\pm$ 0.07	83.03 $\pm$ 0.05	97.79 $\pm$ 0.02
	Can. Parl.	53.92 $\pm$ 0.94	54.02 $\pm$ 0.76	55.18 $\pm$ 0.79	54.10 $\pm$ 0.93	55.80 $\pm$ 0.69	54.30 $\pm$ 0.66	55.91 $\pm$ 0.82	87.74 $\pm$ 0.71
	US Legis.	54.93 $\pm$ 2.29	57.28 $\pm$ 0.71	51.00 $\pm$ 3.11	58.63 $\pm$ 0.37	53.17 $\pm$ 1.20	52.59 $\pm$ 0.97	50.71 $\pm$ 0.76	54.28 $\pm$ 2.87
	UN Trade	59.65 $\pm$ 0.77	57.02 $\pm$ 0.69	61.03 $\pm$ 0.18	58.31 $\pm$ 3.15	65.24 $\pm$ 0.21	62.21 $\pm$ 0.12	62.17 $\pm$ 0.31	64.55 $\pm$ 0.62
	UN Vote	56.64 $\pm$ 0.96	54.62 $\pm$ 2.22	52.24 $\pm$ 1.46	58.85 $\pm$ 2.51	49.94 $\pm$ 0.45	51.60 $\pm$ 0.97	50.68 $\pm$ 0.44	55.93 $\pm$ 0.39
	Contact	94.34 $\pm$ 1.45	92.18 $\pm$ 0.41	95.87 $\pm$ 0.11	93.82 $\pm$ 0.99	89.55 $\pm$ 0.30	91.11 $\pm$ 0.12	90.59 $\pm$ 0.05	98.03 $\pm$ 0.02
	Avg. Rank	4.69	5.77	5.23	3.77	3.77	5.77	5.23	1.54
hist	Wikipedia	68.69 $\pm$ 0.39	62.18 $\pm$ 1.27	84.17 $\pm$ 0.22	81.76 $\pm$ 0.32	67.27 $\pm$ 1.63	82.20 $\pm$ 2.18	87.60 $\pm$ 0.30	71.42 $\pm$ 4.43
	Reddit	62.34 $\pm$ 0.54	61.60 $\pm$ 0.72	63.47 $\pm$ 0.36	64.85 $\pm$ 0.85	63.67 $\pm$ 0.41	60.83 $\pm$ 0.25	64.50 $\pm$ 0.26	65.37 $\pm$ 0.60
	MOOC	63.22 $\pm$ 1.55	62.93 $\pm$ 1.24	76.73 $\pm$ 0.29	77.07 $\pm$ 3.41	74.68 $\pm$ 0.68	74.27 $\pm$ 0.53	74.00 $\pm$ 0.97	80.82 $\pm$ 0.30
	LastFM	70.39 $\pm$ 4.31	71.45 $\pm$ 1.76	76.27 $\pm$ 0.25	66.65 $\pm$ 6.11	71.33 $\pm$ 0.47	65.78 $\pm$ 0.65	76.42 $\pm$ 0.22	76.35 $\pm$ 0.52
	Enron	65.86 $\pm$ 3.71	62.08 $\pm$ 2.27	61.40 $\pm$ 1.31	62.91 $\pm$ 1.16	60.70 $\pm$ 0.36	67.11 $\pm$ 0.62	72.37 $\pm$ 1.37	67.07 $\pm$ 0.62
	Social Evo.	88.51 $\pm$ 0.87	88.72 $\pm$ 1.10	93.97 $\pm$ 0.54	90.66 $\pm$ 1.62	79.83 $\pm$ 0.38	94.10 $\pm$ 0.31	94.01 $\pm$ 0.47	96.82 $\pm$ 0.16
	UCI	63.11 $\pm$ 2.27	52.47 $\pm$ 2.06	70.52 $\pm$ 0.93	70.78 $\pm$ 0.78	64.54 $\pm$ 0.47	76.71 $\pm$ 1.00	81.66 $\pm$ 0.49	72.13 $\pm$ 1.87
	Flights	61.01 $\pm$ 1.65	62.83 $\pm$ 1.31	64.72 $\pm$ 0.36	59.31 $\pm$ 1.43	56.82 $\pm$ 0.57	64.50 $\pm$ 0.25	65.28 $\pm$ 0.24	57.11 $\pm$ 0.21
	Can. Parl.	52.60 $\pm$ 0.88	52.28 $\pm$ 0.31	56.72 $\pm$ 0.47	54.42 $\pm$ 0.77	57.14 $\pm$ 0.07	55.71 $\pm$ 0.74	55.84 $\pm$ 0.73	87.40 $\pm$ 0.85
	US Legis.	52.94 $\pm$ 2.11	62.10 $\pm$ 1.41	51.83 $\pm$ 3.95	61.18 $\pm$ 1.10	55.56 $\pm$ 1.71	53.87 $\pm$ 1.41	52.03 $\pm$ 1.02	56.31 $\pm$ 3.46
	UN Trade	55.46 $\pm$ 1.19	55.49 $\pm$ 0.84	55.28 $\pm$ 0.71	52.80 $\pm$ 3.19	55.00 $\pm$ 0.38	55.76 $\pm$ 1.03	54.94 $\pm$ 0.97	53.20 $\pm$ 1.07
	UN Vote	61.04 $\pm$ 1.30	60.22 $\pm$ 1.78	53.05 $\pm$ 3.10	63.74 $\pm$ 3.00	47.98 $\pm$ 0.84	54.19 $\pm$ 2.17	48.09 $\pm$ 0.43	52.63 $\pm$ 1.26
	Contact	90.42 $\pm$ 2.34	89.22 $\pm$ 0.66	94.15 $\pm$ 0.45	88.13 $\pm$ 1.50	74.20 $\pm$ 0.80	90.44 $\pm$ 0.17	89.91 $\pm$ 0.36	93.56 $\pm$ 0.52
	Avg. Rank	5.38	5.46	4.00	4.54	5.92	3.92	3.54	3.23
ind	Wikipedia	68.70 $\pm$ 0.39	62.19 $\pm$ 1.28	84.17 $\pm$ 0.22	81.77 $\pm$ 0.32	67.24 $\pm$ 1.63	82.20 $\pm$ 2.18	87.60 $\pm$ 0.29	71.42 $\pm$ 4.43
	Reddit	62.32 $\pm$ 0.54	61.58 $\pm$ 0.72	63.40 $\pm$ 0.36	64.84 $\pm$ 0.84	63.65 $\pm$ 0.41	60.81 $\pm$ 0.26	64.49 $\pm$ 0.25	65.35 $\pm$ 0.60
	MOOC	63.22 $\pm$ 1.55	62.92 $\pm$ 1.24	76.72 $\pm$ 0.30	77.07 $\pm$ 3.40	74.69 $\pm$ 0.68	74.28 $\pm$ 0.53	73.99 $\pm$ 0.97	80.82 $\pm$ 0.30
	LastFM	70.39 $\pm$ 4.31	71.45 $\pm$ 1.75	76.28 $\pm$ 0.25	69.46 $\pm$ 4.65	71.33 $\pm$ 0.47	65.78 $\pm$ 0.65	76.42 $\pm$ 0.22	76.35 $\pm$ 0.52
	Enron	65.86 $\pm$ 3.71	62.08 $\pm$ 2.27	61.40 $\pm$ 1.30	62.90 $\pm$ 1.16	60.72 $\pm$ 0.36	67.11 $\pm$ 0.62	72.37 $\pm$ 1.38	67.07 $\pm$ 0.62
	Social Evo.	88.51 $\pm$ 0.87	88.72 $\pm$ 1.10	93.97 $\pm$ 0.54	90.65 $\pm$ 1.62	79.83 $\pm$ 0.39	94.10 $\pm$ 0.32	94.01 $\pm$ 0.47	96.82 $\pm$ 0.17
	UCI	63.16 $\pm$ 2.27	52.47 $\pm$ 2.09	70.49 $\pm$ 0.93	70.73 $\pm$ 0.79	64.54 $\pm$ 0.47	76.65 $\pm$ 0.99	81.64 $\pm$ 0.49	72.13 $\pm$ 1.86
	Flights	61.01 $\pm$ 1.66	62.83 $\pm$ 1.31	64.72 $\pm$ 0.37	59.32 $\pm$ 1.45	56.82 $\pm$ 0.56	64.50 $\pm$ 0.25	65.29 $\pm$ 0.24	57.11 $\pm$ 0.20
	Can. Parl.	52.58 $\pm$ 0.86	52.24 $\pm$ 0.28	56.46 $\pm$ 0.50	54.18 $\pm$ 0.73	57.06 $\pm$ 0.08	55.46 $\pm$ 0.69	55.76 $\pm$ 0.65	87.22 $\pm$ 0.82
	US Legis.	52.94 $\pm$ 2.11	62.10 $\pm$ 1.41	51.83 $\pm$ 3.95	61.18 $\pm$ 1.10	55.56 $\pm$ 1.71	53.87 $\pm$ 1.41	52.03 $\pm$ 1.02	56.31 $\pm$ 3.46
	UN Trade	55.43 $\pm$ 1.20	55.42 $\pm$ 0.87	55.58 $\pm$ 0.68	52.80 $\pm$ 3.24	54.97 $\pm$ 0.38	55.66 $\pm$ 0.98	54.88 $\pm$ 1.01	52.56 $\pm$ 1.70
	UN Vote	61.17 $\pm$ 1.33	60.29 $\pm$ 1.79	53.08 $\pm$ 3.10	63.71 $\pm$ 2.97	48.01 $\pm$ 0.82	54.13 $\pm$ 2.16	48.10 $\pm$ 0.40	52.61 $\pm$ 1.25
	Contact	90.43 $\pm$ 2.33	89.22 $\pm$ 0.65	94.14 $\pm$ 0.45	88.12 $\pm$ 1.50	74.19 $\pm$ 0.81	90.43 $\pm$ 0.17	89.91 $\pm$ 0.36	93.55 $\pm$ 0.52
	Avg. Rank	5.31	5.54	3.85	4.38	5.85	3.92	3.46	3.31

Table 14: AUC-ROC for inductive dynamic link prediction with random, historical, and inductive negative sampling strategies.

NSS	Datasets	JODIE	DyRep	TGAT	TGN	CAWN	TCL	GraphMixer	DyGFormer
rnd	Wikipedia	94.33 $\pm$ 0.27	91.49 $\pm$ 0.45	95.90 $\pm$ 0.09	97.72 $\pm$ 0.03	98.03 $\pm$ 0.04	95.57 $\pm$ 0.20	96.30 $\pm$ 0.04	98.48 $\pm$ 0.03
	Reddit	96.52 $\pm$ 0.13	96.05 $\pm$ 0.12	96.98 $\pm$ 0.04	97.39 $\pm$ 0.07	98.42 $\pm$ 0.02	93.80 $\pm$ 0.07	94.97 $\pm$ 0.05	98.71 $\pm$ 0.01
	MOOC	83.16 $\pm$ 1.30	84.03 $\pm$ 0.49	86.84 $\pm$ 0.17	91.24 $\pm$ 0.99	81.86 $\pm$ 0.25	81.43 $\pm$ 0.19	82.77 $\pm$ 0.24	87.62 $\pm$ 0.51
	LastFM	81.13 $\pm$ 3.39	82.24 $\pm$ 1.51	76.99 $\pm$ 0.29	82.61 $\pm$ 3.15	87.82 $\pm$ 0.12	70.84 $\pm$ 0.85	80.37 $\pm$ 0.18	94.08 $\pm$ 0.08
	Enron	81.96 $\pm$ 1.34	76.34 $\pm$ 4.20	64.63 $\pm$ 1.74	78.83 $\pm$ 1.11	87.02 $\pm$ 0.50	72.33 $\pm$ 0.99	76.51 $\pm$ 0.71	90.69 $\pm$ 0.26
	Social Evo.	93.70 $\pm$ 0.29	91.18 $\pm$ 0.49	93.41 $\pm$ 0.19	93.43 $\pm$ 0.59	84.73 $\pm$ 0.27	93.71 $\pm$ 0.18	94.09 $\pm$ 0.07	95.29 $\pm$ 0.03
	UCI	78.80 $\pm$ 0.94	58.08 $\pm$ 1.81	77.64 $\pm$ 0.38	86.68 $\pm$ 2.29	90.40 $\pm$ 0.11	84.49 $\pm$ 1.82	89.30 $\pm$ 0.57	92.63 $\pm$ 0.13
	Flights	95.21 $\pm$ 0.32	93.56 $\pm$ 0.70	88.64 $\pm$ 0.35	95.92 $\pm$ 0.43	96.86 $\pm$ 0.02	82.48 $\pm$ 0.01	82.27 $\pm$ 0.06	97.80 $\pm$ 0.02
	Can. Parl.	53.81 $\pm$ 1.14	55.27 $\pm$ 0.49	56.51 $\pm$ 0.75	55.86 $\pm$ 0.75	58.83 $\pm$ 1.13	55.83 $\pm$ 1.07	58.32 $\pm$ 1.08	89.33 $\pm$ 0.48
	US Legis.	58.12 $\pm$ 2.35	61.07 $\pm$ 0.56	48.27 $\pm$ 3.50	62.38 $\pm$ 0.48	51.49 $\pm$ 1.13	50.43 $\pm$ 1.48	47.20 $\pm$ 0.89	53.21 $\pm$ 3.04
	UN Trade	62.28 $\pm$ 0.50	58.82 $\pm$ 0.98	62.72 $\pm$ 0.12	59.99 $\pm$ 3.50	67.05 $\pm$ 0.21	63.76 $\pm$ 0.07	63.48 $\pm$ 0.37	67.25 $\pm$ 1.05
	UN Vote	58.13 $\pm$ 1.43	55.13 $\pm$ 3.46	51.83 $\pm$ 1.35	61.23 $\pm$ 2.71	48.34 $\pm$ 0.76	50.51 $\pm$ 1.05	50.04 $\pm$ 0.86	56.73 $\pm$ 0.69
	Contact	95.37 $\pm$ 0.92	91.89 $\pm$ 0.38	96.53 $\pm$ 0.10	94.84 $\pm$ 0.75	89.07 $\pm$ 0.34	93.05 $\pm$ 0.09	92.83 $\pm$ 0.05	98.30 $\pm$ 0.02
	Avg. Rank	4.69	5.85	5.31	3.38	4.00	6.00	5.31	1.46
hist	Wikipedia	61.86 $\pm$ 0.53	57.54 $\pm$ 1.09	78.38 $\pm$ 0.20	75.75 $\pm$ 0.29	62.04 $\pm$ 0.65	79.79 $\pm$ 0.96	82.87 $\pm$ 0.21	68.33 $\pm$ 2.82
	Reddit	61.69 $\pm$ 0.39	60.45 $\pm$ 0.37	64.43 $\pm$ 0.27	64.55 $\pm$ 0.50	64.94 $\pm$ 0.21	61.43 $\pm$ 0.26	64.27 $\pm$ 0.13	64.81 $\pm$ 0.25
	MOOC	64.48 $\pm$ 1.64	64.23 $\pm$ 1.29	74.08 $\pm$ 0.27	77.69 $\pm$ 3.55	71.68 $\pm$ 0.94	69.82 $\pm$ 0.32	72.53 $\pm$ 0.84	80.77 $\pm$ 0.63
	LastFM	68.44 $\pm$ 3.26	68.79 $\pm$ 1.08	69.89 $\pm$ 0.28	66.99 $\pm$ 5.62	67.69 $\pm$ 0.24	55.88 $\pm$ 1.85	70.07 $\pm$ 0.20	70.73 $\pm$ 0.37
	Enron	65.32 $\pm$ 3.57	61.50 $\pm$ 2.50	57.84 $\pm$ 2.18	62.68 $\pm$ 1.09	62.25 $\pm$ 0.40	64.06 $\pm$ 1.02	68.20 $\pm$ 1.62	65.78 $\pm$ 0.42
	Social Evo.	88.53 $\pm$ 0.55	87.93 $\pm$ 1.05	91.87 $\pm$ 0.72	92.10 $\pm$ 1.22	83.54 $\pm$ 0.24	93.28 $\pm$ 0.60	93.62 $\pm$ 0.35	96.91 $\pm$ 0.09
	UCI	60.24 $\pm$ 1.94	51.25 $\pm$ 2.37	62.32 $\pm$ 1.18	62.69 $\pm$ 0.90	56.39 $\pm$ 0.10	70.46 $\pm$ 1.94	75.98 $\pm$ 0.84	65.55 $\pm$ 1.01
	Flights	60.72 $\pm$ 1.29	61.99 $\pm$ 1.39	63.38 $\pm$ 0.26	59.66 $\pm$ 1.04	56.58 $\pm$ 0.44	63.48 $\pm$ 0.23	63.30 $\pm$ 0.19	56.05 $\pm$ 0.21
	Can. Parl.	51.62 $\pm$ 1.00	52.38 $\pm$ 0.46	58.30 $\pm$ 0.61	55.64 $\pm$ 0.54	60.11 $\pm$ 0.48	57.30 $\pm$ 1.03	56.68 $\pm$ 1.20	88.68 $\pm$ 0.74
	US Legis.	58.12 $\pm$ 2.94	67.94 $\pm$ 0.98	49.99 $\pm$ 4.88	64.87 $\pm$ 1.65	54.41 $\pm$ 1.31	52.12 $\pm$ 2.13	49.28 $\pm$ 0.86	56.57 $\pm$ 3.22
	UN Trade	58.73 $\pm$ 1.19	57.90 $\pm$ 1.33	59.74 $\pm$ 0.59	55.61 $\pm$ 3.54	60.95 $\pm$ 0.80	61.12 $\pm$ 0.97	59.88 $\pm$ 1.17	58.46 $\pm$ 1.65
	UN Vote	65.16 $\pm$ 1.28	63.98 $\pm$ 2.12	51.73 $\pm$ 4.12	68.59 $\pm$ 3.11	48.01 $\pm$ 1.77	54.66 $\pm$ 2.11	45.49 $\pm$ 0.42	53.85 $\pm$ 2.02
	Contact	90.80 $\pm$ 1.18	88.88 $\pm$ 0.68	93.76 $\pm$ 0.41	88.84 $\pm$ 1.39	74.79 $\pm$ 0.37	90.37 $\pm$ 0.16	90.04 $\pm$ 0.29	94.14 $\pm$ 0.26
	Avg. Rank	5.08	6.00	4.23	4.54	5.38	4.00	3.69	3.08
ind	Wikipedia	61.87 $\pm$ 0.53	57.54 $\pm$ 1.09	78.38 $\pm$ 0.20	75.76 $\pm$ 0.29	62.02 $\pm$ 0.65	79.79 $\pm$ 0.96	82.88 $\pm$ 0.21	68.33 $\pm$ 2.82
	Reddit	61.69 $\pm$ 0.39	60.44 $\pm$ 0.37	64.39 $\pm$ 0.27	64.55 $\pm$ 0.50	64.91 $\pm$ 0.21	61.36 $\pm$ 0.26	64.27 $\pm$ 0.13	64.80 $\pm$ 0.25
	MOOC	64.48 $\pm$ 1.64	64.22 $\pm$ 1.29	74.07 $\pm$ 0.27	77.68 $\pm$ 3.55	71.69 $\pm$ 0.94	69.83 $\pm$ 0.32	72.52 $\pm$ 0.84	80.77 $\pm$ 0.63
	LastFM	68.44 $\pm$ 3.26	68.79 $\pm$ 1.08	69.89 $\pm$ 0.28	66.99 $\pm$ 5.61	67.68 $\pm$ 0.24	55.88 $\pm$ 1.85	70.07 $\pm$ 0.20	70.73 $\pm$ 0.37
	Enron	65.32 $\pm$ 3.57	61.50 $\pm$ 2.50	57.83 $\pm$ 2.18	62.68 $\pm$ 1.09	62.27 $\pm$ 0.40	64.05 $\pm$ 1.02	68.19 $\pm$ 1.63	65.79 $\pm$ 0.42
	Social Evo.	88.53 $\pm$ 0.55	87.93 $\pm$ 1.05	91.88 $\pm$ 0.72	92.10 $\pm$ 1.22	83.54 $\pm$ 0.24	93.28 $\pm$ 0.60	93.62 $\pm$ 0.35	96.91 $\pm$ 0.09
	UCI	60.27 $\pm$ 1.94	51.26 $\pm$ 2.40	62.29 $\pm$ 1.17	62.66 $\pm$ 0.91	56.39 $\pm$ 0.11	70.42 $\pm$ 1.93	75.97 $\pm$ 0.85	65.58 $\pm$ 1.00
	Flights	60.72 $\pm$ 1.29	61.99 $\pm$ 1.39	63.40 $\pm$ 0.26	59.66 $\pm$ 1.05	56.58 $\pm$ 0.44	63.49 $\pm$ 0.23	63.32 $\pm$ 0.19	56.05 $\pm$ 0.22
	Can. Parl.	51.61 $\pm$ 0.98	52.35 $\pm$ 0.52	58.15 $\pm$ 0.62	55.43 $\pm$ 0.42	60.01 $\pm$ 0.47	56.88 $\pm$ 0.93	56.63 $\pm$ 1.09	88.51 $\pm$ 0.73
	US Legis.	58.12 $\pm$ 2.94	67.94 $\pm$ 0.98	49.99 $\pm$ 4.88	64.87 $\pm$ 1.65	54.41 $\pm$ 1.31	52.12 $\pm$ 2.13	49.28 $\pm$ 0.86	56.57 $\pm$ 3.22
	UN Trade	58.71 $\pm$ 1.20	57.87 $\pm$ 1.36	59.98 $\pm$ 0.59	55.62 $\pm$ 3.59	60.88 $\pm$ 0.79	61.01 $\pm$ 0.93	59.71 $\pm$ 1.17	57.28 $\pm$ 3.06
	UN Vote	65.29 $\pm$ 1.30	64.10 $\pm$ 2.10	51.78 $\pm$ 4.14	68.58 $\pm$ 3.08	48.04 $\pm$ 1.76	54.65 $\pm$ 2.20	45.57 $\pm$ 0.41	53.87 $\pm$ 2.01
	Contact	90.80 $\pm$ 1.18	88.87 $\pm$ 0.67	93.76 $\pm$ 0.40	88.85 $\pm$ 1.39	74.79 $\pm$ 0.38	90.37 $\pm$ 0.16	90.04 $\pm$ 0.29	94.14 $\pm$ 0.26
	Avg. Rank	5.08	5.92	4.15	4.54	5.38	4.00	3.77	3.15

C.3 Results on Dynamic Node Classification

Table 15 shows the results of various methods for dynamic node classification on Wikipedia and Reddit (the only two datasets with dynamic labels).

Table 15: AUC-ROC for dynamic node classification.

Methods	Wikipedia	Reddit	Avg. Rank
JODIE	88.99 $\pm$ 1.05	60.37 $\pm$ 2.58	4.50
DyRep	86.39 $\pm$ 0.98	63.72 $\pm$ 1.32	5.00
TGAT	84.09 $\pm$ 1.27	70.04 $\pm$ 1.09	4.00
TGN	86.38 $\pm$ 2.34	63.27 $\pm$ 0.90	6.00
CAWN	84.88 $\pm$ 1.33	66.34 $\pm$ 1.78	5.00
TCL	77.83 $\pm$ 2.13	68.87 $\pm$ 2.15	5.00
GraphMixer	86.80 $\pm$ 0.79	64.22 $\pm$ 3.32	4.00
DyGFormer	87.44 $\pm$ 1.08	68.00 $\pm$ 1.74	2.50

C.4 Complete Results of TCL and GraphMixer with Neighbor Co-occurrence Encoding

We show the complete results of TCL and GraphMixer with our neighbor co-occurrence encoding scheme in Table 16.

Table 16: AP for different methods when equipped with the neighbor co-occurrence encoding.

Datasets	TCL			GraphMixer			DyGFormer
Datasets	Original	w/ NCoE	Improv.	Original	w/ NCoE	Improv.	DyGFormer
Wikipedia	96.47	99.09	2.72%	97.25	97.90	0.67%	99.03
Reddit	97.53	99.04	1.55%	97.31	97.63	0.33%	99.22
MOOC	82.38	86.92	5.51%	82.78	83.58	0.97%	87.52
LastFM	67.27	84.02	24.90%	75.61	76.48	1.15%	93.00
Enron	79.70	90.18	13.15%	82.25	88.83	8.00%	92.47
Social Evo.	93.13	94.06	1.00%	93.37	94.37	1.07%	94.73
UCI	89.57	94.69	5.72%	93.25	93.48	0.25%	95.79
Flights	91.23	97.71	7.10%	90.99	96.90	6.50%	98.91
Can. Parl.	68.67	69.34	0.98%	77.04	76.38	-0.86%	97.36
US Legis.	69.59	69.47	-0.17%	70.74	70.26	-0.68%	71.11
UN Trade	62.21	63.46	2.01%	62.61	62.77	0.26%	66.46
UN Vote	51.90	51.52	-0.73%	52.11	52.13	0.04%	55.55
Contact	92.44	97.98	5.99%	91.92	97.94	6.55%	98.29
Avg. Rank	4.62	2.62	—	3.85	2.85	—	1.08

C.5 Results of Training Time and Memory Usage Comparisons

Table 17 shows the comparisons of running time and memory usage of DyGFormer with and without the patching technique during the training process.

Table 17: Comparisons of training time and memory usage of DyGFormer with and without the patching technique, where OOM stands for Out-Of-Memory.

Datasets

Input

Lengths

Metrics

DyGFormer

w/ patching

DyGFormer

w/o patching

Reduced

Ratios

LastFM

running time

13min 58s

21min 01s

1.50

memory usage

3,945 MB

7,953 MB

2.02

128

running time

16min 54s

45min 29s

2.69

memory usage

4,677 MB

7,585 MB

1.62

256

running time

24min 41s

2h 5min 50s

5.10

memory usage

4,635 MB

14,583 MB

3.15

512

running time

37min 04s

—

memory usage

7,547 MB

OOM

—

Can. Parl.

running time

58s

1min 14s

1.28

memory usage

2,121 MB

3,263 MB

1.54

128

running time

1min 02s

2min 39s

2.56

memory usage

2,369 MB

6,417 MB

2.71

256

running time

1min 26s

6min 37s

4.62

memory usage

2,855 MB

14,923 MB

5.23

512

running time

1min 57s

—

memory usage

4,511 MB

OOM

—

C.6 Results and Discussions of Ablation Study

We validate the effectiveness of several designs in DyGFormer via an ablation study, including the usage of Neighbor Co-occurrence Encoding (NCoE), the usage of Time Encoding (TE), and the Mixing of the sequence of Source node and Destination node (MixSD). We respectively remove these modules and denote the remaining parts as w/o NCoE, w/o TE, and w/o MixSD. We also Separately encode the Neighbor Occurrence in the source node’s or destination node’s sequence and denote this variant as w/ SepNO. We report the performance of different variants on MOOC, Social Evo., UCI, and UN Trade datasets from four domains in Figure 4.

Our findings from Figure 4 indicate that DyGFormer achieves optimal performance when all components are utilized. The removal of any component would lead to worse results. In particular, the neighbor co-occurrence encoding scheme has the most significant impact on the performance as it effectively captures correlations between nodes. Separately encoding neighbor occurrences or encoding the temporal information could also improve performance. Mixing the sequences of the source node and destination node causes relatively minor improvements due to the usage of the neighbor co-occurrence encoding scheme because both of them aim to explore the node correlations. To this end, the necessity of designing these components has been well demonstrated.

C.7 Results on Temporal Graph Benchmark

We also evaluate DyGFormer and baselines on the recently proposed Temporal Graph Benchmark (TGB) [23], which contains a collection of challenging and diverse benchmark datasets. We aim to verify the superiority and scalability of DyGFormer on TGB since it covers small, medium, and large datasets. We only present the analysis here. Please refer to [23] and our other work [68] for details of the datasets, tasks, evaluation metrics, experimental settings, and complete quantitative results.

Superiority of DyGFormer. For the dynamic link property prediction task, DyGFormer ranks first on the TGB leaderboard⁶⁶6https://tgb.complexdatalab.com/docs/leader_linkprop/ on tgbl-wiki-v2. It ranks second/third on tgbl-coin-v2/tgbl-comment and does medium on tgbl-review-v2. We notice that the surprise index (defined as the ratio of test links that are not seen during training) on tgbl-wiki-v2 and tgbl-coin-v2 are 0.108 and 0.120, which indicates nodes tend to interact repeatedly. Such a property may be well handled by the neighbor co-occurrence encoding scheme in DyGFormer, leading to excellent performance. Correspondingly, the surprise index on tgbl-review-v2 and tgbl-comment are 0.987 and 0.823, showing that these two datasets contain many new links. This characteristic may violate the motivation of our neighbor co-occurrence encoding scheme and bring suboptimal performance. For the dynamic node property prediction task, DyGFormer also performs better than other methods with trainable parameters, but its performance is still worse than the simple non-trainable Persistent Forecast and Moving Average methods on the TGB leaderboard⁷⁷7https://tgb.complexdatalab.com/docs/leader_nodeprop/. This observation demonstrates the necessity of proposing customized models for the dynamic node property prediction task.

Scalability of DyGFormer. DyGFormer can be scaled up to larger datasets that contain hundreds of thousands of nodes and tens of millions of links (e.g., tgbl-coin-v2, tgbl-comment, tgbn-genre, and tgbn-reddit) while many baselines (e.g., JODIE, DyRep, TGN, and CAWN) fail. This demonstrates the good scalability of DyGFormer.

Towards Better Dynamic Graph Learning: New Architecture and Unified Library

Abstract

1 Introduction

2 Related Work

3 Preliminaries

Definition 1.

Definition 2.

4 New Architecture and Unified Library

4.1 DyGFormer: Transformer-based Architecture for Dynamic Graph Learning

4.2 DyGLib: Unified Library for Continuous-Time Dynamic Graph Learning

5 Experiments

5.1 Experimental Settings

5.2 Performance Comparisons and Discussions

5.3 Generalizability of Neighbor Co-occurrence Encoding Scheme

5.4 Advantages of Patching Technique

5.5 Verification of the Motivation of Neighbor Co-occurrence Encoding Scheme

5.6 When Will DyGFormer Be a Good Choice?

5.7 Why do DyGFormer’s Performance Vary across Different Negative Sampling Strategies?

5.8 Ablation Study

6 Conclusion

Acknowledgments and Disclosure of Funding

References

Appendix A Additional Discussions and Details

A.1 Limitations

A.2 Licenses

A.3 Overall Procedure of DyGLib

Appendix B Detailed Experimental Settings

B.1 Descriptions of Datasets

B.2 Descriptions of Baselines

B.3 Some Problematic Implementations in Baselines

B.4 Some Inconsistent Observations with Previous Reports

B.5 Configurations of Different Methods

Appendix C Detailed Experimental Results

C.1 Additional Results for Transductive Dynamic Link Prediction

C.2 Additional Results for Inductive Dynamic Link Prediction

C.3 Results on Dynamic Node Classification

C.4 Complete Results of TCL and GraphMixer with Neighbor Co-occurrence Encoding

C.5 Results of Training Time and Memory Usage Comparisons

C.6 Results and Discussions of Ablation Study

C.7 Results on Temporal Graph Benchmark

Towards Better Dynamic Graph Learning:
New Architecture and Unified Library