Towards Better Dynamic Graph Learning:
New Architecture and Unified Library
Abstract
We propose DyGFormer, a new Transformer-based architecture for dynamic graph learning. DyGFormer is conceptually simple and only needs to learn from nodes’ historical first-hop interactions by: (i) a neighbor co-occurrence encoding scheme that explores the correlations of the source node and destination node based on their historical sequences; (ii) a patching technique that divides each sequence into multiple patches and feeds them to Transformer, allowing the model to effectively and efficiently benefit from longer histories. We also introduce DyGLib, a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols to promote reproducible, scalable, and credible dynamic graph learning research. By performing exhaustive experiments on thirteen datasets for dynamic link prediction and dynamic node classification tasks, we find that DyGFormer achieves state-of-the-art performance on most of the datasets, demonstrating its effectiveness in capturing nodes’ correlations and long-term temporal dependencies. Moreover, some results of baselines are inconsistent with previous reports, which may be caused by their diverse but less rigorous implementations, showing the importance of DyGLib. All the used resources are publicly available at https://github.com/yule-BUAA/DyGLib.
1 Introduction
Dynamic graphs denote entities as nodes and represent their interactions as links with timestamps [26], which can model many real-world scenarios such as social networks [28, 50, 2], user-item interaction systems [31, 15, 71, 72, 70], traffic networks [67, 63, 19, 4, 69], and physical systems [24, 47, 43]. In recent years, representation learning on dynamic graphs has become a trending research topic [26, 49, 65]. There are two main categories of existing methods: discrete-time [39, 17, 48, 11, 66] and continuous-time [64, 45, 60, 51, 12]. In this paper, we focus on the latter approaches because they are more flexible and effective than the formers and are being increasingly investigated.
Despite the rapid development of dynamic graph learning methods, they still suffer from two limitations. Firstly, most of them independently compute the temporal representations of nodes in an interaction without exploiting nodes’ correlations, which are often indicative of future interactions. Moreover, existing methods learn at the interaction level and thus only work for nodes with fewer interactions. When nodes have longer histories, they require sampling strategies to truncate the interactions for feasible calculations of the computationally expensive modules like graph convolutions [55, 64, 45, 36, 9, 59], temporal random walks [60, 25] and sequential models [57, 12]. Though some approaches use memory networks [61, 53] to sequentially process interactions with affordable computational costs [28, 55, 45, 36, 59, 35], they are faced with the vanishing/exploding gradients due to the usage of recurrent neural networks [40, 57]. Therefore, we conclude that previous methods lack the ability to capture either nodes’ correlations or long-term temporal dependencies.
Secondly, the training pipelines of different methods are inconsistent and often lead to poor reproducibility. Also, existing methods are implemented by diverse frameworks (e.g., Pytorch [41], Tensorflow [1], DGL [58], PyG [16], C++), making it time-consuming and difficult for researchers to quickly understand the algorithms and further dive into the core of dynamic graph learning. Although there exist some libraries for dynamic graph learning [18, 46, 74], they mainly focus on dynamic network embedding methods [18], discrete-time graph learning methods [46], or engineering techniques for training on large-scale dynamic graphs [74] (elaborated in Section 2). Currently, we find that there are still no standard tools for continuous-time dynamic graph learning.
In this paper, we aim to address the above drawbacks with two key technical contributions.
We propose a new Transformer-based dynamic graph learning architecture (DyGFormer). DyGFormer is conceptually simple by solely learning from the sequences of nodes’ historical first-hop interactions. To be specific, DyGFormer is designed with a neighbor co-occurrence encoding scheme, which encodes the appearing frequencies of each neighbor in the sequences of the source and destination nodes to explicitly explore their correlations. In order to capture long-term temporal dependencies, DyGFormer splits each node’s sequence into multiple patches and feeds them to Transformer [56]. This patching technique not only makes the model effectively benefit from longer histories via preserving local temporal proximities, but also efficiently reduces the computational complexity to a constant level that is irrelevant to the input sequence length.
We present a unified continuous-time dynamic graph learning library (DyGLib). DyGLib is an open-source toolkit with standard training pipelines, extensible coding interfaces, and comprehensive evaluating strategies, aiming to foster standard, scalable, and reproducible dynamic graph learning research. DyGLib has integrated a variety of continuous-time dynamic graph learning methods as well as benchmark datasets from various domains. It trains all the methods via the same pipeline to eliminate the influence of different implementations and adopts a modularized design for developers to conveniently incorporate new datasets and algorithms based on their specific requirements. Moreover, DyGLib supports both dynamic link prediction and dynamic node classification tasks with exhaustive evaluating strategies to provide comprehensive comparisons of existing methods.
To evaluate the model performance, we conduct extensive experiments based on DyGLib, including dynamic link prediction under transductive and inductive settings with three negative sampling strategies as well as dynamic node classification. From the results, we observe that: (i) DyGFormer outperforms existing methods on most datasets, demonstrating its superiority in capturing nodes’ correlations and long-term temporal dependencies; (ii) some findings of baselines are not in line with previous reports because of their varied pipelines and problematic implementations, which illustrates the necessity of introducing DyGLib. We also provide an in-depth analysis of the neighbor co-occurrence encoding and patching technique for a better understanding of DyGFormer.
2 Related Work
Dynamic Graph Learning. Representation learning on dynamic graphs has been widely studied in recent years [26, 49, 65]. Discrete-time methods manually divide the dynamic graph into a sequence of snapshots and apply static graph learning methods on each snapshot, which ignore the temporal order of nodes in each snapshot [39, 17, 48, 11, 66]. In contrast, continuous-time methods directly learn on the whole dynamic graph with temporal graph neural networks [55, 64, 45, 36, 9, 59], memory networks [28, 55, 45, 36, 59, 35], temporal random walks [60, 25] or sequential models [57, 12]. Although insightful, most existing dynamic graph learning methods neglect the correlations between two nodes in an interaction. They also fail to handle nodes with longer interactions due to unaffordable computational costs of complex modules or issues in optimizing models (e.g., the vanishing/exploding gradients). In this paper, we propose a new Transformer-based architecture to show the necessity of capturing nodes’ correlations and long-term temporal dependencies, which is achieved by two designs: a neighbor co-occurrence encoding scheme and a patching technique.
Transformer-based Applications in Various Fields. Transformer [56] is an innovative model that employs the self-attention mechanism to handle sequential data, which has been successfully applied in a variety of domains, such as natural language processing [13, 33, 6], computer vision [7, 14, 34] and time series forecasting [30, 73, 62]. The idea of dividing the original data into patches as inputs of the Transformer has been attempted in some studies. ViT [14] splits an image into multiple patches and feeds the sequence of patches’ linear embeddings into a Transformer, which achieves surprisingly good performance on image classification. PatchTST [38] divides a time series into subseries-level patches and calculates the patches by a channel-independent Transformer for long-term multivariate time series forecasting. In this work, we propose a patching technique to learn on dynamic graphs, which can provide our approach with the ability to handle nodes with longer histories.
Graph Learning Library. Currently, there exist many libraries for static graphs [5, 58, 16, 22, 8, 29, 32], but few for dynamic graph learning [18, 46, 74]. DynamicGEM [18] focuses on dynamic graph embedding methods, which just consider the graph topology and cannot leverage node features. PyTorch Geometric Temporal [46] implements discrete-time algorithms for spatiotemporal signal processing and is mainly applicable for nodes with aligned historical observations. TGL [74] trains on large-scale dynamic graphs with some engineering tricks. Though TGL has integrated some continuous-time methods, they are somewhat out-of-date, resulting in the lack of state-of-the-art models. Moreover, TGL is implemented by both PyTorch and C++, which needs additional compilation and increases the usage difficulty. In this paper, we present a unified continuous-time dynamic graph learning library with thorough baselines, diverse datasets, extensible implementations, and comprehensive evaluations to facilitate dynamic graph learning research.
3 Preliminaries
Definition 1.
Dynamic Graph. We represent a dynamic graph as a sequence of non-decreasing chronological interactions with , where denote the source node and destination node of the -th link at timestamp . is the set of all the nodes. Each node can be associated with node feature , and each interaction has link feature . and denote the dimensions of the node feature and link feature. If the graph is non-attributed, we simply set the node feature and link feature to zero vectors, i.e., and .
Definition 2.
Problem Formalization. Given the source node , destination node , timestamp , and historical interactions before , i.e., , representation learning on dynamic graph aims to design a model to learn time-aware representations and for and with as the dimension. We validate the effectiveness of the learned representations via two common tasks in dynamic graph learning: (i) dynamic link prediction, which predicts whether and are connected at ; (ii) dynamic node classification, which infers the state of or at .
4 New Architecture and Unified Library
4.1 DyGFormer: Transformer-based Architecture for Dynamic Graph Learning
The framework of our DyGFormer is shown in Figure 1, which employs Transformer [56] as the backbone. Given an interaction , we first extract historical first-hop interactions of source node and destination node before timestamp and obtain two interaction sequences and . Next, in addition to computing the encodings of neighbors, links, and time intervals for each sequence, we also encode the frequencies of every neighbor’s appearances in both and to exploit the correlations between and , resulting in four encoding sequences for / in total. Then, we divide each encoding sequence into multiple patches and feed all the patches into a Transformer for capturing long-term temporal dependencies. Finally, the outputs of the Transformer are averaged to derive time-aware representations of and at timestamp (i.e., and ), which can be applied in various downstream tasks like dynamic link prediction and dynamic node classification.

Learning from Historical First-hop Interactions. Unlike most previous methods that require nodes’ historical interactions from multiple hops (e.g., DyRep [55], TGAT [64], TGN [45], CAWN [60]), we only learn from the sequences of nodes’ historical first-hop interactions, turning the dynamic graph learning task into a simpler sequence learning problem. Mathematically, given an interaction , for source node and destination node , we obtain the sequences that involve first-hop interactions of and before timestamp , which are denoted by and , respectively.
Encoding Neighbors, Links, and Time Intervals. For source node , we retrieve the features of involved neighbors and links in sequence based on the given features to represent their encodings, which are denoted by and . Following [64], we learn the periodic temporal patterns by encoding the time interval via , where are trainable parameters. is the encoding dimension. The time interval encodings of interactions in is denoted by . We use the same process to get the corresponding encodings for destination node , i.e., , , and .
Neighbor Co-occurrence Encoding Scheme. Existing methods separately compute representations of node and without modeling their correlations. We present a neighbor co-occurrence encoding scheme to tackle this issue, which assumes the appearing frequency of a neighbor in a sequence indicates its importance, and the occurrences of a neighbor in sequences of and (i.e., co-occurrence) could reflect the correlations between and . That is to say, if and have more common historical neighbors in their sequences, they are more likely to interact in the future.
Formally, for each neighbor in the interaction sequence and , we count its occurrences in both and , and derive a two-dimensional vector. By packing the vectors of all the neighbors together, we can get the neighbor co-occurrence features for and , which are represented by and . For example, suppose the historical neighbors of and are and . The appearing frequencies of , , and in /’s historical interactions are 2/1, 1/2, and 0/1, respectively. Therefore, the neighbor co-occurrence features of and are denoted by and . Then, we apply a function to encode the neighbor co-occurrence features by
(1) |
where could be or . The input and output dimensions of are 1 and . In this paper, we implement by a two-layer perceptron with ReLU activation [37]. It is important to note that the neighbor co-occurrence encoding scheme is general and can be easily integrated into some dynamic graph learning methods for better results. We will demonstrate its generalizability in Section 5.3.
Patching Technique. Instead of focusing on the interaction level, we divide the encoding sequence into multiple non-overlapping patches to break through the bottleneck of existing methods in capturing long-term temporal dependencies. Let denote the patch size. Each patch is composed of temporally adjacent interactions with flattened encodings and can preserve local temporal proximities. Take the patching of as an example. will be divided into patches in total (note that we will pad if its length cannot be divided by ), and the patched encoding is represented by . Similarly, we can also get the patched encodings , , , , , , and . Note that when becomes longer, we will correspondingly increase , making the number of patches (i.e., and ) at a constant level to reduce the computational cost.
Transformer Encoder. We first align the patched encodings to the same dimension with trainable weight and to obtain and , where could be , , or . To be specific, the alignments are realized by
(2) |
Then, we concatenate the aligned encodings of and , and get and .
Next, we employ a Transformer encoder to capture the temporal dependencies, which is built by stacking Multi-head Self-Attention (MSA) and Feed-Forward Network (FFN) blocks. The residual connection [20] is employed after every block. We follow [14] by using GELU [21] instead of ReLU [37] between the two-layer perception in each FFN block and applying Layer Normalization (LN) [3] before each block rather than after. Instead of individually processing and , our Transformer encoder takes the stacked as inputs, aiming to learn the temporal dependencies within and across the sequences of and . The calculation process is
(3) | |||
(4) | |||
(5) | |||
(6) | |||
(7) |
, , , , , , and are trainable parameters at the -th layer. We set with as the number of attention heads. The input of the first layer is , and the output of the -th layer is denoted by .
Time-aware Node Representation. The time-aware representations of node and at timestamp are derived by averaging their related representations in with an output layer,
(8) |
where and are trainable weights with as the output dimension.
4.2 DyGLib: Unified Library for Continuous-Time Dynamic Graph Learning
We introduce a unified library with standard training pipelines, extensible coding interfaces, and comprehensive evaluating strategies for reproducible, scalable, and credible continuous-time dynamic graph learning research. The overall procedure of DyGLib is shown in Figure 3 in Section A.3.
Standard Training Pipelines. To eliminate the influence of different training pipelines in previous studies, we unify the data format, create a customized data loader, and train all the methods with the same model trainers. Our standard training pipelines guarantee reproducible performance and enable users to quickly identify the key components of different models. Researchers only need to focus on designing the model architecture without considering other irrelevant implementation details.
Extensible Coding Interfaces. We provide extensible coding interfaces for the datasets and algorithms, which are all implemented by PyTorch. These scalable designs enable users to incorporate new datasets and popular models based on their specific requirements, which can significantly reduce the usage difficulty for beginners and allow experts to conveniently validate new ideas. Currently, DyGLib has integrated thirteen datasets from various domains and nine continuous-time dynamic graph learning methods. It is worth noticing that we also found some issues in previous implementations and have fixed them in DyGLib (see details in Section B.3).
Comprehensive Evaluating Protocols. DyGLib supports both transductive/inductive dynamic link prediction and dynamic node classification tasks. Most previous works evaluate their methods on the dynamic link prediction task with the random negative sampling strategy but a few models already reach saturation performance under such a strategy, making it hard to distinguish more advanced designs. For more reliable comparisons, we adopt three strategies (i.e., random, historical, and inductive negative sampling strategies) in [44] to comprehensively evaluate the model performance.
5 Experiments
In this section, we report the results of various approaches by using DyGLib. We show the superiority of DyGFormer over existing methods and also give an in-depth analysis of DyGFormer.
5.1 Experimental Settings
Datasets and Baselines. We experiment with thirteen datasets (Wikipedia, Reddit, MOOC, LastFM, Enron, Social Evo., UCI, Flights, Can. Parl., US Legis., UN Trade, UN Vote, and Contact), which are collected by [44] and cover diverse domains. Details of the datasets are shown in Section B.1. We compare DyGFormer with eight popular continuous-time dynamic graph learning baselines that are based on graph convolutions, memory networks, random walks, and sequential models, including JODIE [28], DyRep [55], TGAT [64], TGN [45], CAWN [60], EdgeBank [44], TCL [57], and GraphMixer [12]. We give the descriptions of baselines in Section B.2.
Evaluation Tasks and Metrics. We follow [64, 45, 60, 44] to evaluate models for dynamic link prediction, which predicts the probability of a link occurring between two given nodes at a specific time. This task has two settings: the transductive setting aims to predict future links between nodes that are observed during training, and the inductive setting predicts future links between unseen nodes. We use a multi-layer perceptron to take the concatenated representations of two nodes as inputs and return the probability of a link as the output. Average Precision (AP) and Area Under the Receiver Operating Characteristic Curve (AUC-ROC) are adopted as the evaluation metrics. We adopt random (rnd), historical (hist), and inductive (ind) negative sampling strategies in [44] for evaluation, where the latter two strategies are more challenging. Please refer to [44] for more details. We also follow [64, 45] to conduct dynamic node classification, which estimates the state of a node in a given interaction at a specific time. A multi-layer perceptron is employed to map the node representations to the labels. We use AUC-ROC as the evaluation metric due to the label imbalance. For both tasks, we chronologically split each dataset with the ratio of 70%/15%/15% for training/validation/testing.
Model Configurations. For baselines, in addition to following their official settings, we also perform an exhaustive grid search to find the optimal configurations of some critical hyperparameters for more reliable comparisons. As DyGFormer can access longer histories, we vary each node’s input sequence length from 32 to 4096 by a factor of 2. To keep the computational complexity at a constant level that is irrelevant to the input length, we correspondingly increase the patch size from 1 to 128. Please see Section B.5 for the detailed configurations of different models.
Implementation Details. For both tasks, we optimize all models (i.e., excluding EdgeBank which has no trainable parameters) by Adam [27] and use supervised binary cross-entropy loss as the objective function. We train the models for 100 epochs and use the early stopping strategy with a patience of 20. We select the model that achieves the best performance on the validation set for testing. We set the learning rate and batch size to 0.0001 and 200 for all the methods on all the datasets. We run the methods five times with seeds from 0 to 4 and report the average performance to eliminate deviations. Experiments are conducted on an Ubuntu machine equipped with one Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz with 16 physical cores. The GPU device is NVIDIA Tesla T4 with 15 GB memory.
5.2 Performance Comparisons and Discussions
NSS | Datasets | JODIE | DyRep | TGAT | TGN | CAWN | EdgeBank | TCL | GraphMixer | DyGFormer |
---|---|---|---|---|---|---|---|---|---|---|
rnd | Wikipedia | 96.50 0.14 | 94.86 0.06 | 96.94 0.06 | 98.45 0.06 | 98.76 0.03 | 90.37 0.00 | 96.47 0.16 | 97.25 0.03 | 99.03 0.02 |
98.31 0.14 | 98.22 0.04 | 98.52 0.02 | 98.63 0.06 | 99.11 0.01 | 94.86 0.00 | 97.53 0.02 | 97.31 0.01 | 99.22 0.01 | ||
MOOC | 80.23 2.44 | 81.97 0.49 | 85.84 0.15 | 89.15 1.60 | 80.15 0.25 | 57.97 0.00 | 82.38 0.24 | 82.78 0.15 | 87.52 0.49 | |
LastFM | 70.85 2.13 | 71.92 2.21 | 73.42 0.21 | 77.07 3.97 | 86.99 0.06 | 79.29 0.00 | 67.27 2.16 | 75.61 0.24 | 93.00 0.12 | |
Enron | 84.77 0.30 | 82.38 3.36 | 71.12 0.97 | 86.53 1.11 | 89.56 0.09 | 83.53 0.00 | 79.70 0.71 | 82.25 0.16 | 92.47 0.12 | |
Social Evo. | 89.89 0.55 | 88.87 0.30 | 93.16 0.17 | 93.57 0.17 | 84.96 0.09 | 74.95 0.00 | 93.13 0.16 | 93.37 0.07 | 94.73 0.01 | |
UCI | 89.43 1.09 | 65.14 2.30 | 79.63 0.70 | 92.34 1.04 | 95.18 0.06 | 76.20 0.00 | 89.57 1.63 | 93.25 0.57 | 95.79 0.17 | |
Flights | 95.60 1.73 | 95.29 0.72 | 94.03 0.18 | 97.95 0.14 | 98.51 0.01 | 89.35 0.00 | 91.23 0.02 | 90.99 0.05 | 98.91 0.01 | |
Can. Parl. | 69.26 0.31 | 66.54 2.76 | 70.73 0.72 | 70.88 2.34 | 69.82 2.34 | 64.55 0.00 | 68.67 2.67 | 77.04 0.46 | 97.36 0.45 | |
US Legis. | 75.05 1.52 | 75.34 0.39 | 68.52 3.16 | 75.99 0.58 | 70.58 0.48 | 58.39 0.00 | 69.59 0.48 | 70.74 1.02 | 71.11 0.59 | |
UN Trade | 64.94 0.31 | 63.21 0.93 | 61.47 0.18 | 65.03 1.37 | 65.39 0.12 | 60.41 0.00 | 62.21 0.03 | 62.61 0.27 | 66.46 1.29 | |
UN Vote | 63.91 0.81 | 62.81 0.80 | 52.21 0.98 | 65.72 2.17 | 52.84 0.10 | 58.49 0.00 | 51.90 0.30 | 52.11 0.16 | 55.55 0.42 | |
Contact | 95.31 1.33 | 95.98 0.15 | 96.28 0.09 | 96.89 0.56 | 90.26 0.28 | 92.58 0.00 | 92.44 0.12 | 91.92 0.03 | 98.29 0.01 | |
Avg. Rank | 5.08 | 5.85 | 5.69 | 2.54 | 4.31 | 7.54 | 6.92 | 5.46 | 1.62 | |
hist | Wikipedia | 83.01 0.66 | 79.93 0.56 | 87.38 0.22 | 86.86 0.33 | 71.21 1.67 | 73.35 0.00 | 89.05 0.39 | 90.90 0.10 | 82.23 2.54 |
80.03 0.36 | 79.83 0.31 | 79.55 0.20 | 81.22 0.61 | 80.82 0.45 | 73.59 0.00 | 77.14 0.16 | 78.44 0.18 | 81.57 0.67 | ||
MOOC | 78.94 1.25 | 75.60 1.12 | 82.19 0.62 | 87.06 1.93 | 74.05 0.95 | 60.71 0.00 | 77.06 0.41 | 77.77 0.92 | 85.85 0.66 | |
LastFM | 74.35 3.81 | 74.92 2.46 | 71.59 0.24 | 76.87 4.64 | 69.86 0.43 | 73.03 0.00 | 59.30 2.31 | 72.47 0.49 | 81.57 0.48 | |
Enron | 69.85 2.70 | 71.19 2.76 | 64.07 1.05 | 73.91 1.76 | 64.73 0.36 | 76.53 0.00 | 70.66 0.39 | 77.98 0.92 | 75.63 0.73 | |
Social Evo. | 87.44 6.78 | 93.29 0.43 | 95.01 0.44 | 94.45 0.56 | 85.53 0.38 | 80.57 0.00 | 94.74 0.31 | 94.93 0.31 | 97.38 0.14 | |
UCI | 75.24 5.80 | 55.10 3.14 | 68.27 1.37 | 80.43 2.12 | 65.30 0.43 | 65.50 0.00 | 80.25 2.74 | 84.11 1.35 | 82.17 0.82 | |
Flights | 66.48 2.59 | 67.61 0.99 | 72.38 0.18 | 66.70 1.64 | 64.72 0.97 | 70.53 0.00 | 70.68 0.24 | 71.47 0.26 | 66.59 0.49 | |
Can. Parl. | 51.79 0.63 | 63.31 1.23 | 67.13 0.84 | 68.42 3.07 | 66.53 2.77 | 63.84 0.00 | 65.93 3.00 | 74.34 0.87 | 97.00 0.31 | |
US Legis. | 51.71 5.76 | 86.88 2.25 | 62.14 6.60 | 74.00 7.57 | 68.82 8.23 | 63.22 0.00 | 80.53 3.95 | 81.65 1.02 | 85.30 3.88 | |
UN Trade | 61.39 1.83 | 59.19 1.07 | 55.74 0.91 | 58.44 5.51 | 55.71 0.38 | 81.32 0.00 | 55.90 1.17 | 57.05 1.22 | 64.41 1.40 | |
UN Vote | 70.02 0.81 | 69.30 1.12 | 52.96 2.14 | 69.37 3.93 | 51.26 0.04 | 84.89 0.00 | 52.30 2.35 | 51.20 1.60 | 60.84 1.58 | |
Contact | 95.31 2.13 | 96.39 0.20 | 96.05 0.52 | 93.05 2.35 | 84.16 0.49 | 88.81 0.00 | 93.86 0.21 | 93.36 0.41 | 97.57 0.06 | |
Avg. Rank | 5.46 | 5.08 | 5.08 | 3.85 | 7.54 | 5.92 | 5.46 | 4.00 | 2.62 | |
ind | Wikipedia | 75.65 0.79 | 70.21 1.58 | 87.00 0.16 | 85.62 0.44 | 74.06 2.62 | 80.63 0.00 | 86.76 0.72 | 88.59 0.17 | 78.29 5.38 |
86.98 0.16 | 86.30 0.26 | 89.59 0.24 | 88.10 0.24 | 91.67 0.24 | 85.48 0.00 | 87.45 0.29 | 85.26 0.11 | 91.11 0.40 | ||
MOOC | 65.23 2.19 | 61.66 0.95 | 75.95 0.64 | 77.50 2.91 | 73.51 0.94 | 49.43 0.00 | 74.65 0.54 | 74.27 0.92 | 81.24 0.69 | |
LastFM | 62.67 4.49 | 64.41 2.70 | 71.13 0.17 | 65.95 5.98 | 67.48 0.77 | 75.49 0.00 | 58.21 0.89 | 68.12 0.33 | 73.97 0.50 | |
Enron | 68.96 0.98 | 67.79 1.53 | 63.94 1.36 | 70.89 2.72 | 75.15 0.58 | 73.89 0.00 | 71.29 0.32 | 75.01 0.79 | 77.41 0.89 | |
Social Evo. | 89.82 4.11 | 93.28 0.48 | 94.84 0.44 | 95.13 0.56 | 88.32 0.27 | 83.69 0.00 | 94.90 0.36 | 94.72 0.33 | 97.68 0.10 | |
UCI | 65.99 1.40 | 54.79 1.76 | 68.67 0.84 | 70.94 0.71 | 64.61 0.48 | 57.43 0.00 | 76.01 1.11 | 80.10 0.51 | 72.25 1.71 | |
Flights | 69.07 4.02 | 70.57 1.82 | 75.48 0.26 | 71.09 2.72 | 69.18 1.52 | 81.08 0.00 | 74.62 0.18 | 74.87 0.21 | 70.92 1.78 | |
Can. Parl. | 48.42 0.66 | 58.61 0.86 | 68.82 1.21 | 65.34 2.87 | 67.75 1.00 | 62.16 0.00 | 65.85 1.75 | 69.48 0.63 | 95.44 0.57 | |
US Legis. | 50.27 5.13 | 83.44 1.16 | 61.91 5.82 | 67.57 6.47 | 65.81 8.52 | 64.74 0.00 | 78.15 3.34 | 79.63 0.84 | 81.25 3.62 | |
UN Trade | 60.42 1.48 | 60.19 1.24 | 60.61 1.24 | 61.04 6.01 | 62.54 0.67 | 72.97 0.00 | 61.06 1.74 | 60.15 1.29 | 55.79 1.02 | |
UN Vote | 67.79 1.46 | 67.53 1.98 | 52.89 1.61 | 67.63 2.67 | 52.19 0.34 | 66.30 0.00 | 50.62 0.82 | 51.60 0.73 | 51.91 0.84 | |
Contact | 93.43 1.78 | 94.18 0.10 | 94.35 0.48 | 90.18 3.28 | 89.31 0.27 | 85.20 0.00 | 91.35 0.21 | 90.87 0.35 | 94.75 0.28 | |
Avg. Rank | 6.62 | 6.38 | 4.15 | 4.38 | 5.46 | 5.62 | 4.69 | 4.46 | 3.23 |
We report the performance of different methods on the AP metric for transductive dynamic link prediction with three negative sampling strategies in Table 1. The best and second-best results are emphasized by bold and underlined fonts. Note that the results are multiplied by 100 for a better display layout. Please refer to Section C.1 and Section C.2 for the results of AP for inductive dynamic link prediction as well as AUC-ROC for transductive and inductive dynamic link prediction tasks. Since EdgeBank can be only evaluated for transductive dynamic link prediction, we do not show its performance under the inductive setting. From the results, we have two main observations.
Firstly, DyGFormer usually outperforms baselines and achieves an average rank of 2.49/2.69 on AP/AUC-ROC for transductive and 2.69/2.56 for inductive dynamic link prediction across three negative sampling strategies. This is because: (i) The neighbor co-occurrence encoding scheme helps DyGFormer exploit correlations between the source node and destination node, which are often predictive for future links (see Section 5.5). (ii) The patching technique allows DyGFormer to access longer histories and capture long-term temporal dependencies (see Section 5.4). In Table 9 in Section B.5, the input sequence lengths of DyGFormer are much longer than those of baselines on several datasets, indicating that it can utilize longer sequences better. We also observe the varying results of DyGFormer across different negative sampling strategies and give an analysis in Section 5.7.
Secondly, some of our findings of baselines differ from previous reports. For instance, the performance of some baselines can be significantly improved by properly setting some hyperparameters. Additionally, some methods would obtain worse results after we fix the problems or make adaptions in their implementations. More explanations can be found in Section B.4. These observations highlight the importance of rigorously evaluating different methods by a unified library and verify the necessity of introducing DyGLib to facilitate the development of dynamic graph learning.
We also report the results of dynamic node classification in Table 15 in Section C.3. We observe that DyGFormer obtains better performance than most baselines and achieves an impressive average rank of 2.50 among them, demonstrating the superiority of DyGFormer once again.
Datasets | TCL | w/ NCoE | Improv. |
---|---|---|---|
Wikipedia | 96.47 | 99.09 | 2.72% |
97.53 | 99.04 | 1.55% | |
MOOC | 82.38 | 86.92 | 5.51% |
LastFM | 67.27 | 84.02 | 24.90% |
Enron | 79.70 | 90.18 | 13.15% |
Social Evo. | 93.13 | 94.06 | 1.00% |
UCI | 89.57 | 94.69 | 5.72% |
Flights | 91.23 | 97.71 | 7.10% |
Can. Parl. | 68.67 | 69.34 | 0.98% |
US Legis. | 69.59 | 69.47 | -0.17% |
UN Trade | 62.21 | 63.46 | 2.01% |
UN Vote | 51.90 | 51.52 | -0.73% |
Contact | 92.44 | 97.98 | 5.99% |

5.3 Generalizability of Neighbor Co-occurrence Encoding Scheme
Our Neighbor Co-occurrence Encoding scheme (NCoE) is versatile and can be easily integrated with dynamic graph learning methods based on sequential models. Hence, we incorporate NCoE with TCL and GraphMixer and show their performance in Table 2 and Table 16 in Section C.4. We find TCL and GraphMixer usually yield better results with NCoE, achieving an average improvement of 5.36% and 1.86% over all datasets. This verifies the effectiveness and versatility of the neighbor co-occurrence encoding, and highlights the importance of capturing correlations between nodes. Also, as TCL and DyGFormer are built upon Transformer, TCL w/ NCoE can achieve similar results with DyGFormer on datasets that enjoy shorter input sequences (in which cases the patching technique in DyGFormer contributes little). However, when datasets exhibit more obvious long-term temporal dependencies (e.g., LastFM, Can. Parl.), the performance gaps become more significant.
5.4 Advantages of Patching Technique
We validate the advantages of our patching technique in preserving the local temporal proximities and reducing the computational complexity, which helps DyGFormer effectively and efficiently utilize longer histories. We conduct experiments on LastFM and Can. Parl. since they can benefit from longer historical records. For baselines, we sample more neighbors or perform more causal anonymous walks (starting from 32) to make them access longer histories. The results are depicted in Figure 2, where the x-axis is represented by a logarithmic scale with base 2. We also plot the performance of baselines with the optimal length by unconnected points based on Table 9 in Section B.5. Note that the results of some baselines are incomplete since they raise the out-of-memory error when the lengths are longer. For example, TGAT is only computationally feasible when extending the input length to 32, resulting in two discrete points with lengths 20 (the optimal length) and 32.
From Figure 2, we conclude that: (i) most of the baselines perform worse when the input lengths become longer, indicating they lack the ability to capture long-term temporal dependencies; (ii) the baselines usually encounter expensive computational costs when computing on longer histories. Although memory network-based methods (i.e., DyRep and TGN) can handle longer histories with affordable computational costs, they cannot benefit from longer histories due to the potential issues of vanishing or exploding gradients; (iii) DyGFormer consistently achieves gains from longer sequences, demonstrating the advantages of the patching technique in leveraging longer histories.
We also compare the running time and memory usage of DyGFormer with and without the patching technique during the training process. The results are shown in Table 17 in Section C.5. We could observe that the patching technique efficiently reduces model training costs in both time and space, allowing DyGFormer to access longer histories. As the input sequence length increases, the reductions become more significant. Moreover, with the patching technique, we find that DyGFormer achieves an average improvement of 0.31% and 0.74% in performance on LastFM and Can. Parl. than DyGFormer without patching. This observation further demonstrates the advantage of our patching technique in leveraging the local temporal proximities for better results.
5.5 Verification of the Motivation of Neighbor Co-occurrence Encoding Scheme
To verify the motivation of NCoE (i.e., nodes with more common historical neighbors tend to interact in the future), we compare the performance of DyGFormer and DyGFormer without NCoE. We choose 0.5 as the threshold and use TP, TN, FN, and FP to denote True/False Positive/Negative. Common Neighbor Ratio (CNR) is defined as the ratio of common neighbors in source node ’s sequence and destination node ’s sequence , i.e., . We focus on links whose predictions of DyGFormer w/o NCoE are changed by DyGFormer (i.e., FN→TP, FP→TN, TP→FN, and TN→FP). We define Changed Link Ratio (CLR) as the ratio of the changed links to their original set, which is respectively computed by FN→TPFN, FP→TNFP, TP→FNTP, and TN→FPTN. If NCoE is helpful, DyGFormer will revise more wrong predictions (more FN→TP and FP→TN) and make fewer incorrect changes (fewer TP→FN and TN→FP). We report CLR and average CNR of links in the above sets on five typical datasets in Table 3.
Datasets | CLR (%) | CNR (%) | ||||||
---|---|---|---|---|---|---|---|---|
FN→TP | FP→TN | TP→FN | TN→FP | FN→TP | FP→TN | TP→FN | TN→FP | |
Wikipedia | 68.36 | 72.73 | 1.68 | 1.69 | 18.16 | 0.01 | 0.10 | 2.49 |
UCI | 71.45 | 94.11 | 7.29 | 1.82 | 19.08 | 2.49 | 3.35 | 13.02 |
Flights | 83.66 | 83.83 | 1.73 | 2.11 | 37.09 | 2.28 | 7.06 | 20.28 |
US Legis. | 31.63 | 23.67 | 6.63 | 1.59 | 69.92 | 62.13 | 61.14 | 63.80 |
UN Vote | 44.02 | 36.46 | 28.95 | 30.53 | 78.57 | 81.39 | 80.86 | 77.02 |
We find NCoE effectively helps DyGFormer rectify wrong predictions of DyGFormer w/o NCoE on datasets with significantly higher CNR of positive links than negative ones, which happens with most datasets. Concretely, for Wikipedia, UCI, and Flights, their CNRs of FN→TP are much higher than FP→TN (e.g., 37.09% vs. 2.28% on Flights) and DyGFormer revises most wrong predictions of DyGFormer w/o NCoE (e.g., 83.66% for positive links in FN and 83.83% for negative links in FP on Flights). Corrections made by our encoding scheme are less obvious on datasets whose CNRs between positive and negative links are similar, which occurs in only 2 of 13 datasets. For US Legis. and UN Vote, their CNRs between FN and FP are analogous (e.g., 69.92% vs. 62.13% on US Legis.), weakening the advantage of our neighbor co-occurrence encoding scheme (e.g., only 31.63%/23.67% of positive/negative links are corrected in FN/FP on US Legis.). Therefore, we conclude that the neighbor co-occurrence encoding scheme helps DyGFormer capture common historical neighbors in and , and bring better results in most cases.
5.6 When Will DyGFormer Be a Good Choice?
Note that DyGFormer is superior to baselines by 1) exploring the source and destination nodes’ correlations from their historical sequences by neighbor co-occurrence encoding scheme; 2) using the patching technique to attend longer histories. Thus, DyGFormer tends to perform better on datasets that favor these two designs. We define Link Ratio (LR) as the ratio of links in their corresponding positive or negative set, which can be computed by TP(TP+FN), TN(TN+FP), FN(TP+FN), and FP(TN+FP). As a method with more TP and TN (i.e., fewer FN and FP) is better, we report the results of LR and average CNR of links in TP and TN on five typical datasets in Table 5.
Datasets | LR (%) | CNR (%) | ||
---|---|---|---|---|
TP | TN | TP | TN | |
Wikipedia | 92.74 | 97.19 | 59.09 | 0.01 |
UCI | 82.70 | 96.77 | 28.03 | 1.45 |
Flights | 96.13 | 95.33 | 47.58 | 1.40 |
US Legis. | 78.95 | 56.83 | 75.18 | 53.98 |
UN Vote | 65.18 | 45.43 | 56.24 | 76.02 |
Datasets | LR(%) | CNR(%) | ||||
---|---|---|---|---|---|---|
rnd | hist | ind | rnd | hist | ind | |
Wikipedia | 2.81 | 89.28 | 94.53 | 0.02 | 14.00 | 11.66 |
UCI | 3.23 | 64.93 | 76.42 | 9.98 | 12.22 | 13.81 |
Flights | 4.67 | 94.52 | 92.94 | 0.01 | 35.62 | 30.29 |
US Legis. | 43.17 | 17.31 | 21.21 | 79.40 | 87.51 | 75.51 |
UN Vote | 54.57 | 39.90 | 52.92 | 79.60 | 75.53 | 79.15 |
We observe when CNR of TP is significantly higher than CNR of TN in the datasets, DyGFormer often outperforms baselines (most datasets satisfy this property). For Wikipedia, UCI, and Flights, their CNRs of TP are much higher than those of TN (e.g., 59.09% vs. 0.01% on Wikipedia). Such a characteristic matches the motivation of our neighbor co-occurrence encoding scheme, enabling DyGFormer to correctly predict most links (e.g., 92.74% of positive links and 97.19% of negative links are properly predicted on Wikipedia). Moreover, as LastFM and Can. Parl. can gain from longer histories (see Figure 2 and Table 9 in Section B.5), DyGFormer is significantly better than baselines on these two datasets. When CNRs of TP and TN are less distinguishable in the datasets, DyGFormer may perform worse (only 2 out of 13 datasets show this property). For US Legis., the CNRs between TP and TN are close (i.e., 75.18% vs. 53.98%), making DyGFormer worse than memory-based baselines (i.e., JODIE, DyRep, and TGN). For UN Vote, its CNR of TP is even lower than that of TN (i.e., 56.24% vs. 76.02%), which is opposite to our motivation, leading to poor results of DyGFormer than a few baselines. Since these two datasets cannot obviously gain from longer sequences either (see Table 9 in Section B.5), DyGFormer obtains worse results on them. Thus, we conclude that for datasets with much higher CNR of TP than CNR of TN or datasets that can benefit from longer histories, DyGFormer is a good choice. Otherwise, we may need to try other methods.
5.7 Why do DyGFormer’s Performance Vary across Different Negative Sampling Strategies?
Compared with the random (rnd) strategy, historical (hist) and inductive (ind) strategies will sample previous links as negative ones. This makes previous positive links negative, which may hurt the performance DyGFormer since the motivation of our neighbor co-occurrence encoding scheme is violated. As positive links are identical among rnd, hist, and ind, we compute LR and the average CNR of links in FP and show results in Table 5.
We find when hist or ind causes several magnitudes higher CNR of FP than rnd in the datasets, DyGFormer drops sharply. For Wikipedia, UCI, and Flights, the CNRs of FP with hist/ind are much higher than rnd (e.g., 14.00%/11.66% vs. 0.02% on Wikipedia). This misleads DyGFormer to predict negative links as positive and causes drops (e.g., 89.28%/94.53% of negative links are incorrectly predicted with hist/ind on Wikipedia, while only 2.81% are wrong with rnd). We also note the drops in UCI are milder since the changes in CNR caused by hist or ind vs. rnd are less obvious than changes in Wikipedia and Flights. When changes in CNR of FP caused by hist or ind are not obvious in the datasets, DyGFormer is less affected. Since hist/ind makes little changes in CNRs of FP on US Legis., we find it ranks second with hist/ind, which may indicate DyGFormer is less influenced by the neighbor co-occurrence encoding scheme and generalizes well to various negative sampling strategies. For UN Vote, although its CNRs of FP are not affected by hist and ind either, DyGFormer still performs worse due to its inferior performance with rnd. Hence, we deduce that our neighbor co-occurrence encoding may be sometimes fragile to various negative sampling strategies if its motivation is violated, leading to the varying performance of DyGFormer.
5.8 Ablation Study
Finally, we validate the effectiveness of the neighbor co-occurrence encoding, time encoding, and mixing of the sequence of source node and destination node in DyGFormer. From Figure 4 in Section C.6, we observe that DyGFormer obtains the best performance when using all the components, and the results would be worse when any component is removed. This illustrates the necessity of each design in DyGFormer. Please refer to Section C.6 for detailed implementations and discussions.
6 Conclusion
In this paper, we proposed a new Transformer-based architecture (DyGFormer) and a unified library (DyGLib) to foster the development of dynamic graph learning. DyGFormer differs from previous methods in (i) a neighbor co-occurrence encoding scheme to exploit the correlations of nodes in each interaction; and (ii) a patching technique to help the model capture long-term temporal dependencies. DyGLib served as a toolkit for reproducible, scalable, and credible continuous-time dynamic graph learning with standard training pipelines, extensible coding interfaces, and comprehensive evaluating protocols. We hope our work can provide new perspectives on designing new dynamic graph learning frameworks and encourage more researchers to dive into this field. In the future, we will continue to enrich DyGLib by incorporating the recently released datasets and state-of-the-art models.
Acknowledgments and Disclosure of Funding
This work was supported by the National Natural Science Foundation of China (No. 62272023 and 51991395) and the Fundamental Research Funds for the Central Universities (No. YWF-23-L-1203).
References
- [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283. USENIX Association, 2016.
- [2] Unai Alvarez-Rodriguez, Federico Battiston, Guilherme Ferraz de Arruda, Yamir Moreno, Matjaž Perc, and Vito Latora. Evolutionary dynamics of higher-order interactions in social networks. Nature Human Behaviour, 5(5):586–595, 2021.
- [3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- [4] Lei Bai, Lina Yao, Can Li, Xianzhi Wang, and Can Wang. Adaptive graph convolutional recurrent network for traffic forecasting. In Advances in Neural Information Processing Systems 33, 2020.
- [5] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew M. Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018.
- [6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33, 2020.
- [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In Computer Vision - ECCV 2020 - 16th European Conference, volume 12346 of Lecture Notes in Computer Science, pages 213–229. Springer, 2020.
- [8] Yukuo Cen, Zhenyu Hou, Yan Wang, Qibin Chen, Yizhen Luo, Xingcheng Yao, Aohan Zeng, Shiguang Guo, Peng Zhang, Guohao Dai, Yu Wang, Chang Zhou, Hongxia Yang, and Jie Tang. Cogdl: An extensive toolkit for deep learning on graphs. CoRR, abs/2103.00959, 2021.
- [9] Xiaofu Chang, Xuqin Liu, Jianfeng Wen, Shuang Li, Yanming Fang, Le Song, and Yuan Qi. Continuous-time dynamic graph learning via neural interaction processes. In The 29th ACM International Conference on Information and Knowledge Management, pages 145–154. ACM, 2020.
- [10] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. In Proceedings of SSST@EMNLP 2014, pages 103–111. Association for Computational Linguistics, 2014.
- [11] Weilin Cong, Yanhong Wu, Yuandong Tian, Mengting Gu, Yinglong Xia, Mehrdad Mahdavi, and Chun-cheng Jason Chen. Dynamic graph representation learning via graph transformer networks. CoRR, abs/2111.10447, 2021.
- [12] Weilin Cong, Si Zhang, Jian Kang, Baichuan Yuan, Hao Wu, Xin Zhou, Hanghang Tong, and Mehrdad Mahdavi. Do we really need complicated model architectures for temporal networks? In International Conference on Learning Representations, 2023.
- [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational Linguistics, 2019.
- [14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations. OpenReview.net, 2021.
- [15] Ziwei Fan, Zhiwei Liu, Jiawei Zhang, Yun Xiong, Lei Zheng, and Philip S. Yu. Continuous-time sequential recommendation with temporal graph collaborative transformer. In The 30th ACM International Conference on Information and Knowledge Management, pages 433–442. ACM, 2021.
- [16] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. CoRR, abs/1903.02428, 2019.
- [17] Palash Goyal, Sujit Rokka Chhetri, and Arquimedes Canedo. dyngraph2vec: Capturing network dynamics using dynamic graph representation learning. Knowl. Based Syst., 187, 2020.
- [18] Palash Goyal, Sujit Rokka Chhetri, Ninareh Mehrabi, Emilio Ferrara, and Arquimedes Canedo. Dynamicgem: A library for dynamic graph embedding methods. CoRR, abs/1811.10734, 2018.
- [19] Shengnan Guo, Youfang Lin, Ning Feng, Chao Song, and Huaiyu Wan. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In The Thirty-Third AAAI Conference on Artificial Intelligence, pages 922–929. AAAI Press, 2019.
- [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE Computer Society, 2016.
- [21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- [22] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. In Advances in Neural Information Processing Systems 33, 2020.
- [23] Shenyang Huang, Farimah Poursafaei, Jacob Danovitch, Matthias Fey, Weihua Hu, Emanuele Rossi, Jure Leskovec, Michael Bronstein, Guillaume Rabusseau, and Reihaneh Rabbany. Temporal graph benchmark for machine learning on temporal graphs. arXiv preprint arXiv:2307.01026, 2023.
- [24] Zijie Huang, Yizhou Sun, and Wei Wang. Learning continuous system dynamics from irregularly-sampled partial observations. In Advances in Neural Information Processing Systems 33, 2020.
- [25] Ming Jin, Yuan-Fang Li, and Shirui Pan. Neural temporal walks: Motif-aware representation learning on continuous-time dynamic graphs. In NeurIPS, 2022.
- [26] Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi, Peter Forsyth, and Pascal Poupart. Representation learning for dynamic graphs: A survey. J. Mach. Learn. Res., 21:70:1–70:73, 2020.
- [27] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, 2015.
- [28] Srijan Kumar, Xikun Zhang, and Jure Leskovec. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1269–1278. ACM, 2019.
- [29] Jintang Li, Kun Xu, Liang Chen, Zibin Zheng, and Xiao Liu. Graphgallery: A platform for fast benchmarking and easy development of graph neural networks based intelligent software. In 43rd IEEE/ACM International Conference on Software Engineering: Companion Proceedings, pages 13–16. IEEE, 2021.
- [30] Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. In Advances in Neural Information Processing Systems 32, pages 5244–5254, 2019.
- [31] Xiaohan Li, Mengqi Zhang, Shu Wu, Zheng Liu, Liang Wang, and Philip S. Yu. Dynamic graph collaborative filtering. In 20th IEEE International Conference on Data Mining, pages 322–331. IEEE, 2020.
- [32] Meng Liu, Youzhi Luo, Limei Wang, Yaochen Xie, Hao Yuan, Shurui Gui, Haiyang Yu, Zhao Xu, Jingtun Zhang, Yi Liu, Keqiang Yan, Haoran Liu, Cong Fu, Bora Oztekin, Xuan Zhang, and Shuiwang Ji. DIG: A turnkey library for diving into graph deep learning research. J. Mach. Learn. Res., 22:240:1–240:9, 2021.
- [33] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019.
- [34] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF International Conference on Computer Vision, pages 9992–10002. IEEE, 2021.
- [35] Yuhong Luo and Pan Li. Neighborhood-aware scalable temporal network representation learning. In The First Learning on Graphs Conference, 2022.
- [36] Yao Ma, Ziyi Guo, Zhaochun Ren, Jiliang Tang, and Dawei Yin. Streaming graph neural networks. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 719–728. ACM, 2020.
- [37] Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814. Omnipress, 2010.
- [38] Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers. In International Conference on Learning Representations, 2023.
- [39] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B. Schardl, and Charles E. Leiserson. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, pages 5363–5370. AAAI Press, 2020.
- [40] Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of JMLR Workshop and Conference Proceedings, pages 1310–1318. JMLR.org, 2013.
- [41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Z. Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.
- [42] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001.
- [43] Tobias Pfaff, Meire Fortunato, Alvaro Sanchez-Gonzalez, and Peter W. Battaglia. Learning mesh-based simulation with graph networks. In 9th International Conference on Learning Representations. OpenReview.net, 2021.
- [44] Farimah Poursafaei, Andy Huang, Kellin Pelrine, and Reihaneh Rabbany. Towards better evaluation for dynamic link prediction. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
- [45] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. In ICML 2020 Workshop on Graph Representation Learning, 2020.
- [46] Benedek Rozemberczki, Paul Scherer, Yixuan He, George Panagopoulos, Alexander Riedel, Maria Sinziana Astefanoaei, Oliver Kiss, Ferenc Béres, Guzmán López, Nicolas Collignon, and Rik Sarkar. Pytorch geometric temporal: Spatiotemporal signal processing with neural machine learning models. In The 30th ACM International Conference on Information and Knowledge Management, pages 4564–4573. ACM, 2021.
- [47] Alvaro Sanchez-Gonzalez, Jonathan Godwin, Tobias Pfaff, Rex Ying, Jure Leskovec, and Peter W. Battaglia. Learning to simulate complex physics with graph networks. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8459–8468. PMLR, 2020.
- [48] Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. In The Thirteenth ACM International Conference on Web Search and Data Mining, pages 519–527. ACM, 2020.
- [49] Joakim Skarding, Bogdan Gabrys, and Katarzyna Musial. Foundations and modeling of dynamic networks using dynamic graph neural networks: A survey. IEEE Access, 9:79143–79168, 2021.
- [50] Weiping Song, Zhiping Xiao, Yifan Wang, Laurent Charlin, Ming Zhang, and Jian Tang. Session-based social recommendation via dynamic graph attention networks. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, pages 555–563. ACM, 2019.
- [51] Amauri H. Souza, Diego Mesquita, Samuel Kaski, and Vikas Garg. Provably expressive temporal graph networks. In NeurIPS, 2022.
- [52] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, 2014.
- [53] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. In Advances in Neural Information Processing Systems 28, pages 2440–2448, 2015.
- [54] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. In Advances in Neural Information Processing Systems 34, pages 24261–24272, 2021.
- [55] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. Dyrep: Learning representations over dynamic graphs. In 7th International Conference on Learning Representations. OpenReview.net, 2019.
- [56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- [57] Lu Wang, Xiaofu Chang, Shuang Li, Yunfei Chu, Hui Li, Wei Zhang, Xiaofeng He, Le Song, Jingren Zhou, and Hongxia Yang. TCL: transformer-based dynamic graph modelling via contrastive learning. CoRR, abs/2105.07944, 2021.
- [58] Minjie Wang, Lingfan Yu, Da Zheng, Quan Gan, Yu Gai, Zihao Ye, Mufei Li, Jinjing Zhou, Qi Huang, Chao Ma, Ziyue Huang, Qipeng Guo, Hao Zhang, Haibin Lin, Junbo Zhao, Jinyang Li, Alexander J. Smola, and Zheng Zhang. Deep graph library: Towards efficient and scalable deep learning on graphs. CoRR, abs/1909.01315, 2019.
- [59] Xuhong Wang, Ding Lyu, Mengjian Li, Yang Xia, Qi Yang, Xinwen Wang, Xinguang Wang, Ping Cui, Yupu Yang, Bowen Sun, and Zhenyu Guo. APAN: asynchronous propagation attention network for real-time temporal graph embedding. In International Conference on Management of Data, pages 2628–2638. ACM, 2021.
- [60] Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. Inductive representation learning in temporal networks via causal anonymous walks. In 9th International Conference on Learning Representations. OpenReview.net, 2021.
- [61] Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. In 3rd International Conference on Learning Representations, 2015.
- [62] Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Advances in Neural Information Processing Systems 34, pages 22419–22430, 2021.
- [63] Zonghan Wu, Shirui Pan, Guodong Long, Jing Jiang, and Chengqi Zhang. Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 1907–1913. ijcai.org, 2019.
- [64] Da Xu, Chuanwei Ruan, Evren Körpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. In 8th International Conference on Learning Representations. OpenReview.net, 2020.
- [65] Guotong Xue, Ming Zhong, Jianxin Li, Jia Chen, Chengshuai Zhai, and Ruochen Kong. Dynamic network embedding survey. Neurocomputing, 472:212–223, 2022.
- [66] Jiaxuan You, Tianyu Du, and Jure Leskovec. ROLAND: graph learning framework for dynamic graphs. In The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2358–2366. ACM, 2022.
- [67] Bing Yu, Haoteng Yin, and Zhanxing Zhu. Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pages 3634–3640. ijcai.org, 2018.
- [68] Le Yu. An empirical evaluation of temporal graph benchmark. arXiv preprint arXiv:2307.12510, 2023.
- [69] Le Yu, Bowen Du, Xiao Hu, Leilei Sun, Liangzhe Han, and Weifeng Lv. Deep spatio-temporal graph convolutional network for traffic accident prediction. Neurocomputing, 423:135–147, 2021.
- [70] Le Yu, Zihang Liu, Leilei Sun, Bowen Du, Chuanren Liu, and Weifeng Lv. Continuous-time user preference modelling for temporal sets prediction. IEEE Transactions on Knowledge and Data Engineering, 2023.
- [71] Le Yu, Guanghui Wu, Leilei Sun, Bowen Du, and Weifeng Lv. Element-guided temporal graph representation learning for temporal sets prediction. In The ACM Web Conference 2022, pages 1902–1913. ACM, 2022.
- [72] Mengqi Zhang, Shu Wu, Xueli Yu, Qiang Liu, and Liang Wang. Dynamic graph neural networks for sequential recommendation. IEEE Transactions on Knowledge and Data Engineering, 2022.
- [73] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 11106–11115. AAAI Press, 2021.
- [74] Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George Karypis. Tgl: a general framework for temporal gnn training on billion-scale graphs. Proceedings of the VLDB Endowment, 15(8):1572–1580, 2022.
Appendix A Additional Discussions and Details
A.1 Limitations
One potential limitation of DyGFormer lies in the ignorance of high-order relationships between nodes since it solely learns from the first-hop interactions of nodes. In certain scenarios where nodes’ high-order relationships are essential, DyGFormer may be suboptimal compared with baselines that learn the higher-order interactions. However, trivially feeding the multi-hop neighbors of nodes into DyGFormer would incur expensive computational costs. It is promising to design more efficient and effective frameworks to model nodes’ high-order relationships for dynamic graph learning.
Another potential limitation is the sensitivity of the neighbor co-occurrence encoding scheme against different negative sampling strategies (discussed in Section 5.7). When the assumption of our neighbor co-occurrence encoding scheme is violated, its performance may drop drastically in some cases. It is an insightful direction to design more robust encoding schemes to tackle this issue.
A.2 Licenses
All the used codes and datasets are publicly available and permit usage for research purposes under either MIT License or Apache License 2.0.
A.3 Overall Procedure of DyGLib
Figure 3 shows the overall procedure of DyGLib.

Appendix B Detailed Experimental Settings
B.1 Descriptions of Datasets
We use thirteen datasets collected by [44] in the experiments, which are publicly available111https://zenodo.org/record/7213796#.Y1cO6y8r30o:
-
•
Wikipedia is a bipartite interaction graph that contains the edits on Wikipedia pages over a month. Nodes represent users and pages, and links denote the editing behaviors with timestamps. Each link is associated with a 172-dimensional Linguistic Inquiry and Word Count (LIWC) feature [42]. This dataset additionally contains dynamic labels that indicate whether users are temporarily banned from editing.
-
•
Reddit is bipartite and records the posts of users under subreddits during one month. Users and subreddits are nodes, and links are the timestamped posting requests. Each link has a 172-dimensional LIWC feature. This dataset also includes dynamic labels representing whether users are banned from posting.
-
•
MOOC is a bipartite interaction network of online sources, where nodes are students and course content units (e.g., videos and problem sets). Each link denotes a student’s access behavior to a specific content unit and is assigned with a 4-dimensional feature.
-
•
LastFM is bipartite and consists of the information about which songs were listened to by which users over one month. Users and songs are nodes, and links denote the listening behaviors of users.
-
•
Enron records the email communications between employees of the ENRON energy corporation over three years.
-
•
Social Evo. is a mobile phone proximity network that monitors the daily activities of an entire undergraduate dormitory for a period of eight months, where each link has a 2-dimensional feature.
-
•
UCI is an online communication network, where nodes are university students and links are messages posted by students.
-
•
Flights is a dynamic flight network that displays the development of air traffic during the COVID-19 pandemic. Airports are represented by nodes and the tracked flights are denoted as links. Each link is associated with a weight, indicating the number of flights between two airports in a day.
-
•
Can. Parl. is a dynamic political network that records the interactions between Canadian Members of Parliament (MPs) from 2006 to 2019. Each node represents an MP from an electoral district and a link is created when two MPs both vote "yes" on a bill. The weight of each link refers to the number of times that one MP voted “yes” for another MP in a year.
-
•
US Legis. is a senate co-sponsorship network that tracks social interactions between legislators in the US Senate. The weight of each link specifies the number of times two congresspersons have co-sponsored a bill in a given congress.
-
•
UN Trade contains the food and agriculture trade between 181 nations for more than 30 years. The weight of each link indicates the total sum of normalized agriculture import or export values between two particular countries.
-
•
UN Vote records roll-call votes in the United Nations General Assembly. If two nations both voted "yes" to an item, the weight of the link between them is increased by one.
-
•
Contact describes how the physical proximity evolves among about 700 university students over a month. Each student has a unique identifier and links denote that they are within close proximity to each other. Each link is associated with a weight, revealing the physical proximity between students.
We show the statistics of the datasets in Table 6, where #N&L Feat stands for the dimensions of node and link features. We notice a slight difference between the statistics of the Contact dataset reported in [44] (which has 694 nodes and 2,426,280 links) and our own calculations, although both of them are computed based on the released dataset by [44]. We ultimately report our statistics of the Contact dataset in this paper.
Datasets | Domains | #Nodes | #Links | #N&L Feat | Bipartite | Duration | Unique Steps | Time Granularity |
---|---|---|---|---|---|---|---|---|
Wikipedia | Social | 9,227 | 157,474 | – & 172 | True | 1 month | 152,757 | Unix timestamps |
Social | 10,984 | 672,447 | – & 172 | True | 1 month | 669,065 | Unix timestamps | |
MOOC | Interaction | 7,144 | 411,749 | – & 4 | True | 17 months | 345,600 | Unix timestamps |
LastFM | Interaction | 1,980 | 1,293,103 | – & – | True | 1 month | 1,283,614 | Unix timestamps |
Enron | Social | 184 | 125,235 | – & – | False | 3 years | 22,632 | Unix timestamps |
Social Evo. | Proximity | 74 | 2,099,519 | – & 2 | False | 8 months | 565,932 | Unix timestamps |
UCI | Social | 1,899 | 59,835 | – & – | False | 196 days | 58,911 | Unix timestamps |
Flights | Transport | 13,169 | 1,927,145 | – & 1 | False | 4 months | 122 | days |
Can. Parl. | Politics | 734 | 74,478 | – & 1 | False | 14 years | 14 | years |
US Legis. | Politics | 225 | 60,396 | – & 1 | False | 12 congresses | 12 | congresses |
UN Trade | Economics | 255 | 507,497 | – & 1 | False | 32 years | 32 | years |
UN Vote | Politics | 201 | 1,035,742 | – & 1 | False | 72 years | 72 | years |
Contact | Proximity | 692 | 2,426,279 | – & 1 | False | 1 month | 8,064 | 5 minutes |
B.2 Descriptions of Baselines
We select the following eight baselines:
-
•
JODIE is designed for temporal bipartite networks of user-item interactions. It employs two coupled recurrent neural networks to update the states of users and items. A projection operation is introduced to learn the future representation trajectory of each user/item [28].
-
•
DyRep proposes a recurrent architecture to update node states upon each interaction. It also includes a temporal-attentive aggregation module to consider the temporally evolving structural information in dynamic graphs [55].
-
•
TGAT computes the node representation by aggregating features from each node’s temporal-topological neighbors based on the self-attention mechanism. It is also equipped with a time encoding function for capturing temporal patterns [64].
-
•
TGN maintains an evolving memory for each node and updates this memory when the node is observed in an interaction, which is achieved by the message function, message aggregator, and memory updater. An embedding module is leveraged to generate the temporal representations of nodes [45].
-
•
CAWN first extracts multiple causal anonymous walks for each node, which can explore the causality of network dynamics and generate relative node identities. Then, it utilizes recurrent neural networks to encode each walk and aggregates these walks to obtain the final node representation [60].
-
•
EdgeBank is a pure memory-based approach without trainable parameters for transductive dynamic link prediction. It stores the observed interactions in the memory unit and updates the memory through various strategies. An interaction will be predicted as positive if it was retained in the memory, and negative otherwise [44]. The publication presents two updating strategies, but their official implementations include two more222https://github.com/fpour/DGB/blob/main/EdgeBank/link_pred/edge_bank_baseline.py. To be specific, the four variants of EdgeBank are: EdgeBank∞ with unlimited memory to store all the observed edges; EdgeBank and EdgeBank, which only remember edges within a fixed-size time window from the immediate past. The window size of EdgeBank is set to the duration of the test split, while EdgeBank makes it related to the time intervals of repeated edges; EdgeBank that retains edges with appearing counts higher than a threshold. We experiment with all four variants and report the best performance among them.
-
•
TCL first generates each node’s interaction sequence by performing a breadth-first search algorithm on the temporal dependency interaction sub-graph. Then, it presents a graph transformer that considers both graph topology and temporal information to learn node representations. It also incorporates a cross-attention operation for modeling the inter-dependencies between two interaction nodes [57].
- •
B.3 Some Problematic Implementations in Baselines
JODIE, DyRep, and TGN models are based on memory networks, and their implementations have designed the raw messages to avoid information leakage. However, they fail to store the raw messages when saving models because the raw messages are maintained in a dictionary and thus cannot be saved as model parameters333https://github.com/twitter-research/tgn/blob/2aa295a1f182137a6ad56328b43cb3d8223cae77/train_self_supervised.py#L302. Our DyGLib has addressed this issue by additionally saving the correct raw messages when saving models. In the official implementation of CAWN, the encoding of each causal anonymous walk is represented by the embedding at the last position of the walk444https://github.com/snap-stanford/CAW/blob/f994ff2b2c29778e6250b6a9928fd9943e0163f7/module.py#L1069. However, some walks are padded to support mini-batch training in practice, making the last position of these padded walks meaningless. To get the correct encoding, it is necessary to use the actual length of each walk. Moreover, there are issues with the nodeedge2idx dictionary computed by the get_ts2idx function555https://github.com/snap-stanford/CAW/blob/f994ff2b2c29778e6250b6a9928fd9943e0163f7/graph.py#L79 if multiple interactions simultaneously occur at the last timestamp. This would lead to information leakage since the model can potentially access more interactions for a given interaction even if they happen at the same time. Our DyGLib has fixed these problems as well.
B.4 Some Inconsistent Observations with Previous Reports
In the experiments, we find the behaviors of baselines are inconsistent with their previous reports in some cases. We provide detailed illustrations and attribute these phenomena to the following reasons.
Suboptimal Settings of Hyperparameters. Some important hyperparameters in the baselines are not sufficiently fine-tuned, such as the dropout rate, the number of sampled neighbors, the number of random walks, the length of input sequences, and the neighbor sampling strategies. In this paper, we perform the grid search to find the best settings of these hyperparameters and observe that the performance of many baselines can be significantly improved by properly setting certain hyperparameters. Take TGAT for transductive dynamic link prediction with the random negative sampling strategy as an example (see Table 1). Compared with the performance in [44], the results of AP are significantly improved on datasets like MOOC (from 0.61 to 0.86), LastFM (from 0.50 to 0.73), Enron (from 0.59 to 0.71), Social Evo. (from 0.76 to 0.93), Flights (from 0.89 to 0.94), and Contact (from 0.58 to 0.96). Similar improvements can also be found on JODIE, DyRep, and TGN on several datasets.
Usage of Problematic Implementations. Some previous studies utilize problematic implementations in their experiments (illustrated in Section B.3), and the reported results may not be rigorous. Therefore, some methods would obtain worse results after we fix the problems in their implementations. Take CAWN for transductive dynamic link prediction with the random negative sampling strategy as an example (see Table 1). Compared with the results in [44], the performance on the AP metric of CAWN drops sharply on datasets like LastFM (from 0.98 to 0.87), Can. Parl. (from 0.94 to 0.70), US Legis. (from 0.97 to 0.71), UN Trade (from 0.97 to 0.65), and UN Vote (from 0.82 to 0.53). This is because [44] uses the problematic implementations of CAWN to conduct experiments and the results will sometimes become worse after fixing the issues.
Adaptions for Evaluations. There also exist some differences between GraphMixer’s results in the original paper and our work because we modify its implementation to fit our evaluations. The original GraphMixer could only be evaluated for transductive dynamic link prediction. In this work, we remove the one-hot encoding of nodes in GraphMixer to adapt it to the inductive setting, which may lead to decreased performance in certain situations. Take the performance of GraphMixer for transductive dynamic link prediction with the random negative sampling strategy as an example (see Table 1). Compared with the results in [12], the performance of GraphMixer slightly drops on datasets like Wikipedia (from 0.9985 to 0.9725) and Reddit (from 0.9993 to 0.9731).
B.5 Configurations of Different Methods
We first present the official settings of baselines as well as the configurations of DyGFormer. These configurations remain unchanged across all the datasets.
-
•
JODIE:
-
–
Dimension of time encoding: 100
-
–
Dimension of node memory: 172
-
–
Dimension of output representation: 172
-
–
Memory updater: vanilla recurrent neural network
-
–
-
•
DyRep:
-
–
Dimension of time encoding: 100
-
–
Dimension of node memory: 172
-
–
Dimension of output representation: 172
-
–
Number of graph attention heads: 2
-
–
Number of graph convolution layers: 1
-
–
Memory updater: vanilla recurrent neural network
-
–
-
•
TGAT:
-
–
Dimension of time encoding: 100
-
–
Dimension of output representation: 172
-
–
Number of graph attention heads: 2
-
–
Number of graph convolution layers: 2
-
–
-
•
TGN:
-
–
Dimension of time encoding: 100
-
–
Dimension of node memory: 172
-
–
Dimension of output representation: 172
-
–
Number of graph attention heads: 2
-
–
Number of graph convolution layers: 1
-
–
Memory updater: gated recurrent unit [10]
-
–
-
•
CAWN:
-
–
Dimension of time encoding: 100
-
–
Dimension of position encoding: 172
-
–
Dimension of output representation: 172
-
–
Number of attention heads for encoding walks: 8
-
–
Length of each walk (including the target node): 2
-
–
Time scaling factor : 1e-6
-
–
-
•
TCL:
-
–
Dimension of time encoding: 100
-
–
Dimension of depth encoding: 172
-
–
Dimension of output representation: 172
-
–
Number of attention heads: 2
-
–
Number of Transformer layers: 2
-
–
-
•
GraphMixer:
-
–
Dimension of time encoding: 100
-
–
Dimension of output representation: 172
-
–
Number of MLP-Mixer layers: 2
-
–
Time gap : 2000
-
–
-
•
DyGFormer:
-
–
Dimension of time encoding : 100
-
–
Dimension of neighbor co-occurrence encoding : 50
-
–
Dimension of aligned encoding : 50
-
–
Dimension of output representation : 172
-
–
Number of attention heads : 2
-
–
Number of Transformer layers : 2
-
–
Then, we perform the grid search to find the best settings of some critical hyperparameters, where the searched ranges and related methods are shown in Table 7. It is worth noticing that DyGFormer can directly handle nodes with sequence lengths shorter than the defined length. When the sequence length exceeds the specified length, we select the most recent interactions up to the defined length.
Hyperparameters | Searched Ranges | Related Methods | ||||||
---|---|---|---|---|---|---|---|---|
Dropout Rate [52] |
|
|
||||||
|
[10, 20, 30] |
|
||||||
|
[uniform,recent] |
|
||||||
|
[16, 32, 64, 128] | CAWN | ||||||
|
|
EdgeBank | ||||||
|
|
DyGFormer |
Finally, we show the hyperparameter settings of various methods that are determined by the grid search in Table 8, Table 9, Table 10 and Table 11.
Datasets | JODIE | DyRep | TGAT | TGN | CAWN | TCL | GraphMixer | DyGFormer |
---|---|---|---|---|---|---|---|---|
Wikipedia | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.5 | 0.1 |
0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.5 | 0.2 | |
MOOC | 0.2 | 0.0 | 0.1 | 0.2 | 0.1 | 0.1 | 0.4 | 0.1 |
LastFM | 0.3 | 0.0 | 0.1 | 0.3 | 0.1 | 0.1 | 0.0 | 0.1 |
Enron | 0.1 | 0.0 | 0.2 | 0.0 | 0.1 | 0.1 | 0.5 | 0.0 |
Social Evo. | 0.1 | 0.1 | 0.1 | 0.0 | 0.1 | 0.0 | 0.3 | 0.1 |
UCI | 0.4 | 0.0 | 0.1 | 0.1 | 0.1 | 0.0 | 0.4 | 0.1 |
Flights | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 |
Can. Parl. | 0.0 | 0.0 | 0.2 | 0.3 | 0.0 | 0.2 | 0.2 | 0.1 |
US Legis. | 0.2 | 0.0 | 0.1 | 0.1 | 0.1 | 0.3 | 0.4 | 0.0 |
UN Trade | 0.4 | 0.1 | 0.1 | 0.2 | 0.1 | 0.0 | 0.1 | 0.0 |
UN Vote | 0.1 | 0.1 | 0.2 | 0.1 | 0.1 | 0.0 | 0.0 | 0.2 |
Contact | 0.1 | 0.0 | 0.1 | 0.1 | 0.1 | 0.0 | 0.1 | 0.0 |
Datasets | DyRep | TGAT | TGN | CAWN | TCL | GraphMixer | DyGFormer |
---|---|---|---|---|---|---|---|
Wikipedia | 10 | 20 | 10 | 32 | 20 | 30 | 32 & 1 |
10 | 20 | 10 | 32 | 20 | 10 | 64 & 2 | |
MOOC | 10 | 20 | 10 | 64 | 20 | 20 | 256 & 8 |
LastFM | 10 | 20 | 10 | 128 | 20 | 10 | 512 & 16 |
Enron | 10 | 20 | 10 | 32 | 20 | 20 | 256 & 8 |
Social Evo. | 10 | 20 | 10 | 64 | 20 | 20 | 32 & 1 |
UCI | 10 | 20 | 10 | 64 | 20 | 20 | 32 & 1 |
Flights | 10 | 20 | 10 | 64 | 20 | 20 | 256 & 8 |
Can. Parl. | 10 | 20 | 10 | 128 | 20 | 20 | 2048 & 64 |
US Legis. | 10 | 20 | 10 | 32 | 20 | 20 | 256 & 8 |
UN Trade | 10 | 20 | 10 | 64 | 20 | 20 | 256 & 8 |
UN Vote | 10 | 20 | 10 | 64 | 20 | 20 | 128 & 4 |
Contact | 10 | 20 | 10 | 64 | 20 | 20 | 32 & 1 |
Datasets | DyRep | TGAT | TGN | TCL | GraphMixer |
---|---|---|---|---|---|
Wikipedia | recent | recent | recent | recent | recent |
recent | uniform | recent | uniform | recent | |
MOOC | recent | recent | recent | recent | recent |
LastFM | recent | recent | recent | recent | recent |
Enron | recent | recent | recent | recent | recent |
Social Evo. | recent | recent | recent | recent | recent |
UCI | recent | recent | recent | recent | recent |
Flights | recent | recent | recent | recent | recent |
Can. Parl. | uniform | uniform | uniform | uniform | uniform |
US Legis. | recent | recent | recent | uniform | recent |
UN Trade | recent | uniform | recent | uniform | uniform |
UN Vote | recent | recent | uniform | uniform | uniform |
Contact | recent | recent | recent | recent | recent |
Datasets | Random | Historical | Inductive |
---|---|---|---|
Wikipedia | EdgeBank∞ | EdgeBank | EdgeBank |
EdgeBank∞ | EdgeBank | EdgeBank | |
MOOC | EdgeBank | EdgeBank | EdgeBank |
LastFM | EdgeBank | EdgeBank | EdgeBank |
Enron | EdgeBank | EdgeBank | EdgeBank |
Social Evo. | EdgeBank | EdgeBank | EdgeBank |
UCI | EdgeBank∞ | EdgeBank | EdgeBank |
Flights | EdgeBank∞ | EdgeBank | EdgeBank |
Can. Parl. | EdgeBank | EdgeBank | EdgeBank |
US Legis. | EdgeBank | EdgeBank | EdgeBank |
UN Trade | EdgeBank | EdgeBank | EdgeBank |
UN Vote | EdgeBank | EdgeBank | EdgeBank |
Contact | EdgeBank | EdgeBank | EdgeBank |
Appendix C Detailed Experimental Results
C.1 Additional Results for Transductive Dynamic Link Prediction
We show the AUC-ROC for transductive dynamic link prediction with three negative sampling strategies in Table 12.
NSS | Datasets | JODIE | DyRep | TGAT | TGN | CAWN | EdgeBank | TCL | GraphMixer | DyGFormer |
---|---|---|---|---|---|---|---|---|---|---|
rnd | Wikipedia | 96.33 0.07 | 94.37 0.09 | 96.67 0.07 | 98.37 0.07 | 98.54 0.04 | 90.78 0.00 | 95.84 0.18 | 96.92 0.03 | 98.91 0.02 |
98.31 0.05 | 98.17 0.05 | 98.47 0.02 | 98.60 0.06 | 99.01 0.01 | 95.37 0.00 | 97.42 0.02 | 97.17 0.02 | 99.15 0.01 | ||
MOOC | 83.81 2.09 | 85.03 0.58 | 87.11 0.19 | 91.21 1.15 | 80.38 0.26 | 60.86 0.00 | 83.12 0.18 | 84.01 0.17 | 87.91 0.58 | |
LastFM | 70.49 1.66 | 71.16 1.89 | 71.59 0.18 | 78.47 2.94 | 85.92 0.10 | 83.77 0.00 | 64.06 1.16 | 73.53 0.12 | 93.05 0.10 | |
Enron | 87.96 0.52 | 84.89 3.00 | 68.89 1.10 | 88.32 0.99 | 90.45 0.14 | 87.05 0.00 | 75.74 0.72 | 84.38 0.21 | 93.33 0.13 | |
Social Evo. | 92.05 0.46 | 90.76 0.21 | 94.76 0.16 | 95.39 0.17 | 87.34 0.08 | 81.60 0.00 | 94.84 0.17 | 95.23 0.07 | 96.30 0.01 | |
UCI | 90.44 0.49 | 68.77 2.34 | 78.53 0.74 | 92.03 1.13 | 93.87 0.08 | 77.30 0.00 | 87.82 1.36 | 91.81 0.67 | 94.49 0.26 | |
Flights | 96.21 1.42 | 95.95 0.62 | 94.13 0.17 | 98.22 0.13 | 98.45 0.01 | 90.23 0.00 | 91.21 0.02 | 91.13 0.01 | 98.93 0.01 | |
Can. Parl. | 78.21 0.23 | 73.35 3.67 | 75.69 0.78 | 76.99 1.80 | 75.70 3.27 | 64.14 0.00 | 72.46 3.23 | 83.17 0.53 | 97.76 0.41 | |
US Legis. | 82.85 1.07 | 82.28 0.32 | 75.84 1.99 | 83.34 0.43 | 77.16 0.39 | 62.57 0.00 | 76.27 0.63 | 76.96 0.79 | 77.90 0.58 | |
UN Trade | 69.62 0.44 | 67.44 0.83 | 64.01 0.12 | 69.10 1.67 | 68.54 0.18 | 66.75 0.00 | 64.72 0.05 | 65.52 0.51 | 70.20 1.44 | |
UN Vote | 68.53 0.95 | 67.18 1.04 | 52.83 1.12 | 69.71 2.65 | 53.09 0.22 | 62.97 0.00 | 51.88 0.36 | 52.46 0.27 | 57.12 0.62 | |
Contact | 96.66 0.89 | 96.48 0.14 | 96.95 0.08 | 97.54 0.35 | 89.99 0.34 | 94.34 0.00 | 94.15 0.09 | 93.94 0.02 | 98.53 0.01 | |
Avg. Rank | 4.38 | 5.77 | 6.00 | 2.54 | 4.38 | 7.31 | 7.23 | 5.77 | 1.62 | |
hist | Wikipedia | 80.77 0.73 | 77.74 0.33 | 82.87 0.22 | 82.74 0.32 | 67.84 0.64 | 77.27 0.00 | 85.76 0.46 | 87.68 0.17 | 78.80 1.95 |
80.52 0.32 | 80.15 0.18 | 79.33 0.16 | 81.11 0.19 | 80.27 0.30 | 78.58 0.00 | 76.49 0.16 | 77.80 0.12 | 80.54 0.29 | ||
MOOC | 82.75 0.83 | 81.06 0.94 | 80.81 0.67 | 88.00 1.80 | 71.57 1.07 | 61.90 0.00 | 72.09 0.56 | 76.68 1.40 | 87.04 0.35 | |
LastFM | 75.22 2.36 | 74.65 1.98 | 64.27 0.26 | 77.97 3.04 | 67.88 0.24 | 78.09 0.00 | 47.24 3.13 | 64.21 0.73 | 78.78 0.35 | |
Enron | 75.39 2.37 | 74.69 3.55 | 61.85 1.43 | 77.09 2.22 | 65.10 0.34 | 79.59 0.00 | 67.95 0.88 | 75.27 1.14 | 76.55 0.52 | |
Social Evo. | 90.06 3.15 | 93.12 0.34 | 93.08 0.59 | 94.71 0.53 | 87.43 0.15 | 85.81 0.00 | 93.44 0.68 | 94.39 0.31 | 97.28 0.07 | |
UCI | 78.64 3.50 | 57.91 3.12 | 58.89 1.57 | 77.25 2.68 | 57.86 0.15 | 69.56 0.00 | 72.25 3.46 | 77.54 2.02 | 76.97 0.24 | |
Flights | 68.97 1.87 | 69.43 0.90 | 72.20 0.16 | 68.39 0.95 | 66.11 0.71 | 74.64 0.00 | 70.57 0.18 | 70.37 0.23 | 68.09 0.43 | |
Can. Parl. | 62.44 1.11 | 70.16 1.70 | 70.86 0.94 | 73.23 3.08 | 72.06 3.94 | 63.04 0.00 | 69.95 3.70 | 79.03 1.01 | 97.61 0.40 | |
US Legis. | 67.47 6.40 | 91.44 1.18 | 73.47 5.25 | 83.53 4.53 | 78.62 7.46 | 67.41 0.00 | 83.97 3.71 | 85.17 0.70 | 90.77 1.96 | |
UN Trade | 68.92 1.40 | 64.36 1.40 | 60.37 0.68 | 63.93 5.41 | 63.09 0.74 | 86.61 0.00 | 61.43 1.04 | 63.20 1.54 | 73.86 1.13 | |
UN Vote | 76.84 1.01 | 74.72 1.43 | 53.95 3.15 | 73.40 5.20 | 51.27 0.33 | 89.62 0.00 | 52.29 2.39 | 52.61 1.44 | 64.27 1.78 | |
Contact | 96.35 0.92 | 96.00 0.23 | 95.39 0.43 | 93.76 1.29 | 83.06 0.32 | 92.17 0.00 | 93.34 0.19 | 93.14 0.34 | 97.17 0.05 | |
Avg. Rank | 4.38 | 4.77 | 5.85 | 3.46 | 7.38 | 5.38 | 6.08 | 4.77 | 2.92 | |
ind | Wikipedia | 70.96 0.78 | 67.36 0.96 | 81.93 0.22 | 80.97 0.31 | 70.95 0.95 | 81.73 0.00 | 82.19 0.48 | 84.28 0.30 | 75.09 3.70 |
83.51 0.15 | 82.90 0.31 | 87.13 0.20 | 84.56 0.24 | 88.04 0.29 | 85.93 0.00 | 84.67 0.29 | 82.21 0.13 | 86.23 0.51 | ||
MOOC | 66.63 2.30 | 63.26 1.01 | 73.18 0.33 | 77.44 2.86 | 70.32 1.43 | 48.18 0.00 | 70.36 0.37 | 72.45 0.72 | 80.76 0.76 | |
LastFM | 61.32 3.49 | 62.15 2.12 | 63.99 0.21 | 65.46 4.27 | 67.92 0.44 | 77.37 0.00 | 46.93 2.59 | 60.22 0.32 | 69.25 0.36 | |
Enron | 70.92 1.05 | 68.73 1.34 | 60.45 2.12 | 71.34 2.46 | 75.17 0.50 | 75.00 0.00 | 67.64 0.86 | 71.53 0.85 | 74.07 0.64 | |
Social Evo. | 90.01 3.19 | 93.07 0.38 | 92.94 0.61 | 95.24 0.56 | 89.93 0.15 | 87.88 0.00 | 93.44 0.72 | 94.22 0.32 | 97.51 0.06 | |
UCI | 64.14 1.26 | 54.25 2.01 | 60.80 1.01 | 64.11 1.04 | 58.06 0.26 | 58.03 0.00 | 70.05 1.86 | 74.59 0.74 | 65.96 1.18 | |
Flights | 69.99 3.10 | 71.13 1.55 | 73.47 0.18 | 71.63 1.72 | 69.70 0.75 | 81.10 0.00 | 72.54 0.19 | 72.21 0.21 | 69.53 1.17 | |
Can. Parl. | 52.88 0.80 | 63.53 0.65 | 72.47 1.18 | 69.57 2.81 | 72.93 1.78 | 61.41 0.00 | 69.47 2.12 | 70.52 0.94 | 96.70 0.59 | |
US Legis. | 59.05 5.52 | 89.44 0.71 | 71.62 5.42 | 78.12 4.46 | 76.45 7.02 | 68.66 0.00 | 82.54 3.91 | 84.22 0.91 | 87.96 1.80 | |
UN Trade | 66.82 1.27 | 65.60 1.28 | 66.13 0.78 | 66.37 5.39 | 71.73 0.74 | 74.20 0.00 | 67.80 1.21 | 66.53 1.22 | 62.56 1.51 | |
UN Vote | 73.73 1.61 | 72.80 2.16 | 53.04 2.58 | 72.69 3.72 | 52.75 0.90 | 72.85 0.00 | 52.02 1.64 | 51.89 0.74 | 53.37 1.26 | |
Contact | 94.47 1.08 | 94.23 0.18 | 94.10 0.41 | 91.64 1.72 | 87.68 0.24 | 85.87 0.00 | 91.23 0.19 | 90.96 0.27 | 95.01 0.15 | |
Avg. Rank | 5.92 | 6.15 | 4.85 | 4.54 | 5.15 | 5.08 | 5.00 | 4.77 | 3.54 |
C.2 Additional Results for Inductive Dynamic Link Prediction
We present the AP and AUC-ROC for inductive dynamic link prediction with three negative sampling strategies in Table 13 and Table 14 .
NSS | Datasets | JODIE | DyRep | TGAT | TGN | CAWN | TCL | GraphMixer | DyGFormer |
---|---|---|---|---|---|---|---|---|---|
rnd | Wikipedia | 94.82 0.20 | 92.43 0.37 | 96.22 0.07 | 97.83 0.04 | 98.24 0.03 | 96.22 0.17 | 96.65 0.02 | 98.59 0.03 |
96.50 0.13 | 96.09 0.11 | 97.09 0.04 | 97.50 0.07 | 98.62 0.01 | 94.09 0.07 | 95.26 0.02 | 98.84 0.02 | ||
MOOC | 79.63 1.92 | 81.07 0.44 | 85.50 0.19 | 89.04 1.17 | 81.42 0.24 | 80.60 0.22 | 81.41 0.21 | 86.96 0.43 | |
LastFM | 81.61 3.82 | 83.02 1.48 | 78.63 0.31 | 81.45 4.29 | 89.42 0.07 | 73.53 1.66 | 82.11 0.42 | 94.23 0.09 | |
Enron | 80.72 1.39 | 74.55 3.95 | 67.05 1.51 | 77.94 1.02 | 86.35 0.51 | 76.14 0.79 | 75.88 0.48 | 89.76 0.34 | |
Social Evo. | 91.96 0.48 | 90.04 0.47 | 91.41 0.16 | 90.77 0.86 | 79.94 0.18 | 91.55 0.09 | 91.86 0.06 | 93.14 0.04 | |
UCI | 79.86 1.48 | 57.48 1.87 | 79.54 0.48 | 88.12 2.05 | 92.73 0.06 | 87.36 2.03 | 91.19 0.42 | 94.54 0.12 | |
Flights | 94.74 0.37 | 92.88 0.73 | 88.73 0.33 | 95.03 0.60 | 97.06 0.02 | 83.41 0.07 | 83.03 0.05 | 97.79 0.02 | |
Can. Parl. | 53.92 0.94 | 54.02 0.76 | 55.18 0.79 | 54.10 0.93 | 55.80 0.69 | 54.30 0.66 | 55.91 0.82 | 87.74 0.71 | |
US Legis. | 54.93 2.29 | 57.28 0.71 | 51.00 3.11 | 58.63 0.37 | 53.17 1.20 | 52.59 0.97 | 50.71 0.76 | 54.28 2.87 | |
UN Trade | 59.65 0.77 | 57.02 0.69 | 61.03 0.18 | 58.31 3.15 | 65.24 0.21 | 62.21 0.12 | 62.17 0.31 | 64.55 0.62 | |
UN Vote | 56.64 0.96 | 54.62 2.22 | 52.24 1.46 | 58.85 2.51 | 49.94 0.45 | 51.60 0.97 | 50.68 0.44 | 55.93 0.39 | |
Contact | 94.34 1.45 | 92.18 0.41 | 95.87 0.11 | 93.82 0.99 | 89.55 0.30 | 91.11 0.12 | 90.59 0.05 | 98.03 0.02 | |
Avg. Rank | 4.69 | 5.77 | 5.23 | 3.77 | 3.77 | 5.77 | 5.23 | 1.54 | |
hist | Wikipedia | 68.69 0.39 | 62.18 1.27 | 84.17 0.22 | 81.76 0.32 | 67.27 1.63 | 82.20 2.18 | 87.60 0.30 | 71.42 4.43 |
62.34 0.54 | 61.60 0.72 | 63.47 0.36 | 64.85 0.85 | 63.67 0.41 | 60.83 0.25 | 64.50 0.26 | 65.37 0.60 | ||
MOOC | 63.22 1.55 | 62.93 1.24 | 76.73 0.29 | 77.07 3.41 | 74.68 0.68 | 74.27 0.53 | 74.00 0.97 | 80.82 0.30 | |
LastFM | 70.39 4.31 | 71.45 1.76 | 76.27 0.25 | 66.65 6.11 | 71.33 0.47 | 65.78 0.65 | 76.42 0.22 | 76.35 0.52 | |
Enron | 65.86 3.71 | 62.08 2.27 | 61.40 1.31 | 62.91 1.16 | 60.70 0.36 | 67.11 0.62 | 72.37 1.37 | 67.07 0.62 | |
Social Evo. | 88.51 0.87 | 88.72 1.10 | 93.97 0.54 | 90.66 1.62 | 79.83 0.38 | 94.10 0.31 | 94.01 0.47 | 96.82 0.16 | |
UCI | 63.11 2.27 | 52.47 2.06 | 70.52 0.93 | 70.78 0.78 | 64.54 0.47 | 76.71 1.00 | 81.66 0.49 | 72.13 1.87 | |
Flights | 61.01 1.65 | 62.83 1.31 | 64.72 0.36 | 59.31 1.43 | 56.82 0.57 | 64.50 0.25 | 65.28 0.24 | 57.11 0.21 | |
Can. Parl. | 52.60 0.88 | 52.28 0.31 | 56.72 0.47 | 54.42 0.77 | 57.14 0.07 | 55.71 0.74 | 55.84 0.73 | 87.40 0.85 | |
US Legis. | 52.94 2.11 | 62.10 1.41 | 51.83 3.95 | 61.18 1.10 | 55.56 1.71 | 53.87 1.41 | 52.03 1.02 | 56.31 3.46 | |
UN Trade | 55.46 1.19 | 55.49 0.84 | 55.28 0.71 | 52.80 3.19 | 55.00 0.38 | 55.76 1.03 | 54.94 0.97 | 53.20 1.07 | |
UN Vote | 61.04 1.30 | 60.22 1.78 | 53.05 3.10 | 63.74 3.00 | 47.98 0.84 | 54.19 2.17 | 48.09 0.43 | 52.63 1.26 | |
Contact | 90.42 2.34 | 89.22 0.66 | 94.15 0.45 | 88.13 1.50 | 74.20 0.80 | 90.44 0.17 | 89.91 0.36 | 93.56 0.52 | |
Avg. Rank | 5.38 | 5.46 | 4.00 | 4.54 | 5.92 | 3.92 | 3.54 | 3.23 | |
ind | Wikipedia | 68.70 0.39 | 62.19 1.28 | 84.17 0.22 | 81.77 0.32 | 67.24 1.63 | 82.20 2.18 | 87.60 0.29 | 71.42 4.43 |
62.32 0.54 | 61.58 0.72 | 63.40 0.36 | 64.84 0.84 | 63.65 0.41 | 60.81 0.26 | 64.49 0.25 | 65.35 0.60 | ||
MOOC | 63.22 1.55 | 62.92 1.24 | 76.72 0.30 | 77.07 3.40 | 74.69 0.68 | 74.28 0.53 | 73.99 0.97 | 80.82 0.30 | |
LastFM | 70.39 4.31 | 71.45 1.75 | 76.28 0.25 | 69.46 4.65 | 71.33 0.47 | 65.78 0.65 | 76.42 0.22 | 76.35 0.52 | |
Enron | 65.86 3.71 | 62.08 2.27 | 61.40 1.30 | 62.90 1.16 | 60.72 0.36 | 67.11 0.62 | 72.37 1.38 | 67.07 0.62 | |
Social Evo. | 88.51 0.87 | 88.72 1.10 | 93.97 0.54 | 90.65 1.62 | 79.83 0.39 | 94.10 0.32 | 94.01 0.47 | 96.82 0.17 | |
UCI | 63.16 2.27 | 52.47 2.09 | 70.49 0.93 | 70.73 0.79 | 64.54 0.47 | 76.65 0.99 | 81.64 0.49 | 72.13 1.86 | |
Flights | 61.01 1.66 | 62.83 1.31 | 64.72 0.37 | 59.32 1.45 | 56.82 0.56 | 64.50 0.25 | 65.29 0.24 | 57.11 0.20 | |
Can. Parl. | 52.58 0.86 | 52.24 0.28 | 56.46 0.50 | 54.18 0.73 | 57.06 0.08 | 55.46 0.69 | 55.76 0.65 | 87.22 0.82 | |
US Legis. | 52.94 2.11 | 62.10 1.41 | 51.83 3.95 | 61.18 1.10 | 55.56 1.71 | 53.87 1.41 | 52.03 1.02 | 56.31 3.46 | |
UN Trade | 55.43 1.20 | 55.42 0.87 | 55.58 0.68 | 52.80 3.24 | 54.97 0.38 | 55.66 0.98 | 54.88 1.01 | 52.56 1.70 | |
UN Vote | 61.17 1.33 | 60.29 1.79 | 53.08 3.10 | 63.71 2.97 | 48.01 0.82 | 54.13 2.16 | 48.10 0.40 | 52.61 1.25 | |
Contact | 90.43 2.33 | 89.22 0.65 | 94.14 0.45 | 88.12 1.50 | 74.19 0.81 | 90.43 0.17 | 89.91 0.36 | 93.55 0.52 | |
Avg. Rank | 5.31 | 5.54 | 3.85 | 4.38 | 5.85 | 3.92 | 3.46 | 3.31 |
NSS | Datasets | JODIE | DyRep | TGAT | TGN | CAWN | TCL | GraphMixer | DyGFormer |
---|---|---|---|---|---|---|---|---|---|
rnd | Wikipedia | 94.33 0.27 | 91.49 0.45 | 95.90 0.09 | 97.72 0.03 | 98.03 0.04 | 95.57 0.20 | 96.30 0.04 | 98.48 0.03 |
96.52 0.13 | 96.05 0.12 | 96.98 0.04 | 97.39 0.07 | 98.42 0.02 | 93.80 0.07 | 94.97 0.05 | 98.71 0.01 | ||
MOOC | 83.16 1.30 | 84.03 0.49 | 86.84 0.17 | 91.24 0.99 | 81.86 0.25 | 81.43 0.19 | 82.77 0.24 | 87.62 0.51 | |
LastFM | 81.13 3.39 | 82.24 1.51 | 76.99 0.29 | 82.61 3.15 | 87.82 0.12 | 70.84 0.85 | 80.37 0.18 | 94.08 0.08 | |
Enron | 81.96 1.34 | 76.34 4.20 | 64.63 1.74 | 78.83 1.11 | 87.02 0.50 | 72.33 0.99 | 76.51 0.71 | 90.69 0.26 | |
Social Evo. | 93.70 0.29 | 91.18 0.49 | 93.41 0.19 | 93.43 0.59 | 84.73 0.27 | 93.71 0.18 | 94.09 0.07 | 95.29 0.03 | |
UCI | 78.80 0.94 | 58.08 1.81 | 77.64 0.38 | 86.68 2.29 | 90.40 0.11 | 84.49 1.82 | 89.30 0.57 | 92.63 0.13 | |
Flights | 95.21 0.32 | 93.56 0.70 | 88.64 0.35 | 95.92 0.43 | 96.86 0.02 | 82.48 0.01 | 82.27 0.06 | 97.80 0.02 | |
Can. Parl. | 53.81 1.14 | 55.27 0.49 | 56.51 0.75 | 55.86 0.75 | 58.83 1.13 | 55.83 1.07 | 58.32 1.08 | 89.33 0.48 | |
US Legis. | 58.12 2.35 | 61.07 0.56 | 48.27 3.50 | 62.38 0.48 | 51.49 1.13 | 50.43 1.48 | 47.20 0.89 | 53.21 3.04 | |
UN Trade | 62.28 0.50 | 58.82 0.98 | 62.72 0.12 | 59.99 3.50 | 67.05 0.21 | 63.76 0.07 | 63.48 0.37 | 67.25 1.05 | |
UN Vote | 58.13 1.43 | 55.13 3.46 | 51.83 1.35 | 61.23 2.71 | 48.34 0.76 | 50.51 1.05 | 50.04 0.86 | 56.73 0.69 | |
Contact | 95.37 0.92 | 91.89 0.38 | 96.53 0.10 | 94.84 0.75 | 89.07 0.34 | 93.05 0.09 | 92.83 0.05 | 98.30 0.02 | |
Avg. Rank | 4.69 | 5.85 | 5.31 | 3.38 | 4.00 | 6.00 | 5.31 | 1.46 | |
hist | Wikipedia | 61.86 0.53 | 57.54 1.09 | 78.38 0.20 | 75.75 0.29 | 62.04 0.65 | 79.79 0.96 | 82.87 0.21 | 68.33 2.82 |
61.69 0.39 | 60.45 0.37 | 64.43 0.27 | 64.55 0.50 | 64.94 0.21 | 61.43 0.26 | 64.27 0.13 | 64.81 0.25 | ||
MOOC | 64.48 1.64 | 64.23 1.29 | 74.08 0.27 | 77.69 3.55 | 71.68 0.94 | 69.82 0.32 | 72.53 0.84 | 80.77 0.63 | |
LastFM | 68.44 3.26 | 68.79 1.08 | 69.89 0.28 | 66.99 5.62 | 67.69 0.24 | 55.88 1.85 | 70.07 0.20 | 70.73 0.37 | |
Enron | 65.32 3.57 | 61.50 2.50 | 57.84 2.18 | 62.68 1.09 | 62.25 0.40 | 64.06 1.02 | 68.20 1.62 | 65.78 0.42 | |
Social Evo. | 88.53 0.55 | 87.93 1.05 | 91.87 0.72 | 92.10 1.22 | 83.54 0.24 | 93.28 0.60 | 93.62 0.35 | 96.91 0.09 | |
UCI | 60.24 1.94 | 51.25 2.37 | 62.32 1.18 | 62.69 0.90 | 56.39 0.10 | 70.46 1.94 | 75.98 0.84 | 65.55 1.01 | |
Flights | 60.72 1.29 | 61.99 1.39 | 63.38 0.26 | 59.66 1.04 | 56.58 0.44 | 63.48 0.23 | 63.30 0.19 | 56.05 0.21 | |
Can. Parl. | 51.62 1.00 | 52.38 0.46 | 58.30 0.61 | 55.64 0.54 | 60.11 0.48 | 57.30 1.03 | 56.68 1.20 | 88.68 0.74 | |
US Legis. | 58.12 2.94 | 67.94 0.98 | 49.99 4.88 | 64.87 1.65 | 54.41 1.31 | 52.12 2.13 | 49.28 0.86 | 56.57 3.22 | |
UN Trade | 58.73 1.19 | 57.90 1.33 | 59.74 0.59 | 55.61 3.54 | 60.95 0.80 | 61.12 0.97 | 59.88 1.17 | 58.46 1.65 | |
UN Vote | 65.16 1.28 | 63.98 2.12 | 51.73 4.12 | 68.59 3.11 | 48.01 1.77 | 54.66 2.11 | 45.49 0.42 | 53.85 2.02 | |
Contact | 90.80 1.18 | 88.88 0.68 | 93.76 0.41 | 88.84 1.39 | 74.79 0.37 | 90.37 0.16 | 90.04 0.29 | 94.14 0.26 | |
Avg. Rank | 5.08 | 6.00 | 4.23 | 4.54 | 5.38 | 4.00 | 3.69 | 3.08 | |
ind | Wikipedia | 61.87 0.53 | 57.54 1.09 | 78.38 0.20 | 75.76 0.29 | 62.02 0.65 | 79.79 0.96 | 82.88 0.21 | 68.33 2.82 |
61.69 0.39 | 60.44 0.37 | 64.39 0.27 | 64.55 0.50 | 64.91 0.21 | 61.36 0.26 | 64.27 0.13 | 64.80 0.25 | ||
MOOC | 64.48 1.64 | 64.22 1.29 | 74.07 0.27 | 77.68 3.55 | 71.69 0.94 | 69.83 0.32 | 72.52 0.84 | 80.77 0.63 | |
LastFM | 68.44 3.26 | 68.79 1.08 | 69.89 0.28 | 66.99 5.61 | 67.68 0.24 | 55.88 1.85 | 70.07 0.20 | 70.73 0.37 | |
Enron | 65.32 3.57 | 61.50 2.50 | 57.83 2.18 | 62.68 1.09 | 62.27 0.40 | 64.05 1.02 | 68.19 1.63 | 65.79 0.42 | |
Social Evo. | 88.53 0.55 | 87.93 1.05 | 91.88 0.72 | 92.10 1.22 | 83.54 0.24 | 93.28 0.60 | 93.62 0.35 | 96.91 0.09 | |
UCI | 60.27 1.94 | 51.26 2.40 | 62.29 1.17 | 62.66 0.91 | 56.39 0.11 | 70.42 1.93 | 75.97 0.85 | 65.58 1.00 | |
Flights | 60.72 1.29 | 61.99 1.39 | 63.40 0.26 | 59.66 1.05 | 56.58 0.44 | 63.49 0.23 | 63.32 0.19 | 56.05 0.22 | |
Can. Parl. | 51.61 0.98 | 52.35 0.52 | 58.15 0.62 | 55.43 0.42 | 60.01 0.47 | 56.88 0.93 | 56.63 1.09 | 88.51 0.73 | |
US Legis. | 58.12 2.94 | 67.94 0.98 | 49.99 4.88 | 64.87 1.65 | 54.41 1.31 | 52.12 2.13 | 49.28 0.86 | 56.57 3.22 | |
UN Trade | 58.71 1.20 | 57.87 1.36 | 59.98 0.59 | 55.62 3.59 | 60.88 0.79 | 61.01 0.93 | 59.71 1.17 | 57.28 3.06 | |
UN Vote | 65.29 1.30 | 64.10 2.10 | 51.78 4.14 | 68.58 3.08 | 48.04 1.76 | 54.65 2.20 | 45.57 0.41 | 53.87 2.01 | |
Contact | 90.80 1.18 | 88.87 0.67 | 93.76 0.40 | 88.85 1.39 | 74.79 0.38 | 90.37 0.16 | 90.04 0.29 | 94.14 0.26 | |
Avg. Rank | 5.08 | 5.92 | 4.15 | 4.54 | 5.38 | 4.00 | 3.77 | 3.15 |
C.3 Results on Dynamic Node Classification
Table 15 shows the results of various methods for dynamic node classification on Wikipedia and Reddit (the only two datasets with dynamic labels).
Methods | Wikipedia | Avg. Rank | |
---|---|---|---|
JODIE | 88.99 1.05 | 60.37 2.58 | 4.50 |
DyRep | 86.39 0.98 | 63.72 1.32 | 5.00 |
TGAT | 84.09 1.27 | 70.04 1.09 | 4.00 |
TGN | 86.38 2.34 | 63.27 0.90 | 6.00 |
CAWN | 84.88 1.33 | 66.34 1.78 | 5.00 |
TCL | 77.83 2.13 | 68.87 2.15 | 5.00 |
GraphMixer | 86.80 0.79 | 64.22 3.32 | 4.00 |
DyGFormer | 87.44 1.08 | 68.00 1.74 | 2.50 |
C.4 Complete Results of TCL and GraphMixer with Neighbor Co-occurrence Encoding
We show the complete results of TCL and GraphMixer with our neighbor co-occurrence encoding scheme in Table 16.
Datasets | TCL | GraphMixer | DyGFormer | ||||
---|---|---|---|---|---|---|---|
Original | w/ NCoE | Improv. | Original | w/ NCoE | Improv. | ||
Wikipedia | 96.47 | 99.09 | 2.72% | 97.25 | 97.90 | 0.67% | 99.03 |
97.53 | 99.04 | 1.55% | 97.31 | 97.63 | 0.33% | 99.22 | |
MOOC | 82.38 | 86.92 | 5.51% | 82.78 | 83.58 | 0.97% | 87.52 |
LastFM | 67.27 | 84.02 | 24.90% | 75.61 | 76.48 | 1.15% | 93.00 |
Enron | 79.70 | 90.18 | 13.15% | 82.25 | 88.83 | 8.00% | 92.47 |
Social Evo. | 93.13 | 94.06 | 1.00% | 93.37 | 94.37 | 1.07% | 94.73 |
UCI | 89.57 | 94.69 | 5.72% | 93.25 | 93.48 | 0.25% | 95.79 |
Flights | 91.23 | 97.71 | 7.10% | 90.99 | 96.90 | 6.50% | 98.91 |
Can. Parl. | 68.67 | 69.34 | 0.98% | 77.04 | 76.38 | -0.86% | 97.36 |
US Legis. | 69.59 | 69.47 | -0.17% | 70.74 | 70.26 | -0.68% | 71.11 |
UN Trade | 62.21 | 63.46 | 2.01% | 62.61 | 62.77 | 0.26% | 66.46 |
UN Vote | 51.90 | 51.52 | -0.73% | 52.11 | 52.13 | 0.04% | 55.55 |
Contact | 92.44 | 97.98 | 5.99% | 91.92 | 97.94 | 6.55% | 98.29 |
Avg. Rank | 4.62 | 2.62 | — | 3.85 | 2.85 | — | 1.08 |
C.5 Results of Training Time and Memory Usage Comparisons
Table 17 shows the comparisons of running time and memory usage of DyGFormer with and without the patching technique during the training process.
Datasets |
|
Metrics |
|
|
|
||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LastFM | 64 | running time | 13min 58s | 21min 01s | 1.50 | ||||||||
memory usage | 3,945 MB | 7,953 MB | 2.02 | ||||||||||
128 | running time | 16min 54s | 45min 29s | 2.69 | |||||||||
memory usage | 4,677 MB | 7,585 MB | 1.62 | ||||||||||
256 | running time | 24min 41s | 2h 5min 50s | 5.10 | |||||||||
memory usage | 4,635 MB | 14,583 MB | 3.15 | ||||||||||
512 | running time | 37min 04s | — | — | |||||||||
memory usage | 7,547 MB | OOM | — | ||||||||||
Can. Parl. | 64 | running time | 58s | 1min 14s | 1.28 | ||||||||
memory usage | 2,121 MB | 3,263 MB | 1.54 | ||||||||||
128 | running time | 1min 02s | 2min 39s | 2.56 | |||||||||
memory usage | 2,369 MB | 6,417 MB | 2.71 | ||||||||||
256 | running time | 1min 26s | 6min 37s | 4.62 | |||||||||
memory usage | 2,855 MB | 14,923 MB | 5.23 | ||||||||||
512 | running time | 1min 57s | — | — | |||||||||
memory usage | 4,511 MB | OOM | — |
C.6 Results and Discussions of Ablation Study
We validate the effectiveness of several designs in DyGFormer via an ablation study, including the usage of Neighbor Co-occurrence Encoding (NCoE), the usage of Time Encoding (TE), and the Mixing of the sequence of Source node and Destination node (MixSD). We respectively remove these modules and denote the remaining parts as w/o NCoE, w/o TE, and w/o MixSD. We also Separately encode the Neighbor Occurrence in the source node’s or destination node’s sequence and denote this variant as w/ SepNO. We report the performance of different variants on MOOC, Social Evo., UCI, and UN Trade datasets from four domains in Figure 4.

Our findings from Figure 4 indicate that DyGFormer achieves optimal performance when all components are utilized. The removal of any component would lead to worse results. In particular, the neighbor co-occurrence encoding scheme has the most significant impact on the performance as it effectively captures correlations between nodes. Separately encoding neighbor occurrences or encoding the temporal information could also improve performance. Mixing the sequences of the source node and destination node causes relatively minor improvements due to the usage of the neighbor co-occurrence encoding scheme because both of them aim to explore the node correlations. To this end, the necessity of designing these components has been well demonstrated.
C.7 Results on Temporal Graph Benchmark
We also evaluate DyGFormer and baselines on the recently proposed Temporal Graph Benchmark (TGB) [23], which contains a collection of challenging and diverse benchmark datasets. We aim to verify the superiority and scalability of DyGFormer on TGB since it covers small, medium, and large datasets. We only present the analysis here. Please refer to [23] and our other work [68] for details of the datasets, tasks, evaluation metrics, experimental settings, and complete quantitative results.
Superiority of DyGFormer. For the dynamic link property prediction task, DyGFormer ranks first on the TGB leaderboard666https://tgb.complexdatalab.com/docs/leader_linkprop/ on tgbl-wiki-v2. It ranks second/third on tgbl-coin-v2/tgbl-comment and does medium on tgbl-review-v2. We notice that the surprise index (defined as the ratio of test links that are not seen during training) on tgbl-wiki-v2 and tgbl-coin-v2 are 0.108 and 0.120, which indicates nodes tend to interact repeatedly. Such a property may be well handled by the neighbor co-occurrence encoding scheme in DyGFormer, leading to excellent performance. Correspondingly, the surprise index on tgbl-review-v2 and tgbl-comment are 0.987 and 0.823, showing that these two datasets contain many new links. This characteristic may violate the motivation of our neighbor co-occurrence encoding scheme and bring suboptimal performance. For the dynamic node property prediction task, DyGFormer also performs better than other methods with trainable parameters, but its performance is still worse than the simple non-trainable Persistent Forecast and Moving Average methods on the TGB leaderboard777https://tgb.complexdatalab.com/docs/leader_nodeprop/. This observation demonstrates the necessity of proposing customized models for the dynamic node property prediction task.
Scalability of DyGFormer. DyGFormer can be scaled up to larger datasets that contain hundreds of thousands of nodes and tens of millions of links (e.g., tgbl-coin-v2, tgbl-comment, tgbn-genre, and tgbn-reddit) while many baselines (e.g., JODIE, DyRep, TGN, and CAWN) fail. This demonstrates the good scalability of DyGFormer.