This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

BenchTemp: A General Benchmark for Evaluating Temporal Graph Neural Networks

Qiang Huang1   Jiawei Jiang1   Susie Xi Rao2   Ce Zhang2   Zhichao Han3   Zitao Zhang3
Xin Wang1   Yongjun He2  Quanqing Xu4   Yang Zhao3   Chuang Hu1   Shuo Shang5   Bo Du1
1School of Computer Science, Wuhan University  2ETH Zürich  3eBay
4OceanBase, Ant Group  5University of Electronic Science and Technology of China
1[email protected]1{jiawei.jiang,tbbdfl,handc,dubo}@whu.edu.cn
2{raox,ce.zhang,yongjun.he}@inf.ethz.ch3{zhihan,zitzhang, yzhao5}@ebay.com
4[email protected]5[email protected]
Abstract

To handle graphs in which features or connectivities are evolving over time, a series of temporal graph neural networks (TGNNs) have been proposed. Despite the success of these TGNNs, the previous TGNN evaluations reveal several limitations regarding four critical issues: 1) inconsistent datasets, 2) inconsistent evaluation pipelines, 3) lacking workload diversity, and 4) lacking efficient comparison. Overall, there lacks an empirical study that puts TGNN models onto the same ground and compares them comprehensively. To this end, we propose BenchTemp, a general benchmark for evaluating TGNN models on various workloads. BenchTemp provides a set of benchmark datasets so that different TGNN models can be fairly compared. Further, BenchTemp engineers a standard pipeline that unifies the TGNN evaluation. With BenchTemp, we extensively compare the representative TGNN models on different tasks (e.g., link prediction and node classification) and settings (transductive and inductive), w.r.t. both effectiveness and efficiency metrics. We have made BenchTemp publicly available at https://github.com/qianghuangwhu/benchtemp.

1 Introduction

Temporal graphs provide a natural abstraction for many real-world systems, in which the topology structure and node/edge attributes change over time. Often, a temporal graph can be modeled as an interaction stream, and it is critical to capture the latent evolution pattern in many domains, such as social network, transportation system, and biology. Recently, researchers have proposed a range of temporal graph neural network (TGNN) models, e.g., JODIE [1], DyRep [2], TGN [3], TGAT [4], CAWN [5], NeurTW [6], and NAT [7]. As shown in Table 1, these models introduce different techniques, such as memory, attention, RNN, and temporal walk, to detect temporal changes and update node representations.

Limitation of Prior Works. In the presence of different paradigms of TGNNs, it is therefore meaningful to compare their performance in various tasks. Although each of the above methods has provided an evaluation study, none of them has covered a wide spectrum of the literature. A few premier works have tried to compare these models [8]; however, if we look at those TGNN benchmarks, we find that:

The scope of current benchmarks is limited to some specific tasks. Worse, their evaluation of the existing TGNN models shows controversial and inconsistent results, mainly due to the lack of standard datasets and processing pipeline.

Table 1: Anatomy of state-of-the-art TGNN models. Memory, Attention, RNN, and TempWalk indicate whether the model adopts memory modules, attention mechanism, RNN, and temporal walk strategy, respectively. Scalability means whether the model is claimed to be scalable. LP and NC indicate whether the corresponding work implements link prediction and node classification tasks, respectively. # Datasets and # Features represent the number of evaluated datasets and the number of initial node features. Supervised indicates whether the method is supervised, unsupervised, or self(semi)-supervised.
Model Memory Attention RNN TempWalk Scalability LP NC # Datasets # Features Supervised
JODIE 4 128 self (semi)-supervised
DyRep 2 172 unsupervised
TGN 4 172 self (semi)-supervised
TGAT 3 172 self (semi)-supervised
NAT 8 172, 32, 2 self-supervised
CAWN 6 172, 32, 2 self-supervised
NeurTW 6 172, 32, 4 self (semi)-supervised

For example, EdgeBank [8] compares a set of TGNNs only on link prediction task and does not unify the data loaders (e.g., train/validation/test split and inductive masking); some works choose different dimensions of initial node features for the evaluated temporal graphs [5, 6, 7]; some works use different seeds for edge samplers [3, 4, 5]. Without unifying the datasets and evaluation pipeline, previous work unsurprisingly reports different relative performance for the same set of TGNNs [3, 4, 5, 6, 7, 8].

Our Goals. Despite the efforts of previous works, the landscape of TGNN has not been fully explored. We try to achieve the following goals, which are vital, yet unsolved, for a mature TGNN benchmark.

  • Benchmark datasets. A vital reason for the inconsistent results in previous work is that the temporal graph models are not compared using the same datasets. The original graph is preprocessed in different manners, and the temporal order is handled from different perspectives. Without standard datasets and their preprocessing steps, it is not possible to compare different TGNN models fairly.

  • A unified pipeline. The execution of a TGNN model naturally contains multiple phases, including but not limited to data loader, graph sampler, trainer, etc. Unfortunately, there is no benchmark library that establishes a general processing pipeline for temporal graphs.

  • Diverse workloads. Temporal graphs evolve over time, and therefore there exist diverse workloads. For example, there are transductive and inductive tasks, and the inductive task has New-Old and New-New scenarios (see Section 3.2.1). A solid benchmark should include diverse workloads.

  • Efficiency comparison. In addition to model quality, it is also indispensable to compare the efficiency performance of models to help us understand their performance trade-offs. However, this issue has not been well studied in the literature.

In this work, we conduct a comprehensive benchmark on the state-of-the-art TGNN models. To avoid orange-to-apple comparison, we first study how to construct standard datasets, and then study how to build a framework that evaluates various TGNN models in a unified pipeline. The major contributions of this work are summarized below.

  1. 1.

    We present BenchTemp, the first general benchmark for evaluating temporal graph neural network (TGNN) models over a wide range of tasks and settings.

  2. 2.

    We construct a set of benchmark temporal graph datasets and a unified benchmark pipeline. In this way, we standardize the entire lifecycle of benchmarking TGNNs.

  3. 3.

    On top of BenchTemp, we extensively compare representative TGNN models on the benchmark datasets, regarding different tasks, settings, metrics, and efficiency.

  4. 4.

    We thoroughly discuss the empirical results and draw insights for TGNN. Further, we have released the datasets, PyPI, and leaderboard. Please refer to our github project and appendix for details.

Refer to caption
Figure 1: Overview of BenchTemp.

2 Anatomy of Current TGNN Models

Before presenting our benchmark, we revisit the current TGNN models and investigate the following question: How do the existing models handle temporal graphs? Are there problems and inconsistencies during their processing procedures?

We consider seven state-of-the-art TGNN models — JODIE, DyRep, TGN, TGAT, NeurTW, CAWN, and NAT [1, 2, 3, 4, 5, 6, 7]. We summarize the anatomy results of these models in Table 1. JODIE [1] models the dynamic evolution of user-item interactions through joint user and item embeddings, which are updated by two RNNs. Furthermore, JODIE trains a temporal attention layer to predict trajectories. DyRep [2] implements a temporal attention mechanism to encode the evolution of temporal structure information into interleaved and nonlinear dynamics over time. TGN [3] establishes a memory module to memorize node representations, and uses RNN and graph operators to update the memory. TGAT [4] utilizes self-attention to aggregate temporal-topological structure, and learns the temporal evolution of node embeddings through a continuous-time encoding. CAWN [5] samples a set of causal anonymous walks (CAWs) by temporal random walks, retrieves motifs to represent network dynamics, and encodes and aggregates these CAWs via RNNs. NeurTW [6] introduces a spatio-temporal random walk to retrieve representative motifs and irregularly-sampled temporal nodes, and then learns time-aware node representations. NAT [7] constructs structural features using joint neighborhoods of different nodes, and designs a data structure to support parallel access and update of node representations.

The Landscape of TGNNs. As seen in Table 1, the abovementioned models choose different techniques and perform different tasks (some do not implement the node classification task [2, 5, 7, 8]). Regarding the datasets, they choose different dimensions (e.g., 2, 4, 32, 128, and 172) for the initial node features and cause potentially inconsistent results. Regarding negative edge sampler, both fixed [1, 2, 3] and random [4, 5, 6, 7] seeds are used. Regarding node visibility, JODIE, DyRep, TGN, TGAT, and NAT choose traditional transductive and inductive settings, while CAWN and NeurTW further consider inductive New-New and inductive New-Old settings. Besides, most models only report model performance metrics and neglect efficiency metrics.

3 Temporal Graph Benchmark

In this section, we describe our proposed BenchTemp, including the construction of benchmark datasets and the general benchmark pipeline, as shown in Figure 1.

3.1 Benchmark Datasets

Formalization. A temporal graph can be represented as an ordered sequence of temporal interactions. The rr-th interaction Ir=(ur,ir,tr,er)I_{r}=(u_{r},i_{r},t_{r},e_{r}) happens at time trt_{r} between the source node uru_{r} and the destination node iri_{r} with edge feature ere_{r}.

Table 2: Dataset statistics.
Dataset Domain # Nodes # Edges Avg. Degree Edge Density Heterogeneous Homogeneous
(#Edges#Nodes)\left(\frac{\#Edges}{\#Nodes}\right) (#Edges#Users×#Items)\left(\frac{\#Edges}{\#Users\times{\#Items}}\right)
Reddit [1, 9] Social 10,984 672,447 61.22 0.06
Wikipedia [1] Social 9,227 157,474 17.07 0.01
MOOC [1, 10] Interaction 7,144 411,749 57.64 0.60
LastFM [1, 11] Interaction 1,980 1,293,103 653.08 1.32
Taobao [6, 12] E-commerce 82,566 77,436 0.94 5.55
Enron [13, 14] Social 184 125,235 680.63 3.76
SocialEvo [15] Proximity 74 2,099,519 28,371.88 405.31
UCI [16] Social 1,899 59,835 31.51 0.02
CollegeMsg [17] Social 1,899 59,834 31.51 0.02
CanParl [8, 18] Politics 734 74,478 101.47 0.42
Contact [8, 19] Proximity 692 2,426,279 3,506.18 5.31
Flights [8, 20] Transport 13,169 1,927,145 146.34 0.01
UNTrade [8, 21] Economics 255 507,497 1,990.18 7.84
USLegis [8, 18] Politics 225 60,396 268.43 1.19
UNVote [8, 22] Politics 201 1,035,742 5,152.95 25.6

Datasets. We choose a large set of popular temporal graph datasets from various domains [1, 6, 8], on which we build the benchmark datasets attached CC BY-NC license. The statistics of these datasets are summarized in Table 16. Please refer to Appendix A for dataset details.

Benchmark Dataset Construction. BenchTemp standardizes the construction of temporal graph benchmark datasets with two steps: node feature initialization and node reindexing.

Refer to caption
(a) Transductive
Refer to caption
(b) Inductive
Figure 2: Link prediction results on MOOC dataset with different dimensions for initial node features.

Node Feature Initialization. The prior works choose different dimensions for the initial node features. As the results in Figure 2 show, as the node feature dimension increases from 4 to 172, most TGNN models achieve higher ROC AUC. Consequently, to address the discrepancy w.r.t. node feature dimension, we set it to 172 for all the datasets, since it is the most commonly chosen value in prior works.

Refer to caption
(a) Heterogeneous graph
Refer to caption
(b) Homogeneous graph
Figure 3: Node reindexing of temporal graphs. Node colors represent node types.

Node Reindexing. In some of the original graphs, the node indices are not continuous, and the maximal node index is larger than the number of nodes.

As shown in Figure 3(a), for a heterogeneous temporal graph, we propose to reindex the user and item, respectively. The user indices are mapped to a continuous range starting from 1. Then, the item indices are reindexed in the same manner, starting from the maximal user index. As for a homogeneous temporal graph, the user and item nodes belong to the same type. Thus, BenchTemp concatenates the original user and item indices, and reindex them together, as shown in Figure 3(b). For the Taobao dataset, the size of node feature matrix can be reduced from 5,162,993×1725,162,993\times{172} to 82,566×17282,566\times{172}, a reduction of 62.53×\times.

3.2 BenchTemp Pipeline

BenchTemp presents a general pipeline for TGNN and standardizes the entire lifecycle. There are seven major modules in BenchTempDataset, DataLoader, EdgeSampler, Evaluator, Model, EarlyStopMonitor, and Leaderboard. Users can also implement their own modules using our library.

3.2.1 Link Prediction

Refer to caption
Figure 4: pipelines for link prediction and node classification tasks.

Figure 4 showcases a pipeline for the link prediction task, which consists of all seven modules.

  • Dataset. Referring to Section 3.1, BenchTemp provides fifteen benchmark datasets. The Dataset module is also compatible with user-generated benchmark datasets.

  • DataLoader. DataLoader module is responsible for data ingest, preprocessing and splitting. DataLoader standardizes the data splitting under both transductive and inductive settings. In transductive link prediction, Dataloader splits the temporal graphs chronologically into 70%-15%-15% for train, validation and test sets according to edge timestamps. In inductive link prediction, Dataloader performs the same split as the transductive setting, and randomly masks 10% nodes as unseen nodes. Any edges associated with these unseen nodes are removed from the training set. To reflect different inductive scenarios, DataLoader further generates three inductive test sets from the transductive test dataset, by filtering edges in different manners: Inductive selects edges with at least one unseen node. Inductive New-Old selects edges between a seen node and an unseen node. Inductive New-New selects edges between two unseen nodes. In fact, Inductive New-Old and Inductive New-New together form Inductive; that is, Inductive=New-OldNew-New\textbf{Inductive}=\textbf{New-Old}\cup\textbf{New-New}.

  • EdgeSampler. Since link prediction is a self-supervised task, EdgeSampler with a seed is used to randomly sample a set of negative edges.

  • Model. A TGNN model is created to perform training. BenchTemp has implemented seven TGNN models (see Section 2). The users can also implement their own models using our library.

  • EarlyStopMonitor. BenchTemp provides a unified EarlyStopMonitor to improve training efficiency and save resources. We invoke early stopping with a patience of 3 and 10310^{-3} tolerance.

  • Evaluator. Different evaluation metrics are available, including Area Under the Receiver Operating Characteristic Curve (ROC AUC) and Average Precision (AP).

  • Leaderboard. The results of the pipeline are pushed to the leaderboard, where the users can see the overall performance in the literature.

3.2.2 Node Classification

The pipeline for node classification consists of six modules, excluding the EdgeSampler.

  • Dataset. Due to the presence of node labels, only three datasets (Reddit, Wikipedia, and MOOC) are available for node classification task.

  • DataLoader. The DataLoader module sorts edges and splits the input dataset (70%-15%-15%) according to edge timestamps.

  • Model. Some baseline models, such as CAWN, NeurTW, and NAT, have not implemented the node classification task. In BenchTemp, we have made a lot of efforts to implement the node classification task for all seven TGNN models.

  • EarlyStopMonitor, Evaluator & Leaderboard. These modules are the same as link prediction.

4 Experiments

4.1 Experimental setup

Environment. All experiments were run on an Ubuntu 20.04 server with two 32-core 2.90 GHz Xeon Platinum 8375C CPUs, 256 GB of RAM, and four Nvidia 4090 24 GB GPUs.

Datasets. We evaluate fifteen datasets. Please refer to Section 3.1, Appendix A and B for details.

Model Implementations. It is nontrivial to implement the chosen TGNN models since there are inconsistencies between their original implementations. Please refer to Appendix C for details.

Tasks. We run both link prediction task and node classification task. For link prediction task, we run transductive, inductive, inductive New-Old, and inductive New-New settings (see Section 3.2.1). For node classification task, we run the traditional transductive setting as introduced in Section 3.2.2.

Evaluation Metrics. We use the Evaluator module, choosing AUC and AP for the link prediction task, and AUC for the node classification task, following the prior works [1, 2, 3, 4, 5, 6, 7]. In addition, we report efficiency metrics, as shown in Table 4.

Protocol. We run each job three times (unless timed out) and report the mean and standard deviation. We use an EarlyStopMonitor with a patience of 3 and tolerance of 10310^{-3}, and set a timeout (48 hours). We choose binary cross entropy loss and Adam optimizer [23] to train the models. For Adam, we set learning rate to 10410^{-4} and use default values for other hyperparameters.

Table 3: ROC AUC results on the link prediction task. "*" denotes that the model encounters runtime error; "—" denotes timeout after 48 hours. The best and second-best results are highlighted as bold red and underlined blue. Some standard deviations are zero because we terminate those models that can only run one epoch within 2 days. We do not highlight the second-best if the gap is >0.05>0.05 compared with the best result.
Transductive
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 0.9760 ± 0.0006 0.9803 ± 0.0005 0.9871 ± 0.0001 0.981 ± 0.0002 0.9889 ± 0.0002 0.9841 ± 0.0016 0.9854 ± 0.002
Wikipedia 0.9505 ± 0.0032 0.9426 ± 0.0007 0.9846 ± 0.0003 0.9509 ± 0.0017 0.9889 ± 0.0 0.9912 ± 0.0001 0.9786 ± 0.0035
MOOC 0.7899 ± 0.0208 0.8243 ± 0.0323 0.8999 ± 0.0213 0.7391 ± 0.0056 0.9459 ± 0.0008 0.8071 ± 0.0193 0.7568 ± 0.0305
LastFM 0.6766 ± 0.0590 0.6793 ± 0.0553 0.7743 ± 0.0256 0.5094 ± 0.0071 0.8746 ± 0.0013 0.839 ± 0.0 0.8536 ± 0.0027
Enron 0.8293 ± 0.0148 0.7986 ± 0.0358 0.8621 ± 0.0173 0.6161 ± 0.0214 0.9159 ± 0.0032 0.8956 ± 0.0045 0.9212 ± 0.0029
SocialEvo 0.8666 ± 0.0233 0.9020 ± 0.0026 0.952 ± 0.0003 0.7851 ± 0.0047 0.9337 ± 0.0003 0.9202 ± 0.0065
UCI 0.8786 ± 0.0017 0.5086 ± 0.0651 0.8875 ± 0.0161 0.7998 ± 0.0052 0.9189 ± 0.0017 0.9670 ± 0.0031 0.9076 ± 0.0116
CollegeMsg 0.5730 ± 0.0690 0.5382 ± 0.0058 0.8419 ± 0.084 0.8084 ± 0.0032 0.9156 ± 0.004 0.9698 ± 0.0 0.9059 ± 0.0122
CanParl 0.7939 ± 0.0063 0.7737 ± 0.0255 0.7575 ± 0.0694 0.7077 ± 0.0218 0.7197 ± 0.0905 0.8920 ± 0.0173 0.6917 ± 0.0722
Contact 0.9379 ± 0.0073 0.9276 ± 0.0206 0.9769 ± 0.0032 0.5582 ± 0.009 0.9685 ± 0.0028 0.984 ± 0.0 0.9463 ± 0.021
Flights 0.9449 ± 0.0073 0.8981 ± 0.0056 0.9787 ± 0.0025 0.9016 ± 0.0027 0.9861 ± 0.0002 0.9302 ± 0.0 0.9747 ± 0.0061
UNTrade 0.6786 ± 0.0103 0.6377 ± 0.0032 0.6543 ± 0.01 * 0.7511 ± 0.0012 0.5924 ± 0.0368 0.783 ± 0.0472
USLegis 0.8278 ± 0.0024 0.7425 ± 0.0374 0.8137 ± 0.002 0.7738 ± 0.0062 0.9643 ± 0.0043 0.9715 ± 0.0009 0.782 ± 0.0261
UNVote 0.6523 ± 0.0082 0.6236 ± 0.0305 0.7176 ± 0.0109 0.5134 ± 0.0026 0.6037 ± 0.0019 0.5871 ± 0.0 0.6776 ± 0.0411
Taobao 0.8405 ± 0.0006 0.8409 ± 0.0012 0.8654 ± 0.0005 0.5396 ± 0.009 0.7708 ± 0.0026 0.8759 ± 0.0009 0.8937 ± 0.0015
Inductive
Reddit 0.9514 ± 0.0045 0.9583 ± 0.0004 0.976 ± 0.0002 0.9651 ± 0.0002 0.9868 ± 0.0002 0.9802 ± 0.0011 0.9906 ± 0.0034
Wikipedia 0.9305 ± 0.0020 0.9099 ± 0.0031 0.9781 ± 0.0005 0.9343 ± 0.0031 0.989 ± 0.0003 0.9904 ± 0.0002 0.9962 ± 0.0026
MOOC 0.7779 ± 0.0575 0.8269 ± 0.0182 0.8869 ± 0.0249 0.737 ± 0.006 0.9481 ± 0.0006 0.8045 ± 0.0224 0.7325 ± 0.0433
LastFM 0.8011 ± 0.0344 0.7990 ± 0.0444 0.8284 ± 0.0142 0.5196 ± 0.0144 0.9082 ± 0.0015 0.884 ± 0.0 0.9139 ± 0.0038
Enron 0.8038 ± 0.0223 0.7120 ± 0.0600 0.8159 ± 0.0231 0.5529 ± 0.015 0.9162 ± 0.0016 0.9051 ± 0.0027 0.952 ± 0.0057
SocialEvo 0.8963 ± 0.0228 0.9158 ± 0.0039 0.9244 ± 0.0084 0.6748 ± 0.0021 0.9298 ± 0.0002 0.896 ± 0.0164
UCI 0.7517 ± 0.0059 0.4297 ± 0.0428 0.8083 ± 0.0237 0.7024 ± 0.0048 0.9177 ± 0.0015 0.9686 ± 0.0031 0.9622 ± 0.0167
CollegeMsg 0.5097 ± 0.0306 0.4838 ± 0.0116 0.777 ± 0.0522 0.715 ± 0.0007 0.9163 ± 0.0038 0.9726 ± 0.0002 0.9603 ± 0.0173
CanParl 0.5012 ± 0.0155 0.5532 ± 0.0088 0.5727 ± 0.0268 0.5802 ± 0.0069 0.7154 ± 0.0967 0.8871 ± 0.0139 0.6214 ± 0.0734
Contact 0.9358 ± 0.0025 0.8650 ± 0.0431 0.952 ± 0.0056 0.5571 ± 0.0047 0.9691 ± 0.0031 0.9842 ± 0.0 0.9466 ± 0.0127
Flights 0.9218 ± 0.0094 0.8689 ± 0.0128 0.9519 ± 0.0043 0.8321 ± 0.0041 0.9834 ± 0.0001 0.9158 ± 0.0 0.9827 ± 0.003
UNTrade 0.6727 ± 0.0132 0.6467 ± 0.0112 0.5977 ± 0.014 * 0.7398 ± 0.0007 0.5915 ± 0.0328 0.6475 ± 0.0664
USLegis 0.5840 ± 0.0129 0.5980 ± 0.0097 0.6128 ± 0.0046 0.5568 ± 0.0078 0.9665 ± 0.0032 0.9708 ± 0.0009 0.7453 ± 0.0286
UNVote 0.5121 ± 0.0005 0.4993 ± 0.0103 0.5881 ± 0.0118 0.477 ± 0.0047 0.5911 ± 0.0006 0.586 ± 0.0 0.779 ± 0.0082
Taobao 0.701 ± 0.0013 0.7026 ± 0.0006 0.7017 ± 0.0026 0.5261 ± 0.0119 0.7737 ± 0.0027 0.8843 ± 0.0016 0.9992 ± 0.0002
Inductive New-Old
Reddit 0.9488 ± 0.0043 0.9549 ± 0.0029 0.9742 ± 0.0004 0.9639 ± 0.0004 0.9848 ± 0.0002 0.9789 ± 0.0017 0.9949 ± 0.0017
Wikipedia 0.9084 ± 0.0043 0.8821 ± 0.0031 0.9703 ± 0.0008 0.9178 ± 0.0023 0.9886 ± 0.0002 0.9878 ± 0.0002 0.9963 ± 0.0021
MOOC 0.7910 ± 0.0475 0.8274 ± 0.0132 0.8808 ± 0.0326 0.7438 ± 0.0063 0.949 ± 0.0016 0.8052 ± 0.0244 0.7487 ± 0.0459
LastFM 0.7305 ± 0.0051 0.6980 ± 0.0364 0.763 ± 0.0231 0.5189 ± 0.003 0.8678 ± 0.0030 0.8311 ± 0.0 0.9144 ± 0.0013
Enron 0.7859 ± 0.0134 0.6915 ± 0.0650 0.8100 ± 0.0204 0.5589 ± 0.0235 0.9185 ± 0.003 0.9007 ± 0.0039 0.9491 ± 0.0079
SocialEvo 0.8953 ± 0.0303 0.9182 ± 0.0050 0.9257 ± 0.0086 0.684 ± 0.0034 0.9155 ± 0.0002 0.8793 ± 0.0318
UCI 0.7139 ± 0.0112 0.4259 ± 0.0397 0.8015 ± 0.0269 0.6842 ± 0.0078 0.9176 ± 0.0028 0.9696 ± 0.0039 0.9748 ± 0.0163
CollegeMsg 0.5168 ± 0.0360 0.4808 ± 0.0279 0.7725 ± 0.0365 0.7012 ± 0.005 0.9166 ± 0.0032 0.968 ± 0.0018 0.9725 ± 0.0189
CanParl 0.5078 ± 0.0005 0.5393 ± 0.0204 0.5691 ± 0.0223 0.5724 ± 0.0063 0.7231 ± 0.085 0.8847 ± 0.0102 0.6277 ± 0.0811
Contact 0.9345 ± 0.0027 0.8574 ± 0.0454 0.9527 ± 0.0052 0.556 ± 0.0039 0.9691 ± 0.0028 0.9841 ± 0.0 0.9351 ± 0.0202
Flights 0.9172 ± 0.0114 0.8650 ± 0.0126 0.9503 ± 0.0043 0.8285 ± 0.0038 0.9828 ± 0.0002 0.9127 ± 0.0 0.986 ± 0.0034
UNTrade 0.6650 ± 0.0106 0.6306 ± 0.0139 0.5959 ± 0.0171 * 0.7413 ± 0.001 0.5965 ± 0.0371 0.5812 ± 0.0957
USLegis 0.5801 ± 0.0213 0.5673 ± 0.0098 0.5741 ± 0.0148 0.5596 ± 0.0092 0.9672 ± 0.0029 0.9682 ± 0.0018 0.531 ± 0.1
UNVote 0.5208 ± 0.0075 0.5023 ± 0.0204 0.5889 ± 0.0106 0.4787 ± 0.0033 0.5933 ± 0.0007 0.5878 ± 0.0 0.7789 ± 0.0192
Taobao 0.6988 ± 0.0024 0.6992 ± 0.0001 0.7025 ± 0.0038 0.5266 ± 0.0239 0.7574 ± 0.0032 0.8617 ± 0.0032 0.9997 ± 0.0001
Inductive New-New
Reddit 0.9381 ± 0.0090 0.9525 ± 0.0050 0.9811 ± 0.0004 0.9597 ± 0.0043 0.9952 ± 0.0016 0.9875 ± 0.0004 0.9954 ± 0.0011
Wikipedia 0.9349 ± 0.0051 0.9261 ± 0.0030 0.9858 ± 0.0007 0.958 ± 0.0039 0.9934 ± 0.0005 0.996 ± 0.0001 0.9984 ± 0.0008
MOOC 0.7065 ± 0.0165 0.7217 ± 0.0178 0.8762 ± 0.0038 0.7403 ± 0.0057 0.9422 ± 0.0003 0.8048 ± 0.0087 0.6562 ± 0.0287
LastFM 0.8852 ± 0.0090 0.8683 ± 0.0160 0.8754 ± 0.0011 0.5092 ± 0.0333 0.9697 ± 0.0002 0.9628 ± 0.0 0.9743 ± 0.0015
Enron 0.6800 ± 0.0017 0.6571 ± 0.0521 0.7644 ± 0.0179 0.531 ± 0.0179 0.9609 ± 0.0051 0.9387 ± 0.0001 0.9687 ± 0.0051
SocialEvo 0.6484 ± 0.0491 0.7740 ± 0.0215 0.8791 ± 0.0045 0.4659 ± 0.0068 0.9318 ± 0.0003 0.9275 ± 0.0471
UCI 0.6393 ± 0.0158 0.4771 ± 0.0100 0.8051 ± 0.021 0.768 ± 0.0041 0.9245 ± 0.0027 0.9716 ± 0.0016 0.9472 ± 0.0262
CollegeMsg 0.5320 ± 0.0269 0.5269 ± 0.0049 0.7969 ± 0.0111 0.7832 ± 0.0026 0.9304 ± 0.0024 0.9762 ± 0.0008 0.9404 ± 0.0371
CanParl 0.4347 ± 0.0090 0.4430 ± 0.0068 0.5625 ± 0.0396 0.5955 ± 0.0074 0.7005 ± 0.1241 0.8882 ± 0.0045 0.5685 ± 0.0326
Contact 0.7531 ± 0.0059 0.6602 ± 0.0395 0.9118 ± 0.0053 0.5449 ± 0.0056 0.9652 ± 0.0012 0.982 ± 0.0 0.9495 ± 0.0034
Flights 0.9303 ± 0.0083 0.8900 ± 0.0266 0.9652 ± 0.0022 0.857 ± 0.0056 0.9873 ± 0.0009 0.9411 ± 0.0 0.9905 ± 0.0014
UNTrade 0.5922 ± 0.0085 0.5362 ± 0.0147 0.5068 ± 0.0061 * 0.7458 ± 0.0081 0.5938 ± 0.0600 0.6876 ± 0.0177
USLegis 0.5390 ± 0.0075 0.5640 ± 0.0192 0.5626 ± 0.0195 0.5324 ± 0.0294 0.9738 ± 0.0058 0.9787 ± 0.0004 0.8897 ± 0.0225
UNVote 0.4913 ± 0.0203 0.4728 ± 0.0033 0.5663 ± 0.0093 0.5 ± 0.006 0.5775 ± 0.0022 0.5669 ± 0.0 0.7198 ± 0.0748
Taobao 0.7169 ± 0.0013 0.717 ± 0.0011 0.7082 ± 0.001 0.5226 ± 0.0054 0.7847 ± 0.0148 0.9083 ± 0.0013 0.9996 ± 0.0001

4.2 Link Prediction Task

Model Performance. We run the link prediction task on 7 TGNN models and 15 datasets under different settings. In Table 17, we show the AUC results and highlight the best and second-best numbers for each job. Please refer to Appendix D.1 for the results of average precision (AP). The overall performance is similar to that of AUC.

For the transductive setting, we find that no model can perform the best consistently across different datasets. This is not surprising considering the dynamic nature and diverse structures of temporal graphs. In a nutshell, CAWN obtains the best or second-best results on 10 datasets out of 15, followed by NeurTW (7 out of 15), NAT (5 out of 15), and TGN (6 out of 15). The outstanding performance of CAWN and NeurTW comes from their temporal walk mechanism that perceives structural information. This also proves that introducing graph topology in TGNN models can effectively improve the model quality. Specially, NeurTW significantly outperforms other baselines on the CanParl dataset. NeurTW introduces a continuous-time operation that can depict evolution trajectory, which is potentially suitable for CanParl with a large time granularity (per year) [6, 8]. The source code of TGAT reports error on the UNTrade dataset because it may not find suitable neighbors within some given time intervals. This error also occurs using other implementations [8].

When it comes to the inductive setting, we observe different results. CAWN ranks top-2 on 9 datasets, followed by NeurTW on 6 and NAT on 9. These methods reveal better generalization capability under the inductive setting since they consider local structures, either by temporal walk or joint neighborhood. In contrast, methods adopting attention and RNN (e.g., TGAT and TGN) suffer performance degradation when the features of test data are not used during the training. NeurTW exceeds other models on CanParl, UNTrade, and UNVote datasets. As we have discussed, NeurTW works well on these large-granularity datasets due to its continuous-time encoder. For the New-Old and New-New scenarios, we observe similar results. Specifically, NAT achieves top-2 performance on 10 datasets under the New-New scenario, verifying that learning joint neighborhood benefits node representations. CAWN, NeurTW, and NAT still perform well under the inductive New-New setting due to their structure-aware techniques. CAWN and NeurTW are both based on motifs and index anonymization operation [5, 6]. NeurTW additionally constructs neural ordinary differential equations (NODEs). NAT relies on joint neighborhood features based on a dedicated data structure termed N-caches [7].

Table 4: Model efficiency on the link prediction task. We report seconds per epoch as Runtime, the averaged number of epochs for convergence before early stopping as Epoch, the maximum RAM usage as RAM, and the maximum GPU memory usage as GPU Memory, respectively. Refer to Table 17 for the meaning of "*" and "—". "x" indicates that the model cannot converge within 48 hours.
Runtime (second) Epoch
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 140.20 134.44 61.63 515.64 1,701.72 27,108.06 27.05 19 20 21 9 x x 12
Wikipedia 9.37 10.02 12.17 109.24 270.83 4,388.59 6.15 30 32 34 13 8 12 6
MOOC 30.19 33.54 41.48 256.26 1,913.38 13,497.27 16.48 9 18 17 5 10 x 4
LastFM 29.42 39.61 45.98 882.36 5,527.12 51,007.30 51.95 9 12 14 6 x x 4
Enron 2.41 3.45 4.13 92.24 398.38 10,896.81 5.70 15 15 19 9 8 x 6
SocialEvo 27.77 42.57 51.35 1,544.52 12,292.31 84.72 23 37 14 24 10 x 7
UCI 1.89 2.49 2.72 57.49 121.41 3,801.79 2.90 15 6 20 7 11 18 10
CollegeMsg 1.87 2.42 2.85 45.80 111.96 2,097.39 2.83 7 6 22 7 12 24 10
CanParl 1.96 2.44 3.03 46.28 234.65 4,566.24 3.79 6 7 7 7 13 x 6
Contact 41.99 58.20 69.60 1,645.10 12,100.94 114,274.38 96.35 12 11 14 8 x x 4
Flights 180.80 197.61 262.51 1,195.30 12,105.70 143,731.99 76.91 12 8 12 7 x x 5
UNTrade 7.89 11.75 14.17 * 1,860.84 39,402.43 21.57 15 9 6 * 3 x 18
USLegis 2.15 2.73 2.93 36.69 220.41 3,208.23 3.16 13 12 15 8 17 5 12
UNVote 22.31 25.53 28.97 686.15 6,414.90 88,939.57 40.51 26 9 18 4 x x 9
Taobao 32.17 38.03 34.51 29.15 135.04 1,156.96 4.12 10 9 9 7 9 6 4
RAM (GB) GPU Memory (GB)
Reddit 3.4 3.4 3.4 3.6 30.8 10.1 3.3 2.8 2.8 3.0 4.4 3.5 2.8 2.2
Wikipedia 2.7 2.7 2.6 2.7 14.4 9.8 2.6 1.8 1.8 1.9 3.6 3 2.1 1.3
MOOC 2.6 2.6 2.6 2.6 11.5 9.1 2.4 2.1 2.1 2.1 2.8 1.5 1.6 1.3
LastFM 2.6 2.6 2.6 3.1 20.5 9.8 2.5 1.3 1.4 1.4 2.6 1.5 1.6 1.1
Enron 2.4 2.4 2.4 2.5 15.8 8.1 2.4 1.3 1.4 1.4 3.0 2.3 1.7 1.3
SocialEvo 2.7 2.7 2.7 3.6 13.4 6.4 2.6 1.1 1.2 1.2 2.6 2.0 1.5 1.2
UCI 2.5 2.4 2.4 2.5 11.2 15.2 2.4 1.5 1.6 1.6 3.2 2 1.9 1.4
CollegeMsg 2.5 2.5 2.5 2.5 11 14.8 2.5 1.4 1.7 1.5 3.5 2.5 1.8 1.3
CanParl 2.4 2.4 2.4 2.4 8.5 8.8 2.4 1.3 1.3 1.4 2.8 1.7 1.6 1.3
Contact 2.8 2.8 2.8 3.7 13.2 10.1 2.6 1.3 1.3 1.4 2.7 2.1 1.6 1.1
Flights 2.8 2.8 2.8 3.5 9.9 10.2 2.6 2.2 2.6 2.9 2.8 2.3 1.7 1.3
UNTrade 2.5 2.5 2.5 * 5.4 8.7 2.4 1.1 1.2 1.2 * 2.1 1.6 1.1
USLegis 2.4 2.4 2.4 2.4 8.8 5.9 2.4 1.4 1.4 1.5 2.8 2.4 1.6 1.3
UNVote 2.5 2.5 2.5 3 17.9 6.4 2.5 1.1 1.2 1.2 2.6 2.0 1.5 1.1
Taobao 3 2.9 3 2.6 3.5 2.8 2.6 7.5 7.7 8.4 2.9 2.7 1.6 1.4

Model Efficiency. Since many real-world graphs are extremely large, we believe efficiency is a vital issue for TGNNs in practice. We thereby compare the efficiency of the evaluated models, and present the results in Table 4. In terms of runtime per epoch, JODIE, DyRep, TGN and TGAT are faster, as their message passing and memory updater operators are efficient. In contrast, CAWN and NeurTW can obtain more accurate models, but are much slower (sometimes encountering timeouts) due to their inefficient temporal walk operations. NAT is relatively faster than temporal walk-based methods through caching and parallelism optimizations, achieving a good trade-off between model quality and efficiency. Regarding the consumed epochs till convergence, TGAT and NAT converge relatively faster, while CAWN and NeurTW often cannot converge in a reasonable time. RAM results illustrate that CAWN and NeurTW consume more memory than other models due to the complex sampler. As for GPU Memory, most models consume 1GB - 3GB. TGAT requires the highest GPU memory on almost all datasets, since it stacks multiple attention layers. The memory usages of JODIE, DyRep, and TGN on Taobao are much higher than those of other models. Taobao has the maximal nodes, so that the Memory module consumes much larger GPU memory. We also evaluate GPU utilization (please refer to Appendix D.2) and the results are similar.

4.3 Node Classification Task

Model Performance. We evaluate Reddit, Wikipedia, and MOOC datasets since they have two classes of node labels. The test AUC results for the node classification task are shown in Table 19. In these three datasets, the node labels are highly imbalanced and may change over time, yielding relatively lower AUC than the link prediction task. TGN achieves the best result on Wikipedia dataset, verifying the effectiveness of temporal memory on node classification task. TGAT obtains the best or second-best results on two datasets. TGAT adopts temporal attention to explicitly capture the evolution of node embeddings, and therefore can potentially learn better knowledge of nodes. CAWN and NeurTW perform better on MOOC, since MOOC is relatively denser and the temporal walk mechanism can effectively perceive local structures. NAT performs poorly on the node classification task. The node classification task does not rely on structural features as much as the link prediction task, so that the joint neighborhood mechanism may be less effective.

Table 5: ROC AUC results for the node classification task. The top-2 results are highlighted as bold red and underlined blue.
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 0.6033 ± 0.0173 0.4988 ± 0.0066 0.6216 ± 0.007 0.6252 ± 0.0057 0.6502 ± 0.0252 0.6054 ± 0.0352 0.4746 ± 0.0207
Wikipedia 0.8527 ± 0.002 0.8276 ± 0.008 0.8831 ± 0.0009 0.8603 ± 0.0051 0.8586 ± 0.0030 0.8470 ± 0.0233 0.5417 ± 0.0474
MOOC 0.6774 ± 0.0066 0.6604 ± 0.0037 0.626 ± 0.0062 0.6705 ± 0.005 0.7271 ± 0.0016 0.7719 ± 0.0073 0.5175 ± 0.0143

Model Efficiency. Due to the space constraint, we report the efficiency performance of node classification task in Appendix D.3. Overall, we observe similar results as the link prediction task.

4.4 Discussion and Visions

Link Prediction. Temporal walk-based models retrieve a set of motifs to learn structural information, and therefore benefit future edge prediction. NAT [7] achieves a trade-off between statistical performance and efficiency, owing to its joint-neighborhood learning paradigm. CAWN [5] and NeurTW [6] perform well on the link prediction task and are both based on motifs and index anonymization operation. However, NeurTW [6] additionally constructs neural ordinary differential equations (NODEs) for embedding extracted motifs over time intervals to capture dynamics.

Node Classification. TGN, TGAT, CAWN and NeurTW obtain top-2 results on at least one dataset, verifying that node memory, graph attention, and temporal walk strategies are effective for node classification. These three techniques perform message passing for each node using all of its neighborhoods, and can learn better node representations.

Limitation and Future Direction. It is hard for traditional GNNs to capture the dynamic evolution of temporal graphs. Motifs suffer the complex random walk and sampler. Joint-neighborhoods overly focused on joint neighborhoods and performs poorly on node classification task. Inspired by the above analysis, we can conclude that the future directions of TGNN models are more focused on mining the temporal structure of the temporal graph and increasing the model’s structure-aware ability by jointing motifs [5, 6] and joint-neighborhood [7].

Our Improvement. Here, we introduce TeMP in the Appendix, a novel approach that incorporates GNN aggregation and temporal structure. According to our empirical results, TeMP effectively balances model performance and efficiency, and performs well on both link prediction and node classification tasks. Please refer to Appendix E for details.

5 Related work

Early works [24, 25, 26] treat the temporal graph as a sequence of snapshots, then encode the snapshots utilizing static GNNs, and capture temporal evolution patterns through sequence analysis methods.

Recent works [1, 2, 3, 4, 5, 6, 7, 27, 28, 29, 30, 31] tend to choose a continuous-time abstraction, treat the input as interaction streams, thus preserving dynamic evolution in temporal graphs. Some models update the representation using the node that is currently interacting [3]; while others update the representation using the historical neighbor nodes [2, 4], and encode the temporal structural pattern. In addition, a series of motif-based models  [5, 6] aim at capturing high-order connectivity. These models first sample temporal random walks starting from the target node and obtain motifs which reflect the spatio-temporal information around the target node.

However, to our best knowledge, there is no benchmark that conducts a standardized comparison for TGNN models. A few works [8] have made preliminary attempts, but they have not provided standard datasets and unified evaluation pipelines. In this work, we try to build a comprehensive benchmark framework for TGNNs and standardize the entire lifecycle.

6 Conclusion

In this work, we propose BenchTemp, a general benchmark for temporal graph neural networks. In the BenchTemp, we construct a series of benchmark datasets and design a unified pipeline. We compare the effectiveness and efficiency of different TGNN models fairly across different tasks, settings, and metrics. Through an in-depth analysis of the empirical results, we draw some valuable insights for future research. In future work, we will continuously update our benchmark, corresponding to newly emerged TGNN models. Contributions and issues from the community are eagerly welcomed, with which we can together push forward the TGNN research.

References

  • Kumar et al. [2019] Srijan Kumar, Xikun Zhang, and Jure Leskovec. Predicting dynamic embedding trajectory in temporal interaction networks. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1269–1278, 2019.
  • Trivedi et al. [2019] Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. Dyrep: Learning representations over dynamic graphs. In International Conference on Learning Representations, 2019.
  • Rossi et al. [2020] Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal graph networks for deep learning on dynamic graphs. arXiv preprint arXiv:2006.10637, 2020.
  • Xu et al. [2020] Da Xu, Chuanwei Ruan, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Inductive representation learning on temporal graphs. In International Conference on Learning Representations, 2020.
  • Wang et al. [2021] Yanbang Wang, Yen-Yu Chang, Yunyu Liu, Jure Leskovec, and Pan Li. Inductive representation learning in temporal networks via causal anonymous walks. In International Conference on Learning Representations, 2021.
  • Jin et al. [2022] Ming Jin, Yuan-Fang Li, and Shirui Pan. Neural temporal walks: Motif-aware representation learning on continuous-time dynamic graphs. In Advances in Neural Information Processing Systems, 2022.
  • Luo and Li [2022] Yuhong Luo and Pan Li. Neighborhood-aware scalable temporal network representation learning. In The First Learning on Graphs Conference, 2022.
  • Poursafaei et al. [2022] Farimah Poursafaei, Andy Huang, Kellin Pelrine, and Reihaneh Rabbany. Towards better evaluation for dynamic link prediction. In Advances in Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • [9] Reddit data dump. http://files.pushshift.io/reddit/.
  • cup [2015] Kdd cup 2015. https://biendata.com/competition/kddcup2015/data/.
  • Hidasi and Tikk [2012] Balázs Hidasi and Domonkos Tikk. Fast als-based tensor factorization for context-aware recommendation from implicit feedback. In The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), pages 67–82. Springer, 2012.
  • Zhu et al. [2018] Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai. Learning tree-based deep model for recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1079–1088, 2018.
  • [13] Enron email dataset. http://www.cs.cmu.edu/~enron/.
  • Leskovec and Krevl [2014] Jure Leskovec and Andrej Krevl. Snap datasets: Stanford large network dataset collection, 2014.
  • Madan et al. [2011] Anmol Madan, Manuel Cebrian, Sai Moturu, Katayoun Farrahi, et al. Sensing the "health state" of a community. IEEE Pervasive Computing, 11(4):36–45, 2011.
  • Opsahl and Panzarasa [2009] Tore Opsahl and Pietro Panzarasa. Clustering in weighted networks. Social networks, 31(2):155–163, 2009.
  • Panzarasa et al. [2009] Pietro Panzarasa, Tore Opsahl, and Kathleen M Carley. Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community. Journal of the American Society for Information Science and Technology, 60(5):911–932, 2009.
  • Huang et al. [2020] Shenyang Huang, Yasmeen Hitti, Guillaume Rabusseau, and Reihaneh Rabbany. Laplacian change point detection for dynamic graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 349–358, 2020.
  • Sapiezynski et al. [2019] Piotr Sapiezynski, Arkadiusz Stopczynski, David Dreyer Lassen, and Sune Lehmann. Interaction data from the copenhagen networks study. Scientific Data, 6(1):315, 2019.
  • Schäfer et al. [2014] Matthias Schäfer, Martin Strohmeier, Vincent Lenders, Ivan Martinovic, and Matthias Wilhelm. Bringing up opensky: A large-scale ads-b sensor network for research. In IPSN-14 Proceedings of the 13th International Symposium on Information Processing in Sensor Networks, pages 83–94. IEEE, 2014.
  • MacDonald et al. [2015] Graham K MacDonald, Kate A Brauman, Shipeng Sun, Kimberly M Carlson, Emily S Cassidy, James S Gerber, and Paul C West. Rethinking agricultural trade relationships in an era of globalization. BioScience, 65(3):275–289, 2015.
  • Voeten et al. [2009] Erik Voeten, Anton Strezhnev, and Michael Bailey. United Nations General Assembly Voting Data, 2009. URL https://doi.org/10.7910/DVN/LEJUQZ.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Xu et al. [2019] Da Xu, Chuanwei Ruan, Kamiya Motwani, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. Generative graph convolutional network for growing graphs. In ICASSP 2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3167–3171. IEEE, 2019.
  • Pareja et al. [2020] Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao Schardl, and Charles Leiserson. Evolvegcn: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5363–5370, 2020.
  • Hajiramezanali et al. [2019] Ehsan Hajiramezanali, Arman Hasanzadeh, Krishna Narayanan, Nick Duffield, Mingyuan Zhou, and Xiaoning Qian. Variational graph recurrent neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Nguyen et al. [2018] Giang Hoang Nguyen, John Boaz Lee, Ryan A Rossi, Nesreen K Ahmed, Eunyee Koh, and Sungchul Kim. Continuous-time dynamic network embeddings. In Companion Proceedings of the Web conference, pages 969–976, 2018.
  • Tian et al. [2021] Sheng Tian, Tao Xiong, and Leilei Shi. Streaming dynamic graph neural networks for continuous-time temporal graph modeling. In 2021 IEEE International Conference on Data Mining (ICDM), pages 1361–1366, 2021.
  • Zhou et al. [2022] Hongkuan Zhou, Da Zheng, Israt Nisa, Vasileios Ioannidis, Xiang Song, and George Karypis. Tgl: a general framework for temporal gnn training on billion-scale graphs. Proceedings of the VLDB Endowment, 15(8):1572–1580, 2022.
  • Souza et al. [2022] Amauri Souza, Diego Mesquita, Samuel Kaski, and Vikas Garg. Provably expressive temporal graph networks. Advances in Neural Information Processing Systems, 35:32257–32269, 2022.
  • Liu et al. [2022] Yunyu Liu, Jianzhu Ma, and Pan Li. Neural predicting higher-order patterns in temporal networks. In Proceedings of the ACM Web Conference 2022, pages 1340–1351, 2022.
  • Pennebaker et al. [2001] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001, 2001.
  • Jin et al. [2023] Yiqiao Jin, Yeon-Chang Lee, Kartik Sharma, Meng Ye, Karan Sikka, Ajay Divakaran, and Srijan Kumar. Predicting information pathways across online communities. arXiv preprint arXiv:2306.02259, 2023.
  • Huang et al. [2022] Xuanwen Huang, Yang Yang, Yang Wang, Chunping Wang, Zhisheng Zhang, Jiarong Xu, Lei Chen, and Michalis Vazirgiannis. Dgraph: A large-scale financial dataset for graph anomaly detection. Advances in Neural Information Processing Systems, 35:22765–22777, 2022.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

    2. (b)

      Did you describe the limitations of your work? [Yes] See Section 4.2, we fail to implement TGAT on UNTrade dataset.

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A]

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A]

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A]

  3. 3.

    If you ran experiments (e.g. for benchmarks)…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See Section 1: datasets, PyPI, leaderboard, and github project.

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 3.2, Section 4.1, Appendix B, and Appendix C.

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] See Section F.1, Table 17, and Table 19.

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 4.1.

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] See Section 3.1.

    2. (b)

      Did you mention the license of the assets? [Yes] See Section 3.1, CC BY-NC license.

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes] See Section 1.

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] See Section 3.1, the datasets are open-sourced with public license.

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Temporal Graph Datasets

As shown in Table 2 of the main paper, we select fifteen benchmark databases from diverse domains. All datasets are publicly available under CC BY-NC licence and can be accessed at https://drive.google.com/drive/folders/1HKSFGEfxHDlHuQZ6nK4SLCEMFQIOtzpz?usp=sharing. Figure 5 shows the temporal distribution of edges in the evaluated temporal graphs.

  • Reddit is a bipartite interaction graph, consisting of one month of posts made by users on subreddits [9]. Users and subreddits are nodes, and egdes are interactions of users writing posts to subreddits. The text of each post is converted into LIWC-feature vector [32] as an edge feature of length 172. This public dataset gives 366 true labels among 672,447 interactions, and those true label are ground-truth labels of banned users from Reddit [1].

  • Wikipedia is a bipartite interaction graph, and contains one month of edits made by editors. This public dataset selects the 1,000 most edited pages as items and editors who made at least 5 edits as users over a month [1]. Editors and pages are nodes, and edges are interactions of editors editing on pages. Edge features of length 172 are interaction edits converted into LIWC-feature vectors [32]. Wikipedia dataset treats 217 public ground-truth labels of banned users from 157,474 interactions as positive labels.

  • MOOC is a bipartite MOOC online network of students and online course content units [10]. Students and courses are nodes, and edges with features of length 4 are interactions of user viewing a video, submitting an answer, etc. This public dataset treats 4,066 dropout events out of 411,749 interactions as positive labels [1].

  • LastFM is a user-song bipartite network [11]. Users and songs are nodes, and edges are user-listens-song interactions. This public dataset includes 1,293,103 interactions between all 1000 users and the 1000 most listened songs [1].

  • Enron is an email communication network that collects about half a million emails over several years [13]. Nodes of the network are email addresses, and edges are email communication between accounts [14].

  • SocialEvo is a network in which experiments are conducted to closely track the everyday life of a whole undergraduate dormitory with mobile phones. This public dataset is collected by a cell phone application every six minutes, and contains physical proximity and location between students living in halls of residence. [15].

  • UCI is a facebook-like social network that contains user posts to forums. Nodes are students (1,899) at University of California, Irvine, and edges are interactions of online messages (59,835) among these users [16]. Each edge has 100 features.

  • CollegeMsg is provided by the SNAP team of Stanford [17]. This dataset is derived from the facebook-like social network introduced in dataset UCI. The SNAP team has parsed it to a temporal network. Each edge has 172 features.

    Refer to caption
    (a) Reddit
    Refer to caption
    (b) Wikipedia
    Refer to caption
    (c) MOOC
    Refer to caption
    (d) LastFM
    Refer to caption
    (e) Enron
    Refer to caption
    (f) SocialEvo
    Refer to caption
    (g) UCI
    Refer to caption
    (h) CollegeMsg
    Refer to caption
    (i) CanParl
    Refer to caption
    (j) Contact
    Refer to caption
    (k) Flights
    Refer to caption
    (l) UNTrade
    Refer to caption
    (m) USLegis
    Refer to caption
    (n) UNVote
    Refer to caption
    (o) Taobao
    Figure 5: The temporal distribution of edges of the evaluated temporal graphs.
  • CanParl is a Canadian parliament bill voting network extracted from open website [18]. Nodes are members of parliament (MPs), and edges are the interactions between MPs from 2006 to 2019.

  • Contact is a temporal and weighted network of physical proximity among the participants [19]. Nodes are participant and edges are proximity events between the study participants. Edge features indicate the physical proximity between participants [8].

  • Flights is a weighted flight network. Nodes are airports, and edges are tracked flights [20]. The weights of edges indicate the number of flights between two given airports within a day [8].

  • UNTrade is a food and agriculture trading weighted network among 181 nations over 30 years [21]. Nodes are countries, and edges are tradings between two countries. The weights of edges are the total sum of normalized agriculture import or export values between two given countries [8].

  • USLegis is a senate co-sponsorship network that examines the social relations of legislators in their co-sponsorship relationships on bills [18]. Nodes are congress members, and edge weights are the number of times that two members of congress co-sponsor a bill in a given congress [8].

  • UNVote is a weighted network of roll-call votes in the UN General Assembly 1946-2021 [22]. Nodes are nations, and edge weights are the number of times both nations have voted "yes" to an item.

  • Taobao is a subset of the Taobao user behavior dataset intercepted based on the period 8:00 to 18:00 on 26 November 2017 [12]. This public dataset is a user-item bipartite network. Nodes are users and items, and edges are behaviors between users and items, such as favor, click, purchase, and add an item to shopping cart. Each edge has 4 features, corresponding to 4 different types of behaviors [6].

Appendix B Experiment Details

DataLoader of the link prediction pipeline introduced in Section 3.2.1 splits and generates training set, validation set, transductive set, and inductive test sets depending on the New-Old and New-New settings, for link prediction task. The detailed statistics of these data sets are shown in Table 6. DataLoader of the node classification pipeline introduced in Section 3.2.2 follows the traditional transductive setting. The detailed statistics of the training set, validation set, and test set on three available datasets (Reddit, Wikipedia, and MOOC) are given in Table 7.

Table 6: Statistics of datasets for link prediction task. "New-Old Validation" indicates the validation set under Inductive New-Old setting, and so on.
Training Validation Transductive Test Inductive Validation Inductive Test
# nodes # edges # nodes # edges # nodes # edges # nodes # edges # nodes # edges
Reddit 9,574 389,989 9,839 100,867 9,615 100,867 3,491 19,446 3,515 21,470
Wikipedia 6,141 81,029 3,256 23,621 3,564 23,621 2,120 12,016 2,437 11,715
MOOC 6,015 227,485 2,599 61,762 2,412 61,763 2,333 25,592 2,181 29,179
LastFM 1,612 722,758 1,714 193,965 1,753 193,966 1,643 57,651 1,674 98,442
Enron 157 79,064 155 18,786 141 18,785 112 5,637 110 4,859
SocialEvo 67 1,222,980 64 314,930 62 314,924 62 62,811 60 70,038
UCI 1,338 34,386 1,036 8,975 847 8,976 816 4,761 678 5,707
CollegeMsg 1,337 34,544 1,036 8,975 847 8,975 818 4,914 680 5,885
CanParl 618 47,435 344 11,809 342 10,113 344 5,481 341 5,591
Contact 617 1,372,030 632 364,005 629 363,780 582 68,261 590 69,617
Flights 11,230 1,107,798 10,844 279,399 10,906 287,824 6,784 54,861 6,820 58,102
UNTrade 230 291,287 230 78,721 228 61,595 227 17,528 226 14,001
USLegis 176 38,579 113 10,005 100 4,950 113 5,010 100 3,297
UNVote 178 600,511 194 135,298 194 155,119 194 28,136 194 33,083
Taobao 54,462 45,630 17,964 11,621 18,143 11,550 16,476 10,338 16,896 10,516
New-Old Validation New-Old Test New-New Validation New-New Test Unseen Nodes
# nodes # edges # nodes # edges # nodes # edges # nodes # edges
Reddit 3,301 16,760 3,325 18,703 488 2,686 486 2,767 1,098
Wikipedia 1,809 8,884 1,996 8,148 468 3,132 629 3,567 922
MOOC 2,316 23,109 2,164 25,730 553 2,483 592 3,449 714
LastFM 1,642 52,379 1,674 63,505 272 5,272 331 34,937 198
Enron 111 4,965 109 4,262 19 672 20 597 18
SocialEvo 62 58,959 60 65,466 7 3,852 7 4,572 7
UCI 757 3,686 606 4,193 247 1,075 213 1,514 189
CollegeMsg 759 3,839 608 4,328 247 1,075 214 1,557 189
CanParl 344 4,543 341 4,469 106 938 111 1,122 73
Contact 582 64,887 590 65,883 62 3,374 59 3,734 69
Flights 6,711 49,796 6,739 52,504 874 5,065 937 5,598 1,316
UNTrade 227 16,420 226 13,112 25 1,108 25 889 25
USLegis 112 4,154 100 2,436 37 856 42 861 22
UNVote 194 26,545 194 31,166 23 1,591 23 1,917 20
Taobao 9,247 5,678 8,136 4,860 7,706 4,660 9,298 5,656 8,256
Table 7: Statistics of datasets for node classification task.
Training Validation Test
# nodes # edges # nodes # edges # nodes # edges
Reddit 10,844 470,713 9,839 100,867 9,615 100,867
Wikipedia 7,475 110,232 3,256 23,621 3,564 23,621
MOOC 6,625 288,224 2,599 61,762 2,412 61,763

EdgeSampler of link prediction pipeline introduced in Section 3.2.1 uses fixed seeds for different validation sets and test sets to ensure that the test results are reproducible across different runs.

Table 8: Experimental parameters of TGAT.
dnd_{n} ded_{e} dtimed_{time} dposd_{pos} nheadn_{head}
Reddit 172 172 172 172 2
Wikipedia 172 172 172 172 2
MOOC 172 4 172 172 2
LastFM 172 2 172 172 2
Enron 172 32 172 172 2
SocialEvo 172 2 172 172 2
UCI 172 100 172 172 2
CollegeMsg 172 172 172 172 2
CanParl 172 1 172 172 1
Contact 172 1 172 172 1
Flights 172 1 172 172 1
UNTrade 172 1 172 172 1
USLegis 172 1 172 172 1
UNVote 172 1 172 172 1
Taobao 172 4 172 172 2
Table 9: Experimental parameters of CAWN.
dnd_{n} ded_{e} dtimed_{time} dposd_{pos}
Reddit 172 172 172 108
Wikipedia 172 172 172 108
MOOC 172 4 172 100
LastFM 172 2 172 102
Enron 172 32 172 104
SocialEvo 172 2 172 102
UCI 172 100 172 100
CollegeMsg 172 172 172 108
CanParl 172 1 172 103
Contact 172 1 172 103
Flights 172 1 172 103
UNTrade 172 1 172 103
USLegis 172 1 172 103
UNVote 172 1 172 103
Taobao 172 4 172 100

Appendix C Model Implementation Details

We implement JODIE, DyRep, and TGN based on the TGN framework. Furthermore, we fix the inconsistencies of implementations between link prediction task and node classification task.

TGAT concatenates node features, edge features, time features, and position features to perform the multi-head self-attention mechanism. There is a positional encoding in the self-attention mechanism for capturing sequential information. Let dnd_{n}, ded_{e}, dtimed_{time}, and dposd_{pos} denote the dimensions of node features, edge features, time features, and positional encoding, respectively. The number of attention heads is nheadn_{head}. These parameters must satisfy:

(dn+de+dtime+dpos)%nhead=0(d_{n}+d_{e}+d_{time}+d_{pos})\%n_{head}=0 (1)

The experimental parameters of TGAT are summarized in Table 9.

Similar to the setup of TGAT, CAWN adopts a multi-head self-attention mechanism to capture the subtle relevance of node features, edge features, time features, and positional features. Those parameters satisfy Formula (1) as well, and dn=dtimed_{n}=d_{time} . However, CAWN initializes the number of attention heads to 2, so we change the dimension of dposd_{pos} to conduct experiments. The experimental parameters of CAWN are shown in Table 9.

NeurTW concatenates node features, edge features, and positional features (without time features) during the temporal random walk encoding. Regarding the temporal walk sampling strategy, given a node uu at time tt, the sampling probability weight of its neighbor vv (({v,u},t)𝒢u,t(\{v,u\},t^{\prime})\in\mathcal{G}_{u,t}) is proportion to exp(α(tt))exp(\alpha(t^{\prime}-t)), where α\alpha is a temporal bias. This sampling strategy is a temporal-biased sampling method. However, the time intervals in some benchmark datasets (Enron, CanParl, UNTrade, USLegis, and UNVote) are relatively large, and the exponential sampling probability weights may encounter overflow. Therefore, we propose a strategy to calculate the sampling probability weights for these datasets:

W(v,t)={tt,tt>0,1,tt=0,1/(tt),tt<0,W(v,t^{\prime})=\begin{cases}t^{\prime}-t,&t^{\prime}-t>0,\\ 1,&t^{\prime}-t=0,\\ -1/(t^{\prime}-t),&t^{\prime}-t<0,\end{cases} (2)

where W(v,t)>0W(v,t^{\prime})>0. This strategy can avoid overflow and is also a temporal-biased sampling method. Finally, the sampling probability of each neighbor is obtained after normalization:

Prt(v)=αW(v,tv)v𝒢u,tαW(v,tv),Pr_{t}(v)=\frac{\alpha W(v,t_{v})}{\sum_{v^{\prime}\in\mathcal{G}_{u,t}}\alpha W(v^{\prime},t_{v^{\prime}})}, (3)

where α\alpha is a temporal bias. For other hyperparameters that we have not mentioned, we use default values from the original experiments in the corresponding papers. All the experimental codes are publicly available under MIT license and can be accessed at https://github.com/qianghuangwhu/benchtemp.

Appendix D Experiment Results

Table 10: Average Precision (AP) results on link prediction task. "*" denotes that the model encounters runtime error; "—" denotes timeout after 48 hours. The best and second-best results are highlighted as bold red and underlined blue. Some standard deviations are zero because we terminate those models that can only run one epoch within 2 days. We do not highlight the second-best if the gap is >0.05>0.05 compared with the best result.
Transductive
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 0.9718 ± 0.0022 0.9808 ± 0.0006 0.9874 ± 0.0002 0.9822 ± 0.0003 0.9904 ± 0.0001 0.9855 ± 0.0013 0.9868 ± 0.0017
Wikipedia 0.9471 ± 0.0056 0.9464 ± 0.0010 0.9852 ± 0.0003 0.9536 ± 0.0022 0.9906 ± 0.0001 0.9918 ± 0.0001 0.9819 ± 0.0026
MOOC 0.7364 ± 0.0370 0.7933 ± 0.0348 0.883 ± 0.0242 0.7185 ± 0.0051 0.9369 ± 0.0009 0.7943 ± 0.0248 0.7537 ± 0.0191
LastFM 0.6762 ± 0.0678 0.6736 ± 0.0768 0.7694 ± 0.0276 0.5375 ± 0.0044 0.8946 ± 0.0006 0.8405 ± 0.0 0.8729 ± 0.0022
Enron 0.7841 ± 0.0254 0.7648 ± 0.0418 0.8472 ± 0.0173 0.6063 ± 0.0194 0.9142 ± 0.0052 0.8847 ± 0.0079 0.9044 ± 0.0036
SocialEvo 0.7982 ± 0.0476 0.8816 ± 0.0042 0.9325 ± 0.0006 0.7724 ± 0.0052 0.9188 ± 0.0011 0.8989 ± 0.0096
UCI 0.8436 ± 0.0110 0.4913 ± 0.0367 0.8914 ± 0.0138 0.779 ± 0.0052 0.9425 ± 0.001 0.9702 ± 0.0021 0.9253 ± 0.0083
CollegeMsg 0.5276 ± 0.0493 0.5070 ± 0.0049 0.8418 ± 0.0847 0.7902 ± 0.0033 0.9401 ± 0.0025 0.9727 ± 0.0001 0.9241 ± 0.0086
CanParl 0.7030 ± 0.0077 0.6860 ± 0.0256 0.6765 ± 0.0615 0.6811 ± 0.0157 0.6916 ± 0.0546 0.8528 ± 0.0213 0.6593 ± 0.0764
Contact 0.9087 ± 0.0114 0.9016 ± 0.0319 0.9699 ± 0.0045 0.5888 ± 0.0065 0.9677 ± 0.0024 0.9756 ± 0.0 0.945 ± 0.0168
Flights 0.9389 ± 0.0075 0.8836 ± 0.0078 0.9764 ± 0.0025 0.899 ± 0.0025 0.9860 ± 0.0002 0.9321 ± 0.0 0.9749 ± 0.0048
UNTrade 0.6329 ± 0.0102 0.6099 ± 0.0057 0.6059 ± 0.0086 * 0.7488 ± 0.0005 0.5648 ± 0.0167 0.7514 ± 0.0615
USLegis 0.7585 ± 0.0032 0.6808 ± 0.0368 0.7398 ± 0.0027 0.7206 ± 0.0071 0.9682 ± 0.0048 0.9713 ± 0.0013 0.7425 ± 0.016
UNVote 0.6090 ± 0.0076 0.5855 ± 0.0225 0.6694 ± 0.0095 0.5388 ± 0.002 0.6175 ± 0.0013 0.6008 ± 0.0 0.6449 ± 0.033
Taobao 0.808 ± 0.0015 0.8074 ± 0.0014 0.8618 ± 0.0004 0.5508 ± 0.0093 0.7464 ± 0.0027 0.8808 ± 0.0012 0.8933 ± 0.0007
Inductive
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 0.9427 ± 0.0118 0.9582 ± 0.0003 0.9767 ± 0.0003 0.9667 ± 0.0003 0.9889 ± 0.0001 0.9821 ± 0.0006 0.9912 ± 0.0027
Wikipedia 0.9316 ± 0.0049 0.9181 ± 0.0037 0.9791 ± 0.0004 0.9389 ± 0.0035 0.9903 ± 0.0002 0.9912 ± 0.0004 0.9962 ± 0.0021
MOOC 0.7282 ± 0.0686 0.7985 ± 0.0153 0.8726 ± 0.0267 0.7204 ± 0.0055 0.9394 ± 0.0005 0.7903 ± 0.0307 0.7474 ± 0.0214
LastFM 0.8057 ± 0.0424 0.7956 ± 0.0631 0.8261 ± 0.0145 0.5454 ± 0.0094 0.9225 ± 0.0009 0.8842 ± 0.0 0.9235 ± 0.0028
Enron 0.7640 ± 0.0310 0.6883 ± 0.0635 0.7982 ± 0.0237 0.5661 ± 0.0134 0.916 ± 0.001 0.8940 ± 0.0025 0.9308 ± 0.0085
SocialEvo 0.8527 ± 0.0303 0.8954 ± 0.0034 0.8944 ± 0.0102 0.6497 ± 0.004 0.9118 ± 0.0003 0.8682 ± 0.0324
UCI 0.7298 ± 0.0152 0.4606 ± 0.0209 0.8306 ± 0.0177 0.704 ± 0.0046 0.9421 ± 0.0012 0.9720 ± 0.0024 0.9658 ± 0.0125
CollegeMsg 0.4960 ± 0.0193 0.4858 ± 0.0051 0.7983 ± 0.049 0.7184 ± 0.0014 0.941 ± 0.0026 0.9762 ± 0.0 0.9642 ± 0.0124
CanParl 0.5148 ± 0.0119 0.5365 ± 0.0064 0.5596 ± 0.0141 0.5814 ± 0.0041 0.6915 ± 0.0578 0.8469 ± 0.0161 0.6058 ± 0.0812
Contact 0.9162 ± 0.0051 0.8334 ± 0.0620 0.9411 ± 0.0071 0.5922 ± 0.0056 0.9688 ± 0.0023 0.9762 ± 0.0 0.9489 ± 0.0091
Flights 0.9190 ± 0.0081 0.8707 ± 0.0121 0.9439 ± 0.0043 0.8361 ± 0.0039 0.9834 ± 0.0002 0.9201 ± 0.0 0.9817 ± 0.0026
UNTrade 0.6392 ± 0.0132 0.6232 ± 0.0188 0.5603 ± 0.0106 * 0.7361 ± 0.0009 0.5640 ± 0.0137 0.6586 ± 0.0543
USLegis 0.5557 ± 0.0107 0.5687 ± 0.0008 0.6048 ± 0.0047 0.5637 ± 0.0048 0.9694 ± 0.0028 0.971 ± 0.0009 0.6946 ± 0.0198
UNVote 0.5242 ± 0.0050 0.5118 ± 0.0037 0.5702 ± 0.0099 0.5204 ± 0.004 0.6014 ± 0.0013 0.6025 ± 0.0 0.7637 ± 0.0023
Taobao 0.6696 ± 0.0025 0.6717 ± 0.0006 0.6761 ± 0.0015 0.5293 ± 0.0096 0.7389 ± 0.0026 0.8815 ± 0.0045 0.9992 ± 0.0001
Inductive New-Old
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 0.9399 ± 0.0112 0.9552 ± 0.0027 0.9749 ± 0.0006 0.9659 ± 0.0004 0.9871 ± 0.0003 0.9810 ± 0.0015 0.9947 ± 0.0014
Wikipedia 0.9127 ± 0.0078 0.8947 ± 0.0040 0.9724 ± 0.0008 0.9223 ± 0.0021 0.9901 ± 0.0002 0.9884 ± 0.0007 0.9959 ± 0.0018
MOOC 0.7366 ± 0.5977 0.8011 ± 0.0092 0.8669 ± 0.0328 0.7263 ± 0.0059 0.9408 ± 0.0022 0.7907 ± 0.0336 0.7677 ± 0.0175
LastFM 0.7448 ± 0.0034 0.7024 ± 0.0532 0.7661 ± 0.0232 0.5447 ± 0.0023 0.8906 ± 0.0021 0.835 ± 0.0 0.9235 ± 0.001
Enron 0.7526 ± 0.0158 0.6742 ± 0.0632 0.7918 ± 0.0209 0.5729 ± 0.0189 0.9168 ± 0.0061 0.8925 ± 0.0090 0.9319 ± 0.0092
SocialEvo 0.8521 ± 0.0403 0.8999 ± 0.0046 0.8972 ± 0.0107 0.6578 ± 0.0041 0.8830 ± 0.0008 0.8437 ± 0.0569
UCI 0.6891 ± 0.0166 0.4574 ± 0.0207 0.8243 ± 0.0205 0.6826 ± 0.0084 0.9414 ± 0.002 0.9732 ± 0.0040 0.9768 ± 0.0127
CollegeMsg 0.5000 ± 0.0227 0.4834 ± 0.0177 0.7954 ± 0.0349 0.701 ± 0.0058 0.9407 ± 0.0017 0.9719 ± 0.0014 0.9763 ± 0.0133
CanParl 0.5143 ± 0.0043 0.5168 ± 0.0170 0.552 ± 0.0135 0.574 ± 0.0054 0.6952 ± 0.0518 0.8417 ± 0.0132 0.6027 ± 0.0787
Contact 0.9150 ± 0.0058 0.8253 ± 0.0637 0.9421 ± 0.0055 0.5915 ± 0.0049 0.9689 ± 0.0029 0.9757 ± 0.0 0.9384 ± 0.0175
Flights 0.9128 ± 0.0095 0.8657 ± 0.0117 0.9412 ± 0.0039 0.833 ± 0.0031 0.9827 ± 0.0002 0.9161 ± 0.0 0.9845 ± 0.0033
UNTrade 0.6333 ± 0.0102 0.6101 ± 0.0196 0.5622 ± 0.014 * 0.7375 ± 0.001 0.5692 ± 0.0185 0.5844 ± 0.053
USLegis 0.5567 ± 0.0106 0.5490 ± 0.0143 0.5651 ± 0.0131 0.5695 ± 0.0099 0.9703 ± 0.0027 0.9671 ± 0.0027 0.5024 ± 0.0511
UNVote 0.5348 ± 0.0072 0.5126 ± 0.0103 0.5724 ± 0.0107 0.5196 ± 0.0022 0.6050 ± 0.0019 0.6036 ± 0.0 0.7598 ± 0.0167
Taobao 0.6838 ± 0.0045 0.6884 ± 0.0013 0.6944 ± 0.0038 0.5309 ± 0.0189 0.7374 ± 0.0032 0.8687 ± 0.0010 0.9997 ± 0.0001
Inductive New-New
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 0.9199 ± 0.0167 0.9384 ± 0.0064 0.9727 ± 0.0004 0.9523 ± 0.0056 0.9958 ± 0.0017 0.9890 ± 0.0003 0.9951 ± 0.0005
Wikipedia 0.9307 ± 0.0060 0.9329 ± 0.0028 0.9822 ± 0.0009 0.9592 ± 0.0039 0.9941 ± 0.0004 0.9963 ± 0.0003 0.9979 ± 0.0009
MOOC 0.6623 ± 0.0189 0.7135 ± 0.0148 0.8651 ± 0.0059 0.7239 ± 0.0052 0.935 ± 0.0009 0.7871 ± 0.0221 0.6654 ± 0.0155
LastFM 0.8558 ± 0.0110 0.8388 ± 0.0209 0.8121 ± 0.0046 0.536 ± 0.0217 0.9716 ± 0.0008 0.9585 ± 0.0 0.9722 ± 0.0013
Enron 0.6525 ± 0.0146 0.6312 ± 0.0449 0.7391 ± 0.0196 0.538 ± 0.0093 0.9556 ± 0.0055 0.9358 ± 0.0008 0.9503 ± 0.0079
SocialEvo 0.5958 ± 0.0391 0.7312 ± 0.0103 0.8268 ± 0.003 0.5096 ± 0.0097 0.9150 ± 0.0013 0.9112 ± 0.0563
UCI 0.6249 ± 0.0198 0.5062 ± 0.0032 0.8393 ± 0.0155 0.7758 ± 0.0033 0.9488 ± 0.0012 0.9736 ± 0.0008 0.9518 ± 0.0211
CollegeMsg 0.5212 ± 0.0244 0.5328 ± 0.0117 0.8244 ± 0.0098 0.7929 ± 0.0029 0.9484 ± 0.0039 0.9797 ± 0.0008 0.95 ± 0.0257
CanParl 0.4697 ± 0.0043 0.4794 ± 0.0057 0.5553 ± 0.0258 0.6004 ± 0.0087 0.6671 ± 0.0795 0.8511 ± 0.0079 0.5989 ± 0.0571
Contact 0.7381 ± 0.0145 0.6601 ± 0.0432 0.8916 ± 0.0075 0.5779 ± 0.0044 0.9670 ± 0.0031 0.9704 ± 0.0 0.9535 ± 0.0044
Flights 0.9250 ± 0.0065 0.6312 ± 0.0449 0.9644 ± 0.0015 0.8608 ± 0.0049 0.9882 ± 0.0009 0.9496 ± 0.0 0.9906 ± 0.0009
UNTrade 0.5801 ± 0.0112 0.5344 ± 0.0130 0.5164 ± 0.0056 * 0.7404 ± 0.0023 0.5685 ± 0.0298 0.6785 ± 0.0289
USLegis 0.5250 ± 0.0045 0.5523 ± 0.0127 0.5582 ± 0.02 0.5434 ± 0.0203 0.9767 ± 0.0055 0.9803 ± 0.0005 0.8627 ± 0.0196
UNVote 0.4973 ± 0.0145 0.4856 ± 0.0078 0.5502 ± 0.0096 0.5337 ± 0.0046 0.5830 ± 0.0076 0.5964 ± 0.0 0.7549 ± 0.035
Taobao 0.6764 ± 0.0013 0.676 ± 0.0011 0.6739 ± 0.0016 0.5222 ± 0.0033 0.7390 ± 0.0147 0.9025 ± 0.0035 0.9997 ± 0.0001

D.1 AP Results for Link Prediction

We show the average precision (AP) results on link prediction task and highlight the best and second-best numbers for each job in Table 10. The overall performance is similar to that of AUC. For the transductive setting, CAWN gives impressive results and achieves the best or second-best results on 12 datasets out of 15, followed by NeurTW (7 out of 15), TGN (5 out of 15), and NAT (5 out of 15), verifying the effectiveness of temporal walk, temporal memory, and joint neighborhood on transductive link prediction task. For the inductive setting, CAWN and NAT both rank top-2 on 9 datasets, followed by NeurTW on 6. Results reveal that models based on temporal walks and joint neighborhood can better capture structure patterns on edges that have never been seen. TGN performs relatively poorly for the inductive link prediction task on almost all datasets. We can draw similar conclusions from inductive New-Old and inductive New-Old experimental results. We note that DyRep achieves the best AP result under the inductive New-Old setting and the second-best AP result under the inductive setting on the SocialEvo dataset. SocialEvo has the maximum average degree and edge density as shown in Table 2 in the main paper, demonstrating that DyRep performs better for inductive link prediction tasks on dense temporal graphs.

D.2 GPU Utilization Comparison for Link Prediction

In Table 11, we report the GPU utilization results on the link prediction task. NAT obtains the best or second-best results on 13 datasets out of 15, followed by TGN (9 out of 15). The data structure, called N-cache, designed in NAT supports parallel access and updates of dictionary-type neighborhood representation on GPUs. Therefore, NAT achieves the best performance regarding GPU utilization. TGN proposes a highly efficient parallel processing strategy to handle temporal graph, so that TGN has the second-best performance on GPU utilization.

Table 11: GPU utilization of models on link prediction task. "*" denotes that TGAT layer cannot find suitable neighbors within given time interval and encounters error. The best and second-best results are highlighted as bold red and underlined blue.
GPU Utilization (%)
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 21 22 41 39 26 31 40
Wikipedia 34 46 36 35 28 17 38
MOOC 14 29 35 35 18 14 45
LastFM 22 28 38 22 23 22 48
Enron 18 24 41 24 25 22 51
SocialEvo 24 25 42 25 17 22 46
UCI 30 25 35 33 27 58 44
CollegeMsg 21 32 46 47 25 34 48
CanParl 26 27 54 47 24 22 51
Contact 25 22 40 29 19 22 50
Flights 20 20 26 35 18 24 43
UNTrade 19 23 50 * 15 23 53
USLegis 22 28 54 38 26 55 53
UNVote 12 22 37 35 16 46 54
Taobao 29 56 31 55 22 53 38

D.3 Efficiency Comparison of Node Classification Task

Table 12: Model efficiency on the node classification task. We report seconds per epoch as Runtime, the averaged number of epochs for convergence before early stopping as Epoch, the maximum RAM usage as RAM, the maximum GPU memory usage as GPU Memory, and the maximum GPU utilization usage as GPU Utilization, respectively. "x" indicates that the model cannot converge within 48 hours. The best and second-best results are highlighted as bold red and underlined blue.
Runtime (second) Epoch
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT JODIE DyRep TGN TGAT CAWN NeurTW NAT
Reddit 65.98 75.07 73.99 16.99 1,913.83 24,016.13 28.24 6 7 8 4 x x 6
Wikipedia 15.75 14.55 14.58 4.95 351.54 2,723.37 6.39 4 14 3 7 6 4 3
MOOC 28.57 32.91 27.95 8.51 1,146.76 7,466.04 17.55 8 3 5 7 10 x 3
RAM (GB) GPU Memory (GB)
Reddit 3.8 3.4 3.3 4.1 45.7 16.7 3.3 2.9 2.8 3.1 3.0 3.7 3.0 3.0
Wikipedia 2.6 2.6 2.6 3.1 8.2 8.2 2.6 2.6 2.4 2.8 2.5 3.1 2.5 1.7
MOOC 2.9 2.9 2.9 3.1 45.1 32.6 2.5 1.9 1.8 1.9 2.1 2.3 1.7 1.3
GPU Utilization (%)
Reddit 41 40 42 38 18 36 52
Wikipedia 37 30 41 31 25 80 48
MOOC 35 33 40 30 13 56 62

We compare the efficiency results on node classification task and show the results in Table 12. Regarding runtime per epoch, TGAT achieves the fastest performance on all three datasets, followed by NAT. Similar to the runtime results on the link prediction task, the training process of CAWN and NeurTW is much slower due to the inefficient temporal walk. As for the averaged number of epochs for convergence, NAT ranks top-2 on all three datasets, followed by JODIE (2 out of 3), TGN (2 out of 3). RAM results reveal that CAWN and NeurTW consume much more memory due to the temporal walk and complex sampling strategy. Most models need 1 - 3 GB of GPU memory, similar to the link prediction task. Due to the parallel access and update of representation on GPU, NAT achieves the highest GPU utilization.

Refer to caption
Figure 6: Workflow of TeMP. UU denotes user, and II denotes item.

Appendix E Results of TeMP

Upon the anatomy of the existing methods, we propose a novel temporal graph neural network, called TeMP. As shown in Figure 6, given a dynamic graph (a), the processing of TeMP is as follows:

  • Subgraph Construction (b). We construct a subgraph with a temporal neighbor sampler with intervals adaptive to data. We try to find a reference timestamp and sample a subgraph before this timestamp. We have conducted experiments at various quantiles, and chosen the mean timestamp since it obtains the overall best performance.

  • Embedding Generation (c). Upon subgraph construction, TeMP generates temporal embeddings for nodes and edges. The model architecture consists of three main components: temporal label propagation (LPA), message-passing operators, and a sequence updater. The temporal LPA captures the motif pattern, while the message-passing operators aggregate the original edge features. The sequence updater chooses RNN to update the embeddings with a memory module. Furthermore, TeMP uses a pre-initialization strategy to generate initial temporal node embeddings.

Table 13: AUC and AP results of TeMP on link prediction task. We highlight the numbers as bold red and underlined blue if TeMP achieves the best and second-best results compared to the TGNN models in the main paper.
AUC
Transductive Inductive Inductive New-Old Inductive New-New
Reddit 0.99 ± 0.0001 0.9843 ± 0.0002 0.9818 ± 0.0001 0.9843 ± 0.0003
Wikipedia 0.9801 ± 0.0005 0.9669 ± 0.0008 0.9504 ± 0.0005 0.97 ± 0.0013
MOOC 0.8249 ± 0.0027 0.8277 ± 0.0036 0.832 ± 0.0031 0.7621 ± 0.0086
LastFM 0.865 ± 0.0012 0.8879 ± 0.0015 0.8333 ± 0.0014 0.9228 ± 0.0007
Enron 0.8717 ± 0.0122 0.8485 ± 0.0153 0.8491 ± 0.009 0.7798 ± 0.0329
SocialEvo 0.9303 ± 0.0005 0.9039 ± 0.0005 0.9017 ± 0.0013 0.8485 ± 0.0041
UCI 0.8925 ± 0.001 0.7955 ± 0.0054 0.7787 ± 0.0032 0.7318 ± 0.0017
CollegeMsg 0.8873 ± 0.0018 0.8027 ± 0.0029 0.7848 ± 0.0022 0.7342 ± 0.0047
CanParl 0.7801 ± 0.0139 0.5414 ± 0.0176 0.5313 ± 0.0253 0.5683 ± 0.0293
Contact 0.958 ± 0.0013 0.9474 ± 0.0015 0.9474 ± 0.0011 0.7835 ± 0.0041
Flights 0.987 ± 0.0003 0.9708 ± 0.0006 0.9692 ± 0.0006 0.973 ± 0.0009
UNTrade 0.6011 ± 0.0009 0.5732 ± 0.0006 0.5703 ± 0.0006 0.5539 ± 0.0008
USLegis 0.7073 ± 0.0114 0.5493 ± 0.0183 0.5609 ± 0.0123 0.5425 ± 0.0234
UNVote 0.5571 ± 0.0053 0.5419 ± 0.006 0.5409 ± 0.0066 0.5696 ± 0.0038
Taobao 0.939 ± 0.0005 0.8513 ± 0.0018 0.8249 ± 0.003 0.8502 ± 0.0025
AP
Transductive Inductive Inductive New-Old Inductive New-New
Reddit 0.9904 ± 0.0001 0.9849 ± 0.0002 0.9828 ± 0.0002 0.9757 ± 0.0007
Wikipedia 0.9817 ± 0.0004 0.9696 ± 0.0004 0.9562 ± 0.0005 0.9666 ± 0.0016
MOOC 0.7935 ± 0.0033 0.7918 ± 0.0039 0.7989 ± 0.0035 0.7328 ± 0.0089
LastFM 0.8717 ± 0.0014 0.8952 ± 0.0015 0.8468 ± 0.0015 0.8984 ± 0.0014
Enron 0.85 ± 0.0141 0.8274 ± 0.0167 0.8324 ± 0.0104 0.757 ± 0.0307
SocialEvo 0.9098 ± 0.0005 0.8767 ± 0.0004 0.8752 ± 0.0021 0.7907 ± 0.0056
UCI 0.8968 ± 0.0007 0.8104 ± 0.0054 0.7969 ± 0.0047 0.7594 ± 0.0047
CollegeMsg 0.8928 ± 0.0026 0.8206 ± 0.0041 0.8018 ± 0.003 0.7505 ± 0.0045
CanParl 0.6871 ± 0.015 0.5388 ± 0.0074 0.5341 ± 0.0037 0.5554 ± 0.015
Contact 0.9525 ± 0.0016 0.9429 ± 0.0019 0.944 ± 0.0012 0.7992 ± 0.0031
Flights 0.9857 ± 0.0003 0.9687 ± 0.0005 0.9661 ± 0.0006 0.9731 ± 0.001
UNTrade 0.5855 ± 0.0007 0.564 ± 0.0007 0.5593 ± 0.0011 0.5634 ± 0.0008
USLegis 0.6493 ± 0.0078 0.529 ± 0.01 0.5519 ± 0.0088 0.5375 ± 0.0218
UNVote 0.5402 ± 0.0038 0.536 ± 0.005 0.5354 ± 0.0041 0.5481 ± 0.0052
Taobao 0.9385 ± 0.0004 0.8493 ± 0.0033 0.8243 ± 0.0049 0.8448 ± 0.0056
Table 14: Efficiency of TeMP on link prediction task.
Efficiency
Runtime (second) Epoch RAM (GB) GPU Memory (GB) GPU Utilization (%)
Reddit 304.84 27 5.4 2.3 86
Wikipedia 51.00 22 3.4 1.4 71
MOOC 100.64 23 2.9 1.3 57
LastFM 471.25 48 3.6 1.4 57
Enron 38.68 20 2.8 1.1 31
SocialEvo 670.08 30 3.77 1.1 30
UCI 43.98 26 2.9 1.4 50
CollegeMsg 11.55 30 3 1.2 49
CanParl 13.80 7 2.8 1.1 37
Contact 421.15 54 4.1 1.3 13
Flights 859.04 38 3.9 1.4 76
UNTrade 80.96 10 3 1.3 37
USLegis 10.31 14 2.6 1.1 57
UNVote 165.32 12 3.3 1.3 39
Taobao 17.78 16 3.1 1.5 42
Table 15: AUC and efficiency results of TeMP on node classification task.
AUC Efficiency
Runtime (second) Epoch RAM (GB) GPU Memory (GB) GPU Utilization (%)
Reddit 0.6357 ± 0.0265 170.66 7 4.4 3.0 66
Wikipedia 0.8873 ± 0.0078 24.18 13 3.5 1.4 47
MOOC 0.6958 ± 0.0017 49.90 7 3.3 1.2 42

E.1 Experimental Results

Link Prediction. The AUC and AP results of TeMP on link prediction task are presented in Table 13, and the efficiency results are shown in Table 14. TeMP performs relatively well in the transductive setting, while lags behind CAWN, NeurTW, and NAT. Due to efficient subgraph sampling and parallel dataloader, TeMP outperforms other baselines regarding GPU memory and GPU utilization.

Node Classification. The experimental results of TeMP on node classification task are presented in Table 15. TeMP achieves the best AUC on Wikipedia dataset and the second-best AUC on Reddit dataset, demonstrating that TeMP can effectively capture temporal evolution of nodes. Similarly, TeMP consumes relatively low GPU memory and can better utilize the computation power of GPU.

Appendix F Newly Added Datasets

We have added six datasets (eBay-Small, eBay-Large, Taobal-Large, DGraphFin, YouTubeReddit-Small, YouTubeReddit-Large), including four large-scale datasets (eBay-Large, Taobao-Large, DGraphFin, YouTubeReddit-Large). The statistics of the new datasets are shown in Table 16. For easy access, all datasets have been hosted on the open-source platform zenodo (https://zenodo.org/) with a Digital Object Identifier (DOI) 10.5281/zenodo.8267846 (https://zenodo.org/record/8267846).

After a discussion with our industrial partner eBay, we are working on sharing the eBay-Small and eBay-Large datasets in a way that ensures availability and justifies the research purpose. eBay provide a Google form for the applicants: https://forms.gle/bP1RmyVJ1C6pgyS66 (the applicants can remain anonymous).

Table 16: Dataset statistics of the new datasets.
Domain # Nodes # Edges
eBay-Small E-commerce 38,427 384,677
YouTubeReddit-Small [33] Social 264,443 297,732
eBay-Large E-commerce 1,333,594 1,119,454
DGraphFin [34] E-commerce 3,700,550 4,300,999
Youtube-Reddit-Large [33] Social 5,724,111 4,228,523
Taobao-Large [6, 12] E-commerce 1,630,453 5,008,745
  • eBay-Small is a subset of the eBay-Large dataset. We sample 38,427 nodes and 384,677 edges from eBay-Large graph according to edge timestamps.

  • YouTubeReddit-Small is a collection of massive visual contents on YouTube and long-term community activity on Reddit. This dataset covers a 3-month period from January to March 2020. Each row in the dataset represents a YouTube video viv_{i} being shared in a subreddit sjs_{j} by some user uku_{k} at time tt [33]. Nodes are YouTube videos and subreddits, edges are the users’ interactions between videos and subreddits. This dynamic graph has 264,443 nodes and 297,732 edges.

  • eBay-Large is a million-scale dataset consisting of 1.3 million nodes and 1.1 million edges, which comprises the selected transaction records from the eBay e-commerce platform over a two-month period. eBay-Large is modeled as a user-item graph, where items are heterogeneous entities which include information such as phone numbers, addresses, and email addresses associated with a transaction. We selecte one month of transactions as seed nodes and then expand each seed node two hops back in time to enrich the topology while maintaining consistency in the distribution of seed nodes.

  • DGraphFin is a collection of large-scale dynamic graph datasets, consisting of interactive objects, events and labels that evolve with time.It is a directed, unweighted dynamic graph consisting of millions of nodes and edges, representing a realistic user-to-user social network in financial industry. Nodes are users, and an edge from one user to another means that the user regards the other user as the emergency contact person [34].

  • Youtube-Reddit-Large dataset covers 54 months of YouTube video propagation history from January 2018 to June 2022 [33]. This dataset has 5,724,111 nodes and 4,228,523 edges.

  • Taobao-Large is a collection of the Taobao user behavior dataset intercepted based on the period 8:00 to 18:00 on 26 November 2017 [12]. Nodes are users and items, and edges are behaviors between users and items, such as favor, click, purchase, and add an item to shopping cart. This public dataset has 1,630,453 nodes and 5,008,74 user-item interaction edges.

F.1 Experiments

We have conducted extensive experiments on the six newly added temporal graph datasets, including the dynamic link prediction task and dynamic node classification task, diverse workloads (transductive, inductive, inductive New-Old, and inductive New-New scenarios), and efficiency comparison (inference time). We have added Average Rank metric shown in Table 17 and Table 19 for ranking model performances on the newly added large-scale datasets for evaluating TGNN models.

F.1.1 Link Prediction Task

We run the link prediction task on 7 TGNN models and the new datasets under different settings (Transductive, Inductive, Inductive New-Old, and Inductive New-New). The AUC and AP results for each new datasets are shown in Table 17 and Table 18, respectively. For the four large-scale datasets (eBay-Large, Taobao-Large, DGraphFin, YouTubeReddit-Large), we observe the similar results as in the paper. Specifically, NAT and NeurTW achieve the top-2 performance on almost all datasets under transductive and inductive settings.

Table 17: ROC AUC results of new datasets on the dynamic link prediction task. The best and second-best results are highlighted as bold red and underlined blue. Average Rank are computed by the experimental results of models on four large-scale datasets (eBay-Large, Taobao-Large, DGraphFin, YouTubeReddit-Large). We do not highlight the second-best if the gap is >0.05>0.05 compared with the best result.
Transductive
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
eBay-Small 0.9946 ± 0.0002 0.9941 ± 0.0006 0.9984 ± 0.0003 0.9838 ± 0.0006 0.9985 ± 0.0 0.9991 ± 0.0 0.9978 ± 0.0003
YouTubeReddit-Small 0.8519 ± 0.0007 0.8499 ± 0.0012 0.8432 ± 0.0032 0.8441 ± 0.0014 0.7586 ± 0.0031 0.9003 ± 0.0031 0.8259 ± 0.005
eBay-Large 0.9614 ± 0.0 0.9619 ± 0.0001 0.9642 ± 0.0003 0.5311 ± 0.0003 0.9442 ± 0.0003 0.9608 ± 0.0 0.9658 ± 0.0002
DGraphFin 0.8165 ± 0.0024 0.8171 ± 0.0016 0.8683 ± 0.0023 0.6112 ± 0.0165 0.5466 ± 0.0103 0.8611 ± 0.0035 0.8258 ± 0.0001
Youtube-Reddit-Large 0.8532 ± 0.0003 0.8529 ± 0.0006 0.8458 ± 0.0025 0.8536 ± 0.0026 0.7466 ± 0.0012 0.916 ± 0.0025 0.8605 ± 0.0009
Taobao-Large 0.7726 ± 0.0005 0.7724 ± 0.001 0.8464 ± 0.0008 0.5567 ± 0.0047 0.7771 ± 0.0068 0.859 ± 0.0091 0.8188 ± 0.001
Average Rank 4.5 4.5 2.75 5.75 6 2.25 2.25
Inductive
eBay-Small 0.9696 ± 0.0007 0.9674 ± 0.0018 0.9913 ± 0.0004 0.9698 ± 0.0006 0.9964 ± 0.0001 0.9982 ± 0.0 0.9998 ± 0.0001
YouTubeReddit-Small 0.7582 ± 0.0003 0.7545 ± 0.0009 0.7276 ± 0.0033 0.7436 ± 0.0006 0.7533 ± 0.0016 0.8978 ± 0.0032 0.9876 ± 0.0049
eBay-Large 0.7536 ± 0.0014 0.7515 ± 0.0006 0.7657 ± 0.0026 0.5224 ± 0.0003 0.9459 ± 0.0001 0.9608 ± 0.0 0.9999 ± 0.0001
DGraphFin 0.6884 ± 0.0051 0.6876 ± 0.001 0.6439 ± 0.0089 0.5677 ± 0.0184 0.5479 ± 0.009 0.8635 ± 0.0021 0.7955 ± 0.0201
Youtube-Reddit-Large 0.7539 ± 0.0005 0.7554 ± 0.0003 0.7243 ± 0.0016 0.7501 ± 0.0019 0.7327 ± 0.0016 0.9128 ± 0.0031 0.9863 ± 0.006
Taobao-Large 0.7075 ± 0.0009 0.7042 ± 0.0006 0.6812 ± 0.0032 0.5222 ± 0.0041 0.7787 ± 0.0103 0.869 ± 0.010 0.9933 ± 0.0008
Average Rank 4 4.5 5.5 6.25 4.75 1.75 1.25
Inductive New-Old
eBay-Small 0.9862 ± 0.0003 0.9836 ± 0.0016 0.9947 ± 0.0009 0.9712 ± 0.002 0.9985 ± 0.0 0.9988 ± 0.0 0.9999 ± 0.0
YouTubeReddit-Small 0.7695 ± 0.001 0.7655 ± 0.0018 0.7396 ± 0.0034 0.7242 ± 0.0004 0.7573 ± 0.0022 0.922 ± 0.0002 0.9967 ± 0.0014
eBay-Large 0.6109 ± 0.0244 0.5906 ± 0.0087 0.8134 ± 0.0105 0.6363 ± 0.0605 0.9569 ± 0.0007 0.8973 ± 0.0 1.0 ± 0.0
DGraphFin 0.5768 ± 0.0071 0.5735 ± 0.0007 0.5564 ± 0.0021 0.5742 ± 0.013 0.5646 ± 0.0244 0.7702 ± 0.0043 0.8693 ± 0.0066
Youtube-Reddit-Large 0.7844 ± 0.0015 0.7894 ± 0.0017 0.7623 ± 0.0031 0.7457 ± 0.0062 0.7511 ± 0.0022 0.9356 ± 0.0004 0.9958 ± 0.0025
Taobao-Large 0.7023 ± 0.0015 0.6953 ± 0.0022 0.6771 ± 0.0055 0.5104 ± 0.0106 0.7674 ± 0.005 0.8458 ± 0.0043 0.9965 ± 0.0005
Average Rank 4.25 5 5.5 5.75 4.25 2.25 1
Inductive New-New
eBay-Small 0.9388 ± 0.0009 0.9366 ± 0.0037 0.9838 ± 0.0007 0.9556 ± 0.0007 0.9937 ± 0.0 0.9975 ± 0.0 0.9997 ± 0.0004
YouTubeReddit-Small 0.7436 ± 0.0015 0.7436 ± 0.0018 0.7265 ± 0.0055 0.749 ± 0.0011 0.7479 ± 0.004 0.864 ± 0.0071 0.9868 ± 0.0049
eBay-Large 0.7526 ± 0.0013 0.7500 ± 0.0005 0.7639 ± 0.0027 0.5196 ± 0.0002 0.9542 ± 0.0003 0.9615 ± 0.0 0.9999 ± 0.0001
DGraphFin 0.7307 ± 0.0007 0.7323 ± 0.0002 0.6843 ± 0.0131 0.5649 ± 0.0248 0.5417 ± 0.0099 0.9051 ± 0.0028 0.7584 ± 0.0323
Youtube-Reddit-Large 0.6932 ± 0.0026 0.7022 ± 0.0007 0.6703 ± 0.0024 0.7269 ± 0.0 0.6942 ± 0.0028 0.8716 ± 0.0077 0.9796 ± 0.0103
Taobao-Large 0.7243 ± 0.0001 0.7247 ± 0.0001 0.6885 ± 0.0024 0.5256 ± 0.0054 0.7922 ± 0.0118 0.8906 ± 0.0088 0.9969 ± 0.0002
Average Rank 5 4.25 5.5 5.75 4.5 1.75 1.25
JODIE DyRep TGN TGAT CAWN NeurTW NAT
Total Rank 4.44 4.56 4.81 5.88 4.88 2.00 1.44
Table 18: AP results of new datasets on the dynamic link prediction task. The best and second-best results are highlighted as bold red and underlined blue. We do not highlight the second-best if the gap is >0.05>0.05 compared with the best result.
Transductive
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
eBay-Small 0.9938 ± 0.0004 0.9936 ± 0.0006 0.9983 ± 0.0003 0.9819 ± 0.0009 0.9981 ± 0.0 0.9991 ± 0.0 0.9975 ± 0.0002
YouTubeReddit-Small 0.8612 ± 0.0009 0.8594 ± 0.0012 0.8421 ± 0.0041 0.8515 ± 0.0012 0.7625 ± 0.0042 0.9112 ± 0.0021 0.8325 ± 0.0068
eBay-Large 0.9318 ± 0.0002 0.9322 ± 0.0002 0.9357 ± 0.0006 0.5239 ± 0.0002 0.9144 ± 0.0004 0.9307 ± 0.0 0.9398 ± 0.0004
DGraphFin 0.7705 ± 0.0009 0.7705 ± 0.0024 0.8571 ± 0.0009 0.6441 ± 0.0123 0.5431 ± 0.0095 0.8637 ± 0.0014 0.7956 ± 0.0012
Youtube-Reddit-Large 0.8622 ± 0.0007 0.8632 ± 0.0004 0.8476 ± 0.0022 0.8591 ± 0.0026 0.7475 ± 0.0017 0.9222 ± 0.0013 0.8628 ± 0.0015
Taobao-Large 0.7164 ± 0.0003 0.7142 ± 0.0008 0.844 ± 0.0011 0.5761 ± 0.0023 0.7616 ± 0.0069 0.8568 ± 0.016 0.7904 ± 0.0008
Inductive
eBay-Small 0.9638 ± 0.0007 0.9619 ± 0.0017 0.9898 ± 0.0005 0.9675 ± 0.0007 0.9953 ± 0.0002 0.9982 ± 0.0 0.9998 ± 0.0001
YouTubeReddit-Small 0.7866 ± 0.0007 0.7833 ± 0.0009 0.7387 ± 0.0069 0.7551 ± 0.0002 0.7568 ± 0.0031 0.9086 ± 0.0022 0.9872 ± 0.0056
eBay-Large 0.6989 ± 0.0018 0.6973 ± 0.0007 0.7096 ± 0.0030 0.518 ± 0.0002 0.9174 ± 0.0001 0.9308 ± 0.0 0.9999 ± 0.0001
DGraphFin 0.6563 ± 0.002 0.6567 ± 0.0009 0.624 ± 0.006 0.5866 ± 0.0123 0.5428 ± 0.0082 0.8626 ± 0.0012 0.7053 ± 0.0185
Youtube-Reddit-Large 0.7796 ± 0.0009 0.7818 ± 0.0009 0.73 ± 0.0029 0.7587 ± 0.0025 0.7353 ± 0.0022 0.9192 ± 0.0022 0.9849 ± 0.0071
Taobao-Large 0.6763 ± 0.0011 0.6746 ± 0.0011 0.6664 ± 0.0012 0.5315 ± 0.0027 0.7533 ± 0.011 0.8596 ± 0.0205 0.9941 ± 0.0007
Inductive New-Old
eBay-Small 0.9849 ± 0.0007 0.9836 ± 0.0013 0.9931 ± 0.0008 0.9682 ± 0.0028 0.9985 ± 0.0001 0.999 ± 0.0 0.9999 ± 0.0
YouTubeReddit-Small 0.7963 ± 0.0013 0.7937 ± 0.0014 0.729 ± 0.0086 0.7296 ± 0.0013 0.762 ± 0.0041 0.9244 ± 0.0015 0.9966 ± 0.0016
eBay-Large 0.5670 ± 0.0186 0.5870 ± 0.0074 0.8024 ± 0.0060 0.6504 ± 0.0385 0.9592 ± 0.0008 0.8458 ± 0.0 1.0 ± 0.0
DGraphFin 0.6005 ± 0.0048 0.5872 ± 0.0059 0.5753 ± 0.0062 0.5927 ± 0.0058 0.5669 ± 0.0269 0.7572 ± 0.0025 0.8184 ± 0.0088
Youtube-Reddit-Large 0.808 ± 0.0014 0.8142 ± 0.0019 0.7472 ± 0.0043 0.7526 ± 0.0097 0.7553 ± 0.0025 0.9368 ± 0.0009 0.9953 ± 0.0028
Taobao-Large 0.7009 ± 0.0013 0.698 ± 0.0014 0.6879 ± 0.0008 0.5254 ± 0.0074 0.7597 ± 0.0053 0.8459 ± 0.0103 0.9969 ± 0.0004
Inductive New-New
eBay-Small 0.923 ± 0.001 0.9226 ± 0.0024 0.98 ± 0.0007 0.9505 ± 0.0009 0.991 ± 0.0001 0.9973 ± 0.0 0.9997 ± 0.0004
YouTubeReddit-Small 0.7578 ± 0.0015 0.7582 ± 0.0021 0.7564 ± 0.0043 0.7718 ± 0.0023 0.7498 ± 0.004 0.8868 ± 0.0034 0.9861 ± 0.0063
eBay-Large 0.6976 ± 0.0016 0.6957 ± 0.0007 0.7078 ± 0.0031 0.5154 ± 0.0001 0.93 ± 0.0003 0.9318 ± 0.0 0.9999 ± 0.0001
DGraphFin 0.6802 ± 0.0005 0.6811 ± 0.0002 0.6526 ± 0.0098 0.5831 ± 0.0184 0.5379 ± 0.0071 0.8977 ± 0.0014 0.6529 ± 0.0249
Youtube-Reddit-Large 0.7038 ± 0.0024 0.7115 ± 0.0007 0.6979 ± 0.002 0.7414 ± 0.0012 0.6965 ± 0.004 0.8848 ± 0.0023 0.9761 ± 0.0134
Taobao-Large 0.6738 ± 0.0005 0.6742 ± 0.0005 0.6611 ± 0.0011 0.53 ± 0.0023 0.7521 ± 0.0127 0.8738 ± 0.0145 0.9973 ± 0.0001

F.1.2 Node Classification Task

The eBay-Small and eBay-Large datasets have node labels, so we conduct dynamic node classification experiments on both the eBay-Small and eBay-Large datasets. The AUC results are shown in Table 19. We can observe the similar results as in the paper. NeurTW achieves the best performance on both eBay-Small and eBay-Large datasets. NAT performs poorly on the node classification task.

Table 19: ROC AUC results for the dynamic node classification task on the eBay datasets. The top-2 results are highlighted as bold red and underlined blue.
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
eBay-Small 0.9274 ± 0.0017 0.8677 ± 0.0356 0.913 ± 0.0025 0.9342 ± 0.0002 0.9305 ± 0.0001 0.9529 ± 0.0002 0.6797 ± 0.0115
eBay-Large 0.7244 ± 0.0002 0.7246 ± 0.0 0.6586 ± 0.0129 0.672 ± 0.0016 0.7710 ± 0.0002 0.7859 ± 0.0 0.5304 ± 0.0011
Average Rank 4 4.5 5.5 3.5 2.5 1 7

F.1.3 Efficiency - the inference time

Considering many real world applications, we add the inference time metric to evaluate the efficiency of methods. The inference time comparison per 100,000 edges is shown in Figure 7. According to the figure, we can observe the similar model efficiency results as in the paper. In terms of the inference time, JODIE, DyRep, TGN and TGAT are faster, while CAWN and NeurTW are much slower. NAT is relatively faster than temporal walk-based methods through caching and parallelism optimizations, achieving a good trade-off between model quality and efficiency.

Refer to caption
Figure 7: Inference time comparison per 100,000 edges.

F.1.4 Efficiency - Runtime, RAM, GPU

We have added model efficiency results for the newly added datasets as follows. We will add all these results to the Appendix (https://openreview.net/attachment?id=rnZm2vQq31&name=supplementary_material).

Since many real-world graphs are extremely large, we believe efficiency is a vital issue for TGNNs in practice. We thereby compare the efficiency of the evaluated models on the newly added datasets (eBay-Small, eBay-Large, Taobal-Large, DGraphFin, YouTubeReddit-Small, YouTubeReddit-Large), and present the results for dynamic link prediction task in Table 20, while dynamic node classification task Table 21.

The Runtime in Table 20 and Table 21 shows that NAT is always trained much faster than the others and need a low RAM and GPU Memory. TGAT obtains the second-best efficiency performance on the newly added datasets. JODIE, DyRep, TGN achieve similar efficiency performance. We observe similar results as the main paper, NeurTW performs poorly on model efficiency.

Table 20: Model efficiency for the newly added datasets on the link prediction task. We report seconds per epoch as Runtime, the maximum RAM usage as RAM, and the maximum GPU memory usage as GPU Memory, respectively. The best and second-best results are highlighted as bold red and underlined blue.
Runtime (second)
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
eBay-Small 749.80 801.58 905.19 61.05 1,385.54 1,556.32 25.12
YouTubeReddit-Small 213.92 227.99 214.17 85.59 378.94 7,459.92 29.51
eBay-Large 28,203.53 30,151.18 30,286.88 791.86 52,116.62 58,540.48 117.38
DGraphFin 4,579.52 4,210.48 4,397.32 1,708.71 30,144.25 81,653.89 904.38
Youtube-Reddit-Large 4,630.49 4,935.05 4,635.91 1,852.67 8,202.50 161,476.80 638.77
Taobao-Large 3,108.45 2,931.87 2,860.83 2,658.34 12,143.02 148,922.55 6654.56
RAM (GB)
eBay-Small 7.8 6.2 6.8 4.3 9.1 7.8 4.3
YouTubeReddit-Small 6.8 7.2 6.6 5.3 13.1 8.1 4.5
eBay-Large 20.2 18.3 19.1 5.2 17.1 10.1 5.5
DGraphFin 17.5 15.3 17.5 8.3 23.2 24.3 6.9
Youtube-Reddit-Large 26.3 16.6 18.9 7.9 18.5 21.3 6.3
Taobao-Large 14.3 12.1 13.4 7.5 18.1 20.7 6.2
GPU Memory (GB)
eBay-Small 2.0 1.9 2.0 1.9 1.8 1.6 2.2
YouTubeReddit-Small 1.3 1.4 2.1 1.3 1.8 1.1 1.1
eBay-Large 29.7 24.6 30.9 5.8 5.7 3.0 5.9
DGraphFin 19.3 18.5 16.1 6.3 6.9 6.1 6.0
Youtube-Reddit-Large 22.1 23.0 23.4 7.8 6.3 7.2 7.1
Taobao-Large 20.3 21.8 19.6 7.7 7.3 6.8 5.6
Table 21: Model efficiency for the newly added datasets on the node classification task. We report seconds per epoch as Runtime, the maximum RAM usage as RAM, and the maximum GPU memory usage as GPU Memory, respectively. The best and second-best results are highlighted as bold red and underlined blue.
Runtime (second)
Dataset Model JODIE DyRep TGN TGAT CAWN NeurTW NAT
eBay-Small 765.05 794.03 718.56 55.05 226.56 583.08 13.05
eBay-Large 29,153.28 29,867.17 27,028.53 629.52 8,522.04 25,693.71 97.54
RAM (GB)
eBay-Small 6.5 6.8 6.7 4.2 6.9 7.2 4.1
eBay-Large 41.8 39.2 20.5 5.2 15.1 7.4 5.8
GPU Memory (GB)
eBay-Small 1.8 1.2 1.5 1.8 1.9 1.8 2.3
eBay-Large 31.7 31 31.4 5.8 5.8 2.9 5.9

Appendix G Node Classification Task with Multiple Node Labels

We have added experiments of the node classification task with multiple label numbers.

In the previous works, only Reddit, Wikipedia, and MOOC datasets have node labels (two labels: 0 and 1) and are used for binary node classification task. Through unremitting efforts, we have finded a large-scale temporal dataset - DGraphFin [34] with multiple node labels. DGraphFin consists of 3,700,550 nodes and 4,300,999 edges. 4,300,999 edges.

DGraphFin is a collection of large-scale dynamic graph datasets, consisting of interactive objects, events and labels that evolve with time.It is a directed, unweighted dynamic graph consisting of millions of nodes and edges, representing a realistic user-to-user social network in financial industry. Nodes are users, and an edge from one user to another means that the user regards the other user as the emergency contact person [34].

There four classes. Below are the nodes counts of each class.

  • 0: 1210092

  • 1: 15509

  • 2: 1620851

  • 3: 854098

Nodes of Class 1 are fraud users and nodes of 0 are normal users, and they the two classes to be predicted. Nodes of Class 2 and Class 3 are background users.

We preprocess DGraphFin as the format of temporal graph. We open source the code of preprocessing DGraphFin dataset at https://github.com/qianghuangwhu/benchtemp/blob/master/DGraphFin/DGraphFin.py.

DGraphFin dataset has been hosted on the open-source platform zenodo (https://zenodo.org/) with a Digital Object Identifier (DOI) 10.5281/zenodo.8267771 (https://zenodo.org/record/8267846).

The experimental results for dynamic node classification task on DGraphFin dataset are shown in Table 22. Different evaluation metrics are available, including Accuracy, Precision, Recall and F1.

 Precision weighted =i=1N Support i× Precision ii=1N Support i Recall weighted =i=1N Support i× Recall ii=1N Support iF1weighted =2× Precision weighted × Recall weighted  Precision weighted + Recall weighted \begin{gathered}\text{ Precision }_{\text{weighted }}=\frac{\sum_{i=1}^{N}\text{ Support }_{i}\times\text{ Precision }_{i}}{\sum_{i=1}^{N}\text{ Support }_{i}}\\ \text{ Recall }_{\text{weighted }}=\frac{\sum_{i=1}^{N}\text{ Support }_{i}\times\text{ Recall }_{i}}{\sum_{i=1}^{N}\text{ Support }_{i}}\\ \mathrm{~{}F}1_{\text{weighted }}=\frac{2\times\text{ Precision }_{\text{weighted }}\times\text{ Recall }_{\text{weighted }}}{\text{ Precision }_{\text{weighted }}+\text{ Recall }_{\text{weighted }}}\end{gathered} (4)

where Support i is the number of supports for the ii-th class.

Table 22: The experimental results for dynamic node classification task with multiple labels on DGraphFin dataset.
JODIE DyRep TGN TGAT CAWN NeurTW NAT
Accuracy 0.4396 ± 0.0070 0.4400 ± 0.0117 0.5696 ± 0.0056 0.5529 ± 0.0036 0.4366±0.0068 0.4933±0.0870 0.4131±0.0038
Precision 0.1933 ± 0.0062 0.1937 ± 0.0103 0.4806 ± 0.0053 0.4644 ± 0.0118 0.1906±0.0060 0.3879±0.2849 0.3068±0.0044
Recall 0.4396 ± 0.0070 0.4400 ± 0.0117 0.5696 ± 0.0056 0.5529 ± 0.0036 0.4366±0.0068 0.4933±0.0870 0.4131±0.0038
F1 0.2685 ± 0.0073 0.2690 ± 0.0121 0.4905 ± 0.0063 0.4744 ± 0.0020 0.2654±0.0071 0.3727±0.1588 0.3402±0.0047

As shown in Table 22, TGN achieves the best performance on dynamic node classification task with multiple labels, followed by TGAT. JODIE, DyRep, and CAWN perform poorly.

Appendix H Ablation Studies on NODEs of NeurTW

CAWN [5] and NeurTW [6] perform well on the link prediction task and are both based on motifs and index anonymization operation. However, NeurTW [6] additionally constructs neural ordinary differential equations (NODEs). With a component based on neural ordinary differential equations, the extracted motifs allow for irregularly-sampled temporal nodes to be embedded explicitly over multiple different interaction time intervals, enabling the effective capture of the underlying spatiotemporal dynamics.

In NeurTW [6], The Continuous Evolution operation is illustrated in Equation 8 in the original paper (https://openreview.net/pdf?id=NqbktPUkZf7).

hi=hi1+ti1tif(ht,θ)𝑑t,h_{i}^{\prime}=h_{i-1}+\int_{t_{i-1}}^{t_{i}}f\left(h_{t},\theta\right)dt, (5)

where f(ht,θ)f\left(h_{t},\theta\right) is the ODE function, implemented by an autoregressive gated recurrent unit with a parameter σ\sigma. The corresponding ODE function f~(h~s,s,θ)\tilde{f}\left(\tilde{h}_{s},s,\theta\right) follows:

f~(h~s,s,θ)\displaystyle\tilde{f}\left(\tilde{h}_{s},s,\theta\right) :=dh~sds=dhtdt|t=s(tend ctstart c)+tstart cdtds\displaystyle:=\frac{d\tilde{h}_{s}}{ds}=\left.\frac{dh_{t}}{dt}\right|_{t=s\left(t_{\text{end }}^{c}-t_{\text{start }}^{c}\right)+t_{\text{start }}^{c}}\frac{dt}{ds} (6)
=f(ht,t,θ)|t=s(tend ctstart c)+tstart c(tend ctstart c)\displaystyle=\left.f\left(h_{t},t,\theta\right)\right|_{t=s\left(t_{\text{end }}^{c}-t_{\text{start }}^{c}\right)+t_{\text{start }}^{c}}\left(t_{\text{end }}^{c}-t_{\text{start }}^{c}\right)
=f(h~s,s(tend ctstart c)+tstart c,θ)(tend ctstart c)\displaystyle=f\left(\tilde{h}_{s},s\left(t_{\text{end }}^{c}-t_{\text{start }}^{c}\right)+t_{\text{start }}^{c},\theta\right)\left(t_{\text{end }}^{c}-t_{\text{start }}^{c}\right)

Thus, Due to the neural ordinary differential equations (NODEs) , NeurTW [6] performs better on datasets with a large time granularity.

CanParl is a Canadian parliament bill voting network extracted from open website [18]. Nodes are members of parliament (MPs), and edges are the interactions between MPs from 2006 to 2019.

Refer to caption
Figure 8: The distribution of temporal edge count for the CanParl dataset, and the illustration on the train-validation-test splitting.

We illustrate the distribution of temporal edge count for the CanParl dataset in Figure 8. As shown in Figure 8, CanParl dataset has a large time granularity and NeurTW [6] achieves the best performance on CanParl dataset. Inspired by the above analysis, we could infer that NeurTW is potentially suitable for datasets with a large time granularity and time intervals, such as CanParl.

We further conduct ablation studies to verify the effectiveness of neural ordinary differential equations (NODEs) of NeurTW on datasets with a large time granularity and time intervals. The experimental results are detailed in Table 23.

Table 23: Ablation studies on neural ordinary differential equations (NODEs) of NeurTW. " NODEs" means remove NODEs module.
Ablation Datasets AUC AP
Transductive Inductive New-Old New-New Transductive Inductive New-Old New-New
original CanParl 0.8920 ± 0.0173 0.8871 ± 0.0139 0.8847 ± 0.0102 0.8882 ± 0.0045 0.8528 ± 0.0213 0.8469 ± 0.0161 0.8417 ± 0.0132 0.8511 ± 0.0079
USLegis 0.9715 ± 0.0009 0.9708 ± 0.0009 0.9682 ± 0.0018 0.9787 ± 0.0004 0.9713 ± 0.0013 0.971 ± 0.0009 0.9671 ± 0.0027 0.9803 ± 0.0005
NODEs CanParl 0.5 ± 0.0 0.5001 ± 0.0 0.5001 ± 0.0 0.5 ± 0.0 0.5 ± 0.0 0.5001 ± 0.0 0.5001 ± 0.0 0.5 ± 0.0
USLegis 0.898 ± 0.004 0.9186 ± 0.0018 0.9026 ± 0.0025 0.9474 ± 0.0 0.8721 ± 0.0047 0.9037 ± 0.0034 0.8651 ± 0.0029 0.9458 ± 0.0004

NeurTW without differential equations (NODEs) module performs much poorly on datasets with a large time granularity and time intervals (such as, CanParl). However, on a tiny time granularity and time intervals (such as, USLegis, the timestamp of USLegis is only from 0 to 11.), thus, removing the differential equations (NODEs) module has relatively little negative impact on the performance of the NeurTW.

Ablation studies on neural ordinary differential equations (NODEs) of NeurTW verify that "NeurTW introduces a continuous-time operation that can depict evolution trajectory, which is potentially suitable for CanParl with a large time granularity and time intervals".

Appendix I Graph Density for CAWN with Random Subgraph Sampling

Refer to caption
Figure 9: The distribution of temporal edge count for the MOOC dataset, and the illustration on the train-validation-test splitting.

We have added experiments to demonstrate that the effectiveness of the temporal walk mechanism changes in response to changes in graph density.

As shown in Figure 9, MOOC is relatively denser and CAWN based on the temporal walk mechanism (motifs) achieves the best performance on this dataset.

To demonstrate that the effectiveness of the temporal walk mechanism changes in response to changes in graph density, we adopt edges sampling strategy. We randomly sample a constant number NeN_{e} of temporal edges {(u1,i1),,(uN,iN)}\{(u_{1},i_{1}),\dots,(u_{N},i_{N})\} each time as a subgraph GSG_{S} of the original temporal graph. The number NuN_{u} of the source nodes is the number of the elements in set {u1,,uN}\{u_{1},\dots,u_{N}\}. The number NiN_{i} of the source nodes is the number of the elements in set {i1,,iN}\{i_{1},\dots,i_{N}\}.

The graph density σDS\sigma_{D_{S}} in our paper is the temoral density of temoral edges (ut,it)(u_{t},i_{t}), the calculation formula is as follows:

σDS=NeNu×Ni.\sigma_{D_{S}}=\frac{N_{e}}{N_{u}\times N_{i}}. (7)

The number of sampled edges is a constant NeN_{e}. Thus, we sampled two subgraphs: GS1G_{S_{1}} and GS2G_{S_{2}}. The statistics of sampled subgraphs are shown in Table 24. The temoral graph density of GS1G_{S_{1}} is 0.6320, while GS2G_{S_{2}} 0.3127.

Table 24: The parameters of sampled graphs GS1G_{S_{1}}, GS2G_{S_{2}}.
NeN_{e} NuN_{u} NiN_{i} σDS\sigma_{D_{S}}
GS1G_{S_{1}} 100000 4395 36 0.6320
GS2G_{S_{2}} 100000 4264 75 0.3127

The experimental results of CAWN based on the temporal walk mechanism (motifs) on GS1G_{S_{1}} and GS2G_{S_{2}} are shown in Table 25.

Table 25: The experimental results of CAWN on GS1G_{S_{1}} and GS2G_{S_{2}}.
AUC AP
Transductive Inductive Inductive New-Old Inductive New-Old Transductive Inductive Inductive New-Old Inductive New-Old
GS1G_{S_{1}} 0.886 ± 0.0164 0.883 ± 0.0158 0.8847 ± 0.0166 0.8701 ± 0.0134 0.8651 ± 0.021 0.8601 ± 0.0219 0.8616 ± 0.0225 0.8511 ± 0.0174
GS2G_{S_{2}} 0.8357 ± 0.0073 0.8353 ± 0.0105 0.8535 ± 0.0087 0.7752 ± 0.0157 0.8172 ± 0.0088 0.8154 ± 0.0128 0.8337 ± 0.0103 0.7535 ± 0.0197

As shown in Table 25, CAWN performs much better on GS1G_{S_{1}} with a larger graph density, σDS1=0.6320\sigma_{D_{S_{1}}}=0.6320 > σDS2=0.3127\sigma_{D_{S_{2}}}=0.3127. The experimental results demonstrate that the effectiveness of the CAWN based on temporal walk mechanism changes in response to changes in graph density.

Appendix J BenchTeMP with Historical Negative Sampling and Inductive Negative Sampling

As for NAT and NeurTW can achieved 90%+ or even 95%+ AUC and AP performance on many of these datasets, we have found that BenchTeMP with Historical Negative Sampling and Inductive Negative Sampling can addressed this issue. [8].

Refer to caption
Figure 10: Historical Negative Sampling and Inductive Negative Sampling.

Let EallE_{all}, EtrainE_{train}, SS be the set of edges in dataset, the set of edges in train dataset, and the set of negative sampling edges, respectively. Historical Negative Sampling and Inductive Negative Sampling are illustrated in Figure 10.

  • Historical Negative Sampling. Sampling negative edges from the set of edges that have been observed during previous timestamps but are absent in the current step, i.e., Sampling negative edges in the EtrainE_{train}.

  • Inductive Negative Sampling. Sampling negative edges in the EallE_{all} but not observed during training.

We have conducted experiments of NAT on those over-performance datasets (Reddit, Wikipedia, Flights) with Historical Negative Sampling and Inductive Negative Sampling. Experimental results are shown in Table 26 and Table 27.

Table 26: ROC AUC results of NAT with Historical Negative Sampling and Inductive Negative Sampling for the dynamic link prediction task.
Sampling Datasets Transductive Inductive Inductive New-Old Inductive New-New
Historical Reddit 0.7759 ± 0.0065 0.8272 ± 0.0036 0.8532 ± 0.0019 0.9097 ± 0.0014
Wikipedia 0.6992 ± 0.0027 0.7924 ± 0.0022 0.8118 ± 0.0034 0.8448 ± 0.0015
Flights 0.6145 ± 0.0034 0.6443 ± 0.0276 0.6418 ± 0.0566 0.8019 ± 0.0079
Inductive Reddit 0.8058 ± 0.0049 0.858 ± 0.0045 0.8746 ± 0.0035 0.9515 ± 0.001
Wikipedia 0.731 ± 0.0022 0.7609 ± 0.0009 0.7593 ± 0.001 0.8323 ± 0.0045
Flights 0.6145 ± 0.0034 0.6443 ± 0.0276 0.6418 ± 0.0566 0.8019 ± 0.0079
Table 27: AP results of NAT with Historical Negative Sampling and Inductive Negative Sampling for the dynamic link prediction task.
Sampling Datasets Transductive Inductive Inductive New-Old Inductive New-New
Historical Reddit 0.7958 ± 0.0097 0.8406 ± 0.0008 0.8558 ± 0.001 0.9063 ± 0.0033
Wikipedia 0.7128 ± 0.0019 0.7859 ± 0.0022 0.8017 ± 0.0031 0.8267 ± 0.0025
Flights 0.6287 ± 0.0094 0.6596 ± 0.0248 0.6507 ± 0.0562 0.8428 ± 0.0042
Inductive Reddit 0.8523 ± 0.0013 0.8902 ± 0.8979 0.8979 ± 0.0035 0.9648 ± 0.0009
Wikipedia 0.733 ± 0.0025 0.7525 ± 0.0043 0.747 ± 0.0 0.8238 ± 0.0088
Flights 0.6287 ± 0.0094 0.6596 ± 0.0248 0.6507 ± 0.0562 0.8428 ± 0.0042

The experimental results demonstrated the effectiveness of Historical Negative Sampling and Inductive Negative Sampling.

We leave the exploration of BenchTeMP with Historical Negative Sampling and Inductive Negative Sampling in detail for future works.