Graph Continual Learning with Debiased Lossless Memory Replay

Chaoxi Niu¹ Guansong Pang² Ling Chen¹ ¹University of Technology Sydney
²Singapore Management University
[email protected], [email protected], [email protected]

Abstract

Real-life graph data often expands continually, rendering the learning of graph neural networks (GNNs) on static graph data impractical. Graph continual learning (GCL) tackles this problem by continually adapting GNNs to the expanded graph of the current task while maintaining the performance over the graph of previous tasks. Memory replay-based methods, which aim to replay data of previous tasks when learning new tasks, have been explored as one principled approach to mitigate the forgetting of the knowledge learned from the previous tasks. In this paper we extend this methodology with a novel framework, called Debiased Lossless Memory replay (DeLoMe). Unlike existing methods that sample nodes/edges of previous graphs to construct the memory, DeLoMe learns small lossless synthetic node representations as the memory. The learned memory can not only preserve the graph data privacy but also capture the holistic graph information, for which the sampling-based methods are not viable. Further, prior methods suffer from bias toward the current task due to the data imbalance between the classes in the memory data and the current data. A debiased GCL loss function is devised in DeLoMe to effectively alleviate this bias. Extensive experiments on four graph datasets show the effectiveness of DeLoMe under both class- and task-incremental learning settings.

1 Introduction

Due to the superior capacity to represent complex relations between samples, graph is widely used in various real-world applications Xia et al. (2021); Wu et al. (2020) such as social networks, citation networks, and online shopping. For example, in the context of online shopping, consumers could be nodes and the edges between consumers could represent that they have purchased or rated the same product. In real-world applications, graph data are often expanded continually, e.g., new consumers and the associated connections would be constantly added to the online shopping network. Following the message propagation paradigm, graph neural networks (GNNs) Wu et al. (2019); Kipf and Welling (2016) have achieved remarkable success for various graph-related tasks. Despite the success, most GNNs operate on static graph data. Directly applying them to accommodate new emerging graphs would cause the GNNs to easily forget the knowledge learned from the previous data due to the large distribution difference between the historical data and the newly added data. The forgetting of the learned knowledge when learning new graphs, a.k.a. catastrophic forgetting, would result in deteriorated performance on historical graphs. A simple solution is to store all historical data and repeatedly retrain the GNNs whenever the graph is updated, but it is prohibitively expensive in terms of computational time and resources considering the large scale of the continually expanding graph. Moreover, the previous graph data would also be inaccessible in privacy-critical application scenarios when learning the newly added data.

To tackle catastrophic forgetting, many methods have been proposed and demonstrated impressive performance on Euclidean data such as images and texts Hayes and Kanan (2020); Wu et al. (2021); Wang et al. (2023). However, it is ineffective to directly adopt them for graph continual learning (GCL) as graphs are non-Euclidean data that contain complex relations among a large number of nodes.

Recently, to address the unique challenges in graph data, several GCL works He et al. (2023); Zhou and Cao (2021); Liu et al. (2021); Sun et al. (2023); Wang et al. (2022); Su et al. (2023); Rakaraddi et al. (2022); Zhang et al. (2023a); Perini et al. (2022); Zhang et al. (2022c, 2023b) have been proposed to continually adapt GNNs to the expanded graph data of the current task while maintaining the performance over the graph data of previous tasks. Generally, these methods can be roughly divided into three categories, including regularization-based methods, parameter isolation-based methods, and replay-based methods. Due to the straightforward intuitiveness and impressive effectiveness, replay-based methods Zhang et al. (2022c, 2023b); Zhou and Cao (2021) have been widely explored and achieved remarkable capacity against catastrophic forgetting in GCL. Typically, they sample and store representative data of previous tasks in a memory buffer and replay them when learning a new task. However, since only selected graph information is stored, their constructed memory for each previous task often struggles to capture the holistic graph information of the full graph, limiting the power of memory replay against catastrophic forgetting. For example, ERGNN Zhou and Cao (2021) stores only individual nodes via sampling methods for replaying, which ignores the rich relations among the nodes, severely degrading the effectiveness of memory replay in GCL. A straightforward solution to this issue is to store the complete neighbors and the associated edges of the selected nodes for graph data of previous tasks, but this would pose great challenges to memory storage and is infeasible under tight memory budgets. To preserve the topological information and alleviate the storage requirement simultaneously, recent studies propose approaches like SSM Zhang et al. (2023b) to sparsify the neighborhood of the selected nodes by filtering unimportant ones. Nevertheless, all these methods focus on constructing the memory using partial graph data, as shown in Figure 1(Left), failing to preserve the holistic graph semantics. Also, these memory construction methods would become inapplicable in privacy-critical applications where the replay of previous graph data can lead to privacy leakage, e.g., storing the original data of the previous online shopping networks would divulge the purchase/rating information of consumers.

Refer to caption — Figure 1: Left: (a) Current replay-based methods use a sampling-based memory consisting of partial sampled graph data, (b) whereas our approach learns to generate the memory using a lossless small graph with synthetic node representations. Right: Average accuracy (AA) of three replay methods – SSM, CaT and our DeLoMe – with increasing imbalance rates on a real-world dataset ArXiv.

Further, since the amount of the memory data is often limited, the current graph data often dominates the training data, leading to a data imbalance between the classes in the memory data and that in the current graph data. When updating the GCL models with those imbalanced data, the models are biased toward the current task, amplifying the deteriorated performance on the previous tasks as the graph continually expands with a fixed memory budget (i.e., increasing imbalance rates), e.g., performance of SSM in Figure 1(Right).

To tackle these three issues, in this paper, we introduce a novel memory replay-based GCL approach, called Debiased Lossless Memory replay (DeLoMe). Rather than storing the original graph data, it learns a small graph consisting of synthetic node representations as memory so that the gradient of a randomly initialized GNN on the original large graph is lossless compared to that on the learned small graph. In doing so, the learned representations are enforced to better capture the holistic graph structure and attribute information, resulting in better performance on previous graph data, as illustrated in Figure 1(Right). This node representation-based memory also helps better preserve the privacy of the graph data, compared to the original node/edge-based memory.

To handle the aforementioned bias issue, a debiased loss function is devised in our GCL objective. This is done by calibrating the prediction logits of the classes in the memory data and the current graph data in the continual updating of our DeLoMe model, which helps largely enhance its robustness w.r.t. the class imbalance.

Our main contributions can be summarized as follows:

•

We propose a novel learnable memory replay-based approach DeLoMe. Compared to the concurrent work CaT Liu et al. (2023) that introduces a seminal memory learning-based GCL method, DeLoMe introduces an enhanced graph memory learning method and augments it with a debiased GCL objective, resulting in the first GCL framework for debiased learnable memory-based replay.
•

To obtain such memory, we introduce a lossless GCL memory learning method that utilizes gradient matching to enforce a lossless compression of large graphs of previous tasks into small synthetic graphs as the memory data. It enables DeLoMe to achieve not only a small memory budget requirement but also good privacy preservation of historical data.
•

To mitigate the bias toward the current graph, we devise a debiased GCL objective that effectively calibrates the GNN predictions when adapting the GCL models.
•

Extensive experiments on four real-world datasets show that DeLoMe learns substantially more cost-effective memory and outperforms both state-of-the-art sampling and learnable memory-based methods under both class- and task-incremental settings of GCL.

2 Related Work

2.1 Graph Continual Learning

Graph continual learning (GCL) has gained growing popularity in deep learning, and various methods have been proposed He et al. (2023); Zhou and Cao (2021); Liu et al. (2021); Sun et al. (2023); Wang et al. (2022); Su et al. (2023); Rakaraddi et al. (2022); Zhang et al. (2023a); Perini et al. (2022); Zhang et al. (2022c, 2023b), which can be divided into three categories, i.e., regularization-based, parameter isolation-based, and data replay-based methods. For example, as a regularization-based method, TWP Liu et al. (2021) preserved the important parameters in the topological aggregation and loss minimization for previous tasks via regularization terms. Among the parameter isolation methods, HPNs Zhang et al. (2022b) extracted different levels of abstract knowledge in the form of prototypes and selected different combinations of parameters for different tasks, and Zhang et al. (2023a) proposed to continually expand model parameters to learn new emerging graph patterns. Differently, ERGNN Zhou and Cao (2021) proposed to sample and store representative nodes of previous tasks in a memory buffer, and replayed them when learning new tasks. However, the graph structure, which plays a vital role in graph representation learning, is not considered in ERGNN. To cope with the topological information, Zhang et al. (2022c) and Zhang et al. (2023b) proposed the sparsification techniques to find the important neighbors of the selected nodes. Then, the important neighbors together with the edges between neighbors and representative nodes are stored in the memory buffer for replaying during the learning of new tasks. Nevertheless, all these methods focus on constructing the memory using partial graph data of previous graphs, failing to preserve the holistic graph semantics and also raising privacy concerns in privacy-critical applications. By contrast, our learned memory data helps capture the holistic graph information while effectively preserving the privacy of the previous graph data.

2.2 Graph Condensation

Graph condensation aims to compress a graph into a significantly smaller graph while preserving the holistic information of the original graph. Various graph compression methods have been proposed Jin et al. (2022a, b); Gao et al. (2023); Liu et al. (2022) that are based on gradient matching or distribution matching between the compressed graph and the original graph. In this work, we utilize gradient matching to compress the graphs with node features and structure in previous tasks into comprehensive synthetic node representations, which can serve as the memory for the subsequent replaying. Note that the concurrent work CaT Liu et al. (2023) shares a common motivation with our method. However, unlike CaT that adopts a distribution matching-based method to learn the memory, we propose to use a gradient matching method that can measure the loss of condensation in a more fine-grained manner. CaT performs the GNNs training within the learned memory bank to alleviate the class imbalance problem, but it can lead to less effective utilization of the graph data of the current task. We instead devise a debiased GCL objective to calibrate the GCL predictions while avoiding the inefficient exploitation of the current graph data.

3 Preliminaries

3.1 The GCL Problem

In this paper, we focus on the node-level GCL problem. Formally, this problem can be formulated as learning a model on a sequence of graphs (tasks) $\{\mathcal{G}_{1},\ldots,\mathcal{G}_{T}\}$ where $T$ is the number of continual learning tasks. Each $\mathcal{G}_{t}=(A_{t},X_{t})$ is a newly emerging graph at task $t$ , where $A_{t}$ denotes the relations between nodes, $X_{t}$ represents the node features, and the labels of nodes can be denoted as $Y_{t}$ . Generally, each task contains a unique set of classes, i.e., { $Y_{t}\cap Y_{j}=\varnothing|t\neq j$ }. When learning task $t$ , the model trained from previous tasks only has access to current data $\mathcal{G}_{t}$ . The goal is to adapt the model to current graph $\mathcal{G}_{t}$ while maintaining the classification performance on the previous graphs $\{\mathcal{G}_{1},\ldots,\mathcal{G}_{t-1}\}$ .

3.2 Class & Task Incremental Settings of GCL

Depending on whether the task indicator is provided at the testing stage, GCL can be further divided into two settings: Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL). Assume that each task in $\{\mathcal{G}_{1},\ldots,\mathcal{G}_{T}\}$ has the same number of $C$ classes and we use $C_{t}$ to represent the unique set of classes in $\mathcal{G}_{t}$ . In CIL, after learning all the $T$ tasks, the model is required to classify each test instance into one of all the learned $T\times C$ classes. While under the TIL setting, the task indicator $t$ is also provided with the test instance. Thus, the model is only required to assign a class to the test instance from the classes set $C_{t}$ of task $t$ . Compared to TIL, CIL is more practical yet more challenging as the label prediction space contains all the learned classes so far. In this paper, we evaluate the proposed method under both settings to demonstrate its effectiveness.

3.3 Memory Replay in Graph Continual Learning

In GCL, a GNN model is sequentially trained across the task sequence $\{\mathcal{G}_{1},\ldots,\mathcal{G}_{T}\}$ . Typically, the model only has access to $\mathcal{G}_{t}=(A_{t},X_{t})$ at task $t$ . To continually accommodate the new graph and alleviate the catastrophic forgetting, memory replay-based methods store representative data of previous $t-1$ tasks into a memory buffer $\mathcal{B}_{t}$ . Then, the memory data are replayed with the data of task $t$ to fit the new graph data and maintain the knowledge of previous tasks simultaneously. Therefore, the training objective of memory replay at task $t$ can be formulated as follows:

\mathcal{L}=\underbrace{\ell(f_{\theta}(\mathcal{G}_{t}),Y_{t})}_{\text{current task loss}}+\lambda\underbrace{\ell(f_{\theta}(\mathcal{B}_{t}),Y_{\mathcal{B}_{t}})}_{\text{memory loss}}\,,

(1)

where $f_{\theta}(\cdot)$ is the GNN model parameterized by $\theta$ , $Y_{\mathcal{B}_{t}}$ denote the labels of nodes in the memory buffer $\mathcal{B}_{t}$ (i.e., all class labels encountered in the previous $t-1$ tasks), $\ell(\cdot)$ denotes a loss function, i.e., cross-entropy loss, and $\lambda$ is a hyperparameter to control the importance of the memory loss.

The memory buffer plays an important role in maintaining previously learned knowledge and different memory construction methods have been proposed. For example, ERGNN Zhou and Cao (2021) sampled the representative nodes of the previous tasks as the memory. However, the rich topological information is neglected in ERGNN. Other approaches like SSM Zhang et al. (2023b) aim to utilize the structure information by sparsifying the neighbors of the sampled nodes based on their topological importance. Then, the important neighborhood structures together with the sampled nodes are used to construct the memory. Despite the success achieved by these methods, the memory buffer constructed with partial graph data fails to preserve the holistic semantics of the original graph and it may lead to privacy leakage issue when the sampled nodes/edges are sensitive data. Further, the memory budget is typically kept significantly smaller than the current graph data, so it can lead to class imbalance between the memory data and the current graph data. Our method DeLoMe is proposed to tackle these three issues.

4 Debiased Lossless Memory Replay

4.1 Lossless Memory Learning

4.1.1 Key intuition

Instead of using partial graph data as memory replay data, we propose to learn a small set of synthetic node representatives as the memory data which can holistically represent the original graph structure and attributes. Taking the task $t-1$ as an example, given $\mathcal{G}_{t-1}=(A_{t-1},X_{t-1})$ with the label set $Y_{t-1}$ , we aim to learn a compressed synthetic graph $\hat{\mathcal{G}}_{t-1}=(I,\hat{X}_{t-1})$ associated with label set $\hat{Y}_{t-1}$ as our memory data, where the fixed identity matrix $I$ represents the structure of $\hat{\mathcal{G}}_{t-1}$ . Note that the size of $\hat{X}_{t-1}$ is constrained by the memory budget and is significantly smaller than $X_{t-1}$ .

Since the learned $\hat{\mathcal{G}}_{t-1}$ captures the holistic semantics of the original graph $\mathcal{G}_{t-1}$ at task ${t-1}$ , we can expect that a GNN model trained on $\hat{\mathcal{G}}_{t-1}$ would achieve comparable performance to that trained on $\mathcal{G}_{t-1}$ . Following this idea, the learning objective of $\hat{\mathcal{G}}_{t-1}$ can be formulated as follows:

\min_{\hat{\mathcal{G}}_{t-1}}\ell(f_{\hat{\theta}}(\mathcal{G}_{t-1}),Y_{t-1})\,,\ \ \text{s.t.}\ \ \hat{\theta}=\arg\min_{\theta}\ell(f_{\theta}(\hat{\mathcal{G}}_{t-1}),\hat{Y}_{t-1})\,,

(2)

where $\hat{\theta}$ denotes the parameters of the graph neural network $f(\cdot)$ trained on $\hat{\mathcal{G}}_{t-1}$ . Due to the nested loop optimization of Eq. (2), directly solving the above objective would be prohibitively expensive. To address this challenge, one-step gradient matching Jin et al. (2022a) is proposed, which aims to match the gradient of the same model with regard to the real data and the synthetic data at the first training epoch. Inspired by this, the optimization objective Eq. (2) can be transformed into a lossless compression as follows:

\min_{\hat{\mathcal{G}}_{t-1}}d(\nabla_{\theta}\ell(f_{\theta}(\mathcal{G}_{t-1}),Y_{t-1}),\nabla_{\theta}\ell(f_{\theta}(\hat{\mathcal{G}}_{t-1}),\hat{Y}_{t-1}))\,,

(3)

where $d(\cdot)$ is a distance function to measure the difference between the two gradients. Moreover, to get more generalized $\hat{\mathcal{G}}_{t-1}$ without fitting to a specific model initialization, we aim to devise an objective of Eq. (3) that can be minimized under different random initializations of the $f_{\theta}(\cdot)$ . Thus, the final objective to learn $\hat{\mathcal{G}}_{t-1}$ is defined as follows:

\min_{\hat{\mathcal{G}}_{t-1}}\sum_{\theta_{p}\sim\Theta}d(\nabla_{\theta_{p}}\ell(f_{\theta_{p}}(\mathcal{G}_{t-1}),Y_{t-1}),\nabla_{\theta_{p}}\ell(f_{\theta_{p}}(\hat{\mathcal{G}}_{t-1}),\hat{Y}_{t-1}))\,,

(4)

where $\theta_{p}$ is a random instantiation of the parameter space $\Theta$ .

By optimizing Eq. (4), we obtain the synthetic graph $\hat{\mathcal{G}}_{t-1}$ and use it to update the memory buffer $\mathcal{B}_{t-1}$ , i.e., $\mathcal{B}_{t}=\mathcal{B}_{t-1}\cup\hat{\mathcal{G}}_{t-1}$ . Then, $\mathcal{B}_{t}$ is replayed with the graph $\mathcal{G}_{t}$ when learning the next task $t$ (see Figure 2).

4.1.2 The learning algorithm

To obtain the memory $\hat{\mathcal{G}}_{t-1}=(I,\hat{X}_{t-1})$ with $\hat{Y}_{t-1}$ of task $t-1$ , we need to learn the node feature $\hat{X}_{t-1}$ and label set $\hat{Y}_{t-1}$ . Since $\hat{Y}_{t-1}$ represents the node label and is discrete, $\hat{Y}_{t-1}$ is fixed as the same classes as the original label set $Y_{t-1}$ . Therefore, we only need to learn $\hat{X}_{t-1}$ for task $t-1$ . To accelerate the learning process, let $b$ be the memory budget for each class, $\hat{X}_{t-1}$ is initialized as the features of randomly selected $b$ nodes from each class in $\mathcal{G}_{t-1}$ . To further reduce the computation cost, we sample a fixed number of neighbors for each node in $\mathcal{G}_{t-1}$ at each hop, and adopt the mini-batch training strategy. In addition, the gradient matching and learning of $\hat{X}$ is performed on each class separately. Specifically, for a given class $c$ in $Y_{t-1}$ , a batch of nodes belonging to class $c$ is randomly sampled from $\mathcal{G}_{t-1}$ together with the associated neighborhoods, which is denoted as $\mathcal{G}_{t-1}^{c}=(A_{t-1}^{c},X_{t-1}^{c})$ . Meanwhile, we get the corresponding nodes of class $c$ from $\hat{\mathcal{G}}_{t-1}$ and denote it as $\hat{\mathcal{G}}_{t-1}^{c}=(I,\hat{X}_{t-1}^{c})$ . Then, the sampled $\mathcal{G}_{t-1}^{c}$ and $\hat{\mathcal{G}}_{t-1}^{c}$ are fed into the same GNN model to calculate the gradient matching loss. Finally, $\hat{X}_{t-1}^{c}$ is optimized via gradient descent to minimize the graph matching loss. Note that the graph neural network $f_{\theta}(\cdot)$ is not updated during the learning process and we adopt different initializations of $f_{\theta}(\cdot)$ to learn a generalized $\hat{\mathcal{G}}_{t-1}$ . By imitating the training trajectory of the original graph $\mathcal{G}_{t-1}$ , $\hat{\mathcal{G}}_{t-1}=(I,\hat{X}_{t-1})$ is enforced to capture the holistic semantics of $\mathcal{G}_{t-1}$ and the global topological information is implicitly incorporated into $\hat{X}_{t-1}$ (see Algorithm 1 in Appendix for the detailed algorithm).

As illustrated in Figure 3 where we visualize the node embeddings of the memories constructed by two sampling methods (random node sampling and SSM) and DeLoMe, the synthetic node representations in the memory learned by DeLoMe can better preserve the distribution of the nodes of different classes in the original graph, compared to the other two methods. This indicates the better capability of DeLoMe in capturing more holistic graph semantics.

4.1.3 Time complexity analysis

We analyze the time complexity of obtaining $\hat{\mathcal{G}}_{t-1}=(I,\hat{X}_{t-1})$ at task $t-1$ . As mentioned above, the learning process is conducted for each class $\mathcal{G}_{t-1}^{c}=(A_{t-1}^{c},X_{t-1}^{c})$ separately. We take a two-layer SGC Wu et al. (2019) as the GNN model for gradient matching and denote the number of nodes and edges in $\mathcal{G}_{t-1}^{c}$ as $N^{n}_{c}$ and $N^{e}_{c}$ respectively. The dimension of the node features is denoted as $F$ . For the two-layer SGC, the time complexity of forward and backward propagation with regard to $\mathcal{G}_{t-1}^{c}$ and $\hat{\mathcal{G}}_{t-1}^{c}$ are $\mathcal{O}(4N^{e}_{c}F+2N^{n}_{c}FC))$ and $\mathcal{O}(2bFC)$ respectively. Given the training epochs $E$ , the time complexity for class $c$ is $\mathcal{O}(4N^{e}_{c}EF+2N^{n}_{c}EFC+2bEFC)$ . Then, the overall time complexity of learning $\hat{\mathcal{G}}_{t-1}$ is $\mathcal{O}(\sum_{c=1}^{C}(4N^{e}_{c}EF+2N^{n}_{c}EFC+2bEFC))$ . Since we adopt the minibatch training strategy and sample a fixed number of neighbors for each node at each hop, the numbers of $N^{n}_{c}$ and $N^{e}_{c}$ are typically small and do not induce high computation. Besides, the learning process can be implemented in parallel in practice to further reduce the learning time.

4.2 Debiased Memory Replay

Although the learned memory data are lossless compared to previous graphs, there is a data imbalance between the classes in the new graph $\mathcal{G}_{t}$ and that in the memory buffer $\mathcal{B}_{t}$ . This is because the memory budget $b$ of each class is typically much smaller than the number of nodes belonging to new classes in $\mathcal{G}_{t}$ . This class imbalance would increase, when the graph evolves with more current graph data or a tighter memory budget is given, amplifying the degraded performance for GCL. To tackle this problem, we propose a debiased memory replay method that adjusts the prediction logits of the classes in the memory data and the current graph data based on the class label frequencies during the memory replay. Specifically, at task $t$ , we have the memory buffer $\mathcal{B}_{t}=\{\hat{\mathcal{G}}_{1},\ldots,\hat{\mathcal{G}}_{t-1}\}$ , and the vanilla objective of memory replay in Eq. (1) can be explicitly formulated as:

\mathcal{L}=\ell(H_{t},Y_{t})+\lambda\sum_{j=1}^{t-1}\ell(\hat{H}_{j},\hat{Y}_{j})\,,

(5)

where $H_{t}$ and $\hat{H}_{j}$ are the prediction logits of $f_{\theta}(\cdot)$ for the nodes in $\mathcal{G}_{t}$ and $\hat{\mathcal{G}}_{j}$ respectively. Directly minimizing Eq. (5) would make the model biased toward the current task as the current graph data $\mathcal{G}_{t}$ dominates the training data. We address this problem by calibrating the logits $H_{t}$ and $\hat{H}_{j}$ based on their label frequencies. Given the memory budget $b$ , the calibration magnitude for each class in the memory buffer is equal and can be defined as:

\Pi_{\mathcal{B}_{t}}=\tau\log\frac{b}{|Y_{t}|+(t-1)bC}\,,

(6)

where we assume each task contains the same $C$ classes, $\tau$ is a scaling hyperparameter, and $|Y_{t}|$ returns the number of samples in $Y_{t}$ . For a class $c$ in the current graph $\mathcal{G}_{t}$ , the calibration magnitude is defined as:

\Pi_{t}^{c}=\tau\log\frac{|Y_{t}^{c}|}{|Y_{t}|+(t-1)bC}\,,

(7)

where $|Y_{t}^{c}|$ denotes the number of training samples of class $c$ in $\mathcal{G}_{t}$ . By incorporating these two calibrations into Eq. (5), our debiased GCL training loss is as follows:

\mathcal{L}=\sum_{c=1}^{C}\ell(H_{t}^{c}+\Pi_{t}^{c},Y_{t}^{c})+\sum_{j=1}^{i-1}\ell(\hat{H}_{j}+\Pi_{\mathcal{B}},\hat{Y}_{j})\,,

(8)

where we discard the weight parameter $\lambda$ to simplify the parameter selection. Compared to Eq. (5), Eq. (8) augments the softmax cross-entropy with a pairwise margin based on label frequencies Menon et al. (2021). In this way, the predictions for dominant classes in the current graph do not overwhelm those for tail classes in the memory data, thus reducing the bias toward the dominant classes in $\mathcal{G}_{t}$ . The detailed training steps of DeLoMe are presented in Algorithm 2 in Appendix.

5 Experiments

5.1 Datasets

Following the GCL benchmark Zhang et al. (2022a), four public graph datasets are employed, i.e., CoraFull McCallum et al. (2000), Arxiv Hu et al. (2020), Reddit Hamilton et al. (2017) and Products Hu et al. (2020). Specifically, CoraFull and Arxiv are citation networks, Reddit is constructed from Reddit posts, and Products is a co-purchasing network from Amazon. For all datasets, each task is set to contain only two classes Zhang et al. (2022a). Besides, for each class, the proportions of training, validation and testing are set to be 0.6, 0.2 and 0.2 respectively.

5.2 Competing Models

Two categories of state-of-the-art (SOTA) continual learning methods are employed for comparison. The first category contains traditional continual learning methods, i.e., EWC Kirkpatrick et al. (2017), LwF Li and Hoiem (2017), GEM Lopez-Paz and Ranzato (2017) and MAS Aljundi et al. (2018). The second category includes four SOTA GCL methods: ERGNN Zhou and Cao (2021), TWP Liu et al. (2021), HPNs Zhang et al. (2022b), SSM Zhang et al. (2022c), SEM Zhang et al. (2023b) and CaT Liu et al. (2023). In addition, we include two other methods: Joint and Fine-tune. The Joint method is an oracle model that can see all graphs at all times and performs GCL on the full graphs of all tasks, while Fine-tune is a baseline that simply fine-tunes the learned model from previous tasks without continual learning techniques.

5.3 Evaluation Metrics

Average accuracy (AA) and average forgetting (AF) are adopted to evaluate the model performance. Specifically, AA and AF are calculated from the accuracy matrix $M\in\mathbb{R}^{T\times T}$ , where $T$ is the number of the tasks. The entry $M_{tj}(t\geqslant j)$ denotes the classification accuracy on task $j$ after the model is optimized on task $t$ . After learning all the $T$ tasks, the overall AA and AF can be calculated as follows:

\text{AA}=\frac{\sum_{j=1}^{T}M_{Tj}}{T}\,,\ \ \ \text{AF}=\frac{\sum_{j=1}^{T-1}(M_{Tj}-M_{jj})}{T-1}\,.

(9)

To sum up, AA evaluates the average performance of the model on all the learned tasks after learning the current task, and AF describes how the performance of previous tasks is affected by the current task. For both AA and AF, the higher value denotes the better GCL performance.

5.4 Implementation Details

To have a fair comparison, we implement the proposed method with the GCL benchmark Zhang et al. (2022a). More specifically, we adopt the two-layer SGC Wu et al. (2019) as the backbone model with the same hyper-parameters following Zhang et al. (2023b). The memory budget is also set as the same in Zhang et al. (2023b), i.e., 60 per class for the CoraFull dataset and 400 per class for the other datasets. For the lossless memory learning module, we also use a two-layer SGC as the GNN model to match the gradients of both the original graph and the synthetic graph, and the gradient divergence is calculated based on the mean square distance. For each dataset, we report the average performance with standard deviations after 5 runs with different seeds under both task incremental and class incremental settings.

Methods	CoraFull		Arixv		Reddit		Products
Methods	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$
Fine-tune	3.5 $\pm$ 0.5	-95.2 $\pm$ 0.5	4.9 $\pm$ 0.0	-89.7 $\pm$ 0.4	5.9 $\pm$ 1.2	-97.9 $\pm$ 3.3	7.6 $\pm$ 0.7	-88.7 $\pm$ 0.8
Joint	81.2 $\pm$ 0.4	-	51.3 $\pm$ 0.5	-	97.1 $\pm$ 0.1	-	71.5 $\pm$ 0.1	-
EWC	52.6 $\pm$ 8.2	-38.5 $\pm$ 12.1	8.5 $\pm$ 1.0	-69.5 $\pm$ 8.0	10.3 $\pm$ 11.6	-33.2 $\pm$ 26.1	23.8 $\pm$ 3.8	-21.7 $\pm$ 7.5
MAS	6.5 $\pm$ 1.5	-92.3 $\pm$ 1.5	4.8 $\pm$ 0.4	-72.2 $\pm$ 4.1	9.2 $\pm$ 14.5	-23.1 $\pm$ 28.2	16.7 $\pm$ 4.8	-57.0 $\pm$ 31.9
GEM	8.4 $\pm$ 1.1	-88.4 $\pm$ 1.4	4.9 $\pm$ 0.0	-89.8 $\pm$ 0.3	11.5 $\pm$ 5.5	-92.4 $\pm$ 5.9	4.5 $\pm$ 1.3	-94.7 $\pm$ 0.4
LwF	33.4 $\pm$ 1.6	-59.6 $\pm$ 2.2	9.9 $\pm$ 12.1	-43.6 $\pm$ 11.9	86.6 $\pm$ 1.1	-9.2 $\pm$ 1.1	48.2 $\pm$ 1.6	-18.6 $\pm$ 1.6
TWP	62.6 $\pm$ 2.2	-30.6 $\pm$ 4.3	6.7 $\pm$ 1.5	-50.6 $\pm$ 13.2	8.0 $\pm$ 5.2	-18.8 $\pm$ 9.0	14.1 $\pm$ 4.0	-11.4 $\pm$ 2.0
ERGNN	34.5 $\pm$ 4.4	-61.6 $\pm$ 4.3	21.5 $\pm$ 5.4	-70.0 $\pm$ 5.5	82.7 $\pm$ 0.4	-17.3 $\pm$ 0.4	48.3 $\pm$ 1.2	-45.7 $\pm$ 1.3
SSM-uniform	73.0 $\pm$ 0.3	-14.8 $\pm$ 0.5	47.1 $\pm$ 0.5	-11.7 $\pm$ 1.5	94.3 $\pm$ 0.1	-1.4 $\pm$ 0.1	62.0 $\pm$ 1.6	-9.9 $\pm$ 1.3
SSM-degree	75.4 $\pm$ 0.1	-9.7 $\pm$ 0.0	48.3 $\pm$ 0.5	-10.7 $\pm$ 0.3	94.4 $\pm$ 0.0	-1.3 $\pm$ 0.0	63.3 $\pm$ 0.1	-9.6 $\pm$ 0.3
SEM-curvature	77.7 $\pm$ 0.8	-10.0 $\pm$ 1.2	49.9 $\pm$ 0.6	-8.4 $\pm$ 1.3	96.3 $\pm$ 0.1	-0.6 $\pm$ 0.1	65.1 $\pm$ 1.0	-9.5 $\pm$ 0.8
CaT	80.4 $\pm$ 0.5	-5.3 $\pm$ 0.4	48.2 $\pm$ 0.4	-12.6 $\pm$ 0.7	97.3 $\pm$ 0.1	-0.4 $\pm$ 0.0	70.3 $\pm$ 0.9	-4.5 $\pm$ 0.8
DeLoMe (Ours)	81.0 $\pm$ 0.2	-3.3 $\pm$ 0.3	50.6 $\pm$ 0.3	5.1 $\pm$ 0.4	97.4 $\pm$ 0.1	-0.1 $\pm$ 0.1	67.5 $\pm$ 0.7	-17.3 $\pm$ 0.3

Table 1: Results (mean

\pm

std) under the class-incremental learning setting on four datasets. Fine-tune and Joint are shown to respectively serve as approximated lower bound and upper bound performance. The best performance achieved by continual learning methods on each dataset is highlighted in bold. ”

\uparrow

” denotes the higher value represents better performance.

Methods	CoraFull		Arixv		Reddit		Products
Methods	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$
Fine-Tune	56.0 $\pm$ 4.2	-41.0 $\pm$ 4.5	56.2 $\pm$ 2.6	-36.2 $\pm$ 2.6	79.5 $\pm$ 24.2	-11.7 $\pm$ 4.8	64.4 $\pm$ 3.8	-31.1 $\pm$ 4.4
Joint	95.5 $\pm$ 0.2	-	90.3 $\pm$ 0.4	-	99.5 $\pm$ 0.0	-	95.3 $\pm$ 0.8	-
EWC	89.8 $\pm$ 1.0	-5.1 $\pm$ 0.5	71.5 $\pm$ 0.6	-0.9 $\pm$ 0.6	83.9 $\pm$ 15.1	-2.0 $\pm$ 1.5	87.0 $\pm$ 1.4	-1.7 $\pm$ 1.2
MAS	92.2 $\pm$ 0.9	-3.7 $\pm$ 1.3	72.7 $\pm$ 2.6	-18.5 $\pm$ 2.5	61.1 $\pm$ 7.1	-0.5 $\pm$ 1.0	80.6 $\pm$ 4.3	-13.7 $\pm$ 3.7
GEM	91.5 $\pm$ 0.5	-1.9 $\pm$ 0.9	81.1 $\pm$ 1.7	-4.0 $\pm$ 1.8	98.9 $\pm$ 0.1	-0.5 $\pm$ 0.1	87.7 $\pm$ 1.8	-7.0 $\pm$ 2.0
LwF	93.8 $\pm$ 0.1	-0.4 $\pm$ 0.1	71.1 $\pm$ 3.2	-1.5 $\pm$ 0.8	98.6 $\pm$ 0.1	-0.0 $\pm$ 0.0	86.3 $\pm$ 0.2	-0.5 $\pm$ 0.1
TWP	94.3 $\pm$ 0.9	-1.6 $\pm$ 0.4	89.4 $\pm$ 0.4	0.0 $\pm$ 0.3	78.0 $\pm$ 18.5	-0.2 $\pm$ 0.4	81.8 $\pm$ 3.3	-0.3 $\pm$ 0.8
HPNs	-	-	85.8 $\pm$ 0.7	0.6 $\pm$ 0.9	-	-	80.1 $\pm$ 0.8	2.9 $\pm$ 1.0
ERGNN	86.3 $\pm$ 1.0	-9.2 $\pm$ 0.9	86.4 $\pm$ 0.3	0.5 $\pm$ 0.6	97.4 $\pm$ 0.2	4.7 $\pm$ 0.1	86.4 $\pm$ 0.0	11.7 $\pm$ 0.0
SSM-uniform	95.3 $\pm$ 0.5	0.2 $\pm$ 0.5	88.5 $\pm$ 0.6	-1.3 $\pm$ 0.5	99.2 $\pm$ 0.0	-0.2 $\pm$ 0.0	93.1 $\pm$ 0.8	-1.8 $\pm$ 0.3
SSM-degree	95.8 $\pm$ 0.3	0.6 $\pm$ 0.2	88.4 $\pm$ 0.3	-1.1 $\pm$ 0.1	99.3 $\pm$ 0.0	-0.2 $\pm$ 0.0	93.2 $\pm$ 0.7	-1.9 $\pm$ 0.0
SEM-curvature	95.9 $\pm$ 0.5	0.7 $\pm$ 0.4	89.9 $\pm$ 0.3	-0.1 $\pm$ 0.5	99.3 $\pm$ 0.0	-0.2 $\pm$ 0.0	93.2 $\pm$ 0.7	-1.8 $\pm$ 0.4
Cat	95.0 $\pm$ 0.2	1.6 $\pm$ 0.7	90.3 $\pm$ 0.3	0.3 $\pm$ 0.4	99.2 $\pm$ 0.0	0.0 $\pm$ 0.0	94.7 $\pm$ 0.1	-0.0 $\pm$ 0.1
DeLoMe (Ours)	95.4 $\pm$ 0.1	2.0 $\pm$ 0.6	90.4 $\pm$ 0.3	-1.1 $\pm$ 0.2	99.4 $\pm$ 0.0	-0.1 $\pm$ 0.0	94.8 $\pm$ 0.1	-2.2 $\pm$ 0.2

Table 2: Full results (mean

\pm

std) under the task-incremental learning setting.

5.5 Main Results

5.5.1 Results under class-incremental learning (CIL)

The results of all methods under the CIL setting are shown in Table 1. Note that the results of baselines are taken from the paper Zhang et al. (2023b) since we adopt the same GCL benchmark Zhang et al. (2022a)¹¹1https://github.com/QueuQ/CGLB/tree/master. From the table, we can draw the following observations: (1) Directly fine-tuning the learned model from previous tasks on the current task data leads to serious performance degradation because the knowledge of previous tasks could be easily overwritten by the new tasks. (2) Continual learning methods proposed for Euclidean data generally do not achieve satisfactory performance for GCL, which verifies the fact that the unique graph properties should be taken into consideration for GCL. (3) Replay-based graph continual learning methods achieve much better performance than other baselines. Among them, SSM and SEM outperform ERGNN on all datasets. The reason could be attributed to that SSM and SEM preserve the topological information for the historical graph data in the memory while ERGNN only stores the individual nodes. (4) Differently, CaT and DeLoMe propose to learn the memory and capture the semantics of the original graph. The performance gain of CaT and DeLoMe over SSM and SEM demonstrates that the learned memory buffer is more informative (e.g., in capturing holistic graph information) and enhances the power of replaying memory. (5) Our method DeLoMe achieves new SOTA performance in nearly all cases. Its performance can even match or outperform the ideal method Joint on the CoraFull and Reddit datasets. Note that on Products DeLoMe outperforms all methods except CaT. This may be attributed to that the node class frequency information in Products does not help accurately capture the imbalance bias due to the possible presence of less informative nodes in this largest dataset, rendering our debiased component less effective.

5.5.2 Results under task-incremental learning (TIL)

In Table 2, we report the results under the TIL setting, in which a SOTA TIL method HPNs is also included for comparison. From the table, we can first see that all the methods achieve much better performance under the TIL setting. This is because the availability of the task indicator during inference makes TIL much easier than CIL. Despite the performance of our proposed method DeLoMe being slightly inferior to SEM Zhang et al. (2023b) on the CoraFull dataset, DeLoMe achieves comparable performance with the oracle model, Joint. For the other three datasets, DeLoMe achieves the best performance, and even outperforms Joint on Arxiv dataset, which verifies the advantages of our two components in overcoming the forgetting and class imbalance problems.

5.6 Cost-effective Learning of Expressive Memory

5.6.1 Expressiveness of sampling- vs. learning-based memory

In this subsection, we evaluate the expressiveness of the memory constructed by our learning-based method against two SOTA sampling-based methods, ERGNN and SSM. To evaluate the memory expressiveness, for each task, we train a GNN model with the memory data constructed by these methods and calculate the node classification accuracy on the test set of the original graph. We follow the same experimental setting in the above GCL experiments and report the average classification accuracy of all tasks in Figure 4. It is clear that the our learning-based method achieve much better performance than both sampling-based methods, achieving improvement by up to 3% in AA. This demonstrates the superiority of capturing the semantics of the original graph when constructing memory and explains the performance gain over the sampling-based methods in the GCL experiments.

5.6.2 Memory budget efficiency

The requirement of the memory budget is crucial to replay-based methods. Here we evaluate the performance of the proposed method with different memory budgets, with two best competing methods – SSM and CaT – and the oracle model Joint as the baselines. Due to the page limits, we report the results on CoraFull and Arxiv under class-incremental settings. The memory budgets vary in the range of $\{5,10,15,30,60\}$ for CoraFull and $\{50,100,200,300,400\}$ for Arxiv. The results are shown in Figure 5. The figure demonstrates that the learning-based methods (DeLoMe and CaT) are more effective than the sampling-based method with tight memory budgets. Compared to SSM and CaT, our DeLoMe performs much more stably with varying budgets and achieves consistently better performance with different memory budgets, especially on the Arxiv dataset. These results reinforce our empirical evidence on the effectiveness of DeLoMe in constructing memory and handling the class imbalance problem.

5.6.3 Computational efficiency

We further investigate the memory construction time and inference time of DeLoMe and employ SSM and CaT for comparison. Specifically, we report the average memory construction time per task and the overall inference time of each method on the largest dataset, Products. From the results in Table 3, we can see that CaT and DeLoMe require almost the same time for memory construction while SSM is much faster than the two learning-based methods. This is attributed to that CaT and DeLoMe involve model optimization to enhance the memory to capture the holistic semantics of the original graph while SSM employs the parameter-free sampling strategy. This also explains the superiority of CaT and DeLoMe over SSM in AA and AF in Tables 1 and 2. In terms of the inference time, the three methods are nearly the same since the memory construction only happens in the training stage and the test setting remains the same for all methods.

Stage	SSM	CaT	DeloMe
Memory Construction	0.53	28.94	28.37
Inference	1.73	1.75	1.71

Table 3: Time (second) of different stages on Products.

5.7 Ablation Study

We evaluate the contribution of two key components in DeLoMe, i.e., lossless memory learning and debiased GCL learning. Specifically, we derive four variants of DeLoMe. When the lossless memory learning is not exploited, we employ the sampling strategy to construct the memory as in ERGNN. Without loss of generality, we conduct the experiments on the CoraFull and Arxiv datasets under the CIL setting. The results of different variants are shown in the Table 4. From the table, we can see that adding either of the two components contributes to a significant improvement compard to the variant that does not use both components. These showcase the importance of both components, as well as their effectiveness in respectively addressing the memory expressiveness and data imbalance problems. In general, having a more expressive memory helps achieve larger improvement than tackling the data imbalance problem. Nevertheless, with our biased GCL objective, DeLoMe can achieve further large improvement over using our expressive memory learning component alone.

Lossless Memory	Debiased Learning	CoraFull		Arixv
Lossless Memory	Debiased Learning	AA $\uparrow$	AF $\uparrow$	AA $\uparrow$	AF $\uparrow$
$\times$	$\times$	36.9	-59.0	24.6	-66.1
$\times$	$\checkmark$	50.2	-40.6	33.9	-31.1
$\checkmark$	$\times$	78.5	-9.3	47.9	-15.8
$\checkmark$	$\checkmark$	81.0	-3.3	50.6	5.1

Table 4: Ablation study of the two components in DeLoMe.

6 Conclusion

In this paper, we propose a novel memory replay-based GCL method DeLoMe. Traditional replay-based graph continual learning methods typically construct the memory of the previous task using partial graph data, failing to preserve the holistic semantics of the original graph at each task. To tackle this issue, we learn compressed synthetic node representations as the memory by a gradient matching approach. In this way, the learned representations capture the holistic graph structure and attribute information. Besides, the learned representations help preserve the privacy of the graph data when replaying. To overcome the class imbalance problem between the learned memory and the new-coming graph, we further proposed a debiased memory replay objective by calibrating the prediction logits of the classes in both memory data and the current task based on the label frequencies. Extensive experiments on four datasets demonstrate the effectiveness of the proposed method under both class- and task-incremental learning settings of GCL.

References

Aljundi et al. [2018] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European conference on computer vision (ECCV), pages 139–154, 2018.
Gao et al. [2023] Xinyi Gao, Tong Chen, Yilong Zang, Wentao Zhang, Quoc Viet Hung Nguyen, Kai Zheng, and Hongzhi Yin. Graph condensation for inductive node representation learning. arXiv preprint arXiv:2307.15967, 2023.
Hamilton et al. [2017] William L Hamilton, Rex Ying, and Jure Leskovec. Representation learning on graphs: Methods and applications. arXiv preprint arXiv:1709.05584, 2017.
Hayes and Kanan [2020] Tyler L Hayes and Christopher Kanan. Lifelong machine learning with deep streaming linear discriminant analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 220–221, 2020.
He et al. [2023] Bowei He, Xu He, Yingxue Zhang, Ruiming Tang, and Chen Ma. Dynamically expandable graph convolution for streaming recommendation. In Proceedings of the ACM Web Conference 2023, pages 1457–1467, 2023.
Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
Jin et al. [2022a] Wei Jin, Xianfeng Tang, Haoming Jiang, Zheng Li, Danqing Zhang, Jiliang Tang, and Bing Yin. Condensing graphs via one-step gradient matching. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 720–730, 2022.
Jin et al. [2022b] Wei Jin, Lingxiao Zhao, Shichang Zhang, Yozen Liu, Jiliang Tang, and Neil Shah. Graph condensation for graph neural networks. In International Conference on Learning Representations, 2022.
Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.
Kirkpatrick et al. [2017] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Li and Hoiem [2017] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935–2947, 2017.
Liu et al. [2021] Huihui Liu, Yiding Yang, and Xinchao Wang. Overcoming catastrophic forgetting in graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 8653–8661, 2021.
Liu et al. [2022] Mengyang Liu, Shanchuan Li, Xinshi Chen, and Le Song. Graph condensation via receptive field distribution matching. arXiv preprint arXiv:2206.13697, 2022.
Liu et al. [2023] Yilun Liu, Ruihong Qiu, and Zi Huang. Cat: Balanced continual graph learning with graph condensation. arXiv preprint arXiv:2309.09455, 2023.
Lopez-Paz and Ranzato [2017] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30, 2017.
McCallum et al. [2000] Andrew Kachites McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Information Retrieval, 3:127–163, 2000.
Menon et al. [2021] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2021.
Perini et al. [2022] Massimo Perini, Giorgia Ramponi, Paris Carbone, and Vasiliki Kalavri. Learning on streaming graphs with experience replay. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pages 470–478, 2022.
Rakaraddi et al. [2022] Appan Rakaraddi, Lam Siew Kei, Mahardhika Pratama, and Marcus De Carvalho. Reinforced continual learning for graphs. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 1666–1674, 2022.
Sinha et al. [2015] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June Hsu, and Kuansan Wang. An overview of microsoft academic service (mas) and applications. In Proceedings of the 24th international conference on world wide web, pages 243–246, 2015.
Su et al. [2023] Junwei Su, Difan Zou, Zijun Zhang, and Chuan Wu. Towards robust graph incremental learning on evolving graphs. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32728–32748. PMLR, 23–29 Jul 2023.
Sun et al. [2023] Li Sun, Junda Ye, Hao Peng, Feiyang Wang, and S Yu Philip. Self-supervised continual graph learning in adaptive riemannian spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 4633–4642, 2023.
Wang et al. [2022] Chen Wang, Yuheng Qiu, Dasong Gao, and Sebastian Scherer. Lifelong graph learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13719–13728, 2022.
Wang et al. [2023] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023.
Wu et al. [2019] Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. Simplifying graph convolutional networks. In International conference on machine learning, pages 6861–6871. PMLR, 2019.
Wu et al. [2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24, 2020.
Wu et al. [2021] Guile Wu, Shaogang Gong, and Pan Li. Striking a balance between stability and plasticity for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1124–1133, 2021.
Xia et al. [2021] Feng Xia, Ke Sun, Shuo Yu, Abdul Aziz, Liangtian Wan, Shirui Pan, and Huan Liu. Graph learning: A survey. IEEE Transactions on Artificial Intelligence, 2(2):109–127, 2021.
Zhang et al. [2022a] Xikun Zhang, Dongjin Song, and Dacheng Tao. Cglb: Benchmark tasks for continual graph learning. Advances in Neural Information Processing Systems, 35:13006–13021, 2022.
Zhang et al. [2022b] Xikun Zhang, Dongjin Song, and Dacheng Tao. Hierarchical prototype networks for continual graph representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4622–4636, 2022.
Zhang et al. [2022c] Xikun Zhang, Dongjin Song, and Dacheng Tao. Sparsified subgraph memory for continual graph representation learning. In 2022 IEEE International Conference on Data Mining (ICDM), pages 1335–1340. IEEE, 2022.
Zhang et al. [2023a] Peiyan Zhang, Yuchen Yan, Chaozhuo Li, Senzhang Wang, Xing Xie, Guojie Song, and Sunghun Kim. Continual learning on dynamic graphs via parameter isolation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 601–611, New York, NY, USA, 2023. Association for Computing Machinery.
Zhang et al. [2023b] Xikun Zhang, Dongjin Song, and Dacheng Tao. Ricci curvature-based graph sparsification for continual graph representation learning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
Zhou and Cao [2021] Fan Zhou and Chengtai Cao. Overcoming catastrophic forgetting in graph neural networks with experience replay. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4714–4722, 2021.

Appendix A Appendix

A.1 Algorithms

We present the algorithm of lossless memory learning at the task $t-1$ in Algorithm 1, and the algorithm of the proposed method DeLoMe is shown in Algorithm 2.

1: Input: Graph data

\mathcal{G}_{t-1}=(A_{t-1},X_{t-1})

with label

Y_{t-1}

, memory budget

b

, graph neural network

f_{\theta}(\cdot)

, learning rate

\eta

, and the number of epochs

E

. 2: Output: Memory data

\hat{\mathcal{G}}_{t-1}=(I,\hat{X}_{t-1})

with label

\hat{Y}_{t-1}

3: Set

\hat{Y}_{t-1}

to fixed class values as in

Y_{t-1}

, initialize

\hat{X}

by randomly selecting

b

nodes from each class in

\mathcal{G}_{t-1}

. 4: for

e=1,\ldots,E

do 15: Initialize graph neural network parameter

\theta_{p}

from

\Theta

6: for

c=1,\ldots,C

do 7: Sample

\mathcal{G}_{t-1}^{c}=(A_{t-1}^{c},X_{t-1}^{c})

and

Y_{t-1}^{c}

from

\mathcal{G}_{t-1}

28: Sample

\hat{\mathcal{G}}_{t-1}^{c}=(I,\hat{X}_{t-1}^{c})

and

\hat{Y}_{t-1}^{c}

from

\hat{\mathcal{G}}_{t-1}

9: Compute gradient:

G_{t-1}^{c}=\nabla_{\theta_{p}}\ell(f_{\theta_{p}}(\mathcal{G}_{t-1}^{c}),Y_{t-1}^{c})

310: Compute gradient:

\hat{G}_{t-1}^{c}=\nabla_{\theta_{p}}\ell(f_{\theta_{p}}(\hat{\mathcal{G}}_{t-1}^{c}),\hat{Y}_{t-1}^{c})

411: Update

\hat{X}=\hat{X}-\eta\nabla_{\hat{X}}D(G_{t-1}^{c},\hat{G}_{t-1}^{c})

12: end for 13: end for

Algorithm 1 Memory Learning for graph data of task

t-1

1: Input: A sequence of graph learning tasks:

\{\mathcal{G}_{1},\ldots,\mathcal{G}_{T}\}

and an empty memory buffer

\mathcal{B}_{1}

with a budget

b

for each class. 2: Output: A continually learned graph neural network

f_{\theta}(\cdot)

3: Initialize

f_{\theta}(\cdot)

. 4: for

t=1,\ldots,T

do 5: if

t>1

then 56: Update memory buffer

\mathcal{B}_{t-1}

with

\hat{\mathcal{G}}_{t-1}

, i.e.,

\mathcal{B}_{t}=\mathcal{B}_{t-1}\cup\hat{\mathcal{G}}_{t-1}

67: Calculate the logit adjustments via Eq.(6) and Eq.(7) 8: end if 79: Update

f_{\theta}(\cdot)

by minimizing training objective Eq.(8) 810: Obtain

\hat{\mathcal{G}}_{t}

with Algorithm 1 11: end for

Algorithm 2 DeLoMe

A.2 Details on Datasets

•

CoraFull²²2https://docs.dgl.ai/en/1.1.x/generated/dgl.data.CoraFullDataset.html: It is a citation network containing 70 classes, where nodes represent papers and edges represent citation links between papers.
•

Arxiv³³3https://ogb.stanford.edu/docs/nodeprop/#ogbn-arxiv: It is also a citation network between all Computer Science (CS) ARXIV papers indexed by MAG Sinha et al. [2015]. Each node in Arxiv denotes a CS paper and the edge between nodes represents a citation between them. The nodes are classified into 40 subject areas. The node features are computed as the average word-embedding of all words in the title and abstract.
•

Reddit⁴⁴4https://docs.dgl.ai/en/1.1.x/generated/dgl.data.RedditDataset.html#dgl.data.RedditDataset: It encompasses Reddit posts generated in September 2014, with each post classified into distinct communities or subreddits. Specifically, nodes represent individual posts, and the edges between posts exist if a user has commented on both posts. Node features are derived from various attributes, including post title, content, comments, post score, and the number of comments.
•

Products⁵⁵5https://ogb.stanford.edu/docs/nodeprop/#ogbn-products: It is an Amazon product co-purchasing network, where nodes represent products sold in Amazon and the edges between nodes indicate that the products are purchased together. The node features are constructed with the dimensionality-reduced bag-of-words of the product descriptions.

The statistics of these datasets are summarized in the Table 5.

Datasets	CoraFull	Arxiv	Reddit	Products
# nodes	19,793	169,343	227,853	2,449,028
# edges	130,622	1,166,243	114,615,892	61,859,036
# classes	70	40	40	46
# tasks	35	20	20	23
# Avg. nodes per task	660	8,467	11,393	122,451
# Avg. edges per task	4,354	58,312	5,730,794	2,689,523

Table 5: Statistics of the graph datasets.

A.3 More Implementation Details

All the continual learning methods including the proposed method are implemented based on the graph continual learning benchmark Zhang et al. [2022a]. For ERGNN Zhou and Cao [2021], the memory budget is set to be up to 800 per class to demonstrate the advantage of the proposed method. For SSM Zhang et al. [2022c] and Zhang et al. [2023b], we conduct experiments with the same memory budgets. However, to preserve the topological information, SSM and SEM need to store the neighbors of selected nodes. The number of neighbors for each node to store is set to 5 at each hop for both SSM and SEM.

Different GNN backbones, such as GCN Kipf and Welling [2016] and SGC Wu et al. [2019], can be applied to these continual learning methods. In the main paper, to have a fair comparison to the baselines Zhang et al. [2023b], we employ a two-layer SGC model as the backbone. Specifically, the hidden dimension is set to 256 for all methods. The number of training epochs of each graph learning task is 200 with Adam as the optimizer and the learning rate is set to 0.005.

For the lossless memory learning in the proposed method, we employ the same two-layer SGC model as the GNN model to calculate the gradient-matching loss. The synthetic node representations are set as learnable parameters and are initialized with the original node attributes. The labels of the synthetic nodes are fixed during the learning process. Adam is used as the optimizer to learn the synthetic representations. The learning epochs and learning rate are set to 800 and 0.0001 for all tasks and datasets respectively. The scaling parameter $\tau$ in the debiased loss function is set to 1 for all experiments for simplicity.

The code is implemented with Pytorch (version: 1.10.0), DGL (version: 0.9.1), OGB (version: 1.3.6), and Python 3.8.5. Moreover, all experiments are conducted on a Linux server with an Intel CPU (Intel Xeon E-2288G 3.7GHz) and a Nvidia GPU (Quadro RTX 6000).

A.4 Full Results with Standard Deviation

In the main paper, we only report the average performance of different methods. The full results with standard deviation under both class- and task-incremental settings are shown in Table 1 and Table 2 respectively.

Methods	CoraFull		Arixv		Reddit		Products
Methods	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$
Fine-tune	2.9 $\pm$ 0.0	-94.7 $\pm$ 0.1	4.9 $\pm$ 0.0	-87.0 $\pm$ 1.5	5.1 $\pm$ 0.3	-94.5 $\pm$ 2.5	3.4 $\pm$ 0.8	-82.5 $\pm$ 0.8
Joint	80.6 $\pm$ 0.3	-	46.4 $\pm$ 1.4	-	99.3 $\pm$ 0.2	-	71.5 $\pm$ 0.7	-
EWC	15.2 $\pm$ 0.7	-81.1 $\pm$ 1.0	4.9 $\pm$ 0.0	-88.9 $\pm$ 0.3	10.6 $\pm$ 1.5	-92.9 $\pm$ 1.6	3.3 $\pm$ 1.2	-89.6 $\pm$ 2.0
MAS	12.3 $\pm$ 3.8	-83.7 $\pm$ 4.1	4.9 $\pm$ 0.0	-86.8 $\pm$ 0.6	13.1 $\pm$ 2.6	-35.2 $\pm$ 3.5	15.0 $\pm$ 2.1	-66.3 $\pm$ 1.5
GEM	7.9 $\pm$ 2.7	-84.8 $\pm$ 2.7	4.8 $\pm$ 0.5	-87.8 $\pm$ 0.2	28.4 $\pm$ 3.5	-71.9 $\pm$ 4.2	5.5 $\pm$ 0.7	-84.3 $\pm$ 0.9
LwF	2.0 $\pm$ 0.2	-95.0 $\pm$ 0.2	4.9 $\pm$ 0.0	-87.9 $\pm$ 1.0	4.5 $\pm$ 0.5	-82.1 $\pm$ 1.0	3.1 $\pm$ 0.8	-85.9 $\pm$ 1.4
TWP	20.9 $\pm$ 3.8	-73.3 $\pm$ 4.1	4.9 $\pm$ 0.0	-89.0 $\pm$ 0.4	13.5 $\pm$ 2.6	-89.7 $\pm$ 2.7	3.0 $\pm$ 0.7	-89.7 $\pm$ 1.0
ERGNN	3.0 $\pm$ 0.1	-93.8 $\pm$ 0.5	30.3 $\pm$ 1.5	-54.0 $\pm$ 1.3	88.5 $\pm$ 2.3	-10.8 $\pm$ 2.4	24.5 $\pm$ 1.9	-67.4 $\pm$ 1.9
SSM-uniform	72.3 $\pm$ 0.8	-15.5 $\pm$ 1.5	45.1 $\pm$ 1.1	-12.2 $\pm$ 1.4	93.8 $\pm$ 0.4	-2.0 $\pm$ 0.3	61.8 $\pm$ 1.2	-10.7 $\pm$ 1.1
SSM-degree	74.4 $\pm$ 0.2	-9.9 $\pm$ 0.1	46.0 $\pm$ 0.4	-11.3 $\pm$ 0.8	94.0 $\pm$ 0.2	-1.9 $\pm$ 0.2	62.9 $\pm$ 0.4	-10.3 $\pm$ 0.5
SEM-curvature	79.6 $\pm$ 0.3	-2.7 $\pm$ 0.1	51.0 $\pm$ 0.3	-6.7 $\pm$ 0.6	95.1 $\pm$ 0.6	-1.5 $\pm$ 0.7	64.0 $\pm$ 1.6	-11.2 $\pm$ 1.8
CaT	79.5 $\pm$ 0.4	-5.5 $\pm$ 0.2	47.1 $\pm$ 0.4	-13.7 $\pm$ 0.2	98.0 $\pm$ 0.1	-0.1 $\pm$ 0.1	70.6 $\pm$ 1.0	-4.0 $\pm$ 0.9
DeLoMe (Ours)	81.8 $\pm$ 0.3	1.9 $\pm$ 0.4	52.8 $\pm$ 0.3	0.2 $\pm$ 0.6	98.1 $\pm$ 0.1	0.6 $\pm$ 0.2	67.0 $\pm$ 0.8	-18.3 $\pm$ 0.4

Table 6: Results under the class-incremental learning setting on four datasets with GCN as the backbone. Fine-tune and Joint are shown to respectively serve as approximated lower bound and upper bound performance. The best performance achieved by continual learning methods on each dataset is highlighted in bold. ”

\uparrow

” denotes the higher value represents better performance.

Methods	CoraFull		Arixv		Reddit		Products
Methods	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$	AA/% $\uparrow$	AF/% $\uparrow$
Fine-Tune	58.0 $\pm$ 1.7	-38.4 $\pm$ 1.8	61.7 $\pm$ 3.8	-28.2 $\pm$ 3.3	73.6 $\pm$ 3.5	-26.9 $\pm$ 3.5	67.6 $\pm$ 1.6	-25.4 $\pm$ 1.6
Joint	95.2 $\pm$ 0.2	-	90.3 $\pm$ 0.2	-	99.4 $\pm$ 0.1	-	91.8 $\pm$ 0.2	-
EWC	78.9 $\pm$ 2.4	-15.5 $\pm$ 2.3	78.8 $\pm$ 2.7	-5.0 $\pm$ 3.1	91.5 $\pm$ 4.2	-8.1 $\pm$ 4.6	90.1 $\pm$ 0.3	-0.7 $\pm$ 0.3
MAS	93.0 $\pm$ 0.5	-0.6 $\pm$ 0.2	88.4 $\pm$ 0.2	-0.0 $\pm$ 0.0	98.6 $\pm$ 0.5	-0.1 $\pm$ 0.1	91.2 $\pm$ 0.6	-0.5 $\pm$ 0.2
GEM	91.6 $\pm$ 0.6	-1.8 $\pm$ 0.6	87.3 $\pm$ 0.6	2.8 $\pm$ 0.3	91.6 $\pm$ 5.6	-8.1 $\pm$ 5.8	87.8 $\pm$ 0.5	-2.9 $\pm$ 0.5
LwF	56.1 $\pm$ 2.0	-37.5 $\pm$ 1.8	84.2 $\pm$ 0.5	-3.7 $\pm$ 0.6	80.9 $\pm$ 4.3	-19.1 $\pm$ 4.6	66.5 $\pm$ 2.2	-26.3 $\pm$ 2.3
TWP	92.2 $\pm$ 0.5	-0.9 $\pm$ 0.3	86.0 $\pm$ 0.8	-2.8 $\pm$ 0.8	87.4 $\pm$ 3.8	-12.6 $\pm$ 4.0	90.3 $\pm$ 0.1	-0.5 $\pm$ 0.1
ERGNN	90.6 $\pm$ 0.1	-3.7 $\pm$ 0.1	86.7 $\pm$ 0.1	11.4 $\pm$ 0.9	98.9 $\pm$ 0.0	-0.1 $\pm$ 0.1	89.0 $\pm$ 0.4	-2.5 $\pm$ 0.3
SSM-uniform	94.5 $\pm$ 0.8	-0.1 $\pm$ 0.9	88.8 $\pm$ 1.2	-2.1 $\pm$ 0.7	98.8 $\pm$ 0.3	-0.9 $\pm$ 0.6	90.9 $\pm$ 2.8	-2.2 $\pm$ 1.0
SSM-degree	94.3 $\pm$ 1.1	0.9 $\pm$ 0.5	88.1 $\pm$ 1.3	-1.0 $\pm$ 0.4	99.0 $\pm$ 0.2	-0.4 $\pm$ 0.2	91.1 $\pm$ 0.9	-1.8 $\pm$ 0.8
SEM-curvature	95.0 $\pm$ 0.5	-0.2 $\pm$ 0.8	88.8 $\pm$ 0.1	-1.0 $\pm$ 0.2	99.2 $\pm$ 0.1	-0.1 $\pm$ 0.3	91.6 $\pm$ 0.5	-1.5 $\pm$ 0.8
CaT	94.3 $\pm$ 0.2	0.5 $\pm$ 0.5	90.2 $\pm$ 0.2	0.2 $\pm$ 0.1	99.2 $\pm$ 0.0	0.1 $\pm$ 0.1	94.3 $\pm$ 0.6	0.1 $\pm$ 0.2
DeLoMe (Ours)	95.2 $\pm$ 0.3	1.3 $\pm$ 0.1	90.6 $\pm$ 0.3	2.3 $\pm$ 1.1	99.3 $\pm$ 0.1	-0.2 $\pm$ 0.1	94.2 $\pm$ 0.1	-1.4 $\pm$ 0.1

Table 7: Results under the task-incremental learning setting with GCN as the backbone.

A.5 Memory budget efficiency

In the main paper, we report the results of the proposed method with different memory budgets under class-incremental learning. Here, we present the results under task-incremental learning in Figure 6. From the figure, we can see that all methods exhibit significantly improved performance under task-incremental learning compared to class-incremental learning. However, the learning-based methods (DeLoMe and CaT) are still more effective than the sampling-based method with tight memory budgets, indicating the importance of capturing the holistic semantics of the original graph into the memory buffer. Note that the performance gain of DeLoMe over CaT is not as significant as observed in class-incremental learning, which is attributed to the fact that task-incremental learning is less challenging than class-incremental learning.

A.6 Results with GCN as Backbone

As stated above, different GNNs can be used as the backbone of DeLoMe. In the main paper, we report the results with SGC Wu et al. [2019] as the backbone. In this subsection, we further present the results with GCN Kipf and Welling [2016] as the backbone. Specifically, we employ the same baselines as in the main paper for comparison, and the results under the class- and task-incremental learning are shown in Table 6 and Table 7 respectively. From the results, we can draw similar observations as in the main paper, i.e., the proposed DeLoMe can effectively overcome the catastrophic forgetting problem in graph continual learning by utilizing lossless memory and debiased memory replay. The results also verify the effectiveness of DeLoMe with different backbones.