Supplementary Materials: FTF-ER: Feature-Topology Fusion-Based Experience Replay Method for Continual Graph Learning

^†^†copyright: none

1. Algorithm

Algorithm 1 Framework of our FTF-ER.

0: Task sequence

\mathcal{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}\}

; Experience buffer

\mathcal{B}

; Number of sampled nodes for each class

b

0: A model

f(\cdot;\boldsymbol{\theta})

that performs well on all tasks.

\mathcal{B}\leftarrow(\emptyset,\emptyset)

2: Initialize

\boldsymbol{\theta}

at random

3: for

\mathcal{T}_{i}

\mathcal{T}

4: Obtain training dataset

\mathcal{D}^{tr}_{i}=(\mathcal{V}^{tr}_{i},\mathcal{G}_{i})

from

\mathcal{T}_{i}

5: Extract experience nodes

V(\mathcal{B})

and subgraph

\mathcal{G}_{\mathcal{B}}

from

\mathcal{B}

6: Compute

\mathcal{L}(f(\cdot;\boldsymbol{\theta}),\mathcal{D}^{tr}_{i},V(\mathcal{B}),\mathcal{G}_{\mathcal{B}})

\boldsymbol{\theta}\leftarrow arg\;min_{\boldsymbol{\theta}}(\mathcal{L})

8: Compute

\mathbf{S}^{mix}(f(\cdot;\boldsymbol{\theta}),\mathcal{D}^{tr}_{i})

\mathcal{V}^{buf}_{i}\leftarrow Select(\mathcal{V}^{tr}_{i},\mathbf{S}^{mix},b)

10:

\mathcal{B}\leftarrow(V(\mathcal{B})\cup\mathcal{V}^{buf}_{i},\mathcal{G}_{\mathcal{B}}\cup\mathcal{G}_{i}(\mathcal{V}^{buf}_{i}))

11: end for

2. Graph Neural Networks

Graph neural networks (GNNs) are deep learning models defined on a graph $\mathcal{G}$ . At each layer of GNNs, nodes in $\mathcal{V}$ update their hidden representations by aggregating and transforming information from their neighborhoods. the process of aggregation and transformation is accomplished by an update function that takes into account the hidden representations of the node and its adjacent nodes.

Formally, for each node $v$ in $\mathcal{V}$ , its hidden representation at the $l$ -th layer of the GNN, denoted as $h_{v}^{(l)}$ , is computed as follows:

(1)

h_{v}^{(l)}=U(h_{v}^{(l-1)},h_{u}^{(l-1)}),\forall u\in\mathcal{N}(v),

where $U(\cdot)$ is a differentiable function and $\mathcal{N}(v)$ represents the set of neighbors of node $v$ in the graph. $U(\cdot)$ takes the hidden representation of the current node $h_{v}^{(l-1)}$ and its neighboring nodes $h_{u}^{(l-1)}$ as inputs.

The graph convolutional network (GCN) (GCN) designs $U(\cdot)$ based on a first-order approximation of the spectra of the graph, which fixes the adjacency matrix $\mathbf{A}$ . In the case of the attention-based GNN such as the graph attention network (GAT) (GAT), $U(\cdot)$ is designed based on pairwise attention. Furthermore, in (GIN), the authors highlight the performance limitations of GNNs and propose a new network called the graph isomorphism network (GIN). In our paper, the aforementioned three types of neural networks serve as the backbones of our FTF-ER.

3. Theorem and Proof

3.1. Hodge Decomposition Theorem

Hodge decomposition theorem is a fundamental result in the theory of differential forms and Riemannian geometry. On a compact and oriented Riemannian manifold, the Hodge decomposition theorem states that any differential form can be uniquely decomposed into the sum of three components:

•

An exact form: A differential form that is the exterior derivative of another form.
•

A co-exact form: A differential form whose codifferential (adjoint of the exterior derivative) is zero.
•

A harmonic form: A differential form that is both closed (its exterior derivative is zero) and co-closed (its codifferential is zero).

Let $\Omega^{k}(M)$ be a $k$ -form on an $n$ -dimensional smooth manifold $M$ , $d$ be the exterior derivative operator, and $\delta$ be the adjoint map of $d$ . Then we can define the Hodge-Laplace operator:

Definition 3.1 (Hodge-Laplace operator).

(2)

\Delta\triangleq d\delta+\delta d:\Omega^{k}(M)\mapsto\Omega^{k}(M).

Given the Definition 3.1, we state the Hodge decomposition theorem as follows:

Theorem 3.2 (Hodge Decomposition Theorem).

For any $\alpha\in\Omega^{k-1}(M)$ , $\beta\in\Omega^{k+1}(M)$ and $\Delta\gamma=0$ , we have

(3)

\omega\in\Omega^{k}(M)\Rightarrow\omega=d\alpha+\delta\beta+\gamma,

where $d\alpha$ is an exact $k$ -form, $\delta\beta$ is a co-exact $k$ -form and $\gamma$ satisfying $\Delta\gamma=0$ is also referred to as a harmonic form.

This decomposition is unique and orthogonal with respect to the $L^{2}$ inner product on the space of differential forms. The Hodge decomposition theorem has significant implications in various areas of mathematics and physics, including the study of cohomology, the theory of partial differential equations, and the formulation of Maxwell’s equations in differential form language. Besides, this theorem allows us to categorize differential forms into three distinct types, each with its own physical interpretation:

•

An exact form $d\alpha$ : This differential form can be expressed as the gradient of a scalar field, making it particularly useful for describing potential energies associated with various fields. In physics, an exact form is commonly employed to represent electric potential energy or magnetic potential energy, providing valuable insights into the behavior of these fields.
•

A co-exact form $\delta\beta$ : A differential form that can be written as the curl of a vector field is classified as a cexact form. It plays a crucial role in describing circulation phenomena related to fields, such as magnetic flux in the context of magnetic fields. By utilizing a cexact form, physicists can effectively model and analyze the circulatory aspects of these fields.
•

A harmonic form $\gamma$ : Unlike an exact or a cexact form, a harmonic form is characterized by the absence of potential energies or circulations. It represents a state of equilibrium or scenarios devoid of sources or sinks. In the realm of physics, a harmonic form is frequently associated with fields that are free from sources, such as source-free electric fields or source-free magnetic fields. This form provides a framework for studying the behavior of fields in the absence of external influences.

3.2. Hodge Decomposition Theorem on Graphs

By defining several concepts on graphs, including $d$ , $\Delta$ and $\delta$ when $k=0$ , we can generalize the Hodge decomposition theorem from its manifold version to graphs.

Definition 3.3 (Hodge Potential Score).

(4)

\Omega^{0}(\mathcal{G})\triangleq\{s:\mathcal{V}\mapsto\mathbb{R}\}.

Definition 3.4 (Edge Flows).

(5)

\Omega^{1}(\mathcal{G})\triangleq\{\mathcal{X}:\mathcal{V}\times\mathcal{V}\mapsto\mathbb{R}|\mathcal{X}(i,j)=-\mathcal{X}(j,i),(i,j)\in\mathcal{E}\}.

Definition 3.5 (Gradient Operator).

Let $\mathbf{grad}$ be the gradient operator, $s_{i},s_{j}\in\Omega^{0}(\mathcal{G})$ , we have

(6)

(d_{0}s)(i,j)\triangleq(\mathbf{grad}\,s)(i,j)\triangleq s_{j}-s_{i},(i,j)\in\mathcal{E}.

Definition 3.6 (Negative Divergence Operator).

Let $w$ be the weight of an element in $\Omega^{k}(\mathcal{G})$ , $w_{i}\in\Omega^{0}(\mathcal{G})$ , $w_{ij}\in\Omega^{1}(\mathcal{G})$ and $\mathbf{div}$ be the divergence operator, we have

(7)

(\delta_{0}\mathcal{X})(i)\triangleq(-\mathbf{div}\,\mathcal{X})(i)=-\sum_{j}\frac{w_{ij}}{w_{i}}\mathcal{X}(i,j).

Definition 3.7 (Graph Laplacian Operator).

(8)

\Delta_{0}\triangleq\delta_{0}d_{0}\triangleq-\mathbf{div}(\mathbf{grad}).

We note that we denote $\Delta_{0}$ as the graph Laplacian operator. The proof that $\Delta_{0}$ corresponds to the usual graph Laplacian operator is given in Section 3.3.

Given the definitions above, we state the Hodge decomposition theorem on graphs for the case where $k=0$ at frist:

Theorem 3.8 (Hodge Decomposition Theorem on Graphs for $k=0$ ).

Let $\mathbf{Im}$ be the image set and $\mathbf{ker}$ be the kernel set, we have

(9)

\Omega^{0}(\mathcal{G})=\mathbf{Im}\;d_{0}\oplus\mathbf{Im}\;\delta_{0}\oplus\mathbf{ker}\;\Delta_{0}.

In our paper, we employ the gradient field decomposition provided by the $k=0$ version of the Hodge decomposition theorem on graphs to compute the topological importance of nodes. For the sake of theoretical completeness, we then describe the theorem for the case where $k\in\mathbb{N}$ . To generalize the Hodge decomposition theorem on graphs for $k=0$ to the case where $k\in\mathbb{N}$ , we first denote $K_{k}$ as the set of $k$ -cliques on the graph. Then we have

(10)	$\displaystyle\Omega^{k}(G)\triangleq\{$	$\displaystyle u:\mathcal{V}^{k+1}\mapsto\mathbb{R}\|$
		$\displaystyle u(i_{\sigma(0)},\cdots,i_{\sigma(k)})=\mathrm{sign}(\sigma)u(i_{0},\cdots,i_{k}),$
		$\displaystyle(i_{0},\cdots,i_{k})\in K_{k+1}\}.$

For $d_{k}:\Omega^{k}(\mathcal{G})\mapsto\Omega^{k+1}(\mathcal{G})$ and $\delta_{k}:\Omega^{k}(\mathcal{G})\mapsto\Omega^{k+1}(\mathcal{G})$ , we define

(11)

(d_{k}u)(i_{0},\cdots,i_{k+1})\triangleq\sum^{k+1}_{j=0}(-1)^{j+1}u(i_{0},\cdots,i_{j-1},i_{j+1},\cdots,i_{k+1}),

and

(12)

(\delta_{k}u)(i_{0},\cdots,i_{k+1})\triangleq\sum^{k+1}_{j=0}(-1)^{j}u(i_{0},\cdots,i_{j-1},i_{j+1},\cdots,i_{k+1}).

Then we have

(13)

\Delta_{k}\triangleq\delta_{k}d_{k}+d_{k-1}\delta_{k-1}.

Given the definitions above, we state the Hodge decomposition theorem on graphs for $k\in\mathbb{N}$ :

Theorem 3.9 (Hodge Decomposition Theorem on Graphs).

Let $\mathbf{Im}$ be the image set and $\mathbf{ker}$ be the kernel set, we have

(14)

\Omega^{k}(\mathcal{G})=\mathbf{Im}\;d_{k}\oplus\mathbf{Im}\;\delta_{k}\oplus\mathbf{ker}\;\Delta_{k}.

Theorem 3.9 reveals that a graph signal can be decomposed into three orthogonal components:

•

Gradient component: The gradient component represents the conservative or curl-free part of the graph signal. It can be expressed as the gradient of a potential function defined on the nodes of the graph. This means that the flow along any closed path in the graph sums up to zero. In other words, the gradient component captures the portion of the signal that exhibits no rotational behavior. It is analogous to the irrotational component in the continuous Hodge decomposition.
•

Curl component: The curl component represents the rotational or divergence-free part of the graph signal. It can be expressed as the curl of a potential function defined on the edges of the graph. This means that the net flow out of any node in the graph is zero. The curl component captures the portion of the signal that exhibits rotational behavior but has no divergence.
•

Harmonic component: The harmonic component is the part of the graph signal that is both gradient-free and divergence-free. It represents the signal’s behavior that is not captured by either the gradient or the curl components. In other words, it is the part of the signal that is constant on connected components of the graph and has zero gradient and zero divergence.

These three components (i.e., gradient, curl, and harmonic) form a complete and orthogonal decomposition of the graph signal. They provide a way to analyze and understand the different aspects of the signal’s behavior on the graph. The gradient component captures the conservative part, the curl component captures the rotational part, and the harmonic component captures the part that is neither conservative nor rotational. This decomposition is particularly useful in applications involving signal processing, data analysis, and machine learning on graph-structured data.

3.3. Graph Laplacian Operator

In this section, we give the proof that $\Delta_{0}$ in our paper corresponds to the usual graph Laplacian operator for completeness, which may be found in (Hodge1).

Proof.

Let $s\in\Omega^{1}(\mathcal{G})$ , we have

(15)

(\mathbf{grad}\;s)(i,j)=\begin{cases}s(j)-s(i),&\mathrm{if}\;(i,j)\in\mathcal{E},\\ 0,&\mathrm{otherwise}.\end{cases}

Let $a_{ij}$ be an element of the adjacency matrix $\mathbf{A}$ , the gradient may be written as $(\mathbf{grad}\;s)(i,j)=a_{ij}(s(j)-s(i))$ and then

(16)		$\displaystyle(\Delta_{0}s)(i)$	$\displaystyle=-(\mathbf{div}(\mathbf{grad}\;s))(i)=-(\mathbf{div}\;a_{ij}(s(j)-s(i)))(i)$
(16)			$\displaystyle=-\sum^{n}_{j=1}a_{ij}(s(j)-s(i))=\xi_{i}s(i)-\sum^{n}_{j=1}a_{ij}s(j),$

where for any node $v_{i}\;(i=1,\cdots,n)$ , we define its degree as

(17)

\xi_{i}=\mathrm{deg}(i)=\sum^{n}_{j=1}a_{ij}.

If we regard a function $s\in\Omega^{1}(\mathcal{G})$ as a vector $(s_{1},\cdots,s_{n})\in\mathbb{R}^{n}$ where $s(i)=s_{i}$ and set $\mathbf{D}=\mathrm{diag}(\xi_{1},\cdots,\xi_{n})\in\mathbb{R}^{n\times n}$ , then Eq. (16) becomes

(18)

\Delta_{0}s=\left[\begin{array}[]{cccc}\xi_{1}-a_{11}&-a_{12}&\cdots&-a_{1n}\\ -a_{21}&\xi_{2}-a_{22}&\cdots&-a_{2n}\\ \vdots&&\ddots&\vdots\\ -a_{n1}&-a_{n2}&\cdots&\xi_{n}-a_{nn}\\ \end{array}\right]\left[\begin{array}[]{c}s_{1}\\ s_{2}\\ \vdots\\ s_{n}\\ \end{array}\right]=(\mathbf{D}-\mathbf{A})s.

So $\Delta_{0}$ may be regarded as $\mathbf{D}-\mathbf{A}$ , the usual definition of a graph Laplacian. ∎

4. Additional Experimental Details

4.1. Implementation Details

We use Adam optimizer to optimize the models, setting the initial learning rate to 0.005 and the number of training epochs to 200 on all datasets. The regularizer hyper-parameter for EWC (EWC), MAS and TWP is always set to 10,000. And $\beta$ for TWP (TWP) is set to 0.01. For those experience replay baselines, i.e., GEM (GEM), ER-GNN (ERGNN) and SSM (SSM), we set the buffer size for each class to be 60, 60, 400, 100 for Amazon Computers, Corafull, OGB-Arxiv, and Reddit, respectively. For our method, we choose a buffer size that is the same as that of other methods and select a suitable $\beta$ from [0.0, 0.25, 0.5, 0.75, 1.0] for different datasets. Additionally, we set the structure of all backbones as a 2-layer network with a hidden layer dimension of 256 for fairness. Finally, due to the abundance of experimental results for each method across all the three backbones, we only present the best results of each method on each dataset.

Refer to caption — Figure 1. The influence of $\beta$ on average accuracy.

4.2. Sensitivity of Hyper-parameter

To analyze the sensitivity of our method to the value of the hyper-parameter $\beta$ , we demonstrate the influence of $\beta$ on AA using the total four datasets. In this experiment, we further divide $\beta$ into [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] to observe its effects more carefully. Figure 1 displays that the value of $\beta$ has different effects on different datasets. For example, on the Amazon Computers dataset, it leads to approximately 14% performance fluctuation, while on the Corafull dataset, it only leads to about 0.8% fluctuation. Further analysis of the experiments reveals that the peak consistently occurs at the middle position on all datasets, which further confirms the effectiveness of our feature-topology fusion sampling.

4.3. Qualitative Analysis

In order to demonstrate the effectiveness of the node representations learned by FTF-ER, we conduct a qualitative analysis on the Amazon Computers dataset. For this purpose, we generate a series of standard t-SNE (t-SNE) 2D projected plots of node representations to reinforce this analysis. We select the complete test data set containing 5 tasks and 10 classes for demonstration, to analyze the overall performance of each model after undergoing the complete continual learning process. Given FTF-ER’s ability to differentiate between the importance of nodes at the feature and topological levels, we anticipate that nodes sharing the same labels will be positioned closely in the projection space, indicating similar representation vectors. Figure 2 visualizes the hidden layer representations of four CGL methods, namely FTF-ER, SSM (SSM), ER-GNN (ERGNN), and TWP (TWP). Experimental results show that FTF-ER exhibits a clearer separation of nodes from distinct communities compared to alternative methods. The nodes with different labels are represented by dots in different colors. This showcases the capability of our FTF-ER in capturing distinctions among nodes within diverse communities through the gathered node information.