This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Supplementary Materials: FTF-ER: Feature-Topology Fusion-Based Experience Replay Method for Continual Graph Learning

copyright: none

1. Algorithm

Algorithm 1 Framework of our FTF-ER.
0:  Task sequence 𝒯={𝒯1,𝒯2,,𝒯K}\mathcal{T}=\{\mathcal{T}_{1},\mathcal{T}_{2},\dots,\mathcal{T}_{K}\}; Experience buffer \mathcal{B}; Number of sampled nodes for each class bb.
0:  A model f(;𝜽)f(\cdot;\boldsymbol{\theta}) that performs well on all tasks.
1:  (,)\mathcal{B}\leftarrow(\emptyset,\emptyset)
2:  Initialize 𝜽\boldsymbol{\theta} at random
3:  for 𝒯i\mathcal{T}_{i} in 𝒯\mathcal{T} do
4:     Obtain training dataset 𝒟itr=(𝒱itr,𝒢i)\mathcal{D}^{tr}_{i}=(\mathcal{V}^{tr}_{i},\mathcal{G}_{i}) from 𝒯i\mathcal{T}_{i}
5:     Extract experience nodes V()V(\mathcal{B}) and subgraph 𝒢\mathcal{G}_{\mathcal{B}} from \mathcal{B}
6:     Compute (f(;𝜽),𝒟itr,V(),𝒢)\mathcal{L}(f(\cdot;\boldsymbol{\theta}),\mathcal{D}^{tr}_{i},V(\mathcal{B}),\mathcal{G}_{\mathcal{B}})
7:     𝜽argmin𝜽()\boldsymbol{\theta}\leftarrow arg\;min_{\boldsymbol{\theta}}(\mathcal{L})
8:     Compute 𝐒mix(f(;𝜽),𝒟itr)\mathbf{S}^{mix}(f(\cdot;\boldsymbol{\theta}),\mathcal{D}^{tr}_{i})
9:     𝒱ibufSelect(𝒱itr,𝐒mix,b)\mathcal{V}^{buf}_{i}\leftarrow Select(\mathcal{V}^{tr}_{i},\mathbf{S}^{mix},b)
10:     (V()𝒱ibuf,𝒢𝒢i(𝒱ibuf))\mathcal{B}\leftarrow(V(\mathcal{B})\cup\mathcal{V}^{buf}_{i},\mathcal{G}_{\mathcal{B}}\cup\mathcal{G}_{i}(\mathcal{V}^{buf}_{i}))
11:  end for

2. Graph Neural Networks

Graph neural networks (GNNs) are deep learning models defined on a graph 𝒢\mathcal{G}. At each layer of GNNs, nodes in 𝒱\mathcal{V} update their hidden representations by aggregating and transforming information from their neighborhoods. the process of aggregation and transformation is accomplished by an update function that takes into account the hidden representations of the node and its adjacent nodes.

Formally, for each node vv in 𝒱\mathcal{V}, its hidden representation at the ll-th layer of the GNN, denoted as hv(l)h_{v}^{(l)}, is computed as follows:

(1) hv(l)=U(hv(l1),hu(l1)),u𝒩(v),h_{v}^{(l)}=U(h_{v}^{(l-1)},h_{u}^{(l-1)}),\forall u\in\mathcal{N}(v),

where U()U(\cdot) is a differentiable function and 𝒩(v)\mathcal{N}(v) represents the set of neighbors of node vv in the graph. U()U(\cdot) takes the hidden representation of the current node hv(l1)h_{v}^{(l-1)} and its neighboring nodes hu(l1)h_{u}^{(l-1)} as inputs.

The graph convolutional network (GCN) (GCN) designs U()U(\cdot) based on a first-order approximation of the spectra of the graph, which fixes the adjacency matrix 𝐀\mathbf{A}. In the case of the attention-based GNN such as the graph attention network (GAT) (GAT), U()U(\cdot) is designed based on pairwise attention. Furthermore, in (GIN), the authors highlight the performance limitations of GNNs and propose a new network called the graph isomorphism network (GIN). In our paper, the aforementioned three types of neural networks serve as the backbones of our FTF-ER.

3. Theorem and Proof

3.1. Hodge Decomposition Theorem

Hodge decomposition theorem is a fundamental result in the theory of differential forms and Riemannian geometry. On a compact and oriented Riemannian manifold, the Hodge decomposition theorem states that any differential form can be uniquely decomposed into the sum of three components:

  • An exact form: A differential form that is the exterior derivative of another form.

  • A co-exact form: A differential form whose codifferential (adjoint of the exterior derivative) is zero.

  • A harmonic form: A differential form that is both closed (its exterior derivative is zero) and co-closed (its codifferential is zero).

Let Ωk(M)\Omega^{k}(M) be a kk-form on an nn-dimensional smooth manifold MM, dd be the exterior derivative operator, and δ\delta be the adjoint map of dd. Then we can define the Hodge-Laplace operator:

Definition 3.1 (Hodge-Laplace operator).
(2) Δdδ+δd:Ωk(M)Ωk(M).\Delta\triangleq d\delta+\delta d:\Omega^{k}(M)\mapsto\Omega^{k}(M).

Given the Definition 3.1, we state the Hodge decomposition theorem as follows:

Theorem 3.2 (Hodge Decomposition Theorem).

For any αΩk1(M)\alpha\in\Omega^{k-1}(M), βΩk+1(M)\beta\in\Omega^{k+1}(M) and Δγ=0\Delta\gamma=0, we have

(3) ωΩk(M)ω=dα+δβ+γ,\omega\in\Omega^{k}(M)\Rightarrow\omega=d\alpha+\delta\beta+\gamma,

where dαd\alpha is an exact kk-form, δβ\delta\beta is a co-exact kk-form and γ\gamma satisfying Δγ=0\Delta\gamma=0 is also referred to as a harmonic form.

This decomposition is unique and orthogonal with respect to the L2L^{2} inner product on the space of differential forms. The Hodge decomposition theorem has significant implications in various areas of mathematics and physics, including the study of cohomology, the theory of partial differential equations, and the formulation of Maxwell’s equations in differential form language. Besides, this theorem allows us to categorize differential forms into three distinct types, each with its own physical interpretation:

  • An exact form dαd\alpha: This differential form can be expressed as the gradient of a scalar field, making it particularly useful for describing potential energies associated with various fields. In physics, an exact form is commonly employed to represent electric potential energy or magnetic potential energy, providing valuable insights into the behavior of these fields.

  • A co-exact form δβ\delta\beta: A differential form that can be written as the curl of a vector field is classified as a cexact form. It plays a crucial role in describing circulation phenomena related to fields, such as magnetic flux in the context of magnetic fields. By utilizing a cexact form, physicists can effectively model and analyze the circulatory aspects of these fields.

  • A harmonic form γ\gamma: Unlike an exact or a cexact form, a harmonic form is characterized by the absence of potential energies or circulations. It represents a state of equilibrium or scenarios devoid of sources or sinks. In the realm of physics, a harmonic form is frequently associated with fields that are free from sources, such as source-free electric fields or source-free magnetic fields. This form provides a framework for studying the behavior of fields in the absence of external influences.

3.2. Hodge Decomposition Theorem on Graphs

By defining several concepts on graphs, including dd, Δ\Delta and δ\delta when k=0k=0, we can generalize the Hodge decomposition theorem from its manifold version to graphs.

Definition 3.3 (Hodge Potential Score).
(4) Ω0(𝒢){s:𝒱}.\Omega^{0}(\mathcal{G})\triangleq\{s:\mathcal{V}\mapsto\mathbb{R}\}.
Definition 3.4 (Edge Flows).
(5) Ω1(𝒢){𝒳:𝒱×𝒱|𝒳(i,j)=𝒳(j,i),(i,j)}.\Omega^{1}(\mathcal{G})\triangleq\{\mathcal{X}:\mathcal{V}\times\mathcal{V}\mapsto\mathbb{R}|\mathcal{X}(i,j)=-\mathcal{X}(j,i),(i,j)\in\mathcal{E}\}.
Definition 3.5 (Gradient Operator).

Let 𝐠𝐫𝐚𝐝\mathbf{grad} be the gradient operator, si,sjΩ0(𝒢)s_{i},s_{j}\in\Omega^{0}(\mathcal{G}), we have

(6) (d0s)(i,j)(𝐠𝐫𝐚𝐝s)(i,j)sjsi,(i,j).(d_{0}s)(i,j)\triangleq(\mathbf{grad}\,s)(i,j)\triangleq s_{j}-s_{i},(i,j)\in\mathcal{E}.
Definition 3.6 (Negative Divergence Operator).

Let ww be the weight of an element in Ωk(𝒢)\Omega^{k}(\mathcal{G}), wiΩ0(𝒢)w_{i}\in\Omega^{0}(\mathcal{G}), wijΩ1(𝒢)w_{ij}\in\Omega^{1}(\mathcal{G}) and 𝐝𝐢𝐯\mathbf{div} be the divergence operator, we have

(7) (δ0𝒳)(i)(𝐝𝐢𝐯𝒳)(i)=jwijwi𝒳(i,j).(\delta_{0}\mathcal{X})(i)\triangleq(-\mathbf{div}\,\mathcal{X})(i)=-\sum_{j}\frac{w_{ij}}{w_{i}}\mathcal{X}(i,j).
Definition 3.7 (Graph Laplacian Operator).
(8) Δ0δ0d0𝐝𝐢𝐯(𝐠𝐫𝐚𝐝).\Delta_{0}\triangleq\delta_{0}d_{0}\triangleq-\mathbf{div}(\mathbf{grad}).

We note that we denote Δ0\Delta_{0} as the graph Laplacian operator. The proof that Δ0\Delta_{0} corresponds to the usual graph Laplacian operator is given in Section 3.3.

Given the definitions above, we state the Hodge decomposition theorem on graphs for the case where k=0k=0 at frist:

Theorem 3.8 (Hodge Decomposition Theorem on Graphs for k=0k=0).

Let 𝐈𝐦\mathbf{Im} be the image set and 𝐤𝐞𝐫\mathbf{ker} be the kernel set, we have

(9) Ω0(𝒢)=𝐈𝐦d0𝐈𝐦δ0𝐤𝐞𝐫Δ0.\Omega^{0}(\mathcal{G})=\mathbf{Im}\;d_{0}\oplus\mathbf{Im}\;\delta_{0}\oplus\mathbf{ker}\;\Delta_{0}.

In our paper, we employ the gradient field decomposition provided by the k=0k=0 version of the Hodge decomposition theorem on graphs to compute the topological importance of nodes. For the sake of theoretical completeness, we then describe the theorem for the case where kk\in\mathbb{N}. To generalize the Hodge decomposition theorem on graphs for k=0k=0 to the case where kk\in\mathbb{N}, we first denote KkK_{k} as the set of kk-cliques on the graph. Then we have

(10) Ωk(G){\displaystyle\Omega^{k}(G)\triangleq\{ u:𝒱k+1|\displaystyle u:\mathcal{V}^{k+1}\mapsto\mathbb{R}|
u(iσ(0),,iσ(k))=sign(σ)u(i0,,ik),\displaystyle u(i_{\sigma(0)},\cdots,i_{\sigma(k)})=\mathrm{sign}(\sigma)u(i_{0},\cdots,i_{k}),
(i0,,ik)Kk+1}.\displaystyle(i_{0},\cdots,i_{k})\in K_{k+1}\}.

For dk:Ωk(𝒢)Ωk+1(𝒢)d_{k}:\Omega^{k}(\mathcal{G})\mapsto\Omega^{k+1}(\mathcal{G}) and δk:Ωk(𝒢)Ωk+1(𝒢)\delta_{k}:\Omega^{k}(\mathcal{G})\mapsto\Omega^{k+1}(\mathcal{G}), we define

(11) (dku)(i0,,ik+1)j=0k+1(1)j+1u(i0,,ij1,ij+1,,ik+1),(d_{k}u)(i_{0},\cdots,i_{k+1})\triangleq\sum^{k+1}_{j=0}(-1)^{j+1}u(i_{0},\cdots,i_{j-1},i_{j+1},\cdots,i_{k+1}),

and

(12) (δku)(i0,,ik+1)j=0k+1(1)ju(i0,,ij1,ij+1,,ik+1).(\delta_{k}u)(i_{0},\cdots,i_{k+1})\triangleq\sum^{k+1}_{j=0}(-1)^{j}u(i_{0},\cdots,i_{j-1},i_{j+1},\cdots,i_{k+1}).

Then we have

(13) Δkδkdk+dk1δk1.\Delta_{k}\triangleq\delta_{k}d_{k}+d_{k-1}\delta_{k-1}.

Given the definitions above, we state the Hodge decomposition theorem on graphs for kk\in\mathbb{N}:

Theorem 3.9 (Hodge Decomposition Theorem on Graphs).

Let 𝐈𝐦\mathbf{Im} be the image set and 𝐤𝐞𝐫\mathbf{ker} be the kernel set, we have

(14) Ωk(𝒢)=𝐈𝐦dk𝐈𝐦δk𝐤𝐞𝐫Δk.\Omega^{k}(\mathcal{G})=\mathbf{Im}\;d_{k}\oplus\mathbf{Im}\;\delta_{k}\oplus\mathbf{ker}\;\Delta_{k}.

Theorem 3.9 reveals that a graph signal can be decomposed into three orthogonal components:

  • Gradient component: The gradient component represents the conservative or curl-free part of the graph signal. It can be expressed as the gradient of a potential function defined on the nodes of the graph. This means that the flow along any closed path in the graph sums up to zero. In other words, the gradient component captures the portion of the signal that exhibits no rotational behavior. It is analogous to the irrotational component in the continuous Hodge decomposition.

  • Curl component: The curl component represents the rotational or divergence-free part of the graph signal. It can be expressed as the curl of a potential function defined on the edges of the graph. This means that the net flow out of any node in the graph is zero. The curl component captures the portion of the signal that exhibits rotational behavior but has no divergence.

  • Harmonic component: The harmonic component is the part of the graph signal that is both gradient-free and divergence-free. It represents the signal’s behavior that is not captured by either the gradient or the curl components. In other words, it is the part of the signal that is constant on connected components of the graph and has zero gradient and zero divergence.

These three components (i.e., gradient, curl, and harmonic) form a complete and orthogonal decomposition of the graph signal. They provide a way to analyze and understand the different aspects of the signal’s behavior on the graph. The gradient component captures the conservative part, the curl component captures the rotational part, and the harmonic component captures the part that is neither conservative nor rotational. This decomposition is particularly useful in applications involving signal processing, data analysis, and machine learning on graph-structured data.

3.3. Graph Laplacian Operator

In this section, we give the proof that Δ0\Delta_{0} in our paper corresponds to the usual graph Laplacian operator for completeness, which may be found in (Hodge1).

Proof.

Let sΩ1(𝒢)s\in\Omega^{1}(\mathcal{G}), we have

(15) (𝐠𝐫𝐚𝐝s)(i,j)={s(j)s(i),if(i,j),0,otherwise.(\mathbf{grad}\;s)(i,j)=\begin{cases}s(j)-s(i),&\mathrm{if}\;(i,j)\in\mathcal{E},\\ 0,&\mathrm{otherwise}.\end{cases}

Let aija_{ij} be an element of the adjacency matrix 𝐀\mathbf{A}, the gradient may be written as (𝐠𝐫𝐚𝐝s)(i,j)=aij(s(j)s(i))(\mathbf{grad}\;s)(i,j)=a_{ij}(s(j)-s(i)) and then

(16) (Δ0s)(i)\displaystyle(\Delta_{0}s)(i) =(𝐝𝐢𝐯(𝐠𝐫𝐚𝐝s))(i)=(𝐝𝐢𝐯aij(s(j)s(i)))(i)\displaystyle=-(\mathbf{div}(\mathbf{grad}\;s))(i)=-(\mathbf{div}\;a_{ij}(s(j)-s(i)))(i)
=j=1naij(s(j)s(i))=ξis(i)j=1naijs(j),\displaystyle=-\sum^{n}_{j=1}a_{ij}(s(j)-s(i))=\xi_{i}s(i)-\sum^{n}_{j=1}a_{ij}s(j),

where for any node vi(i=1,,n)v_{i}\;(i=1,\cdots,n), we define its degree as

(17) ξi=deg(i)=j=1naij.\xi_{i}=\mathrm{deg}(i)=\sum^{n}_{j=1}a_{ij}.

If we regard a function sΩ1(𝒢)s\in\Omega^{1}(\mathcal{G}) as a vector (s1,,sn)n(s_{1},\cdots,s_{n})\in\mathbb{R}^{n} where s(i)=sis(i)=s_{i} and set 𝐃=diag(ξ1,,ξn)n×n\mathbf{D}=\mathrm{diag}(\xi_{1},\cdots,\xi_{n})\in\mathbb{R}^{n\times n}, then Eq. (16) becomes

(18) Δ0s=[ξ1a11a12a1na21ξ2a22a2nan1an2ξnann][s1s2sn]=(𝐃𝐀)s.\Delta_{0}s=\left[\begin{array}[]{cccc}\xi_{1}-a_{11}&-a_{12}&\cdots&-a_{1n}\\ -a_{21}&\xi_{2}-a_{22}&\cdots&-a_{2n}\\ \vdots&&\ddots&\vdots\\ -a_{n1}&-a_{n2}&\cdots&\xi_{n}-a_{nn}\\ \end{array}\right]\left[\begin{array}[]{c}s_{1}\\ s_{2}\\ \vdots\\ s_{n}\\ \end{array}\right]=(\mathbf{D}-\mathbf{A})s.

So Δ0\Delta_{0} may be regarded as 𝐃𝐀\mathbf{D}-\mathbf{A}, the usual definition of a graph Laplacian. ∎

4. Additional Experimental Details

4.1. Implementation Details

We use Adam optimizer to optimize the models, setting the initial learning rate to 0.005 and the number of training epochs to 200 on all datasets. The regularizer hyper-parameter for EWC (EWC), MAS and TWP is always set to 10,000. And β\beta for TWP (TWP) is set to 0.01. For those experience replay baselines, i.e., GEM (GEM), ER-GNN (ERGNN) and SSM (SSM), we set the buffer size for each class to be 60, 60, 400, 100 for Amazon Computers, Corafull, OGB-Arxiv, and Reddit, respectively. For our method, we choose a buffer size that is the same as that of other methods and select a suitable β\beta from [0.0, 0.25, 0.5, 0.75, 1.0] for different datasets. Additionally, we set the structure of all backbones as a 2-layer network with a hidden layer dimension of 256 for fairness. Finally, due to the abundance of experimental results for each method across all the three backbones, we only present the best results of each method on each dataset.

Refer to caption
Figure 1. The influence of β\beta on average accuracy.

4.2. Sensitivity of Hyper-parameter

To analyze the sensitivity of our method to the value of the hyper-parameter β\beta, we demonstrate the influence of β\beta on AA using the total four datasets. In this experiment, we further divide β\beta into [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] to observe its effects more carefully. Figure 1 displays that the value of β\beta has different effects on different datasets. For example, on the Amazon Computers dataset, it leads to approximately 14% performance fluctuation, while on the Corafull dataset, it only leads to about 0.8% fluctuation. Further analysis of the experiments reveals that the peak consistently occurs at the middle position on all datasets, which further confirms the effectiveness of our feature-topology fusion sampling.

Refer to caption
Figure 2. 2-D t-SNE projections of embeddings in four models. The nodes with different labels are represented by dots in different colors.

4.3. Qualitative Analysis

In order to demonstrate the effectiveness of the node representations learned by FTF-ER, we conduct a qualitative analysis on the Amazon Computers dataset. For this purpose, we generate a series of standard t-SNE (t-SNE) 2D projected plots of node representations to reinforce this analysis. We select the complete test data set containing 5 tasks and 10 classes for demonstration, to analyze the overall performance of each model after undergoing the complete continual learning process. Given FTF-ER’s ability to differentiate between the importance of nodes at the feature and topological levels, we anticipate that nodes sharing the same labels will be positioned closely in the projection space, indicating similar representation vectors. Figure 2 visualizes the hidden layer representations of four CGL methods, namely FTF-ER, SSM (SSM), ER-GNN (ERGNN), and TWP (TWP). Experimental results show that FTF-ER exhibits a clearer separation of nodes from distinct communities compared to alternative methods. The nodes with different labels are represented by dots in different colors. This showcases the capability of our FTF-ER in capturing distinctions among nodes within diverse communities through the gathered node information.