This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Beyond Graph Convolutional Network: An Interpretable Regularizer-centered Optimization Framework

Shiping Wang1,2, Zhihao Wu1,2, Yuhong Chen1,2, Yong Chen3 Corresponding author
Abstract

Graph convolutional networks (GCNs) have been attracting widespread attentions due to their encouraging performance and powerful generalizations. However, few work provide a general view to interpret various GCNs and guide GCNs’ designs. In this paper, by revisiting the original GCN, we induce an interpretable regularizer-centerd optimization framework, in which by building appropriate regularizers we can interpret most GCNs, such as APPNP, JKNet, DAGNN, and GNN-LF/HF. Further, under the proposed framework, we devise a dual-regularizer graph convolutional network (dubbed tsGCN) to capture topological and semantic structures from graph data. Since the derived learning rule for tsGCN contains an inverse of a large matrix and thus is time-consuming, we leverage the Woodbury matrix identity and low-rank approximation tricks to successfully decrease the high computational complexity of computing infinite-order graph convolutions. Extensive experiments on eight public datasets demonstrate that tsGCN achieves superior performance against quite a few state-of-the-art competitors w.r.t. classification tasks.

Introduction

Owing to the powerful ability to aggregate neighborhood information, Graph Convolutional Network (GCN) has been successfully applied to diverse domains, such as computer vision [1, 2, 3], recommender systems [4, 5], privacy preserving [6], and traffic forecasting [7, 8]. Rooted in a series of theoretical foundations, GCN extends convolution operations to the non-Euclidean spaces and effectively propagates label signals, and therefore its variants have been extensively employed for a variety of graph-related tasks, including classification [9, 10], clustering [11, 12] and link prediction [13, 14]. In a nutshell, GCN generates the graph embedding with the well-established graph convolutional layers gathering semantics from neighbors according to the network topology, which are revealed to be the most critical component.

Although GCN has behaved well in many machine learning tasks, lots of studies have pointed out its certain drawbacks and made efforts for further improvements. Bo et al. [15] indicated that the propagation mechanism could be considered as a special form of low-pass filter, and presented a GCN with an adaptive frequency. Zhang et al. [16] argued that most GCN-based methods ignored the global information and proposed SHNE, which leveraged the structure and feature similarity to capture latent semantics. Wang et al. [17] revealed that the original GCN aggregated information from node neighbors inadequately, and then developed a multi-channel GCN by utilizing feature-based semantic graph. In spite of the performance boosts of these GCN variants, they didn’t establish a generalized framework, i.e., these approaches understood and enhanced GCN from certain and non-generalizable perspectives, thereby they are exceedingly difficult to be further developed, and with limited interpretability.

Consequently, it is expected to construct a unified framework for various GCNs with better interpretability; however, it is a pity that this kind of work is still in shortage. Zhao et al. [18] linked GCN and Graph-regularized PCA (GPCA), and then proposed a multi-layer network by stacking the GPCA layers. Zhu et al. [19] attempted to interpret existing GCN-based methods with a unified optimization framework, under which they devised an adjustable graph filter for a new GCN variant. Yang et al. [20] designed a family of graph convolutional layers inspired by the updating rules of two typical iterative algorithms. Although these efforts have contributed to better understanding of GCNs, they only explained GCNs in partial aspects, promoting the expectation of a more comprehensive analysis of GCNs.

To tackle the aforementioned issues, this paper induces an interpretable regularizer-centered optimization framework, which provides a novel perspective to digest various GCNs, i.e., this framework captures the common essential properties of existing state-of-the-art GCN variants and could defines them just by devising different regularizers. Moreover, in light of the analyses on current representative GCNs, we find that most of the existing approaches only consider capturing the topological regularization, while the feature-based semantic structure is underutilized, and hence this motivates us to design a dual-regularizer graph convolutional network (called tsGCN) within the regularizer-centered optimization framework for the fullest explorations of both structures and semantics from graph data. Due to the high computational complexity of performing infinite-order graph convolutions, the unified framework provides a straightforward way employing truncated polynomials to approximate the graph Laplacian, similar to the truncated Chebyshev polynomials by vanilla GCN, restricting the message passing of a single graph convolution to the first-order neighborhood.

The main contributions of this paper can be summarized as the following three aspects:

  • Propose a regularizer-centered constrained optimization framework, which interprets various existing GCNs with specific regularizers.

  • Establish a new dual-regularizer graph convolutional network (tsGCN), which exploits topological and semantic structures of the given data; and develop an efficient algorithm to reduce the computational complexity of solving infinite-order graph convolutions.

  • Conduct a series of experiments to show that tsGCN performs much better than many SOTA GCNs, and also consumes much less time than the newly GNN-HF/LF.

Related Work

Graph Convolutional Networks

The original GCN was first introduced by Kipf et al. [21], who generalized the convolution operations from the Euclidean domain to the non-Euclidean domain. SGC [22] assumed that the nonlinear transform of GCN was not that significant, and then devised a simplified GCN by removing the nonlinear activation functions and collapsing the weight matrices. PPNP [23] employed the relationship between PageRank and GCN for the improvement on the propagation mechanism of GCN, and an iterative version called APPNP was further proposed to reduce the high computational complexity. Attempting to adaptively learn the influence radii for each node and task, JKNet [24] combined various aggregations at the last layer and was able to learn representations of different orders for graph substructures. GNN-LF and GNN-HF [19] considered the low-pass and the high-pass filter as the convolution kernels to improve GCN’s expressive power, respectively. AdaGCN [25] leveraged Adaboost strategy for the enhancement of GCN, allowing information to be shared between layers. To sum up, a main characteristic of these methods is exploring GCN from the perspectives of redesigning information aggregation strategies or modifying graph convolutions, and few work try to construct a unified framework to interpret various GCNs and reveal the underlying common principles.

Further Insights on GCNs

Quite a few studies have been devoted to explore the mechanisms of GCN for deeper insights. Li et al. [26] indicated that the convolutional operation of GCN was a special form of Laplacian smoothing, attributed to which GCN suffered from the so-called over-smoothing problem. Specifically, the performance of GCN will decrease as the number of layers increases, which has been validated by many other studies. However, Liu et al. [27] held a different opinion that the entanglement of two steps in GCN damages the performance of the deep GCN, where the two steps were explained as propagation and transformation. Based on this view, they decoupled the two operations and further presented a deeper GCN. Zhu et al. [19] also decomposed the convolution operation of GCN into two separate stages, called aggregation and transformation, and focused on the aggregation process, formulating an optimization objective to interpret it. Yang et al. [28] explored network topology refinement, leveraging a topology optimization process for the explanation. Oono et al. [29] analyzed the forward propagation of GCN and interpreted it with a specific dynamical system, allowing GCN to be related to the underlying topological structures. Overall, these studies have contributed to the interpretability of GCNs, and also let researchers better understand GCNs. In this paper, we build a unified optimization framework from a novel view of graph regularizers to interpret and understand GCNs.

Mathematical Notations

For the convenience of formal descriptions, derivations, and analyses, necessary notations are narrated as below. A graph is denoted as 𝒢=(𝒱,,𝐀)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{A}), where 𝒱\mathcal{V} marks the vertex set with |𝒱|=N|\mathcal{V}|=N (NN is the total number of nodes in graph 𝒢\mathcal{G}), \mathcal{E} marks the edge set, and 𝐀=[𝐀ij]N×N\mathbf{A}=[\mathbf{A}_{ij}]_{N\times N} marks an affinity matrix of which 𝐀ij\mathbf{A}_{ij} measures the similarity between the ii-th and the jj-th node. In addition, 𝐃=[𝐃ij]N×N\mathbf{D}=[\mathbf{D}_{ij}]_{N\times N} represents the degree matrix of 𝒢\mathcal{G} with 𝐃ii=j=1N𝐀ij\mathbf{D}_{ii}=\sum_{j=1}^{N}\mathbf{A}_{ij}, and then the normalized symmetrical graph Laplacian of 𝒢\mathcal{G} is computed as 𝐋~=𝐈𝐀~\widetilde{\mathbf{L}}=\mathbf{I}-\widetilde{\mathbf{A}} with 𝐀~=𝐃12𝐀𝐃12\widetilde{\mathbf{A}}=\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}.

Revisiting Graph Convolutional Network

Methods Propagation Rules Regularizer (𝐇(l);𝒢)\mathcal{L}(\mathbf{H}^{(l)};\mathcal{G}) Projective Set
GCN 𝐇(l)=σ(𝐀^𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left(\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)𝐋~𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
SGC 𝐇(l)=σ(𝐀^𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left({\widehat{\mathbf{A}}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)𝐋~𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
APPNP 𝐇(l)=σ((1α)𝐀^𝐇(l1)+α𝐇(0))\mathbf{H}^{(l)}=\sigma\left((1-\alpha)\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}+\alpha\mathbf{H}^{(0)}\right) Tr(11α𝐇(l)𝐀^1(𝐇(l)2α𝐇(0)))Tr\left(\frac{1}{1-\alpha}{\mathbf{H}^{(l)}}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{H}^{(l)}-2\alpha\mathbf{H}^{(0)})\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
JKNet 𝐇(l)=σ(k=1Kαk𝐀^k𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left(\sum_{k=1}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)𝐀^1(𝐈+β𝐋~)𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
DAGNN 𝐇(l)=σ(k=0Kαk𝐀^k𝐇(l1))\mathbf{H}^{(l)}=\sigma\left(\sum_{k=0}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(l-1)}\right) Tr(𝐇(l)(𝐈+β𝐋~)𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
GNN-HF 𝐇(l)=σ((𝐈+α𝐋^)1(𝐈+β𝐋^)𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\beta\widehat{\mathbf{L}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)(𝐈+β𝐋^)1(𝐈+α𝐋^)𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\beta\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\alpha\widehat{\mathbf{L}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex.\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}.\end{matrix}\right.
GNN-LF 𝐇(l)=σ((𝐈+α𝐀^)1(𝐈+β𝐀^)𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\beta\widehat{\mathbf{A}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)(𝐈+β𝐀^)1(𝐈+α𝐀^)𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\beta\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\alpha\widehat{\mathbf{A}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
tsGCN 𝐇(l)=σ((𝐈+α𝐋~𝒢+β𝐋~𝒳)1𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})^{-1}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)(𝐈+α𝐋~𝒢+β𝐋~𝒳)𝐇(l))Tr\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
Table 1: Different regularizers can derive different GCN variants under the regularizer-centered optimization framework.

For a graph 𝒢=(𝒱,,𝐀)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{A}), the svd of its graph Laplacian is 𝐋=𝐔𝚲𝐔\mathbf{L}=\mathbf{U}\mathbf{\Lambda}\mathbf{U}^{\top}, where 𝐔N×N\mathbf{U}\in\mathbb{R}^{N\times N} is comprised of orthonormal eigenvectors and 𝚲=diag(λ1,,λN)\mathbf{\Lambda}=\mbox{diag}(\lambda_{1},\cdots,\lambda_{N}) is a diagonal matrix with λi\lambda_{i} denoting the ii-th eigenvalue and λiλj\lambda_{i}\geq\lambda_{j} (i=1,,Ni=1,\cdots,N). Essentially, this decomposition induces a Fourier transform on the graph domain, where eigenvectors correspond to Fourier components and eigenvalues represent frequencies of the graph. For an input signal 𝐱N\mathbf{x}\in\mathbb{R}^{N} defined on the graph 𝒢\mathcal{G}, the corresponding graph Fourier transform of 𝐱\mathbf{x} is 𝐱^=𝐔𝐱\widehat{\mathbf{x}}=\mathbf{U}^{\top}\mathbf{x}, and its inverse transform is derived as 𝐱=𝐔𝐱^\mathbf{x}=\mathbf{U}\widehat{\mathbf{x}}. Consequently, the graph convolution between the signal 𝐱\mathbf{x} and the filter 𝐠N\mathbf{g}\in\mathbb{R}^{N} is

𝐠𝐱=𝐔(𝐠^𝐱^)=𝐔((𝐔𝐠)(𝐔𝐱)),\displaystyle\mathbf{g}*\mathbf{x}=\mathbf{U}(\widehat{\mathbf{g}}\odot\widehat{\mathbf{x}})=\mathbf{U}((\mathbf{U}^{\top}\mathbf{g})\odot(\mathbf{U}^{\top}{\mathbf{x}})), (1)

where \odot is the Hadamard product between two vectors. Particularly, denoting 𝐠𝚯=diag(𝚯):=𝐔𝐠\mathbf{g}_{\mathbf{\Theta}}=\mbox{diag}(\mathbf{\Theta}):=\mathbf{U}^{\top}\mathbf{g} parameterized by 𝚯N\mathbf{\Theta}\in\mathbb{R}^{N}, the graph convolution between 𝐱\mathbf{x} and 𝐠\mathbf{g} can be rewritten as

𝐠𝐱=𝐔(𝐠^𝐱^)=𝐔𝐠𝚯𝐔𝐱,\displaystyle\mathbf{g}*\mathbf{x}=\mathbf{U}(\widehat{\mathbf{g}}\odot\widehat{\mathbf{x}})=\mathbf{U}\mathbf{g}_{\mathbf{\Theta}}\mathbf{U}^{\top}{\mathbf{x}}, (2)

where 𝚯\mathbf{\Theta} is regarded as the filter coefficients to be optimized. Especially, 𝚯\mathbf{\Theta} is assumed to be the polynomials of the spectrums of the graph Laplacian [30], i.e.,

𝚯=𝚯(𝚲)=i=1K𝚯i𝚲i,\displaystyle\mathbf{\Theta}=\mathbf{\Theta}(\mathbf{\Lambda})=\sum_{i=1}^{K}\mathbf{\Theta}_{i}\mathbf{\Lambda}^{i}, (3)

where KK is the order of Chebyshev polynomials. By fixing K=2K=2, the graph convolutional network (GCN) [21] takes an effective form

𝐠𝐱=θ(𝐈+𝐋)𝐱,\displaystyle\mathbf{g}*\mathbf{x}=\theta(\mathbf{I}+\mathbf{L})\mathbf{x}, (4)

where 𝚯=[θ]\mathbf{\Theta}=[\theta] is a parameter to be optimized. When extending single channel signal 𝐱\mathbf{x} and filter θ\theta to multi-channel 𝐇(l)N×dl\mathbf{H}^{(l)}\in\mathbb{R}^{N\times d_{l}} and 𝚯(l)dl×fl\mathbf{\Theta}^{(l)}\in\mathbb{R}^{d_{l}\times f_{l}}, the GCN is converted to

𝐇(l)=σ(𝐀^𝐇(l1)𝚯(l)),\displaystyle\mathbf{H}^{(l)}=\sigma(\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}), (5)

where 𝐀^\widehat{\mathbf{A}} is a normalized version of 𝐈+𝐀~\mathbf{I}+\widetilde{\mathbf{A}}, σ()\sigma(\cdot) is an activation function, and 𝐇(l)N×dl\mathbf{H}^{(l)}\in\mathbb{R}^{N\times d_{l}} is the output of the ll-th layer with 𝐇(0)=𝐗\mathbf{H}^{(0)}=\mathbf{X} being the input feature matrix.

An Interpretable Regularizer-centered Optimization Framework for GCNs

Given the input 𝐇(l1)\mathbf{H}^{(l-1)} of the (l)(l)-th layer, GCN can compute the output 𝐇(l)\mathbf{H}^{(l)} by minimizing

=Tr(𝐇(l)𝐇(l1)𝚯(l))+12Tr(𝐇(l)𝐋~𝐇(l))\displaystyle\begin{split}\mathcal{L}=-\text{Tr}({\mathbf{H}^{(l)}}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)})+\frac{1}{2}\text{Tr}({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)})\\ \end{split} (6)
 s.t. 𝐇(l)𝟎,\mbox{ s.t. }\mathbf{H}^{(l)}\geq\mathbf{0},

where 12Tr(𝐇(l)𝐋~𝐇(l))=14j=1Ni=1N𝐀ij𝐡i(l)𝐃ii𝐡j(l)𝐃jj22\frac{1}{2}\text{Tr}({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)})=\frac{1}{4}\sum_{j=1}^{N}\sum_{i=1}^{N}\mathbf{A}_{ij}||\frac{\mathbf{h}^{(l)}_{i}}{\sqrt{\mathbf{D}_{ii}}}-\frac{\mathbf{h}^{(l)}_{j}}{\sqrt{\mathbf{D}_{jj}}}||_{2}^{2} with 𝐇(l)=[𝐡1(l);;𝐡N(l)]\mathbf{H}^{(l)}=[\mathbf{h}_{1}^{(l)};\cdots;\mathbf{h}_{N}^{(l)}]; it is a normalized regularizer to preserve the pairwise similarity of any two nodes in the given graph. Besides, the Tr(𝐇(l)𝐇(l1)𝚯(l))-\text{Tr}({\mathbf{H}^{(l)}}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}) is actually a fitting loss term bewteen 𝐇(l)\mathbf{H}^{(l)} and 𝐇(l1)𝚯(l)\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}, i.e., 𝐇(l)𝐇(l1)𝚯(l)F2||\mathbf{H}^{(l)}-\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}||_{F}^{2} with 𝐇(l1)\mathbf{H}^{(l-1)} and 𝚯(l)\mathbf{\Theta}^{(l)} fixed when optimizing 𝐇(l)\mathbf{H}^{(l)}. Note that the square term 𝐇(l)F2||\mathbf{H}^{(l)}||_{F}^{2} is a L2-regularized smoother, which can be ignored or absorbed in the second graph regularizer Tr(𝐇(l)𝐋~𝐇(l))\text{Tr}({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)}).

Taking derivative of \mathcal{L} with respect to 𝐇(l)\mathbf{H}^{(l)} and setting it to zero, we obtain 𝐇(l+)\mathbf{H}^{(l_{+})} as

𝐇(l+)=(𝐈𝐀~)1𝐇(l1)𝚯(l);\displaystyle\mathbf{H}^{(l_{+})}=(\mathbf{I}-\widetilde{\mathbf{A}})^{-1}\mathbf{H}^{(l-1)}{\mathbf{\Theta}^{(l)}}; (7)

and then there yields

𝐇(l)=σ(𝐇(l+)),\mathbf{H}^{(l)}=\sigma\left(\mathbf{H}^{(l_{+})}\right), (8)

when the nonnegative constraints 𝐇(l)𝟎\mathbf{H}^{(l)}\geq\mathbf{0} are further considered. Notice that σ()\sigma(\cdot) is the ReLU()\mathrm{ReLU}(\cdot) activation function. Here, if the matix inverse (𝐈𝐀~)1=i=0𝐀~i(\mathbf{I}-\tilde{\mathbf{A}})^{-1}=\sum_{i=0}^{\infty}\tilde{\mathbf{A}}^{i} is approximated by the first-order expansion, i.e., (𝐈𝐀~)1𝐈+𝐀~(\mathbf{I}-\tilde{\mathbf{A}})^{-1}\approx\mathbf{I}+\tilde{\mathbf{A}}, then Eq. (8) will lead to the updating rule (5) of GCN.

Usually, the activation functions in GCN are ReLU()\mathrm{ReLU}(\cdot) and Softmax()\mathrm{Softmax}(\cdot), which could be converted to different projection optimizations. Concretely, the ReLU()\mathrm{ReLU}(\cdot) activation function is equivalent to project a point 𝐱\mathbf{x} onto the non-negative plane 𝒮+={𝐬d|𝐬𝟎}\mathcal{S}_{+}=\{\mathbf{s}\in\mathbb{R}^{d}|\mathbf{s}\geq\mathbf{0}\}, i.e.,

ReLU(𝐱)=argmin𝐲𝒮+𝐱𝐲+12𝐲22.\displaystyle\mathrm{ReLU}(\mathbf{x})=\mathop{\arg\min}_{\mathbf{y}\in\mathcal{S}_{+}}-\mathbf{x}^{\top}\mathbf{y}+\frac{1}{2}||\mathbf{y}||_{2}^{2}. (9)

By the way, we denote 𝒮={𝐬d}\mathcal{S}=\{\mathbf{s}\in\mathbb{R}^{d}\}, which corresponds to an identity activation function. In terms of the Softmax()\mathrm{Softmax}(\cdot) activation function, it can be regarded as projecting 𝐱\mathbf{x} onto the set 𝒮simplex={𝐬d|𝟏𝐬=1,𝐬𝟎}\mathcal{S}_{simplex}=\{\mathbf{s}\in\mathbb{R}^{d}|\mathbf{1}^{\top}\mathbf{s}=1,\mathbf{s}\geq\mathbf{0}\}, i.e.,

Softmax(𝐱)=argmin𝐲𝒮simplex𝐱𝐲+𝐲log(𝐲),\displaystyle\mathrm{Softmax}(\mathbf{x})=\mathop{\arg\min}_{\mathbf{y}\in\mathcal{S}_{simplex}}-\mathbf{x}^{\top}\mathbf{y}+\mathbf{y}^{\top}\log(\mathbf{y}), (10)

where 𝐲log(𝐲)=i=1d𝐲ilog(𝐲i)\mathbf{y}^{\top}\log(\mathbf{y})=\sum_{i=1}^{d}\mathbf{y}_{i}\log(\mathbf{y}_{i}) is the negative entropy of 𝐲\mathbf{y} [31]. In fact, with respect to other activation functions, they can also be equivalent to project a point onto some feasible set with some metric.

Up to present, we have actually utilized a constrained optimization problem to interpret GCN, including information propagations (i.e., Eq. (7)) and the nonlinear activation functions (i.e., ReLU()\mathrm{ReLU}(\cdot) and Softmax()\mathrm{Softmax}(\cdot)).

The above analyses can not only explain the vanilla GCN, but also stimulate a regularizer-centered optimization framework that can further unify various GCNs. By extending the optimization (6), a more general framework is written as

=Tr(𝐇(l)𝐇(l1)𝚯(l))+12(𝐇(l);𝒢)\displaystyle\begin{split}\mathcal{L}=-\text{Tr}({\mathbf{H}^{(l)}}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)})+\frac{1}{2}\mathcal{L}(\mathbf{H}^{(l)};\mathcal{G})\\ \end{split} (11)
 s.t. 𝐇(l){𝒮+or𝒮},l[L1],𝐇(L)𝒮simplex.\mbox{ s.t. }\mathbf{H}^{(l)}\in\{\mathcal{S}_{+}\;or\;\mathcal{S}\},l\in[L-1],\mathbf{H}^{(L)}\in\mathcal{S}_{simplex}.

Under this framework, different regularizers could derive different GCNs, for example,

  • If (𝐇(l);𝒢)=Tr(𝐇(l)(𝐈+μ𝐋^)1(𝐈+λ𝐋^)𝐇(l))\mathcal{L}(\mathbf{H}^{(l)};\mathcal{G})=Tr\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\mu\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{L}})\mathbf{H}^{(l)}\right) with λ=β+1α1\lambda=\beta+\frac{1}{\alpha}-1, μ=β\mu=\beta, and 𝐋^=𝐈𝐀^\widehat{\mathbf{L}}=\mathbf{I}-\widehat{\mathbf{A}}, then it induces the updating rule 𝐇(l)=σ((𝐈+α𝐀^)1(𝐈+β𝐀^)𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\beta\widehat{\mathbf{A}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right), which corresponds to GNN-HF [19].

  • If (𝐇(l);𝒢)=Tr(𝐇(l)(𝐈+μ𝐀^)1(𝐈+λ𝐀^)𝐇(l))\mathcal{L}(\mathbf{H}^{(l)};\mathcal{G})=Tr\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\mu\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{A}})\mathbf{H}^{(l)}\right) with λ=αβ+2α1αβα+1\lambda=\frac{-\alpha\beta+2\alpha-1}{\alpha\beta-\alpha+1} and μ=1β1\mu=\frac{1}{\beta}-1, then it gives rise to the updating rule 𝐇(l)=σ((𝐈+α𝐀^)1(𝐈+β𝐀^)𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\beta\widehat{\mathbf{A}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right), which corresponds to GNN-LF [19].

For more cases, their results are summarized in Table 4, and the derivation details can refer to those of the original GCN (from Eq. (7) to Eq. (10)) and the supplementary.

Remarks. The work [19] is most similar to our work with the same research idea: they both want to propose a unified framework to interpret the current GCNs and guide the design of new GCN variants; however, they are realized in different ways. To be specific, (1) Zhu et al. [19] develop an optimization framework to explain different GCNs’ propagation processes; whereas we propose a constrained optimization framework not only to interpret various GCNs’ propagation processes, but also explain the nonlinear activation layers; (2) [19] unifies various GCNs via devising various fitting items which are essentially constructed by limited graph filters; while our work derives different GCNs through designing different regularizers. To sum up, our work interprets the whole (not partial) GCNs with regularizer-centered constrained optimizations.

Algorithm 1 Topological and Semantic Regularized GCN
0:  Graph data 𝒢=(𝒱,,𝐀)\mathcal{G}=(\mathcal{V},\mathcal{E},\mathbf{A}), labels 𝐲\mathbf{y}, number of layers LL, and hyperparameters {α,β,r}\{\alpha,\beta,r\}.
0:  Predicted label set {yi}i=n+1N\{y_{i}^{*}\}_{i=n+1}^{N}.
1:  Initialize model parameters {𝐇(l),𝚯(l)}l=1L\{\mathbf{H}^{(l)},\mathbf{\Theta}^{(l)}\}_{l=1}^{L};
2:  Compute the joint graph Laplacian α𝐋~𝒢+β𝐋~𝒳\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}} and its low-rank factorization 𝐖𝐕\mathbf{W}\mathbf{V}^{\top};
3:  Substitute the matrix inverse (𝐈+α𝐋~𝒢+β𝐋~𝒳)1(\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})^{-1} with 𝐈𝐖(𝐈+𝐕𝐖)1𝐕\mathbf{I}-\mathbf{W}(\mathbf{I}+\mathbf{V}^{\top}\mathbf{W})^{-1}\mathbf{V}^{\top};
4:  while not convergent do
5:     Calculate hidden layers {𝐇(l)}l=1L\{\mathbf{H}^{(l)}\}_{l=1}^{L} by Eq. (14);
6:     Update weights: 𝚯(l+1)𝚯(l)η𝚯(l)\mathbf{\Theta}^{(l+1)}\leftarrow\mathbf{\Theta}^{(l)}-\eta\frac{\partial\mathcal{L}}{\partial\mathbf{\Theta}^{(l)}};
7:  end while
8:  return  The predicted labels: yi=argmaxj𝐇ij(L)y_{i}^{*}=\arg\max_{j}\mathbf{H}^{(L)}_{ij}.
Table 2: Dataset statistics.
Datasets #Nodes #Edges #Classes #Features #Train/Val/Test
Cora 2,708 5,429 7 1,433 140/500/1,000
Citeseer 3,327 4,732 6 3,703 120/500/1,000
Pubmed 19,717 44,338 3 500 60/500/1,000
ACM 3,025 13,128 3 1,870 60/500/1,000
BlogCatalog 5,196 171,743 6 8,189 120/500/1,000
CoraFull 19,793 65,311 70 8,710 1,400/500/1,000
Flickr 7,575 239,738 9 12,047 180/500/1,000
UAI 3,067 28,311 19 4,973 367/500/,1000
Table 3: Accuracy and F1-score (mean% and standard deviation%) of all methods, where the best results are in red and the second-best are in blue. Note that GraphSAGE fails to work on the ACM dataset, and thus its results are marked with “—”.
Metrics Methods / Datasets Cora Citeseer Pubmed ACM BlogCatalog CoraFull Flickr UAI
Chebyshev 76.2 (0.7) 69.3 (0.4) 74.0 (0.8) 82.8 (1.4) 68.3 (1.6) 57.2 (1.1) 38.5 (1.6) 49.7 (0.4)
GraphSAGE 76.7 (0.6) 64.4 (0.9) 75.5 (0.2) 57.8 (0.7) 59.9 (0.7) 32.7 (1.0) 41.7 (1.4)
GAT 79.1 (0.8) 68.3 (0.5) 78.4 (0.3) 84.6 (0.5) 67.1 (1.7) 62.4 (0.4) 40.4 (0.9) 49.7 (3.0)
GCN 80.6 (1.4) 69.1 (1.5) 77.6 (1.3) 88.8 (0.5) 84.2 (0.6) 62.8 (0.4) 51.0 (1.2) 58.5 (1.1)
SGC 79.3 (1.0) 66.4 (1.7) 76.8 (2.0) 80.8 (2.7) 81.3 (0.2) 62.9 (2.2) 51.0 (0.1) 56.5 (3.5)
APPNP 78.0 (0.1) 65.8 (0.2) 78.0 (0.0) 88.2 (0.0) 87.7 (0.3) 63.1 (0.5) 57.5 (0.2) 62.3 (1.2)
JKNet 83.1 (0.1) 72.3 (0.1) 80.1 (0.2) 82.3 (0.6) 75.7 (0.1) 62.6 (0.0) 54.0 (0.3) 45.6 (0.5)
DAGNN 81.9 (0.7) 70.0 (1.1) 80.6 (0.7) 87.4 (0.9) 84.6 (1.9) 65.6 (0.3) 54.6 (5.9) 46.7 (12.4)
GNN-LF 81.1 (0.5) 72.3 (0.9) 80.0 (0.4) 90.8 (0.5) 86.7 (0.6) 63.5 (0.9) 56.6 (0.6) 36.6 (19.8)
GNN-HF 80.7 (0.2) 68.8 (1.3) 77.7 (0.2) 91.2 (0.5) 84.5 (0.4) 63.0 (0.7) 60.7 (0.4) 54.8 (1.4)
tsGCN (inv) 80.3 (0.3) 73.3 (0.4) 78.4 (0.3) 85.1 (1.6) 87.8 (6.3) 67.0 (0.9) 53.3 (12.6) 64.2 (1.8)
ACC tsGCN 82.0 (0.3) 73.1 (0.4) 82.4 (0.1) 92.8 (0.3) 92.3 (0.5) 67.9 (0.9) 79.1 (3.0) 67.9 (0.6)
Chebyshev 76.3 (0.7) 65.4 (0.8) 73.9 (0.7) 82.5 (1.4) 64.3 (1.6) 40.0 (0.5) 38.4 (1.5) 39.1 (0.2)
GraphSAGE 76.7 (0.5) 60.7 (0.5) 74.7 (0.2) 54.7 (0.6) 51.9 (0.6) 31.0 (1.1) 35.3 (1.0)
GAT 77.1 (0.7) 64.6 (0.5) 78.2 (0.2) 84.8 (0.5) 66.3 (1.9) 46.4 (0.4) 38.1 (1.1) 40.8 (1.3)
GCN 79.4 (1.4) 65.2 (2.4) 77.2 (1.4) 88.9 (0.5) 82.4 (0.5) 52.8 (0.8) 50.0 (1.7) 45.0 (1.1)
SGC 77.7 (0.9) 61.5 (1.7) 76.5 (2.3) 81.1 (2.6) 80.7 (0.3) 53.2 (2.1) 44.2 (0.2) 46.7 (1.7)
APPNP 77.6 (0.1) 63.2 (0.2) 77.7 (0.0) 88.3 (0.0) 85.7 (0.3) 48.2 (0.7) 56.9 (0.2) 48.6 (1.6)
JKNet 82.3 (0.3) 67.8 (0.1) 79.3 (0.3) 82.2 (0.6) 75.0 (0.1) 51.3 (0.1) 51.1 (0.5) 31.7 (1.5)
DAGNN 80.0 (0.7) 65.7 (0.7) 80.7 (0.7) 87.5 (0.9) 83.8 (2.4) 53.0 (0.9) 55.5 (6.7) 39.3 (11.2)
GNN-LF 79.1 (0.7) 66.7 (0.4) 80.2 (0.5) 90.9 (0.5) 85.9 (0.6) 50.5 (1.9) 54.3 (1.0) 29.7 (15.1)
GNN-HF 78.6 (0.3) 64.3 (1.7) 78.1 (0.2) 91.3 (0.5) 83.8 (0.4) 49.0 (1.1) 58.6 (0.6) 44.9 (0.8)
tsGCN (inv) 78.5 (0.3) 69.6 (0.4) 78.7 (0.3) 85.1 (1.5) 85.2 (7.1) 57.2 (1.1) 52.9 (15.8) 48.5 (0.8)
F1 tsGCN 80.5 (0.5) 69.0 (0.3) 82.4 (0.1) 92.8 (0.4) 90.1 (0.6) 58.7 (0.7) 79.3 (2.9) 50.1 (0.1)

tsGCN: Topological and Semantic Regularized Graph Convolutional Network

One finding from most existing GCNs is that they often ignored feature-based semantic structures, which can weaken the representation learning abilities of graph networks, then we focus on two regularizers, i.e.,

1(𝐇(l);𝒢)=12Tr({𝐇(l)}(12𝐈+α𝐋~𝒢)𝐇(l)),\displaystyle\mathcal{L}_{1}(\mathbf{H}^{(l)};\mathcal{G})=\frac{1}{2}\text{Tr}\left(\{\mathbf{H}^{(l)}\}^{\top}(\frac{1}{2}\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}})\mathbf{H}^{(l)}\right), (12)
2(𝐇(l);𝒳)=12Tr({𝐇(l)}(12𝐈+β𝐋~𝒳)𝐇(l)),\displaystyle\mathcal{L}_{2}(\mathbf{H}^{(l)};\mathcal{X})=\frac{1}{2}\text{Tr}\left(\{\mathbf{H}^{(l)}\}^{\top}(\frac{1}{2}\mathbf{I}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})\mathbf{H}^{(l)}\right), (13)

where 𝐋~𝒢\widetilde{\mathbf{L}}_{\mathcal{G}} is a graph Laplacian from the given adjacency matrix (e.g., 𝐋~𝒢=𝐋~\widetilde{\mathbf{L}}_{\mathcal{G}}=\widetilde{\mathbf{L}}), and 𝐋~𝒳\widetilde{\mathbf{L}}_{\mathcal{X}} is a graph Laplacian calculated from the pairwise similarity of any two graph nodes. Hence, we devise a dual-regularizer, i.e., (𝐇(l))=1(𝐇(l);𝒢)+2(𝐇(l);𝒳)\mathcal{L}(\mathbf{H}^{(l)})=\mathcal{L}_{1}(\mathbf{H}^{(l)};\mathcal{G})+\mathcal{L}_{2}(\mathbf{H}^{(l)};\mathcal{X}), and if it is under the optimization framework (19), then there yields the following updating rule

𝐇(l)=σ((𝐈+α𝐋~𝒢+β𝐋~𝒳)1𝐇(l1)𝚯(l)).\displaystyle\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})^{-1}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right). (14)

Since this method seeks to preserve both the topological and semantic structures for more accurate presentations, we call it tsGCN (i.e., Topological and Semantic regularized GCN).

Notably, the computational complexity of (𝐈+α𝐋~𝒢+β𝐋~𝒳)1(\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})^{-1} is 𝒪(N3)\mathcal{O}(N^{3}), which tends to be unaffordable in practical applications. To this end, a low-rank approximation is operated, i.e., α𝐋~𝒢+β𝐋~𝒳𝐖𝐕\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}}\approx\mathbf{W}\mathbf{V}^{\top}, where 𝐖,𝐕N×r\mathbf{W},\mathbf{V}\in\mathbb{R}^{N\times r} with rNr\ll N. This leads to the Woodbury matrix identity:

(𝐈+𝐖𝐕)1=𝐈𝐖(𝐈+𝐕𝐖)1𝐕,\displaystyle(\mathbf{I}+\mathbf{W}\mathbf{V}^{\top})^{-1}=\mathbf{I}-\mathbf{W}(\mathbf{I}+\mathbf{V}^{\top}\mathbf{W})^{-1}\mathbf{V}^{\top}, (15)

of which the computational complexity costs 𝒪(N2)\mathcal{O}(N^{2}).

Given that the optimal 𝐌\mathbf{M}^{*} of the following problem

min𝐌N×N:rank(𝐌)=r𝐌(α𝐋~𝒢+β𝐋~𝒳)F2\displaystyle\min_{\mathbf{M}\in\mathbb{R}^{N\times N}:~{}\text{rank}(\mathbf{M})=r}||\mathbf{M}-(\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})||_{\textsc{F}}^{2} (16)

is attained at the rr-truncated singular value decomposition of α𝐋~𝒢+β𝐋~𝒳\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}}, i.e., 𝐌=𝐔𝚺𝐔\mathbf{M}^{*}=\mathbf{U}\mathbf{\Sigma}\mathbf{U}^{\top}, where 𝚺r×r\mathbf{\Sigma}\in\mathbb{R}^{r\times r} is a diagonal matrix containing the rr largest singular values. An optimal {𝐖,𝐕}\{\mathbf{W}^{*},\mathbf{V}^{*}\} to α𝐋~𝒢+β𝐋~𝒳𝐖𝐕\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}}\approx\mathbf{W}\mathbf{V}^{\top} can be given by an analytic form of 𝐖=𝐕=𝐔𝚺12\mathbf{W}^{*}=\mathbf{V}^{*}=\mathbf{U}\mathbf{\Sigma}^{\frac{1}{2}}.

To obtain the optimum {𝐖,𝐕}\{\mathbf{W}^{*},\mathbf{V}^{*}\}, the iterative algorithm [32] with 𝒪(N2)\mathcal{O}(N^{2}) is leveraged as

𝐙(t+1)(α𝐋~𝒢+β𝐋~𝒳)𝐔(t),\mathbf{Z}^{(t+1)}\leftarrow(\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})\mathbf{U}^{(t)}, (17)
{𝐔(t+1),𝐑(t+1)}QR(𝐙(t+1)),\{\mathbf{U}^{(t+1)},\mathbf{R}^{(t+1)}\}\leftarrow\text{QR}(\mathbf{Z}^{(t+1)}), (18)

where QR()(\cdot) denotes the QR-decomposition. Note that this algorithm can converge to the rr largest eigenvalues 𝐑(t+1)\mathbf{R}^{(t+1)} and its corresponding eigenvectors 𝐙(t+1)\mathbf{Z}^{(t+1)} when the iterative number tt is large enough. Finally, there will be 𝐖=𝐕=𝐔(t+1)[𝐑(t+1)]12\mathbf{W}^{*}=\mathbf{V}^{*}=\mathbf{U}^{(t+1)}[\mathbf{R}^{(t+1)}]^{\frac{1}{2}}.

Gathering all analyses mentioned above, the procedure for tsGCN is summarized in Algorithm 1.

Experiment

This section will show tsGCN’s effectiveness and efficiency via comprehensive experiments.

Datasets

Cora, Citeseer and Pubmed are citation networks, and CoraFull is a larger version of Cora; ACM is a paper network, and BlogCatalog and Flickr are social networks; UAI has been utilized for community detection. The detailed statistics of the above eight public datasets are concluded in Table 2.

Compared Methods

Two types of methods are employed here for comparisons. Chebyshev [33], GraphSAGE [34] and GAT [35] are classical graph neural networks. GCN, SGC [22], APPNP [23], JKNet [24], DAGNN [27], GNN-LF and GNN-HF [19] are selected as state-of-the-art GCN variants.

Refer to caption
(a) Runtime
Refer to caption
(b) Classification Accuracy
Figure 1: (a) All methods’ runtime on two large datasets. (b) The classification accuracy of tsGCN w.r.t. (α\alpha, β\beta) on all datasets.
Refer to caption
(a) Cora
Refer to caption
(b) Citeseer
Refer to caption
(c) Pubmed
Refer to caption
(d) ACM
Refer to caption
(e) BlogCatalog
Refer to caption
(f) CoraFull
Refer to caption
(g) Flickr
Refer to caption
(h) UAI
Figure 2: The classification accuracy of tsGCN w.r.t. hyperparameters α\alpha and β\beta on all datasets.

Parameter Setups

For all experiments, we randomly split samples into a small set of 2020 labeled samples per class for training, a set of 500500 samples for validating and a set of 1,0001,000 samples for testing. In terms of the ten baseline methods, all their configurations are set as the default in their original papers. With respect to tsGCN, following the vanilla GCN, the learning rate, weight decay and the size of hidden units are set to 1×1021\times 10^{-2}, 5×1045\times 10^{-4} and 3232, respectively. The hyperparameters α\alpha and β\beta are selected in {0.1,0.2,,1.0}\{0.1,0.2,\ldots,1.0\} for different datasets, and rr is chosen in {d211,d210,,d23}\{\lfloor\frac{d}{2^{11}}\rfloor,\lfloor\frac{d}{2^{10}}\rfloor,\ldots,\lfloor\frac{d}{2^{3}}\rfloor\}, where dd is the feature dimension of the original data.

Semi-supervised Classification

Performance Comparisons. The semi-supervised classification task is conducted on selected datasets, whose results are recorded in Table 3. Specifically, we compare our tsGCN with the ten baseline methods in terms of both accuracy and F1-score, marking the best and second-best results on each dataset. Note that tsGCN (inv) denotes tsGCN without the low-rank approximation, which directly calculates the matrix inverse in Eq. (14). From Table 3, we have the following observations:

  • tsGCN achieves the best performances on most datasets, and is only slightly inferior to the JKNet method on the smallest Cora dataset.

  • tsGCN yields higher scores than JKNet and APPNP, especially on Pubmed, CoraFull, BlogCatalog, and Flickr, where the first two are relatively large datasets and the latter two have dense edges. tsGCN even outperforms the second-best approach GNN-HF by about 20%20\% on Flickr.

It is worth mentioning that tsGCN utilizes high-order information by the infinite-order graph convolution, and JKNet and APPNP also develop different techniques for the same goal. Hence, the advantage of tsGCN over APPNP and JKNet implies that the infinite-order graph convolution implemented by the low-rank approximation not only requires less computational complexity, but also effectively captures high-order neighborhood information and filters significant noises.

Runtime Comparisons. This section collects the training time (i.e., runtime) of all methods on two selected large datasets, i.e., Pubmed and CoraFull, as exhibited in Fig. 1(a): the first three columns correspond to classical GNNs, while the rest are GCNs. From Fig. 1(a), we find that tsGCN takes much less runtime than Chebyshev, GAT, and GraphSAGE; however, it performs moderately well among the state-of-the-art GCN variants. Specifically, tsGCN is (1) inferior to SGC, JKNet, and DAGNN; (2) well-matched with the original GCN; (3) but advantageous over the recently proposed GNN-LF and GNN-HF.

Refer to caption
Figure 3: Accuracy and F1-score of tsGCN and its variants on all datasets.
Refer to caption
(a) Chebyshev
Refer to caption
(b) GraphSAGE
Refer to caption
(c) GAT
Refer to caption
(d) GCN
Refer to caption
(e) SGC
Refer to caption
(f) APPNP
Refer to caption
(g) JKNet
Refer to caption
(h) DAGNN
Refer to caption
(i) GNN-LF
Refer to caption
(j) GNN-HF
Refer to caption
(k) tsGCN (inv)
Refer to caption
(l) tsGCN
Figure 4: Different methods’ t-SNE visualizations on BlogCatalog, where each color corresponds to one class.

Parameter Sensitivity Analysis

Fig. 1(b) curves the accuracy of tsGCN w.r.t. various ranks by fixing other parameters α\alpha and β\beta. Considering that different datasets hold different distributions, their optimal ranks to the optimization (16) are also different. For example, in regard to the curves on BlogCatalog and ACM, their accuracy first go up to a high value and then keep steady, which indicates that when rank r=d/512r=\lfloor d/512\rfloor, the low-rank approximation is effective and efficient enough. When it comes to the curve on Pubmed, the trend of its performance monotonically decreases as rank rr becomes bigger, which implies that a very low-rank (i.e., r=d/2048r=\lfloor d/2048\rfloor) approximation is sufficient enough to preserve abundant information for good results. However, with respect to the other curves such as on Flickr and Cora, the y-axis’ scores generally rise to a peak first and then fall continuously as the rank rr increases. For these cases, the optimal ranks differ at their peaks.

Fig. 2 plots the accuracy of tsGCN w.r.t. (α\alpha, β\beta) by fixing the optimal ranks. On Cora, Citeseer, Pubmed, BlogCatalog, and CoraFull, tsGCN performs well with large α\alpha and small β\beta; while, on ACM, Flickr, and UAI, tsGCN generates high accuracy when these two parameters are both large.

For detailed settings of these hyperparameters, please reference the codes and datasets to be released on Github.

Ablation Study

The results of GCN, tsGCN-s, tsGCN-t, tsGCN (inv), and tsGCN are plotted in Fig. 3 (notice that tsGCN-s and tsGCN-t are with semantic and topological regularizer, respectively), telling us:

  • The performance is unsatisfactory when the two regularizers are adopted alone, while tsGCN can always effectively fuse the two to better capture underlying structures.

  • tsGCN (inv) is even worse than single-regularizer model on some datasets, indicating that the infinite-order graph convolutions implemented by the matrix inverse can pull-in instability to the model.

  • Compared to GCN, tsGCN (inv) performs comparable or even worse, whereas tsGCN shows substantial improvements on all datasets, which indicates that the low-rank approximation enhances the robustness of infinite-order graph convolutions.

Data Visualization

Fig. 4 exhibits the graph representations learned by different methods on BlogCatalog. As can be seen clearly, the results of the three classical graph neural networks, i.e., Chebyshev, GraphSAGE and GAT, are unsatisfactory; while for the other competitors, there are:

  • Both tsGCN (inv) and tsGCN are better than other GCNs, which indicates that the dual-regularizer can extract more accurate inter-relationships from the topological and semantic structures.

  • Comparing the embeddings learned by tsGCN with those of tsGCN (inv), classes in the former sub-figure are more clearly recognized and the within-clusters are more compact, which testifies the effectiveness of the low-rank approximation.

In a nutshell, the embeddings of the proposed model show the best inter-class separation and intra-class aggregation.

Conclusion

By revisiting GCN, this paper puts forward an interpretable regularizer-centered optimization framework, in which the connections between existing GCNs and diverse regularizers are revealed. It’s worth mentioning that this framework provides a new perspective to interpret existing work and guide new GCNs just by designing new graph regularizers. Impressed by the significant effectiveness of the feature based semantic graph, we further combine it with nodes’ topological structures, and develop a novel dual-regularizer graph convolutional network, called tsGCN. Since the analytical updating rule of tsGCN contains a time-consuming matrix inverse, we devise an efficient algorithm with low-rank approximation tricks. Experiments on node classification tasks demonstrate that tsGCN performs much better than quite a few state-of-the-art competitors, and also exhibit that tsGCN runs much faster than the very recently proposed GCN variants, e.g., GNN-HF and GNN-LF.

Acknowledgments

This work is in part supported by the National Natural Science Foundation of China (Grant Nos. U21A20472 and 62276065), the Natural Science Foundation of Fujian Province (Grant No. 2020J01130193).

Supplementary

In this supplementary, we mainly present specific details to link various GCNs with various graph regularizers under the regularizer-centered optimization framework. Besides, more experimental settings and results are provided to further enrich the main paper.

The Framework Review

An interpretable regularizer-centered constrained optimiazation framework is induced as

argmin𝐇(l)=Tr(𝐇(l)𝐇(l1)𝚯(l))fitting+12(𝐇(l);𝒢)regularization\displaystyle\begin{split}\arg\min_{\mathbf{H}^{(l)}}\mathcal{L}=\underbrace{-\text{Tr}({\mathbf{H}^{(l)}}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)})}_{fitting}+\underbrace{\frac{1}{2}\mathcal{L}(\mathbf{H}^{(l)};\mathcal{G})}_{regularization}\\ \end{split} (19)
 s.t. 𝐇(l){𝒮+or𝒮},l[L1],𝐇(L)𝒮simplex,\mbox{ s.t. }\mathbf{H}^{(l)}\in\{\mathcal{S}_{+}\;or\;\mathcal{S}\},l\in[L-1],\mathbf{H}^{(L)}\in\mathcal{S}_{simplex},

with the aim to unify various GCNs in an interpretable way, and also to guide the design of new GCN variants. Note that the first term in optimization (19) is equivalent to the fitting loss between the forward propagation 𝐇(l1)𝚯(l)\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)} and the output 𝐇(l)\mathbf{H}^{(l)}, while the second term is the priors-based graph regularizer. Besides, 𝒮\mathcal{S}, 𝒮+\mathcal{S}_{+} and 𝒮simplex\mathcal{S}_{simplex} are separately defined to be

𝒮={𝐬d},\mathcal{S}=\{\mathbf{s}\in\mathbb{R}^{d}\}, (20)
𝒮+={𝐬d|𝐬𝟎},\mathcal{S}_{+}=\{\mathbf{s}\in\mathbb{R}^{d}|\mathbf{s}\geq\mathbf{0}\}, (21)

and

𝒮simplex={𝐬d|𝟏𝐬=1,𝐬𝟎}.\mathcal{S}_{simplex}=\{\mathbf{s}\in\mathbb{R}^{d}|\mathbf{1}^{\top}\mathbf{s}=1,\mathbf{s}\geq\mathbf{0}\}. (22)

The above three sets correspond to the Identity()\text{Identity}(\cdot), Relu()\text{Relu}(\cdot), and Softmax()\text{Softmax}(\cdot) activation functions frequently used in graph convolutional networks, respectively.

It’s claimed that by designing different regularizers, this framework can give birth to different GCN methods. In the following, we will give specific details about how they could be derived from optimization (19).

Link Various GCNs with Various Regularizers

Methods Propagation Rules Regularizer (𝐇(l);𝒢)\mathcal{L}(\mathbf{H}^{(l)};\mathcal{G}) Projective Set
GCN 𝐇(l)=σ(𝐀^𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left(\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)𝐋~𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
SGC 𝐇(l)=σ(𝐀^𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left({\widehat{\mathbf{A}}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)𝐋~𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}\widetilde{\mathbf{L}}\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
APPNP 𝐇(l)=σ((1α)𝐀^𝐇(l1)+α𝐇(0))\mathbf{H}^{(l)}=\sigma\left((1-\alpha)\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}+\alpha\mathbf{H}^{(0)}\right) Tr(11α𝐇(l)𝐀^1(𝐇(l)2α𝐇(0)))\text{Tr}\left(\frac{1}{1-\alpha}{\mathbf{H}^{(l)}}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{H}^{(l)}-2\alpha\mathbf{H}^{(0)})\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
JKNet 𝐇(l)=σ(k=1Kαk𝐀^k𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left(\sum_{k=1}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)𝐀^1(𝐈+β𝐋~)𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
DAGNN 𝐇(L)=σ(k=0Kαk𝐀^k𝐇(0))\mathbf{H}^{(L)}=\sigma\left(\sum_{k=0}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(0)}\right) Tr(𝐇(l)(𝐈+β𝐋~)𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
GNN-HF 𝐇(l)=σ((𝐈+α𝐋^)1(𝐈+β𝐋^)𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\beta\widehat{\mathbf{L}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)(𝐈+β𝐋^)1(𝐈+α𝐋^)𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\beta\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\alpha\widehat{\mathbf{L}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex.\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}.\end{matrix}\right.
GNN-LF 𝐇(l)=σ((𝐈+α𝐀^)1(𝐈+β𝐀^)𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\beta\widehat{\mathbf{A}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)(𝐈+β𝐀^)1(𝐈+α𝐀^)𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\beta\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\alpha\widehat{\mathbf{A}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
tsGCN 𝐇(l)=σ((𝐈+α𝐋~𝒢+β𝐋~𝒳)1𝐇(l1)𝚯(l))\mathbf{H}^{(l)}=\sigma\left((\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})^{-1}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) Tr(𝐇(l)(𝐈+α𝐋~𝒢+β𝐋~𝒳)𝐇(l))\text{Tr}\left({\mathbf{H}^{(l)}}^{\top}(\mathbf{I}+\alpha\widetilde{\mathbf{L}}_{\mathcal{G}}+\beta\widetilde{\mathbf{L}}_{\mathcal{X}})\mathbf{H}^{(l)}\right) {𝒮(l)=𝒮+,l[L1],𝒮(L)=𝒮simplex\left\{\begin{matrix}\mathcal{S}^{(l)}=\mathcal{S}_{+},l\in[L$$-$$1],\\ \mathcal{S}^{(L)}=\mathcal{S}_{simplex}\end{matrix}\right.
Table 4: Different regularizers can derive different GCN variants under the regularizer-centered optimization framework.
Theorem 1.

The updating rule of the vanilla GCN [21]

𝐇(l)=σ(𝐀^𝐇(l1)𝚯(l)),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left(\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right),l\in[L], (23)

is equivalent to solving the following optimization

𝐇(l)=argmin𝐇𝒮(l)𝒥(l)\mathbf{H}^{(l)}=\arg\min_{\mathbf{H}\in\mathcal{S}^{(l)}}\mathcal{J}^{(l)} (24)
s.t.𝒮(l){𝒮or𝒮+or𝒮simplex},s.t.\;\mathcal{S}^{(l)}\in\{\mathcal{S}\;or\;\mathcal{S}_{+}\;or\;\mathcal{S}_{simplex}\},

where

𝒥(l)=Tr(𝐇𝐇(l1)𝚯(l))+12Tr(𝐇𝐋~𝐇).\displaystyle\mathcal{J}^{(l)}=-{\rm{Tr}}\left(\mathbf{H}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right)+\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}\widetilde{\mathbf{L}}\mathbf{H}\right). (25)
Proof.

Taking derivative of 𝒥(l)\mathcal{J}^{(l)} w.r.t. 𝐇\mathbf{H}, we obtain

𝒥(l)𝐇=𝐇(l1)𝚯(l)+𝐋~𝐇;\displaystyle\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}=-\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}+\widetilde{\mathbf{L}}\mathbf{H}; (26)

if it (𝒥(l)𝐇\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}) is further set to zero, then there yields

𝐇=(𝐈𝐀~)1𝐇(l1)𝚯(l).\mathbf{H}^{*}=(\mathbf{I}-\widetilde{\mathbf{A}})^{-1}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}. (27)

By projecting 𝐇\mathbf{H}^{*} onto 𝒮(l)\mathcal{S}^{(l)}, we could arrive at

𝐇(l)=σ(𝐇).\mathbf{H}^{(l)}=\sigma(\mathbf{H}^{*}). (28)

Notably, (𝐈𝐀~)1=i=0𝐀~i(\mathbf{I}-\widetilde{\mathbf{A}})^{-1}=\sum_{i=0}^{\infty}\widetilde{\mathbf{A}}^{i}; and when its first-order approximation is leveraged, i.e., (𝐈𝐀~)1𝐈+𝐀~=𝐀^(\mathbf{I}-\widetilde{\mathbf{A}})^{-1}\approx\mathbf{I}+\tilde{\mathbf{A}}=\widehat{\mathbf{A}}, Eq. (28) gives birth to the updating rule (23).

The above analyses reveal that when the regularizer is designed to 12Tr(𝐇𝐋~𝐇)\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}\widetilde{\mathbf{L}}\mathbf{H}\right), the framework (19) could generate the vanilla GCN [21]. ∎

Theorem 2.

Given 𝐇(0)=fΘMLP(𝐗)\mathbf{H}^{(0)}=f_{\Theta}^{MLP}(\mathbf{X}) and α[0,1)\alpha\in[0,1), the updating rule of APPNP [23]

𝐇(l)=σ((1α)𝐀^𝐇(l1)+α𝐇(0)),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left((1-\alpha)\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}+\alpha\mathbf{H}^{(0)}\right),\;l\in[L], (29)

is equivalent to solving the following optimization

𝐇(l)=argmin𝐇𝒮(l)𝒥(l),\mathbf{H}^{(l)}=\arg\min_{\mathbf{H}\in\mathcal{S}^{(l)}}\mathcal{J}^{(l)}, (30)
s.t.𝐇(l)=𝒮,l[L1],𝐇(L)=𝒮simplex,s.t.~{}\mathbf{H}^{(l)}=\mathcal{S},\;l\in[L-1],\;\mathbf{H}^{(L)}=\mathcal{S}_{simplex},

where

𝒥(l)\displaystyle\mathcal{J}^{(l)} =Tr(𝐇𝐇(l1)𝚯(l))\displaystyle=-{\rm{Tr}}\left(\mathbf{H}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) (31)
+12Tr(11α𝐇𝐀^1(𝐇2α𝐇(0))).\displaystyle+\frac{1}{2}{\rm{Tr}}\left(\frac{1}{1-\alpha}\mathbf{H}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{H}-2\alpha\mathbf{H}^{(0)})\right).
Proof.

Taking the derivative of 𝒥(l)\mathcal{J}^{(l)} w.r.t. 𝐇\mathbf{H} and setting it to zero, we come to

𝒥(l)𝐇=𝐇(l1)+11α𝐀^1(𝐇α𝐇(0))=𝟎,\displaystyle\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}=-\mathbf{H}^{(l-1)}+\frac{1}{1-\alpha}{\widehat{\mathbf{A}}}^{-1}(\mathbf{H}-\alpha\mathbf{H}^{(0)})=\mathbf{0}, (32)

which leads to

𝐇=(1α)𝐀^𝐇(l1)+α𝐇(0).\displaystyle\mathbf{H}^{*}=(1-\alpha)\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}+\alpha\mathbf{H}^{(0)}. (33)

By projecting 𝐇\mathbf{H}^{*} onto 𝒮(l)\mathcal{S}^{(l)}, we could achieve

𝐇(l)=σ((1α)𝐀^𝐇(l1)+α𝐇(0)),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left((1-\alpha)\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}+\alpha\mathbf{H}^{(0)}\right),\;l\in[L], (34)

which completes the proof.

The above analyses reveal that when the regularizer is devised to 12Tr(11α𝐇𝐀^1(𝐇2α𝐇(0)))\frac{1}{2}{\rm{Tr}}\left(\frac{1}{1-\alpha}\mathbf{H}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{H}-2\alpha\mathbf{H}^{(0)})\right), the framework (19) could give birth to APPNP [23]. ∎

Theorem 3.

The updating rule of JKNet [24]

𝐇(l)=σ(k=1Kαk𝐀^k𝐇(l1)𝚯(l)),l[L],\displaystyle\begin{split}\mathbf{H}^{(l)}=\sigma\left(\sum_{k=1}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right),\;l\in[L],\end{split} (35)

is equivalent to solving the following optimization

𝐇(l)=argmin𝐇𝒮(l)𝒥(l)\mathbf{H}^{(l)}=\arg\min_{\mathbf{H}\in\mathcal{S}^{(l)}}\mathcal{J}^{(l)} (36)
s.t.𝐇(l)=𝒮,l[L1],𝐇(L)=𝒮simplex,s.t.~{}\mathbf{H}^{(l)}=\mathcal{S},l\in[L-1],\mathbf{H}^{(L)}=\mathcal{S}_{simplex},

where

𝒥(l)\displaystyle\mathcal{J}^{(l)} =Tr(𝐇𝐇(l1)𝚯(l))\displaystyle=-{\rm{Tr}}\left(\mathbf{H}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) (37)
+12Tr(𝐇𝐀^1(𝐈+β𝐋~)𝐇).\displaystyle+\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}\right).
Proof.

Taking the derivative of 𝒥(l)\mathcal{J}^{(l)} w.r.t. 𝐇\mathbf{H} and setting it to zero, we have

𝒥(l)𝐇=𝐇(l1)𝚯(l)+𝐀^1(𝐈+β𝐋~)𝐇=𝟎,\displaystyle\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}=-\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}+{\widehat{\mathbf{A}}}^{-1}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}=\mathbf{0}, (38)

which leads to

𝐇=1β+1(𝐈ββ+1𝐀^)1𝐀^𝐇(l1)𝚯(l).\mathbf{H}^{*}=\frac{1}{\beta+1}\left(\mathbf{I}-\frac{\beta}{\beta+1}\widehat{\mathbf{A}}\right)^{-1}\widehat{\mathbf{A}}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}. (39)

It is noted that the spectral radius of ββ+1𝐀^\frac{\beta}{\beta+1}\widehat{\mathbf{A}} is smaller than one, indicating

(𝐈ββ+1𝐀^)1=k=0[ββ+1𝐀^]k.\displaystyle\left(\mathbf{I}-\frac{\beta}{\beta+1}\widehat{\mathbf{A}}\right)^{-1}=\sum_{k=0}^{\infty}\left[\frac{\beta}{\beta+1}\widehat{\mathbf{A}}\right]^{k}. (40)

If its (K(K-1)1)-order approximation is employed, then there goes

(𝐈ββ+1𝐀^)1k=0K1βk(β+1)k𝐀^k,\displaystyle\left(\mathbf{I}-\frac{\beta}{\beta+1}\widehat{\mathbf{A}}\right)^{-1}\approx\sum_{k=0}^{K-1}\frac{\beta^{k}}{(\beta+1)^{k}}\widehat{\mathbf{A}}^{k}, (41)

which suggests that 𝐇\mathbf{H}^{*} can be approximated by

𝐇=k=1Kβk1(β+1)k𝐀^k𝐇(l1)𝚯(l).\displaystyle\mathbf{H}^{*}=\sum_{k=1}^{K}\frac{\beta^{k-1}}{(\beta+1)^{k}}\widehat{\mathbf{A}}^{k}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}. (42)

If denote αk=βk1(β+1)k\alpha_{k}=\frac{\beta^{k-1}}{(\beta+1)^{k}} (k[K]k\in[K]), then {αk}k=1\{\alpha_{k}\}_{k=1}^{\infty} is a set of parameters with k=1αk=1β+111ββ+1=1\sum_{k=1}^{\infty}\alpha_{k}=\frac{1}{\beta+1}\frac{1}{1-\frac{\beta}{\beta+1}}=1.

By projecting 𝐇\mathbf{H}^{*} onto 𝒮(l)\mathcal{S}^{(l)}, we can realize

𝐇(l)=σ(k=1Kαk𝐀^k𝐇(l1)𝚯(l)),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left(\sum_{k=1}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right),l\in[L], (43)

which completes the proof.

The above analyses reveal that when the regularizer is devised to 12Tr(𝐇𝐀^1(𝐈+β𝐋~)𝐇)\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}{\widehat{\mathbf{A}}}^{-1}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}\right), the framework (19) can produce JKNet [24]. ∎

Theorem 4.

Given 𝐇(0)=fΘMLP(𝐗)\mathbf{H}^{(0)}=f_{\Theta}^{MLP}(\mathbf{X}) and a trainable projection vector αK+1\alpha\in\mathbb{R}^{K+1}, the updating rule of DAGNN [27]

𝐇(l)=σ(k=0Kαk𝐀^k𝐇(0)),\displaystyle\mathbf{H}^{(l)}=\sigma\left(\sum_{k=0}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(0)}\right), (44)

is equivalent to solving the following optimization

𝐇(l)=argmin𝐇𝒮(l)𝒥(l)\mathbf{H}^{(l)}=\arg\min_{\mathbf{H}\in\mathcal{S}^{(l)}}\mathcal{J}^{(l)} (45)
s.t.𝐇(l)=𝒮,l[L1],𝐇(L)=𝒮simplex,s.t.~{}\mathbf{H}^{(l)}=\mathcal{S},\;l\in[L-1],\;\mathbf{H}^{(L)}=\mathcal{S}_{simplex},

where

𝒥(l)\displaystyle\mathcal{J}^{(l)} =Tr(𝐇𝐇(l1)𝚯(l))\displaystyle=-{\rm{Tr}}\left(\mathbf{H}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) (46)
+12Tr(𝐇(𝐈+β𝐋~)𝐇).\displaystyle+\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}\right).
Proof.

Taking the derivative of 𝒥(l)\mathcal{J}^{(l)} w.r.t. 𝐇\mathbf{H} and setting it to zero, we can harvest

𝒥(l)𝐇=𝐇(0)+(𝐈+β𝐋~)𝐇=𝟎.\displaystyle\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}=-\mathbf{H}^{(0)}+(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}=\mathbf{0}. (47)

Similar to the proof of Theorem 3, the KK-order approximation of (𝐈+β𝐋~)1=1β+1(𝐈ββ+1𝐀^)1\left(\mathbf{I}+\beta\widetilde{\mathbf{L}}\right)^{-1}=\frac{1}{\beta+1}\left(\mathbf{I}-\frac{\beta}{\beta+1}\widehat{\mathbf{A}}\right)^{-1} is utilized, and then we obtain

𝐇=k=0Kαk𝐀^k𝐇(0).\displaystyle\mathbf{H}^{*}=\sum_{k=0}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(0)}. (48)

By projecting 𝐇\mathbf{H}^{*} onto 𝒮(l)\mathcal{S}^{(l)}, we can arrive at

𝐇(l)=σ(k=0Kαk𝐀^k𝐇(0)),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left(\sum_{k=0}^{K}\alpha_{k}{\widehat{\mathbf{A}}}^{k}\mathbf{H}^{(0)}\right),\;l\in[L], (49)

which completes the proof.

The above analyses reveal that when the regularizer is devised to 12Tr(𝐇(𝐈+β𝐋~)𝐇)\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}(\mathbf{I}+\beta\widetilde{\mathbf{L}})\mathbf{H}\right), the framework (19) can produce DAGNN [27]. ∎

Theorem 5.

The updating rule of GNN-HF [19]

𝐇(l)=σ((β+1α)𝐈+(1β1α)𝐀^1(𝐈+β𝐋^)𝐇(l1)𝚯(l)),\begin{split}\mathbf{H}^{(l)}&=\sigma((\beta+\frac{1}{\alpha})\mathbf{I}\\ &+(1-\beta-\frac{1}{\alpha})\widehat{\mathbf{A}}^{-1}(\mathbf{I}+\beta\widehat{\mathbf{L}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}),\end{split} (50)

is equivalent to solving the following optimization

𝐇(l)=argmin𝐇𝒮(l)𝒥(l)\mathbf{H}^{(l)}=\arg\min_{\mathbf{H}\in\mathcal{S}^{(l)}}\mathcal{J}^{(l)} (51)
s.t.𝐇(l)=𝒮+,l[L1],𝐇(L)=𝒮simplex,s.t.~{}\mathbf{H}^{(l)}=\mathcal{S}_{+},\;l\in[L-1],\;\mathbf{H}^{(L)}=\mathcal{S}_{simplex},

where

𝒥(l)\displaystyle\mathcal{J}^{(l)} =Tr(𝐇𝐇(l1)𝚯(l))\displaystyle=-{\rm{Tr}}\left(\mathbf{H}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) (52)
+12Tr(𝐇(𝐈+μ𝐋^)1(𝐈+λ𝐋^)𝐇)\displaystyle+\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}(\mathbf{I}+\mu\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{L}})\mathbf{H}\right)

with λ=β+1α1\lambda=\beta+\frac{1}{\alpha}-1 and μ=β\mu=\beta.

Proof.

Taking the derivative of 𝒥(l)\mathcal{J}^{(l)} w.r.t. 𝐇\mathbf{H} and setting it to zero, we own

𝒥(l)𝐇=𝐇(l1)𝚯(l)+(𝐈+μ𝐋^)1(𝐈+λ𝐋^)𝐇=𝟎.\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}=-\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}+(\mathbf{I}+\mu\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{L}})\mathbf{H}=\mathbf{0}. (53)

which yields

𝐇=(𝐈+λ𝐋^)1(𝐈+μ𝐋^)𝐇(l1)𝚯(l).\displaystyle\mathbf{H}^{*}=(\mathbf{I}+\lambda\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\mu\widehat{\mathbf{L}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}. (54)

Substituting λ=β+1α1\lambda=\beta+\frac{1}{\alpha}-1 and μ=β\mu=\beta into Eq. (54), we obtain

(𝐈+λ𝐋^)1=((1+λ)𝐈λ𝐀^)1=((β+1α)𝐈+(1β1α)𝐀^)1.\displaystyle\begin{split}(\mathbf{I}+\lambda\widehat{\mathbf{L}})^{-1}&=\left((1+\lambda)\mathbf{I}-\lambda\widehat{\mathbf{A}}\right)^{-1}\\ &=\left((\beta+\frac{1}{\alpha})\mathbf{I}+(1-\beta-\frac{1}{\alpha})\widehat{\mathbf{A}}\right)^{-1}.\end{split} (55)

By projecting 𝐇\mathbf{H}^{*} onto 𝒮(l)\mathcal{S}^{(l)}, we can ahieve

𝐇(l)=σ(𝐇),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left(\mathbf{H}^{*}\right),\;l\in[L], (56)

which completes the proof.

The above analyses reveal that when the regularizer is devised to 12Tr(𝐇(𝐈+μ𝐋^)1(𝐈+λ𝐋^)𝐇)\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}(\mathbf{I}+\mu\widehat{\mathbf{L}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{L}})\mathbf{H}\right), the framework (19) can produce GNN-HF [19]. ∎

Theorem 6.

The updating rule of GNN-LF [19]

𝐇(l)=σ((β𝐈+(1β)𝐀^+(1α1)𝐋^)1(β𝐈+(1β)𝐀^)𝐇(l1)),\begin{split}\mathbf{H}^{(l)}&=\sigma((\beta\mathbf{I}+(1-\beta)\widehat{\mathbf{A}}\\ &+(\frac{1}{\alpha}-1)\widehat{\mathbf{L}})^{-1}(\beta\mathbf{I}+(1-\beta)\widehat{\mathbf{A}})\mathbf{H}^{(l-1)}),\end{split} (57)

is equivalent to solving the following optimization

𝐇(l)=argmin𝐇𝒮(l)𝒥(l)\mathbf{H}^{(l)}=\arg\min_{\mathbf{H}\in\mathcal{S}^{(l)}}\mathcal{J}^{(l)} (58)
s.t.𝐇(l)=𝒮+,l[L1],𝐇(L)=𝒮simplex,s.t.~{}\mathbf{H}^{(l)}=\mathcal{S}_{+},\;l\in[L-1],\;\mathbf{H}^{(L)}=\mathcal{S}_{simplex},

where

J(l)\displaystyle J^{(l)} =Tr(𝐇𝐇(l1)𝚯(l))\displaystyle=-Tr\left(\mathbf{H}^{\top}\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}\right) (59)
+12Tr(𝐇(𝐈+μ𝐀^)1(𝐈+λ𝐀^)𝐇)\displaystyle+\frac{1}{2}Tr\left(\mathbf{H}^{\top}(\mathbf{I}+\mu\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{A}})\mathbf{H}\right)

with λ=αβ+2α1αβα+1\lambda=\frac{-\alpha\beta+2\alpha-1}{\alpha\beta-\alpha+1} and μ=1β1\mu=\frac{1}{\beta}-1.

Proof.

Taking the derivative of 𝒥(l)\mathcal{J}^{(l)} w.r.t. 𝐇\mathbf{H} and setting it to zero, we can get

𝒥(l)𝐇=𝐇(l1)𝚯(l)+(𝐈+μ𝐀^)1(𝐈+λ𝐀^)𝐇=𝟎,\frac{\partial\mathcal{J}^{(l)}}{\partial\mathbf{H}}=-\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}+(\mathbf{I}+\mu\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{A}})\mathbf{H}=\mathbf{0}, (60)

which leads to

𝐇=(𝐈+λ𝐀^)1(𝐈+μ𝐀^)𝐇(l1)𝚯(l).\displaystyle\mathbf{H}^{*}=(\mathbf{I}+\lambda\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\mu\widehat{\mathbf{A}})\mathbf{H}^{(l-1)}\mathbf{\Theta}^{(l)}. (61)
Table 5: Specific (α\alpha, β\beta, rr) and other parameters of tsGCN on all datasets.
Datasets/Parameters α\alpha β\beta rr Learning rate Weight decay Hidden units
Cora 1.01.0 0.20.2 d/16\lfloor d/16\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
Citeseer 1.01.0 0.40.4 d/16\lfloor d/16\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
Pubmed 1.01.0 0.30.3 d/2048\lfloor d/2048\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
ACM 1.01.0 0.90.9 d/64\lfloor d/64\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
BlogCatalog 1.01.0 0.50.5 d/64\lfloor d/64\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
CoraFull 1.01.0 0.10.1 d/8\lfloor d/8\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
Flickr 1.01.0 1.01.0 d/64\lfloor d/64\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232
UAI 1.01.0 1.01.0 d/16\lfloor d/16\rfloor 1×1021\times 10^{-2} 5×1045\times 10^{-4} 3232

Absorbing the scale αβαβα+1\frac{\alpha\beta}{\alpha\beta-\alpha+1} into the to-be-learnt variable 𝚯(l)\mathbf{\Theta}^{(l)}, and substituting λ=αβ+2α1αβα+1\lambda=\frac{-\alpha\beta+2\alpha-1}{\alpha\beta-\alpha+1} and μ=1β1\mu=\frac{1}{\beta}-1 into Eq. (61), we can harvest

αβαβα+1(𝐈+λ𝐀^)1(𝐈+μ𝐀^)\displaystyle\frac{\alpha\beta}{\alpha\beta-\alpha+1}(\mathbf{I}+\lambda\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\mu\widehat{\mathbf{A}}) (62)
=αβαβα+1(𝐈+αβ+2α1αβα+1𝐀^)1(𝐈+(1β1)𝐀^)\displaystyle=\frac{\alpha\beta}{\alpha\beta-\alpha+1}(\mathbf{I}+\frac{-\alpha\beta+2\alpha-1}{\alpha\beta-\alpha+1}\widehat{\mathbf{A}})^{-1}(\mathbf{I}+(\frac{1}{\beta}-1)\widehat{\mathbf{A}})
=((11β+1αβ)𝐈+(1+2β1αβ)𝐀^)1(𝐈+(1β1)𝐀^)\displaystyle=\left((1-\frac{1}{\beta}+\frac{1}{\alpha\beta})\mathbf{I}+(-1+\frac{2}{\beta}-\frac{1}{\alpha\beta})\widehat{\mathbf{A}}\right)^{-1}(\mathbf{I}+(\frac{1}{\beta}-1)\widehat{\mathbf{A}})
=((β1+1α)𝐈+(β+21α)𝐀^)1(β𝐈+(1β)𝐀^).\displaystyle=\left((\beta-1+\frac{1}{\alpha})\mathbf{I}+(-\beta+2-\frac{1}{\alpha})\widehat{\mathbf{A}}\right)^{-1}(\beta\mathbf{I}+(1-\beta)\widehat{\mathbf{A}}).

By projecting 𝐇\mathbf{H}^{*} onto 𝒮(l)\mathcal{S}^{(l)}, we can attain

𝐇(l)=σ(𝐇),l[L],\displaystyle\mathbf{H}^{(l)}=\sigma\left(\mathbf{H}^{*}\right),l\in[L], (63)

For notation consistency, we denote λ\lambda and μ\mu as α\alpha and β\beta in Table 4, completing the proof.

The above analyses reveal that when the regularizer is set to 12Tr(𝐇(𝐈+μ𝐀^)1(𝐈+λ𝐀^)𝐇)\frac{1}{2}{\rm{Tr}}\left(\mathbf{H}^{\top}(\mathbf{I}+\mu\widehat{\mathbf{A}})^{-1}(\mathbf{I}+\lambda\widehat{\mathbf{A}})\mathbf{H}\right), the framework (19) can generate GNN-LF [19]. ∎

More Experimental Settings and Results

In this part, we provide more experimental settings and results for tsGCN, including hyperparameter settings, t-SNE visualizations of various methods, and the parameter sensitivity analysis of tsGCN w.r.t. F1-score.

Hyperparameter Settings. The detailed values of several hyperpaerameters are recorded in Table 5, which can be used to reproduce the reported experimental results. And the code is also provided as a supplementary file.

More visualizations. We draw the t-SNE of embeddings generated by all methods on all datasets from Fig. 5 to Fig. 11, from which we have the following observations:

  • The results are generally matched with the quantitative performance, i.e., tsGCN achieves better results than the others on most datasets.

  • Embeddings generated by tsGCN achieve better inter-class separation alongside intra-class clustering than those generated by tsGCN (inv), even when their quantitative performance is comparable.

Parameter Sensitivity. It can be seen that the F1-scores of tsGCN w.r.t. (α\alpha, β\beta) hold the similar trends with the classification accuracies of tsGCN.

Refer to caption
Figure 5: Different methods’ t-SNE visualizations on Cora, where each color corresponds to one class.
Refer to caption
Figure 6: Different methods’ t-SNE visualizations on Citeseer, where each color corresponds to one class.
Refer to caption
Figure 7: Different methods’ t-SNE visualizations on Pubmed, where each color corresponds to one class.
Refer to caption
Figure 8: Different methods’ t-SNE visualizations on ACM, where each color corresponds to one class. Note that GraphSAGE fails to run on ACM.
Refer to caption
Figure 9: Different methods’ t-SNE visualizations on CoraFull, where each color corresponds to one class.
Refer to caption
Figure 10: Different methods’ t-SNE visualizations on Flickr, where each color corresponds to one class.
Refer to caption
Figure 11: Different methods’ t-SNE visualizations on UAI, where each color corresponds to one class.
Refer to caption
Figure 12: The classification F1-scores of tsGCN w.r.t. different hyperparameters α\alpha and β\beta on all datasets.

References

  • [1] Z. Chen, X. Wei, P. Wang, Y. Guo, Multi-label image recognition with graph convolutional networks, in: CVPR, 2019, pp. 5177–5186.
  • [2] W. Nie, Y. Zhao, A. Liu, Z. Gao, Y. Su, Multi-graph convolutional network for unsupervised 3d shape retrieval, in: MM, 2020, pp. 3395–3403.
  • [3] Y. Wang, M. Cao, Z. Fan, S. Peng, Learning to detect 3d facial landmarks via heatmap regression with graph convolutional network, in: AAAI, 2022, pp. 2595–2603.
  • [4] F. Xu, J. Lian, Z. Han, Y. Li, Y. Xu, X. Xie, Relation-aware graph convolutional networks for agent-initiated social e-commerce recommendation, in: CIKM, 2019, pp. 529–538.
  • [5] Y. Chen, L. Huang, C. Wang, J. Lai, Hybrid-order gated graph neural network for session-based recommendation, IEEE Transactions on Industrial Informatics 18 (3) (2022) 1458–1467.
  • [6] H. Hu, L. Cheng, J. P. Vap, M. Borowczak, Learning privacy-preserving graph convolutional network with partially observed sensitive attributes, in: WWW, 2022, pp. 3552–3561.
  • [7] B. Yu, H. Yin, Z. Zhu, Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting, in: IJCAI, 2018, pp. 3634–3640.
  • [8] W. Chen, L. Chen, Y. Xie, W. Cao, Y. Gao, X. Feng, Multi-range attentive bicomponent graph convolutional network for traffic forecasting, in: AAAI, 2020, pp. 3529–3536.
  • [9] Y. Zhang, S. Pal, M. Coates, D. Üstebay, Bayesian graph convolutional neural networks for semi-supervised classification, in: AAAI, 2019, pp. 5829–5836.
  • [10] M. Yang, Y. Shen, R. Li, H. Qi, Q. Zhang, B. Yin, A new perspective on the effects of spectrum in graph neural networks, in: ICML, 2022, pp. 25261–25279.
  • [11] S. Fan, X. Wang, C. Shi, E. Lu, K. Lin, B. Wang, One2multi graph autoencoder for multi-view graph clustering, in: WWW, 2020, pp. 3070–3076.
  • [12] H. Zhu, P. Koniusz, Simple spectral graph convolution, in: ICLR, 2021, pp. 1–11.
  • [13] H. Chen, H. Yin, X. Sun, T. Chen, B. Gabrys, K. Musial, Multi-level graph convolutional networks for cross-platform anchor link prediction, in: KDD, 2020, pp. 1503–1511.
  • [14] N. Halliwell, Evaluating explanations of relational graph convolutional network link predictions on knowledge graphs, in: AAAI, 2022, pp. 12880–12881.
  • [15] D. Bo, X. Wang, C. Shi, H. Shen, Beyond low-frequency information in graph convolutional networks, in: AAAI, 2021, pp. 3950–3957.
  • [16] Z. Zhang, C. Chen, Y. Chang, W. Hu, X. Xing, Y. Zhou, Z. Zheng, Shne: Semantics and homophily preserving network embedding, IEEE Transactions on Neural Networks and Learning Systems (2021) 1–12.
  • [17] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, J. Pei, Am-gcn: Adaptive multi-channel graph convolutional networks, in: KDD, 2020, pp. 1243–1253.
  • [18] L. Zhao, L. Akoglu, Connecting graph convolutional networks and graph-regularized pca, arXiv preprint arXiv:2006.12294 (2020).
  • [19] M. Zhu, X. Wang, C. Shi, H. Ji, P. Cui, Interpreting and unifying graph neural networks with an optimization framework, in: WWW, 2021, pp. 1215–1226.
  • [20] Y. Yang, T. Liu, Y. Wang, J. Zhou, Q. Gan, Z. Wei, Z. Zhang, Z. Huang, D. Wipf, Graph neural networks inspired by classical iterative algorithms, in: ICML, 2021, pp. 11773–11783.
  • [21] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: ICLR, 2017, pp. 1–13.
  • [22] F. Wu, A. H. S. Jr., T. Zhang, C. Fifty, T. Yu, K. Q. Weinberger, Simplifying graph convolutional networks, in: ICML, 2019, pp. 6861–6871.
  • [23] J. Klicpera, A. Bojchevski, S. Günnemann, Predict then propagate: Graph neural networks meet personalized pagerank, in: ICLR, 2019, pp. 1–15.
  • [24] K. Xu, C. Li, Y. Tian, T. Sonobe, K. ichi Kawarabayashi, S. Jegelka, Representation learning on graphs with jumping knowledge networks, in: ICML, 2018, pp. 5449–5458.
  • [25] K. Sun, Z. Zhu, Z. Lin, Adagcn: Adaboosting graph convolutional networks into deep models, in: ICLR, 2021, pp. 1–15.
  • [26] Q. Li, Z. Han, X. Wu, Deeper insights into graph convolutional networks for semi-supervised learning, in: AAAI, 2018, pp. 3538–3545.
  • [27] M. Liu, H. Gao, S. Ji, Towards deeper graph neural networks, in: KDD, 2020, pp. 338–348.
  • [28] L. Yang, Z. Kang, X. Cao, D. Jin, B. Yang, Y. Guo, Topology optimization based graph convolutional network, in: AAAI, 2019, pp. 4054–4061.
  • [29] K. Oono, T. Suzuki, Graph neural networks exponentially lose expressive power for node classification, in: ICLR, 2020, pp. 1–8.
  • [30] D. K. Hammond, P. Vandergheynst, R. Gribonval, Wavelets on graphs via spectral graph theory, Applied and Computational Harmonic Analysis 30 (2011) 129–150.
  • [31] B. Amos, Differentiable optimization-based modeling for machine learning, Ph.D. thesis, Carnegie Mellon University (2019).
  • [32] J. Sun, Z. Xu, Neural diffusion distance for image segmentation, in: NeurIPS, 2019, pp. 1441–1451.
  • [33] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in: NeurIPS, 2016, pp. 1–9.
  • [34] W. L. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, in: NeurIPS, 2017, pp. 1024–1034.
  • [35] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: ICLR, 2018, pp. 1–12.