Neighborhood Convolutional Network: A New Paradigm of Graph Neural Networks for Node Classification

Jinsong Chen^1,2,∗, Boyu Li^1,∗, Kun He^1,† ¹ School of Computer Science and Technology, Huazhong University of Science and Technology Wuhan China² Institute of Artificial Intelligence, Huazhong University of Science and Technology Wuhan China chenjinsong,afterslby,[email protected]

(2023)

Abstract.

The decoupled Graph Convolutional Network (GCN), a recent development of GCN that decouples the neighborhood aggregation and feature transformation in each convolutional layer, has shown promising performance for graph representation learning. Existing decoupled GCNs first utilize a simple neural network (e.g., MLP) to learn the hidden features of the nodes, then propagate the learned features on the graph with fixed steps to aggregate the information of multi-hop neighborhoods. Despite effectiveness, the aggregation operation, which requires the whole adjacency matrix as the input, is involved in the model training, causing high training cost that hinders its potential on larger graphs. On the other hand, due to the independence of node attributes as the input, the neural networks used in decoupled GCNs are very simple, and advanced techniques cannot be applied to the modeling. To this end, we further liberate the aggregation operation from the decoupled GCN and propose a new paradigm of GCN, termed Neighborhood Convolutional Network (NCN), that utilizes the neighborhood aggregation result as the input, followed by a special convolutional neural network tailored for extracting expressive node representations from the aggregation input. In this way, the model could inherit the merit of decoupled GCN for aggregating neighborhood information, at the same time, develop much more powerful feature learning modules. A training strategy called mask training is incorporated to further boost the model performance. Extensive results demonstrate the effectiveness of our model for the node classification task on diverse homophilic graphs and heterophilic graphs.

Decoupled Graph Convolutional Network, Neighborhood Aggregation, Multi-hop Neighborhoods, Convolutional Neural Network, Node Classification

\ast

Both authors contributed equally to this research.

{\dagger}

Corresponding author.

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Conference XXXXXX 2023; ; ^†^†price: ^†^†submissionid: 4676^†^†ccs: Computing methodologies Supervised learning by classification^†^†ccs: Mathematics of computing Graph algorithms

1. Introduction

Graphs are ubiquitous in real-world scenarios, such as social networks on social media platforms and molecular structures in biology analysis. Due to the wide applications of graphs, many data mining tasks on graphs have been proposed and investigated well in the field of graph representation learning. As one of the most fundamental and important tasks, node classification, whose goal is to predict node labels, has received considerable attention in recent years. Among various techniques for node classification, Graph Convolutional Network (GCN) (Kipf and Welling, 2017), as a typical model of Graph Neural Network (GNN), has drawn great popularity owing to its promising performance and high flexibility.

The vanilla GCN (Gilmer et al., 2017), consisting of a few GCN layers, aggregates the topology features and attribute features of immediate neighbors to learn the representations of nodes. There are two main modules, neighborhood aggregation and feature transformation (Dong et al., 2021), in each GCN layer. First, the neighborhood aggregation module collects the features of the immediate neighbors of each node in the graph through the adjacency matrix. Then, the feature transformation module utilizes a linear layer to learn the hidden representation of each node based on the aggregated information. Neighborhood aggregation and feature transformation are always coupled in each layer of the GCN frameworks (Chen et al., 2020c; Klicpera et al., 2019b; Xu et al., 2018; Jin et al., 2021b). Nevertheless, recent studies illustrate that such coupled design would cause high training cost (Klicpera et al., 2019a) and lead to over-smoothing issue (Wu et al., 2019) if stacking many GCN layers, limiting the model to capturing deep graph structural information.

Thereafter, decoupled GCN (Dong et al., 2021) emerges to tackle the above issues. Its key idea is to separate the neighborhood aggregation module (a.k.a. the propagation module) and feature transformation module into independent modules. Decoupled GCN first leverages a neural network (i.e., MLP) as the feature transformation module to learn the hidden representations of nodes. Then, decoupled GCN propagates the hidden representations of nodes over the graph to extract the final representations of nodes that involve the semantic information of multi-hop neighborhoods. By proposing various propagation methods (Klicpera et al., 2019a; Wu et al., 2019) and aggregation strategies (Liu et al., 2020; Chien et al., 2021), decoupled GCNs have shown promising performances for the node classification task, attracting increasing attention in recent years.

Despite effectiveness, we observe there are two main weaknesses in existing decoupled GCNs. First, the costs of model training, including time and space, are expensive on large graphs, limiting their scalability. It is because the propagation operation is involved in the training stage, so that the whole adjacency matrix has to be utilized in every training epoch. Second, the semantic information of multi-hop neighborhoods has not been fully explored yet. Decoupled GCNs utilize the normalized adjacency matrix to learn the hidden representations of nodes, which is inefficient in discovering the semantic information of multi-hop neighborhoods.

To this end, we propose a novel and flexible framework called Neighborhood Convolutional Network (NCN), which consists of three main stages: data pre-processing, feature extracting and feature fusion. For data pre-processing, NCN propagates the raw attribute features of nodes and transforms the features of multi-hop neighborhoods into a grid-like feature matrix for each node. The grid-like feature matrix brings non-Euclidean space of independent nodes to order, and more advanced deep learning modules can be applied on the grid-like feature matrix to extract more semantic information for nodes. Then, in the feature extracting stage, NCN develops a Convolutional Neural Network (CNN) based module to capture the high-level semantic features from the grid-like feature matrix. In this way, NCN could extract more semantic information from multi-hop neighborhoods than decoupled GCNs. Finally, in the feature fusion stage, a learnable weight layer is utilized to learn the aggregation weights of the original attribute features of nodes and the extracted features learned from multi-hop neighborhoods, respectively. Thus, NCN can adaptively adjust the aggregation weights according to the characteristics of different graphs. Moreover, NCN develops a mask-based training strategy to further improve the performance of model training. We conduct extensive experiments on various benchmark datasets, including homophilic graphs and heterophilic graphs, to validate the effectiveness of NCN for the node classification task.

The main contributions of this work are summarized as follows:

•

We propose NCN, a new and effective paradigm of GCN for the node classification task, that can extract the node representations involving rich semantic information of multi-hop neighborhoods.
•

We utilize CNN based modules to effectively extract the features from multi-hop neighborhoods and apply a learnable weight layer to adaptively learn the aggregation weights of the original attribute features and the extracted features. Besides, a mask-based training strategy is proposed to further boost training performance.
•

We analyze the time complexity and space complexity of NCN. Compared to decoupled GCN, NCN is more efficient and scalable, especially on large-scale graphs.
•

We conduct extensive experiments for the node classification task on ten benchmark datasets, including homophilic graphs and heterophilic graphs. The results demonstrate the superiority of NCN over the state-of-the-art baselines.

2. Preliminaries

Refer to caption — Figure 1. Illustration of NCN. There are three main stages. In Stage 1, NCN develops a novel aggregator that utilizes the propagation operation to calculate the features of multi-hop neighborhoods and further transform them into the grid-like feature matrix. In Stage 2, NCN leverages a CNN based module to extract the neighborhood features from the feature matrix. In Stage 3, NCN designs a learnable weight layer to combine the features of node attributes and neighborhoods adaptively. A mask-based strategy is applied to boost the model training.

2.1. Problem Formulation

Given a network modeled as an undirected and attributed graph $G=(\mathcal{V},\mathcal{E},\mathbf{X})$ , where $\mathcal{V}$ and $\mathcal{E}$ denote the sets of nodes and edges, respectively, and $\mathbf{X}\in\mathbb{R}^{n\times d}$ represents the attribute feature matrix of nodes given that $n$ is the number of nodes and $d$ the dimension of feature vector. Let $\mathbf{A}\in\mathbb{R}^{n\times n}$ represent the adjacency matrix, and the normalized adjacency matrix is defined as $\hat{\mathbf{A}}=\mathbf{D}^{-1/2}\mathbf{A}\mathbf{D}^{-1/2}$ , where $\mathbf{D}$ is the diagonal degree matrix given as $\mathbf{D}_{ii}=\sum_{j=1}^{n}\mathbf{A}_{ij}$ . $\mathbf{Y}\in\mathbb{R}^{n\times c}$ denotes the label matrix of nodes, where $c$ denotes the number of labels. Given a set of nodes $V_{l}$ whose labels are known, the node classification task is to predict the labels for nodes $V/V_{l}$ .

2.2. Homophily and Heterophily

In this paper, we use the edge homophily ratio $\mathcal{H}(G)$ (Zhu et al., 2020; He et al., 2022; Chien et al., 2021) to measure the homophily level of a graph $G$ , such that

(1)

\mathcal{H}(G)=\frac{1}{|V|}\sum_{v_{i}\in V}\frac{|\{v_{j}|v_{j}\in\mathcal{N}_{v_{i}},\mathbf{Y}_{v_{j}}=\mathbf{Y}_{v_{i}}\}|}{|\mathcal{N}_{v_{i}}|},

where $\mathcal{N}_{v_{i}}$ denotes the immediate neighbors of node $v_{i}$ . Then, $\mathcal{H}(G)\in[0,1]$ denotes the degree of homophily, where $\mathcal{H}(G)\to 1$ indicates that $G$ has strong homophily and $\mathcal{H}(G)\to 0$ indicates that $G$ has low homophily (strong heterophily).

2.3. Graph Convolutional Network

The key idea of Graph Convolutional Network (GCN) is to apply the first-order approximation of spectral convolution (Defferrard et al., 2016) that aggregates the information of immediate neighbors. A GCN layer can be given as

(2)

\mathbf{H}^{(l+1)}=\sigma(\hat{\tilde{\mathbf{A}}}\mathbf{H}^{(l)}\mathbf{W}^{(l)}),

where $\hat{\tilde{\mathbf{A}}}=\tilde{\mathbf{D}}^{-1/2}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-1/2}$ represents the normalized adjacency matrix given that $\tilde{\mathbf{A}}$ is the adjacency matrix with self-loops and $\widetilde{\mathbf{D}}$ is the corresponding diagonal degree matrix, $\mathbf{H}^{(l)}\in\mathbb{R}^{n\times d^{(l)}}$ denotes the representations of nodes in the $l$ -th layer, $\mathbf{W}^{(l)}\in\mathbb{R}^{d^{(l)}\times d^{(l+1)}}$ is a learnable parameter matrix, and $\sigma(\cdot)$ represents a non-linear activation function.

2.4. Decoupled Graph Convolutional Network

According to Equation 2, a GCN layer contains two coupled operations, namely neighborhood aggregation and feature transformation. Such coupled framework could cause the over-smoothing issue (Chen et al., 2020a) and increase the training cost (Wu et al., 2019) when stacking many GCN layers. To address these issues, decoupled GCN (Chien et al., 2021) separates the two operations by removing the activation function and collapsing the parameter matrices between layers. In this way, the model can be regarded as a combination of an aggregation module (also known as a feature propagation module) and a feature transformation module. A general framework of decoupled GCN is given as

(3)

\mathbf{Z}=\sum_{k=0}^{K}\alpha_{k}\mathbf{H}^{(k)},\mathbf{H}^{(k)}=\hat{\tilde{\mathbf{A}}}\mathbf{H}^{(k-1)},\mathbf{H}^{(0)}=\boldsymbol{f}_{\theta}(\mathbf{X}),

where $\boldsymbol{f}_{\theta}$ denotes a neural network (e.g., MLP) with a parameter set $\theta$ , $\mathbf{H}^{(k)}\in\mathbb{R}^{n\times d^{\prime}}$ and $\alpha_{k}$ represent the representations of nodes and aggregation weight at step $k$ , respectively. Here $K$ is the total propagation step, and $\mathbf{Z}\in\mathbb{R}^{n\times d^{\prime}}$ denotes the output of decoupled GCN.

3. Methodology

In this section, we propose a novel framework called Neighborhood Convolutional Network (NCN), which consists of three main stages, namely data pre-processing, feature extraction, and feature fusion. The overall framework of NCN is shown in Figure 1, and we detail each stage in the following subsections.

3.1. Data Pre-processing

In the data pre-processing stage, we apply a novel aggregator called grid-like neighborhood aggregator (GNA) to preserve the information of multi-hop neighborhoods for each node that can be effectively used for feature extraction. GNA contains two main steps, namely propagating the features of multi-hop neighborhoods and constructing a grid-like feature matrix. First, GNA utilizes a propagation operation to aggregate the attribute features of neighborhood nodes, such that

(4)

\mathbf{B}^{(k)}=\mathbf{S}^{(k)}\mathbf{X},~{}~{}k\in\{0,1,2,...,K\},

where $\mathbf{S}^{(k)}\in\mathbb{R}^{n\times n}$ denotes an aggregation weight matrix at the propagation step $k$ , and $\mathbf{S}^{(0)}$ is initialized as the identity matrix $\mathbf{I}_{n}$ , such that we consider the nodes themselves as their 0-hop neighborhood node. $\mathbf{B}^{(k)}\in\mathbb{R}^{n\times d}$ represents the features aggregated from the $k$ -hop neighborhood nodes.

In this paper, we apply the approximate personalized PageRank (Klicpera et al., 2019a; Dong et al., 2021) to calculate $\mathbf{S}^{(k)}$ , given as $\mathbf{S}^{(k)}=(1-\gamma)^{k}\hat{\mathbf{A}}^{k}+\gamma\sum_{i=0}^{k-1}(1-\gamma)^{i}\hat{\mathbf{A}}^{i}$ , where $\gamma\in{(0,1]}$ is the teleport probability. Note that the propagation operation can be implemented by various methods according to the actual demands. For instance, we can set $\mathbf{S}^{(k)}=\hat{\mathbf{A}}^{k}$ that utilizes the standard random walk to obtain the features of neighborhood nodes.

In the second step, GNA transfers all the features obtained by the first step into a grid-like feature matrix. Specifically, for node $v$ , let $\{\mathbf{B}^{(0)}_{v},\mathbf{B}^{(1)}_{v}$ $,\cdots,\mathbf{B}^{(K)}_{v}\}$ represent the features aggregated by performing the propagation operation on the node. GNA constructs a grid-like feature matrix $\mathbf{T}_{v}\in\mathbb{R}^{(K+1)\times d}$ by stacking $\{\mathbf{B}^{(0)}_{v},\mathbf{B}^{(1)}_{v}$ $,\cdots,\mathbf{B}^{(K)}_{v}\}$ in the ascending order according to the propagation step. In this way, the grid-like feature matrix preserves the feature information of multi-hop neighborhoods from shallow to deep. For instance, the feature vector of the $k$ th row in the grid-like feature matrix contains wider extracted information than those of less than $k$ th rows. Such design brings non-Euclidean space of independent nodes to order. As a result, more advanced deep learning modules, such as CNN and RNN, can be applied on the grid-like feature matrix to extract more semantic information for nodes.

Note that the whole data pre-processing is non-parametric and can be conducted before the model training process, and therefore it can significantly reduce the training cost. Moreover, by assigning an independent feature matrix for each node, the mini-batch training strategy can be efficiently performed on the training model that guarantees the scalability for large-scale graphs.

3.2. Feature Extraction

In the feature extraction stage, we capture the complex semantic information from the grid-like neighborhood feature matrix by using a CNN-based module. Specifically, we design a neural network block consisting of two convolutional layers with different kernel sizes. For the first convolutional layer, we set the kernel size as $((\lfloor K/2\rfloor+1),1)$ that can capture the semantic association between the features of different hop neighborhoods, where $\lfloor\cdot\rfloor$ denotes the operation of rounding down the value. For the second convolutional layer, the kernel size is set as $(1,1)$ for feature transformation, which is a popular design of CNN in computer vision (He et al., 2016).

In NCN, we first utilize a linear layer to reduce the dimension of the grid-like neighborhood feature matrix since the dimension of attribute features could be very large in some networks. By stacking two proposed neural network blocks, NCN extracts a feature matrix $\mathbf{H}^{N}\in\mathbb{R}^{n\times d^{\prime}}$ that represents the semantic information from multi-hop neighborhoods, given as:

(5)

\mathbf{H}^{N}=\mathrm{Block2}(\sigma(\mathrm{Block1}(\mathbf{T}))),

where $\mathbf{T}\in\mathbb{R}^{n\times(K+1)\times d}$ is the grid-like neighborhood feature matrix, $\sigma(\cdot)$ denotes an activation function, and Block1 and Block2 are the two convolutional blocks, respectively.

Note that the design of the neural network block can be flexibly adjusted according to different scenarios. In other words, various convolutional layers can be applied to improve the performance of NCN. We will study the influence of combining a variety of convolutional layers in our future work.

3.3. Feature Fusion

Although the extracted features include rich semantic information of multi-hop neighborhoods, the function of raw attribute features of nodes is seriously weakened. Recent studies (Zhu et al., 2020; Chien et al., 2021; Li et al., 2022) prove that utilizing only the raw attribute features of nodes can also achieve a competitive performance on some graphs. It is because, in some graphs, the raw attribute features of nodes dominate the node labels. Therefore, to accurately predict the labels of nodes, a feasible solution is to learn the representations of nodes by combining the extracted features and the raw attribute features.

To this end, we present an adaptive feature fusion module that can fuse the extracted features and the raw attribute features for different graphs. First, we transfer the raw feature matrix $\mathbf{X}$ into a matrix whose shape is the same as the extracted feature matrix, such that:

(6)

\mathbf{H}^{R}=\mathbf{XW_{R}},

where $\mathbf{W_{R}}\in\mathbb{R}^{d\times d^{\prime}}$ denotes a parameter matrix. Then, we apply a learnable weight layer to adaptively calculate the weights for the feature matrices of $\mathbf{H}^{R}$ and $\mathbf{H}^{N}$ , given as:

(7)

\mathbf{a}=\mathrm{softmax}(\varphi(\mathbf{H}^{R}||\mathbf{H}^{N})),

where $||$ denotes the concatenation operation and $\varphi(\cdot)$ is a linear layer with a parameter matrix $\mathbf{W}_{a}\in\mathbb{R}^{2d^{\prime}\times 2}$ shared by all nodes. The output $\mathbf{a}\in\mathbb{R}^{n\times 2}$ denotes the aggregation weight vector in which $\mathbf{a}_{0}$ and $\mathbf{a}_{1}$ are the aggregation weights of $\mathbf{H}^{R}$ and $\mathbf{H}^{N}$ , respectively. Based on the aggregation weights, the final representation of nodes is calculated as:

(8)

\mathbf{H}^{F}=[\mathbf{a}_{0}\odot\mathbf{H}^{R}||\mathbf{a}_{1}\odot\mathbf{H}^{N}],

where $\odot$ denotes the element-wise product. In practice, the broadcast mechanism of the programming tools (e.g., Pytorch) guarantees the calculation of Equation 8 by repeating the columns of $\mathbf{a}_{0}$ and $\mathbf{a}_{1}$ for matching the dimensions of $\mathbf{H}^{R}$ and $\mathbf{H}^{N}$ respectively.

Mask Training Additionally, inspired by the famous training technique of Dropout (Srivastava et al., 2014), we propose a training strategy called mask training that randomly sets $\mathbf{a}_{0}$ or $\mathbf{a}_{1}$ to zero during the training process. In this way, NCN can alleviate the over-fitting problem during the training. Specifically, for each training epoch, we randomly partition the training set into three parts $N_{0}$ , $N_{1}$ and $N_{2}$ according to the ratios, $\beta$ , $\beta$ and $1-2*\beta$ , respectively, where $\beta\in[0,0.5)$ is a hyperparameter. We set $\mathbf{a}_{0}=0$ for nodes in $N_{0}$ and $\mathbf{a}_{1}=0$ for nodes in $N_{1}$ . Then values of $\mathbf{a}_{0}$ and $\mathbf{a}_{1}$ are kept for the remaining nodes in $N_{2}$ .

Algorithm 1 The learning algorithm of NCN

1:Adjacency matrix

\mathbf{A}

; Feature matrix

\mathbf{X}

; Training Set

V_{l}

; Observed labels

\mathbf{Y}_{V_{l}}

;

2:Model Parameter

\Theta

;

3:Generate

\mathbf{T}

with

\mathrm{GNA}

(

\mathbf{X}

\mathbf{A}

);

\triangleright

Pre-processing step

4:for

t=1

Epoch_{max}

5: for

V_{b}

V_{l}

\triangleright

Mini-batch training

6: Calculate

\mathbf{H}^{N}_{V_{b}}

with Equation 5;

7: Calculate

\mathbf{H}^{R}_{V_{b}}

with Equation 6;

8: Calculate the aggregation weights

\mathbf{a}_{V_{b}}

with Equation 7;

9: Sample

N_{0}

and

N_{1}

from

V_{b}

;

10: Set

\mathbf{a}_{N_{0},0}=0

and

\mathbf{a}_{N_{1},1}=0

;

11: Calculate

\mathbf{H}^{F}_{V_{b}}

with Equation 8;

12: Calculate loss

\boldsymbol{L}

with Equation 10

13: Optimize

\Theta

by minimizing

\boldsymbol{L}

with gradient descent

14: end for

15:end for

16:return Model Parameter

\Theta

;

3.4. Inference and Optimization

Following the common setting (Klicpera et al., 2019a; Chien et al., 2021; He et al., 2022), we apply MLP to predict the labels of nodes according to $\mathbf{H}^{F}$ , such that:

(9)

\hat{\mathbf{Y}}=\mathrm{MLP}(\mathbf{H}^{F}),

where $\hat{\mathbf{Y}}\in\mathbb{R}^{n\times c}$ denotes the predicted labels. Then, we utilize the Cross-Entropy loss, which is widely used in the node classification task (Kipf and Welling, 2017; Klicpera et al., 2019a; Chien et al., 2021), to calculate the loss, given that

(10)

\boldsymbol{L}=-\sum_{v\in V_{l}}\mathbf{Y}_{v}\mathrm{ln}\hat{\mathbf{Y}}_{v}.

The overall learning algorithm of NCN is presented in Algorithm 1.

3.5. Complexity Analysis

3.5.1. Time complexity.

The time complexity of NCN depends on two parts, i.e., convolutional layers and fully connected layers. Their time complexities are $O(nd^{\prime}K^{2})$ and $O(ndd^{\prime}+nd^{\prime}c)$ , respectively, where $n$ is the number of nodes, $K$ the propagation step, $c$ the number of labels, $d$ and $d^{\prime}$ denote the dimension of the raw feature vector and the hidden feature vector, respectively. Thus, the total complexity of NCN is $O(nd^{\prime}K^{2}+ndd^{\prime}+nd^{\prime}c)$ . Since the values of $K$ and $c$ are always very small, the time complexity can be written as $O(ndd^{\prime})$ . Comparing to the decoupled GCN, which has $O(mKc+ndc)$ given that $m$ is the number of edges, NCN is more efficient to perform on dense graphs in which the number of edges is significantly larger than the number of nodes. The reason is that the time complexity of decoupled GCN relies on both $n$ and $m$ , while the time complexity of NCN depends only on $n$ .

3.5.2. Space complexity.

The space complexity of NCN depends on two parts, i.e., model parameters and intermediate variables. Their space complexities are $O(K+dd^{\prime}+d^{\prime}c)$ and $O(bKd^{\prime})$ , respectively, and the total space complexity of NCN is $O(bKd^{\prime}+dd^{\prime}+d^{\prime}c)$ , where $b$ denotes the batch size. Compared to the decoupled GCN, which has $O(nc+dc+m)$ , NCN is more scalable on large-scale graphs as the space cost of NCN does not count the graph size, $n$ and $m$ .

4. Experiments

Table 1. The statistics of datasets.

Dataset	# Nodes	# Edges	# Features	# Classes	$\mathcal{H}$
Cora	2,708	5,278	1,433	7	0.83
Citeseer	3,327	4,552	3,703	6	0.72
Pubmed	19,717	44,324	500	3	0.79
Computer	13,752	491,722	767	10	0.80
Photo	7,650	238,163	745	8	0.85
Actor	7,600	26,659	932	5	0.22
Squirrel	5,201	198,353	2,089	5	0.22
Chameleon	2,277	31,371	2,325	5	0.23
Cornell	183	277	1,703	5	0.30
Texas	183	279	1,703	5	0.11

Table 2. Comparison of all models in terms of mean accuracy

\pm

stdev (%) on all datasets. The best results appear in bold and the runner-up results are underlined.

	Cora	Citeseer	Pubmed	Computer	Photo	Actor	Squirrel	Chameleon	Cornell	Texas
$\mathcal{H}(G)$	0.83	0.72	0.79	0.80	0.85	0.22	0.22	0.23	0.30	0.11
MLP	77.30 $\pm$ 0.29	75.42 $\pm$ 0.32	86.92 $\pm$ 0.16	77.48 $\pm$ 0.28	86.69 $\pm$ 0.31	36.54 $\pm$ 0.89	29.77 $\pm$ 1.72	48.21 $\pm$ 2.98	82.18 $\pm$ 3.13	81.81 $\pm$ 3.75
GCN	87.22 $\pm$ 0.41	77.05 $\pm$ 0.34	88.21 $\pm$ 0.11	89.53 $\pm$ 0.41	94.76 $\pm$ 0.17	31.52 $\pm$ 0.35	44.76 $\pm$ 1.21	59.45 $\pm$ 1.92	60.28 $\pm$ 2.23	66.28 $\pm$ 1.55
GAT	88.77 $\pm$ 0.27	77.10 $\pm$ 0.38	87.87 $\pm$ 0.13	90.76 $\pm$ 0.56	94.98 $\pm$ 0.24	29.91 $\pm$ 0.26	36.23 $\pm$ 1.96	49.71 $\pm$ 1.86	52.53 $\pm$ 2.08	62.12 $\pm$ 1.79
SGC	87.59 $\pm$ 0.37	76.95 $\pm$ 0.19	87.42 $\pm$ 0.16	90.22 $\pm$ 0.21	95.11 $\pm$ 0.09	29.21 $\pm$ 0.22	39.63 $\pm$ 2.06	46.94 $\pm$ 1.85	62.85 $\pm$ 2.98	65.71 $\pm$ 3.15
APPNP	89.12 $\pm$ 0.28	77.11 $\pm$ 0.39	88.69 $\pm$ 0.25	90.91 $\pm$ 0.32	95.09 $\pm$ 0.14	34.86 $\pm$ 0.45	32.55 $\pm$ 1.32	47.25 $\pm$ 2.11	68.57 $\pm$ 2.02	71.28 $\pm$ 1.95
GPR-GNN	88.96 $\pm$ 0.23	77.30 $\pm$ 0.29	89.64 $\pm$ 0.34	89.32 $\pm$ 0.29	95.49 $\pm$ 0.14	36.31 $\pm$ 0.25	34.75 $\pm$ 1.38	68.54 $\pm$ 1.72	89.71 $\pm$ 1.01	90.85 $\pm$ 1.50
LINKX	85.42 $\pm$ 0.43	74.08 $\pm$ 0.52	86.14 $\pm$ 0.17	87.35 $\pm$ 0.42	93.24 $\pm$ 0.24	36.79 $\pm$ 0.85	61.68 $\pm$ 2.97	65.53 $\pm$ 2.82	74.58 $\pm$ 3.13	71.32 $\pm$ 2.95
BM-GCN	89.49 $\pm$ 0.15	77.47 $\pm$ 0.37	89.82 $\pm$ 0.52	89.19 $\pm$ 0.25	95.35 $\pm$ 0.07	34.53 $\pm$ 0.12	47.53 $\pm$ 1.35	69.12 $\pm$ 1.49	84.57 $\pm$ 0.83	89.13 $\pm$ 0.88
GloGNN	88.57 $\pm$ 0.42	77.23 $\pm$ 0.41	88.58 $\pm$ 0.41	89.12 $\pm$ 0.33	95.17 $\pm$ 0.18	39.59 $\pm$ 0.29	59.75 $\pm$ 1.81	75.16 $\pm$ 1.42	91.42 $\pm$ 1.21	90.56 $\pm$ 1.15
NCN	89.51 $\pm$ 0.14	77.26 $\pm$ 0.16	90.94 $\pm$ 0.36	91.23 $\pm$ 0.13	95.93 $\pm$ 0.08	38.45 $\pm$ 0.48	74.33 $\pm$ 1.50	78.61 $\pm$ 1.20	91.71 $\pm$ 1.94	91.14 $\pm$ 1.23

In this section, we conduct extensive experiments to evaluate the performance of NCN compared with the state-of-the-art methods for node classification on ten widely used datasets. We also provide the parameter sensitivity analysis and ablation study for a deeper understanding of the proposed method.

4.1. Experimental Setup

Datasets. We utilize ten widely used datasets, including homophilic graphs and heterophilic graphs for experiments. The homophilic graphs contain Cora, Citeseer, Pubmed, Computers, and Photo (Chien et al., 2021), and the heterophilic graphs include Actor, Squirrel, Chameleon, Texas, and Cornell (Pei et al., 2020). The statistics of datasets are reported in Table 1, where the edge homophily ratio $\mathcal{H}$ measures the degree of homophily as introduced in Section 2.2.

For the homophilic graphs, Cora, Citeseer and Pubmed are constructed from the citation networks, where nodes represent papers, and edges represent citation relationships between papers. Computer and Photo are extracted from the Amazon co-purchase graph, where nodes represent goods and edges indicate that two goods are frequently bought together.

For the heterophilic graphs, Actor, Squirrel and Chameleon derive from the Wikipedia pages. Actor is the actor-only induced subgraph of the film-director-actor-writer network where nodes correspond to actors, and the edge between two nodes denotes co-occurrence on the same Wikipedia. Squirrel and Chameleon are page-page networks on specific topics in Wikipedia where nodes represent web pages, and edges are mutual links between pages. Cornell and Texas are web page datasets collected from computer science departments of various universities, where nodes represent web pages and edges are hyperlinks between them.

Baselines. We utilize eight state-of-the-art methods of node classification as the baselines for comparison, which are GCN (Kipf and Welling, 2017), GAT (Velikovi et al., 2018), SGC (Wu et al., 2019), APPNP (Klicpera et al., 2019a), LINKX (Lim et al., 2021), GPR-GNN (Chien et al., 2021), BM-GCN (He et al., 2022), and GloGNN (Li et al., 2022). Note that the first four methods apply to homophilic graphs, while the latter four methods are designed for homophilic graphs and heterophilic graphs simultaneously. In addition, we also use MLP as a simple comparison method that only utilizes the attribute features of nodes to predict the node labels.

Settings. We randomly split each dataset into 60%/20%/20% train/val/test three parts for experiments. For the baselines, we use their default hyper-parameters as presented in their papers, and for NCN, we use the grid search method to determine the hyper-parameters. We try the hidden dimension in $\{128,256,512\}$ , $\beta$ in $\{0.1,0.2,0.3,0.4\}$ , and the propagation steps in $\{2,4,\cdots,10\}$ . Parameters are optimized with AdamW (Loshchilov and Hutter, 2019) optimizer, with the learning rate of in $\{1e-3,5e-4,1e-4,\}$ and weight decay in $\{1e-4,1e-5\}$ . The batch size is set to 1000. The training process is early stopped within 50 epochs. All experiments are conducted on a Linux server with 1 I9-9900k CPU, 1 RTX 2080TI GPU and 64G RAM.

4.2. Performance Comparison

We adopt accuracy as the evaluation metric for the node classification task to compare the model performance. We run ten times for each method with random splits and report the average results as summarized in Table 2. We can observe that our NCN achieves the best performance on almost all the datasets except Citeseer and Actor. Particularly, NCN outperforms GloGNN, BM-GCN and LINKX, which are powerful methods for node classification on heterophilic graphs, demonstrating the effectiveness of NCN. Although NCN misses the best results on Citeseer and Actor, it still exhibits competitive performances on these datasets. The results confirm that NCN is capable of the node classification task on both homophilic and heterophilic graphs, showing the effectiveness of NCN. Moreover, NCN beats representative decoupled GCNs (APPNP, SGC and GPRGNN) on all the datasets (except behind GPRGNN on Citeseer), indicating that the designs of NCN are beneficial for learning the powerful node representations from the information of multi-hop neighborhoods.

4.3. Ablation Study

We compare NCN with its two variants on all datasets for evaluating the effectiveness of the key designs in NCN. The two variants of NCN are described as follows

•

NCN-w/o-RA: We remove the adaptive feature fusion from NCN, and only the features learned from the neighborhood information is preserved.
•

NCN-w/o-Mask: We do not utilize the mask training strategy in this variant.

According to the comparison results in Figure 4, we can find that NCN outperforms the variants on all datasets, demonstrating that the proposed feature fusion module and mask-based training strategy can improve the performance for the node classification task. Furthermore, NCN-w/o-Mask beats NCN-w/o-RA on all datasets, indicating that introducing the original features of nodes plays a vital role in learning meaningful representations of the nodes.

4.4. Parameter Analysis

In this subsection, we perform experiments on four datasets to explore the influence of the propagation step $K$ on the performance of NCN. We vary the value of $K$ in $\{2,4,6,8,10\}$ and give the corresponding results in Figure 3. We observe that $K$ affects NCN to a large extent because it determines the range of aggregated information for nodes. For different datasets, the value of $K$ varies for the best performance of NCN because the characteristics of graphs are different. Furthermore, we can also find that setting $K$ less than or equal to 6 could obtain good results. It is because the aggregated information involves too much irrelevant information when $K$ is large that weakens the performance of NCN.

4.5. Visualization

In this subsection, to intuitively demonstrate the effectiveness of our proposed adaptive feature fusion model, we visualize the distribution of the learnable aggregation weights, $a_{0}$ and $a_{1}$ in Equation 8, which measure the importance of the information learned from the node itself and its neighborhoods respectively. We select four representative datasets, including Pubmed, Photo, Squirrel and Chameleon. According to the results of MLP in Table 2, the raw attribute features of nodes are highly correlated to the labels of nodes on Pubmed and Photo. In contrast, the raw attribute features of nodes are not quite irrelevant to the labels of nodes on Squirrel and Chameleon. We randomly select 100 nodes from the test set for each dataset and show the distributions of aggregation weights in Figure 4.

In Figure 4, we can find that the average values of $a_{0}$ and $a_{1}$ are similar on Pubmed and Photo indicating that the raw attribute features and multi-hop neighborhoods are almost equally important for learning the node representations. On the other hand, the values of $a_{1}$ are extremely larger than that of $a_{0}$ on Squirrel and Chameleon, demonstrating that the information of neighborhoods plays a more important role in learning the representations of nodes than the raw attribute features. Therefore, we can conclude that our proposed method can adaptively and effectively learn the aggregation weights according to the characteristics of different graphs.

5. Related Work

This section reviews recent works of GCN and decoupled GCN.

Graph Convolutional Network. Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is a typical GNN that applies a non-linear smoothing operation on the first-order neighbors of nodes to learn the representations of nodes. Due to its flexibility, many follow-up GCN models have been proposed to improve performance. In the original GCN, the aggregation operation only relies on the graph topology. The attribute information, one of the most important information of graphs, is ignored, preventing the model from learning the powerful node representations. To address this problem, some recent works (Velikovi et al., 2018; Wang et al., 2020; Jin et al., 2021a, b) introduce the attribute features into the aggregation of the plain GCN for preserving both topology and attribute information. GAT (Velikovi et al., 2018) introduces the attention mechanism to specify different weights to different one-hop neighbors of the central node for aggregating the information adaptively. AM-GCN (Wang et al., 2020) utilizes the GCN layer to extract the features from the original graph and similarity graph generated by node attributes to learn the node representations from the perspectives of topology information and attribute information.

Recent studies argue that GCN implicitly obeys the graph homophily assumption (Zhu et al., 2020; Li et al., 2022) that the connected nodes tend to share the same labels. Thus, GCN performs poorly on heterophilic graphs where adjacent nodes tend to have different labels. Many works have been proposed to overcome this limitation in recent years (Zheng et al., 2022). Zhu et al. (Zhu et al., 2020) theoretically prove that two-hop neighbors tend to contain more nodes with the same class of the central node and propose H2GNN, which combines the features of one-hop neighborhood and two-hop neighborhood in a GCN layer to strengthen the representation of the central node. BM-GCN (He et al., 2022) and HOG-GCN (Wang et al., 2021) generate the soft labels of unlabeled nodes based on the raw attribute information. And they further estimate the block matrix or homophily degree matrix to learn the class-aware similarity of node pairs for guiding the aggregation operation of GCN. For more works of generalizing GCN to heterophilic graphs, please refer to a survey (Zheng et al., 2022).

Decoupled Graph Convolutional Network. The coupled design of neighborhood aggregation and feature transformation in the GCN layer will increase the cost of model training (Klicpera et al., 2019a; Chen et al., 2020b) and cause the over-smoothing issue (Li et al., 2018; Chen et al., 2020a; Liu et al., 2020) when stacking many GCN layers, limiting the model performance. Recent works, called decoupled GCNs (Wu et al., 2019; Klicpera et al., 2019a; Chien et al., 2021; Liu et al., 2020; Chen et al., 2020b), decouple the neighborhood aggregation and feature transformation and treat them as independent modules, and have become the latest paradigm of GCN (Dong et al., 2021). Some works introduce various propagation methods for aggregating neighborhood information, such as random walk (Wu et al., 2019) and personalized PageRank (Klicpera et al., 2019a). To reduce the computational cost of the propagation operation, Chen et al. (Chen et al., 2020b) developed a localized bidirectional propagation process to accelerate the model training. Several works aim to design various feature extractors for learning the node presentations from multi-hop neighborhoods. GPRGNN (Chien et al., 2021) assigns learnable aggregation weights to each hop neighborhood to fuse the neighborhood information adaptively. DAGNN (Liu et al., 2020) utilizes a trainable projection vector to adaptively balance the information from local and global neighborhoods for each node.

Despite effectiveness, the feature extractors in decoupled GCN could be regarded as a simple linear combination, limiting the model’s ability to capture more meaningful information from the multi-hop neighborhoods. In contrast, our proposed NCN abstracts the feature extractor as an independent and flexible module, enabling the model to develop more powerful neural network-based feature extractors for capturing the complex semantic features of multi-hop neighborhoods.

6. Conclusion

In this paper, we propose a general framework, termed Neighborhood Convolutional Network (NCN), which is a new paradigm of Graph Convolutional Neural Network for the node classification task. Benefiting from the proposed pre-processing aggregator that transforms the neighborhood features into the grid-like feature matrix, NCN develops a powerful CNN based module to extract the neighborhood features from the generated feature matrix of multi-hop neighborhoods. Through an adaptive feature fusion module, NCN can effectively learn the node representations from the raw attribute features and neighborhood features adaptively. Moreover, a mask-based training strategy is developed to further boost the model’s performance. We evaluate NCN on ten benchmark datasets, including homophilic and heterophilic graphs. The results demonstrate the superiority of NCN over the state-of-the-art baselines for node classification.

In future work, we will further explore the NCN model in two main directions. First, we will apply other possible model designs to extract the neighborhood information, such as ResNet (He et al., 2016). Second, we will investigate the framework of NCN on other well-known tasks, such as link prediction and graph classification.

References

(1)
Chen et al. (2020a) Deli Chen, Yankai Lin, Wei Li, Peng Li, Jie Zhou, and Xu Sun. 2020a. Measuring and Relieving the Over-Smoothing Problem for Graph Neural Networks from the Topological View. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
Chen et al. (2020b) Ming Chen, Zhewei Wei, Bolin Ding, Yaliang Li, Ye Yuan, Xiaoyong Du, and Ji-Rong Wen. 2020b. Scalable Graph Neural Networks via Bidirectional Propagation. In Proceedings of the Advances in Neural Information Processing Systems, 2020.
Chen et al. (2020c) Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. 2020c. Simple and Deep Graph Convolutional Networks. In Proceedings of the International Conference on Machine Learning, 2020.
Chien et al. (2021) Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. 2021. Adaptive Universal Generalized PageRank Graph Neural Network. In Proceedings of the International Conference on Learning Representations, 2021.
Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering. In Proceedings of the Advances in Neural Information Processing Systems, 2016.
Dong et al. (2021) Hande Dong, Jiawei Chen, Fuli Feng, Xiangnan He, Shuxian Bi, Zhaolin Ding, and Peng Cui. 2021. On the Equivalence of Decoupled Graph Convolution Network and Label Propagation. In Proceedings of the Web Conference, 2021.
Gilmer et al. (2017) Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. Neural Message Passing for Quantum Chemistry. In Proceedings of the International Conference on Machine Learning, 2017.
He et al. (2022) Dongxiao He, Chundong Liang, Huixin Liu, Mingxiang Wen, Pengfei Jiao, and Zhiyong Feng. 2022. Block Modeling-Guided Graph Convolutional Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
Jin et al. (2021b) Di Jin, Zhizhi Yu, Cuiying Huo, Rui Wang, Xiao Wang, Dongxiao He, and Jiawei Han. 2021b. Universal Graph Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems, 2021.
Jin et al. (2021a) Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang. 2021a. Node similarity preserving graph convolutional networks. In Proceedings of the ACM International Conference on Web Search and Data Mining, 2021.
Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, 2017.
Klicpera et al. (2019a) Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. 2019a. Predict then Propagate: Graph Neural Networks meet Personalized PageRank. In Proceedings of the International Conference on Learning Representations, 2019.
Klicpera et al. (2019b) Johannes Klicpera, Stefan Weißenberger, and Stephan Günnemann. 2019b. Diffusion improves graph learning. In Proceedings of the Advances in Neural Information Processing Systems, 2019.
Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper Insights Into Graph Convolutional Networks for Semi-Supervised Learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Li et al. (2022) Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, and Weining Qian. 2022. Finding Global Homophily in Graph Neural Networks When Meeting Heterophily. arXiv preprint arXiv:2205.07308 (2022).
Lim et al. (2021) Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. 2021. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. In Proceedings of the Advances in Neural Information Processing Systems, 2021.
Liu et al. (2020) Meng Liu, Hongyang Gao, and Shuiwang Ji. 2020. Towards Deeper Graph Neural Networks. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, 2019.
Pei et al. (2020) Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. 2020. Geom-GCN: Geometric Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations, 2020.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
Velikovi et al. (2018) Petar Velikovi, G. Cucurull, A. Casanova, A. Romero, P Liò, and Y. Bengio. 2018. Graph Attention Networks. In Proceedings of the International Conference on Learning Representations, 2018.
Wang et al. (2021) Tao Wang, Rui Wang, Di Jin, Dongxiao He, and Yuxiao Huang. 2021. Powerful Graph Convolutioal Networks with Adaptive Propagation Mechanism for Homophily and Heterophily. CoRR abs/2112.13562 (2021).
Wang et al. (2020) Xiao Wang, Meiqi Zhu, Deyu Bo, Peng Cui, Chuan Shi, and Jian Pei. 2020. AM-GCN: Adaptive Multi-channel Graph Convolutional Networks. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020.
Wu et al. (2019) Felix Wu, Amauri H. Souza Jr., Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Q. Weinberger. 2019. Simplifying Graph Convolutional Networks. In Proceedings of the International Conference on Machine Learning, 2019.
Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation Learning on Graphs with Jumping Knowledge Networks. In Proceedings of the International conference on machine learning, 2018.
Zheng et al. (2022) Xin Zheng, Yixin Liu, Shirui Pan, Miao Zhang, Di Jin, and Philip S. Yu. 2022. Graph Neural Networks for Graphs with Heterophily: A Survey. CoRR abs/2202.07082 (2022).
Zhu et al. (2020) Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. 2020. Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs. In Proceedings of the Advances in Neural Information Processing Systems, 2020.