LSGNN: Towards General Graph Neural Network
in Node Classification by Local Similarity

Yuhan Chen¹ & Yihong Luo^2,3¹¹footnotemark: 1 Equal contribution: Yuhan Chen ([email protected]) and Yihong Luo ([email protected]). Jing Tang^2,3 Correspoding authors: Jing Tang ([email protected]) and Chuan Wang ([email protected]).
Liang Yang⁴ Siya Qiu^2,3 Chuan Wang⁵²²footnotemark: 2 Xiaochun Cao⁶ ¹School of Computer Science and Engineering, Sun Yat-sen University
²The Hong Kong University of Science and Technology (Guangzhou)
³The Hong Kong University of Science and Technology
⁴School of Artificial Intelligence, Hebei University of Technology
⁵Institute of Information Engineering, Chinese Academy of Sciences
⁶School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University

Abstract

Heterophily has been considered as an issue that hurts the performance of Graph Neural Networks (GNNs). To address this issue, some existing work uses a graph-level weighted fusion of the information of multi-hop neighbors to include more nodes with homophily. However, the heterophily might differ among nodes, which requires to consider the local topology. Motivated by it, we propose to use the local similarity (LocalSim) to learn node-level weighted fusion, which can also serve as a plug-and-play module. For better fusion, we propose a novel and efficient Initial Residual Difference Connection (IRDC) to extract more informative multi-hop information. Moreover, we provide theoretical analysis on the effectiveness of LocalSim representing node homophily on synthetic graphs. Extensive evaluations over real benchmark datasets show that our proposed method, namely Local Similarity Graph Neural Network (LSGNN), can offer comparable or superior state-of-the-art performance on both homophilic and heterophilic graphs. Meanwhile, the plug-and-play model can significantly boost the performance of existing GNNs. Our code is provided at https://github.com/draym28/LSGNN.

Section 1 Introduction

Graph Neural Network (GNN) has received significant interest in recent years due to its powerful ability in various real-world applications based on graph-structured data, i.e., node classification Kipf and Welling (2016), graph classification Errica et al. (2020), and link prediction Zhang and Chen (2018). Combining convolutional network and graph signal processing, numerous Graph Convolutional Network (GCN) architectures Scarselli et al. (2008); Defferrard et al. (2016); Hamilton et al. (2017); Velickovic et al. (2017); Kipf and Welling (2016) have been proposed and show superior performance in the above application domains. Recent work Battaglia et al. (2018) believes that the success of GCN and its variants is built on the homophily assumption: connected nodes tend to have the same class label Hamilton (2020). This assumption provides proper inductive bias, raising the general message-passing mechanism: aggregating neighbour information to update ego-node feature—a special form of low-pass filter Bo et al. (2021).

Refer to caption — Figure 1: An illustration of LSGNN framework. ‘Pre-Computed’ means the part can be pre-computed without training.

However, the homophily assumption does not always hold McPherson et al. (2001); Jiang et al. (2013). In such cases, empirical evidence shows that GCN may even be worse than simple Multi-Layer Perceptrons (MLPs) that use merely node features as input Chien et al. (2021) (heterophily issue). A potential explanation is that the low-pass filter is believed to hurt the performance of GCN on heterophilic graph. Recently, the heterophily issue has received the community’s interest, and various methods Pei et al. (2020); Chien et al. (2021); He et al. (2021); Li et al. (2022) have been proposed to address the issue. Many existing works Chien et al. (2021); He et al. (2021); Li et al. (2022) propose to fuse intermediate representations, i.e., outputs of different layers, for heterophilic graphs learning. However, they only consider graph-level heterophily. Moreover, intermediate representations extracted by traditional propagation methods Kipf and Welling (2016) may drop some important information, which limits the effectiveness of information fusion among neighbors.

We show that in real-world datasets the heterophily among nodes can be significantly different (Section 3.3), which suggests that only considering graph-level heterophily is insufficient. Although Liu et al. consider node-level heterophily, they simply map each node’s representation to its node-level weight, do not take local topology into consideration, yielding suboptimal results. Motivated by the deficiency of existing techniques, we propose to use the local similarity to indicate node homophily to conduct better node-level weighted fusion adaptively. Moreover, we empirically and theoretically show its capabilities on synthetic graphs (Section 4). Furthermore, to obtain more informative intermediate representations for better fusion, we propose a novel Initial Residual Difference Connection (IRDC), which can leverage all the input information. Meanwhile, the IRDC propagation preserves the key property of SGC Wu et al. (2019) on removing the nonlinear transformations between propagation, which is time- and GPU-consuming in GCNs. This property makes the feature propagation of LSGNN efficient and deep-learning free. Based on the above novel designs, LSGNN offers high performance and high efficiency in evaluations (Section 6).

To summarize, we make the following contributions: 1) We study the local node homophily of real-world datasets and suggest using LocalSim as an indicator of node homophily. Moreover, we empirically and theoretically show the effectiveness of using LocalSim to indicate node homophily on synthetic graphs. 2) We propose LSGNN, which contains a LocalSim-aware multi-hop fusion to guide the learning of node-level weight for intermediate representations and an IRDC to extract more informative intermediate representations for better fusion. 3) LocalSim-based node-level weight is a plug-and-play module and can significantly boost the performance of state-of-the-art models, such as H₂GCN and GPRGNN. 4) We conduct extensive experiments to demonstrate the superiority of our method, LSGNN, which can offer comparable or superior performance against 13 other methods over homophilic and heterophilic graphs.

Section 2 Preliminaries

2.1 Notations

We denote an undirected graph without self-loops as $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , where $\mathcal{V}=\{v_{i}\}_{i=1}^{n}$ is the node set and $\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}$ is the edge set. Let $\mathcal{N}_{i}$ denotes the neighbours of node $v_{i}$ . Denote by $\mathbf{A}\in\mathbb{R}^{n\times n}$ the adjacency matrix, and by $\mathbf{D}$ the diagonal matrix standing for the degree matrix such that $\mathbf{D}_{ii}=\sum_{j=1}^{n}\mathbf{A}_{ij}$ . Let $\tilde{\mathbf{A}}$ and $\tilde{\mathbf{D}}$ be the corresponding matrices with self-loops, i.e., $\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}$ and $\tilde{\mathbf{D}}=\mathbf{D}+\mathbf{I}$ , where $\mathbf{I}$ is the identity matrix. Denote by $\mathbf{X}=\{\bm{x}_{i}\}_{i=1}^{n}\in\mathbb{R}^{n\times d}$ the initial node feature matrix, where $d$ is the number of dimensions, and by $\mathbf{H}^{(k)}=\{\bm{h}_{i}^{(k)}\}_{i=1}^{n}$ the representation matrix in the $k$ -th layer. We use $\mathbf{Y}=\{\bm{y}_{i}\}_{i=1}^{n}\in\mathbb{R}^{n\times C}$ to denote the ground-truth node label matrix, where $C$ is the number of classes and $\bm{y}_{i}$ is the one-hot encoding of node $v_{i}$ ’s label.

2.2 Simple Graph Convolution (SGC)

SGC Wu et al. (2019) claims that the nonlinearity between propagation in GCN (i.e., the ReLU activation function) is not critical, whereas the majority of benefit arises from the propagation itself. Therefore, SGC removes the nonlinear transformation between propagation so that the class prediction $\hat{\mathbf{Y}}$ of a $K$ -layer SGC becomes

\hat{\mathbf{Y}}=\operatorname{softmax}\left(\mathbf{S}^{K}\mathbf{X}\mathbf{W}\right),

(1)

where $\mathbf{S}$ is a GNN filter, e.g., $\mathbf{S}=\tilde{\mathbf{D}}^{-\frac{1}{2}}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}}$ is used in the vanilla SGC, and $\mathbf{W}$ is the learned weight matrix. Removing the nonlinearities, such a simplified linear model yields orders of magnitude speedup over GCN while achieving comparable empirical performance.

2.3 Heterophily Issue

Existing GNN models are usually based on the homophily assumption that connected nodes are more likely to belong to the same class. However, many real-world graphs are heterophilic. For example, the matching couples in dating networks are usually heterosexual. Empirical evidence shows that when the homophily assumption is broken, GCN may be even worse than simple Multi-Layer Perceptrons (MLPs) that use merely node features as input Chien et al. (2021). Therefore, it is essential to develop a general GNN model for both homophilic and heterophilic graphs.

Pei et al. Pei et al. (2020) introduce a simple index to measure node homophily/heterophily. That is, the homophily $\mathcal{H}(v_{i})$ of node $v_{i}$ is the ratio of the number of $v_{i}$ ’s neighbours who have the same labels as $v_{i}$ to the number of $v_{i}$ ’s neighbours, i.e.,

\mathcal{H}(v_{i})=\frac{|\{v_{j}\mid v_{j}\in\mathcal{N}_{i},\bm{y}_{j}=\bm{y}_{i}\}|}{|\mathcal{N}_{i}|},

(2)

and the homophily in a graph is the average of all nodes, i.e.,

\mathcal{H}(\mathcal{G})=\frac{1}{n}\sum_{v_{i}\in\mathcal{V}}\mathcal{H}(v_{i}).

(3)

Similarly, the heterophily of node $v_{i}$ is measured by $1-\mathcal{H}(v_{i})$ , and the heterophily in a graph is $1-\mathcal{H}(\mathcal{G})$ .

Section 3 Methodology

3.1 Overview

In this section, we present the Local Similarity Graph Neural Network (LSGNN), consisting of a novel propagation method named Initial Residual Difference Connection (IRDC) for better fusion (Section 3.2), and a LocalSim-aware multi-hop fusion (Section 3.4) and thus cope with both homophilic and heterophilic graphs. We also demonstrate the motivation of using LocalSim as an effective indicator of homophily (Section 3.3). Figure 1 shows the framework of LSGNN.

3.2 Initial Residual Difference Connection (IRDC)

Simply aggregating representation from the previous layer (as applied by most GNNs) might lose important information and consequently incur over-smoothing issues. To obtain more informative intermediate representations, i.e., outputs of different layers, we propose a novel propagation method, namely Initial Residual Difference Connection (IRDC).

General Form.

Inspired by SGC, we remove nonlinearity between layers and formulate IRDC as follows:

	$\displaystyle\!\!\!\mathbf{H}^{(1)}$	$\displaystyle=\operatorname{IRDC}^{(1)}(\mathbf{S},\mathbf{X})=\mathbf{S}\mathbf{X},$		(4)
	$\displaystyle\!\!\!\mathbf{H}^{(k)}$	$\displaystyle=\operatorname{IRDC}^{(k)}(\mathbf{S},\mathbf{X})=\mathbf{S}\Big{(}(1-\gamma)\mathbf{X}-\gamma\sum_{\ell=1}^{k-1}\mathbf{H}^{(\ell)}\Big{)},$		(4)

for $k=2,\dotsc,K$ . $\mathbf{S}$ is a GNN filter, and $\gamma\in[0,1]$ is a hyperparameter. The term $\sum_{\ell=1}^{k-1}\mathbf{H}^{(\ell)}$ in Equation (4) can be seen as the total information extracted by the previous $k-1$ layers. In this way, IRDC feeds the information that has never been processed from the original input into the next layer, thus can fully exploit all the information from the initial node features. A normalization operation is conducted following IRDC to handle the potential value scale issue. See Appendix B for comparison between different residual connections.

Parameterization in Practice.

Many GCNs use adjacency matrix with self-loops as the low-pass GNN filter to extract similar information from neighbours. However, the self-loops added in GCNs may not always be helpful Zhu et al. (2020b). Instead, we use enhanced filters Bo et al. (2021) of both low-pass and high-pass ( $\mathcal{F}_{L}+\mathcal{F}_{H}=\mathbf{I}$ ) by adding weighted identity matrix $\mathbf{I}$ (i.e., weighted self-loops),

	$\displaystyle\mathcal{F}_{L}$	$\displaystyle=\beta\mathbf{I}+\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}},$		(5)
	$\displaystyle\mathcal{F}_{H}$	$\displaystyle=\mathbf{I}-\mathcal{F}_{L}=\left(1-\beta\right)\mathbf{I}-\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}},$		(5)

where $\beta\in[0,1]$ is a hyperparameter. Intuitively, combining low-pass and high-pass filters together can learn better representations for both homophilic and heterophilic graphs. In a $K$ -layer LSGNN, the output $\mathbf{H}_{L}^{(k)},\mathbf{H}_{H}^{(k)}\in\mathrm{R}^{n\times d}$ of the $k^{\text{th}}$ low-pass and high-pass graph filter layer are

\mathbf{H}_{L}^{(k)}=\operatorname{IRDC}^{(k)}(\mathcal{F}_{L},\mathbf{X})\text{ and }\mathbf{H}_{H}^{(k)}=\operatorname{IRDC}^{(k)}(\mathcal{F}_{H},\mathbf{X}),

where $k=1,2,\dotsc,K$ . Next, $\mathbf{H}_{L}^{(k)}$ and $\mathbf{H}_{H}^{(k)}$ are transformed into $z$ dimensions by learnable $\mathbf{W}_{L}^{(k)},\mathbf{W}_{H}^{(k)}\in\mathbb{R}^{d\times z}$ , i.e.,

\tilde{\mathbf{H}}_{L}^{(k)}=\sigma\left(\mathbf{H}_{L}^{(k)}\mathbf{W}_{L}^{(k)}\right)\text{ and }\tilde{\mathbf{H}}_{H}^{(k)}=\sigma\left(\mathbf{H}_{H}^{(k)}\mathbf{W}_{H}^{(k)}\right),

where $\sigma$ is the ReLU function. To leverage the the initial node features, we also obtain a transformation $\tilde{\mathbf{H}}_{I}$ of $\mathbf{X}$ such that

\tilde{\mathbf{H}}_{I}=\sigma\left(\mathbf{X}\mathbf{W}_{I}\right),

where $\mathbf{W}_{I}\in\mathbb{R}^{d\times z}$ is a learned weight matrix. In what follows, $\tilde{\mathbf{H}}_{I},\tilde{\mathbf{H}}_{L}^{(k)},\tilde{\mathbf{H}}_{H}^{(k)}\in\mathbb{R}^{n\times z}$ ( $k=1,2,\dotsc,K$ ) are delivered to the LocalSim-aware multi-hop fusion (Section 3.4) to generate the final node representations.

3.3 LocalSim: An Indicator of Homophily

Based on the hypothesis that nodes with similar features are likely to belong to the same class, we start out by defining a simple version of Local Similarity (LocalSim) $\phi_{i}$ of node $v_{i}$ to indicate its homophily, i.e.,

\phi_{i}=\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}d_{ij}=\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}\operatorname{sim}(\bm{x}_{i},\bm{x}_{j}),

(6)

where $\operatorname{sim}(\cdot,\cdot):(\mathbb{R}^{d},\mathbb{R}^{d})\mapsto\mathbb{R}$ is a similarity measure, e.g.,

\operatorname{sim}(\bm{x}_{i},\bm{x}_{j})=\begin{cases}\frac{\bm{x}_{i}^{\top}\bm{x}_{j}}{\|\bm{x}_{i}\|\|\bm{x}_{j}\|},&\text{Cosine similarity},\\ -\|\bm{x}_{i}-\bm{x}_{j}\|_{2},&\text{Euclidean similarity}.\end{cases}

Figure 2 shows the relation between node homophily and LocalSim (using Cosine similarity) on Cornell (left column) and Citeseer (right column) datasets. From the upper figures, we observe that the homophily/heterophily among nodes are different. In particular, Cornell is a heterophilic graph, while Citeseer is a homophilic graph. Moreover, the lower figures show a clear positive correlation between node homophily and LocalSim, which indicates that LocalSim can be used to represent node homophily appropriately.

3.4 LocalSim-Aware Multi-Hop Fusion

In Section 3.3, we show that LocalSim, leveraging the local topology information, can identify the homophily of nodes. We apply LocalSim to learn the adaptive fusion of intermediate representations for each node.

The naive LocalSim in Equation (6) simply averages the similarity between the ego-node and its neighbors. In what follows, we refine the definition of LocalSim by incorporating nonlinearity, i.e., $d^{2}_{ij}$ . Specifically, the refined LocalSim $\varphi_{i}$ of node $i$ is given by

\varphi_{i}=\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}\operatorname{MLP_{ls}}([d_{ij},d_{ij}^{2}]),

(7)

where $\operatorname{MLP_{ls}}:\mathbb{R}^{2}\mapsto\mathbb{R}$ is a 2-layer perceptron. It is trivial to see that incorporating $d^{2}_{ij}$ via $\operatorname{MLP_{ls}}$ can encode the variance of series $\{d_{ij}\}_{v_{j}\in\mathcal{N}_{i}}$ , since $\operatorname{Var}[d_{ij}]=\mathbb{E}[d_{ij}^{2}]-\mathbb{E}^{2}[d_{ij}]$ with respect to $v_{j}\in\mathcal{N}_{i}$ . As a consequence, $\varphi_{i}$ can better characterize the real “local similarity” than $\phi_{i}$ . To illustrate, we compare $\varphi_{i}$ and $\phi_{i}$ through a simple example. Suppose that node $v_{1}$ has two neighbours $v_{2}$ and $v_{3}$ such that $d_{12}=0$ and $d_{13}=1$ , while node $v_{4}$ has two neighbours $v_{5}$ and $v_{6}$ such that $d_{45}=d_{46}=0.5$ . Then, $\phi_{1}=\phi_{4}=0.5$ , which cannot characterize the distinct neighbour distributions. On the other hand, by introducing nonlinearity, we can easily distinguish $\varphi_{1}$ and $\varphi_{4}$ . Finally, we have the LocalSim vector for all nodes $\bm{\varphi}=[\varphi_{1},\varphi_{2},\dotsc,\varphi_{n}]\in\mathbb{R}^{n}$ .

LocalSim can guide the intermediate representations fusion, i.e., using $\bm{\varphi}$ as local topology information when learning weights for fusion. In addition, we also introduce nonlinearity by using $\bm{\varphi}$ and $\bm{\varphi}^{2}$ together to generate weights, i.e.,

\left[\bm{\alpha}_{I},\bm{\alpha}_{L},\bm{\alpha}_{H}\right]=\operatorname{MLP}_{\alpha}([\bm{\varphi},\bm{\varphi}^{2}]),

(8)

where $\bm{\alpha}_{I},\bm{\alpha}_{L},\bm{\alpha}_{H}\in\mathbb{R}^{n\times K}$ are the weights for node representation from each channel and $\operatorname{MLP}_{\alpha}\colon\mathbb{R}^{2}\mapsto\mathbb{R}^{3K}$ is a 2-layer perceptron. Then, the node representations for the $k$ -th layer is computed as follows:

\mathbf{Z}^{(k)}=\bm{\alpha}_{I}^{(k)}\odot\tilde{\mathbf{H}}_{I}+\bm{\alpha}_{L}^{(k)}\odot\tilde{\mathbf{H}}_{L}^{(k)}+\bm{\alpha}_{H}^{(k)}\odot\tilde{\mathbf{H}}_{H}^{(k)},

(9)

where $\mathbf{Z}^{(k)}\in\mathbb{R}^{n\times z}$ and $\bm{\alpha}_{I}^{(k)},\bm{\alpha}_{L}^{(k)},\bm{\alpha}_{H}^{(k)}\in\mathbb{R}^{n}$ for $k=1,2,\dotsc,K$ , and $\bm{\alpha}_{I}^{(k)}\odot\tilde{\mathbf{H}}_{I}$ is the $i$ -th entry of $\bm{\alpha}_{I}^{(k)}$ times the $i$ -th row of $\tilde{\mathbf{H}}_{I}$ for $i=1,2,\dotsc,n$ (so do $\bm{\alpha}_{L}^{(k)}\odot\tilde{\mathbf{H}}_{L}^{(k)}$ and $\bm{\alpha}_{H}^{(k)}\odot\tilde{\mathbf{H}}_{H}^{(k)}$ ). Then, we obtain the final representation as $\tilde{\mathbf{H}}_{I}\|\mathbf{Z}^{(1)}\|\mathbf{Z}^{(2)}\|\dotsb\|\mathbf{Z}^{(K)}$ , where $\|$ is the concatenation function. Finally, the class probability prediction matrix $\mathbf{\hat{Y}}\in\mathbb{R}^{n\times C}$ is computed by

\mathbf{\hat{Y}}=\left[\tilde{\mathbf{H}}_{I}\|\mathbf{Z}^{(1)}\|\mathbf{Z}^{(2)}\|\dotsb\|\mathbf{Z}^{(K)}\right]\mathbf{W}_{\mathrm{out}},

(10)

where $\mathbf{W}_{\text{out}}\in\mathbb{R}^{(K+1)z\times C}$ is a learned matrix.

Section 4 Case Study of Toy Example

In this section, we conduct a case study using a simple synthetic graph to demonstrate the superiority of our LocalSim-based node-level weight.

4.1 Empirical Investigation

Inspired by the Featured Stochastic Block Model (FSBM) Chanpuriya and Musco (2022), in the following, we introduce FSBM with mixture of heterophily that can generate a simple synthetic graph with different node heterophily.

Definition 1 (FSBM with Mixture of Heterophily).

A Stochastic Block Model (SBM) graph $\mathcal{G}$ with mixture of heterophily has $n$ nodes partitioned into $r$ communities $C_{1},C_{2},\dotsc,C_{r}$ and consisting of $t$ subgraphs $\mathcal{G}_{1},\mathcal{G}_{2},\dotsc,\mathcal{G}_{t}$ where there is no edge between any two subgraphs, with intra-community and inter-community edge probabilities $p_{k}$ and $q_{k}$ in subgraph $\mathcal{G}_{k}$ . Let $\bm{c}_{1},\bm{c}_{2},\dotsc,\bm{c}_{r}\in\{0,1\}^{n}$ be indicator vectors for membership in each community, i.e., the $j$ -th entry of $\bm{c}_{i}$ is $1$ if the $j$ -th node is in $C_{i}$ and $0$ otherwise. An FSBM is such a graph model $\mathcal{G}$ , plus a feature vector $\bm{x}=\bm{f}+\bm{\eta}\in\mathbb{R}^{n}$ , where $\bm{\eta}\sim\mathcal{N}(0,\sigma^{2}\mathbf{I})$ is zero-centered, isotropic Gaussian noise and $\bm{f}=\sum_{i}\mu_{i}\bm{c}_{i}$ for some $\mu_{1},\mu_{2},\dotsc,\mu_{r}\in\mathbb{R}$ , which are the expected feature values of each community.

According to Definition 1 and Equation (2), nodes in the same subgraph have identical homophily in expectation, while nodes from different subgraphs have distinct homophily. We consider FSBMs with $n=1000$ , 2 equally-sized communities $C_{1}$ and $C_{2}$ , 2 equally-sized subgraphs $\mathcal{G}_{1}$ and $\mathcal{G}_{2}$ , feature means $\mu_{1}=1$ , $\mu_{2}=-1$ , and noise variance $\sigma=1$ . We generate different graphs by varying $\lambda_{1}=\frac{p_{1}}{p_{1}+q_{1}}$ and $\lambda_{2}=\frac{p_{2}}{p_{2}+q_{2}}$ , where $\lambda_{1},\lambda_{2}\in[0,1]$ and high $\lambda_{\tau}$ means high homophily, with the expected degree of all nodes to 10 (which means $\frac{1}{4}(p_{\tau}+q_{\tau})n=10$ for $\tau=1,2$ ).

We employ graph-level weight and LocalSim-based node-level weight to classify the nodes. For LocalSim-based node-level weight, we use the proposed method in Section 3. For graph-level weight, we replace $\alpha$ in Equation (8) with learnable graph-level value. For the similarity measure, we use $\operatorname{sim}(x,y)=-(x-y)^{2}$ , where $x$ and $y$ are scalar values. Finally, considering the simplicity of the synthetic networks, we use the simple LocalSim in Equation (6), and only use adjacency matrix with self-loop $\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I}$ and set the layers of IRDC to one.

As shown in Figure 3, our LocalSim-based node-level weight consistently outperforms graph-level weight. In particular, when node homophily is significantly different among subgraphs, the graph-level weight performs poorly, while our approach still performs very well which can recognize the node homophily well and adaptively provide suitable filters.

4.2 Theoretical Analysis

In addition to the empirical investigation, we theoretically confirm the capabilities of LocalSim node-level weight. For simplicity, we assume the number of inter-community and intra-community edges for all the nodes is the same as their expectation and without self-loop. Suppose that the optimal graph-level weight can provide a global filter that achieves the best average accuracy among all subgraphs (not the best in every subgraph), while the optimal node-level weight can provide the optimal filter for each subgraph which can achieve the best accuracy in every subgraph. However, to achieve the optimal node-level weight, we need to estimate the node homophily $\lambda$ , and LocalSim is used for the aim. Next, we investigate the effectiveness of using LocalSim to indicate $\lambda$ .

Theorem 1 (Effectiveness of LocalSim Indicating $\lambda$ ).

Consider FSBMs with 2 equally-sized communities and 2 equally-sized subgraphs, and assume that the number of inter-community and intra-community edges for all the nodes is the same as their expectation and without self-loop. Consider that LocalSim is defined as $\phi(v_{i})=-\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}(x_{i}-x_{j})^{2}$ . For any node $v_{i}$ that belongs to $\mathcal{G}_{\tau}$ for $\tau=1,2$ , we have $\mathbb{E}[\phi(v_{i})]=-2\sigma^{2}-(1-\lambda_{\tau})(\mu_{1}-\mu_{2})^{2}$ , where $\lambda_{\tau}=\frac{p_{\tau}}{p_{\tau}+q_{\tau}}$ . Moreover, for any two nodes $v_{i}\in\mathcal{G}_{1}$ and $v_{j}\in\mathcal{G}_{2}$ , the expectation of L1-norm distance between $\phi(v_{i})$ and $\phi(v_{j})$ is no less than to $|\lambda_{1}-\lambda_{2}|(\mu_{1}-\mu_{2})^{2}$ .

We defer the proof to Appendix A. Theorem 1 states that $\mathbb{E}[\phi(v_{i})]$ linearly correlates to $\lambda_{\tau}$ when $v_{i}\in\mathcal{G}_{\tau}$ , which suggests that LocalSim can represent the homophily $\lambda_{\tau}$ of node $v_{i}$ unbiasedly. We can also get that the expectation of L1-norm distance positively correlates to $|\lambda_{1}-\lambda_{2}|$ . This indicates that our estimation is likely to perform better when $|\lambda_{1}-\lambda_{2}|$ goes up. When the gap between $\lambda_{1}$ and $\lambda_{2}$ shrinks, the two subgraphs become similar to each other, so does the optimal filter for each community. These results confirm the effectiveness of LocalSim indicating node homophily, which yields high performance in our empirical investigation.

Section 5 Additional Related Work

In this section, we discuss some related work that attempts to address the heterophily issue. Mixhop Abu-El-Haija et al. (2019) proposes to extract features from multi-hop neighborhoods to get more information. CPGNN Zhu et al. (2020a) modifies feature propagation based on node classes in GCN to accommodate heterophily. Geom-GCN Pei et al. (2020) proposes to use the graph structure defined by geometric relationship to perform aggregation process to address heterophily. FAGCN Bo et al. (2021) proposes to use the attention mechanism to learn the edge-level aggregation weights of low-pass and high-pass filters to adaptively address the heterophily issue, while we use LocalSim to parameterize the node-level weight which achieves less computation complexity and better performance. H₂GCN Zhu et al. (2020b) and GPRGNN Chien et al. (2021) propose fusing intermediate representation at graph-level based on their analysis of graphs. ACM-GCN Luan et al. (2022) divides feature propagation into three channels: low-pass, high-pass, and identity, and then adaptively mix the three channels during propagation but does not fuse different intermediate representations. Compared with H₂GCN, GPRGNN, and ACM-GCN, our work only needs propagation once which means the propagation can be pre-computed, while their works are trained via back-propagation through repeated feature propagation which yields huge computation costs. ASGC Chanpuriya and Musco (2022) directly uses the least square to filter the features obtained by SGC Chen et al. (2020) for classification, while our work involves multiple intermediate information, which is fused by node-level weight. These designs yield better performance based on pre-propagated features. Different from all these methods, our work uses LocalSim to parameterize the node-level weight for better adaptive fusion efficiently, and performs IRDC as an efficient propagation method for better informative representation, thus boosting the performance while keeping high efficiency. Moreover, we show that our designed LocalSim-based node-level weight can be used to improve the performance of H₂GCN and GPRGNN, which confirms the superiority of our method.

	Cora	Citeseer	Pubmed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin	Avg
²MLP	74.75	72.41	86.65	46.36	29.68	35.76	81.08	81.89	85.29	65.99
²GCN	87.28	76.68	87.38	59.82	36.89	30.26	57.03	59.46	59.80	61.62
²GAT	82.68	75.46	84.68	54.69	30.62	26.28	58.92	58.38	55.29	58.56
²MixHop	87.61	76.26	85.31	60.50	43.80	32.22	73.51	77.84	75.88	68.10
³GCNII	88.37	77.33	90.15	63.86	38.47	37.44	77.86	77.57	80.39	70.16
¹Geom-GCN	85.27	77.99	90.05	60.90	38.14	31.63	60.81	67.57	64.12	64.05
¹H₂GCN	87.81	77.07	89.59	59.39	37.90	35.86	82.16	84.86	86.67	71.26
¹WRGNN	88.20	76.81	88.52	65.24	48.85	36.53	81.62	83.62	86.98	72.93
³GPRGNN	87.95	77.13	87.54	46.58	31.61	34.63	80.27	78.38	82.94	67.45
³LINKX	84.64	73.19	87.86	68.42	61.81	36.10	77.84	74.60	75.49	71.11
³ACM-GCN	87.91	77.32	90.00	66.93	54.40	36.28	85.14	87.84	88.43	74.92
³GGCN	87.95	77.14	89.15	71.14	55.17	37.54	85.68	84.86	86.86	75.05
¹GloGNN	88.33	77.41	89.62	71.21	57.88	37.70	85.95	84.32	88.04	75.61
LSGNN	88.49	76.71	90.23	79.04	72.81	36.18	88.92	90.27	90.20	79.21

Table 1: Results on real-world benchmark datasets: Mean accuracy (%). Boldface and underline numbers mark the top-1 and top-2 results, respectively. ¹ denotes results from their respective papers, ² from the H₂GCN paper, and ³ from the GloGNN paper.

Section 6 Experiment

6.1 Datasets and Setups

We evaluate LSGNN on nine small real-world benchmark datasets including both homophilic and heterophilic graphs. For homophilic graphs, we adopt three widely used citation networks, Cora, Citeseer, and Pubmed Sen et al. (2008); Getoor (2012). For heterophilic graphs, Chameleon and Squirrel are page-page Wikipedia networks Rozemberczki et al. (2021), Actor is a actor network Pei et al. (2020), and Cornell, Texas and Wisconsin are web pages networks Pei et al. (2020). Statistics of these datasets can be seen in Appendix D.

For all datasets, we use the feature matrix, class labels, and 10 random splits (48%/32%/20%) provided by Pei et al. Pei et al. (2020). Meanwhile, we use Optuna to tune the hyperparameter 200 times in experiments. For LSGNN, we set the number of layer $K=5$ .

There are three types of baseline methods, including non-graph models (MLP), traditional GNNs (GCN, GAT Velickovic et al. (2017), etc.), and GNNs tolerating heterophily (Geom-GCN, GPRGNN, etc.).

6.2 Evaluations on Real Benchmark Datasets

As shown in Table 1, LSGNN outperforms most of the baseline methods. On homophilic graphs, LSGNN achieves the SOTA performance on the Cora and Pubmed datasets and is lower than the best result by only 1.28% on the Citeseer dataset. On heterophilic graphs except Actor, LSGNN significantly outperforms all the other baseline models by 1.77%–11.00%. Specifically, the accuracy of LSGNN is 7.83% and 2.97% higher than GloGNN, the existing SOTA method on heterophilic graphs, on Chameleon and Cornell, is 11.00% higher than LINKX Lim et al. (2021) on Squirrel, and is 2.43% and 1.77% higher than ACM-GCN on Texas and Wisconsin. In particular, LSGNN achieves the best average accuracy, higher than the second-best by 3.60%. This demonstrates that LSGNN can achieve comparable or superior SOTA performance on both homophilic and heterophilic graphs.

	$\mathcal{H}(\mathcal{G})$	GCN	SGC	GCNII	LINKX	GloGNN	LSGNN
arXiv-year	0.22	46.02	32.83	47.21	56.00	54.79	56.42
ogbn-arXiv	0.66	71.74	69.39	72.74	54.45^∗	49.22^∗	72.64
Avg	NA	58.88	51.11	59.98	55.23	52.01	64.53

Table 2: Average accuracy (%) over 5 runs on large datasets. The field marked ^∗ denotes our implementation.

6.3 Evaluations on Large Benchmark Datasets

We also conduct experiments on two large datasets: ogbn-arXiv (homophilic) and arXiv-year (heterophilic), both with 170k nodes and 1M edges but different labels, following settings in Hu et al. (2020) and Lim et al. (2021), see Appendix D for setting details. Table 2 shows that our method has competitive performance on both homophilic and heterophilic larger graphs.

6.4 Training Cost Comparison

LSGNN only needs to propagate once during training, and the networks for node-level weight are small. Thus the efficiency of LSGNN can be close to SGC’s. As shown in Figure 4(a), on the average training time over nine datasets, our proposed method is only slower than SGC, a simple and effective model, and is nearly 2 $\times$ faster than GCN and GAT. This confirms the high training efficiency of LSGNN. See Appendix C for more setting details and detailed comparisons of training costs on each dataset.

6.5 Alleviate Over-Smoothing Issue

To validate whether LSGNN can alleviate the over-smoothing issue, we compare the performance between vanilla GCN and LSGNN under different layers of propagation (Figure 4(b)), see Appendix E for full results. It can be seen that GCN achieves the best performance at 2 layers, and its performance decreases rapidly as layers increase. In contrast, the results of LSGNN keep stable and high. The reasons are two-fold: (i) our IRDC can extract better informative representations, and (ii) our LocalSim-based fusion can adaptively remain and drop distinguishable and indistinguishable representations, respectively. Through these two designs, LSGNN can perform well as layers increase, which indicates the capability of LSGNN to prevent the over-smoothing issue.

6.6 LocalSim-based Node-level Weight as a Plug-and-Play Module

	Cora	Citeseer	Pubmed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin	Avg
H₂GCN	87.81	77.07	89.59	59.39	37.90	35.86	82.16	84.86	86.67	71.26
w/ LocalSim	88.42	76.77	89.91	64.91	41.73	36.29	85.68	85.68	88.82	73.13
GPRGNN	87.95	77.13	87.54	46.58	31.61	34.63	80.27	78.38	82.94	67.45
w/ LocalSim	88.83	76.60	90.26	65.39	51.96	35.59	85.14	87.84	87.65	74.36
DAGNN	82.63	75.47	88.50	58.03	39.59	30.20	61.89	58.92	55.10	61.15
w/ LocalSim	84.12	75.29	89.97	61.10	44.91	34.13	79.73	77.03	85.29	70.17

Table 3: LocalSim-based node-level weight as a plug-and-play module added to other models. w/ LocalSim denotes the LocalSim-based node-level weight is used, replacing the graph-level weight used in H₂GCN and GPRGNN, and replacing the random node-level weight (learned without any local topology information) in DAGNN. The best results are in boldface.

	Components				Dataset
	Rand	LocalSim	w-self-loop	IRDC	Cora	Citeseer	Pubmed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin	Avg
Baseline					86.33	74.53	89.99	67.28	47.73	35.27	86.47	88.92	88.04	73.84
	✓				87.93	75.32	90.08	67.72	48.99	35.65	78.38	88.51	88.43	73.45
		✓			87.62	76.60	90.20	69.96	49.95	35.80	87.57	88.92	89.80	75.16
		✓	✓		87.65	76.89	89.98	75.95	65.50	35.80	87.84	88.11	89.80	77.50
LSGNN		✓	✓	✓	88.49	76.71	90.23	79.04	72.81	36.18	88.92	90.27	90.20	79.21

Table 4: Ablation study on 9 real-world benchmark datasets. Cells with ✓mean that the corresponding components are applied to the baseline model. Boldface and underlined mark the top-1 and top-2 results. ‘Rand’ denotes random node-level weight (i.e., learned without any local topology information). ‘LocalSim’ means that the added node-level weight is learned under the guidance of LocalSim. ‘w-self-loop’ denotes weighted self-loops. ‘IRDC’ stands for the propagation via IRDC.

LocalSim-based node-level weight (Section 3.4) can be regarded as a plug-and-play module, which can be added to existing GNNs that also fuse intermediate representations without involving extra hyperparameters, such as H₂GCN, GPRGNN, and DAGNN.

H₂GCN and GPRGNN use learned graph-level weight to fuse the intermediate representations, which is only replaced by LocalSim-based node-level weight here. Note that we only add LocalSim-based node-level weight into H₂GCN and GPRGNN, but not the enhanced filters or the high-pass filters. As shown in Table 3, the performance of both H₂GCN and GPRGNN increases significantly, indicating that taking each node’s local topology into consideration when fusing intermediate representations is more reasonable and effective than merely considering graph-level structure.

DAGNN also fuse intermediate representations by node-level weight, which however is simply mapped from each node’s representation and is learned without involving any local topology information, leading to a suboptimal result. We use LocalSim-based node-level weight in DAGNN similarly. The drmatic improving in Table 3 suggests that using LocalSim as local topology information can ensure the weight being learned more effectively. Although DAGNN is not designed for heterophilic graphs, it can also perform well on heterophilic graphs after using LocalSim-based node-level weight, and is even superior to H₂GCN on Squirrel.

6.7 Ablation Study

In what follows, we investigate the effectiveness of LocalSim-based node-level weight, weighted self-loops, and IRDC. Specifically, in the baseline model, we use graph-level weight learned without any local topology information to fuse the intermediate representations, add self-loops when conducting graph filtering, and apply the propagation method used in GCNs. Then, the above components are added one by one to the baseline model for performance comparison. As shown in Table 4, all the components are effective and can bring great benefits to the model in terms of performance improvement.

Note that learning node-level weight without any local topology information (‘Rand’) might even bring negative effect on the model (lower than baseline by 0.39% on average). Once LocalSim is used as local topology information, the model can easily learn an optimal node-level weight (75.16% on average, higher than the baseline by 1.32%), which indicates the superiority of LocalSim-Aware Multi-Hop Fusion.

When replacing the fixed self-loops with weighted self-loops, the performance increases significantly (by 2.34% to 77.50% on average), especially on heterophilic graphs, such as Chameleon and Squirrel. This is because self-loops might not always be helpful on heterophilic graphs.

IRDC also brings a significant improvement (by an increase of 1.71% to 79.21% on average). This suggests that such a propagation method can extract more informative representation for better performance.

Section 7 Conclusion

In this paper, LSGNN is proposed for better performance on both homophilic and heterophilic graphs. Many GNNs use graph-level weight to fuse intermediate representations, which does not fully consider the local structure of different nodes. Some GNNs use node-level weight learned without involving topology information, yielding suboptimal results. Our empirical study and theoretical analysis on synthetic graphs demonstrate the importance of node-level weight considering the local topology information of nodes. The proposed LocalSim-aware multi-hop fusion uses local similarity as guidance to generate a more appropriate node-level weight for intermediate representations fusion, and it can also be used as a plug-and-play module to improve the performance of existing GNNs. For better fusion, IRDC is proposed to extract more informative intermediate representations boosting the performance. Evaluations over real-world benchmark datasets show the superiority of LSGNN in handling both homophilic and heterophilic graphs and the effectiveness of all the proposed components.

ACKNOWLEDGEMENTS

This work is partially supported by the National Key R&D Program of China under Grant No. 2022YFB2703303, by the National Natural Science Foundation of China (NSFC) under Grant No. U22B2060, Grant No. 61972442, Grant No. 62102413, and Grant No. U2001202, by Guangzhou Municipal Science and Technology Bureau under Grant No. 2023A03J0667, by HKUST(GZ) under a Startup Grant, by the S&T Program of Hebei under Grant No. 20350802D, by the Natural Science Foundation of Hebei Province of China under Grant No. F2020202040, and by the Natural Science Foundation of Tianjin of China under Grant No. 20JCYBJC00650.

References

Abu-El-Haija et al. [2019] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pages 21–29. PMLR, 2019.
Battaglia et al. [2018] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
Bo et al. [2021] Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. Beyond low-frequency information in graph convolutional networks. arXiv preprint arXiv:2101.00797, 2021.
Chanpuriya and Musco [2022] Sudhanshu Chanpuriya and Cameron N Musco. Simplified graph convolution with heterophily. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
Chen et al. [2020] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In International Conference on Machine Learning, pages 1725–1735. PMLR, 2020.
Chien et al. [2021] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations. https://openreview. net/forum, 2021.
Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. arXiv, abs/1606.09375, 2016.
Errica et al. [2020] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In ICLR. OpenReview.net, 2020.
Gasteiger et al. [2018] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
Getoor [2012] Lise Getoor. Query-driven active surveying for collective classification. 2012.
Hamilton et al. [2017] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv, abs/1706.02216, 2017.
Hamilton [2020] William L Hamilton. Graph representation learning. Synthesis Lectures on Artifical Intelligence and Machine Learning, 14(3):1–159, 2020.
He et al. [2021] Mingguo He, Zhewei Wei, Hongteng Xu, et al. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems, 34, 2021.
Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
Jiang et al. [2013] Yuexin Jiang, Daniel I Bolnick, and Mark Kirkpatrick. Assortative mating in animals. The American Naturalist, 181(6):E125–E138, 2013.
Kipf and Welling [2016] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv, abs/1609.02907, 2016.
Li et al. [2022] Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, and Weining Qian. Finding global homophily in graph neural networks when meeting heterophily. arXiv preprint arXiv:2205.07308, 2022.
Lim et al. [2021] Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. Advances in Neural Information Processing Systems, 34:20887–20902, 2021.
Liu et al. [2020] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 338–348, 2020.
Luan et al. [2022] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606, 2022.
McPherson et al. [2001] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001.
Page et al. [1999] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab, 1999.
Pei et al. [2020] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. arXiv preprint arXiv:2002.05287, 2020.
Rozemberczki et al. [2021] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-Scale Attributed Node Embedding. Journal of Complex Networks, 9(2), 2021.
Scarselli et al. [2008] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. Ai Magazine, 2008.
Velickovic et al. [2017] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv, abs/1710.10903, 2017.
Wu et al. [2019] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
Yang et al. [2022] Liang Yang, Weihang Peng, Wenmiao Zhou, Bingxin Niu, Junhua Gu, Chuan Wang, Yuanfang Guo, Dongxiao He, and Xiaochun Cao. Difference residual graph neural networks. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3356–3364, 2022.
Zhang and Chen [2018] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In NeurIPS, pages 5171–5181, 2018.
Zhu et al. [2020a] Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. Graph neural networks with heterophily. arXiv preprint arXiv:2009.13566, 2020.
Zhu et al. [2020b] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems, 33, 2020.

Appendix A Proof of Theorem 1

For the node $v_{i}\in\mathcal{G}_{\tau}$ , without loss of generality, suppose that $v_{i}\in C_{1}$ . (The analysis is analogous when $v_{i}\in C_{2}$ .) When the number of inter-community and intra-community edges for all the nodes is the same as their expectation and without self-loop, we have

	$\displaystyle\mathbb{E}[\phi(v_{i})]$	$\displaystyle=-\frac{1}{\|\mathcal{N}_{i}\|}\sum_{v_{j}\in\mathcal{N}_{i}}\mathbb{E}[(x_{i}-x_{j})^{2}]$
		$\displaystyle=-\frac{1}{\|\mathcal{N}_{i}\|}\sum_{v_{j}\in\mathcal{N}_{i}}\Big{(}\mathbb{E}[x_{i}^{2}]+\mathbb{E}[x_{j}^{2}]-2\mathbb{E}[x_{i}x_{j}]\Big{)}.$

By definition, we have

\mathbb{E}[x_{i}^{2}]=\operatorname{Var}[x_{i}]+\mathbb{E}^{2}[x_{i}]=\sigma^{2}+\mu^{2}_{1}.

Similarly,

\mathbb{E}[x_{j}^{2}]=\operatorname{Var}[x_{j}]+\mathbb{E}^{2}[x_{j}]=\begin{cases}\sigma^{2}+\mu^{2}_{1},&\text{if }v_{j}\in C_{1},\\ \sigma^{2}+\mu^{2}_{2},&\text{if }v_{j}\in C_{2}.\end{cases}

Moreover, since $v_{i}$ and $v_{j}$ are independent random variables, we have

\mathbb{E}[x_{i}x_{j}]=\mathbb{E}[x_{i}]\mathbb{E}[x_{j}]=\begin{cases}\mu^{2}_{1},&\text{if }v_{j}\in C_{1},\\ \mu_{1}\mu_{2},&\text{if }v_{j}\in C_{2}.\end{cases}

Putting it together yields

	$\displaystyle\mathbb{E}[\phi(v_{i})]$	$\displaystyle=-\frac{1}{\|\mathcal{N}_{i}\|}\sum_{v_{j}\in\mathcal{N}_{i}\cap C_{1}}\Big{(}\sigma^{2}+\mu^{2}_{1}+\sigma^{2}+\mu^{2}_{1}-2\mu^{2}_{1}\Big{)}$
		$\displaystyle\mathrel{\phantom{=}}-\frac{1}{\|\mathcal{N}_{i}\|}\sum_{v_{j}\in\mathcal{N}_{i}\cap C_{2}}\Big{(}\sigma^{2}+\mu^{2}_{1}+\sigma^{2}+\mu^{2}_{2}-2\mu_{1}\mu_{2}\Big{)}$
		$\displaystyle=-\frac{p_{\tau}\cdot 2\sigma^{2}}{p_{\tau}+q_{\tau}}-\frac{q_{\tau}\cdot(2\sigma^{2}+(\mu_{1}-\mu_{2})^{2})}{p_{\tau}+q_{\tau}}$
		$\displaystyle=-2\sigma^{2}-(1-\lambda_{\tau})(\mu_{1}-\mu_{2})^{2}.$

For the second part where $v_{i}\in\mathcal{G}_{1}$ and $v_{j}\in\mathcal{G}_{2}$ , using above result, we have

	$\displaystyle\mathbb{E}[\phi(v_{i})]$	$\displaystyle=-2\sigma^{2}-(1-\lambda_{1})(\mu_{1}-\mu_{2})^{2},$
	$\displaystyle\mathbb{E}[\phi(v_{j})]$	$\displaystyle=-2\sigma^{2}-(1-\lambda_{2})(\mu_{1}-\mu_{2})^{2}.$

Therefore,

\mathbb{E}[|\phi(v_{i})-\phi(v_{j})|]\geq|\mathbb{E}[\phi(v_{i})-\phi(v_{j})]|=|\lambda_{1}-\lambda_{2}|(\mu_{1}-\mu_{2})^{2}.

This completes the proof.

Appendix B Comparison Between IRDC and Other Residual Connections

Table 5 shows the comparison between IRDC and other residuals connections. Note that the coefficient is omitted, $\mathbf{S}$ is a GNN filter, $\mathbf{X}$ is the origin node feature, $\mathbf{H}^{(k)}$ is the intermediate node feature with only propagation, and $\mathbf{Z}^{(k)}$ is the intermediate node feature with both propagation and nonlinear transformation.

GCN Kipf and Welling [2016] and SGC Wu et al. [2019] uses the the output of the previous layer as the input of the next layer, which might result in over-smoothing issue. To overcome this issue, many residual connection methods are proposed.

GCNII Chen et al. [2020] construct a connection to the initial representation, which ensures that the final representation retains at least a fraction of the input layer even if we stack many layers. APPNP Gasteiger et al. [2018] apply a similar initial residual connection based on the Personalized PageRank Page et al. [1999].

DRC Yang et al. [2022] proposed a novel difference residual connection, enabling each layer to process the information that has not been properly handled in the previous layer.

Differently, our proposed IRDC aims at enabling each layer to process the total information that has not been handled in all the previous layers, thus can leverage all the input information and extract more informative representations.

Another key difference between IRDC and other residual connections is that IRDC is pre-computed using only the origin node features, without any trainable parameter or gradient backward propagation, which brings high efficiency. While other residual connections are based on the transformed node representations.

Table 5: Comparison between IRDC and other residual connections

Residual Connections	Formula
Without Residual Connection (GCN)	$\mathbf{Z}^{(k)}=\sigma\left(\mathbf{S}\mathbf{Z}^{(k-1)}\mathbf{W}\right)$
Without Residual Connection (SGC)	$\mathbf{Z}^{(k)}=\mathbf{S}\mathbf{Z}^{(k-1)}$
Initial Residual Connection (GCNII)	$\mathbf{Z}^{(k)}=\sigma\left(\left(\mathbf{Z}^{(0)}+\mathbf{S}\mathbf{Z}^{(k-1)}\right)\mathbf{W}\right)$
Initial Residual Connection (APPNP)	$\mathbf{Z}^{(k)}=\mathbf{Z}^{(0)}+\mathbf{S}\mathbf{Z}^{(k-1)}$
Difference Residual Connection (DRC)	$\mathbf{Z}^{(k)}=\mathbf{S}\left(\mathbf{Z}^{(k-2)}-\mathbf{Z}^{(k-1)}\right)$
Initial Residual Difference Connection	$\mathbf{H}^{(k)}=\mathbf{S}\left(\mathbf{X}-\sum_{\ell=1}^{k-1}\mathbf{H}^{(\ell)}\right)$

Appendix C Comparison on Training Cost Time

Experiment setting.

For GCN, SGC, and GAT, we all use five layers which is the same as our proposed method. And we run the experiments on a computer with an RTX-3080 GPU. We train all models for 200 epochs on all datasets, and we repeat the experiments 10 times and report the average time.

Experiment results.

The detailed training cost time on each dataset of each model can be found in Table 7. It can be seen that our proposed method is consistently nearly 2 $\times$ faster than GCN and GAT, and is close to SGC.

Appendix D Additional Experiment Details

D.1 Statistics Of Benchmark Datasets

The statistics of nine real-world datasets used in the experiment can be seen in Table 8.

D.2 Statistics Of Large Benchmark Datasets

The statistics of two large datasets used in the experiment can be seen in Table 9.

D.3 Experiment Setting Details of Large Benchmark Datasets

The selected large datasets, ogbn-arXiv and arXiv-year, have same nodes and edges that represent papers and citations. In ogbn-arXiv, node labels represent the papers’ academic topics, which tend to be a homophily setting. While in arXiv-year, the node labels represent the publishing year, thus exhibiting more heterophily label patterns.

For ogbn-arXiv, we use the official split Hu et al. [2020], and for arXiv-year, we use the 5 random splits (50%/25%/25%) provided by Lim et al. [2021]. And note that as observed in SOTA models (e.g., LINKX Lim et al. [2021] and GloGNN Li et al. [2022]) on arxiv-year, incorporating the embedding of the Adjacency matrix is critical for achieving high performance. Therefore, we have also included this embedding in our approach to enhance performance, and details can be seen in our code.

Appendix E Detailed Results

Full results of Alleviating Over-smoothing Issue in Section 6.4.

See Figure 5 for the full results of comparison on the performance of vanilla GCN and LSGNN under different layers of propagation. It can be seen that the performance of GCN decreases rapidly as layers increases in most datasets, except Cornell and Squirrel. But we note that Cornell is a simple dataset in which even MLP can achieve $80\%+$ accuracy. And Squirrel only has five classes, while GCN only achieves $25\%\sim 30\%$ accuracy which is close to the accuracy of random classification if the class is balanced ( $20\%$ ). In contrast, our method (i.e., LSGNN) achieves high performance on all datasets of all numbers of layers, which indicates the capability of LSGNN to prevent the over-smoothing issue. Finally, for fairness, we use Optuna to tune the hyper-parameters 200 times for both LSGNN and GCN.

Hyper-parameters Details.

We use Optuna to tune the hyper-parameters 200 times in all experiments. And the number of layers of IRDC is 5 for all experiments, except the over-smoothing experiments. We only set a rough search space of each hyper-parameters for all datasets, as shown in Table 6. For details of the searched hyper-parameters and all other training details, we refer readers to our code which will be released after published.

Table 6: Hyper-parameters Search Space

Hyper-parameter	Search Space
learning rate	[1e-3, 1e-1]
weight decay	[1e-6, 1e-1]
dropout	{0.1, 0.5, 0.6, 0.7, 0.8, 0.9}
$\beta$	{0.1, 0.3, 0.5, 0.7, 0.9, 1.0}
$\gamma$	{0.1, 0.3, 0.5, 0.7, 0.9, 1.0}
similarity measure	{cosine, euclidean}

Table 7: Time consumption of several GNN models (Seconds/200epoch). These are the time of conducting experiments for 200 epochs on each model, and each experiment is repeated 10 times.

	Cora	Citeseer	Pubmed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin	Avg
GCN	3.85	3.52	3.69	3.72	3.31	3.82	3.59	3.69	3.83	3.67
SGC	0.91	0.80	1.00	0.79	0.81	0.85	0.78	0.80	0.80	0.84
GAT	4.50	4.42	5.06	4.15	6.36	4.37	4.42	4.31	4.27	4.65
LSGNN	1.76	2.07	2.16	1.56	2.10	1.92	1.47	1.49	1.47	1.78

Table 8: Benchmark datasets statistics for node classification

	Cora	Citeseer	Pubmed	Chameleon	Squirrel	Actor	Cornell	Texas	Wisconsin
# Nodes	2708	3327	19717	2277	5201	7600	183	183	251
# Edges	5278	4552	44324	18050	108536	15009	149	162	257
# Classes	7	6	3	5	5	5	5	5	5
# Features	1433	3703	500	2325	2089	932	1703	1703	1703
$\mathcal{H}(\mathcal{G})$	0.81	0.74	0.80	0.24	0.22	0.22	0.31	0.11	0.20

Table 9: Large Benchmark datasets statistics for node classification

	#Nodes	# Edges	# Classes	# Features	$\mathcal{H}(\mathcal{G})$
ogbn-arXiv	169343	1166243	40	128	0.66
arXiv-year	169343	1166243	5	128	0.22

LSGNN: Towards General Graph Neural Network in Node Classification by Local Similarity