This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LSGNN: Towards General Graph Neural Network
in Node Classification by Local Similarity

Yuhan Chen1  &  Yihong Luo2,311footnotemark: 1 Equal contribution: Yuhan Chen ([email protected]) and Yihong Luo ([email protected]).    Jing Tang2,3 Correspoding authors: Jing Tang ([email protected]) and Chuan Wang ([email protected]).   
Liang Yang4
   Siya Qiu2,3    Chuan Wang522footnotemark: 2    Xiaochun Cao6 1School of Computer Science and Engineering, Sun Yat-sen University
2The Hong Kong University of Science and Technology (Guangzhou)
3The Hong Kong University of Science and Technology
4School of Artificial Intelligence, Hebei University of Technology
5Institute of Information Engineering, Chinese Academy of Sciences
6School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University
Abstract

Heterophily has been considered as an issue that hurts the performance of Graph Neural Networks (GNNs). To address this issue, some existing work uses a graph-level weighted fusion of the information of multi-hop neighbors to include more nodes with homophily. However, the heterophily might differ among nodes, which requires to consider the local topology. Motivated by it, we propose to use the local similarity (LocalSim) to learn node-level weighted fusion, which can also serve as a plug-and-play module. For better fusion, we propose a novel and efficient Initial Residual Difference Connection (IRDC) to extract more informative multi-hop information. Moreover, we provide theoretical analysis on the effectiveness of LocalSim representing node homophily on synthetic graphs. Extensive evaluations over real benchmark datasets show that our proposed method, namely Local Similarity Graph Neural Network (LSGNN), can offer comparable or superior state-of-the-art performance on both homophilic and heterophilic graphs. Meanwhile, the plug-and-play model can significantly boost the performance of existing GNNs. Our code is provided at https://github.com/draym28/LSGNN.

Section 1 Introduction

Graph Neural Network (GNN) has received significant interest in recent years due to its powerful ability in various real-world applications based on graph-structured data, i.e., node classification Kipf and Welling (2016), graph classification Errica et al. (2020), and link prediction Zhang and Chen (2018). Combining convolutional network and graph signal processing, numerous Graph Convolutional Network (GCN) architectures Scarselli et al. (2008); Defferrard et al. (2016); Hamilton et al. (2017); Velickovic et al. (2017); Kipf and Welling (2016) have been proposed and show superior performance in the above application domains. Recent work Battaglia et al. (2018) believes that the success of GCN and its variants is built on the homophily assumption: connected nodes tend to have the same class label Hamilton (2020). This assumption provides proper inductive bias, raising the general message-passing mechanism: aggregating neighbour information to update ego-node feature—a special form of low-pass filter Bo et al. (2021).

Refer to caption
Figure 1: An illustration of LSGNN framework. ‘Pre-Computed’ means the part can be pre-computed without training.

However, the homophily assumption does not always hold McPherson et al. (2001); Jiang et al. (2013). In such cases, empirical evidence shows that GCN may even be worse than simple Multi-Layer Perceptrons (MLPs) that use merely node features as input Chien et al. (2021) (heterophily issue). A potential explanation is that the low-pass filter is believed to hurt the performance of GCN on heterophilic graph. Recently, the heterophily issue has received the community’s interest, and various methods Pei et al. (2020); Chien et al. (2021); He et al. (2021); Li et al. (2022) have been proposed to address the issue. Many existing works Chien et al. (2021); He et al. (2021); Li et al. (2022) propose to fuse intermediate representations, i.e., outputs of different layers, for heterophilic graphs learning. However, they only consider graph-level heterophily. Moreover, intermediate representations extracted by traditional propagation methods Kipf and Welling (2016) may drop some important information, which limits the effectiveness of information fusion among neighbors.

We show that in real-world datasets the heterophily among nodes can be significantly different (Section 3.3), which suggests that only considering graph-level heterophily is insufficient. Although  Liu et al. consider node-level heterophily, they simply map each node’s representation to its node-level weight, do not take local topology into consideration, yielding suboptimal results. Motivated by the deficiency of existing techniques, we propose to use the local similarity to indicate node homophily to conduct better node-level weighted fusion adaptively. Moreover, we empirically and theoretically show its capabilities on synthetic graphs (Section 4). Furthermore, to obtain more informative intermediate representations for better fusion, we propose a novel Initial Residual Difference Connection (IRDC), which can leverage all the input information. Meanwhile, the IRDC propagation preserves the key property of SGC Wu et al. (2019) on removing the nonlinear transformations between propagation, which is time- and GPU-consuming in GCNs. This property makes the feature propagation of LSGNN efficient and deep-learning free. Based on the above novel designs, LSGNN offers high performance and high efficiency in evaluations (Section 6).

To summarize, we make the following contributions: 1) We study the local node homophily of real-world datasets and suggest using LocalSim as an indicator of node homophily. Moreover, we empirically and theoretically show the effectiveness of using LocalSim to indicate node homophily on synthetic graphs. 2) We propose LSGNN, which contains a LocalSim-aware multi-hop fusion to guide the learning of node-level weight for intermediate representations and an IRDC to extract more informative intermediate representations for better fusion. 3) LocalSim-based node-level weight is a plug-and-play module and can significantly boost the performance of state-of-the-art models, such as H2GCN and GPRGNN. 4) We conduct extensive experiments to demonstrate the superiority of our method, LSGNN, which can offer comparable or superior performance against 13 other methods over homophilic and heterophilic graphs.

Section 2 Preliminaries

2.1 Notations

We denote an undirected graph without self-loops as 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where 𝒱={vi}i=1n\mathcal{V}=\{v_{i}\}_{i=1}^{n} is the node set and 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V} is the edge set. Let 𝒩i\mathcal{N}_{i} denotes the neighbours of node viv_{i}. Denote by 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} the adjacency matrix, and by 𝐃\mathbf{D} the diagonal matrix standing for the degree matrix such that 𝐃ii=j=1n𝐀ij\mathbf{D}_{ii}=\sum_{j=1}^{n}\mathbf{A}_{ij}. Let 𝐀~\tilde{\mathbf{A}} and 𝐃~\tilde{\mathbf{D}} be the corresponding matrices with self-loops, i.e., 𝐀~=𝐀+𝐈\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I} and 𝐃~=𝐃+𝐈\tilde{\mathbf{D}}=\mathbf{D}+\mathbf{I}, where 𝐈\mathbf{I} is the identity matrix. Denote by 𝐗={𝒙i}i=1nn×d\mathbf{X}=\{\bm{x}_{i}\}_{i=1}^{n}\in\mathbb{R}^{n\times d} the initial node feature matrix, where dd is the number of dimensions, and by 𝐇(k)={𝒉i(k)}i=1n\mathbf{H}^{(k)}=\{\bm{h}_{i}^{(k)}\}_{i=1}^{n} the representation matrix in the kk-th layer. We use 𝐘={𝒚i}i=1nn×C\mathbf{Y}=\{\bm{y}_{i}\}_{i=1}^{n}\in\mathbb{R}^{n\times C} to denote the ground-truth node label matrix, where CC is the number of classes and 𝒚i\bm{y}_{i} is the one-hot encoding of node viv_{i}’s label.

2.2 Simple Graph Convolution (SGC)

SGC Wu et al. (2019) claims that the nonlinearity between propagation in GCN (i.e., the ReLU activation function) is not critical, whereas the majority of benefit arises from the propagation itself. Therefore, SGC removes the nonlinear transformation between propagation so that the class prediction 𝐘^\hat{\mathbf{Y}} of a KK-layer SGC becomes

𝐘^=softmax(𝐒K𝐗𝐖),\hat{\mathbf{Y}}=\operatorname{softmax}\left(\mathbf{S}^{K}\mathbf{X}\mathbf{W}\right), (1)

where 𝐒\mathbf{S} is a GNN filter, e.g., 𝐒=𝐃~12𝐀~𝐃~12\mathbf{S}=\tilde{\mathbf{D}}^{-\frac{1}{2}}\tilde{\mathbf{A}}\tilde{\mathbf{D}}^{-\frac{1}{2}} is used in the vanilla SGC, and 𝐖\mathbf{W} is the learned weight matrix. Removing the nonlinearities, such a simplified linear model yields orders of magnitude speedup over GCN while achieving comparable empirical performance.

2.3 Heterophily Issue

Existing GNN models are usually based on the homophily assumption that connected nodes are more likely to belong to the same class. However, many real-world graphs are heterophilic. For example, the matching couples in dating networks are usually heterosexual. Empirical evidence shows that when the homophily assumption is broken, GCN may be even worse than simple Multi-Layer Perceptrons (MLPs) that use merely node features as input Chien et al. (2021). Therefore, it is essential to develop a general GNN model for both homophilic and heterophilic graphs.

Pei et al. Pei et al. (2020) introduce a simple index to measure node homophily/heterophily. That is, the homophily (vi)\mathcal{H}(v_{i}) of node viv_{i} is the ratio of the number of viv_{i}’s neighbours who have the same labels as viv_{i} to the number of viv_{i}’s neighbours, i.e.,

(vi)=|{vjvj𝒩i,𝒚j=𝒚i}||𝒩i|,\mathcal{H}(v_{i})=\frac{|\{v_{j}\mid v_{j}\in\mathcal{N}_{i},\bm{y}_{j}=\bm{y}_{i}\}|}{|\mathcal{N}_{i}|}, (2)

and the homophily in a graph is the average of all nodes, i.e.,

(𝒢)=1nvi𝒱(vi).\mathcal{H}(\mathcal{G})=\frac{1}{n}\sum_{v_{i}\in\mathcal{V}}\mathcal{H}(v_{i}). (3)

Similarly, the heterophily of node viv_{i} is measured by 1(vi)1-\mathcal{H}(v_{i}), and the heterophily in a graph is 1(𝒢)1-\mathcal{H}(\mathcal{G}).

Section 3 Methodology

3.1 Overview

In this section, we present the Local Similarity Graph Neural Network (LSGNN), consisting of a novel propagation method named Initial Residual Difference Connection (IRDC) for better fusion (Section 3.2), and a LocalSim-aware multi-hop fusion (Section 3.4) and thus cope with both homophilic and heterophilic graphs. We also demonstrate the motivation of using LocalSim as an effective indicator of homophily (Section 3.3). Figure 1 shows the framework of LSGNN.

3.2 Initial Residual Difference Connection (IRDC)

Simply aggregating representation from the previous layer (as applied by most GNNs) might lose important information and consequently incur over-smoothing issues. To obtain more informative intermediate representations, i.e., outputs of different layers, we propose a novel propagation method, namely Initial Residual Difference Connection (IRDC).

General Form.

Inspired by SGC, we remove nonlinearity between layers and formulate IRDC as follows:

𝐇(1)\displaystyle\!\!\!\mathbf{H}^{(1)} =IRDC(1)(𝐒,𝐗)=𝐒𝐗,\displaystyle=\operatorname{IRDC}^{(1)}(\mathbf{S},\mathbf{X})=\mathbf{S}\mathbf{X}, (4)
𝐇(k)\displaystyle\!\!\!\mathbf{H}^{(k)} =IRDC(k)(𝐒,𝐗)=𝐒((1γ)𝐗γ=1k1𝐇()),\displaystyle=\operatorname{IRDC}^{(k)}(\mathbf{S},\mathbf{X})=\mathbf{S}\Big{(}(1-\gamma)\mathbf{X}-\gamma\sum_{\ell=1}^{k-1}\mathbf{H}^{(\ell)}\Big{)},

for k=2,,Kk=2,\dotsc,K. 𝐒\mathbf{S} is a GNN filter, and γ[0,1]\gamma\in[0,1] is a hyperparameter. The term =1k1𝐇()\sum_{\ell=1}^{k-1}\mathbf{H}^{(\ell)} in Equation (4) can be seen as the total information extracted by the previous k1k-1 layers. In this way, IRDC feeds the information that has never been processed from the original input into the next layer, thus can fully exploit all the information from the initial node features. A normalization operation is conducted following IRDC to handle the potential value scale issue. See Appendix B for comparison between different residual connections.

Parameterization in Practice.

Many GCNs use adjacency matrix with self-loops as the low-pass GNN filter to extract similar information from neighbours. However, the self-loops added in GCNs may not always be helpful Zhu et al. (2020b). Instead, we use enhanced filters Bo et al. (2021) of both low-pass and high-pass (L+H=𝐈\mathcal{F}_{L}+\mathcal{F}_{H}=\mathbf{I}) by adding weighted identity matrix 𝐈\mathbf{I} (i.e., weighted self-loops),

L\displaystyle\mathcal{F}_{L} =β𝐈+𝐃12𝐀𝐃12,\displaystyle=\beta\mathbf{I}+\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}}, (5)
H\displaystyle\mathcal{F}_{H} =𝐈L=(1β)𝐈𝐃12𝐀𝐃12,\displaystyle=\mathbf{I}-\mathcal{F}_{L}=\left(1-\beta\right)\mathbf{I}-\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}},

where β[0,1]\beta\in[0,1] is a hyperparameter. Intuitively, combining low-pass and high-pass filters together can learn better representations for both homophilic and heterophilic graphs. In a KK-layer LSGNN, the output 𝐇L(k),𝐇H(k)Rn×d\mathbf{H}_{L}^{(k)},\mathbf{H}_{H}^{(k)}\in\mathrm{R}^{n\times d} of the kthk^{\text{th}} low-pass and high-pass graph filter layer are

𝐇L(k)=IRDC(k)(L,𝐗) and 𝐇H(k)=IRDC(k)(H,𝐗),\mathbf{H}_{L}^{(k)}=\operatorname{IRDC}^{(k)}(\mathcal{F}_{L},\mathbf{X})\text{ and }\mathbf{H}_{H}^{(k)}=\operatorname{IRDC}^{(k)}(\mathcal{F}_{H},\mathbf{X}),

where k=1,2,,Kk=1,2,\dotsc,K. Next, 𝐇L(k)\mathbf{H}_{L}^{(k)} and 𝐇H(k)\mathbf{H}_{H}^{(k)} are transformed into zz dimensions by learnable 𝐖L(k),𝐖H(k)d×z\mathbf{W}_{L}^{(k)},\mathbf{W}_{H}^{(k)}\in\mathbb{R}^{d\times z}, i.e.,

𝐇~L(k)=σ(𝐇L(k)𝐖L(k)) and 𝐇~H(k)=σ(𝐇H(k)𝐖H(k)),\tilde{\mathbf{H}}_{L}^{(k)}=\sigma\left(\mathbf{H}_{L}^{(k)}\mathbf{W}_{L}^{(k)}\right)\text{ and }\tilde{\mathbf{H}}_{H}^{(k)}=\sigma\left(\mathbf{H}_{H}^{(k)}\mathbf{W}_{H}^{(k)}\right),

where σ\sigma is the ReLU function. To leverage the the initial node features, we also obtain a transformation 𝐇~I\tilde{\mathbf{H}}_{I} of 𝐗\mathbf{X} such that

𝐇~I=σ(𝐗𝐖I),\tilde{\mathbf{H}}_{I}=\sigma\left(\mathbf{X}\mathbf{W}_{I}\right),

where 𝐖Id×z\mathbf{W}_{I}\in\mathbb{R}^{d\times z} is a learned weight matrix. In what follows, 𝐇~I,𝐇~L(k),𝐇~H(k)n×z\tilde{\mathbf{H}}_{I},\tilde{\mathbf{H}}_{L}^{(k)},\tilde{\mathbf{H}}_{H}^{(k)}\in\mathbb{R}^{n\times z} (k=1,2,,Kk=1,2,\dotsc,K) are delivered to the LocalSim-aware multi-hop fusion (Section 3.4) to generate the final node representations.

3.3 LocalSim: An Indicator of Homophily

Refer to caption
Figure 2: Positive relation between node homophily and LocalSim on Cornell (left column) and Citeseer (right column) datasets. The upper figures show the histogram of node homophily, while the lower figures show the result for node homophily vs. LocalSim.

Based on the hypothesis that nodes with similar features are likely to belong to the same class, we start out by defining a simple version of Local Similarity (LocalSim) ϕi\phi_{i} of node viv_{i} to indicate its homophily, i.e.,

ϕi=1|𝒩i|vj𝒩idij=1|𝒩i|vj𝒩isim(𝒙i,𝒙j),\phi_{i}=\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}d_{ij}=\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}\operatorname{sim}(\bm{x}_{i},\bm{x}_{j}), (6)

where sim(,):(d,d)\operatorname{sim}(\cdot,\cdot):(\mathbb{R}^{d},\mathbb{R}^{d})\mapsto\mathbb{R} is a similarity measure, e.g.,

sim(𝒙i,𝒙j)={𝒙i𝒙j𝒙i𝒙j,Cosine similarity,𝒙i𝒙j2,Euclidean similarity.\operatorname{sim}(\bm{x}_{i},\bm{x}_{j})=\begin{cases}\frac{\bm{x}_{i}^{\top}\bm{x}_{j}}{\|\bm{x}_{i}\|\|\bm{x}_{j}\|},&\text{Cosine similarity},\\ -\|\bm{x}_{i}-\bm{x}_{j}\|_{2},&\text{Euclidean similarity}.\end{cases}

Figure 2 shows the relation between node homophily and LocalSim (using Cosine similarity) on Cornell (left column) and Citeseer (right column) datasets. From the upper figures, we observe that the homophily/heterophily among nodes are different. In particular, Cornell is a heterophilic graph, while Citeseer is a homophilic graph. Moreover, the lower figures show a clear positive correlation between node homophily and LocalSim, which indicates that LocalSim can be used to represent node homophily appropriately.

3.4 LocalSim-Aware Multi-Hop Fusion

In Section 3.3, we show that LocalSim, leveraging the local topology information, can identify the homophily of nodes. We apply LocalSim to learn the adaptive fusion of intermediate representations for each node.

The naive LocalSim in Equation (6) simply averages the similarity between the ego-node and its neighbors. In what follows, we refine the definition of LocalSim by incorporating nonlinearity, i.e., dij2d^{2}_{ij}. Specifically, the refined LocalSim φi\varphi_{i} of node ii is given by

φi=1|𝒩i|vj𝒩iMLPls([dij,dij2]),\varphi_{i}=\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}\operatorname{MLP_{ls}}([d_{ij},d_{ij}^{2}]), (7)

where MLPls:2\operatorname{MLP_{ls}}:\mathbb{R}^{2}\mapsto\mathbb{R} is a 2-layer perceptron. It is trivial to see that incorporating dij2d^{2}_{ij} via MLPls\operatorname{MLP_{ls}} can encode the variance of series {dij}vj𝒩i\{d_{ij}\}_{v_{j}\in\mathcal{N}_{i}}, since Var[dij]=𝔼[dij2]𝔼2[dij]\operatorname{Var}[d_{ij}]=\mathbb{E}[d_{ij}^{2}]-\mathbb{E}^{2}[d_{ij}] with respect to vj𝒩iv_{j}\in\mathcal{N}_{i}. As a consequence, φi\varphi_{i} can better characterize the real “local similarity” than ϕi\phi_{i}. To illustrate, we compare φi\varphi_{i} and ϕi\phi_{i} through a simple example. Suppose that node v1v_{1} has two neighbours v2v_{2} and v3v_{3} such that d12=0d_{12}=0 and d13=1d_{13}=1, while node v4v_{4} has two neighbours v5v_{5} and v6v_{6} such that d45=d46=0.5d_{45}=d_{46}=0.5. Then, ϕ1=ϕ4=0.5\phi_{1}=\phi_{4}=0.5, which cannot characterize the distinct neighbour distributions. On the other hand, by introducing nonlinearity, we can easily distinguish φ1\varphi_{1} and φ4\varphi_{4}. Finally, we have the LocalSim vector for all nodes 𝝋=[φ1,φ2,,φn]n\bm{\varphi}=[\varphi_{1},\varphi_{2},\dotsc,\varphi_{n}]\in\mathbb{R}^{n}.

LocalSim can guide the intermediate representations fusion, i.e., using 𝝋\bm{\varphi} as local topology information when learning weights for fusion. In addition, we also introduce nonlinearity by using 𝝋\bm{\varphi} and 𝝋2\bm{\varphi}^{2} together to generate weights, i.e.,

[𝜶I,𝜶L,𝜶H]=MLPα([𝝋,𝝋2]),\left[\bm{\alpha}_{I},\bm{\alpha}_{L},\bm{\alpha}_{H}\right]=\operatorname{MLP}_{\alpha}([\bm{\varphi},\bm{\varphi}^{2}]), (8)

where 𝜶I,𝜶L,𝜶Hn×K\bm{\alpha}_{I},\bm{\alpha}_{L},\bm{\alpha}_{H}\in\mathbb{R}^{n\times K} are the weights for node representation from each channel and MLPα:23K\operatorname{MLP}_{\alpha}\colon\mathbb{R}^{2}\mapsto\mathbb{R}^{3K} is a 2-layer perceptron. Then, the node representations for the kk-th layer is computed as follows:

𝐙(k)=𝜶I(k)𝐇~I+𝜶L(k)𝐇~L(k)+𝜶H(k)𝐇~H(k),\mathbf{Z}^{(k)}=\bm{\alpha}_{I}^{(k)}\odot\tilde{\mathbf{H}}_{I}+\bm{\alpha}_{L}^{(k)}\odot\tilde{\mathbf{H}}_{L}^{(k)}+\bm{\alpha}_{H}^{(k)}\odot\tilde{\mathbf{H}}_{H}^{(k)}, (9)

where 𝐙(k)n×z\mathbf{Z}^{(k)}\in\mathbb{R}^{n\times z} and 𝜶I(k),𝜶L(k),𝜶H(k)n\bm{\alpha}_{I}^{(k)},\bm{\alpha}_{L}^{(k)},\bm{\alpha}_{H}^{(k)}\in\mathbb{R}^{n} for k=1,2,,Kk=1,2,\dotsc,K, and 𝜶I(k)𝐇~I\bm{\alpha}_{I}^{(k)}\odot\tilde{\mathbf{H}}_{I} is the ii-th entry of 𝜶I(k)\bm{\alpha}_{I}^{(k)} times the ii-th row of 𝐇~I\tilde{\mathbf{H}}_{I} for i=1,2,,ni=1,2,\dotsc,n (so do 𝜶L(k)𝐇~L(k)\bm{\alpha}_{L}^{(k)}\odot\tilde{\mathbf{H}}_{L}^{(k)} and 𝜶H(k)𝐇~H(k)\bm{\alpha}_{H}^{(k)}\odot\tilde{\mathbf{H}}_{H}^{(k)}). Then, we obtain the final representation as 𝐇~I𝐙(1)𝐙(2)𝐙(K)\tilde{\mathbf{H}}_{I}\|\mathbf{Z}^{(1)}\|\mathbf{Z}^{(2)}\|\dotsb\|\mathbf{Z}^{(K)}, where \| is the concatenation function. Finally, the class probability prediction matrix 𝐘^n×C\mathbf{\hat{Y}}\in\mathbb{R}^{n\times C} is computed by

𝐘^=[𝐇~I𝐙(1)𝐙(2)𝐙(K)]𝐖out,\mathbf{\hat{Y}}=\left[\tilde{\mathbf{H}}_{I}\|\mathbf{Z}^{(1)}\|\mathbf{Z}^{(2)}\|\dotsb\|\mathbf{Z}^{(K)}\right]\mathbf{W}_{\mathrm{out}}, (10)

where 𝐖out(K+1)z×C\mathbf{W}_{\text{out}}\in\mathbb{R}^{(K+1)z\times C} is a learned matrix.

Section 4 Case Study of Toy Example

In this section, we conduct a case study using a simple synthetic graph to demonstrate the superiority of our LocalSim-based node-level weight.

4.1 Empirical Investigation

Inspired by the Featured Stochastic Block Model (FSBM) Chanpuriya and Musco (2022), in the following, we introduce FSBM with mixture of heterophily that can generate a simple synthetic graph with different node heterophily.

Definition 1 (FSBM with Mixture of Heterophily).

A Stochastic Block Model (SBM) graph 𝒢\mathcal{G} with mixture of heterophily has nn nodes partitioned into rr communities C1,C2,,CrC_{1},C_{2},\dotsc,C_{r} and consisting of tt subgraphs 𝒢1,𝒢2,,𝒢t\mathcal{G}_{1},\mathcal{G}_{2},\dotsc,\mathcal{G}_{t} where there is no edge between any two subgraphs, with intra-community and inter-community edge probabilities pkp_{k} and qkq_{k} in subgraph 𝒢k\mathcal{G}_{k}. Let 𝐜1,𝐜2,,𝐜r{0,1}n\bm{c}_{1},\bm{c}_{2},\dotsc,\bm{c}_{r}\in\{0,1\}^{n} be indicator vectors for membership in each community, i.e., the jj-th entry of 𝐜i\bm{c}_{i} is 11 if the jj-th node is in CiC_{i} and 0 otherwise. An FSBM is such a graph model 𝒢\mathcal{G}, plus a feature vector 𝐱=𝐟+𝛈n\bm{x}=\bm{f}+\bm{\eta}\in\mathbb{R}^{n}, where 𝛈𝒩(0,σ2𝐈)\bm{\eta}\sim\mathcal{N}(0,\sigma^{2}\mathbf{I}) is zero-centered, isotropic Gaussian noise and 𝐟=iμi𝐜i\bm{f}=\sum_{i}\mu_{i}\bm{c}_{i} for some μ1,μ2,,μr\mu_{1},\mu_{2},\dotsc,\mu_{r}\in\mathbb{R}, which are the expected feature values of each community.

According to Definition 1 and Equation (2), nodes in the same subgraph have identical homophily in expectation, while nodes from different subgraphs have distinct homophily. We consider FSBMs with n=1000n=1000, 2 equally-sized communities C1C_{1} and C2C_{2}, 2 equally-sized subgraphs 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2}, feature means μ1=1\mu_{1}=1, μ2=1\mu_{2}=-1, and noise variance σ=1\sigma=1. We generate different graphs by varying λ1=p1p1+q1\lambda_{1}=\frac{p_{1}}{p_{1}+q_{1}} and λ2=p2p2+q2\lambda_{2}=\frac{p_{2}}{p_{2}+q_{2}}, where λ1,λ2[0,1]\lambda_{1},\lambda_{2}\in[0,1] and high λτ\lambda_{\tau} means high homophily, with the expected degree of all nodes to 10 (which means 14(pτ+qτ)n=10\frac{1}{4}(p_{\tau}+q_{\tau})n=10 for τ=1,2\tau=1,2).

We employ graph-level weight and LocalSim-based node-level weight to classify the nodes. For LocalSim-based node-level weight, we use the proposed method in Section 3. For graph-level weight, we replace α\alpha in Equation (8) with learnable graph-level value. For the similarity measure, we use sim(x,y)=(xy)2\operatorname{sim}(x,y)=-(x-y)^{2}, where xx and yy are scalar values. Finally, considering the simplicity of the synthetic networks, we use the simple LocalSim in Equation (6), and only use adjacency matrix with self-loop 𝐀~=𝐀+𝐈\tilde{\mathbf{A}}=\mathbf{A}+\mathbf{I} and set the layers of IRDC to one.

Refer to caption
Figure 3: Accuracy (higher is better) on the synthetic graphs using graph-level weight and LocalSim-based node-level weight. Node homophily is measured by λi\lambda_{i}. ‘Raw’ shows the result directly using raw feature.

As shown in Figure 3, our LocalSim-based node-level weight consistently outperforms graph-level weight. In particular, when node homophily is significantly different among subgraphs, the graph-level weight performs poorly, while our approach still performs very well which can recognize the node homophily well and adaptively provide suitable filters.

4.2 Theoretical Analysis

In addition to the empirical investigation, we theoretically confirm the capabilities of LocalSim node-level weight. For simplicity, we assume the number of inter-community and intra-community edges for all the nodes is the same as their expectation and without self-loop. Suppose that the optimal graph-level weight can provide a global filter that achieves the best average accuracy among all subgraphs (not the best in every subgraph), while the optimal node-level weight can provide the optimal filter for each subgraph which can achieve the best accuracy in every subgraph. However, to achieve the optimal node-level weight, we need to estimate the node homophily λ\lambda, and LocalSim is used for the aim. Next, we investigate the effectiveness of using LocalSim to indicate λ\lambda.

Theorem 1 (Effectiveness of LocalSim Indicating λ\lambda).

Consider FSBMs with 2 equally-sized communities and 2 equally-sized subgraphs, and assume that the number of inter-community and intra-community edges for all the nodes is the same as their expectation and without self-loop. Consider that LocalSim is defined as ϕ(vi)=1|𝒩i|vj𝒩i(xixj)2\phi(v_{i})=-\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}(x_{i}-x_{j})^{2}. For any node viv_{i} that belongs to 𝒢τ\mathcal{G}_{\tau} for τ=1,2\tau=1,2, we have 𝔼[ϕ(vi)]=2σ2(1λτ)(μ1μ2)2\mathbb{E}[\phi(v_{i})]=-2\sigma^{2}-(1-\lambda_{\tau})(\mu_{1}-\mu_{2})^{2}, where λτ=pτpτ+qτ\lambda_{\tau}=\frac{p_{\tau}}{p_{\tau}+q_{\tau}}. Moreover, for any two nodes vi𝒢1v_{i}\in\mathcal{G}_{1} and vj𝒢2v_{j}\in\mathcal{G}_{2}, the expectation of L1-norm distance between ϕ(vi)\phi(v_{i}) and ϕ(vj)\phi(v_{j}) is no less than to |λ1λ2|(μ1μ2)2|\lambda_{1}-\lambda_{2}|(\mu_{1}-\mu_{2})^{2}.

We defer the proof to Appendix A. Theorem 1 states that 𝔼[ϕ(vi)]\mathbb{E}[\phi(v_{i})] linearly correlates to λτ\lambda_{\tau} when vi𝒢τv_{i}\in\mathcal{G}_{\tau}, which suggests that LocalSim can represent the homophily λτ\lambda_{\tau} of node viv_{i} unbiasedly. We can also get that the expectation of L1-norm distance positively correlates to |λ1λ2||\lambda_{1}-\lambda_{2}|. This indicates that our estimation is likely to perform better when |λ1λ2||\lambda_{1}-\lambda_{2}| goes up. When the gap between λ1\lambda_{1} and λ2\lambda_{2} shrinks, the two subgraphs become similar to each other, so does the optimal filter for each community. These results confirm the effectiveness of LocalSim indicating node homophily, which yields high performance in our empirical investigation.

Section 5 Additional Related Work

In this section, we discuss some related work that attempts to address the heterophily issue. Mixhop Abu-El-Haija et al. (2019) proposes to extract features from multi-hop neighborhoods to get more information. CPGNN Zhu et al. (2020a) modifies feature propagation based on node classes in GCN to accommodate heterophily. Geom-GCN Pei et al. (2020) proposes to use the graph structure defined by geometric relationship to perform aggregation process to address heterophily. FAGCN Bo et al. (2021) proposes to use the attention mechanism to learn the edge-level aggregation weights of low-pass and high-pass filters to adaptively address the heterophily issue, while we use LocalSim to parameterize the node-level weight which achieves less computation complexity and better performance. H2GCN Zhu et al. (2020b) and GPRGNN Chien et al. (2021) propose fusing intermediate representation at graph-level based on their analysis of graphs. ACM-GCN Luan et al. (2022) divides feature propagation into three channels: low-pass, high-pass, and identity, and then adaptively mix the three channels during propagation but does not fuse different intermediate representations. Compared with H2GCN, GPRGNN, and ACM-GCN, our work only needs propagation once which means the propagation can be pre-computed, while their works are trained via back-propagation through repeated feature propagation which yields huge computation costs. ASGC Chanpuriya and Musco (2022) directly uses the least square to filter the features obtained by SGC Chen et al. (2020) for classification, while our work involves multiple intermediate information, which is fused by node-level weight. These designs yield better performance based on pre-propagated features. Different from all these methods, our work uses LocalSim to parameterize the node-level weight for better adaptive fusion efficiently, and performs IRDC as an efficient propagation method for better informative representation, thus boosting the performance while keeping high efficiency. Moreover, we show that our designed LocalSim-based node-level weight can be used to improve the performance of H2GCN and GPRGNN, which confirms the superiority of our method.

Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin Avg
2MLP 74.75 72.41 86.65 46.36 29.68 35.76 81.08 81.89 85.29 65.99
2GCN 87.28 76.68 87.38 59.82 36.89 30.26 57.03 59.46 59.80 61.62
2GAT 82.68 75.46 84.68 54.69 30.62 26.28 58.92 58.38 55.29 58.56
2MixHop 87.61 76.26 85.31 60.50 43.80 32.22 73.51 77.84 75.88 68.10
3GCNII 88.37 77.33 90.15 63.86 38.47 37.44 77.86 77.57 80.39 70.16
1Geom-GCN 85.27 77.99 90.05 60.90 38.14 31.63 60.81 67.57 64.12 64.05
1H2GCN 87.81 77.07 89.59 59.39 37.90 35.86 82.16 84.86 86.67 71.26
1WRGNN 88.20 76.81 88.52 65.24 48.85 36.53 81.62 83.62 86.98 72.93
3GPRGNN 87.95 77.13 87.54 46.58 31.61 34.63 80.27 78.38 82.94 67.45
3LINKX 84.64 73.19 87.86 68.42 61.81 36.10 77.84 74.60 75.49 71.11
3ACM-GCN 87.91 77.32 90.00 66.93 54.40 36.28 85.14 87.84 88.43 74.92
3GGCN 87.95 77.14 89.15 71.14 55.17 37.54 85.68 84.86 86.86 75.05
1GloGNN 88.33 77.41 89.62 71.21 57.88 37.70 85.95 84.32 88.04 75.61
LSGNN 88.49 76.71 90.23 79.04 72.81 36.18 88.92 90.27 90.20 79.21
Table 1: Results on real-world benchmark datasets: Mean accuracy (%). Boldface and underline numbers mark the top-1 and top-2 results, respectively. 1 denotes results from their respective papers, 2 from the H2GCN paper, and 3 from the GloGNN paper.

Section 6 Experiment

6.1 Datasets and Setups

We evaluate LSGNN on nine small real-world benchmark datasets including both homophilic and heterophilic graphs. For homophilic graphs, we adopt three widely used citation networks, Cora, Citeseer, and Pubmed Sen et al. (2008); Getoor (2012). For heterophilic graphs, Chameleon and Squirrel are page-page Wikipedia networks Rozemberczki et al. (2021), Actor is a actor network Pei et al. (2020), and Cornell, Texas and Wisconsin are web pages networks Pei et al. (2020). Statistics of these datasets can be seen in Appendix D.

For all datasets, we use the feature matrix, class labels, and 10 random splits (48%/32%/20%) provided by Pei et al. Pei et al. (2020). Meanwhile, we use Optuna to tune the hyperparameter 200 times in experiments. For LSGNN, we set the number of layer K=5K=5.

There are three types of baseline methods, including non-graph models (MLP), traditional GNNs (GCN, GAT Velickovic et al. (2017), etc.), and GNNs tolerating heterophily (Geom-GCN, GPRGNN, etc.).

6.2 Evaluations on Real Benchmark Datasets

As shown in Table 1, LSGNN outperforms most of the baseline methods. On homophilic graphs, LSGNN achieves the SOTA performance on the Cora and Pubmed datasets and is lower than the best result by only 1.28% on the Citeseer dataset. On heterophilic graphs except Actor, LSGNN significantly outperforms all the other baseline models by 1.77%–11.00%. Specifically, the accuracy of LSGNN is 7.83% and 2.97% higher than GloGNN, the existing SOTA method on heterophilic graphs, on Chameleon and Cornell, is 11.00% higher than LINKX Lim et al. (2021) on Squirrel, and is 2.43% and 1.77% higher than ACM-GCN on Texas and Wisconsin. In particular, LSGNN achieves the best average accuracy, higher than the second-best by 3.60%. This demonstrates that LSGNN can achieve comparable or superior SOTA performance on both homophilic and heterophilic graphs.

(𝒢)\mathcal{H}(\mathcal{G}) GCN SGC GCNII LINKX GloGNN LSGNN
arXiv-year 0.22 46.02 32.83 47.21 56.00 54.79 56.42
ogbn-arXiv 0.66 71.74 69.39 72.74 54.45 49.22 72.64
Avg NA 58.88 51.11 59.98 55.23 52.01 64.53
Table 2: Average accuracy (%) over 5 runs on large datasets. The field marked denotes our implementation.

6.3 Evaluations on Large Benchmark Datasets

We also conduct experiments on two large datasets: ogbn-arXiv (homophilic) and arXiv-year (heterophilic), both with 170k nodes and 1M edges but different labels, following settings in Hu et al. (2020) and Lim et al. (2021), see Appendix D for setting details. Table 2 shows that our method has competitive performance on both homophilic and heterophilic larger graphs.

Refer to caption
(a) Training Cost Time
Refer to caption
(b) Average Results
Figure 4: Comparison of the average Training Cost Time (s) on nine real-world datasets (Left). Average results with various model depth on nine real-world datasets (Right).

6.4 Training Cost Comparison

LSGNN only needs to propagate once during training, and the networks for node-level weight are small. Thus the efficiency of LSGNN can be close to SGC’s. As shown in Figure 4(a), on the average training time over nine datasets, our proposed method is only slower than SGC, a simple and effective model, and is nearly 2×\times faster than GCN and GAT. This confirms the high training efficiency of LSGNN. See Appendix C for more setting details and detailed comparisons of training costs on each dataset.

6.5 Alleviate Over-Smoothing Issue

To validate whether LSGNN  can alleviate the over-smoothing issue, we compare the performance between vanilla GCN and LSGNN under different layers of propagation (Figure 4(b)), see Appendix E for full results. It can be seen that GCN achieves the best performance at 2 layers, and its performance decreases rapidly as layers increase. In contrast, the results of LSGNN keep stable and high. The reasons are two-fold: (i) our IRDC can extract better informative representations, and (ii) our LocalSim-based fusion can adaptively remain and drop distinguishable and indistinguishable representations, respectively. Through these two designs, LSGNN can perform well as layers increase, which indicates the capability of LSGNN to prevent the over-smoothing issue.

6.6 LocalSim-based Node-level Weight as a Plug-and-Play Module

Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin Avg
H2GCN 87.81 77.07 89.59 59.39 37.90 35.86 82.16 84.86 86.67 71.26
w/ LocalSim 88.42 76.77 89.91 64.91 41.73 36.29 85.68 85.68 88.82 73.13
GPRGNN 87.95 77.13 87.54 46.58 31.61 34.63 80.27 78.38 82.94 67.45
w/ LocalSim 88.83 76.60 90.26 65.39 51.96 35.59 85.14 87.84 87.65 74.36
DAGNN 82.63 75.47 88.50 58.03 39.59 30.20 61.89 58.92 55.10 61.15
w/ LocalSim 84.12 75.29 89.97 61.10 44.91 34.13 79.73 77.03 85.29 70.17
Table 3: LocalSim-based node-level weight as a plug-and-play module added to other models. w/ LocalSim denotes the LocalSim-based node-level weight is used, replacing the graph-level weight used in H2GCN and GPRGNN, and replacing the random node-level weight (learned without any local topology information) in DAGNN. The best results are in boldface.
Components Dataset
Rand LocalSim w-self-loop IRDC Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin Avg
Baseline 86.33 74.53 89.99 67.28 47.73 35.27 86.47 88.92 88.04 73.84
87.93 75.32 90.08 67.72 48.99 35.65 78.38 88.51 88.43 73.45
87.62 76.60 90.20 69.96 49.95 35.80 87.57 88.92 89.80 75.16
87.65 76.89 89.98 75.95 65.50 35.80 87.84 88.11 89.80 77.50
LSGNN 88.49 76.71 90.23 79.04 72.81 36.18 88.92 90.27 90.20 79.21
Table 4: Ablation study on 9 real-world benchmark datasets. Cells with ✓mean that the corresponding components are applied to the baseline model. Boldface and underlined mark the top-1 and top-2 results. ‘Rand’ denotes random node-level weight (i.e., learned without any local topology information). ‘LocalSim’ means that the added node-level weight is learned under the guidance of LocalSim. ‘w-self-loop’ denotes weighted self-loops. ‘IRDC’ stands for the propagation via IRDC.

LocalSim-based node-level weight (Section 3.4) can be regarded as a plug-and-play module, which can be added to existing GNNs that also fuse intermediate representations without involving extra hyperparameters, such as H2GCN, GPRGNN, and DAGNN.

H2GCN and GPRGNN use learned graph-level weight to fuse the intermediate representations, which is only replaced by LocalSim-based node-level weight here. Note that we only add LocalSim-based node-level weight into H2GCN and GPRGNN, but not the enhanced filters or the high-pass filters. As shown in Table 3, the performance of both H2GCN and GPRGNN increases significantly, indicating that taking each node’s local topology into consideration when fusing intermediate representations is more reasonable and effective than merely considering graph-level structure.

DAGNN also fuse intermediate representations by node-level weight, which however is simply mapped from each node’s representation and is learned without involving any local topology information, leading to a suboptimal result. We use LocalSim-based node-level weight in DAGNN similarly. The drmatic improving in Table 3 suggests that using LocalSim as local topology information can ensure the weight being learned more effectively. Although DAGNN is not designed for heterophilic graphs, it can also perform well on heterophilic graphs after using LocalSim-based node-level weight, and is even superior to H2GCN on Squirrel.

6.7 Ablation Study

In what follows, we investigate the effectiveness of LocalSim-based node-level weight, weighted self-loops, and IRDC. Specifically, in the baseline model, we use graph-level weight learned without any local topology information to fuse the intermediate representations, add self-loops when conducting graph filtering, and apply the propagation method used in GCNs. Then, the above components are added one by one to the baseline model for performance comparison. As shown in Table 4, all the components are effective and can bring great benefits to the model in terms of performance improvement.

Note that learning node-level weight without any local topology information (‘Rand’) might even bring negative effect on the model (lower than baseline by 0.39% on average). Once LocalSim is used as local topology information, the model can easily learn an optimal node-level weight (75.16% on average, higher than the baseline by 1.32%), which indicates the superiority of LocalSim-Aware Multi-Hop Fusion.

When replacing the fixed self-loops with weighted self-loops, the performance increases significantly (by 2.34% to 77.50% on average), especially on heterophilic graphs, such as Chameleon and Squirrel. This is because self-loops might not always be helpful on heterophilic graphs.

IRDC also brings a significant improvement (by an increase of 1.71% to 79.21% on average). This suggests that such a propagation method can extract more informative representation for better performance.

Section 7 Conclusion

In this paper, LSGNN is proposed for better performance on both homophilic and heterophilic graphs. Many GNNs use graph-level weight to fuse intermediate representations, which does not fully consider the local structure of different nodes. Some GNNs use node-level weight learned without involving topology information, yielding suboptimal results. Our empirical study and theoretical analysis on synthetic graphs demonstrate the importance of node-level weight considering the local topology information of nodes. The proposed LocalSim-aware multi-hop fusion uses local similarity as guidance to generate a more appropriate node-level weight for intermediate representations fusion, and it can also be used as a plug-and-play module to improve the performance of existing GNNs. For better fusion, IRDC is proposed to extract more informative intermediate representations boosting the performance. Evaluations over real-world benchmark datasets show the superiority of LSGNN in handling both homophilic and heterophilic graphs and the effectiveness of all the proposed components.

ACKNOWLEDGEMENTS

This work is partially supported by the National Key R&D Program of China under Grant No. 2022YFB2703303, by the National Natural Science Foundation of China (NSFC) under Grant No. U22B2060, Grant No. 61972442, Grant No. 62102413, and Grant No. U2001202, by Guangzhou Municipal Science and Technology Bureau under Grant No. 2023A03J0667, by HKUST(GZ) under a Startup Grant, by the S&T Program of Hebei under Grant No. 20350802D, by the Natural Science Foundation of Hebei Province of China under Grant No. F2020202040, and by the Natural Science Foundation of Tianjin of China under Grant No. 20JCYBJC00650.

References

  • Abu-El-Haija et al. [2019] Sami Abu-El-Haija, Bryan Perozzi, Amol Kapoor, Nazanin Alipourfard, Kristina Lerman, Hrayr Harutyunyan, Greg Ver Steeg, and Aram Galstyan. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, pages 21–29. PMLR, 2019.
  • Battaglia et al. [2018] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • Bo et al. [2021] Deyu Bo, Xiao Wang, Chuan Shi, and Huawei Shen. Beyond low-frequency information in graph convolutional networks. arXiv preprint arXiv:2101.00797, 2021.
  • Chanpuriya and Musco [2022] Sudhanshu Chanpuriya and Cameron N Musco. Simplified graph convolution with heterophily. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
  • Chen et al. [2020] Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, and Yaliang Li. Simple and deep graph convolutional networks. In International Conference on Machine Learning, pages 1725–1735. PMLR, 2020.
  • Chien et al. [2021] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations. https://openreview. net/forum, 2021.
  • Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. arXiv, abs/1606.09375, 2016.
  • Errica et al. [2020] Federico Errica, Marco Podda, Davide Bacciu, and Alessio Micheli. A fair comparison of graph neural networks for graph classification. In ICLR. OpenReview.net, 2020.
  • Gasteiger et al. [2018] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
  • Getoor [2012] Lise Getoor. Query-driven active surveying for collective classification. 2012.
  • Hamilton et al. [2017] William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. arXiv, abs/1706.02216, 2017.
  • Hamilton [2020] William L Hamilton. Graph representation learning. Synthesis Lectures on Artifical Intelligence and Machine Learning, 14(3):1–159, 2020.
  • He et al. [2021] Mingguo He, Zhewei Wei, Hongteng Xu, et al. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems, 34, 2021.
  • Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  • Jiang et al. [2013] Yuexin Jiang, Daniel I Bolnick, and Mark Kirkpatrick. Assortative mating in animals. The American Naturalist, 181(6):E125–E138, 2013.
  • Kipf and Welling [2016] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv, abs/1609.02907, 2016.
  • Li et al. [2022] Xiang Li, Renyu Zhu, Yao Cheng, Caihua Shan, Siqiang Luo, Dongsheng Li, and Weining Qian. Finding global homophily in graph neural networks when meeting heterophily. arXiv preprint arXiv:2205.07308, 2022.
  • Lim et al. [2021] Derek Lim, Felix Hohne, Xiuyu Li, Sijia Linda Huang, Vaishnavi Gupta, Omkar Bhalerao, and Ser Nam Lim. Large scale learning on non-homophilous graphs: New benchmarks and strong simple methods. Advances in Neural Information Processing Systems, 34:20887–20902, 2021.
  • Liu et al. [2020] Meng Liu, Hongyang Gao, and Shuiwang Ji. Towards deeper graph neural networks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 338–348, 2020.
  • Luan et al. [2022] Sitao Luan, Chenqing Hua, Qincheng Lu, Jiaqi Zhu, Mingde Zhao, Shuyuan Zhang, Xiao-Wen Chang, and Doina Precup. Revisiting heterophily for graph neural networks. arXiv preprint arXiv:2210.07606, 2022.
  • McPherson et al. [2001] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Homophily in social networks. Annual review of sociology, 27(1):415–444, 2001.
  • Page et al. [1999] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford infolab, 1999.
  • Pei et al. [2020] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. arXiv preprint arXiv:2002.05287, 2020.
  • Rozemberczki et al. [2021] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-Scale Attributed Node Embedding. Journal of Complex Networks, 9(2), 2021.
  • Scarselli et al. [2008] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61–80, 2008.
  • Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. Ai Magazine, 2008.
  • Velickovic et al. [2017] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv, abs/1710.10903, 2017.
  • Wu et al. [2019] Felix Wu, Tianyi Zhang, Amauri Holanda de Souza Jr, Christopher Fifty, Tao Yu, and Kilian Q Weinberger. Simplifying graph convolutional networks. arXiv preprint arXiv:1902.07153, 2019.
  • Yang et al. [2022] Liang Yang, Weihang Peng, Wenmiao Zhou, Bingxin Niu, Junhua Gu, Chuan Wang, Yuanfang Guo, Dongxiao He, and Xiaochun Cao. Difference residual graph neural networks. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3356–3364, 2022.
  • Zhang and Chen [2018] Muhan Zhang and Yixin Chen. Link prediction based on graph neural networks. In NeurIPS, pages 5171–5181, 2018.
  • Zhu et al. [2020a] Jiong Zhu, Ryan A Rossi, Anup Rao, Tung Mai, Nedim Lipka, Nesreen K Ahmed, and Danai Koutra. Graph neural networks with heterophily. arXiv preprint arXiv:2009.13566, 2020.
  • Zhu et al. [2020b] Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu, and Danai Koutra. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in Neural Information Processing Systems, 33, 2020.

Appendix A Proof of Theorem 1

For the node vi𝒢τv_{i}\in\mathcal{G}_{\tau}, without loss of generality, suppose that viC1v_{i}\in C_{1}. (The analysis is analogous when viC2v_{i}\in C_{2}.) When the number of inter-community and intra-community edges for all the nodes is the same as their expectation and without self-loop, we have

𝔼[ϕ(vi)]\displaystyle\mathbb{E}[\phi(v_{i})] =1|𝒩i|vj𝒩i𝔼[(xixj)2]\displaystyle=-\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}\mathbb{E}[(x_{i}-x_{j})^{2}]
=1|𝒩i|vj𝒩i(𝔼[xi2]+𝔼[xj2]2𝔼[xixj]).\displaystyle=-\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}}\Big{(}\mathbb{E}[x_{i}^{2}]+\mathbb{E}[x_{j}^{2}]-2\mathbb{E}[x_{i}x_{j}]\Big{)}.

By definition, we have

𝔼[xi2]=Var[xi]+𝔼2[xi]=σ2+μ12.\mathbb{E}[x_{i}^{2}]=\operatorname{Var}[x_{i}]+\mathbb{E}^{2}[x_{i}]=\sigma^{2}+\mu^{2}_{1}.

Similarly,

𝔼[xj2]=Var[xj]+𝔼2[xj]={σ2+μ12,if vjC1,σ2+μ22,if vjC2.\mathbb{E}[x_{j}^{2}]=\operatorname{Var}[x_{j}]+\mathbb{E}^{2}[x_{j}]=\begin{cases}\sigma^{2}+\mu^{2}_{1},&\text{if }v_{j}\in C_{1},\\ \sigma^{2}+\mu^{2}_{2},&\text{if }v_{j}\in C_{2}.\end{cases}

Moreover, since viv_{i} and vjv_{j} are independent random variables, we have

𝔼[xixj]=𝔼[xi]𝔼[xj]={μ12,if vjC1,μ1μ2,if vjC2.\mathbb{E}[x_{i}x_{j}]=\mathbb{E}[x_{i}]\mathbb{E}[x_{j}]=\begin{cases}\mu^{2}_{1},&\text{if }v_{j}\in C_{1},\\ \mu_{1}\mu_{2},&\text{if }v_{j}\in C_{2}.\end{cases}

Putting it together yields

𝔼[ϕ(vi)]\displaystyle\mathbb{E}[\phi(v_{i})] =1|𝒩i|vj𝒩iC1(σ2+μ12+σ2+μ122μ12)\displaystyle=-\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}\cap C_{1}}\Big{(}\sigma^{2}+\mu^{2}_{1}+\sigma^{2}+\mu^{2}_{1}-2\mu^{2}_{1}\Big{)}
1|𝒩i|vj𝒩iC2(σ2+μ12+σ2+μ222μ1μ2)\displaystyle\mathrel{\phantom{=}}-\frac{1}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{i}\cap C_{2}}\Big{(}\sigma^{2}+\mu^{2}_{1}+\sigma^{2}+\mu^{2}_{2}-2\mu_{1}\mu_{2}\Big{)}
=pτ2σ2pτ+qτqτ(2σ2+(μ1μ2)2)pτ+qτ\displaystyle=-\frac{p_{\tau}\cdot 2\sigma^{2}}{p_{\tau}+q_{\tau}}-\frac{q_{\tau}\cdot(2\sigma^{2}+(\mu_{1}-\mu_{2})^{2})}{p_{\tau}+q_{\tau}}
=2σ2(1λτ)(μ1μ2)2.\displaystyle=-2\sigma^{2}-(1-\lambda_{\tau})(\mu_{1}-\mu_{2})^{2}.

For the second part where vi𝒢1v_{i}\in\mathcal{G}_{1} and vj𝒢2v_{j}\in\mathcal{G}_{2}, using above result, we have

𝔼[ϕ(vi)]\displaystyle\mathbb{E}[\phi(v_{i})] =2σ2(1λ1)(μ1μ2)2,\displaystyle=-2\sigma^{2}-(1-\lambda_{1})(\mu_{1}-\mu_{2})^{2},
𝔼[ϕ(vj)]\displaystyle\mathbb{E}[\phi(v_{j})] =2σ2(1λ2)(μ1μ2)2.\displaystyle=-2\sigma^{2}-(1-\lambda_{2})(\mu_{1}-\mu_{2})^{2}.

Therefore,

𝔼[|ϕ(vi)ϕ(vj)|]|𝔼[ϕ(vi)ϕ(vj)]|=|λ1λ2|(μ1μ2)2.\mathbb{E}[|\phi(v_{i})-\phi(v_{j})|]\geq|\mathbb{E}[\phi(v_{i})-\phi(v_{j})]|=|\lambda_{1}-\lambda_{2}|(\mu_{1}-\mu_{2})^{2}.

This completes the proof.

Appendix B Comparison Between IRDC and Other Residual Connections

Table 5 shows the comparison between IRDC and other residuals connections. Note that the coefficient is omitted, 𝐒\mathbf{S} is a GNN filter, 𝐗\mathbf{X} is the origin node feature, 𝐇(k)\mathbf{H}^{(k)} is the intermediate node feature with only propagation, and 𝐙(k)\mathbf{Z}^{(k)} is the intermediate node feature with both propagation and nonlinear transformation.

GCN Kipf and Welling [2016] and SGC Wu et al. [2019] uses the the output of the previous layer as the input of the next layer, which might result in over-smoothing issue. To overcome this issue, many residual connection methods are proposed.

GCNII Chen et al. [2020] construct a connection to the initial representation, which ensures that the final representation retains at least a fraction of the input layer even if we stack many layers. APPNP Gasteiger et al. [2018] apply a similar initial residual connection based on the Personalized PageRank Page et al. [1999].

DRC Yang et al. [2022] proposed a novel difference residual connection, enabling each layer to process the information that has not been properly handled in the previous layer.

Differently, our proposed IRDC aims at enabling each layer to process the total information that has not been handled in all the previous layers, thus can leverage all the input information and extract more informative representations.

Another key difference between IRDC and other residual connections is that IRDC is pre-computed using only the origin node features, without any trainable parameter or gradient backward propagation, which brings high efficiency. While other residual connections are based on the transformed node representations.

Table 5: Comparison between IRDC and other residual connections
Residual Connections Formula
Without Residual Connection (GCN) 𝐙(k)=σ(𝐒𝐙(k1)𝐖)\mathbf{Z}^{(k)}=\sigma\left(\mathbf{S}\mathbf{Z}^{(k-1)}\mathbf{W}\right)
Without Residual Connection (SGC) 𝐙(k)=𝐒𝐙(k1)\mathbf{Z}^{(k)}=\mathbf{S}\mathbf{Z}^{(k-1)}
Initial Residual Connection (GCNII) 𝐙(k)=σ((𝐙(0)+𝐒𝐙(k1))𝐖)\mathbf{Z}^{(k)}=\sigma\left(\left(\mathbf{Z}^{(0)}+\mathbf{S}\mathbf{Z}^{(k-1)}\right)\mathbf{W}\right)
Initial Residual Connection (APPNP) 𝐙(k)=𝐙(0)+𝐒𝐙(k1)\mathbf{Z}^{(k)}=\mathbf{Z}^{(0)}+\mathbf{S}\mathbf{Z}^{(k-1)}
Difference Residual Connection (DRC) 𝐙(k)=𝐒(𝐙(k2)𝐙(k1))\mathbf{Z}^{(k)}=\mathbf{S}\left(\mathbf{Z}^{(k-2)}-\mathbf{Z}^{(k-1)}\right)
Initial Residual Difference Connection 𝐇(k)=𝐒(𝐗=1k1𝐇())\mathbf{H}^{(k)}=\mathbf{S}\left(\mathbf{X}-\sum_{\ell=1}^{k-1}\mathbf{H}^{(\ell)}\right)

Appendix C Comparison on Training Cost Time

Experiment setting.

For GCN, SGC, and GAT, we all use five layers which is the same as our proposed method. And we run the experiments on a computer with an RTX-3080 GPU. We train all models for 200 epochs on all datasets, and we repeat the experiments 10 times and report the average time.

Experiment results.

The detailed training cost time on each dataset of each model can be found in Table 7. It can be seen that our proposed method is consistently nearly 2×\times faster than GCN and GAT, and is close to SGC.

Appendix D Additional Experiment Details

D.1 Statistics Of Benchmark Datasets

The statistics of nine real-world datasets used in the experiment can be seen in Table 8.

D.2 Statistics Of Large Benchmark Datasets

The statistics of two large datasets used in the experiment can be seen in Table 9.

D.3 Experiment Setting Details of Large Benchmark Datasets

The selected large datasets, ogbn-arXiv and arXiv-year, have same nodes and edges that represent papers and citations. In ogbn-arXiv, node labels represent the papers’ academic topics, which tend to be a homophily setting. While in arXiv-year, the node labels represent the publishing year, thus exhibiting more heterophily label patterns.

For ogbn-arXiv, we use the official split Hu et al. [2020], and for arXiv-year, we use the 5 random splits (50%/25%/25%) provided by Lim et al. [2021]. And note that as observed in SOTA models (e.g., LINKX Lim et al. [2021] and GloGNN Li et al. [2022]) on arxiv-year, incorporating the embedding of the Adjacency matrix is critical for achieving high performance. Therefore, we have also included this embedding in our approach to enhance performance, and details can be seen in our code.

Appendix E Detailed Results

Full results of Alleviating Over-smoothing Issue in Section 6.4.

See Figure 5 for the full results of comparison on the performance of vanilla GCN and LSGNN  under different layers of propagation. It can be seen that the performance of GCN decreases rapidly as layers increases in most datasets, except Cornell and Squirrel. But we note that Cornell is a simple dataset in which even MLP can achieve 80%+80\%+ accuracy. And Squirrel only has five classes, while GCN only achieves 25%30%25\%\sim 30\% accuracy which is close to the accuracy of random classification if the class is balanced (20%20\%). In contrast, our method (i.e., LSGNN) achieves high performance on all datasets of all numbers of layers, which indicates the capability of LSGNN to prevent the over-smoothing issue. Finally, for fairness, we use Optuna to tune the hyper-parameters 200 times for both LSGNN and GCN.

Hyper-parameters Details.

We use Optuna to tune the hyper-parameters 200 times in all experiments. And the number of layers of IRDC is 5 for all experiments, except the over-smoothing experiments. We only set a rough search space of each hyper-parameters for all datasets, as shown in Table 6. For details of the searched hyper-parameters and all other training details, we refer readers to our code which will be released after published.

Table 6: Hyper-parameters Search Space
Hyper-parameter Search Space
learning rate [1e-3, 1e-1]
weight decay [1e-6, 1e-1]
dropout {0.1, 0.5, 0.6, 0.7, 0.8, 0.9}
β\beta {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}
γ\gamma {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}
similarity measure {cosine, euclidean}
Table 7: Time consumption of several GNN models (Seconds/200epoch). These are the time of conducting experiments for 200 epochs on each model, and each experiment is repeated 10 times.
Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin Avg
GCN 3.85 3.52 3.69 3.72 3.31 3.82 3.59 3.69 3.83 3.67
SGC 0.91 0.80 1.00 0.79 0.81 0.85 0.78 0.80 0.80 0.84
GAT 4.50 4.42 5.06 4.15 6.36 4.37 4.42 4.31 4.27 4.65
LSGNN 1.76 2.07 2.16 1.56 2.10 1.92 1.47 1.49 1.47 1.78
Table 8: Benchmark datasets statistics for node classification
Cora Citeseer Pubmed Chameleon Squirrel Actor Cornell Texas Wisconsin
# Nodes 2708 3327 19717 2277 5201 7600 183 183 251
# Edges 5278 4552 44324 18050 108536 15009 149 162 257
# Classes 7 6 3 5 5 5 5 5 5
# Features 1433 3703 500 2325 2089 932 1703 1703 1703
(𝒢)\mathcal{H}(\mathcal{G}) 0.81 0.74 0.80 0.24 0.22 0.22 0.31 0.11 0.20
Table 9: Large Benchmark datasets statistics for node classification
#Nodes # Edges # Classes # Features (𝒢)\mathcal{H}(\mathcal{G})
ogbn-arXiv 169343 1166243 40 128 0.66
arXiv-year 169343 1166243 5 128 0.22
Refer to caption
Figure 5: Results with various model depth on nine real-world datasets.