Heterophily-Aware Graph Attention Network

Junfu Wang [email protected] Yuanfang Guo [email protected] Liang Yang [email protected] Yunhong Wang [email protected]

Abstract

Graph Neural Networks (GNNs) have shown remarkable success in graph representation learning. Unfortunately, current weight assignment schemes in standard GNNs, such as the calculation based on node degrees or pair-wise representations, can hardly be effective in processing the networks with heterophily, in which the connected nodes usually possess different labels or features. Existing heterophilic GNNs tend to ignore the modeling of heterophily of each edge, which is also a vital part in tackling the heterophily problem. In this paper, we first propose a heterophily-aware attention scheme and reveal the benefits of modeling the edge heterophily, i.e., if a GNN assigns different weights to edges according to different heterophilic types, it can learn effective local attention patterns, enabling nodes to acquire appropriate information from distinct neighbors. Then, we propose a novel Heterophily-Aware Graph Attention Network (HA-GAT) by fully exploring and utilizing the local distribution as the underlying heterophily, to handle the networks with different homophily ratios. To demonstrate the effectiveness of the proposed HA-GAT, we analyze the proposed heterophily-aware attention scheme and local distribution exploration, by seeking an interpretation from their mechanism. Extensive results demonstrate that our HA-GAT achieves state-of-the-art performances on eight datasets with different homophily ratios in both the supervised and semi-supervised node classification tasks.

keywords:

Graph Neural Networks, Graph Attention Networks, Network with Heterophily

^†^†journal: Pattern Recognition

\affiliation

[label1]organization=State Key Laboratory of Software Development Environment, Beihang University, city=Beijing, postcode=100191, country=China

\affiliation

[label2]organization=School of Computer Science and Engineering, Beihang University, city=Beijing, postcode=100191, country=China

\affiliation

[label3]organization=Shen Yuan Honors College, Beihang University, city=Beijing, postcode=100191, country=China

\affiliation

[label4]organization=School of Artificial Intelligence, Hebei University of Technology, city=Tianjin, postcode=300401, country=China

\affiliation

[label5]organization=Hebei Province Key Laboratory of Big Data Calculation, Hebei University of Technology, city=Tianjin, postcode=300401, country=China

1 Introduction

Due to their success in modeling the irregular graph data, Graph Neural Networks (GNNs) [1, 2, 3, 4, 5, 6, 7, 8] have been applied to various tasks, including social analysis [9], computer vision [10], and natural language processing [11]. Current GNNs mostly follow the Message Passing strategy [12], where the node state is iteratively updated by interacting with its neighboring nodes. Existing approaches mainly differ in the node aggregation mechanisms, which describe how each node aggregates the representations of its neighbors with its own ones by adopting different weight assignment schemes.

Therefore, one of the critical interests in GNN research is the design of a powerful weight assignment scheme, which can effectively determine the importance of different representations in every node and its neighbors. Graph Convolutional Network (GCN) [1] and its variants [3, 13, 14] utilize the structural information, e.g., symmetric normalization with respect to the node degree, as the edge weight. Besides, Graph Attention Network (GAT) [2] generates an attention score (weight) by utilizing the self-attention mechanism. It allows every node to compute an attention score of its neighbors and softly select its most relevant neighbors. Several attention-based weight assignment schemes are subsequently proposed by employing different pair-wise attention functions [15, 16, 17, 18, 19]. Typically, these two kinds of weight assignment schemes can achieve state-of-the-art results on the networks, which satisfy the homophilic assumption, i.e., nodes prefer to connect with other nodes possessing the same label. Unfortunately, recent studies [20, 21, 22] have shown that the above mentioned weight assignment schemes cannot perform well when the homophilic assumption is not satisfied, e.g., on heterophilic networks, in which the linked nodes usually possess different labels or attributes.

Several improvements have been proposed to tackle the heterophily problem by designing adaptive weight assignment schemes. From the spectral perspective, FAGCN [23] adopts a self-gating attention mechanism to assign weights to the coefficients for both the low-frequency and high-frequency signals. DMP [22] considers attribute heterophily and specifies every attribute propagation weight on each edge. Unfortunately, these methods have ignored to model the edge heterophily, which is determined by the type of the connected nodes for each edge, according to the labels or attributes. This is also critical for solving the heterophily problem. BM-GCN [24] attempts to estimate the label heterophily directly and assigns weights based on the block similarity. Unfortunately, it assigns higher weights to the edges connecting similar nodes, which is insufficient to exploit information from the dissimilar nodes.

In this paper, we first propose an intuitive solution by constructing a GNN, which can assign different weights to edges according to different heterophilic types. Then, the GNN can learn an effective local attention pattern, which enables nodes to acquire appropriate information from both similar and dissimilar nodes. For verification, we develop a heterophily-aware attention scheme to handle the graph with a prior edge heterophily for every edge, which is specified by the labels of its connected nodes. By modeling the importance of different edge heterophilic types via certain adaptive weights, we empirically validated that this GNN can excel in the node classification tasks, even on the heterophilic graphs. For example, on the Actor and Chameleon datasets, it can achieve 63.38/33.31 and 92.27/67.96 accuracies with/without the adaptive weights for different heterophilic types, respectively. This observation indicates that the edge heterophily plays a crucial role in exploiting information from both the similar and dissimilar nodes, and thus solves the heterophily problem.

Unfortunately, in real-world node classification tasks, the edge heterophily cannot be directly computed when the majority of the node labels are not given, especially in the semi-supervised settings. Thus, the remaining challenge is how to acquire proper heterophily information. Previous studies [24, 25] attempt to estimate the edge heterophily according to the node labels by a pre-trained MLP. Unfortunately, the precise node labels are hard to be directly predicted with the pre-trained MLP.

Recent studies [26, 27, 28] have shown that the local distribution, which describes the distribution measured by certain characteristics in the local neighborhood, plays a critical role in handling the heterophilic graphs. According to [26, 27], GCN can outperform MLP on the heterophilic graphs, in which the local distributions computed by the node labels are distinctive, i.e., nodes in the same classes tend to possess similar local distributions while nodes from different classes tend to possess dissimilar ones. Meanwhile, according to [28], the topological structure, node features, and positional identity are also decisive for estimating effective local distributions and then facilitating the weight assignment scheme.

In this paper, we tackle the heterophily problem by proposing a novel Heterophily-Aware Graph Attention Network (HA-GAT), which can handle graphs with different homophily ratios. Instead of directly estimating the heterophily from the node labels, we construct an explorer network to excavate local distributions to represent the underlying heterophily of nodes. The heterophily preference of each edge is derived by integrating the underlying categories of the connected nodes. Then, according to our heterophily-aware attention scheme, the heterophily preference is utilized to generate the attention coefficients. Specifically, a learnable parsing matrix with a gradient scaling factor is constructed in each layer, to compute with the heterophily preference matrices and generate the attention coefficients. Each element in the parsing matrix represents the corresponding importance for a particular underlying heterophilic edge type, which can be learned adaptively. At last, the normalized attention coefficients are utilized in the message aggregation process.

To demonstrate the effectiveness of the proposed HA-GAT, we analyze the proposed heterophily-aware attention scheme and local distribution exploration, by seeking an interpretation from their mechanism. Extensive experiments on eight datasets with different homophily ratios demonstrate the effectiveness of our proposed HA-GAT, for both the supervised and semi-supervised node classification tasks. Our major contributions are summarized as follows:

1.

We empirically reveal the benefits of modeling the edge heterophily, i.e., if a GNN assigns different weights to edges according to different heterophilic types, it can learn effective local attention patterns, enabling nodes to acquire appropriate information from distinct neighbors.
2.

We propose a heterophily-aware attention scheme to adaptively assign weights to different heterophilic types, by fully exploiting the given edge heterophily.
3.

We propose a novel Heterophily-Aware Graph Attention Network (HA-GAT) by fully exploring and utilizing the local distribution as the underlying heterophily, to handle the networks with different homophily ratios.

2 Related Work

2.1 Attention-based Graph Neural Networks

Graph Attention Network (GAT) [2] pioneers to compute the hidden representation of each node by attending over its neighbors, based on a self-attention strategy. Specifically, it generates attention coefficients by performing an inner product between a learnable vector and the concatenated representations of two connected nodes. Subsequently, several GAT variants are proposed by different calculation schemes of the attention coefficients, such as cosine similarity [15] and element-wise product [16, 17]. Besides, some methods are proposed to improve GAT from other perspectives. CS-GNN [18] improves GAT by predicting the attention coefficients based on separate scoring representations, instead of using direct node representations. SuperGAT [17] proposes a self-supervised strategy to learn the attention coefficients with respect to the edge information. GATv2 [19] argues that the attention scores learned by GAT [2] are irrelevant to the ego-representation. Then, it presents an improved variant by modifying the order of operations. All of these attention-based GNNs function decently under the homophilic assumption, i.e., they attempt to measure similarities or distances by employing distinct pair-wise attention functions, which makes them unsuitable to process the networks with heterophily.

2.2 GNNs for Heterophily

Recently, several methods have been proposed to tackle the heterophily problem, which can be divided into two lines.

The first line of approaches intends to model the diverse local patterns via designing adaptive message aggregation schemes. Geom-GCN [21] builds a structural neighborhood by leveraging network embeddings. The neighborhood is then partitioned into several geometric blocks and the neighbors within each block are aggregated separately. FAGCN [23] adopts a self-gating attention mechanism to learn the attention coefficients for both the low-frequency and high-frequency signals. Besides, ACM-GNN [29] proposes an Adaptive Channel Mixing framework to adaptively exploit low-frequency, high-frequency, and identity channels. GBK-GNN [30] employs a bi-kernel feature transformation and a selection gate to simultaneously model the similarity and dissimilarity (i.e., the positive and negative correlations) between node features. By taking node attributes as weak labels, DMP [22] considers the attribute heterophily for diverse message passing and specifies every attribute propagation weight on each edge. WRGNN [31] transforms a heterophilic input graph into a multi-relational graph, in which the proximity and structural information are represented as distinct types of edges. BM-GCN [24] leverages a two-staged block-guided classified aggregation mechanism, to directly model the explicit heterophily and assign edge weights according to the block similarities. MWGNN [28] utilizes well-designed handcrafted features, i.e., local degree profile, sorted node feature, and shortest path distance, to predict the underlying local distribution, and assigns the edge weights by a topology-attribute decoupling mechanism. Although it also considers the local distribution, handcrafted features may not be appropriate for handling all the graphs. Different from the existing literature, our proposed HA-GAT estimates the underlying heterophilic type of each edge to avoid utilizing imprecise label information. It leverages a heterophily-aware attention scheme to adaptively learn the importance of each heterophilic type, presenting a novel paradigm for addressing heterophily.

The other line of approaches focuses on expanding the local neighborhood. H2GCN [20] explicitly aggregates information from high-order neighborhoods to find more similar neighbors, which has proved to be beneficial in heterophily settings. This design is also employed in [32, 33]. GRP-GNN [34] proposes a generalized PageRank architecture to model the importance of different orders of neighborhoods adaptively. Besides, some works utilize different ranking strategies to find potential neighbors. Non-local GNN [35] designs an attention-guided sorting strategy to find distant but informative nodes and put them together. GPNN [36] leverages a pointer network to select the most relevant nodes from a large amount of multi-hop neighborhoods. UGCN [37] introduces a discriminative aggregation to fuse 1-hop, 2-hop, and $k$ NN neighbors simultaneously. GloGNN [38] generates node embeddings by aggregating information from global nodes within the graph. Our HA-GAT follows the first line and thus is not similar to these methods.

3 Preliminaries

3.1 Notations

An attributed graph is denoted as $\mathcal{G}=\left\{\mathcal{V},\mathcal{E},\mathcal{X}\right\}$ with the vertex set $\mathcal{V}=\{v_{i}\}_{i=1}^{N}$ and edge set $|\mathcal{E}|=E$ . Each node $v_{i}$ contains a feature $x_{i}\in\mathbb{R}^{d}$ . $X\in\mathbb{R}^{N\times d}$ is the collection of all the features in all the nodes. $A=[a_{ij}]\in\left\{0,1\right\}^{N\times N}$ is the adjacency matrix, where $a_{ij}=1$ if nodes $v_{i}$ and $v_{j}$ are connected, or $a_{ij}=0$ otherwise. $d_{i}=\sum_{j}a_{ij}$ stands for the degree of node $v_{i}$ and $D=diag(d_{1},d_{2},\dots,d_{n})$ represents the degree matrix corresponding to the adjacency matrix $A$ . For each node $v_{i}$ , $\mathcal{N}_{i}=\left\{v_{j}|a_{ij}=1\right\}$ represents the set of its neighbors and $\tilde{\mathcal{N}}_{i}=\mathcal{N}_{i}\cup\left\{v_{i}\right\}$ denotes the set containing both the node $v_{i}$ and its neighbors.

3.2 Homophily versus Heterophily

To clearly measure the homophily/heterophily of a graph, we follow [21] to define the homophily ratio $\mathcal{H}(\mathcal{G})$ as

\mathcal{H}(\mathcal{G})=\frac{1}{V}\sum_{v_{i}\in\mathcal{V}}\frac{\sum_{v_{j}\in\mathcal{N}_{i}}(y_{i}=y_{j})}{|\mathcal{N}_{i}|},

(1)

where $\mathcal{H}(\mathcal{G})\in\left[0,1\right]$ . A high homophily ratio indicates that the graph is with strong homophily, while a graph with strong heterophily possesses a small homophily ratio.

4 Heterophily-Aware Attention Scheme

This section develops a heterophily-aware attention scheme to adaptively assign weights by fully exploiting the given edge heterophily.

We first assume that the edge heterophily preference matrix $\mathbf{m_{ij}}$ for each edge is acquirable, with respect to the node labels. It can be computed via

\mathbf{m_{ij}}=y_{i}\otimes y_{j},

(2)

where $\mathbf{m_{ij}}\in\{0,1\}^{C\times C}$ represents the distribution among $(C\times C)$ heterophilic types for edge $e_{ij}$ , and $\otimes$ represents the outer product of two vectors. Note that $y_{i}\in\mathbb{R}^{C}$ is a one-hot vector, which represents the ground-truth label of node $v_{i}$ . The element in $\mathbf{m_{ij}}$ , $m_{ijk_{i}k_{j}}=y_{ik_{i}}\cdot y_{jk_{j}}$ , is 1 if and only if the nodes $v_{i}$ and $v_{j}$ belong to classes $k_{i}$ and $k_{j}$ , respectively, and 0 otherwise.

Then, the heterophily preference of each edge, $\mathbf{m_{ij}}$ , is interpreted by a learnable parsing matrix to generate the attention coefficient in each layer. In the $l$ -th layer, the parsing matrix $\mathbf{\omega}^{(l)}\in\mathbb{R}^{C\times C}$ is employed as a parameter matrix, which can be optimized in the training process. Then, the heterophily-aware attention coefficient for each edge $e_{ij}$ can be predicted via

w_{ij}^{(l)}=\langle\mathbf{m_{ij}},\phi(\mathbf{\omega}^{(l)})\rangle_{F}=tr\left(\mathbf{m_{ij}}^{T}\phi(\mathbf{\omega}^{(l)})\right),

(3)

where $\langle\cdot,\cdot\rangle_{F}$ denotes the Frobenius inner product, which is an inner product operation in the matrix domain, and $tr(\cdot)$ is the trace of the matrix. Here, $\phi(\cdot)$ is an element-wise computing function to calculate the importance of each heterophilic edge type with respect to the real-valued parameters. For each parameter $\omega^{(l)}$ in $\mathbf{\omega}^{(l)}$ ,

\phi(\omega^{(l)})=\left(\lambda\omega^{(l)}\right)_{+}=\max\left\{\lambda\omega^{(l)},0\right\},

(4)

where $\lambda>0$ is a gradient scaling factor, which is employed to enlarge/shrink the gradient of each $\omega^{(l)}$ . Specifically, each parameter $\omega^{(l)}$ is initially set as $\frac{1}{\lambda}$ to ensure $\phi(\lambda\cdot\omega^{(l)})$ being one, i.e., each heterophilic type of edges possesses equivalent importance at the beginning of the training process. With the gradient scaling factor $\lambda$ , the modification of $\lambda\mathbf{\omega}^{(l)}$ in the optimization step will be scaled. Besides, the $\left(\cdot\right)_{+}$ function is utilized to ensure the weights to be non-negative. During the training process, $\phi(\omega^{(l)})$ gradually becomes diverse, each of which reflects a corresponding importance for a certain heterophilic edge type.

Typically, the self-loop connection, which is a special type of edge, is always taken into consideration for combining the representations of the central node and its neighbors. Thus, a parameter $\omega_{sl}^{(l)}\in\mathbb{R}$ is utilized here to model the importance of the self-loop connection, as

w_{ii}^{(l)}=\phi\left(\omega_{sl}^{(l)}\right).

(5)

Note that $\left\{\phi(\mathbf{\omega}^{(l)}),\phi(\omega_{sl}^{(l)})\right\}$ determines how each type of nodes assign weights to their neighbors and themselves. We denote it as a Local Attention Pattern (LAP) to facilitate the subsequent analysis.

Neighbor Norm. Different from the commonly utilized normalization methods, we construct a Neighbor Norm to normalize the attention coefficients, as

\alpha_{ij}^{(l)}=\frac{w_{ij}^{(l)}}{\sum_{v_{k}\in\tilde{\mathcal{N}}_{j}}w_{jk}^{(l)}},

(6)

where $\tilde{\mathcal{N}}_{j}=\mathcal{N}_{j}\cup\left\{v_{j}\right\}$ is the set containing node $v_{j}$ and its neighbors. Each attention coefficient $w_{ij}^{(l)}$ of edge $e_{ij}$ is normalized by the weighted degree of the neighbor node $v_{j}$ . This design assumes that the connections to general neighbors (i.e., neighbors with small weighted degrees) are more responsive to the characteristics of the node than the connections to popular neighbors (i.e., neighbors with large weighted degrees). This assumption is consistent with various real-world networks, e.g., social networks and citation networks.

After obtaining the normalized attention coefficients, the message aggregation step can be performed as

h^{(l+1)}_{i}=\sigma\left(\sum_{v_{j}\in\tilde{\mathcal{N}}_{i}}\alpha_{ij}^{(l)}h_{j}^{(l)}\Theta^{(l)}\right),

(7)

where $\Theta^{(l)}$ is a parameter matrix and $\sigma$ is the ReLU function.

Remark. Our heterophily-aware attention scheme utilizes one parameter to model the importance of each heterophilic edge type. It attempts to find an effective LAP to help nodes extract appropriate information from both the similar and dissimilar neighbors.

Refer to caption — Figure 1: Illustration of our Heterophily-Aware Graph Attention Network (HA-GAT).

5 Heterophily-Aware Graph Attention Network

The heterophily-aware attention scheme proposed in Sec. 4 can adaptively assign weights based on the given edge heterophily, thereby addressing the heterophily issue. Unfortunately, in real-world node classification tasks, the edge heterophily with respect to node labels cannot be directly computed when the majority of node labels are unavailable, especially in semi-supervised settings.

To address this challenge, this section introduces a Heterophily-Aware Graph Attention Network (HA-GAT) that fully explores and utilizes the local distribution as underlying heterophily. Subsequently, we compare our HA-GAT with typical attention-based GNNs and elucidate its connections to existing GNNs, for a deeper understanding. A comprehensive illustration of our HA-GAT is provided in Fig. 1.

5.1 Local Distribution Exploration

Local distribution describes a distribution measured by certain characteristics in each local neighborhood. It can be measured by the node label [26] as well as other local characteristics [28]. In general, the local distribution of each node $v_{i}$ can be derived from its local neighborhood, via

s_{i}=\Theta\left(x_{i},\{x_{j}\}_{j\in\mathcal{N}_{i}^{(p)}}\right),

(8)

where $s_{i}\in\mathbb{R}^{t}$ represents the local distribution of node $v_{i}$ and $\mathcal{N}_{i}^{(p)}$ stands for the collection of nodes within a $p$ -hop local neighborhood. Note that $t>0$ is a preset integer constant which represents the number of underlying categories. Each element in $s_{i}$ ( $s_{ik}\geq 0$ ) represents the probability of assigning category $k$ to node $v_{i}$ , and $\sum_{k}s_{ik}=1$ . Note that $\Theta(\cdot)$ is expected to estimate the appropriate local distribution, according to certain underlying measurements, and thus facilitates the node classification task.

In this paper, instead of generating local distribution by directly estimating the node labels or from well-designed handcrafted features [28], we excavate local information from the local smoothed features. This design is motivated by two aspects. Firstly, based on the analysis [26] under certain ideal assumptions, the local smoothed features tend to contribute similarly in representing heterophily, compared to the local distributions. Under such circumstances, GCN can perform decently by obtaining local smoothed features, because the local distributions according to the node labels are discriminative. Secondly, in practical situations, [28] empirically verifies that the local distributions, which are generated from the local node features and the topological information, can facilitate the weight assignment scheme. However, instead of generating the local distributions based on the handcrafted characteristics from the neighborhoods, e.g., local degree profile, the local smoothed features can be utilized to adaptively select appropriate characteristics as underlying measurements.

Therefore, a 2-layered GCN is employed as an explorer network to excavate local distributions from the local smoothed features as the underlying heterophily. Then, Eq. 8 can be rewritten as

S={\rm softmax}\left(\tilde{A}\ \sigma\left(\tilde{A}XW^{(0)}_{e}\right)W^{(1)}_{e}\right),

(9)

where $S=\{s_{1};s_{2};...;s_{N}\}\in\mathbb{R}^{N\times t}$ represents the local distributions. As stated above, we assume that the local distribution reflects the corresponding node categories, i.e., it can be transformed to the category distribution by a certain linear transformation $W_{t}$ . Thus, we can utilize $S$ to approximately model the underlying heterophily by combining the linear projections $W^{(1)}_{e}$ and $W_{t}$ . Then, $s_{i}$ represents the category distribution among the underlying categories of node $v_{i}$ . Note that $\sigma(\cdot)$ is the ReLU function, $\tilde{A}=\hat{D}^{-\frac{1}{2}}\hat{A}\hat{D}^{-\frac{1}{2}}$ , $\hat{A}=A+I$ , and $\hat{D}$ is the degree matrix of $\hat{A}$ .

Then, the underlying heterophily preference of each edge is re-computed via

\mathbf{m_{ij}}=s_{i}\otimes s_{j},

(10)

where $\mathbf{m_{ij}}\in\mathbb{R}^{t\times t}$ represents the probability distribution of the $(t\times t)$ heterophilic types for edge $e_{ij}$ . Each element $m_{ijk_{i}k_{j}}$ in $\mathbf{m_{ij}}$ , where $m_{ijk_{i}k_{j}}=s_{ik_{i}}\cdot s_{jk_{j}}$ , represents the heterophily preference that edge $e_{ij}$ connects two nodes with their categories being $k_{i}$ and $k_{j}$ , respectively. Since $\sum_{k}^{t}s_{ik}=1$ and $\sum_{k}^{t}s_{jk}=1$ , $\mathbf{m_{ij}}$ is also a probability matrix, i.e., $\sum_{k_{1}}^{t}\sum_{k_{2}}^{t}m_{ijk_{1}k_{2}}=1$ . Therefore, $\mathbf{m_{ij}}$ indicates the heterophily preference of each edge $e_{ij}$ , which is utilized in the heterophily-aware attention scheme to generate the attention coefficients and constructs our HA-GAT. Note that we denote a special variant of our HA-GAT with a prior edge heterophily according to the node labels as HA-GAT(L), which is introduced in Sec. 4.

5.2 Discussions with Existing GNNs

Here, we compare the proposed HA-GAT with typical attention-based GNNs, and analyze its connections to existing GNNs by proposing several HA-GAT variants.

5.2.1 Comparisons with Existing Attention-based GNNs

Currently, the attention-based GNNs [2, 19, 18, 17] usually exploit a self-attention strategy, where the attention coefficients on the edges are correlated to the representations of their correspondingly connected nodes, i.e.,

w_{ij}^{(l)}=\Phi_{s}^{(l)}\left(h_{i}^{(l-1)},h_{j}^{(l-1)}\right),

(11)

where $w_{ij}^{(l)}$ is the attention coefficient of edge $e_{ij}$ in the $l$ -th layer and $h_{i}^{(l)}$ is the representations of node $v_{i}$ in the $l$ -th layer. Note that $\Phi_{s}^{(l)}(\cdot)$ is a well-designed attention function, which is designed differently in the existing methods [2, 19, 15]. Then, the attention coefficient is normalized by a ${\rm softmax}$ function [2, 19], i.e.,

\alpha_{ij}^{(l)}={\rm softmax}\left(w_{ij}\right)=\frac{{\rm exp}(w_{ij})}{\sum_{v_{k}\in\tilde{\mathcal{N}_{i}}}{\rm exp}(w_{ik})}.

(12)

On the contrary, our HA-GAT calculates the underlying heterophily preference $\mathbf{m_{ij}}$ for each edge, which is correlated to the local neighborhood in the attributed graph, i.e.,

\mathbf{m_{ij}}=\Phi_{u}\left(x_{i},x_{j},\{x_{k_{i}}\}_{k_{i}\in\mathcal{N}_{i}^{(2)}},\{x_{k_{j}}\}_{k_{j}\in\mathcal{N}_{j}^{(2)}}\right),

(13)

where $\Phi_{u}(\cdot)$ calculates the heterophily preference $\mathbf{m_{ij}}$ from the local neighborhood. Then, $\mathbf{m_{ij}}$ is utilized to compute the attention coefficients in each layer as

w_{ij}^{(l)}=f^{(l)}(\mathbf{m_{ij}}),

(14)

where $f^{(l)}(\cdot)$ is the parsing function in the $l$ -th layer to be utilized to generate the heterophily-aware attention coefficient from the heterophily preference matrix.

Compared to Eqs. 11 and 13, our $\Phi_{u}(\cdot)$ fully excavates information in the local neighborhood, while $\Phi_{s}^{(l)}(\cdot)$ only measures the attention coefficient based on the pair-wise node representations. Although these pair-wise node representations contain information of local neighborhood when $l\geq 1$ , they are task-specific, i.e., they are restricted to maintain the classification information. Thus, their correspondingly calculated attention coefficients can hardly be optimal without fully considering the local neighborhood.

To clearly demonstrate the effectiveness of our attention scheme, we develop 1) a GAT-like HA-GAT variant, i.e., HA-GAT(G), by utilizing the representation in each HA-GAT layer to generate $\mathbf{m_{ij}}$ , and 2) a MLP based HA-GAT variant, i.e., HA-GAT(M), by replacing the 2-layered GCN with a 2-layered MLP as the explorer network.

5.2.2 Variants and Connections to existing GNNs

To better understand our HA-GAT, we discuss its connections to the existing GNNs by constructing three HA-GAT variants.

(1) When the underlying category dimension $t$ is set to 1, the calculation of the attention coefficient in Eq. 3 will degenerate to

w_{ij}^{(l)}=\phi\left(\omega\right),

(15)

where $\omega\in\mathbb{R}$ is a parameter value. Then, this weight assignment scheme can only distinguish the inter-node connections from the self-loop connections. This variant is similar to GraphSAGE, which forces the neighbor projection matrix to be the multiple of residual projection matrix, and it is denoted as HA-GAT(O).

(2) When the gradient scaling factor $\lambda$ is set to a very small value, e.g., $\lambda=1e-10$ , the gradients to LAPs will also be very small. Then, each attention coefficient $w_{ij}^{(l)}$ is approximately 1. Our HA-GAT will then degenerate to a GCN variant with Neighbor Norm. This variant is denoted as HA-GAT(Z).

(3) BM-GCN [24] attempts to directly extract the edge heterophily according to the node labels and leverages a two-staged learning strategy. To demonstrate the superiority of our design, we set the category dimension $t$ of our HA-GAT to the number of classes $C$ , and employ the same two-staged learning strategy as [24]. This variant is denoted as HA-GAT(T).

5.3 Complexity Analysis

Our HA-GAT has two backbone networks: 1) 2-layered GCN as an explorer network; 2) the classification network.

Time Complexity. For explorer network, the time complexity of each layer is $O(|N|d_{in}d_{out}$ $+|E|d_{out})$ , i.e., the time complexity of GCN layer. For the classification network, in each layer, the complexity of calculating edge weight is $O(|E|t^{2})$ , the complexity of message aggregation step is still $O(|N|d_{in}d_{out}+|E|d_{out})$ . Thus, the total time complexity is $O(|N|d_{in}d_{out}+|E|d_{out}+|E|t^{2})$ . Note that $t$ represents the number of underlying node categories, which is usually set to a relatively small number. We analyze the impacts of $t$ in Sec. 6.4. In practice, we set $t=3$ for all experiments in Tabs. 2 and 3.

Memory Complexity. For explorer network, the memory complexity of each layer is $O(d_{in}d_{out})$ , i.e., the memory complexity of GCN layer. In the classification network, the parameters are $\boldsymbol{\omega}$ , $\omega_{sl}$ , and $\Theta$ . Compared with a GCN layer, a classification network layer only introduces extra $t^{2}+1$ parameters to get the weight, so the complexity is $O(d_{in}d_{out}+t^{2})$ , while $t$ is a very small number, as stated above.

Overall, for an L-layered HA-GAT, we utilize a 2-layered GCN as an explorer network and an L-layered classification network. The final time complexity and memory complexity are the addition of the corresponding complexities of these (2+L) layers. Considering that $t$ is a relatively small number, e.g., $t=3$ in our settings, the time and memory complexities of L-layered HA-GAT are approximated to those of an (L+2)-layered GCN.

Table 1: Datasets.

	Actor	Chameleon	Squirrel	Texas	Cornell	Cora	CiteSeer	PubMed
Classes	5	5	5	5	5	7	6	5
Features	932	2325	2089	1703	1703	1433	3703	500
Nodes	7600	2277	5201	183	183	2708	3327	19717
Edges	26659	31371	198353	279	277	5278	4552	44324
$\mathcal{H(\mathcal{G})}$	0.158	0.248	0.218	0.087	0.306	0.825	0.706	0.762

6 Evaluations

In this section, we evaluate and analyze the effectiveness of our HA-GAT, which contains a local distribution exploration and a heterophily-aware attention scheme, on eight node classification datasets in both the supervised and semi-supervised settings, by seeking an interpretation from their mechanism.

Table 2: Results of supervised node classification: Mean accuracy (%) ± standard deviation. M denotes that it occurs an out-of-memory error. Note that since HA-GAT(L) leverages the prior label information, it is not involved in this comparison.

Methods	Actor	Chameleon	Squirrel	Texas	Cornell	Cora	CiteSeer	PubMed
MLP	40.26±1.08	48.94±1.86	31.43±1.25	89.29±5.43	89.07±4.20	75.97±1.71	76.01±1.33	86.78±0.44
GraphSAGE	37.78±1.17	67.89±1.76	54.19±1.17	84.20±4.92	81.96±7.39	87.73±1.29	80.11±1.50	89.17±0.52
GCN	33.31±1.09	67.96±2.13	53.85±1.25	77.00±6.82	77.54±4.21	87.17±1.23	80.80±1.15	87.71±0.42
GAT	34.44±1.06	65.89±2.60	51.45±2.59	78.03±4.31	78.63±4.26	87.39±1.27	80.46±1.47	87.15±0.42
GATv2	34.94±1.21	66.93±2.25	50.27±1.61	79.29±4.03	80.33±3.33	88.43±1.31	80.52±1.18	88.06±0.40
Geom-GCN	31.81±0.24	61.06±0.49	38.28±0.27	55.59±1.59	58.56±1.77	85.09±1.44	77.78±1.62	88.59±0.47
JKNet-GCN	34.41±1.30	65.58±2.03	50.11±3.66	79.62±3.94	77.16±4.21	86.33±1.53	79.05±1.57	88.37±0.65
ChebyNet	39.17±1.43	62.40±2.21	43.36±1.29	83.01±6.03	84.59±4.49	87.11±1.17	80.02±1.58	88.34±0.46
H2GCN	36.35±0.76	65.43±2.27	58.13±1.59	69.47±5.13	73.68±7.67	80.00±0.70	64.06±1.52	76.00±0.44
DMP	37.26±1.79	M	M	81.64±5.20	87.00±5.52	84.03±1.25	74.61±1.78	M
GPR-GNN	39.21±1.27	67.52±2.37	52.06±1.85	89.52±4.41	92.79±3.12	88.41±1.35	80.27±1.42	89.31±0.87
BM-GCN	39.56±1.43	66.93±1.92	47.32±1.42	78.14±4.80	81.97±6.13	87.34±1.22	79.65±1.72	87.00±0.74
HA-GAT	41.96±1.36	72.58±2.08	69.02±1.50	91.75±3.52	93.44±2.84	88.26±1.24	81.91±1.20	89.52±0.52
HA-GAT(G)	40.16±1.20	67.14±2.40	63.39±1.31	86.89±7.68	88.74±5.69	87.11±1.45	80.97±1.35	88.30±0.61
HA-GAT(M)	42.16±1.14	69.22±1.68	62.73±1.49	91.28±3.22	92.84±3.52	87.76±1.40	81.05±1.01	88.76±0.54
HA-GAT(O)	41.73±1.21	69.31±2.51	57.40±5.44	88.19±6.25	89.72±4.76	87.80±1.22	80.28±1.67	88.42±0.56
HA-GAT(Z)	33.39±1.09	67.53±2.10	53.08±1.56	78.85±3.86	77.32±5.52	88.15±1.11	80.44±1.63	87.79±0.48
HA-GAT(T)	35.07±3.09	70.75±2.02	61.00±1.89	84.48±6.51	79.51±7.08	86.80±1.54	80.27±1.58	87.93±0.53
HA-GAT(L)	63.38±5.38	92.26±1.43	85.46±5.52	91.86±3.37	92.13±2.18	93.26±1.29	87.29±1.83	96.90±0.99

6.1 Experimental Settings

Datasets. We adopt eight node classification benchmarks with varying homophily ratios. Based on their homophily ratios, calculated using Eq. 1, they can be categorized into five heterophilic networks and three homophilic networks. Specifically, the homophilic networks contain three citation networks, including Cora, CiteSeer and PubMed [39, 40]. The heterophilic networks include two Wikipedia networks (i.e., Chameleon and Squirrel), one co-occurrence network (i.e., Actor), and two WebKB webpage networks (i.e., Texas and Cornell) [21]. Their details are presented in Tab. 1.

Baselines. We compare our method with 12 state-of-the-art methods. (1) One graph-agnostic method: Multi-layer Perception Machine (MLP); (2) Four Standard GNNs: GCN [1], GAT [2], GATv2 [19] and GraphSAGE [3]; (3) Seven heterophilic GNNs: JKNet-GCN [41], ChebyNet [32], Geom-GCN [21], H2GCN [20], GPR-GNN [34], DMP [22] and BM-GCN [24].

Setups. For the supervised node classification tasks, we follow the strategy in [34], which randomly splits 60%/20%/20% nodes into the training/validation/testing sets. For the semi-supervised setting, we adopt the public splits [40] for Cora, CiteSeer, and Pubmed, and randomly split 10%/10%/80% nodes as training/validation/testing samples on other heterophilic datasets. Every experiment is run 100 times with different random splits and initializations. In our evaluations, the category dimension $t$ is set to 3 and the gradient scaling factor is varied in $\left\{0.1,1.0,10\right\}$ . A two-layered HA-GAT with a 64-neuron hidden layer is employed for both the supervised and semi-supervised settings. The cross-entropy loss is utilized as the loss function, and Adam [42] optimizer is employed. The learning rate, weight decay rate, and dropout rate are obtained by the grid-search strategy. Note that for fair comparisons, we utilize the same grid-search strategy for these existing GNNs.

Table 3: Results of semi-supervised node classification: Mean accuracy (%) ± standard deviation.

Methods	Actor	Chameleon	Squirrel	Texas	Cornell	Cora	CiteSeer	PubMed
MLP	33.47±0.47	37.13±1.20	25.69±0.87	65.81±4.55	70.06±5.12	60.07±1.30	60.82±1.01	73.12±0.48
GCN	26.06±1.21	52.32±1.42	31.26±1.83	56.35±4.95	49.93±5.29	81.69±0.67	71.58±0.55	79.15±0.49
GAT	27.37±1.04	52.85±1.89	31.27±1.72	53.78±4.14	53.85±7.79	83.04±0.73	71.48±0.69	78.50±0.38
GPR-GNN	32.51±1.05	47.13±4.10	33.11±1.81	65.54±6.85	65.54±6.85	83.43±0.69	71.41±0.80	79.21±0.53
BM-GCN	30.84±0.90	54.13±1.41	33.94±1.34	64.86±7.00	54.86±9.32	81.01±0.51	69.34±1.12	76.50±0.62
HA-GAT	35.14±0.68	58.84±1.84	45.67±0.97	75.47±5.32	73.71±5.66	83.43±0.43	72.51±0.62	79.68±0.34

6.2 Effectiveness of Heterophily-Aware Attention Scheme

The proposed heterophily-aware attention scheme attempts to find an effective LAP to help nodes to extract appropriate information from both similar and dissimilar neighbors. As can be observed in Tab. 2, by exploiting the prior edge heterophily to identify edges according to node labels, HA-GAT(L) achieves superior results on all the datasets, especially on the heterophilic graphs. It indicates that even on the heterophilic graphs, where the nodes usually connect with other dissimilar nodes, there still exists an effective LAP which can enable the nodes to acquire appropriate information from distinct neighbors. Note that since Texas and Cornell are quite small, their performances are vulnerable to the topological bias. Fig. 2 visualizes the learned LAPs of HA-GAT(L) on Chameleon, where the specific weight assignments from each type of nodes to all their neighbors can be observed. For example, in the first layer, the edges coming into the nodes of C5 have large weights, which suggests that the nodes of C5 may possess salient characteristics after the aggregation. Then, the classification tasks may benefit accordingly. (Note that the nodes of C5 represent the web pages with the most significant averaged traffic.)

6.3 Main Results of HA-GAT

When the prior edge heterophily is not given, our HA-GAT excavates effective local distributions as a substitute for the underlying heterophily. Here, we demonstrate the effectiveness of the proposed HA-GAT for both the supervised and semi-supervised node classification on real-world graphs with different homophily ratios.

Supervised Node Classification. Recent studies [26, 27] have proved that the standard GNNs, such as GCN, can also achieve decent performances on the heterophilic graphs, where the node local distributions are distinguishable, e.g., on the Chameleon and Squirrel datasets. As can be observed in Tab. 2, all the GNNs outperform MLP on these two datasets. Our HA-GAT, which excavates effective local distributions and leverages a heterophily-aware attention scheme, can further improve the weight assignment scheme, and achieve superior performances. For example, GCN improves 22.43% accuracy on Squirrel over MLP, because of the distinguishable local distributions, and obtains a comparable performance to the existing heterophilic GNNs. Our HA-GAT further achieves an up to 15.17% gain over GCN due to our heterophily-aware attention scheme, by fully considering the local distributions. On the Actor, Cornell, and Texas datasets, where the node local distributions are indistinguishable [26], the accuracies of standard GNNs, e.g., GCN and GAT, are lower than those of MLP. The heterophilic GNNs, which emphasize ego-representations, e.g., GPR-GNN, can achieve excellent performances. However, as can be observed, few methods perform better than MLP on Actor. Since our attention scheme leverages an adaptive weight to model the importance of the self-loop connections and excavates useful information among neighborhoods, our HA-GAT outperforms the others on these datasets. As observed in Fig. 3(a), both the edges and self-loops are considered. Besides, our HA-GAT(Z) and HA-GAT(O), which are the equivalents of certain variants of GCN and GraphSAGE, achieve similar performances compared to their corresponding counterparts. Overall, the proposed HA-GAT achieves state-of-the-art performances on all the datasets, hence it is applicable to the graphs with different homophily ratios.

Semi-Supervised Node Classification. Under the semi-supervised settings, as can be observed from Tab. 3, all the results decrease significantly compared to the results under the supervised settings. However, the relative performances of the baselines on different graphs are similar to these under the supervised settings. For example, MLP outperforms GCN on Actor, while GCN achieves better performances on other datasets. An exception is BM-GCN, which directly estimates the edge heterophily according to the node labels and achieves relatively worse performances on three homophilic datasets than other GNN baselines. It suggests that estimating the label heterophily is not applicable under the semi-supervised settings, where only the minority of nodes are labeled (e.g., CiteSeer only provides 120/500 labeled nodes for training and validation). On the contrary, the proposed HA-GAT can still excavate effective local distributions to represent the underlying heterophily, and achieves state-of-the-art results on all the datasets under the semi-supervised settings, which verifies the superiority of our method.

6.4 Analysis of the Local Distribution Exploration

Our HA-GAT excavates effective local distributions from the local features and topology information, to represent the underlying heterophily. The local distribution exploration is thoroughly examined and analyzed in this subsection.

Effectiveness of the Local Distribution Exploration. As can be observed in Tab. 2, on the Chameleon and Squirrel datasets, our HA-GAT performs much better than HA-GAT(M), i.e., a HA-GAT variant with MLP being the explorer network. Note that on the Squirrel dataset, our HA-GAT achieves a 6.29% gain, which demonstrates the necessity of local distribution exploration. On heterophilic graphs with indistinguishable local label distributions, e.g., Actor, Texas, and Cornell, HA-GAT and HA-GAT(M) achieve approximately identical performances. These results suggest that the explorer networks tend to excavate similar information by employing GCN and MLP. Besides, as can be observed in Fig. 3, the weight for the self-loop connections is dominant in both HA-GAT and HA-GAT(M) on Actor. Meanwhile, the edge heterophily possesses limited effects, which can further interpret their approximately identical performances. In addition, we also leverage two variants of our HA-GAT, i.e., HA-GAT(G) and HA-GAT(T) for more comparisons. In general, our HA-GAT outperforms these two variants on all the datasets, which further verifies our superiority.

Impacts of the Dimension t. The explorer network attempts to model the local distribution among $t$ categories. Fig. 4(a) gives the accuracies of our HA-GAT on Chameleon with different category dimensions $t$ . As can be observed, the performance of HA-GAT is substantially improved by increasing $t$ from 1 to 2, which indicates that assigning nodes to only two underlying categories can already effectively model the underlying heterophily. When $t$ is greater than 2, the accuracy still increases, which suggests that our local distribution exploration can provide a greater representation ability with a larger category dimension.

Going Deeper to the Mechanism. To better understand the mechanism of our local distribution exploration, we employ $\boldsymbol{M}=\sum_{e_{ij}}\mathbf{m_{ij}}$ (i.e., the overall heterophily preference matrix) and $N_{T}=\sum_{v_{i}}s_{i}$ (i.e., the overall node categories), to analyze the overall distributions of the heterophily information captured by our HA-GAT. According to the learned local distributions, the nodes are softly classified. As shown in Fig. 4(b), when $t=2$ , nodes are classified into two underlying categories, T1 (with a score of 1598) and T2 (with a score of 678). According to $M$ , the majority of the edges connect different nodes within T1, while others mainly connect a node in T1 with a node in T2. Note that only the minority of the edges connect different nodes within T2. This phenomenon indicates that HA-GAT can effectively capture the underlying heterophily information, i.e., it sparsely selects nodes (as the nodes in T2) which are distinctive from the others in the local neighborhood. Then, according to the learned LAPs, which are shown in Figs. 4(c) and 4(d), the weight of the edge (which connects distinctive nodes), is well modeled. When $t$ is increased to a larger number, a similar phenomenon can be observed, as shown in Fig. 4(e), i.e., the edges, which connect the nodes in the underlying categories 2 to 5, are seldom. By combining the underlying categories 2 to 5, we can generate similar LAPs and distribution, compared to these when $t=2$ . The results indicate that the improvement from $t=2$ to $t=5$ is achieved because the network attempts to further classify the selected nodes into four underlying categories. Then, the performances can further benefit from increasing $t$ , as shown in Fig. 4(a).

6.5 Visualization of the Learned Weights

To provide an intuitive understanding, this subsection visualizes the distributions of the learned weights in our HA-GAT, two attention-based GNNs (i.e., GAT and GATv2) and a structure-based GNN (i.e., GCN). All of these GNNs are constructed with two layers. Fig. 5 shows their layer-wise distributions of the learned weights. As can be observed, the distributions of the learned weights in GCN, GAT, and GATv2 are similar. Besides, there are little variations of their distributions across different layers. On the contrary, the distributions of our HA-GAT are diverse for different layers and datasets. Due to the employed Neighbor Norm, our HA-GAT can assign diverse weights, even more than one, which makes our attention scheme more flexible and can handle graphs with complex local distribution.

For further comparisons, we replace our normalization approach with the Softmax Norm (as shown in Eq. 18), which is employed in GAT and GATv2. Fig. 6 shows the difference between the weights learned by different models with the Mean Norm (as shown in Eq. 16). GCN, which employs GCN Norm (as shown in Eq. 17) to normalize edge weight, is utilized as a baseline. As can be observed, the weights learned by GAT and GATv2 are very close to the averaged normalization, which indicates that they assign most edges equal importance. On the contrary, our heterophily-aware attention scheme in our HA-GAT(S) can generate diverse weights to differentiate various types of edges, which demonstrates the superiority of our attention scheme.

6.6 Ablation Studies of Critical Designs

Here, we verify the effectiveness of our Neighbor Norm and the gradient scaling factor.

6.6.1 Neighbor Norm

Our HA-GAT employs the Neighbor Norm to normalize the attention coefficients, which exploits the weighted degree of the neighbor nodes as the normalization item. We leverage three commonly utilized normalization methods for comparison, i.e., Mean Norm, GCN Norm, and Softmax Norm. Their details are as follows.

Mean Norm. It utilizes the weighted degree of the central node $v_{i}$ as a normalization item, i.e.,

\alpha_{ij}^{(l)}=\frac{w_{ij}^{(l)}}{\sum_{v_{k}\in\tilde{\mathcal{N}}_{i}}w_{ik}^{(l)}},

(16)

where $\tilde{\mathcal{N}}_{i}=\mathcal{N}_{i}\cup{i}$ and each $w_{ik}^{(l)}$ is the heterophily-aware attention coefficient for edge $e_{ik}$ in the $l$ -th layer, which is calculated via Eq. 3.

GCN Norm. It is utilized in GCN [1], which considers the weighted degrees of both the central node $v_{i}$ and the neighbor node $v_{j}$ , i.e.,

\alpha_{ij}^{(l)}=\frac{w_{ij}^{(l)}}{\sqrt{\sum_{v_{k}\in\tilde{\mathcal{N}}_{i}}w_{ik}^{(l)}}\cdot\sqrt{\sum_{v_{k}\in\tilde{\mathcal{N}}_{j}}w_{jk}^{(l)}}}.

(17)

Although this method is usually employed for handling undirected graphs, we exploit it here to consider the effects of both the central node and the corresponding neighbor.

Softmax Norm. It utilizes the Softmax function to normalize the weights among all the edges coming into the central node and is usually used in the attention-based GNNs [2, 19], i.e.,

\alpha_{ij}^{(l)}={\rm softmax}\left(w_{ij}\right)=\frac{{\rm exp}(w_{ij})}{\sum_{v_{k}\in\tilde{\mathcal{N}_{i}}}{\rm exp}(w_{ik})}.

(18)

Note that when utilizing the Softmax Norm for our HA-GAT, we eliminate the $\left(\cdot\right)_{+}$ function in Eq. 4.

As can be observed in Fig. 7, for both HA-GAT and GCN, four normalization methods achieve outstanding performances on CiteSeer, due to its simple local distributions. However, their performances are quite different on the heterophilic graphs. GCN and HA-GAT with Neighbor Norm outperform their corresponding variants with other normalization methods on Chameleon, which suggests that Neighbor Norm is more compatible under such circumstances. It leads to a flexible scoring mechanism and allows diverse normalized weights (shown in Sec. 6.5), thus it can be applied to more complex local distributions. Besides, Neighbor Norm functions decently in our heterophily-aware attention scheme. As shown in Fig. 4, it makes the characteristics of nodes from type T1 significant (in the first layer) and then uses them to influence other types of nodes (in the second layer). This observation further reveals the mechanism of our HA-GAT.

6.6.2 Gradient Scaling Factor

Here, we experimentally study the impacts of the employed gradient scaling factor, $\lambda$ . Fig. 8 visualizes the learned LAPs of the first layer in a 2-layered HA-GAT with different choices of gradient scaling factor on Squirrel. Note that all the elements in LAPs are initially set as ones. As observed, when lambda is a small value, i.e., 1e-5, the learned LAP is approximately equivalent to the initial LAP. Then, our HA-GAT degenerates to the GCN with Neighbor Norm. With the lambda increasing, the elements of the learned LAP are more diverse, i.e., our HA-GAT is more courageous to assign the weights of edges. It demonstrates that the gradient scaling factor can enlarge/shrink the gradients of the parsing matrix, hence enabling our LAP to be applicable to distinct graphs.

6.7 Parameter Study

Here, we explore the impact of two critical hyperparameters: the number of hidden neurons and the number of layers within our HA-GAT.

As depicted in Fig. 9(a), our HA-GAT exhibits increased performance up to a certain number of hidden neurons. Then, its performance stabilizes at high accuracies beyond a certain threshold. This suggests a critical minimum number of hidden neurons is required for our HA-GAT to effectively learn node representations and accomplish downstream tasks. According to this observation, we adopt HA-GAT with 64 hidden neurons in our experiments.

Fig. 9(b) displays the performance of our HA-GAT, configured with 64 hidden neurons across varying numbers of layers. It is evident that the performance of our HA-GAT significantly benefits from increasing the number of layers from 1 to 2. However, performance diminishes when the number of layers exceeds 2, leading to the over-smoothing issue [43]. This trend is consistent with findings in the broader GNN literature [1, 2]. According to the above analysis, a 2-layered HA-GAT is employed in all our settings.

6.8 Running Time

Methods	Actor	Chameleon	Squirrel	Cornell
MLP	1.86 ± 0.13	1.82 ± 0.27	1.10 ± 0.10	1.03 ± 0.05
GCN	1.42 ± 0.02	6.73 ± 0.19	1.93 ± 0.04	2.17 ± 0.85
GAT	2.52 ± 0.32	3.08 ± 0.55	3.72 ± 0.62	2.85 ± 0.45
GRP-GNN	4.13 ± 0.17	3.67 ± 0.07	4.80 ± 1.19	3.89 ± 0.08
DMP	20.52 ± 0.49	M	M	14.65 ± 1.32
BM-GCN	115.13 ± 1.28	26.16 ± 5.85	M	21.01 ± 4.09
HA-GAT	5.45 ± 0.75	4.20 ± 0.69	9.94 ± 3.83	3.72 ± 0.17

Table 4: Running times (s) of the supervised node classification tasks. (Note that M denotes that it occurs an out-of-memory error on our Nvidia GeForce RTX 2080Ti GPU.)

Tab. 4 presents the running times (s) of experiments in Tab. 2. All the experiments are conducted on our Nvidia GeForce RTX 2080Ti GPU. In our settings, all the following methods except BM-GCN are trained for a maximum of 1000 epochs with an early stopping condition at 200 epochs. For BM-GCN, we follow their 2-staged learning strategy, which contains 400 and 1600 epochs for its two training stages. As can be observed, the reported running times of our HA-GAT match the complexity analysis.

7 Conclusion

This paper firstly reveals the benefits of modeling the edge heterophily for handling the heterophily problem, i.e., if a GNN assigns different weights to edges according to different heterophilic types, it can learn effective local attention patterns, enabling nodes to acquire appropriate information from distinct neighbors. Then, by fully exploring and utilizing the local distribution as the underlying heterophily, a novel Heterophily-Aware Graph Attention Network (HA-GAT) is proposed to handle the networks with different homophily ratios. Extensive results have demonstrated that our HA-GAT achieves state-of-the-art performances on eight datasets with different homophily ratios. To further demonstrate the effectiveness of the proposed HA-GAT, we analyze the proposed heterophily-aware attention scheme and local distribution exploration and seek for an interpretation from their mechanism. More interpretations of the learned weights will require domain knowledge under study and are left as future work. Consequently, we believe our work can be applied in various areas, such as social analysis [44, 45].

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Grants 62272020, U20B2069, and 62376088, in part by State Key Laboratory of Software Development Environment (SKLSDE-2023ZX-16), in part by Hebei Natural Science Foundation No. F2024202047, in part by the Outstanding Research Project of Shen Yuan Honors College, BUAA under Grant 230123201, and in part by the Fundamental Research Funds for the Central Universities.

References

[1] T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in: ICLR, 2017.
[2] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: ICLR, 2018.
[3] W. L. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, in: NIPS, 2017, pp. 1024–1034.
[4] J. Wang, Y. Wang, Z. Yang, L. Yang, Y. Guo, Bi-gcn: Binary graph convolutional network, in: IEEE CVPR, 2021, pp. 1561–1570.
[5] L. Bai, Y. Jiao, L. Cui, L. Rossi, Y. Wang, S. Y. Philip, E. R. Hancock, Learning graph convolutional networks based on quantum vertex information propagation, IEEE Transactions on Knowledge and Data Engineering 35 (2) (2021) 1747–1760.
[6] L. Cui, L. Bai, X. Bai, Y. Wang, E. R. Hancock, Learning aligned vertex convolutional networks for graph classification, IEEE Transactions on Neural Networks and Learning Systems (2021).
[7] J. Wang, Y. Guo, L. Yang, Y. Wang, Enabling homogeneous gnns to handle heterogeneous graphs via relation embedding, IEEE Transactions on Big Data 9 (6) (2023) 1697–1710.
[8] J. Wang, Y. Guo, L. Yang, Y. Wang, Binary graph convolutional network with capacity exploration, IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (5) (2024) 3031–3046.
[9] J. Qiu, J. Tang, H. Ma, Y. Dong, K. Wang, J. Tang, Deepinf: Social influence prediction with deep learning, in: ACM SIGKDD, 2018, pp. 2110–2119.
[10] X. Weng, Y. Wang, Y. Man, K. M. Kitani, GNN3DMOT: graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning, in: IEEE CVPR, 2020, pp. 6498–6507.
[11] D. Marcheggiani, I. Titov, Encoding sentences with graph convolutional networks for semantic role labeling, in: EMNLP, 2017, pp. 1506–1515.
[12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, G. E. Dahl, Neural message passing for quantum chemistry, in: ICML, 2017, pp. 1263–1272.
[13] S. Yun, M. Jeong, R. Kim, J. Kang, H. J. Kim, Graph transformer networks, in: NIPS, 2019, pp. 11983–11993.
[14] A. Wijesinghe, Q. Wang, A new perspective on” how graph neural networks go beyond weisfeiler-lehman?”, in: ICLR, 2022.
[15] K. K. Thekumparampil, C. Wang, S. Oh, L.-J. Li, Attention-based graph neural network for semi-supervised learning, arXiv preprint arXiv:1803.03735 (2018).
[16] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, D.-Y. Yeung, Gaan: Gated attention networks for learning on large and spatiotemporal graphs, arXiv preprint arXiv:1803.07294 (2018).
[17] D. Kim, A. Oh, How to find your friendly neighborhood: Graph attention design with self-supervision, in: ICLR, 2021.
[18] Y. Hou, J. Zhang, J. Cheng, K. Ma, R. T. Ma, H. Chen, M.-C. Yang, Measuring and improving the use of graph information in graph neural networks, in: ICLR, 2020.
[19] S. Brody, U. Alon, E. Yahav, How attentive are graph attention networks?, in: ICLR, 2022.
[20] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, D. Koutra, Beyond homophily in graph neural networks: Current limitations and effective designs, in: NIPS, Vol. 33, 2020, pp. 7793–7804.
[21] H. Pei, B. Wei, K. C.-C. Chang, Y. Lei, B. Yang, Geom-gcn: Geometric graph convolutional networks, in: ICLR, 2020.
[22] L. Yang, M. Li, L. Liu, C. Wang, X. Cao, Y. Guo, Diverse message passing for attribute with heterophily, in: NIPS, Vol. 34, 2021, pp. 4751–4763.
[23] D. Bo, X. Wang, C. Shi, H. Shen, Beyond low-frequency information in graph convolutional networks, in: AAAI, Vol. 35, 2021, pp. 3950–3957.
[24] D. He, C. Liang, H. Liu, M. Wen, P. Jiao, Z. Feng, Block modeling-guided graph convolutional neural networks, in: AAAI, Vol. 36, 2022, pp. 4022–4029.
[25] J. Zhu, R. A. Rossi, A. Rao, T. Mai, N. Lipka, N. K. Ahmed, D. Koutra, Graph neural networks with heterophily, in: AAAI, Vol. 35, 2021, pp. 11168–11176.
[26] Y. Ma, X. Liu, N. Shah, J. Tang, Is homophily a necessity for graph neural networks?, in: ICLR, 2022.
[27] J. Wang, Y. Guo, L. Yang, Y. Wang, Understanding heterophily for graph neural networks, arXiv preprint arXiv:2401.09125 (2024).
[28] X. Ma, Q. Chen, Y. Ren, G. Song, L. Wang, Meta-weight graph neural network: Push the limits beyond global homophily, in: WWW, 2022, pp. 1270–1280.
[29] S. Luan, C. Hua, Q. Lu, J. Zhu, M. Zhao, S. Zhang, X.-W. Chang, D. Precup, Revisiting heterophily for graph neural networks, in: NeurIPS, Vol. 35, 2022, pp. 1362–1375.
[30] L. Du, X. Shi, Q. Fu, X. Ma, H. Liu, S. Han, D. Zhang, Gbk-gnn: Gated bi-kernel graph neural networks for modeling both homophily and heterophily, in: WWW, 2022, pp. 1550–1558.
[31] S. Suresh, V. Budde, J. Neville, P. Li, J. Ma, Breaking the limit of graph neural networks by improving the assortativity of graphs with local mixing patterns, in: ACM SIGKDD, 2021, pp. 1541–1551.
[32] M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in: NIPS, 2016, pp. 253–261.
[33] S. Abu-El-Haija, B. Perozzi, A. Kapoor, N. Alipourfard, K. Lerman, H. Harutyunyan, G. V. Steeg, A. Galstyan, Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing, in: ICML, 2019, pp. 21–29.
[34] E. Chien, J. Peng, P. Li, O. Milenkovic, Adaptive universal generalized pagerank graph neural network, in: ICLR, 2021.
[35] M. Liu, Z. Wang, S. Ji, Non-local graph neural networks, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021) 1–1.
[36] T. Yang, Y. Wang, Z. Yue, Y. Yang, Y. Tong, J. Bai, Graph pointer neural networks, arXiv preprint arXiv:2110.00973 (2021).
[37] D. Jin, Z. Yu, C. Huo, R. Wang, X. Wang, D. He, J. Han, Universal graph convolutional networks, in: NIPS, Vol. 34, 2021, pp. 10654–10664.
[38] X. Li, R. Zhu, Y. Cheng, C. Shan, S. Luo, D. Li, W. Qian, Finding global homophily in graph neural networks when meeting heterophily, in: ICML, Vol. 162, 2022, pp. 13242–13256.
[39] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, T. Eliassi-Rad, Collective classification in network data, AI magazine 29 (3) (2008) 93–93.
[40] Z. Yang, W. W. Cohen, R. Salakhutdinov, Revisiting semi-supervised learning with graph embeddings, in: ICML, 2016, pp. 40–48.
[41] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, S. Jegelka, Representation learning on graphs with jumping knowledge networks, in: ICML, 2018, pp. 5453–5462.
[42] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: ICLR, 2015.
[43] Q. Li, Z. Han, X. Wu, Deeper insights into graph convolutional networks for semi-supervised learning, in: AAAI, 2018, pp. 3538–3545.
[44] Z. Wang, M. Jusup, L. Shi, J.-H. Lee, Y. Iwasa, S. Boccaletti, Exploiting a cognitive bias promotes cooperation in social dilemma experiments, Nature communications 9 (1) (2018) 2954.
[45] Z. Wang, M. Jusup, R.-W. Wang, L. Shi, Y. Iwasa, Y. Moreno, J. Kurths, Onymity promotes cooperation in social dilemma experiments, Science advances 3 (3) (2017) e1601444.