This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Rethinking Graph Transformer Architecture Design for Node Classification

Jiajun Zhou1,2,3, Xuanze Chen1,2,3, Chenxuan Xie1,2, Shanqing Yu1,2, Qi Xuan1,2, Xiaoniu Yang1,4
Abstract

Graph Transformer (GT), as a special type of Graph Neural Networks (GNNs), utilizes multi-head attention to facilitate high-order message passing. However, this also imposes several limitations in node classification applications: 1) nodes are susceptible to global noise; 2) self-attention computation cannot scale well to large graphs. In this work, we conduct extensive observational experiments to explore the adaptability of the GT architecture in node classification tasks and draw several conclusions: the current multi-head self-attention module in GT can be completely replaceable, while the feed-forward neural network module proves to be valuable. Based on this, we decouple the propagation (𝐏\mathbf{P}) and transformation (𝐓\mathbf{T}) of GNNs and explore a powerful GT architecture, named GNNFormer, which is based on the P/T combination message passing and adapted for node classification in both homophilous and heterophilous scenarios. Extensive experiments on 12 benchmark datasets demonstrate that our proposed GT architecture can effectively adapt to node classification tasks without being affected by global noise and computational efficiency limitations.

Introduction

Transformer can alleviate certain limitations encountered by Graph Neural Networks (GNNs) in graph-related tasks, such as over-squashing (Alon and Yahav 2021), long-range dependency (Rampášek et al. 2022; Shirzad et al. 2023; Ma et al. 2023), weak connectivity (Wu et al. 2022; Ying et al. 2021), and limited expressiveness (Kreuzer et al. 2021; Mialon et al. 2021; Rong et al. 2020). These advantages have motivated extensive research on Graph Transformer (GT), demonstrating its strong potential in tasks such as graph classification, link prediction, and node classification.

Despite the superior performance of GT over classical GNNs in certain scenarios, its adaptation for the prevalent node classification task exhibits several limitations, which can be summarized as follows: 1) Self-attention mechanism inevitably introduces noise. Previous studies  (Xie et al. 2023; Zhu et al. 2020a; Abu-El-Haija et al. 2019) suggest that shallow GNNs can only capture local neighborhood information of nodes and cannot effectively access higher-order information. Instead, GTs employ the self-attention mechanism to receive information from all other nodes, thereby capturing higher-order relationships such as long-range dependencies. However, self-attention actually disregards the topological structure of the graph and treats it as fully connected (Ying et al. 2021), which inevitably introduces irrelevant information, or noise, especially in homophilous scenarios. Thus, the adaptability of GTs relying on self-attention in node classification task is still open to question. 2) The self-attention mechanism is unsuitable for large-scale graph computation. For vanilla GT, the computational complexity of self-attention is quadratic with respect to the number of nodes. In real-world scenarios such as social networks and e-commerce recommendations, graph-structured data often contains a massive number of nodes, making the direct application of the self-attention mechanism computationally and memory-intensive. Therefore, GTs relying on self-attention are highly inefficient for node classification on large-scale graphs.

For the former limitation, several studies (Min et al. 2022; Dwivedi and Bresson 2020) utilize topological information as masks or biases to recalibrate the self-attention distribution, thereby reducing the introduction of noise to some extent. Regarding the latter limitation, several studies (Wu et al. 2022, 2024; Zhang et al. 2022b; Liu et al. 2023) attempt to improve the efficiency of GTs on large-scale graphs by reducing the computational complexity of attention computation or compressing the graph size. Despite the observed performance improvement, these studies persist in applying the GT architecture to node classification tasks without thoroughly investigating whether the current GT architecture is genuinely adaptability for this purpose.

Refer to caption
Figure 1: Comparison of node classification performance between vanilla GT, vanilla GNN, and variant GT.

In this work, we question this adaptability and obtain the supporting observations, as shown in Figure 1, by substituting the multi-head self-attention module (MHA) in vanilla GT with single graph attention layer (more details will be discussed later). Specifically, we find that: 1) the applicability of the self-attention module in GT architecture to node classification is not guaranteed; 2) the feed-forward network module (FFN) in GT architecture appears to be beneficial for node classification. Based on extensive observations, we further disentangle message passing into propagation 𝐏\mathbf{P} and transformation 𝐓\mathbf{T}, and design a graph transformer architecture called GNNFormer that allows for flexible combination of 𝐏\mathbf{P} and 𝐓\mathbf{T}, making it suitable for node classification in both homophilous and heterophilous scenarios. The main differences compared to the vanilla GT architecture are as follows: 1) substituting the multi-head self-attention module with stackable message-passing blocks (PT-block); 2) optimizing the feed-forward network with SwishGLU; 3) improving the residual connections to adaptive initial residual connections. The incorporation of three key design in our GNNFormer enables the mitigation of global noise interference during message passing, facilitates effective fusion of graph attribute and structural features, and ultimately achieves powerful node representation learning. Extensive experiments on 12 homophilous and heterophilous benchmark datasets validate the effectiveness and superiority of our GNNFormer framework in node classification tasks, compared to existing graph transformers.

Preliminaries and Related Work

A graph is denoted as G=(V,E,𝑿,𝒀)G=(V,E,\boldsymbol{X},\boldsymbol{Y}), where VV and EE are the set of nodes and edges respectively, 𝑿|V|×d\boldsymbol{X}\in\mathbb{R}^{|V|\times d} is the node feature matrix, and 𝒀|V|×C\boldsymbol{Y}\in\mathbb{R}^{|V|\times C} is the node label matrix. Each node is associated with a feature vector 𝒙d\boldsymbol{x}\in\mathbb{R}^{d} and a one-hot encoded label vector 𝒚C\boldsymbol{y}\in\mathbb{R}^{C}. Here we use |V||V|, dd and CC to denote the number of nodes, the dimension of the node features, and the number of classes, respectively. The graph topology information (V,E)(V,E) can also be represented by an adjacency matrix 𝑨|V|×|V|\boldsymbol{A}\in\mathbb{R}^{|V|\times|V|}, where 𝑨ij=1\boldsymbol{A}_{ij}=1 indicates the existence of an edge between viv_{i} and vjv_{j}, and 𝑨ij=0\boldsymbol{A}_{ij}=0 otherwise. Based on the adjacency matrix, we can define the degree distribution of GG as a diagonal degree matrix 𝑫|V|×|V|\boldsymbol{D}\in\mathbb{R}^{|V|\times|V|} with 𝑫ii=j=1|V|𝑨ij\boldsymbol{D}_{ii}=\sum_{j=1}^{|V|}\boldsymbol{A}_{ij} representing the degree value of viv_{i}. Node classification is a fundamental task in graph machine learning. It involves assigning labels to the nodes of a graph based on their features and the graph topology structure.

Disentanglement for GNN Architecture

Currently, most existing GNNs follow a unified message-passing framework (Gilmer et al. 2017), in which the message passing phase is decomposed into three processes: message generation, aggregation, and node feature update.

message: 𝒎ij(l)=MES(l)(𝒉j(l1),𝒉i(l1),𝒆ji)\displaystyle\boldsymbol{m}_{i\leftarrow j}^{(l)}=\text{MES}^{(l)}\left(\boldsymbol{h}_{j}^{(l-1)},\boldsymbol{h}_{i}^{(l-1)},\boldsymbol{e}_{ji}\right) (1)
aggregation: 𝒎i(l)=AGG({𝒎ij(l)j𝒩(i)})\displaystyle\boldsymbol{m}_{i}^{(l)}=\text{AGG}\left(\left\{\boldsymbol{m}_{i\leftarrow j}^{(l)}\mid j\in\mathcal{N}(i)\right\}\right)
update: 𝒉i(l)=UPD(l)(𝒉i(l1),𝒎i(l))\displaystyle\boldsymbol{h}_{i}^{(l)}=\text{UPD}^{(l)}\left(\boldsymbol{h}_{i}^{(l-1)},\boldsymbol{m}_{i}^{(l)}\right)

where 𝒎ij(l)\boldsymbol{m}_{i\leftarrow j}^{(l)} denotes the message sent from node vjv_{j} to viv_{i} at iteration step ll, and depends on the feature 𝒉j(l1)\boldsymbol{h}_{j}^{(l-1)} of the sending node, the feature 𝒉i(l1)\boldsymbol{h}_{i}^{(l-1)} of the receiving node and the feature 𝒆ji\boldsymbol{e}_{ji} of edge between them. The message function MES(l)\text{MES}^{(l)} can be a parameterized model such as MLPs. AGG is the aggregation function, such as summation, averaging and maximum, which is used to aggregate messages from the neighborhood 𝒩(i)\mathcal{N}(i) of the target node viv_{i}. The update function UPD(l)\text{UPD}^{(l)} can be a neural network that integrates the current node state and the aggregated messages to produce a new node state. From a more decoupled perspective, the message-passing phase can be decomposed into two functionally independent operations: Propagation and Transformation (Zhang et al. 2022a).

prop: 𝒉i(l)=𝐏(𝒉i(l1),{𝒉j(l1),𝒆jij𝒩(i)})\displaystyle\textbf{prop:~{}}\boldsymbol{h}_{i}^{(l)}=\mathbf{P}\left(\boldsymbol{h}_{i}^{(l-1)},\left\{\boldsymbol{h}_{j}^{(l-1)},\boldsymbol{e}_{ji}\mid j\in\mathcal{N}(i)\right\}\right) (2)
trans: 𝒉i(l)=𝐓(𝒉i(l))\displaystyle\textbf{trans:~{}}\boldsymbol{h}_{i}^{(l)}=\mathbf{T}\left(\boldsymbol{h}_{i}^{(l)}\right)

where 𝐏\mathbf{P} is the propagation function that combines message generation and aggregation from neighbor node vjv_{j} to target node viv_{i}. 𝐓\mathbf{T} performs a non-linear transformation on the state of the nodes after propagation. Based on the disentanglement, existing GNN architectures can be roughly and loosely categoried into four types according to the stacking order of propagation and transformation operations: 𝐏𝐓𝐏𝐓\mathbf{PTPT}, 𝐏𝐏𝐓𝐓\mathbf{PPTT}, 𝐓𝐓𝐏𝐏\mathbf{TTPP}, and 𝐓𝐏𝐓𝐏\mathbf{TPTP}, as listed in Table 1.

Table 1: Disentanglement for existing model architectures.
Catrgory Method Message Passing
NN 𝐓𝐓\mathbf{TT} MLP, LINKX 𝐓(𝐓(𝑿))\mathbf{T}(\mathbf{T}(\boldsymbol{X}))
GNN 𝐏𝐓𝐏𝐓\mathbf{PTPT} GCN, GAT, SAGE, FSGNN 𝐓(𝐏(𝐓(𝐏(𝑿))))\mathbf{T}(\mathbf{P}(\mathbf{T}(\mathbf{P}(\boldsymbol{X}))))
𝐏𝐏𝐓𝐓\mathbf{PPTT} ACMGCN 𝐓(𝐓(𝐏(𝐏(𝑿))))\mathbf{T}(\mathbf{T}(\mathbf{P}(\mathbf{P}(\boldsymbol{X}))))
𝐓𝐓𝐏𝐏\mathbf{TTPP} GPRGNN, H2GCN, GloGNN 𝐏(𝐏(𝐓(𝐓(𝑿))))\mathbf{P}(\mathbf{P}(\mathbf{T}(\mathbf{T}(\boldsymbol{X}))))
𝐓𝐏𝐓𝐏\mathbf{TPTP} FAGCN 𝐏(𝐓(𝐏(𝐓(𝑿))))\mathbf{P}(\mathbf{T}(\mathbf{P}(\mathbf{T}(\boldsymbol{X}))))

Graph Transformer for Node Classification

The graph transformer, serving as a new graph representation learning architecture, is proposed to alleviate the limitations of GNNs in dealing with issues such as over-squashing (Alon and Yahav 2021), long-range dependency (Rampášek et al. 2022; Shirzad et al. 2023), weak connectivity (Wu et al. 2022; Ying et al. 2021), etc. Generally, the vanilla GT architecture broadly follows the classic Transformer (Vaswani et al. 2017) and consists of two essential modules: a multi-head self-attention module (MHA) and a feed-forward network (FFN), as shown in Figure 3(a). The MHA module first transforms the node features into query vectors (𝑸\boldsymbol{Q}), key vectors (𝑲\boldsymbol{K}) and value vectors (𝑽\boldsymbol{V}), then performs the inner-product operation on the former two to compute the attention scores, and ultimately employs the obtained scores for weighted aggregation of the value vector:

𝑸=𝑯𝑾Q,𝑲=𝑯𝑾K,𝑽=𝑯𝑾V𝑯=softmax(𝑸𝑲d)𝑽\begin{gathered}\boldsymbol{Q}=\boldsymbol{H}\boldsymbol{W}_{Q},\quad\boldsymbol{K}=\boldsymbol{H}\boldsymbol{W}_{K},\quad\boldsymbol{V}=\boldsymbol{H}\boldsymbol{W}_{V}\\ \boldsymbol{H}^{\prime}=\operatorname{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^{\top}}{\sqrt{d^{\prime}}}\right)\boldsymbol{V}\end{gathered} (3)

where 𝑯=[𝒉1,,𝒉|V|]|V|×d\boldsymbol{H}=\left[\boldsymbol{h}_{1}^{\top},\cdots,\boldsymbol{h}_{|V|}^{\top}\right]\in\mathbb{R}^{|V|\times d} denotes the input node embedding matrix, 𝑾Q\boldsymbol{W}_{Q}, 𝑾K\boldsymbol{W}_{K}, 𝑾Vd×d\boldsymbol{W}_{V}\in\mathbb{R}^{d\times d^{\prime}} are the projection matrics, dd^{\prime} is the output hidden dimension. The FFN follows the MHA to enhance the expressiveness of GT through linear and non-linear transformations. Additionally, residual connection (He et al. 2016) is employed to prevent the gradient vanishing as the model depth increases, followed by layer normalization (Ba, Kiros, and Hinton 2016) to stabilize model training and accelerate convergence.

In this paper, we primarily focus on research related to GT in node classification. Graph-BERT (Zhang et al. 2020) utilizes top-k affinity sampling to extract unlinked subgraphs as training samples, and incorporates three positional embeddings to pre-train a vanilla GT for node representation learning. Considering the importance of topology structure in graph representation learning, several studies (Dwivedi and Bresson 2020; Hussain, Zaki, and Subramanian 2022) integrate edge embeddings as additional inputs to the MHA to recalibrate the attention computation, thereby better utilizing the graph connectivity inductive bias. Considering the efficiency limitations of vanilla GT on large-scale graphs, researchers have attempted to enhance the scalability of GT by reducing the computational complexity of attention or compressing the graph size. NodeFormer (Wu et al. 2022) introduces a kernelized Gumbel-Softmax operator to reduce the quadratic complexity of self-attention computation to linear complexity. SGFormer (Wu et al. 2024) designs a single-layer, single-head global attention model, avoiding the stacking of multiple multi-head attention layers, significantly reducing computational complexity. NAGphormer (Chen et al. 2023) transforms the 1- to k-hop neighborhood representation of each node into a sequence of tokens and trains it in a mini-batch manner to improve scalability to large-scale graphs. ASN-GT (Zhang et al. 2022b) and GapFormer (Liu et al. 2023) first utilize graph coarsening or pooling techniques to significantly compress the number of nodes in large-scale graphs, then train GT on the coarsened graph, thereby reducing computational overhead.

Refer to caption
Figure 2: Comparison of node classification performance between different message passing architectures.

Motivating Observations

Vanilla GT for Node Classification Is Debatable

Compared to GNNs, GT actually employs self-attention to achieve global message passing. However, this practice also reveals the limitations of GT in node classification, which can be summarized as follows:

Proposition 1.  The self-attention mechanism in the vanilla GT architecture actually treats the graph as fully connected for global message propagation, inevitably introducing noise from irrelevant nodes. Additionally, the quadratic computational complexity of the attention mechanism makes GT inefficient for node classification on large-scale graphs. Therefore, the adaptability of the vanilla GT architecture for node classification tasks remains debatable.

Observation 1.  To illustrate our proposition, we first train a vanilla GT and a two-layer GAT (Veličković et al. 2018), using common settings of 4 attention heads, 64 hidden dimensions, 0.5 dropout rate, ReLU activation and 10 runs. The performance comparison of node classification on homophilous and heterophilous datasets is shown in Figure 1. It is evident that vanilla GT underperforms GAT on 8 out of the 12 benchmarks. Further, by replacing the MHA in vanilla GT with a single graph attention layer, we derive the variant GT that demonstrates not only greater stability but also significant performance improvements compared to vanilla GT in most cases. This indicates that the self-attention mechanism indeed introduces irrelevant noise during message passing, thereby limits model performance. Additionally, the variant GT outperforms GAT in most cases, suggesting that incorporating a subsequent FFN offers valuable guidance for designing GT architectures more suited for node classification, meriting further exploration. Finally, vanilla GT also experiences out of memory (OOM) issue on several datasets, further highlighting its efficiency limitations.

Refer to caption
Figure 3: Illustration of graph transformer architectures.

Exploring Message Passing in GT Is Valuable

In this work, we regard GT and its variants as another form of GNNs. Based on the disentangled GNN architectures mentioned above, we can further approximate the vanilla GT architecture in Figure 3(a) as a 𝐓𝐏𝐓𝐓\mathbf{TPTT} form, where the Linear, MHA, and FFN modules can be respectively approximated as 𝐓\mathbf{T}, 𝐏\mathbf{P}, and 𝐓𝐓\mathbf{TT}. Considering that the instantiation of 𝐏\mathbf{P} with multi-head self-attention leads to unstable performance of GT in node classification, this drives us to rethink the design of the GT architecture.

Proposition 2.  To enhance the adaptability of the GT architecture for node classification tasks, rethinking the message passing design to replace the original MHA module is a strightforward practice.

Observation 2.  In Observation 1, we directly substitute the MHA module in vanilla GT with a single multi-head graph attention layer, yielding a variant GT. This crude architectural assembly is naive, although it achieves commendable performance. Based on this, we further explore message passing designs based on the P/T combination (PT-block) that can replace the MHA in vanilla GT. We substitute the MHA in vanilla GT with different PT-blocks (𝐏𝐓𝐏𝐓\mathbf{PTPT}, 𝐏𝐏𝐓𝐓\mathbf{PPTT}, 𝐓𝐓𝐏𝐏\mathbf{TTPP}, and 𝐓𝐏𝐓𝐏\mathbf{TPTP}, with P instantiated by GAT-like propagation). These blocks use initial residual connections and layer normalization, making the resulting variants more aligned with the Transformer architecture. As shown in Figure 2, all models share common settings: 4 attention heads, 64 hidden dimension, 0.5 dropout rate, ReLU activation and 10 runs. We can observe that these P/T-based variants outperform the naively assembled variant GT (GAT+FFN) in most cases, indicating that: 1) Message passing modules based on the P/T combination can be more flexibly integrated into the GT architecture; 2) Certain architectural components of Transformers, such as residual connections, also contribute to node representation learning. Moreover, the varying performance of different P/T-blocks across different types of datasets further motivates us to explore more universal message passing designs within GT architectures.

GNNFormer: New GT Architecture Design

Based on the aforementioned motivating observations, we propose a novel graph transformer architecture named GNNFormer, specifically tailored for node classification. The architecture is illustrated in Figure 3(b). Compared to the vanilla GT block, the modifications are mainly reflected in three aspects: 1) the original multi-head self-attention module is substituted with the stackable P-T combination message passing block (PT-block); 2) the FFN is optimized via SwishGLU; 3) the original residual connection is improved to the initial residual connection. The design details of GNNFormer will be discussed below.

As shown in Figure 3(a), vanilla GT learns node representations by stacking transformer blocks, with the input to the current block being the output of the previous block. Differently, our GNNFormer consists of only a single transformer block, which takes node features and adjacency information as input and outputs the final node representations.

First, the input features 𝑿\boldsymbol{X} will be transformed into an initial feature embedding through a linear transformation parameterized by 𝑾0d×d\boldsymbol{W}_{0}\in\mathbb{R}^{d\times d^{\prime}} and a ReLU activation:

𝑯(0)=ReLU(𝑿𝑾0)\boldsymbol{H}^{(0)}=\operatorname{ReLU}\left(\boldsymbol{X}\boldsymbol{W}_{0}\right) (4)

where dd^{\prime} is the hidden dimension. Next, we stack message passing blocks, i.e., PT-blocks, to further learn node representations. Each PT-block consists of a message passing function ff, an initial residual connection, and a layer normalization operation LN()\operatorname{LN}\left(\cdot\right). For the ll-th PT-block, it takes the adjacency matrix 𝑨\boldsymbol{A} and the node representations 𝑯(l1)\boldsymbol{H}^{(l-1)} output by the previous PT-block as input, and outputs new node representations via propagation and transformation:

𝑯(l)=LN(αl𝑯(0)+(1αl)f(𝑯(l1),𝑨))\boldsymbol{H}^{(l)}=\operatorname{LN}\left(\alpha_{l}\cdot\boldsymbol{H}^{(0)}+(1-\alpha_{l})\cdot f\left(\boldsymbol{H}^{(l-1)},\boldsymbol{A}\right)\right)\\ (5)

where ff is selected from the candidate message passing set {𝐏𝐏\mathbf{PP}, 𝐏𝐓\mathbf{PT}, 𝐓𝐏\mathbf{TP}, 𝐓𝐓\mathbf{TT}}, and αl\alpha_{l} is a learnable parameter that controls the adaptive residual connection.

After message passing through ll PT-blocks, GNNFormer has effectively fused the attribute information of the nodes with the topological information. Furthermore, motivated by observation 1, which suggests that appending an FFN can enhance the expressive power of vanilla GNNs, we preserve the FFN module in the GNNFormer architecture. Specifically, the node representations 𝑯(l)\boldsymbol{H}^{(l)} will first be fed into a two-layer FFN that employs the SwishGLU in place of the first linear layer and the activation function:

𝒁=(Swish(𝑯(l)𝑾1)𝑯(l)𝑾2)𝑾3\boldsymbol{Z}^{\prime}=\left(\operatorname{Swish}\left(\boldsymbol{H}^{(l)}\boldsymbol{W}_{1}\right)\otimes\boldsymbol{H}^{(l)}\boldsymbol{W}_{2}\right)\boldsymbol{W}_{3} (6)

where 𝑾1\boldsymbol{W}_{1}, 𝑾2\boldsymbol{W}_{2}, 𝑾3d×d\boldsymbol{W}_{3}\in\mathbb{R}^{d^{\prime}\times d^{\prime}} are the transformation weight matrices, and \otimes is the element-wise multiplication. The SwishGLU (Shazeer 2020) combines the smoothness of the Swish (Ramachandran, Zoph, and Le 2017) activation function with the selectivity of Gated Linear Units (Dauphin et al. 2017). The former ensures more stable gradient updates during back-propagation, while the latter allows the GNNFormer to selectively propagate more important features, especially in heterophilous scenarios. Consequently, the enhanced FFN improves the non-linear expressive power of GNNFormer, enabling it to better capture complex node relationships and feature interactions within various graphs.

Subsequently, an initial residual connection and layer normalization will be applied to the output of the FFN:

𝒁=LN(β𝑯(0)+(1β)𝒁)\boldsymbol{Z}^{\prime}=\operatorname{LN}\left(\beta\cdot\boldsymbol{H}^{(0)}+(1-\beta)\cdot\boldsymbol{Z}^{\prime}\right)\\ (7)

where β\beta is a learnable parameter that controls the adaptive residual connection. Finally, we further integrate the topology information to obtain the final node representations:

𝒁=γ𝒁+(1γ)𝑨𝑾4\boldsymbol{Z}=\gamma\cdot\boldsymbol{Z}^{\prime}+(1-\gamma)\cdot\boldsymbol{A}\boldsymbol{W}_{4} (8)

where 𝑾4|V|×d\boldsymbol{W}_{4}\in\mathbb{R}^{|V|\times d^{\prime}} is the weight matrix that encodes the topology information, and γ\gamma is a learnable parameter that controls the contribution of topology information.

To achieve node classification, we finally use a prediction head fpredf_{\text{pred}} parameterized by 𝑾5d×C\boldsymbol{W}_{5}\in\mathbb{R}^{d^{\prime}\times C} and Softmax activation to obtain the node predictions:

𝒀^=Softmax(𝒁𝑾5)\hat{\boldsymbol{Y}}=\operatorname{Softmax}\left(\boldsymbol{Z}\boldsymbol{W}_{5}\right) (9)

During model training, binary cross-entropy classification loss is used as the optimization objective:

=trace(𝒀trainlog𝒀^train)\mathcal{L}=-\operatorname{trace}\left(\boldsymbol{Y}_{\text{train}}^{\top}\cdot\log\hat{\boldsymbol{Y}}_{\text{train}}\right) (10)

where the trace operation trace()\operatorname{trace}\left(\cdot\right) is used to compute the sum of the diagonal elements of the matrix.

Experiments

Datasets

We conduct extensive exploratory and evaluation experiments on 12 benchmark datasets, which include (1) Six homophilous datasets: Computers, Photo (McAuley et al. 2015), Coauthor CS, Coauthor Physics (Shchur et al. 2018), Wiki-CS (Mernyei and Cangea 2020), and Facebook (Rozemberczki, Allen, and Sarkar 2021); (2) Six heterophilous datasets: Actor (Tang et al. 2009), Chameleon-fix, Squirrel-fix, Tolokers, Roman-empire (Platonov et al. 2023), and Penn94 (Lim et al. 2021). All datasets are divided into training, validation, and testing sets in a proportion of 48%: 32%: 20%. More details are presented in Appendix A.

Baselines

To evaluate the effectiveness and superiority of the new GT architecture, we compare it against 15 baselines, including (1) MLP; (2) Classical GNNs: GCN (Kipf and Welling 2017), GAT (Veličković et al. 2018), GraphSAGE (Hamilton, Ying, and Leskovec 2017); (3) Heterophilous GNNs: LINKX (Lim et al. 2021), H2GCN (Zhu et al. 2020b), GPRGNN (Chien et al. 2021), FAGCN (Bo et al. 2021), ACMGCN (Luan et al. 2022), GloGNN (Li et al. 2022), FSGNN (Maurya, Liu, and Murata 2022); (4) Graph Transformers: vanilla GT, ANS-GT (Zhang et al. 2022b), SGformer (Wu et al. 2024), NAGphormer (Chen et al. 2023). More details are presented in Appendix B.

Experimental Settings

To ensure the reproducibility of our experiments, we utilize 10 random seeds to fix the data splits and model initialization, and report the average accuracy and standard deviation over 10 experimental runs. For all methods, we set the search space of common parameters as follows: maximum epochs to 500 with 100 patience, hidden dimension dd^{\prime} to 64, optimizer to AdamW (Loshchilov and Hutter 2019), learning rate in {0.005, 0.01, 0.05, 0.1}, dropout rate in {0.1, 0.3, 0.5, 0.7, 0.9}. For our GNNFormer, we search the number of PT-blocks in {1, 2, 3}. For all baselines, we search the common parameters in the same parameter spaces. And more default parameter details of baselines are reported in Appendix C. Moreover, GNNFormer are implemented in PyTorch 1.11.0, Pytorch-Geometric 2.1.0 with CUDA 12.0 and Python 3.9. All experiments are conducted at NVIDIA A100 40GB.

Table 2: Node classification results: average test accuracy (%) ±\pm standard deviation. The best results are highlighted in bold, while the second-best results are underlined. “Local Rank” indicates the average performance ranking across homophilous or heterophilous datasets, “Global Rank” indicates the average performance ranking across all datasets.
Method Dataset Computers Photo
Coauthor
CS
Coauthor
Physics
Wiki-CS Facebook
Local
Rank
Actor
Chameleon
-fix
Squirrel-fix Tolokers
Roman
-empire
Penn94
Local
Rank
Global
Rank
Vanilla MLP 85.01 ±\pm 0.84 92.00 ±\pm 0.56 94.80 ±\pm 0.35 96.11 ±\pm 0.14 79.57 ±\pm 0.81 76.86 ±\pm 0.34 25.17 37.14 ±\pm 1.06 33.31 ±\pm 2.32 34.47 ±\pm 3.09 53.18 ±\pm 6.35 65.98 ±\pm 0.43 75.18 ±\pm 0.35 23.83 24.50
GCN 91.17 ±\pm 0.54 94.26 ±\pm 0.59 93.40 ±\pm 0.45 96.37 ±\pm 0.20 83.80 ±\pm 0.66 93.98 ±\pm 0.34 21.50 30.65 ±\pm 1.06 41.85 ±\pm 3.22 33.89 ±\pm 2.61 70.34 ±\pm 1.64 50.76 ±\pm 0.46 80.45 ±\pm 0.27 24.17 22.83
GAT 91.44 ±\pm 0.43 94.42 ±\pm 0.61 93.20 ±\pm 0.64 96.28 ±\pm 0.31 83.99 ±\pm 0.73 94.03 ±\pm 0.36 20.83 30.58 ±\pm 1.18 43.31 ±\pm 3.42 36.27 ±\pm 2.12 79.93 ±\pm 0.77 57.34 ±\pm 1.81 78.10 ±\pm 1.28 23.00 21.92
GraphSAGE 90.94 ±\pm 0.56 95.41 ±\pm 0.45 94.17 ±\pm 0.46 96.69 ±\pm 0.23 84.75 ±\pm 0.64 93.72 ±\pm 0.35 19.17 37.60 ±\pm 0.95 44.94 ±\pm 3.67 36.61 ±\pm 3.06 82.37 ±\pm 0.64 77.77 ±\pm 0.49 OOM 16.83 18.00
Heterophilous H2GCN 91.69 ±\pm 0.33 95.59 ±\pm 0.48 95.62 ±\pm 0.27 97.00 ±\pm 0.16 84.62 ±\pm 0.66 94.36 ±\pm 0.32 13.33 37.27 ±\pm 1.27 43.09 ±\pm 3.85 40.07 ±\pm 2.73 81.34 ±\pm 1.16 79.47 ±\pm 0.43 75.91 ±\pm 0.44 17.33 15.33
GPRGNN 91.80 ±\pm 0.55 95.44 ±\pm 0.33 95.17 ±\pm 0.34 96.94 ±\pm 0.20 84.84 ±\pm 0.54 94.84 ±\pm 0.24 12.33 36.89 ±\pm 0.83 44.27 ±\pm 5.23 40.58 ±\pm 2.00 73.84 ±\pm 1.40 67.72 ±\pm 0.63 84.34 ±\pm 0.29 17.67 15.00
FAGCN 89.54 ±\pm 0.75 94.44 ±\pm 0.62 94.93 ±\pm 0.22 96.91 ±\pm 0.27 84.47 ±\pm 0.75 91.90 ±\pm 1.95 21.00 37.59 ±\pm 0.95 45.28 ±\pm 4.33 41.05 ±\pm 2.67 81.38 ±\pm 1.34 75.83 ±\pm 0.35 79.01 ±\pm 1.09 14.00 17.50
ACMGCN 91.66 ±\pm 0.78 95.42 ±\pm 0.39 95.47 ±\pm 0.33 97.00 ±\pm 0.27 85.10 ±\pm 0.77 94.27 ±\pm 0.33 13.83 36.89 ±\pm 1.13 43.99 ±\pm 2.02 36.58 ±\pm 2.75 83.52 ±\pm 0.87 81.57 ±\pm 0.35 83.01 ±\pm 0.46 17.67 15.75
GloGNN 89.48 ±\pm 0.63 94.34 ±\pm 0.58 95.32 ±\pm 0.29 OOM 80.59 ±\pm 0.53 84.57 ±\pm 0.62 23.67 37.30 ±\pm 1.41 41.46 ±\pm 3.89 37.66 ±\pm 2.12 58.74 ±\pm 13.41 66.46 ±\pm 0.41 85.33 ±\pm 0.27 19.17 21.42
FSGNN 91.03 ±\pm 0.56 95.50 ±\pm 0.41 95.51 ±\pm 0.32 96.98 ±\pm 0.20 85.10 ±\pm 0.58 94.32 ±\pm 0.32 14.50 37.14 ±\pm 1.06 45.79 ±\pm 3.31 38.25 ±\pm 2.62 83.87 ±\pm 0.98 79.76 ±\pm 0.41 83.87 ±\pm 0.98 15.00 14.75
LINKX 90.75 ±\pm 0.36 94.58 ±\pm 0.56 95.52 ±\pm 0.30 96.93 ±\pm 0.16 83.51 ±\pm 0.78 93.84 ±\pm 0.32 18.50 31.17 ±\pm 0.61 44.94 ±\pm 3.08 38.40 ±\pm 3.54 77.55 ±\pm 0.80 61.36 ±\pm 0.60 84.97 ±\pm 0.46 19.50 19.00
GT Vanilla GT 84.41 ±\pm 0.72 91.58 ±\pm 0.73 94.61 ±\pm 0.30 OOM 79.05 ±\pm 0.97 OOM 26.17 37.08 ±\pm 1.08 44.27 ±\pm 3.98 39.55 ±\pm 3.10 72.24 ±\pm 1.17 OOM OOM 21.17 23.67
ANS-GT 90.01 ±\pm 0.38 94.51 ±\pm 0.24 93.93 ±\pm 0.23 96.28 ±\pm 0.19 83.27 ±\pm 0.49 92.61 ±\pm 0.16 22.67 37.80 ±\pm 0.95 40.74 ±\pm 2.26 36.65 ±\pm 0.80 76.91 ±\pm 0.85 80.36 ±\pm 0.71 OOM 18.17 20.42
NAGFormer 90.22 ±\pm 0.42 94.95 ±\pm 0.52 94.96 ±\pm 0.25 96.43 ±\pm 0.20 84.31 ±\pm 0.72 93.35 ±\pm 0.28 20.17 36.99 ±\pm 1.39 46.12 ±\pm 2.25 38.31 ±\pm 2.43 66.73 ±\pm 1.18 75.92 ±\pm 0.69 73.98 ±\pm 0.53 19.33 19.75
SGFormer 90.70 ±\pm 0.59 94.46 ±\pm 0.49 95.21 ±\pm 0.20 96.87 ±\pm 0.18 82.67 ±\pm 0.58 86.66 ±\pm 0.54 21.17 36.59 ±\pm 0.90 44.27 ±\pm 3.68 38.83 ±\pm 2.19 80.46 ±\pm 0.91 76.41 ±\pm 0.50 76.65 ±\pm 0.49 19.00 20.08
GNNFormer TP+TP GCN-like P 92.21 ±\pm 0.35 95.90 ±\pm 0.37 95.79 ±\pm 0.23 97.25 ±\pm 0.15 84.95 ±\pm 0.77 94.66 ±\pm 0.33 4.83 37.72 ±\pm 0.92 47.13 ±\pm 2.13 38.79 ±\pm 1.72 85.62 ±\pm 0.83 80.80 ±\pm 0.79 85.44 ±\pm 0.36 6.33 5.58
SAGE-like P 92.06 ±\pm 0.47 95.90 ±\pm 0.50 95.64 ±\pm 0.19 97.13 ±\pm 0.12 85.41 ±\pm 0.62 94.87 ±\pm 0.18 4.17 37.44 ±\pm 1.32 46.07 ±\pm 3.31 40.22 ±\pm 2.36 84.73 ±\pm 0.76 83.91 ±\pm 0.55 85.41 ±\pm 0.44 7.00 5.58
GAT-like P 91.62 ±\pm 0.40 95.92 ±\pm 0.46 95.69 ±\pm 0.29 97.15 ±\pm 0.13 85.34 ±\pm 0.68 94.83 ±\pm 0.18 5.50 37.45 ±\pm 1.19 46.12 ±\pm 3.63 39.98 ±\pm 2.04 84.94 ±\pm 0.73 83.20 ±\pm 0.79 85.32 ±\pm 0.41 8.17 6.83
GNNFormer PT+PT GCN-like P 92.01 ±\pm 0.36 95.55 ±\pm 0.17 95.79 ±\pm 0.23 97.24 ±\pm 0.12 84.73 ±\pm 0.61 94.54 ±\pm 0.32 8.33 37.53 ±\pm 0.87 45.90 ±\pm 2.26 40.20 ±\pm 2.78 85.66 ±\pm 0.69 79.38 ±\pm 0.43 85.40 ±\pm 0.35 8.33 8.33
SAGE-like P 91.73 ±\pm 0.38 95.69 ±\pm 0.34 95.70 ±\pm 0.20 97.12 ±\pm 0.13 85.14 ±\pm 0.74 94.79 ±\pm 0.29 8.17 37.99 ±\pm 1.31 46.52 ±\pm 3.39 40.13 ±\pm 2.68 84.22 ±\pm 0.54 82.21 ±\pm 0.44 85.38 ±\pm 0.29 6.17 7.17
GAT-like P 91.65 ±\pm 0.36 95.69 ±\pm 0.47 95.72 ±\pm 0.29 97.13 ±\pm 0.15 85.30 ±\pm 0.83 94.83 ±\pm 0.36 7.17 37.82 ±\pm 0.99 46.24 ±\pm 3.81 40.16 ±\pm 2.22 85.57 ±\pm 0.72 81.54 ±\pm 0.63 85.37 ±\pm 0.37 6.50 6.83
GNNFormer TT+PP GCN-like P 91.91 ±\pm 0.49 95.86 ±\pm 0.28 95.84 ±\pm 0.19 97.19 ±\pm 0.13 85.38 ±\pm 0.77 94.62 ±\pm 0.26 5.00 37.31 ±\pm 1.00 47.08 ±\pm 3.92 41.60 ±\pm 2.30 84.65 ±\pm 0.66 77.06 ±\pm 0.77 85.31 ±\pm 0.28 8.67 6.83
SAGE-like P 91.75 ±\pm 0.48 95.73 ±\pm 0.41 95.66 ±\pm 0.15 97.14 ±\pm 0.17 85.36 ±\pm 0.55 94.78 ±\pm 0.32 7.17 37.74 ±\pm 1.00 47.98 ±\pm 4.60 41.75 ±\pm 3.06 83.58 ±\pm 0.56 83.35 ±\pm 0.52 85.03 ±\pm 0.62 6.17 6.67
GAT-like P 92.08 ±\pm 0.15 95.88 ±\pm 0.43 95.69 ±\pm 0.25 97.13 ±\pm 0.20 85.32 ±\pm 0.47 94.79 ±\pm 0.28 5.00 37.45 ±\pm 0.82 46.97 ±\pm 4.23 41.57 ±\pm 2.03 84.65 ±\pm 0.71 83.41 ±\pm 0.51 85.10 ±\pm 0.40 6.67 5.83
GNNFormer PP+TT GCN-like P 91.95 ±\pm 0.41 95.68 ±\pm 0.16 95.68 ±\pm 0.16 97.13 ±\pm 0.14 85.03 ±\pm 0.78 94.64 ±\pm 0.40 9.17 37.05 ±\pm 0.98 46.29 ±\pm 2.28 41.57 ±\pm 3.03 84.01 ±\pm 0.74 74.48 ±\pm 1.13 85.23 ±\pm 0.27 12.00 10.58
SAGE-like P 92.00 ±\pm 0.34 95.80 ±\pm 0.48 95.67 ±\pm 0.17 97.08 ±\pm 0.14 85.33 ±\pm 0.59 94.74 ±\pm 0.33 7.83 37.84 ±\pm 0.85 46.74 ±\pm 3.92 41.62 ±\pm 3.55 83.79 ±\pm 0.73 81.65 ±\pm 0.46 85.18 ±\pm 0.51 6.67 7.25
GAT-like P 91.22 ±\pm 0.70 95.78 ±\pm 0.36 95.73 ±\pm 0.25 97.11 ±\pm 0.22 85.21 ±\pm 0.56 94.76 ±\pm 0.28 9.00 37.43 ±\pm 0.95 46.12 ±\pm 3.68 42.29 ±\pm 2.50 85.19 ±\pm 0.74 82.10 ±\pm 0.82 85.25 ±\pm 0.47 7.00 8.00

Evaluation on Node Classification

Table 2 reports the node classification results of all methods, from which we can draw the following conclusions:

1) Overall performance: GNNFormer consistently exhibits superior local and global average performance rankings on both homophilous and heterophilous datasets, surpassing the three types of baselines significantly in terms of performance superiority and stability;

2) Better universality for transformation-first message passing: A more universal GNNFormer architecture tends to prioritize a message passing settings where transformation precedes propagation (i.e., 𝐓𝐏𝐓𝐏\mathbf{TPTP}, 𝐓𝐓𝐏𝐏\mathbf{TTPP}) for node representation learning. This is supported by the higher local and global performance rankings of GNNFormer(TP+TP, TT+PP). A reasonable explanation is that the transformation operation can map different types of node features into a unified feature space, achieving feature alignment, while also enhancing the expressive power of features through linear transformation and non-linear activation. Therefore, the strategy of transformation followed by propagation improves the model’s adaptability and performance across different types of graphs by unifying and enhancing node/edge feature representations before message propagation, resulting in better universality in both homophilous and heterophilous scenarios.

3) More significant preferences in heterophilous scenarios: Due to the diversity of heterophilous scenarios, the optimal architecture of GNNFormer varies for different heterophilous datasets. For example, on Squirrel-fix, GNNFormer tends to perform multiple propagations (𝐏𝐏\mathbf{PP}) to capture higher-order homophily, or multiple transformations (𝐓𝐓\mathbf{TT}) to adjust and transform features to adapt to heterophily. In contrast, on Tolokers and Penn94, GNNFormer prefers alternating transformation and propagation to adapt to heterophily via immediate feature adjustment and transformation. These observations drive us to explore the automatic search for GNNFormer architectures in future work.

Table 3: Ablation analysis of the GNNFormer.
GNNFormer Dataset Computers Photo
Coauthor
CS
Coauthor
Physics
Wiki-CS Facebook Actor
Chameleon
-fix
Squirrel-fix Tolokers
Roman
-empire
Penn94
Global
Rank
Best architecture
TP+TP
GCN-like P
TP+TP
GAT-like P
TT+PP
GCN-like P
TP+TP
GCN-like P
TP+TP
SAGE-like P
TP+TP
SAGE-like P
PT+PT
SAGE-like P
TT+PP
SAGE-like P
PP+TT
GAT-like P
PT+PT
GCN-like P
TP+TP
SAGE-like P
TP+TP
GCN-like P
1.17
92.21 ±\pm 0.35 95.92 ±\pm 0.46 95.84 ±\pm 0.19 97.25 ±\pm 0.15 85.41 ±\pm 0.62 94.87 ±\pm 0.18 37.99 ±\pm 1.31 47.98 ±\pm 4.60 42.29 ±\pm 2.50 85.66 ±\pm 0.69 83.91 ±\pm 0.55 85.44 ±\pm 0.36
w/o FFN 92.05 ±\pm 0.44 95.77 ±\pm 0.28 95.29 ±\pm 0.38 96.98 ±\pm 0.24 85.13 ±\pm 0.53 94.87 ±\pm 0.29 36.57 ±\pm 0.91 47.25 ±\pm 4.63 41.75 ±\pm 3.74 85.54 ±\pm 0.75 82.12 ±\pm 0.32 85.01 ±\pm 0.48 3.92
FFN(GEGLU) 92.00 ±\pm 0.34 95.81 ±\pm 0.34 95.17 ±\pm 0.37 97.23 ±\pm 0.15 85.17 ±\pm 0.50 94.79 ±\pm 0.24 37.66 ±\pm 1.13 47.53 ±\pm 4.22 40.56 ±\pm 2.11 85.46 ±\pm 0.49 83.69 ±\pm 0.45 85.40 ±\pm 0.36 3.75
FFN(ReGLU) 92.11 ±\pm 0.43 95.90 ±\pm 0.26 95.15 ±\pm 0.40 97.21 ±\pm 0.14 85.10 ±\pm 0.63 94.75 ±\pm 0.25 37.72 ±\pm 1.05 48.26 ±\pm 2.83 41.08 ±\pm 3.10 85.60 ±\pm 0.91 83.63 ±\pm 0.54 85.34 ±\pm 0.42 3.33
w/o AIRes 92.06 ±\pm 0.45 93.34 ±\pm 0.65 95.18 ±\pm 0.38 97.25 ±\pm 0.16 83.60 ±\pm 0.49 94.62 ±\pm 0.35 36.02 ±\pm 1.06 47.87 ±\pm 4.27 41.98 ±\pm 2.39 85.52 ±\pm 0.57 80.81 ±\pm 0.88 85.38 ±\pm 0.38 4.25
AIRes - Res 91.80 ±\pm 0.26 95.35 ±\pm 0.65 95.03 ±\pm 0.54 96.82 ±\pm 0.24 85.08 ±\pm 0.57 94.84 ±\pm 0.20 36.82 ±\pm 1.16 47.98 ±\pm 3.87 42.13 ±\pm 2.61 79.13 ±\pm 6.53 84.35 ±\pm 0.56 81.87 ±\pm 0.72 4.33
Refer to caption
Figure 4: Impact of model depth.

More Analysis

Impact on FFN

We further analyze the impact of the FFN module on GNNFormer, as shown in Table 3. It is evident that when the FFN module is removed (w/o FFN), the model experiences significant performance degradation across all datasets, indicating that the FFN can effectively enhance the expressive power of GNNFormer. Furthermore, we compare the FFN equipped with GEGLU and ReGLU, and observe that FFN with SwishGLU generally maintains higher classification performance in most cases. This is because SwishGLU combines the smoothness of Swish with the flexibility of GLU—the former stabilizes gradient updates, while the latter helps GNNFormer selectively retain and propagate more informative features. Therefore, GNNFormer, utilizing SwishGLU, exhibits better non-linear expressive power, allowing it to capture complex node relationships and feature interactions within various graphs.

Impact on Residual Connection

We further conduct an ablation analysis on the Adaptive Initial Residual Connection (AIRes) in the GNNFormer architecture, as shown in Table 3. When this component is removed (w/o AIRes), GNNFormer experiences a significant performance degradation, indicating that leveraging initial features can substantially enhance GNNFormer’s representation learning ability in diverse scenarios. Additionally, when we replace the AIRes with a standard residual connection, the model exhibits even more significant performance degradation in most cases. This suggests that while initial features are important, improperly utilizing them can introduce noise and interfere with the model’s representation learning.

Impact on Model Depth

To validate the immunity of our GNNFormer to over-smoothing, we conduct model depth experiments on four datasets. Specifically, we compare GNNFormer with Vanilla GNNs (GCN, GAT) and higher-order GNN method (H2GNN). The latter acquires higher-order information by stacking message-passing modules, whereas our GNNFormer achieves this by stacking PT-blocks while employing AIRes. Figure 4 shows the performance of all methods at different model depths. It is evident that the performance of Vanilla GNNs rapidly declines as model depth increases, indicating the presence of over-smoothing. For H2GNN, performance gradually decreases as the model depth increases from 2 to 8 layers, and the model experiences memory overflow when the depth exceeds 16 layers. In contrast, our GNNFormer consistently maintains stable performance when stacking PT-block message-passing modules, indicating its immunity to the over-smoothing problem.

Efficiency Analysis

The efficiency analysis of all baselines and GNNFormer (TPTP) is shown in Figure 5. In the small-scale dataset Chameleon-fix, the runtime advantage of GNNFormer is not particularly evident. In Wiki-CS, where the number of nodes exceeds 10,000, the average runtime of GT methods increases by more than 2.5 times compared to Chameleon-fix, while the runtime of our GNNFormer remains relatively stable. In the Penn94 dataset, which contains over 40,000 nodes, GT methods involving quadratic complexity in attention computation encounters OOM issues, whereas the runtime of GNNFormer remains at a relatively low level. Therefore, our GNNFormer architecture demonstrates a clear advantage in computational efficiency compared to traditional GT architectures.

Refer to caption
Figure 5: Average running time per epoch (ms).

Conclusion

To demonstrate the inherent misalignment of Graph Transformers in node classification tasks, we start with observational experiments and progressively develop the GNNFormer framework. Extensive experiments on 12 benchmark datasets demonstrate that, compared to existing Graph Transformer methods, our framework not only maintains superior performance but also improves computational efficiency. Additionally, we find that GNNFormer, which employs a transformation-first message-passing manner exhibits greater universality in both homophilous and heterophilous scenarios. This insight also motivates us to explore the automatic search of GNNFormer architectures in future work, i.e., driving GNNFormer to automatically search for the optimal combination of transformation and propagation operations, which will further enhance the generalizability and superiority of our framework.

Acknowledgement

This work was supported in part by the Key R&D Program of Zhejiang under Grants 2022C01018 and 2024C01025, by the National Natural Science Foundation of China under Grants 62103374 and U21B2001.

References

  • Abu-El-Haija et al. (2019) Abu-El-Haija, S.; Perozzi, B.; Kapoor, A.; Alipourfard, N.; Lerman, K.; Harutyunyan, H.; Ver Steeg, G.; and Galstyan, A. 2019. Mixhop: Higher-order graph convolutional architectures via sparsified neighborhood mixing. In international conference on machine learning, 21–29. PMLR.
  • Alon and Yahav (2021) Alon, U.; and Yahav, E. 2021. On the Bottleneck of Graph Neural Networks and its Practical Implications. In International Conference on Learning Representations.
  • Ba, Kiros, and Hinton (2016) Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer Normalization. stat, 1050: 21.
  • Bo et al. (2021) Bo, D.; Wang, X.; Shi, C.; and Shen, H. 2021. Beyond Low-frequency Information in Graph Convolutional Networks. In AAAI. AAAI Press.
  • Chen et al. (2023) Chen, J.; Gao, K.; Li, G.; and He, K. 2023. NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. In The Eleventh International Conference on Learning Representations.
  • Chien et al. (2021) Chien, E.; Peng, J.; Li, P.; and Milenkovic, O. 2021. Adaptive Universal Generalized PageRank Graph Neural Network. In International Conference on Learning Representations.
  • Dauphin et al. (2017) Dauphin, Y. N.; Fan, A.; Auli, M.; and Grangier, D. 2017. Language Modeling with Gated Convolutional Networks. In International Conference on Machine Learning, 933–941. PMLR.
  • Dwivedi and Bresson (2020) Dwivedi, V. P.; and Bresson, X. 2020. A generalization of transformer networks to graphs. arXiv preprint arXiv:2012.09699.
  • Gilmer et al. (2017) Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; and Dahl, G. E. 2017. Neural message passing for quantum chemistry. In International Conference on Machine Learning, 1263–1272. PMLR.
  • Hamilton, Ying, and Leskovec (2017) Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductive Representation Learning on Large Graphs. Advances in Neural Information Processing Systems, 30.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Hussain, Zaki, and Subramanian (2022) Hussain, M. S.; Zaki, M. J.; and Subramanian, D. 2022. Global Self-attention As A Replacement for Graph Convolution. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 655–665.
  • Kipf and Welling (2017) Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In International Conference on Learning Representations.
  • Kreuzer et al. (2021) Kreuzer, D.; Beaini, D.; Hamilton, W.; Létourneau, V.; and Tossou, P. 2021. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34: 21618–21629.
  • Li et al. (2022) Li, X.; Zhu, R.; Cheng, Y.; Shan, C.; Luo, S.; Li, D.; and Qian, W. 2022. Finding Global Homophily in Graph Neural Networks When Meeting Heterophily. arXiv preprint arXiv:2205.07308.
  • Lim et al. (2021) Lim, D.; Hohne, F.; Li, X.; Huang, S. L.; Gupta, V.; Bhalerao, O.; and Lim, S. N. 2021. Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems, volume 34, 20887–20902. Curran Associates, Inc.
  • Liu et al. (2023) Liu, C.; Zhan, Y.; Ma, X.; Ding, L.; Tao, D.; Wu, J.; and Hu, W. 2023. Gapformer: graph transformer with graph pooling for node classification. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2196–2205.
  • Loshchilov and Hutter (2019) Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  • Luan et al. (2022) Luan, S.; Hua, C.; Lu, Q.; Zhu, J.; Zhao, M.; Zhang, S.; Chang, X.-W.; and Precup, D. 2022. Revisiting heterophily for graph neural networks. Advances in neural information processing systems, 35: 1362–1375.
  • Ma et al. (2023) Ma, L.; Lin, C.; Lim, D.; Romero-Soriano, A.; Dokania, P. K.; Coates, M.; Torr, P.; and Lim, S.-N. 2023. Graph inductive biases in transformers without message passing. In International Conference on Machine Learning, 23321–23337. PMLR.
  • Maurya, Liu, and Murata (2022) Maurya, S. K.; Liu, X.; and Murata, T. 2022. Simplifying approach to node classification in Graph Neural Networks. Journal of Computational Science, 62: 101695.
  • McAuley et al. (2015) McAuley, J.; Targett, C.; Shi, Q.; and van den Hengel, A. 2015. Image-Based Recommendations on Styles and Substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, 43–52.
  • Mernyei and Cangea (2020) Mernyei, P.; and Cangea, C. 2020. Wiki-CS: A Wikipedia-Based Benchmark for Graph Neural Networks. arXiv preprint arXiv:2007.02901.
  • Mialon et al. (2021) Mialon, G.; Chen, D.; Selosse, M.; and Mairal, J. 2021. Graphit: Encoding graph structure in transformers. arXiv preprint arXiv:2106.05667.
  • Min et al. (2022) Min, E.; Chen, R.; Bian, Y.; Xu, T.; Zhao, K.; Huang, W.; Zhao, P.; Huang, J.; Ananiadou, S.; and Rong, Y. 2022. Transformer for graphs: An overview from architecture perspective. arXiv preprint arXiv:2202.08455.
  • Platonov et al. (2023) Platonov, O.; Kuznedelev, D.; Diskin, M.; Babenko, A.; and Prokhorenkova, L. 2023. A Critical Look at The Evaluation of GNNs under Heterophily: Are We Really Making Progress? In The Eleventh International Conference on Learning Representations.
  • Ramachandran, Zoph, and Le (2017) Ramachandran, P.; Zoph, B.; and Le, Q. V. 2017. Searching for Activation Functions. arXiv preprint arXiv:1710.05941.
  • Rampášek et al. (2022) Rampášek, L.; Galkin, M.; Dwivedi, V. P.; Luu, A. T.; Wolf, G.; and Beaini, D. 2022. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35: 14501–14515.
  • Rong et al. (2020) Rong, Y.; Bian, Y.; Xu, T.; Xie, W.; Wei, Y.; Huang, W.; and Huang, J. 2020. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems, 33: 12559–12571.
  • Rozemberczki, Allen, and Sarkar (2021) Rozemberczki, B.; Allen, C.; and Sarkar, R. 2021. Multi-Scale attributed node embedding. Journal of Complex Networks, 9(2): cnab014.
  • Shazeer (2020) Shazeer, N. 2020. Glu Variants Improve Transformer. arXiv preprint arXiv:2002.05202.
  • Shchur et al. (2018) Shchur, O.; Mumme, M.; Bojchevski, A.; and Günnemann, S. 2018. Pitfalls of Graph Neural Network Evaluation. ArXiv, abs/1811.05868.
  • Shirzad et al. (2023) Shirzad, H.; Velingker, A.; Venkatachalam, B.; Sutherland, D. J.; and Sinop, A. K. 2023. Exphormer: Sparse transformers for graphs. In International Conference on Machine Learning, 31613–31632. PMLR.
  • Tang et al. (2009) Tang, J.; Sun, J.; Wang, C.; and Yang, Z. 2009. Social influence analysis in large-scale networks. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’09, 807–816. New York, NY, USA: Association for Computing Machinery. ISBN 9781605584959.
  • Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is All You Need. Advances in Neural Information Processing Systems, 30.
  • Veličković et al. (2018) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018. Graph Attention Networks. In International Conference on Learning Representations.
  • Wu et al. (2022) Wu, Q.; Zhao, W.; Li, Z.; Wipf, D. P.; and Yan, J. 2022. Nodeformer: A scalable graph structure learning transformer for node classification. Advances in Neural Information Processing Systems, 35: 27387–27401.
  • Wu et al. (2024) Wu, Q.; Zhao, W.; Yang, C.; Zhang, H.; Nie, F.; Jiang, H.; Bian, Y.; and Yan, J. 2024. Simplifying and empowering transformers for large-graph representations. Advances in Neural Information Processing Systems, 36.
  • Xie et al. (2023) Xie, C.; Zhou, J.; Gong, S.; Wan, J.; Qian, J.; Yu, S.; Xuan, Q.; and Yang, X. 2023. PathMLP: Smooth Path Towards High-order Homophily. arXiv preprint arXiv:2306.13532.
  • Ying et al. (2021) Ying, C.; Cai, T.; Luo, S.; Zheng, S.; Ke, G.; He, D.; Shen, Y.; and Liu, T.-Y. 2021. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34: 28877–28888.
  • Zhang et al. (2020) Zhang, J.; Zhang, H.; Xia, C.; and Sun, L. 2020. Graph-bert: Only attention is needed for learning graph representations. arXiv preprint arXiv:2001.05140.
  • Zhang et al. (2022a) Zhang, W.; Sheng, Z.; Yin, Z.; Jiang, Y.; Xia, Y.; Gao, J.; Yang, Z.; and Cui, B. 2022a. Model degradation hinders deep graph neural networks. In Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, 2493–2503.
  • Zhang et al. (2022b) Zhang, Z.; Liu, Q.; Hu, Q.; and Lee, C.-K. 2022b. Hierarchical graph transformer with adaptive node sampling. Advances in Neural Information Processing Systems, 35: 21171–21183.
  • Zhu et al. (2020a) Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; and Koutra, D. 2020a. Beyond homophily in graph neural networks: Current limitations and effective designs. Advances in neural information processing systems, 33: 7793–7804.
  • Zhu et al. (2020b) Zhu, J.; Yan, Y.; Zhao, L.; Heimann, M.; Akoglu, L.; and Koutra, D. 2020b. Beyond homophily in graph neural networks: current limitations and effective designs. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20. Red Hook, NY, USA: Curran Associates Inc. ISBN 9781713829546.

Appendix

A. Dataset Details

  • Computers and Photo (McAuley et al. 2015) are segments of the Amazon co-purchase graph, where nodes represent products, edges represent the co-purchased relations of products, and features are bag-of-words vectors extracted from product reviews.

  • Coauthor CS and Coauthor Physics (Shchur et al. 2018) are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge, where nodes represent authors, edge represent the corresponding authors have co-authored a paper, features consist of keywords from each author’s published papers, and the class labels denote the most active research fields for each author.

  • Wiki-CS (Mernyei and Cangea 2020) is a Wikipedia-based dataset which is constructed from Wikipedia categories, specifically 10 classes corresponding to branches of computer science, with very high connectivity. The node features are derived from the text of the corresponding articles. They were calculated as the average of pretrained GloVe word embeddings, resulting in 300-dimensional node features.

  • Facebook (Rozemberczki, Allen, and Sarkar 2021) is a page-page graph of verified Facebook sites, where nodes correspond to official Facebook pages, links to mutual likes between sites, and features are extracted from the site descriptions.

  • Actor (Tang et al. 2009) is a network dataset designed for analyzing co-occurrence relationships among actors, where node represents an actor, and the edges between nodes indicate their co-occurrence on the same Wikipedia page. The node features are constructed from keywords extracted from the respective actors’ Wikipedia pages.

  • Chameleon-fix and Squirrel-fix (Platonov et al. 2023) are two page-page networks focusing on specific topics in Wikipedia, where nodes represent web pages, and edges denote mutual links between the pages. The node features are composed of informative nouns extracted from the corresponding Wikipedia pages. The task of these datasets is to categorize the nodes into five distinct groups based on the average monthly traffic received by each web page.

  • Tolokers (Platonov et al. 2023) is a social network extracted from the Toloka crowdsourcing platform, where nodes represent workers and two workers are connected if they participate in the same task. The node features are constructed from the workers’ profile information and task performance statistics, while the labels indicate whether a worker is banned in a project.

  • Roman-empire (Platonov et al. 2023) is derived from the Roman Empire article on Wikipedia, where nodes in the dataset represent words from the article, edges indicating word dependencies. The node features are constructed from word embeddings obtained using the FastText method, and labels denote the syntactic roles of the words.

  • Penn94 (Lim et al. 2021) is a Facebook social network, where nodes denote students and are labeled with the gender of users, edges represent the relationship of students. Node features are constructed from basic information about students which are major, second major/minor, dorm/house, year and high school.

Table 4: Summary of datasets used
Node Feature Node Number Edges Classes
Computers 767 13752 491722 10
Photo 745 7650 238162 8
Coauthor CS 6805 18333 163788 15
Coauthor Physics 8415 34493 495924 5
Wiki-CS 300 11701 431726 10
Facebook 128 22470 342004 4
Actor 932 7600 30019 5
Chameleon-fix 2325 890 13584 5
Squirrel-fix 2089 2223 65718 5
Tolokers 10 11758 1038000 2
Roman-empire 300 22662 65854 18
Penn94 4814 41554 2724458 2

B. Baseline Details

Table 5: Optimal parameters for GNNFormer
Computers Photo Coauthor CS Coauthor Physics Wiki-CS Facebook Actor Chameleon-fix Squirrel-fix Tolokers Roman-empire Penn94
P-like GCN-like P GAT-like P GCN-like P GCN-like P SAGE-like P SAGE-like P SAGE-like P SAGE-like P GAT-like P GCN-like P SAGE-like P GCN-like P
model type TPTP TPTP TTPP TPTP TPTP TPTP PTPT TTPP PPTT PTPT TPTP TPTP
learning rate 5e-3 5e-3 1e-2 5e-3 5e-3 5e-3 5e-3 5e-3 1e-2 5e-2 5e-3 5e-3
weight decay 5e-4 1e-5 1e-4 5e-5 5e-4 1e-4 1e-4 5e-4 5e-4 5e-3 5e-4 5e-5
dropout 0.7 0.5 0.1 0.7 0.3 0.5 0.5 0.9 0.5 0.3 0.3 0.5
  • MLP is a two-layer linear neural network that based on the original features of the nodes, without any propagation or aggregation rules.

  • GCN is a neural network that aggregates information among neighboring nodes through message passing.

  • GAT is a neural network that leverages multi-head attention to weight node features effectively on graph data.

  • SAGE is a graph neural network that learns node representations by sampling and aggregating neighborhood information.

  • H2GCN constructs a neural network by separating ego and neighbor embeddings, aggregating higher-order neighborhood information, and combing intermediate representations.

  • GPRGNN is a graph neural network that optimizes node feature and topology extraction by adaptively learning Generalized PageRank weights.

  • FAGCN is a novel graph convolutional network that integrates low and high-frequency signals through an adaptive gating mechanism.

  • ACMGCN adaptively employs aggregation, diversification, and identity channels to extract richer local information for each node at every layer.

  • GloGNN generates node embeddings by aggregating global node information and effectively captures homophily by learning a correlation matrix between nodes.

  • FSGNN is a simplified graph neural network model that enhances node classification performance by introducing a soft selection mechanism.

  • LINKX combines independent embeddings of the adjacency matrix and node features, generating predictions through a multi-layer perceptron and simple transformations.

  • ANS-GT is a graph transformer architecture that effectively captures long-range dependencies and global context information through adaptive node sampling and hierarchical graph attention mechanisms.

  • NAGFormer is a novel graph transformer that handles node classification tasks on large graphs by treating each node as a sequence aggregated from features of neighbors at various hops.

  • SGFormer is a simplified and efficient graph transformer model that handles large-scale graph data through a single-layer global attention mechanism, achieving node representation learning with linear complexity.

C. More Parameter Settings

We used the Neural Network Intelligence (NNI) tool for hyper-parameter tuning to conduct experiments on the baseline models. The experiments were conducted using the same base parameters as our method, along with specific parameters unique to each baseline model. The special parameters are as follows:

  • GloGNN: norm_layers within {1,2,3}, orders within {2,3,4}, term weight within {0,1}, weighting factor within {0,1,10} and {0.1,1,10,100,1000} and the balanced term parameters.

  • FSGNN: aggregator within {cat, sum}.

  • ANS-GT: data_augmentation within {4,8,16,32}, n_layer within {2,3,4} and batch size within {8,16,32}.

  • NAGFormer: hidden within {128,256,512}, number of Transformer layers within {1,2,3,4,5} and number of propagation steps K {7,10}.

  • SGFormer: number of global attention layers is fixed as 1, number of GCN layers within {1,2,3}, weight α\alpha within {0.5,0.8}.

Table 5 presents the optimal parameters for the GNNFormer framework, encompassing the learning rate, weight decay, and dropout as the key hyperparameters. Additionally, it highlights the optimal model type and the 𝐏\mathbf{P} operation type.