This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

ChebMixer: Efficient Graph Representation Learning with MLP Mixer

Xiaoyan Kui
Central South University
   Haonan Yan
Central South University
   Qinsong Li
Central South University
   Liming Chen
Ecole Centrale de Lyon
   Beiji Zou
Central South University
Abstract

Graph neural networks have achieved remarkable success in learning graph representations, especially graph Transformer, which has recently shown superior performance on various graph mining tasks. However, graph Transformer generally treats nodes as tokens, which results in quadratic complexity regarding the number of nodes during self-attention computation. The graph MLP Mixer addresses this challenge by using the efficient MLP Mixer technique from computer vision. However, the time-consuming process of extracting graph tokens limits its performance. In this paper, we present a novel architecture named ChebMixer, a newly graph MLP Mixer that uses fast Chebyshev polynomials-based spectral filtering to extract a sequence of tokens. Firstly, we produce multiscale representations of graph nodes via fast Chebyshev polynomial-based spectral filtering. Next, we consider each node’s multiscale representations as a sequence of tokens and refine the node representation with an effective MLP Mixer. Finally, we aggregate the multiscale representations of nodes through Chebyshev interpolation. Owing to the powerful representation capabilities and fast computational properties of MLP Mixer, we can quickly extract more informative node representations to improve the performance of downstream tasks. The experimental results prove our significant improvements in a variety of scenarios ranging from graph node classification to medical image segmentation.

Refer to caption
Figure 1: Applications of ChebMixer. Graph node classification and medical image segmentation.

1 Introduction

Graphs provide an extremely flexible model for approximating the data domains of a large class of problems. For example, computer networks, transportation (road, rail, airplane) networks, or social networks can all be described by weighted graphs, with the vertices corresponding to individual computers, cities, or people, respectively. The remarkable success of deep learning for text [38] and images [34] defined on regular domain motivates the study of extensions to irregular, non-Euclidean spaces such as manifold and graph. Therefore, learning representation of irregular data emerges, termed graph neural networks (GNNs) or geometric deep learning [4, 8, 5, 50]. With the rapid development of GNNs, they have achieved state-of-the-art performance on almost all tasks among various graph representation learning methods [51, 49, 48, 21].

GNNs can be roughly divided into two categories: spatial-based and spectral-based. Spatial GNNs [41, 31, 45, 1] imitate the convolution operator in Euclidean space, and the core idea is to aggregate one-hop neighborhood information, which can be quickly implemented via the message-passing scheme. Spatial GNNs possibly learn long-range dependencies by increasing the number of layers but suffer from over-smoothing [29, 39] and over-squashing [43]. Spectral GNNs [13, 23, 47] are a kind of GNNs that design graph signal filters in the spectral domain of the graph Laplace operator. Theoretically, an arbitrary filter can be represented by choosing a set of basis functions. However, spectral GNNs are often overfitting due to illegal coefficient learning caused by sparse labeling [24].

Recently, Transformers emerged as a new architecture and have shown superior performance in a variety of data with an underlying Euclidean or grid-like structure, such as natural language processing [30] and images [14, 36]. Their great modeling capability motives the generalizing Transformers to non-Euclidean data like graphs. Due to the irregular structure of the graph, the Transformer cannot be directly extended. Graphormer [52], SAN [33], and GPS [40] design powerful positional and structural embeddings to improve their expressive power further. However, graph Transformer generally treats nodes as tokens, which results in quadratic complexity regarding the number of nodes during self-attention computation. Therefore, NAGphormer [9] extracts the KK-hops representation of the node via the adjacent matrix as a preprocessing step and treats each node as a sequence of tokens to directly input the Transformer. However, once input into the Transformer, each node is optimized without considering the relationship between them, which is easy to overfit. Graph MLP Mixer [25] extracts graph patches or tokens via graph cluster algorithm and uses the efficient MLP Mixer [42] to learn token representations from computer vision instead of the self-attention mechanism. However, the clustering algorithm is time-consuming, and the method is unsuitable for graph node and link prediction tasks. More importantly, most graph neural networks are generally validated on graph benchmarks and do not verify performance on other data representations. We believe a unified framework based on graph neural networks is possible because of graphs’ flexible and general representation capabilities.

To address these issues, we propose a novel efficient graph neural network and proves its effectiveness in diverse task of different data domain ranging from graph node classification to medical image segmentation. Inspired by NAGphormer [9], we extract multiscale representations of each node and treat each node as a sequence of tokens. One of the influential works to extract multiscale representations of nodes is using spectral graph wavelet transform [20] and accelerate computation via spectral filtering based on wavelets filters. For simplicity, we use Chebyshev polynomials as filters to extract the KK-hop representation of nodes due to their K-localized properties [13]. Instead of costly self-attention computation, we use a more efficient MLP mixer to learn the representations of different-hop neighborhoods based on their semantic correlation. Finally, we aggregate the different-hop neighborhood representations via Chebyshev interpolation to avoid learning illegal coefficients [24]. Overall, due to the powerful representation capabilities and fast computational properties of MLP Mixer, we can quickly extract more informative node representations benefitting downstream tasks. For validation, we apply our method to tasks from the fields of graph node classification and medical image segmentation and show that it outperforms state-of-the-art approaches. Moreover, it proves that creating a unified architecture via graph representation learning for tasks on different data domains is possible. We summarize our main contributions as follows:

  • We present a novel graph MLP mixer for graph representation learning, which uses MLP mixer to learn node representations of different-hop neighborhoods, leading to more informative node representation after aggregation.

  • In addition to the general task of graph node classification, we apply our method to medical image segmentation task. We convert an image to a kNN-graph and learn the image representation based on our proposed graph representation learning approach.

  • The experimental results prove our significant improvements in a variety of scenarios ranging from graph node classification to medical image segmentation. What’s more, it proves that creating a unified architecture via graph representation learning for tasks on different data domains is possible.

2 Related Work

2.1 GNN(Graph Neural Network)

Graph Neural Networks (GNNs) are a successful tool for graph representation learning, which have achieved state-of-the-art performance on almost all tasks among various graph representation learning methods [51, 49, 48, 21]. Generally, GNNs can be roughly divided into two categories: spatial-based and spectral-based.

Spatial GNNs [41, 31, 45, 1] imitate the convolution operator in Euclidean space, and the core idea is to aggregate one-hop neighborhood information, which can be quickly implemented via the message-passing scheme. Scarselli et al. [41] propose the first spatial-based graph convolution method via directly summing the neighborhood information of nodes. And the residual and skip connections are applied to remember the information of each layer. However, it uses a non-standardized adjacency matrix, which leads to hidden node states with different scales. DCNN  [1] considers graph convolution as a diffusion process, where information is transferred from one node to one of its neighboring nodes with a certain probability to balance the information distribution after several rounds. GraphSAGE [19] is a generalized inductive framework that aggregates information by uniformly sampling a fixed-size set of neighbors instead of using the complete set. Message Passing neural network (MPNN) [18] utilizes K-step message-passing iterations to propagate information further. To reduce noise and improve performance, some researchers extend the attention operator to the graph by assigning varying weights to neighbors through the attention mechanism [45, 54]. Spatial GNNs possibly learn long-range dependencies by increasing the number of layers but suffer from over-smoothing [29, 39] and over-squashing [43].

Spectral GNNs [13, 23, 47] are a kind of GNNs that design graph signal filters in the spectral domain of the graph Laplace operator. This method is first proposed by Bruna et al. [6] but suffers from costly graph Fourier transform. He et al. [23] categorizes spectral GNNs based on the type of filtering operation used into two categories. The first category comprises spectral domain GNNs with fixed filters, such as APPNP [32], which uses personalized PageRank to construct the filter function. The second category includes spectral domain GNNs with learnable filters. ChebNet [13] utilizes diagonal matrices based on Chebyshev-polynomial eigenvalue expansions to approximate graph convolution. GPRGNN [10] employs direct gradient descent on polynomial coefficients. Additionally, some works use other polynomial bases [23, 3, 56, 47], such as BernNet [23] which uses the Bernstein polynomial basis, ARMA [3] and GNN-LF/HF[56] which utilize rational functions. Among the mentioned methods, GPRGNN and ChebNet are considered more expressive because the former can express all polynomial filters, and the latter uses Chebyshev polynomials that can form a complete set of bases in polynomial space. ChebNet often suffers from overfitting issues in experimental settings despite being theoretically more expressive. To alleviate the overfitting problem in ChebNet, Kipf et al. propose GCN [31], simplifying ChebNet by using filters that run on 1-hop neighborhoods of the graph. He et al. [24] theoretically demonstrates that the overfitting issue of ChebNet is caused by the illegal parameters it learns. To address this problem, they propose ChebNetII, which enhances the approximation of the original Chebyshev polynomials using Chebyshev interpolation.

2.2 Graph Transformer and MLP-Mixer

Graph Transformers emerges as a new architecture and has shown superior performance on various graph mining tasks recently. The core idea of graph Transformer is to learn node representations by integrating graph structures into Transformer structures. Recognizing the similarity between the attention weights and the weighted neighbor matrix of a fully connected graph, Dwivedi et al. [15] combine Transformer with GNN. Several studies have explored strategies to replace the traditional position encoding in the Transformer with powerful position and structure encodings based on relevant graph information. For instance, SAN [33], LSPE [16], NAGphormer [9] use the Laplacian operator or random walk operator to learn the position encoding. Graphormer [52] develops the centrality encoding according to the degree centrality, the spatial encoding and the edge encoding according to the shortest path distance. In general, most graph Transformers can solve the issue of over-squeezing and limited long-range dependency in GNNs by using a non-local self-attention mechanism. However, they also increase the complexity from O(E)O(E) to O(N2)O(N^{2}), resulting in computational bottlenecks.

Due to the high computational cost of the Transformer, this work has recently been challenged by the more time-efficient novel model called MLP-Mixer [42]. Instead of convolutional or attentional mechanisms, MLP-Mixer uses multi-layer perceptron unaffected by over-squeezing or limited long-range dependency. MLP-Mixer is initially used for image-related tasks such as image classification [42, 44, 53], image segmentation [35] and other tasks. However, a generalization of MLP mixer to graph is challenging given the irregular and variable nature of graphs. He et al. [25] successfully generalize MLP mixer to graph via graph clustering algorithm to extract graph patch or token. However, the clustering algorithm is time-consuming, and the method is unsuitable for graph node and link prediction tasks.

Refer to caption
Figure 2: The proposed ChebMixer. ChebMixer first uses a novel neighborhood extraction module, KK-hop extractor, to generate multiscale or multi-hop representations of graph nodes and treat them as a sequence of tokens. ChebMixer then refines the multi-hop representations of graph node with an effective MLP Mixer and develops a novel aggregator to aggregate the multiscale representations of the nodes.

3 ChebMixer

In this section, we first introduce some essential knowledge related to graphs and spectral filtering. After that, we introduce the proposed ChebMixer in detail.

3.1 Preliminary

Graph. A graph is composed of a finite non-empty set of vertices and a set of edges, which can be represented as G=(V,E)G=(V,E), where VV is a finite set of |V|=N|V|=N vertices, EE is a set of edges and AN×NA\in\mathbb{R}^{N\times N} is a weighted adjacency matrix encoding the connection weight between two nodes. In graph machine learning and graph neural networks, a node feature can be regarded as a matrix XN×dX\in\mathbb{R}^{N\times d}. Here, dd is the feature dimension of each node. Expressly, Ai,jA_{i,j} is set to 1 or the edge’s weight if the edge exists from node ii to node jj; otherwise, it is set to 0.

Spectral convolution is a graph convolution method based on the spectral graph theory [11]. It involves converting the graph signal into the spectral domain for processing through a spectral filter, which is then converted back into the spatial domain. Spectral convolution is based on the graph Laplacian Laplacian LL, which combinatorial definition is L=DAL=D-A where DN×ND\in\mathbb{R}^{N\times N} is the diagonal degree matrix with Dii=jAijD_{ii}=\sum_{j}{A_{ij}}, and symmetric normalized definitions is L=IND1/2AD1/2L=I_{N}-D^{-1/2}AD^{1/2} where INI_{N} is the identity matrix. As LL is a real symmetric positive semidefinite matrix, it can be decomposed as L=UΛUTL=U\Lambda U^{T}, where UU is the eigenvector matrix of the Laplacian matrix, Λ\Lambda is the diagonal matrix with the eigenvalues of the Laplacian matrix as its diagonal elements. After that, the spectral domain convolution filters the eigenvalues by designing different filters. Given spectral filter h(λ)h(\lambda), the spectral convolution operator can be defined as

Y=Uh(Λ)UTXY=Uh(\Lambda)U^{T}X (1)

where yy denotes the filtering results of xx. Many methods to learn the optimal spectral filters via polynomials expansions h(λ)k=0Kθkλkh(\lambda)\approx\sum^{K}_{k=0}\theta_{k}\lambda^{k} , where coefficients θk\theta_{k} of expansions are trainable. So we have

Y=Uh(Λ)UTXUk=0KθkΛkUTXk=0KθkLkXY=Uh(\Lambda)U^{T}X\approx U\sum^{K}_{k=0}\theta_{k}\Lambda^{k}U^{T}X\approx\sum^{K}_{k=0}\theta_{k}L^{k}X (2)

ChebNet [13] is an outstanding contribution to the polynomial spectral convolution field, using Chebyshev polynomials to implement filtering, as shown below:

Yk=0KθkTk(L^)XY\approx\sum^{K}_{k=0}\theta_{k}T_{k}(\hat{L})X (3)

where Tk(L^)N×NT_{k}(\hat{L})\in\mathbb{R}^{N\times N} is the Chebyshev polynomial of order kk evaluated at L^\hat{L}. L^\hat{L} denotes the scaled Laplacian calculated by L^=2L/λmaxIN\hat{L}=2L/\lambda_{max}-I_{N}. The Chebyshev polynomials Tk(x)T_{k}(x) of order k can be recursively calculated by Tk(x)=2xTk1(x)Tk2(x)T_{k}(x)=2xT_{k-1}(x)-T_{k-2}(x) with T0(x)=1T_{0}(x)=1 and T1(x)=xT_{1}(x)=x. Due to the K-localized properties of Tk(L^)T_{k}(\hat{L}), we can reinterpret the Eq.(3) as two steps: (1) extract KK-hop neighborhood representations;(2) aggregate the different-hop neighborhoods representations. Our core idea is to use effective MLP mixer to enhance the KK-hop neighborhood representations for more informative node features.

Refer to caption
Figure 3: Architecture for medical image segmentation, which is composed of encoder, decoder, and skip connections. The encoder contains two modules: patch merging layer and GraphProp. The former performs downsampling, and the latter is constructed based on ChebMixer for feature extraction. The encoded feature representations in each encoder layer are fed into the corresponding CNN-decoder(i.e., the ResUpBlock in the figure, which has the function of up-sampling) via skip connections. The final output of the decoder goes through a head to obtain the segmentation result.

3.2 Methods

The overview of our method is introduced in Fig. 2. It includes three parts: extracting the KK-hop representations of nodes via fast Chebyshev polynomials spectral filtering, mixing the KK-hop representations of nodes with effective MLP mixer and aggregating KK-hop representations via Chebyshev interpolation.

KK-hop extractor. We design a module for KK-hop neighborhood computation based on Chebyshev polynomials. Specifically, for input adjacency matrix AN×NA\in\mathbb{R}^{N\times N} and node feature matrix XN×dX\in\mathbb{R}^{N\times d}, we first compute the scaled normalized Laplacian matrix L^\hat{L}. Due to the recurrence relation of Chebyshev polynomials, we can compute the KK-hop neighborhood feature embedding of all nodes as Algorithm 1. The final output is XG=(X0,X1,,Xk,,XK)X_{G}=(X^{0},X^{1},...,X^{k},...,X^{K}), where KK is a hyperparameter, XkN×dX^{k}\in\mathbb{R}^{N\times d} denotes the kk-th hop neighborhood information, and XGN×(K+1)×dX_{G}\in\mathbb{R}^{N\times(K+1)\times d} denotes the sequence of KK-hop neighborhood features of all nodes.

Algorithm 1 KK-hop extractor based on Chebyshev.
0:  Scaled Normalized Laplacian Matrix L^\hat{L}; Feature matrix XN×dX\in\mathbb{R}^{N\times d}; Chebyshev polynomials order KK
0:  The KK-hop neighborhood representation of all nodes XGN×((K+1))×dX_{G}\in\mathbb{R}^{N\times((K+1))\times d}
1:  XG[:,0,:]=X,XG[:,1,:]=L^XX_{G}[:,0,:]=X,X_{G}[:,1,:]=\hat{L}X
2:  for i=2i=2 to KK do
3:     XG[:,i,:]=2L^XG[:,i1,:]XG[,i2,:]X_{G}[:,i,:]=2\hat{L}X_{G}[:,i-1,:]-X_{G}[,i-2,:]
4:  end for
5:  return  XGN×(K+1)×dX_{G}\in\mathbb{R}^{N\times(K+1)\times d}

KK-hop mixer. Having obtained the information sequence XGN×(K+1)×dX_{G}\in\mathbb{R}^{N\times(K+1)\times d} for the KK-hop neighborhood, we treat each node as a sequence of tokens and follow the approach of Tolstikhin et al. [42], alternating channel and KK-hop mixing steps using a simple mixer layer. Token mixing is used for the KK-hop dimension, while channel mixing is used for the channel dimension. The mixer layer can be expressed as

XG\displaystyle X_{G} =XG+(W2σ(W1LayerNorm(XG)))\displaystyle=X_{G}+(W_{2}\sigma(W_{1}LayerNorm(X_{G}))) (4)
XG\displaystyle X_{G} =XG+(W4σ(W3LayerNorm(XG)T))T\displaystyle=X_{G}+(W_{4}\sigma(W_{3}LayerNorm(X_{G})^{T}))^{T}

where LayerNorm is layer normalization [2], σ\sigma is a nonlinear activation function [27], and matrix W1ds×(K+1)W_{1}\in\mathbb{R}^{d_{s}\times(K+1)}, W2(K+1)×dsW_{2}\in\mathbb{R}^{(K+1)\times d_{s}}, W3dc×dW_{3}\in\mathbb{R}^{d_{c}\times d}, W4d×dcW_{4}\in\mathbb{R}^{d\times d_{c}}, both dsd_{s} and dcd_{c} are hyperparameters.

KK-hop aggregator. Finally, we aggregate the sequence of KK-hop neighborhood features XGX_{G} to produce more informative node features. To alleviate the illegal coefficients learning, we utilize method in ChebNetII [24], which reparameterizing the learning coefficients via Chebyshev interpolation. In contrast to ChebNetII, using the same parameters for all channels, we assert that different channels represent different information and should have different aggregation weights. Hence, we propose an aggregator with a set of learnable weights W(K+1)×dW\in\mathbb{R}^{(K+1)\times d}. We restrict WW according to Eq. 5 and aggregate KK-hop neighborhood information sequences based on Eq. 6.

Wk=2K+1j=0KγjTk(xj)W^{k}=\frac{2}{K+1}\sum_{j=0}^{K}\gamma_{j}T_{k}(x_{j}) (5)
Xagg=k=0KWk×XGkX_{agg}=\sum_{k=0}^{K}W^{k}\times X_{G}^{k} (6)

where WkW^{k} denotes the vector of learnable weights WW at index kk, XGkX_{G}^{k} denotes the matrix of the sequence of feature matrices at index kk, γjd\gamma_{j}\in\mathbb{R}^{d} is the learnable parameter, and xj=cos((j+1/2)π/(K+1))x_{j}=cos((j+1/2)\pi/(K+1)) is a Chebyshev node of TK+1T_{K+1}. The detailed implementation is drawn in Algorithm 2.

Algorithm 2 KK-hop aggregator
0:  Learnable weight W(K+1)×dW\in\mathbb{R}^{(K+1)\times d}; Sequence of feature matrices XGRN×((K+1))×dX_{G}\in R^{N\times((K+1))\times d}
0:  Aggregated node feature matrix XaggRN×dX_{agg}\in R^{N\times d}
1:  Clone WW to γ\gamma ; Initialize XaggX_{agg} with 0
2:  for k=0k=0 to KK do
3:     W[k]=γ[0]×Tk(x0)W[k]=\gamma[0]\times T_{k}(x_{0})
4:     for j=1j=1 to KK do
5:        W[k]=W[k]+γ[j]×Tk(xj)W[k]=W[k]+\gamma[j]\times T_{k}(x_{j})
6:     end for
7:     W[k]=2W[k]/(K+1)W[k]=2W[k]/(K+1)
8:  end for
9:  for k=0k=0 to KK do
10:     Xagg=Xagg+XG[:,k,:]×W[k]X_{agg}=X_{agg}+X_{G}[:,k,:]\times W[k]
11:  end for
12:  return  Aggregated node feature matrix XaggN×dX_{agg}\in\mathbb{R}^{N\times d}
Method Cora Citeseer Pubmed Computers Photo CoauthorCS ogbn-arxiv
MLP 72.91 ± 0.16 76.18 ± 0.70 84.08 ± 0.11 74.20 ± 0.85 74.89 ± 0.42 94.51 ± 0.15 51.58 ± 0.14
GCN [31] 88.61 ± 0.20 79.38 ± 0.33 86.34 ± 0.08 79.55 ± 0.11 88.47 ± 0.20 94.61 ± 0.03 66.19 ± 0.57
GAT [45] 86.35 ± 0.40 79.48 ± 0.66 85.98 ± 0.20 81.56 ± 0.63 86.96 ± 0.37 93.38 ± 0.18 64.86 ± 0.34
APPNP [32] 88.65 ± 0.20 79.33 ± 0.84 85.79 ± 0.21 72.29 ± 0.20 84.31 ± 1.72 94.07 ± 0.10 65.98 ± 0.29
GPRGNN [10] 89.29 ± 0.36 80.33 ± 0.09 87.94 ± 0.28 89.08 ± 0.96 94.33 ± 0.36 95.13 ± 0.06 69.86 ± 0.70
ChebNet [13] 84.94 ± 0.32 78.16 ± 0.42 88.25 ± 0.11 90.60 ± 0.13 93.88 ± 0.14 94.49 ± 0.11 70.79 ± 0.20
ChebNetII [24] 89.69 ± 0.64 80.94 ± 0.37 88.93 ± 0.29 81.76 ± 2.27 89.65 ± 0.69 94.53 ± 0.28 72.32 ± 0.23
NAGphormer [9] 87.68 ± 0.52 77.12 ± 0.80 89.02 ± 0.19 91.22 ± 0.14 95.49 ± 0.11 95.75 ± 0.09 71.01 ± 0.13
Ours 89.46 ± 0.24 81.04 ± 0.28 89.22 ± 0.35 92.95 ± 0.20 96.19 ± 0.17 95.53 ± 0.13 73.28 ± 0.06
Table 1: Comparison of all models in terms of mean accuracy ± stdev (%) on graph node classification. The best results appear in bold. Blue indicates that the results of our model are comparable to the best results.

3.3 Network Architecture Details

The ChebMixer module we developed is illustrated in Fig. 2. To evaluate the effectiveness of our approach, we construct different network architectures and experiment on both graph node classification and medical image segmentation tasks. For graph node classification, we begin by feeding the node feature matrix XX into a linear projection layer to map the node feature dimensions to dd, followed by sequentially updating the features in the three modules described in Sec. 3.2. The resulting output is then passed through a classification header for classification.

For medical image segmentation, we construct a UNet-like architecture using the architectures in ViG [21] and SwinUNETR [22], as shown in Fig. 3. It comprises encoder, decoder, and skip connections. The encoder has four layers, each with a GraphProp module based on ChebMixer. This module transforms the image into a graph and extracts features. Specifically, for an image with size of H×W×3H\times W\times 3, we first divide it into NN patches. By converting each patch into a feature vectors xidx_{i}\in\mathbb{R}^{d}, we obtain a set of feature matrices X=(x1,x2,,xN)X=(x_{1},x_{2},...,x_{N}). Each patch can be considered as a node, resulting in a set of node sets VV. After that, for each node viVv_{i}\in V (i.e., patch), we construct KK-nearest neighbors graph in feature space. By considering the image as a graph, we can extract the node representation using the proposed ChebMixer module. To maintain the hierarchical structure of the encoder, a patch merging layer is used before each stage for performing downsampling by 2×2\times and doubling the feature dimensions from their original size. The feature representations extracted by the encoder are used in the decoder via skip connections at each layer. Each layer of the decoder contains a module named ResUpBlock, which performs upsampling using convolution. This process reshapes the feature maps of adjacent dimensions into feature maps with a resolution of 2×2\times and reduces the feature dimension to half of the original dimensions accordingly.

4 Results

This section introduces our experiments and results in detail. Our models are implemented by PyTorch, and all the experiments are carried out on a machine with an NVIDIA RTX3090 GPU ( 24GB memory ), Intel Xeon CPU ( 2.1 GHz ) with 16 cores, and 64 GB of RAM.

4.1 Graph Node Classification

Datasets. We evaluate ChebMixer on seven widely used real-world datasets, which include six small-scale datasets: Cora, Citeseer, Pubmed, Computers, Photo, CoauthorCS, and a large-scale reference dataset: ogbn-arxiv. We apply 60%/20%/20% train/val/test random splits for small-scale datasets. For ogbn-arxiv, we use the partition method in OGB [28].

Baselines and settings. We compare ChebMixer with 8 advanced baselines, including GCN [31], GAT [45], APPNP [32], GPRGNN [10], MLP, ChebNet [13], ChebNetII [24], NAGphormer [9]. For all models, We employ the AdamW [37] optimizer with an early stopping of 50 and a maximum of 2000 epochs to train.

Results. We rigorously conducted five trials for each model, ensuring that each trial’s random seed is different to minimize potential bias. The results are shown in Tab. 1. From the experimental results, we can observe that ChebMixer outperforms the baseline on the above dataset, proving our proposed ChebMixer’s superiority.

Model Dice \uparrow IOU \uparrow Params (M)
UTNet [17] 89.04 80.88 10.02
BAT [46] 90.03 82.41 46.73
TransFuse [55] 90.31 82.87 26.27
SwinUnet [7] 89.55 81.75 41.39
SwinUNETR [22] 90.26 82.77 25.13
HiFormer [26] 90.46 83.07 23.37
Base* 90.42 83.04 5.23
Ours 91.68 84.57 7.60
  • *

    Base is a UNet-like architecture based on ViG [21].

Table 2: Comparison of segmentation results between our model, Base, and SOTA methods. All models are validated by five-fold cross-validation and reported mean values(%). We report the models’ parameter count in millions (M).

4.2 Medical Image Segmentation

Datasets and evaluation metrics. We conduct experiments on the skin disease segmentation dataset ISIC2018 [12], which contains 2594 images and corresponding segmentation masks. We resize the images to 256 × 256 and enhance them with random flipping, rotation, Gaussian noise, contrast, and brightness variations.

Implementation details. We train the model by combining binary cross entropy (BCE) and dice loss. The loss \mathcal{L} between y^\hat{y} and the target y can be formulated as :

=BCE(y^,y)+Dice(y^,y)\mathcal{L}=BCE(\hat{y},y)+\operatorname{Dice}(\hat{y},y) (7)

We train models for 200 epochs with AdamW [37] optimizer. The learning rate is initialized to 1×1041\times 10^{-4} and changed by a linear decay scheduler with a step size of 50 and decay factor γ=0.5\gamma=0.5. We evaluate models by Dice and IOU metrics.

Comparisons with the state-of-the-arts. We compare our model with BASE and six other SOTA models. BASE is a UNet-like architecture based on ViG [21], which is consistent with our model. The difference is that the graph convolution module of BASE adopts EdgeConv [48]. The SOTA includes UTNet [17], BAT [46], TransFuse [55], SwinUnet [7], SwinUNETR [22], HiFormer [26]. All models are validated by five-fold cross-validation and reported mean values. The experimental results are recorded in Tab. 2, which show that our model can achieve better results with fewer parameters. We also show a visual comparison of the skin lesion segmentation results in Fig. 4, demonstrating that our proposed method can capture finer structures and generate more accurate contours.

Refer to caption
Figure 4: Visual result comparison of our model and SOTA.

4.3 Ablation Studies

To evaluate the effectiveness of our model, we conducted a series of experiments on graph node classification. Specifically, we conduct ablation experiments on each module of ChebMixer and explore the influence of polynomial order KK on the resultant outcomes. These experiments aim to provide an in-depth evaluation of the proposed model architecture’s performance and gain insights into its underlying mechanisms.

KK-hop extractor. This module is used to extract KK-hop neighborhood information in the model. To evaluate the efficacy of this module, we compare it with the hop2Token module in NAGphormer [9]. The remaining modules and experimental settings are consistent with Sec. 4.1, and the experimental results are shown in Tab. 3. Our analysis of the experimental results indicates that our module has demonstrated varying degrees of improvement across all datasets except for Citeseer and CoauthorCS. In addition, our module has exhibited a lower standard deviation on all datasets, which suggests greater stability.

Hop2Token Ours Δ\Delta
Cora 89.01 ± 0.64 89.46 ± 0.24 + 0.45
Citeseer 81.32 ± 2.07 81.04 ± 0.28 - 0.28
Pubmed 89.03 ± 0.42 89.22 ± 0.35 + 0.19
Computers 91.64 ± 0.17 92.95 ± 0.20 + 1.31
Photo 95.82 ± 0.21 96.19 ± 0.17 + 0.37
CoauthorCS 95.59 ± 0.20 95.53 ± 0.13 - 0.06
ogbn-arxiv 72.91 ± 0.11 73.28 ± 0.06 + 0.37
Table 3: Ablation experiment on KK-hop extractor. Δ\Delta denotes the change in performance, where “+” denotes an increase in performance and “-” denotes a decrease in performance.

KK-hop mixer. We conduct a comparative analysis of our proposed model with and without the KK-hop mixer module while keeping other modules and experimental settings consistent with section Sec. 4.1. As depicted in Tab. 4, the experimental results reveal a substantial improvement in the model’s performance when the KK-hop mixer module is incorporated. This finding underscores the significance of updating features in the KK-hop neighborhood, which is essential for obtaining comprehensive information and improving the model’s performance.

Δ\Delta
Cora 85.59 ± 0.88 89.46 ± 0.24 + 3.87
Citeseer 81.17 ± 0.61 81.04 ± 0.28 - 0.13
Pubmed 88.77 ± 0.10 89.22 ± 0.35 + 0.45
Computers 89.28 ± 0.27 92.95 ± 0.20 + 3.67
Photo 94.71 ± 0.41 96.19 ± 0.17 + 1.48
CoauthorCS 94.47 ± 0.28 95.53 ± 0.13 + 1.06
ogbn-arxiv 72.53 ± 0.17 73.28 ± 0.06 + 0.75
Table 4: Ablation experiment on KK-hop mixer. ✔ represents experiments with the mlp mixer, while ✘ represents experiments without the mlp mixer.
Refer to caption
Figure 5: Ablation experiment on KK-hop aggregator. The ”ChebNetII” represents the aggregation module in ChebNetII [24].
Refer to caption
Figure 6: The influence of order of polynomials KK.

KK-hop aggregator. To verify the effectiveness of the aggregation module, we compare the summation, averaging, maximization, and aggregation methods in ChebNetII [24]. The experimental results, as depicted in Fig. 5, demonstrate the effectiveness of our designed aggregator and highlight the advantages of learning distinct weights on various channels for more informative aggregation.

The influence of order of polynomials KK. We experiment with different values of KK on various graph node classification datasets. As depicted in Fig. 6, the experimental results show that when KK is small (less than 6), the increase of KK will significantly improve the model’s performance. As KK is further increased, the model’s performance exhibits a jittery but gradually rising trend.

4.4 Runtime

To evaluate the time efficacy of our proposed method, we conduct a comparative analysis with NAGphormer [9] in terms of epoch-wise training time. For all datasets, the hyperparameter KK, the hidden dimension dd, and the number of model layers ll are set to 7, 64, and 1, respectively. The experimental results are illustrated in Fig. 7. As the dataset size increases, the training time proportionally increases. Notably, our method surpasses NAGphormer in terms of computational efficiency on almost all datasets.

Refer to caption
Figure 7: Comparison of computational efficiency with NAGphormer.

5 Conclusion

We propose ChebMixer, a novel and efficient graph MLP mixer for learning graph representation that addresses the issues of the standard MP-GNN. By using fast Chebyshev polynomials spectral filtering to extract multiscale or multi-hop representations of graph nodes, we can treat each node as a sequence of tokens and efficiently enhance different hop information via MLP mixer. After a well-designed aggregator, we can produce more informative node representations to improve the performance of downstream tasks. Experiment results demonstrate that our approach yields state-of-the-art results in graph node classification and medical image segmentation. We hope our attempt at tasks on the different domains will encourage the unified modeling of graph and image signals via graph representation learning.

References

  • Atwood and Towsley [2016] James Atwood and Don Towsley. Diffusion-convolutional neural networks. Advances in neural information processing systems, 29, 2016.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. stat, 1050:21, 2016.
  • Bianchi et al. [2021] Filippo Maria Bianchi, Daniele Grattarola, Lorenzo Livi, and Cesare Alippi. Graph neural networks with convolutional arma filters. IEEE transactions on pattern analysis and machine intelligence, 44(7):3496–3507, 2021.
  • Bronstein et al. [2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • Bronstein et al. [2021] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
  • Bruna et al. [2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and deep locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  • Cao et al. [2022] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
  • Cao et al. [2020] Wenming Cao, Zhiyue Yan, Zhiquan He, and Zhihai He. A comprehensive survey on geometric deep learning. IEEE Access, 8:35929–35949, 2020.
  • Chen et al. [2023] Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Nagphormer: A tokenized graph transformer for node classification in large graphs. In Proceedings of the International Conference on Learning Representations, 2023.
  • Chien et al. [2021] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations, 2021.
  • Chung and Graham [1997] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
  • Codella et al. [2019] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
  • Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), page 3844–3852, 2016.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Dwivedi and Bresson [2020] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. CoRR, abs/2012.09699, 2020.
  • Dwivedi et al. [2022] Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. In International Conference on Learning Representations, 2022.
  • Gao et al. [2021] Yunhe Gao, Mu Zhou, and Dimitris N Metaxas. UTNet: A hybrid transformer architecture for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 61–71, 2021.
  • Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR, 2017.
  • Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
  • Hammond et al. [2011] David K Hammond, Pierre Vandergheynst, and Remi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
  • Han et al. [2022] Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is worth graph of nodes. In NeurIPS, 2022.
  • Hatamizadeh et al. [2021] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
  • He et al. [2021] Mingguo He, Zhewei Wei, Hongteng Xu, et al. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems, 34:14239–14251, 2021.
  • He et al. [2022] Mingguo He, Zhewei Wei, and Ji-Rong Wen. Convolutional neural networks on graphs with chebyshev approximation, revisited. In NeurIPS, 2022.
  • He et al. [2023] Xiaoxin He, Bryan Hooi, Thomas Laurent, Adam Perold, Yann LeCun, and Xavier Bresson. A generalization of vit/mlp-mixer to graphs. In International Conference on Machine Learning, pages 12724–12745. PMLR, 2023.
  • Heidari et al. [2023] Moein Heidari, Amirhossein Kazerouni, Milad Soltany, Reza Azad, Ehsan Khodapanah Aghdam, Julien Cohen-Adad, and Dorit Merhof. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6202–6212, 2023.
  • Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  • Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  • Huang et al. [2020] Wenbing Huang, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864, 2020.
  • Kenton and Toutanova [2019] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  • Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
  • Klicpera et al. [2019] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), 2019.
  • Kreuzer et al. [2021] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
  • Lai et al. [2022] Hong-Phuc Lai, Thi-Thao Tran, and Van-Truong Pham. Axial attention mlp-mixer: A new architecture for image segmentation. In 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), pages 381–386. IEEE, 2022.
  • Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • Loshchilov and Hutter [2018] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  • Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2013.
  • Oono and Suzuki [2020] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020.
  • Rampášek et al. [2022] Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
  • Scarselli et al. [2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
  • Tolstikhin et al. [2021] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
  • Topping et al. [2022] Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022.
  • Touvron et al. [2022] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022.
  • Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
  • Wang et al. [2021] Jiacheng Wang, Lan Wei, Liansheng Wang, Qichao Zhou, Lei Zhu, and Jing Qin. Boundary-aware transformers for skin lesion segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 206–216. Springer, 2021.
  • Wang and Zhang [2022] Xiyuan Wang and Muhan Zhang. How powerful are spectral graph neural networks. In International Conference on Machine Learning, pages 23341–23362. PMLR, 2022.
  • Wang et al. [2019] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  • Wei et al. [2023] Tianjun Wei, Tommy WS Chow, Jianghong Ma, and Mingbo Zhao. Expgcn: Review-aware graph convolution network for explainable recommendation. Neural Networks, 157:202–215, 2023.
  • Wu et al. [2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
  • Yao et al. [2019] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, pages 7370–7377, 2019.
  • Ying et al. [2021] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 28877–28888, 2021.
  • Zhang et al. [2022] Hong Zhang, ZhiXiang Dong, Bo Li, and Siyuan He. Multi-scale mlp-mixer for image classification. Knowledge-Based Systems, 258:109792, 2022.
  • Zhang et al. [2018] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit Yan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 2018.
  • Zhang et al. [2021] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 14–24. Springer, 2021.
  • Zhu et al. [2021] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. In Proceedings of the Web Conference 2021, pages 1215–1226, 2021.