This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

ChebMixer: Efficient Graph Representation Learning with MLP Mixer

Xiaoyan Kui
Central South University Haonan Yan
Central South University Qinsong Li
Central South University Liming Chen
Ecole Centrale de Lyon Beiji Zou
Central South University

Abstract

Graph neural networks have achieved remarkable success in learning graph representations, especially graph Transformer, which has recently shown superior performance on various graph mining tasks. However, graph Transformer generally treats nodes as tokens, which results in quadratic complexity regarding the number of nodes during self-attention computation. The graph MLP Mixer addresses this challenge by using the efficient MLP Mixer technique from computer vision. However, the time-consuming process of extracting graph tokens limits its performance. In this paper, we present a novel architecture named ChebMixer, a newly graph MLP Mixer that uses fast Chebyshev polynomials-based spectral filtering to extract a sequence of tokens. Firstly, we produce multiscale representations of graph nodes via fast Chebyshev polynomial-based spectral filtering. Next, we consider each node’s multiscale representations as a sequence of tokens and refine the node representation with an effective MLP Mixer. Finally, we aggregate the multiscale representations of nodes through Chebyshev interpolation. Owing to the powerful representation capabilities and fast computational properties of MLP Mixer, we can quickly extract more informative node representations to improve the performance of downstream tasks. The experimental results prove our significant improvements in a variety of scenarios ranging from graph node classification to medical image segmentation.

Refer to caption — Figure 1: Applications of ChebMixer. Graph node classification and medical image segmentation.

1 Introduction

Graphs provide an extremely flexible model for approximating the data domains of a large class of problems. For example, computer networks, transportation (road, rail, airplane) networks, or social networks can all be described by weighted graphs, with the vertices corresponding to individual computers, cities, or people, respectively. The remarkable success of deep learning for text [38] and images [34] defined on regular domain motivates the study of extensions to irregular, non-Euclidean spaces such as manifold and graph. Therefore, learning representation of irregular data emerges, termed graph neural networks (GNNs) or geometric deep learning [4, 8, 5, 50]. With the rapid development of GNNs, they have achieved state-of-the-art performance on almost all tasks among various graph representation learning methods [51, 49, 48, 21].

GNNs can be roughly divided into two categories: spatial-based and spectral-based. Spatial GNNs [41, 31, 45, 1] imitate the convolution operator in Euclidean space, and the core idea is to aggregate one-hop neighborhood information, which can be quickly implemented via the message-passing scheme. Spatial GNNs possibly learn long-range dependencies by increasing the number of layers but suffer from over-smoothing [29, 39] and over-squashing [43]. Spectral GNNs [13, 23, 47] are a kind of GNNs that design graph signal filters in the spectral domain of the graph Laplace operator. Theoretically, an arbitrary filter can be represented by choosing a set of basis functions. However, spectral GNNs are often overfitting due to illegal coefficient learning caused by sparse labeling [24].

Recently, Transformers emerged as a new architecture and have shown superior performance in a variety of data with an underlying Euclidean or grid-like structure, such as natural language processing [30] and images [14, 36]. Their great modeling capability motives the generalizing Transformers to non-Euclidean data like graphs. Due to the irregular structure of the graph, the Transformer cannot be directly extended. Graphormer [52], SAN [33], and GPS [40] design powerful positional and structural embeddings to improve their expressive power further. However, graph Transformer generally treats nodes as tokens, which results in quadratic complexity regarding the number of nodes during self-attention computation. Therefore, NAGphormer [9] extracts the $K$ -hops representation of the node via the adjacent matrix as a preprocessing step and treats each node as a sequence of tokens to directly input the Transformer. However, once input into the Transformer, each node is optimized without considering the relationship between them, which is easy to overfit. Graph MLP Mixer [25] extracts graph patches or tokens via graph cluster algorithm and uses the efficient MLP Mixer [42] to learn token representations from computer vision instead of the self-attention mechanism. However, the clustering algorithm is time-consuming, and the method is unsuitable for graph node and link prediction tasks. More importantly, most graph neural networks are generally validated on graph benchmarks and do not verify performance on other data representations. We believe a unified framework based on graph neural networks is possible because of graphs’ flexible and general representation capabilities.

To address these issues, we propose a novel efficient graph neural network and proves its effectiveness in diverse task of different data domain ranging from graph node classification to medical image segmentation. Inspired by NAGphormer [9], we extract multiscale representations of each node and treat each node as a sequence of tokens. One of the influential works to extract multiscale representations of nodes is using spectral graph wavelet transform [20] and accelerate computation via spectral filtering based on wavelets filters. For simplicity, we use Chebyshev polynomials as filters to extract the $K$ -hop representation of nodes due to their K-localized properties [13]. Instead of costly self-attention computation, we use a more efficient MLP mixer to learn the representations of different-hop neighborhoods based on their semantic correlation. Finally, we aggregate the different-hop neighborhood representations via Chebyshev interpolation to avoid learning illegal coefficients [24]. Overall, due to the powerful representation capabilities and fast computational properties of MLP Mixer, we can quickly extract more informative node representations benefitting downstream tasks. For validation, we apply our method to tasks from the fields of graph node classification and medical image segmentation and show that it outperforms state-of-the-art approaches. Moreover, it proves that creating a unified architecture via graph representation learning for tasks on different data domains is possible. We summarize our main contributions as follows:

•

We present a novel graph MLP mixer for graph representation learning, which uses MLP mixer to learn node representations of different-hop neighborhoods, leading to more informative node representation after aggregation.
•

In addition to the general task of graph node classification, we apply our method to medical image segmentation task. We convert an image to a kNN-graph and learn the image representation based on our proposed graph representation learning approach.
•

The experimental results prove our significant improvements in a variety of scenarios ranging from graph node classification to medical image segmentation. What’s more, it proves that creating a unified architecture via graph representation learning for tasks on different data domains is possible.

2 Related Work

2.1 GNN(Graph Neural Network)

Graph Neural Networks (GNNs) are a successful tool for graph representation learning, which have achieved state-of-the-art performance on almost all tasks among various graph representation learning methods [51, 49, 48, 21]. Generally, GNNs can be roughly divided into two categories: spatial-based and spectral-based.

Spatial GNNs [41, 31, 45, 1] imitate the convolution operator in Euclidean space, and the core idea is to aggregate one-hop neighborhood information, which can be quickly implemented via the message-passing scheme. Scarselli et al. [41] propose the first spatial-based graph convolution method via directly summing the neighborhood information of nodes. And the residual and skip connections are applied to remember the information of each layer. However, it uses a non-standardized adjacency matrix, which leads to hidden node states with different scales. DCNN [1] considers graph convolution as a diffusion process, where information is transferred from one node to one of its neighboring nodes with a certain probability to balance the information distribution after several rounds. GraphSAGE [19] is a generalized inductive framework that aggregates information by uniformly sampling a fixed-size set of neighbors instead of using the complete set. Message Passing neural network (MPNN) [18] utilizes K-step message-passing iterations to propagate information further. To reduce noise and improve performance, some researchers extend the attention operator to the graph by assigning varying weights to neighbors through the attention mechanism [45, 54]. Spatial GNNs possibly learn long-range dependencies by increasing the number of layers but suffer from over-smoothing [29, 39] and over-squashing [43].

Spectral GNNs [13, 23, 47] are a kind of GNNs that design graph signal filters in the spectral domain of the graph Laplace operator. This method is first proposed by Bruna et al. [6] but suffers from costly graph Fourier transform. He et al. [23] categorizes spectral GNNs based on the type of filtering operation used into two categories. The first category comprises spectral domain GNNs with fixed filters, such as APPNP [32], which uses personalized PageRank to construct the filter function. The second category includes spectral domain GNNs with learnable filters. ChebNet [13] utilizes diagonal matrices based on Chebyshev-polynomial eigenvalue expansions to approximate graph convolution. GPRGNN [10] employs direct gradient descent on polynomial coefficients. Additionally, some works use other polynomial bases [23, 3, 56, 47], such as BernNet [23] which uses the Bernstein polynomial basis, ARMA [3] and GNN-LF/HF[56] which utilize rational functions. Among the mentioned methods, GPRGNN and ChebNet are considered more expressive because the former can express all polynomial filters, and the latter uses Chebyshev polynomials that can form a complete set of bases in polynomial space. ChebNet often suffers from overfitting issues in experimental settings despite being theoretically more expressive. To alleviate the overfitting problem in ChebNet, Kipf et al. propose GCN [31], simplifying ChebNet by using filters that run on 1-hop neighborhoods of the graph. He et al. [24] theoretically demonstrates that the overfitting issue of ChebNet is caused by the illegal parameters it learns. To address this problem, they propose ChebNetII, which enhances the approximation of the original Chebyshev polynomials using Chebyshev interpolation.

2.2 Graph Transformer and MLP-Mixer

Graph Transformers emerges as a new architecture and has shown superior performance on various graph mining tasks recently. The core idea of graph Transformer is to learn node representations by integrating graph structures into Transformer structures. Recognizing the similarity between the attention weights and the weighted neighbor matrix of a fully connected graph, Dwivedi et al. [15] combine Transformer with GNN. Several studies have explored strategies to replace the traditional position encoding in the Transformer with powerful position and structure encodings based on relevant graph information. For instance, SAN [33], LSPE [16], NAGphormer [9] use the Laplacian operator or random walk operator to learn the position encoding. Graphormer [52] develops the centrality encoding according to the degree centrality, the spatial encoding and the edge encoding according to the shortest path distance. In general, most graph Transformers can solve the issue of over-squeezing and limited long-range dependency in GNNs by using a non-local self-attention mechanism. However, they also increase the complexity from $O(E)$ to $O(N^{2})$ , resulting in computational bottlenecks.

Due to the high computational cost of the Transformer, this work has recently been challenged by the more time-efficient novel model called MLP-Mixer [42]. Instead of convolutional or attentional mechanisms, MLP-Mixer uses multi-layer perceptron unaffected by over-squeezing or limited long-range dependency. MLP-Mixer is initially used for image-related tasks such as image classification [42, 44, 53], image segmentation [35] and other tasks. However, a generalization of MLP mixer to graph is challenging given the irregular and variable nature of graphs. He et al. [25] successfully generalize MLP mixer to graph via graph clustering algorithm to extract graph patch or token. However, the clustering algorithm is time-consuming, and the method is unsuitable for graph node and link prediction tasks.

3 ChebMixer

In this section, we first introduce some essential knowledge related to graphs and spectral filtering. After that, we introduce the proposed ChebMixer in detail.

3.1 Preliminary

Graph. A graph is composed of a finite non-empty set of vertices and a set of edges, which can be represented as $G=(V,E)$ , where $V$ is a finite set of $|V|=N$ vertices, $E$ is a set of edges and $A\in\mathbb{R}^{N\times N}$ is a weighted adjacency matrix encoding the connection weight between two nodes. In graph machine learning and graph neural networks, a node feature can be regarded as a matrix $X\in\mathbb{R}^{N\times d}$ . Here, $d$ is the feature dimension of each node. Expressly, $A_{i,j}$ is set to 1 or the edge’s weight if the edge exists from node $i$ to node $j$ ; otherwise, it is set to 0.

Spectral convolution is a graph convolution method based on the spectral graph theory [11]. It involves converting the graph signal into the spectral domain for processing through a spectral filter, which is then converted back into the spatial domain. Spectral convolution is based on the graph Laplacian Laplacian $L$ , which combinatorial definition is $L=D-A$ where $D\in\mathbb{R}^{N\times N}$ is the diagonal degree matrix with $D_{ii}=\sum_{j}{A_{ij}}$ , and symmetric normalized definitions is $L=I_{N}-D^{-1/2}AD^{1/2}$ where $I_{N}$ is the identity matrix. As $L$ is a real symmetric positive semidefinite matrix, it can be decomposed as $L=U\Lambda U^{T}$ , where $U$ is the eigenvector matrix of the Laplacian matrix, $\Lambda$ is the diagonal matrix with the eigenvalues of the Laplacian matrix as its diagonal elements. After that, the spectral domain convolution filters the eigenvalues by designing different filters. Given spectral filter $h(\lambda)$ , the spectral convolution operator can be defined as

Y=Uh(\Lambda)U^{T}X

(1)

where $y$ denotes the filtering results of $x$ . Many methods to learn the optimal spectral filters via polynomials expansions $h(\lambda)\approx\sum^{K}_{k=0}\theta_{k}\lambda^{k}$ , where coefficients $\theta_{k}$ of expansions are trainable. So we have

Y=Uh(\Lambda)U^{T}X\approx U\sum^{K}_{k=0}\theta_{k}\Lambda^{k}U^{T}X\approx\sum^{K}_{k=0}\theta_{k}L^{k}X

(2)

ChebNet [13] is an outstanding contribution to the polynomial spectral convolution field, using Chebyshev polynomials to implement filtering, as shown below:

Y\approx\sum^{K}_{k=0}\theta_{k}T_{k}(\hat{L})X

(3)

where $T_{k}(\hat{L})\in\mathbb{R}^{N\times N}$ is the Chebyshev polynomial of order $k$ evaluated at $\hat{L}$ . $\hat{L}$ denotes the scaled Laplacian calculated by $\hat{L}=2L/\lambda_{max}-I_{N}$ . The Chebyshev polynomials $T_{k}(x)$ of order k can be recursively calculated by $T_{k}(x)=2xT_{k-1}(x)-T_{k-2}(x)$ with $T_{0}(x)=1$ and $T_{1}(x)=x$ . Due to the K-localized properties of $T_{k}(\hat{L})$ , we can reinterpret the Eq.(3) as two steps: (1) extract $K$ -hop neighborhood representations;(2) aggregate the different-hop neighborhoods representations. Our core idea is to use effective MLP mixer to enhance the $K$ -hop neighborhood representations for more informative node features.

3.2 Methods

The overview of our method is introduced in Fig. 2. It includes three parts: extracting the $K$ -hop representations of nodes via fast Chebyshev polynomials spectral filtering, mixing the $K$ -hop representations of nodes with effective MLP mixer and aggregating $K$ -hop representations via Chebyshev interpolation.

$K$ -hop extractor. We design a module for $K$ -hop neighborhood computation based on Chebyshev polynomials. Specifically, for input adjacency matrix $A\in\mathbb{R}^{N\times N}$ and node feature matrix $X\in\mathbb{R}^{N\times d}$ , we first compute the scaled normalized Laplacian matrix $\hat{L}$ . Due to the recurrence relation of Chebyshev polynomials, we can compute the $K$ -hop neighborhood feature embedding of all nodes as Algorithm 1. The final output is $X_{G}=(X^{0},X^{1},...,X^{k},...,X^{K})$ , where $K$ is a hyperparameter, $X^{k}\in\mathbb{R}^{N\times d}$ denotes the $k$ -th hop neighborhood information, and $X_{G}\in\mathbb{R}^{N\times(K+1)\times d}$ denotes the sequence of $K$ -hop neighborhood features of all nodes.

Algorithm 1

K

-hop extractor based on Chebyshev.

0: Scaled Normalized Laplacian Matrix

\hat{L}

; Feature matrix

X\in\mathbb{R}^{N\times d}

; Chebyshev polynomials order

K

0: The

K

-hop neighborhood representation of all nodes

X_{G}\in\mathbb{R}^{N\times((K+1))\times d}

X_{G}[:,0,:]=X,X_{G}[:,1,:]=\hat{L}X

2: for

i=2

K

X_{G}[:,i,:]=2\hat{L}X_{G}[:,i-1,:]-X_{G}[,i-2,:]

4: end for

5: return

X_{G}\in\mathbb{R}^{N\times(K+1)\times d}

$K$ -hop mixer. Having obtained the information sequence $X_{G}\in\mathbb{R}^{N\times(K+1)\times d}$ for the $K$ -hop neighborhood, we treat each node as a sequence of tokens and follow the approach of Tolstikhin et al. [42], alternating channel and $K$ -hop mixing steps using a simple mixer layer. Token mixing is used for the $K$ -hop dimension, while channel mixing is used for the channel dimension. The mixer layer can be expressed as

	$\displaystyle X_{G}$	$\displaystyle=X_{G}+(W_{2}\sigma(W_{1}LayerNorm(X_{G})))$		(4)
	$\displaystyle X_{G}$	$\displaystyle=X_{G}+(W_{4}\sigma(W_{3}LayerNorm(X_{G})^{T}))^{T}$		(4)

where LayerNorm is layer normalization [2], $\sigma$ is a nonlinear activation function [27], and matrix $W_{1}\in\mathbb{R}^{d_{s}\times(K+1)}$ , $W_{2}\in\mathbb{R}^{(K+1)\times d_{s}}$ , $W_{3}\in\mathbb{R}^{d_{c}\times d}$ , $W_{4}\in\mathbb{R}^{d\times d_{c}}$ , both $d_{s}$ and $d_{c}$ are hyperparameters.

$K$ -hop aggregator. Finally, we aggregate the sequence of $K$ -hop neighborhood features $X_{G}$ to produce more informative node features. To alleviate the illegal coefficients learning, we utilize method in ChebNetII [24], which reparameterizing the learning coefficients via Chebyshev interpolation. In contrast to ChebNetII, using the same parameters for all channels, we assert that different channels represent different information and should have different aggregation weights. Hence, we propose an aggregator with a set of learnable weights $W\in\mathbb{R}^{(K+1)\times d}$ . We restrict $W$ according to Eq. 5 and aggregate $K$ -hop neighborhood information sequences based on Eq. 6.

W^{k}=\frac{2}{K+1}\sum_{j=0}^{K}\gamma_{j}T_{k}(x_{j})

(5)

X_{agg}=\sum_{k=0}^{K}W^{k}\times X_{G}^{k}

(6)

where $W^{k}$ denotes the vector of learnable weights $W$ at index $k$ , $X_{G}^{k}$ denotes the matrix of the sequence of feature matrices at index $k$ , $\gamma_{j}\in\mathbb{R}^{d}$ is the learnable parameter, and $x_{j}=cos((j+1/2)\pi/(K+1))$ is a Chebyshev node of $T_{K+1}$ . The detailed implementation is drawn in Algorithm 2.

Algorithm 2

K

-hop aggregator

0: Learnable weight

W\in\mathbb{R}^{(K+1)\times d}

; Sequence of feature matrices

X_{G}\in R^{N\times((K+1))\times d}

0: Aggregated node feature matrix

X_{agg}\in R^{N\times d}

1: Clone

W

\gamma

; Initialize

X_{agg}

with 0

2: for

k=0

K

W[k]=\gamma[0]\times T_{k}(x_{0})

4: for

j=1

K

W[k]=W[k]+\gamma[j]\times T_{k}(x_{j})

6: end for

W[k]=2W[k]/(K+1)

8: end for

9: for

k=0

K

10:

X_{agg}=X_{agg}+X_{G}[:,k,:]\times W[k]

11: end for

12: return Aggregated node feature matrix

X_{agg}\in\mathbb{R}^{N\times d}

Method	Cora	Citeseer	Pubmed	Computers	Photo	CoauthorCS	ogbn-arxiv
MLP	72.91 ± 0.16	76.18 ± 0.70	84.08 ± 0.11	74.20 ± 0.85	74.89 ± 0.42	94.51 ± 0.15	51.58 ± 0.14
GCN [31]	88.61 ± 0.20	79.38 ± 0.33	86.34 ± 0.08	79.55 ± 0.11	88.47 ± 0.20	94.61 ± 0.03	66.19 ± 0.57
GAT [45]	86.35 ± 0.40	79.48 ± 0.66	85.98 ± 0.20	81.56 ± 0.63	86.96 ± 0.37	93.38 ± 0.18	64.86 ± 0.34
APPNP [32]	88.65 ± 0.20	79.33 ± 0.84	85.79 ± 0.21	72.29 ± 0.20	84.31 ± 1.72	94.07 ± 0.10	65.98 ± 0.29
GPRGNN [10]	89.29 ± 0.36	80.33 ± 0.09	87.94 ± 0.28	89.08 ± 0.96	94.33 ± 0.36	95.13 ± 0.06	69.86 ± 0.70
ChebNet [13]	84.94 ± 0.32	78.16 ± 0.42	88.25 ± 0.11	90.60 ± 0.13	93.88 ± 0.14	94.49 ± 0.11	70.79 ± 0.20
ChebNetII [24]	89.69 ± 0.64	80.94 ± 0.37	88.93 ± 0.29	81.76 ± 2.27	89.65 ± 0.69	94.53 ± 0.28	72.32 ± 0.23
NAGphormer [9]	87.68 ± 0.52	77.12 ± 0.80	89.02 ± 0.19	91.22 ± 0.14	95.49 ± 0.11	95.75 ± 0.09	71.01 ± 0.13
Ours	89.46 ± 0.24	81.04 ± 0.28	89.22 ± 0.35	92.95 ± 0.20	96.19 ± 0.17	95.53 ± 0.13	73.28 ± 0.06

Table 1: Comparison of all models in terms of mean accuracy ± stdev (%) on graph node classification. The best results appear in bold. Blue indicates that the results of our model are comparable to the best results.

3.3 Network Architecture Details

The ChebMixer module we developed is illustrated in Fig. 2. To evaluate the effectiveness of our approach, we construct different network architectures and experiment on both graph node classification and medical image segmentation tasks. For graph node classification, we begin by feeding the node feature matrix $X$ into a linear projection layer to map the node feature dimensions to $d$ , followed by sequentially updating the features in the three modules described in Sec. 3.2. The resulting output is then passed through a classification header for classification.

For medical image segmentation, we construct a UNet-like architecture using the architectures in ViG [21] and SwinUNETR [22], as shown in Fig. 3. It comprises encoder, decoder, and skip connections. The encoder has four layers, each with a GraphProp module based on ChebMixer. This module transforms the image into a graph and extracts features. Specifically, for an image with size of $H\times W\times 3$ , we first divide it into $N$ patches. By converting each patch into a feature vectors $x_{i}\in\mathbb{R}^{d}$ , we obtain a set of feature matrices $X=(x_{1},x_{2},...,x_{N})$ . Each patch can be considered as a node, resulting in a set of node sets $V$ . After that, for each node $v_{i}\in V$ (i.e., patch), we construct $K$ -nearest neighbors graph in feature space. By considering the image as a graph, we can extract the node representation using the proposed ChebMixer module. To maintain the hierarchical structure of the encoder, a patch merging layer is used before each stage for performing downsampling by $2\times$ and doubling the feature dimensions from their original size. The feature representations extracted by the encoder are used in the decoder via skip connections at each layer. Each layer of the decoder contains a module named ResUpBlock, which performs upsampling using convolution. This process reshapes the feature maps of adjacent dimensions into feature maps with a resolution of $2\times$ and reduces the feature dimension to half of the original dimensions accordingly.

4 Results

This section introduces our experiments and results in detail. Our models are implemented by PyTorch, and all the experiments are carried out on a machine with an NVIDIA RTX3090 GPU ( 24GB memory ), Intel Xeon CPU ( 2.1 GHz ) with 16 cores, and 64 GB of RAM.

4.1 Graph Node Classification

Datasets. We evaluate ChebMixer on seven widely used real-world datasets, which include six small-scale datasets: Cora, Citeseer, Pubmed, Computers, Photo, CoauthorCS, and a large-scale reference dataset: ogbn-arxiv. We apply 60%/20%/20% train/val/test random splits for small-scale datasets. For ogbn-arxiv, we use the partition method in OGB [28].

Baselines and settings. We compare ChebMixer with 8 advanced baselines, including GCN [31], GAT [45], APPNP [32], GPRGNN [10], MLP, ChebNet [13], ChebNetII [24], NAGphormer [9]. For all models, We employ the AdamW [37] optimizer with an early stopping of 50 and a maximum of 2000 epochs to train.

Results. We rigorously conducted five trials for each model, ensuring that each trial’s random seed is different to minimize potential bias. The results are shown in Tab. 1. From the experimental results, we can observe that ChebMixer outperforms the baseline on the above dataset, proving our proposed ChebMixer’s superiority.

Model	Dice $\uparrow$	IOU $\uparrow$	Params (M)
UTNet [17]	89.04	80.88	10.02
BAT [46]	90.03	82.41	46.73
TransFuse [55]	90.31	82.87	26.27
SwinUnet [7]	89.55	81.75	41.39
SwinUNETR [22]	90.26	82.77	25.13
HiFormer [26]	90.46	83.07	23.37
Base^*	90.42	83.04	5.23
Ours	91.68	84.57	7.60

*

Base is a UNet-like architecture based on ViG [21].

Table 2: Comparison of segmentation results between our model, Base, and SOTA methods. All models are validated by five-fold cross-validation and reported mean values(%). We report the models’ parameter count in millions (M).

4.2 Medical Image Segmentation

Datasets and evaluation metrics. We conduct experiments on the skin disease segmentation dataset ISIC2018 [12], which contains 2594 images and corresponding segmentation masks. We resize the images to 256 × 256 and enhance them with random flipping, rotation, Gaussian noise, contrast, and brightness variations.

Implementation details. We train the model by combining binary cross entropy (BCE) and dice loss. The loss $\mathcal{L}$ between $\hat{y}$ and the target y can be formulated as :

\mathcal{L}=BCE(\hat{y},y)+\operatorname{Dice}(\hat{y},y)

(7)

We train models for 200 epochs with AdamW [37] optimizer. The learning rate is initialized to $1\times 10^{-4}$ and changed by a linear decay scheduler with a step size of 50 and decay factor $\gamma=0.5$ . We evaluate models by Dice and IOU metrics.

Comparisons with the state-of-the-arts. We compare our model with BASE and six other SOTA models. BASE is a UNet-like architecture based on ViG [21], which is consistent with our model. The difference is that the graph convolution module of BASE adopts EdgeConv [48]. The SOTA includes UTNet [17], BAT [46], TransFuse [55], SwinUnet [7], SwinUNETR [22], HiFormer [26]. All models are validated by five-fold cross-validation and reported mean values. The experimental results are recorded in Tab. 2, which show that our model can achieve better results with fewer parameters. We also show a visual comparison of the skin lesion segmentation results in Fig. 4, demonstrating that our proposed method can capture finer structures and generate more accurate contours.

4.3 Ablation Studies

To evaluate the effectiveness of our model, we conducted a series of experiments on graph node classification. Specifically, we conduct ablation experiments on each module of ChebMixer and explore the influence of polynomial order $K$ on the resultant outcomes. These experiments aim to provide an in-depth evaluation of the proposed model architecture’s performance and gain insights into its underlying mechanisms.

$K$ -hop extractor. This module is used to extract $K$ -hop neighborhood information in the model. To evaluate the efficacy of this module, we compare it with the hop2Token module in NAGphormer [9]. The remaining modules and experimental settings are consistent with Sec. 4.1, and the experimental results are shown in Tab. 3. Our analysis of the experimental results indicates that our module has demonstrated varying degrees of improvement across all datasets except for Citeseer and CoauthorCS. In addition, our module has exhibited a lower standard deviation on all datasets, which suggests greater stability.

	Hop2Token	Ours	$\Delta$
Cora	89.01 ± 0.64	89.46 ± 0.24	+ 0.45
Citeseer	81.32 ± 2.07	81.04 ± 0.28	- 0.28
Pubmed	89.03 ± 0.42	89.22 ± 0.35	+ 0.19
Computers	91.64 ± 0.17	92.95 ± 0.20	+ 1.31
Photo	95.82 ± 0.21	96.19 ± 0.17	+ 0.37
CoauthorCS	95.59 ± 0.20	95.53 ± 0.13	- 0.06
ogbn-arxiv	72.91 ± 0.11	73.28 ± 0.06	+ 0.37

Table 3: Ablation experiment on

K

-hop extractor.

\Delta

denotes the change in performance, where “+” denotes an increase in performance and “-” denotes a decrease in performance.

$K$ -hop mixer. We conduct a comparative analysis of our proposed model with and without the $K$ -hop mixer module while keeping other modules and experimental settings consistent with section Sec. 4.1. As depicted in Tab. 4, the experimental results reveal a substantial improvement in the model’s performance when the $K$ -hop mixer module is incorporated. This finding underscores the significance of updating features in the $K$ -hop neighborhood, which is essential for obtaining comprehensive information and improving the model’s performance.

	✘	✔	$\Delta$
Cora	85.59 ± 0.88	89.46 ± 0.24	+ 3.87
Citeseer	81.17 ± 0.61	81.04 ± 0.28	- 0.13
Pubmed	88.77 ± 0.10	89.22 ± 0.35	+ 0.45
Computers	89.28 ± 0.27	92.95 ± 0.20	+ 3.67
Photo	94.71 ± 0.41	96.19 ± 0.17	+ 1.48
CoauthorCS	94.47 ± 0.28	95.53 ± 0.13	+ 1.06
ogbn-arxiv	72.53 ± 0.17	73.28 ± 0.06	+ 0.75

Table 4: Ablation experiment on

K

-hop mixer. ✔ represents experiments with the mlp mixer, while ✘ represents experiments without the mlp mixer.

$K$ -hop aggregator. To verify the effectiveness of the aggregation module, we compare the summation, averaging, maximization, and aggregation methods in ChebNetII [24]. The experimental results, as depicted in Fig. 5, demonstrate the effectiveness of our designed aggregator and highlight the advantages of learning distinct weights on various channels for more informative aggregation.

The influence of order of polynomials $K$ . We experiment with different values of $K$ on various graph node classification datasets. As depicted in Fig. 6, the experimental results show that when $K$ is small (less than 6), the increase of $K$ will significantly improve the model’s performance. As $K$ is further increased, the model’s performance exhibits a jittery but gradually rising trend.

4.4 Runtime

To evaluate the time efficacy of our proposed method, we conduct a comparative analysis with NAGphormer [9] in terms of epoch-wise training time. For all datasets, the hyperparameter $K$ , the hidden dimension $d$ , and the number of model layers $l$ are set to 7, 64, and 1, respectively. The experimental results are illustrated in Fig. 7. As the dataset size increases, the training time proportionally increases. Notably, our method surpasses NAGphormer in terms of computational efficiency on almost all datasets.

5 Conclusion

We propose ChebMixer, a novel and efficient graph MLP mixer for learning graph representation that addresses the issues of the standard MP-GNN. By using fast Chebyshev polynomials spectral filtering to extract multiscale or multi-hop representations of graph nodes, we can treat each node as a sequence of tokens and efficiently enhance different hop information via MLP mixer. After a well-designed aggregator, we can produce more informative node representations to improve the performance of downstream tasks. Experiment results demonstrate that our approach yields state-of-the-art results in graph node classification and medical image segmentation. We hope our attempt at tasks on the different domains will encourage the unified modeling of graph and image signals via graph representation learning.

References

Atwood and Towsley [2016] James Atwood and Don Towsley. Diffusion-convolutional neural networks. Advances in neural information processing systems, 29, 2016.
Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. stat, 1050:21, 2016.
Bianchi et al. [2021] Filippo Maria Bianchi, Daniele Grattarola, Lorenzo Livi, and Cesare Alippi. Graph neural networks with convolutional arma filters. IEEE transactions on pattern analysis and machine intelligence, 44(7):3496–3507, 2021.
Bronstein et al. [2017] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
Bronstein et al. [2021] Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478, 2021.
Bruna et al. [2014] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and deep locally connected networks on graphs. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
Cao et al. [2022] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In European conference on computer vision, pages 205–218. Springer, 2022.
Cao et al. [2020] Wenming Cao, Zhiyue Yan, Zhiquan He, and Zhihai He. A comprehensive survey on geometric deep learning. IEEE Access, 8:35929–35949, 2020.
Chen et al. [2023] Jinsong Chen, Kaiyuan Gao, Gaichao Li, and Kun He. Nagphormer: A tokenized graph transformer for node classification in large graphs. In Proceedings of the International Conference on Learning Representations, 2023.
Chien et al. [2021] Eli Chien, Jianhao Peng, Pan Li, and Olgica Milenkovic. Adaptive universal generalized pagerank graph neural network. In International Conference on Learning Representations, 2021.
Chung and Graham [1997] Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
Codella et al. [2019] Noel Codella, Veronica Rotemberg, Philipp Tschandl, M Emre Celebi, Stephen Dusza, David Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael Marchetti, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
Defferrard et al. [2016] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NeurIPS), page 3844–3852, 2016.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
Dwivedi and Bresson [2020] Vijay Prakash Dwivedi and Xavier Bresson. A generalization of transformer networks to graphs. CoRR, abs/2012.09699, 2020.
Dwivedi et al. [2022] Vijay Prakash Dwivedi, Anh Tuan Luu, Thomas Laurent, Yoshua Bengio, and Xavier Bresson. Graph neural networks with learnable structural and positional representations. In International Conference on Learning Representations, 2022.
Gao et al. [2021] Yunhe Gao, Mu Zhou, and Dimitris N Metaxas. UTNet: A hybrid transformer architecture for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 61–71, 2021.
Gilmer et al. [2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In International conference on machine learning, pages 1263–1272. PMLR, 2017.
Hamilton et al. [2017] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
Hammond et al. [2011] David K Hammond, Pierre Vandergheynst, and Remi Gribonval. Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150, 2011.
Han et al. [2022] Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, and Enhua Wu. Vision gnn: An image is worth graph of nodes. In NeurIPS, 2022.
Hatamizadeh et al. [2021] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
He et al. [2021] Mingguo He, Zhewei Wei, Hongteng Xu, et al. Bernnet: Learning arbitrary graph spectral filters via bernstein approximation. Advances in Neural Information Processing Systems, 34:14239–14251, 2021.
He et al. [2022] Mingguo He, Zhewei Wei, and Ji-Rong Wen. Convolutional neural networks on graphs with chebyshev approximation, revisited. In NeurIPS, 2022.
He et al. [2023] Xiaoxin He, Bryan Hooi, Thomas Laurent, Adam Perold, Yann LeCun, and Xavier Bresson. A generalization of vit/mlp-mixer to graphs. In International Conference on Machine Learning, pages 12724–12745. PMLR, 2023.
Heidari et al. [2023] Moein Heidari, Amirhossein Kazerouni, Milad Soltany, Reza Azad, Ehsan Khodapanah Aghdam, Julien Cohen-Adad, and Dorit Merhof. Hiformer: Hierarchical multi-scale representations using transformers for medical image segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6202–6212, 2023.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
Hu et al. [2020] Weihua Hu, Matthias Fey, Marinka Zitnik, Yuxiao Dong, Hongyu Ren, Bowen Liu, Michele Catasta, and Jure Leskovec. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
Huang et al. [2020] Wenbing Huang, Yu Rong, Tingyang Xu, Fuchun Sun, and Junzhou Huang. Tackling over-smoothing for general graph convolutional networks. arXiv preprint arXiv:2008.09864, 2020.
Kenton and Toutanova [2019] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
Klicpera et al. [2019] Johannes Klicpera, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), 2019.
Kreuzer et al. [2021] Devin Kreuzer, Dominique Beaini, Will Hamilton, Vincent Létourneau, and Prudencio Tossou. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
Lai et al. [2022] Hong-Phuc Lai, Thi-Thao Tran, and Van-Truong Pham. Axial attention mlp-mixer: A new architecture for image segmentation. In 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), pages 381–386. IEEE, 2022.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Loshchilov and Hutter [2018] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2013.
Oono and Suzuki [2020] Kenta Oono and Taiji Suzuki. Graph neural networks exponentially lose expressive power for node classification. In International Conference on Learning Representations, 2020.
Rampášek et al. [2022] Ladislav Rampášek, Michael Galkin, Vijay Prakash Dwivedi, Anh Tuan Luu, Guy Wolf, and Dominique Beaini. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
Scarselli et al. [2009] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009.
Tolstikhin et al. [2021] Ilya O Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
Topping et al. [2022] Jake Topping, Francesco Di Giovanni, Benjamin Paul Chamberlain, Xiaowen Dong, and Michael M. Bronstein. Understanding over-squashing and bottlenecks on graphs via curvature. In International Conference on Learning Representations, 2022.
Touvron et al. [2022] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Gautier Izacard, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5314–5321, 2022.
Veličković et al. [2018] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representations, 2018.
Wang et al. [2021] Jiacheng Wang, Lan Wei, Liansheng Wang, Qichao Zhou, Lei Zhu, and Jing Qin. Boundary-aware transformers for skin lesion segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 206–216. Springer, 2021.
Wang and Zhang [2022] Xiyuan Wang and Muhan Zhang. How powerful are spectral graph neural networks. In International Conference on Machine Learning, pages 23341–23362. PMLR, 2022.
Wang et al. [2019] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
Wei et al. [2023] Tianjun Wei, Tommy WS Chow, Jianghong Ma, and Mingbo Zhao. Expgcn: Review-aware graph convolution network for explainable recommendation. Neural Networks, 157:202–215, 2023.
Wu et al. [2020] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020.
Yao et al. [2019] Liang Yao, Chengsheng Mao, and Yuan Luo. Graph convolutional networks for text classification. In Proceedings of the AAAI conference on artificial intelligence, pages 7370–7377, 2019.
Ying et al. [2021] Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. Do transformers really perform badly for graph representation? In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 28877–28888, 2021.
Zhang et al. [2022] Hong Zhang, ZhiXiang Dong, Bo Li, and Siyuan He. Multi-scale mlp-mixer for image classification. Knowledge-Based Systems, 258:109792, 2022.
Zhang et al. [2018] Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and Dit Yan Yeung. Gaan: Gated attention networks for learning on large and spatiotemporal graphs. In 34th Conference on Uncertainty in Artificial Intelligence 2018, UAI 2018, 2018.
Zhang et al. [2021] Yundong Zhang, Huiye Liu, and Qiang Hu. Transfuse: Fusing transformers and cnns for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 14–24. Springer, 2021.
Zhu et al. [2021] Meiqi Zhu, Xiao Wang, Chuan Shi, Houye Ji, and Peng Cui. Interpreting and unifying graph neural networks with an optimization framework. In Proceedings of the Web Conference 2021, pages 1215–1226, 2021.