This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GraphMinNet: Learning Dependencies in Graphs with Light Complexity Minimal Architecture

Md Atik Ahamed    Andrew Cheng    Qiang Ye    Qiang Cheng
Abstract

Graph Neural Networks (GNNs) have demonstrated remarkable success in various applications, yet they often struggle to capture long-range dependencies (LRD) effectively. This paper introduces GraphMinNet, a novel GNN architecture that generalizes the idea of minimal Gated Recurrent Units to graph-structured data. Our approach achieves efficient LRD modeling with linear computational complexity while maintaining permutation equivariance and stability. The model incorporates both structural and positional information through a unique combination of feature and positional encodings, leading to provably stronger expressiveness than the 1-WL test. Theoretical analysis establishes that GraphMinNet maintains non-decaying gradients over long distances, ensuring effective long-range information propagation. Extensive experiments on ten diverse datasets, including molecular graphs, image graphs, and synthetic networks, demonstrate that GraphMinNet achieves state-of-the-art performance while being computationally efficient. Our results show superior performance on 6 out of 10 datasets and competitive results on the others, validating the effectiveness of our approach in capturing both local and global graph structures.

Machine Learning, ICML, Graph, Convolution, RNN, minGRU

1 Introduction

Graphs are widely used in various fields ranging from social networks and knowledge representations to engineering. Graph neural networks (GNNs) provide crucial techniques for extracting information and making inference over graph data. While numerous GNN models have been developed, important challenges still need to be overcome.

A fundamental challenge in GNNs is capturing long-range dependencies (LRDs) - the ability to model relationships between nodes that are far apart in the graph structure. Classical GNNs typically use message passing between neighboring nodes, where each layer only allows information to travel one hop. To reach distant nodes, information must pass through many layers, leading to a phenomenon called over-squashing: as information propagates through multiple layers, messages from many source nodes are compressed into fixed-size node representations, causing excessive compression and loss of important details. This over-squashing problem, particularly severe in deep GNNs or highly connected graphs, creates information bottlenecks that prevent effective modeling of long-range relationships.

The ability to capture LRD is critical because many real-world graphs inherently contain important long-distance relationships and require understanding of global graph structure. Beyond over-squashing, attempting to capture these dependencies through deep message passing leads to additional challenges such as over-smoothing (where node features become too similar), gradient vanishing, and information dilution. Thus, while GNN performance often depends on capturing both local and distant graph interactions, existing approaches struggle with this fundamental tension.

Several approaches have been proposed to address these challenges, including attention mechanisms over longer paths (Ying et al., 2021; Kreuzer et al., 2021; Rampášek et al., 2022), global graph features (Zhang & Li, 2021; You et al., 2021), skip connections (Wu & Cheng, 2022), graph diffusion (Chamberlain et al., 2021), and multi-scale architectures (Ying et al., 2018). While these approaches show promise, attention-based and multi-scale methods often face computational scalability issues with large graphs, whereas simpler approaches like skip connections and global features can be prone to overfitting on complex graph structures.

To address these limitations, we propose a novel approach that achieves effective LRD modeling with linear computational complexity. Our key insight comes from recursive neural networks (RNNs), particularly an emerging variant called minGRU (Feng et al., 2024) that has demonstrated remarkable ability to capture long-range dependencies in sequential data with linear complexity.

However, introducing the idea of minGRU to graph data presents fundamental challenges due to the inherent differences between sequential and graph structures. Unlike sequential data where elements have natural ordering and positional information, graph nodes lack intrinsic ordering. Moreover, graphs contain explicit structural information through edges that is absent in sequential data. To address these differences, we develop GraphMinNet, which bridges these differences by generalizing minGRU’s efficient mechanisms to the graph domain while preserving its feature learning advantages and incorporating graph-specific structural information.

In summary, this paper has the following contributions:

  • We generalize the idea of minimal GRU, an RNN variant, to graph-structured data by developing an efficient algorithm that integrates node features with positional encoding;

  • The resulting model, GraphMinNet, has key advantages including strong ability to capture LRD, expressivity between 1-WL and 3-WL in terms of graph discrimination power, and linear computational complexity, all with provable performance guarantees;

  • Through extensive experiments across diverse datasets, we demonstrate that our algorithm has superior predictive accuracy and efficiency compared to state-of-the-art baselines.

2 Related Works

Refer to caption
Figure 1: Schematic diagram of our proposed method. Here ulu\in\mathbb{R}^{l} represents each node. σ\sigma denotes the sigmoid activation function. β\beta represents a learnable parameter to focus on the specific part.

We categorize the existing GNNs for prediction or classification tasks into several groups by their main network architectures. Notably, more recent models often combine elements from multiple categories.

GCN or message passing based. These methods leverage either spatial or spectral domain operations through message passing or graph convolutions. Key approaches include Graph Convolutional Networks (GCN) (Kipf & Welling, 2016), Gate GCN (Bresson & Laurent, 2017), Graph Isomorphism Networks (GIN) (Xu et al., 2018), Graph Attention Networks (GAT) (Veličković et al., 2018), GraphSAGE (Hamilton et al., 2017), and Principal Neighborhood Aggregation (PNA) (Corso et al., 2020). While efficient and scalable, these models typically have limited ability to capture LRD. Moreover, standard GCNs have limited expressiveness, as they are equivalent to the 1-Weisfeiler-Lehman (WL) test, whereas higher-order k-GNNs are proven to be more expressive (Morris et al., 2019).

Kernel based. Graph kernel methods have been extensively studied, including neural tangent kernel (Jacot et al., 2018), graph neural tangent kernel (Du et al., 2019), graph kernels with neural networks (Morris et al., 2019), and spectral kernel learning (Zhi et al., 2023). These methods offer theoretical guarantees through strong mathematical foundations from kernel theory, particularly in terms of provable convergence properties. However, they face several challenges in adaptability to hierarchical structures, capturing complex patterns, and incorporating node/edge features. Thus, they may have limited representation power.

Transformer or SSM based. These recent models leverage Transformers or State Space Models (SSMs) to capture LRD. Following the success of Transformers in text and image domains, they have been adapted for graph learning. Key approaches include Graphormer (Ying et al., 2021), SAN (Kreuzer et al., 2021), GraphGPS (Rampášek et al., 2022), Exphormer (Shirzad et al., 2023), Grit (Ma et al., 2023), and Specformer (Bo et al., 2022). With the emergence of SSMs such as Mamba (Gu & Dao, 2023), new graph models like Graph-Mamba (Wang et al., 2024) have been developed. While these models effectively capture LRD between distant nodes, Transformer-based models typically have quadratic complexity and are computationally demanding, whereas SSM-based models may not fully preserve the permutation equivariance of graph data (Zhang et al., 2024).

Position or structural aware. Various techniques incorporate position or substructure information in graphs, from early approaches like DeepWalk (Perozzi et al., 2014) and node2vec (Grover & Leskovec, 2016), to recent work in position encoding (You et al., 2019), distance encoding (Li et al., 2020), structural RNN (Jain et al., 2016), and SPE (Huang et al., 2024a). Recent models have explored substructural information through nested structures (NGNN (Zhang & Li, 2021)), identity-aware patterns (ID-GNN (You et al., 2021)), augmented kernels (GIN-AK+ (Zhao et al., 2021)), iterative learning (I2-GNN (Huang et al., 2022)), and sign patterns (SUN (Frasca et al., 2022), SignNet (Lim et al., 2022)). While these techniques effectively capture higher-order interactions with provable expressiveness bounds, they face scalability challenges, e.g., due to expensive distance computations and requirements for full graph structure access. They may also suffer from generalization issues, including overfitting to specific structures and sensitivity to graph variations.

Implicit and continuous-depth architecture. Continuous-depth architectures have emerged as a promising direction for graph learning, starting with GDE (Poli et al., 2019) which models network dynamics as a continuous-time system. GRAND (Chamberlain et al., 2021) developed graph neural diffusion, while Continuous GNNs (Xhonneux et al., 2020) provided a framework for continuous-depth networks, and GRED (Ding et al., 2024) enhanced this with recurrent and dilated memory. These approaches offer memory efficiency and natural handling of dynamical processes, but face challenges in solving differential equations and require careful tuning to balance stability and expressiveness.

3 Methods

3.1 Preliminary

GRU has been improved to minGRU (Feng et al., 2024) to overcome gradient vanishing/explosion and enable better capture of global dependencies. For an input sequence of tokens, minGRU is defined as:

ht=(1zt)ht1+zth~tzt=σ(Lineardh(xt))h~t=Lineardh(xt)),\begin{split}h_{t}&=(1-z_{t})\odot h_{t-1}+z_{t}\odot\tilde{h}_{t}\\ z_{t}&=\sigma(\text{Linear}{d_{h}}(x_{t}))\\ \tilde{h}_{t}&=\text{Linear}{d_{h}}(x_{t})),\end{split} (1)

where xtdxx_{t}\in\mathbb{R}^{d_{x}} is the input feature vector and htdhh_{t}\in\mathbb{R}^{d_{h}} is its corresponding state-space representation at time tt. Here, σ()\sigma(\cdot) is an element-wise non-linear activation function with values in (0,1)(0,1), and Lineardh(xt)\text{Linear}_{d_{h}}(x_{t}) projects xtx_{t} to a dhd_{h}-dimensional state space via an MLP.

The model achieves computational efficiency with complexity O(2dxdh)O(2d_{x}d_{h}) (Feng et al., 2024), significantly lower than the original GRU’s O(3dh(dx+dh))O(3d_{h}(d_{x}+d_{h})). Training can be parallelized using a parallel scan algorithm (Feng et al., 2024). To enable more effective feature extraction, minGRU employs state expansion where the state dimension dh=αdxd_{h}=\alpha d_{x} with α1\alpha\geq 1.

3.2 GraphMinNet for Graph Learning

The minGRU model enhances the original GRU with several key advantages: 1) ability to capture global dependency, 2) linear efficiency in terms of input sequence length, 3) scalable model size with respect to input length, and 4) shift equivariance. Moreover, compared to state-space models (Gu et al., 2020) (Gu et al., 2021), in particular, Mamba (Gu & Dao, 2023) that has a linear complexity and scalability, minGRU does not have a fixed state-space length and thus have a stronger ability to possess context awareness or content selectivity (Feng et al., 2024). These advantages offer potential solutions to common GNN challenges, motivating our development of GraphMinNet.

To develop this model, first we obtain the explicit expression of the minGRU model containing no state variable hth_{t}:

ht=\displaystyle h_{t}= zth~t+(1zt)ht1\displaystyle z_{t}\odot\tilde{h}_{t}+(1-z_{t})\odot h_{t-1}
=\displaystyle= zth~t+(1zt)zt1h~t1\displaystyle z_{t}\odot\tilde{h}_{t}+(1-z_{t})\odot z_{t-1}\odot\tilde{h}_{t-1}
+(1zt)(1zt1)ht2=\displaystyle+(1-z_{t})(1-z_{t-1})\odot{h}_{t-2}=\cdots
=\displaystyle= i=1tj=i+1t(1zj)zih~i.\displaystyle\sum_{i=1}^{t}\prod_{j=i+1}^{t}(1-z_{j})\odot z_{i}\odot\tilde{h}_{i}. (2)

In the last equality of Eq. (2) and hereafter, the notation \prod stands for element-wise multiplication of dhd_{h}-dimensional vectors across relevant indexes. By denoting ci=j=1i(1zj)c_{i}=\prod_{j=1}^{i}(1-z_{j}), i=1,,ti=1,\cdots,t, Eq. (2) can be written as

ht=(i=1tzih~ici1)ct.h_{t}=(\sum_{i=1}^{t}z_{i}\odot\tilde{h}_{i}\odot c_{i}^{-1})\odot c_{t}. (3)

The above equation can facilitate us to derive a corresponding model for graph learning, as it does not contain any intermediate, latent state variables.

For a graph G=(V,A)G=(V,A) with nn nodes in set VV and adjacency matrix AA, each node uu has a feature vector xulx_{u}\in\mathcal{R}^{l}. To introduce the idea of minGRU to graph-structured data, we make three key observations:

1) Position Indexing: While minGRU uses sequence positions as indices, we associate these with graph nodes. Since graphs lack natural ordering, we replace it\sum_{i\leq t} with vV\sum_{v\in V} to ensure permutation equivariance.

2) Positional Information: To capture node positions in the graph structure, we employ Laplacian positional embedding (LPE) (Wang et al., 2022; Huang et al., 2024b, 2023). Given the graph Laplacian eigendecomposition L=V~ΛV~TL=\tilde{V}\Lambda\tilde{V}^{T}, where V~n×n\tilde{V}\in{\mathcal{R}^{n\times n}} contains the eigenvectors and Λ\Lambda is the diagonal eigenvalue matrix, we define the dd-dimensional LPE for node uu as pu=V~[u,1:d]dp_{u}=\tilde{V}[u,1:d]\in{\mathcal{R}^{d}}. This encoding captures the absolute position of node uu in the graph. We denote by Λd\Lambda_{d} the vector of top dd non-zero eigenvalues.

3) Content Dependence: The interaction terms ci1ctc_{i}^{-1}\odot c_{t} depend on zjz_{j} (for i+1jti+1\leq j\leq t), which in turn depend on input features xjx_{j}, creating content-dependent state variables. This mechanism parallels the selection mechanism in Mamba (Gu & Dao, 2023; Ahamed & Cheng, 2024). To fully capture both structural and feature information, we encode positions and features separately in matrix form. Therefore, we need to replace the \odot operation involving the matrices with suitable operations that produce vectors.

Based on these observations, we construct the node embedding as follows. First, we define the feature component Aud×mA_{u}\in\mathcal{R}^{d\times m} for node uu:

Au=[ϕ1(Λd)(Wxu),,ϕm(Λd)(Wxu)],\displaystyle A_{u}=[\phi_{1}(\Lambda_{d})\odot(Wx_{u}),\cdots,\phi_{m}(\Lambda_{d})\odot(Wx_{u})], (4)

where Wd×lW\in\mathcal{R}^{d\times l} is a learnable weight matrix and ϕi():dd\phi_{i}(\cdot):\mathcal{R}^{d}\rightarrow\mathcal{R}^{d} are learnable permutation equivariant functions, i=1,,mi=1,\cdots,m. Here, ϕi(Λd)\phi_{i}(\Lambda_{d}) are permutation equivariant to the top-dd eigenvalues of the Laplacian, similar to the global permutation-equivariant set aggregation functions in (Wang et al., 2022; Huang et al., 2024b).

Similarly, we construct the positional component Cud×mC_{u}\in\mathcal{R}^{d\times m}:

Cu=[ϕ1(Λd)pu,,ϕm(Λd)pu].\displaystyle C_{u}=[\phi_{1}(\Lambda_{d})\odot p_{u},\cdots,\phi_{m}(\Lambda_{d})\odot p_{u}]. (5)

The overall embedding for node uu combines these components:

au=B(Au1Cu),\displaystyle a_{u}=B(A_{u}\oplus_{1}C_{u}), (6)

where Bl×dB\in\mathcal{R}^{l\times d} is a learnable matrix and 1\oplus_{1} denotes element-wise aggregation (e.g., addition or multiplication) between matrices AuA_{u} and CuC_{u}. The resulting embedding aua_{u} has size l×ml\times m.

The inverse operation in minGRU is adapted to graphs by associating node encodings with quantities in Eq. (2) and Eq. (3): avcj1zjh~ja_{v}\leftarrow c_{j}^{-1}\odot z_{j}\odot\tilde{h}_{j} for any node vVv\in V and avzvh~vcja_{v}\odot z_{v}\odot\tilde{h}_{v}\leftarrow c_{j}, where \odot between ava_{v} and zvz_{v} denotes element-wise multiplication of each column of ava_{v} with zvz_{v}.

Inspired by Eq. (3), we formulate GraphMinNet for node uu as:

hu\displaystyle h_{u} =vVav(auzuh~u)\displaystyle=\sum_{v\in V}a_{v}\odot(a_{u}\odot z_{u}\odot\tilde{h}_{u}) (7)
zu\displaystyle z_{u} =σ(Linear(xu))\displaystyle=\sigma(Linear(x_{u})) (8)
h~u\displaystyle\tilde{h}_{u} =Linear(xu).\displaystyle=Linear(x_{u}). (9)

Here, the state variable huh_{u} has the same dimension ll as node feature xux_{u}. Our formulation differs from the graph convolution in (Huang et al., 2024b) in two key aspects: First, nonlinear feature dependence where huh_{u} depends nonlinearly on xux_{u} through h~u\tilde{h}_{u} (linear transformation), zuz_{u} (gated feature attention), and AuA_{u} (in aua_{u}, containing xux_{u} in each column), with the gated feature attention providing automatic focus on important features. Second, while (Huang et al., 2024b) primarily emphasizes positional encoding, our formulation incorporates features through AuA_{u}, zuz_{u}, and h~u\tilde{h}_{u}.

For matrix ava_{v}, the \odot operation with vector zvh~vz_{v}\odot\tilde{h}_{v} multiplies the vector element-wise with each matrix column. Defining a¯=vVav\bar{a}=\sum_{v\in V}a_{v}, we can reformulate Eq. (7) as:

hu=au,a¯zuh~u.\displaystyle h_{u}=\langle a_{u},\bar{a}\rangle\odot z_{u}\odot\tilde{h}_{u}. (10)

Here, we generalize the \odot operation between aua_{u} and a¯\bar{a} by an inner product operation ,\langle\cdot,\cdot\rangle because aua_{u} and a¯\bar{a} are matrices. As a¯=vVav\bar{a}=\sum_{v\in V}a_{v}, we define au,av\langle a_{u},a_{v}\rangle using four types of inner products:

Type 1. Elementary inner product between corresponding matrix rows;

Type 2. For m=lm=l, elementary inner product between corresponding columns, followed by transposition into a column-vector;

Type 3. Inner product of vectorized matrices, with result multiplied by ll-dimensional all-ones vector;

Type 4. Either BCu,BCvAu,Av\langle BC_{u},BC_{v}\rangle\langle A_{u},A_{v}\rangle or BAu,BAv\langle BA_{u},BA_{v}\rangle Cu,Cv\langle C_{u},C_{v}\rangle, where the first inner product uses any of Types 1-3 and the second uses Type 2 or 3.

Note that the Einstein summation operation (einsum) in torch can be used to efficiently calculate these different types of products. In the model, all nodes vv share the same weight matrices BB (in ava_{v}) and WW (in AvA_{v}). Because emphasizing on node uu potentially facilitates capturing its useful information, we may further consider the following formulation:

hu=\displaystyle h_{u}= zuh~u(βau,a¯\displaystyle z_{u}\odot\tilde{h}_{u}\odot(\beta\langle a_{u},\bar{a}\rangle
+(2β)Ws1Cu,Ws2Cu1l),\displaystyle+(2-\beta)\langle W_{s1}C_{u},W_{s2}C_{u}\rangle 1_{l}), (11)

where β[0,2]\beta\in[0,2] is a weighting parameter, Ws1,Ws2d×dW_{s1},W_{s2}\in\mathcal{R}^{d\times d} are learnable matrices, and 1ll1_{l}\in\mathcal{R}^{l} is an all-ones vector. The inner product ,\langle\cdot,\cdot\rangle in the second term uses Type 3 (matrix vectorization). This formulation generalizes Eq. (7), which can be recovered as a special case when β=1\beta=1, showing how the self-loop term integrates with the original structure.

3.3 Properties of GraphMinNet

We present several key properties of our GraphMinNet formulation. Detailed proofs are provided in the appendix.

Definition 3.1 (Permutation Equivariance).

For a graph GG with node features XX and adjacency matrix AA, given a permutation matrix QQ, a GNN function ff is permutation equivariant if and only if f(QX,QAQT)=Qf(X,A)f(QX,QAQ^{T})=Qf(X,A).

This property ensures that when nodes of an input graph are reordered (permuted), the node-level outputs transform according to the same permutation. Since node ordering in graphs is arbitrary, permutation equivariance provides an essential inductive bias that enables the model to learn representations independent of node indexing.

Definition 3.2 ((Lipschitz) Stability of a GNN).

For a GNN ff, input graphs GG and GG^{\prime}, and distance metrics did_{i} and dod_{o} for graphs and outputs respectively, if di(G,G)ϵd_{i}(G,G^{\prime})\leq\epsilon, then do(f(G),f(G))Lϵd_{o}(f(G),f(G^{\prime}))\leq L\epsilon, where LL is the Lipschitz constant of the network.

Stability encompasses three aspects: Structural stability refers to how outputs change when edges are added/removed. Thus, it is about what the output response is to changes in graph connectivity. Feature stability refers to how outputs change when node/edge features are perturbed, thus is about sensitivity to noise in feature values. Spectral stability refers to how changes in the graph’s eigenvalues affect the output, which is particularly important for spectral-based GNN approaches. As shown in (Huang et al., 2023), stability is more general than equivariance (Huang et al., 2023) and implies generalization ability (Bousquet & Elisseeff, 2002) (Shalev-Shwartz et al., 2010). Therefore, the stability property is critical to the GNN algorithm’s robustness to noise in real-world data, generalization abilities, and adversarial robustness.

Proposition 3.3.

The formulation of GraphMinNet in Equations (7) (or (11)) through (9) is permutation equivariant. Moreover, if the functions ϕi()\phi_{i}(\cdot) used in forming AuA_{u} and CuC_{u} (Equations (4) and (5)) are Lipschitz, then GraphMinNet is Lipschitz stable with respect to both feature vectors and eigenvalues.

Many common permutation equivariant functions in neural networks are naturally Lipschitz, including linear permutation equivariant functions, element-wise Lipschitz operations, max/min pooling, and mean pooling. Thus, the Lipschitz condition is readily satisfied for typical choices of ϕ()\phi(\cdot), ensuring provable stability of GraphMinNet. As stability implies strong generalizability and robustness, we have the following result.

Corollary 3.4.

When the condition in Proposition 3.3 is satisfied, GraphMinNet has provable generalization ability and robustness.

The following property establishes GraphMinNet’s ability to capture long-range dependencies, which is critical for effective graph learning.

Proposition 3.5 (Long-range Dependency).

There exists ϕ()\phi(\cdot) such that the gradient norm huxv\|\frac{\partial h_{u}}{\partial x_{v}}\| of GraphMinNet does not decay as spd(u,v)spd(u,v) grows (with nn tending to \infty), where spd(u,v)spd(u,v) is the shortest path distance between uu and vv.

In this paper, we assume the eigenvalue decomposition of AA is pre-computed. For large sparse graphs, we adopt the Lanczos algorithm (e.g., implemented in ARPACK) to efficiently compute the dd largest/smallest eigenvalues and eigenvectors of the adjacency/Laplacian matrices. This computation has complexity O(Md)O(Md), where MM is the number of edges. Given these pre-computed eigenvalues and eigenvectors, the complexity and scalability of our algorithm is given as follows:

Proposition 3.6 (Complexity and Scalability).

The hidden states h1,,hnh_{1},\cdots,h_{n} can be computed from x1,,xnx_{1},\cdots,x_{n} with a complexity of O(nmdl)O(nmdl), where nn is the number of nodes, mm is the number of columns in node encoding (AuA_{u}, CuC_{u}, or cuc_{u}), dd is the dimension of rows in node encoding, and ll is the feature dimension.

Additionally, the GraphMinNet algorithm in Equations (7) or (11) achieves linear scalability with respect to the number of nodes.

Proposition 3.7 (Expressiveness).

The formulation of GraphMinNet is more powerful than the 1-WL test but not more powerful than the 3-WL test.

These properties collectively demonstrate that GraphMinNet achieves efficient long-range dependency modeling with linear complexity while maintaining strong expressive power between 1-WL and 3-WL tests.

4 Experiments

Table 1: Performance comparison on various datasets. Best results are represented in colors by 𝐟𝐢𝐫𝐬𝐭{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{first}}, 𝐬𝐞𝐜𝐨𝐧𝐝{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{second}}, and 𝐭𝐡𝐢𝐫𝐝{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{third}}. Results are reported as mean±std\text{mean}_{\pm\text{std}}.
MNIST CIFAR10 PATTERN CLUSTER Molhiv PascalVOC-SP Peptides-func Peptides-struct ZINC MalNet-Tiny Avg. Rank
Accuracy \uparrow Accuracy \uparrow Accuracy \uparrow Accuracy \uparrow AUROC \uparrow F1 score\uparrow AP \uparrow MAE \downarrow MAE \downarrow Accuracy \uparrow Lower better\downarrow
GCN 90.705±0.21890.705_{\pm 0.218} 55.710±0.38155.710_{\pm 0.381} 71.892±0.33471.892_{\pm 0.334} 68.498±0.97668.498_{\pm 0.976} 75.99±1.1975.99_{\pm 1.19} 0.1268±0.00600.1268_{\pm 0.0060} 0.5930±0.00230.5930_{\pm 0.0023} 0.3496±0.00130.3496_{\pm 0.0013} 0.367±0.0110.367_{\pm 0.011} 81.0{81.0} 9.709.70
GIN 96.485±0.25296.485_{\pm 0.252} 55.255±1.52755.255_{\pm 1.527} 85.387±0.13685.387_{\pm 0.136} 64.716±1.55364.716_{\pm 1.553} 77.07±1.4977.07_{\pm 1.49} 0.1265±0.00760.1265_{\pm 0.0076} 0.5498±0.00790.5498_{\pm 0.0079} 0.3547±0.00450.3547_{\pm 0.0045} 0.526±0.0510.526_{\pm 0.051} 88.98±0.5688.98_{\pm 0.56} 9.909.90
GAT 95.535±0.20595.535_{\pm 0.205} 64.223±0.45564.223_{\pm 0.455} 78.271±0.18678.271_{\pm 0.186} 70.587±0.44770.587_{\pm 0.447} - - - - 0.384±0.0070.384_{\pm 0.007} 92.10±0.2492.10_{\pm 0.24} 9.509.50
GatedGCN 97.340±0.14397.340_{\pm 0.143} 67.312±0.31167.312_{\pm 0.311} 85.568±0.08885.568_{\pm 0.088} 73.840±0.32673.840_{\pm 0.326} - 0.2873±0.02190.2873_{\pm 0.0219} 0.5864±0.00770.5864_{\pm 0.0077} 0.3420±0.00130.3420_{\pm 0.0013} - 92.23±0.5692.23_{\pm 0.56} 8.258.25
SAN - - 86.581±0.03786.581_{\pm 0.037} 76.691±0.65076.691_{\pm 0.650} 77.85±2.4777.85_{\pm 2.47} 0.3216±0.00270.3216_{\pm 0.0027} 0.6439±0.00750.6439_{\pm 0.0075} 0.2545±0.00120.2545_{\pm 0.0012} 0.139±0.0060.139_{\pm 0.006} - 7.437.43
GraphGPS 98.051±0.12698.051_{\pm 0.126} 72.298±0.35672.298_{\pm 0.356} 86.685±0.05986.685_{\pm 0.059} 78.016±0.18078.016_{\pm 0.180} 78.80±1.01{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{78.80}_{\pm 1.01}} 0.3748±0.01090.3748_{\pm 0.0109} 0.6535±0.00410.6535_{\pm 0.0041} 0.2500±0.00050.2500_{\pm 0.0005} 0.070±0.0040.070_{\pm 0.004} 93.50±0.4193.50_{\pm 0.41} 5.705.70
Exphormer 98.550±0.039{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{98.550}_{\pm 0.039}} 74.690±0.125{74.690}_{\pm 0.125} 86.740±0.015{86.740}_{\pm 0.015} 78.070±0.037{78.070}_{\pm 0.037} 78.79±1.31{78.79}_{\pm 1.31} 0.3975±0.0037{0.3975}_{\pm 0.0037} 0.6527±0.00430.6527_{\pm 0.0043} 0.2481±0.00070.2481_{\pm 0.0007} 0.111±0.0070.111_{\pm 0.007} 94.02±0.21{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{94.02}_{\pm 0.21}} 4.704.70
Grit 98.108±0.11198.108_{\pm 0.111} 76.468±0.881{76.468}_{\pm 0.881} 87.196±0.076{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{87.196}_{\pm 0.076}} 80.026±0.277{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{80.026}_{\pm 0.277}} - - 0.6988±0.0082{0.6988}_{\pm 0.0082} 0.2460±0.0012{0.2460}_{\pm 0.0012} 0.059±0.002{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}0.059_{\pm 0.002}} - 3.29{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}3.29}
GRED 98.383±0.01298.383_{\pm 0.012} 76.853±0.185{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{76.853}_{\pm 0.185}} 86.759±0.020{86.759}_{\pm 0.020} 78.495±0.103{78.495}_{\pm 0.103} - - 0.7133±0.0011{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.7133}_{\pm 0.0011}} 0.2455±0.0013{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.2455}_{\pm 0.0013}} 0.077±0.0020.077_{\pm 0.002} - 3.713.71
Graph-Mamba 98.420±0.08098.420_{\pm 0.080} 73.700±0.34073.700_{\pm 0.340} 86.710±0.05086.710_{\pm 0.050} 76.800±0.36076.800_{\pm 0.360} 78.23±1.2178.23_{\pm 1.21} 0.4191±0.0126{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{0.4191}_{\pm 0.0126}} 0.6739±0.0087{0.6739}_{\pm 0.0087} 0.2478±0.0016{0.2478}_{\pm 0.0016} 0.067±0.0020.067_{\pm 0.002} 93.40±0.2793.40_{\pm 0.27} 5.005.00
GSSC 98.492±0.051{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}{98.492}_{\pm 0.051}} 77.642±0.456{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{77.642}_{\pm 0.456}} 87.510±0.082{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{87.510}_{\pm 0.082}} 79.156±0.152{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{79.156}_{\pm 0.152}} 80.35±1.42{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{80.35}_{\pm 1.42}} 0.4561±0.0039{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{0.4561}_{\pm 0.0039}} 0.7081±0.0062{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{0.7081}_{\pm 0.0062}} 0.2459±0.0020{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{0.2459}_{\pm 0.0020}} 0.064±0.002{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}0.064_{\pm 0.002}} 94.06±0.64{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{94.06}_{\pm 0.64}} 2.302.30
Ours 98.598±0.138{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{98.598}_{\pm 0.138}} 78.068±0.785{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{78.068}_{\pm 0.785}} 87.552±0.123{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{87.552}_{\pm 0.123}} 79.284±0.122{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{79.284}_{\pm 0.122}} 80.86±0.56{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{80.86}_{\pm 0.56}} 0.4352±0.0030{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{0.4352}_{\pm 0.0030}} 0.7182±0.0024{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{0.7182}_{\pm 0.0024}} 0.2438±0.0014{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{0.2438}_{\pm 0.0014}} 0.063±0.001{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}0.063_{\pm 0.001}} 93.72±0.29{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\mathbf{93.72}_{\pm 0.29}} 1.501.50

In this section, we present our experimental results. Table 1 summarizes the performance of our proposed approach compared against several robust baselines across multiple datasets. We evaluated our method on 10 diverse datasets, with detailed descriptions provided in Appendix 8 (Table 8).

Our evaluation encompasses several key dataset categories: 1) The Long Range Graph Benchmark (Dwivedi et al., 2022), which requires effective reasoning over long-range interactions, consisting of three tasks: Peptides-func (graph-level classification with 10 peptide functional labels), Peptides-struct (graph-level regression of 11 molecular structural properties), and PascalVOC-SP (superpixel classification in image graphs). 2) Molecular graph datasets, including ZINC (Dwivedi et al., 2023) for graph regression of molecular properties and ogbg-molhiv (Hu et al., 2020) with 41k molecular graphs for classification. 3) Image graph datasets MNIST and CIFAR10 (Dwivedi et al., 2023), represented as 8-nearest neighbor superpixel graphs for classification. 4) Synthetic graph datasets PATTERN and CLUSTER (Dwivedi et al., 2023), generated using the Stochastic Block Model for node-level community classification. 5) Function call graphs from MalNet-Tiny (Freitas et al., 2020) for classification tasks. Our experimental results across these diverse settings demonstrate our method’s robustness and versatility.

We report the experimental results in Table 1, following the evaluation protocols from GSSC (Huang et al., 2024b) with mean and standard deviation across five random seeds (0 to 4). We report relevant hyperparameters in Appendix 9, Table 9. The results demonstrate GraphMinNet’s superior performance, achieving the best results on 6 out of 10 datasets and ranking second on 3 datasets, leading to the highest overall average rank among all methods. Even on the remaining dataset, GraphMinNet shows competitive performance comparable to state-of-the-art baselines.

We further evaluate GraphMinNet’s efficiency in terms of both model parameters and computational cost. Table 2 compares parameter counts against recent baselines, while Fig. 2 shows runtime analysis. Our model shows better or comparable efficiency to SOTA models.

Table 2: Model parameter count comparison across datasets. The lowest and second-lowest parameter counts are highlighted in 𝐟𝐢𝐫𝐬𝐭{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\mathbf{first}} and 𝐬𝐞𝐜𝐨𝐧𝐝{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}\mathbf{second}}, respectively.
MNIST CIFAR10 PATTERN CLUSTER Molhiv PascalVOC-SP Peptides-func Peptides-struct ZINC MalNet-Tiny
GraphGPS 115.39K 112.73K 337.20K 502.05K 558.63K 510.45K 504.36K 504.46K 423.72K 527.24K
Grit 102.14K 99.486K 477.95K 432.21K - - 443.34K 438.83K 473.47K -
GSSC 133k 131k 539k 539k 351k 375k 410k 410k 436k 299K
Ours 122.82K 120.13K 431.54K 432.01K 338.80K 474.10K 386.32K 391.92K 415.28K 279.17K
Refer to caption
Figure 2: Run time comparison per epochs including train, validation, and test phases.

To demonstrate scalability, we conducted experiments on randomly generated graphs with node counts ranging from 1,000 to 20,000. These graphs were generated using PyTorch Geometric (Fey & Lenssen, 2019) with an average node degree of 5, maintaining realistic sparsity constraints. Node features were randomly initialized, and we included eigenvalues, eigenvectors, and logarithmic degree values to simulate diverse graph properties. We evaluated scalability through two metrics: FLOPs (Floating Point Operations), computed using thop 111https://pypi.org/project/thop, and Maximum Memory Utilization, measured via torch.cuda.max_memory_allocated. As shown in Figure 3, both FLOPs and memory utilization demonstrate linear growth with respect to the number of nodes. This linear scalability confirms our theoretical analysis and demonstrates our method’s practical efficiency for large-scale graph applications.

Refer to caption
(a) FLOPs Utilization
Refer to caption
(b) Memory Utilization
Figure 3: Scalability analysis of GraphMinNet. (a) shows the linear growth of FLOPs, reflecting computational efficiency with increasing graph size, and (b) depicts the linear growth of maximum memory usage, demonstrating feasible memory requirements.

4.1 Robustness Analysis

To validate our theoretical stability results empirically, we evaluate our method’s robustness to feature perturbations. We introduce controlled synthetic noise to node embeddings by adding Gaussian perturbations proportional to the feature magnitudes. For each node xulx_{u}\in\mathbb{R}^{l}, we compute the perturbed embedding xux^{\prime}_{u} as:

xu=xu+ϵnu,x^{\prime}_{u}=x_{u}+\epsilon\cdot n_{u}, (12)

where ϵ\epsilon is the noise level, and nuln_{u}\in{\mathcal{R}}^{l} is a noise term. We consider two types of noise: additive white noise nu=𝒩(0,I)n_{u}=\mathcal{N}(0,I) and signal-dependent noise nu=𝒩(0,I)μ(x)n_{u}=\mathcal{N}(0,I)\odot\mu(x). Here, 𝒩(0,I)\mathcal{N}(0,I) is standard multivariate Gaussian, and μ(x)\mu(x) denotes the mean feature magnitude. The second type models perturbations that scale appropriately with the underlying feature values. Figure 4 shows our method’s performance on Molhiv and Peptides-Func datasets under increasing noise levels (0%, 5%, …, 30%). The results demonstrate consistent performance across noise levels, empirically confirming our theoretical stability guarantees.

Refer to caption
Figure 4: Robustness analysis of our model under varying noise levels. SD: signal-dependent; WN: white noise.

5 Ablation Studies

We conducted ablation studies to analyze key components of GraphMinNet, including the optional self-term contribution, dropout effectiveness, and local method impact.

5.1 Optional Self-term

We first investigate the impact of the self-term introduced in Eq. (11). Results in Table 3 show dataset-dependent effects: while the self-term improves performance on Molhiv (AUROC increase of 2.55%), it slightly decreases performance on CLUSTER and Peptides-func. This variation suggests the self-term’s effectiveness correlates with specific graph properties - it appears more beneficial for molecular graphs with complex local structures (like Molhiv) compared to more regular graph structures (like CLUSTER).

Self-term CLUSTER Peptides-func Molhiv
79.284±0.122{79.284}_{\pm 0.122} 0.7182±0.0024{0.7182}_{\pm 0.0024} 78.31±1.06{78.31}_{\pm 1.06}
78.942±0.126{78.942}_{\pm 0.126} 0.7020±0.0100{0.7020}_{\pm 0.0100} 80.86±0.56{80.86}_{\pm 0.56}
Table 3: Ablation on optional self-term in our method.

5.2 Embedding Representation

We analyzed different methods for aggregating node embeddings aua_{u} in Eq. 6, comparing two element-wise operations for combining feature and positional information: multiplication and addition. Our results showed that element-wise multiplication achieves higher accuracy across most datasets (e.g., two datasets shown in Table 4). Based on these results, we adopt element-wise multiplication by default except for MalNet-Tiny, which uses element-wise addition.

Aggregation Peptides-func Molhiv
Multiplication 0.7182±0.0024{0.7182}_{\pm 0.0024} 80.86±0.56{80.86}_{\pm 0.56}
Addition 0.6941±0.0028{0.6941}_{\pm 0.0028} 80.19±0.32{80.19}_{\pm 0.32}
Table 4: Comparing aggregation operation 1\oplus_{1}.

5.3 Effectiveness of Dropout Regularization

We analyze the impact of dropout regularization on model performance. Figure 5 compares training and validation performance with and without dropout. The results demonstrate that dropout plays a crucial role in preventing overfitting - models without dropout show significant performance degradation on validation sets, particularly on complex datasets like Peptides-func and Molhiv. This confirms dropout’s importance as a regularization mechanism in our architecture.

Refer to caption
Figure 5: Effectiveness of dropouts.

5.4 Local Method

Method CLUSTER Peptides-func Molhiv
+GatedGCN 79.284±0.122{79.284}_{\pm 0.122} 0.7182±0.0024{0.7182}_{\pm 0.0024} 80.86±0.56{80.86}_{\pm 0.56}
+GINE 77.200±0.415{77.200}_{\pm 0.415} 0.7129±0.0037{0.7129}_{\pm 0.0037} 80.33±0.83{80.33}_{\pm 0.83}
Table 5: Ablation results with an additional local method.

We evaluate GraphMinNet’s stability when combined with different local methods that use local neighborhood information including edge features, such as GatedGCN and GINE. Table 5 shows that GraphMinNet maintains consistent performance across different local methods: performance variations remain within about 2%2\% on CLUSTER, 0.6%0.6\% on Peptides-func, and within about 0.6%0.6\% on Molhiv. These minimal differences across diverse datasets demonstrate that GraphMinNet can effectively integrate local structural information regardless of the specific local method employed, confirming its architectural robustness and versatility.

5.5 Nonlinear WW and BB

Method CLUSTER Peptides-func Molhiv
Linear 79.284±0.122{79.284}_{\pm 0.122} 0.7182±0.0024{0.7182}_{\pm 0.0024} 80.86±0.56{80.86}_{\pm 0.56}
Nonlinear 78.672±0.073{78.672}_{\pm 0.073} 0.6526±0.0067{0.6526}_{\pm 0.0067} 79.76±1.16{79.76}_{\pm 1.16}
Table 6: Ablation on linear versus nonlinear WW and BB.

In our method, we utilized single layer linear transformation for matrix WW and BB. However, to explore further, instead of only using linear projection, we perform ablation study with two linear layers and a nonlinear activation such as ReLU/GELU/SiLU in between in both WW and BB matrix. We report the results in Table 6. From this, we can see that, for Molhiv and CLUSTER, the performance variation is around 1%1\%. However, for Peptides-func, performance drops around 7%7\%. Therefore increasing more layers may lead to overfitting performance for graph data. Particularly for smaller datasets like Peptides-func.

5.6 Projection Strategy on zuz_{u} and h~u\tilde{h}_{u}

Method CLUSTER Peptides-func Molhiv
LP 79.284±0.122{79.284}_{\pm 0.122} 0.7182±0.0024{0.7182}_{\pm 0.0024} 80.86±0.56{80.86}_{\pm 0.56}
NLP 78.980±0.239{78.980}_{\pm 0.239} 0.7038±0.0040{0.7038}_{\pm 0.0040} 79.95±1.34{79.95}_{\pm 1.34}
Table 7: Ablation study comparing linear versus nonlinear projection utilized to achieve zuz_{u} and h~u\tilde{h}_{u}.

Inspired by minGRU (Feng et al., 2024), we adopt a linear projection (LP) for both zuz_{u} and h~u\tilde{h}_{u} in our method. However, to further explore the projection strategy, we conduct an ablation study by employing a nonlinear projection (NLP), similar to Subsection 5.5. The results presented in Table 7 indicate that NLP maintains competitive performance and may be beneficial for datasets with higher-dimensional node features to capture complex patterns. However, since NLP does not provide significant improvements while introducing additional parameters, we adopt LP as the default projection strategy for all datasets.

6 Conclusion

This paper presents GraphMinNet, a novel graph neural network that effectively captures long-range dependencies while maintaining linear computational complexity. Our approach successfully generalizes the idea of minimal GRU to graph-structured data while preserving permutation equivariance and ensuring stability, with theoretical guarantees for non-decaying gradients over long distances. Our key contributions include an efficient integration of node features with positional encoding, achieving linear complexity and scalability, along with theoretical proofs establishing the model’s stability and expressiveness bounds between 1-WL and 3-WL tests. Extensive experimental results across diverse datasets demonstrate GraphMinNet’s effectiveness, achieving superior performance while maintaining computational efficiency. Future directions include extensions to dynamic graphs, applications to larger-scale networks, and adaptation to heterogeneous graph structures.

One limitation of our model is edge features are not explicitly accounted for. In our future study, we would like to incorporate edge feature directly in our model.

Impact Statement

This work aims to advance machine learning methods for graph-structured data. While our technical contributions focus on graph neural networks, we acknowledge that ML systems can have broader societal impacts. Potential applications of our work include modeling social networks, analyzing biological systems, and understanding complex network interactions. However, careful consideration must be given to bias in graph construction and the representativeness of training data in deployment contexts. We particularly emphasize the importance of responsible data collection and proper validation when applying these methods to sensitive domains. We encourage future work to investigate these aspects and develop robust guidelines for ethical applications.

Acknowledgements

This work was supported in part by the NSF under Grants IIS 2327113 and ITE 2433190; and the NIH under Grants P30AG072946. We would like to thank the NSF support for AI research resource with NAIRR240219. We would like to thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Lipscomb Compute Cluster and associated research computing resources.

References

  • Achanta et al. (2012) Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Süsstrunk, S. Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
  • Ahamed & Cheng (2024) Ahamed, M. A. and Cheng, Q. Timemachine: A time series is worth 4 mambas for long-term forecasting. arXiv preprint arXiv:2403.09898, 2024.
  • Bo et al. (2022) Bo, D., Shi, C., Wang, L., and Liao, R. Specformer: Spectral graph neural networks meet transformers. In The Eleventh International Conference on Learning Representations, 2022.
  • Bousquet & Elisseeff (2002) Bousquet, O. and Elisseeff, A. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
  • Bresson & Laurent (2017) Bresson, X. and Laurent, T. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.
  • Chamberlain et al. (2021) Chamberlain, B., Rowbottom, J., Gorinova, M. I., Bronstein, M., Webb, S., and Rossi, E. Grand: Graph neural diffusion. In International conference on machine learning, pp.  1407–1418. PMLR, 2021.
  • Corso et al. (2020) Corso, G., Cavalleri, L., Beaini, D., Liò, P., and Veličković, P. Principal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems, 33:13260–13271, 2020.
  • Ding et al. (2024) Ding, Y., Orvieto, A., He, B., and Hofmann, T. Recurrent distance filtering for graph representation learning. In Forty-first International Conference on Machine Learning, 2024.
  • Du et al. (2019) Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems, 32, 2019.
  • Dwivedi et al. (2022) Dwivedi, V. P., Rampášek, L., Galkin, M., Parviz, A., Wolf, G., Luu, A. T., and Beaini, D. Long range graph benchmark. Advances in Neural Information Processing Systems, 35:22326–22340, 2022.
  • Dwivedi et al. (2023) Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1–48, 2023.
  • Everingham et al. (2010) Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  • Feng et al. (2024) Feng, L., Tung, F., Ahmed, M. O., Bengio, Y., and Hajimirsadegh, H. Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024.
  • Fey & Lenssen (2019) Fey, M. and Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • Frasca et al. (2022) Frasca, F., Bevilacqua, B., Bronstein, M., and Maron, H. Understanding and extending subgraph gnns by rethinking their symmetries. Advances in Neural Information Processing Systems, 35:31376–31390, 2022.
  • Freitas et al. (2020) Freitas, S., Dong, Y., Neil, J., and Chau, D. H. A large-scale database for graph representation learning. arXiv preprint arXiv:2011.07682, 2020.
  • Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  855–864, 2016.
  • Gu & Dao (2023) Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  • Gu et al. (2020) Gu, A., Dao, T., Ermon, S., Rudra, A., and Ré, C. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
  • Gu et al. (2021) Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
  • Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
  • Hu et al. (2020) Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
  • Huang et al. (2022) Huang, Y., Peng, X., Ma, J., and Zhang, M. Boosting the cycle counting power of graph neural networks with i2-gnns. In The Eleventh International Conference on Learning Representations, 2022.
  • Huang et al. (2023) Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., and Li, P. On the stability of expressive positional encodings for graph neural networks. arXiv preprint arXiv:2310.02579, 2023.
  • Huang et al. (2024a) Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., and Li, P. On the stability of expressive positional encodings for graph neural networks. In The Twelfth International Conference on Learning Representations, 2024a.
  • Huang et al. (2024b) Huang, Y., Miao, S., and Li, P. What can we learn from state space models for machine learning on graphs? arXiv preprint arXiv:2406.05815, 2024b.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Jain et al. (2016) Jain, A., Zamir, A. R., Savarese, S., and Saxena, A. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition, pp.  5308–5317, 2016.
  • Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.
  • Kreuzer et al. (2021) Kreuzer, D., Beaini, D., Hamilton, W., Létourneau, V., and Tossou, P. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
  • Li et al. (2020) Li, P., Wang, Y., Wang, H., and Leskovec, J. Distance encoding: Design provably more powerful neural networks for graph representation learning. Advances in Neural Information Processing Systems, 33:4465–4478, 2020.
  • Lim et al. (2022) Lim, D., Robinson, J. D., Zhao, L., Smidt, T., Sra, S., Maron, H., and Jegelka, S. Sign and basis invariant networks for spectral graph representation learning. In The Eleventh International Conference on Learning Representations, 2022.
  • Ma et al. (2023) Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, P. K., Coates, M., Torr, P., and Lim, S.-N. Graph inductive biases in transformers without message passing. In International Conference on Machine Learning, pp.  23321–23337. PMLR, 2023.
  • Morris et al. (2019) Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.  4602–4609, 2019.
  • Perozzi et al. (2014) Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  701–710, 2014.
  • Poli et al. (2019) Poli, M., Massaroli, S., Park, J., Yamashita, A., Asama, H., and Park, J. Graph neural ordinary differential equations. arXiv preprint arXiv:1911.07532, 2019.
  • Rampášek et al. (2022) Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
  • Shalev-Shwartz et al. (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.
  • Shirzad et al. (2023) Shirzad, H., Velingker, A., Venkatachalam, B., Sutherland, D. J., and Sinop, A. K. Exphormer: Sparse transformers for graphs. In International Conference on Machine Learning, pp.  31613–31632. PMLR, 2023.
  • Singh et al. (2016) Singh, S., Chaudhary, K., Dhanda, S. K., Bhalla, S., Usmani, S. S., Gautam, A., Tuknait, A., Agrawal, P., Mathur, D., and Raghava, G. P. Satpdb: a database of structurally annotated therapeutic peptides. Nucleic acids research, 44(D1):D1119–D1126, 2016.
  • Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018.
  • Wang et al. (2022) Wang, H., Yin, H., Zhang, M., and Li, P. Equivariant and stable positional encoding for more powerful graph neural networks. In International Conference on Learning Representations, 2022.
  • Wang et al. (2024) Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024.
  • Wu & Cheng (2022) Wu, X. and Cheng, Q. Stabilizing and enhancing link prediction through deepened graph auto-encoders. In IJCAI: proceedings of the conference, volume 2022, pp.  3587. NIH Public Access, 2022.
  • Wu et al. (2018) Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
  • Xhonneux et al. (2020) Xhonneux, L.-P., Qu, M., and Tang, J. Continuous graph neural networks. In International conference on machine learning, pp.  10432–10441. PMLR, 2020.
  • Xu et al. (2018) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2018.
  • Ying et al. (2021) Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877–28888, 2021.
  • Ying et al. (2018) Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31, 2018.
  • You et al. (2019) You, J., Ying, R., and Leskovec, J. Position-aware graph neural networks. In International conference on machine learning, pp.  7134–7143. PMLR, 2019.
  • You et al. (2021) You, J., Gomes-Selman, J. M., Ying, R., and Leskovec, J. Identity-aware graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp.  10737–10745, 2021.
  • Zhang et al. (2024) Zhang, B., Zhao, L., and Maron, H. On the expressive power of spectral invariant graph neural networks. arXiv preprint arXiv:2406.04336, 2024.
  • Zhang & Li (2021) Zhang, M. and Li, P. Nested graph neural networks. Advances in Neural Information Processing Systems, 34:15734–15747, 2021.
  • Zhao et al. (2021) Zhao, L., Jin, W., Akoglu, L., and Shah, N. From stars to subgraphs: Uplifting any gnn with local structure awareness. In International Conference on Learning Representations, 2021.
  • Zhi et al. (2023) Zhi, Y.-C., Ng, Y. C., and Dong, X. Gaussian processes on graphs via spectral kernel learning. IEEE Transactions on Signal and Information Processing over Networks, 9:304–314, 2023.

7 Appendix

7.1 Proofs of Model Properties

Proposition 7.1.

The GraphMinNet in Eq. (7) is permutation equivariant. Moreover, if ϕi()\phi_{i}(\cdot) are Lipschitz, the GraphMinNet is also stable in terms of features and eigen values.

Proof.

We first establish three fundamental facts, which can be straightforwardly proved:

Fact 1. Permutation equivariance holds for node-wise operations that are applied independently to each node’s features.

Fact 2. The composition of permutation equivariant functions is also permutation equivariant.

Fact 3. The composition of Lipschitz functions is also Lipschitz.

Permutation Equivariance:
By Fact 1, zuz_{u} and h~u\tilde{h}_{u} in Eq. (7) are permutation equivariant, as both are node-wise operations. Next, we prove that AuA_{u}, CuC_{u} and huh_{u} are also permutation equivariant. For a node permutation QQ, let q(u)q(u) denote the index of node uu after permutation. Then, the Laplacian for the permuted graph is

L^=QLQT=(QV~)Λ(QV~)T.\hat{L}=QLQ^{T}=(Q{\tilde{V}})\Lambda(Q{\tilde{V}})^{T}.

After node permutation, Λ\Lambda does not change. Then, we have the hat-version p^u\hat{p}_{u}, A^u\hat{A}_{u}, C^u\hat{C}_{u}, and h^u\hat{h}_{u} for L^\hat{L}, which are counterparts of the corresponding quantities for LL. Therefore:

p^q(u)\displaystyle\hat{p}_{q(u)} =(QV~)[q(u),1:d]=V~[u,1:d]=pu,\displaystyle=(Q\tilde{V})[q(u),1:d]=\tilde{V}[u,1:d]=p_{u},
A^q(u)\displaystyle\hat{A}_{q(u)} =[ϕ1(Λd)(Wx^q(u)),,ϕm(Λd)(Wx^q(u))]\displaystyle=[\phi_{1}(\Lambda_{d})\odot(W\hat{x}_{q(u)}),\cdots,\phi_{m}(\Lambda_{d})\odot(W\hat{x}_{q(u)})]
=[ϕ1(Λd)(Wxu),,ϕm(Λd)(Wxu)]=Au,\displaystyle=[\phi_{1}(\Lambda_{d})\odot(Wx_{u}),\cdots,\phi_{m}(\Lambda_{d})\odot(Wx_{u})]=A_{u},
C^q(u)\displaystyle\hat{C}_{q(u)} =[ϕ1(Λd)(p^q(u)),,ϕm(Λd)(p^q(u))]\displaystyle=[\phi_{1}(\Lambda_{d})\odot(\hat{p}_{q(u)}),\cdots,\phi_{m}(\Lambda_{d})\odot(\hat{p}_{q(u)})]
=[ϕ1(Λd)pu,,ϕm(Λd)pu]=Cu.\displaystyle=[\phi_{1}(\Lambda_{d})\odot p_{u},\cdots,\phi_{m}(\Lambda_{d})\odot p_{u}]=C_{u}.

Thus, a^q(u)=au\hat{a}_{q(u)}=a_{u}. Finally, we have:

h^q(u)=vVa^q(v)(a^q(u)z^q(u)h~^q(u))=vVav(auzuh~u)=hu.\begin{split}\hat{h}_{q(u)}&=\sum_{v\in V}\hat{a}_{q(v)}\odot(\hat{a}_{q(u)}\odot\hat{z}_{q(u)}\odot\hat{\tilde{h}}_{q(u)})\\ &=\sum_{v\in V}{a}_{v}\odot({a}_{u}\odot{z}_{u}\odot{\tilde{h}}_{u})\\ &=h_{u}.\end{split}

Therefore, the GraphMinNet algorithm is permutation equivariant.

Stability Analysis:
First, we consider the feature stability. For any node uu and its feature vector xux_{u} with perturbation Δxu\Delta x_{u},

  • zu=σ(c1xu)z_{u}=\sigma(c_{1}x_{u}) is Lipschitz since σ\sigma is Lipschitz

  • h~u=c2xu\tilde{h}_{u}=c_{2}x_{u} is clearly Lipschitz with constant |c2||c_{2}|.

By Fact 3 and chain rule, huh_{u} is Lipschitz as composition of Lipschitz functions.

For spectral and structural stability, consider a symmetric perturbation matrix EE to adjacency matrix AA. By Weyl’s inequality, we have

|λi(A+E)λi(A)|E2.|\lambda_{i}(A+E)-\lambda_{i}(A)|\leq\|E\|_{2}.

Since ϕi()\phi_{i}(\cdot) are Lipschitz by assumption with constants LiL_{i}, and AuA_{u} involves composition of ϕi\phi_{i} with eigenvalues, we have

ΔAumaxiLiE2.\|\Delta A_{u}\|\leq\max_{i}L_{i}\|E\|_{2}.

Finally, as huh_{u} is Lipschitz in aua_{u} and aua_{u} is Lipschitz in AuA_{u}, huh_{u} is stable with respect to both feature and eigenvalue perturbations. ∎

Proposition 7.2.

There exists ϕ()\phi(\cdot) such that the gradient norm huxv\|\frac{\partial h_{u}}{\partial x_{v}}\| of GraphMinNet does not decay as spd(u,v)spd(u,v) grows (with nn tending to \infty), where spd(u,v)spd(u,v) is the shortest path distance between uu and vv.

Proof.

We only consider nonnegative adjacency matrix AA and uvu\neq v since spd(u,v)=0spd(u,v)=0 for any node u=vu=v. Without loss of generality (w.l.o.g.), we consider the case of l=m=1l=m=1 and d=nd=n, where xvx_{v}, zvz_{v}, and h~v\tilde{h}_{v} are scalars with zv=σ(c1xv)>0z_{v}=\sigma(c_{1}x_{v})>0, h~v=c2xv\tilde{h}_{v}=c_{2}x_{v}, and c2>0c_{2}>0. Let A~=D1/2AD1/2\tilde{A}=D^{-1/2}AD^{-1/2} be the normalized adjacency matrix with DD being the diagonal degree matrix of the original adjacency matrix AA. Then

Au\displaystyle A_{u} =ϕ1(Λ)(Wxu)=diag(ϕ1(Λ))(Wxu)\displaystyle=\phi_{1}(\Lambda)\odot(Wx_{u})={\texttt{diag}}(\phi_{1}(\Lambda))(Wx_{u})
Cu\displaystyle C_{u} =ϕ1(Λ)pu.\displaystyle=\phi_{1}(\Lambda)\odot p_{u}.

By taking 1\oplus_{1} as \odot (the case that it is ++ can be proved similarly), we have

au\displaystyle a_{u} =B(AuCu)\displaystyle=B(A_{u}\odot C_{u})
=B(diag(ϕ1(Λ))(Wxu))(diag(ϕ1(Λ))pu).\displaystyle=B({\texttt{diag}}(\phi_{1}(\Lambda))(Wx_{u}))\odot({\texttt{diag}}(\phi_{1}(\Lambda))p_{u}).

Thus, we have

hu\displaystyle h_{u} =au,a¯zuh~u\displaystyle=\langle a_{u},\bar{a}\rangle\odot z_{u}\odot\tilde{h}_{u} (by Eq. (7))
=vVBdiag(ϕ1(Λ))(Wxu),Bdiag(ϕ1(Λ))(Wxv)\displaystyle=\sum_{v\in V}\langle B\texttt{diag}(\phi_{1}(\Lambda))(Wx_{u}),B\texttt{diag}(\phi_{1}(\Lambda))(Wx_{v})\rangle
Cu,Cvzuh~u\displaystyle\quad\quad\langle C_{u},C_{v}\rangle\odot z_{u}\odot\tilde{h}_{u}
=xuzuh~u(Bdiag(ϕ1(Λ))W)2\displaystyle=x_{u}z_{u}\tilde{h}_{u}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}
vV(xvpuTdiag(ϕ1(Λ))2pv)\displaystyle\quad\sum_{v\in V}(x_{v}p_{u}^{T}{\texttt{diag}(\phi_{1}(\Lambda)})^{2}p_{v}){} (as l=1l=1)
=xuzuh~u(Bdiag(ϕ1(Λ))W)2\displaystyle=x_{u}z_{u}\tilde{h}_{u}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}
vV(xveuTV~diag(ϕ1(Λ))2V~Tev)\displaystyle\quad\sum_{v\in V}(x_{v}e_{u}^{T}\tilde{V}{\texttt{diag}(\phi_{1}(\Lambda)})^{2}\tilde{V}^{T}e_{v})
=xuzuh~u(Bdiag(ϕ1(Λ))W)2\displaystyle=x_{u}z_{u}\tilde{h}_{u}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}
vV(xveuTϕ12(V~ΛV~T)ev)\displaystyle\quad\sum_{v\in V}(x_{v}e_{u}^{T}\phi_{1}^{2}(\tilde{V}\Lambda\tilde{V}^{T})e_{v}) (by spectral decomposition)
=xuzuh~u(Bdiag(ϕ1(Λ))W)2vV(xveuTϕ12(L)ev)\displaystyle=x_{u}z_{u}\tilde{h}_{u}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}\sum_{v\in V}(x_{v}e_{u}^{T}\phi_{1}^{2}(L)e_{v})

In the above, eue_{u} is the unit vector with the uu-th element being 1; the second equality follows from the Type 4 definition of inner product in Eq. (10). While the proof is shown using Type 4, it holds for other types by appropriately choosing BB. The learnable parameters BB and WW must be chosen to ensure Bdiag(ϕ1(Λ))W0B{\texttt{diag}(\phi_{1}(\Lambda)})W\neq 0. We define ϕ(λ)=k=1nbkλk\phi(\lambda)=\sum_{k=1}^{n}b_{k}\lambda^{k} with positive constants bk>0b_{k}>0. Let ϕ12(L)=ϕ(A~)\phi_{1}^{2}(L)=\phi(\tilde{A}). Then, we have

huxv\displaystyle\frac{\partial h_{u}}{\partial x_{v}} =xuzuh~u(Bdiag(ϕ1(Λ))W)2euTϕ12(L)ev\displaystyle=x_{u}z_{u}\tilde{h}_{u}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}e_{u}^{T}\phi_{1}^{2}(L)e_{v}
=c2(Bdiag(ϕ1(Λ))W)2xu2zuk=1nbk(A~k)u,v.\displaystyle=c_{2}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}x_{u}^{2}z_{u}\sum_{k=1}^{n}b_{k}(\tilde{A}^{k})_{u,v}.

Let k=spd(u,v)>0k=spd(u,v)>0. Since (A~k)u,v(\tilde{A}^{k})_{u,v} represents the degree-weighted sum of all walks of length kk from uu to vv, and there exists at least one path of length spd(u,v)spd(u,v) between uu and vv, it follows that (A~k)u,v>0(\tilde{A}^{k})_{u,v}>0. We denote this value by γ\gamma. Therefore, we have

huxv\displaystyle\|\frac{\partial h_{u}}{\partial x_{v}}\| =c2(Bdiag(ϕ1(Λ))W)2xu2zu|k=1nbkA~u,vk|\displaystyle=c_{2}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}x_{u}^{2}z_{u}|\sum_{k=1}^{n}b_{k}\tilde{A}_{u,v}^{k}|
c2(Bdiag(ϕ1(Λ))W)2xu2zubkγ>0.\displaystyle\geq c_{2}(B{\texttt{diag}(\phi_{1}(\Lambda)})W)^{2}x_{u}^{2}z_{u}b_{k}\gamma>0.

The last inequality holds since bk>0b_{k}>0 and (A~k)u,v0(\tilde{A}^{k})_{u,v}\geq 0 for all kk, making all terms in the sum non-negative. This lower bound is independent of the distance spd(u,v)spd(u,v), proving that the gradient does not decay with distance and establishing GraphMinNet’s long-range dependence property. ∎

Assuming that the eigenvalue decomposition of the adjacent matrix AA is precomputed and thus given, we have the following computational complexity for GraphMinNet:

Proposition 7.3.

The hidden states h1,,hnh_{1},\cdots,h_{n} can be computed from x1,,xnx_{1},\cdots,x_{n} with a complexity of O(nmdl)O(nmdl), where nn is the number of nodes, mm is the number of columns in node encoding (AuA_{u}, CuC_{u}, or cuc_{u}), dd is the dimension of rows in node encoding, and ll is the feature dimension.

Additionally, the GraphMinNet algorithm in Equations (7) or (11) achieves linear scalability with respect to the number of nodes.

Proof.

For linear complexity: This can be straightforwardly counted as follows. For each node uVu\in V:

  1. 1.

    Computing zu,h~ulz_{u},\tilde{h}_{u}\in\mathcal{R}^{l}:
    Linear projections and activation σ()\sigma(\cdot): O(l2)O(l^{2})

  2. 2.

    Computing node encodings:
    WxuWx_{u} with Wd×lW\in\mathcal{R}^{d\times l}: O(dl)O(dl)
    AuA_{u}: mm columns × dd multiplications = O(dm)O(dm)
    CuC_{u}: Similarly O(dm)O(dm)
    au=B(CuAu)a_{u}=B(C_{u}\odot A_{u}):
    Element-wise product CuAuC_{u}\odot A_{u}: O(dm)O(dm)
    Matrix multiplication with BB: O(ldm)O(ldm)

  3. 3.

    Computing final output:
    Inner product au,a¯\langle a_{u},\bar{a}\rangle: O(ldm)O(ldm)

Total complexity per node: O(l2+dl+dm+ldm)=O(ldm)O(l^{2}+dl+dm+ldm)=O(ldm) (assuming m,d>lm,d>l)

Overall complexity for nn nodes: O(nldm)O(nldm).

For linear scalability: Given eigenvalue decomposition, computations are node-wise independent except for a¯=vVav\bar{a}=\sum_{v\in V}a_{v}. The final step requires only one inner product per node with this global sum. Thus, the algorithm scales linearly with nn through:

Independent node-wise computations

Single global aggregation

Final node-wise inner products. ∎

The above property provides a certificate guaranteeing the linear complexity as well as the linear scalability with respect to the number of nodes of the graph.

Proposition 7.4.

The formulation of GraphMinNet is more powerful than WL test but not more powerful than 3-WL test.

Proof.

First, we show GraphMinNet is more powerful than 1-WL. Let B=IB=I, d=ld=l, and ϕ(Λd)=Λd1/2\phi(\Lambda_{d})=\Lambda_{d}^{1/2}. Then:

au,av\displaystyle\langle a_{u},a_{v}\rangle =AuCu,AuCu\displaystyle=\langle A_{u}\odot C_{u},A_{u}\odot C_{u}\rangle
=Cu,CvAu,Av=Au,vAu,Av\displaystyle=\langle C_{u},C_{v}\rangle\langle A_{u},A_{v}\rangle=A_{u,v}\langle A_{u},A_{v}\rangle

The second equality follows from Type 2 definition of matrix inner product (see Eq. (10)). Since Au,vA_{u,v} is an adjacency matrix element and Au,Av\langle A_{u},A_{v}\rangle is a function of node features xux_{u} and xvx_{v}, GraphMinNet can implement standard message passing. As message passing is equivalent to the 1-WL test (Xu et al., 2018), GraphMinNet is at least as powerful as 1-WL. The additional structural components make it strictly more powerful.

For the upper bound, note that GraphMinNet is an eigenspace projection GNN using a basis invariant function of positional encoding. By (Zhang et al., 2024), such architectures cannot exceed the power of 3-WL test. ∎

Table 8: Dataset statistics used in the experiments.
Dataset # Graphs Avg. # nodes Avg. # edges Prediction level Prediction task
ZINC 12,000 23.2 24.9 graph regression
ogbg-molhiv 41,127 25.5 27.5 graph binary classification
MNIST 70,000 70.6 564.5 graph 10-class classification
CIFAR10 60,000 117.6 941.1 graph 10-class classification
PATTERN 14,000 118.9 3,039.3 node binary classification
CLUSTER 12,000 117.2 2,150.9 node 6-class classification
MalNet-Tiny 5,000 1,410.3 2,859.9 graph 5-class classification
Peptides-func 15,535 150.9 307.3 graph 10-class classification
Peptides-struct 15,535 150.9 307.3 graph regression
PascalVOC-SP 11,355 479.4 2,710.5 node 21-class classification

8 Datasets

We utilized 10 datasets, which are widely adopted in the graph machine learning community. These datasets cover a range of tasks, including graph-level regression, binary classification, and node-level classification. All datasets utilized in our study are equipped with predefined training, validation, and test splits, ensuring consistency across experiments. In line with established practices in the field, we report the test results based on the best-performing models on the validation set. To ensure the robustness of our findings, we conduct evaluations over five distinct random seeds for each dataset. This approach aligns with methodologies outlined in prior studies (Rampášek et al., 2022; Ma et al., 2023; Shirzad et al., 2023; Huang et al., 2024b).

ZINC:

The ZINC-12k dataset (Dwivedi et al., 2023) is a subset of the ZINC database, which contains commercially available chemical compounds. This dataset comprises 12,000 molecular graphs, where each graph represents a small molecule with the number of atoms ranging from 9 to 37. In this representation, nodes correspond to heavy atoms (with 28 distinct atom types), and edges symbolize chemical bonds (of 3 different types). The primary task associated with ZINC-12k is graph-level regression.

ogbg-molhiv:

The ogbg-molhiv dataset (Hu et al., 2020) is adopted from the MoleculeNet collection (Wu et al., 2018) by the Open Graph Benchmark (OGB) project. It consists of molecular graphs where nodes and edges are featurized to represent various chemophysical properties. The task for this dataset is a binary graph-level classification, aiming to predict whether a molecule can inhibit HIV replication.

MNIST & CIFAR10:

The MNIST and CIFAR10 datasets (Dwivedi et al., 2023) are derived from well-known image classification datasets. In these graph versions, each image is converted into a graph by constructing an 8-nearest-neighbor graph of SLIC superpixels (Achanta et al., 2012) for each image. The task for these datasets is a 10-class graph-level classification, mirroring the original image classification tasks.

PATTERN and CLUSTER:

The PATTERN and CLUSTER datasets (Dwivedi et al., 2023) are synthetic datasets that model community structures using the Stochastic Block Model (SBM). Both tasks involve inductive node-level classification. In the PATTERN dataset, the goal is to identify nodes that belong to one of 100 randomly generated subgraph patterns, which are distinct from the rest of the graph in terms of SBM parameters. For the CLUSTER dataset, each graph is composed of 6 clusters generated by the SBM, and the task is to predict the cluster ID of 6 test nodes, each representing a unique cluster within the graph.

MalNet-Tiny:

The MalNet-Tiny dataset (Freitas et al., 2020) is a subset of the larger MalNet repository, which contains function call graphs extracted from Android APKs. This subset includes 5,000 graphs, each with up to 5,000 nodes, representing either benign software or four categories of malware. In the MalNet-Tiny subset, all original node and edge attributes have been removed, and the classification task is based solely on the graph structure.

Peptides-func and Peptides-struct:

The Peptides-func and Peptides-struct datasets (Dwivedi et al., 2022) are derived from 15,000 peptides retrieved from the SATPdb database (Singh et al., 2016). Both datasets use the same set of graphs but focus on different prediction tasks. Peptides-func is a graph-level classification task with 10 functional labels associated with peptide functions. In contrast, Peptides-struct is a graph-level regression task aimed at predicting 11 structural properties of the molecules.

PascalVOC-SP:

The PascalVOC-SP dataset (Dwivedi et al., 2022) is a node classification dataset based on the Pascal VOC 2011 image dataset (Everingham et al., 2010). Superpixel nodes are extracted using the SLIC algorithm (Achanta et al., 2012), and a rag-boundary graph that interconnects these nodes is constructed. The task is to classify each node into corresponding object classes, akin to semantic segmentation.

Table 9: Hyperparameters used to achieve our result.
Dataset Hidden dimension (l)(l) Local Method # Layers Lap. dim Batch LR Weight Decay Dropouts
ZINC 6464 GINE 1010 1616 3232 0.0010.001 1e51e^{-5} {0.1,0.0,0.6,0.0}\{0.1,0.0,0.6,0.0\}
ogbg-molhiv 6464 GatedGCN 66 1616 128128 0.0020.002 0.0010.001 {0.1,0.3,0.1,0.1}\{0.1,0.3,0.1,0.1\}
MNIST 5252 GatedGCN 33 3232 1616 0.0050.005 0.010.01 {0.1,0.1,0.1,0.4}\{0.1,0.1,0.1,0.4\}
CIFAR10 5252 GatedGCN 33 3232 1616 0.0050.005 0.010.01 {0.1,0.1,0.1,0.0}\{0.1,0.1,0.1,0.0\}
PATTERN 3636 GatedGCN 2424 3232 3232 0.0010.001 0.10.1 {0.1,0.1,0.5,0.0}\{0.1,0.1,0.5,0.0\}
CLUSTER 3636 GatedGCN 2424 3232 1616 0.0010.001 0.10.1 {0.1,0.3,0.3,0.0}\{0.1,0.3,0.3,0.0\}
MalNet-Tiny 6464 GatedGCN 55 0 16 0.0015 1e51e^{-5} {0.1,0.1,0.35,0.05}\{0.1,0.1,0.35,0.05\}
Peptides-func 100100 GatedGCN 33 3131 256256 0.0030.003 0.10.1 {0.0,0.1,0.1,0.3}\{0.0,0.1,0.1,0.3\}
Peptides-struct 100100 GatedGCN 33 3131 128128 0.0030.003 0.10.1 {0.0,0.1,0.1,0.5}\{0.0,0.1,0.1,0.5\}
PascalVOC-SP 9696 GatedGCN 44 6363 1616 0.0020.002 0.10.1 {0.0,0.0,0.5,0.0}\{0.0,0.0,0.5,0.0\}

9 Hyperparameters

In this section, we summarize the key hyperparameters used to achieve the results presented in Table 1. These parameters, detailed in Table 9, include the hidden dimension (ll), the type of local method (such as GINE or GatedGCN), the number of layers, the dimensionality of the Laplacian features (eigenvalues and eigenvectors), batch size, learning rate, weight decay, and dropout rates. The dropout rates are specified for different components of the model: the feedforward network, the local method, the residual, and our proposed GraphMinNet. For MalNet-Tiny and PascalVOC-SP datasets, we use the 0-th order power function ϕi\phi_{i} in AuA_{u}. Additionally, for MalNet-Tiny, we set ϕi\phi_{i} in CuC_{u} to be the zero function and define the 1\oplus_{1} operation as elementwise addition. For all the other datasets, we used a generalized permutation equivariant set aggregation function (Wang et al., 2022).