GraphMinNet: Learning Dependencies in Graphs with Light Complexity Minimal Architecture
Abstract
Graph Neural Networks (GNNs) have demonstrated remarkable success in various applications, yet they often struggle to capture long-range dependencies (LRD) effectively. This paper introduces GraphMinNet, a novel GNN architecture that generalizes the idea of minimal Gated Recurrent Units to graph-structured data. Our approach achieves efficient LRD modeling with linear computational complexity while maintaining permutation equivariance and stability. The model incorporates both structural and positional information through a unique combination of feature and positional encodings, leading to provably stronger expressiveness than the 1-WL test. Theoretical analysis establishes that GraphMinNet maintains non-decaying gradients over long distances, ensuring effective long-range information propagation. Extensive experiments on ten diverse datasets, including molecular graphs, image graphs, and synthetic networks, demonstrate that GraphMinNet achieves state-of-the-art performance while being computationally efficient. Our results show superior performance on 6 out of 10 datasets and competitive results on the others, validating the effectiveness of our approach in capturing both local and global graph structures.
1 Introduction
Graphs are widely used in various fields ranging from social networks and knowledge representations to engineering. Graph neural networks (GNNs) provide crucial techniques for extracting information and making inference over graph data. While numerous GNN models have been developed, important challenges still need to be overcome.
A fundamental challenge in GNNs is capturing long-range dependencies (LRDs) - the ability to model relationships between nodes that are far apart in the graph structure. Classical GNNs typically use message passing between neighboring nodes, where each layer only allows information to travel one hop. To reach distant nodes, information must pass through many layers, leading to a phenomenon called over-squashing: as information propagates through multiple layers, messages from many source nodes are compressed into fixed-size node representations, causing excessive compression and loss of important details. This over-squashing problem, particularly severe in deep GNNs or highly connected graphs, creates information bottlenecks that prevent effective modeling of long-range relationships.
The ability to capture LRD is critical because many real-world graphs inherently contain important long-distance relationships and require understanding of global graph structure. Beyond over-squashing, attempting to capture these dependencies through deep message passing leads to additional challenges such as over-smoothing (where node features become too similar), gradient vanishing, and information dilution. Thus, while GNN performance often depends on capturing both local and distant graph interactions, existing approaches struggle with this fundamental tension.
Several approaches have been proposed to address these challenges, including attention mechanisms over longer paths (Ying et al., 2021; Kreuzer et al., 2021; Rampášek et al., 2022), global graph features (Zhang & Li, 2021; You et al., 2021), skip connections (Wu & Cheng, 2022), graph diffusion (Chamberlain et al., 2021), and multi-scale architectures (Ying et al., 2018). While these approaches show promise, attention-based and multi-scale methods often face computational scalability issues with large graphs, whereas simpler approaches like skip connections and global features can be prone to overfitting on complex graph structures.
To address these limitations, we propose a novel approach that achieves effective LRD modeling with linear computational complexity. Our key insight comes from recursive neural networks (RNNs), particularly an emerging variant called minGRU (Feng et al., 2024) that has demonstrated remarkable ability to capture long-range dependencies in sequential data with linear complexity.
However, introducing the idea of minGRU to graph data presents fundamental challenges due to the inherent differences between sequential and graph structures. Unlike sequential data where elements have natural ordering and positional information, graph nodes lack intrinsic ordering. Moreover, graphs contain explicit structural information through edges that is absent in sequential data. To address these differences, we develop GraphMinNet, which bridges these differences by generalizing minGRU’s efficient mechanisms to the graph domain while preserving its feature learning advantages and incorporating graph-specific structural information.
In summary, this paper has the following contributions:
-
•
We generalize the idea of minimal GRU, an RNN variant, to graph-structured data by developing an efficient algorithm that integrates node features with positional encoding;
-
•
The resulting model, GraphMinNet, has key advantages including strong ability to capture LRD, expressivity between 1-WL and 3-WL in terms of graph discrimination power, and linear computational complexity, all with provable performance guarantees;
-
•
Through extensive experiments across diverse datasets, we demonstrate that our algorithm has superior predictive accuracy and efficiency compared to state-of-the-art baselines.
2 Related Works

We categorize the existing GNNs for prediction or classification tasks into several groups by their main network architectures. Notably, more recent models often combine elements from multiple categories.
GCN or message passing based. These methods leverage either spatial or spectral domain operations through message passing or graph convolutions. Key approaches include Graph Convolutional Networks (GCN) (Kipf & Welling, 2016), Gate GCN (Bresson & Laurent, 2017), Graph Isomorphism Networks (GIN) (Xu et al., 2018), Graph Attention Networks (GAT) (Veličković et al., 2018), GraphSAGE (Hamilton et al., 2017), and Principal Neighborhood Aggregation (PNA) (Corso et al., 2020). While efficient and scalable, these models typically have limited ability to capture LRD. Moreover, standard GCNs have limited expressiveness, as they are equivalent to the 1-Weisfeiler-Lehman (WL) test, whereas higher-order k-GNNs are proven to be more expressive (Morris et al., 2019).
Kernel based. Graph kernel methods have been extensively studied, including neural tangent kernel (Jacot et al., 2018), graph neural tangent kernel (Du et al., 2019), graph kernels with neural networks (Morris et al., 2019), and spectral kernel learning (Zhi et al., 2023). These methods offer theoretical guarantees through strong mathematical foundations from kernel theory, particularly in terms of provable convergence properties. However, they face several challenges in adaptability to hierarchical structures, capturing complex patterns, and incorporating node/edge features. Thus, they may have limited representation power.
Transformer or SSM based. These recent models leverage Transformers or State Space Models (SSMs) to capture LRD. Following the success of Transformers in text and image domains, they have been adapted for graph learning. Key approaches include Graphormer (Ying et al., 2021), SAN (Kreuzer et al., 2021), GraphGPS (Rampášek et al., 2022), Exphormer (Shirzad et al., 2023), Grit (Ma et al., 2023), and Specformer (Bo et al., 2022). With the emergence of SSMs such as Mamba (Gu & Dao, 2023), new graph models like Graph-Mamba (Wang et al., 2024) have been developed. While these models effectively capture LRD between distant nodes, Transformer-based models typically have quadratic complexity and are computationally demanding, whereas SSM-based models may not fully preserve the permutation equivariance of graph data (Zhang et al., 2024).
Position or structural aware. Various techniques incorporate position or substructure information in graphs, from early approaches like DeepWalk (Perozzi et al., 2014) and node2vec (Grover & Leskovec, 2016), to recent work in position encoding (You et al., 2019), distance encoding (Li et al., 2020), structural RNN (Jain et al., 2016), and SPE (Huang et al., 2024a). Recent models have explored substructural information through nested structures (NGNN (Zhang & Li, 2021)), identity-aware patterns (ID-GNN (You et al., 2021)), augmented kernels (GIN-AK+ (Zhao et al., 2021)), iterative learning (I2-GNN (Huang et al., 2022)), and sign patterns (SUN (Frasca et al., 2022), SignNet (Lim et al., 2022)). While these techniques effectively capture higher-order interactions with provable expressiveness bounds, they face scalability challenges, e.g., due to expensive distance computations and requirements for full graph structure access. They may also suffer from generalization issues, including overfitting to specific structures and sensitivity to graph variations.
Implicit and continuous-depth architecture. Continuous-depth architectures have emerged as a promising direction for graph learning, starting with GDE (Poli et al., 2019) which models network dynamics as a continuous-time system. GRAND (Chamberlain et al., 2021) developed graph neural diffusion, while Continuous GNNs (Xhonneux et al., 2020) provided a framework for continuous-depth networks, and GRED (Ding et al., 2024) enhanced this with recurrent and dilated memory. These approaches offer memory efficiency and natural handling of dynamical processes, but face challenges in solving differential equations and require careful tuning to balance stability and expressiveness.
3 Methods
3.1 Preliminary
GRU has been improved to minGRU (Feng et al., 2024) to overcome gradient vanishing/explosion and enable better capture of global dependencies. For an input sequence of tokens, minGRU is defined as:
(1) |
where is the input feature vector and is its corresponding state-space representation at time . Here, is an element-wise non-linear activation function with values in , and projects to a -dimensional state space via an MLP.
The model achieves computational efficiency with complexity (Feng et al., 2024), significantly lower than the original GRU’s . Training can be parallelized using a parallel scan algorithm (Feng et al., 2024). To enable more effective feature extraction, minGRU employs state expansion where the state dimension with .
3.2 GraphMinNet for Graph Learning
The minGRU model enhances the original GRU with several key advantages: 1) ability to capture global dependency, 2) linear efficiency in terms of input sequence length, 3) scalable model size with respect to input length, and 4) shift equivariance. Moreover, compared to state-space models (Gu et al., 2020) (Gu et al., 2021), in particular, Mamba (Gu & Dao, 2023) that has a linear complexity and scalability, minGRU does not have a fixed state-space length and thus have a stronger ability to possess context awareness or content selectivity (Feng et al., 2024). These advantages offer potential solutions to common GNN challenges, motivating our development of GraphMinNet.
To develop this model, first we obtain the explicit expression of the minGRU model containing no state variable :
(2) |
In the last equality of Eq. (2) and hereafter, the notation stands for element-wise multiplication of -dimensional vectors across relevant indexes. By denoting , , Eq. (2) can be written as
(3) |
The above equation can facilitate us to derive a corresponding model for graph learning, as it does not contain any intermediate, latent state variables.
For a graph with nodes in set and adjacency matrix , each node has a feature vector . To introduce the idea of minGRU to graph-structured data, we make three key observations:
1) Position Indexing: While minGRU uses sequence positions as indices, we associate these with graph nodes. Since graphs lack natural ordering, we replace with to ensure permutation equivariance.
2) Positional Information: To capture node positions in the graph structure, we employ Laplacian positional embedding (LPE) (Wang et al., 2022; Huang et al., 2024b, 2023). Given the graph Laplacian eigendecomposition , where contains the eigenvectors and is the diagonal eigenvalue matrix, we define the -dimensional LPE for node as . This encoding captures the absolute position of node in the graph. We denote by the vector of top non-zero eigenvalues.
3) Content Dependence: The interaction terms depend on (for ), which in turn depend on input features , creating content-dependent state variables. This mechanism parallels the selection mechanism in Mamba (Gu & Dao, 2023; Ahamed & Cheng, 2024). To fully capture both structural and feature information, we encode positions and features separately in matrix form. Therefore, we need to replace the operation involving the matrices with suitable operations that produce vectors.
Based on these observations, we construct the node embedding as follows. First, we define the feature component for node :
(4) |
where is a learnable weight matrix and are learnable permutation equivariant functions, . Here, are permutation equivariant to the top- eigenvalues of the Laplacian, similar to the global permutation-equivariant set aggregation functions in (Wang et al., 2022; Huang et al., 2024b).
Similarly, we construct the positional component :
(5) |
The overall embedding for node combines these components:
(6) |
where is a learnable matrix and denotes element-wise aggregation (e.g., addition or multiplication) between matrices and . The resulting embedding has size .
The inverse operation in minGRU is adapted to graphs by associating node encodings with quantities in Eq. (2) and Eq. (3): for any node and , where between and denotes element-wise multiplication of each column of with .
Inspired by Eq. (3), we formulate GraphMinNet for node as:
(7) | ||||
(8) | ||||
(9) |
Here, the state variable has the same dimension as node feature . Our formulation differs from the graph convolution in (Huang et al., 2024b) in two key aspects: First, nonlinear feature dependence where depends nonlinearly on through (linear transformation), (gated feature attention), and (in , containing in each column), with the gated feature attention providing automatic focus on important features. Second, while (Huang et al., 2024b) primarily emphasizes positional encoding, our formulation incorporates features through , , and .
For matrix , the operation with vector multiplies the vector element-wise with each matrix column. Defining , we can reformulate Eq. (7) as:
(10) |
Here, we generalize the operation between and by an inner product operation because and are matrices. As , we define using four types of inner products:
Type 1. Elementary inner product between corresponding matrix rows;
Type 2. For , elementary inner product between corresponding columns, followed by transposition into a column-vector;
Type 3. Inner product of vectorized matrices, with result multiplied by -dimensional all-ones vector;
Type 4. Either or , where the first inner product uses any of Types 1-3 and the second uses Type 2 or 3.
Note that the Einstein summation operation (einsum) in torch can be used to efficiently calculate these different types of products. In the model, all nodes share the same weight matrices (in ) and (in ). Because emphasizing on node potentially facilitates capturing its useful information, we may further consider the following formulation:
(11) |
where is a weighting parameter, are learnable matrices, and is an all-ones vector. The inner product in the second term uses Type 3 (matrix vectorization). This formulation generalizes Eq. (7), which can be recovered as a special case when , showing how the self-loop term integrates with the original structure.
3.3 Properties of GraphMinNet
We present several key properties of our GraphMinNet formulation. Detailed proofs are provided in the appendix.
Definition 3.1 (Permutation Equivariance).
For a graph with node features and adjacency matrix , given a permutation matrix , a GNN function is permutation equivariant if and only if .
This property ensures that when nodes of an input graph are reordered (permuted), the node-level outputs transform according to the same permutation. Since node ordering in graphs is arbitrary, permutation equivariance provides an essential inductive bias that enables the model to learn representations independent of node indexing.
Definition 3.2 ((Lipschitz) Stability of a GNN).
For a GNN , input graphs and , and distance metrics and for graphs and outputs respectively, if , then , where is the Lipschitz constant of the network.
Stability encompasses three aspects: Structural stability refers to how outputs change when edges are added/removed. Thus, it is about what the output response is to changes in graph connectivity. Feature stability refers to how outputs change when node/edge features are perturbed, thus is about sensitivity to noise in feature values. Spectral stability refers to how changes in the graph’s eigenvalues affect the output, which is particularly important for spectral-based GNN approaches. As shown in (Huang et al., 2023), stability is more general than equivariance (Huang et al., 2023) and implies generalization ability (Bousquet & Elisseeff, 2002) (Shalev-Shwartz et al., 2010). Therefore, the stability property is critical to the GNN algorithm’s robustness to noise in real-world data, generalization abilities, and adversarial robustness.
Proposition 3.3.
Many common permutation equivariant functions in neural networks are naturally Lipschitz, including linear permutation equivariant functions, element-wise Lipschitz operations, max/min pooling, and mean pooling. Thus, the Lipschitz condition is readily satisfied for typical choices of , ensuring provable stability of GraphMinNet. As stability implies strong generalizability and robustness, we have the following result.
Corollary 3.4.
When the condition in Proposition 3.3 is satisfied, GraphMinNet has provable generalization ability and robustness.
The following property establishes GraphMinNet’s ability to capture long-range dependencies, which is critical for effective graph learning.
Proposition 3.5 (Long-range Dependency).
There exists such that the gradient norm of GraphMinNet does not decay as grows (with tending to ), where is the shortest path distance between and .
In this paper, we assume the eigenvalue decomposition of is pre-computed. For large sparse graphs, we adopt the Lanczos algorithm (e.g., implemented in ARPACK) to efficiently compute the largest/smallest eigenvalues and eigenvectors of the adjacency/Laplacian matrices. This computation has complexity , where is the number of edges. Given these pre-computed eigenvalues and eigenvectors, the complexity and scalability of our algorithm is given as follows:
Proposition 3.6 (Complexity and Scalability).
The hidden states can be computed from with a complexity of , where is the number of nodes, is the number of columns in node encoding (, , or ), is the dimension of rows in node encoding, and is the feature dimension.
Proposition 3.7 (Expressiveness).
The formulation of GraphMinNet is more powerful than the 1-WL test but not more powerful than the 3-WL test.
These properties collectively demonstrate that GraphMinNet achieves efficient long-range dependency modeling with linear complexity while maintaining strong expressive power between 1-WL and 3-WL tests.
4 Experiments
MNIST | CIFAR10 | PATTERN | CLUSTER | Molhiv | PascalVOC-SP | Peptides-func | Peptides-struct | ZINC | MalNet-Tiny | Avg. Rank | |
Accuracy | Accuracy | Accuracy | Accuracy | AUROC | F1 score | AP | MAE | MAE | Accuracy | Lower better | |
GCN | |||||||||||
GIN | |||||||||||
GAT | |||||||||||
GatedGCN | |||||||||||
SAN | |||||||||||
GraphGPS | |||||||||||
Exphormer | |||||||||||
Grit | |||||||||||
GRED | |||||||||||
Graph-Mamba | |||||||||||
GSSC | |||||||||||
Ours |
In this section, we present our experimental results. Table 1 summarizes the performance of our proposed approach compared against several robust baselines across multiple datasets. We evaluated our method on 10 diverse datasets, with detailed descriptions provided in Appendix 8 (Table 8).
Our evaluation encompasses several key dataset categories: 1) The Long Range Graph Benchmark (Dwivedi et al., 2022), which requires effective reasoning over long-range interactions, consisting of three tasks: Peptides-func (graph-level classification with 10 peptide functional labels), Peptides-struct (graph-level regression of 11 molecular structural properties), and PascalVOC-SP (superpixel classification in image graphs). 2) Molecular graph datasets, including ZINC (Dwivedi et al., 2023) for graph regression of molecular properties and ogbg-molhiv (Hu et al., 2020) with 41k molecular graphs for classification. 3) Image graph datasets MNIST and CIFAR10 (Dwivedi et al., 2023), represented as 8-nearest neighbor superpixel graphs for classification. 4) Synthetic graph datasets PATTERN and CLUSTER (Dwivedi et al., 2023), generated using the Stochastic Block Model for node-level community classification. 5) Function call graphs from MalNet-Tiny (Freitas et al., 2020) for classification tasks. Our experimental results across these diverse settings demonstrate our method’s robustness and versatility.
We report the experimental results in Table 1, following the evaluation protocols from GSSC (Huang et al., 2024b) with mean and standard deviation across five random seeds (0 to 4). We report relevant hyperparameters in Appendix 9, Table 9. The results demonstrate GraphMinNet’s superior performance, achieving the best results on 6 out of 10 datasets and ranking second on 3 datasets, leading to the highest overall average rank among all methods. Even on the remaining dataset, GraphMinNet shows competitive performance comparable to state-of-the-art baselines.
We further evaluate GraphMinNet’s efficiency in terms of both model parameters and computational cost. Table 2 compares parameter counts against recent baselines, while Fig. 2 shows runtime analysis. Our model shows better or comparable efficiency to SOTA models.
MNIST | CIFAR10 | PATTERN | CLUSTER | Molhiv | PascalVOC-SP | Peptides-func | Peptides-struct | ZINC | MalNet-Tiny | |
---|---|---|---|---|---|---|---|---|---|---|
GraphGPS | 115.39K | 112.73K | 337.20K | 502.05K | 558.63K | 510.45K | 504.36K | 504.46K | 423.72K | 527.24K |
Grit | 102.14K | 99.486K | 477.95K | 432.21K | - | - | 443.34K | 438.83K | 473.47K | - |
GSSC | 133k | 131k | 539k | 539k | 351k | 375k | 410k | 410k | 436k | 299K |
Ours | 122.82K | 120.13K | 431.54K | 432.01K | 338.80K | 474.10K | 386.32K | 391.92K | 415.28K | 279.17K |

To demonstrate scalability, we conducted experiments on randomly generated graphs with node counts ranging from 1,000 to 20,000. These graphs were generated using PyTorch Geometric (Fey & Lenssen, 2019) with an average node degree of 5, maintaining realistic sparsity constraints. Node features were randomly initialized, and we included eigenvalues, eigenvectors, and logarithmic degree values to simulate diverse graph properties. We evaluated scalability through two metrics: FLOPs (Floating Point Operations), computed using thop 111https://pypi.org/project/thop, and Maximum Memory Utilization, measured via torch.cuda.max_memory_allocated. As shown in Figure 3, both FLOPs and memory utilization demonstrate linear growth with respect to the number of nodes. This linear scalability confirms our theoretical analysis and demonstrates our method’s practical efficiency for large-scale graph applications.


4.1 Robustness Analysis
To validate our theoretical stability results empirically, we evaluate our method’s robustness to feature perturbations. We introduce controlled synthetic noise to node embeddings by adding Gaussian perturbations proportional to the feature magnitudes. For each node , we compute the perturbed embedding as:
(12) |
where is the noise level, and is a noise term. We consider two types of noise: additive white noise and signal-dependent noise . Here, is standard multivariate Gaussian, and denotes the mean feature magnitude. The second type models perturbations that scale appropriately with the underlying feature values. Figure 4 shows our method’s performance on Molhiv and Peptides-Func datasets under increasing noise levels (0%, 5%, …, 30%). The results demonstrate consistent performance across noise levels, empirically confirming our theoretical stability guarantees.

5 Ablation Studies
We conducted ablation studies to analyze key components of GraphMinNet, including the optional self-term contribution, dropout effectiveness, and local method impact.
5.1 Optional Self-term
We first investigate the impact of the self-term introduced in Eq. (11). Results in Table 3 show dataset-dependent effects: while the self-term improves performance on Molhiv (AUROC increase of 2.55%), it slightly decreases performance on CLUSTER and Peptides-func. This variation suggests the self-term’s effectiveness correlates with specific graph properties - it appears more beneficial for molecular graphs with complex local structures (like Molhiv) compared to more regular graph structures (like CLUSTER).
Self-term | CLUSTER | Peptides-func | Molhiv |
---|---|---|---|
✗ | |||
✓ |
5.2 Embedding Representation
We analyzed different methods for aggregating node embeddings in Eq. 6, comparing two element-wise operations for combining feature and positional information: multiplication and addition. Our results showed that element-wise multiplication achieves higher accuracy across most datasets (e.g., two datasets shown in Table 4). Based on these results, we adopt element-wise multiplication by default except for MalNet-Tiny, which uses element-wise addition.
Aggregation | Peptides-func | Molhiv |
---|---|---|
Multiplication | ||
Addition |
5.3 Effectiveness of Dropout Regularization
We analyze the impact of dropout regularization on model performance. Figure 5 compares training and validation performance with and without dropout. The results demonstrate that dropout plays a crucial role in preventing overfitting - models without dropout show significant performance degradation on validation sets, particularly on complex datasets like Peptides-func and Molhiv. This confirms dropout’s importance as a regularization mechanism in our architecture.

5.4 Local Method
Method | CLUSTER | Peptides-func | Molhiv |
---|---|---|---|
+GatedGCN | |||
+GINE |
We evaluate GraphMinNet’s stability when combined with different local methods that use local neighborhood information including edge features, such as GatedGCN and GINE. Table 5 shows that GraphMinNet maintains consistent performance across different local methods: performance variations remain within about on CLUSTER, on Peptides-func, and within about on Molhiv. These minimal differences across diverse datasets demonstrate that GraphMinNet can effectively integrate local structural information regardless of the specific local method employed, confirming its architectural robustness and versatility.
5.5 Nonlinear and
Method | CLUSTER | Peptides-func | Molhiv |
---|---|---|---|
Linear | |||
Nonlinear |
In our method, we utilized single layer linear transformation for matrix and . However, to explore further, instead of only using linear projection, we perform ablation study with two linear layers and a nonlinear activation such as ReLU/GELU/SiLU in between in both and matrix. We report the results in Table 6. From this, we can see that, for Molhiv and CLUSTER, the performance variation is around . However, for Peptides-func, performance drops around . Therefore increasing more layers may lead to overfitting performance for graph data. Particularly for smaller datasets like Peptides-func.
5.6 Projection Strategy on and
Method | CLUSTER | Peptides-func | Molhiv |
---|---|---|---|
LP | |||
NLP |
Inspired by minGRU (Feng et al., 2024), we adopt a linear projection (LP) for both and in our method. However, to further explore the projection strategy, we conduct an ablation study by employing a nonlinear projection (NLP), similar to Subsection 5.5. The results presented in Table 7 indicate that NLP maintains competitive performance and may be beneficial for datasets with higher-dimensional node features to capture complex patterns. However, since NLP does not provide significant improvements while introducing additional parameters, we adopt LP as the default projection strategy for all datasets.
6 Conclusion
This paper presents GraphMinNet, a novel graph neural network that effectively captures long-range dependencies while maintaining linear computational complexity. Our approach successfully generalizes the idea of minimal GRU to graph-structured data while preserving permutation equivariance and ensuring stability, with theoretical guarantees for non-decaying gradients over long distances. Our key contributions include an efficient integration of node features with positional encoding, achieving linear complexity and scalability, along with theoretical proofs establishing the model’s stability and expressiveness bounds between 1-WL and 3-WL tests. Extensive experimental results across diverse datasets demonstrate GraphMinNet’s effectiveness, achieving superior performance while maintaining computational efficiency. Future directions include extensions to dynamic graphs, applications to larger-scale networks, and adaptation to heterogeneous graph structures.
One limitation of our model is edge features are not explicitly accounted for. In our future study, we would like to incorporate edge feature directly in our model.
Impact Statement
This work aims to advance machine learning methods for graph-structured data. While our technical contributions focus on graph neural networks, we acknowledge that ML systems can have broader societal impacts. Potential applications of our work include modeling social networks, analyzing biological systems, and understanding complex network interactions. However, careful consideration must be given to bias in graph construction and the representativeness of training data in deployment contexts. We particularly emphasize the importance of responsible data collection and proper validation when applying these methods to sensitive domains. We encourage future work to investigate these aspects and develop robust guidelines for ethical applications.
Acknowledgements
This work was supported in part by the NSF under Grants IIS 2327113 and ITE 2433190; and the NIH under Grants P30AG072946. We would like to thank the NSF support for AI research resource with NAIRR240219. We would like to thank the University of Kentucky Center for Computational Sciences and Information Technology Services Research Computing for their support and use of the Lipscomb Compute Cluster and associated research computing resources.
References
- Achanta et al. (2012) Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Süsstrunk, S. Slic superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence, 34(11):2274–2282, 2012.
- Ahamed & Cheng (2024) Ahamed, M. A. and Cheng, Q. Timemachine: A time series is worth 4 mambas for long-term forecasting. arXiv preprint arXiv:2403.09898, 2024.
- Bo et al. (2022) Bo, D., Shi, C., Wang, L., and Liao, R. Specformer: Spectral graph neural networks meet transformers. In The Eleventh International Conference on Learning Representations, 2022.
- Bousquet & Elisseeff (2002) Bousquet, O. and Elisseeff, A. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
- Bresson & Laurent (2017) Bresson, X. and Laurent, T. Residual gated graph convnets. arXiv preprint arXiv:1711.07553, 2017.
- Chamberlain et al. (2021) Chamberlain, B., Rowbottom, J., Gorinova, M. I., Bronstein, M., Webb, S., and Rossi, E. Grand: Graph neural diffusion. In International conference on machine learning, pp. 1407–1418. PMLR, 2021.
- Corso et al. (2020) Corso, G., Cavalleri, L., Beaini, D., Liò, P., and Veličković, P. Principal neighbourhood aggregation for graph nets. Advances in Neural Information Processing Systems, 33:13260–13271, 2020.
- Ding et al. (2024) Ding, Y., Orvieto, A., He, B., and Hofmann, T. Recurrent distance filtering for graph representation learning. In Forty-first International Conference on Machine Learning, 2024.
- Du et al. (2019) Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in neural information processing systems, 32, 2019.
- Dwivedi et al. (2022) Dwivedi, V. P., Rampášek, L., Galkin, M., Parviz, A., Wolf, G., Luu, A. T., and Beaini, D. Long range graph benchmark. Advances in Neural Information Processing Systems, 35:22326–22340, 2022.
- Dwivedi et al. (2023) Dwivedi, V. P., Joshi, C. K., Luu, A. T., Laurent, T., Bengio, Y., and Bresson, X. Benchmarking graph neural networks. Journal of Machine Learning Research, 24(43):1–48, 2023.
- Everingham et al. (2010) Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
- Feng et al. (2024) Feng, L., Tung, F., Ahmed, M. O., Bengio, Y., and Hajimirsadegh, H. Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024.
- Fey & Lenssen (2019) Fey, M. and Lenssen, J. E. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
- Frasca et al. (2022) Frasca, F., Bevilacqua, B., Bronstein, M., and Maron, H. Understanding and extending subgraph gnns by rethinking their symmetries. Advances in Neural Information Processing Systems, 35:31376–31390, 2022.
- Freitas et al. (2020) Freitas, S., Dong, Y., Neil, J., and Chau, D. H. A large-scale database for graph representation learning. arXiv preprint arXiv:2011.07682, 2020.
- Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864, 2016.
- Gu & Dao (2023) Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Gu et al. (2020) Gu, A., Dao, T., Ermon, S., Rudra, A., and Ré, C. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
- Gu et al. (2021) Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra, A., and Ré, C. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
- Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. Advances in neural information processing systems, 30, 2017.
- Hu et al. (2020) Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118–22133, 2020.
- Huang et al. (2022) Huang, Y., Peng, X., Ma, J., and Zhang, M. Boosting the cycle counting power of graph neural networks with i2-gnns. In The Eleventh International Conference on Learning Representations, 2022.
- Huang et al. (2023) Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., and Li, P. On the stability of expressive positional encodings for graph neural networks. arXiv preprint arXiv:2310.02579, 2023.
- Huang et al. (2024a) Huang, Y., Lu, W., Robinson, J., Yang, Y., Zhang, M., Jegelka, S., and Li, P. On the stability of expressive positional encodings for graph neural networks. In The Twelfth International Conference on Learning Representations, 2024a.
- Huang et al. (2024b) Huang, Y., Miao, S., and Li, P. What can we learn from state space models for machine learning on graphs? arXiv preprint arXiv:2406.05815, 2024b.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Jain et al. (2016) Jain, A., Zamir, A. R., Savarese, S., and Saxena, A. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition, pp. 5308–5317, 2016.
- Kipf & Welling (2016) Kipf, T. N. and Welling, M. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, 2016.
- Kreuzer et al. (2021) Kreuzer, D., Beaini, D., Hamilton, W., Létourneau, V., and Tossou, P. Rethinking graph transformers with spectral attention. Advances in Neural Information Processing Systems, 34:21618–21629, 2021.
- Li et al. (2020) Li, P., Wang, Y., Wang, H., and Leskovec, J. Distance encoding: Design provably more powerful neural networks for graph representation learning. Advances in Neural Information Processing Systems, 33:4465–4478, 2020.
- Lim et al. (2022) Lim, D., Robinson, J. D., Zhao, L., Smidt, T., Sra, S., Maron, H., and Jegelka, S. Sign and basis invariant networks for spectral graph representation learning. In The Eleventh International Conference on Learning Representations, 2022.
- Ma et al. (2023) Ma, L., Lin, C., Lim, D., Romero-Soriano, A., Dokania, P. K., Coates, M., Torr, P., and Lim, S.-N. Graph inductive biases in transformers without message passing. In International Conference on Machine Learning, pp. 23321–23337. PMLR, 2023.
- Morris et al. (2019) Morris, C., Ritzert, M., Fey, M., Hamilton, W. L., Lenssen, J. E., Rattan, G., and Grohe, M. Weisfeiler and leman go neural: Higher-order graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp. 4602–4609, 2019.
- Perozzi et al. (2014) Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710, 2014.
- Poli et al. (2019) Poli, M., Massaroli, S., Park, J., Yamashita, A., Asama, H., and Park, J. Graph neural ordinary differential equations. arXiv preprint arXiv:1911.07532, 2019.
- Rampášek et al. (2022) Rampášek, L., Galkin, M., Dwivedi, V. P., Luu, A. T., Wolf, G., and Beaini, D. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems, 35:14501–14515, 2022.
- Shalev-Shwartz et al. (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.
- Shirzad et al. (2023) Shirzad, H., Velingker, A., Venkatachalam, B., Sutherland, D. J., and Sinop, A. K. Exphormer: Sparse transformers for graphs. In International Conference on Machine Learning, pp. 31613–31632. PMLR, 2023.
- Singh et al. (2016) Singh, S., Chaudhary, K., Dhanda, S. K., Bhalla, S., Usmani, S. S., Gautam, A., Tuknait, A., Agrawal, P., Mathur, D., and Raghava, G. P. Satpdb: a database of structurally annotated therapeutic peptides. Nucleic acids research, 44(D1):D1119–D1126, 2016.
- Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018.
- Wang et al. (2022) Wang, H., Yin, H., Zhang, M., and Li, P. Equivariant and stable positional encoding for more powerful graph neural networks. In International Conference on Learning Representations, 2022.
- Wang et al. (2024) Wang, X., Wang, S., Ding, Y., Li, Y., Wu, W., Rong, Y., Kong, W., Huang, J., Li, S., Yang, H., et al. State space model for new-generation network alternative to transformers: A survey. arXiv preprint arXiv:2404.09516, 2024.
- Wu & Cheng (2022) Wu, X. and Cheng, Q. Stabilizing and enhancing link prediction through deepened graph auto-encoders. In IJCAI: proceedings of the conference, volume 2022, pp. 3587. NIH Public Access, 2022.
- Wu et al. (2018) Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. Moleculenet: a benchmark for molecular machine learning. Chemical science, 9(2):513–530, 2018.
- Xhonneux et al. (2020) Xhonneux, L.-P., Qu, M., and Tang, J. Continuous graph neural networks. In International conference on machine learning, pp. 10432–10441. PMLR, 2020.
- Xu et al. (2018) Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful are graph neural networks? In International Conference on Learning Representations, 2018.
- Ying et al. (2021) Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? Advances in neural information processing systems, 34:28877–28888, 2021.
- Ying et al. (2018) Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., and Leskovec, J. Hierarchical graph representation learning with differentiable pooling. Advances in neural information processing systems, 31, 2018.
- You et al. (2019) You, J., Ying, R., and Leskovec, J. Position-aware graph neural networks. In International conference on machine learning, pp. 7134–7143. PMLR, 2019.
- You et al. (2021) You, J., Gomes-Selman, J. M., Ying, R., and Leskovec, J. Identity-aware graph neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pp. 10737–10745, 2021.
- Zhang et al. (2024) Zhang, B., Zhao, L., and Maron, H. On the expressive power of spectral invariant graph neural networks. arXiv preprint arXiv:2406.04336, 2024.
- Zhang & Li (2021) Zhang, M. and Li, P. Nested graph neural networks. Advances in Neural Information Processing Systems, 34:15734–15747, 2021.
- Zhao et al. (2021) Zhao, L., Jin, W., Akoglu, L., and Shah, N. From stars to subgraphs: Uplifting any gnn with local structure awareness. In International Conference on Learning Representations, 2021.
- Zhi et al. (2023) Zhi, Y.-C., Ng, Y. C., and Dong, X. Gaussian processes on graphs via spectral kernel learning. IEEE Transactions on Signal and Information Processing over Networks, 9:304–314, 2023.
7 Appendix
7.1 Proofs of Model Properties
Proposition 7.1.
The GraphMinNet in Eq. (7) is permutation equivariant. Moreover, if are Lipschitz, the GraphMinNet is also stable in terms of features and eigen values.
Proof.
We first establish three fundamental facts, which can be straightforwardly proved:
Fact 1. Permutation equivariance holds for node-wise operations that are applied independently to each node’s features.
Fact 2. The composition of permutation equivariant functions is also permutation equivariant.
Fact 3. The composition of Lipschitz functions is also Lipschitz.
Permutation Equivariance:
By Fact 1, and in Eq. (7) are permutation equivariant, as both are node-wise operations. Next, we prove
that , and are also permutation equivariant. For a node permutation , let denote the index of node after permutation.
Then, the Laplacian for the permuted graph is
After node permutation, does not change. Then, we have the hat-version , , , and for , which are counterparts of the corresponding quantities for . Therefore:
Thus, . Finally, we have:
Therefore, the GraphMinNet algorithm is permutation equivariant.
Stability Analysis:
First, we consider the feature stability. For any node and its feature vector with perturbation ,
-
•
is Lipschitz since is Lipschitz
-
•
is clearly Lipschitz with constant .
By Fact 3 and chain rule, is Lipschitz as composition of Lipschitz functions.
For spectral and structural stability, consider a symmetric perturbation matrix to adjacency matrix . By Weyl’s inequality, we have
Since are Lipschitz by assumption with constants , and involves composition of with eigenvalues, we have
Finally, as is Lipschitz in and is Lipschitz in , is stable with respect to both feature and eigenvalue perturbations. ∎
Proposition 7.2.
There exists such that the gradient norm of GraphMinNet does not decay as grows (with tending to ), where is the shortest path distance between and .
Proof.
We only consider nonnegative adjacency matrix and since for any node . Without loss of generality (w.l.o.g.), we consider the case of and , where , , and are scalars with , , and . Let be the normalized adjacency matrix with being the diagonal degree matrix of the original adjacency matrix . Then
By taking as (the case that it is can be proved similarly), we have
Thus, we have
(by Eq. (7)) | ||||
(as ) | ||||
(by spectral decomposition) | ||||
In the above, is the unit vector with the -th element being 1; the second equality follows from the Type 4 definition of inner product in Eq. (10). While the proof is shown using Type 4, it holds for other types by appropriately choosing . The learnable parameters and must be chosen to ensure . We define with positive constants . Let . Then, we have
Let . Since represents the degree-weighted sum of all walks of length from to , and there exists at least one path of length between and , it follows that . We denote this value by . Therefore, we have
The last inequality holds since and for all , making all terms in the sum non-negative. This lower bound is independent of the distance , proving that the gradient does not decay with distance and establishing GraphMinNet’s long-range dependence property. ∎
Assuming that the eigenvalue decomposition of the adjacent matrix is precomputed and thus given, we have the following computational complexity for GraphMinNet:
Proposition 7.3.
The hidden states can be computed from with a complexity of , where is the number of nodes, is the number of columns in node encoding (, , or ), is the dimension of rows in node encoding, and is the feature dimension.
Proof.
For linear complexity: This can be straightforwardly counted as follows. For each node :
-
1.
Computing :
Linear projections and activation : -
2.
Computing node encodings:
with :
: columns × multiplications =
: Similarly
:
Element-wise product :
Matrix multiplication with : -
3.
Computing final output:
Inner product :
Total complexity per node: (assuming )
Overall complexity for nodes: .
For linear scalability: Given eigenvalue decomposition, computations are node-wise independent except for . The final step requires only one inner product per node with this global sum. Thus, the algorithm scales linearly with through:
Independent node-wise computations
Single global aggregation
Final node-wise inner products. ∎
The above property provides a certificate guaranteeing the linear complexity as well as the linear scalability with respect to the number of nodes of the graph.
Proposition 7.4.
The formulation of GraphMinNet is more powerful than WL test but not more powerful than 3-WL test.
Proof.
First, we show GraphMinNet is more powerful than 1-WL. Let , , and . Then:
The second equality follows from Type 2 definition of matrix inner product (see Eq. (10)). Since is an adjacency matrix element and is a function of node features and , GraphMinNet can implement standard message passing. As message passing is equivalent to the 1-WL test (Xu et al., 2018), GraphMinNet is at least as powerful as 1-WL. The additional structural components make it strictly more powerful.
For the upper bound, note that GraphMinNet is an eigenspace projection GNN using a basis invariant function of positional encoding. By (Zhang et al., 2024), such architectures cannot exceed the power of 3-WL test. ∎
Dataset | # Graphs | Avg. # nodes | Avg. # edges | Prediction level | Prediction task |
---|---|---|---|---|---|
ZINC | 12,000 | 23.2 | 24.9 | graph | regression |
ogbg-molhiv | 41,127 | 25.5 | 27.5 | graph | binary classification |
MNIST | 70,000 | 70.6 | 564.5 | graph | 10-class classification |
CIFAR10 | 60,000 | 117.6 | 941.1 | graph | 10-class classification |
PATTERN | 14,000 | 118.9 | 3,039.3 | node | binary classification |
CLUSTER | 12,000 | 117.2 | 2,150.9 | node | 6-class classification |
MalNet-Tiny | 5,000 | 1,410.3 | 2,859.9 | graph | 5-class classification |
Peptides-func | 15,535 | 150.9 | 307.3 | graph | 10-class classification |
Peptides-struct | 15,535 | 150.9 | 307.3 | graph | regression |
PascalVOC-SP | 11,355 | 479.4 | 2,710.5 | node | 21-class classification |
8 Datasets
We utilized 10 datasets, which are widely adopted in the graph machine learning community. These datasets cover a range of tasks, including graph-level regression, binary classification, and node-level classification. All datasets utilized in our study are equipped with predefined training, validation, and test splits, ensuring consistency across experiments. In line with established practices in the field, we report the test results based on the best-performing models on the validation set. To ensure the robustness of our findings, we conduct evaluations over five distinct random seeds for each dataset. This approach aligns with methodologies outlined in prior studies (Rampášek et al., 2022; Ma et al., 2023; Shirzad et al., 2023; Huang et al., 2024b).
ZINC:
The ZINC-12k dataset (Dwivedi et al., 2023) is a subset of the ZINC database, which contains commercially available chemical compounds. This dataset comprises 12,000 molecular graphs, where each graph represents a small molecule with the number of atoms ranging from 9 to 37. In this representation, nodes correspond to heavy atoms (with 28 distinct atom types), and edges symbolize chemical bonds (of 3 different types). The primary task associated with ZINC-12k is graph-level regression.
ogbg-molhiv:
The ogbg-molhiv dataset (Hu et al., 2020) is adopted from the MoleculeNet collection (Wu et al., 2018) by the Open Graph Benchmark (OGB) project. It consists of molecular graphs where nodes and edges are featurized to represent various chemophysical properties. The task for this dataset is a binary graph-level classification, aiming to predict whether a molecule can inhibit HIV replication.
MNIST & CIFAR10:
The MNIST and CIFAR10 datasets (Dwivedi et al., 2023) are derived from well-known image classification datasets. In these graph versions, each image is converted into a graph by constructing an 8-nearest-neighbor graph of SLIC superpixels (Achanta et al., 2012) for each image. The task for these datasets is a 10-class graph-level classification, mirroring the original image classification tasks.
PATTERN and CLUSTER:
The PATTERN and CLUSTER datasets (Dwivedi et al., 2023) are synthetic datasets that model community structures using the Stochastic Block Model (SBM). Both tasks involve inductive node-level classification. In the PATTERN dataset, the goal is to identify nodes that belong to one of 100 randomly generated subgraph patterns, which are distinct from the rest of the graph in terms of SBM parameters. For the CLUSTER dataset, each graph is composed of 6 clusters generated by the SBM, and the task is to predict the cluster ID of 6 test nodes, each representing a unique cluster within the graph.
MalNet-Tiny:
The MalNet-Tiny dataset (Freitas et al., 2020) is a subset of the larger MalNet repository, which contains function call graphs extracted from Android APKs. This subset includes 5,000 graphs, each with up to 5,000 nodes, representing either benign software or four categories of malware. In the MalNet-Tiny subset, all original node and edge attributes have been removed, and the classification task is based solely on the graph structure.
Peptides-func and Peptides-struct:
The Peptides-func and Peptides-struct datasets (Dwivedi et al., 2022) are derived from 15,000 peptides retrieved from the SATPdb database (Singh et al., 2016). Both datasets use the same set of graphs but focus on different prediction tasks. Peptides-func is a graph-level classification task with 10 functional labels associated with peptide functions. In contrast, Peptides-struct is a graph-level regression task aimed at predicting 11 structural properties of the molecules.
PascalVOC-SP:
The PascalVOC-SP dataset (Dwivedi et al., 2022) is a node classification dataset based on the Pascal VOC 2011 image dataset (Everingham et al., 2010). Superpixel nodes are extracted using the SLIC algorithm (Achanta et al., 2012), and a rag-boundary graph that interconnects these nodes is constructed. The task is to classify each node into corresponding object classes, akin to semantic segmentation.
Dataset | Hidden dimension | Local Method | # Layers | Lap. dim | Batch | LR | Weight Decay | Dropouts |
---|---|---|---|---|---|---|---|---|
ZINC | GINE | |||||||
ogbg-molhiv | GatedGCN | |||||||
MNIST | GatedGCN | |||||||
CIFAR10 | GatedGCN | |||||||
PATTERN | GatedGCN | |||||||
CLUSTER | GatedGCN | |||||||
MalNet-Tiny | GatedGCN | 16 | 0.0015 | |||||
Peptides-func | GatedGCN | |||||||
Peptides-struct | GatedGCN | |||||||
PascalVOC-SP | GatedGCN |
9 Hyperparameters
In this section, we summarize the key hyperparameters used to achieve the results presented in Table 1. These parameters, detailed in Table 9, include the hidden dimension (), the type of local method (such as GINE or GatedGCN), the number of layers, the dimensionality of the Laplacian features (eigenvalues and eigenvectors), batch size, learning rate, weight decay, and dropout rates. The dropout rates are specified for different components of the model: the feedforward network, the local method, the residual, and our proposed GraphMinNet. For MalNet-Tiny and PascalVOC-SP datasets, we use the 0-th order power function in . Additionally, for MalNet-Tiny, we set in to be the zero function and define the operation as elementwise addition. For all the other datasets, we used a generalized permutation equivariant set aggregation function (Wang et al., 2022).