This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

IGNN-Solver: A Graph Neural Solver for
Implicit Graph Neural Networks

Junchao Lin , Zenan Ling , , Zhanbo Feng , , Jingwen Xu , Minxuan Liao , Feng Zhou , Tianqi Hou , Zhenyu Liao , , and Robert C. Qiu Manuscript created on February 17, 2025. The work of Zenan Ling is supported in part by the National Natural Science Foundation of China (via NSFC-62406119), the Natural Science Foundation of Hubei Province (2024AFB074), and the Guangdong Provincial Key Laboratory of Mathematical Foundations for Artificial Intelligence (2023B1212010001). The work of Zhenyu Liao is supported in part by the National Natural Science Foundation of China (via NSFC-62206101). The work of Robert Caiming Qiu is supported in part by the National Natural Science Foundation of China (via NSFC-12141107) and in part by the Key Research and Development Program of Guangxi (GuiKe-AB21196034). (Corresponding author: Zenan Ling.)Junchao Lin, Zenan Ling, Minxuan Liao, Zhenyu Liao, and Robert C. Qiu are with the School of EIC, Huazhong University of Science and Technology.Zhanbo Feng is with the Department of Computer Science and Engineering, Shanghai Jiao Tong University.Feng Zhou is with the Center for Applied Statistics and School of Statistics, Renmin University of China.Tianqi Hou is with the Lagrange Mathematics and Computation Research Center of Huawei.Jingwen Xu is with the School of Science, Wuhan University of Technology. 0000-0002-6173-4601 0009-0001-3975-4403 0009-0002-4529-7357 0009-0002-3934-3265 0009-0002-6777-2702 0000-0003-0842-306X 0000-0003-1940-4833 0000-0002-1915-8559 0000-0002-1154-1836
Abstract

Implicit graph neural networks (IGNNs), which exhibit strong expressive power with a single layer, have recently demonstrated remarkable performance in capturing long-range dependencies (LRD) in underlying graphs while effectively mitigating the over-smoothing problem. However, IGNNs rely on computationally expensive fixed-point iterations, which lead to significant speed and scalability limitations, hindering their application to large-scale graphs. To achieve fast fixed-point solving for IGNNs, we propose a novel graph neural solver, IGNN-Solver, which leverages the generalized Anderson Acceleration method, parameterized by a tiny GNN, and learns iterative updates as a graph-dependent temporal process. To improve effectiveness on large-scale graph tasks, we further integrate sparsification and storage compression methods, specifically tailored for the IGNN-Solver, into its design. Extensive experiments demonstrate that the IGNN-Solver significantly accelerates inference on both small- and large-scale tasks, achieving a 1.5×1.5\times to 8×8\times speedup without sacrificing accuracy. This advantage becomes more pronounced as the graph scale grows, facilitating its large-scale deployment in real-world applications. The code to reproduce our results is available at https://github.com/landrarwolf/IGNN-Solver.

Index Terms:
Implicit graph neural networks, large-scale graphs, fixed-point solver, deep equilibrium models.

I Introduction

Implicit graph neural networks (IGNNs) [1, 2, 3] have emerged as a significant advancement in graph learning frameworks. Unlike traditional graph neural networks (GNNs) that stack multiple explicit layers, IGNNs utilize a single implicit layer formulated as a fixed-point equation. The solution to this fixed-point equation, known as the equilibrium, is equivalent to the output obtained by iterating an explicit layer infinitely. This allows the implicit layer to access infinite hops of neighbors, providing IGNNs with global receptive fields within just one layer [4]. As a result, IGNNs effectively address the long-standing issue of over-smoothing in conventional explicit GNNs and capture of long-range dependencies in graph-structured data [5, 6, 7].

However, existing IGNNs suffer from slow speeds and have difficulty scaling to large-scale graphs [8, 9, 10, 11]. This is primarily because IGNNs derive features by solving for fixed points, demanding substantial computational resources. For example, even on the Citeseer dataset [12] classification task with a small-scale graph, IGNNs require more than 2020 forward iterative computations to nearly converge to a fixed point  [1]. The computational overhead of solving fixed points is amplified by the task scale, resulting in notably slow inference speeds compared to explicit GNNs. This substantial drawback poses challenges for IGNNs in generalizing to large-scale graphs in practical scenarios.

In response to this challenge, we propose a novel graph neural solver for IGNNs, termed IGNN-Solver. It takes the IGNN layer as input, which includes the graph information matrix and layer parameters, and outputs the solution to the corresponding fixed-point equation. Unlike conventional solvers, which compute output features through iterative forward passes in a potentially large GNN (relying on root-finding algorithms like Broyden’s method [13]), the IGNN-Solver predicts the next iteration step using a tiny graph network. Consequently, the proposed IGNN-Solver remarkably accelerates the model’s inference speed without compromising accuracy and with only a slight increase in training overhead. This advantage becomes increasingly pronounced as the scale of the graph grows, making it particularly beneficial for deploying IGNNs in large-scale graph tasks.

Our IGNN-Solver comprises two components. First, we introduce a learnable initializer that estimates an optimal initial point for the optimization process. Second, we propose a generalized version of Anderson Acceleration (AA) [13], employing a tiny graph network to model the iterative updates as a sequence of graph-dependent steps. Compared to the solvers proposed for conventional Implicit Neural Networks (INNs) [14, 15, 16], we introduce novel improvements: learning solver parameters through a GNN-based method. This approach circumvents the potential loss of graph information, thereby improving the model’s performance. Moreover, IGNN-Solver has significantly fewer parameters compared to IGNN, and its training is independent of the IGNN’s inference. Consequently, the training of IGNN-Solver proceeds rapidly without sacrificing generalization.

Moreover, to address the challenges posed by large-scale graphs, we incorporate both graph sparsification and storage compression in the solver design. Specifically, to prevent the network from becoming excessively large and deviating from the initial goal of using a compact network for prediction [17, 18], we utilize graph sparsification for a lighter graph. Additionally, directly mapping data from high to extremely low dimensions is not appropriate [19], thereby we employ storage compression technique to mitigate this issue.

In our experiments, we apply IGNN-Solver to nine real-world datasets from diverse domains and scales, including four large-scale datasets: Amazon-all [20], Reddit [21], ogbn-arxiv [22] and ogbn-products [22]. Our results demonstrate that the IGNN-Solver achieves higher accuracy with reduced inference time, showing a 1.5×1.5\times to 8×8\times speedup, and incurs minimal additional training overhead, constituting only 1%1\% of the IGNN training time.

Our main contributions are summarized as follows:

  • We introduce IGNN-Solver, a method designed to predict the next fixed-point iteration of the IGNN layer. This innovative approach mitigates the need for extensive iterations, common in vanilla IGNNs, thereby substantially accelerating inference while preserving accuracy and minimizing parameter consumption.

  • Compared to conventional solvers proposed for INNs, our solver introduces a novel improvement by incorporating a tiny GNN to learn parameters. This approach not only preserves graph information but also maintains the simplicity and lightweight characteristics of neural solvers.

  • For large-scale graph tasks, we integrate custom-designed sparsification and storage compression methods into the solver’s architecture. These tailored techniques enable efficient processing of large-scale graphs, improving performance and reducing overhead.

  • We validate our approach through extensive experiments on nine real-world datasets, including four large-scale. Our results demonstrate that IGNN-Solver incurs minimal computational overhead (approximately 1% of the total) and achieves up to a 88-fold increase in inference speed without compromising accuracy.

II Related Works

The typical GNN [12] and its variants [23, 21] have been widely used for graph data modeling in various tasks. Different GNNs have been proposed to utilize attention mechanism [23], neighbors sampling [21], pseudo-coordinates [24] and graph fusion [25]. However, due to issues such as over-smoothing, depth, and bottlenecks, these models typically involve finite aggregation layers [5, 6, 7]. To address it, recent works [1, 2, 3] have developed Implicit Graph Neural Networks, encouraging these models to capture long-range dependencies on graphs. Here, we highlight the contributions of our proposed IGNN-Solver through a detailed comparison with Implicit Graph Neural Networks (IGNNs) and Deep Equilibrium Models (DEQs), both of which are closely related to our approach.

II-A Implicit Graph Neural Networks

Instead of stacking a series of operators hierarchically, implicit GNNs define their outputs as solutions to nonlinear dynamical systems, which is initially introduced by [1] to tackle challenges associated with learning long-range dependencies in graphs. [8] proposes a new implicit graph model enabling mini-batch training without sacrificing the ability to capture long-range dependencies. Subsequently, [26] introduces a novel approach based on implicit layer to model multi-scale structures on graphs. Additionally, [3] theoretically investigates the well-posedness of the IGNN model from a monotone operator viewpoint.

Although the aforementioned IGNN works well by alleviating the problem of over-smoothing of features by allowing meaningful fixed points to propagate implicitly, the inherent slow inference speed of implicit networks poses a major obstacle to its scalability. The main reason is that the solver of the fixed-point network is inefficient (e.g., the Picard solver used by IGNN [1] and Anderson acceleration solver used by MIGNN [3]), makes the overhead of the fixed-point solver magnified by the task scales. In comparison, our proposed IGNN-Solver accelerates the iteration process of IGNNs, which addresses the most limiting drawback compared to traditional feedforward models.

Another class of IGNNs based on Neural ODEs [27] has emerged to address issues like depth and bottlenecks. For example, [28] models continuous residual layers using GCNs. [29] proposes methods for modeling static and dynamic graphs with GCNs, along with a hybrid approach where latent states evolve continuously between RNN steps in dynamic graphs. [30] tackles the problem of continuous message passing. [31] introduces a graph neural diffusion network based on the discretization of diffusion PDEs on graphs. [32] enhances graph neural diffusion with a source term and connects the model to random walk formulation on graphs. However, they essentially view deep learning on graphs as a continuous diffusion process, which differs from IGNNs based on fixed-point networks, thus requiring manually tuning the termination time and step size of the diffusion equation. In contrast, our IGNN-Solver uses implicit formulations instead of explicit diffusion discretization, which admits an equilibrium corresponding to infinite diffusion steps and expands the receptive field.

II-B Fixed-point Solvers for Deep Equilibrium Models

Traditional deep learning models use multi-layered networks, while DEQs [14] find a fixed-point of a single layer, representing the network’s equilibrium state. Viewed as infinitely deep networks, DEQs use root-finding for forward propagation and implicit differentiation for backward propagation. Therefore, DEQs capture complex patterns with constant memory and computational costs [33]. Monotone operator theory has been used to guarantee the convergence of DEQs [34] and to improve the stability of implicit networks [16].

However, it is well-known that DEQs, as typical implicit models, suffer from slow training and inference speeds [35, 8], which is highly disadvantageous for large-scale scenarios like GraphGPT [36]. To alleviate this issue, recent efforts have explored certain improved solver methods for DEQs, further optimizing root-finding problems and making these models easier to solve. [37] discusses the amortization of the cost of the iterative solver that would otherwise make implicit models slow. [9] proposes a novel gradient approximation method for implicit models, which significantly accelerates the backward passes in training implicit models. [16] discusses the superiority of learnable solvers over generic solvers in implicit models using a tiny neural network and significantly enhances the efficiency of such models through custom neural solvers.

In contrast, IGNN has a similar network representation to DEQ, both defining the output as the solution of the equation to obtain network outputs. But the difference lies in the fact that IGNN’s equilibrium equations encode graph structure while DEQ does not, which will undoubtedly deepen its weakness of slow inference speed and make IGNN’s solution slower. Insight on it, our proposed IGNN-Solver leverages graph information to guide solver acceleration, achieving fast and meaningful implicit graph network propagation, especially in the case of graph data being large-scale.

III Preliminaries

III-A Explicit GNNs

Let 𝒢={𝑨,𝑿}\mathcal{G}=\{{\bm{A}},{\bm{X}}\} represent an undirected graph, where 𝑨n×n{\bm{A}}\in\mathbb{R}^{n\times n} is the adjacency matrix indicating the relationships between the nodes in 𝒱={𝒗1,𝒗2,,𝒗n}\mathcal{V}=\{{\bm{v}}_{1},{\bm{v}}_{2},\ldots,{\bm{v}}_{n}\}, and nn is the number of nodes. The node feature matrix 𝑿=[𝒙1,𝒙2,,𝒙n]n×d{\bm{X}}=[{\bm{x}}_{1},{\bm{x}}_{2},\ldots,{\bm{x}}_{n}]\in\mathbb{R}^{n\times d} contains the features of the nodes, with dd representing the feature dimension. Conventional (explicit) GNNs [12, 23] feature a learnable aggregation process centered on the message-passing operation within the graph [38]. This process iteratively propagates information from each node to its neighboring nodes. The formal general structure for each layer ll is defined as follows:

𝑯[l+1]=fθ(𝑨,𝑯[l]),𝑯[0]=𝑿,{\bm{H}}^{[l+1]}=f_{\theta}({{\bm{A}}},{\bm{H}}^{[l]}),\quad{{\bm{H}}^{[0]}}={\bm{X}}, (1)

where 𝑯[l]{{\bm{H}}^{[l]}} represents the hidden node representation, fθf_{\theta} denotes the parameters in ll-th layer. A commonly used GNN is Graph Convolutional Network (GCN) [12], defined as 𝑯[l+1]=σ(𝑨^𝑯[l]𝑾[l]){\bm{H}}^{[l+1]}=\sigma(\widehat{{\bm{A}}}{\bm{H}}^{[l]}{\bm{W}}^{[l]}), where 𝑾[l]{{\bm{W}}^{[l]}} denotes the weight matrix of the ll-th layer, σ()\sigma(\cdot) denotes the activation function, and 𝑨^=𝑫~1/2(𝑨+𝑰)𝑫~1/2{\widehat{{\bm{A}}}}={\tilde{{\bm{D}}}^{-1/2}}({\bm{A}}+{\bm{I}}){\tilde{{\bm{D}}}^{-1/2}} represents the symmetric normalized graph matrix. 𝑫~\tilde{{\bm{D}}} is a diagonal matrix with 𝑫~ii=1+j𝑨ij\tilde{{\bm{D}}}_{ii}=1+\sum\nolimits_{j}{\bm{A}}_{ij}. GNNs leverage the above message-passing operation in  (1) to learn useful information. Still, they often involve a limited number of ll layers due to over-smoothing [6], making it challenging for GNNs to capture the long-range dependency (LRD) on graphs.

III-B Implicit GNNs

III-B1 Architecture

Similar to traditional explicit GNNs, IGNNs [1, 2, 3, 4] also involve an aggregation process, with the distinction that the depth of layers (iteration step kk) is infinite. The aggregation process in IGNNs is typically defined as

𝒁[k+1]=σ(𝑾𝒁[k]𝑨^+bΩ(𝑿)),k=1,2,,,{\bm{Z}}^{[k+1]}=\sigma({\bm{W}}{\bm{Z}}^{[k]}\widehat{{\bm{A}}}+b_{\Omega}({\bm{X}})),k=1,2,\ldots,\infty, (2)

where bΩb_{\Omega} represents affine transformation parameterized by 𝛀\bm{\Omega}, and the weight matrices 𝑾{\bm{W}} and 𝛀\bm{\Omega} are globally shared at each iteration step. The IGNN model fθf_{\theta} is formally described by

𝒁=fθ(𝒁,𝑨^,𝑿)=σ(𝑾𝒁𝑨^+bΩ(𝑿)),{\bm{Z}}^{\star}=f_{{\theta}}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}})=\sigma({\bm{W}}{\bm{Z}}^{\star}\widehat{{\bm{A}}}+b_{\Omega}({\bm{X}})), (3)

where the representation, given as the “internal state” 𝒁{\bm{Z}}^{\star}, is obtained as the fixed-point solution of the equilibrium (2). The final representation theoretically encompasses information from all neighbors in the graph. Therefore, IGNNs capture LRD within the graph, offering better performance compared to GNNs with finite iterations [1, 2]. Another notable advantage of this framework is its memory efficiency, as it only needs to retain the current state 𝒁{\bm{Z}} without requiring additional intermediate representations.

III-B2 Well-posedness

Notably, to obtain a valid representation for any given input 𝑿{\bm{X}} from IGNNs, we need to solve the unique internal state 𝒁{\bm{Z}}^{\star} in (3). However, there may not always exist a well-defined solution 𝒁{\bm{Z}}^{\star} for some inputs 𝑿{\bm{X}} [1]. To guarantee the existence and uniqueness of the solution to (3), we introduce the concept of well-posedness for IGNNs fθf_{\theta} with activation function σ\sigma [1]. Specifically, the weight matrix 𝑾{\bm{W}}\in m×m\mathbb{R}^{m\times m} and the adjacency matrix 𝑨^n×n\widehat{{\bm{A}}}\in\mathbb{R}^{n\times n} are said to be well-posed for σ\sigma if the solution 𝒁m×n{\bm{Z}}^{\star}\in\mathbb{R}^{m\times n} of  (3) exists and is unique for any bΩ(𝑿)m×nb_{\Omega}({\bm{X}})\in\mathbb{R}^{m\times n}.

Based on the monotone operator theorem [39, 34], a sufficient condition for the well-posedness for IGNNs is defined as follows. Let the nonlinearity σ\sigma be a non-expansive activation function (e.g., Sigmoid, Tanh, ReLU, etc.), and let 𝑾^=𝑨^𝑾\widehat{{\bm{W}}}=\widehat{{\bm{A}}}^{\top}\otimes{\bm{W}}. 111We represent the Kronecker product of matrices 𝑨{\bm{A}} and 𝑩{\bm{B}} as 𝑨𝑩{\bm{A}}\otimes{\bm{B}}. Then, the IGNN model defined in (3) is well-posed as long as 𝑰𝑾^m𝑰{\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}} for some m>0m>0. 222 We represent 𝑨𝑩{\bm{A}}\succeq{\bm{B}} if 𝑨𝑩{\bm{A}}-{\bm{B}} is semi-positive definite. For the case of non-symmetric matrices 𝑾^\widehat{{\bm{W}}}, positive definiteness can be defined as the positive definiteness of the symmetric component, i.e., 𝑰𝑾^m𝑰{\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}} if and only if 𝑰𝑾^+𝑾^T2m𝑰{\bm{I}}-\frac{\widehat{{\bm{W}}}+\widehat{{\bm{W}}}^{T}}{2}\succeq m{\bm{I}}.

It establishes that if 𝑾^\widehat{{\bm{W}}} is symmetric and 𝑰𝑾^m𝑰{\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}} for some m>0m>0, then the existence and uniqueness of the equilibrium point 𝒁{\bm{Z}}^{\star} are guaranteed, ensuring the model is well-posed. This follows from the sufficient condition that 𝑰𝑾^{\bm{I}}-\widehat{{\bm{W}}} is strongly monotone.

To satisfy the well-posedness of IGNNs, a commonly used method [40, 3], is to orthogonally reparameterize 𝑾{\bm{W}} in the IGNN layer fθf_{\theta} using the following Cayley map, which is inspired from the unitary operator [41], as

𝑾=κ(𝑰𝑺)(𝑰+𝑺)1,{\bm{W}}=\kappa({\bm{I}}-{\bm{S}})({\bm{I}}+{\bm{S}})^{-1}, (4)

where κ(0,1)\kappa\in(0,1) and 𝑺=𝑪𝑪{\bm{S}}={\bm{C}}-{\bm{C}}^{\top} is a skew-symmetric matrix with 𝑪{\bm{C}} as an arbitrary parameterized matrix. From (4), it follows that 𝑾{\bm{W}} is positive definite and its real eigenvalues are in (0,1)(0,1). Moreover, based on the definition of 𝑨^\widehat{{\bm{A}}}, we can derive the eigenvalues of 𝑾^\widehat{{\bm{W}}} are in (0,1)(0,1), which ensures that 𝑰𝑾^m𝑰{\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}}, thereby ensuring the well-posedness of IGNN. We refer interested readers to [3] for detailed proof.

III-C Training of IGNNs

Additionally, to train the parameters 𝜽={𝑾,𝛀}\bm{\theta}=\{{\bm{W}},\bm{\Omega}\} during the back-propagation process of neural networks, IGNNs compute the Jacobian 𝒁\nabla_{\bm{Z}}\mathcal{L} by solving the equilibrium equation, as derived from the implicit differential method and the implicit function theorem [42]:

𝒁=𝑫(𝑾𝒁𝑨^+𝑿),\nabla_{\bm{Z}}\mathcal{L}={\bm{D}}\odot({\bm{W}}^{\top}\nabla_{\bm{Z}}\mathcal{L}\widehat{{\bm{A}}}^{\top}+\nabla_{\bm{X}}\mathcal{L}), (5)

where 𝑫=σ(𝑾𝑿𝑨^+bΩ(𝑼)){\bm{D}}=\sigma^{\prime}({\bm{W}}{\bm{X}}\widehat{{\bm{A}}}+b_{\Omega}({\bm{U}})) and σ()\sigma^{\prime}(\cdot) refers to the element-wise derivative of the map σ\sigma. Once 𝒁\nabla_{\bm{Z}}\mathcal{L} is obtained, 𝑾\nabla_{\bm{W}}\mathcal{L} and 𝛀\nabla_{\bm{\Omega}}\mathcal{L} are derived via the chain rule, eliminating the need to store activations at each layer during gradient updates.

However, IGNNs typically require multiple iterations to reach equilibrium, both in finding the fixed point during the forward process (2) and in solving the implicit differential during the backward process (5), resulting in significant overhead during both training and inference. Therefore, developing a fast and efficient solver to find their fixed points is crucial.

III-D Traditional Anderson Acceleration Solver

Algorithm 1 Anderson Acceleration Solver
1:fixed-point function fθ:nnf_{\theta}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}, max storage size mm, residual control parameter β\beta
2:Set initial point value 𝒁[0]n{\bm{Z}}^{[0]}\in\mathbb{R}^{n} \triangleright set 𝟎\mathbf{0} or random value normally
3:for k=0k=0, \ldots, K1K-1 do
4:     1) Set mk=min{m,k}m_{k}=\min\{m,k\} \triangleright storage of the most recent steps
5:     2) Solve α[k]=argminαmk+1𝑮[k]α2, s.t. 𝟏α[k]=1\alpha^{[k]}=\arg\min_{\alpha\in\mathbb{R}^{m_{k}+1}}\left\|{\bm{G}}^{[k]}\alpha\right\|_{2}\text{, s.t. }\mathbf{1}^{\top}\alpha^{[k]}=1 \triangleright compute weights
6:     3) 𝒁[k+1]=βi=0mkαi[k]fθ(𝒁[kmk+i])+(1β)i=0mkαi[k]𝒁[kmk+i]{\bm{Z}}^{[k+1]}=\beta\sum_{i=0}^{m_{k}}\alpha_{i}^{[k]}f_{\theta}({\bm{Z}}^{[k-m_{k}+i]})+(1-\beta)\sum_{i=0}^{m_{k}}\alpha_{i}^{[k]}{\bm{Z}}^{[k-m_{k}+i]}\quad \triangleright Anderson step
7:end for

Since implicit networks typically require multiple iterations to reach equilibrium, a commonly used solver method is Anderson Acceleration [13], which is widely applied in [3, 15, 14, 16]. Specifically, Given an implicit network layer fθf_{\theta} and input 𝑿{\bm{X}}, we aim to find the fixed-point 𝒁{\bm{Z}}^{*} in the system that represents the equilibrium state of the network output by solving a root-finding problem

gθ(𝒁,𝑿):=fθ(𝒁,𝑿)𝒁=0,g_{\theta}\left({\bm{Z}}^{\star},{\bm{X}}\right):=f_{\theta}\left({\bm{Z}}^{\star},{\bm{X}}\right)-{\bm{Z}}^{\star}=0, (6)

via utilizing fixed-point solvers, such as the classical Anderson Acceleration (AA) [13], which can directly seek the equilibrium point 𝒁{\bm{Z}}^{\star} through quasi-Newton methods. This approach demonstrates super-linear convergence properties.

We briefly introduce the AA-solver, as our approach is inspired by it. Algorithm 1 provides a pseudocode example, where 𝑮[k]=[gθ(𝐳[kmk]),,gθ(𝐳[k])]{\bm{G}}^{[k]}=[g_{\theta}(\mathbf{z}^{[k-m_{k}]}),\ldots,g_{\theta}(\mathbf{z}^{[k]})] represents the past residuals. The key idea behind AA is to accelerate the next approximate solution by leveraging information stored from the history mm iteration steps, constructing a normalized linear combination of past updates with weights α[k]\alpha^{[k]} and β\beta. These weights are computed greedily to minimize the linear combination of past AA update steps. This approach is typically effective for accelerating the solution of implicit functions.

However, since the computation process relies only on the final output and does not require storing any intermediate activation values [14], the memory consumption during the training process remains constant, which is also a key reason why fixed-point solvers are attractive.

Refer to caption
Figure 1: Comparison of our IGNN-Solver and the generic Anderson solver at each iteration. (a) The canonical Anderson solver performs local least-squares fitting at each iteration, with a fixed β=β[k]\beta=\beta^{[k]} set to a constant. (b) Our learnable IGNN-Solver improves and incorporates learnable iterative updates for enhanced convergence via tiny GNN.

IV Our purposed IGNN-Solver

Although traditional fixed-point solvers for IGNNs demonstrate functionality, they are characterized by relatively slow performance, necessitate manual parameter tuning, and are not well-suited for graph-based applications. To address these limitations, we propose a lightweight, learnable, and content-aware fixed-point solver for IGNN that incorporates a tiny graph neural network, combining the speed advantages of the solver with the information benefits of graph learning. By encoding the relationship between past residuals and parameters α[k]\alpha^{[k]} and β\beta, it enables more efficient adjustments to enhance the learning capabilities of the IGNN-Solver, as shown in Figure 1.

Given the implicit model’s characteristic that its representation capability is decoupled from the forward computation (i.e., the solver has no knowledge of the task, such as node classification, and vice versa), we can train this neural solver in a lightweight, unsupervised manner. Once learned, the IGNN-Solver can solve the fixed-point equation within the IGNN layer, facilitating further model updates. Note that our solver does not alter the original structure of the model but only accelerates the solution process. Details of the training strategy involving the IGNN-Solver are provided in Appendix B.

Algorithm 2 IGNN-Solver Iterations and Training
1:frozen IGNN model fθf_{\theta}, initializer hϕh_{\phi}, tiny-GNN predictor sξs_{\xi}, sparse subgraph 𝑨^s\widehat{{\bm{A}}}_{s}, storage 𝑮(m+1)×d{\bm{G}}\in\mathbb{R}^{(m+1)\times d^{\prime}}.
2:Compute the initial value of fixed point by 𝒁[0]=hϕ(𝑿)d{\bm{Z}}^{[0]}=h_{\phi}({\bm{X}})\in\mathbb{R}^{d^{\prime}} \triangleright initializer to guess initial value
3:Define gθ(𝒁)=fθ(𝒁)𝒁g_{\theta}({\bm{Z}})=f_{\theta}({\bm{Z}})-{\bm{Z}} and set 𝑮[0]=gθ(𝒁[0]){\bm{G}}[0]=g_{\theta}({\bm{Z}}^{[0]})
4:for k=0,,Kk=0,\ldots,K do
5:     1) Set mk=min{m,k}m_{k}=\min\{m,k\} and 𝑮[k]=𝑮[0:(mk+1)](mk+1)×d{\bm{G}}^{[k]}={\bm{G}}\left[0:\left(m_{k}+1\right)\right]\in\mathbb{R}^{\left(m_{k}+1\right)\times d^{\prime}} \triangleright compress storage
6:     2) Predict α[k],β[k]=sξ(𝑮[k],𝑨^s){\alpha}^{[k]},{\beta}^{[k]}=s_{\xi}({\bm{G}}^{[k]},\widehat{{\bm{A}}}_{s}), where α[k](mk+1){\alpha}^{[k]}\in\mathbb{R}^{(m_{k}+1)}, 𝟏α[k]=1\mathbf{1}^{\top}{\alpha}^{[k]}=1 and β[k]1\beta^{[k]}\in\mathbb{R}^{1} \triangleright predict with tiny-GNN
7:     3) Update 𝒁[k+1]=β[k]𝟏𝑮[k]+(1β[k])i=0mkαi[k]𝒁[kmk+i]{\bm{Z}}^{[k+1]}=\beta^{[k]}\cdot\mathbf{1}^{\top}{\bm{G}}^{[k]}+(1-\beta^{[k]})\sum_{i=0}^{m_{k}}\alpha_{i}^{[k]}{\bm{Z}}^{[k-m_{k}+i]} \triangleright Anderson step
8:     4) Update 𝑮=concat(𝑮[1:],[gθ(𝒁[k+1])]){\bm{G}}=\operatorname{concat}\left({\bm{G}}[1:],[g_{\theta}({\bm{Z}}^{[k+1]})]\right)
9:end for
10:if it is inference stage then
11:     return 𝒁[k+1]{\bm{Z}}^{[k+1]}
12:else
13:     return (𝒁[k+1],𝑮[k],α[k],β[k])k=0,,K({\bm{Z}}^{[k+1]},{\bm{G}}^{[k]},{\alpha}^{[k]},{\beta}^{[k]})_{k=0,\ldots,K} and 𝒁[0]{\bm{Z}}^{[0]}
14:     Compute total \mathcal{L}_{\text{total }} and back-propagate it to update IGNN-Solver {sξ,hϕ}\left\{s_{\xi},h_{\phi}\right\}
15:end if

IV-A General Formulation

Let 𝑿d×n{\bm{X}}\in\mathbb{R}^{d\times n} be the input features and 𝑨^n×n\widehat{{\bm{A}}}\in\mathbb{R}^{n\times n} be the graph, IGNN learns the node representation by finding the fixed point below

𝒁=fθ(𝒁,𝑨^,𝑿)=σ(𝑾𝒁𝑨^+bΩ(𝑿)){\bm{Z}}=f_{\theta}({\bm{Z}},\widehat{{\bm{A}}},{\bm{X}})=\sigma({\bm{W}}{\bm{Z}}\widehat{{\bm{A}}}+b_{\Omega}({\bm{X}})) (7)

The overall structure of the IGNN-Solver is shown in Algorithm 2. Specifically, it learns the parameters α\alpha that control the weights among past mm steps approximate solution, and β\beta that control the weights between past residuals and approximate solution, through a solver sξs_{\xi} to accelerate the approximate solution of the next update step. To train the IGNN-Solver, we minimize the joint loss function total\mathcal{L}_{\text{total}} by back-propagating through this KK-step temporal process, which is discussed in Section IV-C. It’s worth noting that the original IGNN is frozen (i.e., model parameters θ\theta are fixed), and only the IGNN-Solver parameters ξ\xi are trained here, so we do not need the ground-truth label yy that corresponds to input xx. This implies that IGNN-Solver can also be fine-tuned during inference after deployment.

IV-A1 Initializer

To accelerate the convergence of predicted values, we propose constructing an initializer

𝒁[0]=hϕ(𝑿){\bm{Z}}^{[0]}=h_{\phi}({\bm{X}}) (8)

where hϕ:ddh_{\phi}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{\prime}}, for a rapid and reasonable input-based initial estimate value, rather than simply setting the initial values to 𝟎\mathbf{0} or random, where ϕ\phi are learnable parameters of the initializer. We set the intermediate dimension dd^{\prime} to be significantly smaller than the feature dimension dd to reduce the training overhead of the initializer.

IV-A2 Improved Anderson Iterations with tiny-GNN

In the original AA solver, the parameter β\beta is predetermined and fixed [13], whereas the parameters α[k]\alpha^{[k]}s are determined by solving numerous linear equations using the least squares method. Although this method operates adequately, its efficiency and adaptability in optimizing IGNN models are limited. Therefore, we propose introducing a tiny and learnable graph neural network as

α,β=sξ(𝑮,𝑨^)\alpha,\beta=s_{\xi}({\bm{G}},\widehat{{\bm{A}}})\, (9)

where sξ:((mk+1)×d×n)((mk+1)×1)\,s_{\xi}:(\mathbb{R}^{(m_{k}+1)\times d^{\prime}}\times\mathbb{R}^{n})\rightarrow(\mathbb{R}^{(m_{k}+1)}\times\mathbb{R}^{1}), to predict the two parameters instead of setting them as the least squares solution on the past residuals 𝑮{\bm{G}}.

Remark 1 (Well-posedness of IGNN-Solver).

The proposed IGNN-Solver remains well-posed when the IGNN satisfies the reparameterization defined in (4). Note that the IGNN-Solver introduces two key improvements while maintaining the structure and parameters of the model. First, we propose an initializer for a better initial guess, establishing a more meaningful starting point, reducing solving time, and shortening the fixed-point trajectory. Second, we parameterize the Anderson weights and step sizes within their original range, ensuring that well-posedness is preserved. Consequently, the modifications introduced in the IGNN-Solver retain the foundational characteristics of classical AA without compromising the well-posedness condition. Therefore, for an IGNN (fθ,bΩ,σ)(f_{\theta},b_{\Omega},\sigma) where fθf_{\theta} satisfies the well-posedness condition, it holds that the corresponding IGNN-Solver (hϕ,sξ)(h_{\phi},s_{\xi}) on this IGNN is also well-posed, thereby guaranteeing the existence and uniqueness of its solution.

IV-B Graph Sparsification and Storage Compression

Directly using the original graph for modeling would result in an oversized network [17, 18], deviating from the initial goal of using a tiny network for prediction. Moreover, because of the curse of dimensionality, directly mapping the data from high to extremely low dimensions is not appropriate [19]. Therefore, in response to the challenges posed by large-scale graphs, we take into account both Graph Sparsification and Storage Compression in the design of the solver, as detailed below.

IV-B1 Graph Sparsification

Firstly, the number of edges in graph 𝑨^\widehat{{\bm{A}}} is usually large in practice, as it is affected by the scale of the input (for example, in the ogbn-products dataset, the number of edges in the graph is close to 62M). To maintain the lightweight nature of tiny-GNN and reduce the computational cost of prediction, we propose introducing graph sparsification [43, 44, 45] for a light graph:

𝑨^s=ς(𝑨^,β),\widehat{{\bm{A}}}_{s}=\varsigma(\widehat{{\bm{A}}},\beta), (10)

where 𝑨^s\widehat{{\bm{A}}}_{s} represents the sparse subgraph of 𝑨^\widehat{{\bm{A}}}, obtained via a graph sparsification operator ς\varsigma, with the parameter β\beta controls the sparsity of 𝑨^s\widehat{{\bm{A}}}_{s}. We choose the RPI-Graph approach [17] in this paper, which is a plug-and-play graph sparsification method based on the principle of relevant information. It is worth noting that this subgraph is used exclusively in the tiny-GNN predictor sξs_{\xi} and does not alter the original graph of the IGNN model fθf_{\theta}.

IV-B2 Storage Compression

Besides, considering that nn (the number of nodes) is relatively large (for example, nn is approximately 169K in the ogbn-arxiv dataset), it is not appropriate to map sξs_{\xi} from a high-dimensional space to a very low-dimensional one in (9) directly [25, 46, 47] owing to the curse of dimensionality [19]. Therefore, to maintain sξs_{\xi} fast and compact, we recommend compress 𝑮i[k]{\bm{G}}^{[k]}_{i} to form a smaller yet still representative version 𝑮¯i[k]\bar{{\bm{G}}}^{[k]}_{i}. Specifically, We map it from (mk+1)×d\mathbb{R}^{(m_{k}+1)\times d^{\prime}} to an appropriate space (mk+1)×p\mathbb{R}^{(m_{k}+1)\times p} by multiple layer perception (MLP). Then, the nearest mk+1m_{k}+1 sets of fixed points are merged into one feature matrix:

𝑮¯[k]=||i=kmkk𝑮¯i[k],\bar{{\bm{G}}}^{[k]}=\mathop{||}_{i=k-m_{k}}^{k}\bar{{\bm{G}}}^{[k]}_{i}, (11)

where ||i=kmkk\mathop{||}_{i=k-m_{k}}^{k} represents the concatenation operation that stacks the compressed storage matrix 𝑮¯i[k](mk+1)×p\bar{{\bm{G}}}^{[k]}_{i}\in\mathbb{R}^{(m_{k}+1)\times p} in the kk-th iteration.

Once the smaller yet representative storage matrix 𝑮¯\bar{{\bm{G}}} is obtained as described above, we treat it as a pp-channel matrix and employ a tiny graph neural network layer:

α[k],β[k]=sξ(𝑮¯[k],𝑨^s),\alpha^{[k]},\beta^{[k]}=s_{\xi}(\bar{{\bm{G}}}^{[k]},\widehat{{\bm{A}}}_{s}), (12)

to predict the relative weights α[k]\alpha^{[k]} assigned to past mk+1m_{k}+1 residual, as well as the AA mixing coefficient β[k]\beta^{[k]} at the kk-th iteration. Notably, we normalize the original output α^[k]\widehat{\alpha}^{[k]} using the function

α[k]=α^[k]+(1𝟏α^[k])mk+1𝟏\alpha^{[k]}=\widehat{\alpha}^{[k]}+\frac{\left(1-\mathbf{1}^{\top}\widehat{\alpha}^{[k]}\right)}{m_{k}+1}\cdot\mathbf{1}

to ensure that the constraint 𝟏α[k]=1\mathbf{1}^{\top}\alpha^{[k]}=1 is satisfied.

In this way, sξs_{\xi} shall autonomously learn and adjust these parameters α[k]\alpha^{[k]} and β[k]\beta^{[k]} based on previous solver steps, while also receiving gradients from subsequent iterations. We provide a comparison of different choices for sξs_{\xi} in Section V-D, demonstrating that IGNN-Solver improves the convergence path. We also introduce the training procedure below.

IV-C Training IGNN-Solver

Unlike the neural ODE solver, the fixed-point trajectory of the IGNN is not unique. Therefore, trajectory fitting is not applicable to the IGNN-solver training. Instead, the goal is simply to bring everything as close as possible to 𝒁{\bm{Z}}^{\star}. Formally, given an IGNN-Solver {sξ,hϕ}\left\{s_{\xi},h_{\phi}\right\} that returns (𝒁[k+1],𝑮[k],α[k],β[k])k=0,,K({\bm{Z}}^{[k+1]},{\bm{G}}^{[k]},{\alpha}^{[k]},{\beta}^{[k]})_{k=0,\ldots,K} and 𝒁[0]{\bm{Z}}^{[0]}, we introduce three objectives functions for its training. The complete training algorithm can be referred to in Algorithm 2.

Initializer Loss Functions

To train the initializer, we minimize the distance between the initial guess and the fixed-point by

init =hϕ(𝑿)𝒁2,\mathcal{L}_{\text{init }}=\left\|h_{\phi}({\bm{X}})-{\bm{Z}}^{\star}\right\|_{2}, (13)

for facilitating the subsequent iteration process. Since the initializer directly predicts based on the input 𝑿{\bm{X}} without going through iteration, we separate this loss from other components.

Reconstruction Loss Functions

The training of the solver does not require reference to label information or any trajectory fitting. Apart from the loss init\mathcal{L}_{\text{init}} necessary for training the initializer above, we introduce reconstruction loss

rec[k]=𝒁[k]𝒁2,\mathcal{L}_{\text{rec}}^{[k]}=\|{\bm{Z}}^{[k]}-{\bm{Z}}^{\star}\|_{2},\quad (14)

where the reconstruction loss rec [k]\mathcal{L}_{\text{rec }}^{[k]} aims to make all intermediate predictions 𝒁[k]{\bm{Z}}^{[k]} converge to the accurate fixed point 𝒁{\bm{Z}}^{\star} as closely as possible.

Auxiliary Loss Functions

Although we utilize α[k]\alpha^{[k]} and β[k]\beta^{[k]} to improve the generic solver, we have empirically found that using an auxiliary loss to guide the solver’s prediction of α\alpha is beneficial sometimes, especially in the early stages of training. Therefore, in practice, we suggest considering this loss and gradually diminishing it as training progresses (i.e., decay the weight of this loss to 0).

α=k=0K𝑮[k]α[k]2,\mathcal{L}_{\alpha}=\sum_{k=0}^{K}\|{\bm{G}}^{[k]}\alpha^{[k]}\|_{2}, (15)

The final form of the joint loss function is as follows:

total =λ1rec [k]+λ2init +λ3α,\mathcal{L}_{\text{total }}=\lambda_{1}\mathcal{L}_{\text{rec }}^{[k]}+\lambda_{2}\mathcal{L}_{\text{init }}+\lambda_{3}\mathcal{L}_{\alpha}, (16)

where λ1\lambda_{1}, λ2\lambda_{2} and λ3\lambda_{3} control the weights of three loss functions mentioned above and elaborate more in Section A-B.

Refer to caption
(a) Amazon-all
Refer to caption
(b) Reddit
Refer to caption
(c) ogbn-arxiv
Refer to caption
(d) ogbn-products
Figure 2: Speed-accuracy tradeoff curves comparing the inference time of IGNN with IGNN-Solver and its variants IGNN-Solver (NN) and IGNN-Solver (AA), across four large-scale datasets: Amazon-all (2(a)), Reddit (2(b)), ogbn-arxiv (2(c)) and ogbn-products (2(d)). Each plot shows the average performance over five independent runs under identical experimental conditions.
TABLE I: Node classification accuracy results on four large-scale real-world datasets (more results on small-scale datasets including Citeseer, ACM, CoraFull, BlogCatalog, Flickr can be found in Table II), with experiments conducted over five trials. The mean accuracy (%) ± standard deviation is reported. The best and the runner-up results are highlighted in boldface and underline, respectively.
Type *Models Amazon-all Reddit ogbn-arxiv ogbn-products
*Nodes 334,863334,863 232,965232,965 169,343169,343 2,449,0292,449,029
*Edges 14,202,05714,202,057 11,606,91911,606,919 1,166,2431,166,243 61,859,14061,859,140
Explicit GCN [12] 79.12±1.1179.12\pm 1.11 89.65±0.1489.65\pm 0.14 71.56±0.2571.56\pm 0.25 71.91±0.2771.91\pm 0.27
GAT [23] 76.02±2.0976.02\pm 2.09 90.08±0.4490.08\pm 0.44 71.10±0.2471.10\pm 0.24 72.33±0.3272.33\pm 0.32
SGC [48] 75.56±1.7575.56\pm 1.75 91.44±0.4191.44\pm 0.41 64.66±1.0264.66\pm 1.02 70.48±0.1970.48\pm 0.19
APPNP [49] 79.80±1.2279.80\pm 1.22 91.68±0.2491.68\pm 0.24 71.28±0.2971.28\pm 0.29 74.46±0.6474.46\pm 0.64
JKNet [50] 81.19±1.0981.19\pm 1.09 91.71±0.3191.71\pm 0.31 71.08±0.3571.08\pm 0.35 73.70±0.5873.70\pm 0.58
AM-GCN [25] 81.85±1.4681.85\pm 1.46 90.20±0.3490.20\pm 0.34 68.83±0.6768.83\pm 0.67 73.19±0.4973.19\pm 0.49
DEMO-Net [51] 84.08±1.2984.08\pm 1.29 89.53±0.2989.53\pm 0.29 68.40±0.2468.40\pm 0.24 73.88±0.6273.88\pm 0.62
GCNII [52] 83.60±2.4083.60\pm 2.40 89.87±0.3889.87\pm 0.38 67.66±0.2067.66\pm 0.20 72.88±0.1872.88\pm 0.18
ACM-GCN [53] 83.04±2.6183.04\pm 2.61 90.37±0.4290.37\pm 0.42 68.32±0.1868.32\pm 0.18 71.68±0.4171.68\pm 0.41
Implicit IGNN [1] 83.90±0.5183.90\pm 0.51 92.30±1.5592.30\pm 1.55 70.49±0.7570.49\pm 0.75 74.63±0.2474.63\pm 0.24
EIGNN [2] 84.32¯±0.57\underline{84.32}\pm 0.57 92.00±0.2492.00\pm 0.24 70.59±0.3170.59\pm 0.31 74.58±0.2674.58\pm 0.26
MIGNN [3] 83.68±0.8283.68\pm 0.82 91.98±0.4291.98\pm 0.42 71.95¯±0.44\underline{71.95}\pm 0.44 74.62±0.3274.62\pm 0.32
IGNN-Solver (AA) 83.44±0.1983.44\pm 0.19 92.37±0.3592.37\pm 0.35 71.73±0.4171.73\pm 0.41 74.61±0.2874.61\pm 0.28
IGNN-Solver (NN) 84.13±1.0484.13\pm 1.04 92.42¯±0.41\underline{92.42}\pm 0.41 70.78±0.3770.78\pm 0.37 74.69¯±0.31\underline{74.69}\pm 0.31
IGNN-Solver (Ours) 84.50±0.70\textbf{84.50}\pm 0.70 93.91±0.31\textbf{93.91}\pm 0.31 72.53±0.41\textbf{72.53}\pm 0.41 74.90±0.20\textbf{74.90}\pm 0.20

V Experiments

In this section, we compare the speed/accuracy Pareto curve rather than a single point on the curve of IGNN-Solver with IGNNs and several state-of-the-art (SOTA) GNNs across various graph classification tasks, both at the node and graph levels. Our goal is to demonstrate: 1) IGNN-Solver achieves nearly a 1.5×1.5\times to 8×8\times inference acceleration without sacrificing accuracy; and 2) IGNN-Solver is extremely compact, adding little overhead to training. Specifically, we compare our approach with numerous representative baselines and two variants of our method: IGNN w. AA (using the original Anderson Acceleration to speed up IGNN) and IGNN w. NN (replacing our proposed graph neural solver with a standard neural network solver for parameter learning).

In Section V-D, we sequentially demonstrate that the training overhead on IGNN-Solver is little relative to the total time overhead on IGNN. Additionally, we provide further evidence regarding the convergence and generalizability of the IGNN-Solver. We compare the training/inference time costs per epoch between IGNN and IGNN-Solver, visualizing the advantages of our approach. Furthermore, we conduct ablation studies to validate the stability of the IGNN solver and assess the effectiveness of its components. Additional experimental setups and descriptions are detailed in Appendix A.

Refer to caption
(a) Citeseer
Refer to caption
(b) ACM
Refer to caption
(c) CoraFull
Refer to caption
(d) BlogCatalog
Refer to caption
(e) Flickr
Figure 3: Speed-accuracy tradeoff curves comparing the inference time of IGNN with IGNN-Solver and its variants IGNN-Solver (NN) and IGNN-Solver (AA), across five small-scale datasets: Citeseer (3(a)), ACM (3(b)), CoraFull (3(c)), BlogCatalog (3(d)) and Flickr (3(e)). Each plot represents the average performance over five independent runs under identical experimental conditions.
TABLE II: Node Classification accuracy results on five small-scale real-world datasets, with experiments conducted over five trials. The mean accuracy (%) ± standard deviation is reported. The best and the runner-up results are highlighted in boldface and underline, respectively.
Type *Models Citeseer ACM CoraFull BlogCatalog Flickr
*Nodes 3,3273,327 3,0253,025 19,79319,793 5,1965,196 7,5757,575
*Edges 4,7324,732 13,12813,128 65,31165,311 171,743171,743 239,738239,738
Explicit GCN [12] 64.80±4.1564.80\pm 4.15 83.78±3.9583.78\pm 3.95 35.98±9.4435.98\pm 9.44 71.30±1.1271.30\pm 1.12 76.98±1.8376.98\pm 1.83
GAT [23] 71.24±2.8871.24\pm 2.88 80.60±5.1280.60\pm 5.12 50.24±1.8750.24\pm 1.87 76.24±2.7676.24\pm 2.76 72.64±3.2372.64\pm 3.23
SGC [48] 70.32±2.7570.32\pm 2.75 85.40±1.2385.40\pm 1.23 46.56±4.4346.56\pm 4.43 69.76±1.1169.76\pm 1.11 71.88±3.4771.88\pm 3.47
APPNP [49] 61.56±8.9261.56\pm 8.92 84.16±4.2584.16\pm 4.25 21.24±4.7621.24\pm 4.76 61.34±9.3961.34\pm 9.39 73.42±3.8073.42\pm 3.80
JKNet [50] 63.78±8.7663.78\pm 8.76 64.96±5.4764.96\pm 5.47 23.04±6.0123.04\pm 6.01 72.62±6.8372.62\pm 6.83 75.68±1.1575.68\pm 1.15
AM-GCN [25] 73.10±1.6273.10\pm 1.62 89.56±0.3089.56\pm 0.30 53.40±1.5953.40\pm 1.59 73.86±1.1073.86\pm 1.10 76.86±2.0276.86\pm 2.02
DEMO-Net [51] 68.34±2.9468.34\pm 2.94 84.38±2.1984.38\pm 2.19 61.74±3.6561.74\pm 3.65 74.26±2.7074.26\pm 2.70 75.60±3.9575.60\pm 3.95
GCNII [52] 71.98±0.8071.98\pm 0.80 85.36±1.0585.36\pm 1.05 57.64±3.3457.64\pm 3.34 74.94±3.8174.94\pm 3.81 79.92¯±2.14\underline{79.92}\pm 2.14
ACM-GCN [53] 72.38±1.4672.38\pm 1.46 88.98±0.4188.98\pm 0.41 59.88±1.5959.88\pm 1.59 78.18±1.75\textbf{78.18}\pm 1.75 74.82±3.7874.82\pm 3.78
Implicit IGNN [1] 72.96±1.8372.96\pm 1.83 90.88±0.9590.88\pm 0.95 65.52±0.5165.52\pm 0.51 75.68±0.5575.68\pm 0.55 75.80±0.2975.80\pm 0.29
EIGNN [2] 72.38±1.3672.38\pm 1.36 88.36±1.0388.36\pm 1.03 61.80±0.6061.80\pm 0.60 75.34±0.3875.34\pm 0.38 75.66±0.9475.66\pm 0.94
MIGNN [3] 73.79±0.9473.79\pm 0.94 89.59±1.6189.59\pm 1.61 62.94±0.4662.94\pm 0.46 76.68±1.4976.68\pm 1.49 74.96±0.4974.96\pm 0.49
IGNN-Solver (AA) 75.28¯±0.38\underline{75.28}\pm 0.38 91.34±0.46\textbf{91.34}\pm 0.46 65.88¯±0.34\underline{65.88}\pm 0.34 76.82±0.3476.82\pm 0.34 75.84±0.2775.84\pm 0.27
IGNN-Solver (NN) 74.78±0.4274.78\pm 0.42 91.08±0.5691.08\pm 0.56 65.62±0.1665.62\pm 0.16 76.88±0.7476.88\pm 0.74 78.55±0.4978.55\pm 0.49
IGNN-Solver (Ours) 75.60±0.22\textbf{75.60}\pm 0.22 91.20¯±1.15\underline{91.20}\pm 1.15 66.08±1.23\textbf{66.08}\pm 1.23 77.64¯±0.54\underline{77.64}\pm 0.54 80.14±0.23\textbf{80.14}\pm 0.23

V-A Experiments Settings

V-A1 Datasets

The evaluation is conducted on nine different-field, real-world benchmarks for node classification tasks, including four citation datasets (Citeseer, ACM, CoraFull, ogbn-arxiv), three social interaction datasets (BlogCatalog, Flickr, Reddit), and two product network datasets (Amazon-all, ogbn-products); as well as five bioinformatics benchmarks (MUTAG, PTC_MR, COX2, PROTEINS, NCI1) for graph classification tasks. Further details can be found in Appendix A. Notably, to demonstrate the scalability of the proposed model on larger datasets, we use four large-scale datasets: Amazon-all, Reddit, ogbn-arxiv, and ogbn-products, two of which are from the Open Graph Benchmark (OGB) [22].

V-A2 Baselines

We compare IGNN-Solver with a comprehensive set of baselines in graph node classification tasks, including three implicit GNNs, i.e., MIGNN [3], EIGNN [2], IGNN [1], and nine explicit/traditional GNNs, i.e., AM-GCN [25], GCN [12], GAT [23], SGC [48], APPNP [49], JKNet [50], DEMO-Net [51], GCNII [52], ACM-GCN [53]. For the sake of fairness, we use the same hidden layer configuration for all baseline models. For example, in the Flickr dataset, the dimension of the hidden layers is uniformly set to 512512 for all methods (including ours, see in Table IV). In addition, for EIGNN, the arbitrary gamma is set to 0.80.8, which is consistent with the original paper. For AM-GCN, the number of nearest neighbors in the KNN graph is set from 5,6,75,6,7. For GCN, the hidden layer setup is the same as that for others. For GAT, three attention headers are used for each layer. For SGC, the power of self-loops in the graph adjacency matrix is set to 33. For APPNP, we set the number of iterations to 33 and teleport probability α\alpha to 0.50.5. For JKNet, we set layers to 11 in small datasets and set to 88 in large ones. For DEMO-Net, the regularization parameter is set as 5e45e-4, the learning rate is set as 5e35e-3, and the hash dimension is set as 256256. For GCNII, we set α=0.1\alpha_{\ell}=0.1 and L2L_{2} regularization to 5e45e-4, consistent with the original paper. For ACM-GCN, we set layers to 11 in small datasets and set to 44 in large ones. For our IGNN-Solver and its variants, we employ the same settings as the basic IGNN model. Furthermore, we adjust these parameters affecting the convergence and performance of IGNN through a combined approach of hierarchical grid search and manual tuning. The Adam optimizer [54] is used for optimization.

V-B Performance on Node Classification Tasks

V-B1 Experimental setup

To demonstrate the superiority of the IGNN-Solver over IGNNs in terms of both performance and efficiency, we analyze the movement of the speed/accuracy Pareto curve across various datasets, rather than concentrating on a single point on the curve, which is depicted in Figures  2 and 3. All experiments are conducted five times, and the best results are reported. Furthermore, we present the node classification performance results for each method on various datasets, which are shown in Tables I and II. All experiments are conducted five times, and we report the average results along with the standard deviation. Additional details on the training procedure and the hyperparameters used for each task are provided in Appendices A and B.

V-B2 Results on large-scale tasks

From Figure 2 (the speed/accuracy Pareto curve on large-scale datasets: Amazon-all, Reddit, ogbn-arxiv and ogbn-products), we observe that:

  1. 1.

    Comprehensive performance and efficiency: IGNN-Solver generally outperforms all other methods, particularly in large datasets with a greater graph radius. This advantage is more pronounced in such cases. The improved performance is attributed to IGNN-Solver’s significant enhancement in the convergence path and the better initialization of the fixed-point equation through the initializer;

  2. 2.

    Rapid inference advantage: IGNN-Solver demonstrates a notable speed advantage during inference. For instance, on the large-scale dataset Amazon-all, our IGNN-Solver achieves 8×8\times faster inference speed than IGNN while maintaining the same performance, and there is consistently at least 1.5×1.5\times acceleration and a similar pattern has been observed as well on other datasets.

From Table I, we observe that implicit GNNs with infinite depth generally outperform shallow explicit GNNs in most cases. Furthermore, IGNN-Solver consistently shows higher accuracy percentages across most datasets compared to other state-of-the-art explicit GNNs like DEMO-Net [51], GCNII [52] etc, indicating its superior performance.

V-B3 Results on small-scale tasks

We further evaluate IGNN-Solver on several small-scale graph node classification tasks, including Citeseer, ACM, CoraFull, BlogCatalog, and Flickr. The hyperparameters used are outlined in Table IV, Section A-B. The mean accuracy and standard deviation are reported in Table II, while the speed/accuracy Pareto curve is presented in Figure 3. In general, learning LRD is not crucial for these tasks, as the graph diameters are relatively small [3]. However, as shown in Table II, even for these small-scale node classification tasks, IGNN-Solver outperforms IGNNs and many explicit GNNs, as well as other advanced implicit models. As illustrated in Figure 3, the solver continues to improve the convergence path of IGNN, even in small-scale tasks. These results, along with those in Section V-B, confirm the expressive power of the IGNN-Solver using learnable graph neural networks, even surpassing that of many explicit and enhanced implicit GNNs.

V-C Performance on Graph Classification Tasks

TABLE III: Graph classification accuracy results on five benchmark datasets. The mean accuracy (%) ± standard deviation is reported. The best and the runner-up results are highlighted in boldface and underline, respectively.
Type Method MUTAG PTC_MR COX2 PROTEINS NCI1
Explicit WL [55] 84.1 ±\pm 1.9 58.0 ±\pm 2.5 83.2 ±\pm 0.2 74.7 ±\pm 0.5 84.5 ±\pm 0.5
DCNN [56] 67.0 56.6 - 61.3 62.6
DGCNN [57] 85.8 58.6 - 75.5 74.4
GIN [58] 89.4 ±\pm 5.6 64.6 ±\pm 7.0 - 76.2 ±\pm 2.8 82.7 ±\pm 1.7
FDGNN [59] 88.5 ±\pm 3.8 63.4 ±\pm 5.4 83.3 ±\pm 2.9 76.8 ±\pm 2.9 77.8 ±\pm 1.6
PK [60] 76.0 ±\pm 2.7 59.5 ±\pm 2.4 81.0 ±\pm 0.2 73.7 ±\pm 0.7 82.5 ±\pm 0.5
GCN [12] 85.6 ±\pm 5.8 64.2 ±\pm 4.3 - 76.0 ±\pm 3.2 80.2 ±\pm 2.0
Implicit IGNN [1] 76.0 ±\pm 13.4 60.5 ±\pm 6.4 79.7 ±\pm 3.4 76.5 ±\pm 3.4 73.5 ±\pm 1.9
CGS [61] 89.4 ±\pm 5.6 64.7 ±\pm 6.4 - 76.3 ±\pm 4.9 77.6 ±\pm 2.0
GIND [62] 89.3 ±\pm 7.4 66.9 ±\pm 6.6 84.8 ±\pm 4.2 77.2 ±\pm 2.9 78.8 ±\pm 1.7
MIGNN [3] 81.8 ±\pm 9.1 72.6 ±\pm 5.4 85.0 ±\pm 5.3 77.9 ±\pm 2.6 79.6 ±\pm 1.8
IGNN-Solver (Ours) 90.5 ±\pm 1.6 71.8 ±\pm 3.1 85.3 ±\pm 3.0 78.4 ±\pm 2.6 80.2 ±\pm 1.1

V-C1 Experimental setup

We extend our evaluation to graph classification tasks to further demonstrate the effectiveness and versatility of IGNN-Solver. We conduct experiments on several widely used bioinformatics benchmark datasets for graph classification, including MUTAG, PTC_MR, COX2, PROTEINS, and NCI1. These datasets vary in size and complexity, as detailed in Appendix A. We train the model using 10-fold cross-validation using the experimental setup of [1]. For each dataset, we employ standard 10-fold cross-validation to evaluate the test performance of the IGNN-Solver, with the baseline and their results drawn from [3, 62]. The average prediction accuracy and standard deviations are in Table III.

V-C2 Results and discussion

The results of the graph classification experiments, summarized in Table III, show that our IGNN-Solver almost outperforms the baseline models across all datasets, clearly demonstrating its ability to effectively capture graph structures and significantly improve classification accuracy. These findings confirm the effectiveness of the IGNN-Solver in graph classification tasks, further supporting its potential for deployment in real-world applications where graph data is prevalent.

V-D Empirical Understandings of IGNN-Solver

V-D1 Overhead in training

Refer to caption
(a) Overhead proportion on Citeseer.
Refer to caption
(b) Overhead proportion on Amazon-all.
Figure 4: Relative parameter size and training time of the IGNN-Solver on the small Citeseer dataset and the large Amazon-All dataset, with consistent patterns observed across other datasets.

To investigate the proportion of training overhead attributed to the solver throughout the entire training process, we depict in Figure 4 the percentage of training computations on IGNN-Solver relative to the total time overhead on IGNN. Our approach is not only effective but also incurs minimal training overhead for the solver: the solver module is relatively small and requires only about 1%1\% of the original training time required by the IGNN model. Besides, this proportion will be lower for large-scale data. For instance, on the Amazon dataset, IGNN necessitates a total runtime of 33 hours, whereas the solver only requires approximately 1.61.6 minutes, indicating that our solver maintains its lightweight nature across any-scale datasets.

V-D2 Efficiency study

Refer to caption
(a) Amazon-all
Refer to caption
(b) Reddit
Refer to caption
(c) obgn-arxiv
Refer to caption
(d) obgn-products
Refer to caption
(e) MUTAG
Refer to caption
(f) PTC_MR
Refer to caption
(g) COX2
Refer to caption
(h) PROTEINS
Figure 5: Comparison of convergence curves across eight datasets, highlighting the substantial acceleration in convergence achieved by our IGNN-Solver.

We present additional evidence on the convergence and generalizability of the neural solvers in Figure 5, where the ordinate represents the relative residual f(z)zFzF\frac{||f(z)-z||_{F}}{||z||_{F}}, with F\|\cdot\|_{F} denoting the Frobenius norm. We compare the convergence of a pre-trained IGNN model under two conditions: 1) canonical iteration in IGNN [1], and 2) IGNN-Solver with K=6K=6 unrolling steps.

From Figure 5, we observe that the IGNN-Solver, trained with KK unrolling steps, is capable of generalizing beyond KK, with both solvers ultimately reaching a stable state, demonstrating good convergence. Moreover, thanks to the fixed-point solution of the graph neural parameterization, IGNN-Solver continuously enhances the convergence of the canonical iterators while being more computationally efficient. Specifically, each step of the neural solver is cheaper than the standard iterator step. This explains the observed improvement in inference efficiency of at least 1.5×1.5\times in Section V-B.

V-D3 Other choices for graph neural layers in IGNN-Solver

Refer to caption
(a) The relative error curve during warm-up.
Refer to caption
(b) The relative error curve during inference.
Figure 6: Comparison of IGNN-Solver with GN, NN, GRU, and TCN. Our proposed IGNN-Solver (with GN) achieves the fastest convergence and superior overall performance, owing to its lightweight design and ability to effectively utilize rich graph information.

We provide additional evidence on the convergence and generalization of IGNN-Solver by exploring alternatives to our proposed graph neural (GN) layers, including neural network (NN), temporal convolutional network (TCN) [63], and gated recurrent unit (GRU) [64]. The relative residual curves on the Citeseer dataset, observed during both the warm-up and final inference stages, are presented in Figure 6. Interestingly, despite their larger number of parameters and more complex architectures, TCN and GRU consistently underperform compared to NN. Notably, IGNN-Solver (GN) significantly improves the convergence path of all solvers and exhibits the fastest rate of residual reduction, establishing itself as the most effective predictor in IGNN-Solver due to its ability to effectively exploit rich graph information.

Refer to caption
(a) Amazon-all
Refer to caption
(b) Reddit
Refer to caption
(c) obgn-arxiv
Refer to caption
(d) obgn-products
Refer to caption
(e) Citeseer
Refer to caption
(f) ACM
Refer to caption
(g) CoraFull
Refer to caption
(h) BlogCatalog
Refer to caption
(i) Flickr
Figure 7: Ablation studies for IGNN-Solver. This study shows that all components contribute significantly to accelerating the convergence process.

V-D4 Ablation study

In this section, we analyze the effect of different loss components on the IGNN-Solver. The performance of IGNN-Solver and its variants on the Citeseer dataset is shown in Figure 7, with similar patterns observed across other datasets and solvers. For the main convergence loss conv\mathcal{L}_{\text{conv}} in fixed-point iterations (16), we design two contrasting schemes concerning its weight λ1[k]\lambda_{1}^{[k]}s: setting them to a constant value (i.e., λ1[k]=λ1\lambda_{1}^{[k]}=\lambda_{1} for all kk), or setting all values except the KK-th term to zero (i.e., λ1[k]=λ1\lambda_{1}^{[k]}=\lambda_{1} if k=Kk=K else 0). Both variant solvers still perform well, yet our suggested monotonically increasing scheme (i.e., emphasizing the later steps to a greater extent) exhibits the best performance.

Furthermore, removing either the initializer or the alpha loss α\mathcal{L}_{\alpha} negatively impacts performance, with the former having a more detrimental impact on the solver. Nonetheless, all these ablation configurations significantly outperform methods without a solver, underscoring the advantages of employing a custom learnable solver for the IGNN model.

VI Conclusion

This paper introduces IGNN-Solver, a novel graph neural solver tailored for achieving fast fixed-point solving in IGNNs. Unlike previous approaches that impose constraints on the structure or parameters of the implicit layer, we propose to leverage the important idea that implicit models separate the representation from the forward computation. By leveraging the generalized Anderson Acceleration method and parameterizing it with a tiny GNN, IGNN-Solver learns iterative updates as a graph-dependent temporal process. To address large-scale graph tasks, we further tailored sparsification and storage compression methods for the IGNN solver and integrated them into its design. Our extensive experiments demonstrate the significant acceleration in inference achieved by IGNN-Solvers, with a speedup ranging from 1.5×1.5\times to 8×8\times, all while maintaining accuracy. Notably, this acceleration is particularly pronounced as the scale of the graph increases. These findings underscore the potential of IGNN-Solver for large-scale end-to-end deployment in real-world applications.

Looking ahead, IGNN-Solver opens new avenues for the integration of learnable solvers in implicit GNNs, bridging the gap between solver efficiency and adaptability to diverse graph structures. Future research could explore extending IGNN-Solver to handle dynamic or heterogeneous graphs, further broadening its applicability. Additionally, investigating the interplay between solver generalization and graph topology holds promise for uncovering deeper insights into the dynamics of graph-based learning. With its ability to combine speed, scalability, and accuracy, IGNN-Solver represents a pivotal step toward making implicit graph models a viable choice for large-scale real-world systems. We hope that IGNN-Solver inspires further innovations in learnable solvers for IGNNs, driving progress in scalable and efficient algorithms on graphs.

References

  • [1] F. Gu, H. Chang, W. Zhu, S. Sojoudi, and L. El Ghaoui, “Implicit graph neural networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 984–11 995, 2020.
  • [2] J. Liu, K. Kawaguchi, B. Hooi, Y. Wang, and X. Xiao, “Eignn: Efficient infinite-depth graph neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 762–18 773, 2021.
  • [3] J. Baker, Q. Wang, C. D. Hauck, and B. Wang, “Implicit graph neural networks: a monotone operator viewpoint,” in International Conference on Machine Learning.   PMLR, 2023, pp. 1521–1548.
  • [4] Q. Chen, Y. Wang, Y. Wang, J. Chang, Q. Tian, J. Yang, and Z. Lin, “Efficient and scalable implicit graph neural networks with virtual equilibrium,” in 2022 IEEE International Conference on Big Data (Big Data).   IEEE, 2022, pp. 864–873.
  • [5] Q. Li, Z. Han, and X. M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018.   AAAI press, 2018, pp. 3538–3545.
  • [6] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” in International Conference on Learning Representations, 2020.
  • [7] X. Zhang, Y. Xu, W. He, W. Guo, and L. Cui, “A comprehensive review of the oversmoothing in graph neural networks,” in CCF Conference on Computer Supported Cooperative Work and Social Computing.   Springer, 2023, pp. 451–465.
  • [8] J. Liu, B. Hooi, K. Kawaguchi, Y. Wang, C. Dong, and X. Xiao, “Scalable and effective implicit graph neural networks on large graphs,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=QcMdPYBwTu
  • [9] Z. Geng, X.-Y. Zhang, S. Bai, Y. Wang, and Z. Lin, “On training implicit models,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 247–24 260, 2021.
  • [10] Y. Yang, T. Liu, Y. Wang, Z. Huang, and D. Wipf, “Implicit vs unfolded graph neural networks,” arXiv preprint arXiv:2111.06592, 2021.
  • [11] M. Nastorg, M.-A. Bucci, T. Faney, J.-M. Gratien, G. Charpiat, and M. Schoenauer, “An implicit gnn solver for poisson-like problems,” arXiv preprint arXiv:2302.10891, 2023.
  • [12] T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.   OpenReview.net, 2017.
  • [13] D. G. Anderson, “Iterative procedures for nonlinear integral equations,” Journal of the ACM (JACM), vol. 12, no. 4, pp. 547–560, 1965.
  • [14] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” Advances in neural information processing systems, vol. 32, 2019.
  • [15] S. Bai, V. Koltun, and J. Z. Kolter, “Multiscale deep equilibrium models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, pp. 5238–5250.
  • [16] ——, “Neural deep equilibrium solvers,” in International Conference on Learning Representations, 2021.
  • [17] S. Yu, F. Alesiani, W. Yin, R. Jenssen, and J. C. Principe, “Principle of relevant information for graph sparsification,” in Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, ser. Proceedings of Machine Learning Research, J. Cussens and K. Zhang, Eds., vol. 180.   PMLR, 01–05 Aug 2022, pp. 2331–2341. [Online]. Available: https://proceedings.mlr.press/v180/yu22c.html
  • [18] C. Zheng, B. Zong, W. Cheng, D. Song, J. Ni, W. Yu, H. Chen, and W. Wang, “Robust graph representation learning via neural sparsification,” in International Conference on Machine Learning.   PMLR, 2020, pp. 11 458–11 468.
  • [19] M. Köppen, “The curse of dimensionality,” in 5th online world conference on soft computing in industrial applications (WSC5), vol. 1, 2000, pp. 4–8.
  • [20] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, 2012, pp. 1–8.
  • [21] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  • [22] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 118–22 133, 2020.
  • [23] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph Attention Networks,” International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ
  • [24] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 5115–5124.
  • [25] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adaptive multi-channel graph convolutional networks,” in Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining, 2020, pp. 1243–1253.
  • [26] J. Liu, B. Hooi, K. Kawaguchi, and X. Xiao, “Mgnni: Multiscale graph neural networks with implicit layers,” Advances in Neural Information Processing Systems, vol. 35, pp. 21 358–21 370, 2022.
  • [27] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” Advances in neural information processing systems, vol. 31, 2018.
  • [28] P. H. Avelar, A. R. Tavares, M. Gori, and L. C. Lamb, “Discrete and continuous deep residual learning over graphs,” arXiv preprint arXiv:1911.09554, 2019.
  • [29] M. Poli, S. Massaroli, J. Park, A. Yamashita, H. Asama, and J. Park, “Graph neural ordinary differential equations,” arXiv preprint arXiv:1911.07532, 2019.
  • [30] L.-P. Xhonneux, M. Qu, and J. Tang, “Continuous graph neural networks,” in International conference on machine learning.   PMLR, 2020, pp. 10 432–10 441.
  • [31] B. Chamberlain, J. Rowbottom, M. I. Gorinova, M. Bronstein, S. Webb, and E. Rossi, “Grand: Graph neural diffusion,” in International conference on machine learning.   PMLR, 2021, pp. 1407–1418.
  • [32] M. Thorpe, T. Nguyen, H. Xia, T. Strohmer, A. Bertozzi, S. Osher, and B. Wang, “Grand++: Graph neural diffusion with a source term,” ICLR, 2022.
  • [33] Z. Ling, X. Xie, Q. Wang, Z. Zhang, and Z. Lin, “Global convergence of over-parameterized deep equilibrium models,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2023, pp. 767–787.
  • [34] E. Winston and J. Z. Kolter, “Monotone operator equilibrium networks,” Advances in neural information processing systems, vol. 33, pp. 10 718–10 728, 2020.
  • [35] Z. Ling, L. Li, Z. Feng, Y. Zhang, F. Zhou, R. C. Qiu, and Z. Liao, “Deep equilibrium models are almost equivalent to not-so-deep explicit models for high-dimensional Gaussian mixtures,” in Proceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235.   PMLR, 21–27 Jul 2024, pp. 30 585–30 609. [Online]. Available: https://proceedings.mlr.press/v235/ling24a.html
  • [36] Q. Zhao, W. Ren, T. Li, X. Xu, and H. Liu, “Graphgpt: Graph learning with generative pre-trained transformers,” arXiv preprint arXiv:2401.00529, 2023.
  • [37] Z. Huang, S. Bai, and J. Z. Kolter, “(implicit)2(implicit)^{2}: Implicit layers for implicit representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 9639–9650, 2021.
  • [38] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning.   PMLR, 2017, pp. 1263–1272.
  • [39] E. K. Ryu and S. Boyd, “Primer on monotone operator methods,” Appl. comput. math, vol. 15, no. 1, pp. 3–43, 2016.
  • [40] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Soljačić, “Tunable efficient unitary neural networks (eunn) and their application to rnns,” in International Conference on Machine Learning.   PMLR, 2017, pp. 1733–1741.
  • [41] P. R. Halmos, “A hilbert space problem book,” Graduate Texts in Mathematics, 1982.
  • [42] S. G. Krantz and H. R. Parks, The implicit function theorem: history, theory, and applications.   Springer Science & Business Media, 2002.
  • [43] B. Hui, D. Yan, X. Ma, and W.-S. Ku, “Rethinking graph lottery tickets: Graph sparsity matters,” arXiv preprint arXiv:2305.02190, 2023.
  • [44] J. F. Lutzeyer, C. Wu, and M. Vazirgiannis, “Sparsifying the update step in graph neural networks,” in Topological, Algebraic and Geometric Learning Workshops 2022.   PMLR, 2022, pp. 258–268.
  • [45] S. Zhang, M. Wang, P.-Y. Chen, S. Liu, S. Lu, and M. Liu, “Joint edge-model sparse learning is provably efficient for graph neural networks,” arXiv preprint arXiv:2302.02922, 2023.
  • [46] J. Zhao, X. Wang, C. Shi, B. Hu, G. Song, and Y. Ye, “Heterogeneous graph structure learning for graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, 2021, pp. 4697–4705.
  • [47] M. Poli, S. Massaroli, A. Yamashita, H. Asama, and J. Park, “Hypersolvers: Toward fast continuous-depth models,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 105–21 117, 2020.
  • [48] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning.   PMLR, 2019, pp. 6861–6871.
  • [49] J. Gasteiger, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” in International Conference on Learning Representations, 2018.
  • [50] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in International conference on machine learning.   PMLR, 2018, pp. 5453–5462.
  • [51] J. Wu, J. He, and J. Xu, “Demo-net: Degree-specific graph neural networks for node and graph classification,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 406–415. [Online]. Available: https://doi.org/10.1145/3292500.3330950
  • [52] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple and deep graph convolutional networks,” in International conference on machine learning.   PMLR, 2020, pp. 1725–1735.
  • [53] S. Luan, C. Hua, Q. Lu, J. Zhu, M. Zhao, S. Zhang, X.-W. Chang, and D. Precup, “Is heterophily a real nightmare for graph neural networks to do node classification?” arXiv preprint arXiv:2109.05641, 2021.
  • [54] D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [55] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial intelligence and statistics.   PMLR, 2009, pp. 488–495.
  • [56] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” Advances in neural information processing systems, vol. 29, 2016.
  • [57] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  • [58] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
  • [59] C. Gallicchio and A. Micheli, “Fast and deep graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 3898–3905.
  • [60] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting, “Propagation kernels: efficient graph kernels from propagated information,” Machine learning, vol. 102, pp. 209–245, 2016.
  • [61] J. Park, J. Choo, and J. Park, “Convergent graph solvers,” arXiv preprint arXiv:2106.01680, 2021.
  • [62] Q. Chen, Y. Wang, Y. Wang, J. Yang, and Z. Lin, “Optimization-induced graph implicit nonlinear diffusion,” in International Conference on Machine Learning.   PMLR, 2022, pp. 3648–3661.
  • [63] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.
  • [64] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, D. Wu, M. Carpuat, X. Carreras, and E. M. Vecchi, Eds.   Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Online]. Available: https://aclanthology.org/W14-4012
  • [65] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graph transformer networks,” in Advances in Neural Information Processing Systems, vol. 32.   Curran Associates, Inc., 2019, pp. 11 960–11 970. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf
  • [66] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in International Conference on Learning Representations, 2018, pp. 1–13.
  • [67] Z. Meng, S. Liang, H. Bao, and X. Zhang, “Co-embedding attributed networks,” in Proceedings of the twelfth ACM international conference on web search and data mining, 2019, pp. 393–401.
  • [68] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang, “Graph random neural networks for semi-supervised learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 092–22 103, 2020.
  • [69] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
  • [70] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of operations research, vol. 134, pp. 19–67, 2005.

Supplementary Material
IGNN-Solver: A Graph Neural Solver for Implicit Graph Neural Networks

Appendix A Experimental Details

The experiments are conducted on nine public datasets of different scales for node classification tasks, including five widely adopted benchmark datasets, Citeseer, ACM, CoraFull, BlogCatalog, Flickr, and four large-scale benchmark datasets, Amazon-all, Reddit, ogbn-arxiv and ogbn-products; as well as five bioinformatics benchmarks for graph classification tasks: MUTAG, PTC_MR, COX2, PROTEINS, NCI1. We endeavor to conduct inference testing under almost identical hyperparameter conditions as previous work [3, 2, 1, 25, 12, 23, 48, 49, 50], including performance and efficiency comparisons. All the Experiments are conducted independently five times (i.e., using five random seeds) on a machine with Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz with a single 3090 GPU. A detailed description of all tasks and additional details are provided below.

A-A Datasets

  • Citeseer [12]: This dataset is a citation network of research papers, which are divided into six categories. The citation network takes 3,327 scientific papers as nodes and 4,732 citation links as edges. The feature of each node in the dataset is a word vector to describe whether the paper has the corresponding words or not.

  • ACM [65]: It is a citation network dataset, where nodes represent papers and node features are constructed by the keywords. Papers are divided into three categories according to their types of conferences.

  • CoraFull [66]: Similar to the Citeseer dataset, CoraFull is a well-known citation network labeled based on the paper topic, which contains 19,793 scientific publications. CoraFull is classified into one of 70 categories, where nodes represent papers and the edges represent citations.

  • BlogCatalog [67]: This dataset is a social relationship network. The graph is composed of bloggers and their social relationships (such as friends). Node attributes are constructed by keywords in the user profile. The labels represent bloggers’ interests. All nodes are divided into six categories.

  • Flickr [51]: It is a graphic social network where nodes represent users and edges correspond to the friendships among users. All the nodes are divided into nine classes according to the interest groups of users.

  • Amazon-all [20]: The widely-used benchmark dataset Amazon-all encompasses the Amazon product co-purchasing network dataset. This dataset represents products as nodes and co-purchases as edges. It includes 58 product types, each with over 5,000 products, selected from a total pool of 75,149 product types.

  • Reddit [68]: The Reddit dataset is a social network where nodes represent posts, and edges indicate that the same user commented on two connected posts. Each node contains 602 dimensional features.

  • ogbn-arxiv [22]: The ogbn-arxiv dataset is a citation network between all Computer Science (CS) arXiv papers. Each node is an arXiv paper, and each directed edge indicates that one paper cites another. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The task is to predict the 40 subject areas of arXiv CS papers.

  • ogbn-products [22]: The ogbn-products dataset contains an undirected and unweighted graph, representing an Amazon product co-purchasing network. Nodes represent products sold on Amazon, and edges between two products indicate that the products are purchased together. The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.

  • MUTAG [69]: A dataset of nitroaromatic compounds, where the task is to predict the mutagenic effect on a bacterium. It contains 188 graphs with an average of 17.9 nodes per graph.

  • PTC-MR [69]: A dataset of 344 chemical compounds that report carcinogenicity for male and female rats. The dataset is used to predict chemical carcinogenicity, where each compound is represented as a graph with atoms as nodes and chemical bonds as edges. The average number of nodes per graph is 14.3, and the task is binary classification to determine whether a compound is carcinogenic or not. The node features represent atom types and other chemical properties.

  • COX2 [69]: A dataset of 467 cyclooxygenase-2 (COX-2) inhibitors. The dataset contains chemical compounds that inhibit COX-2 enzyme activity, which is important in inflammation and pain. Each compound is represented as a graph where atoms are nodes, and chemical bonds are edges.

  • PROTEINS [69]: This dataset consists of proteins represented as graphs, where nodes are secondary structure elements, and edges indicate neighborhood in the amino-acid sequence or in 3D space. It includes 1,113 graphs with an average of 39.1 nodes per graph.

  • NCI1 [69]: A dataset of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines. It contains 4,110 graphs with an average of 29.8 nodes per graph.

TABLE IV: Hyper-paramaters setting on training. *In addition, we linearly increase the loss weight λ1\lambda_{1} from 0 to its set value throughout all IGNN-Solver training steps.
Scale Datasets nhid dropout lr KK epochmax\text{epoch}_{\text{max}} training test λ1\lambda_{1}* λ2\lambda_{2} λ3\lambda_{3}
Small Citeseer 128 0.5 0.002 10 100 360 1000 0.1 5 1e-4
ACM 128 0.5 0.001 10 100 180 1000 0.3 5 1e-4
CoraFull 512 0.5 0.002 15 300 4200 1000 0.3 5 1e-4
BlogCatalog 512 0.5 0.001 10 100 360 1000 0.5 5 1e-4
Flickr 512 0.5 0.002 15 200 540 1000 0.1 5 1e-5
Large Amazon-all 128 0.5 0.005 20 1000 16970 28285 0.3 5 1e-5
Reddit 512 0.5 0.005 15 2000 139,779 46,593 0.5 5 1e-5
ogbn-arxiv 128 0.5 0.001 20 2000 90,941 48,603 0.5 5 1e-5
ogbn-products 128 0.5 0.001 20 2000 196,615 2,213,091 0.5 5 1e-5

A-B Hyper-parameters Setting in IGNN-Solver

We present in Table IV all the hyper-parameters preset by IGNN-Solver across all datasets, where KK represents the threshold of maximum iteration in IGNN-Solver, epochmax\text{epoch}_{\text{max}} means the maximum training steps in IGNN. It is worth noting that we set the value of λ1\lambda_{1} in a linearly increasing manner to penalize the intermediate estimation error when the training step KK is large. Similarly, we set the value of λ3\lambda_{3} in a linearly decreasing manner to prevent overfitting α\alpha during later training epochs.

Appendix B Training Strategies

B-A IGNN Training with frozen IGNN-Solver

During the training processing of the IGNN model, we assume that both the initializer hϕh_{\phi} and the neural solver sξs_{\xi} of IGNN-Solver have approached the desired state via the training method outlined in Section IV, and they will no longer be updated in the subsequent model training. For a given IGNN layer fθf_{\theta}, as well as the input graph 𝑨{\bm{A}} and feature matrix 𝑿{\bm{X}}, we define:

gθ(𝒁,𝑨^,𝑿):=fθ(𝒁,𝑨^,𝑿)𝒁=0,g_{\theta}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}}):=f_{\theta}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}})-{\bm{Z}}^{\star}=0, (17)

and utilize the IGNN-Solver to solve (17), obtaining the fixed point 𝒁{\bm{Z}}^{\star}. Furthermore, the fixed point 𝒁{\bm{Z}}^{\star} is passed through a linear layer to obtain an embedding 𝑯=Linear(𝒁){\bm{H}}=\text{Linear}({\bm{Z}}^{\star}). Then we, jointly with downstream tasks, compute the Binary Cross Entropy (BCE) loss [70] to back-propagate and update the parameters of fθf_{\theta} in a supervised way. The complete process is illustrated in Algorithm 3.

Algorithm 3 IGNN model training
1:graph matrix 𝑨^\widehat{{\bm{A}}}, feature matrix 𝑿{\bm{X}}, fixed IGNN-Solver sξs_{\xi} and initializer hϕh_{\phi}
2:Initialize 𝒁[0]=hϕ(𝑿){\bm{Z}}^{[0]}=h_{\phi}({\bm{X}}) and randomly initialize fθf_{\theta}
3:while Stopping condition is not met do
4:     𝒁Solvefθ(𝒁,𝑨^,𝑿)𝒁=0{\bm{Z}}^{\star}\leftarrow\text{Solve}\ f_{\theta}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}})-{\bm{Z}}^{\star}=0 \triangleright via frozen sξs_{\xi}, 𝑨^\widehat{{\bm{A}}} and initial value 𝒁[0]{\bm{Z}}^{[0]}
5:     𝑯Linear(𝒁){\bm{H}}\leftarrow\text{Linear}({\bm{Z}}^{\star})
6:     taskBCE(𝑯,𝒀)\mathcal{L}_{\text{task}}\leftarrow\text{BCE}({\bm{H}},{\bm{Y}}) \triangleright computing the label loss
7:     Back-propagate task\mathcal{L}_{\text{task}} to update fθf_{\theta}
8:end while
9:return fθf_{\theta}

B-B Advanced IGNN Training Strategy with IGNN-Solver

Given the fact that model parameter θ\theta gradually updates only with the model training iterations (i.e., we assume θt+1θt\left\|\theta_{t+1}-\theta_{t}\right\| is only related to θt\theta_{t} at training step tt), we propose training the lightweight IGNN-Solver sξ,hϕs_{\xi},h_{\phi} and the large model fθf_{\theta} in an alternating way. Specifically, we adopt the following procedure:

  1. (i)

    Warm up and train the IGNN model and its solver (IGNN-Solver) for some steps.

  2. (ii)

    Fix the IGNN-Solver sξ,hϕs_{\xi},h_{\phi} and solving the fixed points of fθf_{\theta} in IGNN via Algorithm 3 for T1T_{1} steps.

  3. (iii)

    Fix the current model parameters θ\theta and start fine-tuning the IGNN-Solver sξ,hϕs_{\xi},h_{\phi} via Algorithm 2 over some T2T_{2} steps.

  4. (iv)

    Repeat steps (ii) and (iii) until reaching the maximum training steps for the IGNN model.

Here T1T_{1} is approximately 4% to 10% of T2T_{2}, adjusted as needed. The sum of all T2T_{2} values is referred to as epochmax\text{epoch}_{\text{max}}. It is reassuring that the additional cost of fine-tuning the IGNN-Solver (step (iii)) is sufficiently offset by the substantial benefits of accelerated solving (step (ii)). Moreover, we are surprised to find that the proposed IGNN-Solver does not exhibit a decline in expressive capability for excessively small or large T1T_{1} values. This indicates that thanks to its robust stability, accidentally setting T1T_{1} high does not significantly affect the normal training of the IGNN. Conversely, if T1T_{1} is set slightly lower, the model still demonstrates good generalization capabilities. Thus, the IGNN-Solver shows low sensitivity to T1T_{1}.

Appendix C More Experiments on IGNN-Solver

C-A Training Dynamics

In this section, we present the time cost trends of IGNN and IGNN-Solver during training and inference on the Amazon dataset. From Figure 8, it is evident that IGNN starts relatively fast overall at the beginning of training. However, as the training of IGNN progresses, the model becomes increasingly complex, leading to a sharp rise in the time cost for fixed-point computation. After 100 epochs, this cost remains persistently high and fluctuates continuously. This issue arises because the IGNN reaches the maximum number of iterations without achieving the preset minimum error.

Refer to caption
Figure 8: Training time and inference time(s) per epoch of IGNN and IGNN-Solver on Amazon-all datasets.

On the contrary, our IGNN-Solver demonstrates significantly lower training and inference time consumption compared to the former and maintains a stable state throughout the process. This is attributed to the solver’s consistent memory consumption and its rapid fixed-point computation capability, which drives the unique advantage of IGNN-Solver.

C-B Long-Range Dependencies (LRD)

To evaluate the capability of GNNs in capturing long-range dependencies within graphs, we follow the methodology presented in [1] by creating a synthetic dataset called Chains. In this dataset, the objective is to classify nodes positioned in a chain of length l. The class information is provided sparsely, appearing only as a feature in one end node of the chain. Our experimental setup consists of small datasets: a training set with 20 nodes, a validation set with 100 nodes, and a test set with 200 nodes. For simplicity, we focus on binary classification. Following the experimental protocol outlined in IGNN, we implement and evaluate four representative baseline models.

Refer to caption
Figure 9: Average accuracy with respect to the length of chains.

As demonstrated in Figure 9, both IGNN-Solver and IGNN effectively capture long-range dependencies on longer chains, whereas the performance of the graph transformer (GTN) [65] slightly decreases when the chain length exceeds 66. On the other hand, convolutional GNNs with a fixed number of iterations (T=2)(T=2), like GCN and GAT, face challenges in making meaningful predictions as the chain length grows due to their limited capacity to capture dependencies beyond two hops. Interestingly, increasing the number of iterations (T)(T) for these models does not lead to significant improvement in this case.