IGNN-Solver: A Graph Neural Solver for
Implicit Graph Neural Networks

Junchao Lin , Zenan Ling , , Zhanbo Feng , , Jingwen Xu , Minxuan Liao , Feng Zhou , Tianqi Hou , Zhenyu Liao , , and Robert C. Qiu Manuscript created on February 17, 2025. The work of Zenan Ling is supported in part by the National Natural Science Foundation of China (via NSFC-62406119), the Natural Science Foundation of Hubei Province (2024AFB074), and the Guangdong Provincial Key Laboratory of Mathematical Foundations for Artificial Intelligence (2023B1212010001). The work of Zhenyu Liao is supported in part by the National Natural Science Foundation of China (via NSFC-62206101). The work of Robert Caiming Qiu is supported in part by the National Natural Science Foundation of China (via NSFC-12141107) and in part by the Key Research and Development Program of Guangxi (GuiKe-AB21196034). (Corresponding author: Zenan Ling.)Junchao Lin, Zenan Ling, Minxuan Liao, Zhenyu Liao, and Robert C. Qiu are with the School of EIC, Huazhong University of Science and Technology.Zhanbo Feng is with the Department of Computer Science and Engineering, Shanghai Jiao Tong University.Feng Zhou is with the Center for Applied Statistics and School of Statistics, Renmin University of China.Tianqi Hou is with the Lagrange Mathematics and Computation Research Center of Huawei.Jingwen Xu is with the School of Science, Wuhan University of Technology. 0000-0002-6173-4601 0009-0001-3975-4403 0009-0002-4529-7357 0009-0002-3934-3265 0009-0002-6777-2702 0000-0003-0842-306X 0000-0003-1940-4833 0000-0002-1915-8559 0000-0002-1154-1836

Abstract

Implicit graph neural networks (IGNNs), which exhibit strong expressive power with a single layer, have recently demonstrated remarkable performance in capturing long-range dependencies (LRD) in underlying graphs while effectively mitigating the over-smoothing problem. However, IGNNs rely on computationally expensive fixed-point iterations, which lead to significant speed and scalability limitations, hindering their application to large-scale graphs. To achieve fast fixed-point solving for IGNNs, we propose a novel graph neural solver, IGNN-Solver, which leverages the generalized Anderson Acceleration method, parameterized by a tiny GNN, and learns iterative updates as a graph-dependent temporal process. To improve effectiveness on large-scale graph tasks, we further integrate sparsification and storage compression methods, specifically tailored for the IGNN-Solver, into its design. Extensive experiments demonstrate that the IGNN-Solver significantly accelerates inference on both small- and large-scale tasks, achieving a $1.5\times$ to $8\times$ speedup without sacrificing accuracy. This advantage becomes more pronounced as the graph scale grows, facilitating its large-scale deployment in real-world applications. The code to reproduce our results is available at https://github.com/landrarwolf/IGNN-Solver.

Index Terms:

Implicit graph neural networks, large-scale graphs, fixed-point solver, deep equilibrium models.

I Introduction

Implicit graph neural networks (IGNNs) [1, 2, 3] have emerged as a significant advancement in graph learning frameworks. Unlike traditional graph neural networks (GNNs) that stack multiple explicit layers, IGNNs utilize a single implicit layer formulated as a fixed-point equation. The solution to this fixed-point equation, known as the equilibrium, is equivalent to the output obtained by iterating an explicit layer infinitely. This allows the implicit layer to access infinite hops of neighbors, providing IGNNs with global receptive fields within just one layer [4]. As a result, IGNNs effectively address the long-standing issue of over-smoothing in conventional explicit GNNs and capture of long-range dependencies in graph-structured data [5, 6, 7].

However, existing IGNNs suffer from slow speeds and have difficulty scaling to large-scale graphs [8, 9, 10, 11]. This is primarily because IGNNs derive features by solving for fixed points, demanding substantial computational resources. For example, even on the Citeseer dataset [12] classification task with a small-scale graph, IGNNs require more than $20$ forward iterative computations to nearly converge to a fixed point [1]. The computational overhead of solving fixed points is amplified by the task scale, resulting in notably slow inference speeds compared to explicit GNNs. This substantial drawback poses challenges for IGNNs in generalizing to large-scale graphs in practical scenarios.

In response to this challenge, we propose a novel graph neural solver for IGNNs, termed IGNN-Solver. It takes the IGNN layer as input, which includes the graph information matrix and layer parameters, and outputs the solution to the corresponding fixed-point equation. Unlike conventional solvers, which compute output features through iterative forward passes in a potentially large GNN (relying on root-finding algorithms like Broyden’s method [13]), the IGNN-Solver predicts the next iteration step using a tiny graph network. Consequently, the proposed IGNN-Solver remarkably accelerates the model’s inference speed without compromising accuracy and with only a slight increase in training overhead. This advantage becomes increasingly pronounced as the scale of the graph grows, making it particularly beneficial for deploying IGNNs in large-scale graph tasks.

Our IGNN-Solver comprises two components. First, we introduce a learnable initializer that estimates an optimal initial point for the optimization process. Second, we propose a generalized version of Anderson Acceleration (AA) [13], employing a tiny graph network to model the iterative updates as a sequence of graph-dependent steps. Compared to the solvers proposed for conventional Implicit Neural Networks (INNs) [14, 15, 16], we introduce novel improvements: learning solver parameters through a GNN-based method. This approach circumvents the potential loss of graph information, thereby improving the model’s performance. Moreover, IGNN-Solver has significantly fewer parameters compared to IGNN, and its training is independent of the IGNN’s inference. Consequently, the training of IGNN-Solver proceeds rapidly without sacrificing generalization.

Moreover, to address the challenges posed by large-scale graphs, we incorporate both graph sparsification and storage compression in the solver design. Specifically, to prevent the network from becoming excessively large and deviating from the initial goal of using a compact network for prediction [17, 18], we utilize graph sparsification for a lighter graph. Additionally, directly mapping data from high to extremely low dimensions is not appropriate [19], thereby we employ storage compression technique to mitigate this issue.

In our experiments, we apply IGNN-Solver to nine real-world datasets from diverse domains and scales, including four large-scale datasets: Amazon-all [20], Reddit [21], ogbn-arxiv [22] and ogbn-products [22]. Our results demonstrate that the IGNN-Solver achieves higher accuracy with reduced inference time, showing a $1.5\times$ to $8\times$ speedup, and incurs minimal additional training overhead, constituting only $1\%$ of the IGNN training time.

Our main contributions are summarized as follows:

•

We introduce IGNN-Solver, a method designed to predict the next fixed-point iteration of the IGNN layer. This innovative approach mitigates the need for extensive iterations, common in vanilla IGNNs, thereby substantially accelerating inference while preserving accuracy and minimizing parameter consumption.
•

Compared to conventional solvers proposed for INNs, our solver introduces a novel improvement by incorporating a tiny GNN to learn parameters. This approach not only preserves graph information but also maintains the simplicity and lightweight characteristics of neural solvers.
•

For large-scale graph tasks, we integrate custom-designed sparsification and storage compression methods into the solver’s architecture. These tailored techniques enable efficient processing of large-scale graphs, improving performance and reducing overhead.
•

We validate our approach through extensive experiments on nine real-world datasets, including four large-scale. Our results demonstrate that IGNN-Solver incurs minimal computational overhead (approximately 1% of the total) and achieves up to a $8$ -fold increase in inference speed without compromising accuracy.

II Related Works

The typical GNN [12] and its variants [23, 21] have been widely used for graph data modeling in various tasks. Different GNNs have been proposed to utilize attention mechanism [23], neighbors sampling [21], pseudo-coordinates [24] and graph fusion [25]. However, due to issues such as over-smoothing, depth, and bottlenecks, these models typically involve finite aggregation layers [5, 6, 7]. To address it, recent works [1, 2, 3] have developed Implicit Graph Neural Networks, encouraging these models to capture long-range dependencies on graphs. Here, we highlight the contributions of our proposed IGNN-Solver through a detailed comparison with Implicit Graph Neural Networks (IGNNs) and Deep Equilibrium Models (DEQs), both of which are closely related to our approach.

II-A Implicit Graph Neural Networks

Instead of stacking a series of operators hierarchically, implicit GNNs define their outputs as solutions to nonlinear dynamical systems, which is initially introduced by [1] to tackle challenges associated with learning long-range dependencies in graphs. [8] proposes a new implicit graph model enabling mini-batch training without sacrificing the ability to capture long-range dependencies. Subsequently, [26] introduces a novel approach based on implicit layer to model multi-scale structures on graphs. Additionally, [3] theoretically investigates the well-posedness of the IGNN model from a monotone operator viewpoint.

Although the aforementioned IGNN works well by alleviating the problem of over-smoothing of features by allowing meaningful fixed points to propagate implicitly, the inherent slow inference speed of implicit networks poses a major obstacle to its scalability. The main reason is that the solver of the fixed-point network is inefficient (e.g., the Picard solver used by IGNN [1] and Anderson acceleration solver used by MIGNN [3]), makes the overhead of the fixed-point solver magnified by the task scales. In comparison, our proposed IGNN-Solver accelerates the iteration process of IGNNs, which addresses the most limiting drawback compared to traditional feedforward models.

Another class of IGNNs based on Neural ODEs [27] has emerged to address issues like depth and bottlenecks. For example, [28] models continuous residual layers using GCNs. [29] proposes methods for modeling static and dynamic graphs with GCNs, along with a hybrid approach where latent states evolve continuously between RNN steps in dynamic graphs. [30] tackles the problem of continuous message passing. [31] introduces a graph neural diffusion network based on the discretization of diffusion PDEs on graphs. [32] enhances graph neural diffusion with a source term and connects the model to random walk formulation on graphs. However, they essentially view deep learning on graphs as a continuous diffusion process, which differs from IGNNs based on fixed-point networks, thus requiring manually tuning the termination time and step size of the diffusion equation. In contrast, our IGNN-Solver uses implicit formulations instead of explicit diffusion discretization, which admits an equilibrium corresponding to infinite diffusion steps and expands the receptive field.

II-B Fixed-point Solvers for Deep Equilibrium Models

Traditional deep learning models use multi-layered networks, while DEQs [14] find a fixed-point of a single layer, representing the network’s equilibrium state. Viewed as infinitely deep networks, DEQs use root-finding for forward propagation and implicit differentiation for backward propagation. Therefore, DEQs capture complex patterns with constant memory and computational costs [33]. Monotone operator theory has been used to guarantee the convergence of DEQs [34] and to improve the stability of implicit networks [16].

However, it is well-known that DEQs, as typical implicit models, suffer from slow training and inference speeds [35, 8], which is highly disadvantageous for large-scale scenarios like GraphGPT [36]. To alleviate this issue, recent efforts have explored certain improved solver methods for DEQs, further optimizing root-finding problems and making these models easier to solve. [37] discusses the amortization of the cost of the iterative solver that would otherwise make implicit models slow. [9] proposes a novel gradient approximation method for implicit models, which significantly accelerates the backward passes in training implicit models. [16] discusses the superiority of learnable solvers over generic solvers in implicit models using a tiny neural network and significantly enhances the efficiency of such models through custom neural solvers.

In contrast, IGNN has a similar network representation to DEQ, both defining the output as the solution of the equation to obtain network outputs. But the difference lies in the fact that IGNN’s equilibrium equations encode graph structure while DEQ does not, which will undoubtedly deepen its weakness of slow inference speed and make IGNN’s solution slower. Insight on it, our proposed IGNN-Solver leverages graph information to guide solver acceleration, achieving fast and meaningful implicit graph network propagation, especially in the case of graph data being large-scale.

III Preliminaries

III-A Explicit GNNs

Let $\mathcal{G}=\{{\bm{A}},{\bm{X}}\}$ represent an undirected graph, where ${\bm{A}}\in\mathbb{R}^{n\times n}$ is the adjacency matrix indicating the relationships between the nodes in $\mathcal{V}=\{{\bm{v}}_{1},{\bm{v}}_{2},\ldots,{\bm{v}}_{n}\}$ , and $n$ is the number of nodes. The node feature matrix ${\bm{X}}=[{\bm{x}}_{1},{\bm{x}}_{2},\ldots,{\bm{x}}_{n}]\in\mathbb{R}^{n\times d}$ contains the features of the nodes, with $d$ representing the feature dimension. Conventional (explicit) GNNs [12, 23] feature a learnable aggregation process centered on the message-passing operation within the graph [38]. This process iteratively propagates information from each node to its neighboring nodes. The formal general structure for each layer $l$ is defined as follows:

{\bm{H}}^{[l+1]}=f_{\theta}({{\bm{A}}},{\bm{H}}^{[l]}),\quad{{\bm{H}}^{[0]}}={\bm{X}},

(1)

where ${{\bm{H}}^{[l]}}$ represents the hidden node representation, $f_{\theta}$ denotes the parameters in $l$ -th layer. A commonly used GNN is Graph Convolutional Network (GCN) [12], defined as ${\bm{H}}^{[l+1]}=\sigma(\widehat{{\bm{A}}}{\bm{H}}^{[l]}{\bm{W}}^{[l]})$ , where ${{\bm{W}}^{[l]}}$ denotes the weight matrix of the $l$ -th layer, $\sigma(\cdot)$ denotes the activation function, and ${\widehat{{\bm{A}}}}={\tilde{{\bm{D}}}^{-1/2}}({\bm{A}}+{\bm{I}}){\tilde{{\bm{D}}}^{-1/2}}$ represents the symmetric normalized graph matrix. $\tilde{{\bm{D}}}$ is a diagonal matrix with $\tilde{{\bm{D}}}_{ii}=1+\sum\nolimits_{j}{\bm{A}}_{ij}$ . GNNs leverage the above message-passing operation in (1) to learn useful information. Still, they often involve a limited number of $l$ layers due to over-smoothing [6], making it challenging for GNNs to capture the long-range dependency (LRD) on graphs.

III-B Implicit GNNs

III-B1 Architecture

Similar to traditional explicit GNNs, IGNNs [1, 2, 3, 4] also involve an aggregation process, with the distinction that the depth of layers (iteration step $k$ ) is infinite. The aggregation process in IGNNs is typically defined as

{\bm{Z}}^{[k+1]}=\sigma({\bm{W}}{\bm{Z}}^{[k]}\widehat{{\bm{A}}}+b_{\Omega}({\bm{X}})),k=1,2,\ldots,\infty,

(2)

where $b_{\Omega}$ represents affine transformation parameterized by $\bm{\Omega}$ , and the weight matrices ${\bm{W}}$ and $\bm{\Omega}$ are globally shared at each iteration step. The IGNN model $f_{\theta}$ is formally described by

{\bm{Z}}^{\star}=f_{{\theta}}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}})=\sigma({\bm{W}}{\bm{Z}}^{\star}\widehat{{\bm{A}}}+b_{\Omega}({\bm{X}})),

(3)

where the representation, given as the “internal state” ${\bm{Z}}^{\star}$ , is obtained as the fixed-point solution of the equilibrium (2). The final representation theoretically encompasses information from all neighbors in the graph. Therefore, IGNNs capture LRD within the graph, offering better performance compared to GNNs with finite iterations [1, 2]. Another notable advantage of this framework is its memory efficiency, as it only needs to retain the current state ${\bm{Z}}$ without requiring additional intermediate representations.

III-B2 Well-posedness

Notably, to obtain a valid representation for any given input ${\bm{X}}$ from IGNNs, we need to solve the unique internal state ${\bm{Z}}^{\star}$ in (3). However, there may not always exist a well-defined solution ${\bm{Z}}^{\star}$ for some inputs ${\bm{X}}$ [1]. To guarantee the existence and uniqueness of the solution to (3), we introduce the concept of well-posedness for IGNNs $f_{\theta}$ with activation function $\sigma$ [1]. Specifically, the weight matrix ${\bm{W}}\in$ $\mathbb{R}^{m\times m}$ and the adjacency matrix $\widehat{{\bm{A}}}\in\mathbb{R}^{n\times n}$ are said to be well-posed for $\sigma$ if the solution ${\bm{Z}}^{\star}\in\mathbb{R}^{m\times n}$ of (3) exists and is unique for any $b_{\Omega}({\bm{X}})\in\mathbb{R}^{m\times n}$ .

Based on the monotone operator theorem [39, 34], a sufficient condition for the well-posedness for IGNNs is defined as follows. Let the nonlinearity $\sigma$ be a non-expansive activation function (e.g., Sigmoid, Tanh, ReLU, etc.), and let $\widehat{{\bm{W}}}=\widehat{{\bm{A}}}^{\top}\otimes{\bm{W}}$ . ¹¹1We represent the Kronecker product of matrices ${\bm{A}}$ and ${\bm{B}}$ as ${\bm{A}}\otimes{\bm{B}}$ . Then, the IGNN model defined in (3) is well-posed as long as ${\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}}$ for some $m>0$ . ²²2 We represent ${\bm{A}}\succeq{\bm{B}}$ if ${\bm{A}}-{\bm{B}}$ is semi-positive definite. For the case of non-symmetric matrices $\widehat{{\bm{W}}}$ , positive definiteness can be defined as the positive definiteness of the symmetric component, i.e., ${\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}}$ if and only if ${\bm{I}}-\frac{\widehat{{\bm{W}}}+\widehat{{\bm{W}}}^{T}}{2}\succeq m{\bm{I}}$ .

It establishes that if $\widehat{{\bm{W}}}$ is symmetric and ${\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}}$ for some $m>0$ , then the existence and uniqueness of the equilibrium point ${\bm{Z}}^{\star}$ are guaranteed, ensuring the model is well-posed. This follows from the sufficient condition that ${\bm{I}}-\widehat{{\bm{W}}}$ is strongly monotone.

To satisfy the well-posedness of IGNNs, a commonly used method [40, 3], is to orthogonally reparameterize ${\bm{W}}$ in the IGNN layer $f_{\theta}$ using the following Cayley map, which is inspired from the unitary operator [41], as

{\bm{W}}=\kappa({\bm{I}}-{\bm{S}})({\bm{I}}+{\bm{S}})^{-1},

(4)

where $\kappa\in(0,1)$ and ${\bm{S}}={\bm{C}}-{\bm{C}}^{\top}$ is a skew-symmetric matrix with ${\bm{C}}$ as an arbitrary parameterized matrix. From (4), it follows that ${\bm{W}}$ is positive definite and its real eigenvalues are in $(0,1)$ . Moreover, based on the definition of $\widehat{{\bm{A}}}$ , we can derive the eigenvalues of $\widehat{{\bm{W}}}$ are in $(0,1)$ , which ensures that ${\bm{I}}-\widehat{{\bm{W}}}\succeq m{\bm{I}}$ , thereby ensuring the well-posedness of IGNN. We refer interested readers to [3] for detailed proof.

III-C Training of IGNNs

Additionally, to train the parameters $\bm{\theta}=\{{\bm{W}},\bm{\Omega}\}$ during the back-propagation process of neural networks, IGNNs compute the Jacobian $\nabla_{\bm{Z}}\mathcal{L}$ by solving the equilibrium equation, as derived from the implicit differential method and the implicit function theorem [42]:

\nabla_{\bm{Z}}\mathcal{L}={\bm{D}}\odot({\bm{W}}^{\top}\nabla_{\bm{Z}}\mathcal{L}\widehat{{\bm{A}}}^{\top}+\nabla_{\bm{X}}\mathcal{L}),

(5)

where ${\bm{D}}=\sigma^{\prime}({\bm{W}}{\bm{X}}\widehat{{\bm{A}}}+b_{\Omega}({\bm{U}}))$ and $\sigma^{\prime}(\cdot)$ refers to the element-wise derivative of the map $\sigma$ . Once $\nabla_{\bm{Z}}\mathcal{L}$ is obtained, $\nabla_{\bm{W}}\mathcal{L}$ and $\nabla_{\bm{\Omega}}\mathcal{L}$ are derived via the chain rule, eliminating the need to store activations at each layer during gradient updates.

However, IGNNs typically require multiple iterations to reach equilibrium, both in finding the fixed point during the forward process (2) and in solving the implicit differential during the backward process (5), resulting in significant overhead during both training and inference. Therefore, developing a fast and efficient solver to find their fixed points is crucial.

III-D Traditional Anderson Acceleration Solver

Algorithm 1 Anderson Acceleration Solver

1:fixed-point function

f_{\theta}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}

, max storage size

m

, residual control parameter

\beta

2:Set initial point value

{\bm{Z}}^{[0]}\in\mathbb{R}^{n}

\triangleright

set

\mathbf{0}

or random value normally

3:for

k=0

\ldots

K-1

4: 1) Set

m_{k}=\min\{m,k\}

\triangleright

storage of the most recent steps

5: 2) Solve

\alpha^{[k]}=\arg\min_{\alpha\in\mathbb{R}^{m_{k}+1}}\left\|{\bm{G}}^{[k]}\alpha\right\|_{2}\text{, s.t. }\mathbf{1}^{\top}\alpha^{[k]}=1

\triangleright

compute weights

6: 3)

{\bm{Z}}^{[k+1]}=\beta\sum_{i=0}^{m_{k}}\alpha_{i}^{[k]}f_{\theta}({\bm{Z}}^{[k-m_{k}+i]})+(1-\beta)\sum_{i=0}^{m_{k}}\alpha_{i}^{[k]}{\bm{Z}}^{[k-m_{k}+i]}\quad

\triangleright

Anderson step

7:end for

Since implicit networks typically require multiple iterations to reach equilibrium, a commonly used solver method is Anderson Acceleration [13], which is widely applied in [3, 15, 14, 16]. Specifically, Given an implicit network layer $f_{\theta}$ and input ${\bm{X}}$ , we aim to find the fixed-point ${\bm{Z}}^{*}$ in the system that represents the equilibrium state of the network output by solving a root-finding problem

g_{\theta}\left({\bm{Z}}^{\star},{\bm{X}}\right):=f_{\theta}\left({\bm{Z}}^{\star},{\bm{X}}\right)-{\bm{Z}}^{\star}=0,

(6)

via utilizing fixed-point solvers, such as the classical Anderson Acceleration (AA) [13], which can directly seek the equilibrium point ${\bm{Z}}^{\star}$ through quasi-Newton methods. This approach demonstrates super-linear convergence properties.

We briefly introduce the AA-solver, as our approach is inspired by it. Algorithm 1 provides a pseudocode example, where ${\bm{G}}^{[k]}=[g_{\theta}(\mathbf{z}^{[k-m_{k}]}),\ldots,g_{\theta}(\mathbf{z}^{[k]})]$ represents the past residuals. The key idea behind AA is to accelerate the next approximate solution by leveraging information stored from the history $m$ iteration steps, constructing a normalized linear combination of past updates with weights $\alpha^{[k]}$ and $\beta$ . These weights are computed greedily to minimize the linear combination of past AA update steps. This approach is typically effective for accelerating the solution of implicit functions.

However, since the computation process relies only on the final output and does not require storing any intermediate activation values [14], the memory consumption during the training process remains constant, which is also a key reason why fixed-point solvers are attractive.

Refer to caption — Figure 1: Comparison of our IGNN-Solver and the generic Anderson solver at each iteration. (a) The canonical Anderson solver performs local least-squares fitting at each iteration, with a fixed $\beta=\beta^{[k]}$ set to a constant. (b) Our learnable IGNN-Solver improves and incorporates learnable iterative updates for enhanced convergence via tiny GNN.

IV Our purposed IGNN-Solver

Although traditional fixed-point solvers for IGNNs demonstrate functionality, they are characterized by relatively slow performance, necessitate manual parameter tuning, and are not well-suited for graph-based applications. To address these limitations, we propose a lightweight, learnable, and content-aware fixed-point solver for IGNN that incorporates a tiny graph neural network, combining the speed advantages of the solver with the information benefits of graph learning. By encoding the relationship between past residuals and parameters $\alpha^{[k]}$ and $\beta$ , it enables more efficient adjustments to enhance the learning capabilities of the IGNN-Solver, as shown in Figure 1.

Given the implicit model’s characteristic that its representation capability is decoupled from the forward computation (i.e., the solver has no knowledge of the task, such as node classification, and vice versa), we can train this neural solver in a lightweight, unsupervised manner. Once learned, the IGNN-Solver can solve the fixed-point equation within the IGNN layer, facilitating further model updates. Note that our solver does not alter the original structure of the model but only accelerates the solution process. Details of the training strategy involving the IGNN-Solver are provided in Appendix B.

Algorithm 2 IGNN-Solver Iterations and Training

1:frozen IGNN model

f_{\theta}

, initializer

h_{\phi}

, tiny-GNN predictor

s_{\xi}

, sparse subgraph

\widehat{{\bm{A}}}_{s}

, storage

{\bm{G}}\in\mathbb{R}^{(m+1)\times d^{\prime}}

2:Compute the initial value of fixed point by

{\bm{Z}}^{[0]}=h_{\phi}({\bm{X}})\in\mathbb{R}^{d^{\prime}}

\triangleright

initializer to guess initial value

3:Define

g_{\theta}({\bm{Z}})=f_{\theta}({\bm{Z}})-{\bm{Z}}

and set

{\bm{G}}[0]=g_{\theta}({\bm{Z}}^{[0]})

4:for

k=0,\ldots,K

5: 1) Set

m_{k}=\min\{m,k\}

and

{\bm{G}}^{[k]}={\bm{G}}\left[0:\left(m_{k}+1\right)\right]\in\mathbb{R}^{\left(m_{k}+1\right)\times d^{\prime}}

\triangleright

compress storage

6: 2) Predict

{\alpha}^{[k]},{\beta}^{[k]}=s_{\xi}({\bm{G}}^{[k]},\widehat{{\bm{A}}}_{s})

, where

{\alpha}^{[k]}\in\mathbb{R}^{(m_{k}+1)}

\mathbf{1}^{\top}{\alpha}^{[k]}=1

and

\beta^{[k]}\in\mathbb{R}^{1}

\triangleright

predict with tiny-GNN

7: 3) Update

{\bm{Z}}^{[k+1]}=\beta^{[k]}\cdot\mathbf{1}^{\top}{\bm{G}}^{[k]}+(1-\beta^{[k]})\sum_{i=0}^{m_{k}}\alpha_{i}^{[k]}{\bm{Z}}^{[k-m_{k}+i]}

\triangleright

Anderson step

8: 4) Update

{\bm{G}}=\operatorname{concat}\left({\bm{G}}[1:],[g_{\theta}({\bm{Z}}^{[k+1]})]\right)

9:end for

10:if it is inference stage then

11: return

{\bm{Z}}^{[k+1]}

12:else

13: return

({\bm{Z}}^{[k+1]},{\bm{G}}^{[k]},{\alpha}^{[k]},{\beta}^{[k]})_{k=0,\ldots,K}

and

{\bm{Z}}^{[0]}

14: Compute

\mathcal{L}_{\text{total }}

and back-propagate it to update IGNN-Solver

\left\{s_{\xi},h_{\phi}\right\}

15:end if

IV-A General Formulation

Let ${\bm{X}}\in\mathbb{R}^{d\times n}$ be the input features and $\widehat{{\bm{A}}}\in\mathbb{R}^{n\times n}$ be the graph, IGNN learns the node representation by finding the fixed point below

{\bm{Z}}=f_{\theta}({\bm{Z}},\widehat{{\bm{A}}},{\bm{X}})=\sigma({\bm{W}}{\bm{Z}}\widehat{{\bm{A}}}+b_{\Omega}({\bm{X}}))

(7)

The overall structure of the IGNN-Solver is shown in Algorithm 2. Specifically, it learns the parameters $\alpha$ that control the weights among past $m$ steps approximate solution, and $\beta$ that control the weights between past residuals and approximate solution, through a solver $s_{\xi}$ to accelerate the approximate solution of the next update step. To train the IGNN-Solver, we minimize the joint loss function $\mathcal{L}_{\text{total}}$ by back-propagating through this $K$ -step temporal process, which is discussed in Section IV-C. It’s worth noting that the original IGNN is frozen (i.e., model parameters $\theta$ are fixed), and only the IGNN-Solver parameters $\xi$ are trained here, so we do not need the ground-truth label $y$ that corresponds to input $x$ . This implies that IGNN-Solver can also be fine-tuned during inference after deployment.

IV-A1 Initializer

To accelerate the convergence of predicted values, we propose constructing an initializer

{\bm{Z}}^{[0]}=h_{\phi}({\bm{X}})

(8)

where $h_{\phi}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{\prime}}$ , for a rapid and reasonable input-based initial estimate value, rather than simply setting the initial values to $\mathbf{0}$ or random, where $\phi$ are learnable parameters of the initializer. We set the intermediate dimension $d^{\prime}$ to be significantly smaller than the feature dimension $d$ to reduce the training overhead of the initializer.

IV-A2 Improved Anderson Iterations with tiny-GNN

In the original AA solver, the parameter $\beta$ is predetermined and fixed [13], whereas the parameters $\alpha^{[k]}$ s are determined by solving numerous linear equations using the least squares method. Although this method operates adequately, its efficiency and adaptability in optimizing IGNN models are limited. Therefore, we propose introducing a tiny and learnable graph neural network as

\alpha,\beta=s_{\xi}({\bm{G}},\widehat{{\bm{A}}})\,

(9)

where $\,s_{\xi}:(\mathbb{R}^{(m_{k}+1)\times d^{\prime}}\times\mathbb{R}^{n})\rightarrow(\mathbb{R}^{(m_{k}+1)}\times\mathbb{R}^{1})$ , to predict the two parameters instead of setting them as the least squares solution on the past residuals ${\bm{G}}$ .

Remark 1 (Well-posedness of IGNN-Solver).

The proposed IGNN-Solver remains well-posed when the IGNN satisfies the reparameterization defined in (4). Note that the IGNN-Solver introduces two key improvements while maintaining the structure and parameters of the model. First, we propose an initializer for a better initial guess, establishing a more meaningful starting point, reducing solving time, and shortening the fixed-point trajectory. Second, we parameterize the Anderson weights and step sizes within their original range, ensuring that well-posedness is preserved. Consequently, the modifications introduced in the IGNN-Solver retain the foundational characteristics of classical AA without compromising the well-posedness condition. Therefore, for an IGNN $(f_{\theta},b_{\Omega},\sigma)$ where $f_{\theta}$ satisfies the well-posedness condition, it holds that the corresponding IGNN-Solver $(h_{\phi},s_{\xi})$ on this IGNN is also well-posed, thereby guaranteeing the existence and uniqueness of its solution.

IV-B Graph Sparsification and Storage Compression

Directly using the original graph for modeling would result in an oversized network [17, 18], deviating from the initial goal of using a tiny network for prediction. Moreover, because of the curse of dimensionality, directly mapping the data from high to extremely low dimensions is not appropriate [19]. Therefore, in response to the challenges posed by large-scale graphs, we take into account both Graph Sparsification and Storage Compression in the design of the solver, as detailed below.

IV-B1 Graph Sparsification

Firstly, the number of edges in graph $\widehat{{\bm{A}}}$ is usually large in practice, as it is affected by the scale of the input (for example, in the ogbn-products dataset, the number of edges in the graph is close to 62M). To maintain the lightweight nature of tiny-GNN and reduce the computational cost of prediction, we propose introducing graph sparsification [43, 44, 45] for a light graph:

\widehat{{\bm{A}}}_{s}=\varsigma(\widehat{{\bm{A}}},\beta),

(10)

where $\widehat{{\bm{A}}}_{s}$ represents the sparse subgraph of $\widehat{{\bm{A}}}$ , obtained via a graph sparsification operator $\varsigma$ , with the parameter $\beta$ controls the sparsity of $\widehat{{\bm{A}}}_{s}$ . We choose the RPI-Graph approach [17] in this paper, which is a plug-and-play graph sparsification method based on the principle of relevant information. It is worth noting that this subgraph is used exclusively in the tiny-GNN predictor $s_{\xi}$ and does not alter the original graph of the IGNN model $f_{\theta}$ .

IV-B2 Storage Compression

Besides, considering that $n$ (the number of nodes) is relatively large (for example, $n$ is approximately 169K in the ogbn-arxiv dataset), it is not appropriate to map $s_{\xi}$ from a high-dimensional space to a very low-dimensional one in (9) directly [25, 46, 47] owing to the curse of dimensionality [19]. Therefore, to maintain $s_{\xi}$ fast and compact, we recommend compress ${\bm{G}}^{[k]}_{i}$ to form a smaller yet still representative version $\bar{{\bm{G}}}^{[k]}_{i}$ . Specifically, We map it from $\mathbb{R}^{(m_{k}+1)\times d^{\prime}}$ to an appropriate space $\mathbb{R}^{(m_{k}+1)\times p}$ by multiple layer perception (MLP). Then, the nearest $m_{k}+1$ sets of fixed points are merged into one feature matrix:

\bar{{\bm{G}}}^{[k]}=\mathop{||}_{i=k-m_{k}}^{k}\bar{{\bm{G}}}^{[k]}_{i},

(11)

where $\mathop{||}_{i=k-m_{k}}^{k}$ represents the concatenation operation that stacks the compressed storage matrix $\bar{{\bm{G}}}^{[k]}_{i}\in\mathbb{R}^{(m_{k}+1)\times p}$ in the $k$ -th iteration.

Once the smaller yet representative storage matrix $\bar{{\bm{G}}}$ is obtained as described above, we treat it as a $p$ -channel matrix and employ a tiny graph neural network layer:

\alpha^{[k]},\beta^{[k]}=s_{\xi}(\bar{{\bm{G}}}^{[k]},\widehat{{\bm{A}}}_{s}),

(12)

to predict the relative weights $\alpha^{[k]}$ assigned to past $m_{k}+1$ residual, as well as the AA mixing coefficient $\beta^{[k]}$ at the $k$ -th iteration. Notably, we normalize the original output $\widehat{\alpha}^{[k]}$ using the function

\alpha^{[k]}=\widehat{\alpha}^{[k]}+\frac{\left(1-\mathbf{1}^{\top}\widehat{\alpha}^{[k]}\right)}{m_{k}+1}\cdot\mathbf{1}

to ensure that the constraint $\mathbf{1}^{\top}\alpha^{[k]}=1$ is satisfied.

In this way, $s_{\xi}$ shall autonomously learn and adjust these parameters $\alpha^{[k]}$ and $\beta^{[k]}$ based on previous solver steps, while also receiving gradients from subsequent iterations. We provide a comparison of different choices for $s_{\xi}$ in Section V-D, demonstrating that IGNN-Solver improves the convergence path. We also introduce the training procedure below.

IV-C Training IGNN-Solver

Unlike the neural ODE solver, the fixed-point trajectory of the IGNN is not unique. Therefore, trajectory fitting is not applicable to the IGNN-solver training. Instead, the goal is simply to bring everything as close as possible to ${\bm{Z}}^{\star}$ . Formally, given an IGNN-Solver $\left\{s_{\xi},h_{\phi}\right\}$ that returns $({\bm{Z}}^{[k+1]},{\bm{G}}^{[k]},{\alpha}^{[k]},{\beta}^{[k]})_{k=0,\ldots,K}$ and ${\bm{Z}}^{[0]}$ , we introduce three objectives functions for its training. The complete training algorithm can be referred to in Algorithm 2.

Initializer Loss Functions

To train the initializer, we minimize the distance between the initial guess and the fixed-point by

\mathcal{L}_{\text{init }}=\left\|h_{\phi}({\bm{X}})-{\bm{Z}}^{\star}\right\|_{2},

(13)

for facilitating the subsequent iteration process. Since the initializer directly predicts based on the input ${\bm{X}}$ without going through iteration, we separate this loss from other components.

Reconstruction Loss Functions

The training of the solver does not require reference to label information or any trajectory fitting. Apart from the loss $\mathcal{L}_{\text{init}}$ necessary for training the initializer above, we introduce reconstruction loss

\mathcal{L}_{\text{rec}}^{[k]}=\|{\bm{Z}}^{[k]}-{\bm{Z}}^{\star}\|_{2},\quad

(14)

where the reconstruction loss $\mathcal{L}_{\text{rec }}^{[k]}$ aims to make all intermediate predictions ${\bm{Z}}^{[k]}$ converge to the accurate fixed point ${\bm{Z}}^{\star}$ as closely as possible.

Auxiliary Loss Functions

Although we utilize $\alpha^{[k]}$ and $\beta^{[k]}$ to improve the generic solver, we have empirically found that using an auxiliary loss to guide the solver’s prediction of $\alpha$ is beneficial sometimes, especially in the early stages of training. Therefore, in practice, we suggest considering this loss and gradually diminishing it as training progresses (i.e., decay the weight of this loss to $0$ ).

\mathcal{L}_{\alpha}=\sum_{k=0}^{K}\|{\bm{G}}^{[k]}\alpha^{[k]}\|_{2},

(15)

The final form of the joint loss function is as follows:

\mathcal{L}_{\text{total }}=\lambda_{1}\mathcal{L}_{\text{rec }}^{[k]}+\lambda_{2}\mathcal{L}_{\text{init }}+\lambda_{3}\mathcal{L}_{\alpha},

(16)

where $\lambda_{1}$ , $\lambda_{2}$ and $\lambda_{3}$ control the weights of three loss functions mentioned above and elaborate more in Section A-B.

TABLE I: Node classification accuracy results on four large-scale real-world datasets (more results on small-scale datasets including Citeseer, ACM, CoraFull, BlogCatalog, Flickr can be found in Table II), with experiments conducted over five trials. The mean accuracy (%) ± standard deviation is reported. The best and the runner-up results are highlighted in boldface and underline, respectively.

Type	*Models	Amazon-all	Reddit	ogbn-arxiv	ogbn-products
	*Nodes	$334,863$	$232,965$	$169,343$	$2,449,029$
	*Edges	$14,202,057$	$11,606,919$	$1,166,243$	$61,859,140$
Explicit	GCN [12]	$79.12\pm 1.11$	$89.65\pm 0.14$	$71.56\pm 0.25$	$71.91\pm 0.27$
	GAT [23]	$76.02\pm 2.09$	$90.08\pm 0.44$	$71.10\pm 0.24$	$72.33\pm 0.32$
	SGC [48]	$75.56\pm 1.75$	$91.44\pm 0.41$	$64.66\pm 1.02$	$70.48\pm 0.19$
	APPNP [49]	$79.80\pm 1.22$	$91.68\pm 0.24$	$71.28\pm 0.29$	$74.46\pm 0.64$
	JKNet [50]	$81.19\pm 1.09$	$91.71\pm 0.31$	$71.08\pm 0.35$	$73.70\pm 0.58$
	AM-GCN [25]	$81.85\pm 1.46$	$90.20\pm 0.34$	$68.83\pm 0.67$	$73.19\pm 0.49$
	DEMO-Net [51]	$84.08\pm 1.29$	$89.53\pm 0.29$	$68.40\pm 0.24$	$73.88\pm 0.62$
	GCNII [52]	$83.60\pm 2.40$	$89.87\pm 0.38$	$67.66\pm 0.20$	$72.88\pm 0.18$
	ACM-GCN [53]	$83.04\pm 2.61$	$90.37\pm 0.42$	$68.32\pm 0.18$	$71.68\pm 0.41$
Implicit	IGNN [1]	$83.90\pm 0.51$	$92.30\pm 1.55$	$70.49\pm 0.75$	$74.63\pm 0.24$
	EIGNN [2]	$\underline{84.32}\pm 0.57$	$92.00\pm 0.24$	$70.59\pm 0.31$	$74.58\pm 0.26$
	MIGNN [3]	$83.68\pm 0.82$	$91.98\pm 0.42$	$\underline{71.95}\pm 0.44$	$74.62\pm 0.32$
	IGNN-Solver (AA)	$83.44\pm 0.19$	$92.37\pm 0.35$	$71.73\pm 0.41$	$74.61\pm 0.28$
	IGNN-Solver (NN)	$84.13\pm 1.04$	$\underline{92.42}\pm 0.41$	$70.78\pm 0.37$	$\underline{74.69}\pm 0.31$
	IGNN-Solver (Ours)	$\textbf{84.50}\pm 0.70$	$\textbf{93.91}\pm 0.31$	$\textbf{72.53}\pm 0.41$	$\textbf{74.90}\pm 0.20$

V Experiments

In this section, we compare the speed/accuracy Pareto curve rather than a single point on the curve of IGNN-Solver with IGNNs and several state-of-the-art (SOTA) GNNs across various graph classification tasks, both at the node and graph levels. Our goal is to demonstrate: 1) IGNN-Solver achieves nearly a $1.5\times$ to $8\times$ inference acceleration without sacrificing accuracy; and 2) IGNN-Solver is extremely compact, adding little overhead to training. Specifically, we compare our approach with numerous representative baselines and two variants of our method: IGNN w. AA (using the original Anderson Acceleration to speed up IGNN) and IGNN w. NN (replacing our proposed graph neural solver with a standard neural network solver for parameter learning).

In Section V-D, we sequentially demonstrate that the training overhead on IGNN-Solver is little relative to the total time overhead on IGNN. Additionally, we provide further evidence regarding the convergence and generalizability of the IGNN-Solver. We compare the training/inference time costs per epoch between IGNN and IGNN-Solver, visualizing the advantages of our approach. Furthermore, we conduct ablation studies to validate the stability of the IGNN solver and assess the effectiveness of its components. Additional experimental setups and descriptions are detailed in Appendix A.

TABLE II: Node Classification accuracy results on five small-scale real-world datasets, with experiments conducted over five trials. The mean accuracy (%) ± standard deviation is reported. The best and the runner-up results are highlighted in boldface and underline, respectively.

Type	*Models	Citeseer	ACM	CoraFull	BlogCatalog	Flickr
	*Nodes	$3,327$	$3,025$	$19,793$	$5,196$	$7,575$
	*Edges	$4,732$	$13,128$	$65,311$	$171,743$	$239,738$
Explicit	GCN [12]	$64.80\pm 4.15$	$83.78\pm 3.95$	$35.98\pm 9.44$	$71.30\pm 1.12$	$76.98\pm 1.83$
	GAT [23]	$71.24\pm 2.88$	$80.60\pm 5.12$	$50.24\pm 1.87$	$76.24\pm 2.76$	$72.64\pm 3.23$
	SGC [48]	$70.32\pm 2.75$	$85.40\pm 1.23$	$46.56\pm 4.43$	$69.76\pm 1.11$	$71.88\pm 3.47$
	APPNP [49]	$61.56\pm 8.92$	$84.16\pm 4.25$	$21.24\pm 4.76$	$61.34\pm 9.39$	$73.42\pm 3.80$
	JKNet [50]	$63.78\pm 8.76$	$64.96\pm 5.47$	$23.04\pm 6.01$	$72.62\pm 6.83$	$75.68\pm 1.15$
	AM-GCN [25]	$73.10\pm 1.62$	$89.56\pm 0.30$	$53.40\pm 1.59$	$73.86\pm 1.10$	$76.86\pm 2.02$
	DEMO-Net [51]	$68.34\pm 2.94$	$84.38\pm 2.19$	$61.74\pm 3.65$	$74.26\pm 2.70$	$75.60\pm 3.95$
	GCNII [52]	$71.98\pm 0.80$	$85.36\pm 1.05$	$57.64\pm 3.34$	$74.94\pm 3.81$	$\underline{79.92}\pm 2.14$
	ACM-GCN [53]	$72.38\pm 1.46$	$88.98\pm 0.41$	$59.88\pm 1.59$	$\textbf{78.18}\pm 1.75$	$74.82\pm 3.78$
Implicit	IGNN [1]	$72.96\pm 1.83$	$90.88\pm 0.95$	$65.52\pm 0.51$	$75.68\pm 0.55$	$75.80\pm 0.29$
	EIGNN [2]	$72.38\pm 1.36$	$88.36\pm 1.03$	$61.80\pm 0.60$	$75.34\pm 0.38$	$75.66\pm 0.94$
	MIGNN [3]	$73.79\pm 0.94$	$89.59\pm 1.61$	$62.94\pm 0.46$	$76.68\pm 1.49$	$74.96\pm 0.49$
	IGNN-Solver (AA)	$\underline{75.28}\pm 0.38$	$\textbf{91.34}\pm 0.46$	$\underline{65.88}\pm 0.34$	$76.82\pm 0.34$	$75.84\pm 0.27$
	IGNN-Solver (NN)	$74.78\pm 0.42$	$91.08\pm 0.56$	$65.62\pm 0.16$	$76.88\pm 0.74$	$78.55\pm 0.49$
	IGNN-Solver (Ours)	$\textbf{75.60}\pm 0.22$	$\underline{91.20}\pm 1.15$	$\textbf{66.08}\pm 1.23$	$\underline{77.64}\pm 0.54$	$\textbf{80.14}\pm 0.23$

V-A Experiments Settings

V-A1 Datasets

The evaluation is conducted on nine different-field, real-world benchmarks for node classification tasks, including four citation datasets (Citeseer, ACM, CoraFull, ogbn-arxiv), three social interaction datasets (BlogCatalog, Flickr, Reddit), and two product network datasets (Amazon-all, ogbn-products); as well as five bioinformatics benchmarks (MUTAG, PTC_MR, COX2, PROTEINS, NCI1) for graph classification tasks. Further details can be found in Appendix A. Notably, to demonstrate the scalability of the proposed model on larger datasets, we use four large-scale datasets: Amazon-all, Reddit, ogbn-arxiv, and ogbn-products, two of which are from the Open Graph Benchmark (OGB) [22].

V-A2 Baselines

We compare IGNN-Solver with a comprehensive set of baselines in graph node classification tasks, including three implicit GNNs, i.e., MIGNN [3], EIGNN [2], IGNN [1], and nine explicit/traditional GNNs, i.e., AM-GCN [25], GCN [12], GAT [23], SGC [48], APPNP [49], JKNet [50], DEMO-Net [51], GCNII [52], ACM-GCN [53]. For the sake of fairness, we use the same hidden layer configuration for all baseline models. For example, in the Flickr dataset, the dimension of the hidden layers is uniformly set to $512$ for all methods (including ours, see in Table IV). In addition, for EIGNN, the arbitrary gamma is set to $0.8$ , which is consistent with the original paper. For AM-GCN, the number of nearest neighbors in the KNN graph is set from $5,6,7$ . For GCN, the hidden layer setup is the same as that for others. For GAT, three attention headers are used for each layer. For SGC, the power of self-loops in the graph adjacency matrix is set to $3$ . For APPNP, we set the number of iterations to $3$ and teleport probability $\alpha$ to $0.5$ . For JKNet, we set layers to $1$ in small datasets and set to $8$ in large ones. For DEMO-Net, the regularization parameter is set as $5e-4$ , the learning rate is set as $5e-3$ , and the hash dimension is set as $256$ . For GCNII, we set $\alpha_{\ell}=0.1$ and $L_{2}$ regularization to $5e-4$ , consistent with the original paper. For ACM-GCN, we set layers to $1$ in small datasets and set to $4$ in large ones. For our IGNN-Solver and its variants, we employ the same settings as the basic IGNN model. Furthermore, we adjust these parameters affecting the convergence and performance of IGNN through a combined approach of hierarchical grid search and manual tuning. The Adam optimizer [54] is used for optimization.

V-B Performance on Node Classification Tasks

V-B1 Experimental setup

To demonstrate the superiority of the IGNN-Solver over IGNNs in terms of both performance and efficiency, we analyze the movement of the speed/accuracy Pareto curve across various datasets, rather than concentrating on a single point on the curve, which is depicted in Figures 2 and 3. All experiments are conducted five times, and the best results are reported. Furthermore, we present the node classification performance results for each method on various datasets, which are shown in Tables I and II. All experiments are conducted five times, and we report the average results along with the standard deviation. Additional details on the training procedure and the hyperparameters used for each task are provided in Appendices A and B.

V-B2 Results on large-scale tasks

From Figure 2 (the speed/accuracy Pareto curve on large-scale datasets: Amazon-all, Reddit, ogbn-arxiv and ogbn-products), we observe that:

1.

Comprehensive performance and efficiency: IGNN-Solver generally outperforms all other methods, particularly in large datasets with a greater graph radius. This advantage is more pronounced in such cases. The improved performance is attributed to IGNN-Solver’s significant enhancement in the convergence path and the better initialization of the fixed-point equation through the initializer;
2.

Rapid inference advantage: IGNN-Solver demonstrates a notable speed advantage during inference. For instance, on the large-scale dataset Amazon-all, our IGNN-Solver achieves $8\times$ faster inference speed than IGNN while maintaining the same performance, and there is consistently at least $1.5\times$ acceleration and a similar pattern has been observed as well on other datasets.

From Table I, we observe that implicit GNNs with infinite depth generally outperform shallow explicit GNNs in most cases. Furthermore, IGNN-Solver consistently shows higher accuracy percentages across most datasets compared to other state-of-the-art explicit GNNs like DEMO-Net [51], GCNII [52] etc, indicating its superior performance.

V-B3 Results on small-scale tasks

We further evaluate IGNN-Solver on several small-scale graph node classification tasks, including Citeseer, ACM, CoraFull, BlogCatalog, and Flickr. The hyperparameters used are outlined in Table IV, Section A-B. The mean accuracy and standard deviation are reported in Table II, while the speed/accuracy Pareto curve is presented in Figure 3. In general, learning LRD is not crucial for these tasks, as the graph diameters are relatively small [3]. However, as shown in Table II, even for these small-scale node classification tasks, IGNN-Solver outperforms IGNNs and many explicit GNNs, as well as other advanced implicit models. As illustrated in Figure 3, the solver continues to improve the convergence path of IGNN, even in small-scale tasks. These results, along with those in Section V-B, confirm the expressive power of the IGNN-Solver using learnable graph neural networks, even surpassing that of many explicit and enhanced implicit GNNs.

V-C Performance on Graph Classification Tasks

TABLE III: Graph classification accuracy results on five benchmark datasets. The mean accuracy (%) ± standard deviation is reported. The best and the runner-up results are highlighted in boldface and underline, respectively.

Type	Method	MUTAG	PTC_MR	COX2	PROTEINS	NCI1
Explicit	WL [55]	84.1 $\pm$ 1.9	58.0 $\pm$ 2.5	83.2 $\pm$ 0.2	74.7 $\pm$ 0.5	84.5 $\pm$ 0.5
	DCNN [56]	67.0	56.6	-	61.3	62.6
	DGCNN [57]	85.8	58.6	-	75.5	74.4
	GIN [58]	89.4 $\pm$ 5.6	64.6 $\pm$ 7.0	-	76.2 $\pm$ 2.8	82.7 $\pm$ 1.7
	FDGNN [59]	88.5 $\pm$ 3.8	63.4 $\pm$ 5.4	83.3 $\pm$ 2.9	76.8 $\pm$ 2.9	77.8 $\pm$ 1.6
	PK [60]	76.0 $\pm$ 2.7	59.5 $\pm$ 2.4	81.0 $\pm$ 0.2	73.7 $\pm$ 0.7	82.5 $\pm$ 0.5
	GCN [12]	85.6 $\pm$ 5.8	64.2 $\pm$ 4.3	-	76.0 $\pm$ 3.2	80.2 $\pm$ 2.0
Implicit	IGNN [1]	76.0 $\pm$ 13.4	60.5 $\pm$ 6.4	79.7 $\pm$ 3.4	76.5 $\pm$ 3.4	73.5 $\pm$ 1.9
	CGS [61]	89.4 $\pm$ 5.6	64.7 $\pm$ 6.4	-	76.3 $\pm$ 4.9	77.6 $\pm$ 2.0
	GIND [62]	89.3 $\pm$ 7.4	66.9 $\pm$ 6.6	84.8 $\pm$ 4.2	77.2 $\pm$ 2.9	78.8 $\pm$ 1.7
	MIGNN [3]	81.8 $\pm$ 9.1	72.6 $\pm$ 5.4	85.0 $\pm$ 5.3	77.9 $\pm$ 2.6	79.6 $\pm$ 1.8
	IGNN-Solver (Ours)	90.5 $\pm$ 1.6	71.8 $\pm$ 3.1	85.3 $\pm$ 3.0	78.4 $\pm$ 2.6	80.2 $\pm$ 1.1

V-C1 Experimental setup

We extend our evaluation to graph classification tasks to further demonstrate the effectiveness and versatility of IGNN-Solver. We conduct experiments on several widely used bioinformatics benchmark datasets for graph classification, including MUTAG, PTC_MR, COX2, PROTEINS, and NCI1. These datasets vary in size and complexity, as detailed in Appendix A. We train the model using 10-fold cross-validation using the experimental setup of [1]. For each dataset, we employ standard 10-fold cross-validation to evaluate the test performance of the IGNN-Solver, with the baseline and their results drawn from [3, 62]. The average prediction accuracy and standard deviations are in Table III.

V-C2 Results and discussion

The results of the graph classification experiments, summarized in Table III, show that our IGNN-Solver almost outperforms the baseline models across all datasets, clearly demonstrating its ability to effectively capture graph structures and significantly improve classification accuracy. These findings confirm the effectiveness of the IGNN-Solver in graph classification tasks, further supporting its potential for deployment in real-world applications where graph data is prevalent.

V-D Empirical Understandings of IGNN-Solver

V-D1 Overhead in training

To investigate the proportion of training overhead attributed to the solver throughout the entire training process, we depict in Figure 4 the percentage of training computations on IGNN-Solver relative to the total time overhead on IGNN. Our approach is not only effective but also incurs minimal training overhead for the solver: the solver module is relatively small and requires only about $1\%$ of the original training time required by the IGNN model. Besides, this proportion will be lower for large-scale data. For instance, on the Amazon dataset, IGNN necessitates a total runtime of $3$ hours, whereas the solver only requires approximately $1.6$ minutes, indicating that our solver maintains its lightweight nature across any-scale datasets.

V-D2 Efficiency study

We present additional evidence on the convergence and generalizability of the neural solvers in Figure 5, where the ordinate represents the relative residual $\frac{||f(z)-z||_{F}}{||z||_{F}}$ , with $\|\cdot\|_{F}$ denoting the Frobenius norm. We compare the convergence of a pre-trained IGNN model under two conditions: 1) canonical iteration in IGNN [1], and 2) IGNN-Solver with $K=6$ unrolling steps.

From Figure 5, we observe that the IGNN-Solver, trained with $K$ unrolling steps, is capable of generalizing beyond $K$ , with both solvers ultimately reaching a stable state, demonstrating good convergence. Moreover, thanks to the fixed-point solution of the graph neural parameterization, IGNN-Solver continuously enhances the convergence of the canonical iterators while being more computationally efficient. Specifically, each step of the neural solver is cheaper than the standard iterator step. This explains the observed improvement in inference efficiency of at least $1.5\times$ in Section V-B.

V-D3 Other choices for graph neural layers in IGNN-Solver

We provide additional evidence on the convergence and generalization of IGNN-Solver by exploring alternatives to our proposed graph neural (GN) layers, including neural network (NN), temporal convolutional network (TCN) [63], and gated recurrent unit (GRU) [64]. The relative residual curves on the Citeseer dataset, observed during both the warm-up and final inference stages, are presented in Figure 6. Interestingly, despite their larger number of parameters and more complex architectures, TCN and GRU consistently underperform compared to NN. Notably, IGNN-Solver (GN) significantly improves the convergence path of all solvers and exhibits the fastest rate of residual reduction, establishing itself as the most effective predictor in IGNN-Solver due to its ability to effectively exploit rich graph information.

V-D4 Ablation study

In this section, we analyze the effect of different loss components on the IGNN-Solver. The performance of IGNN-Solver and its variants on the Citeseer dataset is shown in Figure 7, with similar patterns observed across other datasets and solvers. For the main convergence loss $\mathcal{L}_{\text{conv}}$ in fixed-point iterations (16), we design two contrasting schemes concerning its weight $\lambda_{1}^{[k]}$ s: setting them to a constant value (i.e., $\lambda_{1}^{[k]}=\lambda_{1}$ for all $k$ ), or setting all values except the $K$ -th term to zero (i.e., $\lambda_{1}^{[k]}=\lambda_{1}$ if $k=K$ else $0$ ). Both variant solvers still perform well, yet our suggested monotonically increasing scheme (i.e., emphasizing the later steps to a greater extent) exhibits the best performance.

Furthermore, removing either the initializer or the alpha loss $\mathcal{L}_{\alpha}$ negatively impacts performance, with the former having a more detrimental impact on the solver. Nonetheless, all these ablation configurations significantly outperform methods without a solver, underscoring the advantages of employing a custom learnable solver for the IGNN model.

VI Conclusion

This paper introduces IGNN-Solver, a novel graph neural solver tailored for achieving fast fixed-point solving in IGNNs. Unlike previous approaches that impose constraints on the structure or parameters of the implicit layer, we propose to leverage the important idea that implicit models separate the representation from the forward computation. By leveraging the generalized Anderson Acceleration method and parameterizing it with a tiny GNN, IGNN-Solver learns iterative updates as a graph-dependent temporal process. To address large-scale graph tasks, we further tailored sparsification and storage compression methods for the IGNN solver and integrated them into its design. Our extensive experiments demonstrate the significant acceleration in inference achieved by IGNN-Solvers, with a speedup ranging from $1.5\times$ to $8\times$ , all while maintaining accuracy. Notably, this acceleration is particularly pronounced as the scale of the graph increases. These findings underscore the potential of IGNN-Solver for large-scale end-to-end deployment in real-world applications.

Looking ahead, IGNN-Solver opens new avenues for the integration of learnable solvers in implicit GNNs, bridging the gap between solver efficiency and adaptability to diverse graph structures. Future research could explore extending IGNN-Solver to handle dynamic or heterogeneous graphs, further broadening its applicability. Additionally, investigating the interplay between solver generalization and graph topology holds promise for uncovering deeper insights into the dynamics of graph-based learning. With its ability to combine speed, scalability, and accuracy, IGNN-Solver represents a pivotal step toward making implicit graph models a viable choice for large-scale real-world systems. We hope that IGNN-Solver inspires further innovations in learnable solvers for IGNNs, driving progress in scalable and efficient algorithms on graphs.

References

[1] F. Gu, H. Chang, W. Zhu, S. Sojoudi, and L. El Ghaoui, “Implicit graph neural networks,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 984–11 995, 2020.
[2] J. Liu, K. Kawaguchi, B. Hooi, Y. Wang, and X. Xiao, “Eignn: Efficient infinite-depth graph neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 18 762–18 773, 2021.
[3] J. Baker, Q. Wang, C. D. Hauck, and B. Wang, “Implicit graph neural networks: a monotone operator viewpoint,” in International Conference on Machine Learning. PMLR, 2023, pp. 1521–1548.
[4] Q. Chen, Y. Wang, Y. Wang, J. Chang, Q. Tian, J. Yang, and Z. Lin, “Efficient and scalable implicit graph neural networks with virtual equilibrium,” in 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022, pp. 864–873.
[5] Q. Li, Z. Han, and X. M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018. AAAI press, 2018, pp. 3538–3545.
[6] U. Alon and E. Yahav, “On the bottleneck of graph neural networks and its practical implications,” in International Conference on Learning Representations, 2020.
[7] X. Zhang, Y. Xu, W. He, W. Guo, and L. Cui, “A comprehensive review of the oversmoothing in graph neural networks,” in CCF Conference on Computer Supported Cooperative Work and Social Computing. Springer, 2023, pp. 451–465.
[8] J. Liu, B. Hooi, K. Kawaguchi, Y. Wang, C. Dong, and X. Xiao, “Scalable and effective implicit graph neural networks on large graphs,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=QcMdPYBwTu
[9] Z. Geng, X.-Y. Zhang, S. Bai, Y. Wang, and Z. Lin, “On training implicit models,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 247–24 260, 2021.
[10] Y. Yang, T. Liu, Y. Wang, Z. Huang, and D. Wipf, “Implicit vs unfolded graph neural networks,” arXiv preprint arXiv:2111.06592, 2021.
[11] M. Nastorg, M.-A. Bucci, T. Faney, J.-M. Gratien, G. Charpiat, and M. Schoenauer, “An implicit gnn solver for poisson-like problems,” arXiv preprint arXiv:2302.10891, 2023.
[12] T. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.
[13] D. G. Anderson, “Iterative procedures for nonlinear integral equations,” Journal of the ACM (JACM), vol. 12, no. 4, pp. 547–560, 1965.
[14] S. Bai, J. Z. Kolter, and V. Koltun, “Deep equilibrium models,” Advances in neural information processing systems, vol. 32, 2019.
[15] S. Bai, V. Koltun, and J. Z. Kolter, “Multiscale deep equilibrium models,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020, pp. 5238–5250.
[16] ——, “Neural deep equilibrium solvers,” in International Conference on Learning Representations, 2021.
[17] S. Yu, F. Alesiani, W. Yin, R. Jenssen, and J. C. Principe, “Principle of relevant information for graph sparsification,” in Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, ser. Proceedings of Machine Learning Research, J. Cussens and K. Zhang, Eds., vol. 180. PMLR, 01–05 Aug 2022, pp. 2331–2341. [Online]. Available: https://proceedings.mlr.press/v180/yu22c.html
[18] C. Zheng, B. Zong, W. Cheng, D. Song, J. Ni, W. Yu, H. Chen, and W. Wang, “Robust graph representation learning via neural sparsification,” in International Conference on Machine Learning. PMLR, 2020, pp. 11 458–11 468.
[19] M. Köppen, “The curse of dimensionality,” in 5th online world conference on soft computing in industrial applications (WSC5), vol. 1, 2000, pp. 4–8.
[20] J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” in Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, 2012, pp. 1–8.
[21] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
[22] W. Hu, M. Fey, M. Zitnik, Y. Dong, H. Ren, B. Liu, M. Catasta, and J. Leskovec, “Open graph benchmark: Datasets for machine learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 118–22 133, 2020.
[23] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph Attention Networks,” International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=rJXMpikCZ
[24] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 5115–5124.
[25] X. Wang, M. Zhu, D. Bo, P. Cui, C. Shi, and J. Pei, “Am-gcn: Adaptive multi-channel graph convolutional networks,” in Proceedings of the 26th ACM SIGKDD International conference on knowledge discovery & data mining, 2020, pp. 1243–1253.
[26] J. Liu, B. Hooi, K. Kawaguchi, and X. Xiao, “Mgnni: Multiscale graph neural networks with implicit layers,” Advances in Neural Information Processing Systems, vol. 35, pp. 21 358–21 370, 2022.
[27] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” Advances in neural information processing systems, vol. 31, 2018.
[28] P. H. Avelar, A. R. Tavares, M. Gori, and L. C. Lamb, “Discrete and continuous deep residual learning over graphs,” arXiv preprint arXiv:1911.09554, 2019.
[29] M. Poli, S. Massaroli, J. Park, A. Yamashita, H. Asama, and J. Park, “Graph neural ordinary differential equations,” arXiv preprint arXiv:1911.07532, 2019.
[30] L.-P. Xhonneux, M. Qu, and J. Tang, “Continuous graph neural networks,” in International conference on machine learning. PMLR, 2020, pp. 10 432–10 441.
[31] B. Chamberlain, J. Rowbottom, M. I. Gorinova, M. Bronstein, S. Webb, and E. Rossi, “Grand: Graph neural diffusion,” in International conference on machine learning. PMLR, 2021, pp. 1407–1418.
[32] M. Thorpe, T. Nguyen, H. Xia, T. Strohmer, A. Bertozzi, S. Osher, and B. Wang, “Grand++: Graph neural diffusion with a source term,” ICLR, 2022.
[33] Z. Ling, X. Xie, Q. Wang, Z. Zhang, and Z. Lin, “Global convergence of over-parameterized deep equilibrium models,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2023, pp. 767–787.
[34] E. Winston and J. Z. Kolter, “Monotone operator equilibrium networks,” Advances in neural information processing systems, vol. 33, pp. 10 718–10 728, 2020.
[35] Z. Ling, L. Li, Z. Feng, Y. Zhang, F. Zhou, R. C. Qiu, and Z. Liao, “Deep equilibrium models are almost equivalent to not-so-deep explicit models for high-dimensional Gaussian mixtures,” in Proceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 21–27 Jul 2024, pp. 30 585–30 609. [Online]. Available: https://proceedings.mlr.press/v235/ling24a.html
[36] Q. Zhao, W. Ren, T. Li, X. Xu, and H. Liu, “Graphgpt: Graph learning with generative pre-trained transformers,” arXiv preprint arXiv:2401.00529, 2023.
[37] Z. Huang, S. Bai, and J. Z. Kolter, “ $(implicit)^{2}$ : Implicit layers for implicit representations,” Advances in Neural Information Processing Systems, vol. 34, pp. 9639–9650, 2021.
[38] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning. PMLR, 2017, pp. 1263–1272.
[39] E. K. Ryu and S. Boyd, “Primer on monotone operator methods,” Appl. comput. math, vol. 15, no. 1, pp. 3–43, 2016.
[40] L. Jing, Y. Shen, T. Dubcek, J. Peurifoy, S. Skirlo, Y. LeCun, M. Tegmark, and M. Soljačić, “Tunable efficient unitary neural networks (eunn) and their application to rnns,” in International Conference on Machine Learning. PMLR, 2017, pp. 1733–1741.
[41] P. R. Halmos, “A hilbert space problem book,” Graduate Texts in Mathematics, 1982.
[42] S. G. Krantz and H. R. Parks, The implicit function theorem: history, theory, and applications. Springer Science & Business Media, 2002.
[43] B. Hui, D. Yan, X. Ma, and W.-S. Ku, “Rethinking graph lottery tickets: Graph sparsity matters,” arXiv preprint arXiv:2305.02190, 2023.
[44] J. F. Lutzeyer, C. Wu, and M. Vazirgiannis, “Sparsifying the update step in graph neural networks,” in Topological, Algebraic and Geometric Learning Workshops 2022. PMLR, 2022, pp. 258–268.
[45] S. Zhang, M. Wang, P.-Y. Chen, S. Liu, S. Lu, and M. Liu, “Joint edge-model sparse learning is provably efficient for graph neural networks,” arXiv preprint arXiv:2302.02922, 2023.
[46] J. Zhao, X. Wang, C. Shi, B. Hu, G. Song, and Y. Ye, “Heterogeneous graph structure learning for graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, 2021, pp. 4697–4705.
[47] M. Poli, S. Massaroli, A. Yamashita, H. Asama, and J. Park, “Hypersolvers: Toward fast continuous-depth models,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 105–21 117, 2020.
[48] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in International conference on machine learning. PMLR, 2019, pp. 6861–6871.
[49] J. Gasteiger, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networks meet personalized pagerank,” in International Conference on Learning Representations, 2018.
[50] K. Xu, C. Li, Y. Tian, T. Sonobe, K.-i. Kawarabayashi, and S. Jegelka, “Representation learning on graphs with jumping knowledge networks,” in International conference on machine learning. PMLR, 2018, pp. 5453–5462.
[51] J. Wu, J. He, and J. Xu, “Demo-net: Degree-specific graph neural networks for node and graph classification,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 406–415. [Online]. Available: https://doi.org/10.1145/3292500.3330950
[52] M. Chen, Z. Wei, Z. Huang, B. Ding, and Y. Li, “Simple and deep graph convolutional networks,” in International conference on machine learning. PMLR, 2020, pp. 1725–1735.
[53] S. Luan, C. Hua, Q. Lu, J. Zhu, M. Zhao, S. Zhang, X.-W. Chang, and D. Precup, “Is heterophily a real nightmare for graph neural networks to do node classification?” arXiv preprint arXiv:2109.05641, 2021.
[54] D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[55] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt, “Efficient graphlet kernels for large graph comparison,” in Artificial intelligence and statistics. PMLR, 2009, pp. 488–495.
[56] J. Atwood and D. Towsley, “Diffusion-convolutional neural networks,” Advances in neural information processing systems, vol. 29, 2016.
[57] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
[58] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” arXiv preprint arXiv:1810.00826, 2018.
[59] C. Gallicchio and A. Micheli, “Fast and deep graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 3898–3905.
[60] M. Neumann, R. Garnett, C. Bauckhage, and K. Kersting, “Propagation kernels: efficient graph kernels from propagated information,” Machine learning, vol. 102, pp. 209–245, 2016.
[61] J. Park, J. Choo, and J. Park, “Convergent graph solvers,” arXiv preprint arXiv:2106.01680, 2021.
[62] Q. Chen, Y. Wang, Y. Wang, J. Yang, and Z. Lin, “Optimization-induced graph implicit nonlinear diffusion,” in International Conference on Machine Learning. PMLR, 2022, pp. 3648–3661.
[63] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 156–165.
[64] K. Cho, B. van Merriënboer, D. Bahdanau, and Y. Bengio, “On the properties of neural machine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, D. Wu, M. Carpuat, X. Carreras, and E. M. Vecchi, Eds. Doha, Qatar: Association for Computational Linguistics, Oct. 2014, pp. 103–111. [Online]. Available: https://aclanthology.org/W14-4012
[65] S. Yun, M. Jeong, R. Kim, J. Kang, and H. J. Kim, “Graph transformer networks,” in Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019, pp. 11 960–11 970. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf
[66] A. Bojchevski and S. Günnemann, “Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking,” in International Conference on Learning Representations, 2018, pp. 1–13.
[67] Z. Meng, S. Liang, H. Bao, and X. Zhang, “Co-embedding attributed networks,” in Proceedings of the twelfth ACM international conference on web search and data mining, 2019, pp. 393–401.
[68] W. Feng, J. Zhang, Y. Dong, Y. Han, H. Luan, Q. Xu, Q. Yang, E. Kharlamov, and J. Tang, “Graph random neural networks for semi-supervised learning on graphs,” Advances in neural information processing systems, vol. 33, pp. 22 092–22 103, 2020.
[69] P. Yanardag and S. Vishwanathan, “Deep graph kernels,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 1365–1374.
[70] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the cross-entropy method,” Annals of operations research, vol. 134, pp. 19–67, 2005.

Supplementary Material
IGNN-Solver: A Graph Neural Solver for Implicit Graph Neural Networks

Appendix A Experimental Details

The experiments are conducted on nine public datasets of different scales for node classification tasks, including five widely adopted benchmark datasets, Citeseer, ACM, CoraFull, BlogCatalog, Flickr, and four large-scale benchmark datasets, Amazon-all, Reddit, ogbn-arxiv and ogbn-products; as well as five bioinformatics benchmarks for graph classification tasks: MUTAG, PTC_MR, COX2, PROTEINS, NCI1. We endeavor to conduct inference testing under almost identical hyperparameter conditions as previous work [3, 2, 1, 25, 12, 23, 48, 49, 50], including performance and efficiency comparisons. All the Experiments are conducted independently five times (i.e., using five random seeds) on a machine with Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz with a single 3090 GPU. A detailed description of all tasks and additional details are provided below.

A-A Datasets

•

Citeseer [12]: This dataset is a citation network of research papers, which are divided into six categories. The citation network takes 3,327 scientific papers as nodes and 4,732 citation links as edges. The feature of each node in the dataset is a word vector to describe whether the paper has the corresponding words or not.
•

ACM [65]: It is a citation network dataset, where nodes represent papers and node features are constructed by the keywords. Papers are divided into three categories according to their types of conferences.
•

CoraFull [66]: Similar to the Citeseer dataset, CoraFull is a well-known citation network labeled based on the paper topic, which contains 19,793 scientific publications. CoraFull is classified into one of 70 categories, where nodes represent papers and the edges represent citations.
•

BlogCatalog [67]: This dataset is a social relationship network. The graph is composed of bloggers and their social relationships (such as friends). Node attributes are constructed by keywords in the user profile. The labels represent bloggers’ interests. All nodes are divided into six categories.
•

Flickr [51]: It is a graphic social network where nodes represent users and edges correspond to the friendships among users. All the nodes are divided into nine classes according to the interest groups of users.
•

Amazon-all [20]: The widely-used benchmark dataset Amazon-all encompasses the Amazon product co-purchasing network dataset. This dataset represents products as nodes and co-purchases as edges. It includes 58 product types, each with over 5,000 products, selected from a total pool of 75,149 product types.
•

Reddit [68]: The Reddit dataset is a social network where nodes represent posts, and edges indicate that the same user commented on two connected posts. Each node contains 602 dimensional features.
•

ogbn-arxiv [22]: The ogbn-arxiv dataset is a citation network between all Computer Science (CS) arXiv papers. Each node is an arXiv paper, and each directed edge indicates that one paper cites another. Each paper comes with a 128-dimensional feature vector obtained by averaging the embeddings of words in its title and abstract. The task is to predict the 40 subject areas of arXiv CS papers.
•

ogbn-products [22]: The ogbn-products dataset contains an undirected and unweighted graph, representing an Amazon product co-purchasing network. Nodes represent products sold on Amazon, and edges between two products indicate that the products are purchased together. The task is to predict the category of a product in a multi-class classification setup, where the 47 top-level categories are used for target labels.
•

MUTAG [69]: A dataset of nitroaromatic compounds, where the task is to predict the mutagenic effect on a bacterium. It contains 188 graphs with an average of 17.9 nodes per graph.
•

PTC-MR [69]: A dataset of 344 chemical compounds that report carcinogenicity for male and female rats. The dataset is used to predict chemical carcinogenicity, where each compound is represented as a graph with atoms as nodes and chemical bonds as edges. The average number of nodes per graph is 14.3, and the task is binary classification to determine whether a compound is carcinogenic or not. The node features represent atom types and other chemical properties.
•

COX2 [69]: A dataset of 467 cyclooxygenase-2 (COX-2) inhibitors. The dataset contains chemical compounds that inhibit COX-2 enzyme activity, which is important in inflammation and pain. Each compound is represented as a graph where atoms are nodes, and chemical bonds are edges.
•

PROTEINS [69]: This dataset consists of proteins represented as graphs, where nodes are secondary structure elements, and edges indicate neighborhood in the amino-acid sequence or in 3D space. It includes 1,113 graphs with an average of 39.1 nodes per graph.
•

NCI1 [69]: A dataset of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines. It contains 4,110 graphs with an average of 29.8 nodes per graph.

TABLE IV: Hyper-paramaters setting on training. *In addition, we linearly increase the loss weight

\lambda_{1}

from

0

to its set value throughout all IGNN-Solver training steps.

Scale	Datasets	nhid	dropout	lr	$K$	$\text{epoch}_{\text{max}}$	training	test	$\lambda_{1}$ *	$\lambda_{2}$	$\lambda_{3}$
Small	Citeseer	128	0.5	0.002	10	100	360	1000	0.1	5	1e-4
	ACM	128	0.5	0.001	10	100	180	1000	0.3	5	1e-4
	CoraFull	512	0.5	0.002	15	300	4200	1000	0.3	5	1e-4
	BlogCatalog	512	0.5	0.001	10	100	360	1000	0.5	5	1e-4
	Flickr	512	0.5	0.002	15	200	540	1000	0.1	5	1e-5
Large	Amazon-all	128	0.5	0.005	20	1000	16970	28285	0.3	5	1e-5
	Reddit	512	0.5	0.005	15	2000	139,779	46,593	0.5	5	1e-5
	ogbn-arxiv	128	0.5	0.001	20	2000	90,941	48,603	0.5	5	1e-5
	ogbn-products	128	0.5	0.001	20	2000	196,615	2,213,091	0.5	5	1e-5

A-B Hyper-parameters Setting in IGNN-Solver

We present in Table IV all the hyper-parameters preset by IGNN-Solver across all datasets, where $K$ represents the threshold of maximum iteration in IGNN-Solver, $\text{epoch}_{\text{max}}$ means the maximum training steps in IGNN. It is worth noting that we set the value of $\lambda_{1}$ in a linearly increasing manner to penalize the intermediate estimation error when the training step $K$ is large. Similarly, we set the value of $\lambda_{3}$ in a linearly decreasing manner to prevent overfitting $\alpha$ during later training epochs.

Appendix B Training Strategies

B-A IGNN Training with frozen IGNN-Solver

During the training processing of the IGNN model, we assume that both the initializer $h_{\phi}$ and the neural solver $s_{\xi}$ of IGNN-Solver have approached the desired state via the training method outlined in Section IV, and they will no longer be updated in the subsequent model training. For a given IGNN layer $f_{\theta}$ , as well as the input graph ${\bm{A}}$ and feature matrix ${\bm{X}}$ , we define:

g_{\theta}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}}):=f_{\theta}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}})-{\bm{Z}}^{\star}=0,

(17)

and utilize the IGNN-Solver to solve (17), obtaining the fixed point ${\bm{Z}}^{\star}$ . Furthermore, the fixed point ${\bm{Z}}^{\star}$ is passed through a linear layer to obtain an embedding ${\bm{H}}=\text{Linear}({\bm{Z}}^{\star})$ . Then we, jointly with downstream tasks, compute the Binary Cross Entropy (BCE) loss [70] to back-propagate and update the parameters of $f_{\theta}$ in a supervised way. The complete process is illustrated in Algorithm 3.

Algorithm 3 IGNN model training

1:graph matrix

\widehat{{\bm{A}}}

, feature matrix

{\bm{X}}

, fixed IGNN-Solver

s_{\xi}

and initializer

h_{\phi}

2:Initialize

{\bm{Z}}^{[0]}=h_{\phi}({\bm{X}})

and randomly initialize

f_{\theta}

3:while Stopping condition is not met do

{\bm{Z}}^{\star}\leftarrow\text{Solve}\ f_{\theta}({\bm{Z}}^{\star},\widehat{{\bm{A}}},{\bm{X}})-{\bm{Z}}^{\star}=0

\triangleright

via frozen

s_{\xi}

\widehat{{\bm{A}}}

and initial value

{\bm{Z}}^{[0]}

{\bm{H}}\leftarrow\text{Linear}({\bm{Z}}^{\star})

\mathcal{L}_{\text{task}}\leftarrow\text{BCE}({\bm{H}},{\bm{Y}})

\triangleright

computing the label loss

7: Back-propagate

\mathcal{L}_{\text{task}}

to update

f_{\theta}

8:end while

9:return

f_{\theta}

B-B Advanced IGNN Training Strategy with IGNN-Solver

Given the fact that model parameter $\theta$ gradually updates only with the model training iterations (i.e., we assume $\left\|\theta_{t+1}-\theta_{t}\right\|$ is only related to $\theta_{t}$ at training step $t$ ), we propose training the lightweight IGNN-Solver $s_{\xi},h_{\phi}$ and the large model $f_{\theta}$ in an alternating way. Specifically, we adopt the following procedure:

(i)

Warm up and train the IGNN model and its solver (IGNN-Solver) for some steps.
(ii)

Fix the IGNN-Solver $s_{\xi},h_{\phi}$ and solving the fixed points of $f_{\theta}$ in IGNN via Algorithm 3 for $T_{1}$ steps.
(iii)

Fix the current model parameters $\theta$ and start fine-tuning the IGNN-Solver $s_{\xi},h_{\phi}$ via Algorithm 2 over some $T_{2}$ steps.
(iv)

Repeat steps (ii) and (iii) until reaching the maximum training steps for the IGNN model.

Here $T_{1}$ is approximately 4% to 10% of $T_{2}$ , adjusted as needed. The sum of all $T_{2}$ values is referred to as $\text{epoch}_{\text{max}}$ . It is reassuring that the additional cost of fine-tuning the IGNN-Solver (step (iii)) is sufficiently offset by the substantial benefits of accelerated solving (step (ii)). Moreover, we are surprised to find that the proposed IGNN-Solver does not exhibit a decline in expressive capability for excessively small or large $T_{1}$ values. This indicates that thanks to its robust stability, accidentally setting $T_{1}$ high does not significantly affect the normal training of the IGNN. Conversely, if $T_{1}$ is set slightly lower, the model still demonstrates good generalization capabilities. Thus, the IGNN-Solver shows low sensitivity to $T_{1}$ .

Appendix C More Experiments on IGNN-Solver

C-A Training Dynamics

In this section, we present the time cost trends of IGNN and IGNN-Solver during training and inference on the Amazon dataset. From Figure 8, it is evident that IGNN starts relatively fast overall at the beginning of training. However, as the training of IGNN progresses, the model becomes increasingly complex, leading to a sharp rise in the time cost for fixed-point computation. After 100 epochs, this cost remains persistently high and fluctuates continuously. This issue arises because the IGNN reaches the maximum number of iterations without achieving the preset minimum error.

On the contrary, our IGNN-Solver demonstrates significantly lower training and inference time consumption compared to the former and maintains a stable state throughout the process. This is attributed to the solver’s consistent memory consumption and its rapid fixed-point computation capability, which drives the unique advantage of IGNN-Solver.

C-B Long-Range Dependencies (LRD)

To evaluate the capability of GNNs in capturing long-range dependencies within graphs, we follow the methodology presented in [1] by creating a synthetic dataset called Chains. In this dataset, the objective is to classify nodes positioned in a chain of length l. The class information is provided sparsely, appearing only as a feature in one end node of the chain. Our experimental setup consists of small datasets: a training set with 20 nodes, a validation set with 100 nodes, and a test set with 200 nodes. For simplicity, we focus on binary classification. Following the experimental protocol outlined in IGNN, we implement and evaluate four representative baseline models.

As demonstrated in Figure 9, both IGNN-Solver and IGNN effectively capture long-range dependencies on longer chains, whereas the performance of the graph transformer (GTN) [65] slightly decreases when the chain length exceeds $6$ . On the other hand, convolutional GNNs with a fixed number of iterations $(T=2)$ , like GCN and GAT, face challenges in making meaningful predictions as the chain length grows due to their limited capacity to capture dependencies beyond two hops. Interestingly, increasing the number of iterations $(T)$ for these models does not lead to significant improvement in this case.