Is Solving Graph Neural Tangent Kernel Equivalent to Training Graph Neural Network?

Lianke Qin [email protected]. UCSB. Zhao Song [email protected]. Adobe. Baocheng Sun [email protected]. Weizmann Institute of Science.

A rising trend in theoretical deep learning is to understand why deep learning works through Neural Tangent Kernel (NTK) [44], a kernel method that is equivalent to using gradient descent to train a multi-layer infinitely-wide neural network. NTK is a major step forward in the theoretical deep learning because it allows researchers to use traditional mathematical tools to analyze properties of deep neural networks and to explain various neural network techniques from a theoretical view. A natural extension of NTK on graph learning is Graph Neural Tangent Kernel (GNTK), and researchers have already provide GNTK formulation for graph-level regression and show empirically that this kernel method can achieve similar accuracy as GNNs on various bioinformatics datasets [23]. The remaining question now is whether solving GNTK regression is equivalent to training an infinite-wide multi-layer GNN using gradient descent. In this paper, we provide three new theoretical results. First, we formally prove this equivalence for graph-level regression. Second, we present the first GNTK formulation for node-level regression. Finally, we prove the equivalence for node-level regression.

1 Introduction

Many deep learning tasks need to deal with graph data, including social networks [95], bio-informatics [100, 94], recommendation systems [92], and autonomous driving [85, 93]. Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for machine learning on graph data. GNNs show SOTA results on many graph-based learning jobs compared to using hand-crafted combinatorial features. As a result, GNNs have achieved convincing performance on a large number of tasks on graph-structured data.

Today, one major trend in theoretical deep learning is to understand when and how deep learning works through Neural Tangent Kernel (NTK) [44]. NTK is a pure kernel method, and it has been shown that solving NTK is equivalent to training a fully-connected infinite-wide neural network whose layers are trained by gradient descent [3, 52]. This is a major step forward in theoretical deep learning because researchers can now use traditional mathematical tools to analyze properties in deep neural networks and to explain various neural network optimizations from a theoretical perspective. For example, [44] uses NTK to explain why early stopping works in training neural networks. [10, 9] build general toolkits to analyze multi-layer networks with ReLU activations and prove why stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time.

A natural next step is thus to apply NTK to GNNs, called Graph Neural Tangent Kernel (GNTK). [23] provides the formulation for graph-level regression and demonstrates that GNTK can empirically achieve similar accuracy as GNNs in various bioinformatics datasets.

This raises an important theoretical question:

Is solving Graph Neural Tangent Kernel regression equivalent to using gradient descent to train an infinitely wide graph neural network?

Refer to caption — Figure 1: Our definitions for Graph Neural Tangent Kernel(GNTK) and Graph Dynamic Kernel. We define $W$ as the parameters of GNN. When we load parameters $W$ from $W(t)$ the parameters of GNN optimizing with gradient descent at iteration $t$ , then the inner product of the gradients on the GNN parameters is the Graph Dynamic Kernel $\mathsf{K}_{t}(G,H)$ . When we load parameters $W$ from samples over a normal distribution, then the expectation taken on the inner product of gradients on the GNN parameters is GNTK $\mathsf{K}_{\mathrm{gntk}}(G,H)$ .

This question is worthwhile because GNTK can potentially enable researchers to understand why and how GNNs work, as such in neural network training through analysis of the corresponding NTK. At the same time, this question is more challenging than the equivalence result of NTK, because we need deal with the structures of the graphs.

In this paper, we provide an affirmative answer. We start from the formulation of graph-level regression in [23]. We provide the first formal proof for the equivalence. To achieve this result, we develop new proof techniques, including iterative Graph Tangent Kernel regression and Graph Dynamic Kernel. The high-level idea is that we use an iterative regression to bridge the connection between the neural network and the kernel method.

We then consider node-level regression, another graph learning task which has many real-world applications, such as social networks. We provide the first GNTK formulation for node-level regression. We also establish the equivalence result for node-level regression.

Our paper makes the following contributions:

•

We present the first graph neural tangent kernel formation for node-level regression.
•

We prove the equivalence of graph neural tangent kernel and graph neural network training for both graph-level and node-level regression.

2 Related Work

Kernel methods

Instead of learning some fixed set of parameters based on a set of features, kernel methods can “remember” training examples in their weights [29, 50, 10, 9]. After learning on the set of inputs, the kernel function $\mathsf{K}$ is generated, and the kernel function can be used to predict unlabeled inputs. Our work is related to the recent breakthroughs in drawing the connections between kernel methods and neural networks [22, 21, 44, 17]. To the best of our knowledge, our work is the first that draws the theoretical connections between kernel methods and deep learning in the context of graph neural networks.

Convergence of neural network

Many works study the convergence of deep neural network [12, 80, 102, 67, 55, 101, 28, 32, 13] with random input assumptions. Recently, instead of assuming random input, several works study the convergence of neural networks in the over-parametrization regime [50, 29, 10, 9, 25, 2, 3, 75, 52, 14, 78, 79]. These works only need a weak assumption on the input data such that the input data is separable: the distance between any two input data is sufficiently large. These works prove that if the neural network’s width is sufficiently large, a neural network can converge. We use the convergence result in our proofs to bound the test predictor of graph neural network training.

Graph neural network (GNNs) and graph neural tangent kernel (GNTK)

Graph neural networks are deep learning methods applied to graphs [97, 83, 16, 96, 90, 63, 89]. They have delivered promising results in many areas of AI, including social networks [95, 82], bio-informatics [100, 94, 31], knowledge graphs [41], recommendation systems [92], and autonomous driving [85, 93]. Due to their convincing performance, GNNs have become a widely applied graph learning method.

Graph neural tangent kernel applies the idea of the neural tangent kernel to GNNs. [23] provides the first formulation for graph-level regression, and it shows that the kernel method can deliver similar accuracy as training graph neural networks on several bio-informatics datasets. However, [23] does not provide a formal proof, and it does not consider node-level regression.

Sketching

Sketching is a well-known technique to improve performance or memory complexity [19]. It has wide applications in linear algebra, such as linear regression and low-rank approximation[19, 57, 56, 65, 71, 38, 6, 72, 73, 24, 35], training over-parameterized neural network [78, 79, 98], empirical risk minimization [53, 62], linear programming [53, 49, 76], distributed problems [86, 15, 45, 46], clustering [30], generative adversarial networks [91], kernel density estimation [60], tensor decomposition [74], trace estimation [48], projected gradient descent [39, 88], matrix sensing [27, 61, 36], softmax regression [8, 54, 26, 35, 69], John Ellipsoid computation [18, 77], semi-definite programming [34], kernel methods [1, 4, 20, 70, 37], adversarial training [33], anomaly detection [81, 84], cutting plane method [47], discrepany [99], federated learning [64], reinforcement learning [5, 87, 68], relational database [59].

3 Graph Neural Tangent Kernel

We first consider the graph regression task. Before we continuous, we introduce some useful notations as follows: Let $G=(U,E)$ be a graph with node set $U$ and edge set $E$ . A graph contains $|U|=N$ nodes. The dimension of the feature vector of each input node is $d$ . $\mathcal{N}$ is the neighboring matrix that represents the topology of $G$ . Each node $u\in U$ has a feature vector $h(G)_{u}\in\mathbb{R}^{d}$ . $\mathcal{N}(u)$ represents all the neighboring nodes of $u$ . A dataset contains $n$ graphs. We denote training data graph $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ with $G_{i}\in\mathbb{R}^{d\times N}\leavevmode\nobreak\ i\in[n]$ . We denote label vector $Y\in\mathbb{R}^{n}$ corresponding to $\mathsf{G}$ . Ours goal is to learn to predict the label of the graph $G_{\mathrm{test}}$ . For presentation simplicity, let’s define an ordering of all the nodes in a graph. We can thus use $u_{i}$ to be the $i$ th node in the graph. We use $[k]$ to denote the set $\{1,2,\cdots,k\}$ .

3.1 Graph Neural Network

For simplicity, we only consider a single-level, single-layer GNN. A graph neural network (GNN) consists of three operations: Aggregate, Combine and ReadOut.

Aggregate. The Aggregate operation aggregates the information from the neighboring nodes. We define the sum of feature vectors of neighboring nodes as $h_{G,u}:=\sum_{v\in\mathcal{N}(u)}h(G)_{v}$ .

Let $R:=\max_{i\in[n]}\max_{u\in G_{i}}h(G)_{u}$ , we have that $\|\sum_{v\in{\cal N}(u)}h(G_{i})_{v}\|_{2}\leq\sum_{v\in{\cal N}(u)}\|h(G_{i})_{u}\|_{2}\leq R\cdot|{\cal N}(u)|$ where the first step follows from triangle inequality, the second step follows from the definition of $R$ . As a result, $\max_{i\in[n]}\max_{u\in G_{i}}\|\sum_{v\in{\cal N}(u)}h(G_{i})_{v}\|_{2}\leq R\max_{i\in[n]}\max_{u\in G_{i}}|{\cal N}(u)|$ . When graphs in $\mathsf{G}$ are sparse, the upper bound in the above equation will be $\alpha R$ where $\alpha$ is a constant and satisfy $\alpha\ll N$ . When the graphs are dense, the upper bound above will be $NR$ .

Combine. The Combine operation has a fully connected layer with ReLU activation. $W=[w_{1},\cdots,w_{m}]\in\mathbb{R}^{d\times m}$ represents the weights in the neural network. $\sigma$ is the ReLU function ( $\sigma(z)=\max\{0,z\}$ ), and $m$ represents the output dimension of the neural network. This layer can be formulated as $h^{\prime}_{G,u}:=\frac{1}{\sqrt{m}}\cdot\sigma(W^{\top}h_{G,u})\in\mathbb{R}^{m}.$

ReadOut. The final output of the GNN on graph $G$ is generated by the ReadOut operation. We take summation over the output feature with weights $a_{r}\in\{-1,+1\},\leavevmode\nobreak\ r\in[m]$ or $a=[a_{1},\cdots,a_{m}]^{\top}$ , which is a standard setting in the literature [29, 75, 14]. This layer can be formulated as $f_{\mathrm{gnn}}(G):=\sum_{r=1}^{m}a_{r}\sum_{u\in G}[h^{\prime}_{G,u}]_{r}\in\mathbb{R}.$

Combining the previous three operations, we now have the definition of a graph neural network function.

Definition 3.1 (Graph neural network function).

We define a graph neural network $f_{\mathrm{gnn}}(W,a,G):\mathbb{R}^{d\times m}\times\mathbb{R}^{m}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R}$ as the following form $f_{\mathrm{gnn}}(W,a,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u_{l})}h(G)_{v})$ .

Furthermore, We denote $f_{\mathrm{gnn}}(W,a,\mathsf{G})=[f_{\mathrm{gnn}}(W,a,G_{1}),\cdots,f_{\mathrm{gnn}}(W,a,G_{n})]^{\top}$ to represent the GNN output on the training set.

The normalization $1/\sqrt{m}$ is a standard formulation for GNN [23]. Similar to other works in theoretical deep learning [50, 10, 29, 75, 14, 79, 7, 42], we consider only training $W$ while fixing $a$ , so we can write $f_{\mathrm{gnn}}(W,G)=f_{\mathrm{gnn}}(W,a,G)$ .

3.2 Graph Neural Tangent Kernel

We now translate the GNN architecture to its corresponding Graph Neural Tangent Kernel (GNTK).

Definition 3.2 (Graph neural tangent kernel).

We define the graph neural tangent kernel (GNTK) corresponding to the graph neural networks $f_{\mathrm{gnn}}$ as following

\displaystyle\mathsf{K}_{\mathrm{gntk}}(G,H)=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn}}(W,H)}{\partial W}\right\rangle\right],

where

•

$G,H$ are any input graph,
•

and the expectation is taking over

$\displaystyle w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.$

We define $H^{\mathrm{cts}}\in\mathbb{R}^{n\times n}$ as the kernel matrix between training data as $[H^{\mathrm{cts}}]_{i,j}=\mathsf{K}_{\mathrm{gntk}}(G_{i},G_{j})\in\mathbb{R}$ . We denote the smallest eigenvalue of $H^{\mathrm{cts}}$ as $\Lambda_{0}$ .

In Figure 1, we provide an illustration on GNTK. We sample the parameters from a normal distribution and then load the sampled parameters into the GNN. We use those loaded parameters to perform the forward pass on the graph and then calculate the gradient. We forward each pair of the graph in the dataset $\mathsf{G}$ to the same GNN with the same parameters and calculate the gradient on $W$ the parameters of GNN. Finally, taking expectations on the inner product of the gradient on each pair of the graph, we can obtain GNTK. GNTK is a constant and does not change in the GNN training process.

We define the corresponding feature function as follows:

Definition 3.3 (Feature function).

Let $\mathcal{F}$ be a Hilbert space.

•

We denote the feature function corresponding to the kernel $\mathsf{K}_{\mathrm{gntk}}$ as $\Phi:\mathbb{R}^{d\times N}\rightarrow\mathcal{F}$ , which satisfies $\langle\Phi(G),\Phi(H)\rangle_{\mathcal{F}}=\mathsf{K}_{\mathrm{gntk}}(G,H)$ ,for any graph data $G$ , $H$ .
•

We write $\Phi(\mathsf{G})=[\Phi(G_{1}),\cdots,\Phi(G_{n})]^{\top}$ .

Next, we provide the definition of Graph Neural Tangent Kernel regression.

Definition 3.4 (Graph Neural Tangent Kernel regression).

We consider the following neural tangent kernel regression problem: $\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2}$ where

•

$\beta\in\mathcal{F}$ is weight of kernel regression,
•

$f_{\mathrm{gntk}}(\beta,\mathsf{G})=\Phi(\mathsf{G})\beta\in\mathbb{R}$ denotes the prediction function.

3.3 Notations for Node-Level regression

Here, we consider the node level regression task. In our setting, our dataset contains a single graph $G=(V,E)$ where $V$ denotes the set of nodes in graph $G$ , and $E$ denotes the set of edges in graph $G$ . Suppose that $|V|=N$ , the dataset also contains labels $Y=(y_{1},\cdots,y_{N})$ with $y_{i}$ corresponding to the label of the node $u_{i},\leavevmode\nobreak\ i\in[N]$ in the graph.

In this paper, we consider two different setting for node level regression training, inductive training and transductive training. In the transductive training process, we can observe the whole $G=(V,E)$ . We can observe the labels of nodes in $V_{\mathrm{train}}$ , but we cannot observe the labels of nodes in $V_{\mathrm{test}}$ . In the inductive training process, we partition $V$ into two disjoint sets $V_{\mathrm{train}}$ and $V_{\mathrm{test}}$ . Let $E_{\mathrm{train}}=\{(u,v)\in E|u\in V_{\mathrm{train}}\wedge v\in V_{\mathrm{train}}\}$ . Let $G_{\mathrm{train}}=(V_{\mathrm{train}},E_{\mathrm{train}})$ . We can only observe the subgraph $G_{\mathrm{train}}=(V_{\mathrm{train}},E_{\mathrm{train}})$ and feature in $V_{\mathrm{train}}$ . During the test process, we can observe the whole graph $G=(V,E)$ in both the inductive and transductive setting, but we cannot observe any label. We prove the equivalence between GNN and GNTK for both of the settings. For simplicity, we only consider transductive setting here. We left the inductive setting into appendix.

In the rest of this section, we will provide the definitions of GNN and GNTK for node level regression and the definition of their corresponding training process.

We first show the definition of GNN in node level.

Definition 3.5 (Graph neural network function (node level)).

Given the graph $G$ , we define a graph neural network $f_{\mathrm{gnn},\mathrm{node}}(W,a,u):\mathbb{R}^{d\times m}\times\mathbb{R}^{m}\times[N]\rightarrow\mathbb{R}$ as the following form

\displaystyle f_{\mathrm{gnn},\mathrm{node}}(W,a,u)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u)}h(G)_{v}),

Besides, We denote

\displaystyle\leavevmode\nobreak\ f_{\mathrm{gnn},\mathrm{node}}(W,a,G)=[f_{\mathrm{gnn},\mathrm{node}}(W,a,u_{1}),\cdots,f_{\mathrm{gnn},\mathrm{node}}(W,a,u_{N})]^{\top}\in\mathbb{R}^{N}

to represent the GNN output on the training set.

Note that in node level, the definition of GNN do not contain ReadOut layer. There are only a single Aggregate layer and a single Combine layer.

We now show the definition of GNTK in node level as follows:

Definition 3.6 (Graph neural tangent kernel (node level)).

We define the graph neural tangent kernel (GNTK) corresponding to the graph neural networks $f_{\mathrm{gnn},\mathrm{node}}$ as following

\displaystyle\leavevmode\nobreak\ \mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,v)=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn},\mathrm{node}}(W,u)}{\partial W},\frac{\partial f_{\mathrm{gnn},\mathrm{node}}(W,v)}{\partial W}\right\rangle\right],

where

•

$u,v$ are any input node in graph $G$ ,
•

and the expectation is taking over

$\displaystyle w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.$

We define $H^{\mathrm{cts},\mathrm{node}}\in\mathbb{R}^{N\times N}$ as the kernel matrix between training data as $[H^{\mathrm{cts},\mathrm{node}}]_{i,j}=\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u_{i},u_{j})\in\mathbb{R}$ . We denote the smallest eigenvalue of $H^{\mathrm{cts},\mathrm{node}}$ as $\Lambda_{0,\mathrm{node}}$ .

In graph level Definition 3.2, any single term in the kernel is about two different graphs, while in the node level Definition 3.6, the kernel is about two different nodes in a single graph.

The training process of GNN in node level is similar to the training of GNN in graph level defined in Definition D.2.

Definition 3.7 (Training graph neural network (node level)).

Let $\kappa\in(0,1)$ be a small multiplier. We initialize the network as $a_{r}\overset{i.i.d.}{\sim}\operatorname{unif}[\{-1,1\}],\leavevmode\nobreak\ w_{r}(0)\overset{i.i.d.}{\sim}\mathcal{N}(0,I_{d})$ We consider solving the following optimization problem using gradient descent:

\displaystyle\min_{W}\frac{1}{2}\|Y-\kappa f_{\mathrm{gnn},\mathrm{node}}(W,G)\|_{2}.

Furthermore, we define

•

$w_{r}(t),r\in[m]$ as the weight at iteration $t$ ,
•

training data predictor at iteration $t$ as $u_{\mathrm{gnn},\mathrm{node}}(t)=\kappa f_{\mathrm{gnn},\mathrm{node}}(W(t),G)$ ,
•

and test data predictor at iteration $t$ as

$\displaystyle u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(t)=\kappa f_{\mathrm{gnn},\mathrm{node}}(W(t),v_{\mathrm{test}}).$

The regression of GNTK in node level is similar with the regression of GNTK in graph level defined in Definition 3.4.

Definition 3.8 (Graph Neural Tangent Kernel regression (node level)).

We consider the following neural tangent kernel regression problem:

\displaystyle\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,G)\|_{2}^{2}

where

•

$\beta\in\mathcal{F}$ is weight of kernel regression.
•

$f_{\mathrm{gntk},\mathrm{node}}(\beta,G)=\Phi(G)\beta\in\mathbb{R}$ denotes the prediction function.

4 Equivalence between GNTK and infinite-wide GNN

4.1 Equivalence result for graph-level regression

Our main result is the following equivalence between solving kernel regression and using gradient descent to train an infinitely wide single-layer graph neural network. It states that if the neural network width $m$ is wide enough and the training iterations $T$ is large enough then the test data predictor between Graph Neural Network $u_{\mathrm{gnn},\mathrm{test}}(T)$ and the Graph Neural Tangent Kernel regression $u_{\mathrm{test}}^{*}$ can be less than any positive accuracy $\varepsilon$ .

In our result, we use the fact that inputs are bounded, which is a standard setting in the optimization field [52]. Let $R$ denote the parameter that bound the feature vectors $h(G)_{u}$ : $\max_{i\in[n]}\max_{u\in G_{i}}\|h(G_{i})_{u}\|_{2}\leq R$ . This equation means that there is a bound on the maximum value of the input data.

The equivalence result is as follows:

Theorem 4.1 (Equivalence result for graph level regression).

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ ,
•

corresponding label vector $Y\in\mathbb{R}^{n}$ ,
•

arbitrary test data $G_{\mathrm{test}}$ .

Let

•

$T>0$ be the total number of iterations,
•

$N$ be the number of nodes in a graph,
•

$u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}$ and $u_{\mathrm{test}}^{*}\in\mathbb{R}$ be the test data predictors defined in Definition D.2 and Definition 3.4 respectively.

For any accuracy $\varepsilon\in(0,1/10)$ and failure probability $\delta\in(0,1/10)$ , if the following conditions hold:

•

the multiplier $\kappa=\widetilde{O}(N^{-2}\mathrm{poly}(\varepsilon,\Lambda_{0},R^{-1},n^{-1}))$ ,
•

the total iterations $T=\widetilde{O}(N^{4}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0}^{-1},R,n))$ ,
•

the width of the neural network $m\geq\widetilde{O}(N^{2}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0}^{-1},n,d))$ ,
•

and the smallest eigenvalue of GNTK $\Lambda_{0}>0$ ,

then for any $G_{\mathrm{test}}$ , with probability at least $1-\delta$ over the random initialization, we have

\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon.

Here $\widetilde{O}(\cdot)$ hides $\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0}))$ .

Note that when $\varepsilon$ goes into $0$ , the width of GNN will go into infinity, our result can deduce that the limit of $\|u_{\mathrm{gnn},\mathrm{test}}(T)-u^{*}_{\mathrm{test}}\|_{2}\rightarrow 0$ when the width of GNN $m\rightarrow\infty$ and the training iteration $T\rightarrow\infty$ . It is worth noting that our result is a non-asymptotic convergence result: we provide a lower bound on the training time $T$ and the width of layers $m$ .

The number of nodes $N$ has a non-linear impact on the required training iterations $T$ and the width of layer $m$ . Our Theorem 4.1 shows that $m$ depends on $N^{2}$ and $T$ depends on $N^{4}$ .

Moreover, the number of vertices does not need to be fixed in our Theorem 4.1. In fact, for any graph that contains number of vertices less than $N$ , we can add dummy vertices with $0$ feature until the graph contains $N$ vertices. Then, it reduces to the case where all the graphs in the dataset $\mathsf{G}$ contain $N$ vertices.

Next, we will show the generalized $\delta$ -separation assumption under which the condition $\Lambda>0$ in Theorem 4.1 will always hold.

Theorem 4.2 (Spectral gap of shifted GNTK).

For graph neural network with shifted rectified linear unit (ReLU) activation, $f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u_{l})}h(G)_{v}-b)$ , where $b\in\mathbb{R}$ . We define $H^{\mathrm{cts}}_{\mathrm{shift}}\in\mathbb{R}^{n\times n}$ as the shifted kernel matrix between training data:

\displaystyle\leavevmode\nobreak\ [H^{\mathrm{cts}}_{\mathrm{shift}}]_{i,j}=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,H)}{\partial W}\right\rangle\right],

where

•

$G,H$ are any input graph,
•

and the expectation is taking over $w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m$ .

Let

•

$h_{G,u}=\sum_{v\in\mathcal{N}(u)}h(G)_{v},\leavevmode\nobreak\ x_{i,p}=h_{G_{i},u_{p}},\leavevmode\nobreak\ i\in[n],p\in[N]$ .
•

initialize $w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m$ .

Assuming that our dataset satisfy the $\delta$ -separation:

For any $i,j\in[n],\leavevmode\nobreak\ p,q\in[N]$ ,

\displaystyle\min\{\|\overline{x}_{i,p}-\overline{x}_{j,q}\|_{2},\|\overline{x}_{i,p}+\overline{x}_{j,q}\|_{2}\}\leq\delta,

where $\overline{x}_{i,p}=x_{i,p}/\|x_{i,p}\|_{2}$ . Then we claim that, for $\lambda:=\lambda_{\min}(H^{\mathrm{cts}}_{\mathrm{shift}})$ ,

\displaystyle\exp(-b^{2}/2)\geq\lambda\geq\exp(-b^{2}/2)\cdot\mathrm{poly}(\delta,n^{-1},N^{-1})

In this theorem, we consider shifted GNTK, which is motivated by the shifted NTK [78] and is more general than the standard GNTK definition [23]. Here, we expand $\delta$ -separation into the graph literature. The theorem says that if $\delta$ -separation is satisfied for the graphs, then the Graph Neural Tangent Kernel will be positive definite. More specifically, in the theorem above, when we set $b=0$ , we have that $\Lambda_{0}=\lambda\geq\mathrm{poly}(\delta,n^{-1},N^{-1})$ .

4.2 Equivalence result for node-level regression

We also prove the equivalence result for node-level regression setting. The result is as follows:

Theorem 4.3 (Equivalence result for node level regression, informal version of Theorem D.31).

Given

•

training graph data $G$ ,
•

corresponding label vector $Y\in\mathbb{R}^{N}$ ,
•

arbitrary test data $v$ .

Let

•

$T>0$ be the total number of iterations.
•

$u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(t)\in\mathbb{R}$ and $u_{\mathrm{test},\mathrm{node}}^{*}\in\mathbb{R}$ be the test data predictors defined in Definition 3.7 and Definition 3.4 respectively.

For any accuracy $\varepsilon\in(0,1/10)$ and failure probability $\delta\in(0,1/10)$ , if

•

the multiplier $\kappa=\widetilde{O}(N^{-1}\mathrm{poly}(\varepsilon,\Lambda_{0,\mathrm{node}},R^{-1}))$ ,
•

the total iterations $T=\widetilde{O}(N^{2}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1}))$ ,
•

the width of the neural network $m\geq\widetilde{O}(N^{10}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1},d))$ ,
•

and the smallest eigenvalue of node level GNTK $\Lambda_{0,\mathrm{node}}>0$ ,

then for any $v_{\mathrm{test}}$ , with probability at least $1-\delta$ over the random initialization, we have

\displaystyle\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}\|_{2}\leq\varepsilon.

Here $\widetilde{O}(\cdot)$ hides $\mathrm{poly}\log(N/(\varepsilon\delta\Lambda_{0,\mathrm{node}}))$ .

Our result shows that the width of the neural network $m$ depends on $N^{10}$ , where $N$ is the number of nodes in the graph. The total training iteration $T$ depends on $N^{2}$ .

5 Overview of Techniques

High-level Approach

Given any test graph $G_{\mathrm{test}}$ , our goal is to bound the distance between the prediction of graph kernel regression $u_{\mathrm{test}}^{*}$ and the prediction of the GNN $u_{\mathrm{gnn},\mathrm{test}}(T)$ at training iterations $t=T$ . Our main approach is that we consider an iterative GNTK regression process $u_{\mathrm{gntk},\mathrm{test}}(t)$ that is optimized using gradient descent.

Solving the GNTK kernel regression is a single-step process, while training GNN involves multiple iterations. Here, our proposed iterative GNTK regression $u_{\mathrm{gntk},\mathrm{test}}(t)$ , which is optimized by gradient descent, bridges the gap. We prove that when our iterative GNTK regression finishes, the final prediction on $G_{\mathrm{test}}$ is exactly the same as if solving GNTK as a single-step process. Thus, to prove GNTK=GNN, the remaining step is to prove the equivalence of iterative GNTK regression and the GNN training process.

More specifically, We use the iterative GNTK regression’s test data predictor $u_{\mathrm{gntk},\mathrm{test}}(T)$ at training iterations $T$ as a bridge to connect GNTK regression predictor $u_{\mathrm{test}}^{*}$ and GNN predictor $u_{\mathrm{gnn},\mathrm{test}}(T)$ . We conclude that there are three major steps to prove our main result.

•

By picking a sufficiently large number of training iteration $T$ , we bound the test predictor difference for GNTK regression at training iterations $T$ and the test error after GNTK converges $|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|\leq\varepsilon/2$ by any accuracy $\varepsilon>0$ .
•

After fixing the number of training iteration $T$ in Step 1, we bound the test predictor difference between GNN and GNTK $|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|\leq\varepsilon/2$ at training iterations $T$ .
•

Lastly, we combine the results of step 1 and step 2 using triangle inequality. This finishes the proof of the equivalence between training GNN and GNTK kernel regression, i.e., $|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|\leq\varepsilon$ .

For the first part, the high level idea is that $u_{\mathrm{test}}^{*}$ is $u_{\mathrm{gntk},\mathrm{test}}(t)$ when $t\to\infty$ , we then obtain the desired result by the linear convergence of the gradient flow of $u_{\mathrm{gntk},\mathrm{test}}(t)$ . The second part needs sophisticated tricks. The high-level idea is that we can split the second part into two sub-problem by using the Newton-Leibniz formula and the almost invariant property of Graph Dynamic Kernel. The third part follows from triangle inequality.

Iterative Graph Neural Tangent Kernel regression

Iterative Graph Neural Tangent Kernel regression provides a powerful theoretical tool to compare the GNN training process with GNTK regression. We consider solving the following optimization problem using gradient descent with $\beta(0)$ initialized as $0$ :

\displaystyle\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2},

where $f_{\mathrm{gntk}}(\beta,\mathsf{G})$ is defined in Definition 3.4. At iteration $t$ , we define the weight as $\beta(t)$ , the training data predictor $u_{\mathrm{gntk}}(t)=\kappa\Phi(\mathsf{G})\beta(t)\in\mathbb{R}^{n}$ , the test data predictor $u_{\mathrm{gntk},\mathrm{test}}(t)=\kappa\Phi(G_{\mathrm{test}})^{\top}\beta(t)\in\mathbb{R}$ .

Bridge GNTK regression and iterative GNTK regression

Then, we claim the relationship between GNTK regression and iterative GNTK regression. When the iteration goes into infinity, we get the final training data predictor $u^{*}=\lim_{t\to\infty}u_{\mathrm{gntk}}(t)$ and the final test data predictor as $u_{\mathrm{test}}^{*}=\lim_{t\to\infty}u_{\mathrm{gntk},\mathrm{test}}(t)$ . Due to the strongly convexity [52], the gradient flow converges to the optimal solution of the objective $\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2}$ , therefore, the optimal parameters obtained from GNTK regression

\displaystyle\beta^{*}=\lim_{t\to\infty}\beta(t)=\kappa^{-1}(\Phi(\mathsf{G})^{\top}\Phi(\mathsf{G}))^{-1}\Phi(\mathsf{G})^{\top}Y,

which indicate that the optimal parameters $\beta^{*}$ is exactly $\lim_{t\to\infty}\beta(t)$ the parameters of iterative GNTK regression, when training finish.

Graph Dynamic Kernel

Before we continue, we introduce the Graph Dynamic Kernel for the GNN training process. Graph Dynamic Kernel can help us analyze the dynamic of the GNN training process. For any data $G,H\in\mathbb{R}^{d\times N}$ , we define $\mathsf{K}_{t}(G,H)\in\mathbb{R}$ as

\displaystyle\mathsf{K}_{t}(G,H)=\left\langle\frac{\partial f_{\mathrm{gnn}}(W(t),G)}{\partial W(t)},\frac{\partial f_{\mathrm{gnn}}(W(t),H)}{\partial W(t)}\right\rangle.

where $W(t)\in\mathbb{R}^{d\times m}$ is the parameters of the neural network at iterations $t$ as defined in Definition D.2. Similar to tangent kernel, we define $\mathsf{H}(t)\in\mathbb{R}^{n\times n}$ as

\displaystyle[\mathsf{H}(t)]_{i,j}=\mathsf{K}_{t}(G_{i},G_{j})\in\mathbb{R}.

As in Figure 1, this concept have some common points with GNTK defined in Definition 3.2. Both of them load the parameters $W$ . Both of them calculate the inner product on a pair of gradients. However, Graph Dynamic Kernel has a few differences from GNTK. First, GNTK is an expectation taken over a normal distribution, while Graph Dynamic Kernel is not an expectation. Second, GNTK is a constant during the training process, while Graph Dynamic Kernel changes with training iteration $t$ . Third, the Graph Dynamic Kernel loads parameters from the GNN training process, while GNTK loads parameters from samples of a normal distribution.

Graph Dynamic Kernel is very different than the Dynamic Kernel defined in NTK literature [29, 75, 52, 78]. Our key results is that Graph Dynamic Kernel is almost invariant during the training process, which means Graph Dynamic Kernel becomes a constant when the width of the neural network goes into infinity. Although people already know that Dynamic Kernel has similar properties [3, 52], no previous works show whether the almost invariant property still holds for Graph Dynamic Kernel. In fact, GNNs have Combine layer and Aggregate layer, while NNs do not have those layer. Whether or not those layers influence the almost invariant property are not trivial. In fact, the perturbation on the Graph Dynamic Kernel during training process polynomial depends on $N$ , where $N$ is the number of nodes in a graph. As a result, for graph neural network, the width should be chosen larger to ensure that it behaves like a GNTK.

Linear convergence

We first show that GNNs and iterative GNTK are linear convergence. The linear convergence can be proved by calculating the gradient flow of GNTK regression $u_{\mathrm{gntk},\mathrm{test}}(t)$ and GNN training $u_{\mathrm{gnn},\mathrm{test}}(t)$ with respect to the training iterations $t$ . Based on the gradient, we can obtain the desired results:

\displaystyle\frac{\mathrm{d}\|u(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-2(\kappa^{2}\Lambda_{0})\|u(t)-u^{*}\|_{2}^{2}.

where $u(t)$ can be both $u_{\mathrm{gnn}}(t)$ and $u_{\mathrm{gntk}}(t)$ .

Then, we show that the test predictor difference between GNTK regression and iterative GNTK regression is sufficiently small, when $T$ is sufficiently large. The linear convergence indicates that both $u_{\mathrm{gnn}}(t)$ and $u_{\mathrm{gntk}}(t)$ converge at a significant speed. In fact, the linear convergence imply that the squared distance between $u(t)$ and $u^{*}$ converge to $0$ in exponential. Based on the convergence result, we immediately get

\displaystyle|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|<\varepsilon/2

by picking sufficiently large $T$ .

Bridge iterative GNTK regression and GNN training process

In this section, we will bound $|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|$ . The high-level approach is that we can use the Newton-Leibniz formula to separate the problem into two parts: First, we bound the initialization perturbation $|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|$ . Then, we bound

\displaystyle|\mathrm{d}(u_{\mathrm{gnn},\mathrm{test}}(t)-u_{\mathrm{gntk},\mathrm{test}}(t))/\mathrm{d}t|

the change rate of the perturbation.

For the first part, note that $u_{\mathrm{gntk},\mathrm{test}}(0)=0$ , we bound $|u_{\mathrm{gnn},\mathrm{test}}(0)|$ by the concentration property of Gaussian initialization. For the second part, since we can prove that the gradient flow of the predictor of GNN during the GNN training process is associated with $\mathsf{H}(t)$ and the gradient flow of the predictor of iterative GNTK regression is associated with GNTK $\mathsf{H}^{cts}$ . So bounding $\|H(t)-H^{\mathrm{cts}}\|_{2}$ can provide us with a bound on $|\mathrm{d}(u_{\mathrm{gnn},\mathrm{test}}(t)-u_{\mathrm{gntk},\mathrm{test}}(t))/\mathrm{d}t|.$ Note that $H^{\mathrm{cts}}$ is exactly the expectation of $H(0)$ , where the expectation is taken on the random initialization of GNN. We bridge $H(t)$ and $H^{\mathrm{cts}}$ by the initial value of Graph Dynamic Kernel $H(0)$ . To bound $\|\mathsf{H}(0)-\mathsf{H}^{\mathrm{cts}}\|_{2}$ , we use the fact that $\operatorname*{{\mathbb{E}}}_{W(0)}[\mathsf{H}(0)]=\mathsf{H}^{\mathrm{cts}}$ and Hoeffding inequality. To bound $\|\mathsf{H}(0)-H(t)\|_{2}$ , we use the almost invariant property of the Graph Dynamic Kernel.

For the node level equivalence, given any test node $v_{\mathrm{test}}$ , the goal is to bound the distance between the node level graph kernel predictor $u_{\mathrm{node},\mathrm{test}}^{*}$ and the node level GNN predictor $u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(T)$ . To prove the node level equivalence, we use a toolbox similar to the one of graph level equivalence. We introduce node-level iterative GNTK regression, which is optimized by gradient descent. Node-level iterative GNTK regression connects the node-level GNTK regression and node-level GNN training process. By leveraging the linear convergence of the gradient flow, and the almost invariant property of node-level Graph Dynamic Kernel, we get the equivalence result of node level.

Spectral gap of shifted GNTK

To prove a lower bound on the smallest eigenvalue of the shifted GNTK, we first calculate the explicit formation of shifted GNTK, which turns out to be the summation of Hadamard product between covariance of the input graph dataset and the covariance of a indicator mapping. The indicator mapping takes $1$ when ReLU be activated by the input. We bound the eigenvalue of Hadamard product by analyzing the two covariance matrix separately. For the covariance of the input graph dataset, we need to bound the minimum magnitude of its diagonal entry, which can be further bounded by the norm of the input data. For the covariance of the indicator mapping, we need to bound the smallest eigenvalue, which is equal to the lower bound of the expectation of the inner product between the indicator mapping and any real vector. We bound the expectation by carefully decomposing the vector’s maximum coordinate and the vector’s other coordinates. To provide a good bound on the maximum coordinate, it turns out to be an anti-concentration problem of the Gaussian random matrix.

We delay more technique overview discussions to Section B due to space limitation.

6 Conclusion

Neural Tangent Kernel [44] is a useful tool in the theoretical deep learning to help understand how and why deep learning works. A natural extension of NTK for graph learning is Graph Neural Tangent Kernel (GNTK). [23] has shown that GNTK can achieve similar accuracy as training graph neural networks for several bioinformatics datasets for graph-level regression. We provide the first GNTK formulation for node-level regression. We present the first formal proof that training a graph neural network is equivalent to solving GNTK for both graph-level and node-level regression. From our view, given that our work is theoretical, it doesn’t lead to any negative societal impacts.

References

ACW [17] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017.
[2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
[3] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In NeurIPS. arXiv preprint arXiv:1904.11955, 2019.
AKK⁺ [20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM, 2020.
AKL [17] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pages 166–175. PMLR, 2017.
ALS⁺ [18] Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with orlicz norm. In ICML, 2018.
ALS⁺ [22] Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, and Danyang Zhuo. Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. arXiv preprint arXiv:2211.14227, 2022.
AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
[9] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, 2019.
[10] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, 2019.
Ber [24] Sergei Bernstein. On a modification of chebyshev’s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38–49, 1924.
BG [17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pages 605–614. PMLR, 2017.
BJW [19] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
BWZ [16] Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 236–249, 2016.
CAEHP⁺ [20] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. Machine learning on graphs: A model and comprehensive taxonomy. arXiv preprint arXiv:2005.03675, 2020.
CB [19] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. In 32nd Conf. Neural Information Processing Systems (NeurIPS 2018), 2019.
CCLY [19] Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Conference on Learning Theory, pages 849–873. PMLR, 2019.
CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference (STOC), pages 81–90. https://arxiv.org/pdf/1207.6365, 2013.
CY [21] Yifan Chen and Yun Yang. Accumulations of projections—a unified framework for random sketches in kernel ridge regression. In International Conference on Artificial Intelligence and Statistics, pages 2953–2961. PMLR, 2021.
Dan [17] Amit Daniely. Sgd learns the conjugate kernel class of the network. arXiv preprint arXiv:1702.08503, 2017.
DFS [16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. arXiv preprint arXiv:1602.05897, 2016.
DHS⁺ [19] Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems (NeurIPS), pages 5723–5733, 2019.
DJS⁺ [19] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems, 32, 2019.
DLL⁺ [19] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
[26] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
[27] Yichuan Deng, Zhihang Li, and Zhao Song. An improved sample complexity for rank-1 matrix sensing. arXiv preprint arXiv:2303.06895, 2023.
DLT⁺ [18] Simon Du, Jason Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pages 1339–1348. PMLR, 2018.
DZPS [19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. arXiv preprint arXiv:1810.02054, 2019.
EMZ [21] Hossein Esfandiari, Vahab Mirrokni, and Peilin Zhong. Almost linear time density level set estimation via dbscan. In AAAI, 2021.
Fou [17] Alex M Fout. Protein interface prediction using graph convolutional networks. PhD thesis, Colorado State University, 2017.
GLM [17] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
GQSW [22] Yeqi Gao, Lianke Qin, Zhao Song, and Yitan Wang. A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
GSY [23] Yeqi Gao, Zhao Song, and Junze Yin. An iterative algorithm for rescaled hyperbolic functions regression. arXiv preprint arXiv:2305.00660, 2023.
GSYZ [23] Yuzhou Gu, Zhao Song, Junze Yin, and Lichen Zhang. Low rank matrix completion via robust alternating minimization in nearly linear time. arXiv preprint arXiv:2302.11068, 2023.
GSZ [23] Yuzhou Gu, Zhao Song, and Lichen Zhang. A nearly-linear time algorithm for structured support vector machines. arXiv preprint arXiv:2307.07735, 2023.
HLW [17] Jarvis Haupt, Xingguo Li, and David P Woodruff. Near optimal sketching of low-rank tensor regression. In ICML, 2017.
HMR [18] Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. Sega: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
Hoe [63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
HOSM [17] Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach. arXiv preprint arXiv:1706.05674, 2017.
HSWZ [22] Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
HZRS [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
JGH [18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NeurIPS), pages 8571–8580, 2018.
JLL⁺ [20] Shuli Jiang, Dongyu Li, Irene Mengze Li, Arvind V Mahankali, and David Woodruff. An efficient protocol for distributed column subset selection in the entrywise $\backslash ell\_p$ norm. 2020.
JLL⁺ [21] Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, and David Woodruff. Streaming and distributed algorithms for robust column subset selection. In International Conference on Machine Learning, pages 4971–4981. PMLR, 2021.
JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, 2020.
JPWZ [21] Shuli Jiang, Hai Pham, David Woodruff, and Richard Zhang. Optimal sketching for trace estimation. Advances in Neural Information Processing Systems, 34, 2021.
JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC. arXiv preprint arXiv:2004.07470, 2021.
LL [18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS, 2018.
LS [01] Wenbo V Li and Q-M Shao. Gaussian processes: inequalities, small ball probabilities and applications. Handbook of Statistics, 19:533–597, 2001.
LSS⁺ [20] Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. Generalized leverage score sampling for neural networks. In NeurIPS, 2020.
LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.04447.pdf, 2019.
LSZ [23] Zhihang Li, Zhao Song, and Tianyi Zhou. Solving regularized exp, cosh and sinh regression problems. arXiv preprint arXiv:2303.15725, 2023.
LY [17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. arXiv preprint arXiv:1705.09886, 2017.
MM [13] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 91–100, 2013.
NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117–126. IEEE, https://arxiv.org/pdf/1211.1002, 2013.
OS [20] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. In IEEE Journal on Selected Areas in Information Theory. IEEE, 2020.
QJS⁺ [22] Lianke Qin, Rajesh Jayaram, Elaine Shi, Zhao Song, Danyang Zhuo, and Shumo Chu. Adore: Differentially oblivious relational database operators. In VLDB, 2022.
QRS⁺ [22] Lianke Qin, Aravind Reddy, Zhao Song, Zhaozhuo Xu, and Danyang Zhuo. Adaptive and dynamic multi-resolution hashing for pairwise summations. In BigData, 2022.
QSZ [23] Lianke Qin, Zhao Song, and Ruizhe Zhang. A general algorithm for solving rank-one matrix sensing. arXiv preprint arXiv:2303.12298, 2023.
QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pages 101–156. PMLR, 2023.
QWF [22] Can Qin, Yizhou Wang, and Yun Fu. Robust semi-supervised domain adaptation against noisy labels. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 4409–4413, 2022.
RPU⁺ [20] Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning, pages 8253–8265. PMLR, 2020.
RSW [16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), 2016.
Sch [11] Jssai Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. 1911.
Sol [17] Mahdi Soltanolkotabi. Learning relus via gradient descent. arXiv preprint arXiv:1705.04591, 2017.
SSX [23] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. A tale of two efficient value iteration algorithms for solving linear mdps with large action space. In AISTATS, 2023.
SSZ [23] Ritwik Sinha, Zhao Song, and Tianyi Zhou. A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
SWYZ [21] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
SWZ [17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise $\ell_{1}$ -norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.
[72] Zhao Song, David Woodruff, and Peilin Zhong. Average case column subset selection for entrywise $\ell_{1}$ -norm loss. Advances in Neural Information Processing Systems (NeurIPS), 32:10111–10121, 2019.
[73] Zhao Song, David Woodruff, and Peilin Zhong. Towards a zero-one law for column subset selection. Advances in Neural Information Processing Systems, 32:6123–6134, 2019.
[74] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In SODA. arXiv preprint arXiv:1704.08246, 2019.
SY [19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. In arXiv preprint. https://arxiv.org/pdf/1906.03593.pdf, 2019.
SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
SYZ [21] Zhao Song, Shuo Yang, and Ruizhe Zhang. Does preprocessing help training over-parameterized neural networks? Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
Tia [17] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pages 3404–3413. PMLR, 2017.
WGLF [23] Yizhou Wang, Dongliang Guo, Sheng Li, and Yun Fu. Towards explainable visual anomaly detection. arXiv preprint arXiv:2302.06670, 2023.
WLX⁺ [20] Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. Graph convolutional networks with markov random field reasoning for social spammer detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1054–1061, 2020.
WPC⁺ [20] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 2020.
WQB⁺ [22] Yizhou Wang, Can Qin, Yue Bai, Yi Xu, Xu Ma, and Yun Fu. Making reconstruction-based method great again for video anomaly detection. In 2022 IEEE International Conference on Data Mining (ICDM), pages 1215–1220. IEEE, 2022.
WWMK [20] Xinshuo Weng, Yongxin Wang, Yunze Man, and Kris M Kitani. Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6499–6508, 2020.
WZ [16] David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847–858. IEEE, 2016.
WZD⁺ [20] Ruosong Wang, Peilin Zhong, Simon S Du, Russ R Salakhutdinov, and Lin F Yang. Planning with general objective functions: Going beyond total rewards. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
XSS [21] Zhaozhuo Xu, Zhao Song, and Anshumali Shrivastava. Breaking the linear iteration cost barrier for some well-known conditional gradient methods using maxip data-structures. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
XWW⁺ [21] Yi Xu, Lichen Wang, Yizhou Wang, Can Qin, Yulun Zhang, and Yun Fu. Memrein: Rein the domain shift for cross-domain few-shot learning. 2021.
XWWF [22] Yi Xu, Lichen Wang, Yizhou Wang, and Yun Fu. Adaptive trajectory prediction via transferable gnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6520–6531, 2022.
XZZ [18] Chang Xiao, Peilin Zhong, and Changxi Zheng. Bourgan: generative networks with metric embeddings. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pages 2275–2286, 2018.
YHC⁺ [18] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983, 2018.
YSLJ [20] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11040–11048, 2020.
YWH⁺ [20] Xiang Yue, Zhen Wang, Jingong Huang, Srinivasan Parthasarathy, Soheil Moosavinasab, Yungui Huang, Simon M Lin, Wen Zhang, Ping Zhang, and Huan Sun. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241–1251, 2020.
YZW⁺ [20] Carl Yang, Jieyu Zhang, Haonan Wang, Sha Li, Myungwan Kim, Matt Walker, Yiou Xiao, and Jiawei Han. Relation learning on social networks with multi-modal graph edge variational autoencoders. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 699–707, 2020.
ZCH⁺ [20] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.
ZCZ [20] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020.
ZHA⁺ [21] Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin. Scaling neural tangent kernels via sketching and random features. Advances in Neural Information Processing Systems, 34, 2021.
Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
ZL [17] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
ZSD [17] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.
ZSJ⁺ [17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.

Appendix

Roadmap.

In Section A, we introduce several notations and tools that is useful in our paper. In Section B, we provide more discussion of technique overview. In Section C, we give the formula for calculating node-level GNTK. In Section D, we prove the equivalence of GNN and GNTK for both graph level and node level regression. In Section E, we provide a lower bound and a upper bound on the eigenvalue of the GNTK. In Section F, we deduce the formulation of node level GNTK.

Appendix A Notations and Probability Tools

In this section we introduce some notations and sevreral probability tools we use in the proof.

Notations.

Let $G=(U,E)$ be a graph with node set $U$ and edge set $E$ . A graph contains $|U|=N$ nodes. The dimension of the feature vector of each input node is $d$ . $\mathcal{N}$ is the neighboring matrix that represents the topology of $G$ . Each node $u\in U$ has a feature vector $h(G)_{u}\in\mathbb{R}^{d}$ . $\mathcal{N}(u)$ represents all the neighboring nodes of $u$ . A dataset contains $n$ graphs. We denote training data graph $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ with $G_{i}\in\mathbb{R}^{d\times N}\leavevmode\nobreak\ i\in[n]$ . We denote label vector $Y\in\mathbb{R}^{n}$ corresponding to $\mathsf{G}$ . For graph level regression, ours goal is to learn to predict the label of the graph $G_{\mathrm{test}}$ . For node level regression, ours goal is to learn to predict the label of the node $v_{\mathrm{test}}$ . For presentation simplicity, let’s define an ordering of all the nodes in a graph. We can thus use $u_{i}$ to be the $i$ th node in the graph. We use $\sigma(\cdot)$ to denote the non-linear activation function. In this paper, $\sigma(\cdot)$ usually denotes the rectified linear unit (ReLU) function. We use $\dot{\sigma}(\cdot)$ to denote the derivative of $\sigma(\cdot)$ . We use $[k]$ to denote the set $\{1,2,\cdots,k\}$ .

We state Chernoff, Hoeffding and Bernstein inequalities.

Lemma A.1 (Hoeffding bound [40]).

Let $X_{1},\cdots,X_{n}$ denote $n$ independent bounded variables in $[a_{i},b_{i}]$ . Let $X=\sum_{i=1}^{n}X_{i}$ , then we have

\displaystyle\Pr[|X-\operatorname*{{\mathbb{E}}}[X]|\geq t]\leq 2\exp\left(-\frac{2t^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\right).

Lemma A.2 (Bernstein inequality [11]).

Let $X_{1},\cdots,X_{n}$ be independent zero-mean random variables. Suppose that $|X_{i}|\leq M$ almost surely, for all $i$ . Then, for all positive $t$ ,

\displaystyle\Pr\left[\sum_{i=1}^{n}X_{i}>t\right]\leq\exp\left(-\frac{t^{2}/2}{\sum_{j=1}^{n}\operatorname*{{\mathbb{E}}}[X_{j}^{2}]+Mt/3}\right).

We state three inequalities for Gaussian random variables.

Lemma A.3 (Gaussian tail bounds).

Let $X\sim\mathcal{N}(\mu,\sigma^{2})$ be a Gaussian random variable with mean $\mu$ and variance $\sigma^{2}$ . Then for all $t\geq 0$ , we have

\displaystyle\Pr[|X-\mu|\geq t]\leq 2e^{-\frac{-t^{2}}{2\sigma^{2}}}.

Claim A.4 (Theorem 3.1 in [51]).

Let $b>0$ and $r>0$ . Then,

\displaystyle\exp(-b^{2}/2)\Pr_{x\sim\mathcal{N}(0,1)}[|x|\leq r]\leq\leavevmode\nobreak\ \Pr_{x\sim\mathcal{N}(0,1)}[|x-b|\leq r]\leq\leavevmode\nobreak\ \Pr_{x\sim\mathcal{N}(0,1)}[|x|\leq r].

Lemma A.5 (Anti-concentration of Gaussian distribution).

Let $Z\sim{\mathcal{N}}(0,\sigma^{2})$ . Then, for $t>0$ ,

\displaystyle\Pr[|Z|\leq t]\leq\frac{2t}{\sqrt{2\pi}\sigma}.

Appendix B More Discussion of Technique Overview

Node level techniques

The formulation of node level GNTK in Section C is deduced by the Gaussian processes nature of infinite width neural network. In the infinite width limit, the pre-activations $W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)$ at every hidden layer $r$ , every level $l$ have all its coordinates tending to i.i.d. centered Gaussian processes. Note that, for the activation layer, if the covariance matrix and distribution of pre-activations is explicitly known, we can compute the covariance matrix of the activated values exactly. As a result, we can recursively compute the covariance of the centered Gaussian processes $\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})$ . By computing the partial derivative of GNN output $f^{(L)}_{(R)}$ with respect to weight $W^{(l)}_{(r)}$ , we obtain the explicit formulation for computing node level GNTK.

Appendix C Calculating Node-Level GNTK

In this section, we give the formula for calculating node-level GNTK. We consider multi-level multi-layer node level Graph Neural Tangent Kernels in this section.

Let $f^{(l)}_{(r)}$ define the output of the Graph Neural Network in the $l$ -th level and $r$ -th layer. Suppose that there are $L$ levels and in each level, there are $R$ layers. At the beginning of each layer, we aggregate features over the neighbors $\mathcal{N}(u)$ of each node $u$ . More specifically, the definition of Aggregate layer is as follows: $f^{(1)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}h(G)_{v},$ and $f^{(l)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}f^{(l-1)}_{(R)}(v),\leavevmode\nobreak\ l\in\{2,3,\cdots,L\}.$ In each level, we then perform $R$ fully-connected layers on the output of the Aggregate layer, which are also called Combine layer. The definition of Combine layer can be formulated as: $f^{(l)}_{(r)}(u)=\sqrt{\frac{c_{\sigma}}{m}}\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)),\leavevmode\nobreak\ l\in[L],r\in[R]$ where $c_{\sigma}$ is a scaling factor, $W^{(l)}_{(r)}$ is the weight of the neural network in $l$ -th level and $r$ -th layer, $m$ is the width of the layer. In this paper we set $c_{\sigma}$ the same as [43], i.e. $c_{\sigma}=2$ . Let $f^{(L)}_{(R)}$ be the output of the Graph Neural Network.

We show how to compute Graph Neural Tangent Kernel $\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})=\mathsf{K}_{(R)}^{(L)}(u,u^{\prime})$ as follows. Let $\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})$ be defined as the covariance matrix of the Gaussian process of the pre-activations $W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)$ . We provide the formula of the GNTK in a recursive way. First, we give the recursive formula for Aggregate layer. We initialize with $\bm{\Sigma}^{(1)}_{(1)}(u,u^{\prime})=h(G)_{u}^{\top}h(G)_{u^{\prime}}.$ For $l\in\{2,3,\cdots,L\}$ , $\bm{\Sigma}^{(l)}_{(1)}(u,u^{\prime})=\sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\bm{\Sigma}^{(l-1)}_{(R)}(v,v^{\prime}).$

Then, we give the recursive formula for the Combine layer.

	$\displaystyle\bm{\Lambda}^{(l)}_{(r)}(u,u^{\prime})=$	$\displaystyle\begin{pmatrix}\bm{\Sigma}_{(r-1)}^{(l)}(u,u)&\bm{\Sigma}_{(r-1)}^{(l)}(u,u^{\prime})\\ \bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u)&\bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u^{\prime})\end{pmatrix}\in\mathbb{R}^{2\times 2},$
	$\displaystyle\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})=$	$\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\sigma(a)\sigma(b)],$
	$\displaystyle\bm{\dot{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)=$	$\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\dot{\sigma}(a)\dot{\sigma}(b)].$

Finally, we give the formula for calculating the graph neural tangent kernel. For each Aggregate layer, the kernel can be calculated as follows: $\mathsf{K}^{(1)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ h(G)_{u}^{\top}h(G)_{u^{\prime}},$ and

\displaystyle\mathsf{K}^{(l)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ \sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\mathsf{K}^{(l-1)}_{(R)}(v,v^{\prime}).

For each Combine layer, the kernel can be calculated as follows:

\displaystyle\mathsf{K}_{(r)}^{(l)}(u,u^{\prime})=\mathsf{K}_{(r-1)}^{(l)}(u,u^{\prime})\dot{\bm{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)+\bm{\Sigma}^{(l)}_{(r)}\left(u,u^{\prime}\right).

The Graph neural tangent kernel $\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})$ is the output $\mathsf{K}_{(R)}^{(L)}(u,u^{\prime})$ .

Appendix D Equivalence Results

In this section we prove the equivalence of GNN and GNTK when the width goes to infinity for one layer of GNN and there is exactly one Aggregate, one Combine, and one ReadOut operation.

In Section D.1, we present some useful definitions. In Section D.2, we compute the gradient flow of iterative GNTK regression and GNN training process. In Section D.3, we prove an upper bound for $|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|$ . In Section D.4, we bound the initialization and kernel perturbation. In Section D.5, we present the concentration results for kernels. In Section D.6, we upper bound the initialization perturbation. In Section D.7, we upper bound the kernel perturbation. In Section D.8, we connect iterative GNTK regression and GNN training process. In Section D.9 we present the main theorem.

D.1 Definitions

In this section we first present some formal definitions.

Definition D.1 (Graph neural network function).

We define a graph neural networks with rectified linear unit (ReLU) activation as the following form

\displaystyle f_{\mathrm{gnn}}(W,a,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}x_{l})\in\mathbb{R},

where

•

$x\in\mathbb{R}^{d\times N}$ (decided by $G$ ) is the input,
•

$w_{r}\in\mathbb{R}^{d},\leavevmode\nobreak\ r\in[m]$ is the weight vector of the first layer,
•

$W=[w_{1},\cdots,w_{m}]\in\mathbb{R}^{d\times m}$ , $a_{r}\in\mathbb{R},\leavevmode\nobreak\ r\in[m]$ is the output weight,
•

$a=[a_{1},\cdots,a_{m}]^{\top}$ and $\sigma(\cdot)$ is the ReLU activation function: $\sigma(z)=\max\{0,z\}$ .

In this paper, we consider only training the first layer $W$ while fixing $a$ . So we also write $f_{\mathrm{gnn}}(W,G)=f_{\mathrm{gnn}}(W,a,G)$ . We denote $f_{\mathrm{gnn}}(W,\mathsf{G})=[f_{\mathrm{gnn}}(W,G_{1}),\cdots,f_{\mathrm{gnn}}(W,G_{n})]^{\top}\in\mathbb{R}^{n}$ .

The training process of the GNN is as follows:

Definition D.2 (Training graph neural network).

Given

•

training data graph $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ (which can be viewed as a tensor $X\in\mathbb{R}^{n\times d\times N}$ and say $X_{1},\cdots,X_{N}\in\mathbb{R}^{d\times n}$ ),
•

and corresponding label vector $Y\in\mathbb{R}^{n}$ where $d$ represent the dimension of a feature of a node and $N$ denote the number of nodes in a graph and $n$ is the size of training set.

Let

•

$f_{\mathrm{gnn}}$ be defined as in Definition D.1.
•

$\kappa\in(0,1)$ be a small multiplier.

We initialize the network as $a_{r}\overset{i.i.d.}{\sim}\operatorname{unif}[\{-1,1\}]$ and $w_{r}(0)\overset{i.i.d.}{\sim}\mathcal{N}(0,I_{d})$ . Then we consider solving the following optimization problem using gradient descent:

\displaystyle\min_{W}\frac{1}{2}\|Y-\kappa f_{\mathrm{gnn}}(W,\mathsf{G})\|_{2}

(1)

We denote $w_{r}(t),r\in[m]$ as the variable at iteration $t$ . We denote

	$\displaystyle u_{\mathrm{gnn}}(t)=$	$\displaystyle\leavevmode\nobreak\ \kappa f_{\mathrm{gnn}}(W(t),\mathsf{G})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{\kappa}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}(t)^{\top}X_{l})\in\mathbb{R}^{n}$		(2)

as the training data predictor at iteration $t$ . Given any test data $\{x_{\mathrm{test},l}\}_{l\in[N]}\in\mathbb{R}^{d}$ (decided by graph $G_{\mathrm{test}}$ ), we denote

	$\displaystyle u_{\mathrm{gnn},\mathrm{test}}(t)=$	$\displaystyle\leavevmode\nobreak\ \kappa f_{\mathrm{gnn}}(W(t),G_{\mathrm{test}})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{\kappa}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}(t)^{\top}x_{\mathrm{test},l})\in\mathbb{R}$		(3)

as the test data predictor at iteration $t$ .

In the definition above, $\kappa$ is the scaling factor [52, 3].

Definition D.3 (Graph neural tangent kernel and feature function).

We define the graph neural tangent kernel(GNTK) and the feature function corresponding to the graph neural networks $f_{\mathrm{gnn}}$ defined in Definition D.1 as following

\displaystyle\mathsf{K}_{\mathrm{gntk}}(G,H)=\operatorname*{{\mathbb{E}}}\left[\left\langle\frac{\partial f_{\mathrm{gnn}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn}}(W,H)}{\partial W}\right\rangle\right]

where

•

$G,H$ are any input data,
•

and the expectation is taking over $w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m$ .

Given

•

training data matrix $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ ,

we define $H^{\mathrm{cts}}\in\mathbb{R}^{n\times n}$ as the kernel matrix between training data as

\displaystyle[H^{\mathrm{cts}}]_{i,j}=\mathsf{K}_{\mathrm{gntk}}(G_{i},G_{j})\in\mathbb{R}.

Let $\Lambda_{0}>0$ represent the smallest eigenvalue of $H^{\mathrm{cts}}$ , under the assumption that $H^{\mathrm{cts}}$ is positive definite. Additionally, for any data point $z$ belonging to $\mathbb{R}^{d}$ , the kernel between the test and training data is denoted by $\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})$ and is a member of $\mathbb{R}^{n}$ . We write it as:

\displaystyle\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})=[\mathsf{K}_{\mathrm{gntk}}(G,G_{1}),\cdots,\mathsf{K}_{\mathrm{gntk}}(G,G_{n})]^{\top}\in\mathbb{R}^{n}.

We denote the feature function corresponding to the kernel $\mathsf{K}_{\mathrm{gntk}}$ as we defined above as $\Phi:\mathbb{R}^{d\times N}\rightarrow\mathcal{F}$ , which satisfies

\displaystyle\langle\Phi(G),\Phi(H)\rangle_{\mathcal{F}}=\mathsf{K}_{\mathrm{gntk}}(G,H),

for any graph data $G$ , $H$ . And we write $\Phi(\mathsf{G})=[\Phi(G_{1}),\cdots,\Phi(G_{n})]^{\top}$ .

Definition D.4 (Neural tangent kernel regression).

Given

•

training data matrix $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$
•

and corresponding label vector $Y\in\mathbb{R}^{n}$ .

Let

•

$\mathsf{K}_{\mathrm{gntk}}$ , $\mathsf{H}^{\mathrm{cts}}$ and $\Phi$ be the neural tangent kernel and corresponding feature functions defined as in Definition D.3

Then we consider the following neural tangent kernel regression problem:

\displaystyle\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2}

(4)

where $f_{\mathrm{gntk}}(\beta,G)=\Phi(G)^{\top}\beta\in\mathbb{R}$ denotes the prediction function and

\displaystyle f_{\mathrm{gntk}}(\beta,\mathsf{G})=[f_{\mathrm{gntk}}(\beta,G_{1}),\cdots,f_{\mathrm{gntk}}(\beta,G_{n})]^{\top}\in\mathbb{R}^{n}.

Consider the gradient flow associated with solving the problem given by equation (4), starting with the initial condition $\beta(0)=0$ . Let $\beta(t)$ represent the variable at the $t$ -th iteration. We represent

\displaystyle u_{\mathrm{gntk}}(t)=\kappa\Phi(\mathsf{G})\beta(t)\in\mathbb{R}^{n}

(5)

as the training data predictor at iteration $t$ . Given any test data $x_{\mathrm{test}}\in\mathbb{R}^{d}$ , we denote

\displaystyle u_{\mathrm{gntk},\mathrm{test}}(t)=\kappa\Phi(G_{\mathrm{test}})^{\top}\beta(t)\in\mathbb{R}

(6)

as the predictor for the test data at the $t$ -th iteration. It’s important to highlight that the gradient flow converges to the optimal solution of problem (4) because of the strong convexity inherent to the problem . We denote

\displaystyle\beta^{*}=\lim_{t\to\infty}\beta(t)=\kappa^{-1}(\Phi(\mathsf{G})^{\top}\Phi(\mathsf{G}))^{-1}\Phi(\mathsf{G})^{\top}Y

(7)

and the optimal predictor derived from training data

\displaystyle u^{*}=\lim_{t\to\infty}u_{\mathrm{gntk}}(t)=\kappa\Phi(G)\beta^{*}=Y\in\mathbb{R}^{n}

(8)

and the optimal test data predictor

$\displaystyle u_{\mathrm{test}}^{*}=$	$\displaystyle\leavevmode\nobreak\ \lim_{t\to\infty}u_{\mathrm{gntk},\mathrm{test}}(t)$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \kappa\Phi(G_{\mathrm{test}})^{\top}\beta^{*}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(\mathsf{H}^{\mathrm{cts}})^{-1}Y\in\mathbb{R}.$	(9)

Definition D.5 (Dynamic kernel).

Given

•

$W(t)\in\mathbb{R}^{d\times m}$ as the parameters of the neural network at training time $t$ as defined in Definition D.2.

For any data $G,H\in\mathbb{R}^{d\times N}$ , we define $\mathsf{K}_{t}(G,H)\in\mathbb{R}$ as

\displaystyle\mathsf{K}_{t}(G,H)=\left\langle\frac{\partial f_{\mathrm{gnn}}(W(t),G)}{\partial W(t)},\frac{\partial f_{\mathrm{gnn}}(W(t),H)}{\partial W(t)}\right\rangle

Given training data matrix $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ , we define $\mathsf{H}(t)\in\mathbb{R}^{n\times n}$ as

\displaystyle[\mathsf{H}(t)]_{i,j}=\mathsf{K}_{t}(G_{i},G_{j})\in\mathbb{R}.

Further, given a test graph data $G_{\mathrm{test}}$ , we define $\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n}$ as

\displaystyle\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})=[\mathsf{K}_{t}(G_{\mathrm{test}},G_{1}),\cdots,\mathsf{K}_{t}(G_{\mathrm{test}},G_{n})]^{\top}\in\mathbb{R}^{n}.

D.2 Gradient flow

In this section, we compute the gradient flow of iterative GNTK regression and GNN train process.

First, we compute the gradient flow of kernel regression.

Lemma D.6.

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ ,
•

and corresponding label vector $Y\in\mathbb{R}^{n}$ . Let $f_{\mathrm{gntk}}$ be defined as in Definition D.4.

Let

•

$\beta(t)$ , $\kappa\in(0,1)$ and $u_{\mathrm{gntk}}(t)\in\mathbb{R}^{n}$ be defined as in Definition D.4.
•

$\mathsf{K}_{\mathrm{gntk}}:\mathbb{R}^{d}\times\mathbb{R}^{n\times d}\to\mathbb{R}^{n}$ be defined as in Definition D.3.

Then for any data $G\in\mathbb{R}^{d\times N}$ , we have

\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G)}{\mathrm{d}t}=\kappa\cdot\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t))

Proof.

Denote $L(t)=\frac{1}{2}\|Y-u_{\mathrm{gntk}}(t)\|_{2}^{2}$ . By the rule of gradient descent, we have

\displaystyle\frac{\mathrm{d}\beta(t)}{\mathrm{d}t}=-\frac{\mathrm{d}L}{\mathrm{d}\beta}=\kappa\Phi(\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)),

where $\Phi$ is defined in Definition D.3. Thus we have

	$\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G)}{\mathrm{d}t}=$	$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G)}{\mathrm{d}\beta(t)}\frac{\mathrm{d}\beta(t)}{\mathrm{d}t}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Phi(G)^{\top}(\kappa\Phi(\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)))$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \kappa\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t))$

where the initial step arises from the chain rule. The subsequent step is a consequence of the relationship $\mathrm{d}f_{\mathrm{gntk}}(\beta,G)/\mathrm{d}\beta=\Phi(G)^{\top}$ . The final step is based on the definition of the kernel, with $\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})=\Phi(\mathsf{G})\Phi(G)$ belonging to $\mathbb{R}^{n}$ . ∎

Corollary D.7 (Gradient of prediction of kernel regression).

Given

•

training data matrix $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ and corresponding label vector $Y\in\mathbb{R}^{n}$ .
•

a test data $G_{\mathrm{test}}$ .

Let

•

$f_{\mathrm{gntk}}$ be defined as in Definition D.4.
•

$\beta(t)$ , $\kappa\in(0,1)$ and $u_{\mathrm{gntk}}(t)\in\mathbb{R}^{n}$ be defined as in Definition D.4.
•

$\mathsf{K}_{\mathrm{gntk}}:\mathbb{R}^{d}\times\mathbb{R}^{n\times d}\rightarrow\mathbb{R}^{n},\leavevmode\nobreak\ H^{\mathrm{cts}}\in\mathbb{R}^{n\times n}$ be defined as in Definition D.3.

Then we have

	$\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk}}(t)}{\mathrm{d}t}$	$\displaystyle=\kappa^{2}H^{\mathrm{cts}}(Y-u_{\mathrm{gntk}}(t))$
	$\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t}$	$\displaystyle=\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},X)^{\top}(Y-u_{\mathrm{gntk}}(t)).$

Proof.

Plugging in $G=G_{i}$ in Lemma D.6, we have

\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G_{i})}{\mathrm{d}t}=\kappa\mathsf{K}_{\mathrm{gntk}}(G_{i},\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)).

Note $[u_{\mathrm{gntk}}(t)]_{i}=\kappa f_{\mathrm{gntk}}(\beta(t),x_{i})$ and $[\mathsf{H}^{\mathrm{cts}}]_{:,i}=\mathsf{K}_{\mathrm{gntk}}(x_{i},X)$ , so writing all the data in a compact form, we have

\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk}}(t)}{\mathrm{d}t}=\kappa^{2}\mathsf{H}^{\mathrm{cts}}(Y-u_{\mathrm{gntk}}(t)).

Plugging in data $G=G_{\mathrm{test}}$ in Lemma D.6, we have

\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G_{\mathrm{test}})}{\mathrm{d}t}=\kappa\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)).

Note by definition, $u_{\mathrm{gntk},\mathrm{test}}(t)=\kappa f_{\mathrm{gntk}}(\beta(t),G_{\mathrm{test}})\in\mathbb{R}$ , so we have

\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t}=\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)).

∎

We prove the linear convergence of kernel regression:

Lemma D.8.

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ ,
•

and corresponding label vector $Y\in\mathbb{R}^{n}$ .

Let

•

$\kappa\in(0,1)$ and $u_{\mathrm{gntk}}(t)\in\mathbb{R}^{n}$ be defined as in Definition D.4.
•

$u^{*}\in\mathbb{R}^{n}$ be defined in Definition D.4.
•

$\Lambda_{0}>0$ be defined as in Definition D.3.

Then we have

\displaystyle\frac{\mathrm{d}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-2(\kappa^{2}\Lambda_{0})\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}.

Further, we have

\displaystyle\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\leq e^{-(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}.

Proof.

So we have

	$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}\\|u_{\mathrm{gntk}}(t)-u^{*}\\|_{2}^{2}}{\mathrm{d}t}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ 2(u_{\mathrm{gntk}}(t)-u^{*})^{\top}\frac{\mathrm{d}u_{\mathrm{gntk}}(t)}{\mathrm{d}t}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ -2\kappa^{2}(u_{\mathrm{gntk}}(t)-u^{*})^{\top}H^{\mathrm{cts}}(u_{\mathrm{gntk}}(t)-Y)$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ -2(\kappa^{2}\Lambda_{0})\\|u_{\mathrm{gntk}}(t)-u^{*}\\|_{2}^{2},$	(10)

where the initial step is derived from the chain rule. The subsequent step is based on Corollary D.7, and the concluding step is in accordance with the definition of $\Lambda_{0}$ . Further, since

		$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}(e^{2(\kappa^{2}\Lambda_{0})t}\\|u_{\mathrm{gntk}}(t)-u^{*}\\|_{2}^{2})}{\mathrm{d}t}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ 2(\kappa^{2}\Lambda_{0})e^{2(\kappa^{2}\Lambda_{0})t}\\|u_{\mathrm{gntk}}(t)-u^{*}\\|_{2}^{2}$
		$\displaystyle\leavevmode\nobreak\ \quad+e^{2(\kappa^{2}\Lambda_{0})t}\cdot\frac{\mathrm{d}\\|u_{\mathrm{gntk}}(t)-u^{*}\\|_{2}^{2}}{\mathrm{d}t}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ 0,$

where the initial step involves gradient computation, while the subsequent step is derived from Eq. (D.2). Consequently, the value of $e^{2(\kappa^{2}\Lambda_{0})t}|u_{\mathrm{gntk}}(t)-u^{*}|_{2}^{2}$ does not increase, leading to the inference that

\displaystyle\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\leq e^{-2(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}.

∎

We compute the gradient flow of neural network training:

Lemma D.9.

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ ,
•

and corresponding label vector $Y\in\mathbb{R}^{n}$ .

Let

•

$f_{\mathrm{gnn}}:\mathbb{R}^{d\times m}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R}$ be defined as in Definition D.1.
•

$W(t)\in\mathbb{R}^{d\times m}$ , $\kappa\in(0,1)$ and $u_{\mathrm{gnn}}(t)\in\mathbb{R}^{n}$ be defined as in Definition D.2.
•

$\mathsf{K}_{t}:\mathbb{R}^{d\times N}\times\mathbb{R}^{n\times d\times N}\rightarrow\mathbb{R}^{n}$ be defined as in Definition D.5.

Then for any data $G\in\mathbb{R}^{d\times N}$ , we have

\displaystyle\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}t}=\kappa\mathsf{K}_{t}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

Proof.

Denote $L(t)=\frac{1}{2}\|Y-u_{\mathrm{gnn}}(t)\|_{2}^{2}$ .

By the rule of gradient descent, we have

\displaystyle\frac{\mathrm{d}w_{r}}{\mathrm{d}t}=-\frac{\partial L}{\partial w_{r}}=(\frac{\partial u_{\mathrm{gnn}}}{\partial w_{r}})^{\top}(Y-u_{\mathrm{gnn}}).

(11)

Thus, we have

		$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}t}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Big{\langle}\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}W(t)},\frac{\mathrm{d}W(t)}{\mathrm{d}t}\Big{\rangle}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \sum_{j=1}^{n}(y_{j}-\kappa f_{\mathrm{gnn}}(W(t),G_{j}))$
		$\displaystyle\quad\cdot\Big{\langle}\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}W(t)},\frac{\mathrm{d}\kappa f_{\mathrm{gnn}}(W(t),G_{j})}{\mathrm{d}W(t)}\Big{\rangle}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \kappa\sum_{j=1}^{n}(y_{j}-\kappa f_{\mathrm{gnn}}(W(t),G_{j}))\cdot\mathsf{K}_{t}(G,G_{j})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \kappa\mathsf{K}_{t}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t))$

where the initial step is derived from the chain rule. The subsequent step is based on Eq. (11). The third step adheres to the definition of $\mathsf{K}_{t}$ , and the concluding step presents the formula in a more concise manner. ∎

Corollary D.10 (Gradient of prediction of graph neural network).

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ and corresponding label vector $Y\in\mathbb{R}^{n}$ .
•

a test graph data $G_{\mathrm{test}}$ . Let $f_{\mathrm{gnn}}:\mathbb{R}^{d\times m}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R}$ be defined as in Definition D.1.

Let

•

$W(t)\in\mathbb{R}^{d\times m}$ , $\kappa\in(0,1)$ and $u_{\mathrm{gnn}}(t)\in\mathbb{R}^{n}$ be defined as in Definition D.2.
•

$\mathsf{K}_{t}:\mathbb{R}^{d\times N}\times\mathbb{R}^{n\times d\times N}\rightarrow\mathbb{R}^{n},\leavevmode\nobreak\ H(t)\in\mathbb{R}^{n\times n}$ be defined as in Definition D.5.

Then we have

	$\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn}}(t)}{\mathrm{d}t}=$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}H(t)(Y-u_{\mathrm{gnn}}(t))$
	$\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}=$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).$

Proof.

Plugging in $G=G_{i}\in\mathbb{R}^{d}$ in Lemma D.9, we have

\displaystyle\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G_{i})}{\mathrm{d}t}=\kappa\mathsf{K}_{t}(G_{i},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

Note $[u_{\mathrm{gnn}}(t)]_{i}=\kappa f_{\mathrm{gnn}}(W(t),G_{i})$ and $[H(t))]_{:,i}=\mathsf{K}_{t}(G_{i},\mathsf{G})$ , so writing all the data in a compact form, we have

\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn}}(t)}{\mathrm{d}t}=\kappa^{2}H(t)(Y-u_{\mathrm{gnn}}(t)).

Plugging in data $G=G_{\mathrm{test}}$ in Lemma D.9, we have

\displaystyle\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G_{\mathrm{test}})}{\mathrm{d}t}=\kappa\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

Note by definition, $u_{\mathrm{gnn},\mathrm{test}}(t)=\kappa f_{\mathrm{gnn}}(W(t),G_{\mathrm{test}})$ , so we have

\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}=\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

∎

We prove the linear convergence of neural network training:

Lemma D.11.

Given

•

training graph data matrix $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ and corresponding label vector $Y\in\mathbb{R}^{n}$ .
•

the total number of iterations $T>0$ .

Let

•

$\kappa\in(0,1)$ and $u_{\mathrm{gnn}}(t)\in\mathbb{R}^{n\times n}$ be defined as in Definition D.2.
•

$u^{*}\in\mathbb{R}^{n}$ be defined in Eq. (8).
•

$H^{\mathrm{cts}}\in\mathbb{R}^{n\times n}$ and $\Lambda_{0}>0$ be defined as in Definition D.3.
•

$H(t)\in\mathbb{R}^{n\times n}$ be defined as in Definition D.5.

Then we have

\displaystyle\frac{\mathrm{d}\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-(\kappa^{2}\Lambda_{0})\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}.

Proof.

Thus, we have

		$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}\\|u_{\mathrm{gnn}}(t)-u^{*}\\|_{2}^{2}}{\mathrm{d}t}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ 2(u_{\mathrm{gnn}}(t)-u^{*})^{\top}\frac{\mathrm{d}u_{\mathrm{gnn}}(t)}{\mathrm{d}t}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ -2\kappa^{2}(u_{\mathrm{gnn}}(t)-u^{*})^{\top}H(t)(u_{\mathrm{gnn}}(t)-Y)$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ -2(u_{\mathrm{gnn}}(t)-u^{})^{\top}(\kappa^{2}H(t))(u_{\mathrm{gnn}}(t)-u^{})$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ -2(\kappa^{2}\Lambda_{0})\\|u_{\mathrm{gnn}}(t)-u^{*}\\|_{2}^{2}$

where the first step follows the chain rule, the second step follows Corollary D.10, the third step uses basic linear algebra, and the last step follows the assumption $\|H(t)-H^{\mathrm{cts}}\|\leq\Lambda_{0}/2$ . ∎

D.3 Perturbation during iterative GNTK regression

In this section, we prove an upper bound for $|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|$ .

Lemma D.12.

Let

•

$u_{\mathrm{gntk},\mathrm{test}}(T)\in\mathbb{R}$ and $u_{\mathrm{test}}^{*}\in\mathbb{R}$ be defined as Definition D.4.

Given

•

any accuracy $\varepsilon>0$ ,

if $\kappa\in(0,1)$ , then by picking $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , we have

\displaystyle|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|\leq\varepsilon/2.

where $\widetilde{O}(\cdot)$ here hides $\mathrm{poly}\log(n/(\varepsilon\Lambda_{0}))$ .

Proof.

Due to the linear convergence of kernel regression, i.e.,

\displaystyle\frac{\mathrm{d}\|\beta(t)-\beta^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-2(\kappa^{2}\Lambda_{0})\|\beta(t)-\beta^{*}\|_{2}^{2}

Thus,

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \|\kappa\Phi(G_{\mathrm{test}})^{\top}\beta(T)-\kappa\Phi(G_{\mathrm{test}})^{\top}\beta^{*}\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa\\|\Phi(G_{\mathrm{test}})\\|_{2}\\|\beta(T)-\beta^{*}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa e^{-(\kappa^{2}\Lambda_{0})T}\\|\beta(0)-\beta^{*}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ e^{-(\kappa^{2}\Lambda_{0})T}\cdot\mathrm{poly}(\kappa,N,n,1/\Lambda_{0})$

where the final step is derived from the conditions $\beta(0)=0$ and the norm $|\beta^{*}|_{2}$ being a polynomial function of $\kappa,N,n,$ and $1/\Lambda_{0}$ .

It’s worth noting that $\kappa$ lies in the interval (0,1). Consequently, by selecting $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , it follows that

\displaystyle\|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon/2,

where $\widetilde{O}(\cdot)$ here hides $\mathrm{poly}\log(Nn/(\varepsilon\Lambda_{0}))$ .∎

Lemma D.13.

Let $u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)\in\mathbb{R}$ and $u_{\mathrm{test},\mathrm{node}}^{*}\in\mathbb{R}$ be defined as Definition 3.8. Given

•

any accuracy $\varepsilon>0$ ,

•

$\kappa\in(0,1)$ ,

then by picking $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , we have

\displaystyle|u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}|\leq\varepsilon/2.

D.4 Bounding initialization and kernel perturbation

In this section we prove Lemma D.14, which shows that in order to bound prediction perturbation, it suffices to bound both initialization perturbation and kernel perturbation. We prove this result by utilizing Newton-Leibniz formula.

Lemma D.14 (Prediction perturbation implies kernel perturbation).

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ and corresponding label vector $Y\in\mathbb{R}^{n}$ .
•

the total number of iterations $T>0$ .
•

arbitrary test data $G_{\mathrm{test}}$ .

Let

•

$u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}^{n}$ and $u_{\mathrm{gntk},\mathrm{test}}(t)\in\mathbb{R}^{n}$ be the test data predictors defined in Definition D.2 and Definition D.4 respectively.
•

$\kappa\in(0,1)$ be the corresponding multiplier.
•

$\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{H}^{\mathrm{cts}}\in\mathbb{R}^{n\times n},\leavevmode\nobreak\ \mathsf{H}(t)\in\mathbb{R}^{n\times n},\leavevmode\nobreak\ \Lambda_{0}>0$ be defined in Definition D.3 and Definition D.5.
•

$u^{*}\in\mathbb{R}^{n}$ be defined as in Eq. (8).
•
$\varepsilon_{K}\in(0,1)$ , $\varepsilon_{\mathrm{init}}\in(0,1)$ and $\varepsilon_{H}\in(0,1)$ denote parameters that are independent of $t$ , and the following conditions hold for all $t\in[0,T]$ ,
- –
  
  $\|u_{\mathrm{gnn}}(0)\|_{2}\leq\sqrt{n}\varepsilon_{\mathrm{init}}$ and $|u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}$
- –
  
  $\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq\varepsilon_{K}$
- –
  
  $\|\mathsf{H}(t)-\mathsf{H}^{\mathrm{cts}}\|\leq\varepsilon_{H}$

then we have

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\\|u^{}\\|_{2}}{\Lambda_{0}}+\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\\|u^{}\\|_{2}$

Proof.

Combining results from Lemma D.15, Claim D.16. D.17, D.18, we complete the proof. We have

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\leavevmode\nobreak\ +\kappa^{2}\Big{\|}\int_{0}^{T}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
		$\displaystyle\leavevmode\nobreak\ +\kappa^{2}\Big{\|}\int_{0}^{T}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\\|u^{}\\|}{\Lambda_{0}}+\kappa^{2}n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{4}\varepsilon_{H}\cdot\\|u^{}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\\|u^{}\\|_{2}}{\Lambda_{0}}+\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\\|u^{}\\|_{2}$

where the initial step is derived from Lemma D.15. The subsequent step is based on Claim D.16, D.17, and D.18. The concluding step streamlines the expression. ∎

To prove Lemma D.14, we first show that $|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|$ can be upper bounded by three terms that are defined in Lemma D.15, then we bound each of these three terms respectively in Claim D.16, Claim D.17, and Claim D.18.

Lemma D.15.

Follow the same notation as Lemma D.14, we have

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}.$

Proof.

	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Big{\|}u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)$
	$\displaystyle\quad+\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
	$\displaystyle\quad+\Big{\|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|},$	(12)

where the initial step is derived from the integral’s definition. The subsequent step is based on the triangle inequality. It’s worth noting that, as per Corollary D.7 and D.10, their respective gradient flows are described as

	$\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t}$	$\displaystyle=-\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-Y)$		(13)
	$\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}$	$\displaystyle=-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gnn}}(t)-Y)$		(14)

where $u_{\mathrm{gntk}}(t)\in\mathbb{R}^{n}$ and $u_{\mathrm{gnn}}(t)\in\mathbb{R}^{n}$ are the predictors for training data defined in Definition D.4 and Definition D.2. Thus, we have

	$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ -\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gnn}}(t)-Y)$
	$\displaystyle\quad+\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-Y)$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)$
	$\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))$	(15)

where the initial step is derived from both Eq.(13) and Eq.(14). The subsequent step reformulates the formula. So we have

	$\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{T}\kappa^{2}((\mathsf{K}_{\mathrm{gntk}}(x_{\mathrm{test}},X)-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)$
	$\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)))\mathrm{d}t\Big{\|}$	(16)

Thus,

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}((\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)$
		$\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)))\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}$

where the initial step is based on Eq.(D.4). The next step is derived from Eq.(D.4), and the third step arises from the triangle inequality. ∎

Now we bound each of these three terms $|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|$ , $\Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}$ and $\Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}$ in the following three claims.

Claim D.16.

We have

\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}.

Proof.

Note $u_{\mathrm{gntk},\mathrm{test}}(0)=0$ , so by assumption we have

\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|=|u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}.

∎

Claim D.17.

We have

\displaystyle\Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}\leq\varepsilon_{K}\cdot\frac{\|u^{*}\|}{\Lambda_{0}}.

Proof.

Note

		$\displaystyle\leavevmode\nobreak\ \kappa^{2}\Big{\|}\int_{0}^{T}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}$
		$\displaystyle\quad\cdot\int_{0}^{T}\\|u_{\mathrm{gntk}}(t)-Y\\|_{2}\mathrm{d}t,$

where the initial step is derived from the Cauchy-Schwarz inequality. It’s important to mention that, as per Lemma D.8, the kernel regression predictor $u_{\mathrm{gntk}}(t)$ , which belongs to $\mathbb{R}^{n}$ , exhibits linear convergence towards the optimal predictor $u^{*}=Y$ , which is also in $\mathbb{R}^{n}$ . Specifically,

\displaystyle\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\leq e^{-(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}.

(17)

Thus, we have

	$\displaystyle\leavevmode\nobreak\ \int_{0}^{T}\\|u_{\mathrm{gntk}}(t)-Y\\|_{2}\mathrm{d}t$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \int_{0}^{T}\\|u_{\mathrm{gntk}}(t)-u^{}\\|_{2}\mathrm{d}t+\int_{0}^{T}\\|u^{}-Y\\|_{2}\mathrm{d}t$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \int_{0}^{T}e^{-(\kappa^{2}\Lambda_{0})}\\|u_{\mathrm{gntk}}(0)-u^{*}\\|_{2}\mathrm{d}t$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{\\|u_{\mathrm{gntk}}(0)-u^{*}\\|_{2}}{\kappa^{2}\Lambda_{0}}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{\\|u^{*}\\|}{\kappa^{2}\Lambda_{0}},$	(18)

where the initial step arises from the triangle inequality. The subsequent step is based on Eq. (17). The third step involves computing the integration, and the concluding step is derived from the condition $u_{\mathrm{gntk}}(0)=0$ . Thus, we have

		$\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}\cdot\int_{0}^{T}\\|u_{\mathrm{gntk}}(t)-Y\\|_{2}\mathrm{d}t$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \varepsilon_{K}\cdot\frac{\\|u^{*}\\|}{\Lambda_{0}}.$

where the initial step is derived from Eq.(17). The subsequent step is based on both Eq.(D.4) and the definition of $\varepsilon_{K}$ . ∎

Claim D.18.

We have

\displaystyle\Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}\leq n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{2}\varepsilon_{H}\cdot\|u^{*}\|_{2}

Proof.

Note

		$\displaystyle\leavevmode\nobreak\ \kappa^{2}\Big{\|}\int_{0}^{T}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}\max_{t\in[0,T]}\\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\\|_{2}\cdot T$		(19)

where the first step follows the Cauchy-Schwartz inequality.

To bound term $\max_{t\in[0,T]}\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\|_{2}$ , notice that for any $t\in[0,T]$ , we have

	$\displaystyle\leavevmode\nobreak\ \\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\\|_{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \\|u_{\mathrm{gntk}}(0)-u_{\mathrm{gnn}}(0)\\|_{2}+\Big{\\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\\|}_{2}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\Big{\\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\\|}_{2},$	(20)

where the first step follows the triangle inequality, and the second step follows the assumption. Further,

		$\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ -\kappa^{2}H^{\mathrm{cts}}(u_{\mathrm{gntk}}(\tau)-Y)+\kappa^{2}H(\tau)(u_{\mathrm{gnn}}(\tau)-Y)$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ -(\kappa^{2}H(\tau))(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))$
		$\displaystyle\quad+\kappa^{2}(H(\tau)-H^{\mathrm{cts}})(u_{\mathrm{gntk}}(\tau)-Y),$

where the first step follows the Corollary D.7, D.10, the second step rewrites the formula. Since the term $-(\kappa^{2}H(\tau))(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))$ makes $\|\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\|_{2}$ smaller.

Taking the integral and apply the $\ell_{2}$ norm, we have

		$\displaystyle\leavevmode\nobreak\ \Big{\\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\\|}_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \Big{\\|}\int_{0}^{t}\kappa^{2}(H(\tau)-H^{\mathrm{cts}})(u_{\mathrm{gntk}}(\tau)-Y)\mathrm{d}\tau\Big{\\|}_{2}.$		(21)

Thus,

	$\displaystyle\leavevmode\nobreak\ \max_{t\in[0,T]}\\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\\|_{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\Big{\\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\\|}_{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\Big{\\|}\int_{0}^{t}\kappa^{2}(H(\tau)-H^{\mathrm{cts}})(u_{\mathrm{gntk}}(\tau)-Y)\mathrm{d}\tau\Big{\\|}_{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\int_{0}^{t}\kappa^{2}\\|H(\tau)-H^{\mathrm{cts}}\\|\cdot\\|u_{\mathrm{gntk}}(\tau)-Y\\|_{2}\mathrm{d}\tau$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\kappa^{2}\varepsilon_{H}\cdot\Big{(}\int_{0}^{t}\\|u_{\mathrm{gntk}}(\tau)-u^{}\\|_{2}\mathrm{d}\tau+\int_{0}^{t}\\|u^{}-Y\\|_{2}\mathrm{d}\tau\Big{)}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\kappa^{2}\varepsilon_{H}\int_{0}^{t}\\|u_{\mathrm{gntk}}(0)-u^{*}\\|_{2}\mathrm{d}\tau$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}t\cdot\kappa^{2}\varepsilon_{H}\cdot\\|u^{*}\\|_{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+T\cdot\kappa^{2}\varepsilon_{H}\cdot\\|u^{*}\\|_{2}$	(22)

where the initial step is based on Eq.(D.4). The next step is derived from Eq.(D.4). The third step arises from the triangle inequality. The fourth step is informed by the condition $|H(\tau)-H^{\mathrm{cts}}|\leq\varepsilon_{H}$ for all values of $\tau\leq T$ , coupled with the triangle inequality. The fifth step references the linear convergence of $|u_{\mathrm{gntk}}(\tau)-u^{*}|2$ , as described in Lemma D.8. The sixth step is based on the condition $u{\mathrm{gntk}}(0)=0$ . Finally, the concluding step determines the maximum value. Therefore,

		$\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}\max_{t\in[0,T]}\\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\\|_{2}\cdot T$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}\cdot(\sqrt{n}\varepsilon_{\mathrm{init}}T+T^{2}\cdot\kappa^{2}\varepsilon_{H}\cdot\\|u^{*}\\|_{2})$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \kappa^{2}n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{4}\varepsilon_{H}\cdot\\|u^{*}\\|_{2}$

where the initial step is derived from Eq.(D.4). The subsequent step is based on Eq.(D.4). The concluding step is informed by the condition $\mathsf{K}_{t}(G,H)\leq 1$ , which is consistent with the data assumptions related to graphs $G$ and $H$ . ∎

D.5 Concentration results for kernels

For simplify, we use $x_{i,l}$ to denote $h_{G_{i},u}$ where $u$ is the $l$ th node of the $N$ node in $i$ th graph $G_{i}$ . For convenient, for each $i,j\in[n]\times[n]$ , we write as

		$\displaystyle\leavevmode\nobreak\ \Big{\langle}\frac{\partial f_{\mathrm{gnn}}(W,G_{i})}{\partial W},\frac{\partial f_{\mathrm{gnn}}(W,G_{j})}{\partial W}\Big{\rangle}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0}{\bf 1}_{w_{r}^{\top}x_{j,l_{2}}\geq 0}$

Recall the classical definition of neural network

		$\displaystyle\leavevmode\nobreak\ \Big{\langle}\frac{\partial f_{\mathrm{nn}}(W,x_{i})}{\partial W},\frac{\partial f_{\mathrm{nn}}(W,x_{j})}{\partial W}\Big{\rangle}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}x_{i}^{\top}x_{j}{\bf 1}_{w_{r}^{\top}x_{i}\geq 0}{\bf 1}_{w_{r}^{\top}x_{j}\geq 0}$

In this section, we present proof for a finding that is a broader variant of Lemma 3.1 from [75]. This result demonstrates that when the width, denoted as $m$ , is adequately expansive, the continuous and discrete renditions of the gram matrix for input data exhibit proximity in a spectral manner.

Lemma D.19.

We write $\mathsf{H}^{\mathrm{cts}},\mathsf{H}^{\mathrm{dis}}\in\mathbb{R}^{n\times n}$ as follows

	$\displaystyle\mathsf{H}^{\mathrm{cts}}_{i,j}=$	$\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{w\sim\mathcal{N}(0,I)}\left[\sum_{l_{1},l_{2}}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w^{\top}x_{i,l_{1}}\geq 0,w^{\top}x_{j,l_{2}}\geq 0}\right],$
	$\displaystyle\mathsf{H}^{\mathrm{dis}}_{i,j}=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\left[\sum_{l_{1},l_{2}}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right].$

Let $\lambda=\lambda_{\min}(\mathsf{H}^{\mathrm{cts}})$ . If $m=\Omega(\lambda^{-2}N^{2}n^{2}\log(n/\delta))$ , we have

\displaystyle\|\mathsf{H}^{\mathrm{dis}}-\mathsf{H}^{\mathrm{cts}}\|_{F}\leq\frac{\lambda}{4},\mathrm{\leavevmode\nobreak\ and\leavevmode\nobreak\ }\lambda_{\min}(\mathsf{H}^{\mathrm{dis}})\geq\frac{3}{4}\lambda.

hold with probability at least $1-\delta$ .

Proof.

For every fixed pair $(i,j)$ , $\mathsf{H}_{i,j}^{\mathrm{dis}}$ is an average of independent random variables, i.e.

\displaystyle\mathsf{H}_{i,j}^{\mathrm{dis}}=\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}.

Then the expectation of $\mathsf{H}_{i,j}^{\mathrm{dis}}$ is

		$\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}[\mathsf{H}_{i,j}^{\mathrm{dis}}]$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}\operatorname*{{\mathbb{E}}}_{w_{r}\sim{\mathcal{N}}(0,I_{d})}\left[x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right]$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{w\sim{\mathcal{N}}(0,I_{d})}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}\left[x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right]$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \mathsf{H}_{i,j}^{\mathrm{cts}}.$

For $r\in[m]$ , we define $z_{r}$

\displaystyle z_{r}:=\frac{1}{m}\sum_{l_{1},l_{2}}x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}.

Then $z_{r}$ is a random function of $w_{r}$ , hence $\{z_{r}\}_{r\in[m]}$ are mutually independent. Moreover, $-\frac{1}{m}N^{2}\leq z_{r}\leq\frac{1}{m}N^{2}$ . So by Hoeffding inequality (Lemma A.1) we have for all $t>0$ ,

	$\displaystyle\Pr\left[\|\mathsf{H}_{i,j}^{\mathrm{dis}}-\mathsf{H}_{i,j}^{\mathrm{cts}}\|\geq t\right]\leq$	$\displaystyle\leavevmode\nobreak\ 2\exp\Big{(}-\frac{2t^{2}}{4N^{2}/m}\Big{)}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ 2\exp(-mt^{2}/(2N^{2})).$

Setting $t=(\frac{1}{m}2N^{2}\log(2n^{2}/\delta))^{1/2}$ , we can apply union bound on all pairs $(i,j)$ to get with probability at least $1-\delta$ , for all $i,j\in[n]$ ,

	$\displaystyle\|\mathsf{H}_{i,j}^{\mathrm{dis}}-\mathsf{H}_{i,j}^{\mathrm{cts}}\|\leq$	$\displaystyle\leavevmode\nobreak\ \Big{(}\frac{2}{m}N^{2}\log(2n^{2}/\delta)\Big{)}^{1/2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ 4\Big{(}\frac{N^{2}\log(n/\delta)}{m}\Big{)}^{1/2}.$

Thus we have

	$\displaystyle\\|\mathsf{H}^{\mathrm{dis}}-\mathsf{H}^{\mathrm{cts}}\\|^{2}\leq$	$\displaystyle\leavevmode\nobreak\ \\|\mathsf{H}^{\mathrm{dis}}-\mathsf{H}^{\mathrm{cts}}\\|_{F}^{2}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \sum_{i=1}^{n}\sum_{j=1}^{n}\|\mathsf{H}_{i,j}^{\mathrm{dis}}-\mathsf{H}_{i,j}^{\mathrm{cts}}\|^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}16N^{2}n^{2}\log(n/\delta).$

Hence if $m=\Omega(\lambda^{-2}N^{2}n^{2}\log(n/\delta))$ we have the desired result. ∎

Similarly, we can show

Lemma D.20.

Let

•

$R_{0}\in(0,1)$ .

If the following conditions hold:

•

$\widetilde{w}_{1},\cdots,\widetilde{w}_{m}$ are i.i.d. generated from ${\cal N}(0,I)$ .
•

For any set of weight vectors $w_{1},\cdots,w_{m}\in\mathbb{R}^{d}$ that satisfy for any $r\in[m]$ , $\|\widetilde{w}_{r}-w_{r}\|_{2}\leq R_{0}$ ,

then the $H:\mathbb{R}^{m\times d}\rightarrow\mathbb{R}^{n\times n}$ defined

\displaystyle[\mathsf{H}(W)]_{i,j}=\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}.

Then we have

\displaystyle\|\mathsf{H}(w)-\mathsf{H}(\widetilde{w})\|_{F}<2NnR_{0}

holds with probability at least $1-N^{2}n^{2}\cdot\exp(-mR_{0}/10)$ .

D.6 Upper bounding initialization perturbation

In this section we bound $\varepsilon_{\mathrm{init}}\leq\varepsilon$ by choosing a large enough $\kappa$ . Our goal is to prove Lemma D.21. This result indicates that initially the predictor’s output is small. This result can be proved by Gaussian tail bounds.

Lemma D.21 (Bounding initialization perturbation).

Let $f_{\mathrm{gnn}}$ be as defined in Definition D.1. Assume

•

the initial weight of the network work $w_{r}(0)\in\mathbb{R}^{d},\leavevmode\nobreak\ r=1,\cdots,m$ as defined in Definition D.2 are drawn independently from standard Gaussian distribution $\mathcal{N}(0,I_{d})$ .
•

and $a_{r}\in\mathbb{R},\leavevmode\nobreak\ r=1,\cdots,m$ as defined in Definition D.1 are drawn independently from $\operatorname{unif}[\{-1,+1\}]$ .

Let

•

$\kappa\in(0,1)$ , $u_{\mathrm{gnn}}(t)$ and $u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}$ be defined as in Definition D.2.

For any data $G$ , assume its corresponding $\{x_{l}\}_{l\in[N]}\in\mathbb{R}^{d}$ satisfying that $\|x_{l}\|_{2}\leq R,\forall l\in[N]$ . Then, we have with probability $1-\delta$ ,

\displaystyle|f_{\mathrm{gnn}}(W(0),G)|\leq 2NR\log(2Nm/\delta).

Further, given any accuracy $\varepsilon\in(0,1)$ , if $\kappa=\widetilde{O}(\varepsilon\Lambda_{0}/(NRn))$ , let $\varepsilon_{\mathrm{init}}=\varepsilon\Lambda_{0}/(NRn)$ , we have

\displaystyle|u_{\mathrm{gnn}}(0)|\leq\sqrt{n}\varepsilon_{\mathrm{init}}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}

hold with probability $1-\delta$ , where $\widetilde{O}(\cdot)$ hides the $\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0}))$ .

Proof.

Note by definition,

\displaystyle f_{\mathrm{gnn}}(W(0),G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}(0)^{\top}x_{l}).

Since $w_{r}(0)\sim\mathcal{N}(0,I_{d})$ , so $w_{r}(0)^{\top}x_{\mathrm{test}}\sim N(0,\|x_{\mathrm{test},l}\|_{2})$ . Note $\|x_{\mathrm{test},l}\|_{2}\leq R$ , $\forall l\in[N]$ , by Gaussian tail bounds Lemma A.3, we have with probability $1-\delta/(2Nm)$ :

\displaystyle|w_{r}(0)^{\top}x|\leq R\sqrt{2\log(2Nm/\delta)}.

(23)

Condition on Eq. (23) holds for all $r\in[m]$ , denote $Z_{r}=a_{r}\sum_{l=1}^{N}\sigma(w_{r}(0)^{\top}x_{l})$ , then we have $\operatorname*{{\mathbb{E}}}[Z_{r}]=0$ and $|Z_{r}|\leq NR\sqrt{2\log(2m/\delta)}$ . By Lemma A.1, with probability $1-\delta/2$ :

\displaystyle\Big{|}\sum_{r=1}^{m}Z_{r}\Big{|}\leq 2NR\sqrt{m}\log{(2Nm/\delta)}.

(24)

Since $u_{\mathrm{gnn},\mathrm{test}}(0)=\frac{1}{\sqrt{n}}\sum_{r=1}^{m}Z_{r}$ , by combining Eq. (23), (24) and union bound over all $r\in[m]$ , we have with probability $1-\delta$ :

\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(0)|\leq 2NR\log(2Nm/\delta).

Further, note $[u_{\mathrm{gnn}}(0)]_{i}=\kappa f_{\mathrm{gnn}}(W(0),G_{i})$ and $u_{\mathrm{gnn},\mathrm{test}}(0)=\kappa f_{\mathrm{gnn}}(W(0),G_{\mathrm{test}})$ . Thus, by choosing

\displaystyle\kappa=\widetilde{O}(\varepsilon\Lambda_{0}/(NRn)),

taking the union bound over all training and test data, we have

\displaystyle|u_{\mathrm{gnn}}(0)|\leq\sqrt{n}\varepsilon_{\mathrm{init}}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}

hold with probability $1-\delta$ , where $\widetilde{O}(\cdot)$ hides the $\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0}))$ . ∎

D.7 Upper bounding kernel perturbation

In this section, our goal is to prove Lemma D.22. We bound the kernel perturbation using induction. More specifically, we prove four induction lemma. Each of them states a property still holds after one iteration. We combine them together and prove Lemma D.22.

Lemma D.22 (Bounding kernel perturbation).

Given

•

training data $\mathsf{G}\in\mathbb{R}^{n\times d\times N}$ , $Y\in\mathbb{R}^{n}$ and a test data $G_{\mathrm{test}}\in\mathbb{R}^{d}$ .

Let

•

$T>0$ denotes the total number of iterations,
•

$m>0$ denotes the width of the network,
•

$\varepsilon_{\mathrm{train}}$ denotes a fixed training error threshold,
•

$\delta>0$ denotes the failure probability.
•

$u_{\mathrm{gnn}}(t)\in\mathbb{R}^{n}$ and $u_{\mathrm{gntk}}(t)\in\mathbb{R}^{n}$ be the training data predictors defined in Definition D.2 and Definition D.4 respectively.
•

$\kappa\in(0,1)$ be the corresponding multiplier.
•

$\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{H}(t)\in\mathbb{R}^{n\times n},\leavevmode\nobreak\ \Lambda_{0}>0$ be the kernel related quantities defined in Definition D.3 and Definition D.5.
•

$u^{*}\in\mathbb{R}^{n}$ be defined as in Eq. (8).
•

$W(t)=[w_{1}(t),\cdots,w_{m}(t)]\in\mathbb{R}^{d\times m}$ be the parameters of the neural network defined in Definition D.2.

For any accuracy $\varepsilon\in(0,1/10)$ . If

•

$\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn})$ , $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , $\varepsilon_{\mathrm{train}}=\widetilde{O}(\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2})$ , $m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}})$ , with probability $1-\delta$ ,

there exist $\varepsilon_{W},\leavevmode\nobreak\ \varepsilon_{H}^{\prime},\leavevmode\nobreak\ \varepsilon_{K}^{\prime}>0$ that are independent of $t$ , such that the following hold for all $0\leq t\leq T$ :

•

1. $\|w_{r}(0)-w_{r}(t)\|_{2}\leq\varepsilon_{W}$ , $\forall r\in[m]$
•

2. $\|\mathsf{H}(0)-\mathsf{H}(t)\|_{2}\leq\varepsilon_{H}^{\prime}$
•

3. $\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}\leq\max\{\exp(-(\kappa^{2}\Lambda_{0})t/2)\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}$
•

4. $\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq\varepsilon_{K}^{\prime}$

Further,

\displaystyle\varepsilon_{W}\leq\widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n^{2}}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \varepsilon_{H}^{\prime}\leq\widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n})\mathrm{\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ }\varepsilon_{K}^{\prime}\leq\widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n^{1.5}}).

Here $\widetilde{O}(\cdot)$ hides the $\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0}))$ .

In order to prove this lemma, we first state some helpful concentration results for random initialization.

Lemma D.23 (Random initialization result).

Assume

•

initial value $w_{r}(0)\in\mathbb{R}^{d},\leavevmode\nobreak\ r=1,\cdots,m$ are drawn independently from standard Gaussian distribution $\mathcal{N}(0,I_{d})$ ,

then with probability $1-3\delta$ we have

	$\displaystyle\leavevmode\nobreak\ \\|w_{r}(0)\\|_{2}\leq 2\sqrt{d}+2\sqrt{\log{(m/\delta)}}\leavevmode\nobreak\ \text{for all}\leavevmode\nobreak\ r\in[m]$		(25)
	$\displaystyle\leavevmode\nobreak\ \\|\mathsf{H}(0)-\mathsf{H}^{\mathrm{cts}}\\|\leq 4Nn(\log(n/\delta)/m)^{1/2}$		(26)
	$\displaystyle\leavevmode\nobreak\ \\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\\|_{2}\leq(2N^{2}n\log{(2n/\delta)}/m)^{1/2}$		(27)

Proof.

By standard concentration inequality, with probability at least $1-\delta$ ,

\displaystyle\|w_{r}(0)\|_{2}\leq\sqrt{d}+\sqrt{\log(m/\delta)}

holds for all $r\in[m]$ .

Using Lemma D.19, we have

\displaystyle\|H(0)-H^{\mathrm{cts}}\|\leq\varepsilon_{H}^{\prime\prime}=4Nn(\log{(n/\delta)}/m)^{1/2}

holds with probability at least $1-\delta$ .
Note by definition,

\displaystyle\operatorname*{{\mathbb{E}}}[\mathsf{K}_{0}(G_{\mathrm{test}},x_{i})]=\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},G_{i})

holds for any training data $G_{i}$ . By Hoeffding inequality, we have for any $t>0$ ,

\displaystyle\Pr[|\mathsf{K}_{0}(G_{\mathrm{test}},G_{i})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},G_{i})|\geq t]\leq 2\exp{(-mt^{2}/(2N^{2}))}.

Setting $t=(\frac{2}{m}N^{2}\log{(2n/\delta)})^{1/2}$ , we can apply union bound on all training data $G_{i}$ to get with probability at least $1-\delta$ , for all $i\in[n]$ ,

\displaystyle|\mathsf{K}_{0}(G_{\mathrm{test}},G_{i})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},G_{i})|\leq(2N^{2}\log(2n/\delta)/m)^{1/2}.

Thus, we have

\displaystyle\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq(2N^{2}n\log(2n/\delta)/m)^{1/2}

(28)

holds with probability at least $1-\delta$ .
Using union bound over above three events, we finish the proof. ∎

Given the conditions set by Eq.(25),(26), and (27), we demonstrate that all four outcomes outlined in Lemma D.22 are valid, employing an inductive approach.

We define the following quantity:

$\displaystyle\varepsilon_{W}:=$	$\displaystyle\leavevmode\nobreak\ \frac{\sqrt{Nn}}{\sqrt{m}}\max\{4\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}/(\kappa^{2}\Lambda_{0}),\varepsilon_{\mathrm{train}}\cdot T\}$	(29)
$\displaystyle\varepsilon_{H}^{\prime}:=$	$\displaystyle\leavevmode\nobreak\ 2n\varepsilon_{W}$
$\displaystyle\varepsilon_{K}:=$	$\displaystyle\leavevmode\nobreak\ 2\sqrt{n}\varepsilon_{W}$

which are independent of $t$ .

Note that the base case where $t=0$ trivially holds. Under the induction hypothesis that Lemma D.22 holds before time $t\in[0,T]$ , we will prove that it still holds at time $t$ . We prove this in Lemma D.24, D.25, and D.26.

Lemma D.24.

If for any $\tau<t$ , we have

	$\displaystyle\\|$	$\displaystyle\leavevmode\nobreak\ u_{\mathrm{gnn}}(\tau)-u^{*}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \max\{\exp(-(\kappa^{2}\Lambda_{0})\tau/2)\cdot\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}$

and

\displaystyle\|w_{r}(0)-w_{r}(\tau)\|_{2}\leq\varepsilon_{W}\leq 1

and

\displaystyle\|w_{r}(0)\|_{2}\leq\leavevmode\nobreak\ \sqrt{d}+\sqrt{\log(m/\delta)}\leavevmode\nobreak\ \text{for all}\leavevmode\nobreak\ r\in[m]

hold, then

\displaystyle\|w_{r}(0)-w_{r}(t)\|_{2}\leq\varepsilon_{W}

Proof.

For simplicity, we consider the case when $N=1$ . For general case, we need to blow up an $N$ factor in the upper bound.

Recall the gradient flow as Eq. (11)

\displaystyle\frac{\mathrm{d}w_{r}(\tau)}{\mathrm{d}\tau}=\leavevmode\nobreak\ \sum_{i=1}^{n}\frac{1}{\sqrt{m}}a_{r}(y_{i}-u_{\mathrm{gnn}}(\tau)_{i})x_{i}\sigma^{\prime}(w_{r}(\tau)^{\top}x_{i})

(30)

So we have

$\displaystyle\Big{\\|}\frac{\mathrm{d}w_{r}(\tau)}{\mathrm{d}\tau}\Big{\\|}_{2}=$	$\displaystyle\leavevmode\nobreak\ \left\\|\sum_{i=1}^{n}\frac{1}{\sqrt{m}}a_{r}(y_{i}-u_{\mathrm{gnn}}(\tau)_{i})x_{i}\sigma^{\prime}(w_{r}(\tau)^{\top}x_{i})\right\\|_{2}$	(31)
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{1}{\sqrt{m}}\sum_{i=1}^{n}\|y_{i}-u_{\mathrm{gnn}}(\tau)_{i}\|$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{\sqrt{n}}{\sqrt{m}}\\|Y-u_{\mathrm{gnn}}(\tau)\\|_{2}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{\sqrt{n}}{\sqrt{m}}(\\|Y-u^{}\\|_{2}+\\|u_{\mathrm{gnn}}(\tau)-u^{}\\|_{2})$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{\sqrt{n}}{\sqrt{m}}\max\{e^{-(\kappa^{2}\Lambda_{0})\tau/4}\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2},\varepsilon_{\mathrm{train}}\}$

where the initial step is based on Eq.(30). The subsequent step arises from the triangle inequality. The third step is derived from the Cauchy-Schwarz inequality. The fourth step again references the triangle inequality. The concluding step is informed by the condition $|u_{\mathrm{gnn}}(\tau)-u^{*}|2^{2}\leq\max{\exp(-(\kappa^{2}\Lambda_{0})\tau/2)\cdot|u{\mathrm{gnn}}(0)-u^{*}|_{2}^{2},\varepsilon_{\mathrm{train}}^{2}}$ .

Thus, for any $t\leq T$ ,

		$\displaystyle\leavevmode\nobreak\ \\|w_{r}(0)-w_{r}(t)\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \int_{0}^{t}\Big{\\|}\frac{\mathrm{d}w_{r}(\tau)}{\mathrm{d}\tau}\Big{\\|}_{2}d\tau$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \frac{\sqrt{Nn}}{\sqrt{m}}\max\{4\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}/(\kappa^{2}\Lambda_{0}),\varepsilon_{\mathrm{train}}\cdot T\}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \varepsilon_{W}$

where the initial step arises from the triangle inequality. The subsequent step is based on Eq.(31). The concluding step is informed by the definition of $\varepsilon_{W}$ , as outlined in Eq.(29). ∎

Lemma D.25.

If $\forall r\in[m]$ ,

\displaystyle\|w_{r}(0)-w_{r}(t)\|_{2}\leq\varepsilon_{W}<1,

then

\displaystyle\|\mathsf{H}(0)-\mathsf{H}(t)\|_{F}\leq 2n\varepsilon_{W}

holds with probability $1-N^{2}n^{2}\cdot\exp{(-m\varepsilon_{W}/(10N^{2}))}$ .

Proof.

Applying Lemma D.20 completes the proof. ∎

Lemma D.26.

we have

		$\displaystyle\leavevmode\nobreak\ \\|u_{\mathrm{gnn}}(t)-u^{*}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \max\{\exp(-(\kappa^{2}\Lambda_{0})t/2)\cdot\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}.$

Proof.

By Lemma D.11, for any $\tau<t$ , we have

\displaystyle\frac{\mathrm{d}\|u_{\mathrm{gnn}}(\tau)-u^{*}\|_{2}^{2}}{\mathrm{d}\tau}\leq

\displaystyle\leavevmode\nobreak\ -2(\kappa^{2}\Lambda_{0})\cdot\|u_{\mathrm{gnn}}(\tau)-u^{*}\|_{2}^{2}

(32)

where the step follows from Lemma D.11, which implies

\displaystyle\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}\leq\exp{(-2(\kappa^{2}\Lambda_{0})t)}\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2}.

we conclude

		$\displaystyle\leavevmode\nobreak\ \\|u_{\mathrm{gnn}}(t)-u^{*}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \max\{\exp(-(\kappa^{2}\Lambda_{0})t/2)\cdot\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}.$

∎

Lemma D.27.

Fix $\varepsilon_{W}\in(0,1)$ independent of $t$ . If $\forall r\in[m]$ , we have

\displaystyle\|w_{r}(t)-w_{r}(0)\|_{2}\leq\varepsilon_{W}

then

\displaystyle\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq\varepsilon_{K}^{\prime}=2\sqrt{n}\varepsilon_{W}

holds with probability at least $1-N^{2}\cdot n\cdot\exp{(-m\varepsilon_{W}/(10N^{2}))}$ .

Proof.

Recall the definition of $\mathsf{K}_{0}$ and $\mathsf{K}_{t}$

		$\displaystyle\leavevmode\nobreak\ \mathsf{K}_{0}(G_{\mathrm{test}},G_{i})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{\mathrm{test},l_{1}}^{\top}x_{i,l_{2}}\sigma^{\prime}(x_{\mathrm{test},l_{1}}^{\top}w_{r}(0))\sigma^{\prime}(x_{i,l_{2}}^{\top}w_{r}(0))$
		$\displaystyle\leavevmode\nobreak\ \mathsf{K}_{t}(G_{\mathrm{test}},G_{i})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{\mathrm{test},l_{1}}^{\top}x_{i,l_{2}}\sigma^{\prime}(x_{\mathrm{test},l_{1}}^{\top}w_{r}(t))\sigma^{\prime}(x_{i,l_{2}}^{\top}w_{r}(t))$

By direct calculation we have

\displaystyle\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(x_{\mathrm{test}},X)\|_{2}^{2}\leq\sum_{i=1}^{n}\Big{(}\frac{1}{m}\sum_{r=1}^{m}s_{r,i}\Big{)}^{2},

where

	$\displaystyle s_{r,i}=$	$\displaystyle\leavevmode\nobreak\ \sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}{\bf 1}[w_{r}(0)^{\top}x_{\mathrm{test},l_{1}}\geq 0,w_{r}(0)^{\top}x_{i,l_{2}}\geq 0]$
		$\displaystyle\leavevmode\nobreak\ -{\bf 1}[w_{r}(t)^{\top}x_{\mathrm{test},l_{1}}\geq 0,w_{r}(t)^{\top}x_{i,l_{2}}\geq 0],$
		$\displaystyle\leavevmode\nobreak\ \forall r\in[m],i\in[n].$

Fix $i\in[n]$ , by Bernstein inequality (Lemma A.2), we have for any $t>0$ ,

		$\displaystyle\leavevmode\nobreak\ \Pr\Big{[}\frac{1}{m}\sum_{r=1}^{m}s_{r,i}\geq 2\varepsilon_{W}\Big{]}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \exp(-m\varepsilon_{W}/(10N^{2})).$

Thus, applying union bound over all training data $x_{i},\leavevmode\nobreak\ i\in[n]$ , we conclude

		$\displaystyle\leavevmode\nobreak\ \Pr[\\|\mathsf{K}_{0}(x_{\mathrm{test}},X)-\mathsf{K}_{t}(x_{\mathrm{test}},X)\\|_{2}\leq 2\sqrt{n}\varepsilon_{W}]$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ 1-n\cdot\exp{(-m\varepsilon_{W}/(10N^{2}))}.$

Note by definition $\varepsilon_{K}^{\prime}=2\sqrt{n}\varepsilon_{W}$ , so we finish the proof.

∎

Now, we choose the parameters as follows: $\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn}),\leavevmode\nobreak\ T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , $\varepsilon_{\mathrm{train}}=\widetilde{O}(\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2})$ , $m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}})$ .

Further, with probability $1-\delta$ , we have

	$\displaystyle\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}\leq$	$\displaystyle\leavevmode\nobreak\ \\|u_{\mathrm{gnn}}(0)\\|_{2}+\\|Y-u^{*}\\|_{2}+\\|Y\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \widetilde{O}(\frac{\varepsilon\Lambda_{0}}{\sqrt{n}})+0+\widetilde{O}(\sqrt{n})$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \widetilde{O}(\sqrt{n})$

where the first step follows from triangle inequality, the second step follows from Lemma D.21, and $Y=O(\sqrt{n})$ , and the last step follows $\varepsilon,\leavevmode\nobreak\ \Lambda_{0}<1$ . With same reason,

	$\displaystyle\\|u_{\mathrm{gnn}}(0)-u^{*}\\|_{2}\geq$	$\displaystyle\leavevmode\nobreak\ -\\|u_{\mathrm{gnn}}(0)\\|_{2}-\\|Y-u^{*}\\|_{2}+\\|Y\\|_{2}$
	$\displaystyle\geq$	$\displaystyle\leavevmode\nobreak\ -\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{\sqrt{n}})-0+\widetilde{O}(\sqrt{n})$
	$\displaystyle\geq$	$\displaystyle\leavevmode\nobreak\ \widetilde{O}(\sqrt{n}).$

Thus, we have $\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}=\widetilde{O}(\sqrt{n})$ .

We have proved that with high probability all the induction conditions are satisfied. Note that the failure probability only comes from Lemma D.21, D.23, D.25, D.27, which only depends on the initialization. By union bound over these failure events, we have that with high probability all the four statements in Lemma D.22 are satisfied. This completes the proof.

D.8 Connect iterative GNTK regression and GNN training process

In this section we prove Lemma D.28. Lemma D.28 bridges GNN test predictor and GNTK test predictor. We prove this result by bound the difference between dynamic kernal and GNTK, which can be further bridged by the graph dynamic kernal at initial.

Lemma D.28 (Upper bounding test error).

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ and corresponding label vector $Y\in\mathbb{R}^{n}$ .
•

the total number of iterations $T>0$ .
•

arbitrary test data $G_{\mathrm{test}}\in\mathbb{R}^{d}$ .

Let

•

$u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}^{n}$ and $u_{\mathrm{gntk},\mathrm{test}}(t)\in\mathbb{R}^{n}$ be the test data predictors defined in Definition D.2 and Definition D.4 respectively.
•

$\kappa\in(0,1)$ be the corresponding multiplier.

Given accuracy $\varepsilon>0$ , if

•

$\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn})$ , $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , $m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}})$ .

Then for any $G_{\mathrm{test}}\in\mathbb{R}^{d}$ , with probability at least $1-\delta$ over the random initialization, we have

\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|_{2}\leq\varepsilon/2,

where $\widetilde{O}(\cdot)$ hides $\mathrm{poly}\log(n/(\varepsilon\delta\Lambda_{0}))$ .

Proof.

By Lemma D.14, we have

	$\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|\leq$	$\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\\|u^{*}\\|_{2}}{\Lambda_{0}}$
		$\displaystyle\leavevmode\nobreak\ +\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\\|u^{*}\\|_{2}$		(33)

By Lemma D.21, we can choose $\varepsilon_{\mathrm{init}}=\varepsilon(\Lambda_{0})/n$ .

Further, note

		$\displaystyle\leavevmode\nobreak\ \\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{0}(x_{\mathrm{test}},X)\\|_{2}$
		$\displaystyle\quad+\\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (2N^{2}n\log{(2n/\delta)}/m)^{1/2}+\\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n^{1.5}})$

where the initial step arises due to the triangle inequality. The subsequent step is derived from Lemma D.23. The concluding step is based on Lemma D.22. Consequently, we can select $\varepsilon_{K}=\frac{\varepsilon\Lambda_{0}^{2}}{n^{1.5}}$ .

Also,

		$\displaystyle\leavevmode\nobreak\ \\|\mathsf{H}^{\mathrm{cts}}-\mathsf{H}(t)\\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \\|\mathsf{H}^{\mathrm{cts}}-\mathsf{H}(0)\\|+\\|\mathsf{H}(0)-\mathsf{H}(t)\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ 4Nn(\log(n/\delta)/m)^{1/2}+\\|\mathsf{H}(0)-\mathsf{H}(t)\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n})$

where the initial step is informed by the triangle inequality. The next step is based on Lemma D.23. The final step is derived from Lemma D.22. As a result, we can set $\varepsilon_{H}=\frac{\varepsilon\Lambda_{0}^{2}}{n}$ .

Note $\|u^{*}\|_{2}\leq\sqrt{n}$ , plugging the value of $\varepsilon_{\mathrm{init}},\leavevmode\nobreak\ \varepsilon_{K},\leavevmode\nobreak\ \varepsilon_{H}$ into Eq. (D.8), we have

\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|\leq\varepsilon/2.

∎

Lemma D.29 (Upper bounding test error (node version)).

Given

•

training graph $G$ and corresponding label vector $Y\in\mathbb{R}^{n}$ .
•

$T>0$ total number of iterations.
•

arbitrary test node $v\in\mathbb{R}^{d}$ .

Let

•

$u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(t)\in\mathbb{R}^{n}$ and $u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(t)\in\mathbb{R}^{n}$ be the test data predictors defined in Definition 3.7 and Definition 3.8 respectively.
•

$\kappa\in(0,1)$ be the corresponding multiplier.

Given accuracy $\varepsilon>0$ , if

•

$\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{RN})$ , $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , $m\geq\widetilde{O}(\frac{N^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}})$ .

Then with probability at least $1-\delta$ over the random initialization, we have

\displaystyle\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)\|_{2}\leq\varepsilon/2,

where $\widetilde{O}(\cdot)$ hides $\mathrm{poly}\log(N/(\varepsilon\delta\Lambda_{0}))$ .

D.9 Main result

In this section we present our main theorem. First, we present the equivalence of graph level between training GNN and GNTK regression for test data prediction. Second, we present the equivalence of node level between training GNN and GNTK regression for test data prediction.

Theorem D.30 (Equivalence between training net with and kernel regression for test data prediction).

Given

•

training graph data $\mathsf{G}=\{G_{1},\cdots,G_{n}\}$ and corresponding label vector $Y\in\mathbb{R}^{n}$ . Let $T>0$ be the total number of iterations.
•

arbitrary test data $G_{\mathrm{test}}$ .

Let

•

$u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}^{n}$ and $u_{\mathrm{test}}^{*}\in\mathbb{R}^{n}$ be the test data predictors defined in Definition D.2 and Definition D.4 respectively.

For any accuracy $\varepsilon\in(0,1/10)$ and failure probability $\delta\in(0,1/10)$ , if

•

$\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn})$ , $T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}})$ , $m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}})$ .

Then for any $G_{\mathrm{test}}$ , with probability at least $1-\delta$ over the random initialization, we have

\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon.

Here $\widetilde{O}(\cdot)$ hides $\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0}))$ .

Proof.

It follows from combining results of bounding $\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|_{2}\leq\varepsilon/2$ as shown in Lemma D.28 and $\|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon/2$ as shown in Lemma D.12 using triangle inequality. ∎

Theorem D.31 (Equivalence result for node level regression, formal version of 4.3).

Given

•

training graph data $G$ and corresponding label vector $Y\in\mathbb{R}^{N}$ .
•

$T>0$ be the total number of iterations.
•

arbitrary test data $v$ .

Let

•

$u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(t)\in\mathbb{R}$ and $u_{\mathrm{test},\mathrm{node}}^{*}\in\mathbb{R}$ be the test data predictors defined in Definition 3.7 and Definition 3.4 respectively.

For any accuracy $\varepsilon\in(0,1/10)$ and failure probability $\delta\in(0,1/10)$ , if

•

the multiplier $\kappa=\widetilde{O}(N^{-1}\mathrm{poly}(\varepsilon,\Lambda_{0,\mathrm{node}},R^{-1}))$ ,
•

the total iterations $T=\widetilde{O}(N^{2}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1}))$ ,
•

the width of the neural network $m\geq\widetilde{O}(N^{10}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1},d))$ ,
•

and the smallest eigenvalue of node level GNTK $\Lambda_{0,\mathrm{node}}>0$ ,

then for any $v_{\mathrm{test}}$ , with probability at least $1-\delta$ over the random initialization, we have

\displaystyle\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}\|_{2}\leq\varepsilon.

Here $\widetilde{O}(\cdot)$ hides $\mathrm{poly}\log(N/(\varepsilon\delta\Lambda_{0,\mathrm{node}}))$ .

Proof.

It follows from combining results of bounding $\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)\|_{2}\leq\varepsilon/2$ as shown in Lemma D.29 and $\|u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}\|_{2}\leq\varepsilon/2$ as shown in Lemma D.13 using triangle inequality. ∎

Appendix E Bounds for the Spectral Gap with Data Separation

In this section, we provide a lower bound and an upper bound on the eigenvalue of the GNTK. This result can be proved by carefully decomposing probability. In Section E.1, we consider the standard setting. In Section E.2, we consider the shifted NTK scenario.

E.1 Standard setting

In this section, we consider the GNTK defined in Definition D.3.

Lemma E.1 (Corollary I.2 of [58]).

We define $\mathcal{I}(z)={\bf 1}_{z\geq 0}$ . Let

•

$x_{1},\cdots,x_{n}$ be points in $\mathbb{R}^{d}$ with unit Euclidian norm and $w\sim N(0,I_{d})$ .
•

Form the matrix $X\in R^{n\times d}=[x_{1},\cdots,x_{n}]^{\top}$ .

Suppose there exists $\delta>0$ such that for every $1\leq i\neq j\leq n$ we have that

\displaystyle\min(\|x_{i}-x_{j}\|_{2},\|x_{i}+x_{j}\|_{2})\geq\delta.

Then, the covariance of the vector $\mathcal{I}(Xw)$ obeys

\displaystyle\operatorname*{{\mathbb{E}}}[\mathcal{I}(Xw)\mathcal{I}(Xw)^{\top}\odot XX^{\top}]\succeq\frac{\delta}{100n^{2}}

Then we prove our eigenvalue bound for GNTK.

Theorem E.2 (GNTK with generalized datasets $\delta$ -separation assumption).

Let us define

\displaystyle h_{G,u}=\sum_{v\in\mathcal{N}(u)}h(G)_{v},\leavevmode\nobreak\ x_{i,p}=h_{G_{i},u_{p}},\leavevmode\nobreak\ i\in[n],p\in[N].

Let $w_{r}$ be defined as in Definition 3.1. Assuming that

•

$w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m$ .
•

our datasets satisfy the $\delta$ -separation: For any $i,j\in[n],\leavevmode\nobreak\ p,q\in[N]$

$\displaystyle\min\{\|\overline{x}_{i,p}-\overline{x}_{j,q}\|_{2},\|\overline{x}_{i,p}+\overline{x}_{j,q}\|_{2}\}\geq\delta.$

where $\overline{x}_{i,p}=x_{i,p}/\|x_{i,p}\|_{2}$ .

We define $H^{\mathrm{cts}}$ as in Definition 3.2. Then we can claim that

\displaystyle H^{\mathrm{cts}}\succ\frac{\delta}{100n^{2}},\leavevmode\nobreak\ \Lambda_{0}>\frac{\delta}{100n^{2}}

Proof.

Because

		$\displaystyle[\mathsf{H}^{\mathrm{cts}}]_{i,j}$
	$\displaystyle=$	$\displaystyle E_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right].$

we will claim that

	$\displaystyle\mathsf{H}^{\mathrm{cts}}=$	$\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq 0}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq 0}\right].$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\operatorname*{{\mathbb{E}}}_{W}\left[(\sum_{l_{1}=1}^{N}X_{l_{1}}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq 0})(\sum_{l_{2}=1}^{N}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq 0})\right].$

where $X_{i}=[x_{1,i},\cdots,x_{n,i}]^{\top}\in\mathbb{R}^{d\times n}$ . By Lemma E.1, we get that

\displaystyle E_{W}\left[X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq 0}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq 0}\right]\succ\frac{\delta}{100n^{2}}

As a result, we can immediately get that $\mathsf{H}^{\mathrm{cts}}\succ\frac{\delta}{100n^{2}}$ . ∎

E.2 Shifted NTK

In this section, we consider the shifted GNTK, which is defined by a shifted ReLU function. We proved that given the shifted GNTK, the GNTK is still positive definite. Moreover, the eigenvalue of GNTK goes to $0$ exponentially when the absolute value of the bias $b$ goes larger.

We first state a claim that helps us decompose the eigenvalue of Hadamard product.

Claim E.3 ([66]).

Let

•

$M_{1},M_{2}\in\mathbb{R}^{n\times n}$ be two PSD matrices.
•

$M_{1}\circ M_{2}$ denote the Hadamard product of $M_{1}$ and $M_{2}$ .

Then,

	$\displaystyle\lambda_{\min}(M_{1}\circ M_{2})\geq$	$\displaystyle\leavevmode\nobreak\ (\min_{i\in[n]}{M_{2}}_{i,i})\cdot\lambda_{\min}(M_{1}),$
	$\displaystyle\lambda_{\max}(M_{1}\circ M_{2})\leq$	$\displaystyle\leavevmode\nobreak\ (\max_{i\in[n]}{M_{2}}_{i,i})\cdot\lambda_{\max}(M_{1}).$

Then, we state a result that bounds the eigenvalue of NTK.

Theorem E.4 ([78]).

Let

•

$x_{1},\dots,x_{n}$ be points in $\mathbb{R}^{d}$ with unit Euclidean norm and $w\sim{\cal N}(0,I_{d})$ .
•

Form the matrix $X\in\mathbb{R}^{n\times d}=[x_{1}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ x_{n}]^{\top}$ .

Suppose there exists $\delta\in(0,\sqrt{2})$ such that

\displaystyle\min_{i\neq j\in[n]}\{\|x_{i}-x_{j}\|_{2},\|x_{i}+x_{j}\|_{2}\}\geq\delta.

Let $b\geq 0$ . Recall the continuous Hessian matrix $H^{\mathrm{cts}}$ is defined by

\displaystyle H^{\mathrm{cts}}_{i,j}:=\operatorname*{{\mathbb{E}}}_{w\sim\mathcal{N}(0,I)}\left[x_{i}^{\top}x_{j}{\bf 1}_{w^{\top}x_{i}\geq b,w^{\top}x_{j}\geq b}\right]\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall(i,j)\in[n]\times[n].

Let $\lambda:=\lambda_{\min}(H^{\mathrm{cts}})$ . Then, we have

\displaystyle\exp(-b^{2}/2)\geq\leavevmode\nobreak\ \lambda\geq\leavevmode\nobreak\ \exp(-b^{2}/2)\cdot\frac{\delta}{100n^{2}}.

(34)

We first state our definition of shifted graph neural network function, which can be viewed as a generalization of the standard graph neural network.

Definition E.5 (Shifted graph neural network function).

We define a graph neural network $f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G):\mathbb{R}^{d\times m}\times\mathbb{R}^{m}\times\mathbb{R}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R}$ as the following form

\displaystyle f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u_{l})}h(G)_{v}-b).

Besides, We denote $f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,\mathsf{G})=[f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G_{1}),\cdots,f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G_{n})]^{\top}$ to represent the GNN output on the training set.

Next, we define shifted graph neural tangent kernel for the shifted graph neural network function.

Definition E.6 (Shifted graph neural tangent kernel).

We define the graph neural tangent kernel (GNTK) corresponding to the graph neural networks $f_{\mathrm{gnn},\mathrm{shift}}$ as follows:

\displaystyle\mathsf{K}_{\mathrm{gntk},\mathrm{shift}}(G,H)=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,H)}{\partial W}\right\rangle\right],

where

•

$G,H$ are any input graph,
•

and the expectation is taking over $w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m$ .

We define $H^{\mathrm{cts}}_{\mathrm{shift}}\in\mathbb{R}^{n\times n}$ as the kernel matrix between training data as $[H^{\mathrm{cts}}_{\mathrm{shift}}]_{i,j}=\mathsf{K}_{\mathrm{gntk},\mathrm{shift}}(G_{i},G_{j})\in\mathbb{R}$ .

Then, we prove our eigenvalue bounds for shifted graph neural tangent kernel.

Theorem E.7.

Let us define

\displaystyle h_{G,u}=\sum_{v\in\mathcal{N}(u)}h(G)_{v},\leavevmode\nobreak\ x_{i,p}=h_{G_{i},u_{p}},\leavevmode\nobreak\ i\in[n],p\in[N].

Let $w_{r}$ be defined as in Definition E.5. Assuming that $w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m$ . Assuming that our datasets satisfy the $\delta$ -separation: For any $i,j\in[n],\leavevmode\nobreak\ p,q\in[N]$

\displaystyle\min\{\|\overline{x}_{i,p}-\overline{x}_{j,q}\|_{2},\|\overline{x}_{i,p}+\overline{x}_{j,q}\|_{2}\}\geq\delta.

where $\overline{x}_{i,p}=x_{i,p}/\|x_{i,p}\|_{2}$ .

We define $H^{\mathrm{cts}}_{\mathrm{shift}}$ as in Definition E.6. Then we can claim that, for $\lambda:=\lambda_{\min}(H^{\mathrm{cts}}_{\mathrm{shift}})$

\displaystyle\exp(-b^{2}/2)\geq\lambda\geq\exp(-b^{2}/2)\cdot\frac{\delta}{100n^{2}}

Proof.

Because

\displaystyle[\mathsf{H}^{\mathrm{cts}}]_{i,j}=\operatorname*{{\mathbb{E}}}_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right].

we will claim that

	$\displaystyle\mathsf{H}^{\mathrm{cts}}=$	$\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq b}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq b}\right].$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\operatorname*{{\mathbb{E}}}_{W}\left[(\sum_{l_{1}=1}^{N}X_{l_{1}}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq b})(\sum_{l_{2}=1}^{N}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq b})\right].$

where $X_{i}=[x_{1,i},\cdots,x_{n,i}]^{\top}\in\mathbb{R}^{d\times n}$ . By Theorem E.4, we can get that

\displaystyle\exp(-b^{2}/2)\succ\operatorname*{{\mathbb{E}}}_{W}\left[X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq b}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq b}\right]\succ\exp(-b^{2}/2)\cdot\frac{\delta}{100n^{2}}

As a result, we can immediately get that $\exp(-b^{2}/2)\succ\mathsf{H}^{\mathrm{cts}}\succ\exp(-b^{2}/2)\frac{\delta}{100n^{2}}$ . ∎

Appendix F Node-Level GNTK Formulation

In this section we derive the GNTK formulation for the node level regression task. We first review the definitions. Let $f^{(l)}_{(r)}$ define the output of the Graph Neural Network in the $l$ -th level and $r$ -th layer. The definition of Aggregate layer is as follows:

\displaystyle f^{(1)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}h(G)_{v},

and

\displaystyle f^{(l)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}f^{(l-1)}_{(R)}(v),\leavevmode\nobreak\ l\in\{2,3,\cdots,L\}.

The definition of Combine layer is formulated as follows:

\displaystyle f^{(l)}_{(r)}(u)=\sqrt{\frac{c_{\sigma}}{m}}\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)),\leavevmode\nobreak\ l\in[L],r\in[R]

where $c_{\sigma}$ is a scaling factor, $W^{(l)}_{(r)}$ is the weight of the neural network in $l$ -th level and $r$ -th layer, $m$ is the width of the layer.

We define the Graph Neural Tangent Kernel as $\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})=\mathsf{K}_{(R)}^{(L)}(u,u^{\prime})$ . Let $\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})$ be defined as the covariance matrix of the Gaussian process of the pre-activations $W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)$ . We have that initially:

\displaystyle\bm{\Sigma}^{(1)}_{(1)}(u,u^{\prime})=h(G)_{u}^{\top}h(G)_{u^{\prime}}.

For $l\in\{2,3,\cdots,L\}$ ,

\displaystyle\bm{\Sigma}^{(l)}_{(1)}(u,u^{\prime})=\sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\bm{\Sigma}^{(l-1)}_{(R)}(v,v^{\prime}).

Then, we start to deduce the formulation for the Combine layer. Note that,

		$\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}[(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u))_{i}\cdot(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u^{\prime}))_{i}]$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ f^{(l)}_{(r)}(u)\circ f^{(l)}_{(r)}(u^{\prime})$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{c_{\sigma}}{m}\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u))\circ\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u^{\prime}))$

Then, we have that

\displaystyle\lim_{m\to\infty}\operatorname*{{\mathbb{E}}}[(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u))_{i}\cdot(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u^{\prime}))_{i}]=\bm{\Sigma}^{(l)}_{(r+1)}(u,u^{\prime}),

where $W^{(l)}_{(r+1)}\in\mathbb{R}^{m\times m}$ .

We conclude the formula for the Combine layer.

	$\displaystyle\bm{\Lambda}^{(l)}_{(r)}(u,u^{\prime})=$	$\displaystyle\begin{pmatrix}\bm{\Sigma}_{(r-1)}^{(l)}(u,u)&\bm{\Sigma}_{(r-1)}^{(l)}(u,u^{\prime})\\ \bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u)&\bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u^{\prime})\end{pmatrix}\in\mathbb{R}^{2\times 2},$
	$\displaystyle\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})=$	$\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\sigma(a)\sigma(b)],$
	$\displaystyle\bm{\dot{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)=$	$\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\dot{\sigma}(a)\dot{\sigma}(b)].$

We write the partial derivative with respect to a particular weight $W^{(l)}_{(r)}$ :

\displaystyle\frac{\partial f^{(L)}_{(R)}}{\partial W^{(l)}_{(r)}}=\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))f^{(l)}_{(r-1)}

where

\displaystyle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r-1)}}=\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))W^{(l)}_{(r)}

Then, we compute the inner product:

		$\displaystyle\leavevmode\nobreak\ \frac{\partial f^{(L)}_{(R)}(u)}{\partial W^{(l)}_{(r)}}\circ\frac{\partial f^{(L)}_{(R)}(u^{\prime})}{\partial W^{(l)}_{(r)}}$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))f^{(l)}_{(r-1)}(u),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))f^{(l)}_{(r-1)}(u^{\prime})\rangle$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u^{\prime})\rangle\cdot\langle f^{(l)}_{(r-1)}(u),f^{(l)}_{(r-1)}(u^{\prime})\rangle$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u^{\prime})\rangle\cdot{\bf\Sigma}^{(l)}_{(r)}(u,u^{\prime})$

where

		$\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)})(u)),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u^{\prime})))\rangle$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{c_{\sigma}}{m}\mathrm{tr}[\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u)\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u^{\prime})]\langle\frac{\partial f^{(L)}_{(R)}(u)}{\partial f^{(l)}_{(r)}},\frac{\partial f^{(L)}_{(R)}(u^{\prime})}{\partial f^{(l)}_{(r)}}\rangle$
	$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \frac{c_{\sigma}}{m}\dot{\bf\Sigma}^{(l)}_{(r)}(u,u^{\prime})\langle\frac{\partial f^{(L)}_{(R)}(u)}{\partial f^{(l)}_{(r)}},\frac{\partial f^{(L)}_{(R)}(u^{\prime})}{\partial f^{(l)}_{(r)}}\rangle$

Finally, we conclude the formula for graph neural tangent kernel. For each Aggregate layer:

\displaystyle\mathsf{K}^{(1)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ h(G)_{u}^{\top}h(G)_{u^{\prime}},

and

\displaystyle\mathsf{K}^{(l)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ \sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\mathsf{K}^{(l-1)}_{(R)}(v,v^{\prime}).

For the Combine layer:

\displaystyle\mathsf{K}_{(r)}^{(l)}(u,u^{\prime})=\mathsf{K}_{(r-1)}^{(l)}(u,u^{\prime})\dot{\bm{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)+\bm{\Sigma}^{(l)}_{(r)}\left(u,u^{\prime}\right).

The Graph neural tangent kernel:

\displaystyle\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})=\mathsf{K}_{(R)}^{(L)}(u,u^{\prime}).

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\leavevmode\nobreak\ +\kappa^{2}\Big{\|}\int_{0}^{T}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
		$\displaystyle\leavevmode\nobreak\ +\kappa^{2}\Big{\|}\int_{0}^{T}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\\|u^{}\\|}{\Lambda_{0}}+\kappa^{2}n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{4}\varepsilon_{H}\cdot\\|u^{}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\\|u^{}\\|_{2}}{\Lambda_{0}}+\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\\|u^{}\\|_{2}$

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}.$

	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Big{\|}u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)$
	$\displaystyle\quad+\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|}$
$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
	$\displaystyle\quad+\Big{\|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|},$	(12)

	$\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|}$
$\displaystyle=$	$\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{T}\kappa^{2}((\mathsf{K}_{\mathrm{gntk}}(x_{\mathrm{test}},X)-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)$
	$\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)))\mathrm{d}t\Big{\|}$	(16)

		$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}((\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)$
		$\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)))\mathrm{d}t\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)\|$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{\|}$
		$\displaystyle\quad+\Big{\|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{\|}$