This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Is Solving Graph Neural Tangent Kernel Equivalent to Training Graph Neural Network?

Lianke Qin [email protected]. UCSB.    Zhao Song [email protected]. Adobe.    Baocheng Sun [email protected]. Weizmann Institute of Science.

A rising trend in theoretical deep learning is to understand why deep learning works through Neural Tangent Kernel (NTK) [44], a kernel method that is equivalent to using gradient descent to train a multi-layer infinitely-wide neural network. NTK is a major step forward in the theoretical deep learning because it allows researchers to use traditional mathematical tools to analyze properties of deep neural networks and to explain various neural network techniques from a theoretical view. A natural extension of NTK on graph learning is Graph Neural Tangent Kernel (GNTK), and researchers have already provide GNTK formulation for graph-level regression and show empirically that this kernel method can achieve similar accuracy as GNNs on various bioinformatics datasets [23]. The remaining question now is whether solving GNTK regression is equivalent to training an infinite-wide multi-layer GNN using gradient descent. In this paper, we provide three new theoretical results. First, we formally prove this equivalence for graph-level regression. Second, we present the first GNTK formulation for node-level regression. Finally, we prove the equivalence for node-level regression.

1 Introduction

Many deep learning tasks need to deal with graph data, including social networks [95], bio-informatics [100, 94], recommendation systems [92], and autonomous driving [85, 93]. Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for machine learning on graph data. GNNs show SOTA results on many graph-based learning jobs compared to using hand-crafted combinatorial features. As a result, GNNs have achieved convincing performance on a large number of tasks on graph-structured data.

Today, one major trend in theoretical deep learning is to understand when and how deep learning works through Neural Tangent Kernel (NTK) [44]. NTK is a pure kernel method, and it has been shown that solving NTK is equivalent to training a fully-connected infinite-wide neural network whose layers are trained by gradient descent [3, 52]. This is a major step forward in theoretical deep learning because researchers can now use traditional mathematical tools to analyze properties in deep neural networks and to explain various neural network optimizations from a theoretical perspective. For example, [44] uses NTK to explain why early stopping works in training neural networks. [10, 9] build general toolkits to analyze multi-layer networks with ReLU activations and prove why stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time.

A natural next step is thus to apply NTK to GNNs, called Graph Neural Tangent Kernel (GNTK). [23] provides the formulation for graph-level regression and demonstrates that GNTK can empirically achieve similar accuracy as GNNs in various bioinformatics datasets.

This raises an important theoretical question:

Is solving Graph Neural Tangent Kernel regression equivalent to using gradient descent to train an infinitely wide graph neural network?

Refer to caption
Figure 1: Our definitions for Graph Neural Tangent Kernel(GNTK) and Graph Dynamic Kernel. We define WW as the parameters of GNN. When we load parameters WW from W(t)W(t) the parameters of GNN optimizing with gradient descent at iteration tt, then the inner product of the gradients on the GNN parameters is the Graph Dynamic Kernel 𝖪t(G,H)\mathsf{K}_{t}(G,H). When we load parameters WW from samples over a normal distribution, then the expectation taken on the inner product of gradients on the GNN parameters is GNTK 𝖪gntk(G,H)\mathsf{K}_{\mathrm{gntk}}(G,H).

This question is worthwhile because GNTK can potentially enable researchers to understand why and how GNNs work, as such in neural network training through analysis of the corresponding NTK. At the same time, this question is more challenging than the equivalence result of NTK, because we need deal with the structures of the graphs.

In this paper, we provide an affirmative answer. We start from the formulation of graph-level regression in [23]. We provide the first formal proof for the equivalence. To achieve this result, we develop new proof techniques, including iterative Graph Tangent Kernel regression and Graph Dynamic Kernel. The high-level idea is that we use an iterative regression to bridge the connection between the neural network and the kernel method.

We then consider node-level regression, another graph learning task which has many real-world applications, such as social networks. We provide the first GNTK formulation for node-level regression. We also establish the equivalence result for node-level regression.

Our paper makes the following contributions:

  • We present the first graph neural tangent kernel formation for node-level regression.

  • We prove the equivalence of graph neural tangent kernel and graph neural network training for both graph-level and node-level regression.

2 Related Work

Kernel methods

Instead of learning some fixed set of parameters based on a set of features, kernel methods can “remember” training examples in their weights [29, 50, 10, 9]. After learning on the set of inputs, the kernel function 𝖪\mathsf{K} is generated, and the kernel function can be used to predict unlabeled inputs. Our work is related to the recent breakthroughs in drawing the connections between kernel methods and neural networks [22, 21, 44, 17]. To the best of our knowledge, our work is the first that draws the theoretical connections between kernel methods and deep learning in the context of graph neural networks.

Convergence of neural network

Many works study the convergence of deep neural network [12, 80, 102, 67, 55, 101, 28, 32, 13] with random input assumptions. Recently, instead of assuming random input, several works study the convergence of neural networks in the over-parametrization regime [50, 29, 10, 9, 25, 2, 3, 75, 52, 14, 78, 79]. These works only need a weak assumption on the input data such that the input data is separable: the distance between any two input data is sufficiently large. These works prove that if the neural network’s width is sufficiently large, a neural network can converge. We use the convergence result in our proofs to bound the test predictor of graph neural network training.

Graph neural network (GNNs) and graph neural tangent kernel (GNTK)

Graph neural networks are deep learning methods applied to graphs [97, 83, 16, 96, 90, 63, 89]. They have delivered promising results in many areas of AI, including social networks [95, 82], bio-informatics [100, 94, 31], knowledge graphs [41], recommendation systems [92], and autonomous driving [85, 93]. Due to their convincing performance, GNNs have become a widely applied graph learning method.

Graph neural tangent kernel applies the idea of the neural tangent kernel to GNNs. [23] provides the first formulation for graph-level regression, and it shows that the kernel method can deliver similar accuracy as training graph neural networks on several bio-informatics datasets. However, [23] does not provide a formal proof, and it does not consider node-level regression.

Sketching

Sketching is a well-known technique to improve performance or memory complexity [19]. It has wide applications in linear algebra, such as linear regression and low-rank approximation[19, 57, 56, 65, 71, 38, 6, 72, 73, 24, 35], training over-parameterized neural network [78, 79, 98], empirical risk minimization [53, 62], linear programming [53, 49, 76], distributed problems [86, 15, 45, 46], clustering [30], generative adversarial networks [91], kernel density estimation [60], tensor decomposition [74], trace estimation [48], projected gradient descent [39, 88], matrix sensing [27, 61, 36], softmax regression [8, 54, 26, 35, 69], John Ellipsoid computation [18, 77], semi-definite programming [34], kernel methods [1, 4, 20, 70, 37], adversarial training [33], anomaly detection [81, 84], cutting plane method [47], discrepany [99], federated learning [64], reinforcement learning [5, 87, 68], relational database [59].

3 Graph Neural Tangent Kernel

We first consider the graph regression task. Before we continuous, we introduce some useful notations as follows: Let G=(U,E)G=(U,E) be a graph with node set UU and edge set EE. A graph contains |U|=N|U|=N nodes. The dimension of the feature vector of each input node is dd. 𝒩\mathcal{N} is the neighboring matrix that represents the topology of GG. Each node uUu\in U has a feature vector h(G)udh(G)_{u}\in\mathbb{R}^{d}. 𝒩(u)\mathcal{N}(u) represents all the neighboring nodes of uu. A dataset contains nn graphs. We denote training data graph 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} with Gid×Ni[n]G_{i}\in\mathbb{R}^{d\times N}\leavevmode\nobreak\ i\in[n]. We denote label vector YnY\in\mathbb{R}^{n} corresponding to 𝖦\mathsf{G}. Ours goal is to learn to predict the label of the graph GtestG_{\mathrm{test}}. For presentation simplicity, let’s define an ordering of all the nodes in a graph. We can thus use uiu_{i} to be the iith node in the graph. We use [k][k] to denote the set {1,2,,k}\{1,2,\cdots,k\}.

3.1 Graph Neural Network

For simplicity, we only consider a single-level, single-layer GNN. A graph neural network (GNN) consists of three operations: Aggregate, Combine and ReadOut.

Aggregate. The Aggregate operation aggregates the information from the neighboring nodes. We define the sum of feature vectors of neighboring nodes as hG,u:=v𝒩(u)h(G)vh_{G,u}:=\sum_{v\in\mathcal{N}(u)}h(G)_{v}.

Let R:=maxi[n]maxuGih(G)uR:=\max_{i\in[n]}\max_{u\in G_{i}}h(G)_{u}, we have that v𝒩(u)h(Gi)v2v𝒩(u)h(Gi)u2R|𝒩(u)|\|\sum_{v\in{\cal N}(u)}h(G_{i})_{v}\|_{2}\leq\sum_{v\in{\cal N}(u)}\|h(G_{i})_{u}\|_{2}\leq R\cdot|{\cal N}(u)| where the first step follows from triangle inequality, the second step follows from the definition of RR. As a result, maxi[n]maxuGiv𝒩(u)h(Gi)v2Rmaxi[n]maxuGi|𝒩(u)|\max_{i\in[n]}\max_{u\in G_{i}}\|\sum_{v\in{\cal N}(u)}h(G_{i})_{v}\|_{2}\leq R\max_{i\in[n]}\max_{u\in G_{i}}|{\cal N}(u)|. When graphs in 𝖦\mathsf{G} are sparse, the upper bound in the above equation will be αR\alpha R where α\alpha is a constant and satisfy αN\alpha\ll N. When the graphs are dense, the upper bound above will be NRNR.

Combine. The Combine operation has a fully connected layer with ReLU activation. W=[w1,,wm]d×mW=[w_{1},\cdots,w_{m}]\in\mathbb{R}^{d\times m} represents the weights in the neural network. σ\sigma is the ReLU function (σ(z)=max{0,z}\sigma(z)=\max\{0,z\}), and mm represents the output dimension of the neural network. This layer can be formulated as hG,u:=1mσ(WhG,u)m.h^{\prime}_{G,u}:=\frac{1}{\sqrt{m}}\cdot\sigma(W^{\top}h_{G,u})\in\mathbb{R}^{m}.

ReadOut. The final output of the GNN on graph GG is generated by the ReadOut operation. We take summation over the output feature with weights ar{1,+1},r[m]a_{r}\in\{-1,+1\},\leavevmode\nobreak\ r\in[m] or a=[a1,,am]a=[a_{1},\cdots,a_{m}]^{\top}, which is a standard setting in the literature [29, 75, 14]. This layer can be formulated as fgnn(G):=r=1maruG[hG,u]r.f_{\mathrm{gnn}}(G):=\sum_{r=1}^{m}a_{r}\sum_{u\in G}[h^{\prime}_{G,u}]_{r}\in\mathbb{R}.

Combining the previous three operations, we now have the definition of a graph neural network function.

Definition 3.1 (Graph neural network function).

We define a graph neural network fgnn(W,a,G):d×m×m×d×Nf_{\mathrm{gnn}}(W,a,G):\mathbb{R}^{d\times m}\times\mathbb{R}^{m}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R} as the following form fgnn(W,a,G)=1mr=1marl=1Nσ(wrv𝒩(ul)h(G)v)f_{\mathrm{gnn}}(W,a,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u_{l})}h(G)_{v}).

Furthermore, We denote fgnn(W,a,𝖦)=[fgnn(W,a,G1),,fgnn(W,a,Gn)]f_{\mathrm{gnn}}(W,a,\mathsf{G})=[f_{\mathrm{gnn}}(W,a,G_{1}),\cdots,f_{\mathrm{gnn}}(W,a,G_{n})]^{\top} to represent the GNN output on the training set.

The normalization 1/m1/\sqrt{m} is a standard formulation for GNN [23]. Similar to other works in theoretical deep learning [50, 10, 29, 75, 14, 79, 7, 42], we consider only training WW while fixing aa, so we can write fgnn(W,G)=fgnn(W,a,G)f_{\mathrm{gnn}}(W,G)=f_{\mathrm{gnn}}(W,a,G).

3.2 Graph Neural Tangent Kernel

We now translate the GNN architecture to its corresponding Graph Neural Tangent Kernel (GNTK).

Definition 3.2 (Graph neural tangent kernel).

We define the graph neural tangent kernel (GNTK) corresponding to the graph neural networks fgnnf_{\mathrm{gnn}} as following

𝖪gntk(G,H)=𝔼W[fgnn(W,G)W,fgnn(W,H)W],\displaystyle\mathsf{K}_{\mathrm{gntk}}(G,H)=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn}}(W,H)}{\partial W}\right\rangle\right],

where

  • G,HG,H are any input graph,

  • and the expectation is taking over

    wri.i.d.𝒩(0,I),r=1,,m.\displaystyle w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

We define Hctsn×nH^{\mathrm{cts}}\in\mathbb{R}^{n\times n} as the kernel matrix between training data as [Hcts]i,j=𝖪gntk(Gi,Gj)[H^{\mathrm{cts}}]_{i,j}=\mathsf{K}_{\mathrm{gntk}}(G_{i},G_{j})\in\mathbb{R}. We denote the smallest eigenvalue of HctsH^{\mathrm{cts}} as Λ0\Lambda_{0}.

In Figure 1, we provide an illustration on GNTK. We sample the parameters from a normal distribution and then load the sampled parameters into the GNN. We use those loaded parameters to perform the forward pass on the graph and then calculate the gradient. We forward each pair of the graph in the dataset 𝖦\mathsf{G} to the same GNN with the same parameters and calculate the gradient on WW the parameters of GNN. Finally, taking expectations on the inner product of the gradient on each pair of the graph, we can obtain GNTK. GNTK is a constant and does not change in the GNN training process.

We define the corresponding feature function as follows:

Definition 3.3 (Feature function).

Let \mathcal{F} be a Hilbert space.

  • We denote the feature function corresponding to the kernel 𝖪gntk\mathsf{K}_{\mathrm{gntk}} as Φ:d×N\Phi:\mathbb{R}^{d\times N}\rightarrow\mathcal{F}, which satisfies Φ(G),Φ(H)=𝖪gntk(G,H)\langle\Phi(G),\Phi(H)\rangle_{\mathcal{F}}=\mathsf{K}_{\mathrm{gntk}}(G,H),for any graph data GG, HH.

  • We write Φ(𝖦)=[Φ(G1),,Φ(Gn)]\Phi(\mathsf{G})=[\Phi(G_{1}),\cdots,\Phi(G_{n})]^{\top}.

Next, we provide the definition of Graph Neural Tangent Kernel regression.

Definition 3.4 (Graph Neural Tangent Kernel regression).

We consider the following neural tangent kernel regression problem: minβ12Yκfgntk(β,𝖦)22\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2} where

  • β\beta\in\mathcal{F} is weight of kernel regression,

  • fgntk(β,𝖦)=Φ(𝖦)βf_{\mathrm{gntk}}(\beta,\mathsf{G})=\Phi(\mathsf{G})\beta\in\mathbb{R} denotes the prediction function.

3.3 Notations for Node-Level regression

Here, we consider the node level regression task. In our setting, our dataset contains a single graph G=(V,E)G=(V,E) where VV denotes the set of nodes in graph GG, and EE denotes the set of edges in graph GG. Suppose that |V|=N|V|=N, the dataset also contains labels Y=(y1,,yN)Y=(y_{1},\cdots,y_{N}) with yiy_{i} corresponding to the label of the node ui,i[N]u_{i},\leavevmode\nobreak\ i\in[N] in the graph.

In this paper, we consider two different setting for node level regression training, inductive training and transductive training. In the transductive training process, we can observe the whole G=(V,E)G=(V,E). We can observe the labels of nodes in VtrainV_{\mathrm{train}}, but we cannot observe the labels of nodes in VtestV_{\mathrm{test}}. In the inductive training process, we partition VV into two disjoint sets VtrainV_{\mathrm{train}} and VtestV_{\mathrm{test}}. Let Etrain={(u,v)E|uVtrainvVtrain}E_{\mathrm{train}}=\{(u,v)\in E|u\in V_{\mathrm{train}}\wedge v\in V_{\mathrm{train}}\}. Let Gtrain=(Vtrain,Etrain)G_{\mathrm{train}}=(V_{\mathrm{train}},E_{\mathrm{train}}). We can only observe the subgraph Gtrain=(Vtrain,Etrain)G_{\mathrm{train}}=(V_{\mathrm{train}},E_{\mathrm{train}}) and feature in VtrainV_{\mathrm{train}}. During the test process, we can observe the whole graph G=(V,E)G=(V,E) in both the inductive and transductive setting, but we cannot observe any label. We prove the equivalence between GNN and GNTK for both of the settings. For simplicity, we only consider transductive setting here. We left the inductive setting into appendix.

In the rest of this section, we will provide the definitions of GNN and GNTK for node level regression and the definition of their corresponding training process.

We first show the definition of GNN in node level.

Definition 3.5 (Graph neural network function (node level)).

Given the graph GG, we define a graph neural network fgnn,node(W,a,u):d×m×m×[N]f_{\mathrm{gnn},\mathrm{node}}(W,a,u):\mathbb{R}^{d\times m}\times\mathbb{R}^{m}\times[N]\rightarrow\mathbb{R} as the following form

fgnn,node(W,a,u)=1mr=1marσ(wrv𝒩(u)h(G)v),\displaystyle f_{\mathrm{gnn},\mathrm{node}}(W,a,u)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u)}h(G)_{v}),

Besides, We denote

fgnn,node(W,a,G)=[fgnn,node(W,a,u1),,fgnn,node(W,a,uN)]N\displaystyle\leavevmode\nobreak\ f_{\mathrm{gnn},\mathrm{node}}(W,a,G)=[f_{\mathrm{gnn},\mathrm{node}}(W,a,u_{1}),\cdots,f_{\mathrm{gnn},\mathrm{node}}(W,a,u_{N})]^{\top}\in\mathbb{R}^{N}

to represent the GNN output on the training set.

Note that in node level, the definition of GNN do not contain ReadOut layer. There are only a single Aggregate layer and a single Combine layer.

We now show the definition of GNTK in node level as follows:

Definition 3.6 (Graph neural tangent kernel (node level)).

We define the graph neural tangent kernel (GNTK) corresponding to the graph neural networks fgnn,nodef_{\mathrm{gnn},\mathrm{node}} as following

𝖪gntk,node(u,v)=𝔼W[fgnn,node(W,u)W,fgnn,node(W,v)W],\displaystyle\leavevmode\nobreak\ \mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,v)=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn},\mathrm{node}}(W,u)}{\partial W},\frac{\partial f_{\mathrm{gnn},\mathrm{node}}(W,v)}{\partial W}\right\rangle\right],

where

  • u,vu,v are any input node in graph GG,

  • and the expectation is taking over

    wri.i.d.𝒩(0,I),r=1,,m.\displaystyle w_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

We define Hcts,nodeN×NH^{\mathrm{cts},\mathrm{node}}\in\mathbb{R}^{N\times N} as the kernel matrix between training data as [Hcts,node]i,j=𝖪gntk,node(ui,uj)[H^{\mathrm{cts},\mathrm{node}}]_{i,j}=\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u_{i},u_{j})\in\mathbb{R}. We denote the smallest eigenvalue of Hcts,nodeH^{\mathrm{cts},\mathrm{node}} as Λ0,node\Lambda_{0,\mathrm{node}}.

In graph level Definition 3.2, any single term in the kernel is about two different graphs, while in the node level Definition 3.6, the kernel is about two different nodes in a single graph.

The training process of GNN in node level is similar to the training of GNN in graph level defined in Definition D.2.

Definition 3.7 (Training graph neural network (node level)).

Let κ(0,1)\kappa\in(0,1) be a small multiplier. We initialize the network as ari.i.d.unif[{1,1}],wr(0)i.i.d.𝒩(0,Id)a_{r}\overset{i.i.d.}{\sim}\operatorname{unif}[\{-1,1\}],\leavevmode\nobreak\ w_{r}(0)\overset{i.i.d.}{\sim}\mathcal{N}(0,I_{d}) We consider solving the following optimization problem using gradient descent:

minW12Yκfgnn,node(W,G)2.\displaystyle\min_{W}\frac{1}{2}\|Y-\kappa f_{\mathrm{gnn},\mathrm{node}}(W,G)\|_{2}.

Furthermore, we define

  • wr(t),r[m]w_{r}(t),r\in[m] as the weight at iteration tt,

  • training data predictor at iteration tt as ugnn,node(t)=κfgnn,node(W(t),G)u_{\mathrm{gnn},\mathrm{node}}(t)=\kappa f_{\mathrm{gnn},\mathrm{node}}(W(t),G),

  • and test data predictor at iteration tt as

    ugnn,node,test(t)=κfgnn,node(W(t),vtest).\displaystyle u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(t)=\kappa f_{\mathrm{gnn},\mathrm{node}}(W(t),v_{\mathrm{test}}).

The regression of GNTK in node level is similar with the regression of GNTK in graph level defined in Definition 3.4.

Definition 3.8 (Graph Neural Tangent Kernel regression (node level)).

We consider the following neural tangent kernel regression problem:

minβ12Yκfgntk(β,G)22\displaystyle\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,G)\|_{2}^{2}

where

  • β\beta\in\mathcal{F} is weight of kernel regression.

  • fgntk,node(β,G)=Φ(G)βf_{\mathrm{gntk},\mathrm{node}}(\beta,G)=\Phi(G)\beta\in\mathbb{R} denotes the prediction function.

4 Equivalence between GNTK and infinite-wide GNN

4.1 Equivalence result for graph-level regression

Our main result is the following equivalence between solving kernel regression and using gradient descent to train an infinitely wide single-layer graph neural network. It states that if the neural network width mm is wide enough and the training iterations TT is large enough then the test data predictor between Graph Neural Network ugnn,test(T)u_{\mathrm{gnn},\mathrm{test}}(T) and the Graph Neural Tangent Kernel regression utestu_{\mathrm{test}}^{*} can be less than any positive accuracy ε\varepsilon.

In our result, we use the fact that inputs are bounded, which is a standard setting in the optimization field [52]. Let RR denote the parameter that bound the feature vectors h(G)uh(G)_{u}: maxi[n]maxuGih(Gi)u2R\max_{i\in[n]}\max_{u\in G_{i}}\|h(G_{i})_{u}\|_{2}\leq R. This equation means that there is a bound on the maximum value of the input data.

The equivalence result is as follows:

Theorem 4.1 (Equivalence result for graph level regression).

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\},

  • corresponding label vector YnY\in\mathbb{R}^{n},

  • arbitrary test data GtestG_{\mathrm{test}}.

Let

  • T>0T>0 be the total number of iterations,

  • NN be the number of nodes in a graph,

  • ugnn,test(t)u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R} and utestu_{\mathrm{test}}^{*}\in\mathbb{R} be the test data predictors defined in Definition D.2 and Definition 3.4 respectively.

For any accuracy ε(0,1/10)\varepsilon\in(0,1/10) and failure probability δ(0,1/10)\delta\in(0,1/10), if the following conditions hold:

  • the multiplier κ=O~(N2poly(ε,Λ0,R1,n1))\kappa=\widetilde{O}(N^{-2}\mathrm{poly}(\varepsilon,\Lambda_{0},R^{-1},n^{-1})),

  • the total iterations T=O~(N4poly(ε1,Λ01,R,n))T=\widetilde{O}(N^{4}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0}^{-1},R,n)),

  • the width of the neural network mO~(N2poly(ε1,Λ01,n,d))m\geq\widetilde{O}(N^{2}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0}^{-1},n,d)),

  • and the smallest eigenvalue of GNTK Λ0>0\Lambda_{0}>0,

then for any GtestG_{\mathrm{test}}, with probability at least 1δ1-\delta over the random initialization, we have

ugnn,test(T)utest2ε.\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon.

Here O~()\widetilde{O}(\cdot) hides polylog(Nn/(εδΛ0))\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0})).

Note that when ε\varepsilon goes into 0, the width of GNN will go into infinity, our result can deduce that the limit of ugnn,test(T)utest20\|u_{\mathrm{gnn},\mathrm{test}}(T)-u^{*}_{\mathrm{test}}\|_{2}\rightarrow 0 when the width of GNN mm\rightarrow\infty and the training iteration TT\rightarrow\infty. It is worth noting that our result is a non-asymptotic convergence result: we provide a lower bound on the training time TT and the width of layers mm.

The number of nodes NN has a non-linear impact on the required training iterations TT and the width of layer mm. Our Theorem 4.1 shows that mm depends on N2N^{2} and TT depends on N4N^{4}.

Moreover, the number of vertices does not need to be fixed in our Theorem 4.1. In fact, for any graph that contains number of vertices less than NN, we can add dummy vertices with 0 feature until the graph contains NN vertices. Then, it reduces to the case where all the graphs in the dataset 𝖦\mathsf{G} contain NN vertices.

Next, we will show the generalized δ\delta-separation assumption under which the condition Λ>0\Lambda>0 in Theorem 4.1 will always hold.

Theorem 4.2 (Spectral gap of shifted GNTK).

For graph neural network with shifted rectified linear unit (ReLU) activation, fgnn,shift(W,a,b,G)=1mr=1marl=1Nσ(wrv𝒩(ul)h(G)vb)f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u_{l})}h(G)_{v}-b), where bb\in\mathbb{R}. We define Hshiftctsn×nH^{\mathrm{cts}}_{\mathrm{shift}}\in\mathbb{R}^{n\times n} as the shifted kernel matrix between training data:

[Hshiftcts]i,j=𝔼W[fgnn,shift(W,G)W,fgnn,shift(W,H)W],\displaystyle\leavevmode\nobreak\ [H^{\mathrm{cts}}_{\mathrm{shift}}]_{i,j}=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,H)}{\partial W}\right\rangle\right],

where

  • G,HG,H are any input graph,

  • and the expectation is taking over wri.i.d.𝒩(0,I),r=1,,mw_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

Let

  • hG,u=v𝒩(u)h(G)v,xi,p=hGi,up,i[n],p[N]h_{G,u}=\sum_{v\in\mathcal{N}(u)}h(G)_{v},\leavevmode\nobreak\ x_{i,p}=h_{G_{i},u_{p}},\leavevmode\nobreak\ i\in[n],p\in[N].

  • initialize wri.i.d.𝒩(0,I),r=1,,mw_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

Assuming that our dataset satisfy the δ\delta-separation:

For any i,j[n],p,q[N]i,j\in[n],\leavevmode\nobreak\ p,q\in[N],

min{x¯i,px¯j,q2,x¯i,p+x¯j,q2}δ,\displaystyle\min\{\|\overline{x}_{i,p}-\overline{x}_{j,q}\|_{2},\|\overline{x}_{i,p}+\overline{x}_{j,q}\|_{2}\}\leq\delta,

where x¯i,p=xi,p/xi,p2\overline{x}_{i,p}=x_{i,p}/\|x_{i,p}\|_{2}. Then we claim that, for λ:=λmin(Hshiftcts)\lambda:=\lambda_{\min}(H^{\mathrm{cts}}_{\mathrm{shift}}),

exp(b2/2)λexp(b2/2)poly(δ,n1,N1)\displaystyle\exp(-b^{2}/2)\geq\lambda\geq\exp(-b^{2}/2)\cdot\mathrm{poly}(\delta,n^{-1},N^{-1})

In this theorem, we consider shifted GNTK, which is motivated by the shifted NTK [78] and is more general than the standard GNTK definition [23]. Here, we expand δ\delta-separation into the graph literature. The theorem says that if δ\delta-separation is satisfied for the graphs, then the Graph Neural Tangent Kernel will be positive definite. More specifically, in the theorem above, when we set b=0b=0, we have that Λ0=λpoly(δ,n1,N1)\Lambda_{0}=\lambda\geq\mathrm{poly}(\delta,n^{-1},N^{-1}).

4.2 Equivalence result for node-level regression

We also prove the equivalence result for node-level regression setting. The result is as follows:

Theorem 4.3 (Equivalence result for node level regression, informal version of Theorem D.31).

Given

  • training graph data GG,

  • corresponding label vector YNY\in\mathbb{R}^{N},

  • arbitrary test data vv.

Let

  • T>0T>0 be the total number of iterations.

  • ugnn,node,test(t)u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(t)\in\mathbb{R} and utest,nodeu_{\mathrm{test},\mathrm{node}}^{*}\in\mathbb{R} be the test data predictors defined in Definition 3.7 and Definition 3.4 respectively.

For any accuracy ε(0,1/10)\varepsilon\in(0,1/10) and failure probability δ(0,1/10)\delta\in(0,1/10), if

  • the multiplier κ=O~(N1poly(ε,Λ0,node,R1))\kappa=\widetilde{O}(N^{-1}\mathrm{poly}(\varepsilon,\Lambda_{0,\mathrm{node}},R^{-1})),

  • the total iterations T=O~(N2poly(ε1,Λ0,node1))T=\widetilde{O}(N^{2}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1})),

  • the width of the neural network mO~(N10poly(ε1,Λ0,node1,d))m\geq\widetilde{O}(N^{10}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1},d)),

  • and the smallest eigenvalue of node level GNTK Λ0,node>0\Lambda_{0,\mathrm{node}}>0,

then for any vtestv_{\mathrm{test}}, with probability at least 1δ1-\delta over the random initialization, we have

ugnn,test,node(T)utest,node2ε.\displaystyle\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}\|_{2}\leq\varepsilon.

Here O~()\widetilde{O}(\cdot) hides polylog(N/(εδΛ0,node))\mathrm{poly}\log(N/(\varepsilon\delta\Lambda_{0,\mathrm{node}})).

Our result shows that the width of the neural network mm depends on N10N^{10}, where NN is the number of nodes in the graph. The total training iteration TT depends on N2N^{2}.

5 Overview of Techniques

High-level Approach

Given any test graph GtestG_{\mathrm{test}}, our goal is to bound the distance between the prediction of graph kernel regression utestu_{\mathrm{test}}^{*} and the prediction of the GNN ugnn,test(T)u_{\mathrm{gnn},\mathrm{test}}(T) at training iterations t=Tt=T. Our main approach is that we consider an iterative GNTK regression process ugntk,test(t)u_{\mathrm{gntk},\mathrm{test}}(t) that is optimized using gradient descent.

Solving the GNTK kernel regression is a single-step process, while training GNN involves multiple iterations. Here, our proposed iterative GNTK regression ugntk,test(t)u_{\mathrm{gntk},\mathrm{test}}(t), which is optimized by gradient descent, bridges the gap. We prove that when our iterative GNTK regression finishes, the final prediction on GtestG_{\mathrm{test}} is exactly the same as if solving GNTK as a single-step process. Thus, to prove GNTK=GNN, the remaining step is to prove the equivalence of iterative GNTK regression and the GNN training process.

More specifically, We use the iterative GNTK regression’s test data predictor ugntk,test(T)u_{\mathrm{gntk},\mathrm{test}}(T) at training iterations TT as a bridge to connect GNTK regression predictor utestu_{\mathrm{test}}^{*} and GNN predictor ugnn,test(T)u_{\mathrm{gnn},\mathrm{test}}(T). We conclude that there are three major steps to prove our main result.

  • By picking a sufficiently large number of training iteration TT, we bound the test predictor difference for GNTK regression at training iterations TT and the test error after GNTK converges |ugntk,test(T)utest|ε/2|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|\leq\varepsilon/2 by any accuracy ε>0\varepsilon>0.

  • After fixing the number of training iteration TT in Step 1, we bound the test predictor difference between GNN and GNTK |ugnn,test(T)ugntk,test(T)|ε/2|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|\leq\varepsilon/2 at training iterations TT.

  • Lastly, we combine the results of step 1 and step 2 using triangle inequality. This finishes the proof of the equivalence between training GNN and GNTK kernel regression, i.e., |ugnn,test(T)utest|ε|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|\leq\varepsilon.

For the first part, the high level idea is that utestu_{\mathrm{test}}^{*} is ugntk,test(t)u_{\mathrm{gntk},\mathrm{test}}(t) when tt\to\infty, we then obtain the desired result by the linear convergence of the gradient flow of ugntk,test(t)u_{\mathrm{gntk},\mathrm{test}}(t). The second part needs sophisticated tricks. The high-level idea is that we can split the second part into two sub-problem by using the Newton-Leibniz formula and the almost invariant property of Graph Dynamic Kernel. The third part follows from triangle inequality.

Iterative Graph Neural Tangent Kernel regression

Iterative Graph Neural Tangent Kernel regression provides a powerful theoretical tool to compare the GNN training process with GNTK regression. We consider solving the following optimization problem using gradient descent with β(0)\beta(0) initialized as 0:

minβ12Yκfgntk(β,𝖦)22,\displaystyle\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2},

where fgntk(β,𝖦)f_{\mathrm{gntk}}(\beta,\mathsf{G}) is defined in Definition 3.4. At iteration tt, we define the weight as β(t)\beta(t), the training data predictor ugntk(t)=κΦ(𝖦)β(t)nu_{\mathrm{gntk}}(t)=\kappa\Phi(\mathsf{G})\beta(t)\in\mathbb{R}^{n}, the test data predictor ugntk,test(t)=κΦ(Gtest)β(t)u_{\mathrm{gntk},\mathrm{test}}(t)=\kappa\Phi(G_{\mathrm{test}})^{\top}\beta(t)\in\mathbb{R}.

Bridge GNTK regression and iterative GNTK regression

Then, we claim the relationship between GNTK regression and iterative GNTK regression. When the iteration goes into infinity, we get the final training data predictor u=limtugntk(t)u^{*}=\lim_{t\to\infty}u_{\mathrm{gntk}}(t) and the final test data predictor as utest=limtugntk,test(t)u_{\mathrm{test}}^{*}=\lim_{t\to\infty}u_{\mathrm{gntk},\mathrm{test}}(t). Due to the strongly convexity [52], the gradient flow converges to the optimal solution of the objective Yκfgntk(β,𝖦)22\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2}, therefore, the optimal parameters obtained from GNTK regression

β=limtβ(t)=κ1(Φ(𝖦)Φ(𝖦))1Φ(𝖦)Y,\displaystyle\beta^{*}=\lim_{t\to\infty}\beta(t)=\kappa^{-1}(\Phi(\mathsf{G})^{\top}\Phi(\mathsf{G}))^{-1}\Phi(\mathsf{G})^{\top}Y,

which indicate that the optimal parameters β\beta^{*} is exactly limtβ(t)\lim_{t\to\infty}\beta(t) the parameters of iterative GNTK regression, when training finish.

Graph Dynamic Kernel

Before we continue, we introduce the Graph Dynamic Kernel for the GNN training process. Graph Dynamic Kernel can help us analyze the dynamic of the GNN training process. For any data G,Hd×NG,H\in\mathbb{R}^{d\times N}, we define 𝖪t(G,H)\mathsf{K}_{t}(G,H)\in\mathbb{R} as

𝖪t(G,H)=fgnn(W(t),G)W(t),fgnn(W(t),H)W(t).\displaystyle\mathsf{K}_{t}(G,H)=\left\langle\frac{\partial f_{\mathrm{gnn}}(W(t),G)}{\partial W(t)},\frac{\partial f_{\mathrm{gnn}}(W(t),H)}{\partial W(t)}\right\rangle.

where W(t)d×mW(t)\in\mathbb{R}^{d\times m} is the parameters of the neural network at iterations tt as defined in Definition D.2. Similar to tangent kernel, we define 𝖧(t)n×n\mathsf{H}(t)\in\mathbb{R}^{n\times n} as

[𝖧(t)]i,j=𝖪t(Gi,Gj).\displaystyle[\mathsf{H}(t)]_{i,j}=\mathsf{K}_{t}(G_{i},G_{j})\in\mathbb{R}.

As in Figure 1, this concept have some common points with GNTK defined in Definition 3.2. Both of them load the parameters WW. Both of them calculate the inner product on a pair of gradients. However, Graph Dynamic Kernel has a few differences from GNTK. First, GNTK is an expectation taken over a normal distribution, while Graph Dynamic Kernel is not an expectation. Second, GNTK is a constant during the training process, while Graph Dynamic Kernel changes with training iteration tt. Third, the Graph Dynamic Kernel loads parameters from the GNN training process, while GNTK loads parameters from samples of a normal distribution.

Graph Dynamic Kernel is very different than the Dynamic Kernel defined in NTK literature [29, 75, 52, 78]. Our key results is that Graph Dynamic Kernel is almost invariant during the training process, which means Graph Dynamic Kernel becomes a constant when the width of the neural network goes into infinity. Although people already know that Dynamic Kernel has similar properties [3, 52], no previous works show whether the almost invariant property still holds for Graph Dynamic Kernel. In fact, GNNs have Combine layer and Aggregate layer, while NNs do not have those layer. Whether or not those layers influence the almost invariant property are not trivial. In fact, the perturbation on the Graph Dynamic Kernel during training process polynomial depends on NN, where NN is the number of nodes in a graph. As a result, for graph neural network, the width should be chosen larger to ensure that it behaves like a GNTK.

Linear convergence

We first show that GNNs and iterative GNTK are linear convergence. The linear convergence can be proved by calculating the gradient flow of GNTK regression ugntk,test(t)u_{\mathrm{gntk},\mathrm{test}}(t) and GNN training ugnn,test(t)u_{\mathrm{gnn},\mathrm{test}}(t) with respect to the training iterations tt. Based on the gradient, we can obtain the desired results:

du(t)u22dt2(κ2Λ0)u(t)u22.\displaystyle\frac{\mathrm{d}\|u(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-2(\kappa^{2}\Lambda_{0})\|u(t)-u^{*}\|_{2}^{2}.

where u(t)u(t) can be both ugnn(t)u_{\mathrm{gnn}}(t) and ugntk(t)u_{\mathrm{gntk}}(t).

Then, we show that the test predictor difference between GNTK regression and iterative GNTK regression is sufficiently small, when TT is sufficiently large. The linear convergence indicates that both ugnn(t)u_{\mathrm{gnn}}(t) and ugntk(t)u_{\mathrm{gntk}}(t) converge at a significant speed. In fact, the linear convergence imply that the squared distance between u(t)u(t) and uu^{*} converge to 0 in exponential. Based on the convergence result, we immediately get

|ugntk,test(T)utest|<ε/2\displaystyle|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|<\varepsilon/2

by picking sufficiently large TT.

Bridge iterative GNTK regression and GNN training process

In this section, we will bound |ugnn,test(T)ugntk,test(T)||u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|. The high-level approach is that we can use the Newton-Leibniz formula to separate the problem into two parts: First, we bound the initialization perturbation |ugnn,test(0)ugntk,test(0)||u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|. Then, we bound

|d(ugnn,test(t)ugntk,test(t))/dt|\displaystyle|\mathrm{d}(u_{\mathrm{gnn},\mathrm{test}}(t)-u_{\mathrm{gntk},\mathrm{test}}(t))/\mathrm{d}t|

the change rate of the perturbation.

For the first part, note that ugntk,test(0)=0u_{\mathrm{gntk},\mathrm{test}}(0)=0, we bound |ugnn,test(0)||u_{\mathrm{gnn},\mathrm{test}}(0)| by the concentration property of Gaussian initialization. For the second part, since we can prove that the gradient flow of the predictor of GNN during the GNN training process is associated with 𝖧(t)\mathsf{H}(t) and the gradient flow of the predictor of iterative GNTK regression is associated with GNTK 𝖧cts\mathsf{H}^{cts}. So bounding H(t)Hcts2\|H(t)-H^{\mathrm{cts}}\|_{2} can provide us with a bound on |d(ugnn,test(t)ugntk,test(t))/dt|.|\mathrm{d}(u_{\mathrm{gnn},\mathrm{test}}(t)-u_{\mathrm{gntk},\mathrm{test}}(t))/\mathrm{d}t|. Note that HctsH^{\mathrm{cts}} is exactly the expectation of H(0)H(0), where the expectation is taken on the random initialization of GNN. We bridge H(t)H(t) and HctsH^{\mathrm{cts}} by the initial value of Graph Dynamic Kernel H(0)H(0). To bound 𝖧(0)𝖧cts2\|\mathsf{H}(0)-\mathsf{H}^{\mathrm{cts}}\|_{2}, we use the fact that 𝔼W(0)[𝖧(0)]=𝖧cts\operatorname*{{\mathbb{E}}}_{W(0)}[\mathsf{H}(0)]=\mathsf{H}^{\mathrm{cts}} and Hoeffding inequality. To bound 𝖧(0)H(t)2\|\mathsf{H}(0)-H(t)\|_{2}, we use the almost invariant property of the Graph Dynamic Kernel.

For the node level equivalence, given any test node vtestv_{\mathrm{test}}, the goal is to bound the distance between the node level graph kernel predictor unode,testu_{\mathrm{node},\mathrm{test}}^{*} and the node level GNN predictor ugnn,node,test(T)u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(T). To prove the node level equivalence, we use a toolbox similar to the one of graph level equivalence. We introduce node-level iterative GNTK regression, which is optimized by gradient descent. Node-level iterative GNTK regression connects the node-level GNTK regression and node-level GNN training process. By leveraging the linear convergence of the gradient flow, and the almost invariant property of node-level Graph Dynamic Kernel, we get the equivalence result of node level.

Spectral gap of shifted GNTK

To prove a lower bound on the smallest eigenvalue of the shifted GNTK, we first calculate the explicit formation of shifted GNTK, which turns out to be the summation of Hadamard product between covariance of the input graph dataset and the covariance of a indicator mapping. The indicator mapping takes 11 when ReLU be activated by the input. We bound the eigenvalue of Hadamard product by analyzing the two covariance matrix separately. For the covariance of the input graph dataset, we need to bound the minimum magnitude of its diagonal entry, which can be further bounded by the norm of the input data. For the covariance of the indicator mapping, we need to bound the smallest eigenvalue, which is equal to the lower bound of the expectation of the inner product between the indicator mapping and any real vector. We bound the expectation by carefully decomposing the vector’s maximum coordinate and the vector’s other coordinates. To provide a good bound on the maximum coordinate, it turns out to be an anti-concentration problem of the Gaussian random matrix.

We delay more technique overview discussions to Section B due to space limitation.

6 Conclusion

Neural Tangent Kernel [44] is a useful tool in the theoretical deep learning to help understand how and why deep learning works. A natural extension of NTK for graph learning is Graph Neural Tangent Kernel (GNTK). [23] has shown that GNTK can achieve similar accuracy as training graph neural networks for several bioinformatics datasets for graph-level regression. We provide the first GNTK formulation for node-level regression. We present the first formal proof that training a graph neural network is equivalent to solving GNTK for both graph-level and node-level regression. From our view, given that our work is theoretical, it doesn’t lead to any negative societal impacts.

References

  • ACW [17] Haim Avron, Kenneth L Clarkson, and David P Woodruff. Faster kernel ridge regression using sketching and preconditioning. SIAM Journal on Matrix Analysis and Applications, 38(4):1116–1138, 2017.
  • [2] Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
  • [3] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In NeurIPS. arXiv preprint arXiv:1904.11955, 2019.
  • AKK+ [20] Thomas D Ahle, Michael Kapralov, Jakob BT Knudsen, Rasmus Pagh, Ameya Velingker, David P Woodruff, and Amir Zandieh. Oblivious sketching of high-degree polynomial kernels. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 141–160. SIAM, 2020.
  • AKL [17] Jacob Andreas, Dan Klein, and Sergey Levine. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pages 166–175. PMLR, 2017.
  • ALS+ [18] Alexandr Andoni, Chengyu Lin, Ying Sheng, Peilin Zhong, and Ruiqi Zhong. Subspace embedding and linear regression with orlicz norm. In ICML, 2018.
  • ALS+ [22] Josh Alman, Jiehao Liang, Zhao Song, Ruizhe Zhang, and Danyang Zhuo. Bypass exponential time preprocessing: Fast neural network training via weight-data correlation preprocessing. arXiv preprint arXiv:2211.14227, 2022.
  • AS [23] Josh Alman and Zhao Song. Fast attention requires bounded entries. arXiv preprint arXiv:2302.13214, 2023.
  • [9] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In ICML, 2019.
  • [10] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, 2019.
  • Ber [24] Sergei Bernstein. On a modification of chebyshev’s inequality and of the error formula of laplace. Ann. Sci. Inst. Sav. Ukraine, Sect. Math, 1(4):38–49, 1924.
  • BG [17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pages 605–614. PMLR, 2017.
  • BJW [19] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
  • BPSW [21] Jan van den Brand, Binghui Peng, Zhao Song, and Omri Weinstein. Training (overparametrized) neural networks in near-linear time. In ITCS, 2021.
  • BWZ [16] Christos Boutsidis, David P Woodruff, and Peilin Zhong. Optimal principal component analysis in distributed and streaming models. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing (STOC), pages 236–249, 2016.
  • CAEHP+ [20] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, and Kevin Murphy. Machine learning on graphs: A model and comprehensive taxonomy. arXiv preprint arXiv:2005.03675, 2020.
  • CB [19] Lenaic Chizat and Francis Bach. A note on lazy training in supervised differentiable programming. In 32nd Conf. Neural Information Processing Systems (NeurIPS 2018), 2019.
  • CCLY [19] Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the john ellipsoid. In Conference on Learning Theory, pages 849–873. PMLR, 2019.
  • CW [13] Kenneth L. Clarkson and David P. Woodruff. Low rank approximation and regression in input sparsity time. In Symposium on Theory of Computing Conference (STOC), pages 81–90. https://arxiv.org/pdf/1207.6365, 2013.
  • CY [21] Yifan Chen and Yun Yang. Accumulations of projections—a unified framework for random sketches in kernel ridge regression. In International Conference on Artificial Intelligence and Statistics, pages 2953–2961. PMLR, 2021.
  • Dan [17] Amit Daniely. Sgd learns the conjugate kernel class of the network. arXiv preprint arXiv:1702.08503, 2017.
  • DFS [16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. arXiv preprint arXiv:1602.05897, 2016.
  • DHS+ [19] Simon S Du, Kangcheng Hou, Russ R Salakhutdinov, Barnabas Poczos, Ruosong Wang, and Keyulu Xu. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. In Advances in Neural Information Processing Systems (NeurIPS), pages 5723–5733, 2019.
  • DJS+ [19] Huaian Diao, Rajesh Jayaram, Zhao Song, Wen Sun, and David Woodruff. Optimal sketching for kronecker product regression and low rank approximation. Advances in neural information processing systems, 32, 2019.
  • DLL+ [19] Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
  • [26] Yichuan Deng, Zhihang Li, and Zhao Song. Attention scheme inspired softmax regression. arXiv preprint arXiv:2304.10411, 2023.
  • [27] Yichuan Deng, Zhihang Li, and Zhao Song. An improved sample complexity for rank-1 matrix sensing. arXiv preprint arXiv:2303.06895, 2023.
  • DLT+ [18] Simon Du, Jason Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pages 1339–1348. PMLR, 2018.
  • DZPS [19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In ICLR. arXiv preprint arXiv:1810.02054, 2019.
  • EMZ [21] Hossein Esfandiari, Vahab Mirrokni, and Peilin Zhong. Almost linear time density level set estimation via dbscan. In AAAI, 2021.
  • Fou [17] Alex M Fout. Protein interface prediction using graph convolutional networks. PhD thesis, Colorado State University, 2017.
  • GLM [17] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
  • GQSW [22] Yeqi Gao, Lianke Qin, Zhao Song, and Yitan Wang. A sublinear adversarial training algorithm. arXiv preprint arXiv:2208.05395, 2022.
  • GS [22] Yuzhou Gu and Zhao Song. A faster small treewidth sdp solver. arXiv preprint arXiv:2211.06033, 2022.
  • GSY [23] Yeqi Gao, Zhao Song, and Junze Yin. An iterative algorithm for rescaled hyperbolic functions regression. arXiv preprint arXiv:2305.00660, 2023.
  • GSYZ [23] Yuzhou Gu, Zhao Song, Junze Yin, and Lichen Zhang. Low rank matrix completion via robust alternating minimization in nearly linear time. arXiv preprint arXiv:2302.11068, 2023.
  • GSZ [23] Yuzhou Gu, Zhao Song, and Lichen Zhang. A nearly-linear time algorithm for structured support vector machines. arXiv preprint arXiv:2307.07735, 2023.
  • HLW [17] Jarvis Haupt, Xingguo Li, and David P Woodruff. Near optimal sketching of low-rank tensor regression. In ICML, 2017.
  • HMR [18] Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. Sega: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
  • Hoe [63] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • HOSM [17] Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach. arXiv preprint arXiv:1706.05674, 2017.
  • HSWZ [22] Hang Hu, Zhao Song, Omri Weinstein, and Danyang Zhuo. Training overparametrized neural networks in sublinear time. arXiv preprint arXiv:2208.04508, 2022.
  • HZRS [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • JGH [18] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems (NeurIPS), pages 8571–8580, 2018.
  • JLL+ [20] Shuli Jiang, Dongyu Li, Irene Mengze Li, Arvind V Mahankali, and David Woodruff. An efficient protocol for distributed column subset selection in the entrywise \ell_p\backslash ell\_p norm. 2020.
  • JLL+ [21] Shuli Jiang, Dennis Li, Irene Mengze Li, Arvind V Mahankali, and David Woodruff. Streaming and distributed algorithms for robust column subset selection. In International Conference on Machine Learning, pages 4971–4981. PMLR, 2021.
  • JLSW [20] Haotian Jiang, Yin Tat Lee, Zhao Song, and Sam Chiu-wai Wong. An improved cutting plane method for convex optimization, convex-concave games and its applications. In STOC, 2020.
  • JPWZ [21] Shuli Jiang, Hai Pham, David Woodruff, and Richard Zhang. Optimal sketching for trace estimation. Advances in Neural Information Processing Systems, 34, 2021.
  • JSWZ [21] Shunhua Jiang, Zhao Song, Omri Weinstein, and Hengjie Zhang. Faster dynamic matrix inverse for faster lps. In STOC. arXiv preprint arXiv:2004.07470, 2021.
  • LL [18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In NeurIPS, 2018.
  • LS [01] Wenbo V Li and Q-M Shao. Gaussian processes: inequalities, small ball probabilities and applications. Handbook of Statistics, 19:533–597, 2001.
  • LSS+ [20] Jason D Lee, Ruoqi Shen, Zhao Song, Mengdi Wang, and Zheng Yu. Generalized leverage score sampling for neural networks. In NeurIPS, 2020.
  • LSZ [19] Yin Tat Lee, Zhao Song, and Qiuyi Zhang. Solving empirical risk minimization in the current matrix multiplication time. In COLT. https://arxiv.org/pdf/1905.04447.pdf, 2019.
  • LSZ [23] Zhihang Li, Zhao Song, and Tianyi Zhou. Solving regularized exp, cosh and sinh regression problems. arXiv preprint arXiv:2303.15725, 2023.
  • LY [17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. arXiv preprint arXiv:1705.09886, 2017.
  • MM [13] Xiangrui Meng and Michael W Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (STOC), pages 91–100, 2013.
  • NN [13] Jelani Nelson and Huy L Nguyên. Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pages 117–126. IEEE, https://arxiv.org/pdf/1211.1002, 2013.
  • OS [20] Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. In IEEE Journal on Selected Areas in Information Theory. IEEE, 2020.
  • QJS+ [22] Lianke Qin, Rajesh Jayaram, Elaine Shi, Zhao Song, Danyang Zhuo, and Shumo Chu. Adore: Differentially oblivious relational database operators. In VLDB, 2022.
  • QRS+ [22] Lianke Qin, Aravind Reddy, Zhao Song, Zhaozhuo Xu, and Danyang Zhuo. Adaptive and dynamic multi-resolution hashing for pairwise summations. In BigData, 2022.
  • QSZ [23] Lianke Qin, Zhao Song, and Ruizhe Zhang. A general algorithm for solving rank-one matrix sensing. arXiv preprint arXiv:2303.12298, 2023.
  • QSZZ [23] Lianke Qin, Zhao Song, Lichen Zhang, and Danyang Zhuo. An online and unified algorithm for projection matrix vector multiplication with application to empirical risk minimization. In International Conference on Artificial Intelligence and Statistics, pages 101–156. PMLR, 2023.
  • QWF [22] Can Qin, Yizhou Wang, and Yun Fu. Robust semi-supervised domain adaptation against noisy labels. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 4409–4413, 2022.
  • RPU+ [20] Daniel Rothchild, Ashwinee Panda, Enayat Ullah, Nikita Ivkin, Ion Stoica, Vladimir Braverman, Joseph Gonzalez, and Raman Arora. Fetchsgd: Communication-efficient federated learning with sketching. In International Conference on Machine Learning, pages 8253–8265. PMLR, 2020.
  • RSW [16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approximations with provable guarantees. In Proceedings of the 48th Annual Symposium on the Theory of Computing (STOC), 2016.
  • Sch [11] Jssai Schur. Bemerkungen zur theorie der beschränkten bilinearformen mit unendlich vielen veränderlichen. 1911.
  • Sol [17] Mahdi Soltanolkotabi. Learning relus via gradient descent. arXiv preprint arXiv:1705.04591, 2017.
  • SSX [23] Anshumali Shrivastava, Zhao Song, and Zhaozhuo Xu. A tale of two efficient value iteration algorithms for solving linear mdps with large action space. In AISTATS, 2023.
  • SSZ [23] Ritwik Sinha, Zhao Song, and Tianyi Zhou. A mathematical abstraction for balancing the trade-off between creativity and reality in large language models. arXiv preprint arXiv:2306.02295, 2023.
  • SWYZ [21] Zhao Song, David Woodruff, Zheng Yu, and Lichen Zhang. Fast sketching of polynomial kernels of polynomial degree. In International Conference on Machine Learning, pages 9812–9823. PMLR, 2021.
  • SWZ [17] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with entrywise 1\ell_{1}-norm error. In Proceedings of the 49th Annual Symposium on the Theory of Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.
  • [72] Zhao Song, David Woodruff, and Peilin Zhong. Average case column subset selection for entrywise 1\ell_{1}-norm loss. Advances in Neural Information Processing Systems (NeurIPS), 32:10111–10121, 2019.
  • [73] Zhao Song, David Woodruff, and Peilin Zhong. Towards a zero-one law for column subset selection. Advances in Neural Information Processing Systems, 32:6123–6134, 2019.
  • [74] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank approximation. In SODA. arXiv preprint arXiv:1704.08246, 2019.
  • SY [19] Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. In arXiv preprint. https://arxiv.org/pdf/1906.03593.pdf, 2019.
  • SY [21] Zhao Song and Zheng Yu. Oblivious sketching-based central path method for solving linear programming problems. In 38th International Conference on Machine Learning (ICML), 2021.
  • SYYZ [22] Zhao Song, Xin Yang, Yuanyuan Yang, and Tianyi Zhou. Faster algorithm for structured john ellipsoid computation. arXiv preprint arXiv:2211.14407, 2022.
  • SYZ [21] Zhao Song, Shuo Yang, and Ruizhe Zhang. Does preprocessing help training over-parameterized neural networks? Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
  • SZZ [21] Zhao Song, Lichen Zhang, and Ruizhe Zhang. Training multi-layer over-parametrized neural network in subquadratic time. arXiv preprint arXiv:2112.07628, 2021.
  • Tia [17] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International Conference on Machine Learning, pages 3404–3413. PMLR, 2017.
  • WGLF [23] Yizhou Wang, Dongliang Guo, Sheng Li, and Yun Fu. Towards explainable visual anomaly detection. arXiv preprint arXiv:2302.06670, 2023.
  • WLX+ [20] Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. Graph convolutional networks with markov random field reasoning for social spammer detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1054–1061, 2020.
  • WPC+ [20] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 2020.
  • WQB+ [22] Yizhou Wang, Can Qin, Yue Bai, Yi Xu, Xu Ma, and Yun Fu. Making reconstruction-based method great again for video anomaly detection. In 2022 IEEE International Conference on Data Mining (ICDM), pages 1215–1220. IEEE, 2022.
  • WWMK [20] Xinshuo Weng, Yongxin Wang, Yunze Man, and Kris M Kitani. Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6499–6508, 2020.
  • WZ [16] David P Woodruff and Peilin Zhong. Distributed low rank approximation of implicit functions of a matrix. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE), pages 847–858. IEEE, 2016.
  • WZD+ [20] Ruosong Wang, Peilin Zhong, Simon S Du, Russ R Salakhutdinov, and Lin F Yang. Planning with general objective functions: Going beyond total rewards. In Annual Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • XSS [21] Zhaozhuo Xu, Zhao Song, and Anshumali Shrivastava. Breaking the linear iteration cost barrier for some well-known conditional gradient methods using maxip data-structures. Advances in Neural Information Processing Systems (NeurIPS), 34, 2021.
  • XWW+ [21] Yi Xu, Lichen Wang, Yizhou Wang, Can Qin, Yulun Zhang, and Yun Fu. Memrein: Rein the domain shift for cross-domain few-shot learning. 2021.
  • XWWF [22] Yi Xu, Lichen Wang, Yizhou Wang, and Yun Fu. Adaptive trajectory prediction via transferable gnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6520–6531, 2022.
  • XZZ [18] Chang Xiao, Peilin Zhong, and Changxi Zheng. Bourgan: generative networks with metric embeddings. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS), pages 2275–2286, 2018.
  • YHC+ [18] Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 974–983, 2018.
  • YSLJ [20] Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11040–11048, 2020.
  • YWH+ [20] Xiang Yue, Zhen Wang, Jingong Huang, Srinivasan Parthasarathy, Soheil Moosavinasab, Yungui Huang, Simon M Lin, Wen Zhang, Ping Zhang, and Huan Sun. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241–1251, 2020.
  • YZW+ [20] Carl Yang, Jieyu Zhang, Haonan Wang, Sha Li, Myungwan Kim, Matt Walker, Yiou Xiao, and Jiawei Han. Relation learning on social networks with multi-modal graph edge variational autoencoders. In Proceedings of the 13th International Conference on Web Search and Data Mining, pages 699–707, 2020.
  • ZCH+ [20] Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. Graph neural networks: A review of methods and applications. AI Open, 1:57–81, 2020.
  • ZCZ [20] Ziwei Zhang, Peng Cui, and Wenwu Zhu. Deep learning on graphs: A survey. IEEE Transactions on Knowledge and Data Engineering, 2020.
  • ZHA+ [21] Amir Zandieh, Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, and Jinwoo Shin. Scaling neural tangent kernels via sketching and random features. Advances in Neural Information Processing Systems, 34, 2021.
  • Zha [22] Lichen Zhang. Speeding up optimizations via data structures: Faster search, sample and maintenance. Master’s thesis, Carnegie Mellon University, 2022.
  • ZL [17] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multi-layer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
  • ZSD [17] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.
  • ZSJ+ [17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.

Appendix

Roadmap.

In Section A, we introduce several notations and tools that is useful in our paper. In Section B, we provide more discussion of technique overview. In Section C, we give the formula for calculating node-level GNTK. In Section D, we prove the equivalence of GNN and GNTK for both graph level and node level regression. In Section E, we provide a lower bound and a upper bound on the eigenvalue of the GNTK. In Section F, we deduce the formulation of node level GNTK.

Appendix A Notations and Probability Tools

In this section we introduce some notations and sevreral probability tools we use in the proof.

Notations.

Let G=(U,E)G=(U,E) be a graph with node set UU and edge set EE. A graph contains |U|=N|U|=N nodes. The dimension of the feature vector of each input node is dd. 𝒩\mathcal{N} is the neighboring matrix that represents the topology of GG. Each node uUu\in U has a feature vector h(G)udh(G)_{u}\in\mathbb{R}^{d}. 𝒩(u)\mathcal{N}(u) represents all the neighboring nodes of uu. A dataset contains nn graphs. We denote training data graph 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} with Gid×Ni[n]G_{i}\in\mathbb{R}^{d\times N}\leavevmode\nobreak\ i\in[n]. We denote label vector YnY\in\mathbb{R}^{n} corresponding to 𝖦\mathsf{G}. For graph level regression, ours goal is to learn to predict the label of the graph GtestG_{\mathrm{test}}. For node level regression, ours goal is to learn to predict the label of the node vtestv_{\mathrm{test}}. For presentation simplicity, let’s define an ordering of all the nodes in a graph. We can thus use uiu_{i} to be the iith node in the graph. We use σ()\sigma(\cdot) to denote the non-linear activation function. In this paper, σ()\sigma(\cdot) usually denotes the rectified linear unit (ReLU) function. We use σ˙()\dot{\sigma}(\cdot) to denote the derivative of σ()\sigma(\cdot). We use [k][k] to denote the set {1,2,,k}\{1,2,\cdots,k\}.

We state Chernoff, Hoeffding and Bernstein inequalities.

Lemma A.1 (Hoeffding bound [40]).

Let X1,,XnX_{1},\cdots,X_{n} denote nn independent bounded variables in [ai,bi][a_{i},b_{i}]. Let X=i=1nXiX=\sum_{i=1}^{n}X_{i}, then we have

Pr[|X𝔼[X]|t]2exp(2t2i=1n(biai)2).\displaystyle\Pr[|X-\operatorname*{{\mathbb{E}}}[X]|\geq t]\leq 2\exp\left(-\frac{2t^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\right).
Lemma A.2 (Bernstein inequality [11]).

Let X1,,XnX_{1},\cdots,X_{n} be independent zero-mean random variables. Suppose that |Xi|M|X_{i}|\leq M almost surely, for all ii. Then, for all positive tt,

Pr[i=1nXi>t]exp(t2/2j=1n𝔼[Xj2]+Mt/3).\displaystyle\Pr\left[\sum_{i=1}^{n}X_{i}>t\right]\leq\exp\left(-\frac{t^{2}/2}{\sum_{j=1}^{n}\operatorname*{{\mathbb{E}}}[X_{j}^{2}]+Mt/3}\right).

We state three inequalities for Gaussian random variables.

Lemma A.3 (Gaussian tail bounds).

Let X𝒩(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^{2}) be a Gaussian random variable with mean μ\mu and variance σ2\sigma^{2}. Then for all t0t\geq 0, we have

Pr[|Xμ|t]2et22σ2.\displaystyle\Pr[|X-\mu|\geq t]\leq 2e^{-\frac{-t^{2}}{2\sigma^{2}}}.
Claim A.4 (Theorem 3.1 in [51]).

Let b>0b>0 and r>0r>0. Then,

exp(b2/2)Prx𝒩(0,1)[|x|r]Prx𝒩(0,1)[|xb|r]Prx𝒩(0,1)[|x|r].\displaystyle\exp(-b^{2}/2)\Pr_{x\sim\mathcal{N}(0,1)}[|x|\leq r]\leq\leavevmode\nobreak\ \Pr_{x\sim\mathcal{N}(0,1)}[|x-b|\leq r]\leq\leavevmode\nobreak\ \Pr_{x\sim\mathcal{N}(0,1)}[|x|\leq r].
Lemma A.5 (Anti-concentration of Gaussian distribution).

Let Z𝒩(0,σ2)Z\sim{\mathcal{N}}(0,\sigma^{2}). Then, for t>0t>0,

Pr[|Z|t]2t2πσ.\displaystyle\Pr[|Z|\leq t]\leq\frac{2t}{\sqrt{2\pi}\sigma}.

Appendix B More Discussion of Technique Overview

Node level techniques

The formulation of node level GNTK in Section C is deduced by the Gaussian processes nature of infinite width neural network. In the infinite width limit, the pre-activations W(r)(l)f(r1)(l)(u)W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u) at every hidden layer rr, every level ll have all its coordinates tending to i.i.d. centered Gaussian processes. Note that, for the activation layer, if the covariance matrix and distribution of pre-activations is explicitly known, we can compute the covariance matrix of the activated values exactly. As a result, we can recursively compute the covariance of the centered Gaussian processes 𝚺(r)(l)(u,u)\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime}). By computing the partial derivative of GNN output f(R)(L)f^{(L)}_{(R)} with respect to weight W(r)(l)W^{(l)}_{(r)}, we obtain the explicit formulation for computing node level GNTK.

Appendix C Calculating Node-Level GNTK

In this section, we give the formula for calculating node-level GNTK. We consider multi-level multi-layer node level Graph Neural Tangent Kernels in this section.

Let f(r)(l)f^{(l)}_{(r)} define the output of the Graph Neural Network in the ll-th level and rr-th layer. Suppose that there are LL levels and in each level, there are RR layers. At the beginning of each layer, we aggregate features over the neighbors 𝒩(u)\mathcal{N}(u) of each node uu. More specifically, the definition of Aggregate layer is as follows: f(1)(1)(u)=v𝒩(u)h(G)v,f^{(1)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}h(G)_{v}, and f(1)(l)(u)=v𝒩(u)f(R)(l1)(v),l{2,3,,L}.f^{(l)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}f^{(l-1)}_{(R)}(v),\leavevmode\nobreak\ l\in\{2,3,\cdots,L\}. In each level, we then perform RR fully-connected layers on the output of the Aggregate layer, which are also called Combine layer. The definition of Combine layer can be formulated as: f(r)(l)(u)=cσmσ(W(r)(l)f(r1)(l)(u)),l[L],r[R]f^{(l)}_{(r)}(u)=\sqrt{\frac{c_{\sigma}}{m}}\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)),\leavevmode\nobreak\ l\in[L],r\in[R] where cσc_{\sigma} is a scaling factor, W(r)(l)W^{(l)}_{(r)} is the weight of the neural network in ll-th level and rr-th layer, mm is the width of the layer. In this paper we set cσc_{\sigma} the same as [43], i.e. cσ=2c_{\sigma}=2. Let f(R)(L)f^{(L)}_{(R)} be the output of the Graph Neural Network.

We show how to compute Graph Neural Tangent Kernel 𝖪gntk,node(u,u)=𝖪(R)(L)(u,u)\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})=\mathsf{K}_{(R)}^{(L)}(u,u^{\prime}) as follows. Let 𝚺(r)(l)(u,u)\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime}) be defined as the covariance matrix of the Gaussian process of the pre-activations W(r)(l)f(r1)(l)(u)W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u). We provide the formula of the GNTK in a recursive way. First, we give the recursive formula for Aggregate layer. We initialize with 𝚺(1)(1)(u,u)=h(G)uh(G)u.\bm{\Sigma}^{(1)}_{(1)}(u,u^{\prime})=h(G)_{u}^{\top}h(G)_{u^{\prime}}. For l{2,3,,L}l\in\{2,3,\cdots,L\}, 𝚺(1)(l)(u,u)=v𝒩(u)v𝒩(u)𝚺(R)(l1)(v,v).\bm{\Sigma}^{(l)}_{(1)}(u,u^{\prime})=\sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\bm{\Sigma}^{(l-1)}_{(R)}(v,v^{\prime}).

Then, we give the recursive formula for the Combine layer.

𝚲(r)(l)(u,u)=\displaystyle\bm{\Lambda}^{(l)}_{(r)}(u,u^{\prime})= (𝚺(r1)(l)(u,u)𝚺(r1)(l)(u,u)𝚺(r1)(l)(u,u)𝚺(r1)(l)(u,u))2×2,\displaystyle\begin{pmatrix}\bm{\Sigma}_{(r-1)}^{(l)}(u,u)&\bm{\Sigma}_{(r-1)}^{(l)}(u,u^{\prime})\\ \bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u)&\bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u^{\prime})\end{pmatrix}\in\mathbb{R}^{2\times 2},
𝚺(r)(l)(u,u)=\displaystyle\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})= cσ𝔼(a,b)𝒩(𝟎,𝚲(r)(l)(u,u))[σ(a)σ(b)],\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\sigma(a)\sigma(b)],
𝚺˙(r)(l)(u,u)=\displaystyle\bm{\dot{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)= cσ𝔼(a,b)𝒩(𝟎,𝚲(r)(l)(u,u))[σ˙(a)σ˙(b)].\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\dot{\sigma}(a)\dot{\sigma}(b)].

Finally, we give the formula for calculating the graph neural tangent kernel. For each Aggregate layer, the kernel can be calculated as follows: 𝖪(1)(1)(u,u)=h(G)uh(G)u,\mathsf{K}^{(1)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ h(G)_{u}^{\top}h(G)_{u^{\prime}}, and

𝖪(1)(l)(u,u)=v𝒩(u)v𝒩(u)𝖪(R)(l1)(v,v).\displaystyle\mathsf{K}^{(l)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ \sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\mathsf{K}^{(l-1)}_{(R)}(v,v^{\prime}).

For each Combine layer, the kernel can be calculated as follows:

𝖪(r)(l)(u,u)=𝖪(r1)(l)(u,u)𝚺˙(r)(l)(u,u)+𝚺(r)(l)(u,u).\displaystyle\mathsf{K}_{(r)}^{(l)}(u,u^{\prime})=\mathsf{K}_{(r-1)}^{(l)}(u,u^{\prime})\dot{\bm{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)+\bm{\Sigma}^{(l)}_{(r)}\left(u,u^{\prime}\right).

The Graph neural tangent kernel 𝖪gntk,node(u,u)\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime}) is the output 𝖪(R)(L)(u,u)\mathsf{K}_{(R)}^{(L)}(u,u^{\prime}).

Appendix D Equivalence Results

In this section we prove the equivalence of GNN and GNTK when the width goes to infinity for one layer of GNN and there is exactly one Aggregate, one Combine, and one ReadOut operation.

In Section D.1, we present some useful definitions. In Section D.2, we compute the gradient flow of iterative GNTK regression and GNN training process. In Section D.3, we prove an upper bound for |ugntk,test(T)utest||u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|. In Section D.4, we bound the initialization and kernel perturbation. In Section D.5, we present the concentration results for kernels. In Section D.6, we upper bound the initialization perturbation. In Section D.7, we upper bound the kernel perturbation. In Section D.8, we connect iterative GNTK regression and GNN training process. In Section D.9 we present the main theorem.

D.1 Definitions

In this section we first present some formal definitions.

Definition D.1 (Graph neural network function).

We define a graph neural networks with rectified linear unit (ReLU) activation as the following form

fgnn(W,a,G)=1mr=1marl=1Nσ(wrxl),\displaystyle f_{\mathrm{gnn}}(W,a,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}x_{l})\in\mathbb{R},

where

  • xd×Nx\in\mathbb{R}^{d\times N} (decided by GG) is the input,

  • wrd,r[m]w_{r}\in\mathbb{R}^{d},\leavevmode\nobreak\ r\in[m] is the weight vector of the first layer,

  • W=[w1,,wm]d×mW=[w_{1},\cdots,w_{m}]\in\mathbb{R}^{d\times m}, ar,r[m]a_{r}\in\mathbb{R},\leavevmode\nobreak\ r\in[m] is the output weight,

  • a=[a1,,am]a=[a_{1},\cdots,a_{m}]^{\top} and σ()\sigma(\cdot) is the ReLU activation function: σ(z)=max{0,z}\sigma(z)=\max\{0,z\}.

In this paper, we consider only training the first layer WW while fixing aa. So we also write fgnn(W,G)=fgnn(W,a,G)f_{\mathrm{gnn}}(W,G)=f_{\mathrm{gnn}}(W,a,G). We denote fgnn(W,𝖦)=[fgnn(W,G1),,fgnn(W,Gn)]nf_{\mathrm{gnn}}(W,\mathsf{G})=[f_{\mathrm{gnn}}(W,G_{1}),\cdots,f_{\mathrm{gnn}}(W,G_{n})]^{\top}\in\mathbb{R}^{n}.

The training process of the GNN is as follows:

Definition D.2 (Training graph neural network).

Given

  • training data graph 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} (which can be viewed as a tensor Xn×d×NX\in\mathbb{R}^{n\times d\times N} and say X1,,XNd×nX_{1},\cdots,X_{N}\in\mathbb{R}^{d\times n}),

  • and corresponding label vector YnY\in\mathbb{R}^{n} where dd represent the dimension of a feature of a node and NN denote the number of nodes in a graph and nn is the size of training set.

Let

  • fgnnf_{\mathrm{gnn}} be defined as in Definition D.1.

  • κ(0,1)\kappa\in(0,1) be a small multiplier.

We initialize the network as ari.i.d.unif[{1,1}]a_{r}\overset{i.i.d.}{\sim}\operatorname{unif}[\{-1,1\}] and wr(0)i.i.d.𝒩(0,Id)w_{r}(0)\overset{i.i.d.}{\sim}\mathcal{N}(0,I_{d}). Then we consider solving the following optimization problem using gradient descent:

minW12Yκfgnn(W,𝖦)2\displaystyle\min_{W}\frac{1}{2}\|Y-\kappa f_{\mathrm{gnn}}(W,\mathsf{G})\|_{2} (1)

We denote wr(t),r[m]w_{r}(t),r\in[m] as the variable at iteration tt. We denote

ugnn(t)=\displaystyle u_{\mathrm{gnn}}(t)= κfgnn(W(t),𝖦)\displaystyle\leavevmode\nobreak\ \kappa f_{\mathrm{gnn}}(W(t),\mathsf{G})
=\displaystyle= κmr=1marl=1Nσ(wr(t)Xl)n\displaystyle\leavevmode\nobreak\ \frac{\kappa}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}(t)^{\top}X_{l})\in\mathbb{R}^{n} (2)

as the training data predictor at iteration tt. Given any test data {xtest,l}l[N]d\{x_{\mathrm{test},l}\}_{l\in[N]}\in\mathbb{R}^{d} (decided by graph GtestG_{\mathrm{test}}), we denote

ugnn,test(t)=\displaystyle u_{\mathrm{gnn},\mathrm{test}}(t)= κfgnn(W(t),Gtest)\displaystyle\leavevmode\nobreak\ \kappa f_{\mathrm{gnn}}(W(t),G_{\mathrm{test}})
=\displaystyle= κmr=1marl=1Nσ(wr(t)xtest,l)\displaystyle\leavevmode\nobreak\ \frac{\kappa}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}(t)^{\top}x_{\mathrm{test},l})\in\mathbb{R} (3)

as the test data predictor at iteration tt.

In the definition above, κ\kappa is the scaling factor [52, 3].

Definition D.3 (Graph neural tangent kernel and feature function).

We define the graph neural tangent kernel(GNTK) and the feature function corresponding to the graph neural networks fgnnf_{\mathrm{gnn}} defined in Definition D.1 as following

𝖪gntk(G,H)=𝔼[fgnn(W,G)W,fgnn(W,H)W]\displaystyle\mathsf{K}_{\mathrm{gntk}}(G,H)=\operatorname*{{\mathbb{E}}}\left[\left\langle\frac{\partial f_{\mathrm{gnn}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn}}(W,H)}{\partial W}\right\rangle\right]

where

  • G,HG,H are any input data,

  • and the expectation is taking over wri.i.d.𝒩(0,I),r=1,,mw_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

Given

  • training data matrix 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\},

we define Hctsn×nH^{\mathrm{cts}}\in\mathbb{R}^{n\times n} as the kernel matrix between training data as

[Hcts]i,j=𝖪gntk(Gi,Gj).\displaystyle[H^{\mathrm{cts}}]_{i,j}=\mathsf{K}_{\mathrm{gntk}}(G_{i},G_{j})\in\mathbb{R}.

Let Λ0>0\Lambda_{0}>0 represent the smallest eigenvalue of HctsH^{\mathrm{cts}}, under the assumption that HctsH^{\mathrm{cts}} is positive definite. Additionally, for any data point zz belonging to d\mathbb{R}^{d}, the kernel between the test and training data is denoted by 𝖪gntk(G,𝖦)\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G}) and is a member of n\mathbb{R}^{n}. We write it as:

𝖪gntk(G,𝖦)=[𝖪gntk(G,G1),,𝖪gntk(G,Gn)]n.\displaystyle\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})=[\mathsf{K}_{\mathrm{gntk}}(G,G_{1}),\cdots,\mathsf{K}_{\mathrm{gntk}}(G,G_{n})]^{\top}\in\mathbb{R}^{n}.

We denote the feature function corresponding to the kernel 𝖪gntk\mathsf{K}_{\mathrm{gntk}} as we defined above as Φ:d×N\Phi:\mathbb{R}^{d\times N}\rightarrow\mathcal{F}, which satisfies

Φ(G),Φ(H)=𝖪gntk(G,H),\displaystyle\langle\Phi(G),\Phi(H)\rangle_{\mathcal{F}}=\mathsf{K}_{\mathrm{gntk}}(G,H),

for any graph data GG, HH. And we write Φ(𝖦)=[Φ(G1),,Φ(Gn)]\Phi(\mathsf{G})=[\Phi(G_{1}),\cdots,\Phi(G_{n})]^{\top}.

Definition D.4 (Neural tangent kernel regression).

Given

  • training data matrix 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\}

  • and corresponding label vector YnY\in\mathbb{R}^{n}.

Let

  • 𝖪gntk\mathsf{K}_{\mathrm{gntk}}, 𝖧cts\mathsf{H}^{\mathrm{cts}} and Φ\Phi be the neural tangent kernel and corresponding feature functions defined as in Definition D.3

Then we consider the following neural tangent kernel regression problem:

minβ12Yκfgntk(β,𝖦)22\displaystyle\min_{\beta}\frac{1}{2}\|Y-\kappa f_{\mathrm{gntk}}(\beta,\mathsf{G})\|_{2}^{2} (4)

where fgntk(β,G)=Φ(G)βf_{\mathrm{gntk}}(\beta,G)=\Phi(G)^{\top}\beta\in\mathbb{R} denotes the prediction function and

fgntk(β,𝖦)=[fgntk(β,G1),,fgntk(β,Gn)]n.\displaystyle f_{\mathrm{gntk}}(\beta,\mathsf{G})=[f_{\mathrm{gntk}}(\beta,G_{1}),\cdots,f_{\mathrm{gntk}}(\beta,G_{n})]^{\top}\in\mathbb{R}^{n}.

Consider the gradient flow associated with solving the problem given by equation (4), starting with the initial condition β(0)=0\beta(0)=0. Let β(t)\beta(t) represent the variable at the tt-th iteration. We represent

ugntk(t)=κΦ(𝖦)β(t)n\displaystyle u_{\mathrm{gntk}}(t)=\kappa\Phi(\mathsf{G})\beta(t)\in\mathbb{R}^{n} (5)

as the training data predictor at iteration tt. Given any test data xtestdx_{\mathrm{test}}\in\mathbb{R}^{d}, we denote

ugntk,test(t)=κΦ(Gtest)β(t)\displaystyle u_{\mathrm{gntk},\mathrm{test}}(t)=\kappa\Phi(G_{\mathrm{test}})^{\top}\beta(t)\in\mathbb{R} (6)

as the predictor for the test data at the tt-th iteration. It’s important to highlight that the gradient flow converges to the optimal solution of problem (4) because of the strong convexity inherent to the problem . We denote

β=limtβ(t)=κ1(Φ(𝖦)Φ(𝖦))1Φ(𝖦)Y\displaystyle\beta^{*}=\lim_{t\to\infty}\beta(t)=\kappa^{-1}(\Phi(\mathsf{G})^{\top}\Phi(\mathsf{G}))^{-1}\Phi(\mathsf{G})^{\top}Y (7)

and the optimal predictor derived from training data

u=limtugntk(t)=κΦ(G)β=Yn\displaystyle u^{*}=\lim_{t\to\infty}u_{\mathrm{gntk}}(t)=\kappa\Phi(G)\beta^{*}=Y\in\mathbb{R}^{n} (8)

and the optimal test data predictor

utest=\displaystyle u_{\mathrm{test}}^{*}= limtugntk,test(t)\displaystyle\leavevmode\nobreak\ \lim_{t\to\infty}u_{\mathrm{gntk},\mathrm{test}}(t)
=\displaystyle= κΦ(Gtest)β\displaystyle\leavevmode\nobreak\ \kappa\Phi(G_{\mathrm{test}})^{\top}\beta^{*}
=\displaystyle= 𝖪gntk(Gtest,𝖦)(𝖧cts)1Y.\displaystyle\leavevmode\nobreak\ \mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(\mathsf{H}^{\mathrm{cts}})^{-1}Y\in\mathbb{R}. (9)
Definition D.5 (Dynamic kernel).

Given

  • W(t)d×mW(t)\in\mathbb{R}^{d\times m} as the parameters of the neural network at training time tt as defined in Definition D.2.

For any data G,Hd×NG,H\in\mathbb{R}^{d\times N} , we define 𝖪t(G,H)\mathsf{K}_{t}(G,H)\in\mathbb{R} as

𝖪t(G,H)=fgnn(W(t),G)W(t),fgnn(W(t),H)W(t)\displaystyle\mathsf{K}_{t}(G,H)=\left\langle\frac{\partial f_{\mathrm{gnn}}(W(t),G)}{\partial W(t)},\frac{\partial f_{\mathrm{gnn}}(W(t),H)}{\partial W(t)}\right\rangle

Given training data matrix 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\}, we define 𝖧(t)n×n\mathsf{H}(t)\in\mathbb{R}^{n\times n} as

[𝖧(t)]i,j=𝖪t(Gi,Gj).\displaystyle[\mathsf{H}(t)]_{i,j}=\mathsf{K}_{t}(G_{i},G_{j})\in\mathbb{R}.

Further, given a test graph data GtestG_{\mathrm{test}}, we define 𝖪t(Gtest,𝖦)n\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n} as

𝖪t(Gtest,𝖦)=[𝖪t(Gtest,G1),,𝖪t(Gtest,Gn)]n.\displaystyle\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})=[\mathsf{K}_{t}(G_{\mathrm{test}},G_{1}),\cdots,\mathsf{K}_{t}(G_{\mathrm{test}},G_{n})]^{\top}\in\mathbb{R}^{n}.

D.2 Gradient flow

In this section, we compute the gradient flow of iterative GNTK regression and GNN train process.

First, we compute the gradient flow of kernel regression.

Lemma D.6.

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\},

  • and corresponding label vector YnY\in\mathbb{R}^{n}. Let fgntkf_{\mathrm{gntk}} be defined as in Definition D.4.

Let

  • β(t)\beta(t), κ(0,1)\kappa\in(0,1) and ugntk(t)nu_{\mathrm{gntk}}(t)\in\mathbb{R}^{n} be defined as in Definition D.4.

  • 𝖪gntk:d×n×dn\mathsf{K}_{\mathrm{gntk}}:\mathbb{R}^{d}\times\mathbb{R}^{n\times d}\to\mathbb{R}^{n} be defined as in Definition D.3.

Then for any data Gd×NG\in\mathbb{R}^{d\times N}, we have

dfgntk(β(t),G)dt=κ𝖪gntk(G,𝖦)(Yugntk(t))\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G)}{\mathrm{d}t}=\kappa\cdot\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t))
Proof.

Denote L(t)=12Yugntk(t)22L(t)=\frac{1}{2}\|Y-u_{\mathrm{gntk}}(t)\|_{2}^{2}. By the rule of gradient descent, we have

dβ(t)dt=dLdβ=κΦ(𝖦)(Yugntk(t)),\displaystyle\frac{\mathrm{d}\beta(t)}{\mathrm{d}t}=-\frac{\mathrm{d}L}{\mathrm{d}\beta}=\kappa\Phi(\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)),

where Φ\Phi is defined in Definition D.3. Thus we have

dfgntk(β(t),G)dt=\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G)}{\mathrm{d}t}= dfgntk(β(t),G)dβ(t)dβ(t)dt\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G)}{\mathrm{d}\beta(t)}\frac{\mathrm{d}\beta(t)}{\mathrm{d}t}
=\displaystyle= Φ(G)(κΦ(𝖦)(Yugntk(t)))\displaystyle\leavevmode\nobreak\ \Phi(G)^{\top}(\kappa\Phi(\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)))
=\displaystyle= κ𝖪gntk(G,𝖦)(Yugntk(t))\displaystyle\leavevmode\nobreak\ \kappa\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t))

where the initial step arises from the chain rule. The subsequent step is a consequence of the relationship dfgntk(β,G)/dβ=Φ(G)\mathrm{d}f_{\mathrm{gntk}}(\beta,G)/\mathrm{d}\beta=\Phi(G)^{\top}. The final step is based on the definition of the kernel, with 𝖪gntk(G,𝖦)=Φ(𝖦)Φ(G)\mathsf{K}_{\mathrm{gntk}}(G,\mathsf{G})=\Phi(\mathsf{G})\Phi(G) belonging to n\mathbb{R}^{n}. ∎

Corollary D.7 (Gradient of prediction of kernel regression).

Given

  • training data matrix 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} and corresponding label vector YnY\in\mathbb{R}^{n}.

  • a test data GtestG_{\mathrm{test}}.

Let

  • fgntkf_{\mathrm{gntk}} be defined as in Definition D.4.

  • β(t)\beta(t), κ(0,1)\kappa\in(0,1) and ugntk(t)nu_{\mathrm{gntk}}(t)\in\mathbb{R}^{n} be defined as in Definition D.4.

  • 𝖪gntk:d×n×dn,Hctsn×n\mathsf{K}_{\mathrm{gntk}}:\mathbb{R}^{d}\times\mathbb{R}^{n\times d}\rightarrow\mathbb{R}^{n},\leavevmode\nobreak\ H^{\mathrm{cts}}\in\mathbb{R}^{n\times n} be defined as in Definition D.3.

Then we have

dugntk(t)dt\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk}}(t)}{\mathrm{d}t} =κ2Hcts(Yugntk(t))\displaystyle=\kappa^{2}H^{\mathrm{cts}}(Y-u_{\mathrm{gntk}}(t))
dugntk,test(t)dt\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t} =κ2𝖪gntk(Gtest,X)(Yugntk(t)).\displaystyle=\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},X)^{\top}(Y-u_{\mathrm{gntk}}(t)).
Proof.

Plugging in G=GiG=G_{i} in Lemma D.6, we have

dfgntk(β(t),Gi)dt=κ𝖪gntk(Gi,𝖦)(Yugntk(t)).\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G_{i})}{\mathrm{d}t}=\kappa\mathsf{K}_{\mathrm{gntk}}(G_{i},\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)).

Note [ugntk(t)]i=κfgntk(β(t),xi)[u_{\mathrm{gntk}}(t)]_{i}=\kappa f_{\mathrm{gntk}}(\beta(t),x_{i}) and [𝖧cts]:,i=𝖪gntk(xi,X)[\mathsf{H}^{\mathrm{cts}}]_{:,i}=\mathsf{K}_{\mathrm{gntk}}(x_{i},X), so writing all the data in a compact form, we have

dugntk(t)dt=κ2𝖧cts(Yugntk(t)).\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk}}(t)}{\mathrm{d}t}=\kappa^{2}\mathsf{H}^{\mathrm{cts}}(Y-u_{\mathrm{gntk}}(t)).

Plugging in data G=GtestG=G_{\mathrm{test}} in Lemma D.6, we have

dfgntk(β(t),Gtest)dt=κ𝖪gntk(Gtest,𝖦)(Yugntk(t)).\displaystyle\frac{\mathrm{d}f_{\mathrm{gntk}}(\beta(t),G_{\mathrm{test}})}{\mathrm{d}t}=\kappa\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)).

Note by definition, ugntk,test(t)=κfgntk(β(t),Gtest)u_{\mathrm{gntk},\mathrm{test}}(t)=\kappa f_{\mathrm{gntk}}(\beta(t),G_{\mathrm{test}})\in\mathbb{R}, so we have

dugntk,test(t)dt=κ2𝖪gntk(Gtest,𝖦)(Yugntk(t)).\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t}=\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gntk}}(t)).

We prove the linear convergence of kernel regression:

Lemma D.8.

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\},

  • and corresponding label vector YnY\in\mathbb{R}^{n}.

Let

  • κ(0,1)\kappa\in(0,1) and ugntk(t)nu_{\mathrm{gntk}}(t)\in\mathbb{R}^{n} be defined as in Definition D.4.

  • unu^{*}\in\mathbb{R}^{n} be defined in Definition D.4.

  • Λ0>0\Lambda_{0}>0 be defined as in Definition D.3.

Then we have

dugntk(t)u22dt2(κ2Λ0)ugntk(t)u22.\displaystyle\frac{\mathrm{d}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-2(\kappa^{2}\Lambda_{0})\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}.

Further, we have

ugntk(t)u2e(κ2Λ0)tugntk(0)u2.\displaystyle\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\leq e^{-(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}.
Proof.

So we have

dugntk(t)u22dt\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}
=\displaystyle= 2(ugntk(t)u)dugntk(t)dt\displaystyle\leavevmode\nobreak\ 2(u_{\mathrm{gntk}}(t)-u^{*})^{\top}\frac{\mathrm{d}u_{\mathrm{gntk}}(t)}{\mathrm{d}t}
=\displaystyle= 2κ2(ugntk(t)u)Hcts(ugntk(t)Y)\displaystyle\leavevmode\nobreak\ -2\kappa^{2}(u_{\mathrm{gntk}}(t)-u^{*})^{\top}H^{\mathrm{cts}}(u_{\mathrm{gntk}}(t)-Y)
\displaystyle\leq 2(κ2Λ0)ugntk(t)u22,\displaystyle\leavevmode\nobreak\ -2(\kappa^{2}\Lambda_{0})\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}, (10)

where the initial step is derived from the chain rule. The subsequent step is based on Corollary D.7, and the concluding step is in accordance with the definition of Λ0\Lambda_{0}. Further, since

d(e2(κ2Λ0)tugntk(t)u22)dt\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}(e^{2(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2})}{\mathrm{d}t}
=\displaystyle= 2(κ2Λ0)e2(κ2Λ0)tugntk(t)u22\displaystyle\leavevmode\nobreak\ 2(\kappa^{2}\Lambda_{0})e^{2(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}
+e2(κ2Λ0)tdugntk(t)u22dt\displaystyle\leavevmode\nobreak\ \quad+e^{2(\kappa^{2}\Lambda_{0})t}\cdot\frac{\mathrm{d}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}
\displaystyle\leq 0,\displaystyle\leavevmode\nobreak\ 0,

where the initial step involves gradient computation, while the subsequent step is derived from Eq. (D.2). Consequently, the value of e2(κ2Λ0)t|ugntk(t)u|22e^{2(\kappa^{2}\Lambda_{0})t}|u_{\mathrm{gntk}}(t)-u^{*}|_{2}^{2} does not increase, leading to the inference that

ugntk(t)u2e2(κ2Λ0)tugntk(0)u2.\displaystyle\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\leq e^{-2(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}.

We compute the gradient flow of neural network training:

Lemma D.9.

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} ,

  • and corresponding label vector YnY\in\mathbb{R}^{n}.

Let

  • fgnn:d×m×d×Nf_{\mathrm{gnn}}:\mathbb{R}^{d\times m}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R} be defined as in Definition D.1.

  • W(t)d×mW(t)\in\mathbb{R}^{d\times m}, κ(0,1)\kappa\in(0,1) and ugnn(t)nu_{\mathrm{gnn}}(t)\in\mathbb{R}^{n} be defined as in Definition D.2.

  • 𝖪t:d×N×n×d×Nn\mathsf{K}_{t}:\mathbb{R}^{d\times N}\times\mathbb{R}^{n\times d\times N}\rightarrow\mathbb{R}^{n} be defined as in Definition D.5.

Then for any data Gd×NG\in\mathbb{R}^{d\times N}, we have

dfgnn(W(t),G)dt=κ𝖪t(G,𝖦)(Yugnn(t)).\displaystyle\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}t}=\kappa\mathsf{K}_{t}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).
Proof.

Denote L(t)=12Yugnn(t)22L(t)=\frac{1}{2}\|Y-u_{\mathrm{gnn}}(t)\|_{2}^{2}.

By the rule of gradient descent, we have

dwrdt=Lwr=(ugnnwr)(Yugnn).\displaystyle\frac{\mathrm{d}w_{r}}{\mathrm{d}t}=-\frac{\partial L}{\partial w_{r}}=(\frac{\partial u_{\mathrm{gnn}}}{\partial w_{r}})^{\top}(Y-u_{\mathrm{gnn}}). (11)

Thus, we have

dfgnn(W(t),G)dt\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}t}
=\displaystyle= dfgnn(W(t),G)dW(t),dW(t)dt\displaystyle\leavevmode\nobreak\ \Big{\langle}\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}W(t)},\frac{\mathrm{d}W(t)}{\mathrm{d}t}\Big{\rangle}
=\displaystyle= j=1n(yjκfgnn(W(t),Gj))\displaystyle\leavevmode\nobreak\ \sum_{j=1}^{n}(y_{j}-\kappa f_{\mathrm{gnn}}(W(t),G_{j}))
dfgnn(W(t),G)dW(t),dκfgnn(W(t),Gj)dW(t)\displaystyle\quad\cdot\Big{\langle}\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G)}{\mathrm{d}W(t)},\frac{\mathrm{d}\kappa f_{\mathrm{gnn}}(W(t),G_{j})}{\mathrm{d}W(t)}\Big{\rangle}
=\displaystyle= κj=1n(yjκfgnn(W(t),Gj))𝖪t(G,Gj)\displaystyle\leavevmode\nobreak\ \kappa\sum_{j=1}^{n}(y_{j}-\kappa f_{\mathrm{gnn}}(W(t),G_{j}))\cdot\mathsf{K}_{t}(G,G_{j})
=\displaystyle= κ𝖪t(G,𝖦)(Yugnn(t))\displaystyle\leavevmode\nobreak\ \kappa\mathsf{K}_{t}(G,\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t))

where the initial step is derived from the chain rule. The subsequent step is based on Eq. (11). The third step adheres to the definition of 𝖪t\mathsf{K}_{t}, and the concluding step presents the formula in a more concise manner. ∎

Corollary D.10 (Gradient of prediction of graph neural network).

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} and corresponding label vector YnY\in\mathbb{R}^{n}.

  • a test graph data GtestG_{\mathrm{test}}. Let fgnn:d×m×d×Nf_{\mathrm{gnn}}:\mathbb{R}^{d\times m}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R} be defined as in Definition D.1.

Let

  • W(t)d×mW(t)\in\mathbb{R}^{d\times m}, κ(0,1)\kappa\in(0,1) and ugnn(t)nu_{\mathrm{gnn}}(t)\in\mathbb{R}^{n} be defined as in Definition D.2.

  • 𝖪t:d×N×n×d×Nn,H(t)n×n\mathsf{K}_{t}:\mathbb{R}^{d\times N}\times\mathbb{R}^{n\times d\times N}\rightarrow\mathbb{R}^{n},\leavevmode\nobreak\ H(t)\in\mathbb{R}^{n\times n} be defined as in Definition D.5.

Then we have

dugnn(t)dt=\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn}}(t)}{\mathrm{d}t}= κ2H(t)(Yugnn(t))\displaystyle\leavevmode\nobreak\ \kappa^{2}H(t)(Y-u_{\mathrm{gnn}}(t))
dugnn,test(t)dt=\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}= κ2𝖪t(Gtest,𝖦)(Yugnn(t)).\displaystyle\leavevmode\nobreak\ \kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).
Proof.

Plugging in G=GidG=G_{i}\in\mathbb{R}^{d} in Lemma D.9, we have

dfgnn(W(t),Gi)dt=κ𝖪t(Gi,𝖦)(Yugnn(t)).\displaystyle\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G_{i})}{\mathrm{d}t}=\kappa\mathsf{K}_{t}(G_{i},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

Note [ugnn(t)]i=κfgnn(W(t),Gi)[u_{\mathrm{gnn}}(t)]_{i}=\kappa f_{\mathrm{gnn}}(W(t),G_{i}) and [H(t))]:,i=𝖪t(Gi,𝖦)[H(t))]_{:,i}=\mathsf{K}_{t}(G_{i},\mathsf{G}), so writing all the data in a compact form, we have

dugnn(t)dt=κ2H(t)(Yugnn(t)).\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn}}(t)}{\mathrm{d}t}=\kappa^{2}H(t)(Y-u_{\mathrm{gnn}}(t)).

Plugging in data G=GtestG=G_{\mathrm{test}} in Lemma D.9, we have

dfgnn(W(t),Gtest)dt=κ𝖪t(Gtest,𝖦)(Yugnn(t)).\displaystyle\frac{\mathrm{d}f_{\mathrm{gnn}}(W(t),G_{\mathrm{test}})}{\mathrm{d}t}=\kappa\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

Note by definition, ugnn,test(t)=κfgnn(W(t),Gtest)u_{\mathrm{gnn},\mathrm{test}}(t)=\kappa f_{\mathrm{gnn}}(W(t),G_{\mathrm{test}}), so we have

dugnn,test(t)dt=κ2𝖪t(Gtest,𝖦)(Yugnn(t)).\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}=\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(Y-u_{\mathrm{gnn}}(t)).

We prove the linear convergence of neural network training:

Lemma D.11.

Given

  • training graph data matrix 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} and corresponding label vector YnY\in\mathbb{R}^{n}.

  • the total number of iterations T>0T>0.

Let

  • κ(0,1)\kappa\in(0,1) and ugnn(t)n×nu_{\mathrm{gnn}}(t)\in\mathbb{R}^{n\times n} be defined as in Definition D.2.

  • unu^{*}\in\mathbb{R}^{n} be defined in Eq. (8).

  • Hctsn×nH^{\mathrm{cts}}\in\mathbb{R}^{n\times n} and Λ0>0\Lambda_{0}>0 be defined as in Definition D.3.

  • H(t)n×nH(t)\in\mathbb{R}^{n\times n} be defined as in Definition D.5.

Then we have

dugnn(t)u22dt(κ2Λ0)ugnn(t)u22.\displaystyle\frac{\mathrm{d}\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-(\kappa^{2}\Lambda_{0})\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}.
Proof.

Thus, we have

dugnn(t)u22dt\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}}{\mathrm{d}t}
=\displaystyle= 2(ugnn(t)u)dugnn(t)dt\displaystyle\leavevmode\nobreak\ 2(u_{\mathrm{gnn}}(t)-u^{*})^{\top}\frac{\mathrm{d}u_{\mathrm{gnn}}(t)}{\mathrm{d}t}
=\displaystyle= 2κ2(ugnn(t)u)H(t)(ugnn(t)Y)\displaystyle\leavevmode\nobreak\ -2\kappa^{2}(u_{\mathrm{gnn}}(t)-u^{*})^{\top}H(t)(u_{\mathrm{gnn}}(t)-Y)
=\displaystyle= 2(ugnn(t)u)(κ2H(t))(ugnn(t)u)\displaystyle\leavevmode\nobreak\ -2(u_{\mathrm{gnn}}(t)-u^{*})^{\top}(\kappa^{2}H(t))(u_{\mathrm{gnn}}(t)-u^{*})
\displaystyle\leq 2(κ2Λ0)ugnn(t)u22\displaystyle\leavevmode\nobreak\ -2(\kappa^{2}\Lambda_{0})\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}

where the first step follows the chain rule, the second step follows Corollary D.10, the third step uses basic linear algebra, and the last step follows the assumption H(t)HctsΛ0/2\|H(t)-H^{\mathrm{cts}}\|\leq\Lambda_{0}/2. ∎

D.3 Perturbation during iterative GNTK regression

In this section, we prove an upper bound for |ugntk,test(T)utest||u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|.

Lemma D.12.

Let

  • ugntk,test(T)u_{\mathrm{gntk},\mathrm{test}}(T)\in\mathbb{R} and utestu_{\mathrm{test}}^{*}\in\mathbb{R} be defined as Definition D.4.

Given

  • any accuracy ε>0\varepsilon>0,

if κ(0,1)\kappa\in(0,1), then by picking T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), we have

|ugntk,test(T)utest|ε/2.\displaystyle|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|\leq\varepsilon/2.

where O~()\widetilde{O}(\cdot) here hides polylog(n/(εΛ0))\mathrm{poly}\log(n/(\varepsilon\Lambda_{0})).

Proof.

Due to the linear convergence of kernel regression, i.e.,

dβ(t)β22dt2(κ2Λ0)β(t)β22\displaystyle\frac{\mathrm{d}\|\beta(t)-\beta^{*}\|_{2}^{2}}{\mathrm{d}t}\leq-2(\kappa^{2}\Lambda_{0})\|\beta(t)-\beta^{*}\|_{2}^{2}

Thus,

|ugntk,test(T)utest|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}|
=\displaystyle= |κΦ(Gtest)β(T)κΦ(Gtest)β|\displaystyle\leavevmode\nobreak\ |\kappa\Phi(G_{\mathrm{test}})^{\top}\beta(T)-\kappa\Phi(G_{\mathrm{test}})^{\top}\beta^{*}|
\displaystyle\leq κΦ(Gtest)2β(T)β2\displaystyle\leavevmode\nobreak\ \kappa\|\Phi(G_{\mathrm{test}})\|_{2}\|\beta(T)-\beta^{*}\|_{2}
\displaystyle\leq κe(κ2Λ0)Tβ(0)β2\displaystyle\leavevmode\nobreak\ \kappa e^{-(\kappa^{2}\Lambda_{0})T}\|\beta(0)-\beta^{*}\|_{2}
\displaystyle\leq e(κ2Λ0)Tpoly(κ,N,n,1/Λ0)\displaystyle\leavevmode\nobreak\ e^{-(\kappa^{2}\Lambda_{0})T}\cdot\mathrm{poly}(\kappa,N,n,1/\Lambda_{0})

where the final step is derived from the conditions β(0)=0\beta(0)=0 and the norm |β|2|\beta^{*}|_{2} being a polynomial function of κ,N,n,\kappa,N,n, and 1/Λ01/\Lambda_{0}.

It’s worth noting that κ\kappa lies in the interval (0,1). Consequently, by selecting T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), it follows that

ugntk,test(T)utest2ε/2,\displaystyle\|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon/2,

where O~()\widetilde{O}(\cdot) here hides polylog(Nn/(εΛ0))\mathrm{poly}\log(Nn/(\varepsilon\Lambda_{0})).∎

Lemma D.13.

Let ugntk,test,node(T)u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)\in\mathbb{R} and utest,nodeu_{\mathrm{test},\mathrm{node}}^{*}\in\mathbb{R} be defined as Definition 3.8. Given

  • any accuracy ε>0\varepsilon>0,

if

  • κ(0,1)\kappa\in(0,1),

then by picking T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), we have

|ugntk,test,node(T)utest,node|ε/2.\displaystyle|u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}|\leq\varepsilon/2.

D.4 Bounding initialization and kernel perturbation

In this section we prove Lemma D.14, which shows that in order to bound prediction perturbation, it suffices to bound both initialization perturbation and kernel perturbation. We prove this result by utilizing Newton-Leibniz formula.

Lemma D.14 (Prediction perturbation implies kernel perturbation).

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} and corresponding label vector YnY\in\mathbb{R}^{n}.

  • the total number of iterations T>0T>0.

  • arbitrary test data GtestG_{\mathrm{test}}.

Let

  • ugnn,test(t)nu_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}^{n} and ugntk,test(t)nu_{\mathrm{gntk},\mathrm{test}}(t)\in\mathbb{R}^{n} be the test data predictors defined in Definition D.2 and Definition D.4 respectively.

  • κ(0,1)\kappa\in(0,1) be the corresponding multiplier.

  • 𝖪gntk(Gtest,𝖦)n,𝖪t(Gtest,𝖦)n,𝖧ctsn×n,𝖧(t)n×n,Λ0>0\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{H}^{\mathrm{cts}}\in\mathbb{R}^{n\times n},\leavevmode\nobreak\ \mathsf{H}(t)\in\mathbb{R}^{n\times n},\leavevmode\nobreak\ \Lambda_{0}>0 be defined in Definition D.3 and Definition D.5.

  • unu^{*}\in\mathbb{R}^{n} be defined as in Eq. (8).

  • εK(0,1)\varepsilon_{K}\in(0,1), εinit(0,1)\varepsilon_{\mathrm{init}}\in(0,1) and εH(0,1)\varepsilon_{H}\in(0,1) denote parameters that are independent of tt, and the following conditions hold for all t[0,T]t\in[0,T],

    • ugnn(0)2nεinit\|u_{\mathrm{gnn}}(0)\|_{2}\leq\sqrt{n}\varepsilon_{\mathrm{init}} and |ugnn,test(0)|εinit|u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}

    • 𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦)2εK\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq\varepsilon_{K}

    • 𝖧(t)𝖧ctsεH\|\mathsf{H}(t)-\mathsf{H}^{\mathrm{cts}}\|\leq\varepsilon_{H}

then we have

|ugnn,test(T)ugntk,test(T)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|
\displaystyle\leq (1+κ2nT)εinit+εKu2Λ0+nT2κ4εHu2\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\|u^{*}\|_{2}}{\Lambda_{0}}+\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\|u^{*}\|_{2}
Proof.

Combining results from Lemma D.15, Claim D.16D.17D.18, we complete the proof. We have

|ugnn,test(T)ugntk,test(T)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|
\displaystyle\leq |ugnn,test(0)ugntk,test(0)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|
+κ2|0T(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|\displaystyle\leavevmode\nobreak\ +\kappa^{2}\Big{|}\int_{0}^{T}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}
+κ2|0T𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|\displaystyle\leavevmode\nobreak\ +\kappa^{2}\Big{|}\int_{0}^{T}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}
\displaystyle\leq εinit+εKuΛ0+κ2nεinitT+nT2κ4εHu2\displaystyle\leavevmode\nobreak\ \varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\|u^{*}\|}{\Lambda_{0}}+\kappa^{2}n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{4}\varepsilon_{H}\cdot\|u^{*}\|_{2}
\displaystyle\leq (1+κ2nT)εinit+εKu2Λ0+nT2κ4εHu2\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\|u^{*}\|_{2}}{\Lambda_{0}}+\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\|u^{*}\|_{2}

where the initial step is derived from Lemma D.15. The subsequent step is based on Claim D.16D.17, and D.18. The concluding step streamlines the expression. ∎

To prove Lemma D.14, we first show that |ugnn,test(T)ugntk,test(T)||u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)| can be upper bounded by three terms that are defined in Lemma D.15, then we bound each of these three terms respectively in Claim D.16, Claim D.17, and Claim D.18.

Lemma D.15.

Follow the same notation as Lemma D.14, we have

|ugnn,test(T)ugntk,test(T)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|
\displaystyle\leq |ugnn,test(0)ugntk,test(0)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|
+|0Tκ2(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|\displaystyle\quad+\Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}
+|0Tκ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|.\displaystyle\quad+\Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}.
Proof.
|ugnn,test(T)ugntk,test(T)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|
=\displaystyle= |ugnn,test(0)ugntk,test(0)\displaystyle\leavevmode\nobreak\ \Big{|}u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)
+0T(dugnn,test(t)dtdugntk,test(t)dt)dt|\displaystyle\quad+\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{|}
\displaystyle\leq |ugnn,test(0)ugntk,test(0)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|
+|0T(dugnn,test(t)dtdugntk,test(t)dt)dt|,\displaystyle\quad+\Big{|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{|}, (12)

where the initial step is derived from the integral’s definition. The subsequent step is based on the triangle inequality. It’s worth noting that, as per Corollary D.7 and  D.10, their respective gradient flows are described as

dugntk,test(t)dt\displaystyle\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t} =κ2𝖪gntk(Gtest,𝖦)(ugntk(t)Y)\displaystyle=-\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-Y) (13)
dugnn,test(t)dt\displaystyle\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t} =κ2𝖪t(Gtest,𝖦)(ugnn(t)Y)\displaystyle=-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gnn}}(t)-Y) (14)

where ugntk(t)nu_{\mathrm{gntk}}(t)\in\mathbb{R}^{n} and ugnn(t)nu_{\mathrm{gnn}}(t)\in\mathbb{R}^{n} are the predictors for training data defined in Definition D.4 and Definition D.2. Thus, we have

dugnn,test(t)dtdugntk,test(t)dt\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t}
=\displaystyle= κ2𝖪t(Gtest,𝖦)(ugnn(t)Y)\displaystyle\leavevmode\nobreak\ -\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gnn}}(t)-Y)
+κ2𝖪gntk(Gtest,𝖦)(ugntk(t)Y)\displaystyle\quad+\kappa^{2}\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-Y)
=\displaystyle= κ2(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)\displaystyle\leavevmode\nobreak\ \kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)
κ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)) (15)

where the initial step is derived from both Eq.(13) and Eq.(14). The subsequent step reformulates the formula. So we have

|0T(dugnn,test(t)dtdugntk,test(t)dt)dt|\displaystyle\leavevmode\nobreak\ \Big{|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{|}
=\displaystyle= |0Tκ2((𝖪gntk(xtest,X)𝖪t(Gtest,𝖦))(ugntk(t)Y)\displaystyle\leavevmode\nobreak\ \Big{|}\int_{0}^{T}\kappa^{2}((\mathsf{K}_{\mathrm{gntk}}(x_{\mathrm{test}},X)-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)
κ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t)))dt|\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)))\mathrm{d}t\Big{|} (16)

Thus,

|ugnn,test(T)ugntk,test(T)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|
\displaystyle\leq |ugnn,test(0)ugntk,test(0)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|
+|0T(dugnn,test(t)dtdugntk,test(t)dt)dt|\displaystyle\quad+\Big{|}\int_{0}^{T}(\frac{\mathrm{d}u_{\mathrm{gnn},\mathrm{test}}(t)}{\mathrm{d}t}-\frac{\mathrm{d}u_{\mathrm{gntk},\mathrm{test}}(t)}{\mathrm{d}t})\mathrm{d}t\Big{|}
\displaystyle\leq |ugnn,test(0)ugntk,test(0)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|
+|0Tκ2((𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)\displaystyle\quad+\Big{|}\int_{0}^{T}\kappa^{2}((\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)
κ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t)))dt|\displaystyle\quad-\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)))\mathrm{d}t\Big{|}
\displaystyle\leq |ugnn,test(0)ugntk,test(0)|\displaystyle\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|
+|0Tκ2(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|\displaystyle\quad+\Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}
+|0Tκ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|\displaystyle\quad+\Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}

where the initial step is based on Eq.(D.4). The next step is derived from Eq.(D.4), and the third step arises from the triangle inequality. ∎

Now we bound each of these three terms |ugnn,test(0)ugntk,test(0)||u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|, |0Tκ2(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|\Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|} and |0Tκ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|\Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|} in the following three claims.

Claim D.16.

We have

|ugnn,test(0)ugntk,test(0)|εinit.\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}.
Proof.

Note ugntk,test(0)=0u_{\mathrm{gntk},\mathrm{test}}(0)=0, so by assumption we have

|ugnn,test(0)ugntk,test(0)|=|ugnn,test(0)|εinit.\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(0)-u_{\mathrm{gntk},\mathrm{test}}(0)|=|u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}.

Claim D.17.

We have

|0Tκ2(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|εKuΛ0.\displaystyle\Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}\leq\varepsilon_{K}\cdot\frac{\|u^{*}\|}{\Lambda_{0}}.
Proof.

Note

κ2|0T(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|\displaystyle\leavevmode\nobreak\ \kappa^{2}\Big{|}\int_{0}^{T}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}
\displaystyle\leq κ2maxt[0,T]𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦)2\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}
0Tugntk(t)Y2dt,\displaystyle\quad\cdot\int_{0}^{T}\|u_{\mathrm{gntk}}(t)-Y\|_{2}\mathrm{d}t,

where the initial step is derived from the Cauchy-Schwarz inequality. It’s important to mention that, as per Lemma D.8, the kernel regression predictor ugntk(t)u_{\mathrm{gntk}}(t), which belongs to n\mathbb{R}^{n}, exhibits linear convergence towards the optimal predictor u=Yu^{*}=Y, which is also in n\mathbb{R}^{n}. Specifically,

ugntk(t)u2e(κ2Λ0)tugntk(0)u2.\displaystyle\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\leq e^{-(\kappa^{2}\Lambda_{0})t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}. (17)

Thus, we have

0Tugntk(t)Y2dt\displaystyle\leavevmode\nobreak\ \int_{0}^{T}\|u_{\mathrm{gntk}}(t)-Y\|_{2}\mathrm{d}t
\displaystyle\leq 0Tugntk(t)u2dt+0TuY2dt\displaystyle\leavevmode\nobreak\ \int_{0}^{T}\|u_{\mathrm{gntk}}(t)-u^{*}\|_{2}\mathrm{d}t+\int_{0}^{T}\|u^{*}-Y\|_{2}\mathrm{d}t
\displaystyle\leq 0Te(κ2Λ0)ugntk(0)u2dt\displaystyle\leavevmode\nobreak\ \int_{0}^{T}e^{-(\kappa^{2}\Lambda_{0})}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}\mathrm{d}t
\displaystyle\leq ugntk(0)u2κ2Λ0\displaystyle\leavevmode\nobreak\ \frac{\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}}{\kappa^{2}\Lambda_{0}}
=\displaystyle= uκ2Λ0,\displaystyle\leavevmode\nobreak\ \frac{\|u^{*}\|}{\kappa^{2}\Lambda_{0}}, (18)

where the initial step arises from the triangle inequality. The subsequent step is based on Eq. (17). The third step involves computing the integration, and the concluding step is derived from the condition ugntk(0)=0u_{\mathrm{gntk}}(0)=0. Thus, we have

|0Tκ2(𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦))(ugntk(t)Y)dt|\displaystyle\leavevmode\nobreak\ \Big{|}\int_{0}^{T}\kappa^{2}(\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G}))^{\top}(u_{\mathrm{gntk}}(t)-Y)\mathrm{d}t\Big{|}
\displaystyle\leq κ2maxt[0,T]𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦)20Tugntk(t)Y2dt\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\cdot\int_{0}^{T}\|u_{\mathrm{gntk}}(t)-Y\|_{2}\mathrm{d}t
\displaystyle\leq εKuΛ0.\displaystyle\leavevmode\nobreak\ \varepsilon_{K}\cdot\frac{\|u^{*}\|}{\Lambda_{0}}.

where the initial step is derived from Eq.(17). The subsequent step is based on both Eq.(D.4) and the definition of εK\varepsilon_{K}. ∎

Claim D.18.

We have

|0Tκ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|nεinitT+nT2κ2εHu2\displaystyle\Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}\leq n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{2}\varepsilon_{H}\cdot\|u^{*}\|_{2}
Proof.

Note

κ2|0T𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|\displaystyle\leavevmode\nobreak\ \kappa^{2}\Big{|}\int_{0}^{T}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}
\displaystyle\leq κ2maxt[0,T]𝖪t(Gtest,𝖦)2maxt[0,T]ugntk(t)ugnn(t)2T\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\max_{t\in[0,T]}\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\|_{2}\cdot T (19)

where the first step follows the Cauchy-Schwartz inequality.

To bound term maxt[0,T]ugntk(t)ugnn(t)2\max_{t\in[0,T]}\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\|_{2}, notice that for any t[0,T]t\in[0,T], we have

ugntk(t)ugnn(t)2\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\|_{2}
\displaystyle\leq ugntk(0)ugnn(0)2+0td(ugntk(τ)ugnn(τ))dτdτ2\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gntk}}(0)-u_{\mathrm{gnn}}(0)\|_{2}+\Big{\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\|}_{2}
=\displaystyle= nεinit+0td(ugntk(τ)ugnn(τ))dτdτ2,\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\Big{\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\|}_{2}, (20)

where the first step follows the triangle inequality, and the second step follows the assumption. Further,

d(ugntk(τ)ugnn(τ))dτ\displaystyle\leavevmode\nobreak\ \frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}
=\displaystyle= κ2Hcts(ugntk(τ)Y)+κ2H(τ)(ugnn(τ)Y)\displaystyle\leavevmode\nobreak\ -\kappa^{2}H^{\mathrm{cts}}(u_{\mathrm{gntk}}(\tau)-Y)+\kappa^{2}H(\tau)(u_{\mathrm{gnn}}(\tau)-Y)
=\displaystyle= (κ2H(τ))(ugntk(τ)ugnn(τ))\displaystyle\leavevmode\nobreak\ -(\kappa^{2}H(\tau))(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))
+κ2(H(τ)Hcts)(ugntk(τ)Y),\displaystyle\quad+\kappa^{2}(H(\tau)-H^{\mathrm{cts}})(u_{\mathrm{gntk}}(\tau)-Y),

where the first step follows the Corollary D.7D.10, the second step rewrites the formula. Since the term (κ2H(τ))(ugntk(τ)ugnn(τ))-(\kappa^{2}H(\tau))(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau)) makes 0td(ugntk(τ)ugnn(τ))dτdτ2\|\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\|_{2} smaller.

Taking the integral and apply the 2\ell_{2} norm, we have

0td(ugntk(τ)ugnn(τ))dτdτ2\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\|}_{2}
\displaystyle\leq 0tκ2(H(τ)Hcts)(ugntk(τ)Y)dτ2.\displaystyle\leavevmode\nobreak\ \Big{\|}\int_{0}^{t}\kappa^{2}(H(\tau)-H^{\mathrm{cts}})(u_{\mathrm{gntk}}(\tau)-Y)\mathrm{d}\tau\Big{\|}_{2}. (21)

Thus,

maxt[0,T]ugntk(t)ugnn(t)2\displaystyle\leavevmode\nobreak\ \max_{t\in[0,T]}\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\|_{2}
\displaystyle\leq nεinit+maxt[0,T]0td(ugntk(τ)ugnn(τ))dτdτ2\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\Big{\|}\int_{0}^{t}\frac{\mathrm{d}(u_{\mathrm{gntk}}(\tau)-u_{\mathrm{gnn}}(\tau))}{\mathrm{d}\tau}\mathrm{d}\tau\Big{\|}_{2}
\displaystyle\leq nεinit+maxt[0,T]0tκ2(H(τ)Hcts)(ugntk(τ)Y)dτ2\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\Big{\|}\int_{0}^{t}\kappa^{2}(H(\tau)-H^{\mathrm{cts}})(u_{\mathrm{gntk}}(\tau)-Y)\mathrm{d}\tau\Big{\|}_{2}
\displaystyle\leq nεinit+maxt[0,T]0tκ2H(τ)Hctsugntk(τ)Y2dτ\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\int_{0}^{t}\kappa^{2}\|H(\tau)-H^{\mathrm{cts}}\|\cdot\|u_{\mathrm{gntk}}(\tau)-Y\|_{2}\mathrm{d}\tau
\displaystyle\leq nεinit+maxt[0,T]κ2εH(0tugntk(τ)u2dτ+0tuY2dτ)\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\kappa^{2}\varepsilon_{H}\cdot\Big{(}\int_{0}^{t}\|u_{\mathrm{gntk}}(\tau)-u^{*}\|_{2}\mathrm{d}\tau+\int_{0}^{t}\|u^{*}-Y\|_{2}\mathrm{d}\tau\Big{)}
\displaystyle\leq nεinit+maxt[0,T]κ2εH0tugntk(0)u2dτ\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}\kappa^{2}\varepsilon_{H}\int_{0}^{t}\|u_{\mathrm{gntk}}(0)-u^{*}\|_{2}\mathrm{d}\tau
\displaystyle\leq nεinit+maxt[0,T]tκ2εHu2\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+\max_{t\in[0,T]}t\cdot\kappa^{2}\varepsilon_{H}\cdot\|u^{*}\|_{2}
\displaystyle\leq nεinit+Tκ2εHu2\displaystyle\leavevmode\nobreak\ \sqrt{n}\varepsilon_{\mathrm{init}}+T\cdot\kappa^{2}\varepsilon_{H}\cdot\|u^{*}\|_{2} (22)

where the initial step is based on Eq.(D.4). The next step is derived from Eq.(D.4). The third step arises from the triangle inequality. The fourth step is informed by the condition |H(τ)Hcts|εH|H(\tau)-H^{\mathrm{cts}}|\leq\varepsilon_{H} for all values of τT\tau\leq T, coupled with the triangle inequality. The fifth step references the linear convergence of |ugntk(τ)u|2|u_{\mathrm{gntk}}(\tau)-u^{*}|2, as described in Lemma D.8. The sixth step is based on the condition ugntk(0)=0u{\mathrm{gntk}}(0)=0. Finally, the concluding step determines the maximum value. Therefore,

|0Tκ2𝖪t(Gtest,𝖦)(ugntk(t)ugnn(t))dt|\displaystyle\leavevmode\nobreak\ \Big{|}\int_{0}^{T}\kappa^{2}\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})^{\top}(u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t))\mathrm{d}t\Big{|}
\displaystyle\leq κ2maxt[0,T]𝖪t(Gtest,𝖦)2maxt[0,T]ugntk(t)ugnn(t)2T\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\max_{t\in[0,T]}\|u_{\mathrm{gntk}}(t)-u_{\mathrm{gnn}}(t)\|_{2}\cdot T
\displaystyle\leq κ2maxt[0,T]𝖪t(Gtest,𝖦)2(nεinitT+T2κ2εHu2)\displaystyle\leavevmode\nobreak\ \kappa^{2}\max_{t\in[0,T]}\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\cdot(\sqrt{n}\varepsilon_{\mathrm{init}}T+T^{2}\cdot\kappa^{2}\varepsilon_{H}\cdot\|u^{*}\|_{2})
\displaystyle\leq κ2nεinitT+nT2κ4εHu2\displaystyle\leavevmode\nobreak\ \kappa^{2}n\varepsilon_{\mathrm{init}}T+\sqrt{n}T^{2}\cdot\kappa^{4}\varepsilon_{H}\cdot\|u^{*}\|_{2}

where the initial step is derived from Eq.(D.4). The subsequent step is based on Eq.(D.4). The concluding step is informed by the condition 𝖪t(G,H)1\mathsf{K}_{t}(G,H)\leq 1, which is consistent with the data assumptions related to graphs GG and HH. ∎

D.5 Concentration results for kernels

For simplify, we use xi,lx_{i,l} to denote hGi,uh_{G_{i},u} where uu is the llth node of the NN node in iith graph GiG_{i}. For convenient, for each i,j[n]×[n]i,j\in[n]\times[n], we write as

fgnn(W,Gi)W,fgnn(W,Gj)W\displaystyle\leavevmode\nobreak\ \Big{\langle}\frac{\partial f_{\mathrm{gnn}}(W,G_{i})}{\partial W},\frac{\partial f_{\mathrm{gnn}}(W,G_{j})}{\partial W}\Big{\rangle}
=\displaystyle= 1mr=1ml1=1Nl2=1Nxi,l1xj,l2𝟏wrxi,l10𝟏wrxj,l20\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0}{\bf 1}_{w_{r}^{\top}x_{j,l_{2}}\geq 0}

Recall the classical definition of neural network

fnn(W,xi)W,fnn(W,xj)W\displaystyle\leavevmode\nobreak\ \Big{\langle}\frac{\partial f_{\mathrm{nn}}(W,x_{i})}{\partial W},\frac{\partial f_{\mathrm{nn}}(W,x_{j})}{\partial W}\Big{\rangle}
=\displaystyle= 1mr=1mxixj𝟏wrxi0𝟏wrxj0\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}x_{i}^{\top}x_{j}{\bf 1}_{w_{r}^{\top}x_{i}\geq 0}{\bf 1}_{w_{r}^{\top}x_{j}\geq 0}

In this section, we present proof for a finding that is a broader variant of Lemma 3.1 from [75]. This result demonstrates that when the width, denoted as mm, is adequately expansive, the continuous and discrete renditions of the gram matrix for input data exhibit proximity in a spectral manner.

Lemma D.19.

We write 𝖧cts,𝖧disn×n\mathsf{H}^{\mathrm{cts}},\mathsf{H}^{\mathrm{dis}}\in\mathbb{R}^{n\times n} as follows

𝖧i,jcts=\displaystyle\mathsf{H}^{\mathrm{cts}}_{i,j}= 𝔼w𝒩(0,I)[l1,l2xi,l1xj,l2𝟏wxi,l10,wxj,l20],\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{w\sim\mathcal{N}(0,I)}\left[\sum_{l_{1},l_{2}}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w^{\top}x_{i,l_{1}}\geq 0,w^{\top}x_{j,l_{2}}\geq 0}\right],
𝖧i,jdis=\displaystyle\mathsf{H}^{\mathrm{dis}}_{i,j}= 1mr=1m[l1,l2xi,l1xj,l2𝟏wrxi,l10,wrxj,l20].\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\left[\sum_{l_{1},l_{2}}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right].

Let λ=λmin(𝖧cts)\lambda=\lambda_{\min}(\mathsf{H}^{\mathrm{cts}}). If m=Ω(λ2N2n2log(n/δ))m=\Omega(\lambda^{-2}N^{2}n^{2}\log(n/\delta)), we have

𝖧dis𝖧ctsFλ4,andλmin(𝖧dis)34λ.\displaystyle\|\mathsf{H}^{\mathrm{dis}}-\mathsf{H}^{\mathrm{cts}}\|_{F}\leq\frac{\lambda}{4},\mathrm{\leavevmode\nobreak\ and\leavevmode\nobreak\ }\lambda_{\min}(\mathsf{H}^{\mathrm{dis}})\geq\frac{3}{4}\lambda.

hold with probability at least 1δ1-\delta.

Proof.

For every fixed pair (i,j)(i,j), 𝖧i,jdis\mathsf{H}_{i,j}^{\mathrm{dis}} is an average of independent random variables, i.e.

𝖧i,jdis=1mr=1ml1=1Nl2=1Nxi,l1xj,l2𝟏wrxi,l10,wrxj,l20.\displaystyle\mathsf{H}_{i,j}^{\mathrm{dis}}=\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}.

Then the expectation of 𝖧i,jdis\mathsf{H}_{i,j}^{\mathrm{dis}} is

𝔼[𝖧i,jdis]\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}[\mathsf{H}_{i,j}^{\mathrm{dis}}]
=\displaystyle= 1mr=1ml1=1Nl2=1N𝔼wr𝒩(0,Id)[xi,l1xj,l2𝟏wrxi,l10,wrxj,l20]\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}\operatorname*{{\mathbb{E}}}_{w_{r}\sim{\mathcal{N}}(0,I_{d})}\left[x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right]
=\displaystyle= 𝔼w𝒩(0,Id)l1=1Nl2=1N[xi,l1xj,l2𝟏wrxi,l10,wrxj,l20]\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{w\sim{\mathcal{N}}(0,I_{d})}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}\left[x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right]
=\displaystyle= 𝖧i,jcts.\displaystyle\leavevmode\nobreak\ \mathsf{H}_{i,j}^{\mathrm{cts}}.

For r[m]r\in[m], we define zrz_{r}

zr:=1ml1,l2xi,l1xj,l2𝟏wrxi,l10,wrxj,l20.\displaystyle z_{r}:=\frac{1}{m}\sum_{l_{1},l_{2}}x_{i,l_{1}}^{\top}x_{j,l_{2}}\mathbf{1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}.

Then zrz_{r} is a random function of wrw_{r}, hence {zr}r[m]\{z_{r}\}_{r\in[m]} are mutually independent. Moreover, 1mN2zr1mN2-\frac{1}{m}N^{2}\leq z_{r}\leq\frac{1}{m}N^{2}. So by Hoeffding inequality (Lemma A.1) we have for all t>0t>0,

Pr[|𝖧i,jdis𝖧i,jcts|t]\displaystyle\Pr\left[|\mathsf{H}_{i,j}^{\mathrm{dis}}-\mathsf{H}_{i,j}^{\mathrm{cts}}|\geq t\right]\leq 2exp(2t24N2/m)\displaystyle\leavevmode\nobreak\ 2\exp\Big{(}-\frac{2t^{2}}{4N^{2}/m}\Big{)}
=\displaystyle= 2exp(mt2/(2N2)).\displaystyle\leavevmode\nobreak\ 2\exp(-mt^{2}/(2N^{2})).

Setting t=(1m2N2log(2n2/δ))1/2t=(\frac{1}{m}2N^{2}\log(2n^{2}/\delta))^{1/2}, we can apply union bound on all pairs (i,j)(i,j) to get with probability at least 1δ1-\delta, for all i,j[n]i,j\in[n],

|𝖧i,jdis𝖧i,jcts|\displaystyle|\mathsf{H}_{i,j}^{\mathrm{dis}}-\mathsf{H}_{i,j}^{\mathrm{cts}}|\leq (2mN2log(2n2/δ))1/2\displaystyle\leavevmode\nobreak\ \Big{(}\frac{2}{m}N^{2}\log(2n^{2}/\delta)\Big{)}^{1/2}
\displaystyle\leq 4(N2log(n/δ)m)1/2.\displaystyle\leavevmode\nobreak\ 4\Big{(}\frac{N^{2}\log(n/\delta)}{m}\Big{)}^{1/2}.

Thus we have

𝖧dis𝖧cts2\displaystyle\|\mathsf{H}^{\mathrm{dis}}-\mathsf{H}^{\mathrm{cts}}\|^{2}\leq 𝖧dis𝖧ctsF2\displaystyle\leavevmode\nobreak\ \|\mathsf{H}^{\mathrm{dis}}-\mathsf{H}^{\mathrm{cts}}\|_{F}^{2}
=\displaystyle= i=1nj=1n|𝖧i,jdis𝖧i,jcts|2\displaystyle\leavevmode\nobreak\ \sum_{i=1}^{n}\sum_{j=1}^{n}|\mathsf{H}_{i,j}^{\mathrm{dis}}-\mathsf{H}_{i,j}^{\mathrm{cts}}|^{2}
\displaystyle\leq 1m16N2n2log(n/δ).\displaystyle\leavevmode\nobreak\ \frac{1}{m}16N^{2}n^{2}\log(n/\delta).

Hence if m=Ω(λ2N2n2log(n/δ))m=\Omega(\lambda^{-2}N^{2}n^{2}\log(n/\delta)) we have the desired result. ∎

Similarly, we can show

Lemma D.20.

Let

  • R0(0,1)R_{0}\in(0,1).

If the following conditions hold:

  • w~1,,w~m\widetilde{w}_{1},\cdots,\widetilde{w}_{m} are i.i.d. generated from 𝒩(0,I){\cal N}(0,I).

  • For any set of weight vectors w1,,wmdw_{1},\cdots,w_{m}\in\mathbb{R}^{d} that satisfy for any r[m]r\in[m], w~rwr2R0\|\widetilde{w}_{r}-w_{r}\|_{2}\leq R_{0},

then the H:m×dn×nH:\mathbb{R}^{m\times d}\rightarrow\mathbb{R}^{n\times n} defined

[𝖧(W)]i,j=1mr=1ml1=1Nl2=1Nxi,l1xj,l2𝟏wrxi,l10,wrxj,l20.\displaystyle[\mathsf{H}(W)]_{i,j}=\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}.

Then we have

𝖧(w)𝖧(w~)F<2NnR0\displaystyle\|\mathsf{H}(w)-\mathsf{H}(\widetilde{w})\|_{F}<2NnR_{0}

holds with probability at least 1N2n2exp(mR0/10)1-N^{2}n^{2}\cdot\exp(-mR_{0}/10).

D.6 Upper bounding initialization perturbation

In this section we bound εinitε\varepsilon_{\mathrm{init}}\leq\varepsilon by choosing a large enough κ\kappa. Our goal is to prove Lemma D.21. This result indicates that initially the predictor’s output is small. This result can be proved by Gaussian tail bounds.

Lemma D.21 (Bounding initialization perturbation).

Let fgnnf_{\mathrm{gnn}} be as defined in Definition D.1. Assume

  • the initial weight of the network work wr(0)d,r=1,,mw_{r}(0)\in\mathbb{R}^{d},\leavevmode\nobreak\ r=1,\cdots,m as defined in Definition D.2 are drawn independently from standard Gaussian distribution 𝒩(0,Id)\mathcal{N}(0,I_{d}).

  • and ar,r=1,,ma_{r}\in\mathbb{R},\leavevmode\nobreak\ r=1,\cdots,m as defined in Definition D.1 are drawn independently from unif[{1,+1}]\operatorname{unif}[\{-1,+1\}].

Let

  • κ(0,1)\kappa\in(0,1), ugnn(t)u_{\mathrm{gnn}}(t) and ugnn,test(t)u_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R} be defined as in Definition D.2.

For any data GG, assume its corresponding {xl}l[N]d\{x_{l}\}_{l\in[N]}\in\mathbb{R}^{d} satisfying that xl2R,l[N]\|x_{l}\|_{2}\leq R,\forall l\in[N]. Then, we have with probability 1δ1-\delta,

|fgnn(W(0),G)|2NRlog(2Nm/δ).\displaystyle|f_{\mathrm{gnn}}(W(0),G)|\leq 2NR\log(2Nm/\delta).

Further, given any accuracy ε(0,1)\varepsilon\in(0,1), if κ=O~(εΛ0/(NRn))\kappa=\widetilde{O}(\varepsilon\Lambda_{0}/(NRn)), let εinit=εΛ0/(NRn)\varepsilon_{\mathrm{init}}=\varepsilon\Lambda_{0}/(NRn), we have

|ugnn(0)|nεinitand|ugnn,test(0)|εinit\displaystyle|u_{\mathrm{gnn}}(0)|\leq\sqrt{n}\varepsilon_{\mathrm{init}}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}

hold with probability 1δ1-\delta, where O~()\widetilde{O}(\cdot) hides the polylog(Nn/(εδΛ0))\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0})).

Proof.

Note by definition,

fgnn(W(0),G)=1mr=1marl=1Nσ(wr(0)xl).\displaystyle f_{\mathrm{gnn}}(W(0),G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}(0)^{\top}x_{l}).

Since wr(0)𝒩(0,Id)w_{r}(0)\sim\mathcal{N}(0,I_{d}), so wr(0)xtestN(0,xtest,l2)w_{r}(0)^{\top}x_{\mathrm{test}}\sim N(0,\|x_{\mathrm{test},l}\|_{2}). Note xtest,l2R\|x_{\mathrm{test},l}\|_{2}\leq R, l[N]\forall l\in[N], by Gaussian tail bounds Lemma A.3, we have with probability 1δ/(2Nm)1-\delta/(2Nm):

|wr(0)x|R2log(2Nm/δ).\displaystyle|w_{r}(0)^{\top}x|\leq R\sqrt{2\log(2Nm/\delta)}. (23)

Condition on Eq. (23) holds for all r[m]r\in[m], denote Zr=arl=1Nσ(wr(0)xl)Z_{r}=a_{r}\sum_{l=1}^{N}\sigma(w_{r}(0)^{\top}x_{l}), then we have 𝔼[Zr]=0\operatorname*{{\mathbb{E}}}[Z_{r}]=0 and |Zr|NR2log(2m/δ)|Z_{r}|\leq NR\sqrt{2\log(2m/\delta)}. By Lemma A.1, with probability 1δ/21-\delta/2:

|r=1mZr|2NRmlog(2Nm/δ).\displaystyle\Big{|}\sum_{r=1}^{m}Z_{r}\Big{|}\leq 2NR\sqrt{m}\log{(2Nm/\delta)}. (24)

Since ugnn,test(0)=1nr=1mZru_{\mathrm{gnn},\mathrm{test}}(0)=\frac{1}{\sqrt{n}}\sum_{r=1}^{m}Z_{r}, by combining Eq. (23), (24) and union bound over all r[m]r\in[m], we have with probability 1δ1-\delta:

|ugnn,test(0)|2NRlog(2Nm/δ).\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(0)|\leq 2NR\log(2Nm/\delta).

Further, note [ugnn(0)]i=κfgnn(W(0),Gi)[u_{\mathrm{gnn}}(0)]_{i}=\kappa f_{\mathrm{gnn}}(W(0),G_{i}) and ugnn,test(0)=κfgnn(W(0),Gtest)u_{\mathrm{gnn},\mathrm{test}}(0)=\kappa f_{\mathrm{gnn}}(W(0),G_{\mathrm{test}}). Thus, by choosing

κ=O~(εΛ0/(NRn)),\displaystyle\kappa=\widetilde{O}(\varepsilon\Lambda_{0}/(NRn)),

taking the union bound over all training and test data, we have

|ugnn(0)|nεinitand|ugnn,test(0)|εinit\displaystyle|u_{\mathrm{gnn}}(0)|\leq\sqrt{n}\varepsilon_{\mathrm{init}}\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ |u_{\mathrm{gnn},\mathrm{test}}(0)|\leq\varepsilon_{\mathrm{init}}

hold with probability 1δ1-\delta, where O~()\widetilde{O}(\cdot) hides the polylog(Nn/(εδΛ0))\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0})). ∎

D.7 Upper bounding kernel perturbation

In this section, our goal is to prove Lemma D.22. We bound the kernel perturbation using induction. More specifically, we prove four induction lemma. Each of them states a property still holds after one iteration. We combine them together and prove Lemma D.22.

Lemma D.22 (Bounding kernel perturbation).

Given

  • training data 𝖦n×d×N\mathsf{G}\in\mathbb{R}^{n\times d\times N}, YnY\in\mathbb{R}^{n} and a test data GtestdG_{\mathrm{test}}\in\mathbb{R}^{d}.

Let

  • T>0T>0 denotes the total number of iterations,

  • m>0m>0 denotes the width of the network,

  • εtrain\varepsilon_{\mathrm{train}} denotes a fixed training error threshold,

  • δ>0\delta>0 denotes the failure probability.

  • ugnn(t)nu_{\mathrm{gnn}}(t)\in\mathbb{R}^{n} and ugntk(t)nu_{\mathrm{gntk}}(t)\in\mathbb{R}^{n} be the training data predictors defined in Definition D.2 and Definition D.4 respectively.

  • κ(0,1)\kappa\in(0,1) be the corresponding multiplier.

  • 𝖪gntk(Gtest,𝖦)n,𝖪t(Gtest,𝖦)n,𝖧(t)n×n,Λ0>0\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\in\mathbb{R}^{n},\leavevmode\nobreak\ \mathsf{H}(t)\in\mathbb{R}^{n\times n},\leavevmode\nobreak\ \Lambda_{0}>0 be the kernel related quantities defined in Definition D.3 and Definition D.5.

  • unu^{*}\in\mathbb{R}^{n} be defined as in Eq. (8).

  • W(t)=[w1(t),,wm(t)]d×mW(t)=[w_{1}(t),\cdots,w_{m}(t)]\in\mathbb{R}^{d\times m} be the parameters of the neural network defined in Definition D.2.

For any accuracy ε(0,1/10)\varepsilon\in(0,1/10). If

  • κ=O~(εΛ0NRn)\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn}), T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), εtrain=O~(ugnn(0)u2)\varepsilon_{\mathrm{train}}=\widetilde{O}(\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}), mO~(N2n10dε6Λ010)m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}}), with probability 1δ1-\delta,

there exist εW,εH,εK>0\varepsilon_{W},\leavevmode\nobreak\ \varepsilon_{H}^{\prime},\leavevmode\nobreak\ \varepsilon_{K}^{\prime}>0 that are independent of tt, such that the following hold for all 0tT0\leq t\leq T:

  • 1. wr(0)wr(t)2εW\|w_{r}(0)-w_{r}(t)\|_{2}\leq\varepsilon_{W}, r[m]\forall r\in[m]

  • 2. 𝖧(0)𝖧(t)2εH\|\mathsf{H}(0)-\mathsf{H}(t)\|_{2}\leq\varepsilon_{H}^{\prime}

  • 3. ugnn(t)u22max{exp((κ2Λ0)t/2)ugnn(0)u22,εtrain2}\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}\leq\max\{\exp(-(\kappa^{2}\Lambda_{0})t/2)\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}

  • 4. 𝖪0(Gtest,𝖦)𝖪t(Gtest,𝖦)2εK\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq\varepsilon_{K}^{\prime}

Further,

εWO~(εΛ02n2),εHO~(εΛ02n)andεKO~(εΛ02n1.5).\displaystyle\varepsilon_{W}\leq\widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n^{2}}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \varepsilon_{H}^{\prime}\leq\widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n})\mathrm{\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ and\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ }\varepsilon_{K}^{\prime}\leq\widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n^{1.5}}).

Here O~()\widetilde{O}(\cdot) hides the polylog(Nn/(εδΛ0))\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0})).

In order to prove this lemma, we first state some helpful concentration results for random initialization.

Lemma D.23 (Random initialization result).

Assume

  • initial value wr(0)d,r=1,,mw_{r}(0)\in\mathbb{R}^{d},\leavevmode\nobreak\ r=1,\cdots,m are drawn independently from standard Gaussian distribution 𝒩(0,Id)\mathcal{N}(0,I_{d}),

then with probability 13δ1-3\delta we have

wr(0)22d+2log(m/δ)for allr[m]\displaystyle\leavevmode\nobreak\ \|w_{r}(0)\|_{2}\leq 2\sqrt{d}+2\sqrt{\log{(m/\delta)}}\leavevmode\nobreak\ \text{for all}\leavevmode\nobreak\ r\in[m] (25)
𝖧(0)𝖧cts4Nn(log(n/δ)/m)1/2\displaystyle\leavevmode\nobreak\ \|\mathsf{H}(0)-\mathsf{H}^{\mathrm{cts}}\|\leq 4Nn(\log(n/\delta)/m)^{1/2} (26)
𝖪0(Gtest,𝖦)𝖪gntk(Gtest,𝖦)2(2N2nlog(2n/δ)/m)1/2\displaystyle\leavevmode\nobreak\ \|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq(2N^{2}n\log{(2n/\delta)}/m)^{1/2} (27)
Proof.

By standard concentration inequality, with probability at least 1δ1-\delta,

wr(0)2d+log(m/δ)\displaystyle\|w_{r}(0)\|_{2}\leq\sqrt{d}+\sqrt{\log(m/\delta)}

holds for all r[m]r\in[m].

Using Lemma D.19, we have

H(0)HctsεH′′=4Nn(log(n/δ)/m)1/2\displaystyle\|H(0)-H^{\mathrm{cts}}\|\leq\varepsilon_{H}^{\prime\prime}=4Nn(\log{(n/\delta)}/m)^{1/2}

holds with probability at least 1δ1-\delta.
Note by definition,

𝔼[𝖪0(Gtest,xi)]=𝖪gntk(Gtest,Gi)\displaystyle\operatorname*{{\mathbb{E}}}[\mathsf{K}_{0}(G_{\mathrm{test}},x_{i})]=\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},G_{i})

holds for any training data GiG_{i}. By Hoeffding inequality, we have for any t>0t>0,

Pr[|𝖪0(Gtest,Gi)𝖪gntk(Gtest,Gi)|t]2exp(mt2/(2N2)).\displaystyle\Pr[|\mathsf{K}_{0}(G_{\mathrm{test}},G_{i})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},G_{i})|\geq t]\leq 2\exp{(-mt^{2}/(2N^{2}))}.

Setting t=(2mN2log(2n/δ))1/2t=(\frac{2}{m}N^{2}\log{(2n/\delta)})^{1/2}, we can apply union bound on all training data GiG_{i} to get with probability at least 1δ1-\delta, for all i[n]i\in[n],

|𝖪0(Gtest,Gi)𝖪gntk(Gtest,Gi)|(2N2log(2n/δ)/m)1/2.\displaystyle|\mathsf{K}_{0}(G_{\mathrm{test}},G_{i})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},G_{i})|\leq(2N^{2}\log(2n/\delta)/m)^{1/2}.

Thus, we have

𝖪0(Gtest,𝖦)𝖪gntk(Gtest,𝖦)2(2N2nlog(2n/δ)/m)1/2\displaystyle\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq(2N^{2}n\log(2n/\delta)/m)^{1/2} (28)

holds with probability at least 1δ1-\delta.
Using union bound over above three events, we finish the proof. ∎

Given the conditions set by Eq.(25),(26), and (27), we demonstrate that all four outcomes outlined in Lemma D.22 are valid, employing an inductive approach.

We define the following quantity:

εW:=\displaystyle\varepsilon_{W}:= Nnmmax{4ugnn(0)u2/(κ2Λ0),εtrainT}\displaystyle\leavevmode\nobreak\ \frac{\sqrt{Nn}}{\sqrt{m}}\max\{4\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}/(\kappa^{2}\Lambda_{0}),\varepsilon_{\mathrm{train}}\cdot T\} (29)
εH:=\displaystyle\varepsilon_{H}^{\prime}:= 2nεW\displaystyle\leavevmode\nobreak\ 2n\varepsilon_{W}
εK:=\displaystyle\varepsilon_{K}:= 2nεW\displaystyle\leavevmode\nobreak\ 2\sqrt{n}\varepsilon_{W}

which are independent of tt.

Note that the base case where t=0t=0 trivially holds. Under the induction hypothesis that Lemma D.22 holds before time t[0,T]t\in[0,T], we will prove that it still holds at time tt. We prove this in Lemma D.24D.25, and D.26.

Lemma D.24.

If for any τ<t\tau<t, we have

\displaystyle\| ugnn(τ)u22\displaystyle\leavevmode\nobreak\ u_{\mathrm{gnn}}(\tau)-u^{*}\|_{2}^{2}
\displaystyle\leq max{exp((κ2Λ0)τ/2)ugnn(0)u22,εtrain2}\displaystyle\leavevmode\nobreak\ \max\{\exp(-(\kappa^{2}\Lambda_{0})\tau/2)\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}

and

wr(0)wr(τ)2εW1\displaystyle\|w_{r}(0)-w_{r}(\tau)\|_{2}\leq\varepsilon_{W}\leq 1

and

wr(0)2d+log(m/δ)for allr[m]\displaystyle\|w_{r}(0)\|_{2}\leq\leavevmode\nobreak\ \sqrt{d}+\sqrt{\log(m/\delta)}\leavevmode\nobreak\ \text{for all}\leavevmode\nobreak\ r\in[m]

hold, then

wr(0)wr(t)2εW\displaystyle\|w_{r}(0)-w_{r}(t)\|_{2}\leq\varepsilon_{W}
Proof.

For simplicity, we consider the case when N=1N=1. For general case, we need to blow up an NN factor in the upper bound.

Recall the gradient flow as Eq. (11)

dwr(τ)dτ=i=1n1mar(yiugnn(τ)i)xiσ(wr(τ)xi)\displaystyle\frac{\mathrm{d}w_{r}(\tau)}{\mathrm{d}\tau}=\leavevmode\nobreak\ \sum_{i=1}^{n}\frac{1}{\sqrt{m}}a_{r}(y_{i}-u_{\mathrm{gnn}}(\tau)_{i})x_{i}\sigma^{\prime}(w_{r}(\tau)^{\top}x_{i}) (30)

So we have

dwr(τ)dτ2=\displaystyle\Big{\|}\frac{\mathrm{d}w_{r}(\tau)}{\mathrm{d}\tau}\Big{\|}_{2}= i=1n1mar(yiugnn(τ)i)xiσ(wr(τ)xi)2\displaystyle\leavevmode\nobreak\ \left\|\sum_{i=1}^{n}\frac{1}{\sqrt{m}}a_{r}(y_{i}-u_{\mathrm{gnn}}(\tau)_{i})x_{i}\sigma^{\prime}(w_{r}(\tau)^{\top}x_{i})\right\|_{2} (31)
\displaystyle\leq 1mi=1n|yiugnn(τ)i|\displaystyle\leavevmode\nobreak\ \frac{1}{\sqrt{m}}\sum_{i=1}^{n}|y_{i}-u_{\mathrm{gnn}}(\tau)_{i}|
\displaystyle\leq nmYugnn(τ)2\displaystyle\leavevmode\nobreak\ \frac{\sqrt{n}}{\sqrt{m}}\|Y-u_{\mathrm{gnn}}(\tau)\|_{2}
\displaystyle\leq nm(Yu2+ugnn(τ)u2)\displaystyle\leavevmode\nobreak\ \frac{\sqrt{n}}{\sqrt{m}}(\|Y-u^{*}\|_{2}+\|u_{\mathrm{gnn}}(\tau)-u^{*}\|_{2})
\displaystyle\leq nmmax{e(κ2Λ0)τ/4ugnn(0)u2,εtrain}\displaystyle\leavevmode\nobreak\ \frac{\sqrt{n}}{\sqrt{m}}\max\{e^{-(\kappa^{2}\Lambda_{0})\tau/4}\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2},\varepsilon_{\mathrm{train}}\}

where the initial step is based on Eq.(30). The subsequent step arises from the triangle inequality. The third step is derived from the Cauchy-Schwarz inequality. The fourth step again references the triangle inequality. The concluding step is informed by the condition |ugnn(τ)u|22maxexp((κ2Λ0)τ/2)|ugnn(0)u|22,εtrain2|u_{\mathrm{gnn}}(\tau)-u^{*}|2^{2}\leq\max{\exp(-(\kappa^{2}\Lambda_{0})\tau/2)\cdot|u{\mathrm{gnn}}(0)-u^{*}|_{2}^{2},\varepsilon_{\mathrm{train}}^{2}}.

Thus, for any tTt\leq T,

wr(0)wr(t)2\displaystyle\leavevmode\nobreak\ \|w_{r}(0)-w_{r}(t)\|_{2}
\displaystyle\leq 0tdwr(τ)dτ2𝑑τ\displaystyle\leavevmode\nobreak\ \int_{0}^{t}\Big{\|}\frac{\mathrm{d}w_{r}(\tau)}{\mathrm{d}\tau}\Big{\|}_{2}d\tau
\displaystyle\leq Nnmmax{4ugnn(0)u2/(κ2Λ0),εtrainT}\displaystyle\leavevmode\nobreak\ \frac{\sqrt{Nn}}{\sqrt{m}}\max\{4\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}/(\kappa^{2}\Lambda_{0}),\varepsilon_{\mathrm{train}}\cdot T\}
=\displaystyle= εW\displaystyle\leavevmode\nobreak\ \varepsilon_{W}

where the initial step arises from the triangle inequality. The subsequent step is based on Eq.(31). The concluding step is informed by the definition of εW\varepsilon_{W}, as outlined in Eq.(29). ∎

Lemma D.25.

If r[m]\forall r\in[m],

wr(0)wr(t)2εW<1,\displaystyle\|w_{r}(0)-w_{r}(t)\|_{2}\leq\varepsilon_{W}<1,

then

𝖧(0)𝖧(t)F2nεW\displaystyle\|\mathsf{H}(0)-\mathsf{H}(t)\|_{F}\leq 2n\varepsilon_{W}

holds with probability 1N2n2exp(mεW/(10N2))1-N^{2}n^{2}\cdot\exp{(-m\varepsilon_{W}/(10N^{2}))}.

Proof.

Applying Lemma D.20 completes the proof. ∎

Lemma D.26.

we have

ugnn(t)u22\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}
\displaystyle\leq max{exp((κ2Λ0)t/2)ugnn(0)u22,εtrain2}.\displaystyle\leavevmode\nobreak\ \max\{\exp(-(\kappa^{2}\Lambda_{0})t/2)\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}.
Proof.

By Lemma D.11, for any τ<t\tau<t, we have

dugnn(τ)u22dτ\displaystyle\frac{\mathrm{d}\|u_{\mathrm{gnn}}(\tau)-u^{*}\|_{2}^{2}}{\mathrm{d}\tau}\leq 2(κ2Λ0)ugnn(τ)u22\displaystyle\leavevmode\nobreak\ -2(\kappa^{2}\Lambda_{0})\cdot\|u_{\mathrm{gnn}}(\tau)-u^{*}\|_{2}^{2} (32)

where the step follows from Lemma D.11, which implies

ugnn(t)u22exp(2(κ2Λ0)t)ugnn(0)u22.\displaystyle\|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}\leq\exp{(-2(\kappa^{2}\Lambda_{0})t)}\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2}.

we conclude

ugnn(t)u22\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn}}(t)-u^{*}\|_{2}^{2}
\displaystyle\leq max{exp((κ2Λ0)t/2)ugnn(0)u22,εtrain2}.\displaystyle\leavevmode\nobreak\ \max\{\exp(-(\kappa^{2}\Lambda_{0})t/2)\cdot\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}^{2},\leavevmode\nobreak\ \varepsilon_{\mathrm{train}}^{2}\}.

Lemma D.27.

Fix εW(0,1)\varepsilon_{W}\in(0,1) independent of tt. If r[m]\forall r\in[m], we have

wr(t)wr(0)2εW\displaystyle\|w_{r}(t)-w_{r}(0)\|_{2}\leq\varepsilon_{W}

then

𝖪t(Gtest,𝖦)𝖪0(Gtest,𝖦)2εK=2nεW\displaystyle\|\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})\|_{2}\leq\varepsilon_{K}^{\prime}=2\sqrt{n}\varepsilon_{W}

holds with probability at least 1N2nexp(mεW/(10N2))1-N^{2}\cdot n\cdot\exp{(-m\varepsilon_{W}/(10N^{2}))}.

Proof.

Recall the definition of 𝖪0\mathsf{K}_{0} and 𝖪t\mathsf{K}_{t}

𝖪0(Gtest,Gi)\displaystyle\leavevmode\nobreak\ \mathsf{K}_{0}(G_{\mathrm{test}},G_{i})
=\displaystyle= 1mr=1ml1=1Nl2=1Nxtest,l1xi,l2σ(xtest,l1wr(0))σ(xi,l2wr(0))\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{\mathrm{test},l_{1}}^{\top}x_{i,l_{2}}\sigma^{\prime}(x_{\mathrm{test},l_{1}}^{\top}w_{r}(0))\sigma^{\prime}(x_{i,l_{2}}^{\top}w_{r}(0))
𝖪t(Gtest,Gi)\displaystyle\leavevmode\nobreak\ \mathsf{K}_{t}(G_{\mathrm{test}},G_{i})
=\displaystyle= 1mr=1ml1=1Nl2=1Nxtest,l1xi,l2σ(xtest,l1wr(t))σ(xi,l2wr(t))\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{\mathrm{test},l_{1}}^{\top}x_{i,l_{2}}\sigma^{\prime}(x_{\mathrm{test},l_{1}}^{\top}w_{r}(t))\sigma^{\prime}(x_{i,l_{2}}^{\top}w_{r}(t))

By direct calculation we have

𝖪0(Gtest,𝖦)𝖪t(xtest,X)22i=1n(1mr=1msr,i)2,\displaystyle\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(x_{\mathrm{test}},X)\|_{2}^{2}\leq\sum_{i=1}^{n}\Big{(}\frac{1}{m}\sum_{r=1}^{m}s_{r,i}\Big{)}^{2},

where

sr,i=\displaystyle s_{r,i}= l1=1Nl2=1N𝟏[wr(0)xtest,l10,wr(0)xi,l20]\displaystyle\leavevmode\nobreak\ \sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}{\bf 1}[w_{r}(0)^{\top}x_{\mathrm{test},l_{1}}\geq 0,w_{r}(0)^{\top}x_{i,l_{2}}\geq 0]
𝟏[wr(t)xtest,l10,wr(t)xi,l20],\displaystyle\leavevmode\nobreak\ -{\bf 1}[w_{r}(t)^{\top}x_{\mathrm{test},l_{1}}\geq 0,w_{r}(t)^{\top}x_{i,l_{2}}\geq 0],
r[m],i[n].\displaystyle\leavevmode\nobreak\ \forall r\in[m],i\in[n].

Fix i[n]i\in[n], by Bernstein inequality (Lemma A.2), we have for any t>0t>0,

Pr[1mr=1msr,i2εW]\displaystyle\leavevmode\nobreak\ \Pr\Big{[}\frac{1}{m}\sum_{r=1}^{m}s_{r,i}\geq 2\varepsilon_{W}\Big{]}
\displaystyle\leq exp(mεW/(10N2)).\displaystyle\leavevmode\nobreak\ \exp(-m\varepsilon_{W}/(10N^{2})).

Thus, applying union bound over all training data xi,i[n]x_{i},\leavevmode\nobreak\ i\in[n], we conclude

Pr[𝖪0(xtest,X)𝖪t(xtest,X)22nεW]\displaystyle\leavevmode\nobreak\ \Pr[\|\mathsf{K}_{0}(x_{\mathrm{test}},X)-\mathsf{K}_{t}(x_{\mathrm{test}},X)\|_{2}\leq 2\sqrt{n}\varepsilon_{W}]
\displaystyle\leq 1nexp(mεW/(10N2)).\displaystyle\leavevmode\nobreak\ 1-n\cdot\exp{(-m\varepsilon_{W}/(10N^{2}))}.

Note by definition εK=2nεW\varepsilon_{K}^{\prime}=2\sqrt{n}\varepsilon_{W}, so we finish the proof.

Now, we choose the parameters as follows: κ=O~(εΛ0NRn),T=O~(1κ2Λ0)\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn}),\leavevmode\nobreak\ T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), εtrain=O~(ugnn(0)u2)\varepsilon_{\mathrm{train}}=\widetilde{O}(\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}), mO~(N2n10dε6Λ010)m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}}).

Further, with probability 1δ1-\delta, we have

ugnn(0)u2\displaystyle\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}\leq ugnn(0)2+Yu2+Y2\displaystyle\leavevmode\nobreak\ \|u_{\mathrm{gnn}}(0)\|_{2}+\|Y-u^{*}\|_{2}+\|Y\|_{2}
\displaystyle\leq O~(εΛ0n)+0+O~(n)\displaystyle\leavevmode\nobreak\ \widetilde{O}(\frac{\varepsilon\Lambda_{0}}{\sqrt{n}})+0+\widetilde{O}(\sqrt{n})
\displaystyle\leq O~(n)\displaystyle\leavevmode\nobreak\ \widetilde{O}(\sqrt{n})

where the first step follows from triangle inequality, the second step follows from Lemma D.21, and Y=O(n)Y=O(\sqrt{n}), and the last step follows ε,Λ0<1\varepsilon,\leavevmode\nobreak\ \Lambda_{0}<1. With same reason,

ugnn(0)u2\displaystyle\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}\geq ugnn(0)2Yu2+Y2\displaystyle\leavevmode\nobreak\ -\|u_{\mathrm{gnn}}(0)\|_{2}-\|Y-u^{*}\|_{2}+\|Y\|_{2}
\displaystyle\geq O~(εΛ0n)0+O~(n)\displaystyle\leavevmode\nobreak\ -\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{\sqrt{n}})-0+\widetilde{O}(\sqrt{n})
\displaystyle\geq O~(n).\displaystyle\leavevmode\nobreak\ \widetilde{O}(\sqrt{n}).

Thus, we have ugnn(0)u2=O~(n)\|u_{\mathrm{gnn}}(0)-u^{*}\|_{2}=\widetilde{O}(\sqrt{n}).

We have proved that with high probability all the induction conditions are satisfied. Note that the failure probability only comes from Lemma D.21D.23D.25D.27, which only depends on the initialization. By union bound over these failure events, we have that with high probability all the four statements in Lemma D.22 are satisfied. This completes the proof.

D.8 Connect iterative GNTK regression and GNN training process

In this section we prove Lemma D.28. Lemma D.28 bridges GNN test predictor and GNTK test predictor. We prove this result by bound the difference between dynamic kernal and GNTK, which can be further bridged by the graph dynamic kernal at initial.

Lemma D.28 (Upper bounding test error).

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} and corresponding label vector YnY\in\mathbb{R}^{n}.

  • the total number of iterations T>0T>0.

  • arbitrary test data GtestdG_{\mathrm{test}}\in\mathbb{R}^{d}.

Let

  • ugnn,test(t)nu_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}^{n} and ugntk,test(t)nu_{\mathrm{gntk},\mathrm{test}}(t)\in\mathbb{R}^{n} be the test data predictors defined in Definition D.2 and Definition D.4 respectively.

  • κ(0,1)\kappa\in(0,1) be the corresponding multiplier.

Given accuracy ε>0\varepsilon>0, if

  • κ=O~(εΛ0NRn)\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn}), T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), mO~(N2n10dε6Λ010)m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}}) .

Then for any GtestdG_{\mathrm{test}}\in\mathbb{R}^{d}, with probability at least 1δ1-\delta over the random initialization, we have

ugnn,test(T)ugntk,test(T)2ε/2,\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|_{2}\leq\varepsilon/2,

where O~()\widetilde{O}(\cdot) hides polylog(n/(εδΛ0))\mathrm{poly}\log(n/(\varepsilon\delta\Lambda_{0})).

Proof.

By Lemma D.14, we have

|ugnn,test(T)ugntk,test(T)|\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|\leq (1+κ2nT)εinit+εKu2Λ0\displaystyle\leavevmode\nobreak\ (1+\kappa^{2}nT)\varepsilon_{\mathrm{init}}+\varepsilon_{K}\cdot\frac{\|u^{*}\|_{2}}{\Lambda_{0}}
+nT2κ4εHu2\displaystyle\leavevmode\nobreak\ +\sqrt{n}T^{2}\kappa^{4}\varepsilon_{H}\|u^{*}\|_{2} (33)

By Lemma D.21, we can choose εinit=ε(Λ0)/n\varepsilon_{\mathrm{init}}=\varepsilon(\Lambda_{0})/n.

Further, note

𝖪gntk(Gtest,𝖦)𝖪t(Gtest,𝖦)2\displaystyle\leavevmode\nobreak\ \|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}
\displaystyle\leq 𝖪gntk(Gtest,𝖦)𝖪0(xtest,X)2\displaystyle\leavevmode\nobreak\ \|\mathsf{K}_{\mathrm{gntk}}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{0}(x_{\mathrm{test}},X)\|_{2}
+𝖪0(Gtest,𝖦)𝖪t(Gtest,𝖦)2\displaystyle\quad+\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}
\displaystyle\leq (2N2nlog(2n/δ)/m)1/2+𝖪0(Gtest,𝖦)𝖪t(Gtest,𝖦)2\displaystyle\leavevmode\nobreak\ (2N^{2}n\log{(2n/\delta)}/m)^{1/2}+\|\mathsf{K}_{0}(G_{\mathrm{test}},\mathsf{G})-\mathsf{K}_{t}(G_{\mathrm{test}},\mathsf{G})\|_{2}
\displaystyle\leq O~(εΛ02n1.5)\displaystyle\leavevmode\nobreak\ \widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n^{1.5}})

where the initial step arises due to the triangle inequality. The subsequent step is derived from Lemma D.23. The concluding step is based on Lemma D.22. Consequently, we can select εK=εΛ02n1.5\varepsilon_{K}=\frac{\varepsilon\Lambda_{0}^{2}}{n^{1.5}}.

Also,

𝖧cts𝖧(t)\displaystyle\leavevmode\nobreak\ \|\mathsf{H}^{\mathrm{cts}}-\mathsf{H}(t)\|
\displaystyle\leq 𝖧cts𝖧(0)+𝖧(0)𝖧(t)2\displaystyle\leavevmode\nobreak\ \|\mathsf{H}^{\mathrm{cts}}-\mathsf{H}(0)\|+\|\mathsf{H}(0)-\mathsf{H}(t)\|_{2}
\displaystyle\leq 4Nn(log(n/δ)/m)1/2+𝖧(0)𝖧(t)2\displaystyle\leavevmode\nobreak\ 4Nn(\log(n/\delta)/m)^{1/2}+\|\mathsf{H}(0)-\mathsf{H}(t)\|_{2}
\displaystyle\leq O~(εΛ02n)\displaystyle\leavevmode\nobreak\ \widetilde{O}(\frac{\varepsilon\Lambda_{0}^{2}}{n})

where the initial step is informed by the triangle inequality. The next step is based on Lemma D.23. The final step is derived from Lemma D.22. As a result, we can set εH=εΛ02n\varepsilon_{H}=\frac{\varepsilon\Lambda_{0}^{2}}{n}.

Note u2n\|u^{*}\|_{2}\leq\sqrt{n}, plugging the value of εinit,εK,εH\varepsilon_{\mathrm{init}},\leavevmode\nobreak\ \varepsilon_{K},\leavevmode\nobreak\ \varepsilon_{H} into Eq. (D.8), we have

|ugnn,test(T)ugntk,test(T)|ε/2.\displaystyle|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)|\leq\varepsilon/2.

Lemma D.29 (Upper bounding test error (node version)).

Given

  • training graph GG and corresponding label vector YnY\in\mathbb{R}^{n}.

  • T>0T>0 total number of iterations.

  • arbitrary test node vdv\in\mathbb{R}^{d}.

Let

  • ugnn,test,node(t)nu_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(t)\in\mathbb{R}^{n} and ugntk,test,node(t)nu_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(t)\in\mathbb{R}^{n} be the test data predictors defined in Definition 3.7 and Definition 3.8 respectively.

  • κ(0,1)\kappa\in(0,1) be the corresponding multiplier.

Given accuracy ε>0\varepsilon>0, if

  • κ=O~(εΛ0RN)\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{RN}), T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), mO~(N10dε6Λ010)m\geq\widetilde{O}(\frac{N^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}}) .

Then with probability at least 1δ1-\delta over the random initialization, we have

ugnn,test,node(T)ugntk,test,node(T)2ε/2,\displaystyle\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)\|_{2}\leq\varepsilon/2,

where O~()\widetilde{O}(\cdot) hides polylog(N/(εδΛ0))\mathrm{poly}\log(N/(\varepsilon\delta\Lambda_{0})).

D.9 Main result

In this section we present our main theorem. First, we present the equivalence of graph level between training GNN and GNTK regression for test data prediction. Second, we present the equivalence of node level between training GNN and GNTK regression for test data prediction.

Theorem D.30 (Equivalence between training net with and kernel regression for test data prediction).

Given

  • training graph data 𝖦={G1,,Gn}\mathsf{G}=\{G_{1},\cdots,G_{n}\} and corresponding label vector YnY\in\mathbb{R}^{n}. Let T>0T>0 be the total number of iterations.

  • arbitrary test data GtestG_{\mathrm{test}}.

Let

  • ugnn,test(t)nu_{\mathrm{gnn},\mathrm{test}}(t)\in\mathbb{R}^{n} and utestnu_{\mathrm{test}}^{*}\in\mathbb{R}^{n} be the test data predictors defined in Definition D.2 and Definition D.4 respectively.

For any accuracy ε(0,1/10)\varepsilon\in(0,1/10) and failure probability δ(0,1/10)\delta\in(0,1/10), if

  • κ=O~(εΛ0NRn)\kappa=\widetilde{O}(\frac{\varepsilon\Lambda_{0}}{NRn}), T=O~(1κ2Λ0)T=\widetilde{O}(\frac{1}{\kappa^{2}\Lambda_{0}}), mO~(N2n10dε6Λ010)m\geq\widetilde{O}(\frac{N^{2}n^{10}d}{\varepsilon^{6}\Lambda_{0}^{10}}).

Then for any GtestG_{\mathrm{test}}, with probability at least 1δ1-\delta over the random initialization, we have

ugnn,test(T)utest2ε.\displaystyle\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon.

Here O~()\widetilde{O}(\cdot) hides polylog(Nn/(εδΛ0))\mathrm{poly}\log(Nn/(\varepsilon\delta\Lambda_{0})).

Proof.

It follows from combining results of bounding ugnn,test(T)ugntk,test(T)2ε/2\|u_{\mathrm{gnn},\mathrm{test}}(T)-u_{\mathrm{gntk},\mathrm{test}}(T)\|_{2}\leq\varepsilon/2 as shown in Lemma D.28 and ugntk,test(T)utest2ε/2\|u_{\mathrm{gntk},\mathrm{test}}(T)-u_{\mathrm{test}}^{*}\|_{2}\leq\varepsilon/2 as shown in Lemma D.12 using triangle inequality. ∎

Theorem D.31 (Equivalence result for node level regression, formal version of 4.3).

Given

  • training graph data GG and corresponding label vector YNY\in\mathbb{R}^{N}.

  • T>0T>0 be the total number of iterations.

  • arbitrary test data vv.

Let

  • ugnn,node,test(t)u_{\mathrm{gnn},\mathrm{node},\mathrm{test}}(t)\in\mathbb{R} and utest,nodeu_{\mathrm{test},\mathrm{node}}^{*}\in\mathbb{R} be the test data predictors defined in Definition 3.7 and Definition 3.4 respectively.

For any accuracy ε(0,1/10)\varepsilon\in(0,1/10) and failure probability δ(0,1/10)\delta\in(0,1/10), if

  • the multiplier κ=O~(N1poly(ε,Λ0,node,R1))\kappa=\widetilde{O}(N^{-1}\mathrm{poly}(\varepsilon,\Lambda_{0,\mathrm{node}},R^{-1})),

  • the total iterations T=O~(N2poly(ε1,Λ0,node1))T=\widetilde{O}(N^{2}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1})),

  • the width of the neural network mO~(N10poly(ε1,Λ0,node1,d))m\geq\widetilde{O}(N^{10}\mathrm{poly}(\varepsilon^{-1},\Lambda_{0,\mathrm{node}}^{-1},d)),

  • and the smallest eigenvalue of node level GNTK Λ0,node>0\Lambda_{0,\mathrm{node}}>0,

then for any vtestv_{\mathrm{test}}, with probability at least 1δ1-\delta over the random initialization, we have

ugnn,test,node(T)utest,node2ε.\displaystyle\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}\|_{2}\leq\varepsilon.

Here O~()\widetilde{O}(\cdot) hides polylog(N/(εδΛ0,node))\mathrm{poly}\log(N/(\varepsilon\delta\Lambda_{0,\mathrm{node}})).

Proof.

It follows from combining results of bounding ugnn,test,node(T)ugntk,test,node(T)2ε/2\|u_{\mathrm{gnn},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)\|_{2}\leq\varepsilon/2 as shown in Lemma D.29 and ugntk,test,node(T)utest,node2ε/2\|u_{\mathrm{gntk},\mathrm{test},\mathrm{node}}(T)-u_{\mathrm{test},\mathrm{node}}^{*}\|_{2}\leq\varepsilon/2 as shown in Lemma D.13 using triangle inequality. ∎

Appendix E Bounds for the Spectral Gap with Data Separation

In this section, we provide a lower bound and an upper bound on the eigenvalue of the GNTK. This result can be proved by carefully decomposing probability. In Section E.1, we consider the standard setting. In Section E.2, we consider the shifted NTK scenario.

E.1 Standard setting

In this section, we consider the GNTK defined in Definition D.3.

Lemma E.1 (Corollary I.2 of [58]).

We define (z)=𝟏z0\mathcal{I}(z)={\bf 1}_{z\geq 0}. Let

  • x1,,xnx_{1},\cdots,x_{n} be points in d\mathbb{R}^{d} with unit Euclidian norm and wN(0,Id)w\sim N(0,I_{d}).

  • Form the matrix XRn×d=[x1,,xn]X\in R^{n\times d}=[x_{1},\cdots,x_{n}]^{\top}.

Suppose there exists δ>0\delta>0 such that for every 1ijn1\leq i\neq j\leq n we have that

min(xixj2,xi+xj2)δ.\displaystyle\min(\|x_{i}-x_{j}\|_{2},\|x_{i}+x_{j}\|_{2})\geq\delta.

Then, the covariance of the vector (Xw)\mathcal{I}(Xw) obeys

𝔼[(Xw)(Xw)XX]δ100n2\displaystyle\operatorname*{{\mathbb{E}}}[\mathcal{I}(Xw)\mathcal{I}(Xw)^{\top}\odot XX^{\top}]\succeq\frac{\delta}{100n^{2}}

Then we prove our eigenvalue bound for GNTK.

Theorem E.2 (GNTK with generalized datasets δ\delta-separation assumption).

Let us define

hG,u=v𝒩(u)h(G)v,xi,p=hGi,up,i[n],p[N].\displaystyle h_{G,u}=\sum_{v\in\mathcal{N}(u)}h(G)_{v},\leavevmode\nobreak\ x_{i,p}=h_{G_{i},u_{p}},\leavevmode\nobreak\ i\in[n],p\in[N].

Let wrw_{r} be defined as in Definition 3.1. Assuming that

  • wri.i.d.𝒩(0,I),r=1,,mw_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

  • our datasets satisfy the δ\delta-separation: For any i,j[n],p,q[N]i,j\in[n],\leavevmode\nobreak\ p,q\in[N]

    min{x¯i,px¯j,q2,x¯i,p+x¯j,q2}δ.\displaystyle\min\{\|\overline{x}_{i,p}-\overline{x}_{j,q}\|_{2},\|\overline{x}_{i,p}+\overline{x}_{j,q}\|_{2}\}\geq\delta.

    where x¯i,p=xi,p/xi,p2\overline{x}_{i,p}=x_{i,p}/\|x_{i,p}\|_{2}.

We define HctsH^{\mathrm{cts}} as in Definition 3.2. Then we can claim that

Hctsδ100n2,Λ0>δ100n2\displaystyle H^{\mathrm{cts}}\succ\frac{\delta}{100n^{2}},\leavevmode\nobreak\ \Lambda_{0}>\frac{\delta}{100n^{2}}
Proof.

Because

[𝖧cts]i,j\displaystyle[\mathsf{H}^{\mathrm{cts}}]_{i,j}
=\displaystyle= EW[1mr=1ml1=1Nl2=1Nxi,l1xj,l2𝟏wrxi,l10,wrxj,l20].\displaystyle E_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right].

we will claim that

𝖧cts=\displaystyle\mathsf{H}^{\mathrm{cts}}= 𝔼W[1mr=1ml1=1Nl2=1NXl1Xl2𝟏wrXl10𝟏wrXl20].\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq 0}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq 0}\right].
=\displaystyle= 1mr=1m𝔼W[(l1=1NXl1𝟏wrXl10)(l2=1NXl2𝟏wrXl20)].\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\operatorname*{{\mathbb{E}}}_{W}\left[(\sum_{l_{1}=1}^{N}X_{l_{1}}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq 0})(\sum_{l_{2}=1}^{N}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq 0})\right].

where Xi=[x1,i,,xn,i]d×nX_{i}=[x_{1,i},\cdots,x_{n,i}]^{\top}\in\mathbb{R}^{d\times n}. By Lemma E.1, we get that

EW[Xl1Xl2𝟏wrXl10𝟏wrXl20]δ100n2\displaystyle E_{W}\left[X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq 0}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq 0}\right]\succ\frac{\delta}{100n^{2}}

As a result, we can immediately get that 𝖧ctsδ100n2\mathsf{H}^{\mathrm{cts}}\succ\frac{\delta}{100n^{2}}. ∎

E.2 Shifted NTK

In this section, we consider the shifted GNTK, which is defined by a shifted ReLU function. We proved that given the shifted GNTK, the GNTK is still positive definite. Moreover, the eigenvalue of GNTK goes to 0 exponentially when the absolute value of the bias bb goes larger.

We first state a claim that helps us decompose the eigenvalue of Hadamard product.

Claim E.3 ([66]).

Let

  • M1,M2n×nM_{1},M_{2}\in\mathbb{R}^{n\times n} be two PSD matrices.

  • M1M2M_{1}\circ M_{2} denote the Hadamard product of M1M_{1} and M2M_{2}.

Then,

λmin(M1M2)\displaystyle\lambda_{\min}(M_{1}\circ M_{2})\geq (mini[n]M2i,i)λmin(M1),\displaystyle\leavevmode\nobreak\ (\min_{i\in[n]}{M_{2}}_{i,i})\cdot\lambda_{\min}(M_{1}),
λmax(M1M2)\displaystyle\lambda_{\max}(M_{1}\circ M_{2})\leq (maxi[n]M2i,i)λmax(M1).\displaystyle\leavevmode\nobreak\ (\max_{i\in[n]}{M_{2}}_{i,i})\cdot\lambda_{\max}(M_{1}).

Then, we state a result that bounds the eigenvalue of NTK.

Theorem E.4 ([78]).

Let

  • x1,,xnx_{1},\dots,x_{n} be points in d\mathbb{R}^{d} with unit Euclidean norm and w𝒩(0,Id)w\sim{\cal N}(0,I_{d}).

  • Form the matrix Xn×d=[x1xn]X\in\mathbb{R}^{n\times d}=[x_{1}\leavevmode\nobreak\ \dots\leavevmode\nobreak\ x_{n}]^{\top}.

Suppose there exists δ(0,2)\delta\in(0,\sqrt{2}) such that

minij[n]{xixj2,xi+xj2}δ.\displaystyle\min_{i\neq j\in[n]}\{\|x_{i}-x_{j}\|_{2},\|x_{i}+x_{j}\|_{2}\}\geq\delta.

Let b0b\geq 0. Recall the continuous Hessian matrix HctsH^{\mathrm{cts}} is defined by

Hi,jcts:=𝔼w𝒩(0,I)[xixj𝟏wxib,wxjb](i,j)[n]×[n].\displaystyle H^{\mathrm{cts}}_{i,j}:=\operatorname*{{\mathbb{E}}}_{w\sim\mathcal{N}(0,I)}\left[x_{i}^{\top}x_{j}{\bf 1}_{w^{\top}x_{i}\geq b,w^{\top}x_{j}\geq b}\right]\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall(i,j)\in[n]\times[n].

Let λ:=λmin(Hcts)\lambda:=\lambda_{\min}(H^{\mathrm{cts}}). Then, we have

exp(b2/2)λexp(b2/2)δ100n2.\displaystyle\exp(-b^{2}/2)\geq\leavevmode\nobreak\ \lambda\geq\leavevmode\nobreak\ \exp(-b^{2}/2)\cdot\frac{\delta}{100n^{2}}. (34)

We first state our definition of shifted graph neural network function, which can be viewed as a generalization of the standard graph neural network.

Definition E.5 (Shifted graph neural network function).

We define a graph neural network fgnn,shift(W,a,b,G):d×m×m××d×Nf_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G):\mathbb{R}^{d\times m}\times\mathbb{R}^{m}\times\mathbb{R}\times\mathbb{R}^{d\times N}\rightarrow\mathbb{R} as the following form

fgnn,shift(W,a,b,G)=1mr=1marl=1Nσ(wrv𝒩(ul)h(G)vb).\displaystyle f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G)=\frac{1}{\sqrt{m}}\sum_{r=1}^{m}a_{r}\sum_{l=1}^{N}\sigma(w_{r}^{\top}\sum_{v\in{\cal N}(u_{l})}h(G)_{v}-b).

Besides, We denote fgnn,shift(W,a,b,𝖦)=[fgnn,shift(W,a,b,G1),,fgnn,shift(W,a,b,Gn)]f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,\mathsf{G})=[f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G_{1}),\cdots,f_{\mathrm{gnn},\mathrm{shift}}(W,a,b,G_{n})]^{\top} to represent the GNN output on the training set.

Next, we define shifted graph neural tangent kernel for the shifted graph neural network function.

Definition E.6 (Shifted graph neural tangent kernel).

We define the graph neural tangent kernel (GNTK) corresponding to the graph neural networks fgnn,shiftf_{\mathrm{gnn},\mathrm{shift}} as follows:

𝖪gntk,shift(G,H)=𝔼W[fgnn,shift(W,G)W,fgnn,shift(W,H)W],\displaystyle\mathsf{K}_{\mathrm{gntk},\mathrm{shift}}(G,H)=\operatorname*{{\mathbb{E}}}_{W}\left[\left\langle\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,G)}{\partial W},\frac{\partial f_{\mathrm{gnn},\mathrm{shift}}(W,H)}{\partial W}\right\rangle\right],

where

  • G,HG,H are any input graph,

  • and the expectation is taking over wri.i.d.𝒩(0,I),r=1,,mw_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m.

We define Hshiftctsn×nH^{\mathrm{cts}}_{\mathrm{shift}}\in\mathbb{R}^{n\times n} as the kernel matrix between training data as [Hshiftcts]i,j=𝖪gntk,shift(Gi,Gj)[H^{\mathrm{cts}}_{\mathrm{shift}}]_{i,j}=\mathsf{K}_{\mathrm{gntk},\mathrm{shift}}(G_{i},G_{j})\in\mathbb{R}.

Then, we prove our eigenvalue bounds for shifted graph neural tangent kernel.

Theorem E.7.

Let us define

hG,u=v𝒩(u)h(G)v,xi,p=hGi,up,i[n],p[N].\displaystyle h_{G,u}=\sum_{v\in\mathcal{N}(u)}h(G)_{v},\leavevmode\nobreak\ x_{i,p}=h_{G_{i},u_{p}},\leavevmode\nobreak\ i\in[n],p\in[N].

Let wrw_{r} be defined as in Definition E.5. Assuming that wri.i.d.𝒩(0,I),r=1,,mw_{r}\overset{i.i.d.}{\sim}\mathcal{N}(0,I),\leavevmode\nobreak\ r=1,\cdots,m. Assuming that our datasets satisfy the δ\delta-separation: For any i,j[n],p,q[N]i,j\in[n],\leavevmode\nobreak\ p,q\in[N]

min{x¯i,px¯j,q2,x¯i,p+x¯j,q2}δ.\displaystyle\min\{\|\overline{x}_{i,p}-\overline{x}_{j,q}\|_{2},\|\overline{x}_{i,p}+\overline{x}_{j,q}\|_{2}\}\geq\delta.

where x¯i,p=xi,p/xi,p2\overline{x}_{i,p}=x_{i,p}/\|x_{i,p}\|_{2}.

We define HshiftctsH^{\mathrm{cts}}_{\mathrm{shift}} as in Definition E.6. Then we can claim that, for λ:=λmin(Hshiftcts)\lambda:=\lambda_{\min}(H^{\mathrm{cts}}_{\mathrm{shift}})

exp(b2/2)λexp(b2/2)δ100n2\displaystyle\exp(-b^{2}/2)\geq\lambda\geq\exp(-b^{2}/2)\cdot\frac{\delta}{100n^{2}}
Proof.

Because

[𝖧cts]i,j=𝔼W[1mr=1ml1=1Nl2=1Nxi,l1xj,l2𝟏wrxi,l10,wrxj,l20].\displaystyle[\mathsf{H}^{\mathrm{cts}}]_{i,j}=\operatorname*{{\mathbb{E}}}_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}x_{i,l_{1}}^{\top}x_{j,l_{2}}{\bf 1}_{w_{r}^{\top}x_{i,l_{1}}\geq 0,w_{r}^{\top}x_{j,l_{2}}\geq 0}\right].

we will claim that

𝖧cts=\displaystyle\mathsf{H}^{\mathrm{cts}}= 𝔼W[1mr=1ml1=1Nl2=1NXl1Xl2𝟏wrXl1b𝟏wrXl2b].\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}_{W}\left[\frac{1}{m}\sum_{r=1}^{m}\sum_{l_{1}=1}^{N}\sum_{l_{2}=1}^{N}X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq b}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq b}\right].
=\displaystyle= 1mr=1m𝔼W[(l1=1NXl1𝟏wrXl1b)(l2=1NXl2𝟏wrXl2b)].\displaystyle\leavevmode\nobreak\ \frac{1}{m}\sum_{r=1}^{m}\operatorname*{{\mathbb{E}}}_{W}\left[(\sum_{l_{1}=1}^{N}X_{l_{1}}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq b})(\sum_{l_{2}=1}^{N}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq b})\right].

where Xi=[x1,i,,xn,i]d×nX_{i}=[x_{1,i},\cdots,x_{n,i}]^{\top}\in\mathbb{R}^{d\times n}. By Theorem E.4, we can get that

exp(b2/2)𝔼W[Xl1Xl2𝟏wrXl1b𝟏wrXl2b]exp(b2/2)δ100n2\displaystyle\exp(-b^{2}/2)\succ\operatorname*{{\mathbb{E}}}_{W}\left[X_{l_{1}}X_{l_{2}}^{\top}\odot{\bf 1}_{w_{r}^{\top}X_{l_{1}}\geq b}{\bf 1}_{w_{r}^{\top}X_{l_{2}}\geq b}\right]\succ\exp(-b^{2}/2)\cdot\frac{\delta}{100n^{2}}

As a result, we can immediately get that exp(b2/2)𝖧ctsexp(b2/2)δ100n2\exp(-b^{2}/2)\succ\mathsf{H}^{\mathrm{cts}}\succ\exp(-b^{2}/2)\frac{\delta}{100n^{2}}. ∎

Appendix F Node-Level GNTK Formulation

In this section we derive the GNTK formulation for the node level regression task. We first review the definitions. Let f(r)(l)f^{(l)}_{(r)} define the output of the Graph Neural Network in the ll-th level and rr-th layer. The definition of Aggregate layer is as follows:

f(1)(1)(u)=v𝒩(u)h(G)v,\displaystyle f^{(1)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}h(G)_{v},

and

f(1)(l)(u)=v𝒩(u)f(R)(l1)(v),l{2,3,,L}.\displaystyle f^{(l)}_{(1)}(u)=\sum_{v\in\mathcal{N}(u)}f^{(l-1)}_{(R)}(v),\leavevmode\nobreak\ l\in\{2,3,\cdots,L\}.

The definition of Combine layer is formulated as follows:

f(r)(l)(u)=cσmσ(W(r)(l)f(r1)(l)(u)),l[L],r[R]\displaystyle f^{(l)}_{(r)}(u)=\sqrt{\frac{c_{\sigma}}{m}}\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u)),\leavevmode\nobreak\ l\in[L],r\in[R]

where cσc_{\sigma} is a scaling factor, W(r)(l)W^{(l)}_{(r)} is the weight of the neural network in ll-th level and rr-th layer, mm is the width of the layer.

We define the Graph Neural Tangent Kernel as 𝖪gntk,node(u,u)=𝖪(R)(L)(u,u)\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})=\mathsf{K}_{(R)}^{(L)}(u,u^{\prime}). Let 𝚺(r)(l)(u,u)\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime}) be defined as the covariance matrix of the Gaussian process of the pre-activations W(r)(l)f(r1)(l)(u)W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u). We have that initially:

𝚺(1)(1)(u,u)=h(G)uh(G)u.\displaystyle\bm{\Sigma}^{(1)}_{(1)}(u,u^{\prime})=h(G)_{u}^{\top}h(G)_{u^{\prime}}.

For l{2,3,,L}l\in\{2,3,\cdots,L\},

𝚺(1)(l)(u,u)=v𝒩(u)v𝒩(u)𝚺(R)(l1)(v,v).\displaystyle\bm{\Sigma}^{(l)}_{(1)}(u,u^{\prime})=\sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\bm{\Sigma}^{(l-1)}_{(R)}(v,v^{\prime}).

Then, we start to deduce the formulation for the Combine layer. Note that,

𝔼[(W(r+1)(l)f(r)(l)(u))i(W(r+1)(l)f(r)(l)(u))i]\displaystyle\leavevmode\nobreak\ \operatorname*{{\mathbb{E}}}[(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u))_{i}\cdot(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u^{\prime}))_{i}]
=\displaystyle= f(r)(l)(u)f(r)(l)(u)\displaystyle\leavevmode\nobreak\ f^{(l)}_{(r)}(u)\circ f^{(l)}_{(r)}(u^{\prime})
=\displaystyle= cσmσ(W(r)(l)f(r1)(l)(u))σ(W(r)(l)f(r1)(l)(u))\displaystyle\leavevmode\nobreak\ \frac{c_{\sigma}}{m}\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u))\circ\sigma(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u^{\prime}))

Then, we have that

limm𝔼[(W(r+1)(l)f(r)(l)(u))i(W(r+1)(l)f(r)(l)(u))i]=𝚺(r+1)(l)(u,u),\displaystyle\lim_{m\to\infty}\operatorname*{{\mathbb{E}}}[(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u))_{i}\cdot(W^{(l)}_{(r+1)}f^{(l)}_{(r)}(u^{\prime}))_{i}]=\bm{\Sigma}^{(l)}_{(r+1)}(u,u^{\prime}),

where W(r+1)(l)m×mW^{(l)}_{(r+1)}\in\mathbb{R}^{m\times m}.

We conclude the formula for the Combine layer.

𝚲(r)(l)(u,u)=\displaystyle\bm{\Lambda}^{(l)}_{(r)}(u,u^{\prime})= (𝚺(r1)(l)(u,u)𝚺(r1)(l)(u,u)𝚺(r1)(l)(u,u)𝚺(r1)(l)(u,u))2×2,\displaystyle\begin{pmatrix}\bm{\Sigma}_{(r-1)}^{(l)}(u,u)&\bm{\Sigma}_{(r-1)}^{(l)}(u,u^{\prime})\\ \bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u)&\bm{\Sigma}_{(r-1)}^{(l)}(u^{\prime},u^{\prime})\end{pmatrix}\in\mathbb{R}^{2\times 2},
𝚺(r)(l)(u,u)=\displaystyle\bm{\Sigma}^{(l)}_{(r)}(u,u^{\prime})= cσ𝔼(a,b)𝒩(𝟎,𝚲(r)(l)(u,u))[σ(a)σ(b)],\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\sigma(a)\sigma(b)],
𝚺˙(r)(l)(u,u)=\displaystyle\bm{\dot{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)= cσ𝔼(a,b)𝒩(𝟎,𝚲(r)(l)(u,u))[σ˙(a)σ˙(b)].\displaystyle c_{\sigma}\mathbb{E}_{(a,b)\sim\mathcal{N}\left(\bm{0},\bm{\Lambda}^{(l)}_{(r)}\left(u,u^{\prime}\right)\right)}[\dot{\sigma}(a)\dot{\sigma}(b)].

We write the partial derivative with respect to a particular weight W(r)(l)W^{(l)}_{(r)}:

f(R)(L)W(r)(l)=f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))f(r1)(l)\displaystyle\frac{\partial f^{(L)}_{(R)}}{\partial W^{(l)}_{(r)}}=\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))f^{(l)}_{(r-1)}

where

f(R)(L)f(r1)(l)=f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))W(r)(l)\displaystyle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r-1)}}=\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))W^{(l)}_{(r)}

Then, we compute the inner product:

f(R)(L)(u)W(r)(l)f(R)(L)(u)W(r)(l)\displaystyle\leavevmode\nobreak\ \frac{\partial f^{(L)}_{(R)}(u)}{\partial W^{(l)}_{(r)}}\circ\frac{\partial f^{(L)}_{(R)}(u^{\prime})}{\partial W^{(l)}_{(r)}}
=\displaystyle= f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))f(r1)(l)(u),f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))f(r1)(l)(u)\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))f^{(l)}_{(r-1)}(u),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))f^{(l)}_{(r-1)}(u^{\prime})\rangle
=\displaystyle= f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))(u),f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))(u)f(r1)(l)(u),f(r1)(l)(u)\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u^{\prime})\rangle\cdot\langle f^{(l)}_{(r-1)}(u),f^{(l)}_{(r-1)}(u^{\prime})\rangle
=\displaystyle= f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))(u),f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)))(u)𝚺(r)(l)(u,u)\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u^{\prime})\rangle\cdot{\bf\Sigma}^{(l)}_{(r)}(u,u^{\prime})

where

f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l))(u)),f(R)(L)f(r)(l)cσmdiag(σ˙(W(r)(l)f(r1)(l)(u)))\displaystyle\leavevmode\nobreak\ \langle\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)})(u)),\frac{\partial f^{(L)}_{(R)}}{\partial f^{(l)}_{(r)}}\sqrt{\frac{c_{\sigma}}{m}}\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}(u^{\prime})))\rangle
=\displaystyle= cσmtr[diag(σ˙(W(r)(l)f(r1)(l)))(u)diag(σ˙(W(r)(l)f(r1)(l)))(u)]f(R)(L)(u)f(r)(l),f(R)(L)(u)f(r)(l)\displaystyle\leavevmode\nobreak\ \frac{c_{\sigma}}{m}\mathrm{tr}[\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u)\mathrm{diag}(\dot{\sigma}(W^{(l)}_{(r)}f^{(l)}_{(r-1)}))(u^{\prime})]\langle\frac{\partial f^{(L)}_{(R)}(u)}{\partial f^{(l)}_{(r)}},\frac{\partial f^{(L)}_{(R)}(u^{\prime})}{\partial f^{(l)}_{(r)}}\rangle
=\displaystyle= cσm𝚺˙(r)(l)(u,u)f(R)(L)(u)f(r)(l),f(R)(L)(u)f(r)(l)\displaystyle\leavevmode\nobreak\ \frac{c_{\sigma}}{m}\dot{\bf\Sigma}^{(l)}_{(r)}(u,u^{\prime})\langle\frac{\partial f^{(L)}_{(R)}(u)}{\partial f^{(l)}_{(r)}},\frac{\partial f^{(L)}_{(R)}(u^{\prime})}{\partial f^{(l)}_{(r)}}\rangle

Finally, we conclude the formula for graph neural tangent kernel. For each Aggregate layer:

𝖪(1)(1)(u,u)=h(G)uh(G)u,\displaystyle\mathsf{K}^{(1)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ h(G)_{u}^{\top}h(G)_{u^{\prime}},

and

𝖪(1)(l)(u,u)=v𝒩(u)v𝒩(u)𝖪(R)(l1)(v,v).\displaystyle\mathsf{K}^{(l)}_{(1)}(u,u^{\prime})=\leavevmode\nobreak\ \sum_{v\in\mathcal{N}(u)}\sum_{v^{\prime}\in\mathcal{N}(u^{\prime})}\mathsf{K}^{(l-1)}_{(R)}(v,v^{\prime}).

For the Combine layer:

𝖪(r)(l)(u,u)=𝖪(r1)(l)(u,u)𝚺˙(r)(l)(u,u)+𝚺(r)(l)(u,u).\displaystyle\mathsf{K}_{(r)}^{(l)}(u,u^{\prime})=\mathsf{K}_{(r-1)}^{(l)}(u,u^{\prime})\dot{\bm{\Sigma}}^{(l)}_{(r)}\left(u,u^{\prime}\right)+\bm{\Sigma}^{(l)}_{(r)}\left(u,u^{\prime}\right).

The Graph neural tangent kernel:

𝖪gntk,node(u,u)=𝖪(R)(L)(u,u).\displaystyle\mathsf{K}_{\mathrm{gntk},\mathrm{node}}(u,u^{\prime})=\mathsf{K}_{(R)}^{(L)}(u,u^{\prime}).