Characterizing the spectrum of the NTK via a power series expansion

\dagger

^∗ Michael Murray,

\dagger

^∗ Hui Jin,

\dagger

^∗ Benjamin Bowman,

\dagger{\ddagger}\mathsection

Guido Montufar

\dagger

Department of Mathematics, UCLA, CA, USA

{\ddagger}

Department of Statistics, UCLA, CA, USA

\mathsection

Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
[mmurray,huijin,benbowman314,montufar]@math.ucla.edu
^∗Equal contribution

Abstract

Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.

1 Introduction

Neural networks currently dominate modern artificial intelligence, however, despite their empirical success establishing a principled theoretical foundation for them remains an active challenge. The key difficulties are that neural networks induce nonconvex optimization objectives (Sontag & Sussmann, 1989) and typically operate in an overparameterized regime which precludes classical statistical learning theory (Anthony & Bartlett, 2002). The persistent success of overparameterized models tuned via non-convex optimization suggests that the relationship between the parameterization, optimization, and generalization is more sophisticated than that which can be addressed using classical theory.

A recent breakthrough on understanding the success of overparameterized networks was established through the Neural Tangent Kernel (NTK) (Jacot et al., 2018). In the infinite width limit the optimization dynamics are described entirely by the NTK and the parameterization behaves like a linear model (Lee et al., 2019). In this regime explicit guarantees for the optimization and generalization can be obtained (Du et al., 2019a, b; Arora et al., 2019a; Allen-Zhu et al., 2019; Zou et al., 2020). While one must be judicious when extrapolating insights from the NTK to finite width networks (Lee et al., 2020), the NTK remains one of the most promising avenues for understanding deep learning on a principled basis.

The spectrum of the NTK is fundamental to both the optimization and generalization of wide networks. In particular, bounding the smallest eigenvalue of the NTK Gram matrix is a staple technique for establishing convergence guarantees for the optimization (Du et al., 2019a, b; Oymak & Soltanolkotabi, 2020). Furthermore, the full spectrum of the NTK Gram matrix governs the dynamics of the empirical risk (Arora et al., 2019b), and the eigenvalues of the associated integral operator characterize the dynamics of the generalization error outside the training set (Bowman & Montufar, 2022; Bowman & Montúfar, 2022). Moreover, the decay rate of the generalization error for Gaussian process regression using the NTK can be characterized by the decay rate of the spectrum (Caponnetto & De Vito, 2007; Cui et al., 2021; Jin et al., 2022).

The importance of the spectrum of the NTK has led to a variety of efforts to characterize its structure via random matrix theory and other tools (Yang & Salman, 2019; Fan & Wang, 2020). There is a broader body of work studying the closely related Conjugate Kernel, Fisher Information Matrix, and Hessian (Poole et al., 2016; Pennington & Worah, 2017, 2018; Louart et al., 2018; Karakida et al., 2020). These results often require complex random matrix theory or operate in a regime where the input dimension is sent to infinity. By contrast, using a just a power series expansion we are able to characterize a variety of attributes of the spectrum for fixed input dimension and recover key results from prior work.

1.1 Contributions

In Theorem 3.1 we derive coefficients for the power series expansion of the NTK under unit variance initialization, see Assumption 2. Consequently we are able to derive insights into the NTK spectrum, notably concerning the outlier eigenvalues as well as the asymptotic decay.

•

In Theorem 4.1 and Observation 4.2 we demonstrate that the largest eigenvalue $\lambda_{1}(\mathbf{K})$ of the NTK takes up an $\Omega(1)$ proportion of the trace and that there are $O(1)$ outlier eigenvalues of the same order as $\lambda_{1}(\mathbf{K})$ .
•

In Theorem 4.3 and Theorem 4.5 we show that the effective rank $Tr(\mathbf{K})/\lambda_{1}(\mathbf{K})$ of the NTK is upper bounded by a constant multiple of the effective rank $Tr(\mathbf{X}\mathbf{X}^{T})/\lambda_{1}(\mathbf{X}\mathbf{X}^{T})$ of the input data Gram matrix for both infinite and finite width networks.
•

In Corollary 4.7 and Theorem 4.8 we characterize the asymptotic behavior of the NTK spectrum for both uniform and nonuniform data distributions on the sphere.

1.2 Related work

Neural Tangent Kernel (NTK): the NTK was introduced by Jacot et al. (2018), who demonstrated that in the infinite width limit neural network optimization is described via a kernel gradient descent. As a consequence, when the network is polynomially wide in the number of samples, global convergence guarantees for gradient descent can be obtained (Du et al., 2019a, b; Allen-Zhu et al., 2019; Zou & Gu, 2019; Lee et al., 2019; Zou et al., 2020; Oymak & Soltanolkotabi, 2020; Nguyen & Mondelli, 2020; Nguyen, 2021). Furthermore, the connection between infinite width networks and Gaussian processes, which traces back to Neal (1996), has been reinvigorated in light of the NTK. Recent investigations include Lee et al. (2018); de G. Matthews et al. (2018); Novak et al. (2019).

Analysis of NTK Spectrum: theoretical analysis of the NTK spectrum via random matrix theory was investigated by Yang & Salman (2019); Fan & Wang (2020) in the high dimensional limit. Velikanov & Yarotsky (2021) demonstrated that for ReLU networks the spectrum of the NTK integral operator asymptotically follows a power law, which is consistent with our results for the uniform data distribution. Basri et al. (2019) calculated the NTK spectrum for shallow ReLU networks under the uniform distribution, which was then expanded to the nonuniform case by Basri et al. (2020). Geifman et al. (2022) analyzed the spectrum of the conjugate kernel and NTK for convolutional networks with ReLU activations whose pixels are uniformly distributed on the sphere. Geifman et al. (2020); Bietti & Bach (2021); Chen & Xu (2021) analyzed the reproducing kernel Hilbert spaces of the NTK for ReLU networks and the Laplace kernel via the decay rate of the spectrum of the kernel. In contrast to previous works, we are able to address the spectrum in the finite dimensional setting and characterize the impact of different activation functions on it.

Hermite Expansion: Daniely et al. (2016) used Hermite expansion to the study the expressivity of the Conjugate Kernel. Simon et al. (2022) used this technique to demonstrate that any dot product kernel can be realized by the NTK or Conjugate Kernel of a shallow, zero bias network. Oymak & Soltanolkotabi (2020) use Hermite expansion to study the NTK and establish a quantitative bound on the smallest eigenvalue for shallow networks. This approach was incorporated by Nguyen & Mondelli (2020) to handle convergence for deep networks, with sharp bounds on the smallest NTK eigenvalue for deep ReLU networks provided by Nguyen et al. (2021). The Hermite approach was utilized by Panigrahi et al. (2020) to analyze the smallest NTK eigenvalue of shallow networks under various activations. Finally, in a concurrent work Han et al. (2022) use Hermite expansions to develop a principled and efficient polynomial based approximation algorithm for the NTK and CNTK. In contrast to the aforementioned works, here we employ the Hermite expansion to characterize both the outlier and asymptotic portions of the spectrum for both shallow and deep networks under general activations.

2 Preliminaries

For our notation, lower case letters, e.g., $x,y$ , denote scalars, lower case bold characters, e.g., $\mathbf{x},\mathbf{y}$ are for vectors, and upper case bold characters, e.g., $\mathbf{X},\mathbf{Y}$ , are for matrices. For natural numbers $k_{1},k_{2}\in\mathbb{N}$ we let $[k_{1}]=\{1,\ldots,k_{1}\}$ and $[k_{2},k_{1}]=\{k_{2},\ldots,k_{1}\}$ . If $k_{2}>k_{1}$ then $[k_{2},k_{1}]$ is the empty set. We use $\left\lVert\cdot\right\rVert_{p}$ to denote the $p$ -norm of the matrix or vector in question and as default use $\left\lVert\cdot\right\rVert$ as the operator or 2-norm respectively. We use $\mathbf{1}_{m\times n}\in\mathbb{R}^{m\times n}$ to denote the matrix with all entries equal to one. We define $\delta_{p=c}$ to take the value $1$ if $p=c$ and be zero otherwise. We will frequently overload scalar functions $\phi:\mathbb{R}\rightarrow\mathbb{R}$ by applying them elementwise to vectors and matrices. The entry in the $i$ th row and $j$ th column of a matrix we access using the notation $[\mathbf{X}]_{ij}$ . The Hadamard or entrywise product of two matrices $\mathbf{X},\mathbf{Y}\in\mathbb{R}^{m\times n}$ we denote $\mathbf{X}\odot\mathbf{Y}$ as is standard. The $p$ th Hadamard power we denote $\mathbf{X}^{\odot p}$ and define it as the Hadamard product of $\mathbf{X}$ with itself $p$ times,

\mathbf{X}^{\odot p}\vcentcolon=\mathbf{X}\odot\mathbf{X}\odot\cdots\odot\mathbf{X}.

Given a Hermitian or symmetric matrix $\mathbf{X}\in\mathbb{R}^{n\times n}$ , we adopt the convention that $\lambda_{i}(\mathbf{X})$ denotes the $i$ th largest eigenvalue,

\lambda_{1}(\mathbf{X})\geq\lambda_{2}(\mathbf{X})\geq\cdots\geq\lambda_{n}(\mathbf{X}).

Finally, for a square matrix $\mathbf{X}\in\mathbb{R}^{n\times n}$ we let $Tr(\mathbf{X})=\sum_{i=1}^{n}[\mathbf{X}]_{ii}$ denote the trace.

2.1 Hermite Expansion

We say that a function $f\colon\mathbb{R}\rightarrow\mathbb{R}$ is square integrable with respect to the standard Gaussian measure $\gamma(z)=\frac{1}{\sqrt{2\pi}}e^{-z^{2}/2}$ if $\mathbb{E}_{X\sim\mathcal{N}(0,1)}[f(X)^{2}]<\infty$ . We denote by $L^{2}(\mathbb{R},\gamma)$ the space of all such functions. The normalized probabilist’s Hermite polynomials are defined as

\displaystyle h_{k}(x)=\frac{{(-1)}^{k}e^{x^{2}/2}}{\sqrt{k!}}\frac{d^{k}}{dx^{k}}e^{-x^{2}/2},\quad k=0,1,\ldots

and form a complete orthonormal basis in $L^{2}(\mathbb{R},\gamma)$ (O’Donnell, 2014, §11). The Hermite expansion of a function $\phi\in L^{2}(\mathbb{R},\gamma)$ is given by $\phi(x)=\sum_{k=0}^{\infty}\mu_{k}(\phi)h_{k}(x)$ , where $\mu_{k}(\phi)=\mathbb{E}_{X\sim\mathcal{N}(0,1)}[\phi(X)h_{k}(X)]$ is the $k$ th normalized probabilist’s Hermite coefficient of $\phi$ .

2.2 NTK Parametrization

In what follows, for $n,d\in\mathbb{N}$ let $\mathbf{X}\in\mathbb{R}^{n\times d}$ denote a matrix which stores $n$ points in $\mathbb{R}^{d}$ row-wise. Unless otherwise stated, we assume $d\leq n$ and denote the $i$ th row of $\mathbf{X}_{n}$ as $\mathbf{x}_{i}$ . In this work we consider fully-connected neural networks of the form $f^{(L+1)}\colon\mathbb{R}^{d}\rightarrow\mathbb{R}$ with $L\in\mathbb{N}$ hidden layers and a linear output layer. For a given input vector $\mathbf{x}\in\mathbb{R}^{d}$ , the activation $f^{(l)}$ and preactivation $g^{(l)}$ at each layer $l\in[L+1]$ are defined via the following recurrence relations,

		$\displaystyle g^{(1)}(\mathbf{x})=\gamma_{w}\mathbf{W}^{(1)}\mathbf{x}+\gamma_{b}\mathbf{b}^{(1)},\;f^{(1)}(\mathbf{x})=\phi\left(g^{(1)}(\mathbf{x})\right),$		(1)
		$\displaystyle g^{(l)}(\mathbf{x})=\frac{\sigma_{w}}{\sqrt{m_{l-1}}}\mathbf{W}^{(l)}f^{(l-1)}(\mathbf{x})+\sigma_{b}\mathbf{b}^{(l)},\;f^{(l)}(\mathbf{x})=\phi\left(g^{(l)}(\mathbf{x})\right),\;\forall l\in[2,L],$
		$\displaystyle g^{(L+1)}(\mathbf{x})=\frac{\sigma_{w}}{\sqrt{m_{L}}}\mathbf{W}^{(L+1)}f^{(L)}(\mathbf{x}),\;f^{(L+1)}(\mathbf{x})=g^{(L+1)}(\mathbf{x}).$

The parameters $\mathbf{W}^{(l)}\in\mathbb{R}^{m_{l}\times m_{l-1}}$ and $\mathbf{b}^{(l)}\in\mathbb{R}^{m_{l}}$ are the weight matrix and bias vector at the $l$ th layer respectively, $m_{0}=d$ , $m_{L+1}=1$ , and $\phi\colon\mathbb{R}\rightarrow\mathbb{R}$ is the activation function applied elementwise. The variables $\gamma_{w},\sigma_{w}\in\mathbb{R}_{>0}$ and $\gamma_{b},\sigma_{b}\in\mathbb{R}_{\geq 0}$ correspond to weight and bias hyperparameters respectively. Let $\theta_{l}\in\mathbb{R}^{p}$ denote a vector storing the network parameters $(\mathbf{W}^{(h)},\mathbf{b}^{(h)})_{h=1}^{l}$ up to and including the $l$ th layer. The Neural Tangent Kernel (Jacot et al., 2018) $\tilde{\Theta}^{(l)}\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ associated with $f^{(l)}$ at layer $l\in[L+1]$ is defined as

\tilde{\Theta}^{(l)}(\mathbf{x},\mathbf{y}):=\langle\nabla_{\theta_{l}}f^{(l)}(\mathbf{x}),\nabla_{\theta_{l}}f^{(l)}(\mathbf{y})\rangle.

(2)

We will mostly study the NTK under the following standard assumptions.

Assumption 1.

NTK initialization.

1.

At initialization all network parameters are distributed as $\mathcal{N}(0,1)$ and are mutually independent.
2.

The activation function satisfies $\phi\in L^{2}(\mathbb{R},\gamma)$ , is differentiable almost everywhere and its derivative, which we denote $\phi^{\prime}$ , also satisfies $\phi^{\prime}\in L^{2}(\mathbb{R},\gamma)$ .
3.

The widths are sent to infinity in sequence, $m_{1}\rightarrow\infty,m_{2}\rightarrow\infty,\ldots,m_{L}\rightarrow\infty$ .

Under Assumption 1, for any $l\in[L+1]$ , $\tilde{\Theta}^{(l)}(\mathbf{x},\mathbf{y})$ converges in probability to a deterministic limit $\Theta^{(l)}\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ (Jacot et al., 2018) and the network behaves like a kernelized linear predictor during training; see, e.g., Arora et al. (2019b); Lee et al. (2019); Woodworth et al. (2020). Given access to the rows $(\mathbf{x}_{i})_{i=1}^{n}$ of $\mathbf{X}$ the NTK matrix at layer $l\in[L+1]$ , which we denote $\mathbf{K}_{l}$ , is the $n\times n$ matrix with entries defined as

[\mathbf{K}_{l}]_{ij}=\frac{1}{n}\Theta^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}),\;\forall(i,j)\in[n]\times[n].

(3)

3 Expressing the NTK as a power series

The following assumption allows us to study a power series for the NTK of deep networks and with general activation functions. We remark that power series for the NTK of deep networks with positive homogeneous activation functions, namely ReLU, have been studied in prior works Han et al. (2022); Chen & Xu (2021); Bietti & Bach (2021); Geifman et al. (2022). We further remark that while these works focus on the asymptotics of the NTK spectrum we also study the large eigenvalues.

Assumption 2.

The hyperparameters of the network satisfy $\gamma_{w}^{2}+\gamma_{b}^{2}=1$ , $\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[\phi(Z)^{2}]\leq 1$ and $\sigma_{b}^{2}=1-\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[\phi(Z)^{2}]$ . The data is normalized so that $\left\lVert\mathbf{x}_{i}\right\rVert=1$ for all $i\in[n]$ .

Recall under Assumption 1 that the preactivations of the network are centered Gaussian processes (Neal, 1996; Lee et al., 2018). Assumption 2 ensures the preactivation of each neuron has unit variance and thus is reminiscent of the LeCun et al. (2012), Glorot & Bengio (2010) and He et al. (2015) initializations, which are designed to avoid vanishing and exploding gradients. We refer the reader to Appendix A.3 for a thorough discussion. Under Assumption 2 we will show it is possible to write the NTK not only as a dot-product kernel but also as an analytic power series on $[-1,1]$ and derive expressions for the coefficients. In order to state this result recall, given a function $f\in L^{2}(\mathbb{R},\gamma)$ , that the $p$ th normalized probabilist’s Hermite coefficient of $f$ is denoted $\mu_{p}(f)$ , we refer the reader to Appendix A.4 for an overview of the Hermite polynomials and their properties. Furthermore, letting $\bar{a}=(a_{j})_{j=0}^{\infty}$ denote a sequence of real numbers, then for any $p,k\in\mathbb{Z}_{\geq 0}$ we define

F(p,k,\bar{a})=\begin{cases}1,\;&k=0\text{ and }p=0,\\ 0,\;&k=0\text{ and }p\geq 1,\\ \sum_{(j_{i})\in\mathcal{J}(p,k)}\prod_{i=1}^{k}a_{j_{i}},\;&k\geq 1\text{ and }p\geq 0,\end{cases}

(4)

where

\mathcal{J}(p,k)\vcentcolon=\big{\{}(j_{i})_{i\in[k]}\;:\;j_{i}\geq 0\;\forall i\in[k],\;\sum_{i=1}^{k}j_{i}=p\big{\}}\quad\text{for all $p\in\mathbb{Z}_{\geq 0}$, $k\in\mathbb{N}$}.

Here $\mathcal{J}(p,k)$ is the set of all $k$ -tuples of nonnegative integers which sum to $p$ and $F(p,k,\bar{a})$ is therefore the sum of all ordered products of $k$ elements of $\bar{a}$ whose indices sum to $p$ . We are now ready to state the key result of this section, Theorem 3.1, whose proof is provided in Appendix B.1.

Theorem 3.1.

Under Assumptions 1 and 2, for all $l\in[L+1]$

n\mathbf{K}_{l}=\sum_{p=0}^{\infty}\kappa_{p,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}.

(5)

The series for each entry $n[\mathbf{K}_{l}]_{ij}$ converges absolutely and the coefficients $\kappa_{p,l}$ are nonnegative and can be evaluated using the recurrence relationships

\kappa_{p,l}=\begin{cases}\delta_{p=0}\gamma_{b}^{2}+\delta_{p=1}\gamma_{w}^{2},&l=1,\\ \alpha_{p,l}+\sum_{q=0}^{p}\kappa_{q,l-1}\upsilon_{p-q,l},&l\in[2,L+1],\end{cases}

(6)

where

\alpha_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2},&l=2,\\ \sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3,\end{cases}

(7)

and

\upsilon_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi^{\prime}),&l=2,\\ \sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3,\end{cases}

(8)

are likewise nonnegative for all $p\in\mathbb{Z}_{\geq 0}$ and $l\in[2,L+1]$ .

As already remarked, power series for the NTK have been studied in previous works, however, to the best of our knowledge Theorem 3.1 is the first to explicitly express the coefficients at a layer in terms of the coefficients of previous layers. To compute the coefficients of the NTK as per Theorem 3.1, the Hermite coefficients of both $\phi$ and $\phi^{\prime}$ are required. Under Assumption 3 below, which has minimal impact on the generality of our results, this calculation can be simplified. In short, under Assumption 3 $\upsilon_{p,2}=(p+1)\alpha_{p+1,2}$ and therefore only the Hermite coefficients of $\phi$ are required. We refer the reader to Lemma B.3 in Appendix B.2 for further details.

Assumption 3.

The activation function $\phi\colon\mathbb{R}\rightarrow\mathbb{R}$ is absolutely continuous on $[-a,a]$ for all $a>0$ , differentiable almost everywhere, and is polynomially bounded, i.e., $|\phi(x)|=\mathcal{O}(|x|^{\beta})$ for some $\beta>0$ . Further, the derivative $\phi^{\prime}\colon\mathbb{R}\rightarrow\mathbb{R}$ satisfies $\phi^{\prime}\in L^{2}(\mathbb{R},\gamma)$ .

We remark that ReLU, Tanh, Sigmoid, Softplus and many other commonly used activation functions satisfy Assumption 3. In order to understand the relationship between the Hermite coefficients of the activation function and the coefficients of the NTK, we first consider the simple two-layer case with $L=1$ hidden layers. From Theorem 3.1

\kappa_{p,2}=\sigma_{w}^{2}(1+\gamma_{w}^{2}p)\mu_{p}^{2}(\phi)+\sigma_{w}^{2}\gamma_{b}^{2}(1+p)\mu_{p+1}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2}.

(9)

As per Table 1, a general trend we observe across all activation functions is that the first few coefficients account for the large majority of the total NTK coefficient series.

Table 1: Percentage of

\sum_{p=0}^{\infty}\kappa_{p,2}

accounted for by the first

T+1

NTK coefficients assuming

\gamma_{w}^{2}=1

\gamma_{b}^{2}=0

\sigma_{w}^{2}=1

and

\sigma_{b}^{2}=1-\mathbb{E}[\phi(Z)^{2}]

$T=$	0	1	2	3	4	5
ReLU	43.944	77.277	93.192	93.192	95.403	95.403
Tanh	41.362	91.468	91.468	97.487	97.487	99.090
Sigmoid	91.557	99.729	99.729	99.977	99.977	99.997
Gaussian	95.834	95.834	98.729	98.729	99.634	99.634

However, the asymptotic rate of decay of the NTK coefficients varies significantly by activation function, due to the varying behavior of their tails. In Lemma 3.2 we choose ReLU, Tanh and Gaussian as prototypical examples of activations functions with growing, constant, and decaying tails respectively, and analyze the corresponding NTK coefficients in the two layer setting. For typographical ease we denote the zero mean Gaussian density function with variance $\sigma^{2}$ as $\omega_{\sigma}(z)\vcentcolon=(1/\sqrt{2\pi\sigma^{2}})\exp\left(-z^{2}/(2\sigma^{2})\right)$ .

Lemma 3.2.

Under Assumptions 1 and 2,

1.

if $\phi(z)=ReLU(z)$ , then $\kappa_{p,2}=\delta_{(\gamma_{b}>0)\cup(p\text{ even})}\Theta(p^{-3/2})$ ,
2.

if $\phi(z)=Tanh(z)$ , then $\kappa_{p,2}=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p-1}}{2}\right)\right)$ ,
3.

if $\phi(z)=\omega_{\sigma}(z)$ , then $\kappa_{p,2}=\delta_{(\gamma_{b}>0)\cup(p\text{ even})}\Theta(p^{1/2}(\sigma^{2}+1)^{-p})$ .

The trend we observe from Lemma 3.2 is that activation functions whose Hermite coefficients decay quickly, such as $\omega_{\sigma}$ , result in a faster decay of the NTK coefficients. We remark that analyzing the rates of decay for $l\geq 3$ is challenging due to the calculation of $F(p,k,\bar{\alpha}_{l-1})$ (4). In Appendix B.4 we provide preliminary results in this direction, upper bounding, in a very specific setting, the decay of the NTK coefficients for depths $l\geq 2$ . Finally, we briefly pause here to highlight the potential for using a truncation of (5) in order to perform efficient numerical approximation of the infinite width NTK. We remark that this idea is also addressed in a concurrent work by Han et al. (2022), albeit under a somewhat different set of assumptions ¹¹1In particular, in Han et al. (2022) the authors focus on homogeneous activation functions and allow the data to lie off the sphere. By contrast, we require the data to lie on the sphere but can handle non-homogeneous activation functions in the deep setting.. As per our observations thus far that the coefficients of the NTK power series (5) typically decay quite rapidly, one might consider approximating $\Theta^{(l)}$ by computing just the first few terms in each series of (5). Figure 2 in Appendix B.3 displays the absolute error between the truncated ReLU NTK and the analytical expression for the ReLU NTK, which is also defined in Appendix B.3. Letting $\rho$ denote the input correlation then the key takeaway is that while for $\left\lvert\rho\right\rvert$ close to one the approximation is poor, for $\left\lvert\rho\right\rvert<0.5$ , which is arguably more realistic for real-world data, with just $50$ coefficients machine level precision can be achieved. We refer the interested reader to Appendix B.3 for a proper discussion.

4 Analyzing the spectrum of the NTK via its power series

In this section, we consider a general kernel matrix power series of the form $n\mathbf{K}=\sum_{p=0}^{\infty}c_{p}(\mathbf{X}\mathbf{X}^{T})^{\odot p}$ where $\{c_{p}\}_{p=0}^{\infty}$ are coefficients and $\mathbf{X}$ is the data matrix. According to Theorem 3.1, the coefficients of the NTK power series (5) are always nonnegative, thus we only consider the case where $c_{p}$ are nonnegative. We will also consider the kernel function power series, which we denote as $K(x_{1},x_{2})=\sum_{p=0}^{\infty}c_{p}\langle x_{1},x_{2}\rangle^{p}$ . Later on we will analyze the spectrum of kernel matrix $\mathbf{K}$ and kernel function $K$ .

4.1 Analysis of the upper spectrum and effective rank

In this section we analyze the upper part of the spectrum of the NTK, corresponding to the large eigenvalues, using the power series given in Theorem 3.1. Our first result concerns the effective rank (Huang et al., 2022) of the NTK. Given a positive semidefinite matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ we define the effective rank of $\mathbf{A}$ to be

\mathrm{eff}(\mathbf{A})=\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

The effective rank quantifies how many eigenvalues are on the order of the largest eigenvalue. This follows from the Markov-like inequality

\left\lvert\{p:\lambda_{p}(\mathbf{A})\geq c\lambda_{1}(\mathbf{A})\}\right\rvert\leq c^{-1}\mathrm{eff}(\mathbf{A})

(10)

and the eigenvalue bound

\frac{\lambda_{p}(\mathbf{A})}{\lambda_{1}(\mathbf{A})}\leq\frac{\mathrm{eff}(\mathbf{A})}{p}.

Our first result is that the effective rank of the NTK can be bounded in terms of a ratio involving the power series coefficients. As we are assuming the data is normalized so that $\left\lVert\mathbf{x}_{i}\right\rVert=1$ for all $i\in[n]$ , then observe by the linearity of the trace

Tr(n\mathbf{K})=\sum_{p=0}^{\infty}c_{p}Tr((\mathbf{X}\mathbf{X}^{T})^{\odot p})=n\sum_{p=0}^{\infty}c_{p},

where we have used the fact that $Tr((\mathbf{X}\mathbf{X}^{T})^{\odot p})=n$ for all $p\in\mathbb{N}$ . On the other hand,

\lambda_{1}(n\mathbf{K})\geq\lambda_{1}(c_{0}(\mathbf{X}\mathbf{X}^{T})^{0})=\lambda_{1}(c_{0}\mathbf{1}_{n\times n})=nc_{0}.

Combining these two results we get the following theorem.

Theorem 4.1.

Assume that we have a kernel Gram matrix $\mathbf{K}$ of the form $n\mathbf{K}=\sum_{p=0}^{\infty}c_{p}(\mathbf{X}\mathbf{X}^{T})^{\odot p}$ where $c_{0}\neq 0$ . Furthermore, assume the input data $\mathbf{x}_{i}$ are normalized so that $\left\lVert\mathbf{x}_{i}\right\rVert=1$ for all $i\in[n]$ . Then

\mathrm{eff}(\mathbf{K})\leq\frac{\sum_{p=0}^{\infty}c_{p}}{c_{0}}.

By Theorem 3.1 $c_{0}\neq 0$ provided the network has biases or the activation function has nonzero Gaussian expectation (i.e., $\mu_{0}(\phi)\neq 0$ ). Thus we have that the effective rank of $\mathbf{K}$ is bounded by an $O(1)$ quantity. In the case of ReLU for example, as evidenced by Table 1, the effective rank will be roughly $2.3$ for a shallow network. By contrast, a well-conditioned matrix would have an effective rank that is $\Omega(n)$ . Combining Theorem 4.1 and the Markov-type bound (10) we make the following important observation.

Observation 4.2.

The largest eigenvalue $\lambda_{1}(\mathbf{K})$ of the NTK takes up an $\Omega(1)$ fraction of the entire trace and there are $O(1)$ eigenvalues on the same order of magnitude as $\lambda_{1}(\mathbf{K})$ , where the $O(1)$ and $\Omega(1)$ notation are with respect to the parameter $n$ .

While the constant term $c_{0}\mathbf{1}_{n\times n}$ in the kernel leads to a significant outlier in the spectrum of $\mathbf{K}$ , it is rather uninformative beyond this. What interests us is how the structure of the data $\mathbf{X}$ manifests in the spectrum of the kernel matrix $\mathbf{K}$ . For this reason we will examine the centered kernel matrix $\widetilde{\mathbf{K}}:=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}_{n\times n}$ . By a very similar argument as before we get the following result.

Theorem 4.3.

Assume that we have a kernel Gram matrix $\mathbf{K}$ of the form $n\mathbf{K}=\sum_{p=0}^{\infty}c_{p}(\mathbf{X}\mathbf{X}^{T})^{\odot p}$ where $c_{1}\neq 0$ . Furthermore, assume the input data $\mathbf{x}_{i}$ are normalized so that $\left\lVert\mathbf{x}_{i}\right\rVert=1$ for all $i\in[n]$ . Then the centered kernel $\widetilde{\mathbf{K}}:=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}_{n\times n}$ satisfies

\mathrm{eff}(\widetilde{\mathbf{K}})\leq\mathrm{eff}(\mathbf{X}\mathbf{X}^{T})\frac{\sum_{p=1}^{\infty}c_{p}}{c_{1}}.

Thus we have that the effective rank of the centered kernel $\widetilde{\mathbf{K}}$ is upper bounded by a constant multiple of the effective rank of the input data Gram $\mathbf{X}\mathbf{X}^{T}$ . Furthermore, we can take the ratio $\frac{\sum_{p=1}^{\infty}c_{p}}{c_{1}}$ as a measure of how much the NTK inherits the behavior of the linear kernel $\mathbf{X}\mathbf{X}^{T}$ : in particular, if the input data gram has low effective rank and this ratio is moderate then we may conclude that the centered NTK must also have low effective rank. Again from Table 1, in the shallow setting we see that this ratio tends to be small for many of the common activations, for example, for ReLU it is roughly 1.3. To summarize then from Theorem 4.3 we make the important observation.

Observation 4.4.

Whenever the input data are approximately low rank, the centered kernel matrix $\widetilde{\mathbf{K}}=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}_{n\times n}$ is also approximately low rank.

It turns out that this phenomenon also holds for finite-width networks at initialization. Consider the shallow model

\sum_{\ell=1}^{m}a_{\ell}\phi(\langle\mathbf{w}_{\ell},\mathbf{x}\rangle),

where $\mathbf{x}\in\mathbb{R}^{d}$ and $\mathbf{w}_{\ell}\in\mathbb{R}^{d}$ , $a_{\ell}\in\mathbb{R}$ for all $\ell\in[m]$ . The following theorem demonstrates that when the width $m$ is linear in the number of samples $n$ then $\mathrm{eff}(\mathbf{K})$ is upper bounded by a constant multiple of $\mathrm{eff}(\mathbf{X}\mathbf{X}^{T})$ .

Theorem 4.5.

Assume $\phi(x)=ReLU(x)$ and $n\geq d$ . Fix $\epsilon>0$ small. Suppose that $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu_{1}^{2}I_{d})$ i.i.d. and $a_{1},\ldots,a_{m}\sim N(0,\nu_{2}^{2})$ . Set $M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}$ , and let

\Sigma:=\mathbb{E}_{\mathbf{w}\sim N(0,\nu_{1}^{2}I)}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{w}^{T}\mathbf{X}^{T})].

Then

m=\Omega\left(\max(\lambda_{1}(\Sigma)^{-2},1)\max(n,\log(1/\epsilon))\right),\quad\nu_{1}=O(1/M\sqrt{m})

suffices to ensure that, with probability at least $1-\epsilon$ over the sampling of the parameter initialization,

\mathrm{eff}(\mathbf{K})\leq C\cdot\mathrm{eff}(\mathbf{X}\mathbf{X}^{T}),

where $C>0$ is an absolute constant.

Many works consider the model where the outer layer weights are fixed and have constant magnitude and only the inner layer weights are trained. This is the setting considered by Xie et al. (2017), Arora et al. (2019a), Du et al. (2019b), Oymak et al. (2019), Li et al. (2020), and Oymak & Soltanolkotabi (2020). In this setting we can reduce the dependence on the width $m$ to only be logarithmic in the number of samples $n$ , and we have an accompanying lower bound. See Theorem C.5 in the Appendix C.2.3 for details.

In Figure 1 we empirically validate our theory by computing the spectrum of the NTK on both Caltech101 (Li et al., 2022) and isotropic Gaussian data for feedforward networks. We use the functorch²²2https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html module in PyTorch (Paszke et al., 2019) using an algorithmic approach inspired by Novak et al. (2022). As per Theorem 4.1 and Observation 4.2, we observe all network architectures exhibit a dominant outlier eigenvalue due to the nonzero constant coefficient in the power series. Furthermore, this dominant outlier becomes more pronounced with depth, as can be observed if one carries out the calculations described in Theorem 3.1. Additionally, this outlier is most pronounced for ReLU, as the combination of its Gaussian mean plus bias term is the largest out of the activations considered here. As predicted by Theorem 4.3, Observation 4.4 and Theorem 4.5, we observe real-world data, which has a skewed spectrum and hence a low effective rank, results in the spectrum of the NTK being skewed. By contrast, isotropic Gaussian data has a flat spectrum, and as a result beyond the outlier the decay of eigenvalues of the NTK is more gradual. These observations support the claim that the NTK inherits its spectral structure from the data. We also observe that the spectrum for Tanh is closer to the linear activation relative to ReLU: intuitively this should not be surprising as close to the origin Tanh is well approximated by the identity. Our theory provides a formal explanation for this observation, indeed, the power series coefficients for Tanh networks decay quickly relative to ReLU. We provide further experimental results in Appendix C.3, including for CNNs where we observe the same trends. We note that the effective rank has implications for the generalization error. The Rademacher complexity of a kernel method (and hence the NTK model) within a parameter ball is determined by its its trace (Bartlett & Mendelson, 2002). Since for the NTK $\lambda_{1}(\mathbf{K})=O(1)$ , lower effective rank implies smaller trace and hence limited complexity.

Refer to caption — Figure 1: (Feedforward NTK Spectrum) We plot the normalized eigenvalues $\lambda_{p}/\lambda_{1}$ of the NTK Gram matrix $\mathbf{K}$ and the data Gram matrix $\mathbf{X}\mathbf{X}^{T}$ for Caltech101 and isotropic Gaussian datasets. To compute the NTK we randomly initialize feedforward networks of depths $2$ and $5$ with width $500$ . We use the standard parameterization and Pytorch’s default Kaiming uniform initialization in order to better connect our results with what is used in practice. We consider a batch size of $n=200$ and plot the first $100$ eigenvalues. The thick part of each curve corresponds to the mean across 10 trials, while the transparent part corresponds to the 95% confidence interval

4.2 Analysis of the lower spectrum

In this section, we analyze the lower part of the spectrum using the power series. We first analyze the kernel function $K$ which we recall is a dot-product kernel of the form $K(x_{1},x_{2})=\sum_{p=0}^{\infty}c_{p}\langle x_{1},x_{2}\rangle^{p}$ . Assuming the training data is uniformly distributed on a hypersphere it was shown by Basri et al. (2019); Bietti & Mairal (2019) that the eigenfunctions of $K$ are the spherical harmonics. Azevedo & Menegatto (2015) gave the eigenvalues of the kernel $K$ in terms of the power series coefficients.

Theorem 4.6.

[Azevedo & Menegatto (2015)] Let $\Gamma$ denote the gamma function. Suppose that the training data are uniformly sampled from the unit hypersphere $\mathbb{S}^{d}$ , $d\geq 2$ . If the dot-product kernel function has the expansion $K(x_{1},x_{2})=\sum_{p=0}^{\infty}c_{p}\langle x_{1},x_{2}\rangle^{p}$ where $c_{p}\geq 0$ , then the eigenvalue of every spherical harmonic of frequency $k$ is given by

\overline{\lambda_{k}}=\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.

A proof of Theorem 4.6 is provided in Appendix C.4 for the reader’s convenience. This theorem connects the coefficients $c_{p}$ of the kernel power series with the eigenvalues $\overline{\lambda_{k}}$ of the kernel. In particular, given a specific decay rate for the coefficients $c_{p}$ one may derive the decay rate of $\overline{\lambda_{k}}$ : for example, Scetbon & Harchaoui (2021) examined the decay rate of $\overline{\lambda_{k}}$ if $c_{p}$ admits a polynomial decay or exponential decay. The following Corollary summarizes the decay rates of $\overline{\lambda_{k}}$ corresponding to two layer networks with different activations.

Corollary 4.7.

Under the same setting as in Theorem 4.6,

1.

if $c_{p}=\Theta(p^{-a})$ where $a\geq 1$ , then $\overline{\lambda_{k}}=\Theta(k^{-d-2a+2})$ ,
2.

if $c_{p}=\delta_{(p\text{ even})}\Theta(p^{-a})$ , then $\overline{\lambda_{k}}=\delta_{(k\text{ even})}\Theta(k^{-d-2a+2})$ ,
3.

if $c_{p}=\mathcal{O}\left(\exp\left(-a\sqrt{p}\right)\right)$ , then $\overline{\lambda_{k}}=\mathcal{O}\left(k^{-d+1/2}\exp\left(-a\sqrt{k}\right)\right)$ ,
4.

if $c_{p}=\Theta(p^{1/2}a^{-p})$ , then $\overline{\lambda_{k}}=\mathcal{O}\left(k^{-d+1}a^{-k}\right)$ and $\overline{\lambda_{k}}=\Omega\left(k^{-d/2+1}2^{-k}a^{-k}\right)$ .

In addition to recovering existing results for ReLU networks Basri et al. (2019); Velikanov & Yarotsky (2021); Geifman et al. (2020); Bietti & Bach (2021), Corollary 4.7 also provides the decay rates for two-layer networks with Tanh and Gaussian activations. As faster eigenvalue decay implies a smaller RKHS Corollary 4.7 shows using ReLU results in a larger RKHS relative to Tanh or Gaussian activations. Numerics for Corollary 4.7 are provided in Figure 4 in Appendix C.3. Finally, in Theorem 4.8 we relate a kernel’s power series to its spectral decay for arbitrary data distributions.

Theorem 4.8 (Informal).

Let the rows of $\mathbf{X}\in\mathbb{R}^{n\times d}$ be arbitrary points on the unit sphere. Consider the kernel matrix $n\mathbf{K}=\sum_{p=0}^{\infty}c_{p}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}$ and let $r(n)\leq d$ denote the rank of $\mathbf{X}\mathbf{X}^{T}$ . Then

1.

if $c_{p}=\mathcal{O}(p^{-\alpha})$ with $\alpha>r(n)+1$ for all $n\in\mathbb{Z}_{\geq 0}$ then $\lambda_{n}(\mathbf{K})=\mathcal{O}\left(n^{-\frac{\alpha-1}{r(n)}}\right)$ ,
2.

if $c_{p}=\mathcal{O}(e^{-\alpha\sqrt{p}})$ then $\lambda_{n}(\mathbf{K})=\mathcal{O}\left(n^{\frac{1}{2r(n)}}\exp\left(-\alpha^{\prime}n^{\frac{1}{2r(n)}}\right)\right)$ for any $\alpha^{\prime}<\alpha 2^{-1/2r(n)}$ ,
3.

if $c_{p}=\mathcal{O}(e^{-\alpha p})$ then $\lambda_{n}(\mathbf{K})=\mathcal{O}\left(\exp\left(-\alpha^{\prime}n^{\frac{1}{r(n)}}\right)\right)$ for any $\alpha^{\prime}<\alpha 2^{-1/2r(n)}$ .

Although the presence of the factor $1/r(n)$ in the exponents of $n$ in these bounds is a weakness, Theorem 4.8 still illustrates how, in a highly general setting, the asymptotic decay of the coefficients of the power series ensures a certain asymptotic decay in the eigenvalues of the kernel matrix. A formal version of this result is provided in Appendix C.5 along with further discussion.

5 Conclusion

Using a power series expansion we derived a number of insights into both the outliers as well as the asymptotic decay of the spectrum of the NTK. We are able to perform our analysis without recourse to a high dimensional limit or the use of random matrix theory. Interesting avenues for future work include better characterizing the role of depth and performing the same analysis on networks with convolutional or residual layers.

Reproducibility Statement

To ensure reproducibility, we make the code public at https://github.com/bbowman223/data_ntk.

Acknowledgements and Disclosure of Funding

This project has been supported by ERC Grant 757983 and NSF CAREER Grant DMS-2145630.

References

Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242–252. PMLR, 2019. URL https://proceedings.mlr.press/v97/allen-zhu19a.html.
Anthony & Bartlett (2002) Martin Anthony and Peter L. Bartlett. Neural Network Learning - Theoretical Foundations. Cambridge University Press, 2002. URL http://www.cambridge.org/gb/knowledge/isbn/item1154061/?site_locale=en_GB.
Arora et al. (2019a) Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 322–332. PMLR, 2019a. URL https://proceedings.mlr.press/v97/arora19a.html.
Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/dbc4d84bfcfe2284ba11beffb853a8c4-Paper.pdf.
Azevedo & Menegatto (2015) Douglas Azevedo and Valdir A Menegatto. Eigenvalues of dot-product kernels on the sphere. Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3(1), 2015.
Bartlett & Mendelson (2002) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2002. URL http://dblp.uni-trier.de/db/journals/jmlr/jmlr3.html#BartlettM02.
Basri et al. (2019) Ronen Basri, David W. Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 4763–4772, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5ac8bb8a7d745102a978c5f8ccdb61b8-Abstract.html.
Basri et al. (2020) Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 685–694. PMLR, 2020. URL https://proceedings.mlr.press/v119/basri20a.html.
Bietti & Bach (2021) Alberto Bietti and Francis Bach. Deep equals shallow for ReLU networks in kernel regimes. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=aDjoksTpXOP.
Bietti & Mairal (2019) Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/c4ef9c39b300931b69a36fb3dbb8d60e-Paper.pdf.
Bowman & Montúfar (2022) Benjamin Bowman and Guido Montúfar. Implicit bias of MSE gradient optimization in underparameterized neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=VLgmhQDVBV.
Bowman & Montufar (2022) Benjamin Bowman and Guido Montufar. Spectral bias outside the training set for deep networks in the kernel regime. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=a01PL2gb7W5.
Caponnetto & De Vito (2007) Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
Chen & Xu (2021) Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=vK9WrZ0QYQ.
Cui et al. (2021) Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Da_EHrAcfwd.
Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf.
Davis (2021) Tom Davis. A general expression for Hermite expansions with applications. 2021. doi: 10.13140/RG.2.2.30843.44325. URL https://www.researchgate.net/profile/Tom-Davis-2/publication/352374514_A_GENERAL_EXPRESSION_FOR_HERMITE_EXPANSIONS_WITH_APPLICATIONS/links/60c873c5a6fdcc8267cf74d4/A-GENERAL-EXPRESSION-FOR-HERMITE-EXPANSIONS-WITH-APPLICATIONS.pdf.
de G. Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
Du et al. (2019a) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1675–1685. PMLR, 2019a. URL https://proceedings.mlr.press/v97/du19c.html.
Du et al. (2019b) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=S1eK3i09YQ.
Engel et al. (2022) Andrew Engel, Zhichao Wang, Anand Sarwate, Sutanay Choudhury, and Tony Chiang. TorchNTK: A library for calculation of neural tangent kernels of PyTorch models. 2022.
Fan & Wang (2020) Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, volume 33, pp. 7710–7721. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d-Paper.pdf.
Folland (1999) G. B. Folland. Real analysis: Modern techniques and their applications. Wiley, New York, 1999.
Geifman et al. (2020) Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Basri Ronen. On the similarity between the Laplace and neural tangent kernels. In Advances in Neural Information Processing Systems, volume 33, pp. 1451–1461. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf.
Geifman et al. (2022) Amnon Geifman, Meirav Galun, David Jacobs, and Ronen Basri. On the spectral bias of convolutional neural tangent and gaussian process kernels. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=gthKzdymDu2.
Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249–256. PMLR, 2010. URL https://proceedings.mlr.press/v9/glorot10a.html.
Han et al. (2022) Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, and Amin Karbasi. Fast neural kernel embeddings for general activations. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yLilJ1vZgMe.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015.
Huang et al. (2022) Ningyuan Teresa Huang, David W. Hogg, and Soledad Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions. SIAM J. Math. Data Sci., 4(1):126–152, 2022. URL https://doi.org/10.1137/20m1387821.
Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
Jin et al. (2022) Hui Jin, Pradeep Kr. Banerjee, and Guido Montúfar. Learning curves for gaussian process regression with power-law priors and targets. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=KeI9E-gsoB.
Karakida et al. (2020) Ryo Karakida, Shotaro Akaho, and Shun ichi Amari. Universal statistics of Fisher information in deep neural networks: mean field approach. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124005, 2020. URL https://doi.org/10.1088/1742-5468/abc62e.
Kazarinoff (1956) Donat K. Kazarinoff. On Wallis’ formula. Edinburgh Mathematical Notes, 40:19–21, 1956.
LeCun et al. (2012) Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient BackProp, pp. 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. URL https://doi.org/10.1007/978-3-642-35289-8_3.
Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/0d1a9651497a38d8b1c3871c84528bd4-Paper.pdf.
Lee et al. (2020) Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems, volume 33, pp. 15156–15172. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ad086f59924fffe0773f8d0ca22ea712-Paper.pdf.
Li et al. (2022) Li, Andreeto, Ranzato, and Perona. Caltech 101, Apr 2022.
Li et al. (2020) Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 4313–4324. PMLR, 2020. URL https://proceedings.mlr.press/v108/li20j.html.
Louart et al. (2018) Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190–1248, 2018. URL https://www.jstor.org/stable/26542333.
Mishkin & Matas (2016) Dmytro Mishkin and Jiri Matas. All you need is a good init. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06422.
Murray et al. (2022) M. Murray, V. Abrol, and J. Tanner. Activation function design for deep networks: linearity and effective initialisation. Applied and Computational Harmonic Analysis, 59:117–154, 2022. URL https://www.sciencedirect.com/science/article/pii/S1063520321001111. Special Issue on Harmonic Analysis and Machine Learning.
Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996.
Nguyen (2021) Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8056–8062. PMLR, 2021. URL https://proceedings.mlr.press/v139/nguyen21a.html.
Nguyen & Mondelli (2020) Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In Advances in Neural Information Processing Systems, volume 33, pp. 11961–11972. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8abfe8ac9ec214d68541fcb888c0b4c3-Paper.pdf.
Nguyen et al. (2021) Quynh Nguyen, Marco Mondelli, and Guido Montúfar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep ReLU networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8119–8129. PMLR, 2021. URL https://proceedings.mlr.press/v139/nguyen21g.html.
Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. In 7th International Conference on Learning Representations. OpenReview.net, 2019. URL https://openreview.net/forum?id=B1g30j0qF7.
Novak et al. (2022) Roman Novak, Jascha Sohl-Dickstein, and Samuel S Schoenholz. Fast finite width neural tangent kernel. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 17018–17044. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/novak22a.html.
O’Donnell (2014) Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, 2014.
Oymak & Soltanolkotabi (2020) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1), 2020. URL https://par.nsf.gov/biblio/10200049.
Oymak et al. (2019) Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobian. CoRR, abs/1906.05392, 2019. URL http://arxiv.org/abs/1906.05392.
Panigrahi et al. (2020) Abhishek Panigrahi, Abhishek Shetty, and Navin Goyal. Effect of activation functions on the training of overparametrized neural nets. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgfdeBYvH.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
Pennington & Worah (2017) Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf.
Pennington & Worah (2018) Jeffrey Pennington and Pratik Worah. The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/18bb68e2b38e4a8ce7cf4f6b2625768c-Paper.pdf.
Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
Scetbon & Harchaoui (2021) Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. In International conference on artificial intelligence and statistics, pp. 3394–3402. PMLR, 2021.
Schoenholz et al. (2017) Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations (ICLR), 2017. URL https://openreview.net/pdf?id=H1W1UN9gg.
Schur (1911) J. Schur. Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. Journal für die reine und angewandte Mathematik, 140:1–28, 1911. URL http://eudml.org/doc/149352.
Simon et al. (2022) James Benjamin Simon, Sajant Anand, and Mike Deweese. Reverse engineering the neural tangent kernel. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 20215–20231. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/simon22a.html.
Sontag & Sussmann (1989) Eduardo D. Sontag and Héctor J. Sussmann. Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems, 3:91–106, 1989.
Velikanov & Yarotsky (2021) Maksim Velikanov and Dmitry Yarotsky. Explicit loss asymptotics in the gradient descent training of neural networks. In Advances in Neural Information Processing Systems, volume 34, pp. 2570–2582. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/14faf969228fc18fcd4fcf59437b0c97-Paper.pdf.
Vershynin (2012) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. Cambridge University Press, 2012.
Weyl (1912) Hermann Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912. URL https://doi.org/10.1007/BF01456804.
Woodworth et al. (2020) Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 3635–3673. PMLR, 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html.
Xie et al. (2017) Bo Xie, Yingyu Liang, and Le Song. Diverse Neural Network Learns True Target Functions. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1216–1224. PMLR, 2017. URL https://proceedings.mlr.press/v54/xie17a.html.
Yang & Salman (2019) Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks, 2019. URL https://arxiv.org/abs/1907.10599.
Zou & Gu (2019) Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/6a61d423d02a1c56250dc23ae7ff12f3-Paper.pdf.
Zou et al. (2020) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep ReLU networks. Machine learning, 109(3):467–492, 2020.

Appendix

The appendix is organized as follows.

•

Appendix A gives background material on Gaussan kernels, NTK, unit variance intitialization, and Hermite polynomial expansions.
•

Appendix B provides details for Section 3.
•

Appendix C provides details for Section 4.

Appendix A Background material

A.1 Gaussian kernel

Observe by construction that the flattened collection of preactivations at the first layer $(g^{(1)}(\mathbf{x}_{i}))_{i=1}^{n}$ form a centered Gaussian process, with the covariance between the $\alpha$ th and $\beta$ th neuron being described by

\Sigma_{\alpha_{\beta}}^{(1)}(\mathbf{x}_{i},\mathbf{x}_{j})\vcentcolon=\mathbb{E}[g^{(1)}_{\alpha}(\mathbf{x}_{i})g^{(1)}_{\beta}(\mathbf{x}_{j})]=\delta_{\alpha=\beta}\left(\gamma_{w}^{2}\mathbf{x}_{i}^{T}\mathbf{x}_{j}+\gamma_{b}^{2}\right).

Under the Assumption 1, the preactivations at each layer $l\in[L+1]$ converge also in distribution to centered Gaussian processes (Neal, 1996; Lee et al., 2018). We remark that the sequential width limit condition of Assumption 1 is not necessary for this behavior, for example the same result can be derived in the setting where the widths of the network are sent to infinity simultaneously under certain conditions on the activation function (de G. Matthews et al., 2018). However, as our interests lie in analyzing the limit rather than the conditions for convergence to said limit, for simplicity we consider only the sequential width limit. As per Lee et al. (2018, Eq. 4), the covariance between the preactivations of the $\alpha$ th and $\beta$ th neurons at layer $l\geq 2$ for any input pair $\mathbf{x},\mathbf{y}\in\mathbb{R}$ are described by the following kernel,

	$\displaystyle\Sigma_{\alpha_{\beta}}^{(l)}(\mathbf{x},\mathbf{y})$	$\displaystyle\vcentcolon=\mathbb{E}[g^{(l)}_{\alpha}(\mathbf{x})g^{(l)}_{\beta}(\mathbf{y})]$
		$\displaystyle=\delta_{\alpha=\beta}\left(\sigma_{w}^{2}\mathbb{E}_{g^{(l-1)}\sim\mathcal{G}\mathcal{P}(0,\Sigma^{l-1})}[\phi(g^{(l-1)}_{\alpha}(\mathbf{x}))\phi(g^{(l-1)}_{\beta}(\mathbf{y}))]+\sigma_{b}^{2}\right).$

We refer to this kernel as the Gaussian kernel. As each neuron is identically distributed and the covariance between pairs of neurons is 0 unless $\alpha=\beta$ , moving forward we drop the subscript and discuss only the covariance between the preactivations of an arbitrary neuron given two inputs. As per the discussion by Lee et al. (2018, Section 2.3), the expectations involved in the computation of these Gaussian kernels can be computed with respect to a bivariate Gaussian distribution, whose covariance matrix has three distinct entries: the variance of a preactivation of $\mathbf{x}$ at the previous layer, $\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})$ , the variance of a preactivation of $\mathbf{y}$ at the previous layer, $\Sigma^{(l)}(\mathbf{y},\mathbf{y})$ , and the covariance between preactivations of $\mathbf{x}$ and $\mathbf{y}$ , $\Sigma^{(l-1)}(\mathbf{x},\mathbf{y})$ . Therefore the Gaussian kernel, or covariance function, and its derivative, which we will require later for our analysis of the NTK, can be computed via the the following recurrence relations, see for instance (Lee et al., 2018; Jacot et al., 2018; Arora et al., 2019b; Nguyen et al., 2021),

		$\displaystyle\Sigma^{(1)}(\mathbf{x},\mathbf{y})=\gamma_{w}^{2}\mathbf{x}^{T}\mathbf{x}+\gamma_{b}^{2},$		(11)
		$\displaystyle\mathbf{A}^{(l)}(\mathbf{x},\mathbf{y})=\begin{bmatrix}\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})&\Sigma^{(l-1)}(\mathbf{x},\mathbf{y})\\ \Sigma^{(l-1)}(\mathbf{y},\mathbf{x})&\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})\end{bmatrix}$
		$\displaystyle\Sigma^{(l)}(\mathbf{x},\mathbf{y})=\sigma_{w}^{2}\mathbb{E}_{(B_{1},B_{2})\sim\mathcal{N}(0,\mathbf{A}^{(l)}(\mathbf{x},\mathbf{y}))}[\phi(B_{1})\phi(B_{2})]+\sigma_{b}^{2},$
		$\displaystyle\dot{\Sigma}^{(l)}(\mathbf{x},\mathbf{y})=\sigma_{w}^{2}\mathbb{E}_{(B_{1},B_{2})\sim\mathcal{N}(0,\mathbf{A}^{(l)}(\mathbf{x},\mathbf{y}))}\left[\phi^{\prime}(B_{1})\phi^{\prime}(B_{2})\right].$

A.2 Neural Tangent Kernel (NTK)

As discussed in the Section 1, under Assumption 1 $\tilde{\Theta}^{(l)}$ converges in probability to a deterministic limit, which we denote $\Theta^{(l)}$ . This deterministic limit kernel can be expressed in terms of the Gaussian kernels and their derivatives from Section A.1 via the following recurrence relationships (Jacot et al., 2018, Theorem 1),

$\displaystyle\Theta^{(1)}(\mathbf{x},\mathbf{y})$	$\displaystyle=\Sigma^{(1)}(\mathbf{x},\mathbf{y}),$	(12)
$\displaystyle\Theta^{(l)}(\mathbf{x},\mathbf{y})$	$\displaystyle=\Theta^{(l-1)}(\mathbf{x},\mathbf{y})\dot{\Sigma}^{(l)}(\mathbf{x},\mathbf{y})+\Sigma^{(l)}(\mathbf{x},\mathbf{y})$
	$\displaystyle=\Sigma^{(l)}(\mathbf{x},\mathbf{y})+\sum_{h=1}^{l-1}\Sigma^{(h)}(\mathbf{x},\mathbf{y})\left(\prod_{h^{\prime}=h+1}^{l}\dot{\Sigma}^{(h^{\prime})}(\mathbf{x},\mathbf{y})\right)\;\forall l\in[2,L+1].$

A useful expression for the NTK matrix, which is a straightforward extension and generalization of Nguyen et al. (2021, Lemma 3.1), is provided in Lemma A.1 below.

Lemma A.1.

(Based on Nguyen et al. 2021, Lemma 3.1) Under Assumption 1, a sequence of positive semidefinite matrices $(\mathbf{G}_{l})_{l=1}^{L+1}$ in $\mathbb{R}^{n\times n}$ , and the related sequence $(\dot{\mathbf{G}}_{l})_{l=2}^{L+1}$ also in $\mathbb{R}^{n\times n}$ , can be constructed via the following recurrence relationships,

$\displaystyle\mathbf{G}_{1}$	$\displaystyle=\gamma_{w}^{2}\mathbf{X}\mathbf{X}^{T}+\gamma_{b}^{2}\textbf{1}_{n\times n},$	(13)
$\displaystyle\mathbf{G}_{2}$	$\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{d})}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{X}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n},$
$\displaystyle\dot{\mathbf{G}}_{2}$	$\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\mathbf{X}\mathbf{w})\phi^{\prime}(\mathbf{X}\mathbf{w})^{T}],$
$\displaystyle\mathbf{G}_{l}$	$\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})\phi(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n},\;l\in[3,L+1],$
$\displaystyle\dot{\mathbf{G}}_{l}$	$\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})\phi^{\prime}(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})^{T}],\;l\in[3,L+1].$

The sequence of NTK matrices $(\mathbf{K}_{l})_{l=1}^{L+1}$ can in turn be written using the following recurrence relationship,

$\displaystyle n\mathbf{K}_{1}$	$\displaystyle=\mathbf{G}_{1},$	(14)
$\displaystyle n\mathbf{K}_{l}$	$\displaystyle=\mathbf{G}_{l}+n\mathbf{K}_{l-1}\odot\dot{\mathbf{G}}_{l}$
	$\displaystyle=\mathbf{G}_{l}+\sum_{i=1}^{l-1}\left(\mathbf{G}_{i}\odot\left(\odot_{j=i+1}^{l}\dot{\mathbf{G}}_{j}\right)\right).$

Proof.

For the sequence $(\mathbf{G}_{l})_{l=1}^{L+1}$ it suffices to prove for any $i,j\in[n]$ and $l\in[L+1]$ that

[\mathbf{G}_{l}]_{i,j}=\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})

and $\mathbf{G}_{l}$ is positive semi-definite. We proceed by induction, considering the base case $l=1$ and comparing (13) with (11) then it is evident that

[\mathbf{G}_{1}]_{i,j}=\Sigma^{(1)}(\mathbf{x}_{i},\mathbf{x}_{j}).

In addition, $\mathbf{G}_{1}$ is also clearly positive semi-definite as for any $\mathbf{u}\in\mathbb{R}^{n}$

\mathbf{u}^{T}\mathbf{G}_{1}\mathbf{u}=\gamma_{w}^{2}\left\lVert\mathbf{X}^{T}\mathbf{u}\right\rVert^{2}+\gamma_{b}^{2}\left\lVert\textbf{1}_{n}^{T}\mathbf{u}\right\rVert^{2}\geq 0.

We now assume the induction hypothesis is true for $\mathbf{G}_{l-1}$ . We will need to distinguish slightly between two cases, $l=2$ and $l\in[3,L+1]$ . The proof of the induction step in either case is identical. To this end, and for notational ease, let $\mathbf{V}=\mathbf{X}$ , $\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{d})$ when $l=2$ , and $\mathbf{V}=\sqrt{\mathbf{G}_{l-1}}$ , $\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{n})$ for $l\in[3,L+1]$ . In either case we let $\mathbf{v}_{i}$ denote the $i$ th row of $\mathbf{V}$ . For any $i,j\in[n]$

[\mathbf{G}_{l}]_{ij}=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}}[\phi(\mathbf{v}_{i}^{T}\mathbf{w})\phi(\mathbf{v}_{j}^{T}\mathbf{w})]+\sigma_{b}^{2}.

Now let $B_{1}=\mathbf{v}_{i}^{T}\mathbf{w}$ , $B_{2}=\mathbf{v}_{j}^{T}\mathbf{w}$ and observe for any $\alpha_{1},\alpha_{2}\in\mathbb{R}$ that $\alpha_{1}B_{1}+\alpha_{2}B_{2}=\sum_{k}^{n}(\alpha_{1}v_{ik}+\alpha_{2}v_{jk})w_{k}\sim\mathcal{N}(0,\left\lVert\alpha_{1}\mathbf{v}_{i}+\alpha_{2}\mathbf{v}_{j}\right\rVert^{2})$ . Therefore the joint distribution of $(B_{1},B_{2})$ is a mean 0 bivariate normal distribution. Denoting the covariance matrix of this distribution as $\tilde{\mathbf{A}}\in\mathbb{R}^{2\times 2}$ , then $[\mathbf{G}_{l}]_{ij}$ can be expressed as

[\mathbf{G}_{l}]_{ij}=\sigma_{w}^{2}\mathbb{E}_{(B_{1},B_{2})\sim\tilde{\mathbf{A}}}[\phi(B_{1})\phi(B_{2})]+\sigma_{b}^{2}.

To prove $[\mathbf{G}_{l}]_{i,j}=\Sigma^{(l)}$ it therefore suffices to show that $\tilde{\mathbf{A}}=\mathbf{A}^{(l)}$ as per (11). This follows by the induction hypothesis as

	$\displaystyle\mathbb{E}[B_{1}^{2}]$	$\displaystyle=\mathbf{v}_{i}^{T}\mathbf{v}_{i}=[\mathbf{G}_{l-1}]_{ii}=\Sigma^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{i}),$
	$\displaystyle\mathbb{E}[B_{2}^{2}]$	$\displaystyle=\mathbf{v}_{j}^{T}\mathbf{v}_{j}=[\mathbf{G}_{l-1}]_{jj}=\Sigma^{(l-1)}(\mathbf{x}_{j},\mathbf{x}_{j}),$
	$\displaystyle\mathbb{E}[B_{1}B_{2}]$	$\displaystyle=\mathbf{v}_{i}^{T}\mathbf{v}_{j}=[\mathbf{G}_{l-1}]_{ij}=\Sigma^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{j}).$

Finally, $\mathbf{G}_{l}$ is positive semi-definite as long as $\mathbb{E}_{\mathbf{w}}[\phi(\mathbf{V}\mathbf{w})\phi(\mathbf{V}\mathbf{w})^{T}]$ is positive semi-definite. Let $M(\mathbf{w})=\phi(\mathbf{V}\mathbf{w})\in\mathbb{R}^{n\times n}$ and observe for any $\mathbf{w}$ that $M(\mathbf{w})M(\mathbf{w})^{T}$ is positive semi-definite. Therefore $\mathbb{E}_{\mathbf{w}}[M(\mathbf{w})M(\mathbf{w})^{T}]$ must also be positive semi-definite. Thus the inductive step is complete and we may conclude for $l\in[L+1]$ that

[\mathbf{G}_{l}]_{i,j}=\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}).

(15)

For the proof of the expression for the sequence $(\dot{\mathbf{G}}_{l})_{l=2}^{L+1}$ it suffices to prove for any $i,j\in[n]$ and $l\in[L+1]$ that

[\dot{\mathbf{G}}_{l}]_{i,j}=\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}).

By comparing (13) with (11) this follows immediately from (15). Therefore with (13) proven (14) follows from (12). ∎

A.3 Unit variance initialization

The initialization scheme for a neural network, particularly a deep neural network, needs to be designed with some care in order to avoid either vanishing or exploding gradients during training Glorot & Bengio (2010); He et al. (2015); Mishkin & Matas (2016); LeCun et al. (2012). Some of the most popular initialization strategies used in practice today, in particular LeCun et al. (2012) and Glorot & Bengio (2010) initialization, first model the preactivations of the network as Gaussian random variables and then select the network hyperparameters in order that the variance of these idealized preactivations is fixed at one. Under Assumption 1 this idealized model on the preactivations is actually realized and if we additionally assume the conditions of Assumption 2 hold then likewise the variance of the preactivations at every layer will be fixed at one. To this end, and as in Poole et al. (2016); Murray et al. (2022), consider the function $V\colon\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{\geq 0}$ defined as

V(q)=\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}\left[\phi\left(\sqrt{q}Z\right)^{2}\right]+\sigma_{b}^{2}.

(16)

Noting that $V$ is another expression for $\Sigma^{(l)}(\mathbf{x},\mathbf{x})$ , derived via a change of variables as per Poole et al. (2016), the sequence of variances $(\Sigma^{(l)}(\mathbf{x},\mathbf{x}))_{l=2}^{L}$ can therefore be generated as follows,

\Sigma^{(l)}(\mathbf{x},\mathbf{x})=V(\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})).

(17)

The linear correlation $\rho^{(l)}:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[-1,1]$ between the preactivations of two inputs $\mathbf{x},\mathbf{y}\in\mathbb{R}^{d}$ we define as

\displaystyle\rho^{(l)}(\mathbf{x},\mathbf{y})=\frac{\Sigma^{(l)}(\mathbf{x},\mathbf{y})}{\sqrt{\Sigma^{(l)}(\mathbf{x},\mathbf{x})\Sigma^{(l)}(\mathbf{y},\mathbf{y})}}.

(18)

Assuming $\Sigma^{(l)}(\mathbf{x},\mathbf{x})=\Sigma^{(l)}(\mathbf{y},\mathbf{y})=1$ for all $l\in[L+1]$ , then $\rho^{(l)}(\mathbf{x},\mathbf{y})=\Sigma^{(l)}(\mathbf{x},\mathbf{y})$ . Again as in Murray et al. (2022) and analogous to (16), with $Z_{1},Z_{2}\sim\mathcal{N}(0,1)$ independent, $U_{1}\vcentcolon=Z_{1}$ , $U_{2}(\rho)\vcentcolon=(\rho Z_{1}+\sqrt{1-\rho^{2}}Z_{2})$ ³³3We remark that $U_{1},U_{2}$ are dependent and identically distributed as $U_{1},U_{2}\sim\mathcal{N}(0,1)$ . we define the correlation function $R:[-1,1]\rightarrow[-1,1]$ as

\displaystyle R(\rho)=\sigma_{w}^{2}\mathbb{E}[\phi(U_{1})\phi(U_{2}(\rho))]+\sigma_{b}^{2}.

(19)

Noting under these assumptions that $R$ is equivalent to $\Sigma^{(l)}(\mathbf{x},\mathbf{y})$ , the sequence of correlations $(\rho^{(l)}(\mathbf{x},\mathbf{y}))_{l=2}^{L}$ can thus be generated as

\rho^{(l)}(\mathbf{x},\mathbf{y})=R(\rho^{(l-1)}(\mathbf{x},\mathbf{y})).

As observed in Poole et al. (2016); Schoenholz et al. (2017), $R(1)=V(1)=1$ , hence $\rho=1$ is a fixed point of $R$ . We remark that as all preactivations are distributed as $\mathcal{N}(0,1)$ , then a correlation of one between preactivations implies they are equal. The stability of the fixed point $\rho=1$ is of particular significance in the context of initializing deep neural networks successfully. Under mild conditions on the activation function one can compute the derivative of $R$ , see e.g., Poole et al. (2016); Schoenholz et al. (2017); Murray et al. (2022), as follows,

\displaystyle R^{\prime}(\rho)=\sigma_{w}^{2}\mathbb{E}[\phi^{\prime}(U_{1})\phi^{\prime}(U_{2}(\rho))].

(20)

Observe that the expression for $\dot{\Sigma}^{(l)}$ and $R^{\prime}$ are equivalent via a change of variables (Poole et al., 2016), and therefore the sequence of correlation derivatives may be computed as

\dot{\Sigma}^{(l)}(\mathbf{x},\mathbf{y})=R^{\prime}(\rho^{(l)}(\mathbf{x},\mathbf{y})).

With the relevant background material now in place we are in a position to prove Lemma A.2.

Lemma A.2.

Under Assumptions 1 and 2 and defining $\chi=\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[\phi^{\prime}(Z)^{2}]\in\mathbb{R}_{>0}$ , then for all $i,j\in[n]$ , $l\in[L+1]$

•

$[\mathbf{G}_{n,l}]_{ij}\in[-1,1]$ and $[\mathbf{G}_{n,l}]_{ii}=1$ ,
•

$[\dot{\mathbf{G}}_{n,l}]_{ij}\in[-\chi,\chi]$ and $[\dot{\mathbf{G}}_{n,l}]_{ii}=\chi$ .

Furthermore, the NTK is a dot product kernel, meaning $\Theta(\mathbf{x}_{i},\mathbf{x}_{j})$ can be written as a function of the inner product between the two inputs, $\Theta(\mathbf{x}_{i}^{T}\mathbf{x}_{j})$ .

Proof.

Recall from Lemma A.1 and its proof that for any $l\in[L+1]$ , $i,j\in[n]$ $[\mathbf{G}_{n,l}]_{ij}=\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})$ and $[\dot{\mathbf{G}}_{n,l}]_{ij}=\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})$ . We first prove by induction $\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{i})=1$ for all $l\in[L+1]$ . The base case $l=1$ follows as

\Sigma^{(1)}(\mathbf{x},\mathbf{x})=\gamma_{w}^{2}\mathbf{x}^{T}\mathbf{x}+\gamma_{b}^{2}=\gamma_{w}^{2}+\gamma_{b}^{2}=1.

Assume the induction hypothesis is true for layer $l-1$ . With $Z\sim\mathcal{N}(0,1)$ , then from (16) and (17)

	$\displaystyle\Sigma^{(l)}(\mathbf{x},\mathbf{x})$	$\displaystyle=V(\Sigma^{(l-1)}(\mathbf{x},\mathbf{x}))$
		$\displaystyle=\sigma_{w}^{2}\mathbb{E}\left[\phi^{2}\left(\sqrt{\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})}Z\right)\right]+\sigma_{b}^{2}$
		$\displaystyle=\sigma_{w}^{2}\mathbb{E}\left[\phi^{2}\left(Z\right)\right]+\sigma_{b}^{2}$
		$\displaystyle=1,$

thus the inductive step is complete. As an immediate consequence it follows that $[\mathbf{G}_{l}]_{ii}=1$ . Also, for any $i,j\in[n]$ and $l\in[L+1]$ ,

\displaystyle\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})

\displaystyle=\rho^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})=R(\rho^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{j}))=R(...R(R(\mathbf{x}_{i}^{T}\mathbf{x}_{j}))).

Thus we can consider $\Sigma^{(l)}$ as a univariate function of the input correlation $\Sigma:[-1,1]\rightarrow[-1,1]$ and also conclude that $[\mathbf{G}_{l}]_{ij}\in[-1,1]$ . Furthermore,

\displaystyle\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})=R^{\prime}(\rho^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}))=R^{\prime}(R(...R(R(\mathbf{x}_{i}^{T}\mathbf{x}_{j})))),

which likewise implies $\dot{\Sigma}$ is a dot product kernel. Recall now the random variables introduced to define $R$ : $Z_{1},Z_{2}\sim\mathcal{N}(0,1)$ are independent and $U_{1}=Z_{1}$ , $U_{2}=(\rho Z_{1}+\sqrt{1-\rho^{2}}Z_{2})$ . Observe $U_{1},U_{2}$ are dependent but identically distributed as $U_{1},U_{2}\sim\mathcal{N}(0,1)$ . For any $\rho\in[-1,1]$ then applying the Cauchy-Schwarz inequality gives

\displaystyle|R^{\prime}(\rho)|^{2}=\sigma_{w}^{4}\left|\mathbb{E}[\phi^{\prime}(U_{1})\phi^{\prime}(U_{2})]\right|^{2}\leq\sigma_{w}^{4}\mathbb{E}[\phi^{\prime}(U_{1})^{2}]\mathbb{E}[\phi^{\prime}(U_{2})^{2}]=\sigma_{w}^{4}\mathbb{E}[\phi^{\prime}(U_{1})^{2}]^{2}=|R^{\prime}(1)|^{2}.

As a result, under the assumptions of the lemma $\dot{\Sigma}^{(l)}:[-1,1]\rightarrow[-\chi,\chi]$ and $\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{i})=\chi$ . From this it immediately follows that $[\dot{\mathbf{G}}_{l}]_{ij}\in[-\chi,\chi]$ and $[\dot{\mathbf{G}}_{l}]_{ii}=\chi$ as claimed. Finally, as $\Sigma:[-1,1]\rightarrow[-1,1]$ and $\dot{\Sigma}:[-1,1]\rightarrow[-\chi,\chi]$ are dot product kernels, then from (12) the NTK must also be a dot product kernel and furthermore a univariate function of the pairwise correlation of its input arguments. ∎

The following corollary, which follows immediately from Lemma A.2 and (14), characterizes the trace of the NTK matrix in terms of the trace of the input gram.

Corollary A.3.

Under the same conditions as Lemma A.2, suppose $\phi$ and $\sigma_{w}^{2}$ are chosen such that $\chi=1$ . Then

Tr(\mathbf{K}_{n,l})=l.

(21)

A.4 Hermite Expansions

We say that a function $f:\mathbb{R}\rightarrow\mathbb{R}$ is square integrable w.r.t. the standard Gaussian measure $\gamma=e^{-x^{2}/2}/\sqrt{2\pi}$ if $\mathbb{E}_{x\sim\mathcal{N}(0,1)}[f(x)^{2}]<\infty$ . We denote by $L^{2}(\mathbb{R},\gamma)$ the space of all such functions. The probabilist’s Hermite polynomials are given by

\displaystyle H_{k}(x)={(-1)}^{k}e^{x^{2}/2}\frac{d^{k}}{dx^{k}}e^{-x^{2}/2},\quad k=0,1,\ldots.

The first three Hermite polynomials are $H_{0}(x)=1$ , $H_{1}(x)=x$ , $H_{2}(x)=(x^{2}-1)$ . Let $h_{k}(x)=\tfrac{H_{k}(x)}{\sqrt{k!}}$ denote the normalized probabilist’s Hermite polynomials. The normalized Hermite polynomials form a complete orthonormal basis in $L^{2}(\mathbb{R},\gamma)$ (O’Donnell, 2014, §11): in all that follows, whenever we reference the Hermite polynomials, we will be referring to the normalized Hermite polynomials. The Hermite expansion of a function $\phi\in L^{2}(\mathbb{R},\gamma)$ is given by

\displaystyle\phi(x)=\sum_{k=0}^{\infty}\mu_{k}(\phi)h_{k}(x),

(22)

where

\mu_{k}(\phi)=\mathbb{E}_{X\sim\mathcal{N}(0,1)}[\phi(X)h_{k}(X)]

(23)

is the $k$ th normalized probabilist’s Hermite coefficient of $\phi$ . In what follows we shall make use of the following identities.

	$\displaystyle\forall k\geq 1,\,h_{k}^{\prime}(x)$	$\displaystyle=\sqrt{k}h_{k-1}(x),$		(24)
	$\displaystyle\forall k\geq 1,\,xh_{k}(x)$	$\displaystyle=\sqrt{k+1}h_{k+1}(x)+\sqrt{k}h_{k-1}(x).$		(25)

\displaystyle\begin{array}[]{c}h_{k}(0)=\left\{\begin{array}[]{ll}0,&\text{ if }k\text{ is odd }\\ \frac{1}{\sqrt{k!}}(-1)^{\frac{k}{2}}(k-1)!!&\text{ if }k\text{ is even }\end{array}\right.,\\ \text{ where }k!!=\left\{\begin{array}[]{ll}1,&k\leq 0\\ k\cdot(k-2)\cdots 5\cdot 3\cdot 1,&k>0\text{ odd }\\ k\cdot(k-2)\cdots 6\cdot 4\cdot 2,&k>0\text{ even }.\end{array}\right.\end{array}

(33)

We also remark that the more commonly encountered physicist’s Hermite polynomials, which we denote $\tilde{H}_{k}$ , are related to the normalized probablist’s polynomials as follows,

h_{k}(z)=\frac{2^{-k/2}\tilde{H}_{k}(z/\sqrt{2})}{\sqrt{k!}}.

The Hermite expansion of the activation function deployed will play a key role in determining the coefficients of the NTK power series. In particular, the Hermite coefficients of ReLU are as follows.

Lemma A.4.

Daniely et al. (2016) For $\phi(z)=\max\{0,z\}$ the Hermite coefficients are given by

\mu_{k}(\phi)=\begin{cases}1/\sqrt{2\pi},&\text{$k=0$},\\ 1/2,&\text{$k=1$},\\ (k-3)!!/\sqrt{2\pi k!},&\text{$k$ even and $k\geq 2$},\\ 0,&\text{$k$ odd and $k>3$}.\end{cases}

(34)

Appendix B Expressing the NTK as a power series

B.1 Deriving a power series for the NTK

We will require the following minor adaptation of Nguyen & Mondelli (2020, Lemma D.2). We remark this result was first stated for ReLU and Softplus activations in the work of Oymak & Soltanolkotabi (2020, Lemma H.2).

Lemma B.1.

For arbitrary $n,d\in\mathbb{N}$ , let $\mathbf{A}\in\mathbb{R}^{n\times d}$ . For $i\in[n]$ , we denote the $i$ th row of $\mathbf{A}$ as $\mathbf{a}_{i}$ , and further assume that $\left\lVert\mathbf{a}_{i}\right\rVert=1$ . Let $\phi:\mathbb{R}\rightarrow\mathbb{R}$ satisfy $\phi\in L^{2}(\mathbb{R},\gamma)$ and define

\mathbf{M}=\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{n})}[\phi(\mathbf{A}\mathbf{w})\phi(\mathbf{A}\mathbf{w})^{T}]\in\mathbb{R}^{n\times n}.

Then the matrix series

\mathbf{S}_{K}=\sum_{k=0}^{K}\mu_{k}^{2}(\phi)\left(\mathbf{A}\mathbf{A}^{T}\right)^{\odot k}

converges uniformly to $\mathbf{M}$ as $K\rightarrow\infty$ .

The proof of Lemma B.1 follows exactly as in (Nguyen & Mondelli, 2020, Lemma D.2), and is in fact slightly simpler due to the fact we assume the rows of $\mathbf{A}$ are unit length and $\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{d})$ instead of $\sqrt{d}$ and $\mathbf{w}\sim\mathcal{N}(0,\frac{1}{d}\textbf{I}_{d})$ respectively. For the ease of the reader, we now recall the following definitions, which are also stated in Section 3. Letting $\bar{\alpha}_{l}\vcentcolon=(\alpha_{p,l})_{p=0}^{\infty}$ denote a sequence of real coefficients, then

F(p,k,\bar{\alpha}_{l})\vcentcolon=\begin{cases}1\;&k=0\text{ and }p=0,\\ 0\;&k=0\text{ and }p\geq 1,\\ \sum_{(j_{i})\in\mathcal{J}(p,k)}\prod_{i=1}^{k}\alpha_{j_{i},l}\;&k\geq 1\text{ and }p\geq 0,\end{cases}

(35)

where

\mathcal{J}(p,k)\vcentcolon=\{(j_{i})_{i\in[k]}\;:\;j_{i}\geq 0\;\forall i\in[k],\;\sum_{i=1}^{k}j_{i}=p\}

for all $p\in\mathbb{Z}_{\geq 0}$ , $k\in\mathbb{Z}_{\geq 1}$ .

We are now ready to derive power series for elements of $(\mathbf{G}_{l}))_{l=1}^{L+1}$ and $(\dot{\mathbf{G}}_{l}))_{l=2}^{L+1}$ .

Lemma B.2.

Under Assumptions 1 and 2, for all $l\in[2,L+1]$

\displaystyle\mathbf{G}_{l}=\sum_{k=0}^{\infty}\alpha_{k,l}(\mathbf{X}\mathbf{X}^{T})^{\odot k},

(36)

where the series for each element $[\mathbf{G}_{l}]_{ij}$ converges absolutely and the coefficients $\alpha_{p,l}$ are nonnegative. The coefficients of the series (36) for all $p\in\mathbb{Z}_{\geq 0}$ can be expressed via the following recurrence relationship,

\alpha_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2},&l=2,\\ \sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3.\end{cases}

(37)

Furthermore,

\displaystyle\dot{\mathbf{G}}_{l}=\sum_{k=0}^{\infty}\upsilon_{k,l}(\mathbf{X}\mathbf{X}^{T})^{\odot k},

(38)

where likewise the series for each entry $[\dot{\mathbf{G}}_{l}]_{ij}$ converges absolutely and the coefficients $\upsilon_{p,l}$ for all $p\in\mathbb{Z}_{\geq 0}$ are nonnegative and can be expressed via the following recurrence relationship,

\upsilon_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi^{\prime}),&l=2,\\ \sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3.\end{cases}

(39)

Proof.

We start by proving (36) and (37). Proceeding by induction, consider the base case $l=2$ . From Lemma A.1

\mathbf{G}_{2}=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{d})}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{X}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n}.

By the assumptions of the lemma, the conditions of Lemma B.1 are satisfied and therefore

	$\displaystyle\mathbf{G}_{2}$	$\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi)\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}+\sigma_{b}^{2}\textbf{1}_{n\times n}$
		$\displaystyle=\alpha_{0,2}\textbf{1}_{n\times n}+\sum_{k=1}^{\infty}\alpha_{k,2}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}.$

Observe the coefficients $(\alpha_{k,2})_{k\in\mathbb{Z}_{\geq 0}}$ are nonnegative. Therefore, for any $i,j\in[n]$ using Lemma A.2 the series for $[\mathbf{G}_{l}]_{ij}$ satisfies

\displaystyle\sum_{k=0}^{\infty}\left\lvert\alpha_{k,2}\right\rvert\left\lvert\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{k}\right\rvert

\displaystyle\leq\sum_{k=0}^{\infty}\alpha_{k,2}\langle\mathbf{x}_{i},\mathbf{x}_{i}\rangle^{k}

\displaystyle=[\mathbf{G}_{l}]_{ii}

\displaystyle=1

(40)

and so must be absolutely convergent. With the base case proved we proceed to assume the inductive hypothesis holds for arbitrary $\mathbf{G}_{l}$ with $l\in[2,L]$ . Observe

\displaystyle\mathbf{G}_{l+1}

\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi(\mathbf{A}\mathbf{w})\phi(\mathbf{A}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n},

where $\mathbf{A}$ is a matrix square root of $\mathbf{G}_{l}$ , meaning $\mathbf{G}_{l}=\mathbf{A}\mathbf{A}$ . Recall from Lemma A.1 that $\mathbf{G}_{l}$ is also symmetric and positive semi-definite, therefore we may additionally assume, without loss of generality, that $\mathbf{A}\in\mathbb{R}^{n\times n}$ is symmetric, which conveniently implies $\mathbf{G}_{n,l}=\mathbf{A}\mathbf{A}^{T}$ . Under the assumptions of the lemma the conditions for Lemma A.2 are satisfied and as a result $[\mathbf{G}_{n,l}]_{ii}=\left\lVert\mathbf{a}_{i}\right\rVert=1$ for all $i\in[n]$ , where we recall $\mathbf{a}_{i}$ denotes the $i$ th row of $\mathbf{A}$ . Therefore we may again apply Lemma A.1,

	$\displaystyle\mathbf{G}_{l+1}$	$\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi)\left(\mathbf{A}\mathbf{A}^{T}\right)^{\odot k}+\sigma_{b}^{2}\textbf{1}_{n\times n}$
		$\displaystyle=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})\textbf{1}_{n\times n}+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\mathbf{G}_{n,l}\right)^{\odot k}$
		$\displaystyle=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})\textbf{1}_{n\times n}+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\sum_{m=0}^{\infty}\alpha_{m,l}(\mathbf{X}\mathbf{X}^{T})^{\odot m}\right)^{\odot k},$

where the final equality follows from the inductive hypothesis. For any pair of indices $i,j\in[n]$

[\mathbf{G}_{l+1}]_{ij}=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\sum_{m=0}^{\infty}\alpha_{m,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{m}\right)^{k}.

By the induction hypothesis, for any $i,j\in[n]$ the series $\sum_{m=0}^{\infty}\alpha_{m,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{m}$ is absolutely convergent. Therefore, from the Cauchy product of power series and for any $k\in\mathbb{Z}_{\geq 0}$ we have

\left(\sum_{m=0}^{\infty}\alpha_{m,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{m}\right)^{k}=\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p},

(41)

where $F(p,k,\bar{\alpha}_{l})$ is defined in (4). By definition, $F(p,k,\bar{\alpha}_{l})$ is a sum of products of positive coefficients, and therefore $\left\lvert F(p,k,\bar{\alpha}_{l})\right\rvert=F(p,k,\bar{\alpha}_{l})$ . In addition, recall again by Assumption 2 and Lemma A.2 that $[\mathbf{G}_{l}]_{ii}=1$ . As a result, for any $k\in\mathbb{Z}_{\geq 0}$ , as $\left\lvert\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle\right\rvert\leq 1$

\sum_{p=0}^{\infty}\left\lvert F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right\rvert\leq\left(\sum_{m=0}^{\infty}\alpha_{m,l}\right)^{k}=[\mathbf{G}_{n,l}]_{ii}=1

(42)

and therefore the series $\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$ converges absolutely. Recalling from the proof of the base case that the series $\sum_{p=1}^{\infty}\alpha_{p,2}$ is absolutely convergent and has only nonnegative elements, we may therefore interchange the order of summation in the following,

	_ij	$\displaystyle=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)$
		$\displaystyle=\alpha_{0,2}+\sum_{k=1}^{\infty}\alpha_{k,2}\left(\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)$
		$\displaystyle=\alpha_{0,2}+\sum_{p=0}^{\infty}\left(\sum_{k=1}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}.$

Recalling the definition of $F(p,k,l)$ in (4), in particular $F(0,0,\bar{\alpha}_{l})=1$ and $F(p,0,\bar{\alpha}_{l})=0$ for $p\in\mathbb{Z}_{\geq 1}$ , then

	_ij	$\displaystyle=\left(\alpha_{0,2}+\sum_{k=1}^{\infty}\alpha_{k,2}F(0,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{0}+\sum_{p=1}^{\infty}\left(\sum_{k=1}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$
		$\displaystyle=\left(\sum_{k=0}^{\infty}\alpha_{k,2}F(0,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{0}+\sum_{p=1}^{\infty}\left(\sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$
		$\displaystyle=\sum_{p=0}^{\infty}\left(\sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$
		$\displaystyle=\sum_{p=0}^{\infty}\alpha_{p,l+1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}.$

As the indices $i,j\in[n]$ were arbitrary we conclude that

\mathbf{G}_{l+1}=\sum_{p=0}^{\infty}\alpha_{p,l+1}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}

as claimed. In addition, by inspection and using the induction hypothesis it is clear that the coefficients $(\alpha_{p,l+1})_{p=0}^{\infty}$ are nonnegative. Therefore, by an argument identical to (40), the series for each entry of $[\mathbf{G}_{l+1}]_{ij}$ is absolutely convergent. This concludes the proof of (36) and (37).

We now turn our attention to proving the (38) and (39). Under the assumptions of the lemma the conditions for Lemmas A.1 and B.1 are satisfied and therefore for the base case $l=2$

	$\displaystyle\dot{\mathbf{G}}_{2}$	$\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\mathbf{X}\mathbf{w})\phi^{\prime}(\mathbf{X}\mathbf{w})^{T}]$
		$\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi^{\prime})\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}$
		$\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}.$

By inspection the coefficients $(\upsilon_{p,2})_{p=0}^{\infty}$ are nonnegative and as a result by an argument again identical to (40) the series for each entry of $[\dot{\mathbf{G}}_{2}]_{ij}$ is absolutely convergent. For $l\in[2,L]$ , from (36) and its proof there is a matrix $\mathbf{A}\in\mathbb{R}^{n\times n}$ such that $\mathbf{G}_{l}=\mathbf{A}\mathbf{A}^{T}$ . Again applying Lemma B.1

	$\displaystyle\dot{\mathbf{G}}_{n,l+1}$	$\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\mathbf{A}\mathbf{w})\phi^{\prime}(\mathbf{A}\mathbf{w})^{T}]$
		$\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi^{\prime})\left(\mathbf{A}\mathbf{A}^{T}\right)^{\odot k}$
		$\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\mathbf{G}_{n,l}\right)^{\odot k}$
		$\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\sum_{p=0}^{\infty}\alpha_{p,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}\right)^{\odot k}$

Analyzing now an arbitrary entry $[\dot{\mathbf{G}}_{l+1}]_{ij}$ , by substituting in the power series expression for $\mathbf{G}_{l}$ from (36) and using (41) we have

	_ij	$\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\sum_{p=0}^{\infty}\alpha_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)^{k}$
		$\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)$
		$\displaystyle=\sum_{p=0}^{\infty}\left(\sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$
		$\displaystyle=\sum_{p=0}^{\infty}\upsilon_{p,l+1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}.$

Note that exchanging the order of summation in the third equality above is justified as for any $k\in\mathbb{Z}_{\geq 0}$ by (42) we have $\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})|\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle|^{p}\leq 1$ and therefore $\sum_{k=0}^{\infty}\sum_{p=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$ converges absolutely. As the indices $i,j\in[n]$ were arbitrary we conclude that

\dot{\mathbf{G}}_{l+1}=\sum_{p=0}^{\infty}\upsilon_{p,l+1}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}

as claimed. Finally, by inspection the coefficients $(\upsilon_{p,l+1})_{p=0}^{\infty}$ are nonnegative, therefore, and again by an argument identical to (40), the series for each entry of $[\dot{\mathbf{G}}_{n,l+1}]_{ij}$ is absolutely convergent. This concludes the proof. ∎

We are now prove the key result of Section 3.

See 3.1

Proof.

We proceed by induction. The base case $l=1$ follows trivially from Lemma A.1. We therefore assume the induction hypothesis holds for an arbitrary $l-1\in[1,L]$ . From (14) and Lemma B.2

	$\displaystyle n\mathbf{K}_{l}$	$\displaystyle=\mathbf{G}_{l}+n\mathbf{K}_{l-1}\odot\dot{\mathbf{G}}_{l}$
		$\displaystyle=\left(\sum_{p=0}^{\infty}\alpha_{p,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}\right)+\left(n\sum_{q=0}^{\infty}\kappa_{q,l-1}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot q}\right)\odot\left(\sum_{w=0}^{\infty}\upsilon_{w,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot w}\right).$

Therefore, for arbitrary $i,j\in[n]$

_ij

\displaystyle=\sum_{p=0}^{\infty}\alpha_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}+\left(n\sum_{q=0}^{\infty}\kappa_{q,l-1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{q}\right)\left(\sum_{w=0}^{\infty}\upsilon_{w,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{w}\right).

Observe $n\sum_{q=0}^{\infty}\kappa_{q,l-1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{q}=\Theta^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{j})$ and therefore the series must converge due to the convergence of the NTK. Furthermore, $\sum_{w=0}^{\infty}\upsilon_{w,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{w}=[\dot{\mathbf{G}}_{n,l}]_{ij}$ and therefore is absolutely convergent by Lemma B.2. As a result, by Merten’s Theorem the product of these two series is equal to their Cauchy product. Therefore

	_ij	$\displaystyle=\sum_{p=0}^{\infty}\alpha_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}+\sum_{p=0}^{\infty}\left(\sum_{q=0}^{p}\kappa_{q,l-1}\upsilon_{p-q,l}\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$
		$\displaystyle=\sum_{p=0}^{\infty}\left(\alpha_{p,l}+\sum_{q=0}^{p}\kappa_{q,l-1}\upsilon_{p-q,l}\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}$
		$\displaystyle=\sum_{p=0}^{\infty}\kappa_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p},$

from which the (5) immediately follows. ∎

B.2 Analyzing the coefficients of the NTK power series

In this section we study the coefficients of the NTK power series stated in Theorem 3.1. Our first observation is that, under additional assumptions on the activation function $\phi$ , the recurrence relationship (6) can be simplified in order to depend only on the Hermite expansion of $\phi$ .

Lemma B.3.

Under Assumption 3 the Hermite coefficients of $\phi^{\prime}$ satisfy

\mu_{k}(\phi^{\prime})=\sqrt{k+1}\mu_{k+1}(\phi)

for all $k\in\mathbb{Z}_{\geq 0}$ .

Proof.

Note for each $n\in\mathbb{N}$ as $\phi$ is absolutely continuous on $[-n,n]$ it is differentiable a.e. on $[-n,n]$ . It follows by the countable additivity of the Lebesgue measure that $\phi$ is differentiable a.e. on $\mathbb{R}$ . Furthermore, as $\phi$ is polynomially bounded we have $\phi\in L^{2}(\mathbb{R},e^{-x^{2}/2}/\sqrt{2\pi})$ . Fix $a>0$ . Since $\phi$ is absolutely continuous on $[-a,a]$ it is of bounded variation on $[-a,a]$ . Also note that $h_{k}(x)e^{-x^{2}/2}$ is of bounded variation on $[-a,a]$ due to having a bounded derivative. Thus we have by Lebesgue-Stieltjes integration-by-parts (see e.g. Folland 1999, Chapter 3)

	$\displaystyle\int_{-a}^{a}\phi^{\prime}(x)h_{k}(x)e^{-x^{2}/2}dx$
	$\displaystyle=\phi(a)h_{k}(a)e^{-a^{2}/2}-\phi(-a)h_{k}(-a)e^{-a^{2}/2}+\int_{-a}^{a}\phi(x)[xh_{k}(x)-h_{k}^{\prime}(x)]e^{-x^{2}/2}dx$
	$\displaystyle=\phi(a)h_{k}(a)e^{-a^{2}/2}-\phi(-a)h_{k}(-a)e^{-a^{2}/2}+\int_{-a}^{a}\phi(x)\sqrt{k+1}h_{k+1}(x)e^{-x^{2}/2}dx,$

where in the last line above we have used the fact that (24) and (25) imply that $xh_{k}(x)-h_{k}^{\prime}(x)=\sqrt{k+1}h_{k+1}(x)$ . Thus we have shown

	$\displaystyle\int_{-a}^{a}\phi^{\prime}(x)h_{k}(x)e^{-x^{2}/2}dx$
	$\displaystyle=\phi(a)h_{k}(a)e^{-a^{2}/2}-\phi(-a)h_{k}(-a)e^{-a^{2}/2}+\int_{-a}^{a}\phi(x)\sqrt{k+1}h_{k+1}(x)e^{-x^{2}/2}dx.$

We note that since $|\phi(x)h_{k}(x)|=\mathcal{O}(|x|^{\beta+k})$ we have that as $a\rightarrow\infty$ the first two terms above vanish. Thus by sending $a\rightarrow\infty$ we have

\int_{-\infty}^{\infty}\phi^{\prime}(x)h_{k}(x)e^{-x^{2}/2}dx=\int_{-\infty}^{\infty}\sqrt{k+1}\phi(x)h_{k+1}(x)e^{-x^{2}/2}dx.

After dividing by $\sqrt{2\pi}$ we get the desired result. ∎

In particular, under Assumption 3, and as highlighted by Corollary B.4, which follows directly from Lemmas B.2 and B.3, the NTK coefficients can be computed only using the Hermite coefficients of $\phi$ .

Corollary B.4.

Under Assumptions 1, 2 and 3, for all $p\in\mathbb{Z}_{\geq 0}$

\upsilon_{p,l}=\begin{cases}(p+1)\alpha_{p+1,2},&l=2,\\ \sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3.\end{cases}

(43)

With these results in place we proceed to analyze the decay of the coefficients of the NTK for depth two networks. As stated in the main text, the decay of the NTK coefficients depends on the decay of the Hermite coefficients of the activation function deployed. This in turn is strongly influenced by the behavior of the tails of the activation function. To this end we roughly group activation functions into three categories: growing tails, flat or constant tails and finally decaying tails. Analyzing each of these groups in full generality is beyond the scope of this paper, we therefore instead study the behavior of ReLU, Tanh and Gaussian activation functions, being prototypical and practically used examples of each of these three groups respectively. We remark that these three activation functions satisfy Assumption 3. For typographical ease we let $\omega_{\sigma}(z)\vcentcolon=(1/\sqrt{2\pi\sigma^{2}})\exp\left(-z^{2}/(2\sigma^{2})\right)$ denote the Gaussian activation function with variance $\sigma^{2}$ .

See 3.2

Proof.

Recall (9),

\kappa_{p,2}=\sigma_{w}^{2}(1+\gamma_{w}^{2}p)\mu_{p}^{2}(\phi)+\sigma_{w}^{2}\gamma_{b}^{2}(1+p)\mu_{p+1}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2}.

In order to bound $\kappa_{p,2}$ we proceed by using Lemma A.4 to bound the square of the Hermite coefficients. We start with ReLU. Note Lemma A.4 actually provides precise expressions for the Hermite coefficients of ReLU, however, these are not immediately easy to interpret. Observe from Lemma A.4 that above index $p=2$ all odd indexed Hermite coefficients are $0$ . It therefore suffices to bound the even indexed terms, given by

\mu_{p}(ReLU)=\frac{1}{\sqrt{2\pi}}\frac{(p-3)!!}{\sqrt{p!}}.

Observe from (33) that for $p$ even

h_{p}(0)=(-1)^{p/2}\frac{(p-1)!!}{\sqrt{p!}},

therefore

\mu_{p}(ReLU)=\frac{1}{\sqrt{2\pi}}\frac{(p-3)!!}{\sqrt{p!}}=\frac{1}{\sqrt{2\pi}}\frac{\left\lvert h_{p}(0)\right\rvert}{p-1}.

Analyzing now $\left\lvert h_{p}(0)\right\rvert$ ,

\frac{(p-1)!!}{\sqrt{p!}}=\frac{\prod_{i=1}^{p/2}(2i-1)}{\sqrt{\prod_{i=1}^{p/2}(2i-1)2i}}=\sqrt{\frac{\prod_{i=1}^{p/2}(2i-1)}{\prod_{i=1}^{p/2}2i}}=\sqrt{\frac{(p-1)!!}{p!!}}.

Here, the expression inside the square root is referred to in the literature as the Wallis ratio, for which the following lower and upper bounds are available Kazarinoff (1956),

\sqrt{\frac{1}{\pi(p+0.5)}}<\frac{(p-1)!!}{p!!}<\sqrt{\frac{1}{\pi(p+0.25)}}.

(44)

As a result

\left\lvert h_{p}(0)\right\rvert=\Theta(p^{-1/4})

and therefore

\mu_{p}(ReLU)=\begin{cases}\Theta(p^{-5/4}),\;&p\text{ even},\\ 0,\;&p\text{ odd}.\end{cases}

As $(p+1)^{-3/2}=\Theta(p^{-3/2})$ , then from (9)

	$\displaystyle\kappa_{p,2}$	$\displaystyle=\Theta((p\mu_{p}^{2}(ReLU)+\delta_{\gamma_{b}>0}(p+1)\mu_{p+1}^{2}(ReLU)))$
		$\displaystyle=\Theta((\delta_{p\text{ even}}p^{-3/2}+\delta_{(p\text{ odd})\cap(\gamma_{b}>0)}(p+1)^{-3/2}))$
		$\displaystyle=\Theta\left(\delta_{(p\text{ even})\cup\left((p\text{ odd})\cap(\gamma_{b}>0)\right)}p^{-3/2}\right)$
		$\displaystyle=\delta_{(p\text{ even})\cup(\gamma_{b}>0)}\Theta\left(p^{-3/2}\right)$

as claimed in item 1.

We now proceed to analyze $\phi(z)=Tanh(z)$ . From Panigrahi et al. (2020, Corollary F.7.1)

\mu_{p}(Tanh^{\prime})=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p}}{4}\right)\right).

As Tanh satisfies the conditions of Lemma B.3

\mu_{p}(Tanh)=p^{-1/2}\mu_{p-1}(Tanh^{\prime})=\mathcal{O}\left(p^{-1/2}\exp\left(-\frac{\pi\sqrt{p-1}}{4}\right)\right).

Therefore the result claimed in item 2. follows as

	$\displaystyle\kappa_{p,2}$	$\displaystyle=\mathcal{O}((p\mu_{p}^{2}(Tanh)+(p+1)\mu_{p+1}^{2}(Tanh)))$
		$\displaystyle=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p-1}}{2}\right)+\exp\left(-\frac{\pi\sqrt{p}}{2}\right)\right)$
		$\displaystyle=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p-1}}{2}\right)\right).$

Finally, we now consider $\phi(z)=\omega_{\sigma}(z)$ where $\omega_{\sigma}(z)$ is the density function of $\mathcal{N}(0,\sigma^{2})$ . Similar to ReLU, analytic expressions for the Hermite coefficients of $\omega_{\sigma}(z)$ are known (see e.g., Davis, 2021, Theorem 2.9),

\mu_{p}^{2}(\omega_{\sigma})=\begin{cases}\frac{p!}{(\left(p/2\right)!)^{2}2^{p}2\pi(\sigma^{2}+1)^{p+1}},\;&\text{p even},\\ 0,\;&\text{p odd}.\\ \end{cases}

For $p$ even

(p/2)!=p!!2^{-p/2}.

Therefore

\displaystyle\frac{p!}{(p/2)!(p/2)!}=2^{p}\frac{p!}{p!!p!!}=2^{p}\frac{(p-1)!!}{p!!}.

As a result, for $p$ even and using (44), it follows that

\displaystyle\mu_{p}^{2}(\omega_{\sigma})

\displaystyle=\frac{(\sigma^{2}+1)^{-(p+1)}}{2\pi}\frac{(p-1)!!}{p!!}

\displaystyle=\Theta(p^{-1/2}(\sigma^{2}+1)^{-p}).

Finally, since $(p+1)^{1/2}(\sigma^{2}+1)^{-p-1}=\Theta(p^{1/2}(\sigma^{2}+1)^{-p})$ , then from (9)

	$\displaystyle\kappa_{p,2}$	$\displaystyle=\Theta((p\mu_{p}^{2}(\omega_{\sigma})+\delta_{\gamma_{b}>0}(p+1)\mu_{p+1}^{2}(\omega_{\sigma})))$
		$\displaystyle=\Theta\left(\delta_{(p\text{ even})\cup\left((p\text{ odd})\cap(\gamma_{b}>0)\right)}p^{1/2}(\sigma^{2}+1)^{-p}\right)$
		$\displaystyle=\delta_{(p\text{ even})\cup(\gamma_{b}>0)}\Theta\left(p^{1/2}(\sigma^{2}+1)^{-p}\right)$

as claimed in item 3. ∎

B.3 Numerical approximation via a truncated NTK power series and interpretation of Figure 2

Currently, computing the infinite width NTK requires either a) explicit evaluation of the Gaussian integrals highlighted in (13), b) numerical approximation of these same integrals such as in Lee et al. (2018), or c) approximation via a sufficiently wide yet still finite width network, see for instance Engel et al. (2022); Novak et al. (2022). These Gaussian integrals (13) can be solved solved analytically only for a minority of activation functions, notably ReLU as discussed for example by Arora et al. (2019b), while the numerical integration and finite width approximation approaches are relatively computationally expensive. The truncated NTK power series we define as analogous to (5) but with the series involved being computed only up to the $T$ th element. Once the top $T$ coefficients are computed, then for any input correlation the NTK can be approximated by evaluating the corresponding finite degree $T$ polynomial.

Definition B.5.

For an arbitrary pair $\mathbf{x},\mathbf{y}\in\mathbb{S}^{d-1}$ let $\rho=\mathbf{x}^{T}\mathbf{y}$ denote their linear correlation. Under Assumptions 1, 2 and 3, for all $l\in[2,L+1]$ the $T$ -truncated NTK power series $\hat{\Theta}_{T}^{(l)}:[-1,1]\rightarrow\mathbb{R}$ is defined as

\Theta_{T}^{(l)}(\rho)=\sum_{p=0}^{T}\hat{\kappa}_{p,l}\rho^{p}.

(45)

and whose coefficients are defined via the following recurrence relation,

\hat{\kappa}_{p,l}=\begin{cases}\delta_{p=0}\gamma_{b}^{2}+\delta_{p=1}\gamma_{w}^{2},&l=1,\\ \hat{\alpha}_{p,l}+\sum_{q=0}^{p}\hat{\kappa}_{q,l-1}\hat{\upsilon}_{p-q,l},&l\in[2,L+1].\end{cases}

(46)

Here, with $\bar{\hat{\alpha}}_{l-1}=(\hat{\alpha}_{p,l-1})_{p=0}^{T}$ ,

\hat{\alpha}_{p,l}\vcentcolon=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2},&l=2,\\ \sum_{k=0}^{T}\hat{\alpha}_{k,2}F(p,k,\bar{\hat{\alpha}}_{l-1}),&l\geq 3\end{cases}

(47)

and

\hat{\upsilon}_{p,l}\vcentcolon=\begin{cases}\sqrt{p+1}\hat{\alpha}_{p+1,2},&l=2,\\ \sum_{k=0}^{T}\sqrt{k+1}\hat{\alpha}_{p+1,2}F(p,k,\bar{\hat{\alpha}}_{l}),&l\geq 3.\end{cases}

(48)

In order to analyze the performance and potential of the truncated NTK for numerical approximation, we compute it for ReLU and compare it with its analytical expression Arora et al. (2019b). To recall this result, let

	$\displaystyle R(\rho)$	$\displaystyle\vcentcolon=\frac{\sqrt{1-\rho^{2}}+\rho\cdot\arcsin(\rho)}{\pi}+\frac{\rho}{2},$
	$\displaystyle R^{\prime}(\rho)$	$\displaystyle\vcentcolon=\frac{\arcsin(\rho)}{\pi}+\frac{1}{2}.$

Under Assumptions 1 and 2, with $\phi(z)=ReLU(z)$ , $\gamma_{w}^{2}=1$ , $\sigma_{w}^{2}=2$ , $\sigma_{b}^{2}=\gamma_{b}^{2}=0$ , $\mathbf{x},\mathbf{y}\in\mathbb{S}^{d}$ and $\rho_{1}\vcentcolon=\mathbf{x}^{T}\mathbf{y}$ , then $\Theta_{1}(\mathbf{x},\mathbf{y})=\rho$ and for all $l\in[2,L+1]$

		$\displaystyle\rho_{l}=R(\rho_{l-1}),$		(49)
		$\displaystyle\Theta_{l}(\mathbf{x},\mathbf{y})=\rho_{l}+\rho_{l-1}R^{\prime}(\rho_{l-1}).$		(49)

Turning our attention to Figure 2, we observe particularly for input correlations $\left\lvert\rho\right\rvert\approx 0.5$ and below then the truncated ReLU NTK power series achieves machine level precision. For $\left\lvert\rho\right\rvert\approx 1$ higher order coefficients play a more significant role. As the truncated ReLU NTK power series approximates these coefficients less well the overall approximation of the ReLU NTK is worse. We remark also that negative correlations have a smaller absolute error as odd indexed terms cancel with even index terms: we emphasize again that in Figure 2 we plot the absolute not relative error. In addition, for $L=1$ there is symmetry in the absolute error for positive and negative correlations as $\alpha_{p,2}=0$ for all odd $p$ . One also observes that approximation accuracy goes down with depth, which is due to the error in the coefficients at the previous layer contributing to the error in the coefficients at the next, thereby resulting in an accumulation of error with depth. Also, and certainly as one might expect, a larger truncation point $T$ results in overall better approximation. Finally, as the decay in the Hermite coefficients for ReLU is relatively slow, see e.g., Table 1 and Lemma 3.2, we expect the truncated ReLU NTK power series to perform worse relative to the truncated NTK’s for other activation functions.

B.4 Characterizing NTK power series coefficient decay rates for deep networks

In general, Theorem 3.1 does not provide a straightforward path to analyzing the decay of the NTK power series coefficients for depths greater than two. This is at least in part due to the difficulty of analyzing $F(p,k,\bar{\alpha}_{l-1})$ , which recall is the sum of all ordered products of $k$ elements of $\bar{\alpha}_{l-1}$ whose indices sum to $p$ , defined in (4). However, in the setting where the squares of the Hermite coefficients, and therefore the series $(\alpha_{p,2})_{p=0}^{\infty}$ , decay at an exponential rate, this quantity can be characterized and therefore an analysis, at least to a certain degree, of the impact of depth conducted. Although admittedly limited in scope, we highlight that this setting is relevant for the study of Gaussian activation functions and radial basis function (RBF) networks. We will also make the additional simplifying assumption that the activation function has zero Gaussian mean (which can be obtained by centering). Unfortunately this further reduces the applicability of the following results to activation functions commonly used in practice. We leave the study of relaxing this zero bias assumption, perhaps only enforcing exponential decay asymptotically, as well as a proper exploration of other decay patterns, to future work.

The following lemma precisely describes, in the specific setting considered here, the evolution of the coefficients of the Gaussian Process kernel with depth.

Lemma B.6.

Let $\alpha_{0,2}=0$ and $\alpha_{p,2}=C_{2}\eta_{2}^{-p}$ for $p\in\mathbb{Z}_{\geq 1}$ , where $C_{2}$ and $\eta_{2}$ are constants such that $\sum_{p=1}^{\infty}\alpha_{p,2}=1$ . Then for all $l\geq 2$ and $p\in\mathbb{Z}_{\geq 0}$

\alpha_{p,l+1}=\begin{cases}0,&\;p=0,\\ C_{l+1}\eta_{l+1}^{-p},&\;p\geq 1\end{cases}

(50)

where the constants $\eta_{l+1}$ and $C_{l+1}$ are defined as

\displaystyle\eta_{l+1}=\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}},\;\;C_{l+1}=\frac{C_{l}C_{2}}{\eta_{2}+C_{l}}.

(51)

Proof.

Observe for $l=2$ , we have that $\alpha_{0,l}=0$ and $\alpha_{p,l}=C_{l}\eta_{l}^{-p}$ hold by assumption. Thus by induction it suffices to show that $\alpha_{0,l}=0$ and $\alpha_{p,l}=C_{l}\eta_{l}^{-p}$ implies (50) and (51) hold. Thus assume for some $l\geq 2$ we have that $\alpha_{0,l}=0$ and $\alpha_{p,l}=C_{l}\eta_{l}^{-p}$ . Recall the definition of $F$ from (4): as $\alpha_{0,l}=0$ then with $p\geq 1$ and $1\leq k\leq p$

F(p,k,\bar{\alpha}_{l})=\sum_{(j_{i})\in\mathcal{J}(p,k)}\prod_{i=1}^{k}\alpha_{j_{i},l}=\sum_{(j_{i})\in\mathcal{J}_{+}(p,k)}\prod_{i=1}^{k}\alpha_{j_{i},l},

where

\mathcal{J}_{+}(p,k)\vcentcolon=\big{\{}(j_{i})_{i\in[k]}\;:\;j_{i}\geq 1\;\forall i\in[k],\;\sum_{i=1}^{k}j_{i}=p\big{\}}\quad\text{for all $p\in\mathbb{Z}_{\geq 1}$, $k\in[p]$},

which is the set of all $k$ -tuples of positive (instead of non-negative) integers which sum to $p$ . Substituting $\alpha_{p,l}=C_{l}\eta_{l}^{-p}$ then

F(p,k,\bar{\alpha}_{l})=\sum_{(j_{i})\in\mathcal{J}_{+}(p,k)}C_{l}^{k}\eta_{l}^{-p}=C_{l}^{k}\eta_{l}^{-p}|\mathcal{J}_{+}(p,k)|=C_{l}^{k}\eta_{l}^{-p}\binom{p-1}{k-1},

where the final equality follows from a stars and bars argument. Now observe for $k>p$ that at least one of the indices in $(j_{i})_{i=1}^{k}$ must be 0 and therefore $\prod_{i=1}^{k}\alpha_{j_{i},2}=0$ . As a result under the assumptions of the lemma

F(p,k,\bar{\alpha}_{l})=\begin{cases}1,\;&k=0\text{ and }p=0,\\ C_{l}^{k}\eta_{l}^{-p}\binom{p-1}{k-1},\;&\;k\in[p]\text{ and }p\geq 1,\\ 0,\;&\text{ otherwise.}\\ \end{cases}

(52)

Substituting (52) into (7) it follows that

\alpha_{0,l+1}=\sum_{k=0}^{\infty}\alpha_{k,2}F(0,k,\bar{\alpha}_{l})=\alpha_{0,2}=0

and for $p\geq 1$

	$\displaystyle\alpha_{p,l+1}$	$\displaystyle=\sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})$
		$\displaystyle=C_{2}\eta_{l}^{-p}\sum_{k=1}^{p}\left(\frac{C_{l}}{\eta_{2}}\right)^{k}\binom{p-1}{k-1}$
		$\displaystyle=\eta_{l}^{-p}C_{l}\eta_{2}^{-1}C_{2}\sum_{h=0}^{p-1}\left(\frac{C_{l}}{\eta_{2}}\right)^{h}\binom{p-1}{h}$
		$\displaystyle=\eta_{l}^{-p}C_{l}\eta_{2}^{-1}C_{2}\left(1+\frac{C_{l}}{\eta_{2}}\right)^{p-1}$
		$\displaystyle=\frac{C_{l}C_{2}}{\eta_{2}+C_{l}}\left(\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}}\right)^{-p}$
		$\displaystyle=C_{l+1}\eta_{l+1}^{-p}$

as claimed. ∎

We now analyze the coefficients of the derivative of the Gaussian Process kernel.

Lemma B.7.

In addition to the assumptions of Lemma B.6, assume also that $\phi$ satisfies Assumption 3. Then $\upsilon_{p,2}=\frac{C_{2}}{\eta_{2}}(1+p)\eta_{2}^{-p}$ . Furthermore, for all $l\geq 2$ and $p\in\mathbb{Z}_{\geq 0}$

\upsilon_{p,l+1}=\begin{cases}C_{2}\eta_{2}^{-1},&\;p=0,\\ (V_{l+1}^{\prime}+V_{l+1}p)\eta_{l+1}^{-p},&\;p\geq 1,\end{cases}

(53)

where the constants $V_{l+1}^{\prime}$ and $V_{l+1}$ are defined as

\displaystyle V_{l+1}^{\prime}\vcentcolon=\frac{2C_{2}C_{l}}{\eta_{2}(C_{l}+\eta_{2})}-\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}},\;\;V_{l+1}\vcentcolon=\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}

(54)

and $C_{l}$ and $\eta_{l}$ are defined in (51).

Proof.

Under Assumption 3 then for all $p\in\mathbb{Z}_{\geq 0}$ we have

\upsilon_{p,2}=\sigma_{w}^{2}\mu_{p}^{2}(\phi^{\prime})=\sigma_{w}^{2}(p+1)\mu_{p+1}(\phi)^{2}=(p+1)\alpha_{p+1,2}=\frac{C_{2}}{\eta_{2}}(1+p)\eta_{2}^{-p}.

For $l\geq 2$ and $p=0$ it therefore follows that

\upsilon_{0,l+1}=\sum_{k=0}^{\infty}(k+1)\alpha_{k+1,2}F(0,k,\bar{\alpha}_{l})=\alpha_{1,2}=C_{2}\eta_{2}^{-1}.

For $l\geq 2$ and $p\geq 1$ then

	$\displaystyle\upsilon_{p,l+1}$	$\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l})$
		$\displaystyle=\sum_{k=0}^{\infty}(k+1)\alpha_{k+1,2}F(p,k,\bar{\alpha}_{l})$
		$\displaystyle=\sum_{h=1}^{\infty}hC_{2}\eta_{2}^{-h}F(p,h-1,\bar{\alpha}_{l})$
		$\displaystyle=\frac{C_{2}}{C_{l}}\eta_{l}^{-p}\sum_{h=2}^{p+1}h\left(\frac{C_{l}}{\eta_{2}}\right)^{h}\binom{p-1}{h-2}$
		$\displaystyle=\frac{C_{2}}{C_{l}}\eta_{l}^{-p}\sum_{r=0}^{p-1}(r+2)\left(\frac{C_{l}}{\eta_{2}}\right)^{r+2}\binom{p-1}{r}$
		$\displaystyle=\frac{C_{2}C_{l}}{\eta_{2}^{2}}\eta_{l}^{-p}\left(2\sum_{r=0}^{p-1}\left(\frac{C_{l}}{\eta_{2}}\right)^{r}\binom{p-1}{r}+\sum_{r=0}^{p-1}r\left(\frac{C_{l}}{\eta_{2}}\right)^{r}\binom{p-1}{r}\right)$
		$\displaystyle=\frac{C_{2}C_{l}}{\eta_{2}^{2}}\eta_{l}^{-p}\left(2\left(1+\frac{C_{l}}{\eta_{2}}\right)^{p-1}+\frac{C_{l}}{\eta_{2}}(p-1)\left(1+\frac{C_{l}}{\eta_{2}}\right)^{p-2}\right)$
		$\displaystyle=\frac{2C_{2}C_{l}}{\eta_{2}(C_{l}+\eta_{2})}\left(\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}}\right)^{-p}+\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}(p-1)\left(\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}}\right)^{-p}$
		$\displaystyle=\left(\frac{2C_{2}C_{l}}{\eta_{2}(C_{l}+\eta_{2})}-\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}\right)\eta_{l+1}^{-p}+\left(\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}\right)p\eta_{l+1}^{-p}$
		$\displaystyle=(V_{l+1}^{\prime}+V_{l+1}p)\eta_{l+1}^{-p}$

as claimed. ∎

With the coefficients of both the Gaussian Process kernel and its derivative characterized, we proceed to upper bound the decay of the NTK coefficients in the specific setting outlined in Lemma B.6 and B.7.

Lemma B.8.

Let the data, hyperparameters and activation function $\phi$ be such that Assumptions 1, 2 and 3 are satisfied along with the conditions of of Lemma B.6. Then for any $l\geq 2$ there exist positive constants $M_{l}^{\prime}$ and $K_{l}^{\prime}$ such that for all $p\in\mathbb{Z}_{\geq 1}$

\kappa_{p,l}\leq(M_{l}^{\prime}+K_{l}^{\prime}p^{2l-3})\eta_{l}^{-p}

(55)

where $\eta_{l}$ is defined in Lemma B.6.

Proof.

We proceed by induction starting with the base case $l=2$ . Applying the results of Lemmas B.6 and B.7 to (6) then for $p\in\mathbb{Z}_{\geq 1}$

\kappa_{p,2}=((C_{2}+\gamma_{b}^{2}C_{2}\eta_{2}^{-1})+(\gamma_{b}^{2}C_{2}\eta_{2}^{-1}+\gamma_{w}^{2}C_{2})p)\eta_{2}^{-p}.

(56)

If we define $M_{2}^{\prime}\vcentcolon=C_{2}+\gamma_{b}^{2}C_{2}\eta_{2}^{-1}$ and $K_{2}^{\prime}\vcentcolon=\gamma_{b}^{2}C_{2}\eta_{2}^{-1}+\gamma_{w}^{2}C_{2}$ , which are clearly positive constants, then $\kappa_{p,2}=(M_{2}^{\prime}+K_{2}^{\prime}p)\eta_{2}^{-p}$ and so for $l=2$ the induction hypothesis clearly holds. We now assume the inductive hypothesis holds for some $l\geq 2$ . Observe from (53), with $l\geq 2$ and $p\in\mathbb{Z}_{\geq 0}$ that

\upsilon_{p,l+1}\leq(A_{l+1}^{\prime}+V_{l+1}p)\eta_{l+1}^{-p}.

(57)

where $A_{l+1}^{\prime}\vcentcolon=\max\{C_{2}\eta_{2}^{-1},V_{l+1}^{\prime}\}$ . Substituting 57 and the inductive hypothesis inequality into (6) it follows for $p\geq 1$ that

	$\displaystyle\kappa_{p,l+1}$	$\displaystyle\leq C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})\eta_{l}^{-q}(A_{l+1}^{\prime}+V_{l+1}(p-q))\eta_{l+1}^{q}$
		$\displaystyle=C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}(p-q))\left(\frac{\eta_{2}}{\eta_{2}+C_{l}}\right)^{q}$
		$\displaystyle\leq C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}(p-q))$
		$\displaystyle\leq C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}p)$
		$\displaystyle\leq(C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})\eta_{l+1}^{-p}+\left(M_{l}^{\prime}V_{l+1}p+\sum_{q=1}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}p)\right)\eta_{l+1}^{-p}$
		$\displaystyle\leq(C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})\eta_{l+1}^{-p}+\left(M_{l}^{\prime}V_{l+1}p+p(M_{l}^{\prime}+K_{l}^{\prime}p^{2l-3})(A_{l+1}^{\prime}+V_{l+1}p)\right)\eta_{l+1}^{-p}$
		$\displaystyle\leq(C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})\eta_{l+1}^{-p}+p\left(M_{l}^{\prime}A_{l+1}^{\prime}+2M_{l}^{\prime}V_{l+1}p+K_{l}^{\prime}A_{l+1}^{\prime}p^{2l-3}+K_{l}^{\prime}V_{l+1}p^{2l-2}\right)\eta_{l+1}^{-p}$
		$\displaystyle\leq\left((C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})+\left(M_{l}^{\prime}A_{l+1}^{\prime}+2M_{l}^{\prime}V_{l+1}+K_{l}^{\prime}A_{l+1}^{\prime}+K_{l}^{\prime}V_{l+1}\right)p^{2l-1}\right)\eta_{l+1}^{-p}$

Therefore there exist positive constants $M_{l+1}^{\prime}=C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime}$ and $K_{l+1}^{\prime}=M_{l}^{\prime}A_{l+1}^{\prime}+2M_{l}^{\prime}V_{l+1}+K_{l}^{\prime}A_{l+1}^{\prime}+K_{l}^{\prime}V_{l+1}$ such that $\kappa_{p,l+1}\leq(M_{l+1}^{\prime}+K_{l+1}^{\prime}p^{2(l+1)-3})\eta_{l+1}^{-p}$ as claimed. This completes the inductive step and therefore also the proof of the lemma. ∎

Appendix C Analyzing the spectrum of the NTK via its power series

C.1 Effective rank of power series kernels

Recall that for a positive semidefinite matrix $\mathbf{A}$ we define the effective rank Huang et al. (2022) via the following ratio

\mathrm{eff}(\mathbf{A}):=\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

We consider a kernel Gram matrix $\mathbf{K}\in\mathbb{R}^{n\times n}$ that has the following power series representation in terms of an input gram matrix $\mathbf{X}\mathbf{X}^{T}$

n\mathbf{K}=\sum_{i=0}^{\infty}c_{i}(\mathbf{X}\mathbf{X}^{T})^{\odot i}.

Whenever $c_{0}\neq 0$ the effective rank of $\mathbf{K}$ is $O(1)$ , as displayed in the following theorem. See 4.1

Proof.

By linearity of trace we have that

Tr(n\mathbf{K})=\sum_{i=0}^{\infty}c_{i}Tr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=n\sum_{i=0}^{\infty}c_{i}

where we have used the fact that $Tr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=n$ for all $i\in\mathbb{N}$ . On the other hand

\lambda_{1}(n\mathbf{K})\geq\lambda_{1}(c_{0}(\mathbf{X}\mathbf{X}^{T})^{0})=\lambda_{1}(c_{0}\mathbf{1}_{n\times n})=nc_{0}.

Thus we have that

\mathrm{eff}(\mathbf{K})=\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}=\frac{Tr(n\mathbf{K})}{\lambda_{1}(n\mathbf{K})}\leq\frac{\sum_{i=0}^{\infty}c_{i}}{c_{0}}.

∎

The above theorem demonstrates that the constant term $c_{0}\mathbf{1}_{n\times n}$ in the kernel leads to a significant outlier in the spectrum of $\mathbf{K}$ . However this fails to capture how the structure of the input data $\mathbf{X}$ manifests in the spectrum of $\mathbf{K}$ . For this we will examine the centered kernel matrix $\widetilde{\mathbf{K}}:=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}\mathbf{1}^{T}$ . Using a very similar argument as before we can demonstrate that the effective rank of $\widetilde{\mathbf{K}}$ is controlled by the effective rank of the input data gram $\mathbf{X}\mathbf{X}^{T}$ . This is formalized in the following theorem. See 4.3

Proof.

By the linearity of the trace we have that

Tr(n\widetilde{\mathbf{K}})=\sum_{i=1}^{\infty}c_{i}Tr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=Tr(\mathbf{X}\mathbf{X}^{T})\sum_{i=1}^{\infty}c_{i}

where we have used the fact that $Tr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=Tr(\mathbf{X}\mathbf{X}^{T})=n$ for all $i\in[n]$ . On the other hand we have that

\lambda_{1}(n\widetilde{\mathbf{K}})\geq\lambda_{1}(c_{1}\mathbf{X}\mathbf{X}^{T})=c_{1}\lambda_{1}(\mathbf{X}\mathbf{X}^{T}).

Thus we conclude

\mathrm{eff}(\widetilde{\mathbf{K}})=\frac{Tr(\widetilde{\mathbf{K}})}{\lambda_{1}(\widetilde{\mathbf{K}})}=\frac{Tr(n\widetilde{\mathbf{K}})}{\lambda_{1}(n\widetilde{\mathbf{K}})}\leq\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\frac{\sum_{i=1}^{\infty}c_{i}}{c_{1}}.

∎

C.2 Effective rank of the NTK for finite width networks

C.2.1 Notation and definitions

We will let $[k]:=\{1,2,\ldots,k\}$ . We consider a neural network

\sum_{\ell=1}^{m}a_{\ell}\phi(\langle\mathbf{w}_{\ell},\mathbf{x}\rangle)

where $\mathbf{x}\in\mathbb{R}^{d}$ and $\mathbf{w}_{\ell}\in\mathbb{R}^{d}$ , $a_{\ell}\in\mathbb{R}$ for all $\ell\in[m]$ and $\phi$ is a scalar valued activation function. The network we present here does not have any bias values in the inner-layer, however the results we will prove later apply to the nonzero bias case by replacing $\mathbf{x}$ with $[\mathbf{x}^{T},1]^{T}$ . We let $\mathbf{W}\in\mathbb{R}^{m\times d}$ be the matrix whose $\ell$ -th row is equal to $\mathbf{w}_{\ell}$ and $\mathbf{a}\in\mathbb{R}^{m}$ be the vector whose $\ell$ -th entry is equal to $a_{\ell}$ . We can then write the neural network in vector form

f(\mathbf{x};\mathbf{W},\mathbf{a})=\mathbf{a}^{T}\phi(\mathbf{W}\mathbf{x})

where $\phi$ is understood to be applied entry-wise.

Suppose we have $n$ training data inputs $\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathbb{R}^{d}$ . We will let $\mathbf{X}\in\mathbb{R}^{n\times d}$ be the matrix whose $i$ -th row is equal to $\mathbf{x}_{i}$ . Let $\mathbf{\theta}_{inner}=vec(\mathbf{W})$ denote the row-wise vectorization of the inner-layer weights. We consider the Jacobian of the neural networks predictions on the training data with respect to the inner layer weights:

\mathbf{J}_{inner}^{T}=\left[\frac{\partial f(\mathbf{x}_{1})}{\partial\mathbf{\theta}_{inner}},\frac{\partial f(\mathbf{x}_{2})}{\partial\mathbf{\theta}_{inner}},\ldots,\frac{\partial f(\mathbf{x}_{n})}{\partial\mathbf{\theta}_{inner}}\right]

Similarly we can look at the analagous quantity for the outer layer weights

\mathbf{J}_{outer}^{T}=\left[\frac{\partial f(\mathbf{x}_{1})}{\partial\mathbf{a}},\frac{\partial f(\mathbf{x}_{2})}{\partial\mathbf{a}},\ldots,\frac{\partial f(\mathbf{x}_{n})}{\partial\mathbf{a}}\right]=\phi\left(\mathbf{W}\mathbf{X}^{T}\right).

Our first observation is that the per-example gradients for the inner layer weights have a nice Kronecker product representation

\frac{\partial f(\mathbf{x})}{\partial\mathbf{\theta}_{inner}}=\begin{bmatrix}a_{1}\phi^{\prime}(\langle\mathbf{w}_{1},\mathbf{x}\rangle)\\ a_{2}\phi^{\prime}(\langle\mathbf{w}_{2},\mathbf{x}\rangle)\\ \cdots\\ a_{m}\phi^{\prime}(\langle\mathbf{w}_{m},\mathbf{x}\rangle)\end{bmatrix}\otimes\mathbf{x}.

For convenience we will let

\mathbf{Y}_{i}:=\begin{bmatrix}a_{1}\phi^{\prime}(\langle\mathbf{w}_{1},\mathbf{x}_{i}\rangle)\\ a_{2}\phi^{\prime}(\langle\mathbf{w}_{2},\mathbf{x}_{i}\rangle)\\ \cdots\\ a_{m}\phi^{\prime}(\langle\mathbf{w}_{m},\mathbf{x}_{i}\rangle)\end{bmatrix}.

where the dependence of $\mathbf{Y}_{i}$ on the parameters $\mathbf{W}$ and $\mathbf{a}$ is suppressed (formally $\mathbf{Y}_{i}=\mathbf{Y}_{i}(\mathbf{W},\mathbf{a})$ ). This way we may write

\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}=\mathbf{Y}_{i}\otimes\mathbf{x}_{i}.

We will study the NTK with respect to the inner-layer weights

\mathbf{K}_{inner}=\mathbf{J}_{inner}\mathbf{J}_{inner}^{T}

and the same quantity for the outer-layer weights

\mathbf{K}_{outer}=\mathbf{J}_{outer}\mathbf{J}_{outer}^{T}.

For a hermitian matrix $\mathbf{A}$ we will let $\lambda_{i}(\mathbf{A})$ denote the $i$ th largest eigenvalue of $\mathbf{A}$ so that $\lambda_{1}(\mathbf{A})\geq\lambda_{2}(\mathbf{A})\geq\cdots\geq\lambda_{n}(\mathbf{A})$ . Similarly for an arbitrary matrix $\mathbf{A}$ we will let $\sigma_{i}(\mathbf{A})$ to the $i$ th largest singular value of $\mathbf{A}$ . For a matrix $\mathbf{A}\in\mathbb{R}^{r\times k}$ we will let $\sigma_{min}(\mathbf{A})=\sigma_{\min(r,k)}$ .

C.2.2 Effective rank

For a positive semidefinite matrix $\mathbf{A}$ we define the effective rank (Huang et al., 2022) of $\mathbf{A}$ to be the quantity

\mathrm{eff}(\mathbf{A}):=\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

The effective rank quantifies how many eigenvalues are on the order of the largest eigenvalue. We have the Markov-like inequality

\left\lvert\{i:\lambda_{i}(\mathbf{A})\geq c\lambda_{1}(\mathbf{A})\}\right\rvert\leq c^{-1}\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}

and the eigenvalue bound

\frac{\lambda_{i}(\mathbf{A})}{\lambda_{1}(\mathbf{A})}\leq\frac{1}{i}\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

Let $\mathbf{A}$ and $\mathbf{B}$ be positive semidefinite matrices. Then we have

\displaystyle\frac{Tr(\mathbf{A}+\mathbf{B})}{\lambda_{1}(\mathbf{A}+\mathbf{B})}\leq\frac{Tr(\mathbf{A})+Tr(\mathbf{B})}{\max\left(\lambda_{1}(\mathbf{A}),\lambda_{1}(\mathbf{B})\right)}\leq\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}+\frac{Tr(\mathbf{B})}{\lambda_{1}(\mathbf{B})}.

Thus the effective rank is subadditive for positive semidefinite matrices.

We will be interested in bounding the effective rank of the NTK. Let $\mathbf{K}=\mathbf{J}\mathbf{J}^{T}=\mathbf{J}_{outer}\mathbf{J}_{outer}^{T}+\mathbf{J}_{inner}\mathbf{J}_{inner}^{T}=\mathbf{K}_{outer}+\mathbf{K}_{inner}$ be the NTK matrix with respect to all the network parameters. Note that by subadditivity

\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\leq\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}+\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}.

In this vein we will control the effective rank of $\mathbf{K}_{inner}$ and $\mathbf{K}_{outer}$ separately.

C.2.3 Effective rank of inner-layer NTK

We will show that the effective rank of inner-layer NTK is bounded by a multiple of the effective rank of the data input gram $\mathbf{X}\mathbf{X}^{T}$ . We introduce the following meta-theorem that we will use to prove various corollaries later

Theorem C.1.

Set $\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{j\in[n]}|\langle\mathbf{Y}_{j},\mathbf{b}\rangle|\right]$ . Assume $\alpha>0$ . Then

\frac{\min_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}}{\alpha^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

Proof.

We will first prove the upper bound. We first observe that

	$\displaystyle Tr(\mathbf{K}_{inner})=\sum_{i=1}^{n}\left\lVert\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}$
	$\displaystyle\leq\max_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}\sum_{i=1}^{n}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}=\max_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})$

Recall that

\lambda_{1}\left(\mathbf{K}_{inner}\right)=\lambda_{1}\left(\mathbf{J}_{inner}\mathbf{J}_{inner}^{T}\right)=\lambda_{1}\left(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}\right).

Well

	$\displaystyle\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}=\sum_{i=1}^{n}\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}^{T}=\sum_{i=1}^{n}\left[\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right]\left[\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right]^{T}$
	$\displaystyle=\sum_{i=1}^{n}\left[\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right]\otimes\left[\mathbf{x}_{i}\mathbf{x}_{i}^{T}\right]$

Well then we may use the fact that

\lambda_{1}(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner})=\max_{\left\lVert\mathbf{b}\right\rVert_{2}=1}\mathbf{b}^{T}\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}\mathbf{b}

Let $\mathbf{b}_{1}\in\mathbb{R}^{m}$ and $\mathbf{b}_{2}\in\mathbb{R}^{d}$ be vectors that we will optimize later satisfying $\left\lVert\mathbf{b}_{1}\right\rVert_{2}\left\lVert\mathbf{b}_{2}\right\rVert_{2}=1$ . Then we have that $\left\lVert\mathbf{b}_{1}\otimes\mathbf{b}_{2}\right\rVert=1$ and

	$\displaystyle(\mathbf{b}_{1}\otimes\mathbf{b}_{2})^{T}\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}(\mathbf{b}_{1}\otimes\mathbf{b}_{2})=\sum_{i=1}^{n}(\mathbf{b}_{1}\otimes\mathbf{b}_{2})^{T}\left(\left[\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right]\otimes\left[\mathbf{x}_{i}\mathbf{x}_{i}^{T}\right]\right)(\mathbf{b}_{1}\otimes\mathbf{b}_{2})$
	$\displaystyle=\sum_{i=1}^{n}\left[\mathbf{b}_{1}^{T}\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\mathbf{b}_{1}\right]\left[\mathbf{b}_{2}^{T}\mathbf{x}_{i}\mathbf{x}_{i}^{T}\mathbf{b}_{2}\right]\geq\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\sum_{i=1}^{n}\mathbf{b}_{2}^{T}\mathbf{x}_{i}\mathbf{x}_{i}^{T}\mathbf{b}_{2}$
	$\displaystyle=\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\mathbf{b}_{2}^{T}\left[\sum_{i=1}^{n}\mathbf{x}_{i}\mathbf{x}_{i}^{T}\right]\mathbf{b}_{2}=\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\mathbf{b}_{2}\mathbf{X}^{T}\mathbf{X}\mathbf{b}_{2}$

Pick $\mathbf{b}_{2}$ so that $\left\lVert\mathbf{b}_{2}\right\rVert=1$ and

\mathbf{b}_{2}\mathbf{X}^{T}\mathbf{X}\mathbf{b}_{2}=\lambda_{1}(\mathbf{X}^{T}\mathbf{X})=\lambda_{1}(\mathbf{X}\mathbf{X}^{T}).

Thus for this choice of $\mathbf{b}_{2}$ we have

	$\displaystyle\lambda_{1}(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner})\geq(\mathbf{b}_{1}\otimes\mathbf{b}_{2})^{T}\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}(\mathbf{b}_{1}\otimes\mathbf{b}_{2})\geq$
	$\displaystyle\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\mathbf{b}_{2}\mathbf{X}^{T}\mathbf{X}\mathbf{b}_{2}=\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\lambda_{1}(\mathbf{X}\mathbf{X}^{T})$

Now note that $\alpha^{2}=\sup_{\left\lVert\mathbf{b}_{1}\right\rVert=1}\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]$ . Thus by taking the sup over $\mathbf{b}_{1}$ in our previous bound we have

\lambda_{1}(\mathbf{K}_{inner})=\lambda_{1}(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner})\geq\alpha^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T}).

Thus combined with our previous result we have

\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}}{\alpha^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

We now prove the lower bound.

	$\displaystyle Tr(\mathbf{K}_{inner})=\sum_{i=1}^{n}\left\lVert\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}$
	$\displaystyle\geq\min_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}\sum_{i=1}^{n}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}=\min_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})$

Let $\mathbf{Y}\in\mathbb{R}^{n\times m}$ be the matrix whose $i$ th row is equal to $\mathbf{Y}_{i}$ . Then observe that

\mathbf{K}_{inner}=[\mathbf{Y}\mathbf{Y}^{T}]\odot[\mathbf{X}\mathbf{X}^{T}]

where $\odot$ denotes the entry-wise Hadamard product of two matrices. We now recall that if $\mathbf{A}$ and $\mathbf{B}$ are two positive semidefinite matrices we have (Oymak & Soltanolkotabi, 2020, Lemma 2)

\lambda_{1}(\mathbf{A}\odot\mathbf{B})\leq\max_{i\in[n]}\mathbf{A}_{i,i}\lambda_{1}(\mathbf{B}).

Applying this to $\mathbf{K}_{inner}$ we get that

\lambda_{1}(\mathbf{K}_{inner})\leq\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})

Combining this with our previous result we get

\frac{\min_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}

∎

We can immediately get a useful corollary that applies to the ReLU activation function

Corollary C.2.

Set $\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{j\in[n]}|\langle\mathbf{Y}_{j},\mathbf{b}\rangle|\right]$ and $\gamma_{max}:=\sup_{x\in\mathbb{R}}|\phi^{\prime}(x)|$ . Assume $\alpha>0$ and $\gamma_{max}<\infty$ . Then

\frac{\alpha^{2}}{\gamma_{max}^{2}\left\lVert\mathbf{a}\right\rVert_{2}^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\gamma_{max}^{2}\left\lVert\mathbf{a}\right\rVert_{2}^{2}}{\alpha^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

Proof.

Note that the hypothesis on $|\phi^{\prime}|$ gives $\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\leq\gamma_{max}^{2}\left\lVert\mathbf{a}\right\rVert_{2}^{2}$ for all $i\in[n]$ . Moreover by Cauchy-Schwarz we have that $\min_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}\geq\alpha$ . Thus by theorem C.1 we get the desired result. ∎

If $\phi$ is a leaky ReLU type activation (say like those used in Nguyen & Mondelli (2020)) Theorem C.1 translates into an even simpler bound

Corollary C.3.

Suppose $\phi^{\prime}(x)\in[\gamma_{min},\gamma_{max}]$ for all $x\in\mathbb{R}$ where $\gamma_{min}>0$ . Then

\frac{\gamma_{min}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\gamma_{max}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\gamma_{max}^{2}}{\gamma_{min}^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

Proof.

We will lower bound

\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{j\in[n]}|\langle\mathbf{Y}_{j},\mathbf{b}\rangle|\right]

so that we can apply Corollary C.2. Set $\mathbf{b}=\mathbf{a}/\left\lVert\mathbf{a}\right\rVert_{2}$ . Then we have that

\langle\mathbf{Y}_{j},\mathbf{b}\rangle=\sum_{\ell=1}^{m}a_{\ell}\phi^{\prime}(\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle)a_{\ell}/\left\lVert\mathbf{a}\right\rVert_{2}\geq\frac{\gamma_{min}}{\left\lVert\mathbf{a}\right\rVert_{2}}\sum_{\ell=1}^{m}a_{\ell}^{2}=\gamma_{min}\left\lVert\mathbf{a}\right\rVert_{2}

Thus $\alpha\geq\gamma_{min}\left\lVert\mathbf{a}\right\rVert_{2}$ . The result then follows from Corollary C.2 ∎

To control $\alpha$ in Theorem C.1 when $\phi$ is the ReLU activation function requires a bit more work. To this end we introduce the following lemma.

Lemma C.4.

Assume $\phi(x)=ReLU(x)$ . Let $R_{min},R_{max}>0$ and define $\tau=\{\ell\in[m]:|a_{\ell}|\in[R_{min},R_{max}]\}$ . Set $T=\min_{i\in[n]}\sum_{\ell\in\tau}\mathbb{I}\left[\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\geq 0\right]$ . Then

\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{i\in[n]}|\langle\mathbf{Y}_{i},\mathbf{b}\rangle|\right]\geq\frac{R_{min}^{2}}{R_{max}}\frac{T}{|\tau|^{1/2}}

Proof.

Let $\mathbf{a}_{\tau}$ be the vector such that $(\mathbf{a}_{\tau})_{\ell}=a_{\ell}\mathbb{I}[\ell\in\tau]$ . Then note that

	$\displaystyle\langle\mathbf{Y}_{j},\mathbf{a}_{\tau}/\left\lVert\mathbf{a}_{\tau}\right\rVert_{2}\rangle=\frac{1}{\left\lVert\mathbf{a}_{\tau}\right\rVert}\sum_{\ell\in\tau}a_{\ell}^{2}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\geq$
	$\displaystyle\frac{R_{min}^{2}}{\left\lVert\mathbf{a}_{\tau}\right\rVert}\sum_{\ell\in\tau}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\geq\frac{R_{min}^{2}}{\left\lVert\mathbf{a}_{\tau}\right\rVert_{2}}T\geq\frac{R_{min}^{2}}{R_{max}\|\tau\|^{1/2}}T.$

∎

Roughly what Lemma C.4 says is that $\alpha$ is controlled when there is a set of inner-layer neurons that are active for each data point whose outer layer weights are similar in magnitude. Note that in Du et al. (2019b), Arora et al. (2019a), Oymak et al. (2019), Li et al. (2020), Xie et al. (2017) and Oymak & Soltanolkotabi (2020) the outer layer weights all have fixed constant magnitude. Thus in that case we can set $R_{min}=R_{max}$ in Lemma C.4 so that $\tau=[m]$ . In this setting we have the following result.

Theorem C.5.

Assume $\phi(x)=ReLU(x)$ . Suppose $|a_{\ell}|=R>0$ for all $\ell\in[m]$ . Furthermore suppose $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}$ are independent random vectors such that $\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert$ has the uniform distribution on the sphere for each $\ell\in[m]$ . Also assume $m\geq\frac{4\log(n/\epsilon)}{\delta^{2}}$ for some $\delta,\epsilon\in(0,1)$ . Then with probability at least $1-\epsilon$ we have that

\frac{(1-\delta)^{2}}{4}\mathrm{eff}(\mathbf{X}\mathbf{X}^{T})\leq\mathrm{eff}(\mathbf{K}_{inner})\leq\frac{4}{(1-\delta)^{2}}\mathrm{eff}(\mathbf{X}\mathbf{X}^{T}).

Proof.

Fix $j\in[n]$ . Note by the assumption on the $\mathbf{w}_{\ell}$ ’s we have that $\mathbb{I}[\langle\mathbf{w}_{1},\mathbf{x}_{j}\rangle\geq 0],\ldots,\mathbb{I}[\langle\mathbf{w}_{m},\mathbf{x}_{j}\rangle\geq 0]$ are i.i.d. Bernouilli random variables taking the values $0$ and $1$ with probability $1/2$ . Thus by the Chernoff bound for Binomial random variables we have that

\mathbb{P}\left(\sum_{\ell=1}^{m}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\leq\frac{m}{2}(1-\delta)\right)\leq\exp\left(-\delta^{2}\frac{m}{4}\right).

Thus taking the union bound over every $j\in[n]$ we get that if $m\geq\frac{4\log(n/\epsilon)}{\delta^{2}}$ then

\min_{j\in[n]}\sum_{\ell=1}^{m}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\geq\frac{m}{2}(1-\delta)

holds with probability at least $1-\epsilon$ . Now note that if we set $R_{min}=R_{max}=R$ we have that $\tau=[m]$ where $\tau$ is defined as it is in Lemma C.4. In this case by our previous bound we have that $T$ as defined in Lemma C.4 satisfies $T\geq\frac{m}{2}(1-\delta)$ with probability at least $1-\epsilon$ . In this case the conclusion of Lemma C.4 gives us

\alpha\geq Rm^{1/2}\frac{(1-\delta)}{2}=\left\lVert\mathbf{a}\right\rVert_{2}\frac{(1-\delta)}{2}.

Thus by Corollary C.2 and the above bound for $\alpha$ we get the desired result. ∎

We will now use Lemma C.4 to prove a bound in the case of Gaussian initialization.

Lemma C.6.

Assume $\phi(x)=ReLU(x)$ . Suppose that $a_{\ell}\sim N(0,\nu^{2})$ for each $\ell\in[m]$ i.i.d. Furthermore suppose $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}$ are random vectors independent of each other and $\mathbf{a}$ such that $\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert$ has the uniform distribution on the sphere for each $\ell\in[m]$ . Set $p=\mathbb{P}_{z\sim N(0,1)}\left(|z|\in[1/2,1]\right)\approx 0.3$ . Assume

m\geq\frac{4\log(n/\epsilon)}{\delta^{2}(1-\delta)p}

for some $\epsilon,\delta\in(0,1)$ . Then with probability at least $(1-\epsilon)^{2}$ we have that

\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{i\in[n]}|\langle\mathbf{Y}_{i},\mathbf{b}\rangle|\right]\geq\frac{\nu}{8}(1-\delta)^{3/2}p^{1/2}m^{1/2}

Proof.

Set $R_{min}=\nu/2$ and $R_{max}=\nu$ . Now set

	$\displaystyle p=\mathbb{P}_{a\sim N(0,\nu^{2})}\left(\|a\|\in[R_{min},R_{max}]\right)=2\mathbb{P}_{z\sim N(0,1)}\left(z\in\left[\frac{R_{min}}{\nu},\frac{R_{max}}{\nu}\right]\right)$
	$\displaystyle=2\mathbb{P}_{z\sim N(0,1)}\left(z\in\left[1/2,1\right]\right)\approx 0.3.$

Now define $\tau=\{\ell\in[m]:|a_{\ell}|\in[R_{min},R_{max}]\}$ . We have by the Chernoff bound for binomial random variables

\mathbb{P}\left(|\tau|\leq(1-\delta)mp\right)\leq\exp\left(-\delta^{2}\frac{mp}{2}\right).

Thus if $m\geq\log\left(\frac{1}{\epsilon}\right)\frac{2}{p\delta^{2}}$ (a weaker condition than the hypothesis on $m$ ) then we have that $|\tau|\geq(1-\delta)mp$ with probability at least $1-\epsilon$ . From now on assume such a $\tau$ has been observed and view it as fixed so that the only remaining randomness is over the $\mathbf{w}_{\ell}$ ’s. Now set $T=\min_{i\in[n]}\sum_{\ell\in\tau}\mathbb{I}\left[\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\geq 0\right]$ . By the Chernoff bound again we get that for fixed $i\in[n]$

\mathbb{P}\left(\sum_{\ell\in\tau}\mathbb{I}\left[\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\geq 0\right]\leq\frac{(1-\delta)}{2}|\tau|\right)\leq\exp\left(-\delta^{2}\frac{|\tau|}{4}\right).

Thus by taking the union bound over $i\in[n]$ we get

	$\displaystyle\mathbb{P}\left(T\leq\frac{(1-\delta)}{2}\|\tau\|\right)\leq n\exp\left(-\delta^{2}\frac{\|\tau\|}{4}\right)$
	$\displaystyle\leq n\exp\left(-\delta^{2}\frac{(1-\delta)mp}{4}\right)$

Thus if we consider $\tau$ as fixed and $m\geq\frac{4\log(n/\epsilon)}{\delta^{2}(1-\delta)p}$ then with probability at least $1-\epsilon$ over the sampling of the $\mathbf{w}_{\ell}$ ’s we have that

T\geq\frac{(1-\delta)}{2}|\tau|

In this case by lemma C.4 we have that

	$\displaystyle\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{i\in[n]}\|\langle\mathbf{Y}_{i},\mathbf{b}\rangle\|\right]\geq\frac{R_{min}^{2}}{R_{max}}\frac{T}{\|\tau\|^{1/2}}$
	$\displaystyle\geq\frac{\nu}{8}(1-\delta)^{3/2}m^{1/2}p^{1/2}.$

Thus the above holds with probability at least $(1-\epsilon)^{2}$ . ∎

This lemma now allows us to bound the effective rank of $\mathbf{K}_{inner}$ in the case of Gaussian initialization.

Theorem C.7.

m\geq\frac{4\log(n/\epsilon)}{\delta^{2}(1-\delta)p}

then with probability at least $1-3\epsilon$ we have that

\frac{1}{C}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq C\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

where

C=\frac{64}{(1-\delta)^{3}p}\left[1+\frac{\max\{c^{-1}K\log(1/\epsilon),mK\}}{m}\right].

Proof.

By Bernstein’s inequality

\mathbb{P}\left(\left\lVert\mathbf{a}/\nu\right\rVert_{2}^{2}-m\geq t\right)\leq\exp\left[-c\cdot\min\left(\frac{t^{2}}{mK^{2}},\frac{t}{K}\right)\right]

where $c$ is an absolute constant. Set $t=\max\{c^{-1}K\log(1/\epsilon),mK\}$ so that the right hand side of the above inequality is bounded by $\epsilon$ . Thus by Lemma C.6 and the union bound we can ensure that with probability at least

1-\epsilon-[1-(1-\epsilon)^{2}]=1-3\epsilon+\epsilon^{2}\geq 1-3\epsilon

that $\left\lVert\mathbf{a}/\nu\right\rVert_{2}^{2}\leq m+t$ and the conclusion of Lemma C.6 hold simultaneously. In that case

\displaystyle\frac{\left\lVert\mathbf{a}\right\rVert_{2}^{2}}{\alpha^{2}}\leq\frac{\nu^{2}[m+t]}{\frac{\nu^{2}}{64}(1-\delta)^{3}mp}=\frac{64}{(1-\delta)^{3}p}\left[1+\frac{t}{m}\right]=C.

Thus by Corollary C.2 we get the desired result. ∎

By fixing $\delta>0$ in the previous theorem we get the immediate corollary

Corollary C.8.

Assume $\phi(x)=ReLU(x)$ . Suppose that $a_{\ell}\sim N(0,\nu^{2})$ for each $\ell\in[m]$ i.i.d. Furthermore suppose $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}$ are random vectors independent of each other and $\mathbf{a}$ such that $\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert$ has the uniform distribution on the sphere for each $\ell\in[m]$ . Then there exists an absolute constant $C>0$ such that $m=\Omega(\log(n/\epsilon))$ ensures that with probability at least $1-\epsilon$

\frac{1}{C}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq C\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

C.2.4 Effective rank of outer-layer NTK

Throughout this section $\phi(x)=ReLU(x)$ . Our goal of this section, similar to before, is to bound the effective rank of $\mathbf{K}_{outer}$ by the effective rank of the input data gram $\mathbf{X}\mathbf{X}^{T}$ . In this section we will use often make use of the basic identities

\left\lVert\mathbf{A}\mathbf{B}\right\rVert_{F}\leq\left\lVert\mathbf{A}\right\rVert_{2}\left\lVert\mathbf{B}\right\rVert_{F}

\left\lVert\mathbf{A}\mathbf{B}\right\rVert_{F}\leq\left\lVert\mathbf{A}\right\rVert_{F}\left\lVert\mathbf{B}\right\rVert_{2}

Tr(\mathbf{A}\mathbf{A}^{T})=Tr(\mathbf{A}^{T}\mathbf{A})=\left\lVert\mathbf{A}\right\rVert_{F}^{2}

\left\lVert\mathbf{A}\right\rVert_{2}=\left\lVert\mathbf{A}^{T}\right\rVert_{2}

\lambda_{1}(\mathbf{A}^{T}\mathbf{A})=\lambda_{1}(\mathbf{A}\mathbf{A}^{T})=\left\lVert\mathbf{A}\right\rVert_{2}^{2}.

To begin bounding the effective rank of $\mathbf{K}_{outer}$ , we prove the following lemma.

Lemma C.9.

Assume $\phi(x)=ReLU(x)$ and $\mathbf{W}$ is full rank with $m\geq d$ . Then

\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left[\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right]^{2}}\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

Proof.

First note that

\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}\leq\left\lVert\mathbf{W}\mathbf{X}^{T}\right\rVert_{F}^{2}\leq\left\lVert\mathbf{W}\right\rVert_{2}^{2}\left\lVert\mathbf{X}^{T}\right\rVert_{F}^{2}=\left\lVert\mathbf{W}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T}).

Pick $\mathbf{b}\in\mathbb{R}^{d}$ such that $\left\lVert\mathbf{b}\right\rVert_{2}=1$ and $\left\lVert\mathbf{X}\mathbf{b}\right\rVert_{2}=\left\lVert\mathbf{X}\right\rVert_{2}$ . Since $\mathbf{W}^{T}$ is full rank we may set $\mathbf{u}=(\mathbf{W}^{T})^{\dagger}\mathbf{b}$ so that $\mathbf{W}^{T}\mathbf{u}=\mathbf{b}$ where $\left\lVert\mathbf{u}\right\rVert_{2}\leq\sigma_{min}(\mathbf{W}^{T})^{-1}$ where $\sigma_{min}(\mathbf{W}^{T})$ is the smallest nonzero singular value of $\mathbf{W}^{T}$ . Well then

	$\displaystyle\left\lVert\mathbf{X}\right\rVert_{2}=\left\lVert\mathbf{X}\mathbf{b}\right\rVert_{2}=\left\lVert\mathbf{X}\mathbf{W}^{T}\mathbf{u}\right\rVert_{2}\leq\left\lVert\mathbf{X}\mathbf{W}^{T}\right\rVert_{2}\left\lVert\mathbf{u}\right\rVert_{2}\leq\left\lVert\mathbf{X}\mathbf{W}^{T}\right\rVert_{2}\sigma_{min}(\mathbf{W}^{T})^{-1}$
	$\displaystyle=\left\lVert\mathbf{W}\mathbf{X}^{T}\right\rVert_{2}\sigma_{min}(\mathbf{W})^{-1}$

Now using the fact that $x=\phi(x)-\phi(-x)$ we have that

\left\lVert\mathbf{W}\mathbf{X}^{T}\right\rVert_{2}=\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})-\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\leq\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}

Thus combined with our previous results gives

\left\lVert\mathbf{X}\right\rVert_{2}\leq\sigma_{min}(\mathbf{W})^{-1}\left[\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right]

Therefore

	$\displaystyle\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\sigma_{min}(\mathbf{W})^{-2}\left[\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right]^{2}}\leq\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\mathbf{X}\right\rVert_{2}^{2}}$
	$\displaystyle\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\left\lVert\mathbf{X}\right\rVert_{2}^{2}}=\left\lVert\mathbf{W}\right\rVert_{2}^{2}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}$

which gives us the desired result. ∎

Corollary C.10.

Assume $\phi(x)=ReLU(x)$ and $\mathbf{W}$ is full rank with $m\geq d$ . Then

\frac{\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}\right)}{\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\right)}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

Proof.

Using the fact that

\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\leq 2\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right)

and lemma C.9 we have that

\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{4\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\right)}\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

Note that the right hand side and the denominator of the left hand side do not change when you replace $\mathbf{W}$ with $-\mathbf{W}$ . Therefore by using the above bound for both $\mathbf{W}$ and $-\mathbf{W}$ as the weight matrix separately we can conclude

\frac{\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}\right)}{4\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\right)}\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

∎

Corollary C.11.

Assume $\phi(x)=ReLU(x)$ and $m\geq d$ . Suppose $\mathbf{W}$ and $-\mathbf{W}$ have the same distribution. Then conditioned on $\mathbf{W}$ being full rank we have that with probability at least $1/2$

\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

Proof.

Fix $\mathbf{W}$ where $\mathbf{W}$ is full rank. We have by corollary C.10 that either

\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

holds or

\frac{\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

(the first holds in the case where $\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\geq\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}$ and the second in the case $\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}<\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}$ ). Since $\mathbf{W}$ and $-\mathbf{W}$ have the same distribution, it follows that the first inequality must hold at least $1/2$ of the time. From

\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}=\frac{\left\lVert\mathbf{J}_{outer}^{T}\right\rVert_{F}^{2}}{\left\lVert\mathbf{J}_{outer}^{T}\right\rVert_{2}^{2}}=\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}

we get the desired result. ∎

We now note that when $\mathbf{W}$ is rectangular shaped and the entries of $\mathbf{W}$ are i.i.d. Gaussians that $\mathbf{W}$ is full rank with high probability and $\sigma_{min}(\mathbf{W})^{-2}\left\lVert\mathbf{W}\right\rVert_{2}^{2}$ is well behaved. We recall the result from Vershynin (2012)

Theorem C.12.

Let $\mathbf{A}$ be a $N\times n$ matrix whose entries are independent standard normal random variables. Then for every $t\geq 0$ , with probability at least $1-2\exp(-t^{2}/2)$ one has

\sqrt{N}-\sqrt{n}-t\leq\sigma_{min}(\mathbf{A})\leq\sigma_{1}(\mathbf{A})\leq\sqrt{N}+\sqrt{n}+t

Corollary C.11 gives us a bound that works at least half the time. However, we would like to derive a bound that holds with high probability. We will have that when $m\gtrsim n$ we have sufficient concentration of the largest singular value of $\phi(\mathbf{W}\mathbf{X}^{T})$ to prove such a bound. We recall the result from Vershynin (2012) (Remark 5.40)

Theorem C.13.

Assume that $\mathbf{A}$ is an $N\times n$ matrix whose rows $\mathbf{A}_{i}$ are independent sub-gaussian random vectors in $\mathbb{R}^{n}$ with second moment matrix $\Sigma$ . Then for every $t\geq 0$ , the following inequality holds with probability at least $1-2\exp(-ct^{2})$

\left\lVert\frac{1}{N}\mathbf{A}^{*}\mathbf{A}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})\quad\text{where}\quad\delta=C\sqrt{\frac{n}{N}}+\frac{t}{\sqrt{N}}

where $C=C_{K},c=c_{K}>0$ depend only on $K:=\max_{i}\left\lVert\mathbf{A}_{i}\right\rVert_{\psi_{2}}$ .

We will use theorem C.13 in the following lemma.

Lemma C.14.

Assume $\phi(x)=ReLU(x)$ . Let $\mathbf{A}=\phi(\mathbf{W}\mathbf{X}^{T})$ and $M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}$ . Suppose that $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu^{2}I_{d})$ i.i.d. Set $K=M\nu\sqrt{n}$ and define

\Sigma:=\mathbb{E}_{\mathbf{w}\sim N(0,\nu^{2}I)}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{w}^{T}\mathbf{X}^{T})]

Then for every $t\geq 0$ the following inequality holds with probability at least $1-2\exp(-c_{K}t^{2})$

\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})\quad\text{where}\quad\delta=C_{K}\sqrt{\frac{n}{m}}+\frac{t}{\sqrt{m}},

where $c_{K},C_{K}>0$ are absolute constants that depend only on $K$ .

Proof.

We will let $\mathbf{A}_{\ell\colon}$ denote the $\ell$ th row of $\mathbf{A}$ (considered as a column vector). Note that

\mathbf{A}_{\ell\colon}=\phi(\mathbf{X}\mathbf{w}_{\ell}).

We immediately get that the rows of $\mathbf{A}$ are i.i.d. We will now bound $\left\lVert\mathbf{A}_{\ell\colon}\right\rVert_{\psi_{2}}$ . Let $\mathbf{b}\in\mathbb{R}^{n}$ such that $\left\lVert\mathbf{b}\right\rVert_{2}=1$ . Then

	$\displaystyle\left\lVert\langle\phi(\mathbf{X}\mathbf{w}_{\ell}),\mathbf{b}\rangle\right\rVert_{\psi_{2}}=\left\lVert\sum_{i=1}^{n}\phi(\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle)b_{i}\right\rVert_{\psi_{2}}$
	$\displaystyle\leq\sum_{i=1}^{n}\|b_{i}\|\left\lVert\phi(\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle)\right\rVert_{\psi_{2}}\leq\sum_{i=1}^{n}\|b_{i}\|\left\lVert\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\right\rVert_{\psi_{2}}$
	$\displaystyle\leq\sum_{i=1}^{n}\|b_{i}\|C\left\lVert\mathbf{x}_{i}\right\rVert_{2}\nu\leq CM\nu\left\lVert\mathbf{b}\right\rVert_{1}\leq CM\nu\sqrt{n}$

where $C>0$ is an absolute constant. Set $K:=M\nu\sqrt{n}$ . Well then by theorem C.13 we have the following. For every $t\geq 0$ the following inequality holds with probability at least $1-2\exp(-c_{K}t^{2})$

\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})\quad\text{where}\quad\delta=C_{K}\sqrt{\frac{n}{m}}+\frac{t}{\sqrt{m}}

∎

We are now ready to prove a high probability bound for the effective rank of $\mathbf{K}_{outer}$ .

Theorem C.15.

Assume $\phi(x)=ReLU(x)$ and $m\geq d$ . Let $M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}$ . Suppose that $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu^{2}I_{d})$ i.i.d. Set $K=M\nu\sqrt{n}$

\Sigma:=\mathbb{E}_{\mathbf{w}\sim N(0,\nu^{2}I)}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{w}^{T}\mathbf{X}^{T})]

\delta=C_{K}\left[\sqrt{\frac{n}{m}}+\sqrt{\frac{\log(2/\epsilon)}{m}}\right]

where $\epsilon>0$ is small. Now assume

\sqrt{m}>\sqrt{d}+\sqrt{2\log(2/\epsilon)}

and

\max(\delta,\delta^{2})\leq\frac{1}{2}\lambda_{1}(\Sigma)

Then with probability at least $1-3\epsilon$

\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}\leq 12\left(\frac{\sqrt{m}+\sqrt{d}+t_{1}}{\sqrt{m}-\sqrt{d}-t_{1}}\right)^{2}\frac{Tr(\mathbf{X}^{T}\mathbf{X})}{\lambda_{1}(\mathbf{X}^{T}\mathbf{X})}

Proof.

By theorem C.12 with $t_{1}=\sqrt{2\log(2/\epsilon)}$ we have that with probability at least $1-\epsilon$ that

\sqrt{m}-\sqrt{d}-t_{1}\leq\sigma_{min}(\mathbf{W}/\nu)\leq\sigma_{1}(\mathbf{W}/\nu)\leq\sqrt{m}+\sqrt{d}+t_{1}

(58)

The above inequalities and the hypothesis on $m$ imply that $\mathbf{W}$ is full rank.

Let $\mathbf{A}=\phi(\mathbf{W}\mathbf{X}^{T})$ and $\tilde{\mathbf{A}}=\phi(-\mathbf{W}\mathbf{X}^{T})$ . Set $t_{2}=\sqrt{\frac{\log(2/\epsilon)}{c_{K}}}$ where $c_{K}$ is defined as in theorem C.14. Note that $\mathbf{A}$ and $\tilde{\mathbf{A}}$ are identical in distribution. Thus by theorem C.14 and the union bound we get that with probability at least $1-2\epsilon$

\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}-\Sigma\right\rVert_{2},\left\lVert\frac{1}{m}\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})=:\rho

(59)

where

\delta=C_{K}\sqrt{\frac{n}{m}}+\frac{t_{2}}{\sqrt{m}}.

By our previous results and the union bound we can ensure with probability at least $1-3\epsilon$ that the bounds (58) and (59) all hold simultaneously. In this case we have

	$\displaystyle\left\lVert\frac{1}{m}\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}\right\rVert_{2}\leq\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}+2\rho$
	$\displaystyle=\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}\left[1+\frac{2\rho}{\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}}\right]\leq\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}\left[1+\frac{2\rho}{\lambda_{1}(\Sigma)-\rho}\right]$

Assuming $\rho\leq\lambda_{1}(\Sigma)/2$ we have by the above bound

\left\lVert\frac{1}{m}\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}\right\rVert_{2}\leq 3\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}.

Now note that

\left\lVert\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}=\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\quad\left\lVert\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}\right\rVert_{2}=\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}

so that our previous bound implies

\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\leq 3\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}

then we have by corollary C.10 that

	$\displaystyle\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}=\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}\leq 12\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}$
	$\displaystyle\leq 12\left(\frac{\sqrt{m}+\sqrt{d}+t_{1}}{\sqrt{m}-\sqrt{d}-t_{1}}\right)^{2}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.$

∎

From the above theorem we get the following corollary.

Corollary C.16.

Assume $\phi(x)=ReLU(x)$ and $n\geq d$ . Suppose that $\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu^{2}I_{d})$ i.i.d. Fix $\epsilon>0$ small. Set $M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}$ . Then

m=\Omega\left(\max(\lambda_{1}(\Sigma)^{-2},1)\max(n,\log(1/\epsilon))\right)

and

\nu=O(1/M\sqrt{m})

suffices to ensure that with probability at least $1-\epsilon$

\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}\leq C\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

where $C>0$ is an absolute constant.

C.2.5 Bound for the combined NTK

Based on the results in the previous two sections, we can now bound the effective rank of the combined NTK gram matrix $\mathbf{K}=\mathbf{K}_{inner}+\mathbf{K}_{outer}$ . See 4.5

Proof.

This follows from the union bound and Corollaries C.8 and C.16. ∎

C.2.6 Magnitude of the spectrum

By our results in sections C.2.3 and C.2.4 we have that $m\gtrsim n$ suffices to ensure that

\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\lesssim\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq d

Well note that

\displaystyle i\frac{\lambda_{i}(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\leq\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\lesssim d

If $i\gg d$ then $\lambda_{i}(\mathbf{K})/\lambda_{1}(\mathbf{K})$ is small. Thus the NTK only has $O(d)$ large eigenvalues. The smallest eigenvalue $\lambda_{n}(\mathbf{K})$ of the NTK has been of interest in proving convergence guarantees (Du et al., 2019a, b; Oymak & Soltanolkotabi, 2020). By our previous inequality

\frac{\lambda_{n}(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\lesssim\frac{d}{n}

Thus in the setting where $m\gtrsim n\gg d$ we have that the smallest eigenvalue will be driven to zero relative to the largest eigenvalue. Alternatively we can view the above inequality as a lower bound on the condition number

\frac{\lambda_{1}(\mathbf{K})}{\lambda_{n}(\mathbf{K})}\gtrsim\frac{n}{d}

We will first bound the analytical NTK in the setting when the outer layer weights have fixed constant magnitude. This is the setting considered by Xie et al. (2017), Arora et al. (2019a), Du et al. (2019b), Oymak et al. (2019), Li et al. (2020), and Oymak & Soltanolkotabi (2020).

Theorem C.17.

Let $\phi(x)=ReLU(x)$ and assume $\mathbf{X}\neq 0$ . Let $\mathbf{K}_{inner}^{\infty}\in\mathbb{R}^{n\times n}$ be the analytical NTK, i.e.

(\mathbf{K}_{inner}^{\infty})_{i,j}:=\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle\mathbb{E}_{\mathbf{w}\sim N(0,I_{d})}\left[\phi^{\prime}(\langle\mathbf{x}_{i},\mathbf{w}\rangle)\phi^{\prime}(\langle\mathbf{x}_{j},\mathbf{w}\rangle)\right].

Then

\frac{1}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}\left(\mathbf{K}_{inner}^{\infty}\right)}\leq 4\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

Proof.

We consider the setting where $|a_{\ell}|=1/\sqrt{m}$ for all $\ell\in[m]$ and $\mathbf{w}_{\ell}\sim N(0,I_{d})$ i.i.d.. As was shown by Jacot et al. (2018), Du et al. (2019b) in this setting we have that if we fix the training data $\mathbf{X}$ and send $m\rightarrow\infty$ we have that

\left\lVert\mathbf{K}_{inner}-\mathbf{K}_{inner}^{\infty}\right\rVert_{2}\rightarrow 0

in probability. Therefore by continuity of the effective rank we have that

\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}\rightarrow\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}(\mathbf{K}_{inner}^{\infty})}

in probability. Let $\eta>0$ . Then there exists an $M\in\mathbb{N}$ such that $m\geq M$ implies that

\left\lvert\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}-\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}(\mathbf{K}_{inner}^{\infty})}\right\rvert\leq\eta

(60)

with probability greater than $1/2$ . Now fix $\delta\in(0,1)$ . On the other hand by Theorem C.5 with $\epsilon=1/4$ we have that if $m\geq\frac{4}{\delta^{2}}\log(4n)$ then with probability at least $3/4$ that

\frac{(1-\delta)^{2}}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{4}{(1-\delta)^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

(61)

Thus if we set $m=\max(\frac{4}{\delta^{2}}\log(4n),M)$ we have with probability at least $3/4-1/2=1/4$ that (60) and (61) hold simultaneously. In this case we have that

\frac{(1-\delta)^{2}}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}-\eta\leq\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}\left(\mathbf{K}_{inner}^{\infty}\right)}\leq\frac{4}{(1-\delta)^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}+\eta

Note that the above argument runs through for any $\eta>0$ and $\delta\in(0,1)$ . Thus we may send $\eta\rightarrow 0^{+}$ and $\delta\rightarrow 0^{+}$ in the above inequality to get

\frac{1}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}\left(\mathbf{K}_{inner}^{\infty}\right)}\leq 4\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

∎

We thus have the following corollary about the conditioning of the analytical NTK.

Corollary C.18.

Let $\phi(x)=ReLU(x)$ and assume $\mathbf{X}\neq 0$ . Let $\mathbf{K}_{inner}^{\infty}\in\mathbb{R}^{n\times n}$ be the analytical NTK, i.e.

(\mathbf{K}_{inner}^{\infty})_{i,j}:=\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle\mathbb{E}_{\mathbf{w}\sim N(0,I_{d})}\left[\phi^{\prime}(\langle\mathbf{x}_{i},\mathbf{w}\rangle)\phi^{\prime}(\langle\mathbf{x}_{j},\mathbf{w}\rangle)\right].

Then

\frac{\lambda_{n}(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}(\mathbf{K}_{inner}^{\infty})}\leq 4\frac{d}{n}.

C.3 Experimental validation of results on the NTK spectrum

We experimentally test the theory developed in Section 4.1 and its implications by analyzing the spectrum of the NTK for both fully connected neural network architectures (FCNNs), the results of which are displayed in Figure 1, and also convolutional neural network architectures (CNNs), shown in Figure 3. For the feedforward architectures we consider networks of depth 2 and 5 with the width of all layers being set at 500. With regard to the activation function we test linear, ReLU and Tanh, and in terms of initialization we use Kaiming uniform (He et al., 2015), which is very common in practice and is the default in PyTorch (Paszke et al., 2019). For the convolutional architectures we again consider depths 2 and 5, with each layer consisting of 100 channels with the filter size set to 5x5. In terms of data, we consider 40x40 patches from both real world images, generated by applying Pytorch’s RandomResizedCrop transform to a random batch of Caltech101 images (Li et al., 2022), as well as synthetic images corresponding to isotropic Gaussian vectors. The batch sized is fixed at 200 and we plot only the first 100 normalized eigenvalues. Each experiment was repeated 10 times. Finally, to compute the NTK we use the functorch⁴⁴4https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html module in PyTorch using an algorithmic approach inspired by Novak et al. (2022).

The results for convolutional neural networks show the same trends as observed in feedforward neural networks, which we discussed in Section 4.1. In particular, we again observe the dominant outlier eigenvalue, which increases with both depth and the size of the Gaussian mean of the activation. We also again see that the NTK spectrum inherits its structure from the data, i.e., is skewed for skewed data or relatively flat for isotropic Gaussian data. Finally, we also see that the spectrum for Tanh is closer to the spectrum for the linear activation when compared with the ReLU spectrum. In terms of differences between the CNN and FCNN experiments, we observe that the spread of the 95% confidence interval is slightly larger for convolutional nets, implying a slightly larger variance between trials. We remark that this is likely attributable to the fact that there are only 100 channels in each layer and by increasing this quantity we would expect the variance to reduce. In summary, despite the fact that our analysis is concerned with FCNNs, it appears that the broad implications and trends also hold for CNNs. We leave a thorough study of the NTK spectrum for CNNs and other network architectures to future work.

To test our theory in Section 4.2, we numerically plot the spectrum of NTK of two-layer feedforward networks with ReLU, Tanh, and Gaussian activations in Figure 4. The input data are uniformly drawn from $\mathbb{S}^{2}$ . Notice that when $d=2$ , $k=\Theta(\ell^{1/2})$ . Then Corollary 4.7 shows that for the ReLU activation $\lambda_{\ell}=\Theta(\ell^{-3/2})$ , for the Tanh activation $\lambda_{\ell}=O\left(\ell^{-3/4}\exp(-\frac{\pi}{2}\ell^{1/4})\right)$ , and for the Gaussian activation $\lambda_{\ell}=O(\ell^{-1/2}2^{-\ell^{1/2}})$ . These theoretical decay rates for the NTK spectrum are verified by the experimental results in Figure 4.

C.4 Analysis of the lower spectrum: uniform data

See 4.6

Proof.

Let $\theta(t)=\sum_{p=0}^{\infty}c_{p}t^{p}$ , then $K(x_{1},x_{2})=\theta(\langle x_{1},x_{2}\rangle)$ According to Funk Hecke theorem (Basri et al., 2019, Section 4.2), we have

\displaystyle\overline{\lambda_{k}}=\mathrm{Vol}(\mathbb{S}^{d-1})\int_{-1}^{1}\theta(t)P_{k,d}(t)(1-t^{2})^{\frac{d-2}{2}}\mathrm{d}t,

(62)

where $\mathrm{Vol}(\mathbb{S}^{d-1})=\frac{2\pi^{d/2}}{\Gamma(d/2)}$ is the volume of the hypersphere $\mathbb{S}^{d-1}$ , and $P_{k,d}(t)$ is the Gegenbauer polynomial, given by

P_{k,d}(t)=\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\frac{1}{(1-t^{2})^{(d-2)/2}}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2},

and $\Gamma$ is the gamma function.

From (62) we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle=\mathrm{Vol}(\mathbb{S}^{d-1})\int_{-1}^{1}\theta(t)P_{k,d}(t)(1-t^{2})^{\frac{d-2}{2}}\mathrm{d}t$
	$\displaystyle=\frac{2\pi^{d/2}}{\Gamma(d/2)}\int_{-1}^{1}\theta(t)\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$
	$\displaystyle=\frac{2\pi^{d/2}}{\Gamma(d/2)}\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\sum_{p=0}^{\infty}c_{p}\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t.$	(63)

Using integration by parts, we have

	$\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$
	$\displaystyle=\left.t^{p}\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}\right\|_{-1}^{1}-p\int_{-1}^{1}t^{p-1}\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$
	$\displaystyle=-p\int_{-1}^{1}t^{p-1}\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t,$		(64)

where the last line in (64) holds because $\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}=0$ when $t=1$ or $t=-1$ .

When $p<k$ , repeat the above procedure (64) $p$ times, we get

$\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$	$\displaystyle=(-1)^{p}p!\int_{-1}^{1}\frac{d^{k-p}}{dt^{k-p}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$
	$\displaystyle=(-1)^{p}p!\left.\frac{d^{k-p-1}}{dt^{k-p-1}}(1-t^{2})^{k+(d-2)/2}\right\|_{-1}^{1}$
	$\displaystyle=0.$	(65)

When $p\geq k$ , repeat the above procedure (64) $k$ times, we get

\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t

\displaystyle=(-1)^{k}p(p-1)\cdots(p-k+1)\int_{-1}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t.

(66)

When $p-k$ is odd, $t^{p-k}(1-t^{2})^{k+(d-2)/2}$ is an odd function, then

\int_{-1}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t=0.

(67)

When $p-k$ is even,

$\displaystyle\int_{-1}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$	$\displaystyle=2\int_{0}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$
	$\displaystyle=\int_{0}^{1}(t^{2})^{(p-k-1)/2}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t^{2}$
	$\displaystyle=B\left(\frac{p-k+1}{2},k+d/2\right)$
	$\displaystyle=\frac{\Gamma(\frac{p-k+1}{2})\Gamma(k+d/2)}{\Gamma(\frac{p-k+1}{2}+k+d/2)},$	(68)

where $B$ is the beta function.

Plugging (68) , (65) and (67) into (66), we get

	$\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t$
	$\displaystyle=\begin{cases}(-1)^{k}p(p-1)\ldots(p-k+1)\frac{\Gamma(\frac{p-k+1}{2})\Gamma(k+d/2)}{\Gamma(\frac{p-k+1}{2}+k+d/2)},&p-k\text{ is even and }p\geq k,\\ 0,&\text{otherwise}.\end{cases}$		(69)

Plugging (69) into (63), we get

	$\displaystyle\overline{\lambda_{k}}$	$\displaystyle=\frac{2\pi^{d/2}}{\Gamma(d/2)}\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}(-1)^{k}p(p-1)\ldots(p-k+1)\frac{\Gamma(\frac{p-k+1}{2})\Gamma(k+d/2)}{\Gamma(\frac{p-k+1}{2}+k+d/2)}$
		$\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}\frac{p(p-1)\ldots(p-k+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(\frac{p-k+1}{2}+k+d/2)}$
		$\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.$

∎

See 4.7

Proof of Corollary C.4, part 1.

We first prove $\overline{\lambda_{k}}=O(k^{-d-2a+2})$ . Suppose that $c_{p}\leq Cp^{-a}$ for some constant $C$ , then according to Theorem 4.6 we have

\displaystyle\overline{\lambda_{k}}

\displaystyle\leq\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}Cp^{-a}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.

According to Stirling’s formula, we have

\Gamma(z)=\sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^{z}\left(1+O\left(\frac{1}{z}\right)\right).

(70)

Then for any $z\geq\frac{1}{2}$ , we can find constants $C_{1}$ and $C_{2}$ such that

C_{1}\sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^{z}\leq\Gamma(z)\leq C_{2}\sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^{z}.

(71)

Then

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle\leq\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{2}^{2}}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}Cp^{-a}\frac{\sqrt{\frac{2\pi}{p+1}}\left(\frac{p+1}{e}\right)^{p+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}}}\left(\frac{\frac{p-k+1}{2}}{e}\right)^{\frac{p-k+1}{2}}}{\sqrt{\frac{2\pi}{{p-k+1}}}\left(\frac{p-k+1}{e}\right)^{p-k+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}+k+d/2}}\left(\frac{\frac{p-k+1}{2}+k+d/2}{e}\right)^{\frac{p-k+1}{2}+k+d/2}}$
	$\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}p^{-a}\frac{e^{\frac{d}{2}}\sqrt{\frac{2}{p+1}}\left({p+1}\right)^{p+1}\left({\frac{p-k+1}{2}}\right)^{\frac{p-k+1}{2}}}{\left({p-k+1}\right)^{p-k+1}\sqrt{\frac{1}{\frac{p-k+1}{2}+k+d/2}}\left({\frac{p-k+1}{2}+k+d/2}\right)^{\frac{p-k+1}{2}+k+d/2}}$
	$\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}p^{-a}\frac{e^{\frac{d}{2}}2^{\frac{-p+k}{2}}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({\frac{p-k+1}{2}+k+d/2}\right)^{\frac{p-k}{2}+k+d/2}}$
	$\displaystyle=2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}.$	(72)

We define

f_{a}(p)=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}.

(73)

By applying the chain rule to $e^{\log f_{a}(p)}$ , we have that the derivative of $f_{a}$ is

f_{a}^{\prime}(p)=\frac{(p+1)^{p+\frac{1}{2}}p^{-a}}{2(p-k+1)^{\frac{p-k+1}{2}}(p+k+d+1)^{\frac{p+k+d}{2}}}\\ \cdot\left(-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\log(1+\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)})\right).

(74)

Let $g_{a}(p)=-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\log(1+\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)})$ . Then $g_{a}(p)$ and $f_{a}^{\prime}(p)$ have the same sign. Next we will show that $g_{a}(p)\geq 0$ for $k\leq p\leq\frac{k^{2}}{d+24a}$ when $k$ is large enough.

First when $p\geq k$ and $\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)}\geq 1$ , we have

\displaystyle g_{a}(p)\geq-\frac{2a}{k}-\frac{k+d}{(k+1)(k+k+d+1)}+\log(2)\geq 0,

(75)

when $k$ is sufficiently large.

Second when $p\geq k$ and $0\leq\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)}\leq 1$ , since $\log(1+x)\geq\frac{x}{2}$ for $0\leq x\leq 1$ , we have

	$\displaystyle g_{a}(p)$	$\displaystyle\geq-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\frac{k^{2}-d(p-k+1)}{2(p-k+1)(p+k+d+1)}$
		$\displaystyle\geq-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\frac{k^{2}-dp}{2p(p+k+d+1)}.$

When $p\leq\frac{k^{2}}{d+24a}$ , we have $k^{2}-dp\geq 24ap$ . Then

\frac{k^{2}-dp}{4p(p+k+d+1)}\geq\frac{24ap}{4p(p+k+d+1)}\geq\frac{6ap}{(p+1)(p+k+d+1)}\geq\frac{k+d}{(p+1)(p+k+d+1)}

when $k$ is sufficiently large. Also we have

\frac{k^{2}-dp}{4r(p+k+d+1)}\geq\frac{24ap}{4r(p+k+d+1)}\geq\frac{6a}{p+k+d+1}\geq\frac{2a}{p}

when $k$ is sufficiently large.

Combining all the arguments above, we conclude that $g_{a}(p)\geq 0$ and $f_{a}^{\prime}(p)\geq 0$ when $k\leq p\leq\frac{k^{2}}{d+24a}$ . Then when $k\leq p\leq\frac{k^{2}}{d+24a}$ , we have

f_{a}(p)\leq f_{a}\left(\frac{k^{2}}{d+24a}\right).

(76)

When $p\geq\frac{k^{2}}{d+24a}$ , we have

	$\displaystyle f_{a}(p)$	$\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}$
		$\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left((p+1)^{2}-k^{2}+d(p-k+1)\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{2k+d-1}{2}}}$
		$\displaystyle=\frac{p^{-a}\left({p+1}\right)^{-\frac{d}{2}}}{\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}\left({1+\frac{k+d}{p+1}}\right)^{\frac{2k+d-1}{2}}}$
		$\displaystyle\leq\frac{p^{-a-\frac{d}{2}}}{\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}}.$

If $k^{2}-d(p-k+1)<0$ , $\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}\geq 1$ . If $k^{2}-d(p-k+1)\geq 0$ , i.e., $p\leq\frac{k^{2}+dk-d}{d}$ , for sufficiently large $k$ , we have

	$\displaystyle\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}$	$\displaystyle\geq\left(1-\frac{k^{2}-d(\frac{k^{2}}{d+24a}-k+1)}{(\frac{k^{2}}{d+24a}+1)^{2}}\right)^{\frac{\frac{k^{2}+dk-d}{d}-k+1}{2}}$
		$\displaystyle\geq\left(1-\frac{48a(d+24a)}{k^{2}}\right)^{\frac{k^{2}}{2d}}$
		$\displaystyle\geq e^{-\frac{k^{2}}{2d}\frac{48a(d+24a)}{k^{2}}}=e^{-\frac{48a(d+24a)}{2d}},$

which is a constant independent of $k$ . Then for $p\geq\frac{k^{2}}{d+24a}$ , we have

f_{a}(p)\leq e^{\frac{48a(d+24a)}{2d}}p^{-a-\frac{d}{2}}.

(77)

Finally we have

	$\displaystyle\overline{\lambda_{k}}$	$\displaystyle=2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}f_{a}(p)$
		$\displaystyle\leq O\left(\sum_{\begin{subarray}{c}k\leq p\leq\frac{k^{2}}{d+24a}\\ p-k\text{ is even}\end{subarray}}f_{a}(p)+\sum_{\begin{subarray}{c}p\geq\frac{k^{2}}{d+24a}\\ p-k\text{ is even}\end{subarray}}f_{a}(p)\right)$
		$\displaystyle\leq O\left(\left(\frac{k^{2}}{d+24a}-k+1\right)f_{a}\left(\frac{k^{2}}{d+24a}\right)+\sum_{\begin{subarray}{c}p\geq\frac{k^{2}}{d+24a}\\ p-k\text{ is even}\end{subarray}}e^{\frac{48a(d+24a)}{2d}}p^{-a-\frac{d}{2}}\right)$
		$\displaystyle\leq O\left(\left(\frac{k^{2}}{d+24a}-k+1\right)e^{\frac{48a(d+24a)}{2d}}\left(\frac{k^{2}}{d+24a}\right)^{-a-\frac{d}{2}}+e^{\frac{48a(d+24a)}{2d}}\frac{1}{{a+\frac{d}{2}-1}}\left(\frac{k^{2}}{d+24a}-1\right)^{1-a-\frac{d}{2}}\right)$
		$\displaystyle=O(k^{-d-2a+2}).$

Next we prove $\overline{\lambda_{k}}=\Omega(k^{-d-2a+2})$ . Since $c_{p}$ are nonnegative and $c_{p}=\Theta(p^{-a})$ , we have that $c_{p}\geq C^{\prime}p^{-a}$ for some constant $C^{\prime}$ . Then we have

\displaystyle\overline{\lambda_{k}}

\displaystyle\geq\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}C^{\prime}p^{-a}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.

(78)

According to Stirling’s formula (70) and (71), using the similar argument as (72) we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle\geq\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{1}^{2}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}C^{\prime}p^{-a}\frac{\sqrt{\frac{2\pi}{p+1}}\left(\frac{p+1}{e}\right)^{p+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}}}\left(\frac{\frac{p-k+1}{2}}{e}\right)^{\frac{p-k+1}{2}}}{\sqrt{\frac{2\pi}{{p-k+1}}}\left(\frac{p-k+1}{e}\right)^{p-k+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}+k+d/2}}\left(\frac{\frac{p-k+1}{2}+k+d/2}{e}\right)^{\frac{p-k+1}{2}+k+d/2}}$	(79)
	$\displaystyle=2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}$	(80)
	$\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k^{2}\\ p-k\text{ is even}\end{subarray}}f_{a}(p),$	(81)

where $f_{a}(p)$ is defined in (73). When $p\geq k^{2}$ , we have

	$\displaystyle f_{a}(p)$	$\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}$
		$\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left((p+1)^{2}-k^{2}+d(p-k+1)\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{2k+d-1}{2}}}$
		$\displaystyle\geq\frac{\left({p+1}\right)^{-a-\frac{d}{2}}}{\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}\left({1+\frac{k+d}{p+1}}\right)^{\frac{2k+d-1}{2}}}$

For sufficiently large $k$ , $k^{2}-d(p-k+1)<0$ . Then we have

	$\displaystyle\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}$	$\displaystyle=\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{-(p+1)^{2}}{k^{2}-d(p-k+1)}\cdot\frac{-k^{2}+d(p-k+1)}{(p+1)^{2}}\cdot\frac{p-k+1}{2}}$
		$\displaystyle\leq e^{\frac{-k^{2}+d(p-k+1)}{(p+1)^{2}}\cdot\frac{p-k+1}{2}}$
		$\displaystyle\leq e^{\frac{dp^{2}}{2p^{2}}}=e^{\frac{d}{2}}$

which is a constant independent of $k$ . Also, for sufficiently large $k$ , we have

	$\displaystyle\left({1+\frac{k+d}{p+1}}\right)^{\frac{2k+d-1}{2}}$	$\displaystyle=\left({1+\frac{k+d}{p+1}}\right)^{\frac{p+1}{k+d}\frac{k+d}{p+1}\frac{2k+d-1}{2}}$
		$\displaystyle\leq e^{\frac{k+d}{p+1}\frac{2k+d-1}{2}}$
		$\displaystyle\leq e^{\frac{3k^{2}}{2r}}=e^{\frac{3}{2}}$

Then for $p\geq k^{2}$ , we have $f_{a}(p)\geq e^{-\frac{d}{2}-\frac{3}{2}}(p+1)^{-a-\frac{d}{2}}$ .

Finally we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k^{2}\\ p-k\text{ is even}\end{subarray}}f_{a}(p)$	(82)
	$\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k^{2}\\ p-k\text{ is even}\end{subarray}}e^{-\frac{d}{2}-\frac{3}{2}}(p+1)^{-a-\frac{d}{2}}$	(83)
	$\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}e^{-\frac{d}{2}-\frac{3}{2}}\frac{1}{2(a+\frac{d}{2}-1)}(k^{2}+2)^{1-a-\frac{d}{2}}$	(84)
	$\displaystyle=\Omega(k^{-d-2a+2}).$	(85)

Overall, we have $\overline{\lambda_{k}}=\Theta(k^{-d-2a+2})$ . ∎

Proof of Corollary C.4, part 2.

It is easy to verify that $\overline{\lambda_{k}}=0$ when $k$ is even because $c_{p}=0$ when $p\geq k$ and $p-k$ is even. When $k$ is odd, the proof of Theorem 4.6 still applies. ∎

Proof of Corollary C.4, part 3.

Since $c_{p}=\mathcal{O}\left(\exp\left(-a\sqrt{p}\right)\right)$ , we have that $c_{p}\leq Ce^{-a\sqrt{p}}$ for some constant $C$ . Similar to (72), we have

\displaystyle\overline{\lambda_{k}}

\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{e^{-a\sqrt{p}}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}.

(86)

Use the definition in (73) and let $a=0$ , we have

f_{0}(p)=\frac{\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}.

(87)

Then according to (76) and (77), for sufficiently large $k$ , we have $f_{0}(p)\leq f_{0}\left(\frac{k^{2}}{d}\right)$ when $k\leq p\leq\frac{k^{2}}{d}$ and $f_{0}(p)\leq C_{3}p^{-\frac{d}{2}}$ for some constant $C_{3}$ when $p\geq\frac{k^{2}}{d}$ . Then when $k\leq p\leq\frac{k^{2}}{d}$ , we have $f_{0}(p)\leq f_{0}\left(\frac{k^{2}}{d}\right)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}$ . When $p\geq\frac{k^{2}}{d}$ , we have $f_{0}(p)\leq C_{3}p^{-\frac{d}{2}}\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}$ . Overall, for all $p\geq k$ , we have

f_{0}(p)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}.

(88)

Then we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}e^{-a\sqrt{p}}f_{0}(p)$	(89)
	$\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}e^{-a\sqrt{p}}$	(90)
	$\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}\frac{2e^{-a\sqrt{k-1}}(a\sqrt{k-1}+1)}{a^{2}}$	(91)
	$\displaystyle=\mathcal{O}\left(k^{-d+1/2}\exp\left(-a\sqrt{k}\right)\right)$	(92)

∎

Proof of Corollary C.4, part 4.

Since $c_{p}=\Theta(p^{1/2}a^{-p})$ , we have that $c_{p}\leq Cp^{1/2}a^{-p}$ for some constant $C$ . Similar to (72), we have

\displaystyle\overline{\lambda_{k}}

\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{1/2}a^{-p}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}.

(93)

Use the definition in (73) and let $a=0$ , we have

f_{0}(p)=\frac{\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}.

(94)

Then according to (76) and (77), for sufficiently large $k$ , we have $f_{0}(p)\leq f_{0}\left(\frac{k^{2}}{d}\right)$ when $k\leq p\leq\frac{k^{2}}{d}$ and $f_{0}(p)\leq C_{3}p^{-\frac{d}{2}}$ for some constant $C_{3}$ when $p\geq\frac{k^{2}}{d}$ . Then when $k\leq p\leq\frac{k^{2}}{d}$ , we have $p^{1/2}f_{0}(p)\leq p^{1/2}f_{0}\left(\frac{k^{2}}{d}\right)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{1/2}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}$ . When $p\geq\frac{k^{2}}{d}$ , we have $p^{1/2}f_{0}(p)\leq C_{3}p^{1/2}p^{-\frac{d}{2}}\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}$ . Overall, for all $p\geq k$ , we have

p^{1/2}f_{0}(p)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}.

(95)

Then we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}p^{1/2}a^{-p}f_{0}(p)$	(96)
	$\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}a^{-p}$	(97)
	$\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}\frac{1}{\log a}a^{-(k-1)}$	(98)
	$\displaystyle=\mathcal{O}\left(k^{-d+1}a^{-k}\right).$	(99)

On the other hand, since $c_{p}=\Theta(p^{1/2}a^{-p})$ , we have that $c_{p}\geq C^{\prime}p^{1/2}a^{-p}$ for some constant $C^{\prime}$ . Similar to (80), we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{1/2}a^{-p}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}$	(100)
	$\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\frac{k^{1/2}a^{-k}\left({k+1}\right)^{k+\frac{1}{2}}}{\left({k-k+1}\right)^{\frac{k-k+1}{2}}\left({k+k+1+d}\right)^{\frac{k+k+d}{2}}}$	(101)
	$\displaystyle=\Omega\left(\frac{k^{-d/2+1}a^{-k}\left({k+1}\right)^{k}}{\left({k+k+1+d}\right)^{k}}\right).$	(102)

Since $\left({k+1}\right)^{k}=k^{k}(1+1/k)^{k}=\Theta(k^{k})$ . Similarly, $\left({k+k+1+d}\right)^{k}=\Theta((2k)^{k})$ . Then we have

$\displaystyle\overline{\lambda_{k}}$	$\displaystyle=\Omega\left(\frac{k^{-d/2+1}a^{-k}\left({k+1}\right)^{k}}{\left({k+k+1+d}\right)^{k}}\right)$	(103)
	$\displaystyle=\Omega\left(\frac{k^{-d/2+1}a^{-k}k^{k}}{(2k)^{k}}\right)$	(104)
	$\displaystyle=\Omega\left(k^{-d/2+1}2^{-k}a^{-k}\right).$	(105)

∎

For the NTK of a two-layer ReLU network with $\gamma_{b}>0$ , then according to Lemma 3.2 we have $c_{p}=\kappa_{p,2}=\Theta(p^{-3/2})$ . Therefore using Corollary 4.7 $\overline{\lambda_{k}}=\Theta(k^{-d-1})$ . Notice here that $k$ refers to the frequency, and the number of spherical harmonics of frequency at most $k$ is $\Theta(k^{d})$ . Therefore, for the $\ell$ th largest eigenvalue $\lambda_{\ell}$ we have $\lambda_{\ell}=\Theta(\ell^{-(d+1)/d})$ . This rate agrees with Basri et al. (2019) and Velikanov & Yarotsky (2021). For the NTK of a two-layer ReLU network with $\gamma_{b}=0$ , the eigenvalues corresponding to the even frequencies are $0$ , which also agrees with Basri et al. (2019). Corollary 4.7 also shows the decay rates of eigenvalues for the NTK of two-layer networks with Tanh activation and Gaussian activation. We observe that when the coefficients of the kernel power series decay quickly then the eigenvalues of the kernel also decay quickly. As a faster decay of the eigenvalues of the kernel implies a smaller RKHS, Corollary 4.7 demonstrates that using ReLU results in a larger RKHS relative to using either Tanh or Gaussian activations. We numerically illustrate Corollary 4.7 in Figure 4, Appendix C.3.

C.5 Analysis of the lower spectrum: non-uniform data

The purpose of this section is to prove a formal version of Theorem 4.8. In order to prove this result we first need the following lemma.

Lemma C.19.

Let the coefficients $(c_{j})_{j=0}^{\infty}$ with $c_{j}\in\mathbb{R}_{\geq 0}$ for all $j\in\mathbb{Z}_{\geq 0}$ be such that the series $\sum_{j=0}^{\infty}c_{j}\rho^{j}$ converges for all $\rho\in[-1,1]$ . Given a data matrix $\mathbf{X}\in\mathbb{R}^{n\times d}$ with $\left\lVert\mathbf{x}_{i}\right\rVert=1$ for all $i\in[n]$ , define $r\vcentcolon=\operatorname{rank}(\mathbf{X})\geq 2$ and the gram matrix $\mathbf{G}\vcentcolon=\mathbf{X}\mathbf{X}^{T}$ . Consider the kernel matrix

n\mathbf{K}=\sum_{j=0}^{\infty}c_{j}\mathbf{G}^{\odot j}.

For arbitrary $m\in\mathbb{Z}_{\geq 1}$ , let the eigenvalue index $k$ satisfy $n\geq k>\operatorname{rank}\left(\mathbf{H}_{m}\right)$ , where $\mathbf{H}_{m}\vcentcolon=\sum_{j=0}^{m-1}c_{j}\mathbf{G}^{\odot j}$ . Then

\lambda_{k}(\mathbf{K})\leq\frac{\left\lVert\mathbf{G}^{\odot m}\right\rVert}{n}\sum_{j=m}^{\infty}c_{j}.

(106)

Proof.

We start our analysis by considering $\lambda_{k}(n\mathbf{K})$ for some arbitrary $k\in\mathbb{N}_{\leq n}$ . Let

	$\displaystyle\mathbf{H}_{m}$	$\displaystyle\vcentcolon=\sum_{j=0}^{m-1}c_{j}\mathbf{G}^{\odot j},$
	$\displaystyle\mathbf{T}_{m}$	$\displaystyle\vcentcolon=\sum_{j=m}^{\infty}c_{j}\mathbf{G}^{\odot j}$

be the $m$ -head and $m$ -tail of the Hermite expansion of $n\mathbf{K}$ : clearly $n\mathbf{K}=\mathbf{H}_{m}+\mathbf{T}_{m}$ for any $m\in\mathbb{N}$ . Recall that a constant matrix is symmetric and positive semi-definite, furthermore, by the Schur product theorem, the Hadamard product of two positive semi-definite matrices is positive semi-definite. As a result, $\mathbf{G}^{\odot j}$ is symmetric and positive semi-definite for all $j\in\mathbb{Z}_{\geq 0}$ and therefore $\mathbf{H}_{m}$ and $\mathbf{T}_{m}$ are also symmetric positive semi-definite matrices. From Weyl’s inequality (Weyl, 1912, Satz 1) it follows that

n\lambda_{k}(\mathbf{K})\leq\lambda_{k}(\mathbf{H}_{m})+\lambda_{1}(\mathbf{T}_{m}).

(107)

In order to upper bound $\lambda_{1}(\mathbf{T}_{m})$ , observe, as $\mathbf{T}_{m}$ is square, symmetric and positive semi-definite, that $\lambda_{1}(\mathbf{T}_{m})=\left\lVert\mathbf{T}_{m}\right\rVert$ . Using the non-negativity of the coefficients $(c_{j})_{j=0}^{\infty}$ and the triangle inequality we have

\displaystyle\lambda_{1}(\mathbf{T}_{m})=\left\lVert\sum_{j=m}^{\infty}c_{j}\mathbf{G}^{\odot j}\right\rVert\leq\sum_{j=m}^{\infty}c_{j}\left\lVert\mathbf{G}^{\odot j}\right\rVert

By the assumptions of the lemma $[\mathbf{G}]_{ii}=1$ and therefore $[\mathbf{G}]_{ii}^{j}=1$ for all $j\in\mathbb{Z}_{\geq 0}$ . Furthermore, for any pair of positive semi-definite matrices $\mathbf{A},\mathbf{B}\in\mathbb{R}^{n\times n}$ and $k\in[n]$

\lambda_{1}(\mathbf{A}\odot\mathbf{B})\leq\max_{i\in[n]}[\mathbf{A}]_{ii}\lambda_{1}(\mathbf{B}),

(108)

Schur (1911). Therefore, as $\max_{i\in[n]}[\mathbf{G}]_{ii}=1$ ,

\left\lVert\mathbf{G}^{\odot j}\right\rVert=\lambda_{1}(\mathbf{G}^{\odot j})=\lambda_{1}(\mathbf{G}\odot\mathbf{G}^{\odot(j-1)})\leq\lambda_{1}(\mathbf{G}^{\odot(j-1)})=\left\lVert\mathbf{G}^{\odot(j-1)}\right\rVert

for all $j\in\mathbb{N}$ . As a result

\displaystyle\lambda_{1}(\mathbf{T}_{m})\leq\left\lVert\mathbf{G}^{\odot m}\right\rVert\sum_{j=m}^{\infty}c_{j}.

Finally, we now turn our attention to the analysis of $\lambda_{k}(\mathbf{H}_{m})$ . Upper bounding a small eigenvalue is typically challenging, however, the problem simplifies when and $k$ exceeds the rank of $\mathbf{H}_{m}$ , as is assumed here, as this trivially implies $\lambda_{k}(\mathbf{H}_{m})=0$ . Therefore, for $k>\operatorname{rank}(\mathbf{H}_{m})$

\lambda_{k}(\mathbf{K})\leq\frac{\left\lVert\mathbf{G}^{m}\right\rVert}{n}\sum_{j=m}^{\infty}c_{j}

as claimed. ∎

In order to use Lemma C.19 we require an upper bound on the rank of $\mathbf{H}_{m}$ . To this end we provide Lemma C.20.

Lemma C.20.

Let $\mathbf{G}\in\mathbb{R}^{n\times n}$ be a symmetric, positive semi-definite matrix of rank $2\leq r\leq d$ . Define $\mathbf{H}_{m}\in\mathbb{R}^{n\times n}$ as

\mathbf{H}_{m}=\sum_{j=0}^{m-1}c_{j}\mathbf{G}^{\odot j}

(109)

where $(c_{j})_{j=0}^{m-1}$ is a sequence of real coefficients. Then

	$\displaystyle\text{rank}\left(\mathbf{H}_{m}\right)\leq$	$\displaystyle 1+\min\{r-1,m-1\}(2e)^{r-1}$		(110)
		$\displaystyle+\max\{0,m-r\}\left(\frac{2e}{r-1}\right)^{r-1}(m-1)^{r-1}.$		(110)

Proof.

As $\mathbf{G}$ is a symmetric and positive semi-definite matrix, its eigenvalues are real and non-negative and its eigenvectors are orthogonal. Let $\{\mathbf{v}_{i}\}_{i=1}^{r}$ be a set of orthogonal eigenvectors for $\mathbf{G}$ and $\gamma_{i}$ the eigenvalue associated with $\mathbf{v}_{i}\in\mathbb{R}^{n}$ . Then $\mathbf{G}$ may be written as a sum of rank one matrices as follows,

\mathbf{G}=\sum_{i=1}^{r}\gamma_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{T}.

As the Hadamard product is commutative, associative and distributive over addition, for any $j\in\mathbb{Z}_{\geq 0}$ $\mathbf{G}^{\odot j}$ can also be expressed as a sum of rank 1 matrices,

	$\displaystyle\mathbf{G}^{\odot j}$	$\displaystyle=\left(\sum_{i=1}^{r}\gamma_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{T}\right)^{\odot j}$
		$\displaystyle=\left(\sum_{i_{1}=1}^{r}\gamma_{i_{1}}\mathbf{v}_{i_{1}}\mathbf{v}_{i_{1}}^{T}\right)\odot\left(\sum_{i_{2}=1}^{r}\gamma_{i_{2}}\mathbf{v}_{i_{2}}\mathbf{v}_{i_{2}}^{T}\right)\odot\cdots\odot\left(\sum_{i_{j}=1}^{r}\gamma_{i_{j}}\mathbf{v}_{i_{j}}\mathbf{v}_{i_{j}}^{T}\right)$
		$\displaystyle=\sum_{i_{1},i_{2}...i_{j}=1}^{r}\gamma_{i_{1}}\gamma_{i_{2}}\cdots\gamma_{i_{r}}\left(\mathbf{v}_{i_{1}}\mathbf{v}_{i_{1}}^{T}\right)\odot\left(\mathbf{v}_{i_{2}}\mathbf{v}_{i_{2}}^{T}\right)\odot\cdots\odot\left(\mathbf{v}_{i_{j}}\mathbf{v}_{i_{j}}^{T}\right)$
		$\displaystyle=\sum_{i_{1},i_{2},\ldots,i_{j}=1}^{r}\gamma_{i_{1}}\gamma_{i_{2}}\cdots\gamma_{i_{j}}\left(\mathbf{v}_{i_{1}}\odot\mathbf{v}_{i_{2}}\odot\cdots\odot\mathbf{v}_{i_{j}}\right)\left(\mathbf{v}_{i_{1}}\odot\mathbf{v}_{i_{2}}\odot\cdots\odot\mathbf{v}_{i_{j}}\right)^{T}.$

Note the fourth equality in the above follows from $\mathbf{v}_{i}\mathbf{v}_{i}^{T}=\mathbf{v}_{i}\otimes\mathbf{v}_{i}$ and an application of the mixed-product property of the Hadamard product. As matrix rank is sub-additive, the rank of $\mathbf{G}^{\odot j}$ is less than or equal to the number of distinct rank-one matrix summands. This quantity in turn is equal to the number of vectors of the form $\left(\mathbf{v}_{i_{1}}\odot\mathbf{v}_{i_{2}}\odot\cdots\odot\mathbf{v}_{i_{j}}\right)$ , where $i_{1},i_{2},\ldots,i_{j}\in[r]$ . This in turn is equivalent to computing the number of $j$ -combinations with repetition from $r$ objects. Via a stars and bars argument this is equal to $\binom{r+j-1}{j}=\binom{r+j-1}{r(n)-1}$ . It therefore follows that

	$\displaystyle\operatorname{rank}(\mathbf{G}^{\odot j})$	$\displaystyle\leq\binom{r+j-1}{r-1}$
		$\displaystyle\leq\left(\frac{e(r+j-1)}{r-1}\right)^{r-1}$
		$\displaystyle\leq e^{r-1}\left(1+\frac{j}{r-1}\right)^{r-1}$
		$\displaystyle\leq(2e)^{r-1}\left(\delta_{j\leq r-1}+\delta_{j>r-1}\left(\frac{j}{r-1}\right)^{r-1}\right).$

The rank of $\mathbf{H}_{m}$ can therefore be bounded via subadditivity of the rank as

$\displaystyle\operatorname{rank}(\mathbf{H}_{m})=$	$\displaystyle\operatorname{rank}\left(a_{0}\textbf{1}_{n\times n}+\sum_{j=1}^{m-1}c_{j}\mathbf{G}^{\odot j}\right)$	(111)
$\displaystyle\leq$	$\displaystyle 1+\sum_{j=1}^{m-1}\operatorname{rank}\left(\mathbf{G}^{\odot j}\right)$
$\displaystyle\leq$	$\displaystyle 1+\sum_{j=1}^{m-1}(2e)^{r-1}\left(\delta_{j\leq r-1}+\delta_{j>r-1}\left(\frac{j}{r-1}\right)^{r-1}\right)$
$\displaystyle\leq$	$\displaystyle 1+\min\{r-1,m-1\}(2e)^{r-1}$
	$\displaystyle\phantom{1}+\max\{0,m-r\}\left(\frac{2e}{r-1}\right)^{r-1}(m-1)^{r-1}.$

∎

As our goal here is to characterize the small eigenvalues, then as $n$ grows we need both $k$ and therefore $m$ to grow as well. As a result we will therefore be operating in the regime where $m>r$ . To this end we provide the following corollary.

Corollary C.21.

Under the same conditions and setup as Lemma C.20 with $m\geq r\geq 7$ then

\operatorname{rank}(\mathbf{H}_{m})<2m^{r}.

Proof.

If $r\geq 7>2e+1$ then $r-1>2e$ . As a result from Lemma C.20

	$\displaystyle\operatorname{rank}(\mathbf{H}_{m})$	$\displaystyle\leq 1+(r-1)(2e)^{r-1}+(m-r)\left(\frac{2e}{r-1}\right)^{r-1}(m-1)^{r-1}$
		$\displaystyle<r(2e)^{r-1}+(m-1)^{r}$
		$\displaystyle<2m^{r}$

as claimed. ∎

Corollary C.21 implies for any $k\geq 2m^{r}$ , $k\leq n$ that we can apply Lemma C.19 to upper bound the size of the $k$ th eigenvalue. Our goal is to upper bound the decay of the smallest eigenvalue. To this end, and in order to make our bounds as tight as possible, we therefore choose the truncation point $m(n)=\lfloor(n/2)^{1/r}\rfloor$ , note this is the largest truncation which still satisfies $2m(n)^{r}\leq n$ . In order to state the next lemma, we introduce the following pieces of notation: with $\mathcal{L}\vcentcolon=\{\ell:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{\geq 0}\}$ define $U:\mathcal{L}\times\mathbb{Z}_{\geq 1}\rightarrow\mathbb{R}_{\geq 0}$ as

U(\ell,m)=\int_{m-1}^{\infty}\ell(x)dx.

Lemma C.22.

Given a sequence of data points $(\mathbf{x}_{i})_{i\in\mathbb{Z}_{\geq 1}}$ with $\mathbf{x}_{i}\in\mathbb{S}^{d}$ for all $i\in\mathbb{Z}_{\geq 1}$ , construct a sequence of row-wise data matrices $(\mathbf{X}_{n})_{n\in\mathbb{Z}_{\geq 1}}$ , $\mathbf{X}_{n}\in\mathbb{R}^{n\times d}$ , with $\mathbf{x}_{i}$ corresponding to the $i$ th row of $\mathbf{X}_{n}$ . The corresponding sequence of gram matrices we denote $\mathbf{G}_{n}\vcentcolon=\mathbf{X}_{n}\mathbf{X}_{n}^{T}$ . Let $m(n)\vcentcolon=\lfloor(n/2)^{1/r(n)}\rfloor$ where $r(n)\vcentcolon=\operatorname{rank}(\mathbf{X}_{n})$ and suppose for all sufficiently large $n$ that $m(n)\geq r(n)\geq 7$ . Let the coefficients $(c_{j})_{j=0}^{\infty}$ with $c_{j}\in\mathbb{R}_{\geq 0}$ for all $j\in\mathbb{Z}_{\geq 0}$ be such that 1) the series $\sum_{j=0}^{\infty}c_{j}\rho^{j}$ converges for all $\rho\in[-1,1]$ and 2) $(c_{j})_{j=0}^{\infty}=\mathcal{O}(\ell(j))$ , where $\ell\in\mathcal{L}$ satisfies $U(\ell,m(n))<\infty$ for all $n$ and is monotonically decreasing. Consider the sequence of kernel matrices indexed by $n$ and defined as

n\mathbf{K}_{n}=\sum_{j=0}^{\infty}c_{j}\mathbf{G}_{n}^{\odot j}.

With $\nu:\mathbb{Z}_{\geq 1}\rightarrow\mathbb{Z}_{\geq 1}$ suppose $\left\lVert\mathbf{G}_{n}^{\odot m(n)}\right\rVert=\mathcal{O}(n^{-\nu(n)+1})$ , then

\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}(n^{-\nu(n)}U(\ell,m(n))).

(112)

Proof.

By the assumptions of the Lemma we may apply Lemma C.19 and Corollary C.21, which results in

\lambda_{n}(\mathbf{K}_{n})\leq\frac{\left\lVert\mathbf{G}_{n}^{\odot m(n)}\right\rVert}{n}\sum_{j=m(n)}^{\infty}c_{j}=\mathcal{O}(n^{-\nu(n)})\sum_{j=m(n)}^{\infty}c_{j}.

Additionally, as $(c_{j})_{j=0}^{\infty}=\mathcal{O}(\ell(j))$ then

	$\displaystyle\lambda_{n}(\mathbf{K}_{n})$	$\displaystyle=\mathcal{O}\left(n^{-\nu(n)}\sum_{j=m(n)}^{\infty}\ell(j)\right)$
		$\displaystyle=\mathcal{O}\left(n^{-\nu(n)}\int_{m(n)-1}^{\infty}\ell(x)dx\right)$
		$\displaystyle=\mathcal{O}\left(n^{-\nu(n)}U(\ell,m(n))\right)$

as claimed. ∎

Based on Lemma C.19 we provide Theorem C.23, which considers three specific scenarios for the decay of the power series coefficients inspired by Lemma 3.2.

Theorem C.23.

In the same setting, and also under the same assumptions as in Lemma C.22, then

1.

if $c_{p}=\mathcal{O}(p^{-\alpha})$ with $\alpha>r(n)+1$ for all $n\in\mathbb{Z}_{\geq 0}$ then $\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{-\frac{\alpha-1}{r(n)}}\right)$ ,
2.

if $c_{p}=\mathcal{O}(e^{-\alpha\sqrt{p}})$ , then $\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{\frac{1}{2r(n)}}\exp\left(-\alpha^{\prime}n^{\frac{1}{2r(n)}}\right)\right)$ for any $\alpha^{\prime}<\alpha 2^{-1/2r(n)}$ ,
3.

if $c_{p}=\mathcal{O}(e^{-\alpha p})$ , then $\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(\exp\left(-\alpha^{\prime}n^{\frac{1}{r(n)}}\right)\right)$ for any $\alpha^{\prime}<\alpha 2^{-1/2r(n)}$ .

Proof.

First, as $[\mathbf{G}_{n}]_{ij}\leq 1$ then

\frac{\left\lVert\mathbf{G}^{\odot m(n)}\right\rVert}{n}\leq\frac{\text{Trace}(\mathbf{G}^{\odot m(n)})}{n}=1.

Therefore, to recover the three results listed we now apply Lemma C.22 with $\nu(n)=0$ . First, to prove 1., under the assumption $\ell(x)=x^{-\alpha}$ with $\alpha>0$ then

\int_{m(n)-1}^{\infty}x^{-\alpha}dx=\frac{(m(n)-1)^{-\alpha+1}}{\alpha-1}.

As a result

\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{-\frac{\alpha-1}{r(n)}}\right).

To prove ii), under the assumption $\ell(x)=e^{-\alpha\sqrt{x}}$ with $\alpha>0$ then

\int_{m(n)-1}^{\infty}e^{-\alpha\sqrt{x}}dx=\frac{2\exp(-\alpha(\sqrt{m(n)-1})(\alpha\sqrt{m(n)-1}+1)}{\alpha^{2}}.

As a result

\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{\frac{1}{2r(n)}}\exp\left(-\alpha^{\prime}n^{\frac{1}{2r(n)}}\right)\right)

for any $\alpha^{\prime}<\alpha 2^{-1/2r(n)}$ . Finally, to prove iii), under the assumption $\ell(x)=e^{-\alpha x}$ with $\alpha>0$ then

\int_{m(n)-1}^{\infty}e^{-\alpha x}dx=\frac{\exp(-\alpha(m(n)-1)}{\alpha}.

Therefore

\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(\exp\left(-\alpha^{\prime}n^{\frac{1}{r(n)}}\right)\right)

again for any $\alpha^{\prime}<\alpha 2^{-1/2r(n)}$ . ∎

Unfortunately, the curse of dimensionality is clearly present in these results due to the $1/r(n)$ factor in the exponents of $n$ . However, although perhaps somewhat loose we emphasize that these results are certainly far from trivial. In particular, while trivially we know that $\lambda_{n}(\mathbf{K}_{n})\leq Tr(\mathbf{K}_{n})/n=\mathcal{O}(n^{-1})$ , in contrast, even the weakest result concerning the power law decay our result is a clear improvement as long as $\alpha>r(n)+1$ . For the other settings, i.e., those specified in 2. and 3., our results are significantly stronger.