This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Characterizing the spectrum of the NTK via a power series expansion

\dagger Michael Murray, \dagger Hui Jin, \dagger Benjamin Bowman, §\dagger{\ddagger}\mathsection Guido Montufar
\dagger Department of Mathematics, UCLA, CA, USA
{\ddagger} Department of Statistics, UCLA, CA, USA
§\mathsection Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
[mmurray,huijin,benbowman314,montufar]@math.ucla.edu
Equal contribution
Abstract

Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.

1 Introduction

Neural networks currently dominate modern artificial intelligence, however, despite their empirical success establishing a principled theoretical foundation for them remains an active challenge. The key difficulties are that neural networks induce nonconvex optimization objectives (Sontag & Sussmann, 1989) and typically operate in an overparameterized regime which precludes classical statistical learning theory (Anthony & Bartlett, 2002). The persistent success of overparameterized models tuned via non-convex optimization suggests that the relationship between the parameterization, optimization, and generalization is more sophisticated than that which can be addressed using classical theory.

A recent breakthrough on understanding the success of overparameterized networks was established through the Neural Tangent Kernel (NTK) (Jacot et al., 2018). In the infinite width limit the optimization dynamics are described entirely by the NTK and the parameterization behaves like a linear model (Lee et al., 2019). In this regime explicit guarantees for the optimization and generalization can be obtained (Du et al., 2019a, b; Arora et al., 2019a; Allen-Zhu et al., 2019; Zou et al., 2020). While one must be judicious when extrapolating insights from the NTK to finite width networks (Lee et al., 2020), the NTK remains one of the most promising avenues for understanding deep learning on a principled basis.

The spectrum of the NTK is fundamental to both the optimization and generalization of wide networks. In particular, bounding the smallest eigenvalue of the NTK Gram matrix is a staple technique for establishing convergence guarantees for the optimization (Du et al., 2019a, b; Oymak & Soltanolkotabi, 2020). Furthermore, the full spectrum of the NTK Gram matrix governs the dynamics of the empirical risk (Arora et al., 2019b), and the eigenvalues of the associated integral operator characterize the dynamics of the generalization error outside the training set (Bowman & Montufar, 2022; Bowman & Montúfar, 2022). Moreover, the decay rate of the generalization error for Gaussian process regression using the NTK can be characterized by the decay rate of the spectrum (Caponnetto & De Vito, 2007; Cui et al., 2021; Jin et al., 2022).

The importance of the spectrum of the NTK has led to a variety of efforts to characterize its structure via random matrix theory and other tools (Yang & Salman, 2019; Fan & Wang, 2020). There is a broader body of work studying the closely related Conjugate Kernel, Fisher Information Matrix, and Hessian (Poole et al., 2016; Pennington & Worah, 2017, 2018; Louart et al., 2018; Karakida et al., 2020). These results often require complex random matrix theory or operate in a regime where the input dimension is sent to infinity. By contrast, using a just a power series expansion we are able to characterize a variety of attributes of the spectrum for fixed input dimension and recover key results from prior work.

1.1 Contributions

In Theorem 3.1 we derive coefficients for the power series expansion of the NTK under unit variance initialization, see Assumption 2. Consequently we are able to derive insights into the NTK spectrum, notably concerning the outlier eigenvalues as well as the asymptotic decay.

  • In Theorem 4.1 and Observation 4.2 we demonstrate that the largest eigenvalue λ1(𝐊)\lambda_{1}(\mathbf{K}) of the NTK takes up an Ω(1)\Omega(1) proportion of the trace and that there are O(1)O(1) outlier eigenvalues of the same order as λ1(𝐊)\lambda_{1}(\mathbf{K}).

  • In Theorem 4.3 and Theorem 4.5 we show that the effective rank Tr(𝐊)/λ1(𝐊)Tr(\mathbf{K})/\lambda_{1}(\mathbf{K}) of the NTK is upper bounded by a constant multiple of the effective rank Tr(𝐗𝐗T)/λ1(𝐗𝐗T)Tr(\mathbf{X}\mathbf{X}^{T})/\lambda_{1}(\mathbf{X}\mathbf{X}^{T}) of the input data Gram matrix for both infinite and finite width networks.

  • In Corollary 4.7 and Theorem 4.8 we characterize the asymptotic behavior of the NTK spectrum for both uniform and nonuniform data distributions on the sphere.

1.2 Related work

Neural Tangent Kernel (NTK): the NTK was introduced by Jacot et al. (2018), who demonstrated that in the infinite width limit neural network optimization is described via a kernel gradient descent. As a consequence, when the network is polynomially wide in the number of samples, global convergence guarantees for gradient descent can be obtained (Du et al., 2019a, b; Allen-Zhu et al., 2019; Zou & Gu, 2019; Lee et al., 2019; Zou et al., 2020; Oymak & Soltanolkotabi, 2020; Nguyen & Mondelli, 2020; Nguyen, 2021). Furthermore, the connection between infinite width networks and Gaussian processes, which traces back to Neal (1996), has been reinvigorated in light of the NTK. Recent investigations include Lee et al. (2018); de G. Matthews et al. (2018); Novak et al. (2019).

Analysis of NTK Spectrum: theoretical analysis of the NTK spectrum via random matrix theory was investigated by Yang & Salman (2019); Fan & Wang (2020) in the high dimensional limit. Velikanov & Yarotsky (2021) demonstrated that for ReLU networks the spectrum of the NTK integral operator asymptotically follows a power law, which is consistent with our results for the uniform data distribution. Basri et al. (2019) calculated the NTK spectrum for shallow ReLU networks under the uniform distribution, which was then expanded to the nonuniform case by Basri et al. (2020). Geifman et al. (2022) analyzed the spectrum of the conjugate kernel and NTK for convolutional networks with ReLU activations whose pixels are uniformly distributed on the sphere. Geifman et al. (2020); Bietti & Bach (2021); Chen & Xu (2021) analyzed the reproducing kernel Hilbert spaces of the NTK for ReLU networks and the Laplace kernel via the decay rate of the spectrum of the kernel. In contrast to previous works, we are able to address the spectrum in the finite dimensional setting and characterize the impact of different activation functions on it.

Hermite Expansion: Daniely et al. (2016) used Hermite expansion to the study the expressivity of the Conjugate Kernel. Simon et al. (2022) used this technique to demonstrate that any dot product kernel can be realized by the NTK or Conjugate Kernel of a shallow, zero bias network. Oymak & Soltanolkotabi (2020) use Hermite expansion to study the NTK and establish a quantitative bound on the smallest eigenvalue for shallow networks. This approach was incorporated by Nguyen & Mondelli (2020) to handle convergence for deep networks, with sharp bounds on the smallest NTK eigenvalue for deep ReLU networks provided by Nguyen et al. (2021). The Hermite approach was utilized by Panigrahi et al. (2020) to analyze the smallest NTK eigenvalue of shallow networks under various activations. Finally, in a concurrent work Han et al. (2022) use Hermite expansions to develop a principled and efficient polynomial based approximation algorithm for the NTK and CNTK. In contrast to the aforementioned works, here we employ the Hermite expansion to characterize both the outlier and asymptotic portions of the spectrum for both shallow and deep networks under general activations.

2 Preliminaries

For our notation, lower case letters, e.g., x,yx,y, denote scalars, lower case bold characters, e.g., 𝐱,𝐲\mathbf{x},\mathbf{y} are for vectors, and upper case bold characters, e.g., 𝐗,𝐘\mathbf{X},\mathbf{Y}, are for matrices. For natural numbers k1,k2k_{1},k_{2}\in\mathbb{N} we let [k1]={1,,k1}[k_{1}]=\{1,\ldots,k_{1}\} and [k2,k1]={k2,,k1}[k_{2},k_{1}]=\{k_{2},\ldots,k_{1}\}. If k2>k1k_{2}>k_{1} then [k2,k1][k_{2},k_{1}] is the empty set. We use p\left\lVert\cdot\right\rVert_{p} to denote the pp-norm of the matrix or vector in question and as default use \left\lVert\cdot\right\rVert as the operator or 2-norm respectively. We use 𝟏m×nm×n\mathbf{1}_{m\times n}\in\mathbb{R}^{m\times n} to denote the matrix with all entries equal to one. We define δp=c\delta_{p=c} to take the value 11 if p=cp=c and be zero otherwise. We will frequently overload scalar functions ϕ:\phi:\mathbb{R}\rightarrow\mathbb{R} by applying them elementwise to vectors and matrices. The entry in the iith row and jjth column of a matrix we access using the notation [𝐗]ij[\mathbf{X}]_{ij}. The Hadamard or entrywise product of two matrices 𝐗,𝐘m×n\mathbf{X},\mathbf{Y}\in\mathbb{R}^{m\times n} we denote 𝐗𝐘\mathbf{X}\odot\mathbf{Y} as is standard. The ppth Hadamard power we denote 𝐗p\mathbf{X}^{\odot p} and define it as the Hadamard product of 𝐗\mathbf{X} with itself pp times,

𝐗p:=𝐗𝐗𝐗.\mathbf{X}^{\odot p}\vcentcolon=\mathbf{X}\odot\mathbf{X}\odot\cdots\odot\mathbf{X}.

Given a Hermitian or symmetric matrix 𝐗n×n\mathbf{X}\in\mathbb{R}^{n\times n}, we adopt the convention that λi(𝐗)\lambda_{i}(\mathbf{X}) denotes the iith largest eigenvalue,

λ1(𝐗)λ2(𝐗)λn(𝐗).\lambda_{1}(\mathbf{X})\geq\lambda_{2}(\mathbf{X})\geq\cdots\geq\lambda_{n}(\mathbf{X}).

Finally, for a square matrix 𝐗n×n\mathbf{X}\in\mathbb{R}^{n\times n} we let Tr(𝐗)=i=1n[𝐗]iiTr(\mathbf{X})=\sum_{i=1}^{n}[\mathbf{X}]_{ii} denote the trace.

2.1 Hermite Expansion

We say that a function f:f\colon\mathbb{R}\rightarrow\mathbb{R} is square integrable with respect to the standard Gaussian measure γ(z)=12πez2/2\gamma(z)=\frac{1}{\sqrt{2\pi}}e^{-z^{2}/2} if 𝔼X𝒩(0,1)[f(X)2]<\mathbb{E}_{X\sim\mathcal{N}(0,1)}[f(X)^{2}]<\infty. We denote by L2(,γ)L^{2}(\mathbb{R},\gamma) the space of all such functions. The normalized probabilist’s Hermite polynomials are defined as

hk(x)=(1)kex2/2k!dkdxkex2/2,k=0,1,\displaystyle h_{k}(x)=\frac{{(-1)}^{k}e^{x^{2}/2}}{\sqrt{k!}}\frac{d^{k}}{dx^{k}}e^{-x^{2}/2},\quad k=0,1,\ldots

and form a complete orthonormal basis in L2(,γ)L^{2}(\mathbb{R},\gamma) (O’Donnell, 2014, §11). The Hermite expansion of a function ϕL2(,γ)\phi\in L^{2}(\mathbb{R},\gamma) is given by ϕ(x)=k=0μk(ϕ)hk(x)\phi(x)=\sum_{k=0}^{\infty}\mu_{k}(\phi)h_{k}(x), where μk(ϕ)=𝔼X𝒩(0,1)[ϕ(X)hk(X)]\mu_{k}(\phi)=\mathbb{E}_{X\sim\mathcal{N}(0,1)}[\phi(X)h_{k}(X)] is the kkth normalized probabilist’s Hermite coefficient of ϕ\phi.

2.2 NTK Parametrization

In what follows, for n,dn,d\in\mathbb{N} let 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} denote a matrix which stores nn points in d\mathbb{R}^{d} row-wise. Unless otherwise stated, we assume dnd\leq n and denote the iith row of 𝐗n\mathbf{X}_{n} as 𝐱i\mathbf{x}_{i}. In this work we consider fully-connected neural networks of the form f(L+1):df^{(L+1)}\colon\mathbb{R}^{d}\rightarrow\mathbb{R} with LL\in\mathbb{N} hidden layers and a linear output layer. For a given input vector 𝐱d\mathbf{x}\in\mathbb{R}^{d}, the activation f(l)f^{(l)} and preactivation g(l)g^{(l)} at each layer l[L+1]l\in[L+1] are defined via the following recurrence relations,

g(1)(𝐱)=γw𝐖(1)𝐱+γb𝐛(1),f(1)(𝐱)=ϕ(g(1)(𝐱)),\displaystyle g^{(1)}(\mathbf{x})=\gamma_{w}\mathbf{W}^{(1)}\mathbf{x}+\gamma_{b}\mathbf{b}^{(1)},\;f^{(1)}(\mathbf{x})=\phi\left(g^{(1)}(\mathbf{x})\right), (1)
g(l)(𝐱)=σwml1𝐖(l)f(l1)(𝐱)+σb𝐛(l),f(l)(𝐱)=ϕ(g(l)(𝐱)),l[2,L],\displaystyle g^{(l)}(\mathbf{x})=\frac{\sigma_{w}}{\sqrt{m_{l-1}}}\mathbf{W}^{(l)}f^{(l-1)}(\mathbf{x})+\sigma_{b}\mathbf{b}^{(l)},\;f^{(l)}(\mathbf{x})=\phi\left(g^{(l)}(\mathbf{x})\right),\;\forall l\in[2,L],
g(L+1)(𝐱)=σwmL𝐖(L+1)f(L)(𝐱),f(L+1)(𝐱)=g(L+1)(𝐱).\displaystyle g^{(L+1)}(\mathbf{x})=\frac{\sigma_{w}}{\sqrt{m_{L}}}\mathbf{W}^{(L+1)}f^{(L)}(\mathbf{x}),\;f^{(L+1)}(\mathbf{x})=g^{(L+1)}(\mathbf{x}).

The parameters 𝐖(l)ml×ml1\mathbf{W}^{(l)}\in\mathbb{R}^{m_{l}\times m_{l-1}} and 𝐛(l)ml\mathbf{b}^{(l)}\in\mathbb{R}^{m_{l}} are the weight matrix and bias vector at the llth layer respectively, m0=dm_{0}=d, mL+1=1m_{L+1}=1, and ϕ:\phi\colon\mathbb{R}\rightarrow\mathbb{R} is the activation function applied elementwise. The variables γw,σw>0\gamma_{w},\sigma_{w}\in\mathbb{R}_{>0} and γb,σb0\gamma_{b},\sigma_{b}\in\mathbb{R}_{\geq 0} correspond to weight and bias hyperparameters respectively. Let θlp\theta_{l}\in\mathbb{R}^{p} denote a vector storing the network parameters (𝐖(h),𝐛(h))h=1l(\mathbf{W}^{(h)},\mathbf{b}^{(h)})_{h=1}^{l} up to and including the llth layer. The Neural Tangent Kernel (Jacot et al., 2018) Θ~(l):d×d\tilde{\Theta}^{(l)}\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} associated with f(l)f^{(l)} at layer l[L+1]l\in[L+1] is defined as

Θ~(l)(𝐱,𝐲):=θlf(l)(𝐱),θlf(l)(𝐲).\tilde{\Theta}^{(l)}(\mathbf{x},\mathbf{y}):=\langle\nabla_{\theta_{l}}f^{(l)}(\mathbf{x}),\nabla_{\theta_{l}}f^{(l)}(\mathbf{y})\rangle. (2)

We will mostly study the NTK under the following standard assumptions.

Assumption 1.

NTK initialization.

  1. 1.

    At initialization all network parameters are distributed as 𝒩(0,1)\mathcal{N}(0,1) and are mutually independent.

  2. 2.

    The activation function satisfies ϕL2(,γ)\phi\in L^{2}(\mathbb{R},\gamma), is differentiable almost everywhere and its derivative, which we denote ϕ\phi^{\prime}, also satisfies ϕL2(,γ)\phi^{\prime}\in L^{2}(\mathbb{R},\gamma).

  3. 3.

    The widths are sent to infinity in sequence, m1,m2,,mLm_{1}\rightarrow\infty,m_{2}\rightarrow\infty,\ldots,m_{L}\rightarrow\infty.

Under Assumption 1, for any l[L+1]l\in[L+1], Θ~(l)(𝐱,𝐲)\tilde{\Theta}^{(l)}(\mathbf{x},\mathbf{y}) converges in probability to a deterministic limit Θ(l):d×d\Theta^{(l)}\colon\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} (Jacot et al., 2018) and the network behaves like a kernelized linear predictor during training; see, e.g., Arora et al. (2019b); Lee et al. (2019); Woodworth et al. (2020). Given access to the rows (𝐱i)i=1n(\mathbf{x}_{i})_{i=1}^{n} of 𝐗\mathbf{X} the NTK matrix at layer l[L+1]l\in[L+1], which we denote 𝐊l\mathbf{K}_{l}, is the n×nn\times n matrix with entries defined as

[𝐊l]ij=1nΘ(l)(𝐱i,𝐱j),(i,j)[n]×[n].[\mathbf{K}_{l}]_{ij}=\frac{1}{n}\Theta^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}),\;\forall(i,j)\in[n]\times[n]. (3)

3 Expressing the NTK as a power series

The following assumption allows us to study a power series for the NTK of deep networks and with general activation functions. We remark that power series for the NTK of deep networks with positive homogeneous activation functions, namely ReLU, have been studied in prior works Han et al. (2022); Chen & Xu (2021); Bietti & Bach (2021); Geifman et al. (2022). We further remark that while these works focus on the asymptotics of the NTK spectrum we also study the large eigenvalues.

Assumption 2.

The hyperparameters of the network satisfy γw2+γb2=1\gamma_{w}^{2}+\gamma_{b}^{2}=1, σw2𝔼Z𝒩(0,1)[ϕ(Z)2]1\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[\phi(Z)^{2}]\leq 1 and σb2=1σw2𝔼Z𝒩(0,1)[ϕ(Z)2]\sigma_{b}^{2}=1-\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[\phi(Z)^{2}]. The data is normalized so that 𝐱i=1\left\lVert\mathbf{x}_{i}\right\rVert=1 for all i[n]i\in[n].

Recall under Assumption 1 that the preactivations of the network are centered Gaussian processes (Neal, 1996; Lee et al., 2018). Assumption 2 ensures the preactivation of each neuron has unit variance and thus is reminiscent of the LeCun et al. (2012), Glorot & Bengio (2010) and He et al. (2015) initializations, which are designed to avoid vanishing and exploding gradients. We refer the reader to Appendix A.3 for a thorough discussion. Under Assumption 2 we will show it is possible to write the NTK not only as a dot-product kernel but also as an analytic power series on [1,1][-1,1] and derive expressions for the coefficients. In order to state this result recall, given a function fL2(,γ)f\in L^{2}(\mathbb{R},\gamma), that the ppth normalized probabilist’s Hermite coefficient of ff is denoted μp(f)\mu_{p}(f), we refer the reader to Appendix A.4 for an overview of the Hermite polynomials and their properties. Furthermore, letting a¯=(aj)j=0\bar{a}=(a_{j})_{j=0}^{\infty} denote a sequence of real numbers, then for any p,k0p,k\in\mathbb{Z}_{\geq 0} we define

F(p,k,a¯)={1,k=0 and p=0,0,k=0 and p1,(ji)𝒥(p,k)i=1kaji,k1 and p0,F(p,k,\bar{a})=\begin{cases}1,\;&k=0\text{ and }p=0,\\ 0,\;&k=0\text{ and }p\geq 1,\\ \sum_{(j_{i})\in\mathcal{J}(p,k)}\prod_{i=1}^{k}a_{j_{i}},\;&k\geq 1\text{ and }p\geq 0,\end{cases} (4)

where

𝒥(p,k):={(ji)i[k]:ji0i[k],i=1kji=p}for all p0k.\mathcal{J}(p,k)\vcentcolon=\big{\{}(j_{i})_{i\in[k]}\;:\;j_{i}\geq 0\;\forall i\in[k],\;\sum_{i=1}^{k}j_{i}=p\big{\}}\quad\text{for all $p\in\mathbb{Z}_{\geq 0}$, $k\in\mathbb{N}$}.

Here 𝒥(p,k)\mathcal{J}(p,k) is the set of all kk-tuples of nonnegative integers which sum to pp and F(p,k,a¯)F(p,k,\bar{a}) is therefore the sum of all ordered products of kk elements of a¯\bar{a} whose indices sum to pp. We are now ready to state the key result of this section, Theorem 3.1, whose proof is provided in Appendix B.1.

Theorem 3.1.

Under Assumptions 1 and 2, for all l[L+1]l\in[L+1]

n𝐊l=p=0κp,l(𝐗𝐗T)p.n\mathbf{K}_{l}=\sum_{p=0}^{\infty}\kappa_{p,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}. (5)

The series for each entry n[𝐊l]ijn[\mathbf{K}_{l}]_{ij} converges absolutely and the coefficients κp,l\kappa_{p,l} are nonnegative and can be evaluated using the recurrence relationships

κp,l={δp=0γb2+δp=1γw2,l=1,αp,l+q=0pκq,l1υpq,l,l[2,L+1],\kappa_{p,l}=\begin{cases}\delta_{p=0}\gamma_{b}^{2}+\delta_{p=1}\gamma_{w}^{2},&l=1,\\ \alpha_{p,l}+\sum_{q=0}^{p}\kappa_{q,l-1}\upsilon_{p-q,l},&l\in[2,L+1],\end{cases} (6)

where

αp,l={σw2μp2(ϕ)+δp=0σb2,l=2,k=0αk,2F(p,k,α¯l1),l3,\alpha_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2},&l=2,\\ \sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3,\end{cases} (7)

and

υp,l={σw2μp2(ϕ),l=2,k=0υk,2F(p,k,α¯l1),l3,\upsilon_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi^{\prime}),&l=2,\\ \sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3,\end{cases} (8)

are likewise nonnegative for all p0p\in\mathbb{Z}_{\geq 0} and l[2,L+1]l\in[2,L+1].

As already remarked, power series for the NTK have been studied in previous works, however, to the best of our knowledge Theorem 3.1 is the first to explicitly express the coefficients at a layer in terms of the coefficients of previous layers. To compute the coefficients of the NTK as per Theorem 3.1, the Hermite coefficients of both ϕ\phi and ϕ\phi^{\prime} are required. Under Assumption 3 below, which has minimal impact on the generality of our results, this calculation can be simplified. In short, under Assumption 3 υp,2=(p+1)αp+1,2\upsilon_{p,2}=(p+1)\alpha_{p+1,2} and therefore only the Hermite coefficients of ϕ\phi are required. We refer the reader to Lemma B.3 in Appendix B.2 for further details.

Assumption 3.

The activation function ϕ:\phi\colon\mathbb{R}\rightarrow\mathbb{R} is absolutely continuous on [a,a][-a,a] for all a>0a>0, differentiable almost everywhere, and is polynomially bounded, i.e., |ϕ(x)|=𝒪(|x|β)|\phi(x)|=\mathcal{O}(|x|^{\beta}) for some β>0\beta>0. Further, the derivative ϕ:\phi^{\prime}\colon\mathbb{R}\rightarrow\mathbb{R} satisfies ϕL2(,γ)\phi^{\prime}\in L^{2}(\mathbb{R},\gamma).

We remark that ReLU, Tanh, Sigmoid, Softplus and many other commonly used activation functions satisfy Assumption 3. In order to understand the relationship between the Hermite coefficients of the activation function and the coefficients of the NTK, we first consider the simple two-layer case with L=1L=1 hidden layers. From Theorem 3.1

κp,2=σw2(1+γw2p)μp2(ϕ)+σw2γb2(1+p)μp+12(ϕ)+δp=0σb2.\kappa_{p,2}=\sigma_{w}^{2}(1+\gamma_{w}^{2}p)\mu_{p}^{2}(\phi)+\sigma_{w}^{2}\gamma_{b}^{2}(1+p)\mu_{p+1}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2}. (9)

As per Table 1, a general trend we observe across all activation functions is that the first few coefficients account for the large majority of the total NTK coefficient series.

Table 1: Percentage of p=0κp,2\sum_{p=0}^{\infty}\kappa_{p,2} accounted for by the first T+1T+1 NTK coefficients assuming γw2=1\gamma_{w}^{2}=1, γb2=0\gamma_{b}^{2}=0, σw2=1\sigma_{w}^{2}=1 and σb2=1𝔼[ϕ(Z)2]\sigma_{b}^{2}=1-\mathbb{E}[\phi(Z)^{2}].
T=T= 0 1 2 3 4 5
ReLU 43.944 77.277 93.192 93.192 95.403 95.403
Tanh 41.362 91.468 91.468 97.487 97.487 99.090
Sigmoid 91.557 99.729 99.729 99.977 99.977 99.997
Gaussian 95.834 95.834 98.729 98.729 99.634 99.634

However, the asymptotic rate of decay of the NTK coefficients varies significantly by activation function, due to the varying behavior of their tails. In Lemma 3.2 we choose ReLU, Tanh and Gaussian as prototypical examples of activations functions with growing, constant, and decaying tails respectively, and analyze the corresponding NTK coefficients in the two layer setting. For typographical ease we denote the zero mean Gaussian density function with variance σ2\sigma^{2} as ωσ(z):=(1/2πσ2)exp(z2/(2σ2))\omega_{\sigma}(z)\vcentcolon=(1/\sqrt{2\pi\sigma^{2}})\exp\left(-z^{2}/(2\sigma^{2})\right).

Lemma 3.2.

Under Assumptions 1 and 2,

  1. 1.

    if ϕ(z)=ReLU(z)\phi(z)=ReLU(z), then κp,2=δ(γb>0)(p even)Θ(p3/2)\kappa_{p,2}=\delta_{(\gamma_{b}>0)\cup(p\text{ even})}\Theta(p^{-3/2}),

  2. 2.

    if ϕ(z)=Tanh(z)\phi(z)=Tanh(z), then κp,2=𝒪(exp(πp12))\kappa_{p,2}=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p-1}}{2}\right)\right),

  3. 3.

    if ϕ(z)=ωσ(z)\phi(z)=\omega_{\sigma}(z), then κp,2=δ(γb>0)(p even)Θ(p1/2(σ2+1)p)\kappa_{p,2}=\delta_{(\gamma_{b}>0)\cup(p\text{ even})}\Theta(p^{1/2}(\sigma^{2}+1)^{-p}).

The trend we observe from Lemma 3.2 is that activation functions whose Hermite coefficients decay quickly, such as ωσ\omega_{\sigma}, result in a faster decay of the NTK coefficients. We remark that analyzing the rates of decay for l3l\geq 3 is challenging due to the calculation of F(p,k,α¯l1)F(p,k,\bar{\alpha}_{l-1}) (4). In Appendix B.4 we provide preliminary results in this direction, upper bounding, in a very specific setting, the decay of the NTK coefficients for depths l2l\geq 2. Finally, we briefly pause here to highlight the potential for using a truncation of (5) in order to perform efficient numerical approximation of the infinite width NTK. We remark that this idea is also addressed in a concurrent work by Han et al. (2022), albeit under a somewhat different set of assumptions 111In particular, in Han et al. (2022) the authors focus on homogeneous activation functions and allow the data to lie off the sphere. By contrast, we require the data to lie on the sphere but can handle non-homogeneous activation functions in the deep setting.. As per our observations thus far that the coefficients of the NTK power series (5) typically decay quite rapidly, one might consider approximating Θ(l)\Theta^{(l)} by computing just the first few terms in each series of (5). Figure 2 in Appendix B.3 displays the absolute error between the truncated ReLU NTK and the analytical expression for the ReLU NTK, which is also defined in Appendix B.3. Letting ρ\rho denote the input correlation then the key takeaway is that while for |ρ|\left\lvert\rho\right\rvert close to one the approximation is poor, for |ρ|<0.5\left\lvert\rho\right\rvert<0.5, which is arguably more realistic for real-world data, with just 5050 coefficients machine level precision can be achieved. We refer the interested reader to Appendix B.3 for a proper discussion.

4 Analyzing the spectrum of the NTK via its power series

In this section, we consider a general kernel matrix power series of the form n𝐊=p=0cp(𝐗𝐗T)pn\mathbf{K}=\sum_{p=0}^{\infty}c_{p}(\mathbf{X}\mathbf{X}^{T})^{\odot p} where {cp}p=0\{c_{p}\}_{p=0}^{\infty} are coefficients and 𝐗\mathbf{X} is the data matrix. According to Theorem 3.1, the coefficients of the NTK power series (5) are always nonnegative, thus we only consider the case where cpc_{p} are nonnegative. We will also consider the kernel function power series, which we denote as K(x1,x2)=p=0cpx1,x2pK(x_{1},x_{2})=\sum_{p=0}^{\infty}c_{p}\langle x_{1},x_{2}\rangle^{p}. Later on we will analyze the spectrum of kernel matrix 𝐊\mathbf{K} and kernel function KK.

4.1 Analysis of the upper spectrum and effective rank

In this section we analyze the upper part of the spectrum of the NTK, corresponding to the large eigenvalues, using the power series given in Theorem 3.1. Our first result concerns the effective rank (Huang et al., 2022) of the NTK. Given a positive semidefinite matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} we define the effective rank of 𝐀\mathbf{A} to be

eff(𝐀)=Tr(𝐀)λ1(𝐀).\mathrm{eff}(\mathbf{A})=\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

The effective rank quantifies how many eigenvalues are on the order of the largest eigenvalue. This follows from the Markov-like inequality

|{p:λp(𝐀)cλ1(𝐀)}|c1eff(𝐀)\left\lvert\{p:\lambda_{p}(\mathbf{A})\geq c\lambda_{1}(\mathbf{A})\}\right\rvert\leq c^{-1}\mathrm{eff}(\mathbf{A}) (10)

and the eigenvalue bound

λp(𝐀)λ1(𝐀)eff(𝐀)p.\frac{\lambda_{p}(\mathbf{A})}{\lambda_{1}(\mathbf{A})}\leq\frac{\mathrm{eff}(\mathbf{A})}{p}.

Our first result is that the effective rank of the NTK can be bounded in terms of a ratio involving the power series coefficients. As we are assuming the data is normalized so that 𝐱i=1\left\lVert\mathbf{x}_{i}\right\rVert=1 for all i[n]i\in[n], then observe by the linearity of the trace

Tr(n𝐊)=p=0cpTr((𝐗𝐗T)p)=np=0cp,Tr(n\mathbf{K})=\sum_{p=0}^{\infty}c_{p}Tr((\mathbf{X}\mathbf{X}^{T})^{\odot p})=n\sum_{p=0}^{\infty}c_{p},

where we have used the fact that Tr((𝐗𝐗T)p)=nTr((\mathbf{X}\mathbf{X}^{T})^{\odot p})=n for all pp\in\mathbb{N}. On the other hand,

λ1(n𝐊)λ1(c0(𝐗𝐗T)0)=λ1(c0𝟏n×n)=nc0.\lambda_{1}(n\mathbf{K})\geq\lambda_{1}(c_{0}(\mathbf{X}\mathbf{X}^{T})^{0})=\lambda_{1}(c_{0}\mathbf{1}_{n\times n})=nc_{0}.

Combining these two results we get the following theorem.

Theorem 4.1.

Assume that we have a kernel Gram matrix 𝐊\mathbf{K} of the form n𝐊=p=0cp(𝐗𝐗T)pn\mathbf{K}=\sum_{p=0}^{\infty}c_{p}(\mathbf{X}\mathbf{X}^{T})^{\odot p} where c00c_{0}\neq 0. Furthermore, assume the input data 𝐱i\mathbf{x}_{i} are normalized so that 𝐱i=1\left\lVert\mathbf{x}_{i}\right\rVert=1 for all i[n]i\in[n]. Then

eff(𝐊)p=0cpc0.\mathrm{eff}(\mathbf{K})\leq\frac{\sum_{p=0}^{\infty}c_{p}}{c_{0}}.

By Theorem 3.1 c00c_{0}\neq 0 provided the network has biases or the activation function has nonzero Gaussian expectation (i.e., μ0(ϕ)0\mu_{0}(\phi)\neq 0). Thus we have that the effective rank of 𝐊\mathbf{K} is bounded by an O(1)O(1) quantity. In the case of ReLU for example, as evidenced by Table 1, the effective rank will be roughly 2.32.3 for a shallow network. By contrast, a well-conditioned matrix would have an effective rank that is Ω(n)\Omega(n). Combining Theorem 4.1 and the Markov-type bound (10) we make the following important observation.

Observation 4.2.

The largest eigenvalue λ1(𝐊)\lambda_{1}(\mathbf{K}) of the NTK takes up an Ω(1)\Omega(1) fraction of the entire trace and there are O(1)O(1) eigenvalues on the same order of magnitude as λ1(𝐊)\lambda_{1}(\mathbf{K}), where the O(1)O(1) and Ω(1)\Omega(1) notation are with respect to the parameter nn.

While the constant term c0𝟏n×nc_{0}\mathbf{1}_{n\times n} in the kernel leads to a significant outlier in the spectrum of 𝐊\mathbf{K}, it is rather uninformative beyond this. What interests us is how the structure of the data 𝐗\mathbf{X} manifests in the spectrum of the kernel matrix 𝐊\mathbf{K}. For this reason we will examine the centered kernel matrix 𝐊~:=𝐊c0n𝟏n×n\widetilde{\mathbf{K}}:=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}_{n\times n}. By a very similar argument as before we get the following result.

Theorem 4.3.

Assume that we have a kernel Gram matrix 𝐊\mathbf{K} of the form n𝐊=p=0cp(𝐗𝐗T)pn\mathbf{K}=\sum_{p=0}^{\infty}c_{p}(\mathbf{X}\mathbf{X}^{T})^{\odot p} where c10c_{1}\neq 0. Furthermore, assume the input data 𝐱i\mathbf{x}_{i} are normalized so that 𝐱i=1\left\lVert\mathbf{x}_{i}\right\rVert=1 for all i[n]i\in[n]. Then the centered kernel 𝐊~:=𝐊c0n𝟏n×n\widetilde{\mathbf{K}}:=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}_{n\times n} satisfies

eff(𝐊~)eff(𝐗𝐗T)p=1cpc1.\mathrm{eff}(\widetilde{\mathbf{K}})\leq\mathrm{eff}(\mathbf{X}\mathbf{X}^{T})\frac{\sum_{p=1}^{\infty}c_{p}}{c_{1}}.

Thus we have that the effective rank of the centered kernel 𝐊~\widetilde{\mathbf{K}} is upper bounded by a constant multiple of the effective rank of the input data Gram 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}. Furthermore, we can take the ratio p=1cpc1\frac{\sum_{p=1}^{\infty}c_{p}}{c_{1}} as a measure of how much the NTK inherits the behavior of the linear kernel 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}: in particular, if the input data gram has low effective rank and this ratio is moderate then we may conclude that the centered NTK must also have low effective rank. Again from Table 1, in the shallow setting we see that this ratio tends to be small for many of the common activations, for example, for ReLU it is roughly 1.3. To summarize then from Theorem 4.3 we make the important observation.

Observation 4.4.

Whenever the input data are approximately low rank, the centered kernel matrix 𝐊~=𝐊c0n𝟏n×n\widetilde{\mathbf{K}}=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}_{n\times n} is also approximately low rank.

It turns out that this phenomenon also holds for finite-width networks at initialization. Consider the shallow model

=1maϕ(𝐰,𝐱),\sum_{\ell=1}^{m}a_{\ell}\phi(\langle\mathbf{w}_{\ell},\mathbf{x}\rangle),

where 𝐱d\mathbf{x}\in\mathbb{R}^{d} and 𝐰d\mathbf{w}_{\ell}\in\mathbb{R}^{d}, aa_{\ell}\in\mathbb{R} for all [m]\ell\in[m]. The following theorem demonstrates that when the width mm is linear in the number of samples nn then eff(𝐊)\mathrm{eff}(\mathbf{K}) is upper bounded by a constant multiple of eff(𝐗𝐗T)\mathrm{eff}(\mathbf{X}\mathbf{X}^{T}).

Theorem 4.5.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and ndn\geq d. Fix ϵ>0\epsilon>0 small. Suppose that 𝐰1,,𝐰mN(0,ν12Id)\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu_{1}^{2}I_{d}) i.i.d. and a1,,amN(0,ν22)a_{1},\ldots,a_{m}\sim N(0,\nu_{2}^{2}). Set M=maxi[n]𝐱i2M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}, and let

Σ:=𝔼𝐰N(0,ν12I)[ϕ(𝐗𝐰)ϕ(𝐰T𝐗T)].\Sigma:=\mathbb{E}_{\mathbf{w}\sim N(0,\nu_{1}^{2}I)}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{w}^{T}\mathbf{X}^{T})].

Then

m=Ω(max(λ1(Σ)2,1)max(n,log(1/ϵ))),ν1=O(1/Mm)m=\Omega\left(\max(\lambda_{1}(\Sigma)^{-2},1)\max(n,\log(1/\epsilon))\right),\quad\nu_{1}=O(1/M\sqrt{m})

suffices to ensure that, with probability at least 1ϵ1-\epsilon over the sampling of the parameter initialization,

eff(𝐊)Ceff(𝐗𝐗T),\mathrm{eff}(\mathbf{K})\leq C\cdot\mathrm{eff}(\mathbf{X}\mathbf{X}^{T}),

where C>0C>0 is an absolute constant.

Many works consider the model where the outer layer weights are fixed and have constant magnitude and only the inner layer weights are trained. This is the setting considered by Xie et al. (2017), Arora et al. (2019a), Du et al. (2019b), Oymak et al. (2019), Li et al. (2020), and Oymak & Soltanolkotabi (2020). In this setting we can reduce the dependence on the width mm to only be logarithmic in the number of samples nn, and we have an accompanying lower bound. See Theorem C.5 in the Appendix C.2.3 for details.

In Figure 1 we empirically validate our theory by computing the spectrum of the NTK on both Caltech101 (Li et al., 2022) and isotropic Gaussian data for feedforward networks. We use the functorch222https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html module in PyTorch (Paszke et al., 2019) using an algorithmic approach inspired by Novak et al. (2022). As per Theorem 4.1 and Observation 4.2, we observe all network architectures exhibit a dominant outlier eigenvalue due to the nonzero constant coefficient in the power series. Furthermore, this dominant outlier becomes more pronounced with depth, as can be observed if one carries out the calculations described in Theorem 3.1. Additionally, this outlier is most pronounced for ReLU, as the combination of its Gaussian mean plus bias term is the largest out of the activations considered here. As predicted by Theorem 4.3, Observation 4.4 and Theorem 4.5, we observe real-world data, which has a skewed spectrum and hence a low effective rank, results in the spectrum of the NTK being skewed. By contrast, isotropic Gaussian data has a flat spectrum, and as a result beyond the outlier the decay of eigenvalues of the NTK is more gradual. These observations support the claim that the NTK inherits its spectral structure from the data. We also observe that the spectrum for Tanh is closer to the linear activation relative to ReLU: intuitively this should not be surprising as close to the origin Tanh is well approximated by the identity. Our theory provides a formal explanation for this observation, indeed, the power series coefficients for Tanh networks decay quickly relative to ReLU. We provide further experimental results in Appendix C.3, including for CNNs where we observe the same trends. We note that the effective rank has implications for the generalization error. The Rademacher complexity of a kernel method (and hence the NTK model) within a parameter ball is determined by its its trace (Bartlett & Mendelson, 2002). Since for the NTK λ1(𝐊)=O(1)\lambda_{1}(\mathbf{K})=O(1), lower effective rank implies smaller trace and hence limited complexity.

Refer to caption
Figure 1: (Feedforward NTK Spectrum) We plot the normalized eigenvalues λp/λ1\lambda_{p}/\lambda_{1} of the NTK Gram matrix 𝐊\mathbf{K} and the data Gram matrix 𝐗𝐗T\mathbf{X}\mathbf{X}^{T} for Caltech101 and isotropic Gaussian datasets. To compute the NTK we randomly initialize feedforward networks of depths 22 and 55 with width 500500. We use the standard parameterization and Pytorch’s default Kaiming uniform initialization in order to better connect our results with what is used in practice. We consider a batch size of n=200n=200 and plot the first 100100 eigenvalues. The thick part of each curve corresponds to the mean across 10 trials, while the transparent part corresponds to the 95% confidence interval

4.2 Analysis of the lower spectrum

In this section, we analyze the lower part of the spectrum using the power series. We first analyze the kernel function KK which we recall is a dot-product kernel of the form K(x1,x2)=p=0cpx1,x2pK(x_{1},x_{2})=\sum_{p=0}^{\infty}c_{p}\langle x_{1},x_{2}\rangle^{p}. Assuming the training data is uniformly distributed on a hypersphere it was shown by Basri et al. (2019); Bietti & Mairal (2019) that the eigenfunctions of KK are the spherical harmonics. Azevedo & Menegatto (2015) gave the eigenvalues of the kernel KK in terms of the power series coefficients.

Theorem 4.6.

[Azevedo & Menegatto (2015)] Let Γ\Gamma denote the gamma function. Suppose that the training data are uniformly sampled from the unit hypersphere 𝕊d\mathbb{S}^{d}, d2d\geq 2. If the dot-product kernel function has the expansion K(x1,x2)=p=0cpx1,x2pK(x_{1},x_{2})=\sum_{p=0}^{\infty}c_{p}\langle x_{1},x_{2}\rangle^{p} where cp0c_{p}\geq 0, then the eigenvalue of every spherical harmonic of frequency kk is given by

λk¯=πd/22k1pkpk is evencpΓ(p+1)Γ(pk+12)Γ(pk+1)Γ(pk+12+k+d/2).\overline{\lambda_{k}}=\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.

A proof of Theorem 4.6 is provided in Appendix C.4 for the reader’s convenience. This theorem connects the coefficients cpc_{p} of the kernel power series with the eigenvalues λk¯\overline{\lambda_{k}} of the kernel. In particular, given a specific decay rate for the coefficients cpc_{p} one may derive the decay rate of λk¯\overline{\lambda_{k}}: for example, Scetbon & Harchaoui (2021) examined the decay rate of λk¯\overline{\lambda_{k}} if cpc_{p} admits a polynomial decay or exponential decay. The following Corollary summarizes the decay rates of λk¯\overline{\lambda_{k}} corresponding to two layer networks with different activations.

Corollary 4.7.

Under the same setting as in Theorem 4.6,

  1. 1.

    if cp=Θ(pa)c_{p}=\Theta(p^{-a}) where a1a\geq 1, then λk¯=Θ(kd2a+2)\overline{\lambda_{k}}=\Theta(k^{-d-2a+2}),

  2. 2.

    if cp=δ(p even)Θ(pa)c_{p}=\delta_{(p\text{ even})}\Theta(p^{-a}), then λk¯=δ(k even)Θ(kd2a+2)\overline{\lambda_{k}}=\delta_{(k\text{ even})}\Theta(k^{-d-2a+2}),

  3. 3.

    if cp=𝒪(exp(ap))c_{p}=\mathcal{O}\left(\exp\left(-a\sqrt{p}\right)\right), then λk¯=𝒪(kd+1/2exp(ak))\overline{\lambda_{k}}=\mathcal{O}\left(k^{-d+1/2}\exp\left(-a\sqrt{k}\right)\right),

  4. 4.

    if cp=Θ(p1/2ap)c_{p}=\Theta(p^{1/2}a^{-p}), then λk¯=𝒪(kd+1ak)\overline{\lambda_{k}}=\mathcal{O}\left(k^{-d+1}a^{-k}\right) and λk¯=Ω(kd/2+12kak)\overline{\lambda_{k}}=\Omega\left(k^{-d/2+1}2^{-k}a^{-k}\right).

In addition to recovering existing results for ReLU networks Basri et al. (2019); Velikanov & Yarotsky (2021); Geifman et al. (2020); Bietti & Bach (2021), Corollary 4.7 also provides the decay rates for two-layer networks with Tanh and Gaussian activations. As faster eigenvalue decay implies a smaller RKHS Corollary 4.7 shows using ReLU results in a larger RKHS relative to Tanh or Gaussian activations. Numerics for Corollary 4.7 are provided in Figure 4 in Appendix C.3. Finally, in Theorem 4.8 we relate a kernel’s power series to its spectral decay for arbitrary data distributions.

Theorem 4.8 (Informal).

Let the rows of 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} be arbitrary points on the unit sphere. Consider the kernel matrix n𝐊=p=0cp(𝐗𝐗T)pn\mathbf{K}=\sum_{p=0}^{\infty}c_{p}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p} and let r(n)dr(n)\leq d denote the rank of 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}. Then

  1. 1.

    if cp=𝒪(pα)c_{p}=\mathcal{O}(p^{-\alpha}) with α>r(n)+1\alpha>r(n)+1 for all n0n\in\mathbb{Z}_{\geq 0} then λn(𝐊)=𝒪(nα1r(n))\lambda_{n}(\mathbf{K})=\mathcal{O}\left(n^{-\frac{\alpha-1}{r(n)}}\right),

  2. 2.

    if cp=𝒪(eαp)c_{p}=\mathcal{O}(e^{-\alpha\sqrt{p}}) then λn(𝐊)=𝒪(n12r(n)exp(αn12r(n)))\lambda_{n}(\mathbf{K})=\mathcal{O}\left(n^{\frac{1}{2r(n)}}\exp\left(-\alpha^{\prime}n^{\frac{1}{2r(n)}}\right)\right) for any α<α21/2r(n)\alpha^{\prime}<\alpha 2^{-1/2r(n)},

  3. 3.

    if cp=𝒪(eαp)c_{p}=\mathcal{O}(e^{-\alpha p}) then λn(𝐊)=𝒪(exp(αn1r(n)))\lambda_{n}(\mathbf{K})=\mathcal{O}\left(\exp\left(-\alpha^{\prime}n^{\frac{1}{r(n)}}\right)\right) for any α<α21/2r(n)\alpha^{\prime}<\alpha 2^{-1/2r(n)}.

Although the presence of the factor 1/r(n)1/r(n) in the exponents of nn in these bounds is a weakness, Theorem 4.8 still illustrates how, in a highly general setting, the asymptotic decay of the coefficients of the power series ensures a certain asymptotic decay in the eigenvalues of the kernel matrix. A formal version of this result is provided in Appendix C.5 along with further discussion.

5 Conclusion

Using a power series expansion we derived a number of insights into both the outliers as well as the asymptotic decay of the spectrum of the NTK. We are able to perform our analysis without recourse to a high dimensional limit or the use of random matrix theory. Interesting avenues for future work include better characterizing the role of depth and performing the same analysis on networks with convolutional or residual layers.

Reproducibility Statement

To ensure reproducibility, we make the code public at https://github.com/bbowman223/data_ntk.

Acknowledgements and Disclosure of Funding

This project has been supported by ERC Grant 757983 and NSF CAREER Grant DMS-2145630.

References

  • Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242–252. PMLR, 2019. URL https://proceedings.mlr.press/v97/allen-zhu19a.html.
  • Anthony & Bartlett (2002) Martin Anthony and Peter L. Bartlett. Neural Network Learning - Theoretical Foundations. Cambridge University Press, 2002. URL http://www.cambridge.org/gb/knowledge/isbn/item1154061/?site_locale=en_GB.
  • Arora et al. (2019a) Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 322–332. PMLR, 2019a. URL https://proceedings.mlr.press/v97/arora19a.html.
  • Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/dbc4d84bfcfe2284ba11beffb853a8c4-Paper.pdf.
  • Azevedo & Menegatto (2015) Douglas Azevedo and Valdir A Menegatto. Eigenvalues of dot-product kernels on the sphere. Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3(1), 2015.
  • Bartlett & Mendelson (2002) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2002. URL http://dblp.uni-trier.de/db/journals/jmlr/jmlr3.html#BartlettM02.
  • Basri et al. (2019) Ronen Basri, David W. Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32, pp.  4763–4772, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5ac8bb8a7d745102a978c5f8ccdb61b8-Abstract.html.
  • Basri et al. (2020) Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  685–694. PMLR, 2020. URL https://proceedings.mlr.press/v119/basri20a.html.
  • Bietti & Bach (2021) Alberto Bietti and Francis Bach. Deep equals shallow for ReLU networks in kernel regimes. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=aDjoksTpXOP.
  • Bietti & Mairal (2019) Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/c4ef9c39b300931b69a36fb3dbb8d60e-Paper.pdf.
  • Bowman & Montúfar (2022) Benjamin Bowman and Guido Montúfar. Implicit bias of MSE gradient optimization in underparameterized neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=VLgmhQDVBV.
  • Bowman & Montufar (2022) Benjamin Bowman and Guido Montufar. Spectral bias outside the training set for deep networks in the kernel regime. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=a01PL2gb7W5.
  • Caponnetto & De Vito (2007) Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • Chen & Xu (2021) Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=vK9WrZ0QYQ.
  • Cui et al. (2021) Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Da_EHrAcfwd.
  • Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf.
  • Davis (2021) Tom Davis. A general expression for Hermite expansions with applications. 2021. doi: 10.13140/RG.2.2.30843.44325. URL https://www.researchgate.net/profile/Tom-Davis-2/publication/352374514_A_GENERAL_EXPRESSION_FOR_HERMITE_EXPANSIONS_WITH_APPLICATIONS/links/60c873c5a6fdcc8267cf74d4/A-GENERAL-EXPRESSION-FOR-HERMITE-EXPANSIONS-WITH-APPLICATIONS.pdf.
  • de G. Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
  • Du et al. (2019a) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1675–1685. PMLR, 2019a. URL https://proceedings.mlr.press/v97/du19c.html.
  • Du et al. (2019b) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=S1eK3i09YQ.
  • Engel et al. (2022) Andrew Engel, Zhichao Wang, Anand Sarwate, Sutanay Choudhury, and Tony Chiang. TorchNTK: A library for calculation of neural tangent kernels of PyTorch models. 2022.
  • Fan & Wang (2020) Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, volume 33, pp.  7710–7721. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d-Paper.pdf.
  • Folland (1999) G. B. Folland. Real analysis: Modern techniques and their applications. Wiley, New York, 1999.
  • Geifman et al. (2020) Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Basri Ronen. On the similarity between the Laplace and neural tangent kernels. In Advances in Neural Information Processing Systems, volume 33, pp.  1451–1461. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf.
  • Geifman et al. (2022) Amnon Geifman, Meirav Galun, David Jacobs, and Ronen Basri. On the spectral bias of convolutional neural tangent and gaussian process kernels. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=gthKzdymDu2.
  • Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp.  249–256. PMLR, 2010. URL https://proceedings.mlr.press/v9/glorot10a.html.
  • Han et al. (2022) Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, and Amin Karbasi. Fast neural kernel embeddings for general activations. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yLilJ1vZgMe.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp.  1026–1034, 2015.
  • Huang et al. (2022) Ningyuan Teresa Huang, David W. Hogg, and Soledad Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions. SIAM J. Math. Data Sci., 4(1):126–152, 2022. URL https://doi.org/10.1137/20m1387821.
  • Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
  • Jin et al. (2022) Hui Jin, Pradeep Kr. Banerjee, and Guido Montúfar. Learning curves for gaussian process regression with power-law priors and targets. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=KeI9E-gsoB.
  • Karakida et al. (2020) Ryo Karakida, Shotaro Akaho, and Shun ichi Amari. Universal statistics of Fisher information in deep neural networks: mean field approach. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124005, 2020. URL https://doi.org/10.1088/1742-5468/abc62e.
  • Kazarinoff (1956) Donat K. Kazarinoff. On Wallis’ formula. Edinburgh Mathematical Notes, 40:19–21, 1956.
  • LeCun et al. (2012) Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient BackProp, pp.  9–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. URL https://doi.org/10.1007/978-3-642-35289-8_3.
  • Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
  • Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/0d1a9651497a38d8b1c3871c84528bd4-Paper.pdf.
  • Lee et al. (2020) Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems, volume 33, pp.  15156–15172. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ad086f59924fffe0773f8d0ca22ea712-Paper.pdf.
  • Li et al. (2022) Li, Andreeto, Ranzato, and Perona. Caltech 101, Apr 2022.
  • Li et al. (2020) Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp.  4313–4324. PMLR, 2020. URL https://proceedings.mlr.press/v108/li20j.html.
  • Louart et al. (2018) Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190–1248, 2018. URL https://www.jstor.org/stable/26542333.
  • Mishkin & Matas (2016) Dmytro Mishkin and Jiri Matas. All you need is a good init. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06422.
  • Murray et al. (2022) M. Murray, V. Abrol, and J. Tanner. Activation function design for deep networks: linearity and effective initialisation. Applied and Computational Harmonic Analysis, 59:117–154, 2022. URL https://www.sciencedirect.com/science/article/pii/S1063520321001111. Special Issue on Harmonic Analysis and Machine Learning.
  • Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996.
  • Nguyen (2021) Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8056–8062. PMLR, 2021. URL https://proceedings.mlr.press/v139/nguyen21a.html.
  • Nguyen & Mondelli (2020) Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In Advances in Neural Information Processing Systems, volume 33, pp.  11961–11972. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8abfe8ac9ec214d68541fcb888c0b4c3-Paper.pdf.
  • Nguyen et al. (2021) Quynh Nguyen, Marco Mondelli, and Guido Montúfar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep ReLU networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8119–8129. PMLR, 2021. URL https://proceedings.mlr.press/v139/nguyen21g.html.
  • Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. In 7th International Conference on Learning Representations. OpenReview.net, 2019. URL https://openreview.net/forum?id=B1g30j0qF7.
  • Novak et al. (2022) Roman Novak, Jascha Sohl-Dickstein, and Samuel S Schoenholz. Fast finite width neural tangent kernel. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  17018–17044. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/novak22a.html.
  • O’Donnell (2014) Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, 2014.
  • Oymak & Soltanolkotabi (2020) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1), 2020. URL https://par.nsf.gov/biblio/10200049.
  • Oymak et al. (2019) Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobian. CoRR, abs/1906.05392, 2019. URL http://arxiv.org/abs/1906.05392.
  • Panigrahi et al. (2020) Abhishek Panigrahi, Abhishek Shetty, and Navin Goyal. Effect of activation functions on the training of overparametrized neural nets. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgfdeBYvH.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
  • Pennington & Worah (2017) Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf.
  • Pennington & Worah (2018) Jeffrey Pennington and Pratik Worah. The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/18bb68e2b38e4a8ce7cf4f6b2625768c-Paper.pdf.
  • Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
  • Scetbon & Harchaoui (2021) Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. In International conference on artificial intelligence and statistics, pp.  3394–3402. PMLR, 2021.
  • Schoenholz et al. (2017) Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations (ICLR), 2017. URL https://openreview.net/pdf?id=H1W1UN9gg.
  • Schur (1911) J. Schur. Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. Journal für die reine und angewandte Mathematik, 140:1–28, 1911. URL http://eudml.org/doc/149352.
  • Simon et al. (2022) James Benjamin Simon, Sajant Anand, and Mike Deweese. Reverse engineering the neural tangent kernel. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  20215–20231. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/simon22a.html.
  • Sontag & Sussmann (1989) Eduardo D. Sontag and Héctor J. Sussmann. Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems, 3:91–106, 1989.
  • Velikanov & Yarotsky (2021) Maksim Velikanov and Dmitry Yarotsky. Explicit loss asymptotics in the gradient descent training of neural networks. In Advances in Neural Information Processing Systems, volume 34, pp.  2570–2582. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/14faf969228fc18fcd4fcf59437b0c97-Paper.pdf.
  • Vershynin (2012) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, pp.  210–268. Cambridge University Press, 2012.
  • Weyl (1912) Hermann Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912. URL https://doi.org/10.1007/BF01456804.
  • Woodworth et al. (2020) Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 3635–3673. PMLR, 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html.
  • Xie et al. (2017) Bo Xie, Yingyu Liang, and Le Song. Diverse Neural Network Learns True Target Functions. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp.  1216–1224. PMLR, 2017. URL https://proceedings.mlr.press/v54/xie17a.html.
  • Yang & Salman (2019) Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks, 2019. URL https://arxiv.org/abs/1907.10599.
  • Zou & Gu (2019) Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/6a61d423d02a1c56250dc23ae7ff12f3-Paper.pdf.
  • Zou et al. (2020) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep ReLU networks. Machine learning, 109(3):467–492, 2020.

Appendix

The appendix is organized as follows.

  • Appendix A gives background material on Gaussan kernels, NTK, unit variance intitialization, and Hermite polynomial expansions.

  • Appendix B provides details for Section 3.

  • Appendix C provides details for Section 4.

Appendix A Background material

A.1 Gaussian kernel

Observe by construction that the flattened collection of preactivations at the first layer (g(1)(𝐱i))i=1n(g^{(1)}(\mathbf{x}_{i}))_{i=1}^{n} form a centered Gaussian process, with the covariance between the α\alphath and β\betath neuron being described by

Σαβ(1)(𝐱i,𝐱j):=𝔼[gα(1)(𝐱i)gβ(1)(𝐱j)]=δα=β(γw2𝐱iT𝐱j+γb2).\Sigma_{\alpha_{\beta}}^{(1)}(\mathbf{x}_{i},\mathbf{x}_{j})\vcentcolon=\mathbb{E}[g^{(1)}_{\alpha}(\mathbf{x}_{i})g^{(1)}_{\beta}(\mathbf{x}_{j})]=\delta_{\alpha=\beta}\left(\gamma_{w}^{2}\mathbf{x}_{i}^{T}\mathbf{x}_{j}+\gamma_{b}^{2}\right).

Under the Assumption 1, the preactivations at each layer l[L+1]l\in[L+1] converge also in distribution to centered Gaussian processes (Neal, 1996; Lee et al., 2018). We remark that the sequential width limit condition of Assumption 1 is not necessary for this behavior, for example the same result can be derived in the setting where the widths of the network are sent to infinity simultaneously under certain conditions on the activation function (de G. Matthews et al., 2018). However, as our interests lie in analyzing the limit rather than the conditions for convergence to said limit, for simplicity we consider only the sequential width limit. As per Lee et al. (2018, Eq. 4), the covariance between the preactivations of the α\alphath and β\betath neurons at layer l2l\geq 2 for any input pair 𝐱,𝐲\mathbf{x},\mathbf{y}\in\mathbb{R} are described by the following kernel,

Σαβ(l)(𝐱,𝐲)\displaystyle\Sigma_{\alpha_{\beta}}^{(l)}(\mathbf{x},\mathbf{y}) :=𝔼[gα(l)(𝐱)gβ(l)(𝐲)]\displaystyle\vcentcolon=\mathbb{E}[g^{(l)}_{\alpha}(\mathbf{x})g^{(l)}_{\beta}(\mathbf{y})]
=δα=β(σw2𝔼g(l1)𝒢𝒫(0,Σl1)[ϕ(gα(l1)(𝐱))ϕ(gβ(l1)(𝐲))]+σb2).\displaystyle=\delta_{\alpha=\beta}\left(\sigma_{w}^{2}\mathbb{E}_{g^{(l-1)}\sim\mathcal{G}\mathcal{P}(0,\Sigma^{l-1})}[\phi(g^{(l-1)}_{\alpha}(\mathbf{x}))\phi(g^{(l-1)}_{\beta}(\mathbf{y}))]+\sigma_{b}^{2}\right).

We refer to this kernel as the Gaussian kernel. As each neuron is identically distributed and the covariance between pairs of neurons is 0 unless α=β\alpha=\beta, moving forward we drop the subscript and discuss only the covariance between the preactivations of an arbitrary neuron given two inputs. As per the discussion by Lee et al. (2018, Section 2.3), the expectations involved in the computation of these Gaussian kernels can be computed with respect to a bivariate Gaussian distribution, whose covariance matrix has three distinct entries: the variance of a preactivation of 𝐱\mathbf{x} at the previous layer, Σ(l1)(𝐱,𝐱)\Sigma^{(l-1)}(\mathbf{x},\mathbf{x}), the variance of a preactivation of 𝐲\mathbf{y} at the previous layer, Σ(l)(𝐲,𝐲)\Sigma^{(l)}(\mathbf{y},\mathbf{y}), and the covariance between preactivations of 𝐱\mathbf{x} and 𝐲\mathbf{y}, Σ(l1)(𝐱,𝐲)\Sigma^{(l-1)}(\mathbf{x},\mathbf{y}). Therefore the Gaussian kernel, or covariance function, and its derivative, which we will require later for our analysis of the NTK, can be computed via the the following recurrence relations, see for instance (Lee et al., 2018; Jacot et al., 2018; Arora et al., 2019b; Nguyen et al., 2021),

Σ(1)(𝐱,𝐲)=γw2𝐱T𝐱+γb2,\displaystyle\Sigma^{(1)}(\mathbf{x},\mathbf{y})=\gamma_{w}^{2}\mathbf{x}^{T}\mathbf{x}+\gamma_{b}^{2}, (11)
𝐀(l)(𝐱,𝐲)=[Σ(l1)(𝐱,𝐱)Σ(l1)(𝐱,𝐲)Σ(l1)(𝐲,𝐱)Σ(l1)(𝐱,𝐱)]\displaystyle\mathbf{A}^{(l)}(\mathbf{x},\mathbf{y})=\begin{bmatrix}\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})&\Sigma^{(l-1)}(\mathbf{x},\mathbf{y})\\ \Sigma^{(l-1)}(\mathbf{y},\mathbf{x})&\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})\end{bmatrix}
Σ(l)(𝐱,𝐲)=σw2𝔼(B1,B2)𝒩(0,𝐀(l)(𝐱,𝐲))[ϕ(B1)ϕ(B2)]+σb2,\displaystyle\Sigma^{(l)}(\mathbf{x},\mathbf{y})=\sigma_{w}^{2}\mathbb{E}_{(B_{1},B_{2})\sim\mathcal{N}(0,\mathbf{A}^{(l)}(\mathbf{x},\mathbf{y}))}[\phi(B_{1})\phi(B_{2})]+\sigma_{b}^{2},
Σ˙(l)(𝐱,𝐲)=σw2𝔼(B1,B2)𝒩(0,𝐀(l)(𝐱,𝐲))[ϕ(B1)ϕ(B2)].\displaystyle\dot{\Sigma}^{(l)}(\mathbf{x},\mathbf{y})=\sigma_{w}^{2}\mathbb{E}_{(B_{1},B_{2})\sim\mathcal{N}(0,\mathbf{A}^{(l)}(\mathbf{x},\mathbf{y}))}\left[\phi^{\prime}(B_{1})\phi^{\prime}(B_{2})\right].

A.2 Neural Tangent Kernel (NTK)

As discussed in the Section 1, under Assumption 1 Θ~(l)\tilde{\Theta}^{(l)} converges in probability to a deterministic limit, which we denote Θ(l)\Theta^{(l)}. This deterministic limit kernel can be expressed in terms of the Gaussian kernels and their derivatives from Section A.1 via the following recurrence relationships (Jacot et al., 2018, Theorem 1),

Θ(1)(𝐱,𝐲)\displaystyle\Theta^{(1)}(\mathbf{x},\mathbf{y}) =Σ(1)(𝐱,𝐲),\displaystyle=\Sigma^{(1)}(\mathbf{x},\mathbf{y}), (12)
Θ(l)(𝐱,𝐲)\displaystyle\Theta^{(l)}(\mathbf{x},\mathbf{y}) =Θ(l1)(𝐱,𝐲)Σ˙(l)(𝐱,𝐲)+Σ(l)(𝐱,𝐲)\displaystyle=\Theta^{(l-1)}(\mathbf{x},\mathbf{y})\dot{\Sigma}^{(l)}(\mathbf{x},\mathbf{y})+\Sigma^{(l)}(\mathbf{x},\mathbf{y})
=Σ(l)(𝐱,𝐲)+h=1l1Σ(h)(𝐱,𝐲)(h=h+1lΣ˙(h)(𝐱,𝐲))l[2,L+1].\displaystyle=\Sigma^{(l)}(\mathbf{x},\mathbf{y})+\sum_{h=1}^{l-1}\Sigma^{(h)}(\mathbf{x},\mathbf{y})\left(\prod_{h^{\prime}=h+1}^{l}\dot{\Sigma}^{(h^{\prime})}(\mathbf{x},\mathbf{y})\right)\;\forall l\in[2,L+1].

A useful expression for the NTK matrix, which is a straightforward extension and generalization of Nguyen et al. (2021, Lemma 3.1), is provided in Lemma A.1 below.

Lemma A.1.

(Based on Nguyen et al. 2021, Lemma 3.1) Under Assumption 1, a sequence of positive semidefinite matrices (𝐆l)l=1L+1(\mathbf{G}_{l})_{l=1}^{L+1} in n×n\mathbb{R}^{n\times n}, and the related sequence (𝐆˙l)l=2L+1(\dot{\mathbf{G}}_{l})_{l=2}^{L+1} also in n×n\mathbb{R}^{n\times n}, can be constructed via the following recurrence relationships,

𝐆1\displaystyle\mathbf{G}_{1} =γw2𝐗𝐗T+γb21n×n,\displaystyle=\gamma_{w}^{2}\mathbf{X}\mathbf{X}^{T}+\gamma_{b}^{2}\textbf{1}_{n\times n}, (13)
𝐆2\displaystyle\mathbf{G}_{2} =σw2𝔼𝐰𝒩(0,Id)[ϕ(𝐗𝐰)ϕ(𝐗𝐰)T]+σb21n×n,\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{d})}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{X}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n},
𝐆˙2\displaystyle\dot{\mathbf{G}}_{2} =σw2𝔼𝐰𝒩(0,In)[ϕ(𝐗𝐰)ϕ(𝐗𝐰)T],\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\mathbf{X}\mathbf{w})\phi^{\prime}(\mathbf{X}\mathbf{w})^{T}],
𝐆l\displaystyle\mathbf{G}_{l} =σw2𝔼𝐰𝒩(0,In)[ϕ(𝐆l1𝐰)ϕ(𝐆l1𝐰)T]+σb21n×n,l[3,L+1],\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})\phi(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n},\;l\in[3,L+1],
𝐆˙l\displaystyle\dot{\mathbf{G}}_{l} =σw2𝔼𝐰𝒩(0,In)[ϕ(𝐆l1𝐰)ϕ(𝐆l1𝐰)T],l[3,L+1].\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})\phi^{\prime}(\sqrt{\mathbf{G}_{l-1}}\mathbf{w})^{T}],\;l\in[3,L+1].

The sequence of NTK matrices (𝐊l)l=1L+1(\mathbf{K}_{l})_{l=1}^{L+1} can in turn be written using the following recurrence relationship,

n𝐊1\displaystyle n\mathbf{K}_{1} =𝐆1,\displaystyle=\mathbf{G}_{1}, (14)
n𝐊l\displaystyle n\mathbf{K}_{l} =𝐆l+n𝐊l1𝐆˙l\displaystyle=\mathbf{G}_{l}+n\mathbf{K}_{l-1}\odot\dot{\mathbf{G}}_{l}
=𝐆l+i=1l1(𝐆i(j=i+1l𝐆˙j)).\displaystyle=\mathbf{G}_{l}+\sum_{i=1}^{l-1}\left(\mathbf{G}_{i}\odot\left(\odot_{j=i+1}^{l}\dot{\mathbf{G}}_{j}\right)\right).
Proof.

For the sequence (𝐆l)l=1L+1(\mathbf{G}_{l})_{l=1}^{L+1} it suffices to prove for any i,j[n]i,j\in[n] and l[L+1]l\in[L+1] that

[𝐆l]i,j=Σ(l)(𝐱i,𝐱j)[\mathbf{G}_{l}]_{i,j}=\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})

and 𝐆l\mathbf{G}_{l} is positive semi-definite. We proceed by induction, considering the base case l=1l=1 and comparing (13) with (11) then it is evident that

[𝐆1]i,j=Σ(1)(𝐱i,𝐱j).[\mathbf{G}_{1}]_{i,j}=\Sigma^{(1)}(\mathbf{x}_{i},\mathbf{x}_{j}).

In addition, 𝐆1\mathbf{G}_{1} is also clearly positive semi-definite as for any 𝐮n\mathbf{u}\in\mathbb{R}^{n}

𝐮T𝐆1𝐮=γw2𝐗T𝐮2+γb21nT𝐮20.\mathbf{u}^{T}\mathbf{G}_{1}\mathbf{u}=\gamma_{w}^{2}\left\lVert\mathbf{X}^{T}\mathbf{u}\right\rVert^{2}+\gamma_{b}^{2}\left\lVert\textbf{1}_{n}^{T}\mathbf{u}\right\rVert^{2}\geq 0.

We now assume the induction hypothesis is true for 𝐆l1\mathbf{G}_{l-1}. We will need to distinguish slightly between two cases, l=2l=2 and l[3,L+1]l\in[3,L+1]. The proof of the induction step in either case is identical. To this end, and for notational ease, let 𝐕=𝐗\mathbf{V}=\mathbf{X}, 𝐰𝒩(0,Id)\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{d}) when l=2l=2, and 𝐕=𝐆l1\mathbf{V}=\sqrt{\mathbf{G}_{l-1}}, 𝐰𝒩(0,In)\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{n}) for l[3,L+1]l\in[3,L+1]. In either case we let 𝐯i\mathbf{v}_{i} denote the iith row of 𝐕\mathbf{V}. For any i,j[n]i,j\in[n]

[𝐆l]ij=σw2𝔼𝐰[ϕ(𝐯iT𝐰)ϕ(𝐯jT𝐰)]+σb2.[\mathbf{G}_{l}]_{ij}=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}}[\phi(\mathbf{v}_{i}^{T}\mathbf{w})\phi(\mathbf{v}_{j}^{T}\mathbf{w})]+\sigma_{b}^{2}.

Now let B1=𝐯iT𝐰B_{1}=\mathbf{v}_{i}^{T}\mathbf{w}, B2=𝐯jT𝐰B_{2}=\mathbf{v}_{j}^{T}\mathbf{w} and observe for any α1,α2\alpha_{1},\alpha_{2}\in\mathbb{R} that α1B1+α2B2=kn(α1vik+α2vjk)wk𝒩(0,α1𝐯i+α2𝐯j2)\alpha_{1}B_{1}+\alpha_{2}B_{2}=\sum_{k}^{n}(\alpha_{1}v_{ik}+\alpha_{2}v_{jk})w_{k}\sim\mathcal{N}(0,\left\lVert\alpha_{1}\mathbf{v}_{i}+\alpha_{2}\mathbf{v}_{j}\right\rVert^{2}). Therefore the joint distribution of (B1,B2)(B_{1},B_{2}) is a mean 0 bivariate normal distribution. Denoting the covariance matrix of this distribution as 𝐀~2×2\tilde{\mathbf{A}}\in\mathbb{R}^{2\times 2}, then [𝐆l]ij[\mathbf{G}_{l}]_{ij} can be expressed as

[𝐆l]ij=σw2𝔼(B1,B2)𝐀~[ϕ(B1)ϕ(B2)]+σb2.[\mathbf{G}_{l}]_{ij}=\sigma_{w}^{2}\mathbb{E}_{(B_{1},B_{2})\sim\tilde{\mathbf{A}}}[\phi(B_{1})\phi(B_{2})]+\sigma_{b}^{2}.

To prove [𝐆l]i,j=Σ(l)[\mathbf{G}_{l}]_{i,j}=\Sigma^{(l)} it therefore suffices to show that 𝐀~=𝐀(l)\tilde{\mathbf{A}}=\mathbf{A}^{(l)} as per (11). This follows by the induction hypothesis as

𝔼[B12]\displaystyle\mathbb{E}[B_{1}^{2}] =𝐯iT𝐯i=[𝐆l1]ii=Σ(l1)(𝐱i,𝐱i),\displaystyle=\mathbf{v}_{i}^{T}\mathbf{v}_{i}=[\mathbf{G}_{l-1}]_{ii}=\Sigma^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{i}),
𝔼[B22]\displaystyle\mathbb{E}[B_{2}^{2}] =𝐯jT𝐯j=[𝐆l1]jj=Σ(l1)(𝐱j,𝐱j),\displaystyle=\mathbf{v}_{j}^{T}\mathbf{v}_{j}=[\mathbf{G}_{l-1}]_{jj}=\Sigma^{(l-1)}(\mathbf{x}_{j},\mathbf{x}_{j}),
𝔼[B1B2]\displaystyle\mathbb{E}[B_{1}B_{2}] =𝐯iT𝐯j=[𝐆l1]ij=Σ(l1)(𝐱i,𝐱j).\displaystyle=\mathbf{v}_{i}^{T}\mathbf{v}_{j}=[\mathbf{G}_{l-1}]_{ij}=\Sigma^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{j}).

Finally, 𝐆l\mathbf{G}_{l} is positive semi-definite as long as 𝔼𝐰[ϕ(𝐕𝐰)ϕ(𝐕𝐰)T]\mathbb{E}_{\mathbf{w}}[\phi(\mathbf{V}\mathbf{w})\phi(\mathbf{V}\mathbf{w})^{T}] is positive semi-definite. Let M(𝐰)=ϕ(𝐕𝐰)n×nM(\mathbf{w})=\phi(\mathbf{V}\mathbf{w})\in\mathbb{R}^{n\times n} and observe for any 𝐰\mathbf{w} that M(𝐰)M(𝐰)TM(\mathbf{w})M(\mathbf{w})^{T} is positive semi-definite. Therefore 𝔼𝐰[M(𝐰)M(𝐰)T]\mathbb{E}_{\mathbf{w}}[M(\mathbf{w})M(\mathbf{w})^{T}] must also be positive semi-definite. Thus the inductive step is complete and we may conclude for l[L+1]l\in[L+1] that

[𝐆l]i,j=Σ(l)(𝐱i,𝐱j).[\mathbf{G}_{l}]_{i,j}=\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}). (15)

For the proof of the expression for the sequence (𝐆˙l)l=2L+1(\dot{\mathbf{G}}_{l})_{l=2}^{L+1} it suffices to prove for any i,j[n]i,j\in[n] and l[L+1]l\in[L+1] that

[𝐆˙l]i,j=Σ˙(l)(𝐱i,𝐱j).[\dot{\mathbf{G}}_{l}]_{i,j}=\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}).

By comparing (13) with (11) this follows immediately from (15). Therefore with (13) proven (14) follows from (12). ∎

A.3 Unit variance initialization

The initialization scheme for a neural network, particularly a deep neural network, needs to be designed with some care in order to avoid either vanishing or exploding gradients during training Glorot & Bengio (2010); He et al. (2015); Mishkin & Matas (2016); LeCun et al. (2012). Some of the most popular initialization strategies used in practice today, in particular LeCun et al. (2012) and Glorot & Bengio (2010) initialization, first model the preactivations of the network as Gaussian random variables and then select the network hyperparameters in order that the variance of these idealized preactivations is fixed at one. Under Assumption 1 this idealized model on the preactivations is actually realized and if we additionally assume the conditions of Assumption 2 hold then likewise the variance of the preactivations at every layer will be fixed at one. To this end, and as in Poole et al. (2016); Murray et al. (2022), consider the function V:00V\colon\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{\geq 0} defined as

V(q)=σw2𝔼Z𝒩(0,1)[ϕ(qZ)2]+σb2.V(q)=\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}\left[\phi\left(\sqrt{q}Z\right)^{2}\right]+\sigma_{b}^{2}. (16)

Noting that VV is another expression for Σ(l)(𝐱,𝐱)\Sigma^{(l)}(\mathbf{x},\mathbf{x}), derived via a change of variables as per Poole et al. (2016), the sequence of variances (Σ(l)(𝐱,𝐱))l=2L(\Sigma^{(l)}(\mathbf{x},\mathbf{x}))_{l=2}^{L} can therefore be generated as follows,

Σ(l)(𝐱,𝐱)=V(Σ(l1)(𝐱,𝐱)).\Sigma^{(l)}(\mathbf{x},\mathbf{x})=V(\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})). (17)

The linear correlation ρ(l):d×d[1,1]\rho^{(l)}:\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow[-1,1] between the preactivations of two inputs 𝐱,𝐲d\mathbf{x},\mathbf{y}\in\mathbb{R}^{d} we define as

ρ(l)(𝐱,𝐲)=Σ(l)(𝐱,𝐲)Σ(l)(𝐱,𝐱)Σ(l)(𝐲,𝐲).\displaystyle\rho^{(l)}(\mathbf{x},\mathbf{y})=\frac{\Sigma^{(l)}(\mathbf{x},\mathbf{y})}{\sqrt{\Sigma^{(l)}(\mathbf{x},\mathbf{x})\Sigma^{(l)}(\mathbf{y},\mathbf{y})}}. (18)

Assuming Σ(l)(𝐱,𝐱)=Σ(l)(𝐲,𝐲)=1\Sigma^{(l)}(\mathbf{x},\mathbf{x})=\Sigma^{(l)}(\mathbf{y},\mathbf{y})=1 for all l[L+1]l\in[L+1], then ρ(l)(𝐱,𝐲)=Σ(l)(𝐱,𝐲)\rho^{(l)}(\mathbf{x},\mathbf{y})=\Sigma^{(l)}(\mathbf{x},\mathbf{y}). Again as in Murray et al. (2022) and analogous to (16), with Z1,Z2𝒩(0,1)Z_{1},Z_{2}\sim\mathcal{N}(0,1) independent, U1:=Z1U_{1}\vcentcolon=Z_{1}, U2(ρ):=(ρZ1+1ρ2Z2)U_{2}(\rho)\vcentcolon=(\rho Z_{1}+\sqrt{1-\rho^{2}}Z_{2}) 333We remark that U1,U2U_{1},U_{2} are dependent and identically distributed as U1,U2𝒩(0,1)U_{1},U_{2}\sim\mathcal{N}(0,1). we define the correlation function R:[1,1][1,1]R:[-1,1]\rightarrow[-1,1] as

R(ρ)=σw2𝔼[ϕ(U1)ϕ(U2(ρ))]+σb2.\displaystyle R(\rho)=\sigma_{w}^{2}\mathbb{E}[\phi(U_{1})\phi(U_{2}(\rho))]+\sigma_{b}^{2}. (19)

Noting under these assumptions that RR is equivalent to Σ(l)(𝐱,𝐲)\Sigma^{(l)}(\mathbf{x},\mathbf{y}), the sequence of correlations (ρ(l)(𝐱,𝐲))l=2L(\rho^{(l)}(\mathbf{x},\mathbf{y}))_{l=2}^{L} can thus be generated as

ρ(l)(𝐱,𝐲)=R(ρ(l1)(𝐱,𝐲)).\rho^{(l)}(\mathbf{x},\mathbf{y})=R(\rho^{(l-1)}(\mathbf{x},\mathbf{y})).

As observed in Poole et al. (2016); Schoenholz et al. (2017), R(1)=V(1)=1R(1)=V(1)=1, hence ρ=1\rho=1 is a fixed point of RR. We remark that as all preactivations are distributed as 𝒩(0,1)\mathcal{N}(0,1), then a correlation of one between preactivations implies they are equal. The stability of the fixed point ρ=1\rho=1 is of particular significance in the context of initializing deep neural networks successfully. Under mild conditions on the activation function one can compute the derivative of RR, see e.g., Poole et al. (2016); Schoenholz et al. (2017); Murray et al. (2022), as follows,

R(ρ)=σw2𝔼[ϕ(U1)ϕ(U2(ρ))].\displaystyle R^{\prime}(\rho)=\sigma_{w}^{2}\mathbb{E}[\phi^{\prime}(U_{1})\phi^{\prime}(U_{2}(\rho))]. (20)

Observe that the expression for Σ˙(l)\dot{\Sigma}^{(l)} and RR^{\prime} are equivalent via a change of variables (Poole et al., 2016), and therefore the sequence of correlation derivatives may be computed as

Σ˙(l)(𝐱,𝐲)=R(ρ(l)(𝐱,𝐲)).\dot{\Sigma}^{(l)}(\mathbf{x},\mathbf{y})=R^{\prime}(\rho^{(l)}(\mathbf{x},\mathbf{y})).

With the relevant background material now in place we are in a position to prove Lemma A.2.

Lemma A.2.

Under Assumptions 1 and 2 and defining χ=σw2𝔼Z𝒩(0,1)[ϕ(Z)2]>0\chi=\sigma_{w}^{2}\mathbb{E}_{Z\sim\mathcal{N}(0,1)}[\phi^{\prime}(Z)^{2}]\in\mathbb{R}_{>0}, then for all i,j[n]i,j\in[n], l[L+1]l\in[L+1]

  • [𝐆n,l]ij[1,1][\mathbf{G}_{n,l}]_{ij}\in[-1,1] and [𝐆n,l]ii=1[\mathbf{G}_{n,l}]_{ii}=1,

  • [𝐆˙n,l]ij[χ,χ][\dot{\mathbf{G}}_{n,l}]_{ij}\in[-\chi,\chi] and [𝐆˙n,l]ii=χ[\dot{\mathbf{G}}_{n,l}]_{ii}=\chi.

Furthermore, the NTK is a dot product kernel, meaning Θ(𝐱i,𝐱j)\Theta(\mathbf{x}_{i},\mathbf{x}_{j}) can be written as a function of the inner product between the two inputs, Θ(𝐱iT𝐱j)\Theta(\mathbf{x}_{i}^{T}\mathbf{x}_{j}).

Proof.

Recall from Lemma A.1 and its proof that for any l[L+1]l\in[L+1], i,j[n]i,j\in[n] [𝐆n,l]ij=Σ(l)(𝐱i,𝐱j)[\mathbf{G}_{n,l}]_{ij}=\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}) and [𝐆˙n,l]ij=Σ˙(l)(𝐱i,𝐱j)[\dot{\mathbf{G}}_{n,l}]_{ij}=\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}). We first prove by induction Σ(l)(𝐱i,𝐱i)=1\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{i})=1 for all l[L+1]l\in[L+1]. The base case l=1l=1 follows as

Σ(1)(𝐱,𝐱)=γw2𝐱T𝐱+γb2=γw2+γb2=1.\Sigma^{(1)}(\mathbf{x},\mathbf{x})=\gamma_{w}^{2}\mathbf{x}^{T}\mathbf{x}+\gamma_{b}^{2}=\gamma_{w}^{2}+\gamma_{b}^{2}=1.

Assume the induction hypothesis is true for layer l1l-1. With Z𝒩(0,1)Z\sim\mathcal{N}(0,1), then from (16) and (17)

Σ(l)(𝐱,𝐱)\displaystyle\Sigma^{(l)}(\mathbf{x},\mathbf{x}) =V(Σ(l1)(𝐱,𝐱))\displaystyle=V(\Sigma^{(l-1)}(\mathbf{x},\mathbf{x}))
=σw2𝔼[ϕ2(Σ(l1)(𝐱,𝐱)Z)]+σb2\displaystyle=\sigma_{w}^{2}\mathbb{E}\left[\phi^{2}\left(\sqrt{\Sigma^{(l-1)}(\mathbf{x},\mathbf{x})}Z\right)\right]+\sigma_{b}^{2}
=σw2𝔼[ϕ2(Z)]+σb2\displaystyle=\sigma_{w}^{2}\mathbb{E}\left[\phi^{2}\left(Z\right)\right]+\sigma_{b}^{2}
=1,\displaystyle=1,

thus the inductive step is complete. As an immediate consequence it follows that [𝐆l]ii=1[\mathbf{G}_{l}]_{ii}=1. Also, for any i,j[n]i,j\in[n] and l[L+1]l\in[L+1],

Σ(l)(𝐱i,𝐱j)\displaystyle\Sigma^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}) =ρ(l)(𝐱i,𝐱j)=R(ρ(l1)(𝐱i,𝐱j))=R(R(R(𝐱iT𝐱j))).\displaystyle=\rho^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})=R(\rho^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{j}))=R(...R(R(\mathbf{x}_{i}^{T}\mathbf{x}_{j}))).

Thus we can consider Σ(l)\Sigma^{(l)} as a univariate function of the input correlation Σ:[1,1][1,1]\Sigma:[-1,1]\rightarrow[-1,1] and also conclude that [𝐆l]ij[1,1][\mathbf{G}_{l}]_{ij}\in[-1,1]. Furthermore,

Σ˙(l)(𝐱i,𝐱j)=R(ρ(l)(𝐱i,𝐱j))=R(R(R(R(𝐱iT𝐱j)))),\displaystyle\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j})=R^{\prime}(\rho^{(l)}(\mathbf{x}_{i},\mathbf{x}_{j}))=R^{\prime}(R(...R(R(\mathbf{x}_{i}^{T}\mathbf{x}_{j})))),

which likewise implies Σ˙\dot{\Sigma} is a dot product kernel. Recall now the random variables introduced to define RR: Z1,Z2𝒩(0,1)Z_{1},Z_{2}\sim\mathcal{N}(0,1) are independent and U1=Z1U_{1}=Z_{1}, U2=(ρZ1+1ρ2Z2)U_{2}=(\rho Z_{1}+\sqrt{1-\rho^{2}}Z_{2}). Observe U1,U2U_{1},U_{2} are dependent but identically distributed as U1,U2𝒩(0,1)U_{1},U_{2}\sim\mathcal{N}(0,1). For any ρ[1,1]\rho\in[-1,1] then applying the Cauchy-Schwarz inequality gives

|R(ρ)|2=σw4|𝔼[ϕ(U1)ϕ(U2)]|2σw4𝔼[ϕ(U1)2]𝔼[ϕ(U2)2]=σw4𝔼[ϕ(U1)2]2=|R(1)|2.\displaystyle|R^{\prime}(\rho)|^{2}=\sigma_{w}^{4}\left|\mathbb{E}[\phi^{\prime}(U_{1})\phi^{\prime}(U_{2})]\right|^{2}\leq\sigma_{w}^{4}\mathbb{E}[\phi^{\prime}(U_{1})^{2}]\mathbb{E}[\phi^{\prime}(U_{2})^{2}]=\sigma_{w}^{4}\mathbb{E}[\phi^{\prime}(U_{1})^{2}]^{2}=|R^{\prime}(1)|^{2}.

As a result, under the assumptions of the lemma Σ˙(l):[1,1][χ,χ]\dot{\Sigma}^{(l)}:[-1,1]\rightarrow[-\chi,\chi] and Σ˙(l)(𝐱i,𝐱i)=χ\dot{\Sigma}^{(l)}(\mathbf{x}_{i},\mathbf{x}_{i})=\chi. From this it immediately follows that [𝐆˙l]ij[χ,χ][\dot{\mathbf{G}}_{l}]_{ij}\in[-\chi,\chi] and [𝐆˙l]ii=χ[\dot{\mathbf{G}}_{l}]_{ii}=\chi as claimed. Finally, as Σ:[1,1][1,1]\Sigma:[-1,1]\rightarrow[-1,1] and Σ˙:[1,1][χ,χ]\dot{\Sigma}:[-1,1]\rightarrow[-\chi,\chi] are dot product kernels, then from (12) the NTK must also be a dot product kernel and furthermore a univariate function of the pairwise correlation of its input arguments. ∎

The following corollary, which follows immediately from Lemma A.2 and (14), characterizes the trace of the NTK matrix in terms of the trace of the input gram.

Corollary A.3.

Under the same conditions as Lemma A.2, suppose ϕ\phi and σw2\sigma_{w}^{2} are chosen such that χ=1\chi=1. Then

Tr(𝐊n,l)=l.Tr(\mathbf{K}_{n,l})=l. (21)

A.4 Hermite Expansions

We say that a function f:f:\mathbb{R}\rightarrow\mathbb{R} is square integrable w.r.t. the standard Gaussian measure γ=ex2/2/2π\gamma=e^{-x^{2}/2}/\sqrt{2\pi} if 𝔼x𝒩(0,1)[f(x)2]<\mathbb{E}_{x\sim\mathcal{N}(0,1)}[f(x)^{2}]<\infty. We denote by L2(,γ)L^{2}(\mathbb{R},\gamma) the space of all such functions. The probabilist’s Hermite polynomials are given by

Hk(x)=(1)kex2/2dkdxkex2/2,k=0,1,.\displaystyle H_{k}(x)={(-1)}^{k}e^{x^{2}/2}\frac{d^{k}}{dx^{k}}e^{-x^{2}/2},\quad k=0,1,\ldots.

The first three Hermite polynomials are H0(x)=1H_{0}(x)=1, H1(x)=xH_{1}(x)=x, H2(x)=(x21)H_{2}(x)=(x^{2}-1). Let hk(x)=Hk(x)k!h_{k}(x)=\tfrac{H_{k}(x)}{\sqrt{k!}} denote the normalized probabilist’s Hermite polynomials. The normalized Hermite polynomials form a complete orthonormal basis in L2(,γ)L^{2}(\mathbb{R},\gamma) (O’Donnell, 2014, §11): in all that follows, whenever we reference the Hermite polynomials, we will be referring to the normalized Hermite polynomials. The Hermite expansion of a function ϕL2(,γ)\phi\in L^{2}(\mathbb{R},\gamma) is given by

ϕ(x)=k=0μk(ϕ)hk(x),\displaystyle\phi(x)=\sum_{k=0}^{\infty}\mu_{k}(\phi)h_{k}(x), (22)

where

μk(ϕ)=𝔼X𝒩(0,1)[ϕ(X)hk(X)]\mu_{k}(\phi)=\mathbb{E}_{X\sim\mathcal{N}(0,1)}[\phi(X)h_{k}(X)] (23)

is the kkth normalized probabilist’s Hermite coefficient of ϕ\phi. In what follows we shall make use of the following identities.

k1,hk(x)\displaystyle\forall k\geq 1,\,h_{k}^{\prime}(x) =khk1(x),\displaystyle=\sqrt{k}h_{k-1}(x), (24)
k1,xhk(x)\displaystyle\forall k\geq 1,\,xh_{k}(x) =k+1hk+1(x)+khk1(x).\displaystyle=\sqrt{k+1}h_{k+1}(x)+\sqrt{k}h_{k-1}(x). (25)
hk(0)={0, if k is odd 1k!(1)k2(k1)!! if k is even , where k!!={1,k0k(k2)531,k>0 odd k(k2)642,k>0 even .\displaystyle\begin{array}[]{c}h_{k}(0)=\left\{\begin{array}[]{ll}0,&\text{ if }k\text{ is odd }\\ \frac{1}{\sqrt{k!}}(-1)^{\frac{k}{2}}(k-1)!!&\text{ if }k\text{ is even }\end{array}\right.,\\ \text{ where }k!!=\left\{\begin{array}[]{ll}1,&k\leq 0\\ k\cdot(k-2)\cdots 5\cdot 3\cdot 1,&k>0\text{ odd }\\ k\cdot(k-2)\cdots 6\cdot 4\cdot 2,&k>0\text{ even }.\end{array}\right.\end{array} (33)

We also remark that the more commonly encountered physicist’s Hermite polynomials, which we denote H~k\tilde{H}_{k}, are related to the normalized probablist’s polynomials as follows,

hk(z)=2k/2H~k(z/2)k!.h_{k}(z)=\frac{2^{-k/2}\tilde{H}_{k}(z/\sqrt{2})}{\sqrt{k!}}.

The Hermite expansion of the activation function deployed will play a key role in determining the coefficients of the NTK power series. In particular, the Hermite coefficients of ReLU are as follows.

Lemma A.4.

Daniely et al. (2016) For ϕ(z)=max{0,z}\phi(z)=\max\{0,z\} the Hermite coefficients are given by

μk(ϕ)={1/2π,k=0,1/2,k=1,(k3)!!/2πk!,k even and k2,0,k odd and k>3.\mu_{k}(\phi)=\begin{cases}1/\sqrt{2\pi},&\text{$k=0$},\\ 1/2,&\text{$k=1$},\\ (k-3)!!/\sqrt{2\pi k!},&\text{$k$ even and $k\geq 2$},\\ 0,&\text{$k$ odd and $k>3$}.\end{cases} (34)

Appendix B Expressing the NTK as a power series

B.1 Deriving a power series for the NTK

We will require the following minor adaptation of Nguyen & Mondelli (2020, Lemma D.2). We remark this result was first stated for ReLU and Softplus activations in the work of Oymak & Soltanolkotabi (2020, Lemma H.2).

Lemma B.1.

For arbitrary n,dn,d\in\mathbb{N}, let 𝐀n×d\mathbf{A}\in\mathbb{R}^{n\times d}. For i[n]i\in[n], we denote the iith row of 𝐀\mathbf{A} as 𝐚i\mathbf{a}_{i}, and further assume that 𝐚i=1\left\lVert\mathbf{a}_{i}\right\rVert=1. Let ϕ:\phi:\mathbb{R}\rightarrow\mathbb{R} satisfy ϕL2(,γ)\phi\in L^{2}(\mathbb{R},\gamma) and define

𝐌=𝔼𝐰𝒩(0,In)[ϕ(𝐀𝐰)ϕ(𝐀𝐰)T]n×n.\mathbf{M}=\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{n})}[\phi(\mathbf{A}\mathbf{w})\phi(\mathbf{A}\mathbf{w})^{T}]\in\mathbb{R}^{n\times n}.

Then the matrix series

𝐒K=k=0Kμk2(ϕ)(𝐀𝐀T)k\mathbf{S}_{K}=\sum_{k=0}^{K}\mu_{k}^{2}(\phi)\left(\mathbf{A}\mathbf{A}^{T}\right)^{\odot k}

converges uniformly to 𝐌\mathbf{M} as KK\rightarrow\infty.

The proof of Lemma B.1 follows exactly as in (Nguyen & Mondelli, 2020, Lemma D.2), and is in fact slightly simpler due to the fact we assume the rows of 𝐀\mathbf{A} are unit length and 𝐰𝒩(0,Id)\mathbf{w}\sim\mathcal{N}(0,\textbf{I}_{d}) instead of d\sqrt{d} and 𝐰𝒩(0,1dId)\mathbf{w}\sim\mathcal{N}(0,\frac{1}{d}\textbf{I}_{d}) respectively. For the ease of the reader, we now recall the following definitions, which are also stated in Section 3. Letting α¯l:=(αp,l)p=0\bar{\alpha}_{l}\vcentcolon=(\alpha_{p,l})_{p=0}^{\infty} denote a sequence of real coefficients, then

F(p,k,α¯l):={1k=0 and p=0,0k=0 and p1,(ji)𝒥(p,k)i=1kαji,lk1 and p0,F(p,k,\bar{\alpha}_{l})\vcentcolon=\begin{cases}1\;&k=0\text{ and }p=0,\\ 0\;&k=0\text{ and }p\geq 1,\\ \sum_{(j_{i})\in\mathcal{J}(p,k)}\prod_{i=1}^{k}\alpha_{j_{i},l}\;&k\geq 1\text{ and }p\geq 0,\end{cases} (35)

where

𝒥(p,k):={(ji)i[k]:ji0i[k],i=1kji=p}\mathcal{J}(p,k)\vcentcolon=\{(j_{i})_{i\in[k]}\;:\;j_{i}\geq 0\;\forall i\in[k],\;\sum_{i=1}^{k}j_{i}=p\}

for all p0p\in\mathbb{Z}_{\geq 0}, k1k\in\mathbb{Z}_{\geq 1}.

We are now ready to derive power series for elements of (𝐆l))l=1L+1(\mathbf{G}_{l}))_{l=1}^{L+1} and (𝐆˙l))l=2L+1(\dot{\mathbf{G}}_{l}))_{l=2}^{L+1}.

Lemma B.2.

Under Assumptions 1 and 2, for all l[2,L+1]l\in[2,L+1]

𝐆l=k=0αk,l(𝐗𝐗T)k,\displaystyle\mathbf{G}_{l}=\sum_{k=0}^{\infty}\alpha_{k,l}(\mathbf{X}\mathbf{X}^{T})^{\odot k}, (36)

where the series for each element [𝐆l]ij[\mathbf{G}_{l}]_{ij} converges absolutely and the coefficients αp,l\alpha_{p,l} are nonnegative. The coefficients of the series (36) for all p0p\in\mathbb{Z}_{\geq 0} can be expressed via the following recurrence relationship,

αp,l={σw2μp2(ϕ)+δp=0σb2,l=2,k=0αk,2F(p,k,α¯l1),l3.\alpha_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2},&l=2,\\ \sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3.\end{cases} (37)

Furthermore,

𝐆˙l=k=0υk,l(𝐗𝐗T)k,\displaystyle\dot{\mathbf{G}}_{l}=\sum_{k=0}^{\infty}\upsilon_{k,l}(\mathbf{X}\mathbf{X}^{T})^{\odot k}, (38)

where likewise the series for each entry [𝐆˙l]ij[\dot{\mathbf{G}}_{l}]_{ij} converges absolutely and the coefficients υp,l\upsilon_{p,l} for all p0p\in\mathbb{Z}_{\geq 0} are nonnegative and can be expressed via the following recurrence relationship,

υp,l={σw2μp2(ϕ),l=2,k=0υk,2F(p,k,α¯l1),l3.\upsilon_{p,l}=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi^{\prime}),&l=2,\\ \sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3.\end{cases} (39)
Proof.

We start by proving (36) and (37). Proceeding by induction, consider the base case l=2l=2. From Lemma A.1

𝐆2=σw2𝔼𝐰𝒩(0,Id)[ϕ(𝐗𝐰)ϕ(𝐗𝐰)T]+σb21n×n.\mathbf{G}_{2}=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{d})}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{X}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n}.

By the assumptions of the lemma, the conditions of Lemma B.1 are satisfied and therefore

𝐆2\displaystyle\mathbf{G}_{2} =σw2k=0μk2(ϕ)(𝐗𝐗T)k+σb21n×n\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi)\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}+\sigma_{b}^{2}\textbf{1}_{n\times n}
=α0,21n×n+k=1αk,2(𝐗𝐗T)k.\displaystyle=\alpha_{0,2}\textbf{1}_{n\times n}+\sum_{k=1}^{\infty}\alpha_{k,2}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}.

Observe the coefficients (αk,2)k0(\alpha_{k,2})_{k\in\mathbb{Z}_{\geq 0}} are nonnegative. Therefore, for any i,j[n]i,j\in[n] using Lemma A.2 the series for [𝐆l]ij[\mathbf{G}_{l}]_{ij} satisfies

k=0|αk,2||𝐱i,𝐱jk|\displaystyle\sum_{k=0}^{\infty}\left\lvert\alpha_{k,2}\right\rvert\left\lvert\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{k}\right\rvert k=0αk,2𝐱i,𝐱ik\displaystyle\leq\sum_{k=0}^{\infty}\alpha_{k,2}\langle\mathbf{x}_{i},\mathbf{x}_{i}\rangle^{k} =[𝐆l]ii\displaystyle=[\mathbf{G}_{l}]_{ii} =1\displaystyle=1 (40)

and so must be absolutely convergent. With the base case proved we proceed to assume the inductive hypothesis holds for arbitrary 𝐆l\mathbf{G}_{l} with l[2,L]l\in[2,L]. Observe

𝐆l+1\displaystyle\mathbf{G}_{l+1} =σw2𝔼𝐰𝒩(0,In)[ϕ(𝐀𝐰)ϕ(𝐀𝐰)T]+σb21n×n,\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi(\mathbf{A}\mathbf{w})\phi(\mathbf{A}\mathbf{w})^{T}]+\sigma_{b}^{2}\textbf{1}_{n\times n},

where 𝐀\mathbf{A} is a matrix square root of 𝐆l\mathbf{G}_{l}, meaning 𝐆l=𝐀𝐀\mathbf{G}_{l}=\mathbf{A}\mathbf{A}. Recall from Lemma A.1 that 𝐆l\mathbf{G}_{l} is also symmetric and positive semi-definite, therefore we may additionally assume, without loss of generality, that 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} is symmetric, which conveniently implies 𝐆n,l=𝐀𝐀T\mathbf{G}_{n,l}=\mathbf{A}\mathbf{A}^{T}. Under the assumptions of the lemma the conditions for Lemma A.2 are satisfied and as a result [𝐆n,l]ii=𝐚i=1[\mathbf{G}_{n,l}]_{ii}=\left\lVert\mathbf{a}_{i}\right\rVert=1 for all i[n]i\in[n], where we recall 𝐚i\mathbf{a}_{i} denotes the iith row of 𝐀\mathbf{A}. Therefore we may again apply Lemma A.1,

𝐆l+1\displaystyle\mathbf{G}_{l+1} =σw2k=0μk2(ϕ)(𝐀𝐀T)k+σb21n×n\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi)\left(\mathbf{A}\mathbf{A}^{T}\right)^{\odot k}+\sigma_{b}^{2}\textbf{1}_{n\times n}
=(σw2μ02(ϕ)+σb2)1n×n+σw2k=1μk2(ϕ)(𝐆n,l)k\displaystyle=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})\textbf{1}_{n\times n}+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\mathbf{G}_{n,l}\right)^{\odot k}
=(σw2μ02(ϕ)+σb2)1n×n+σw2k=1μk2(ϕ)(m=0αm,l(𝐗𝐗T)m)k,\displaystyle=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})\textbf{1}_{n\times n}+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\sum_{m=0}^{\infty}\alpha_{m,l}(\mathbf{X}\mathbf{X}^{T})^{\odot m}\right)^{\odot k},

where the final equality follows from the inductive hypothesis. For any pair of indices i,j[n]i,j\in[n]

[𝐆l+1]ij=(σw2μ02(ϕ)+σb2)+σw2k=1μk2(ϕ)(m=0αm,l𝐱i,𝐱jm)k.[\mathbf{G}_{l+1}]_{ij}=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\sum_{m=0}^{\infty}\alpha_{m,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{m}\right)^{k}.

By the induction hypothesis, for any i,j[n]i,j\in[n] the series m=0αm,l𝐱i,𝐱jm\sum_{m=0}^{\infty}\alpha_{m,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{m} is absolutely convergent. Therefore, from the Cauchy product of power series and for any k0k\in\mathbb{Z}_{\geq 0} we have

(m=0αm,l𝐱i,𝐱jm)k=p=0F(p,k,α¯l)𝐱i,𝐱jp,\left(\sum_{m=0}^{\infty}\alpha_{m,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{m}\right)^{k}=\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}, (41)

where F(p,k,α¯l)F(p,k,\bar{\alpha}_{l}) is defined in (4). By definition, F(p,k,α¯l)F(p,k,\bar{\alpha}_{l}) is a sum of products of positive coefficients, and therefore |F(p,k,α¯l)|=F(p,k,α¯l)\left\lvert F(p,k,\bar{\alpha}_{l})\right\rvert=F(p,k,\bar{\alpha}_{l}). In addition, recall again by Assumption 2 and Lemma A.2 that [𝐆l]ii=1[\mathbf{G}_{l}]_{ii}=1. As a result, for any k0k\in\mathbb{Z}_{\geq 0}, as |𝐱i,𝐱j|1\left\lvert\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle\right\rvert\leq 1

p=0|F(p,k,α¯l)𝐱i,𝐱jp|(m=0αm,l)k=[𝐆n,l]ii=1\sum_{p=0}^{\infty}\left\lvert F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right\rvert\leq\left(\sum_{m=0}^{\infty}\alpha_{m,l}\right)^{k}=[\mathbf{G}_{n,l}]_{ii}=1 (42)

and therefore the series p=0F(p,k,α¯l)𝐱i,𝐱jp\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p} converges absolutely. Recalling from the proof of the base case that the series p=1αp,2\sum_{p=1}^{\infty}\alpha_{p,2} is absolutely convergent and has only nonnegative elements, we may therefore interchange the order of summation in the following,

ij =(σw2μ02(ϕ)+σb2)+σw2k=1μk2(ϕ)(p=0F(p,k,α¯l)𝐱i,𝐱jp)\displaystyle=(\sigma_{w}^{2}\mu_{0}^{2}(\phi)+\sigma_{b}^{2})+\sigma_{w}^{2}\sum_{k=1}^{\infty}\mu_{k}^{2}(\phi)\left(\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)
=α0,2+k=1αk,2(p=0F(p,k,α¯l)𝐱i,𝐱jp)\displaystyle=\alpha_{0,2}+\sum_{k=1}^{\infty}\alpha_{k,2}\left(\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)
=α0,2+p=0(k=1αk,2F(p,k,α¯l))𝐱i,𝐱jp.\displaystyle=\alpha_{0,2}+\sum_{p=0}^{\infty}\left(\sum_{k=1}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}.

Recalling the definition of F(p,k,l)F(p,k,l) in (4), in particular F(0,0,α¯l)=1F(0,0,\bar{\alpha}_{l})=1 and F(p,0,α¯l)=0F(p,0,\bar{\alpha}_{l})=0 for p1p\in\mathbb{Z}_{\geq 1}, then

ij =(α0,2+k=1αk,2F(0,k,α¯l))𝐱i,𝐱j0+p=1(k=1αk,2F(p,k,α¯l))𝐱i,𝐱jp\displaystyle=\left(\alpha_{0,2}+\sum_{k=1}^{\infty}\alpha_{k,2}F(0,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{0}+\sum_{p=1}^{\infty}\left(\sum_{k=1}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}
=(k=0αk,2F(0,k,α¯l))𝐱i,𝐱j0+p=1(k=0αk,2F(p,k,α¯l))𝐱i,𝐱jp\displaystyle=\left(\sum_{k=0}^{\infty}\alpha_{k,2}F(0,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{0}+\sum_{p=1}^{\infty}\left(\sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}
=p=0(k=0αk,2F(p,k,α¯l))𝐱i,𝐱jp\displaystyle=\sum_{p=0}^{\infty}\left(\sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}
=p=0αp,l+1𝐱i,𝐱jp.\displaystyle=\sum_{p=0}^{\infty}\alpha_{p,l+1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}.

As the indices i,j[n]i,j\in[n] were arbitrary we conclude that

𝐆l+1=p=0αp,l+1(𝐗𝐗T)p\mathbf{G}_{l+1}=\sum_{p=0}^{\infty}\alpha_{p,l+1}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}

as claimed. In addition, by inspection and using the induction hypothesis it is clear that the coefficients (αp,l+1)p=0(\alpha_{p,l+1})_{p=0}^{\infty} are nonnegative. Therefore, by an argument identical to (40), the series for each entry of [𝐆l+1]ij[\mathbf{G}_{l+1}]_{ij} is absolutely convergent. This concludes the proof of (36) and (37).

We now turn our attention to proving the (38) and (39). Under the assumptions of the lemma the conditions for Lemmas A.1 and B.1 are satisfied and therefore for the base case l=2l=2

𝐆˙2\displaystyle\dot{\mathbf{G}}_{2} =σw2𝔼𝐰𝒩(0,In)[ϕ(𝐗𝐰)ϕ(𝐗𝐰)T]\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\mathbf{X}\mathbf{w})\phi^{\prime}(\mathbf{X}\mathbf{w})^{T}]
=σw2k=0μk2(ϕ)(𝐗𝐗T)k\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi^{\prime})\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}
=k=0υk,2(𝐗𝐗T)k.\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot k}.

By inspection the coefficients (υp,2)p=0(\upsilon_{p,2})_{p=0}^{\infty} are nonnegative and as a result by an argument again identical to (40) the series for each entry of [𝐆˙2]ij[\dot{\mathbf{G}}_{2}]_{ij} is absolutely convergent. For l[2,L]l\in[2,L], from (36) and its proof there is a matrix 𝐀n×n\mathbf{A}\in\mathbb{R}^{n\times n} such that 𝐆l=𝐀𝐀T\mathbf{G}_{l}=\mathbf{A}\mathbf{A}^{T}. Again applying Lemma B.1

𝐆˙n,l+1\displaystyle\dot{\mathbf{G}}_{n,l+1} =σw2𝔼𝐰𝒩(0,In)[ϕ(𝐀𝐰)ϕ(𝐀𝐰)T]\displaystyle=\sigma_{w}^{2}\mathbb{E}_{\mathbf{w}\sim\mathcal{N}(\textbf{0},\textbf{I}_{n})}[\phi^{\prime}(\mathbf{A}\mathbf{w})\phi^{\prime}(\mathbf{A}\mathbf{w})^{T}]
=σw2k=0μk2(ϕ)(𝐀𝐀T)k\displaystyle=\sigma_{w}^{2}\sum_{k=0}^{\infty}\mu_{k}^{2}(\phi^{\prime})\left(\mathbf{A}\mathbf{A}^{T}\right)^{\odot k}
=k=0υk,2(𝐆n,l)k\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\mathbf{G}_{n,l}\right)^{\odot k}
=k=0υk,2(p=0αp,l(𝐗𝐗T)p)k\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\sum_{p=0}^{\infty}\alpha_{p,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}\right)^{\odot k}

Analyzing now an arbitrary entry [𝐆˙l+1]ij[\dot{\mathbf{G}}_{l+1}]_{ij}, by substituting in the power series expression for 𝐆l\mathbf{G}_{l} from (36) and using (41) we have

ij =k=0υk,2(p=0αp,l𝐱i,𝐱jp)k\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\sum_{p=0}^{\infty}\alpha_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)^{k}
=k=0υk,2(p=0F(p,k,α¯l)𝐱i,𝐱jp)\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}\left(\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}\right)
=p=0(k=0υk,2F(p,k,α¯l))𝐱i,𝐱jp\displaystyle=\sum_{p=0}^{\infty}\left(\sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l})\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}
=p=0υp,l+1𝐱i,𝐱jp.\displaystyle=\sum_{p=0}^{\infty}\upsilon_{p,l+1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}.

Note that exchanging the order of summation in the third equality above is justified as for any k0k\in\mathbb{Z}_{\geq 0} by (42) we have p=0F(p,k,α¯l)|𝐱i,𝐱j|p1\sum_{p=0}^{\infty}F(p,k,\bar{\alpha}_{l})|\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle|^{p}\leq 1 and therefore k=0p=0υk,2F(p,k,α¯l)𝐱i,𝐱jp\sum_{k=0}^{\infty}\sum_{p=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l})\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p} converges absolutely. As the indices i,j[n]i,j\in[n] were arbitrary we conclude that

𝐆˙l+1=p=0υp,l+1(𝐗𝐗T)p\dot{\mathbf{G}}_{l+1}=\sum_{p=0}^{\infty}\upsilon_{p,l+1}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}

as claimed. Finally, by inspection the coefficients (υp,l+1)p=0(\upsilon_{p,l+1})_{p=0}^{\infty} are nonnegative, therefore, and again by an argument identical to (40), the series for each entry of [𝐆˙n,l+1]ij[\dot{\mathbf{G}}_{n,l+1}]_{ij} is absolutely convergent. This concludes the proof. ∎

We are now prove the key result of Section 3.

See 3.1

Proof.

We proceed by induction. The base case l=1l=1 follows trivially from Lemma A.1. We therefore assume the induction hypothesis holds for an arbitrary l1[1,L]l-1\in[1,L]. From (14) and Lemma B.2

n𝐊l\displaystyle n\mathbf{K}_{l} =𝐆l+n𝐊l1𝐆˙l\displaystyle=\mathbf{G}_{l}+n\mathbf{K}_{l-1}\odot\dot{\mathbf{G}}_{l}
=(p=0αp,l(𝐗𝐗T)p)+(nq=0κq,l1(𝐗𝐗T)q)(w=0υw,l(𝐗𝐗T)w).\displaystyle=\left(\sum_{p=0}^{\infty}\alpha_{p,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot p}\right)+\left(n\sum_{q=0}^{\infty}\kappa_{q,l-1}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot q}\right)\odot\left(\sum_{w=0}^{\infty}\upsilon_{w,l}\left(\mathbf{X}\mathbf{X}^{T}\right)^{\odot w}\right).

Therefore, for arbitrary i,j[n]i,j\in[n]

ij =p=0αp,l𝐱i,𝐱jp+(nq=0κq,l1𝐱i,𝐱jq)(w=0υw,l𝐱i,𝐱jw).\displaystyle=\sum_{p=0}^{\infty}\alpha_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}+\left(n\sum_{q=0}^{\infty}\kappa_{q,l-1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{q}\right)\left(\sum_{w=0}^{\infty}\upsilon_{w,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{w}\right).

Observe nq=0κq,l1𝐱i,𝐱jq=Θ(l1)(𝐱i,𝐱j)n\sum_{q=0}^{\infty}\kappa_{q,l-1}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{q}=\Theta^{(l-1)}(\mathbf{x}_{i},\mathbf{x}_{j}) and therefore the series must converge due to the convergence of the NTK. Furthermore, w=0υw,l𝐱i,𝐱jw=[𝐆˙n,l]ij\sum_{w=0}^{\infty}\upsilon_{w,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{w}=[\dot{\mathbf{G}}_{n,l}]_{ij} and therefore is absolutely convergent by Lemma B.2. As a result, by Merten’s Theorem the product of these two series is equal to their Cauchy product. Therefore

ij =p=0αp,l𝐱i,𝐱jp+p=0(q=0pκq,l1υpq,l)𝐱i,𝐱jp\displaystyle=\sum_{p=0}^{\infty}\alpha_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}+\sum_{p=0}^{\infty}\left(\sum_{q=0}^{p}\kappa_{q,l-1}\upsilon_{p-q,l}\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}
=p=0(αp,l+q=0pκq,l1υpq,l)𝐱i,𝐱jp\displaystyle=\sum_{p=0}^{\infty}\left(\alpha_{p,l}+\sum_{q=0}^{p}\kappa_{q,l-1}\upsilon_{p-q,l}\right)\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p}
=p=0κp,l𝐱i,𝐱jp,\displaystyle=\sum_{p=0}^{\infty}\kappa_{p,l}\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle^{p},

from which the (5) immediately follows. ∎

B.2 Analyzing the coefficients of the NTK power series

In this section we study the coefficients of the NTK power series stated in Theorem 3.1. Our first observation is that, under additional assumptions on the activation function ϕ\phi, the recurrence relationship (6) can be simplified in order to depend only on the Hermite expansion of ϕ\phi.

Lemma B.3.

Under Assumption 3 the Hermite coefficients of ϕ\phi^{\prime} satisfy

μk(ϕ)=k+1μk+1(ϕ)\mu_{k}(\phi^{\prime})=\sqrt{k+1}\mu_{k+1}(\phi)

for all k0k\in\mathbb{Z}_{\geq 0}.

Proof.

Note for each nn\in\mathbb{N} as ϕ\phi is absolutely continuous on [n,n][-n,n] it is differentiable a.e. on [n,n][-n,n]. It follows by the countable additivity of the Lebesgue measure that ϕ\phi is differentiable a.e. on \mathbb{R}. Furthermore, as ϕ\phi is polynomially bounded we have ϕL2(,ex2/2/2π)\phi\in L^{2}(\mathbb{R},e^{-x^{2}/2}/\sqrt{2\pi}). Fix a>0a>0. Since ϕ\phi is absolutely continuous on [a,a][-a,a] it is of bounded variation on [a,a][-a,a]. Also note that hk(x)ex2/2h_{k}(x)e^{-x^{2}/2} is of bounded variation on [a,a][-a,a] due to having a bounded derivative. Thus we have by Lebesgue-Stieltjes integration-by-parts (see e.g. Folland 1999, Chapter 3)

aaϕ(x)hk(x)ex2/2𝑑x\displaystyle\int_{-a}^{a}\phi^{\prime}(x)h_{k}(x)e^{-x^{2}/2}dx
=ϕ(a)hk(a)ea2/2ϕ(a)hk(a)ea2/2+aaϕ(x)[xhk(x)hk(x)]ex2/2𝑑x\displaystyle=\phi(a)h_{k}(a)e^{-a^{2}/2}-\phi(-a)h_{k}(-a)e^{-a^{2}/2}+\int_{-a}^{a}\phi(x)[xh_{k}(x)-h_{k}^{\prime}(x)]e^{-x^{2}/2}dx
=ϕ(a)hk(a)ea2/2ϕ(a)hk(a)ea2/2+aaϕ(x)k+1hk+1(x)ex2/2𝑑x,\displaystyle=\phi(a)h_{k}(a)e^{-a^{2}/2}-\phi(-a)h_{k}(-a)e^{-a^{2}/2}+\int_{-a}^{a}\phi(x)\sqrt{k+1}h_{k+1}(x)e^{-x^{2}/2}dx,

where in the last line above we have used the fact that (24) and (25) imply that xhk(x)hk(x)=k+1hk+1(x)xh_{k}(x)-h_{k}^{\prime}(x)=\sqrt{k+1}h_{k+1}(x). Thus we have shown

aaϕ(x)hk(x)ex2/2𝑑x\displaystyle\int_{-a}^{a}\phi^{\prime}(x)h_{k}(x)e^{-x^{2}/2}dx
=ϕ(a)hk(a)ea2/2ϕ(a)hk(a)ea2/2+aaϕ(x)k+1hk+1(x)ex2/2𝑑x.\displaystyle=\phi(a)h_{k}(a)e^{-a^{2}/2}-\phi(-a)h_{k}(-a)e^{-a^{2}/2}+\int_{-a}^{a}\phi(x)\sqrt{k+1}h_{k+1}(x)e^{-x^{2}/2}dx.

We note that since |ϕ(x)hk(x)|=𝒪(|x|β+k)|\phi(x)h_{k}(x)|=\mathcal{O}(|x|^{\beta+k}) we have that as aa\rightarrow\infty the first two terms above vanish. Thus by sending aa\rightarrow\infty we have

ϕ(x)hk(x)ex2/2𝑑x=k+1ϕ(x)hk+1(x)ex2/2𝑑x.\int_{-\infty}^{\infty}\phi^{\prime}(x)h_{k}(x)e^{-x^{2}/2}dx=\int_{-\infty}^{\infty}\sqrt{k+1}\phi(x)h_{k+1}(x)e^{-x^{2}/2}dx.

After dividing by 2π\sqrt{2\pi} we get the desired result. ∎

In particular, under Assumption 3, and as highlighted by Corollary B.4, which follows directly from Lemmas B.2 and B.3, the NTK coefficients can be computed only using the Hermite coefficients of ϕ\phi.

Corollary B.4.

Under Assumptions 1, 2 and 3, for all p0p\in\mathbb{Z}_{\geq 0}

υp,l={(p+1)αp+1,2,l=2,k=0υk,2F(p,k,α¯l1),l3.\upsilon_{p,l}=\begin{cases}(p+1)\alpha_{p+1,2},&l=2,\\ \sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l-1}),&l\geq 3.\end{cases} (43)

With these results in place we proceed to analyze the decay of the coefficients of the NTK for depth two networks. As stated in the main text, the decay of the NTK coefficients depends on the decay of the Hermite coefficients of the activation function deployed. This in turn is strongly influenced by the behavior of the tails of the activation function. To this end we roughly group activation functions into three categories: growing tails, flat or constant tails and finally decaying tails. Analyzing each of these groups in full generality is beyond the scope of this paper, we therefore instead study the behavior of ReLU, Tanh and Gaussian activation functions, being prototypical and practically used examples of each of these three groups respectively. We remark that these three activation functions satisfy Assumption 3. For typographical ease we let ωσ(z):=(1/2πσ2)exp(z2/(2σ2))\omega_{\sigma}(z)\vcentcolon=(1/\sqrt{2\pi\sigma^{2}})\exp\left(-z^{2}/(2\sigma^{2})\right) denote the Gaussian activation function with variance σ2\sigma^{2}.

See 3.2

Proof.

Recall (9),

κp,2=σw2(1+γw2p)μp2(ϕ)+σw2γb2(1+p)μp+12(ϕ)+δp=0σb2.\kappa_{p,2}=\sigma_{w}^{2}(1+\gamma_{w}^{2}p)\mu_{p}^{2}(\phi)+\sigma_{w}^{2}\gamma_{b}^{2}(1+p)\mu_{p+1}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2}.

In order to bound κp,2\kappa_{p,2} we proceed by using Lemma A.4 to bound the square of the Hermite coefficients. We start with ReLU. Note Lemma A.4 actually provides precise expressions for the Hermite coefficients of ReLU, however, these are not immediately easy to interpret. Observe from Lemma A.4 that above index p=2p=2 all odd indexed Hermite coefficients are 0. It therefore suffices to bound the even indexed terms, given by

μp(ReLU)=12π(p3)!!p!.\mu_{p}(ReLU)=\frac{1}{\sqrt{2\pi}}\frac{(p-3)!!}{\sqrt{p!}}.

Observe from (33) that for pp even

hp(0)=(1)p/2(p1)!!p!,h_{p}(0)=(-1)^{p/2}\frac{(p-1)!!}{\sqrt{p!}},

therefore

μp(ReLU)=12π(p3)!!p!=12π|hp(0)|p1.\mu_{p}(ReLU)=\frac{1}{\sqrt{2\pi}}\frac{(p-3)!!}{\sqrt{p!}}=\frac{1}{\sqrt{2\pi}}\frac{\left\lvert h_{p}(0)\right\rvert}{p-1}.

Analyzing now |hp(0)|\left\lvert h_{p}(0)\right\rvert,

(p1)!!p!=i=1p/2(2i1)i=1p/2(2i1)2i=i=1p/2(2i1)i=1p/22i=(p1)!!p!!.\frac{(p-1)!!}{\sqrt{p!}}=\frac{\prod_{i=1}^{p/2}(2i-1)}{\sqrt{\prod_{i=1}^{p/2}(2i-1)2i}}=\sqrt{\frac{\prod_{i=1}^{p/2}(2i-1)}{\prod_{i=1}^{p/2}2i}}=\sqrt{\frac{(p-1)!!}{p!!}}.

Here, the expression inside the square root is referred to in the literature as the Wallis ratio, for which the following lower and upper bounds are available Kazarinoff (1956),

1π(p+0.5)<(p1)!!p!!<1π(p+0.25).\sqrt{\frac{1}{\pi(p+0.5)}}<\frac{(p-1)!!}{p!!}<\sqrt{\frac{1}{\pi(p+0.25)}}. (44)

As a result

|hp(0)|=Θ(p1/4)\left\lvert h_{p}(0)\right\rvert=\Theta(p^{-1/4})

and therefore

μp(ReLU)={Θ(p5/4),p even,0,p odd.\mu_{p}(ReLU)=\begin{cases}\Theta(p^{-5/4}),\;&p\text{ even},\\ 0,\;&p\text{ odd}.\end{cases}

As (p+1)3/2=Θ(p3/2)(p+1)^{-3/2}=\Theta(p^{-3/2}), then from (9)

κp,2\displaystyle\kappa_{p,2} =Θ((pμp2(ReLU)+δγb>0(p+1)μp+12(ReLU)))\displaystyle=\Theta((p\mu_{p}^{2}(ReLU)+\delta_{\gamma_{b}>0}(p+1)\mu_{p+1}^{2}(ReLU)))
=Θ((δp evenp3/2+δ(p odd)(γb>0)(p+1)3/2))\displaystyle=\Theta((\delta_{p\text{ even}}p^{-3/2}+\delta_{(p\text{ odd})\cap(\gamma_{b}>0)}(p+1)^{-3/2}))
=Θ(δ(p even)((p odd)(γb>0))p3/2)\displaystyle=\Theta\left(\delta_{(p\text{ even})\cup\left((p\text{ odd})\cap(\gamma_{b}>0)\right)}p^{-3/2}\right)
=δ(p even)(γb>0)Θ(p3/2)\displaystyle=\delta_{(p\text{ even})\cup(\gamma_{b}>0)}\Theta\left(p^{-3/2}\right)

as claimed in item 1.

We now proceed to analyze ϕ(z)=Tanh(z)\phi(z)=Tanh(z). From Panigrahi et al. (2020, Corollary F.7.1)

μp(Tanh)=𝒪(exp(πp4)).\mu_{p}(Tanh^{\prime})=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p}}{4}\right)\right).

As Tanh satisfies the conditions of Lemma B.3

μp(Tanh)=p1/2μp1(Tanh)=𝒪(p1/2exp(πp14)).\mu_{p}(Tanh)=p^{-1/2}\mu_{p-1}(Tanh^{\prime})=\mathcal{O}\left(p^{-1/2}\exp\left(-\frac{\pi\sqrt{p-1}}{4}\right)\right).

Therefore the result claimed in item 2. follows as

κp,2\displaystyle\kappa_{p,2} =𝒪((pμp2(Tanh)+(p+1)μp+12(Tanh)))\displaystyle=\mathcal{O}((p\mu_{p}^{2}(Tanh)+(p+1)\mu_{p+1}^{2}(Tanh)))
=𝒪(exp(πp12)+exp(πp2))\displaystyle=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p-1}}{2}\right)+\exp\left(-\frac{\pi\sqrt{p}}{2}\right)\right)
=𝒪(exp(πp12)).\displaystyle=\mathcal{O}\left(\exp\left(-\frac{\pi\sqrt{p-1}}{2}\right)\right).

Finally, we now consider ϕ(z)=ωσ(z)\phi(z)=\omega_{\sigma}(z) where ωσ(z)\omega_{\sigma}(z) is the density function of 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). Similar to ReLU, analytic expressions for the Hermite coefficients of ωσ(z)\omega_{\sigma}(z) are known (see e.g., Davis, 2021, Theorem 2.9),

μp2(ωσ)={p!((p/2)!)22p2π(σ2+1)p+1,p even,0,p odd.\mu_{p}^{2}(\omega_{\sigma})=\begin{cases}\frac{p!}{(\left(p/2\right)!)^{2}2^{p}2\pi(\sigma^{2}+1)^{p+1}},\;&\text{p even},\\ 0,\;&\text{p odd}.\\ \end{cases}

For pp even

(p/2)!=p!!2p/2.(p/2)!=p!!2^{-p/2}.

Therefore

p!(p/2)!(p/2)!=2pp!p!!p!!=2p(p1)!!p!!.\displaystyle\frac{p!}{(p/2)!(p/2)!}=2^{p}\frac{p!}{p!!p!!}=2^{p}\frac{(p-1)!!}{p!!}.

As a result, for pp even and using (44), it follows that

μp2(ωσ)\displaystyle\mu_{p}^{2}(\omega_{\sigma}) =(σ2+1)(p+1)2π(p1)!!p!!\displaystyle=\frac{(\sigma^{2}+1)^{-(p+1)}}{2\pi}\frac{(p-1)!!}{p!!} =Θ(p1/2(σ2+1)p).\displaystyle=\Theta(p^{-1/2}(\sigma^{2}+1)^{-p}).

Finally, since (p+1)1/2(σ2+1)p1=Θ(p1/2(σ2+1)p)(p+1)^{1/2}(\sigma^{2}+1)^{-p-1}=\Theta(p^{1/2}(\sigma^{2}+1)^{-p}), then from (9)

κp,2\displaystyle\kappa_{p,2} =Θ((pμp2(ωσ)+δγb>0(p+1)μp+12(ωσ)))\displaystyle=\Theta((p\mu_{p}^{2}(\omega_{\sigma})+\delta_{\gamma_{b}>0}(p+1)\mu_{p+1}^{2}(\omega_{\sigma})))
=Θ(δ(p even)((p odd)(γb>0))p1/2(σ2+1)p)\displaystyle=\Theta\left(\delta_{(p\text{ even})\cup\left((p\text{ odd})\cap(\gamma_{b}>0)\right)}p^{1/2}(\sigma^{2}+1)^{-p}\right)
=δ(p even)(γb>0)Θ(p1/2(σ2+1)p)\displaystyle=\delta_{(p\text{ even})\cup(\gamma_{b}>0)}\Theta\left(p^{1/2}(\sigma^{2}+1)^{-p}\right)

as claimed in item 3.

B.3 Numerical approximation via a truncated NTK power series and interpretation of Figure 2

Currently, computing the infinite width NTK requires either a) explicit evaluation of the Gaussian integrals highlighted in (13), b) numerical approximation of these same integrals such as in Lee et al. (2018), or c) approximation via a sufficiently wide yet still finite width network, see for instance Engel et al. (2022); Novak et al. (2022). These Gaussian integrals (13) can be solved solved analytically only for a minority of activation functions, notably ReLU as discussed for example by Arora et al. (2019b), while the numerical integration and finite width approximation approaches are relatively computationally expensive. The truncated NTK power series we define as analogous to (5) but with the series involved being computed only up to the TTth element. Once the top TT coefficients are computed, then for any input correlation the NTK can be approximated by evaluating the corresponding finite degree TT polynomial.

Definition B.5.

For an arbitrary pair 𝐱,𝐲𝕊d1\mathbf{x},\mathbf{y}\in\mathbb{S}^{d-1} let ρ=𝐱T𝐲\rho=\mathbf{x}^{T}\mathbf{y} denote their linear correlation. Under Assumptions 1, 2 and 3, for all l[2,L+1]l\in[2,L+1] the TT-truncated NTK power series Θ^T(l):[1,1]\hat{\Theta}_{T}^{(l)}:[-1,1]\rightarrow\mathbb{R} is defined as

ΘT(l)(ρ)=p=0Tκ^p,lρp.\Theta_{T}^{(l)}(\rho)=\sum_{p=0}^{T}\hat{\kappa}_{p,l}\rho^{p}. (45)

and whose coefficients are defined via the following recurrence relation,

κ^p,l={δp=0γb2+δp=1γw2,l=1,α^p,l+q=0pκ^q,l1υ^pq,l,l[2,L+1].\hat{\kappa}_{p,l}=\begin{cases}\delta_{p=0}\gamma_{b}^{2}+\delta_{p=1}\gamma_{w}^{2},&l=1,\\ \hat{\alpha}_{p,l}+\sum_{q=0}^{p}\hat{\kappa}_{q,l-1}\hat{\upsilon}_{p-q,l},&l\in[2,L+1].\end{cases} (46)

Here, with α^¯l1=(α^p,l1)p=0T\bar{\hat{\alpha}}_{l-1}=(\hat{\alpha}_{p,l-1})_{p=0}^{T},

α^p,l:={σw2μp2(ϕ)+δp=0σb2,l=2,k=0Tα^k,2F(p,k,α^¯l1),l3\hat{\alpha}_{p,l}\vcentcolon=\begin{cases}\sigma_{w}^{2}\mu_{p}^{2}(\phi)+\delta_{p=0}\sigma_{b}^{2},&l=2,\\ \sum_{k=0}^{T}\hat{\alpha}_{k,2}F(p,k,\bar{\hat{\alpha}}_{l-1}),&l\geq 3\end{cases} (47)

and

υ^p,l:={p+1α^p+1,2,l=2,k=0Tk+1α^p+1,2F(p,k,α^¯l),l3.\hat{\upsilon}_{p,l}\vcentcolon=\begin{cases}\sqrt{p+1}\hat{\alpha}_{p+1,2},&l=2,\\ \sum_{k=0}^{T}\sqrt{k+1}\hat{\alpha}_{p+1,2}F(p,k,\bar{\hat{\alpha}}_{l}),&l\geq 3.\end{cases} (48)

In order to analyze the performance and potential of the truncated NTK for numerical approximation, we compute it for ReLU and compare it with its analytical expression Arora et al. (2019b). To recall this result, let

R(ρ)\displaystyle R(\rho) :=1ρ2+ρarcsin(ρ)π+ρ2,\displaystyle\vcentcolon=\frac{\sqrt{1-\rho^{2}}+\rho\cdot\arcsin(\rho)}{\pi}+\frac{\rho}{2},
R(ρ)\displaystyle R^{\prime}(\rho) :=arcsin(ρ)π+12.\displaystyle\vcentcolon=\frac{\arcsin(\rho)}{\pi}+\frac{1}{2}.

Under Assumptions 1 and 2, with ϕ(z)=ReLU(z)\phi(z)=ReLU(z), γw2=1\gamma_{w}^{2}=1, σw2=2\sigma_{w}^{2}=2, σb2=γb2=0\sigma_{b}^{2}=\gamma_{b}^{2}=0, 𝐱,𝐲𝕊d\mathbf{x},\mathbf{y}\in\mathbb{S}^{d} and ρ1:=𝐱T𝐲\rho_{1}\vcentcolon=\mathbf{x}^{T}\mathbf{y}, then Θ1(𝐱,𝐲)=ρ\Theta_{1}(\mathbf{x},\mathbf{y})=\rho and for all l[2,L+1]l\in[2,L+1]

ρl=R(ρl1),\displaystyle\rho_{l}=R(\rho_{l-1}), (49)
Θl(𝐱,𝐲)=ρl+ρl1R(ρl1).\displaystyle\Theta_{l}(\mathbf{x},\mathbf{y})=\rho_{l}+\rho_{l-1}R^{\prime}(\rho_{l-1}).

Turning our attention to Figure 2, we observe particularly for input correlations |ρ|0.5\left\lvert\rho\right\rvert\approx 0.5 and below then the truncated ReLU NTK power series achieves machine level precision. For |ρ|1\left\lvert\rho\right\rvert\approx 1 higher order coefficients play a more significant role. As the truncated ReLU NTK power series approximates these coefficients less well the overall approximation of the ReLU NTK is worse. We remark also that negative correlations have a smaller absolute error as odd indexed terms cancel with even index terms: we emphasize again that in Figure 2 we plot the absolute not relative error. In addition, for L=1L=1 there is symmetry in the absolute error for positive and negative correlations as αp,2=0\alpha_{p,2}=0 for all odd pp. One also observes that approximation accuracy goes down with depth, which is due to the error in the coefficients at the previous layer contributing to the error in the coefficients at the next, thereby resulting in an accumulation of error with depth. Also, and certainly as one might expect, a larger truncation point TT results in overall better approximation. Finally, as the decay in the Hermite coefficients for ReLU is relatively slow, see e.g., Table 1 and Lemma 3.2, we expect the truncated ReLU NTK power series to perform worse relative to the truncated NTK’s for other activation functions.

Refer to caption
Figure 2: (NTK Approximation via Truncation) Absolute error between the analytical ReLU NTK and the truncated ReLU NTK power series as a function of the input correlation ρ\rho for two different values of the truncation point TT and three different values for the depth LL of the network. Although the truncated NTK achieves a uniform approximation error of only 10110^{-1} on [1,1][-1,1], for |ρ|0.5\left\lvert\rho\right\rvert\leq 0.5, which we remark is more typical for real world data, T=50T=50 suffices for the truncated NTK to achieve machine level precision.

B.4 Characterizing NTK power series coefficient decay rates for deep networks

In general, Theorem 3.1 does not provide a straightforward path to analyzing the decay of the NTK power series coefficients for depths greater than two. This is at least in part due to the difficulty of analyzing F(p,k,α¯l1)F(p,k,\bar{\alpha}_{l-1}), which recall is the sum of all ordered products of kk elements of α¯l1\bar{\alpha}_{l-1} whose indices sum to pp, defined in (4). However, in the setting where the squares of the Hermite coefficients, and therefore the series (αp,2)p=0(\alpha_{p,2})_{p=0}^{\infty}, decay at an exponential rate, this quantity can be characterized and therefore an analysis, at least to a certain degree, of the impact of depth conducted. Although admittedly limited in scope, we highlight that this setting is relevant for the study of Gaussian activation functions and radial basis function (RBF) networks. We will also make the additional simplifying assumption that the activation function has zero Gaussian mean (which can be obtained by centering). Unfortunately this further reduces the applicability of the following results to activation functions commonly used in practice. We leave the study of relaxing this zero bias assumption, perhaps only enforcing exponential decay asymptotically, as well as a proper exploration of other decay patterns, to future work.

The following lemma precisely describes, in the specific setting considered here, the evolution of the coefficients of the Gaussian Process kernel with depth.

Lemma B.6.

Let α0,2=0\alpha_{0,2}=0 and αp,2=C2η2p\alpha_{p,2}=C_{2}\eta_{2}^{-p} for p1p\in\mathbb{Z}_{\geq 1}, where C2C_{2} and η2\eta_{2} are constants such that p=1αp,2=1\sum_{p=1}^{\infty}\alpha_{p,2}=1. Then for all l2l\geq 2 and p0p\in\mathbb{Z}_{\geq 0}

αp,l+1={0,p=0,Cl+1ηl+1p,p1\alpha_{p,l+1}=\begin{cases}0,&\;p=0,\\ C_{l+1}\eta_{l+1}^{-p},&\;p\geq 1\end{cases} (50)

where the constants ηl+1\eta_{l+1} and Cl+1C_{l+1} are defined as

ηl+1=ηlη2η2+Cl,Cl+1=ClC2η2+Cl.\displaystyle\eta_{l+1}=\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}},\;\;C_{l+1}=\frac{C_{l}C_{2}}{\eta_{2}+C_{l}}. (51)
Proof.

Observe for l=2l=2, we have that α0,l=0\alpha_{0,l}=0 and αp,l=Clηlp\alpha_{p,l}=C_{l}\eta_{l}^{-p} hold by assumption. Thus by induction it suffices to show that α0,l=0\alpha_{0,l}=0 and αp,l=Clηlp\alpha_{p,l}=C_{l}\eta_{l}^{-p} implies (50) and (51) hold. Thus assume for some l2l\geq 2 we have that α0,l=0\alpha_{0,l}=0 and αp,l=Clηlp\alpha_{p,l}=C_{l}\eta_{l}^{-p}. Recall the definition of FF from (4): as α0,l=0\alpha_{0,l}=0 then with p1p\geq 1 and 1kp1\leq k\leq p

F(p,k,α¯l)=(ji)𝒥(p,k)i=1kαji,l=(ji)𝒥+(p,k)i=1kαji,l,F(p,k,\bar{\alpha}_{l})=\sum_{(j_{i})\in\mathcal{J}(p,k)}\prod_{i=1}^{k}\alpha_{j_{i},l}=\sum_{(j_{i})\in\mathcal{J}_{+}(p,k)}\prod_{i=1}^{k}\alpha_{j_{i},l},

where

𝒥+(p,k):={(ji)i[k]:ji1i[k],i=1kji=p}for all p1k[p],\mathcal{J}_{+}(p,k)\vcentcolon=\big{\{}(j_{i})_{i\in[k]}\;:\;j_{i}\geq 1\;\forall i\in[k],\;\sum_{i=1}^{k}j_{i}=p\big{\}}\quad\text{for all $p\in\mathbb{Z}_{\geq 1}$, $k\in[p]$},

which is the set of all kk-tuples of positive (instead of non-negative) integers which sum to pp. Substituting αp,l=Clηlp\alpha_{p,l}=C_{l}\eta_{l}^{-p} then

F(p,k,α¯l)=(ji)𝒥+(p,k)Clkηlp=Clkηlp|𝒥+(p,k)|=Clkηlp(p1k1),F(p,k,\bar{\alpha}_{l})=\sum_{(j_{i})\in\mathcal{J}_{+}(p,k)}C_{l}^{k}\eta_{l}^{-p}=C_{l}^{k}\eta_{l}^{-p}|\mathcal{J}_{+}(p,k)|=C_{l}^{k}\eta_{l}^{-p}\binom{p-1}{k-1},

where the final equality follows from a stars and bars argument. Now observe for k>pk>p that at least one of the indices in (ji)i=1k(j_{i})_{i=1}^{k} must be 0 and therefore i=1kαji,2=0\prod_{i=1}^{k}\alpha_{j_{i},2}=0. As a result under the assumptions of the lemma

F(p,k,α¯l)={1,k=0 and p=0,Clkηlp(p1k1),k[p] and p1,0, otherwise.F(p,k,\bar{\alpha}_{l})=\begin{cases}1,\;&k=0\text{ and }p=0,\\ C_{l}^{k}\eta_{l}^{-p}\binom{p-1}{k-1},\;&\;k\in[p]\text{ and }p\geq 1,\\ 0,\;&\text{ otherwise.}\\ \end{cases} (52)

Substituting (52) into (7) it follows that

α0,l+1=k=0αk,2F(0,k,α¯l)=α0,2=0\alpha_{0,l+1}=\sum_{k=0}^{\infty}\alpha_{k,2}F(0,k,\bar{\alpha}_{l})=\alpha_{0,2}=0

and for p1p\geq 1

αp,l+1\displaystyle\alpha_{p,l+1} =k=0αk,2F(p,k,α¯l)\displaystyle=\sum_{k=0}^{\infty}\alpha_{k,2}F(p,k,\bar{\alpha}_{l})
=C2ηlpk=1p(Clη2)k(p1k1)\displaystyle=C_{2}\eta_{l}^{-p}\sum_{k=1}^{p}\left(\frac{C_{l}}{\eta_{2}}\right)^{k}\binom{p-1}{k-1}
=ηlpClη21C2h=0p1(Clη2)h(p1h)\displaystyle=\eta_{l}^{-p}C_{l}\eta_{2}^{-1}C_{2}\sum_{h=0}^{p-1}\left(\frac{C_{l}}{\eta_{2}}\right)^{h}\binom{p-1}{h}
=ηlpClη21C2(1+Clη2)p1\displaystyle=\eta_{l}^{-p}C_{l}\eta_{2}^{-1}C_{2}\left(1+\frac{C_{l}}{\eta_{2}}\right)^{p-1}
=ClC2η2+Cl(ηlη2η2+Cl)p\displaystyle=\frac{C_{l}C_{2}}{\eta_{2}+C_{l}}\left(\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}}\right)^{-p}
=Cl+1ηl+1p\displaystyle=C_{l+1}\eta_{l+1}^{-p}

as claimed. ∎

We now analyze the coefficients of the derivative of the Gaussian Process kernel.

Lemma B.7.

In addition to the assumptions of Lemma B.6, assume also that ϕ\phi satisfies Assumption 3. Then υp,2=C2η2(1+p)η2p\upsilon_{p,2}=\frac{C_{2}}{\eta_{2}}(1+p)\eta_{2}^{-p}. Furthermore, for all l2l\geq 2 and p0p\in\mathbb{Z}_{\geq 0}

υp,l+1={C2η21,p=0,(Vl+1+Vl+1p)ηl+1p,p1,\upsilon_{p,l+1}=\begin{cases}C_{2}\eta_{2}^{-1},&\;p=0,\\ (V_{l+1}^{\prime}+V_{l+1}p)\eta_{l+1}^{-p},&\;p\geq 1,\end{cases} (53)

where the constants Vl+1V_{l+1}^{\prime} and Vl+1V_{l+1} are defined as

Vl+1:=2C2Clη2(Cl+η2)C2Cl2η2(Cl+η2)2,Vl+1:=C2Cl2η2(Cl+η2)2\displaystyle V_{l+1}^{\prime}\vcentcolon=\frac{2C_{2}C_{l}}{\eta_{2}(C_{l}+\eta_{2})}-\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}},\;\;V_{l+1}\vcentcolon=\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}} (54)

and ClC_{l} and ηl\eta_{l} are defined in (51).

Proof.

Under Assumption 3 then for all p0p\in\mathbb{Z}_{\geq 0} we have

υp,2=σw2μp2(ϕ)=σw2(p+1)μp+1(ϕ)2=(p+1)αp+1,2=C2η2(1+p)η2p.\upsilon_{p,2}=\sigma_{w}^{2}\mu_{p}^{2}(\phi^{\prime})=\sigma_{w}^{2}(p+1)\mu_{p+1}(\phi)^{2}=(p+1)\alpha_{p+1,2}=\frac{C_{2}}{\eta_{2}}(1+p)\eta_{2}^{-p}.

For l2l\geq 2 and p=0p=0 it therefore follows that

υ0,l+1=k=0(k+1)αk+1,2F(0,k,α¯l)=α1,2=C2η21.\upsilon_{0,l+1}=\sum_{k=0}^{\infty}(k+1)\alpha_{k+1,2}F(0,k,\bar{\alpha}_{l})=\alpha_{1,2}=C_{2}\eta_{2}^{-1}.

For l2l\geq 2 and p1p\geq 1 then

υp,l+1\displaystyle\upsilon_{p,l+1} =k=0υk,2F(p,k,α¯l)\displaystyle=\sum_{k=0}^{\infty}\upsilon_{k,2}F(p,k,\bar{\alpha}_{l})
=k=0(k+1)αk+1,2F(p,k,α¯l)\displaystyle=\sum_{k=0}^{\infty}(k+1)\alpha_{k+1,2}F(p,k,\bar{\alpha}_{l})
=h=1hC2η2hF(p,h1,α¯l)\displaystyle=\sum_{h=1}^{\infty}hC_{2}\eta_{2}^{-h}F(p,h-1,\bar{\alpha}_{l})
=C2Clηlph=2p+1h(Clη2)h(p1h2)\displaystyle=\frac{C_{2}}{C_{l}}\eta_{l}^{-p}\sum_{h=2}^{p+1}h\left(\frac{C_{l}}{\eta_{2}}\right)^{h}\binom{p-1}{h-2}
=C2Clηlpr=0p1(r+2)(Clη2)r+2(p1r)\displaystyle=\frac{C_{2}}{C_{l}}\eta_{l}^{-p}\sum_{r=0}^{p-1}(r+2)\left(\frac{C_{l}}{\eta_{2}}\right)^{r+2}\binom{p-1}{r}
=C2Clη22ηlp(2r=0p1(Clη2)r(p1r)+r=0p1r(Clη2)r(p1r))\displaystyle=\frac{C_{2}C_{l}}{\eta_{2}^{2}}\eta_{l}^{-p}\left(2\sum_{r=0}^{p-1}\left(\frac{C_{l}}{\eta_{2}}\right)^{r}\binom{p-1}{r}+\sum_{r=0}^{p-1}r\left(\frac{C_{l}}{\eta_{2}}\right)^{r}\binom{p-1}{r}\right)
=C2Clη22ηlp(2(1+Clη2)p1+Clη2(p1)(1+Clη2)p2)\displaystyle=\frac{C_{2}C_{l}}{\eta_{2}^{2}}\eta_{l}^{-p}\left(2\left(1+\frac{C_{l}}{\eta_{2}}\right)^{p-1}+\frac{C_{l}}{\eta_{2}}(p-1)\left(1+\frac{C_{l}}{\eta_{2}}\right)^{p-2}\right)
=2C2Clη2(Cl+η2)(ηlη2η2+Cl)p+C2Cl2η2(Cl+η2)2(p1)(ηlη2η2+Cl)p\displaystyle=\frac{2C_{2}C_{l}}{\eta_{2}(C_{l}+\eta_{2})}\left(\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}}\right)^{-p}+\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}(p-1)\left(\frac{\eta_{l}\eta_{2}}{\eta_{2}+C_{l}}\right)^{-p}
=(2C2Clη2(Cl+η2)C2Cl2η2(Cl+η2)2)ηl+1p+(C2Cl2η2(Cl+η2)2)pηl+1p\displaystyle=\left(\frac{2C_{2}C_{l}}{\eta_{2}(C_{l}+\eta_{2})}-\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}\right)\eta_{l+1}^{-p}+\left(\frac{C_{2}C_{l}^{2}}{\eta_{2}(C_{l}+\eta_{2})^{2}}\right)p\eta_{l+1}^{-p}
=(Vl+1+Vl+1p)ηl+1p\displaystyle=(V_{l+1}^{\prime}+V_{l+1}p)\eta_{l+1}^{-p}

as claimed. ∎

With the coefficients of both the Gaussian Process kernel and its derivative characterized, we proceed to upper bound the decay of the NTK coefficients in the specific setting outlined in Lemma B.6 and B.7.

Lemma B.8.

Let the data, hyperparameters and activation function ϕ\phi be such that Assumptions 1, 2 and 3 are satisfied along with the conditions of of Lemma B.6. Then for any l2l\geq 2 there exist positive constants MlM_{l}^{\prime} and KlK_{l}^{\prime} such that for all p1p\in\mathbb{Z}_{\geq 1}

κp,l(Ml+Klp2l3)ηlp\kappa_{p,l}\leq(M_{l}^{\prime}+K_{l}^{\prime}p^{2l-3})\eta_{l}^{-p} (55)

where ηl\eta_{l} is defined in Lemma B.6.

Proof.

We proceed by induction starting with the base case l=2l=2. Applying the results of Lemmas B.6 and B.7 to (6) then for p1p\in\mathbb{Z}_{\geq 1}

κp,2=((C2+γb2C2η21)+(γb2C2η21+γw2C2)p)η2p.\kappa_{p,2}=((C_{2}+\gamma_{b}^{2}C_{2}\eta_{2}^{-1})+(\gamma_{b}^{2}C_{2}\eta_{2}^{-1}+\gamma_{w}^{2}C_{2})p)\eta_{2}^{-p}. (56)

If we define M2:=C2+γb2C2η21M_{2}^{\prime}\vcentcolon=C_{2}+\gamma_{b}^{2}C_{2}\eta_{2}^{-1} and K2:=γb2C2η21+γw2C2K_{2}^{\prime}\vcentcolon=\gamma_{b}^{2}C_{2}\eta_{2}^{-1}+\gamma_{w}^{2}C_{2}, which are clearly positive constants, then κp,2=(M2+K2p)η2p\kappa_{p,2}=(M_{2}^{\prime}+K_{2}^{\prime}p)\eta_{2}^{-p} and so for l=2l=2 the induction hypothesis clearly holds. We now assume the inductive hypothesis holds for some l2l\geq 2. Observe from (53), with l2l\geq 2 and p0p\in\mathbb{Z}_{\geq 0} that

υp,l+1(Al+1+Vl+1p)ηl+1p.\upsilon_{p,l+1}\leq(A_{l+1}^{\prime}+V_{l+1}p)\eta_{l+1}^{-p}. (57)

where Al+1:=max{C2η21,Vl+1}A_{l+1}^{\prime}\vcentcolon=\max\{C_{2}\eta_{2}^{-1},V_{l+1}^{\prime}\}. Substituting 57 and the inductive hypothesis inequality into (6) it follows for p1p\geq 1 that

κp,l+1\displaystyle\kappa_{p,l+1} Cl+1ηl+1p+ηl+1pq=0p(Ml+Klq2l3)ηlq(Al+1+Vl+1(pq))ηl+1q\displaystyle\leq C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})\eta_{l}^{-q}(A_{l+1}^{\prime}+V_{l+1}(p-q))\eta_{l+1}^{q}
=Cl+1ηl+1p+ηl+1pq=0p(Ml+Klq2l3)(Al+1+Vl+1(pq))(η2η2+Cl)q\displaystyle=C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}(p-q))\left(\frac{\eta_{2}}{\eta_{2}+C_{l}}\right)^{q}
Cl+1ηl+1p+ηl+1pq=0p(Ml+Klq2l3)(Al+1+Vl+1(pq))\displaystyle\leq C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}(p-q))
Cl+1ηl+1p+ηl+1pq=0p(Ml+Klq2l3)(Al+1+Vl+1p)\displaystyle\leq C_{l+1}\eta_{l+1}^{-p}+\eta_{l+1}^{-p}\sum_{q=0}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}p)
(Cl+1+MlAl+1)ηl+1p+(MlVl+1p+q=1p(Ml+Klq2l3)(Al+1+Vl+1p))ηl+1p\displaystyle\leq(C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})\eta_{l+1}^{-p}+\left(M_{l}^{\prime}V_{l+1}p+\sum_{q=1}^{p}(M_{l}^{\prime}+K_{l}^{\prime}q^{2l-3})(A_{l+1}^{\prime}+V_{l+1}p)\right)\eta_{l+1}^{-p}
(Cl+1+MlAl+1)ηl+1p+(MlVl+1p+p(Ml+Klp2l3)(Al+1+Vl+1p))ηl+1p\displaystyle\leq(C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})\eta_{l+1}^{-p}+\left(M_{l}^{\prime}V_{l+1}p+p(M_{l}^{\prime}+K_{l}^{\prime}p^{2l-3})(A_{l+1}^{\prime}+V_{l+1}p)\right)\eta_{l+1}^{-p}
(Cl+1+MlAl+1)ηl+1p+p(MlAl+1+2MlVl+1p+KlAl+1p2l3+KlVl+1p2l2)ηl+1p\displaystyle\leq(C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})\eta_{l+1}^{-p}+p\left(M_{l}^{\prime}A_{l+1}^{\prime}+2M_{l}^{\prime}V_{l+1}p+K_{l}^{\prime}A_{l+1}^{\prime}p^{2l-3}+K_{l}^{\prime}V_{l+1}p^{2l-2}\right)\eta_{l+1}^{-p}
((Cl+1+MlAl+1)+(MlAl+1+2MlVl+1+KlAl+1+KlVl+1)p2l1)ηl+1p\displaystyle\leq\left((C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime})+\left(M_{l}^{\prime}A_{l+1}^{\prime}+2M_{l}^{\prime}V_{l+1}+K_{l}^{\prime}A_{l+1}^{\prime}+K_{l}^{\prime}V_{l+1}\right)p^{2l-1}\right)\eta_{l+1}^{-p}

Therefore there exist positive constants Ml+1=Cl+1+MlAl+1M_{l+1}^{\prime}=C_{l+1}+M_{l}^{\prime}A_{l+1}^{\prime} and Kl+1=MlAl+1+2MlVl+1+KlAl+1+KlVl+1K_{l+1}^{\prime}=M_{l}^{\prime}A_{l+1}^{\prime}+2M_{l}^{\prime}V_{l+1}+K_{l}^{\prime}A_{l+1}^{\prime}+K_{l}^{\prime}V_{l+1} such that κp,l+1(Ml+1+Kl+1p2(l+1)3)ηl+1p\kappa_{p,l+1}\leq(M_{l+1}^{\prime}+K_{l+1}^{\prime}p^{2(l+1)-3})\eta_{l+1}^{-p} as claimed. This completes the inductive step and therefore also the proof of the lemma. ∎

Appendix C Analyzing the spectrum of the NTK via its power series

C.1 Effective rank of power series kernels

Recall that for a positive semidefinite matrix 𝐀\mathbf{A} we define the effective rank Huang et al. (2022) via the following ratio

eff(𝐀):=Tr(𝐀)λ1(𝐀).\mathrm{eff}(\mathbf{A}):=\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

We consider a kernel Gram matrix 𝐊n×n\mathbf{K}\in\mathbb{R}^{n\times n} that has the following power series representation in terms of an input gram matrix 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}

n𝐊=i=0ci(𝐗𝐗T)i.n\mathbf{K}=\sum_{i=0}^{\infty}c_{i}(\mathbf{X}\mathbf{X}^{T})^{\odot i}.

Whenever c00c_{0}\neq 0 the effective rank of 𝐊\mathbf{K} is O(1)O(1), as displayed in the following theorem. See 4.1

Proof.

By linearity of trace we have that

Tr(n𝐊)=i=0ciTr((𝐗𝐗T)i)=ni=0ciTr(n\mathbf{K})=\sum_{i=0}^{\infty}c_{i}Tr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=n\sum_{i=0}^{\infty}c_{i}

where we have used the fact that Tr((𝐗𝐗T)i)=nTr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=n for all ii\in\mathbb{N}. On the other hand

λ1(n𝐊)λ1(c0(𝐗𝐗T)0)=λ1(c0𝟏n×n)=nc0.\lambda_{1}(n\mathbf{K})\geq\lambda_{1}(c_{0}(\mathbf{X}\mathbf{X}^{T})^{0})=\lambda_{1}(c_{0}\mathbf{1}_{n\times n})=nc_{0}.

Thus we have that

eff(𝐊)=Tr(𝐊)λ1(𝐊)=Tr(n𝐊)λ1(n𝐊)i=0cic0.\mathrm{eff}(\mathbf{K})=\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}=\frac{Tr(n\mathbf{K})}{\lambda_{1}(n\mathbf{K})}\leq\frac{\sum_{i=0}^{\infty}c_{i}}{c_{0}}.

The above theorem demonstrates that the constant term c0𝟏n×nc_{0}\mathbf{1}_{n\times n} in the kernel leads to a significant outlier in the spectrum of 𝐊\mathbf{K}. However this fails to capture how the structure of the input data 𝐗\mathbf{X} manifests in the spectrum of 𝐊\mathbf{K}. For this we will examine the centered kernel matrix 𝐊~:=𝐊c0n𝟏𝟏T\widetilde{\mathbf{K}}:=\mathbf{K}-\frac{c_{0}}{n}\mathbf{1}\mathbf{1}^{T}. Using a very similar argument as before we can demonstrate that the effective rank of 𝐊~\widetilde{\mathbf{K}} is controlled by the effective rank of the input data gram 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}. This is formalized in the following theorem. See 4.3

Proof.

By the linearity of the trace we have that

Tr(n𝐊~)=i=1ciTr((𝐗𝐗T)i)=Tr(𝐗𝐗T)i=1ciTr(n\widetilde{\mathbf{K}})=\sum_{i=1}^{\infty}c_{i}Tr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=Tr(\mathbf{X}\mathbf{X}^{T})\sum_{i=1}^{\infty}c_{i}

where we have used the fact that Tr((𝐗𝐗T)i)=Tr(𝐗𝐗T)=nTr((\mathbf{X}\mathbf{X}^{T})^{\odot i})=Tr(\mathbf{X}\mathbf{X}^{T})=n for all i[n]i\in[n]. On the other hand we have that

λ1(n𝐊~)λ1(c1𝐗𝐗T)=c1λ1(𝐗𝐗T).\lambda_{1}(n\widetilde{\mathbf{K}})\geq\lambda_{1}(c_{1}\mathbf{X}\mathbf{X}^{T})=c_{1}\lambda_{1}(\mathbf{X}\mathbf{X}^{T}).

Thus we conclude

eff(𝐊~)=Tr(𝐊~)λ1(𝐊~)=Tr(n𝐊~)λ1(n𝐊~)Tr(𝐗𝐗T)λ1(𝐗𝐗T)i=1cic1.\mathrm{eff}(\widetilde{\mathbf{K}})=\frac{Tr(\widetilde{\mathbf{K}})}{\lambda_{1}(\widetilde{\mathbf{K}})}=\frac{Tr(n\widetilde{\mathbf{K}})}{\lambda_{1}(n\widetilde{\mathbf{K}})}\leq\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\frac{\sum_{i=1}^{\infty}c_{i}}{c_{1}}.

C.2 Effective rank of the NTK for finite width networks

C.2.1 Notation and definitions

We will let [k]:={1,2,,k}[k]:=\{1,2,\ldots,k\}. We consider a neural network

=1maϕ(𝐰,𝐱)\sum_{\ell=1}^{m}a_{\ell}\phi(\langle\mathbf{w}_{\ell},\mathbf{x}\rangle)

where 𝐱d\mathbf{x}\in\mathbb{R}^{d} and 𝐰d\mathbf{w}_{\ell}\in\mathbb{R}^{d}, aa_{\ell}\in\mathbb{R} for all [m]\ell\in[m] and ϕ\phi is a scalar valued activation function. The network we present here does not have any bias values in the inner-layer, however the results we will prove later apply to the nonzero bias case by replacing 𝐱\mathbf{x} with [𝐱T,1]T[\mathbf{x}^{T},1]^{T}. We let 𝐖m×d\mathbf{W}\in\mathbb{R}^{m\times d} be the matrix whose \ell-th row is equal to 𝐰\mathbf{w}_{\ell} and 𝐚m\mathbf{a}\in\mathbb{R}^{m} be the vector whose \ell-th entry is equal to aa_{\ell}. We can then write the neural network in vector form

f(𝐱;𝐖,𝐚)=𝐚Tϕ(𝐖𝐱)f(\mathbf{x};\mathbf{W},\mathbf{a})=\mathbf{a}^{T}\phi(\mathbf{W}\mathbf{x})

where ϕ\phi is understood to be applied entry-wise.

Suppose we have nn training data inputs 𝐱1,,𝐱nd\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\in\mathbb{R}^{d}. We will let 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} be the matrix whose ii-th row is equal to 𝐱i\mathbf{x}_{i}. Let θinner=vec(𝐖)\mathbf{\theta}_{inner}=vec(\mathbf{W}) denote the row-wise vectorization of the inner-layer weights. We consider the Jacobian of the neural networks predictions on the training data with respect to the inner layer weights:

𝐉innerT=[f(𝐱1)θinner,f(𝐱2)θinner,,f(𝐱n)θinner]\mathbf{J}_{inner}^{T}=\left[\frac{\partial f(\mathbf{x}_{1})}{\partial\mathbf{\theta}_{inner}},\frac{\partial f(\mathbf{x}_{2})}{\partial\mathbf{\theta}_{inner}},\ldots,\frac{\partial f(\mathbf{x}_{n})}{\partial\mathbf{\theta}_{inner}}\right]

Similarly we can look at the analagous quantity for the outer layer weights

𝐉outerT=[f(𝐱1)𝐚,f(𝐱2)𝐚,,f(𝐱n)𝐚]=ϕ(𝐖𝐗T).\mathbf{J}_{outer}^{T}=\left[\frac{\partial f(\mathbf{x}_{1})}{\partial\mathbf{a}},\frac{\partial f(\mathbf{x}_{2})}{\partial\mathbf{a}},\ldots,\frac{\partial f(\mathbf{x}_{n})}{\partial\mathbf{a}}\right]=\phi\left(\mathbf{W}\mathbf{X}^{T}\right).

Our first observation is that the per-example gradients for the inner layer weights have a nice Kronecker product representation

f(𝐱)θinner=[a1ϕ(𝐰1,𝐱)a2ϕ(𝐰2,𝐱)amϕ(𝐰m,𝐱)]𝐱.\frac{\partial f(\mathbf{x})}{\partial\mathbf{\theta}_{inner}}=\begin{bmatrix}a_{1}\phi^{\prime}(\langle\mathbf{w}_{1},\mathbf{x}\rangle)\\ a_{2}\phi^{\prime}(\langle\mathbf{w}_{2},\mathbf{x}\rangle)\\ \cdots\\ a_{m}\phi^{\prime}(\langle\mathbf{w}_{m},\mathbf{x}\rangle)\end{bmatrix}\otimes\mathbf{x}.

For convenience we will let

𝐘i:=[a1ϕ(𝐰1,𝐱i)a2ϕ(𝐰2,𝐱i)amϕ(𝐰m,𝐱i)].\mathbf{Y}_{i}:=\begin{bmatrix}a_{1}\phi^{\prime}(\langle\mathbf{w}_{1},\mathbf{x}_{i}\rangle)\\ a_{2}\phi^{\prime}(\langle\mathbf{w}_{2},\mathbf{x}_{i}\rangle)\\ \cdots\\ a_{m}\phi^{\prime}(\langle\mathbf{w}_{m},\mathbf{x}_{i}\rangle)\end{bmatrix}.

where the dependence of 𝐘i\mathbf{Y}_{i} on the parameters 𝐖\mathbf{W} and 𝐚\mathbf{a} is suppressed (formally 𝐘i=𝐘i(𝐖,𝐚)\mathbf{Y}_{i}=\mathbf{Y}_{i}(\mathbf{W},\mathbf{a})). This way we may write

f(𝐱i)θinner=𝐘i𝐱i.\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}=\mathbf{Y}_{i}\otimes\mathbf{x}_{i}.

We will study the NTK with respect to the inner-layer weights

𝐊inner=𝐉inner𝐉innerT\mathbf{K}_{inner}=\mathbf{J}_{inner}\mathbf{J}_{inner}^{T}

and the same quantity for the outer-layer weights

𝐊outer=𝐉outer𝐉outerT.\mathbf{K}_{outer}=\mathbf{J}_{outer}\mathbf{J}_{outer}^{T}.

For a hermitian matrix 𝐀\mathbf{A} we will let λi(𝐀)\lambda_{i}(\mathbf{A}) denote the iith largest eigenvalue of 𝐀\mathbf{A} so that λ1(𝐀)λ2(𝐀)λn(𝐀)\lambda_{1}(\mathbf{A})\geq\lambda_{2}(\mathbf{A})\geq\cdots\geq\lambda_{n}(\mathbf{A}). Similarly for an arbitrary matrix 𝐀\mathbf{A} we will let σi(𝐀)\sigma_{i}(\mathbf{A}) to the iith largest singular value of 𝐀\mathbf{A}. For a matrix 𝐀r×k\mathbf{A}\in\mathbb{R}^{r\times k} we will let σmin(𝐀)=σmin(r,k)\sigma_{min}(\mathbf{A})=\sigma_{\min(r,k)}.

C.2.2 Effective rank

For a positive semidefinite matrix 𝐀\mathbf{A} we define the effective rank (Huang et al., 2022) of 𝐀\mathbf{A} to be the quantity

eff(𝐀):=Tr(𝐀)λ1(𝐀).\mathrm{eff}(\mathbf{A}):=\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

The effective rank quantifies how many eigenvalues are on the order of the largest eigenvalue. We have the Markov-like inequality

|{i:λi(𝐀)cλ1(𝐀)}|c1Tr(𝐀)λ1(𝐀)\left\lvert\{i:\lambda_{i}(\mathbf{A})\geq c\lambda_{1}(\mathbf{A})\}\right\rvert\leq c^{-1}\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}

and the eigenvalue bound

λi(𝐀)λ1(𝐀)1iTr(𝐀)λ1(𝐀).\frac{\lambda_{i}(\mathbf{A})}{\lambda_{1}(\mathbf{A})}\leq\frac{1}{i}\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}.

Let 𝐀\mathbf{A} and 𝐁\mathbf{B} be positive semidefinite matrices. Then we have

Tr(𝐀+𝐁)λ1(𝐀+𝐁)Tr(𝐀)+Tr(𝐁)max(λ1(𝐀),λ1(𝐁))Tr(𝐀)λ1(𝐀)+Tr(𝐁)λ1(𝐁).\displaystyle\frac{Tr(\mathbf{A}+\mathbf{B})}{\lambda_{1}(\mathbf{A}+\mathbf{B})}\leq\frac{Tr(\mathbf{A})+Tr(\mathbf{B})}{\max\left(\lambda_{1}(\mathbf{A}),\lambda_{1}(\mathbf{B})\right)}\leq\frac{Tr(\mathbf{A})}{\lambda_{1}(\mathbf{A})}+\frac{Tr(\mathbf{B})}{\lambda_{1}(\mathbf{B})}.

Thus the effective rank is subadditive for positive semidefinite matrices.

We will be interested in bounding the effective rank of the NTK. Let 𝐊=𝐉𝐉T=𝐉outer𝐉outerT+𝐉inner𝐉innerT=𝐊outer+𝐊inner\mathbf{K}=\mathbf{J}\mathbf{J}^{T}=\mathbf{J}_{outer}\mathbf{J}_{outer}^{T}+\mathbf{J}_{inner}\mathbf{J}_{inner}^{T}=\mathbf{K}_{outer}+\mathbf{K}_{inner} be the NTK matrix with respect to all the network parameters. Note that by subadditivity

Tr(𝐊)λ1(𝐊)Tr(𝐊outer)λ1(𝐊outer)+Tr(𝐊inner)λ1(𝐊inner).\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\leq\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}+\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}.

In this vein we will control the effective rank of 𝐊inner\mathbf{K}_{inner} and 𝐊outer\mathbf{K}_{outer} separately.

C.2.3 Effective rank of inner-layer NTK

We will show that the effective rank of inner-layer NTK is bounded by a multiple of the effective rank of the data input gram 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}. We introduce the following meta-theorem that we will use to prove various corollaries later

Theorem C.1.

Set α:=sup𝐛=1[minj[n]|𝐘j,𝐛|]\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{j\in[n]}|\langle\mathbf{Y}_{j},\mathbf{b}\rangle|\right]. Assume α>0\alpha>0. Then

mini[n]𝐘i22Tr(𝐗𝐗T)maxi[n]𝐘i22λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)maxi[n]𝐘i22α2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{\min_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}}{\alpha^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}
Proof.

We will first prove the upper bound. We first observe that

Tr(𝐊inner)=i=1nf(𝐱i)θinner22=i=1n𝐘i𝐱i22=i=1n𝐘i22𝐱i22\displaystyle Tr(\mathbf{K}_{inner})=\sum_{i=1}^{n}\left\lVert\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}
maxj[n]𝐘j22i=1n𝐱i22=maxj[n]𝐘j22Tr(𝐗𝐗T)\displaystyle\leq\max_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}\sum_{i=1}^{n}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}=\max_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})

Recall that

λ1(𝐊inner)=λ1(𝐉inner𝐉innerT)=λ1(𝐉innerT𝐉inner).\lambda_{1}\left(\mathbf{K}_{inner}\right)=\lambda_{1}\left(\mathbf{J}_{inner}\mathbf{J}_{inner}^{T}\right)=\lambda_{1}\left(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}\right).

Well

𝐉innerT𝐉inner=i=1nf(𝐱i)θinnerf(𝐱i)θinnerT=i=1n[𝐘i𝐱i][𝐘i𝐱i]T\displaystyle\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}=\sum_{i=1}^{n}\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}^{T}=\sum_{i=1}^{n}\left[\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right]\left[\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right]^{T}
=i=1n[𝐘i𝐘iT][𝐱i𝐱iT]\displaystyle=\sum_{i=1}^{n}\left[\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right]\otimes\left[\mathbf{x}_{i}\mathbf{x}_{i}^{T}\right]

Well then we may use the fact that

λ1(𝐉innerT𝐉inner)=max𝐛2=1𝐛T𝐉innerT𝐉inner𝐛\lambda_{1}(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner})=\max_{\left\lVert\mathbf{b}\right\rVert_{2}=1}\mathbf{b}^{T}\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}\mathbf{b}

Let 𝐛1m\mathbf{b}_{1}\in\mathbb{R}^{m} and 𝐛2d\mathbf{b}_{2}\in\mathbb{R}^{d} be vectors that we will optimize later satisfying 𝐛12𝐛22=1\left\lVert\mathbf{b}_{1}\right\rVert_{2}\left\lVert\mathbf{b}_{2}\right\rVert_{2}=1. Then we have that 𝐛1𝐛2=1\left\lVert\mathbf{b}_{1}\otimes\mathbf{b}_{2}\right\rVert=1 and

(𝐛1𝐛2)T𝐉innerT𝐉inner(𝐛1𝐛2)=i=1n(𝐛1𝐛2)T([𝐘i𝐘iT][𝐱i𝐱iT])(𝐛1𝐛2)\displaystyle(\mathbf{b}_{1}\otimes\mathbf{b}_{2})^{T}\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}(\mathbf{b}_{1}\otimes\mathbf{b}_{2})=\sum_{i=1}^{n}(\mathbf{b}_{1}\otimes\mathbf{b}_{2})^{T}\left(\left[\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\right]\otimes\left[\mathbf{x}_{i}\mathbf{x}_{i}^{T}\right]\right)(\mathbf{b}_{1}\otimes\mathbf{b}_{2})
=i=1n[𝐛1T𝐘i𝐘iT𝐛1][𝐛2T𝐱i𝐱iT𝐛2][minj[n]𝐛1T𝐘j𝐘jT𝐛1]i=1n𝐛2T𝐱i𝐱iT𝐛2\displaystyle=\sum_{i=1}^{n}\left[\mathbf{b}_{1}^{T}\mathbf{Y}_{i}\mathbf{Y}_{i}^{T}\mathbf{b}_{1}\right]\left[\mathbf{b}_{2}^{T}\mathbf{x}_{i}\mathbf{x}_{i}^{T}\mathbf{b}_{2}\right]\geq\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\sum_{i=1}^{n}\mathbf{b}_{2}^{T}\mathbf{x}_{i}\mathbf{x}_{i}^{T}\mathbf{b}_{2}
=[minj[n]𝐛1T𝐘j𝐘jT𝐛1]𝐛2T[i=1n𝐱i𝐱iT]𝐛2=[minj[n]𝐛1T𝐘j𝐘jT𝐛1]𝐛2𝐗T𝐗𝐛2\displaystyle=\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\mathbf{b}_{2}^{T}\left[\sum_{i=1}^{n}\mathbf{x}_{i}\mathbf{x}_{i}^{T}\right]\mathbf{b}_{2}=\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\mathbf{b}_{2}\mathbf{X}^{T}\mathbf{X}\mathbf{b}_{2}

Pick 𝐛2\mathbf{b}_{2} so that 𝐛2=1\left\lVert\mathbf{b}_{2}\right\rVert=1 and

𝐛2𝐗T𝐗𝐛2=λ1(𝐗T𝐗)=λ1(𝐗𝐗T).\mathbf{b}_{2}\mathbf{X}^{T}\mathbf{X}\mathbf{b}_{2}=\lambda_{1}(\mathbf{X}^{T}\mathbf{X})=\lambda_{1}(\mathbf{X}\mathbf{X}^{T}).

Thus for this choice of 𝐛2\mathbf{b}_{2} we have

λ1(𝐉innerT𝐉inner)(𝐛1𝐛2)T𝐉innerT𝐉inner(𝐛1𝐛2)\displaystyle\lambda_{1}(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner})\geq(\mathbf{b}_{1}\otimes\mathbf{b}_{2})^{T}\mathbf{J}_{inner}^{T}\mathbf{J}_{inner}(\mathbf{b}_{1}\otimes\mathbf{b}_{2})\geq
[minj[n]𝐛1T𝐘j𝐘jT𝐛1]𝐛2𝐗T𝐗𝐛2=[minj[n]𝐛1T𝐘j𝐘jT𝐛1]λ1(𝐗𝐗T)\displaystyle\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\mathbf{b}_{2}\mathbf{X}^{T}\mathbf{X}\mathbf{b}_{2}=\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]\lambda_{1}(\mathbf{X}\mathbf{X}^{T})

Now note that α2=sup𝐛1=1[minj[n]𝐛1T𝐘j𝐘jT𝐛1]\alpha^{2}=\sup_{\left\lVert\mathbf{b}_{1}\right\rVert=1}\left[\min_{j\in[n]}\mathbf{b}_{1}^{T}\mathbf{Y}_{j}\mathbf{Y}_{j}^{T}\mathbf{b}_{1}\right]. Thus by taking the sup over 𝐛1\mathbf{b}_{1} in our previous bound we have

λ1(𝐊inner)=λ1(𝐉innerT𝐉inner)α2λ1(𝐗𝐗T).\lambda_{1}(\mathbf{K}_{inner})=\lambda_{1}(\mathbf{J}_{inner}^{T}\mathbf{J}_{inner})\geq\alpha^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T}).

Thus combined with our previous result we have

Tr(𝐊inner)λ1(𝐊inner)maxi[n]𝐘i22α2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}}{\alpha^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

We now prove the lower bound.

Tr(𝐊inner)=i=1nf(𝐱i)θinner22=i=1n𝐘i𝐱i22=i=1n𝐘i22𝐱i22\displaystyle Tr(\mathbf{K}_{inner})=\sum_{i=1}^{n}\left\lVert\frac{\partial f(\mathbf{x}_{i})}{\partial\mathbf{\theta}_{inner}}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\otimes\mathbf{x}_{i}\right\rVert_{2}^{2}=\sum_{i=1}^{n}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}
minj[n]𝐘j22i=1n𝐱i22=minj[n]𝐘j22Tr(𝐗𝐗T)\displaystyle\geq\min_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}\sum_{i=1}^{n}\left\lVert\mathbf{x}_{i}\right\rVert_{2}^{2}=\min_{j\in[n]}\left\lVert\mathbf{Y}_{j}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})

Let 𝐘n×m\mathbf{Y}\in\mathbb{R}^{n\times m} be the matrix whose iith row is equal to 𝐘i\mathbf{Y}_{i}. Then observe that

𝐊inner=[𝐘𝐘T][𝐗𝐗T]\mathbf{K}_{inner}=[\mathbf{Y}\mathbf{Y}^{T}]\odot[\mathbf{X}\mathbf{X}^{T}]

where \odot denotes the entry-wise Hadamard product of two matrices. We now recall that if 𝐀\mathbf{A} and 𝐁\mathbf{B} are two positive semidefinite matrices we have (Oymak & Soltanolkotabi, 2020, Lemma 2)

λ1(𝐀𝐁)maxi[n]𝐀i,iλ1(𝐁).\lambda_{1}(\mathbf{A}\odot\mathbf{B})\leq\max_{i\in[n]}\mathbf{A}_{i,i}\lambda_{1}(\mathbf{B}).

Applying this to 𝐊inner\mathbf{K}_{inner} we get that

λ1(𝐊inner)maxi[n]𝐘i22λ1(𝐗𝐗T)\lambda_{1}(\mathbf{K}_{inner})\leq\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})

Combining this with our previous result we get

mini[n]𝐘i22Tr(𝐗𝐗T)maxi[n]𝐘i22λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)\frac{\min_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\max_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}

We can immediately get a useful corollary that applies to the ReLU activation function

Corollary C.2.

Set α:=sup𝐛=1[minj[n]|𝐘j,𝐛|]\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{j\in[n]}|\langle\mathbf{Y}_{j},\mathbf{b}\rangle|\right] and γmax:=supx|ϕ(x)|\gamma_{max}:=\sup_{x\in\mathbb{R}}|\phi^{\prime}(x)|. Assume α>0\alpha>0 and γmax<\gamma_{max}<\infty. Then

α2γmax2𝐚22Tr(𝐗𝐗T)λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)γmax2𝐚22α2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{\alpha^{2}}{\gamma_{max}^{2}\left\lVert\mathbf{a}\right\rVert_{2}^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\gamma_{max}^{2}\left\lVert\mathbf{a}\right\rVert_{2}^{2}}{\alpha^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}
Proof.

Note that the hypothesis on |ϕ||\phi^{\prime}| gives 𝐘i22γmax2𝐚22\left\lVert\mathbf{Y}_{i}\right\rVert_{2}^{2}\leq\gamma_{max}^{2}\left\lVert\mathbf{a}\right\rVert_{2}^{2} for all i[n]i\in[n]. Moreover by Cauchy-Schwarz we have that mini[n]𝐘i2α\min_{i\in[n]}\left\lVert\mathbf{Y}_{i}\right\rVert_{2}\geq\alpha. Thus by theorem C.1 we get the desired result. ∎

If ϕ\phi is a leaky ReLU type activation (say like those used in Nguyen & Mondelli (2020)) Theorem C.1 translates into an even simpler bound

Corollary C.3.

Suppose ϕ(x)[γmin,γmax]\phi^{\prime}(x)\in[\gamma_{min},\gamma_{max}] for all xx\in\mathbb{R} where γmin>0\gamma_{min}>0. Then

γmin2Tr(𝐗𝐗T)γmax2λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)γmax2γmin2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{\gamma_{min}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\gamma_{max}^{2}\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{\gamma_{max}^{2}}{\gamma_{min}^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}
Proof.

We will lower bound

α:=sup𝐛=1[minj[n]|𝐘j,𝐛|]\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{j\in[n]}|\langle\mathbf{Y}_{j},\mathbf{b}\rangle|\right]

so that we can apply Corollary C.2. Set 𝐛=𝐚/𝐚2\mathbf{b}=\mathbf{a}/\left\lVert\mathbf{a}\right\rVert_{2}. Then we have that

𝐘j,𝐛==1maϕ(𝐰,𝐱j)a/𝐚2γmin𝐚2=1ma2=γmin𝐚2\langle\mathbf{Y}_{j},\mathbf{b}\rangle=\sum_{\ell=1}^{m}a_{\ell}\phi^{\prime}(\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle)a_{\ell}/\left\lVert\mathbf{a}\right\rVert_{2}\geq\frac{\gamma_{min}}{\left\lVert\mathbf{a}\right\rVert_{2}}\sum_{\ell=1}^{m}a_{\ell}^{2}=\gamma_{min}\left\lVert\mathbf{a}\right\rVert_{2}

Thus αγmin𝐚2\alpha\geq\gamma_{min}\left\lVert\mathbf{a}\right\rVert_{2}. The result then follows from Corollary C.2

To control α\alpha in Theorem C.1 when ϕ\phi is the ReLU activation function requires a bit more work. To this end we introduce the following lemma.

Lemma C.4.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Let Rmin,Rmax>0R_{min},R_{max}>0 and define τ={[m]:|a|[Rmin,Rmax]}\tau=\{\ell\in[m]:|a_{\ell}|\in[R_{min},R_{max}]\}. Set T=mini[n]τ𝕀[𝐱i,𝐰0]T=\min_{i\in[n]}\sum_{\ell\in\tau}\mathbb{I}\left[\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\geq 0\right]. Then

α:=sup𝐛=1[mini[n]|𝐘i,𝐛|]Rmin2RmaxT|τ|1/2\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{i\in[n]}|\langle\mathbf{Y}_{i},\mathbf{b}\rangle|\right]\geq\frac{R_{min}^{2}}{R_{max}}\frac{T}{|\tau|^{1/2}}
Proof.

Let 𝐚τ\mathbf{a}_{\tau} be the vector such that (𝐚τ)=a𝕀[τ](\mathbf{a}_{\tau})_{\ell}=a_{\ell}\mathbb{I}[\ell\in\tau]. Then note that

𝐘j,𝐚τ/𝐚τ2=1𝐚ττa2𝕀[𝐰,𝐱j0]\displaystyle\langle\mathbf{Y}_{j},\mathbf{a}_{\tau}/\left\lVert\mathbf{a}_{\tau}\right\rVert_{2}\rangle=\frac{1}{\left\lVert\mathbf{a}_{\tau}\right\rVert}\sum_{\ell\in\tau}a_{\ell}^{2}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\geq
Rmin2𝐚ττ𝕀[𝐰,𝐱j0]Rmin2𝐚τ2TRmin2Rmax|τ|1/2T.\displaystyle\frac{R_{min}^{2}}{\left\lVert\mathbf{a}_{\tau}\right\rVert}\sum_{\ell\in\tau}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\geq\frac{R_{min}^{2}}{\left\lVert\mathbf{a}_{\tau}\right\rVert_{2}}T\geq\frac{R_{min}^{2}}{R_{max}|\tau|^{1/2}}T.

Roughly what Lemma C.4 says is that α\alpha is controlled when there is a set of inner-layer neurons that are active for each data point whose outer layer weights are similar in magnitude. Note that in Du et al. (2019b), Arora et al. (2019a), Oymak et al. (2019), Li et al. (2020), Xie et al. (2017) and Oymak & Soltanolkotabi (2020) the outer layer weights all have fixed constant magnitude. Thus in that case we can set Rmin=RmaxR_{min}=R_{max} in Lemma C.4 so that τ=[m]\tau=[m]. In this setting we have the following result.

Theorem C.5.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Suppose |a|=R>0|a_{\ell}|=R>0 for all [m]\ell\in[m]. Furthermore suppose 𝐰1,,𝐰m\mathbf{w}_{1},\ldots,\mathbf{w}_{m} are independent random vectors such that 𝐰/𝐰\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert has the uniform distribution on the sphere for each [m]\ell\in[m]. Also assume m4log(n/ϵ)δ2m\geq\frac{4\log(n/\epsilon)}{\delta^{2}} for some δ,ϵ(0,1)\delta,\epsilon\in(0,1). Then with probability at least 1ϵ1-\epsilon we have that

(1δ)24eff(𝐗𝐗T)eff(𝐊inner)4(1δ)2eff(𝐗𝐗T).\frac{(1-\delta)^{2}}{4}\mathrm{eff}(\mathbf{X}\mathbf{X}^{T})\leq\mathrm{eff}(\mathbf{K}_{inner})\leq\frac{4}{(1-\delta)^{2}}\mathrm{eff}(\mathbf{X}\mathbf{X}^{T}).
Proof.

Fix j[n]j\in[n]. Note by the assumption on the 𝐰\mathbf{w}_{\ell}’s we have that 𝕀[𝐰1,𝐱j0],,𝕀[𝐰m,𝐱j0]\mathbb{I}[\langle\mathbf{w}_{1},\mathbf{x}_{j}\rangle\geq 0],\ldots,\mathbb{I}[\langle\mathbf{w}_{m},\mathbf{x}_{j}\rangle\geq 0] are i.i.d. Bernouilli random variables taking the values 0 and 11 with probability 1/21/2. Thus by the Chernoff bound for Binomial random variables we have that

(=1m𝕀[𝐰,𝐱j0]m2(1δ))exp(δ2m4).\mathbb{P}\left(\sum_{\ell=1}^{m}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\leq\frac{m}{2}(1-\delta)\right)\leq\exp\left(-\delta^{2}\frac{m}{4}\right).

Thus taking the union bound over every j[n]j\in[n] we get that if m4log(n/ϵ)δ2m\geq\frac{4\log(n/\epsilon)}{\delta^{2}} then

minj[n]=1m𝕀[𝐰,𝐱j0]m2(1δ)\min_{j\in[n]}\sum_{\ell=1}^{m}\mathbb{I}[\langle\mathbf{w}_{\ell},\mathbf{x}_{j}\rangle\geq 0]\geq\frac{m}{2}(1-\delta)

holds with probability at least 1ϵ1-\epsilon. Now note that if we set Rmin=Rmax=RR_{min}=R_{max}=R we have that τ=[m]\tau=[m] where τ\tau is defined as it is in Lemma C.4. In this case by our previous bound we have that TT as defined in Lemma C.4 satisfies Tm2(1δ)T\geq\frac{m}{2}(1-\delta) with probability at least 1ϵ1-\epsilon. In this case the conclusion of Lemma C.4 gives us

αRm1/2(1δ)2=𝐚2(1δ)2.\alpha\geq Rm^{1/2}\frac{(1-\delta)}{2}=\left\lVert\mathbf{a}\right\rVert_{2}\frac{(1-\delta)}{2}.

Thus by Corollary C.2 and the above bound for α\alpha we get the desired result. ∎

We will now use Lemma C.4 to prove a bound in the case of Gaussian initialization.

Lemma C.6.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Suppose that aN(0,ν2)a_{\ell}\sim N(0,\nu^{2}) for each [m]\ell\in[m] i.i.d. Furthermore suppose 𝐰1,,𝐰m\mathbf{w}_{1},\ldots,\mathbf{w}_{m} are random vectors independent of each other and 𝐚\mathbf{a} such that 𝐰/𝐰\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert has the uniform distribution on the sphere for each [m]\ell\in[m]. Set p=zN(0,1)(|z|[1/2,1])0.3p=\mathbb{P}_{z\sim N(0,1)}\left(|z|\in[1/2,1]\right)\approx 0.3. Assume

m4log(n/ϵ)δ2(1δ)pm\geq\frac{4\log(n/\epsilon)}{\delta^{2}(1-\delta)p}

for some ϵ,δ(0,1)\epsilon,\delta\in(0,1). Then with probability at least (1ϵ)2(1-\epsilon)^{2} we have that

α:=sup𝐛=1[mini[n]|𝐘i,𝐛|]ν8(1δ)3/2p1/2m1/2\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{i\in[n]}|\langle\mathbf{Y}_{i},\mathbf{b}\rangle|\right]\geq\frac{\nu}{8}(1-\delta)^{3/2}p^{1/2}m^{1/2}
Proof.

Set Rmin=ν/2R_{min}=\nu/2 and Rmax=νR_{max}=\nu. Now set

p=aN(0,ν2)(|a|[Rmin,Rmax])=2zN(0,1)(z[Rminν,Rmaxν])\displaystyle p=\mathbb{P}_{a\sim N(0,\nu^{2})}\left(|a|\in[R_{min},R_{max}]\right)=2\mathbb{P}_{z\sim N(0,1)}\left(z\in\left[\frac{R_{min}}{\nu},\frac{R_{max}}{\nu}\right]\right)
=2zN(0,1)(z[1/2,1])0.3.\displaystyle=2\mathbb{P}_{z\sim N(0,1)}\left(z\in\left[1/2,1\right]\right)\approx 0.3.

Now define τ={[m]:|a|[Rmin,Rmax]}\tau=\{\ell\in[m]:|a_{\ell}|\in[R_{min},R_{max}]\}. We have by the Chernoff bound for binomial random variables

(|τ|(1δ)mp)exp(δ2mp2).\mathbb{P}\left(|\tau|\leq(1-\delta)mp\right)\leq\exp\left(-\delta^{2}\frac{mp}{2}\right).

Thus if mlog(1ϵ)2pδ2m\geq\log\left(\frac{1}{\epsilon}\right)\frac{2}{p\delta^{2}} (a weaker condition than the hypothesis on mm) then we have that |τ|(1δ)mp|\tau|\geq(1-\delta)mp with probability at least 1ϵ1-\epsilon. From now on assume such a τ\tau has been observed and view it as fixed so that the only remaining randomness is over the 𝐰\mathbf{w}_{\ell}’s. Now set T=mini[n]τ𝕀[𝐱i,𝐰0]T=\min_{i\in[n]}\sum_{\ell\in\tau}\mathbb{I}\left[\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\geq 0\right]. By the Chernoff bound again we get that for fixed i[n]i\in[n]

(τ𝕀[𝐱i,𝐰0](1δ)2|τ|)exp(δ2|τ|4).\mathbb{P}\left(\sum_{\ell\in\tau}\mathbb{I}\left[\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\geq 0\right]\leq\frac{(1-\delta)}{2}|\tau|\right)\leq\exp\left(-\delta^{2}\frac{|\tau|}{4}\right).

Thus by taking the union bound over i[n]i\in[n] we get

(T(1δ)2|τ|)nexp(δ2|τ|4)\displaystyle\mathbb{P}\left(T\leq\frac{(1-\delta)}{2}|\tau|\right)\leq n\exp\left(-\delta^{2}\frac{|\tau|}{4}\right)
nexp(δ2(1δ)mp4)\displaystyle\leq n\exp\left(-\delta^{2}\frac{(1-\delta)mp}{4}\right)

Thus if we consider τ\tau as fixed and m4log(n/ϵ)δ2(1δ)pm\geq\frac{4\log(n/\epsilon)}{\delta^{2}(1-\delta)p} then with probability at least 1ϵ1-\epsilon over the sampling of the 𝐰\mathbf{w}_{\ell}’s we have that

T(1δ)2|τ|T\geq\frac{(1-\delta)}{2}|\tau|

In this case by lemma C.4 we have that

α:=sup𝐛=1[mini[n]|𝐘i,𝐛|]Rmin2RmaxT|τ|1/2\displaystyle\alpha:=\sup_{\left\lVert\mathbf{b}\right\rVert=1}\left[\min_{i\in[n]}|\langle\mathbf{Y}_{i},\mathbf{b}\rangle|\right]\geq\frac{R_{min}^{2}}{R_{max}}\frac{T}{|\tau|^{1/2}}
ν8(1δ)3/2m1/2p1/2.\displaystyle\geq\frac{\nu}{8}(1-\delta)^{3/2}m^{1/2}p^{1/2}.

Thus the above holds with probability at least (1ϵ)2(1-\epsilon)^{2}. ∎

This lemma now allows us to bound the effective rank of 𝐊inner\mathbf{K}_{inner} in the case of Gaussian initialization.

Theorem C.7.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Suppose that aN(0,ν2)a_{\ell}\sim N(0,\nu^{2}) for each [m]\ell\in[m] i.i.d. Furthermore suppose 𝐰1,,𝐰m\mathbf{w}_{1},\ldots,\mathbf{w}_{m} are random vectors independent of each other and 𝐚\mathbf{a} such that 𝐰/𝐰\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert has the uniform distribution on the sphere for each [m]\ell\in[m]. Set p=zN(0,1)(|z|[1/2,1])0.3p=\mathbb{P}_{z\sim N(0,1)}\left(|z|\in[1/2,1]\right)\approx 0.3. Let ϵ,δ(0,1)\epsilon,\delta\in(0,1). Then there exists absolute constants c,K>0c,K>0 such that if

m4log(n/ϵ)δ2(1δ)pm\geq\frac{4\log(n/\epsilon)}{\delta^{2}(1-\delta)p}

then with probability at least 13ϵ1-3\epsilon we have that

1CTr(𝐗𝐗T)λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)CTr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{1}{C}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq C\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

where

C=64(1δ)3p[1+max{c1Klog(1/ϵ),mK}m].C=\frac{64}{(1-\delta)^{3}p}\left[1+\frac{\max\{c^{-1}K\log(1/\epsilon),mK\}}{m}\right].
Proof.

By Bernstein’s inequality

(𝐚/ν22mt)exp[cmin(t2mK2,tK)]\mathbb{P}\left(\left\lVert\mathbf{a}/\nu\right\rVert_{2}^{2}-m\geq t\right)\leq\exp\left[-c\cdot\min\left(\frac{t^{2}}{mK^{2}},\frac{t}{K}\right)\right]

where cc is an absolute constant. Set t=max{c1Klog(1/ϵ),mK}t=\max\{c^{-1}K\log(1/\epsilon),mK\} so that the right hand side of the above inequality is bounded by ϵ\epsilon. Thus by Lemma C.6 and the union bound we can ensure that with probability at least

1ϵ[1(1ϵ)2]=13ϵ+ϵ213ϵ1-\epsilon-[1-(1-\epsilon)^{2}]=1-3\epsilon+\epsilon^{2}\geq 1-3\epsilon

that 𝐚/ν22m+t\left\lVert\mathbf{a}/\nu\right\rVert_{2}^{2}\leq m+t and the conclusion of Lemma C.6 hold simultaneously. In that case

𝐚22α2ν2[m+t]ν264(1δ)3mp=64(1δ)3p[1+tm]=C.\displaystyle\frac{\left\lVert\mathbf{a}\right\rVert_{2}^{2}}{\alpha^{2}}\leq\frac{\nu^{2}[m+t]}{\frac{\nu^{2}}{64}(1-\delta)^{3}mp}=\frac{64}{(1-\delta)^{3}p}\left[1+\frac{t}{m}\right]=C.

Thus by Corollary C.2 we get the desired result. ∎

By fixing δ>0\delta>0 in the previous theorem we get the immediate corollary

Corollary C.8.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Suppose that aN(0,ν2)a_{\ell}\sim N(0,\nu^{2}) for each [m]\ell\in[m] i.i.d. Furthermore suppose 𝐰1,,𝐰m\mathbf{w}_{1},\ldots,\mathbf{w}_{m} are random vectors independent of each other and 𝐚\mathbf{a} such that 𝐰/𝐰\mathbf{w}_{\ell}/\left\lVert\mathbf{w}_{\ell}\right\rVert has the uniform distribution on the sphere for each [m]\ell\in[m]. Then there exists an absolute constant C>0C>0 such that m=Ω(log(n/ϵ))m=\Omega(\log(n/\epsilon)) ensures that with probability at least 1ϵ1-\epsilon

1CTr(𝐗𝐗T)λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)CTr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{1}{C}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq C\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

C.2.4 Effective rank of outer-layer NTK

Throughout this section ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Our goal of this section, similar to before, is to bound the effective rank of 𝐊outer\mathbf{K}_{outer} by the effective rank of the input data gram 𝐗𝐗T\mathbf{X}\mathbf{X}^{T}. In this section we will use often make use of the basic identities

𝐀𝐁F𝐀2𝐁F\left\lVert\mathbf{A}\mathbf{B}\right\rVert_{F}\leq\left\lVert\mathbf{A}\right\rVert_{2}\left\lVert\mathbf{B}\right\rVert_{F}
𝐀𝐁F𝐀F𝐁2\left\lVert\mathbf{A}\mathbf{B}\right\rVert_{F}\leq\left\lVert\mathbf{A}\right\rVert_{F}\left\lVert\mathbf{B}\right\rVert_{2}
Tr(𝐀𝐀T)=Tr(𝐀T𝐀)=𝐀F2Tr(\mathbf{A}\mathbf{A}^{T})=Tr(\mathbf{A}^{T}\mathbf{A})=\left\lVert\mathbf{A}\right\rVert_{F}^{2}
𝐀2=𝐀T2\left\lVert\mathbf{A}\right\rVert_{2}=\left\lVert\mathbf{A}^{T}\right\rVert_{2}
λ1(𝐀T𝐀)=λ1(𝐀𝐀T)=𝐀22.\lambda_{1}(\mathbf{A}^{T}\mathbf{A})=\lambda_{1}(\mathbf{A}\mathbf{A}^{T})=\left\lVert\mathbf{A}\right\rVert_{2}^{2}.

To begin bounding the effective rank of 𝐊outer\mathbf{K}_{outer}, we prove the following lemma.

Lemma C.9.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and 𝐖\mathbf{W} is full rank with mdm\geq d. Then

ϕ(𝐖𝐗T)F2[ϕ(𝐖𝐗T)2+ϕ(𝐖𝐗T)2]2𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left[\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right]^{2}}\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}
Proof.

First note that

ϕ(𝐖𝐗T)F2𝐖𝐗TF2𝐖22𝐗TF2=𝐖22Tr(𝐗𝐗T).\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}\leq\left\lVert\mathbf{W}\mathbf{X}^{T}\right\rVert_{F}^{2}\leq\left\lVert\mathbf{W}\right\rVert_{2}^{2}\left\lVert\mathbf{X}^{T}\right\rVert_{F}^{2}=\left\lVert\mathbf{W}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T}).

Pick 𝐛d\mathbf{b}\in\mathbb{R}^{d} such that 𝐛2=1\left\lVert\mathbf{b}\right\rVert_{2}=1 and 𝐗𝐛2=𝐗2\left\lVert\mathbf{X}\mathbf{b}\right\rVert_{2}=\left\lVert\mathbf{X}\right\rVert_{2}. Since 𝐖T\mathbf{W}^{T} is full rank we may set 𝐮=(𝐖T)𝐛\mathbf{u}=(\mathbf{W}^{T})^{\dagger}\mathbf{b} so that 𝐖T𝐮=𝐛\mathbf{W}^{T}\mathbf{u}=\mathbf{b} where 𝐮2σmin(𝐖T)1\left\lVert\mathbf{u}\right\rVert_{2}\leq\sigma_{min}(\mathbf{W}^{T})^{-1} where σmin(𝐖T)\sigma_{min}(\mathbf{W}^{T}) is the smallest nonzero singular value of 𝐖T\mathbf{W}^{T}. Well then

𝐗2=𝐗𝐛2=𝐗𝐖T𝐮2𝐗𝐖T2𝐮2𝐗𝐖T2σmin(𝐖T)1\displaystyle\left\lVert\mathbf{X}\right\rVert_{2}=\left\lVert\mathbf{X}\mathbf{b}\right\rVert_{2}=\left\lVert\mathbf{X}\mathbf{W}^{T}\mathbf{u}\right\rVert_{2}\leq\left\lVert\mathbf{X}\mathbf{W}^{T}\right\rVert_{2}\left\lVert\mathbf{u}\right\rVert_{2}\leq\left\lVert\mathbf{X}\mathbf{W}^{T}\right\rVert_{2}\sigma_{min}(\mathbf{W}^{T})^{-1}
=𝐖𝐗T2σmin(𝐖)1\displaystyle=\left\lVert\mathbf{W}\mathbf{X}^{T}\right\rVert_{2}\sigma_{min}(\mathbf{W})^{-1}

Now using the fact that x=ϕ(x)ϕ(x)x=\phi(x)-\phi(-x) we have that

𝐖𝐗T2=ϕ(𝐖𝐗T)ϕ(𝐖𝐗T)2ϕ(𝐖𝐗T)2+ϕ(𝐖𝐗T)2\left\lVert\mathbf{W}\mathbf{X}^{T}\right\rVert_{2}=\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})-\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\leq\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}

Thus combined with our previous results gives

𝐗2σmin(𝐖)1[ϕ(𝐖𝐗T)2+ϕ(𝐖𝐗T)2]\left\lVert\mathbf{X}\right\rVert_{2}\leq\sigma_{min}(\mathbf{W})^{-1}\left[\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right]

Therefore

ϕ(𝐖𝐗T)F2σmin(𝐖)2[ϕ(𝐖𝐗T)2+ϕ(𝐖𝐗T)2]2ϕ(𝐖𝐗T)F2𝐗22\displaystyle\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\sigma_{min}(\mathbf{W})^{-2}\left[\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right]^{2}}\leq\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\mathbf{X}\right\rVert_{2}^{2}}
𝐖22Tr(𝐗𝐗T)𝐗22=𝐖22Tr(𝐗𝐗T)λ1(𝐗𝐗T)\displaystyle\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}Tr(\mathbf{X}\mathbf{X}^{T})}{\left\lVert\mathbf{X}\right\rVert_{2}^{2}}=\left\lVert\mathbf{W}\right\rVert_{2}^{2}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

which gives us the desired result. ∎

Corollary C.10.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and 𝐖\mathbf{W} is full rank with mdm\geq d. Then

max(ϕ(𝐖𝐗T)F2,ϕ(𝐖𝐗T)F2)max(ϕ(𝐖𝐗T)22,ϕ(𝐖𝐗T)22)4𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}\right)}{\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\right)}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.
Proof.

Using the fact that

ϕ(𝐖𝐗T)2+ϕ(𝐖𝐗T)22max(ϕ(𝐖𝐗T)2,ϕ(𝐖𝐗T)2)\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}+\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\leq 2\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}\right)

and lemma C.9 we have that

ϕ(𝐖𝐗T)F24max(ϕ(𝐖𝐗T)22,ϕ(𝐖𝐗T)22)𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{4\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\right)}\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

Note that the right hand side and the denominator of the left hand side do not change when you replace 𝐖\mathbf{W} with 𝐖-\mathbf{W}. Therefore by using the above bound for both 𝐖\mathbf{W} and 𝐖-\mathbf{W} as the weight matrix separately we can conclude

max(ϕ(𝐖𝐗T)F2,ϕ(𝐖𝐗T)F2)4max(ϕ(𝐖𝐗T)22,ϕ(𝐖𝐗T)22)𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}\right)}{4\max\left(\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2},\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\right)}\leq\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

Corollary C.11.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and mdm\geq d. Suppose 𝐖\mathbf{W} and 𝐖-\mathbf{W} have the same distribution. Then conditioned on 𝐖\mathbf{W} being full rank we have that with probability at least 1/21/2

Tr(𝐊outer)λ1(𝐊outer)4𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.
Proof.

Fix 𝐖\mathbf{W} where 𝐖\mathbf{W} is full rank. We have by corollary C.10 that either

ϕ(𝐖𝐗T)F2ϕ(𝐖𝐗T)224𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

holds or

ϕ(𝐖𝐗T)F2ϕ(𝐖𝐗T)224𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}\leq 4\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

(the first holds in the case where ϕ(𝐖𝐗T)22ϕ(𝐖𝐗T)22\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\geq\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2} and the second in the case ϕ(𝐖𝐗T)22<ϕ(𝐖𝐗T)22\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}<\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}). Since 𝐖\mathbf{W} and 𝐖-\mathbf{W} have the same distribution, it follows that the first inequality must hold at least 1/21/2 of the time. From

Tr(𝐊outer)λ1(𝐊outer)=𝐉outerTF2𝐉outerT22=ϕ(𝐖𝐗T)F2ϕ(𝐖𝐗T)22\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}=\frac{\left\lVert\mathbf{J}_{outer}^{T}\right\rVert_{F}^{2}}{\left\lVert\mathbf{J}_{outer}^{T}\right\rVert_{2}^{2}}=\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}

we get the desired result. ∎

We now note that when 𝐖\mathbf{W} is rectangular shaped and the entries of 𝐖\mathbf{W} are i.i.d. Gaussians that 𝐖\mathbf{W} is full rank with high probability and σmin(𝐖)2𝐖22\sigma_{min}(\mathbf{W})^{-2}\left\lVert\mathbf{W}\right\rVert_{2}^{2} is well behaved. We recall the result from Vershynin (2012)

Theorem C.12.

Let 𝐀\mathbf{A} be a N×nN\times n matrix whose entries are independent standard normal random variables. Then for every t0t\geq 0, with probability at least 12exp(t2/2)1-2\exp(-t^{2}/2) one has

Nntσmin(𝐀)σ1(𝐀)N+n+t\sqrt{N}-\sqrt{n}-t\leq\sigma_{min}(\mathbf{A})\leq\sigma_{1}(\mathbf{A})\leq\sqrt{N}+\sqrt{n}+t

Corollary C.11 gives us a bound that works at least half the time. However, we would like to derive a bound that holds with high probability. We will have that when mnm\gtrsim n we have sufficient concentration of the largest singular value of ϕ(𝐖𝐗T)\phi(\mathbf{W}\mathbf{X}^{T}) to prove such a bound. We recall the result from Vershynin (2012) (Remark 5.40)

Theorem C.13.

Assume that 𝐀\mathbf{A} is an N×nN\times n matrix whose rows 𝐀i\mathbf{A}_{i} are independent sub-gaussian random vectors in n\mathbb{R}^{n} with second moment matrix Σ\Sigma. Then for every t0t\geq 0, the following inequality holds with probability at least 12exp(ct2)1-2\exp(-ct^{2})

1N𝐀𝐀Σ2max(δ,δ2)whereδ=CnN+tN\left\lVert\frac{1}{N}\mathbf{A}^{*}\mathbf{A}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})\quad\text{where}\quad\delta=C\sqrt{\frac{n}{N}}+\frac{t}{\sqrt{N}}

where C=CK,c=cK>0C=C_{K},c=c_{K}>0 depend only on K:=maxi𝐀iψ2K:=\max_{i}\left\lVert\mathbf{A}_{i}\right\rVert_{\psi_{2}}.

We will use theorem C.13 in the following lemma.

Lemma C.14.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x). Let 𝐀=ϕ(𝐖𝐗T)\mathbf{A}=\phi(\mathbf{W}\mathbf{X}^{T}) and M=maxi[n]𝐱i2M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}. Suppose that 𝐰1,,𝐰mN(0,ν2Id)\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu^{2}I_{d}) i.i.d. Set K=MνnK=M\nu\sqrt{n} and define

Σ:=𝔼𝐰N(0,ν2I)[ϕ(𝐗𝐰)ϕ(𝐰T𝐗T)]\Sigma:=\mathbb{E}_{\mathbf{w}\sim N(0,\nu^{2}I)}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{w}^{T}\mathbf{X}^{T})]

Then for every t0t\geq 0 the following inequality holds with probability at least 12exp(cKt2)1-2\exp(-c_{K}t^{2})

1m𝐀T𝐀Σ2max(δ,δ2)whereδ=CKnm+tm,\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})\quad\text{where}\quad\delta=C_{K}\sqrt{\frac{n}{m}}+\frac{t}{\sqrt{m}},

where cK,CK>0c_{K},C_{K}>0 are absolute constants that depend only on KK.

Proof.

We will let 𝐀:\mathbf{A}_{\ell\colon} denote the \ellth row of 𝐀\mathbf{A} (considered as a column vector). Note that

𝐀:=ϕ(𝐗𝐰).\mathbf{A}_{\ell\colon}=\phi(\mathbf{X}\mathbf{w}_{\ell}).

We immediately get that the rows of 𝐀\mathbf{A} are i.i.d. We will now bound 𝐀:ψ2\left\lVert\mathbf{A}_{\ell\colon}\right\rVert_{\psi_{2}}. Let 𝐛n\mathbf{b}\in\mathbb{R}^{n} such that 𝐛2=1\left\lVert\mathbf{b}\right\rVert_{2}=1. Then

ϕ(𝐗𝐰),𝐛ψ2=i=1nϕ(𝐱i,𝐰)biψ2\displaystyle\left\lVert\langle\phi(\mathbf{X}\mathbf{w}_{\ell}),\mathbf{b}\rangle\right\rVert_{\psi_{2}}=\left\lVert\sum_{i=1}^{n}\phi(\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle)b_{i}\right\rVert_{\psi_{2}}
i=1n|bi|ϕ(𝐱i,𝐰)ψ2i=1n|bi|𝐱i,𝐰ψ2\displaystyle\leq\sum_{i=1}^{n}|b_{i}|\left\lVert\phi(\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle)\right\rVert_{\psi_{2}}\leq\sum_{i=1}^{n}|b_{i}|\left\lVert\langle\mathbf{x}_{i},\mathbf{w}_{\ell}\rangle\right\rVert_{\psi_{2}}
i=1n|bi|C𝐱i2νCMν𝐛1CMνn\displaystyle\leq\sum_{i=1}^{n}|b_{i}|C\left\lVert\mathbf{x}_{i}\right\rVert_{2}\nu\leq CM\nu\left\lVert\mathbf{b}\right\rVert_{1}\leq CM\nu\sqrt{n}

where C>0C>0 is an absolute constant. Set K:=MνnK:=M\nu\sqrt{n}. Well then by theorem C.13 we have the following. For every t0t\geq 0 the following inequality holds with probability at least 12exp(cKt2)1-2\exp(-c_{K}t^{2})

1m𝐀T𝐀Σ2max(δ,δ2)whereδ=CKnm+tm\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})\quad\text{where}\quad\delta=C_{K}\sqrt{\frac{n}{m}}+\frac{t}{\sqrt{m}}

We are now ready to prove a high probability bound for the effective rank of 𝐊outer\mathbf{K}_{outer}.

Theorem C.15.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and mdm\geq d. Let M=maxi[n]𝐱i2M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}. Suppose that 𝐰1,,𝐰mN(0,ν2Id)\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu^{2}I_{d}) i.i.d. Set K=MνnK=M\nu\sqrt{n}

Σ:=𝔼𝐰N(0,ν2I)[ϕ(𝐗𝐰)ϕ(𝐰T𝐗T)]\Sigma:=\mathbb{E}_{\mathbf{w}\sim N(0,\nu^{2}I)}[\phi(\mathbf{X}\mathbf{w})\phi(\mathbf{w}^{T}\mathbf{X}^{T})]
δ=CK[nm+log(2/ϵ)m]\delta=C_{K}\left[\sqrt{\frac{n}{m}}+\sqrt{\frac{\log(2/\epsilon)}{m}}\right]

where ϵ>0\epsilon>0 is small. Now assume

m>d+2log(2/ϵ)\sqrt{m}>\sqrt{d}+\sqrt{2\log(2/\epsilon)}

and

max(δ,δ2)12λ1(Σ)\max(\delta,\delta^{2})\leq\frac{1}{2}\lambda_{1}(\Sigma)

Then with probability at least 13ϵ1-3\epsilon

Tr(𝐊outer)λ1(𝐊outer)12(m+d+t1mdt1)2Tr(𝐗T𝐗)λ1(𝐗T𝐗)\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}\leq 12\left(\frac{\sqrt{m}+\sqrt{d}+t_{1}}{\sqrt{m}-\sqrt{d}-t_{1}}\right)^{2}\frac{Tr(\mathbf{X}^{T}\mathbf{X})}{\lambda_{1}(\mathbf{X}^{T}\mathbf{X})}
Proof.

By theorem C.12 with t1=2log(2/ϵ)t_{1}=\sqrt{2\log(2/\epsilon)} we have that with probability at least 1ϵ1-\epsilon that

mdt1σmin(𝐖/ν)σ1(𝐖/ν)m+d+t1\sqrt{m}-\sqrt{d}-t_{1}\leq\sigma_{min}(\mathbf{W}/\nu)\leq\sigma_{1}(\mathbf{W}/\nu)\leq\sqrt{m}+\sqrt{d}+t_{1} (58)

The above inequalities and the hypothesis on mm imply that 𝐖\mathbf{W} is full rank.

Let 𝐀=ϕ(𝐖𝐗T)\mathbf{A}=\phi(\mathbf{W}\mathbf{X}^{T}) and 𝐀~=ϕ(𝐖𝐗T)\tilde{\mathbf{A}}=\phi(-\mathbf{W}\mathbf{X}^{T}). Set t2=log(2/ϵ)cKt_{2}=\sqrt{\frac{\log(2/\epsilon)}{c_{K}}} where cKc_{K} is defined as in theorem C.14. Note that 𝐀\mathbf{A} and 𝐀~\tilde{\mathbf{A}} are identical in distribution. Thus by theorem C.14 and the union bound we get that with probability at least 12ϵ1-2\epsilon

1m𝐀T𝐀Σ2,1m𝐀~T𝐀~Σ2max(δ,δ2)=:ρ\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}-\Sigma\right\rVert_{2},\left\lVert\frac{1}{m}\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}-\Sigma\right\rVert_{2}\leq\max(\delta,\delta^{2})=:\rho (59)

where

δ=CKnm+t2m.\delta=C_{K}\sqrt{\frac{n}{m}}+\frac{t_{2}}{\sqrt{m}}.

By our previous results and the union bound we can ensure with probability at least 13ϵ1-3\epsilon that the bounds (58) and (59) all hold simultaneously. In this case we have

1m𝐀~T𝐀~21m𝐀T𝐀2+2ρ\displaystyle\left\lVert\frac{1}{m}\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}\right\rVert_{2}\leq\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}+2\rho
=1m𝐀T𝐀2[1+2ρ1m𝐀T𝐀2]1m𝐀T𝐀2[1+2ρλ1(Σ)ρ]\displaystyle=\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}\left[1+\frac{2\rho}{\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}}\right]\leq\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}\left[1+\frac{2\rho}{\lambda_{1}(\Sigma)-\rho}\right]

Assuming ρλ1(Σ)/2\rho\leq\lambda_{1}(\Sigma)/2 we have by the above bound

1m𝐀~T𝐀~231m𝐀T𝐀2.\left\lVert\frac{1}{m}\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}\right\rVert_{2}\leq 3\left\lVert\frac{1}{m}\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}.

Now note that

𝐀T𝐀2=ϕ(𝐖𝐗T)22𝐀~T𝐀~2=ϕ(𝐖𝐗T)22\left\lVert\mathbf{A}^{T}\mathbf{A}\right\rVert_{2}=\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\quad\left\lVert\tilde{\mathbf{A}}^{T}\tilde{\mathbf{A}}\right\rVert_{2}=\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}

so that our previous bound implies

ϕ(𝐖𝐗T)223ϕ(𝐖𝐗T)22\left\lVert\phi(-\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}\leq 3\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}

then we have by corollary C.10 that

Tr(𝐊outer)λ1(𝐊outer)=ϕ(𝐖𝐗T)F2ϕ(𝐖𝐗T)2212𝐖22σmin(𝐖)2Tr(𝐗𝐗T)λ1(𝐗𝐗T)\displaystyle\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}=\frac{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{F}^{2}}{\left\lVert\phi(\mathbf{W}\mathbf{X}^{T})\right\rVert_{2}^{2}}\leq 12\frac{\left\lVert\mathbf{W}\right\rVert_{2}^{2}}{\sigma_{min}(\mathbf{W})^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}
12(m+d+t1mdt1)2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\displaystyle\leq 12\left(\frac{\sqrt{m}+\sqrt{d}+t_{1}}{\sqrt{m}-\sqrt{d}-t_{1}}\right)^{2}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.

From the above theorem we get the following corollary.

Corollary C.16.

Assume ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and ndn\geq d. Suppose that 𝐰1,,𝐰mN(0,ν2Id)\mathbf{w}_{1},\ldots,\mathbf{w}_{m}\sim N(0,\nu^{2}I_{d}) i.i.d. Fix ϵ>0\epsilon>0 small. Set M=maxi[n]𝐱i2M=\max_{i\in[n]}\left\lVert\mathbf{x}_{i}\right\rVert_{2}. Then

m=Ω(max(λ1(Σ)2,1)max(n,log(1/ϵ)))m=\Omega\left(\max(\lambda_{1}(\Sigma)^{-2},1)\max(n,\log(1/\epsilon))\right)

and

ν=O(1/Mm)\nu=O(1/M\sqrt{m})

suffices to ensure that with probability at least 1ϵ1-\epsilon

Tr(𝐊outer)λ1(𝐊outer)CTr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{Tr(\mathbf{K}_{outer})}{\lambda_{1}(\mathbf{K}_{outer})}\leq C\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

where C>0C>0 is an absolute constant.

C.2.5 Bound for the combined NTK

Based on the results in the previous two sections, we can now bound the effective rank of the combined NTK gram matrix 𝐊=𝐊inner+𝐊outer\mathbf{K}=\mathbf{K}_{inner}+\mathbf{K}_{outer}. See 4.5

Proof.

This follows from the union bound and Corollaries C.8 and C.16. ∎

C.2.6 Magnitude of the spectrum

By our results in sections C.2.3 and C.2.4 we have that mnm\gtrsim n suffices to ensure that

Tr(𝐊)λ1(𝐊)Tr(𝐗𝐗T)λ1(𝐗𝐗T)d\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\lesssim\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq d

Well note that

iλi(𝐊)λ1(𝐊)Tr(𝐊)λ1(𝐊)d\displaystyle i\frac{\lambda_{i}(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\leq\frac{Tr(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\lesssim d

If idi\gg d then λi(𝐊)/λ1(𝐊)\lambda_{i}(\mathbf{K})/\lambda_{1}(\mathbf{K}) is small. Thus the NTK only has O(d)O(d) large eigenvalues. The smallest eigenvalue λn(𝐊)\lambda_{n}(\mathbf{K}) of the NTK has been of interest in proving convergence guarantees (Du et al., 2019a, b; Oymak & Soltanolkotabi, 2020). By our previous inequality

λn(𝐊)λ1(𝐊)dn\frac{\lambda_{n}(\mathbf{K})}{\lambda_{1}(\mathbf{K})}\lesssim\frac{d}{n}

Thus in the setting where mndm\gtrsim n\gg d we have that the smallest eigenvalue will be driven to zero relative to the largest eigenvalue. Alternatively we can view the above inequality as a lower bound on the condition number

λ1(𝐊)λn(𝐊)nd\frac{\lambda_{1}(\mathbf{K})}{\lambda_{n}(\mathbf{K})}\gtrsim\frac{n}{d}

We will first bound the analytical NTK in the setting when the outer layer weights have fixed constant magnitude. This is the setting considered by Xie et al. (2017), Arora et al. (2019a), Du et al. (2019b), Oymak et al. (2019), Li et al. (2020), and Oymak & Soltanolkotabi (2020).

Theorem C.17.

Let ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and assume 𝐗0\mathbf{X}\neq 0. Let 𝐊innern×n\mathbf{K}_{inner}^{\infty}\in\mathbb{R}^{n\times n} be the analytical NTK, i.e.

(𝐊inner)i,j:=𝐱i,𝐱j𝔼𝐰N(0,Id)[ϕ(𝐱i,𝐰)ϕ(𝐱j,𝐰)].(\mathbf{K}_{inner}^{\infty})_{i,j}:=\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle\mathbb{E}_{\mathbf{w}\sim N(0,I_{d})}\left[\phi^{\prime}(\langle\mathbf{x}_{i},\mathbf{w}\rangle)\phi^{\prime}(\langle\mathbf{x}_{j},\mathbf{w}\rangle)\right].

Then

14Tr(𝐗𝐗T)λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)4Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{1}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}\left(\mathbf{K}_{inner}^{\infty}\right)}\leq 4\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}.
Proof.

We consider the setting where |a|=1/m|a_{\ell}|=1/\sqrt{m} for all [m]\ell\in[m] and 𝐰N(0,Id)\mathbf{w}_{\ell}\sim N(0,I_{d}) i.i.d.. As was shown by Jacot et al. (2018), Du et al. (2019b) in this setting we have that if we fix the training data 𝐗\mathbf{X} and send mm\rightarrow\infty we have that

𝐊inner𝐊inner20\left\lVert\mathbf{K}_{inner}-\mathbf{K}_{inner}^{\infty}\right\rVert_{2}\rightarrow 0

in probability. Therefore by continuity of the effective rank we have that

Tr(𝐊inner)λ1(𝐊inner)Tr(𝐊inner)λ1(𝐊inner)\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}\rightarrow\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}(\mathbf{K}_{inner}^{\infty})}

in probability. Let η>0\eta>0. Then there exists an MM\in\mathbb{N} such that mMm\geq M implies that

|Tr(𝐊inner)λ1(𝐊inner)Tr(𝐊inner)λ1(𝐊inner)|η\left\lvert\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}(\mathbf{K}_{inner})}-\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}(\mathbf{K}_{inner}^{\infty})}\right\rvert\leq\eta (60)

with probability greater than 1/21/2. Now fix δ(0,1)\delta\in(0,1). On the other hand by Theorem C.5 with ϵ=1/4\epsilon=1/4 we have that if m4δ2log(4n)m\geq\frac{4}{\delta^{2}}\log(4n) then with probability at least 3/43/4 that

(1δ)24Tr(𝐗𝐗T)λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)4(1δ)2Tr(𝐗𝐗T)λ1(𝐗𝐗T).\frac{(1-\delta)^{2}}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner})}{\lambda_{1}\left(\mathbf{K}_{inner}\right)}\leq\frac{4}{(1-\delta)^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}. (61)

Thus if we set m=max(4δ2log(4n),M)m=\max(\frac{4}{\delta^{2}}\log(4n),M) we have with probability at least 3/41/2=1/43/4-1/2=1/4 that (60) and (61) hold simultaneously. In this case we have that

(1δ)24Tr(𝐗𝐗T)λ1(𝐗𝐗T)ηTr(𝐊inner)λ1(𝐊inner)4(1δ)2Tr(𝐗𝐗T)λ1(𝐗𝐗T)+η\frac{(1-\delta)^{2}}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}-\eta\leq\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}\left(\mathbf{K}_{inner}^{\infty}\right)}\leq\frac{4}{(1-\delta)^{2}}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}+\eta

Note that the above argument runs through for any η>0\eta>0 and δ(0,1)\delta\in(0,1). Thus we may send η0+\eta\rightarrow 0^{+} and δ0+\delta\rightarrow 0^{+} in the above inequality to get

14Tr(𝐗𝐗T)λ1(𝐗𝐗T)Tr(𝐊inner)λ1(𝐊inner)4Tr(𝐗𝐗T)λ1(𝐗𝐗T)\frac{1}{4}\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}\leq\frac{Tr(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}\left(\mathbf{K}_{inner}^{\infty}\right)}\leq 4\frac{Tr(\mathbf{X}\mathbf{X}^{T})}{\lambda_{1}(\mathbf{X}\mathbf{X}^{T})}

We thus have the following corollary about the conditioning of the analytical NTK.

Corollary C.18.

Let ϕ(x)=ReLU(x)\phi(x)=ReLU(x) and assume 𝐗0\mathbf{X}\neq 0. Let 𝐊innern×n\mathbf{K}_{inner}^{\infty}\in\mathbb{R}^{n\times n} be the analytical NTK, i.e.

(𝐊inner)i,j:=𝐱i,𝐱j𝔼𝐰N(0,Id)[ϕ(𝐱i,𝐰)ϕ(𝐱j,𝐰)].(\mathbf{K}_{inner}^{\infty})_{i,j}:=\langle\mathbf{x}_{i},\mathbf{x}_{j}\rangle\mathbb{E}_{\mathbf{w}\sim N(0,I_{d})}\left[\phi^{\prime}(\langle\mathbf{x}_{i},\mathbf{w}\rangle)\phi^{\prime}(\langle\mathbf{x}_{j},\mathbf{w}\rangle)\right].

Then

λn(𝐊inner)λ1(𝐊inner)4dn.\frac{\lambda_{n}(\mathbf{K}_{inner}^{\infty})}{\lambda_{1}(\mathbf{K}_{inner}^{\infty})}\leq 4\frac{d}{n}.

C.3 Experimental validation of results on the NTK spectrum

Refer to caption
Figure 3: (NTK Spectrum for CNNs) We plot the normalized eigenvalues λp/λ1\lambda_{p}/\lambda_{1} of the NTK Gram matrix 𝐊\mathbf{K} and the data Gram matrix 𝐗𝐗T\mathbf{X}\mathbf{X}^{T} for Caltech101 and isotropic Gaussian datasets. To compute the NTK, we randomly initialize convolutional neural networks of depth 22 and 55 with 100100 channels per layer. We use the standard parameterization and Pytorch’s default Kaiming uniform initialization in order to better connect our results with what is used in practice. We consider a batch size of n=200n=200 and plot the first 100100 eigenvalues. The thick part of each curve corresponds to the mean across 10 trials while the transparent part corresponds to the 95% confidence interval.
Refer to caption
Figure 4: (Asymptotic NTK Spectrum) NTK spectrum of two-layer fully connected networks with ReLU, Tanh and Gaussian activations under the NTK parameterization. The orange curve is the experimental eigenvalue. The blue curves in the left shows the regression fit for the experimental eigenvalues as a function of eigenvalue index \ell in the form of λ=ab\lambda_{\ell}=a\ell^{-b} where aa and bb are unknown parameters determined by regression. The blue curves in the middle shows the regression fit for the experimental eigenvalues in the form of λ=a0.75bl1/4\lambda_{\ell}=a\ell^{-0.75}b^{-l^{1/4}}. The blue curves in the right shows the regression fit for the experimental eigenvalues in the form of λ=a0.5bl1/2\lambda_{\ell}=a\ell^{-0.5}b^{-l^{1/2}}.

We experimentally test the theory developed in Section 4.1 and its implications by analyzing the spectrum of the NTK for both fully connected neural network architectures (FCNNs), the results of which are displayed in Figure 1, and also convolutional neural network architectures (CNNs), shown in Figure 3. For the feedforward architectures we consider networks of depth 2 and 5 with the width of all layers being set at 500. With regard to the activation function we test linear, ReLU and Tanh, and in terms of initialization we use Kaiming uniform (He et al., 2015), which is very common in practice and is the default in PyTorch (Paszke et al., 2019). For the convolutional architectures we again consider depths 2 and 5, with each layer consisting of 100 channels with the filter size set to 5x5. In terms of data, we consider 40x40 patches from both real world images, generated by applying Pytorch’s RandomResizedCrop transform to a random batch of Caltech101 images (Li et al., 2022), as well as synthetic images corresponding to isotropic Gaussian vectors. The batch sized is fixed at 200 and we plot only the first 100 normalized eigenvalues. Each experiment was repeated 10 times. Finally, to compute the NTK we use the functorch444https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html module in PyTorch using an algorithmic approach inspired by Novak et al. (2022).

The results for convolutional neural networks show the same trends as observed in feedforward neural networks, which we discussed in Section 4.1. In particular, we again observe the dominant outlier eigenvalue, which increases with both depth and the size of the Gaussian mean of the activation. We also again see that the NTK spectrum inherits its structure from the data, i.e., is skewed for skewed data or relatively flat for isotropic Gaussian data. Finally, we also see that the spectrum for Tanh is closer to the spectrum for the linear activation when compared with the ReLU spectrum. In terms of differences between the CNN and FCNN experiments, we observe that the spread of the 95% confidence interval is slightly larger for convolutional nets, implying a slightly larger variance between trials. We remark that this is likely attributable to the fact that there are only 100 channels in each layer and by increasing this quantity we would expect the variance to reduce. In summary, despite the fact that our analysis is concerned with FCNNs, it appears that the broad implications and trends also hold for CNNs. We leave a thorough study of the NTK spectrum for CNNs and other network architectures to future work.

To test our theory in Section 4.2, we numerically plot the spectrum of NTK of two-layer feedforward networks with ReLU, Tanh, and Gaussian activations in Figure 4. The input data are uniformly drawn from 𝕊2\mathbb{S}^{2}. Notice that when d=2d=2, k=Θ(1/2)k=\Theta(\ell^{1/2}). Then Corollary 4.7 shows that for the ReLU activation λ=Θ(3/2)\lambda_{\ell}=\Theta(\ell^{-3/2}), for the Tanh activation λ=O(3/4exp(π21/4))\lambda_{\ell}=O\left(\ell^{-3/4}\exp(-\frac{\pi}{2}\ell^{1/4})\right), and for the Gaussian activation λ=O(1/221/2)\lambda_{\ell}=O(\ell^{-1/2}2^{-\ell^{1/2}}). These theoretical decay rates for the NTK spectrum are verified by the experimental results in Figure 4.

C.4 Analysis of the lower spectrum: uniform data

See 4.6

Proof.

Let θ(t)=p=0cptp\theta(t)=\sum_{p=0}^{\infty}c_{p}t^{p}, then K(x1,x2)=θ(x1,x2)K(x_{1},x_{2})=\theta(\langle x_{1},x_{2}\rangle) According to Funk Hecke theorem (Basri et al., 2019, Section 4.2), we have

λk¯=Vol(𝕊d1)11θ(t)Pk,d(t)(1t2)d22dt,\displaystyle\overline{\lambda_{k}}=\mathrm{Vol}(\mathbb{S}^{d-1})\int_{-1}^{1}\theta(t)P_{k,d}(t)(1-t^{2})^{\frac{d-2}{2}}\mathrm{d}t, (62)

where Vol(𝕊d1)=2πd/2Γ(d/2)\mathrm{Vol}(\mathbb{S}^{d-1})=\frac{2\pi^{d/2}}{\Gamma(d/2)} is the volume of the hypersphere 𝕊d1\mathbb{S}^{d-1}, and Pk,d(t)P_{k,d}(t) is the Gegenbauer polynomial, given by

Pk,d(t)=(1)k2kΓ(d/2)Γ(k+d/2)1(1t2)(d2)/2dkdtk(1t2)k+(d2)/2,P_{k,d}(t)=\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\frac{1}{(1-t^{2})^{(d-2)/2}}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2},

and Γ\Gamma is the gamma function.

From (62) we have

λk¯\displaystyle\overline{\lambda_{k}} =Vol(𝕊d1)11θ(t)Pk,d(t)(1t2)d22dt\displaystyle=\mathrm{Vol}(\mathbb{S}^{d-1})\int_{-1}^{1}\theta(t)P_{k,d}(t)(1-t^{2})^{\frac{d-2}{2}}\mathrm{d}t
=2πd/2Γ(d/2)11θ(t)(1)k2kΓ(d/2)Γ(k+d/2)dkdtk(1t2)k+(d2)/2dt\displaystyle=\frac{2\pi^{d/2}}{\Gamma(d/2)}\int_{-1}^{1}\theta(t)\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t
=2πd/2Γ(d/2)(1)k2kΓ(d/2)Γ(k+d/2)p=0cp11tpdkdtk(1t2)k+(d2)/2dt.\displaystyle=\frac{2\pi^{d/2}}{\Gamma(d/2)}\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\sum_{p=0}^{\infty}c_{p}\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t. (63)

Using integration by parts, we have

11tpdkdtk(1t2)k+(d2)/2dt\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t
=tpdk1dtk1(1t2)k+(d2)/2|11p11tp1dk1dtk1(1t2)k+(d2)/2dt\displaystyle=\left.t^{p}\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}\right|_{-1}^{1}-p\int_{-1}^{1}t^{p-1}\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t
=p11tp1dk1dtk1(1t2)k+(d2)/2dt,\displaystyle=-p\int_{-1}^{1}t^{p-1}\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t, (64)

where the last line in (64) holds because dk1dtk1(1t2)k+(d2)/2=0\frac{d^{k-1}}{dt^{k-1}}(1-t^{2})^{k+(d-2)/2}=0 when t=1t=1 or t=1t=-1.

When p<kp<k, repeat the above procedure (64) pp times, we get

11tpdkdtk(1t2)k+(d2)/2dt\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t =(1)pp!11dkpdtkp(1t2)k+(d2)/2dt\displaystyle=(-1)^{p}p!\int_{-1}^{1}\frac{d^{k-p}}{dt^{k-p}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t
=(1)pp!dkp1dtkp1(1t2)k+(d2)/2|11\displaystyle=(-1)^{p}p!\left.\frac{d^{k-p-1}}{dt^{k-p-1}}(1-t^{2})^{k+(d-2)/2}\right|_{-1}^{1}
=0.\displaystyle=0. (65)

When pkp\geq k, repeat the above procedure (64) kk times, we get

11tpdkdtk(1t2)k+(d2)/2dt\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t =(1)kp(p1)(pk+1)11tpk(1t2)k+(d2)/2dt.\displaystyle=(-1)^{k}p(p-1)\cdots(p-k+1)\int_{-1}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t. (66)

When pkp-k is odd, tpk(1t2)k+(d2)/2t^{p-k}(1-t^{2})^{k+(d-2)/2} is an odd function, then

11tpk(1t2)k+(d2)/2dt=0.\int_{-1}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t=0. (67)

When pkp-k is even,

11tpk(1t2)k+(d2)/2dt\displaystyle\int_{-1}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t =201tpk(1t2)k+(d2)/2dt\displaystyle=2\int_{0}^{1}t^{p-k}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t
=01(t2)(pk1)/2(1t2)k+(d2)/2dt2\displaystyle=\int_{0}^{1}(t^{2})^{(p-k-1)/2}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t^{2}
=B(pk+12,k+d/2)\displaystyle=B\left(\frac{p-k+1}{2},k+d/2\right)
=Γ(pk+12)Γ(k+d/2)Γ(pk+12+k+d/2),\displaystyle=\frac{\Gamma(\frac{p-k+1}{2})\Gamma(k+d/2)}{\Gamma(\frac{p-k+1}{2}+k+d/2)}, (68)

where BB is the beta function.

Plugging (68) , (65) and (67) into (66), we get

11tpdkdtk(1t2)k+(d2)/2dt\displaystyle\int_{-1}^{1}t^{p}\frac{d^{k}}{dt^{k}}(1-t^{2})^{k+(d-2)/2}\mathrm{d}t
={(1)kp(p1)(pk+1)Γ(pk+12)Γ(k+d/2)Γ(pk+12+k+d/2),pk is even and pk,0,otherwise.\displaystyle=\begin{cases}(-1)^{k}p(p-1)\ldots(p-k+1)\frac{\Gamma(\frac{p-k+1}{2})\Gamma(k+d/2)}{\Gamma(\frac{p-k+1}{2}+k+d/2)},&p-k\text{ is even and }p\geq k,\\ 0,&\text{otherwise}.\end{cases} (69)

Plugging (69) into (63), we get

λk¯\displaystyle\overline{\lambda_{k}} =2πd/2Γ(d/2)(1)k2kΓ(d/2)Γ(k+d/2)pkpk is evencp(1)kp(p1)(pk+1)Γ(pk+12)Γ(k+d/2)Γ(pk+12+k+d/2)\displaystyle=\frac{2\pi^{d/2}}{\Gamma(d/2)}\frac{(-1)^{k}}{2^{k}}\frac{\Gamma(d/2)}{\Gamma(k+d/2)}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}(-1)^{k}p(p-1)\ldots(p-k+1)\frac{\Gamma(\frac{p-k+1}{2})\Gamma(k+d/2)}{\Gamma(\frac{p-k+1}{2}+k+d/2)}
=πd/22k1pkpk is evencpp(p1)(pk+1)Γ(pk+12)Γ(pk+12+k+d/2)\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}\frac{p(p-1)\ldots(p-k+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(\frac{p-k+1}{2}+k+d/2)}
=πd/22k1pkpk is evencpΓ(p+1)Γ(pk+12)Γ(pk+1)Γ(pk+12+k+d/2).\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}c_{p}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.

See 4.7

Proof of Corollary C.4, part 1.

We first prove λk¯=O(kd2a+2)\overline{\lambda_{k}}=O(k^{-d-2a+2}). Suppose that cpCpac_{p}\leq Cp^{-a} for some constant CC, then according to Theorem 4.6 we have

λk¯\displaystyle\overline{\lambda_{k}} πd/22k1pkpk is evenCpaΓ(p+1)Γ(pk+12)Γ(pk+1)Γ(pk+12+k+d/2).\displaystyle\leq\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}Cp^{-a}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}.

According to Stirling’s formula, we have

Γ(z)=2πz(ze)z(1+O(1z)).\Gamma(z)=\sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^{z}\left(1+O\left(\frac{1}{z}\right)\right). (70)

Then for any z12z\geq\frac{1}{2}, we can find constants C1C_{1} and C2C_{2} such that

C12πz(ze)zΓ(z)C22πz(ze)z.C_{1}\sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^{z}\leq\Gamma(z)\leq C_{2}\sqrt{\frac{2\pi}{z}}\left(\frac{z}{e}\right)^{z}. (71)

Then

λk¯\displaystyle\overline{\lambda_{k}} πd/22k1C22C12pkpk is evenCpa2πp+1(p+1e)p+12πpk+12(pk+12e)pk+122πpk+1(pk+1e)pk+12πpk+12+k+d/2(pk+12+k+d/2e)pk+12+k+d/2\displaystyle\leq\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{2}^{2}}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}Cp^{-a}\frac{\sqrt{\frac{2\pi}{p+1}}\left(\frac{p+1}{e}\right)^{p+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}}}\left(\frac{\frac{p-k+1}{2}}{e}\right)^{\frac{p-k+1}{2}}}{\sqrt{\frac{2\pi}{{p-k+1}}}\left(\frac{p-k+1}{e}\right)^{p-k+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}+k+d/2}}\left(\frac{\frac{p-k+1}{2}+k+d/2}{e}\right)^{\frac{p-k+1}{2}+k+d/2}}
=πd/22k1C22CC12pkpk is evenpaed22p+1(p+1)p+1(pk+12)pk+12(pk+1)pk+11pk+12+k+d/2(pk+12+k+d/2)pk+12+k+d/2\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}p^{-a}\frac{e^{\frac{d}{2}}\sqrt{\frac{2}{p+1}}\left({p+1}\right)^{p+1}\left({\frac{p-k+1}{2}}\right)^{\frac{p-k+1}{2}}}{\left({p-k+1}\right)^{p-k+1}\sqrt{\frac{1}{\frac{p-k+1}{2}+k+d/2}}\left({\frac{p-k+1}{2}+k+d/2}\right)^{\frac{p-k+1}{2}+k+d/2}}
=πd/22k1C22CC12pkpk is evenpaed22p+k2(p+1)p+12(pk+1)pk+12(pk+12+k+d/2)pk2+k+d/2\displaystyle=\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}p^{-a}\frac{e^{\frac{d}{2}}2^{\frac{-p+k}{2}}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({\frac{p-k+1}{2}+k+d/2}\right)^{\frac{p-k}{2}+k+d/2}}
=2πd/22d2ed2C22CC12pkpk is evenpa(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2.\displaystyle=2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}. (72)

We define

fa(p)=pa(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2.f_{a}(p)=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}. (73)

By applying the chain rule to elogfa(p)e^{\log f_{a}(p)}, we have that the derivative of faf_{a} is

fa(p)=(p+1)p+12pa2(pk+1)pk+12(p+k+d+1)p+k+d2(2apk+d(p+1)(p+k+d+1)+log(1+k2d(pk+1)(pk+1)(p+k+d+1))).f_{a}^{\prime}(p)=\frac{(p+1)^{p+\frac{1}{2}}p^{-a}}{2(p-k+1)^{\frac{p-k+1}{2}}(p+k+d+1)^{\frac{p+k+d}{2}}}\\ \cdot\left(-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\log(1+\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)})\right). (74)

Let ga(p)=2apk+d(p+1)(p+k+d+1)+log(1+k2d(pk+1)(pk+1)(p+k+d+1))g_{a}(p)=-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\log(1+\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)}). Then ga(p)g_{a}(p) and fa(p)f_{a}^{\prime}(p) have the same sign. Next we will show that ga(p)0g_{a}(p)\geq 0 for kpk2d+24ak\leq p\leq\frac{k^{2}}{d+24a} when kk is large enough.

First when pkp\geq k and k2d(pk+1)(pk+1)(p+k+d+1)1\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)}\geq 1, we have

ga(p)2akk+d(k+1)(k+k+d+1)+log(2)0,\displaystyle g_{a}(p)\geq-\frac{2a}{k}-\frac{k+d}{(k+1)(k+k+d+1)}+\log(2)\geq 0, (75)

when kk is sufficiently large.

Second when pkp\geq k and 0k2d(pk+1)(pk+1)(p+k+d+1)10\leq\frac{k^{2}-d(p-k+1)}{(p-k+1)(p+k+d+1)}\leq 1, since log(1+x)x2\log(1+x)\geq\frac{x}{2} for 0x10\leq x\leq 1, we have

ga(p)\displaystyle g_{a}(p) 2apk+d(p+1)(p+k+d+1)+k2d(pk+1)2(pk+1)(p+k+d+1)\displaystyle\geq-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\frac{k^{2}-d(p-k+1)}{2(p-k+1)(p+k+d+1)}
2apk+d(p+1)(p+k+d+1)+k2dp2p(p+k+d+1).\displaystyle\geq-\frac{2a}{p}-\frac{k+d}{(p+1)(p+k+d+1)}+\frac{k^{2}-dp}{2p(p+k+d+1)}.

When pk2d+24ap\leq\frac{k^{2}}{d+24a}, we have k2dp24apk^{2}-dp\geq 24ap. Then

k2dp4p(p+k+d+1)24ap4p(p+k+d+1)6ap(p+1)(p+k+d+1)k+d(p+1)(p+k+d+1)\frac{k^{2}-dp}{4p(p+k+d+1)}\geq\frac{24ap}{4p(p+k+d+1)}\geq\frac{6ap}{(p+1)(p+k+d+1)}\geq\frac{k+d}{(p+1)(p+k+d+1)}

when kk is sufficiently large. Also we have

k2dp4r(p+k+d+1)24ap4r(p+k+d+1)6ap+k+d+12ap\frac{k^{2}-dp}{4r(p+k+d+1)}\geq\frac{24ap}{4r(p+k+d+1)}\geq\frac{6a}{p+k+d+1}\geq\frac{2a}{p}

when kk is sufficiently large.

Combining all the arguments above, we conclude that ga(p)0g_{a}(p)\geq 0 and fa(p)0f_{a}^{\prime}(p)\geq 0 when kpk2d+24ak\leq p\leq\frac{k^{2}}{d+24a}. Then when kpk2d+24ak\leq p\leq\frac{k^{2}}{d+24a}, we have

fa(p)fa(k2d+24a).f_{a}(p)\leq f_{a}\left(\frac{k^{2}}{d+24a}\right). (76)

When pk2d+24ap\geq\frac{k^{2}}{d+24a}, we have

fa(p)\displaystyle f_{a}(p) =pa(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}
=pa(p+1)p+12((p+1)2k2+d(pk+1))pk+12(p+k+1+d)2k+d12\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left((p+1)^{2}-k^{2}+d(p-k+1)\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{2k+d-1}{2}}}
=pa(p+1)d2(1k2d(pk+1)(p+1)2)pk+12(1+k+dp+1)2k+d12\displaystyle=\frac{p^{-a}\left({p+1}\right)^{-\frac{d}{2}}}{\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}\left({1+\frac{k+d}{p+1}}\right)^{\frac{2k+d-1}{2}}}
pad2(1k2d(pk+1)(p+1)2)pk+12.\displaystyle\leq\frac{p^{-a-\frac{d}{2}}}{\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}}.

If k2d(pk+1)<0k^{2}-d(p-k+1)<0, (1k2d(pk+1)(p+1)2)pk+121\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}\geq 1. If k2d(pk+1)0k^{2}-d(p-k+1)\geq 0, i.e., pk2+dkddp\leq\frac{k^{2}+dk-d}{d}, for sufficiently large kk, we have

(1k2d(pk+1)(p+1)2)pk+12\displaystyle\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}} (1k2d(k2d+24ak+1)(k2d+24a+1)2)k2+dkddk+12\displaystyle\geq\left(1-\frac{k^{2}-d(\frac{k^{2}}{d+24a}-k+1)}{(\frac{k^{2}}{d+24a}+1)^{2}}\right)^{\frac{\frac{k^{2}+dk-d}{d}-k+1}{2}}
(148a(d+24a)k2)k22d\displaystyle\geq\left(1-\frac{48a(d+24a)}{k^{2}}\right)^{\frac{k^{2}}{2d}}
ek22d48a(d+24a)k2=e48a(d+24a)2d,\displaystyle\geq e^{-\frac{k^{2}}{2d}\frac{48a(d+24a)}{k^{2}}}=e^{-\frac{48a(d+24a)}{2d}},

which is a constant independent of kk. Then for pk2d+24ap\geq\frac{k^{2}}{d+24a}, we have

fa(p)e48a(d+24a)2dpad2.f_{a}(p)\leq e^{\frac{48a(d+24a)}{2d}}p^{-a-\frac{d}{2}}. (77)

Finally we have

λk¯\displaystyle\overline{\lambda_{k}} =2πd/22d2ed2C22CC12pkpk is evenfa(p)\displaystyle=2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}f_{a}(p)
O(kpk2d+24apk is evenfa(p)+pk2d+24apk is evenfa(p))\displaystyle\leq O\left(\sum_{\begin{subarray}{c}k\leq p\leq\frac{k^{2}}{d+24a}\\ p-k\text{ is even}\end{subarray}}f_{a}(p)+\sum_{\begin{subarray}{c}p\geq\frac{k^{2}}{d+24a}\\ p-k\text{ is even}\end{subarray}}f_{a}(p)\right)
O((k2d+24ak+1)fa(k2d+24a)+pk2d+24apk is evene48a(d+24a)2dpad2)\displaystyle\leq O\left(\left(\frac{k^{2}}{d+24a}-k+1\right)f_{a}\left(\frac{k^{2}}{d+24a}\right)+\sum_{\begin{subarray}{c}p\geq\frac{k^{2}}{d+24a}\\ p-k\text{ is even}\end{subarray}}e^{\frac{48a(d+24a)}{2d}}p^{-a-\frac{d}{2}}\right)
O((k2d+24ak+1)e48a(d+24a)2d(k2d+24a)ad2+e48a(d+24a)2d1a+d21(k2d+24a1)1ad2)\displaystyle\leq O\left(\left(\frac{k^{2}}{d+24a}-k+1\right)e^{\frac{48a(d+24a)}{2d}}\left(\frac{k^{2}}{d+24a}\right)^{-a-\frac{d}{2}}+e^{\frac{48a(d+24a)}{2d}}\frac{1}{{a+\frac{d}{2}-1}}\left(\frac{k^{2}}{d+24a}-1\right)^{1-a-\frac{d}{2}}\right)
=O(kd2a+2).\displaystyle=O(k^{-d-2a+2}).

Next we prove λk¯=Ω(kd2a+2)\overline{\lambda_{k}}=\Omega(k^{-d-2a+2}). Since cpc_{p} are nonnegative and cp=Θ(pa)c_{p}=\Theta(p^{-a}), we have that cpCpac_{p}\geq C^{\prime}p^{-a} for some constant CC^{\prime}. Then we have

λk¯\displaystyle\overline{\lambda_{k}} πd/22k1pkpk is evenCpaΓ(p+1)Γ(pk+12)Γ(pk+1)Γ(pk+12+k+d/2).\displaystyle\geq\frac{\pi^{d/2}}{2^{k-1}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}C^{\prime}p^{-a}\frac{\Gamma(p+1)\Gamma(\frac{p-k+1}{2})}{\Gamma(p-k+1)\Gamma(\frac{p-k+1}{2}+k+d/2)}. (78)

According to Stirling’s formula (70) and (71), using the similar argument as (72) we have

λk¯\displaystyle\overline{\lambda_{k}} πd/22k1C12C22pkpk is evenCpa2πp+1(p+1e)p+12πpk+12(pk+12e)pk+122πpk+1(pk+1e)pk+12πpk+12+k+d/2(pk+12+k+d/2e)pk+12+k+d/2\displaystyle\geq\frac{\pi^{d/2}}{2^{k-1}}\frac{C_{1}^{2}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}C^{\prime}p^{-a}\frac{\sqrt{\frac{2\pi}{p+1}}\left(\frac{p+1}{e}\right)^{p+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}}}\left(\frac{\frac{p-k+1}{2}}{e}\right)^{\frac{p-k+1}{2}}}{\sqrt{\frac{2\pi}{{p-k+1}}}\left(\frac{p-k+1}{e}\right)^{p-k+1}\sqrt{\frac{2\pi}{\frac{p-k+1}{2}+k+d/2}}\left(\frac{\frac{p-k+1}{2}+k+d/2}{e}\right)^{\frac{p-k+1}{2}+k+d/2}} (79)
=2πd/22d2ed2C12CC22pkpk is evenpa(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2\displaystyle=2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}} (80)
2πd/22d2ed2C12CC22pk2pk is evenfa(p),\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k^{2}\\ p-k\text{ is even}\end{subarray}}f_{a}(p), (81)

where fa(p)f_{a}(p) is defined in (73). When pk2p\geq k^{2}, we have

fa(p)\displaystyle f_{a}(p) =pa(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}
=pa(p+1)p+12((p+1)2k2+d(pk+1))pk+12(p+k+1+d)2k+d12\displaystyle=\frac{p^{-a}\left({p+1}\right)^{p+\frac{1}{2}}}{\left((p+1)^{2}-k^{2}+d(p-k+1)\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{2k+d-1}{2}}}
(p+1)ad2(1k2d(pk+1)(p+1)2)pk+12(1+k+dp+1)2k+d12\displaystyle\geq\frac{\left({p+1}\right)^{-a-\frac{d}{2}}}{\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}}\left({1+\frac{k+d}{p+1}}\right)^{\frac{2k+d-1}{2}}}

For sufficiently large kk, k2d(pk+1)<0k^{2}-d(p-k+1)<0. Then we have

(1k2d(pk+1)(p+1)2)pk+12\displaystyle\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{p-k+1}{2}} =(1k2d(pk+1)(p+1)2)(p+1)2k2d(pk+1)k2+d(pk+1)(p+1)2pk+12\displaystyle=\left(1-\frac{k^{2}-d(p-k+1)}{(p+1)^{2}}\right)^{\frac{-(p+1)^{2}}{k^{2}-d(p-k+1)}\cdot\frac{-k^{2}+d(p-k+1)}{(p+1)^{2}}\cdot\frac{p-k+1}{2}}
ek2+d(pk+1)(p+1)2pk+12\displaystyle\leq e^{\frac{-k^{2}+d(p-k+1)}{(p+1)^{2}}\cdot\frac{p-k+1}{2}}
edp22p2=ed2\displaystyle\leq e^{\frac{dp^{2}}{2p^{2}}}=e^{\frac{d}{2}}

which is a constant independent of kk. Also, for sufficiently large kk, we have

(1+k+dp+1)2k+d12\displaystyle\left({1+\frac{k+d}{p+1}}\right)^{\frac{2k+d-1}{2}} =(1+k+dp+1)p+1k+dk+dp+12k+d12\displaystyle=\left({1+\frac{k+d}{p+1}}\right)^{\frac{p+1}{k+d}\frac{k+d}{p+1}\frac{2k+d-1}{2}}
ek+dp+12k+d12\displaystyle\leq e^{\frac{k+d}{p+1}\frac{2k+d-1}{2}}
e3k22r=e32\displaystyle\leq e^{\frac{3k^{2}}{2r}}=e^{\frac{3}{2}}

Then for pk2p\geq k^{2}, we have fa(p)ed232(p+1)ad2f_{a}(p)\geq e^{-\frac{d}{2}-\frac{3}{2}}(p+1)^{-a-\frac{d}{2}}.

Finally we have

λk¯\displaystyle\overline{\lambda_{k}} 2πd/22d2ed2C12CC22pk2pk is evenfa(p)\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k^{2}\\ p-k\text{ is even}\end{subarray}}f_{a}(p) (82)
2πd/22d2ed2C12CC22pk2pk is evened232(p+1)ad2\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k^{2}\\ p-k\text{ is even}\end{subarray}}e^{-\frac{d}{2}-\frac{3}{2}}(p+1)^{-a-\frac{d}{2}} (83)
2πd/22d2ed2C12CC22ed23212(a+d21)(k2+2)1ad2\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}e^{-\frac{d}{2}-\frac{3}{2}}\frac{1}{2(a+\frac{d}{2}-1)}(k^{2}+2)^{1-a-\frac{d}{2}} (84)
=Ω(kd2a+2).\displaystyle=\Omega(k^{-d-2a+2}). (85)

Overall, we have λk¯=Θ(kd2a+2)\overline{\lambda_{k}}=\Theta(k^{-d-2a+2}). ∎

Proof of Corollary C.4, part 2.

It is easy to verify that λk¯=0\overline{\lambda_{k}}=0 when kk is even because cp=0c_{p}=0 when pkp\geq k and pkp-k is even. When kk is odd, the proof of Theorem 4.6 still applies. ∎

Proof of Corollary C.4, part 3.

Since cp=𝒪(exp(ap))c_{p}=\mathcal{O}\left(\exp\left(-a\sqrt{p}\right)\right), we have that cpCeapc_{p}\leq Ce^{-a\sqrt{p}} for some constant CC. Similar to (72), we have

λk¯\displaystyle\overline{\lambda_{k}} 2πd/22d2ed2C22CC12pkpk is eveneap(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2.\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{e^{-a\sqrt{p}}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}. (86)

Use the definition in (73) and let a=0a=0, we have

f0(p)=(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2.f_{0}(p)=\frac{\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}. (87)

Then according to (76) and (77), for sufficiently large kk, we have f0(p)f0(k2d)f_{0}(p)\leq f_{0}\left(\frac{k^{2}}{d}\right) when kpk2dk\leq p\leq\frac{k^{2}}{d} and f0(p)C3pd2f_{0}(p)\leq C_{3}p^{-\frac{d}{2}} for some constant C3C_{3} when pk2dp\geq\frac{k^{2}}{d}. Then when kpk2dk\leq p\leq\frac{k^{2}}{d}, we have f0(p)f0(k2d)C3(k2d)d2f_{0}(p)\leq f_{0}\left(\frac{k^{2}}{d}\right)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}. When pk2dp\geq\frac{k^{2}}{d}, we have f0(p)C3pd2C3(k2d)d2f_{0}(p)\leq C_{3}p^{-\frac{d}{2}}\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}. Overall, for all pkp\geq k, we have

f0(p)C3(k2d)d2.f_{0}(p)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}. (88)

Then we have

λk¯\displaystyle\overline{\lambda_{k}} 2πd/22d2ed2C22CC12pkpk is eveneapf0(p)\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}e^{-a\sqrt{p}}f_{0}(p) (89)
2πd/22d2ed2C22C3CC12(k2d)d2pkpk is eveneap\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}e^{-a\sqrt{p}} (90)
2πd/22d2ed2C22C3CC12(k2d)d22eak1(ak1+1)a2\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}\frac{2e^{-a\sqrt{k-1}}(a\sqrt{k-1}+1)}{a^{2}} (91)
=𝒪(kd+1/2exp(ak))\displaystyle=\mathcal{O}\left(k^{-d+1/2}\exp\left(-a\sqrt{k}\right)\right) (92)

Proof of Corollary C.4, part 4.

Since cp=Θ(p1/2ap)c_{p}=\Theta(p^{1/2}a^{-p}), we have that cpCp1/2apc_{p}\leq Cp^{1/2}a^{-p} for some constant CC. Similar to (72), we have

λk¯\displaystyle\overline{\lambda_{k}} 2πd/22d2ed2C22CC12pkpk is evenp1/2ap(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2.\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{1/2}a^{-p}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}. (93)

Use the definition in (73) and let a=0a=0, we have

f0(p)=(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2.f_{0}(p)=\frac{\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}}. (94)

Then according to (76) and (77), for sufficiently large kk, we have f0(p)f0(k2d)f_{0}(p)\leq f_{0}\left(\frac{k^{2}}{d}\right) when kpk2dk\leq p\leq\frac{k^{2}}{d} and f0(p)C3pd2f_{0}(p)\leq C_{3}p^{-\frac{d}{2}} for some constant C3C_{3} when pk2dp\geq\frac{k^{2}}{d}. Then when kpk2dk\leq p\leq\frac{k^{2}}{d}, we have p1/2f0(p)p1/2f0(k2d)C3(k2d)1/2(k2d)d2p^{1/2}f_{0}(p)\leq p^{1/2}f_{0}\left(\frac{k^{2}}{d}\right)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{1/2}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}}. When pk2dp\geq\frac{k^{2}}{d}, we have p1/2f0(p)C3p1/2pd2C3(k2d)d2+12p^{1/2}f_{0}(p)\leq C_{3}p^{1/2}p^{-\frac{d}{2}}\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}. Overall, for all pkp\geq k, we have

p1/2f0(p)C3(k2d)d2+12.p^{1/2}f_{0}(p)\leq C_{3}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}. (95)

Then we have

λk¯\displaystyle\overline{\lambda_{k}} 2πd/22d2ed2C22CC12pkpk is evenp1/2apf0(p)\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C}{C_{1}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}p^{1/2}a^{-p}f_{0}(p) (96)
2πd/22d2ed2C22C3CC12(k2d)d2+12pkpk is evenap\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}a^{-p} (97)
2πd/22d2ed2C22C3CC12(k2d)d2+121logaa(k1)\displaystyle\leq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{2}^{2}C_{3}C}{C_{1}^{2}}\left(\frac{k^{2}}{d}\right)^{-\frac{d}{2}+\frac{1}{2}}\frac{1}{\log a}a^{-(k-1)} (98)
=𝒪(kd+1ak).\displaystyle=\mathcal{O}\left(k^{-d+1}a^{-k}\right). (99)

On the other hand, since cp=Θ(p1/2ap)c_{p}=\Theta(p^{1/2}a^{-p}), we have that cpCp1/2apc_{p}\geq C^{\prime}p^{1/2}a^{-p} for some constant CC^{\prime}. Similar to (80), we have

λk¯\displaystyle\overline{\lambda_{k}} 2πd/22d2ed2C12CC22pkpk is evenp1/2ap(p+1)p+12(pk+1)pk+12(p+k+1+d)p+k+d2\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\sum_{\begin{subarray}{c}p\geq k\\ p-k\text{ is even}\end{subarray}}\frac{p^{1/2}a^{-p}\left({p+1}\right)^{p+\frac{1}{2}}}{\left({p-k+1}\right)^{\frac{p-k+1}{2}}\left({p+k+1+d}\right)^{\frac{p+k+d}{2}}} (100)
2πd/22d2ed2C12CC22k1/2ak(k+1)k+12(kk+1)kk+12(k+k+1+d)k+k+d2\displaystyle\geq 2\pi^{d/2}\frac{2^{\frac{d}{2}}e^{\frac{d}{2}}C_{1}^{2}C^{\prime}}{C_{2}^{2}}\frac{k^{1/2}a^{-k}\left({k+1}\right)^{k+\frac{1}{2}}}{\left({k-k+1}\right)^{\frac{k-k+1}{2}}\left({k+k+1+d}\right)^{\frac{k+k+d}{2}}} (101)
=Ω(kd/2+1ak(k+1)k(k+k+1+d)k).\displaystyle=\Omega\left(\frac{k^{-d/2+1}a^{-k}\left({k+1}\right)^{k}}{\left({k+k+1+d}\right)^{k}}\right). (102)

Since (k+1)k=kk(1+1/k)k=Θ(kk)\left({k+1}\right)^{k}=k^{k}(1+1/k)^{k}=\Theta(k^{k}). Similarly, (k+k+1+d)k=Θ((2k)k)\left({k+k+1+d}\right)^{k}=\Theta((2k)^{k}). Then we have

λk¯\displaystyle\overline{\lambda_{k}} =Ω(kd/2+1ak(k+1)k(k+k+1+d)k)\displaystyle=\Omega\left(\frac{k^{-d/2+1}a^{-k}\left({k+1}\right)^{k}}{\left({k+k+1+d}\right)^{k}}\right) (103)
=Ω(kd/2+1akkk(2k)k)\displaystyle=\Omega\left(\frac{k^{-d/2+1}a^{-k}k^{k}}{(2k)^{k}}\right) (104)
=Ω(kd/2+12kak).\displaystyle=\Omega\left(k^{-d/2+1}2^{-k}a^{-k}\right). (105)

For the NTK of a two-layer ReLU network with γb>0\gamma_{b}>0, then according to Lemma 3.2 we have cp=κp,2=Θ(p3/2)c_{p}=\kappa_{p,2}=\Theta(p^{-3/2}) . Therefore using Corollary 4.7 λk¯=Θ(kd1)\overline{\lambda_{k}}=\Theta(k^{-d-1}). Notice here that kk refers to the frequency, and the number of spherical harmonics of frequency at most kk is Θ(kd)\Theta(k^{d}). Therefore, for the \ellth largest eigenvalue λ\lambda_{\ell} we have λ=Θ((d+1)/d)\lambda_{\ell}=\Theta(\ell^{-(d+1)/d}). This rate agrees with Basri et al. (2019) and Velikanov & Yarotsky (2021). For the NTK of a two-layer ReLU network with γb=0\gamma_{b}=0, the eigenvalues corresponding to the even frequencies are 0, which also agrees with Basri et al. (2019). Corollary 4.7 also shows the decay rates of eigenvalues for the NTK of two-layer networks with Tanh activation and Gaussian activation. We observe that when the coefficients of the kernel power series decay quickly then the eigenvalues of the kernel also decay quickly. As a faster decay of the eigenvalues of the kernel implies a smaller RKHS, Corollary 4.7 demonstrates that using ReLU results in a larger RKHS relative to using either Tanh or Gaussian activations. We numerically illustrate Corollary 4.7 in Figure 4, Appendix C.3.

C.5 Analysis of the lower spectrum: non-uniform data

The purpose of this section is to prove a formal version of Theorem 4.8. In order to prove this result we first need the following lemma.

Lemma C.19.

Let the coefficients (cj)j=0(c_{j})_{j=0}^{\infty} with cj0c_{j}\in\mathbb{R}_{\geq 0} for all j0j\in\mathbb{Z}_{\geq 0} be such that the series j=0cjρj\sum_{j=0}^{\infty}c_{j}\rho^{j} converges for all ρ[1,1]\rho\in[-1,1]. Given a data matrix 𝐗n×d\mathbf{X}\in\mathbb{R}^{n\times d} with 𝐱i=1\left\lVert\mathbf{x}_{i}\right\rVert=1 for all i[n]i\in[n], define r:=rank(𝐗)2r\vcentcolon=\operatorname{rank}(\mathbf{X})\geq 2 and the gram matrix 𝐆:=𝐗𝐗T\mathbf{G}\vcentcolon=\mathbf{X}\mathbf{X}^{T}. Consider the kernel matrix

n𝐊=j=0cj𝐆j.n\mathbf{K}=\sum_{j=0}^{\infty}c_{j}\mathbf{G}^{\odot j}.

For arbitrary m1m\in\mathbb{Z}_{\geq 1}, let the eigenvalue index kk satisfy nk>rank(𝐇m)n\geq k>\operatorname{rank}\left(\mathbf{H}_{m}\right), where 𝐇m:=j=0m1cj𝐆j\mathbf{H}_{m}\vcentcolon=\sum_{j=0}^{m-1}c_{j}\mathbf{G}^{\odot j}. Then

λk(𝐊)𝐆mnj=mcj.\lambda_{k}(\mathbf{K})\leq\frac{\left\lVert\mathbf{G}^{\odot m}\right\rVert}{n}\sum_{j=m}^{\infty}c_{j}. (106)
Proof.

We start our analysis by considering λk(n𝐊)\lambda_{k}(n\mathbf{K}) for some arbitrary knk\in\mathbb{N}_{\leq n}. Let

𝐇m\displaystyle\mathbf{H}_{m} :=j=0m1cj𝐆j,\displaystyle\vcentcolon=\sum_{j=0}^{m-1}c_{j}\mathbf{G}^{\odot j},
𝐓m\displaystyle\mathbf{T}_{m} :=j=mcj𝐆j\displaystyle\vcentcolon=\sum_{j=m}^{\infty}c_{j}\mathbf{G}^{\odot j}

be the mm-head and mm-tail of the Hermite expansion of n𝐊n\mathbf{K}: clearly n𝐊=𝐇m+𝐓mn\mathbf{K}=\mathbf{H}_{m}+\mathbf{T}_{m} for any mm\in\mathbb{N}. Recall that a constant matrix is symmetric and positive semi-definite, furthermore, by the Schur product theorem, the Hadamard product of two positive semi-definite matrices is positive semi-definite. As a result, 𝐆j\mathbf{G}^{\odot j} is symmetric and positive semi-definite for all j0j\in\mathbb{Z}_{\geq 0} and therefore 𝐇m\mathbf{H}_{m} and 𝐓m\mathbf{T}_{m} are also symmetric positive semi-definite matrices. From Weyl’s inequality (Weyl, 1912, Satz 1) it follows that

nλk(𝐊)λk(𝐇m)+λ1(𝐓m).n\lambda_{k}(\mathbf{K})\leq\lambda_{k}(\mathbf{H}_{m})+\lambda_{1}(\mathbf{T}_{m}). (107)

In order to upper bound λ1(𝐓m)\lambda_{1}(\mathbf{T}_{m}), observe, as 𝐓m\mathbf{T}_{m} is square, symmetric and positive semi-definite, that λ1(𝐓m)=𝐓m\lambda_{1}(\mathbf{T}_{m})=\left\lVert\mathbf{T}_{m}\right\rVert. Using the non-negativity of the coefficients (cj)j=0(c_{j})_{j=0}^{\infty} and the triangle inequality we have

λ1(𝐓m)=j=mcj𝐆jj=mcj𝐆j\displaystyle\lambda_{1}(\mathbf{T}_{m})=\left\lVert\sum_{j=m}^{\infty}c_{j}\mathbf{G}^{\odot j}\right\rVert\leq\sum_{j=m}^{\infty}c_{j}\left\lVert\mathbf{G}^{\odot j}\right\rVert

By the assumptions of the lemma [𝐆]ii=1[\mathbf{G}]_{ii}=1 and therefore [𝐆]iij=1[\mathbf{G}]_{ii}^{j}=1 for all j0j\in\mathbb{Z}_{\geq 0}. Furthermore, for any pair of positive semi-definite matrices 𝐀,𝐁n×n\mathbf{A},\mathbf{B}\in\mathbb{R}^{n\times n} and k[n]k\in[n]

λ1(𝐀𝐁)maxi[n][𝐀]iiλ1(𝐁),\lambda_{1}(\mathbf{A}\odot\mathbf{B})\leq\max_{i\in[n]}[\mathbf{A}]_{ii}\lambda_{1}(\mathbf{B}), (108)

Schur (1911). Therefore, as maxi[n][𝐆]ii=1\max_{i\in[n]}[\mathbf{G}]_{ii}=1,

𝐆j=λ1(𝐆j)=λ1(𝐆𝐆(j1))λ1(𝐆(j1))=𝐆(j1)\left\lVert\mathbf{G}^{\odot j}\right\rVert=\lambda_{1}(\mathbf{G}^{\odot j})=\lambda_{1}(\mathbf{G}\odot\mathbf{G}^{\odot(j-1)})\leq\lambda_{1}(\mathbf{G}^{\odot(j-1)})=\left\lVert\mathbf{G}^{\odot(j-1)}\right\rVert

for all jj\in\mathbb{N}. As a result

λ1(𝐓m)𝐆mj=mcj.\displaystyle\lambda_{1}(\mathbf{T}_{m})\leq\left\lVert\mathbf{G}^{\odot m}\right\rVert\sum_{j=m}^{\infty}c_{j}.

Finally, we now turn our attention to the analysis of λk(𝐇m)\lambda_{k}(\mathbf{H}_{m}). Upper bounding a small eigenvalue is typically challenging, however, the problem simplifies when and kk exceeds the rank of 𝐇m\mathbf{H}_{m}, as is assumed here, as this trivially implies λk(𝐇m)=0\lambda_{k}(\mathbf{H}_{m})=0. Therefore, for k>rank(𝐇m)k>\operatorname{rank}(\mathbf{H}_{m})

λk(𝐊)𝐆mnj=mcj\lambda_{k}(\mathbf{K})\leq\frac{\left\lVert\mathbf{G}^{m}\right\rVert}{n}\sum_{j=m}^{\infty}c_{j}

as claimed. ∎

In order to use Lemma C.19 we require an upper bound on the rank of 𝐇m\mathbf{H}_{m}. To this end we provide Lemma C.20.

Lemma C.20.

Let 𝐆n×n\mathbf{G}\in\mathbb{R}^{n\times n} be a symmetric, positive semi-definite matrix of rank 2rd2\leq r\leq d. Define 𝐇mn×n\mathbf{H}_{m}\in\mathbb{R}^{n\times n} as

𝐇m=j=0m1cj𝐆j\mathbf{H}_{m}=\sum_{j=0}^{m-1}c_{j}\mathbf{G}^{\odot j} (109)

where (cj)j=0m1(c_{j})_{j=0}^{m-1} is a sequence of real coefficients. Then

rank(𝐇m)\displaystyle\text{rank}\left(\mathbf{H}_{m}\right)\leq 1+min{r1,m1}(2e)r1\displaystyle 1+\min\{r-1,m-1\}(2e)^{r-1} (110)
+max{0,mr}(2er1)r1(m1)r1.\displaystyle+\max\{0,m-r\}\left(\frac{2e}{r-1}\right)^{r-1}(m-1)^{r-1}.
Proof.

As 𝐆\mathbf{G} is a symmetric and positive semi-definite matrix, its eigenvalues are real and non-negative and its eigenvectors are orthogonal. Let {𝐯i}i=1r\{\mathbf{v}_{i}\}_{i=1}^{r} be a set of orthogonal eigenvectors for 𝐆\mathbf{G} and γi\gamma_{i} the eigenvalue associated with 𝐯in\mathbf{v}_{i}\in\mathbb{R}^{n}. Then 𝐆\mathbf{G} may be written as a sum of rank one matrices as follows,

𝐆=i=1rγi𝐯i𝐯iT.\mathbf{G}=\sum_{i=1}^{r}\gamma_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{T}.

As the Hadamard product is commutative, associative and distributive over addition, for any j0j\in\mathbb{Z}_{\geq 0} 𝐆j\mathbf{G}^{\odot j} can also be expressed as a sum of rank 1 matrices,

𝐆j\displaystyle\mathbf{G}^{\odot j} =(i=1rγi𝐯i𝐯iT)j\displaystyle=\left(\sum_{i=1}^{r}\gamma_{i}\mathbf{v}_{i}\mathbf{v}_{i}^{T}\right)^{\odot j}
=(i1=1rγi1𝐯i1𝐯i1T)(i2=1rγi2𝐯i2𝐯i2T)(ij=1rγij𝐯ij𝐯ijT)\displaystyle=\left(\sum_{i_{1}=1}^{r}\gamma_{i_{1}}\mathbf{v}_{i_{1}}\mathbf{v}_{i_{1}}^{T}\right)\odot\left(\sum_{i_{2}=1}^{r}\gamma_{i_{2}}\mathbf{v}_{i_{2}}\mathbf{v}_{i_{2}}^{T}\right)\odot\cdots\odot\left(\sum_{i_{j}=1}^{r}\gamma_{i_{j}}\mathbf{v}_{i_{j}}\mathbf{v}_{i_{j}}^{T}\right)
=i1,i2ij=1rγi1γi2γir(𝐯i1𝐯i1T)(𝐯i2𝐯i2T)(𝐯ij𝐯ijT)\displaystyle=\sum_{i_{1},i_{2}...i_{j}=1}^{r}\gamma_{i_{1}}\gamma_{i_{2}}\cdots\gamma_{i_{r}}\left(\mathbf{v}_{i_{1}}\mathbf{v}_{i_{1}}^{T}\right)\odot\left(\mathbf{v}_{i_{2}}\mathbf{v}_{i_{2}}^{T}\right)\odot\cdots\odot\left(\mathbf{v}_{i_{j}}\mathbf{v}_{i_{j}}^{T}\right)
=i1,i2,,ij=1rγi1γi2γij(𝐯i1𝐯i2𝐯ij)(𝐯i1𝐯i2𝐯ij)T.\displaystyle=\sum_{i_{1},i_{2},\ldots,i_{j}=1}^{r}\gamma_{i_{1}}\gamma_{i_{2}}\cdots\gamma_{i_{j}}\left(\mathbf{v}_{i_{1}}\odot\mathbf{v}_{i_{2}}\odot\cdots\odot\mathbf{v}_{i_{j}}\right)\left(\mathbf{v}_{i_{1}}\odot\mathbf{v}_{i_{2}}\odot\cdots\odot\mathbf{v}_{i_{j}}\right)^{T}.

Note the fourth equality in the above follows from 𝐯i𝐯iT=𝐯i𝐯i\mathbf{v}_{i}\mathbf{v}_{i}^{T}=\mathbf{v}_{i}\otimes\mathbf{v}_{i} and an application of the mixed-product property of the Hadamard product. As matrix rank is sub-additive, the rank of 𝐆j\mathbf{G}^{\odot j} is less than or equal to the number of distinct rank-one matrix summands. This quantity in turn is equal to the number of vectors of the form (𝐯i1𝐯i2𝐯ij)\left(\mathbf{v}_{i_{1}}\odot\mathbf{v}_{i_{2}}\odot\cdots\odot\mathbf{v}_{i_{j}}\right), where i1,i2,,ij[r]i_{1},i_{2},\ldots,i_{j}\in[r]. This in turn is equivalent to computing the number of jj-combinations with repetition from rr objects. Via a stars and bars argument this is equal to (r+j1j)=(r+j1r(n)1)\binom{r+j-1}{j}=\binom{r+j-1}{r(n)-1}. It therefore follows that

rank(𝐆j)\displaystyle\operatorname{rank}(\mathbf{G}^{\odot j}) (r+j1r1)\displaystyle\leq\binom{r+j-1}{r-1}
(e(r+j1)r1)r1\displaystyle\leq\left(\frac{e(r+j-1)}{r-1}\right)^{r-1}
er1(1+jr1)r1\displaystyle\leq e^{r-1}\left(1+\frac{j}{r-1}\right)^{r-1}
(2e)r1(δjr1+δj>r1(jr1)r1).\displaystyle\leq(2e)^{r-1}\left(\delta_{j\leq r-1}+\delta_{j>r-1}\left(\frac{j}{r-1}\right)^{r-1}\right).

The rank of 𝐇m\mathbf{H}_{m} can therefore be bounded via subadditivity of the rank as

rank(𝐇m)=\displaystyle\operatorname{rank}(\mathbf{H}_{m})= rank(a01n×n+j=1m1cj𝐆j)\displaystyle\operatorname{rank}\left(a_{0}\textbf{1}_{n\times n}+\sum_{j=1}^{m-1}c_{j}\mathbf{G}^{\odot j}\right) (111)
\displaystyle\leq 1+j=1m1rank(𝐆j)\displaystyle 1+\sum_{j=1}^{m-1}\operatorname{rank}\left(\mathbf{G}^{\odot j}\right)
\displaystyle\leq 1+j=1m1(2e)r1(δjr1+δj>r1(jr1)r1)\displaystyle 1+\sum_{j=1}^{m-1}(2e)^{r-1}\left(\delta_{j\leq r-1}+\delta_{j>r-1}\left(\frac{j}{r-1}\right)^{r-1}\right)
\displaystyle\leq 1+min{r1,m1}(2e)r1\displaystyle 1+\min\{r-1,m-1\}(2e)^{r-1}
+max{0,mr}(2er1)r1(m1)r1.\displaystyle\phantom{1}+\max\{0,m-r\}\left(\frac{2e}{r-1}\right)^{r-1}(m-1)^{r-1}.

As our goal here is to characterize the small eigenvalues, then as nn grows we need both kk and therefore mm to grow as well. As a result we will therefore be operating in the regime where m>rm>r. To this end we provide the following corollary.

Corollary C.21.

Under the same conditions and setup as Lemma C.20 with mr7m\geq r\geq 7 then

rank(𝐇m)<2mr.\operatorname{rank}(\mathbf{H}_{m})<2m^{r}.
Proof.

If r7>2e+1r\geq 7>2e+1 then r1>2er-1>2e. As a result from Lemma C.20

rank(𝐇m)\displaystyle\operatorname{rank}(\mathbf{H}_{m}) 1+(r1)(2e)r1+(mr)(2er1)r1(m1)r1\displaystyle\leq 1+(r-1)(2e)^{r-1}+(m-r)\left(\frac{2e}{r-1}\right)^{r-1}(m-1)^{r-1}
<r(2e)r1+(m1)r\displaystyle<r(2e)^{r-1}+(m-1)^{r}
<2mr\displaystyle<2m^{r}

as claimed. ∎

Corollary C.21 implies for any k2mrk\geq 2m^{r}, knk\leq n that we can apply Lemma C.19 to upper bound the size of the kkth eigenvalue. Our goal is to upper bound the decay of the smallest eigenvalue. To this end, and in order to make our bounds as tight as possible, we therefore choose the truncation point m(n)=(n/2)1/rm(n)=\lfloor(n/2)^{1/r}\rfloor, note this is the largest truncation which still satisfies 2m(n)rn2m(n)^{r}\leq n. In order to state the next lemma, we introduce the following pieces of notation: with :={:00}\mathcal{L}\vcentcolon=\{\ell:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}_{\geq 0}\} define U:×10U:\mathcal{L}\times\mathbb{Z}_{\geq 1}\rightarrow\mathbb{R}_{\geq 0} as

U(,m)=m1(x)𝑑x.U(\ell,m)=\int_{m-1}^{\infty}\ell(x)dx.
Lemma C.22.

Given a sequence of data points (𝐱i)i1(\mathbf{x}_{i})_{i\in\mathbb{Z}_{\geq 1}} with 𝐱i𝕊d\mathbf{x}_{i}\in\mathbb{S}^{d} for all i1i\in\mathbb{Z}_{\geq 1}, construct a sequence of row-wise data matrices (𝐗n)n1(\mathbf{X}_{n})_{n\in\mathbb{Z}_{\geq 1}}, 𝐗nn×d\mathbf{X}_{n}\in\mathbb{R}^{n\times d}, with 𝐱i\mathbf{x}_{i} corresponding to the iith row of 𝐗n\mathbf{X}_{n}. The corresponding sequence of gram matrices we denote 𝐆n:=𝐗n𝐗nT\mathbf{G}_{n}\vcentcolon=\mathbf{X}_{n}\mathbf{X}_{n}^{T}. Let m(n):=(n/2)1/r(n)m(n)\vcentcolon=\lfloor(n/2)^{1/r(n)}\rfloor where r(n):=rank(𝐗n)r(n)\vcentcolon=\operatorname{rank}(\mathbf{X}_{n}) and suppose for all sufficiently large nn that m(n)r(n)7m(n)\geq r(n)\geq 7. Let the coefficients (cj)j=0(c_{j})_{j=0}^{\infty} with cj0c_{j}\in\mathbb{R}_{\geq 0} for all j0j\in\mathbb{Z}_{\geq 0} be such that 1) the series j=0cjρj\sum_{j=0}^{\infty}c_{j}\rho^{j} converges for all ρ[1,1]\rho\in[-1,1] and 2) (cj)j=0=𝒪((j))(c_{j})_{j=0}^{\infty}=\mathcal{O}(\ell(j)), where \ell\in\mathcal{L} satisfies U(,m(n))<U(\ell,m(n))<\infty for all nn and is monotonically decreasing. Consider the sequence of kernel matrices indexed by nn and defined as

n𝐊n=j=0cj𝐆nj.n\mathbf{K}_{n}=\sum_{j=0}^{\infty}c_{j}\mathbf{G}_{n}^{\odot j}.

With ν:11\nu:\mathbb{Z}_{\geq 1}\rightarrow\mathbb{Z}_{\geq 1} suppose 𝐆nm(n)=𝒪(nν(n)+1)\left\lVert\mathbf{G}_{n}^{\odot m(n)}\right\rVert=\mathcal{O}(n^{-\nu(n)+1}), then

λn(𝐊n)=𝒪(nν(n)U(,m(n))).\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}(n^{-\nu(n)}U(\ell,m(n))). (112)
Proof.

By the assumptions of the Lemma we may apply Lemma C.19 and Corollary C.21, which results in

λn(𝐊n)𝐆nm(n)nj=m(n)cj=𝒪(nν(n))j=m(n)cj.\lambda_{n}(\mathbf{K}_{n})\leq\frac{\left\lVert\mathbf{G}_{n}^{\odot m(n)}\right\rVert}{n}\sum_{j=m(n)}^{\infty}c_{j}=\mathcal{O}(n^{-\nu(n)})\sum_{j=m(n)}^{\infty}c_{j}.

Additionally, as (cj)j=0=𝒪((j))(c_{j})_{j=0}^{\infty}=\mathcal{O}(\ell(j)) then

λn(𝐊n)\displaystyle\lambda_{n}(\mathbf{K}_{n}) =𝒪(nν(n)j=m(n)(j))\displaystyle=\mathcal{O}\left(n^{-\nu(n)}\sum_{j=m(n)}^{\infty}\ell(j)\right)
=𝒪(nν(n)m(n)1(x)𝑑x)\displaystyle=\mathcal{O}\left(n^{-\nu(n)}\int_{m(n)-1}^{\infty}\ell(x)dx\right)
=𝒪(nν(n)U(,m(n)))\displaystyle=\mathcal{O}\left(n^{-\nu(n)}U(\ell,m(n))\right)

as claimed. ∎

Based on Lemma C.19 we provide Theorem C.23, which considers three specific scenarios for the decay of the power series coefficients inspired by Lemma 3.2.

Theorem C.23.

In the same setting, and also under the same assumptions as in Lemma C.22, then

  1. 1.

    if cp=𝒪(pα)c_{p}=\mathcal{O}(p^{-\alpha}) with α>r(n)+1\alpha>r(n)+1 for all n0n\in\mathbb{Z}_{\geq 0} then λn(𝐊n)=𝒪(nα1r(n))\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{-\frac{\alpha-1}{r(n)}}\right),

  2. 2.

    if cp=𝒪(eαp)c_{p}=\mathcal{O}(e^{-\alpha\sqrt{p}}), then λn(𝐊n)=𝒪(n12r(n)exp(αn12r(n)))\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{\frac{1}{2r(n)}}\exp\left(-\alpha^{\prime}n^{\frac{1}{2r(n)}}\right)\right) for any α<α21/2r(n)\alpha^{\prime}<\alpha 2^{-1/2r(n)},

  3. 3.

    if cp=𝒪(eαp)c_{p}=\mathcal{O}(e^{-\alpha p}), then λn(𝐊n)=𝒪(exp(αn1r(n)))\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(\exp\left(-\alpha^{\prime}n^{\frac{1}{r(n)}}\right)\right) for any α<α21/2r(n)\alpha^{\prime}<\alpha 2^{-1/2r(n)}.

Proof.

First, as [𝐆n]ij1[\mathbf{G}_{n}]_{ij}\leq 1 then

𝐆m(n)nTrace(𝐆m(n))n=1.\frac{\left\lVert\mathbf{G}^{\odot m(n)}\right\rVert}{n}\leq\frac{\text{Trace}(\mathbf{G}^{\odot m(n)})}{n}=1.

Therefore, to recover the three results listed we now apply Lemma C.22 with ν(n)=0\nu(n)=0. First, to prove 1., under the assumption (x)=xα\ell(x)=x^{-\alpha} with α>0\alpha>0 then

m(n)1xα𝑑x=(m(n)1)α+1α1.\int_{m(n)-1}^{\infty}x^{-\alpha}dx=\frac{(m(n)-1)^{-\alpha+1}}{\alpha-1}.

As a result

λn(𝐊n)=𝒪(nα1r(n)).\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{-\frac{\alpha-1}{r(n)}}\right).

To prove ii), under the assumption (x)=eαx\ell(x)=e^{-\alpha\sqrt{x}} with α>0\alpha>0 then

m(n)1eαx𝑑x=2exp(α(m(n)1)(αm(n)1+1)α2.\int_{m(n)-1}^{\infty}e^{-\alpha\sqrt{x}}dx=\frac{2\exp(-\alpha(\sqrt{m(n)-1})(\alpha\sqrt{m(n)-1}+1)}{\alpha^{2}}.

As a result

λn(𝐊n)=𝒪(n12r(n)exp(αn12r(n)))\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(n^{\frac{1}{2r(n)}}\exp\left(-\alpha^{\prime}n^{\frac{1}{2r(n)}}\right)\right)

for any α<α21/2r(n)\alpha^{\prime}<\alpha 2^{-1/2r(n)}. Finally, to prove iii), under the assumption (x)=eαx\ell(x)=e^{-\alpha x} with α>0\alpha>0 then

m(n)1eαx𝑑x=exp(α(m(n)1)α.\int_{m(n)-1}^{\infty}e^{-\alpha x}dx=\frac{\exp(-\alpha(m(n)-1)}{\alpha}.

Therefore

λn(𝐊n)=𝒪(exp(αn1r(n)))\lambda_{n}(\mathbf{K}_{n})=\mathcal{O}\left(\exp\left(-\alpha^{\prime}n^{\frac{1}{r(n)}}\right)\right)

again for any α<α21/2r(n)\alpha^{\prime}<\alpha 2^{-1/2r(n)}. ∎

Unfortunately, the curse of dimensionality is clearly present in these results due to the 1/r(n)1/r(n) factor in the exponents of nn. However, although perhaps somewhat loose we emphasize that these results are certainly far from trivial. In particular, while trivially we know that λn(𝐊n)Tr(𝐊n)/n=𝒪(n1)\lambda_{n}(\mathbf{K}_{n})\leq Tr(\mathbf{K}_{n})/n=\mathcal{O}(n^{-1}), in contrast, even the weakest result concerning the power law decay our result is a clear improvement as long as α>r(n)+1\alpha>r(n)+1. For the other settings, i.e., those specified in 2. and 3., our results are significantly stronger.