Characterizing the spectrum of the NTK via a power series expansion
Abstract
Under mild conditions on the network initialization we derive a power series expansion for the Neural Tangent Kernel (NTK) of arbitrarily deep feedforward networks in the infinite width limit. We provide expressions for the coefficients of this power series which depend on both the Hermite coefficients of the activation function as well as the depth of the network. We observe faster decay of the Hermite coefficients leads to faster decay in the NTK coefficients and explore the role of depth. Using this series, first we relate the effective rank of the NTK to the effective rank of the input-data Gram. Second, for data drawn uniformly on the sphere we study the eigenvalues of the NTK, analyzing the impact of the choice of activation function. Finally, for generic data and activation functions with sufficiently fast Hermite coefficient decay, we derive an asymptotic upper bound on the spectrum of the NTK.
1 Introduction
Neural networks currently dominate modern artificial intelligence, however, despite their empirical success establishing a principled theoretical foundation for them remains an active challenge. The key difficulties are that neural networks induce nonconvex optimization objectives (Sontag & Sussmann, 1989) and typically operate in an overparameterized regime which precludes classical statistical learning theory (Anthony & Bartlett, 2002). The persistent success of overparameterized models tuned via non-convex optimization suggests that the relationship between the parameterization, optimization, and generalization is more sophisticated than that which can be addressed using classical theory.
A recent breakthrough on understanding the success of overparameterized networks was established through the Neural Tangent Kernel (NTK) (Jacot et al., 2018). In the infinite width limit the optimization dynamics are described entirely by the NTK and the parameterization behaves like a linear model (Lee et al., 2019). In this regime explicit guarantees for the optimization and generalization can be obtained (Du et al., 2019a, b; Arora et al., 2019a; Allen-Zhu et al., 2019; Zou et al., 2020). While one must be judicious when extrapolating insights from the NTK to finite width networks (Lee et al., 2020), the NTK remains one of the most promising avenues for understanding deep learning on a principled basis.
The spectrum of the NTK is fundamental to both the optimization and generalization of wide networks. In particular, bounding the smallest eigenvalue of the NTK Gram matrix is a staple technique for establishing convergence guarantees for the optimization (Du et al., 2019a, b; Oymak & Soltanolkotabi, 2020). Furthermore, the full spectrum of the NTK Gram matrix governs the dynamics of the empirical risk (Arora et al., 2019b), and the eigenvalues of the associated integral operator characterize the dynamics of the generalization error outside the training set (Bowman & Montufar, 2022; Bowman & Montúfar, 2022). Moreover, the decay rate of the generalization error for Gaussian process regression using the NTK can be characterized by the decay rate of the spectrum (Caponnetto & De Vito, 2007; Cui et al., 2021; Jin et al., 2022).
The importance of the spectrum of the NTK has led to a variety of efforts to characterize its structure via random matrix theory and other tools (Yang & Salman, 2019; Fan & Wang, 2020). There is a broader body of work studying the closely related Conjugate Kernel, Fisher Information Matrix, and Hessian (Poole et al., 2016; Pennington & Worah, 2017, 2018; Louart et al., 2018; Karakida et al., 2020). These results often require complex random matrix theory or operate in a regime where the input dimension is sent to infinity. By contrast, using a just a power series expansion we are able to characterize a variety of attributes of the spectrum for fixed input dimension and recover key results from prior work.
1.1 Contributions
1.2 Related work
Neural Tangent Kernel (NTK): the NTK was introduced by Jacot et al. (2018), who demonstrated that in the infinite width limit neural network optimization is described via a kernel gradient descent. As a consequence, when the network is polynomially wide in the number of samples, global convergence guarantees for gradient descent can be obtained (Du et al., 2019a, b; Allen-Zhu et al., 2019; Zou & Gu, 2019; Lee et al., 2019; Zou et al., 2020; Oymak & Soltanolkotabi, 2020; Nguyen & Mondelli, 2020; Nguyen, 2021). Furthermore, the connection between infinite width networks and Gaussian processes, which traces back to Neal (1996), has been reinvigorated in light of the NTK. Recent investigations include Lee et al. (2018); de G. Matthews et al. (2018); Novak et al. (2019).
Analysis of NTK Spectrum: theoretical analysis of the NTK spectrum via random matrix theory was investigated by Yang & Salman (2019); Fan & Wang (2020) in the high dimensional limit. Velikanov & Yarotsky (2021) demonstrated that for ReLU networks the spectrum of the NTK integral operator asymptotically follows a power law, which is consistent with our results for the uniform data distribution. Basri et al. (2019) calculated the NTK spectrum for shallow ReLU networks under the uniform distribution, which was then expanded to the nonuniform case by Basri et al. (2020). Geifman et al. (2022) analyzed the spectrum of the conjugate kernel and NTK for convolutional networks with ReLU activations whose pixels are uniformly distributed on the sphere. Geifman et al. (2020); Bietti & Bach (2021); Chen & Xu (2021) analyzed the reproducing kernel Hilbert spaces of the NTK for ReLU networks and the Laplace kernel via the decay rate of the spectrum of the kernel. In contrast to previous works, we are able to address the spectrum in the finite dimensional setting and characterize the impact of different activation functions on it.
Hermite Expansion: Daniely et al. (2016) used Hermite expansion to the study the expressivity of the Conjugate Kernel. Simon et al. (2022) used this technique to demonstrate that any dot product kernel can be realized by the NTK or Conjugate Kernel of a shallow, zero bias network. Oymak & Soltanolkotabi (2020) use Hermite expansion to study the NTK and establish a quantitative bound on the smallest eigenvalue for shallow networks. This approach was incorporated by Nguyen & Mondelli (2020) to handle convergence for deep networks, with sharp bounds on the smallest NTK eigenvalue for deep ReLU networks provided by Nguyen et al. (2021). The Hermite approach was utilized by Panigrahi et al. (2020) to analyze the smallest NTK eigenvalue of shallow networks under various activations. Finally, in a concurrent work Han et al. (2022) use Hermite expansions to develop a principled and efficient polynomial based approximation algorithm for the NTK and CNTK. In contrast to the aforementioned works, here we employ the Hermite expansion to characterize both the outlier and asymptotic portions of the spectrum for both shallow and deep networks under general activations.
2 Preliminaries
For our notation, lower case letters, e.g., , denote scalars, lower case bold characters, e.g., are for vectors, and upper case bold characters, e.g., , are for matrices. For natural numbers we let and . If then is the empty set. We use to denote the -norm of the matrix or vector in question and as default use as the operator or 2-norm respectively. We use to denote the matrix with all entries equal to one. We define to take the value if and be zero otherwise. We will frequently overload scalar functions by applying them elementwise to vectors and matrices. The entry in the th row and th column of a matrix we access using the notation . The Hadamard or entrywise product of two matrices we denote as is standard. The th Hadamard power we denote and define it as the Hadamard product of with itself times,
Given a Hermitian or symmetric matrix , we adopt the convention that denotes the th largest eigenvalue,
Finally, for a square matrix we let denote the trace.
2.1 Hermite Expansion
We say that a function is square integrable with respect to the standard Gaussian measure if . We denote by the space of all such functions. The normalized probabilist’s Hermite polynomials are defined as
and form a complete orthonormal basis in (O’Donnell, 2014, §11). The Hermite expansion of a function is given by , where is the th normalized probabilist’s Hermite coefficient of .
2.2 NTK Parametrization
In what follows, for let denote a matrix which stores points in row-wise. Unless otherwise stated, we assume and denote the th row of as . In this work we consider fully-connected neural networks of the form with hidden layers and a linear output layer. For a given input vector , the activation and preactivation at each layer are defined via the following recurrence relations,
(1) | ||||
The parameters and are the weight matrix and bias vector at the th layer respectively, , , and is the activation function applied elementwise. The variables and correspond to weight and bias hyperparameters respectively. Let denote a vector storing the network parameters up to and including the th layer. The Neural Tangent Kernel (Jacot et al., 2018) associated with at layer is defined as
(2) |
We will mostly study the NTK under the following standard assumptions.
Assumption 1.
NTK initialization.
-
1.
At initialization all network parameters are distributed as and are mutually independent.
-
2.
The activation function satisfies , is differentiable almost everywhere and its derivative, which we denote , also satisfies .
-
3.
The widths are sent to infinity in sequence, .
Under Assumption 1, for any , converges in probability to a deterministic limit (Jacot et al., 2018) and the network behaves like a kernelized linear predictor during training; see, e.g., Arora et al. (2019b); Lee et al. (2019); Woodworth et al. (2020). Given access to the rows of the NTK matrix at layer , which we denote , is the matrix with entries defined as
(3) |
3 Expressing the NTK as a power series
The following assumption allows us to study a power series for the NTK of deep networks and with general activation functions. We remark that power series for the NTK of deep networks with positive homogeneous activation functions, namely ReLU, have been studied in prior works Han et al. (2022); Chen & Xu (2021); Bietti & Bach (2021); Geifman et al. (2022). We further remark that while these works focus on the asymptotics of the NTK spectrum we also study the large eigenvalues.
Assumption 2.
The hyperparameters of the network satisfy , and . The data is normalized so that for all .
Recall under Assumption 1 that the preactivations of the network are centered Gaussian processes (Neal, 1996; Lee et al., 2018). Assumption 2 ensures the preactivation of each neuron has unit variance and thus is reminiscent of the LeCun et al. (2012), Glorot & Bengio (2010) and He et al. (2015) initializations, which are designed to avoid vanishing and exploding gradients. We refer the reader to Appendix A.3 for a thorough discussion. Under Assumption 2 we will show it is possible to write the NTK not only as a dot-product kernel but also as an analytic power series on and derive expressions for the coefficients. In order to state this result recall, given a function , that the th normalized probabilist’s Hermite coefficient of is denoted , we refer the reader to Appendix A.4 for an overview of the Hermite polynomials and their properties. Furthermore, letting denote a sequence of real numbers, then for any we define
(4) |
where
Here is the set of all -tuples of nonnegative integers which sum to and is therefore the sum of all ordered products of elements of whose indices sum to . We are now ready to state the key result of this section, Theorem 3.1, whose proof is provided in Appendix B.1.
Theorem 3.1.
As already remarked, power series for the NTK have been studied in previous works, however, to the best of our knowledge Theorem 3.1 is the first to explicitly express the coefficients at a layer in terms of the coefficients of previous layers. To compute the coefficients of the NTK as per Theorem 3.1, the Hermite coefficients of both and are required. Under Assumption 3 below, which has minimal impact on the generality of our results, this calculation can be simplified. In short, under Assumption 3 and therefore only the Hermite coefficients of are required. We refer the reader to Lemma B.3 in Appendix B.2 for further details.
Assumption 3.
The activation function is absolutely continuous on for all , differentiable almost everywhere, and is polynomially bounded, i.e., for some . Further, the derivative satisfies .
We remark that ReLU, Tanh, Sigmoid, Softplus and many other commonly used activation functions satisfy Assumption 3. In order to understand the relationship between the Hermite coefficients of the activation function and the coefficients of the NTK, we first consider the simple two-layer case with hidden layers. From Theorem 3.1
(9) |
As per Table 1, a general trend we observe across all activation functions is that the first few coefficients account for the large majority of the total NTK coefficient series.
0 | 1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|---|
ReLU | 43.944 | 77.277 | 93.192 | 93.192 | 95.403 | 95.403 |
Tanh | 41.362 | 91.468 | 91.468 | 97.487 | 97.487 | 99.090 |
Sigmoid | 91.557 | 99.729 | 99.729 | 99.977 | 99.977 | 99.997 |
Gaussian | 95.834 | 95.834 | 98.729 | 98.729 | 99.634 | 99.634 |
However, the asymptotic rate of decay of the NTK coefficients varies significantly by activation function, due to the varying behavior of their tails. In Lemma 3.2 we choose ReLU, Tanh and Gaussian as prototypical examples of activations functions with growing, constant, and decaying tails respectively, and analyze the corresponding NTK coefficients in the two layer setting. For typographical ease we denote the zero mean Gaussian density function with variance as .
The trend we observe from Lemma 3.2 is that activation functions whose Hermite coefficients decay quickly, such as , result in a faster decay of the NTK coefficients. We remark that analyzing the rates of decay for is challenging due to the calculation of (4). In Appendix B.4 we provide preliminary results in this direction, upper bounding, in a very specific setting, the decay of the NTK coefficients for depths . Finally, we briefly pause here to highlight the potential for using a truncation of (5) in order to perform efficient numerical approximation of the infinite width NTK. We remark that this idea is also addressed in a concurrent work by Han et al. (2022), albeit under a somewhat different set of assumptions 111In particular, in Han et al. (2022) the authors focus on homogeneous activation functions and allow the data to lie off the sphere. By contrast, we require the data to lie on the sphere but can handle non-homogeneous activation functions in the deep setting.. As per our observations thus far that the coefficients of the NTK power series (5) typically decay quite rapidly, one might consider approximating by computing just the first few terms in each series of (5). Figure 2 in Appendix B.3 displays the absolute error between the truncated ReLU NTK and the analytical expression for the ReLU NTK, which is also defined in Appendix B.3. Letting denote the input correlation then the key takeaway is that while for close to one the approximation is poor, for , which is arguably more realistic for real-world data, with just coefficients machine level precision can be achieved. We refer the interested reader to Appendix B.3 for a proper discussion.
4 Analyzing the spectrum of the NTK via its power series
In this section, we consider a general kernel matrix power series of the form where are coefficients and is the data matrix. According to Theorem 3.1, the coefficients of the NTK power series (5) are always nonnegative, thus we only consider the case where are nonnegative. We will also consider the kernel function power series, which we denote as . Later on we will analyze the spectrum of kernel matrix and kernel function .
4.1 Analysis of the upper spectrum and effective rank
In this section we analyze the upper part of the spectrum of the NTK, corresponding to the large eigenvalues, using the power series given in Theorem 3.1. Our first result concerns the effective rank (Huang et al., 2022) of the NTK. Given a positive semidefinite matrix we define the effective rank of to be
The effective rank quantifies how many eigenvalues are on the order of the largest eigenvalue. This follows from the Markov-like inequality
(10) |
and the eigenvalue bound
Our first result is that the effective rank of the NTK can be bounded in terms of a ratio involving the power series coefficients. As we are assuming the data is normalized so that for all , then observe by the linearity of the trace
where we have used the fact that for all . On the other hand,
Combining these two results we get the following theorem.
Theorem 4.1.
Assume that we have a kernel Gram matrix of the form where . Furthermore, assume the input data are normalized so that for all . Then
By Theorem 3.1 provided the network has biases or the activation function has nonzero Gaussian expectation (i.e., ). Thus we have that the effective rank of is bounded by an quantity. In the case of ReLU for example, as evidenced by Table 1, the effective rank will be roughly for a shallow network. By contrast, a well-conditioned matrix would have an effective rank that is . Combining Theorem 4.1 and the Markov-type bound (10) we make the following important observation.
Observation 4.2.
The largest eigenvalue of the NTK takes up an fraction of the entire trace and there are eigenvalues on the same order of magnitude as , where the and notation are with respect to the parameter .
While the constant term in the kernel leads to a significant outlier in the spectrum of , it is rather uninformative beyond this. What interests us is how the structure of the data manifests in the spectrum of the kernel matrix . For this reason we will examine the centered kernel matrix . By a very similar argument as before we get the following result.
Theorem 4.3.
Assume that we have a kernel Gram matrix of the form where . Furthermore, assume the input data are normalized so that for all . Then the centered kernel satisfies
Thus we have that the effective rank of the centered kernel is upper bounded by a constant multiple of the effective rank of the input data Gram . Furthermore, we can take the ratio as a measure of how much the NTK inherits the behavior of the linear kernel : in particular, if the input data gram has low effective rank and this ratio is moderate then we may conclude that the centered NTK must also have low effective rank. Again from Table 1, in the shallow setting we see that this ratio tends to be small for many of the common activations, for example, for ReLU it is roughly 1.3. To summarize then from Theorem 4.3 we make the important observation.
Observation 4.4.
Whenever the input data are approximately low rank, the centered kernel matrix is also approximately low rank.
It turns out that this phenomenon also holds for finite-width networks at initialization. Consider the shallow model
where and , for all . The following theorem demonstrates that when the width is linear in the number of samples then is upper bounded by a constant multiple of .
Theorem 4.5.
Assume and . Fix small. Suppose that i.i.d. and . Set , and let
Then
suffices to ensure that, with probability at least over the sampling of the parameter initialization,
where is an absolute constant.
Many works consider the model where the outer layer weights are fixed and have constant magnitude and only the inner layer weights are trained. This is the setting considered by Xie et al. (2017), Arora et al. (2019a), Du et al. (2019b), Oymak et al. (2019), Li et al. (2020), and Oymak & Soltanolkotabi (2020). In this setting we can reduce the dependence on the width to only be logarithmic in the number of samples , and we have an accompanying lower bound. See Theorem C.5 in the Appendix C.2.3 for details.
In Figure 1 we empirically validate our theory by computing the spectrum of the NTK on both Caltech101 (Li et al., 2022) and isotropic Gaussian data for feedforward networks. We use the functorch222https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html module in PyTorch (Paszke et al., 2019) using an algorithmic approach inspired by Novak et al. (2022). As per Theorem 4.1 and Observation 4.2, we observe all network architectures exhibit a dominant outlier eigenvalue due to the nonzero constant coefficient in the power series. Furthermore, this dominant outlier becomes more pronounced with depth, as can be observed if one carries out the calculations described in Theorem 3.1. Additionally, this outlier is most pronounced for ReLU, as the combination of its Gaussian mean plus bias term is the largest out of the activations considered here. As predicted by Theorem 4.3, Observation 4.4 and Theorem 4.5, we observe real-world data, which has a skewed spectrum and hence a low effective rank, results in the spectrum of the NTK being skewed. By contrast, isotropic Gaussian data has a flat spectrum, and as a result beyond the outlier the decay of eigenvalues of the NTK is more gradual. These observations support the claim that the NTK inherits its spectral structure from the data. We also observe that the spectrum for Tanh is closer to the linear activation relative to ReLU: intuitively this should not be surprising as close to the origin Tanh is well approximated by the identity. Our theory provides a formal explanation for this observation, indeed, the power series coefficients for Tanh networks decay quickly relative to ReLU. We provide further experimental results in Appendix C.3, including for CNNs where we observe the same trends. We note that the effective rank has implications for the generalization error. The Rademacher complexity of a kernel method (and hence the NTK model) within a parameter ball is determined by its its trace (Bartlett & Mendelson, 2002). Since for the NTK , lower effective rank implies smaller trace and hence limited complexity.

4.2 Analysis of the lower spectrum
In this section, we analyze the lower part of the spectrum using the power series. We first analyze the kernel function which we recall is a dot-product kernel of the form . Assuming the training data is uniformly distributed on a hypersphere it was shown by Basri et al. (2019); Bietti & Mairal (2019) that the eigenfunctions of are the spherical harmonics. Azevedo & Menegatto (2015) gave the eigenvalues of the kernel in terms of the power series coefficients.
Theorem 4.6.
[Azevedo & Menegatto (2015)] Let denote the gamma function. Suppose that the training data are uniformly sampled from the unit hypersphere , . If the dot-product kernel function has the expansion where , then the eigenvalue of every spherical harmonic of frequency is given by
A proof of Theorem 4.6 is provided in Appendix C.4 for the reader’s convenience. This theorem connects the coefficients of the kernel power series with the eigenvalues of the kernel. In particular, given a specific decay rate for the coefficients one may derive the decay rate of : for example, Scetbon & Harchaoui (2021) examined the decay rate of if admits a polynomial decay or exponential decay. The following Corollary summarizes the decay rates of corresponding to two layer networks with different activations.
Corollary 4.7.
Under the same setting as in Theorem 4.6,
-
1.
if where , then ,
-
2.
if , then ,
-
3.
if , then ,
-
4.
if , then and .
In addition to recovering existing results for ReLU networks Basri et al. (2019); Velikanov & Yarotsky (2021); Geifman et al. (2020); Bietti & Bach (2021), Corollary 4.7 also provides the decay rates for two-layer networks with Tanh and Gaussian activations. As faster eigenvalue decay implies a smaller RKHS Corollary 4.7 shows using ReLU results in a larger RKHS relative to Tanh or Gaussian activations. Numerics for Corollary 4.7 are provided in Figure 4 in Appendix C.3. Finally, in Theorem 4.8 we relate a kernel’s power series to its spectral decay for arbitrary data distributions.
Theorem 4.8 (Informal).
Let the rows of be arbitrary points on the unit sphere. Consider the kernel matrix and let denote the rank of . Then
-
1.
if with for all then ,
-
2.
if then for any ,
-
3.
if then for any .
Although the presence of the factor in the exponents of in these bounds is a weakness, Theorem 4.8 still illustrates how, in a highly general setting, the asymptotic decay of the coefficients of the power series ensures a certain asymptotic decay in the eigenvalues of the kernel matrix. A formal version of this result is provided in Appendix C.5 along with further discussion.
5 Conclusion
Using a power series expansion we derived a number of insights into both the outliers as well as the asymptotic decay of the spectrum of the NTK. We are able to perform our analysis without recourse to a high dimensional limit or the use of random matrix theory. Interesting avenues for future work include better characterizing the role of depth and performing the same analysis on networks with convolutional or residual layers.
Reproducibility Statement
To ensure reproducibility, we make the code public at https://github.com/bbowman223/data_ntk.
Acknowledgements and Disclosure of Funding
This project has been supported by ERC Grant 757983 and NSF CAREER Grant DMS-2145630.
References
- Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 242–252. PMLR, 2019. URL https://proceedings.mlr.press/v97/allen-zhu19a.html.
- Anthony & Bartlett (2002) Martin Anthony and Peter L. Bartlett. Neural Network Learning - Theoretical Foundations. Cambridge University Press, 2002. URL http://www.cambridge.org/gb/knowledge/isbn/item1154061/?site_locale=en_GB.
- Arora et al. (2019a) Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 322–332. PMLR, 2019a. URL https://proceedings.mlr.press/v97/arora19a.html.
- Arora et al. (2019b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019b. URL https://proceedings.neurips.cc/paper/2019/file/dbc4d84bfcfe2284ba11beffb853a8c4-Paper.pdf.
- Azevedo & Menegatto (2015) Douglas Azevedo and Valdir A Menegatto. Eigenvalues of dot-product kernels on the sphere. Proceeding Series of the Brazilian Society of Computational and Applied Mathematics, 3(1), 2015.
- Bartlett & Mendelson (2002) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. J. Mach. Learn. Res., 3:463–482, 2002. URL http://dblp.uni-trier.de/db/journals/jmlr/jmlr3.html#BartlettM02.
- Basri et al. (2019) Ronen Basri, David W. Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 4763–4772, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/5ac8bb8a7d745102a978c5f8ccdb61b8-Abstract.html.
- Basri et al. (2020) Ronen Basri, Meirav Galun, Amnon Geifman, David Jacobs, Yoni Kasten, and Shira Kritchman. Frequency bias in neural networks for input of non-uniform density. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 685–694. PMLR, 2020. URL https://proceedings.mlr.press/v119/basri20a.html.
- Bietti & Bach (2021) Alberto Bietti and Francis Bach. Deep equals shallow for ReLU networks in kernel regimes. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=aDjoksTpXOP.
- Bietti & Mairal (2019) Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/c4ef9c39b300931b69a36fb3dbb8d60e-Paper.pdf.
- Bowman & Montúfar (2022) Benjamin Bowman and Guido Montúfar. Implicit bias of MSE gradient optimization in underparameterized neural networks. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=VLgmhQDVBV.
- Bowman & Montufar (2022) Benjamin Bowman and Guido Montufar. Spectral bias outside the training set for deep networks in the kernel regime. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=a01PL2gb7W5.
- Caponnetto & De Vito (2007) Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
- Chen & Xu (2021) Lin Chen and Sheng Xu. Deep neural tangent kernel and laplace kernel have the same RKHS. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=vK9WrZ0QYQ.
- Cui et al. (2021) Hugo Cui, Bruno Loureiro, Florent Krzakala, and Lenka Zdeborová. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=Da_EHrAcfwd.
- Daniely et al. (2016) Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/abea47ba24142ed16b7d8fbf2c740e0d-Paper.pdf.
- Davis (2021) Tom Davis. A general expression for Hermite expansions with applications. 2021. doi: 10.13140/RG.2.2.30843.44325. URL https://www.researchgate.net/profile/Tom-Davis-2/publication/352374514_A_GENERAL_EXPRESSION_FOR_HERMITE_EXPANSIONS_WITH_APPLICATIONS/links/60c873c5a6fdcc8267cf74d4/A-GENERAL-EXPRESSION-FOR-HERMITE-EXPANSIONS-WITH-APPLICATIONS.pdf.
- de G. Matthews et al. (2018) Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1-nGgWC-.
- Du et al. (2019a) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 1675–1685. PMLR, 2019a. URL https://proceedings.mlr.press/v97/du19c.html.
- Du et al. (2019b) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview.net/forum?id=S1eK3i09YQ.
- Engel et al. (2022) Andrew Engel, Zhichao Wang, Anand Sarwate, Sutanay Choudhury, and Tony Chiang. TorchNTK: A library for calculation of neural tangent kernels of PyTorch models. 2022.
- Fan & Wang (2020) Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, volume 33, pp. 7710–7721. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d-Paper.pdf.
- Folland (1999) G. B. Folland. Real analysis: Modern techniques and their applications. Wiley, New York, 1999.
- Geifman et al. (2020) Amnon Geifman, Abhay Yadav, Yoni Kasten, Meirav Galun, David Jacobs, and Basri Ronen. On the similarity between the Laplace and neural tangent kernels. In Advances in Neural Information Processing Systems, volume 33, pp. 1451–1461. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1006ff12c465532f8c574aeaa4461b16-Paper.pdf.
- Geifman et al. (2022) Amnon Geifman, Meirav Galun, David Jacobs, and Ronen Basri. On the spectral bias of convolutional neural tangent and gaussian process kernels. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=gthKzdymDu2.
- Glorot & Bengio (2010) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 249–256. PMLR, 2010. URL https://proceedings.mlr.press/v9/glorot10a.html.
- Han et al. (2022) Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, and Amin Karbasi. Fast neural kernel embeddings for general activations. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yLilJ1vZgMe.
- He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, 2015.
- Huang et al. (2022) Ningyuan Teresa Huang, David W. Hogg, and Soledad Villar. Dimensionality reduction, regularization, and generalization in overparameterized regressions. SIAM J. Math. Data Sci., 4(1):126–152, 2022. URL https://doi.org/10.1137/20m1387821.
- Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
- Jin et al. (2022) Hui Jin, Pradeep Kr. Banerjee, and Guido Montúfar. Learning curves for gaussian process regression with power-law priors and targets. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=KeI9E-gsoB.
- Karakida et al. (2020) Ryo Karakida, Shotaro Akaho, and Shun ichi Amari. Universal statistics of Fisher information in deep neural networks: mean field approach. Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124005, 2020. URL https://doi.org/10.1088/1742-5468/abc62e.
- Kazarinoff (1956) Donat K. Kazarinoff. On Wallis’ formula. Edinburgh Mathematical Notes, 40:19–21, 1956.
- LeCun et al. (2012) Yann A. LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient BackProp, pp. 9–48. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. URL https://doi.org/10.1007/978-3-642-35289-8_3.
- Lee et al. (2018) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=B1EA-M-0Z.
- Lee et al. (2019) Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/0d1a9651497a38d8b1c3871c84528bd4-Paper.pdf.
- Lee et al. (2020) Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems, volume 33, pp. 15156–15172. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ad086f59924fffe0773f8d0ca22ea712-Paper.pdf.
- Li et al. (2022) Li, Andreeto, Ranzato, and Perona. Caltech 101, Apr 2022.
- Li et al. (2020) Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pp. 4313–4324. PMLR, 2020. URL https://proceedings.mlr.press/v108/li20j.html.
- Louart et al. (2018) Cosme Louart, Zhenyu Liao, and Romain Couillet. A random matrix approach to neural networks. The Annals of Applied Probability, 28(2):1190–1248, 2018. URL https://www.jstor.org/stable/26542333.
- Mishkin & Matas (2016) Dmytro Mishkin and Jiri Matas. All you need is a good init. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06422.
- Murray et al. (2022) M. Murray, V. Abrol, and J. Tanner. Activation function design for deep networks: linearity and effective initialisation. Applied and Computational Harmonic Analysis, 59:117–154, 2022. URL https://www.sciencedirect.com/science/article/pii/S1063520321001111. Special Issue on Harmonic Analysis and Machine Learning.
- Neal (1996) Radford M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag, Berlin, Heidelberg, 1996.
- Nguyen (2021) Quynh Nguyen. On the proof of global convergence of gradient descent for deep relu networks with linear widths. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8056–8062. PMLR, 2021. URL https://proceedings.mlr.press/v139/nguyen21a.html.
- Nguyen & Mondelli (2020) Quynh Nguyen and Marco Mondelli. Global convergence of deep networks with one wide layer followed by pyramidal topology. In Advances in Neural Information Processing Systems, volume 33, pp. 11961–11972. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/8abfe8ac9ec214d68541fcb888c0b4c3-Paper.pdf.
- Nguyen et al. (2021) Quynh Nguyen, Marco Mondelli, and Guido Montúfar. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep ReLU networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 8119–8129. PMLR, 2021. URL https://proceedings.mlr.press/v139/nguyen21g.html.
- Novak et al. (2019) Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are Gaussian processes. In 7th International Conference on Learning Representations. OpenReview.net, 2019. URL https://openreview.net/forum?id=B1g30j0qF7.
- Novak et al. (2022) Roman Novak, Jascha Sohl-Dickstein, and Samuel S Schoenholz. Fast finite width neural tangent kernel. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 17018–17044. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/novak22a.html.
- O’Donnell (2014) Ryan O’Donnell. Analysis of Boolean functions. Cambridge University Press, 2014.
- Oymak & Soltanolkotabi (2020) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1), 2020. URL https://par.nsf.gov/biblio/10200049.
- Oymak et al. (2019) Samet Oymak, Zalan Fabian, Mingchen Li, and Mahdi Soltanolkotabi. Generalization guarantees for neural networks via harnessing the low-rank structure of the Jacobian. CoRR, abs/1906.05392, 2019. URL http://arxiv.org/abs/1906.05392.
- Panigrahi et al. (2020) Abhishek Panigrahi, Abhishek Shetty, and Navin Goyal. Effect of activation functions on the training of overparametrized neural nets. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgfdeBYvH.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Pennington & Worah (2017) Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/file/0f3d014eead934bbdbacb62a01dc4831-Paper.pdf.
- Pennington & Worah (2018) Jeffrey Pennington and Pratik Worah. The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/18bb68e2b38e4a8ce7cf4f6b2625768c-Paper.pdf.
- Poole et al. (2016) Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper/2016/file/148510031349642de5ca0c544f31b2ef-Paper.pdf.
- Scetbon & Harchaoui (2021) Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. In International conference on artificial intelligence and statistics, pp. 3394–3402. PMLR, 2021.
- Schoenholz et al. (2017) Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information propagation. In International Conference on Learning Representations (ICLR), 2017. URL https://openreview.net/pdf?id=H1W1UN9gg.
- Schur (1911) J. Schur. Bemerkungen zur Theorie der beschränkten Bilinearformen mit unendlich vielen Veränderlichen. Journal für die reine und angewandte Mathematik, 140:1–28, 1911. URL http://eudml.org/doc/149352.
- Simon et al. (2022) James Benjamin Simon, Sajant Anand, and Mike Deweese. Reverse engineering the neural tangent kernel. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 20215–20231. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/simon22a.html.
- Sontag & Sussmann (1989) Eduardo D. Sontag and Héctor J. Sussmann. Backpropagation can give rise to spurious local minima even for networks without hidden layers. Complex Systems, 3:91–106, 1989.
- Velikanov & Yarotsky (2021) Maksim Velikanov and Dmitry Yarotsky. Explicit loss asymptotics in the gradient descent training of neural networks. In Advances in Neural Information Processing Systems, volume 34, pp. 2570–2582. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/14faf969228fc18fcd4fcf59437b0c97-Paper.pdf.
- Vershynin (2012) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices, pp. 210–268. Cambridge University Press, 2012.
- Weyl (1912) Hermann Weyl. Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912. URL https://doi.org/10.1007/BF01456804.
- Woodworth et al. (2020) Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp. 3635–3673. PMLR, 2020. URL https://proceedings.mlr.press/v125/woodworth20a.html.
- Xie et al. (2017) Bo Xie, Yingyu Liang, and Le Song. Diverse Neural Network Learns True Target Functions. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 1216–1224. PMLR, 2017. URL https://proceedings.mlr.press/v54/xie17a.html.
- Yang & Salman (2019) Greg Yang and Hadi Salman. A fine-grained spectral perspective on neural networks, 2019. URL https://arxiv.org/abs/1907.10599.
- Zou & Gu (2019) Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/6a61d423d02a1c56250dc23ae7ff12f3-Paper.pdf.
- Zou et al. (2020) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep ReLU networks. Machine learning, 109(3):467–492, 2020.
Appendix
Appendix A Background material
A.1 Gaussian kernel
Observe by construction that the flattened collection of preactivations at the first layer form a centered Gaussian process, with the covariance between the th and th neuron being described by
Under the Assumption 1, the preactivations at each layer converge also in distribution to centered Gaussian processes (Neal, 1996; Lee et al., 2018). We remark that the sequential width limit condition of Assumption 1 is not necessary for this behavior, for example the same result can be derived in the setting where the widths of the network are sent to infinity simultaneously under certain conditions on the activation function (de G. Matthews et al., 2018). However, as our interests lie in analyzing the limit rather than the conditions for convergence to said limit, for simplicity we consider only the sequential width limit. As per Lee et al. (2018, Eq. 4), the covariance between the preactivations of the th and th neurons at layer for any input pair are described by the following kernel,
We refer to this kernel as the Gaussian kernel. As each neuron is identically distributed and the covariance between pairs of neurons is 0 unless , moving forward we drop the subscript and discuss only the covariance between the preactivations of an arbitrary neuron given two inputs. As per the discussion by Lee et al. (2018, Section 2.3), the expectations involved in the computation of these Gaussian kernels can be computed with respect to a bivariate Gaussian distribution, whose covariance matrix has three distinct entries: the variance of a preactivation of at the previous layer, , the variance of a preactivation of at the previous layer, , and the covariance between preactivations of and , . Therefore the Gaussian kernel, or covariance function, and its derivative, which we will require later for our analysis of the NTK, can be computed via the the following recurrence relations, see for instance (Lee et al., 2018; Jacot et al., 2018; Arora et al., 2019b; Nguyen et al., 2021),
(11) | ||||
A.2 Neural Tangent Kernel (NTK)
As discussed in the Section 1, under Assumption 1 converges in probability to a deterministic limit, which we denote . This deterministic limit kernel can be expressed in terms of the Gaussian kernels and their derivatives from Section A.1 via the following recurrence relationships (Jacot et al., 2018, Theorem 1),
(12) | ||||
A useful expression for the NTK matrix, which is a straightforward extension and generalization of Nguyen et al. (2021, Lemma 3.1), is provided in Lemma A.1 below.
Lemma A.1.
(Based on Nguyen et al. 2021, Lemma 3.1) Under Assumption 1, a sequence of positive semidefinite matrices in , and the related sequence also in , can be constructed via the following recurrence relationships,
(13) | ||||
The sequence of NTK matrices can in turn be written using the following recurrence relationship,
(14) | ||||
Proof.
For the sequence it suffices to prove for any and that
and is positive semi-definite. We proceed by induction, considering the base case and comparing (13) with (11) then it is evident that
In addition, is also clearly positive semi-definite as for any
We now assume the induction hypothesis is true for . We will need to distinguish slightly between two cases, and . The proof of the induction step in either case is identical. To this end, and for notational ease, let , when , and , for . In either case we let denote the th row of . For any
Now let , and observe for any that . Therefore the joint distribution of is a mean 0 bivariate normal distribution. Denoting the covariance matrix of this distribution as , then can be expressed as
To prove it therefore suffices to show that as per (11). This follows by the induction hypothesis as
Finally, is positive semi-definite as long as is positive semi-definite. Let and observe for any that is positive semi-definite. Therefore must also be positive semi-definite. Thus the inductive step is complete and we may conclude for that
(15) |
For the proof of the expression for the sequence it suffices to prove for any and that
By comparing (13) with (11) this follows immediately from (15). Therefore with (13) proven (14) follows from (12). ∎
A.3 Unit variance initialization
The initialization scheme for a neural network, particularly a deep neural network, needs to be designed with some care in order to avoid either vanishing or exploding gradients during training Glorot & Bengio (2010); He et al. (2015); Mishkin & Matas (2016); LeCun et al. (2012). Some of the most popular initialization strategies used in practice today, in particular LeCun et al. (2012) and Glorot & Bengio (2010) initialization, first model the preactivations of the network as Gaussian random variables and then select the network hyperparameters in order that the variance of these idealized preactivations is fixed at one. Under Assumption 1 this idealized model on the preactivations is actually realized and if we additionally assume the conditions of Assumption 2 hold then likewise the variance of the preactivations at every layer will be fixed at one. To this end, and as in Poole et al. (2016); Murray et al. (2022), consider the function defined as
(16) |
Noting that is another expression for , derived via a change of variables as per Poole et al. (2016), the sequence of variances can therefore be generated as follows,
(17) |
The linear correlation between the preactivations of two inputs we define as
(18) |
Assuming for all , then . Again as in Murray et al. (2022) and analogous to (16), with independent, , 333We remark that are dependent and identically distributed as . we define the correlation function as
(19) |
Noting under these assumptions that is equivalent to , the sequence of correlations can thus be generated as
As observed in Poole et al. (2016); Schoenholz et al. (2017), , hence is a fixed point of . We remark that as all preactivations are distributed as , then a correlation of one between preactivations implies they are equal. The stability of the fixed point is of particular significance in the context of initializing deep neural networks successfully. Under mild conditions on the activation function one can compute the derivative of , see e.g., Poole et al. (2016); Schoenholz et al. (2017); Murray et al. (2022), as follows,
(20) |
Observe that the expression for and are equivalent via a change of variables (Poole et al., 2016), and therefore the sequence of correlation derivatives may be computed as
With the relevant background material now in place we are in a position to prove Lemma A.2.
Lemma A.2.
Proof.
Recall from Lemma A.1 and its proof that for any , and . We first prove by induction for all . The base case follows as
Assume the induction hypothesis is true for layer . With , then from (16) and (17)
thus the inductive step is complete. As an immediate consequence it follows that . Also, for any and ,
Thus we can consider as a univariate function of the input correlation and also conclude that . Furthermore,
which likewise implies is a dot product kernel. Recall now the random variables introduced to define : are independent and , . Observe are dependent but identically distributed as . For any then applying the Cauchy-Schwarz inequality gives
As a result, under the assumptions of the lemma and . From this it immediately follows that and as claimed. Finally, as and are dot product kernels, then from (12) the NTK must also be a dot product kernel and furthermore a univariate function of the pairwise correlation of its input arguments. ∎
The following corollary, which follows immediately from Lemma A.2 and (14), characterizes the trace of the NTK matrix in terms of the trace of the input gram.
Corollary A.3.
Under the same conditions as Lemma A.2, suppose and are chosen such that . Then
(21) |
A.4 Hermite Expansions
We say that a function is square integrable w.r.t. the standard Gaussian measure if . We denote by the space of all such functions. The probabilist’s Hermite polynomials are given by
The first three Hermite polynomials are , , . Let denote the normalized probabilist’s Hermite polynomials. The normalized Hermite polynomials form a complete orthonormal basis in (O’Donnell, 2014, §11): in all that follows, whenever we reference the Hermite polynomials, we will be referring to the normalized Hermite polynomials. The Hermite expansion of a function is given by
(22) |
where
(23) |
is the th normalized probabilist’s Hermite coefficient of . In what follows we shall make use of the following identities.
(24) | ||||
(25) |
(33) |
We also remark that the more commonly encountered physicist’s Hermite polynomials, which we denote , are related to the normalized probablist’s polynomials as follows,
The Hermite expansion of the activation function deployed will play a key role in determining the coefficients of the NTK power series. In particular, the Hermite coefficients of ReLU are as follows.
Lemma A.4.
Daniely et al. (2016) For the Hermite coefficients are given by
(34) |
Appendix B Expressing the NTK as a power series
B.1 Deriving a power series for the NTK
We will require the following minor adaptation of Nguyen & Mondelli (2020, Lemma D.2). We remark this result was first stated for ReLU and Softplus activations in the work of Oymak & Soltanolkotabi (2020, Lemma H.2).
Lemma B.1.
For arbitrary , let . For , we denote the th row of as , and further assume that . Let satisfy and define
Then the matrix series
converges uniformly to as .
The proof of Lemma B.1 follows exactly as in (Nguyen & Mondelli, 2020, Lemma D.2), and is in fact slightly simpler due to the fact we assume the rows of are unit length and instead of and respectively. For the ease of the reader, we now recall the following definitions, which are also stated in Section 3. Letting denote a sequence of real coefficients, then
(35) |
where
for all , .
We are now ready to derive power series for elements of and .
Lemma B.2.
Under Assumptions 1 and 2, for all
(36) |
where the series for each element converges absolutely and the coefficients are nonnegative. The coefficients of the series (36) for all can be expressed via the following recurrence relationship,
(37) |
Furthermore,
(38) |
where likewise the series for each entry converges absolutely and the coefficients for all are nonnegative and can be expressed via the following recurrence relationship,
(39) |
Proof.
We start by proving (36) and (37). Proceeding by induction, consider the base case . From Lemma A.1
By the assumptions of the lemma, the conditions of Lemma B.1 are satisfied and therefore
Observe the coefficients are nonnegative. Therefore, for any using Lemma A.2 the series for satisfies
(40) |
and so must be absolutely convergent. With the base case proved we proceed to assume the inductive hypothesis holds for arbitrary with . Observe
where is a matrix square root of , meaning . Recall from Lemma A.1 that is also symmetric and positive semi-definite, therefore we may additionally assume, without loss of generality, that is symmetric, which conveniently implies . Under the assumptions of the lemma the conditions for Lemma A.2 are satisfied and as a result for all , where we recall denotes the th row of . Therefore we may again apply Lemma A.1,
where the final equality follows from the inductive hypothesis. For any pair of indices
By the induction hypothesis, for any the series is absolutely convergent. Therefore, from the Cauchy product of power series and for any we have
(41) |
where is defined in (4). By definition, is a sum of products of positive coefficients, and therefore . In addition, recall again by Assumption 2 and Lemma A.2 that . As a result, for any , as
(42) |
and therefore the series converges absolutely. Recalling from the proof of the base case that the series is absolutely convergent and has only nonnegative elements, we may therefore interchange the order of summation in the following,
ij | |||
Recalling the definition of in (4), in particular and for , then
ij | |||
As the indices were arbitrary we conclude that
as claimed. In addition, by inspection and using the induction hypothesis it is clear that the coefficients are nonnegative. Therefore, by an argument identical to (40), the series for each entry of is absolutely convergent. This concludes the proof of (36) and (37).
We now turn our attention to proving the (38) and (39). Under the assumptions of the lemma the conditions for Lemmas A.1 and B.1 are satisfied and therefore for the base case
By inspection the coefficients are nonnegative and as a result by an argument again identical to (40) the series for each entry of is absolutely convergent. For , from (36) and its proof there is a matrix such that . Again applying Lemma B.1
Analyzing now an arbitrary entry , by substituting in the power series expression for from (36) and using (41) we have
ij | |||
Note that exchanging the order of summation in the third equality above is justified as for any by (42) we have and therefore converges absolutely. As the indices were arbitrary we conclude that
as claimed. Finally, by inspection the coefficients are nonnegative, therefore, and again by an argument identical to (40), the series for each entry of is absolutely convergent. This concludes the proof. ∎
We are now prove the key result of Section 3.
See 3.1
Proof.
We proceed by induction. The base case follows trivially from Lemma A.1. We therefore assume the induction hypothesis holds for an arbitrary . From (14) and Lemma B.2
Therefore, for arbitrary
ij |
Observe and therefore the series must converge due to the convergence of the NTK. Furthermore, and therefore is absolutely convergent by Lemma B.2. As a result, by Merten’s Theorem the product of these two series is equal to their Cauchy product. Therefore
ij | |||
from which the (5) immediately follows. ∎
B.2 Analyzing the coefficients of the NTK power series
In this section we study the coefficients of the NTK power series stated in Theorem 3.1. Our first observation is that, under additional assumptions on the activation function , the recurrence relationship (6) can be simplified in order to depend only on the Hermite expansion of .
Lemma B.3.
Proof.
Note for each as is absolutely continuous on it is differentiable a.e. on . It follows by the countable additivity of the Lebesgue measure that is differentiable a.e. on . Furthermore, as is polynomially bounded we have . Fix . Since is absolutely continuous on it is of bounded variation on . Also note that is of bounded variation on due to having a bounded derivative. Thus we have by Lebesgue-Stieltjes integration-by-parts (see e.g. Folland 1999, Chapter 3)
where in the last line above we have used the fact that (24) and (25) imply that . Thus we have shown
We note that since we have that as the first two terms above vanish. Thus by sending we have
After dividing by we get the desired result. ∎
In particular, under Assumption 3, and as highlighted by Corollary B.4, which follows directly from Lemmas B.2 and B.3, the NTK coefficients can be computed only using the Hermite coefficients of .
With these results in place we proceed to analyze the decay of the coefficients of the NTK for depth two networks. As stated in the main text, the decay of the NTK coefficients depends on the decay of the Hermite coefficients of the activation function deployed. This in turn is strongly influenced by the behavior of the tails of the activation function. To this end we roughly group activation functions into three categories: growing tails, flat or constant tails and finally decaying tails. Analyzing each of these groups in full generality is beyond the scope of this paper, we therefore instead study the behavior of ReLU, Tanh and Gaussian activation functions, being prototypical and practically used examples of each of these three groups respectively. We remark that these three activation functions satisfy Assumption 3. For typographical ease we let denote the Gaussian activation function with variance .
See 3.2
Proof.
Recall (9),
In order to bound we proceed by using Lemma A.4 to bound the square of the Hermite coefficients. We start with ReLU. Note Lemma A.4 actually provides precise expressions for the Hermite coefficients of ReLU, however, these are not immediately easy to interpret. Observe from Lemma A.4 that above index all odd indexed Hermite coefficients are . It therefore suffices to bound the even indexed terms, given by
Observe from (33) that for even
therefore
Analyzing now ,
Here, the expression inside the square root is referred to in the literature as the Wallis ratio, for which the following lower and upper bounds are available Kazarinoff (1956),
(44) |
As a result
and therefore
As , then from (9)
as claimed in item 1.
We now proceed to analyze . From Panigrahi et al. (2020, Corollary F.7.1)
As Tanh satisfies the conditions of Lemma B.3
Therefore the result claimed in item 2. follows as
Finally, we now consider where is the density function of . Similar to ReLU, analytic expressions for the Hermite coefficients of are known (see e.g., Davis, 2021, Theorem 2.9),
For even
Therefore
As a result, for even and using (44), it follows that
Finally, since , then from (9)
as claimed in item 3. ∎
B.3 Numerical approximation via a truncated NTK power series and interpretation of Figure 2
Currently, computing the infinite width NTK requires either a) explicit evaluation of the Gaussian integrals highlighted in (13), b) numerical approximation of these same integrals such as in Lee et al. (2018), or c) approximation via a sufficiently wide yet still finite width network, see for instance Engel et al. (2022); Novak et al. (2022). These Gaussian integrals (13) can be solved solved analytically only for a minority of activation functions, notably ReLU as discussed for example by Arora et al. (2019b), while the numerical integration and finite width approximation approaches are relatively computationally expensive. The truncated NTK power series we define as analogous to (5) but with the series involved being computed only up to the th element. Once the top coefficients are computed, then for any input correlation the NTK can be approximated by evaluating the corresponding finite degree polynomial.
Definition B.5.
In order to analyze the performance and potential of the truncated NTK for numerical approximation, we compute it for ReLU and compare it with its analytical expression Arora et al. (2019b). To recall this result, let
Under Assumptions 1 and 2, with , , , , and , then and for all
(49) | ||||
Turning our attention to Figure 2, we observe particularly for input correlations and below then the truncated ReLU NTK power series achieves machine level precision. For higher order coefficients play a more significant role. As the truncated ReLU NTK power series approximates these coefficients less well the overall approximation of the ReLU NTK is worse. We remark also that negative correlations have a smaller absolute error as odd indexed terms cancel with even index terms: we emphasize again that in Figure 2 we plot the absolute not relative error. In addition, for there is symmetry in the absolute error for positive and negative correlations as for all odd . One also observes that approximation accuracy goes down with depth, which is due to the error in the coefficients at the previous layer contributing to the error in the coefficients at the next, thereby resulting in an accumulation of error with depth. Also, and certainly as one might expect, a larger truncation point results in overall better approximation. Finally, as the decay in the Hermite coefficients for ReLU is relatively slow, see e.g., Table 1 and Lemma 3.2, we expect the truncated ReLU NTK power series to perform worse relative to the truncated NTK’s for other activation functions.

B.4 Characterizing NTK power series coefficient decay rates for deep networks
In general, Theorem 3.1 does not provide a straightforward path to analyzing the decay of the NTK power series coefficients for depths greater than two. This is at least in part due to the difficulty of analyzing , which recall is the sum of all ordered products of elements of whose indices sum to , defined in (4). However, in the setting where the squares of the Hermite coefficients, and therefore the series , decay at an exponential rate, this quantity can be characterized and therefore an analysis, at least to a certain degree, of the impact of depth conducted. Although admittedly limited in scope, we highlight that this setting is relevant for the study of Gaussian activation functions and radial basis function (RBF) networks. We will also make the additional simplifying assumption that the activation function has zero Gaussian mean (which can be obtained by centering). Unfortunately this further reduces the applicability of the following results to activation functions commonly used in practice. We leave the study of relaxing this zero bias assumption, perhaps only enforcing exponential decay asymptotically, as well as a proper exploration of other decay patterns, to future work.
The following lemma precisely describes, in the specific setting considered here, the evolution of the coefficients of the Gaussian Process kernel with depth.
Lemma B.6.
Let and for , where and are constants such that . Then for all and
(50) |
where the constants and are defined as
(51) |
Proof.
Observe for , we have that and hold by assumption. Thus by induction it suffices to show that and implies (50) and (51) hold. Thus assume for some we have that and . Recall the definition of from (4): as then with and
where
which is the set of all -tuples of positive (instead of non-negative) integers which sum to . Substituting then
where the final equality follows from a stars and bars argument. Now observe for that at least one of the indices in must be 0 and therefore . As a result under the assumptions of the lemma
(52) |
Substituting (52) into (7) it follows that
and for
as claimed. ∎
We now analyze the coefficients of the derivative of the Gaussian Process kernel.
Lemma B.7.
Proof.
Under Assumption 3 then for all we have
For and it therefore follows that
For and then
as claimed. ∎
With the coefficients of both the Gaussian Process kernel and its derivative characterized, we proceed to upper bound the decay of the NTK coefficients in the specific setting outlined in Lemma B.6 and B.7.
Lemma B.8.
Proof.
We proceed by induction starting with the base case . Applying the results of Lemmas B.6 and B.7 to (6) then for
(56) |
If we define and , which are clearly positive constants, then and so for the induction hypothesis clearly holds. We now assume the inductive hypothesis holds for some . Observe from (53), with and that
(57) |
where . Substituting 57 and the inductive hypothesis inequality into (6) it follows for that
Therefore there exist positive constants and such that as claimed. This completes the inductive step and therefore also the proof of the lemma. ∎
Appendix C Analyzing the spectrum of the NTK via its power series
C.1 Effective rank of power series kernels
Recall that for a positive semidefinite matrix we define the effective rank Huang et al. (2022) via the following ratio
We consider a kernel Gram matrix that has the following power series representation in terms of an input gram matrix
Whenever the effective rank of is , as displayed in the following theorem. See 4.1
Proof.
By linearity of trace we have that
where we have used the fact that for all . On the other hand
Thus we have that
∎
The above theorem demonstrates that the constant term in the kernel leads to a significant outlier in the spectrum of . However this fails to capture how the structure of the input data manifests in the spectrum of . For this we will examine the centered kernel matrix . Using a very similar argument as before we can demonstrate that the effective rank of is controlled by the effective rank of the input data gram . This is formalized in the following theorem. See 4.3
Proof.
By the linearity of the trace we have that
where we have used the fact that for all . On the other hand we have that
Thus we conclude
∎
C.2 Effective rank of the NTK for finite width networks
C.2.1 Notation and definitions
We will let . We consider a neural network
where and , for all and is a scalar valued activation function. The network we present here does not have any bias values in the inner-layer, however the results we will prove later apply to the nonzero bias case by replacing with . We let be the matrix whose -th row is equal to and be the vector whose -th entry is equal to . We can then write the neural network in vector form
where is understood to be applied entry-wise.
Suppose we have training data inputs . We will let be the matrix whose -th row is equal to . Let denote the row-wise vectorization of the inner-layer weights. We consider the Jacobian of the neural networks predictions on the training data with respect to the inner layer weights:
Similarly we can look at the analagous quantity for the outer layer weights
Our first observation is that the per-example gradients for the inner layer weights have a nice Kronecker product representation
For convenience we will let
where the dependence of on the parameters and is suppressed (formally ). This way we may write
We will study the NTK with respect to the inner-layer weights
and the same quantity for the outer-layer weights
For a hermitian matrix we will let denote the th largest eigenvalue of so that . Similarly for an arbitrary matrix we will let to the th largest singular value of . For a matrix we will let .
C.2.2 Effective rank
For a positive semidefinite matrix we define the effective rank (Huang et al., 2022) of to be the quantity
The effective rank quantifies how many eigenvalues are on the order of the largest eigenvalue. We have the Markov-like inequality
and the eigenvalue bound
Let and be positive semidefinite matrices. Then we have
Thus the effective rank is subadditive for positive semidefinite matrices.
We will be interested in bounding the effective rank of the NTK. Let be the NTK matrix with respect to all the network parameters. Note that by subadditivity
In this vein we will control the effective rank of and separately.
C.2.3 Effective rank of inner-layer NTK
We will show that the effective rank of inner-layer NTK is bounded by a multiple of the effective rank of the data input gram . We introduce the following meta-theorem that we will use to prove various corollaries later
Theorem C.1.
Set . Assume . Then
Proof.
We will first prove the upper bound. We first observe that
Recall that
Well
Well then we may use the fact that
Let and be vectors that we will optimize later satisfying . Then we have that and
Pick so that and
Thus for this choice of we have
Now note that . Thus by taking the sup over in our previous bound we have
Thus combined with our previous result we have
We now prove the lower bound.
Let be the matrix whose th row is equal to . Then observe that
where denotes the entry-wise Hadamard product of two matrices. We now recall that if and are two positive semidefinite matrices we have (Oymak & Soltanolkotabi, 2020, Lemma 2)
Applying this to we get that
Combining this with our previous result we get
∎
We can immediately get a useful corollary that applies to the ReLU activation function
Corollary C.2.
Set and . Assume and . Then
Proof.
Note that the hypothesis on gives for all . Moreover by Cauchy-Schwarz we have that . Thus by theorem C.1 we get the desired result. ∎
If is a leaky ReLU type activation (say like those used in Nguyen & Mondelli (2020)) Theorem C.1 translates into an even simpler bound
Corollary C.3.
Suppose for all where . Then
Proof.
To control in Theorem C.1 when is the ReLU activation function requires a bit more work. To this end we introduce the following lemma.
Lemma C.4.
Assume . Let and define . Set . Then
Proof.
Let be the vector such that . Then note that
∎
Roughly what Lemma C.4 says is that is controlled when there is a set of inner-layer neurons that are active for each data point whose outer layer weights are similar in magnitude. Note that in Du et al. (2019b), Arora et al. (2019a), Oymak et al. (2019), Li et al. (2020), Xie et al. (2017) and Oymak & Soltanolkotabi (2020) the outer layer weights all have fixed constant magnitude. Thus in that case we can set in Lemma C.4 so that . In this setting we have the following result.
Theorem C.5.
Assume . Suppose for all . Furthermore suppose are independent random vectors such that has the uniform distribution on the sphere for each . Also assume for some . Then with probability at least we have that
Proof.
Fix . Note by the assumption on the ’s we have that are i.i.d. Bernouilli random variables taking the values and with probability . Thus by the Chernoff bound for Binomial random variables we have that
Thus taking the union bound over every we get that if then
holds with probability at least . Now note that if we set we have that where is defined as it is in Lemma C.4. In this case by our previous bound we have that as defined in Lemma C.4 satisfies with probability at least . In this case the conclusion of Lemma C.4 gives us
Thus by Corollary C.2 and the above bound for we get the desired result. ∎
We will now use Lemma C.4 to prove a bound in the case of Gaussian initialization.
Lemma C.6.
Assume . Suppose that for each i.i.d. Furthermore suppose are random vectors independent of each other and such that has the uniform distribution on the sphere for each . Set . Assume
for some . Then with probability at least we have that
Proof.
Set and . Now set
Now define . We have by the Chernoff bound for binomial random variables
Thus if (a weaker condition than the hypothesis on ) then we have that with probability at least . From now on assume such a has been observed and view it as fixed so that the only remaining randomness is over the ’s. Now set . By the Chernoff bound again we get that for fixed
Thus by taking the union bound over we get
Thus if we consider as fixed and then with probability at least over the sampling of the ’s we have that
In this case by lemma C.4 we have that
Thus the above holds with probability at least . ∎
This lemma now allows us to bound the effective rank of in the case of Gaussian initialization.
Theorem C.7.
Assume . Suppose that for each i.i.d. Furthermore suppose are random vectors independent of each other and such that has the uniform distribution on the sphere for each . Set . Let . Then there exists absolute constants such that if
then with probability at least we have that
where
Proof.
By Bernstein’s inequality
where is an absolute constant. Set so that the right hand side of the above inequality is bounded by . Thus by Lemma C.6 and the union bound we can ensure that with probability at least
that and the conclusion of Lemma C.6 hold simultaneously. In that case
Thus by Corollary C.2 we get the desired result. ∎
By fixing in the previous theorem we get the immediate corollary
Corollary C.8.
Assume . Suppose that for each i.i.d. Furthermore suppose are random vectors independent of each other and such that has the uniform distribution on the sphere for each . Then there exists an absolute constant such that ensures that with probability at least
C.2.4 Effective rank of outer-layer NTK
Throughout this section . Our goal of this section, similar to before, is to bound the effective rank of by the effective rank of the input data gram . In this section we will use often make use of the basic identities
To begin bounding the effective rank of , we prove the following lemma.
Lemma C.9.
Assume and is full rank with . Then
Proof.
First note that
Pick such that and . Since is full rank we may set so that where where is the smallest nonzero singular value of . Well then
Now using the fact that we have that
Thus combined with our previous results gives
Therefore
which gives us the desired result. ∎
Corollary C.10.
Assume and is full rank with . Then
Proof.
Using the fact that
and lemma C.9 we have that
Note that the right hand side and the denominator of the left hand side do not change when you replace with . Therefore by using the above bound for both and as the weight matrix separately we can conclude
∎
Corollary C.11.
Assume and . Suppose and have the same distribution. Then conditioned on being full rank we have that with probability at least
Proof.
Fix where is full rank. We have by corollary C.10 that either
holds or
(the first holds in the case where and the second in the case ). Since and have the same distribution, it follows that the first inequality must hold at least of the time. From
we get the desired result. ∎
We now note that when is rectangular shaped and the entries of are i.i.d. Gaussians that is full rank with high probability and is well behaved. We recall the result from Vershynin (2012)
Theorem C.12.
Let be a matrix whose entries are independent standard normal random variables. Then for every , with probability at least one has
Corollary C.11 gives us a bound that works at least half the time. However, we would like to derive a bound that holds with high probability. We will have that when we have sufficient concentration of the largest singular value of to prove such a bound. We recall the result from Vershynin (2012) (Remark 5.40)
Theorem C.13.
Assume that is an matrix whose rows are independent sub-gaussian random vectors in with second moment matrix . Then for every , the following inequality holds with probability at least
where depend only on .
We will use theorem C.13 in the following lemma.
Lemma C.14.
Assume . Let and . Suppose that i.i.d. Set and define
Then for every the following inequality holds with probability at least
where are absolute constants that depend only on .
Proof.
We will let denote the th row of (considered as a column vector). Note that
We immediately get that the rows of are i.i.d. We will now bound . Let such that . Then
where is an absolute constant. Set . Well then by theorem C.13 we have the following. For every the following inequality holds with probability at least
∎
We are now ready to prove a high probability bound for the effective rank of .
Theorem C.15.
Assume and . Let . Suppose that i.i.d. Set
where is small. Now assume
and
Then with probability at least
Proof.
By theorem C.12 with we have that with probability at least that
(58) |
The above inequalities and the hypothesis on imply that is full rank.
From the above theorem we get the following corollary.
Corollary C.16.
Assume and . Suppose that i.i.d. Fix small. Set . Then
and
suffices to ensure that with probability at least
where is an absolute constant.
C.2.5 Bound for the combined NTK
Based on the results in the previous two sections, we can now bound the effective rank of the combined NTK gram matrix . See 4.5
C.2.6 Magnitude of the spectrum
By our results in sections C.2.3 and C.2.4 we have that suffices to ensure that
Well note that
If then is small. Thus the NTK only has large eigenvalues. The smallest eigenvalue of the NTK has been of interest in proving convergence guarantees (Du et al., 2019a, b; Oymak & Soltanolkotabi, 2020). By our previous inequality
Thus in the setting where we have that the smallest eigenvalue will be driven to zero relative to the largest eigenvalue. Alternatively we can view the above inequality as a lower bound on the condition number
We will first bound the analytical NTK in the setting when the outer layer weights have fixed constant magnitude. This is the setting considered by Xie et al. (2017), Arora et al. (2019a), Du et al. (2019b), Oymak et al. (2019), Li et al. (2020), and Oymak & Soltanolkotabi (2020).
Theorem C.17.
Let and assume . Let be the analytical NTK, i.e.
Then
Proof.
We consider the setting where for all and i.i.d.. As was shown by Jacot et al. (2018), Du et al. (2019b) in this setting we have that if we fix the training data and send we have that
in probability. Therefore by continuity of the effective rank we have that
in probability. Let . Then there exists an such that implies that
(60) |
with probability greater than . Now fix . On the other hand by Theorem C.5 with we have that if then with probability at least that
(61) |
Thus if we set we have with probability at least that (60) and (61) hold simultaneously. In this case we have that
Note that the above argument runs through for any and . Thus we may send and in the above inequality to get
∎
We thus have the following corollary about the conditioning of the analytical NTK.
Corollary C.18.
Let and assume . Let be the analytical NTK, i.e.
Then
C.3 Experimental validation of results on the NTK spectrum


We experimentally test the theory developed in Section 4.1 and its implications by analyzing the spectrum of the NTK for both fully connected neural network architectures (FCNNs), the results of which are displayed in Figure 1, and also convolutional neural network architectures (CNNs), shown in Figure 3. For the feedforward architectures we consider networks of depth 2 and 5 with the width of all layers being set at 500. With regard to the activation function we test linear, ReLU and Tanh, and in terms of initialization we use Kaiming uniform (He et al., 2015), which is very common in practice and is the default in PyTorch (Paszke et al., 2019). For the convolutional architectures we again consider depths 2 and 5, with each layer consisting of 100 channels with the filter size set to 5x5. In terms of data, we consider 40x40 patches from both real world images, generated by applying Pytorch’s RandomResizedCrop transform to a random batch of Caltech101 images (Li et al., 2022), as well as synthetic images corresponding to isotropic Gaussian vectors. The batch sized is fixed at 200 and we plot only the first 100 normalized eigenvalues. Each experiment was repeated 10 times. Finally, to compute the NTK we use the functorch444https://pytorch.org/functorch/stable/notebooks/neural_tangent_kernels.html module in PyTorch using an algorithmic approach inspired by Novak et al. (2022).
The results for convolutional neural networks show the same trends as observed in feedforward neural networks, which we discussed in Section 4.1. In particular, we again observe the dominant outlier eigenvalue, which increases with both depth and the size of the Gaussian mean of the activation. We also again see that the NTK spectrum inherits its structure from the data, i.e., is skewed for skewed data or relatively flat for isotropic Gaussian data. Finally, we also see that the spectrum for Tanh is closer to the spectrum for the linear activation when compared with the ReLU spectrum. In terms of differences between the CNN and FCNN experiments, we observe that the spread of the 95% confidence interval is slightly larger for convolutional nets, implying a slightly larger variance between trials. We remark that this is likely attributable to the fact that there are only 100 channels in each layer and by increasing this quantity we would expect the variance to reduce. In summary, despite the fact that our analysis is concerned with FCNNs, it appears that the broad implications and trends also hold for CNNs. We leave a thorough study of the NTK spectrum for CNNs and other network architectures to future work.
To test our theory in Section 4.2, we numerically plot the spectrum of NTK of two-layer feedforward networks with ReLU, Tanh, and Gaussian activations in Figure 4. The input data are uniformly drawn from . Notice that when , . Then Corollary 4.7 shows that for the ReLU activation , for the Tanh activation , and for the Gaussian activation . These theoretical decay rates for the NTK spectrum are verified by the experimental results in Figure 4.
C.4 Analysis of the lower spectrum: uniform data
See 4.6
Proof.
Let , then According to Funk Hecke theorem (Basri et al., 2019, Section 4.2), we have
(62) |
where is the volume of the hypersphere , and is the Gegenbauer polynomial, given by
and is the gamma function.
From (62) we have
(63) |
Using integration by parts, we have
(64) |
where the last line in (64) holds because when or .
See 4.7
Proof of Corollary C.4, part 1.
We first prove . Suppose that for some constant , then according to Theorem 4.6 we have
According to Stirling’s formula, we have
(70) |
Then for any , we can find constants and such that
(71) |
Then
(72) |
We define
(73) |
By applying the chain rule to , we have that the derivative of is
(74) |
Let . Then and have the same sign. Next we will show that for when is large enough.
First when and , we have
(75) |
when is sufficiently large.
Second when and , since for , we have
When , we have . Then
when is sufficiently large. Also we have
when is sufficiently large.
Combining all the arguments above, we conclude that and when . Then when , we have
(76) |
When , we have
If , . If , i.e., , for sufficiently large , we have
which is a constant independent of . Then for , we have
(77) |
Finally we have
Next we prove . Since are nonnegative and , we have that for some constant . Then we have
(78) |
According to Stirling’s formula (70) and (71), using the similar argument as (72) we have
(79) | ||||
(80) | ||||
(81) |
where is defined in (73). When , we have
For sufficiently large , . Then we have
which is a constant independent of . Also, for sufficiently large , we have
Then for , we have .
Finally we have
(82) | ||||
(83) | ||||
(84) | ||||
(85) |
Overall, we have . ∎
Proof of Corollary C.4, part 2.
It is easy to verify that when is even because when and is even. When is odd, the proof of Theorem 4.6 still applies. ∎
Proof of Corollary C.4, part 3.
Since , we have that for some constant . Similar to (72), we have
(86) |
Use the definition in (73) and let , we have
(87) |
Then according to (76) and (77), for sufficiently large , we have when and for some constant when . Then when , we have . When , we have . Overall, for all , we have
(88) |
Then we have
(89) | ||||
(90) | ||||
(91) | ||||
(92) |
∎
Proof of Corollary C.4, part 4.
Since , we have that for some constant . Similar to (72), we have
(93) |
Use the definition in (73) and let , we have
(94) |
Then according to (76) and (77), for sufficiently large , we have when and for some constant when . Then when , we have . When , we have . Overall, for all , we have
(95) |
Then we have
(96) | ||||
(97) | ||||
(98) | ||||
(99) |
On the other hand, since , we have that for some constant . Similar to (80), we have
(100) | ||||
(101) | ||||
(102) |
Since . Similarly, . Then we have
(103) | ||||
(104) | ||||
(105) |
∎
For the NTK of a two-layer ReLU network with , then according to Lemma 3.2 we have . Therefore using Corollary 4.7 . Notice here that refers to the frequency, and the number of spherical harmonics of frequency at most is . Therefore, for the th largest eigenvalue we have . This rate agrees with Basri et al. (2019) and Velikanov & Yarotsky (2021). For the NTK of a two-layer ReLU network with , the eigenvalues corresponding to the even frequencies are , which also agrees with Basri et al. (2019). Corollary 4.7 also shows the decay rates of eigenvalues for the NTK of two-layer networks with Tanh activation and Gaussian activation. We observe that when the coefficients of the kernel power series decay quickly then the eigenvalues of the kernel also decay quickly. As a faster decay of the eigenvalues of the kernel implies a smaller RKHS, Corollary 4.7 demonstrates that using ReLU results in a larger RKHS relative to using either Tanh or Gaussian activations. We numerically illustrate Corollary 4.7 in Figure 4, Appendix C.3.
C.5 Analysis of the lower spectrum: non-uniform data
The purpose of this section is to prove a formal version of Theorem 4.8. In order to prove this result we first need the following lemma.
Lemma C.19.
Let the coefficients with for all be such that the series converges for all . Given a data matrix with for all , define and the gram matrix . Consider the kernel matrix
For arbitrary , let the eigenvalue index satisfy , where . Then
(106) |
Proof.
We start our analysis by considering for some arbitrary . Let
be the -head and -tail of the Hermite expansion of : clearly for any . Recall that a constant matrix is symmetric and positive semi-definite, furthermore, by the Schur product theorem, the Hadamard product of two positive semi-definite matrices is positive semi-definite. As a result, is symmetric and positive semi-definite for all and therefore and are also symmetric positive semi-definite matrices. From Weyl’s inequality (Weyl, 1912, Satz 1) it follows that
(107) |
In order to upper bound , observe, as is square, symmetric and positive semi-definite, that . Using the non-negativity of the coefficients and the triangle inequality we have
By the assumptions of the lemma and therefore for all . Furthermore, for any pair of positive semi-definite matrices and
(108) |
Schur (1911). Therefore, as ,
for all . As a result
Finally, we now turn our attention to the analysis of . Upper bounding a small eigenvalue is typically challenging, however, the problem simplifies when and exceeds the rank of , as is assumed here, as this trivially implies . Therefore, for
as claimed. ∎
In order to use Lemma C.19 we require an upper bound on the rank of . To this end we provide Lemma C.20.
Lemma C.20.
Let be a symmetric, positive semi-definite matrix of rank . Define as
(109) |
where is a sequence of real coefficients. Then
(110) | ||||
Proof.
As is a symmetric and positive semi-definite matrix, its eigenvalues are real and non-negative and its eigenvectors are orthogonal. Let be a set of orthogonal eigenvectors for and the eigenvalue associated with . Then may be written as a sum of rank one matrices as follows,
As the Hadamard product is commutative, associative and distributive over addition, for any can also be expressed as a sum of rank 1 matrices,
Note the fourth equality in the above follows from and an application of the mixed-product property of the Hadamard product. As matrix rank is sub-additive, the rank of is less than or equal to the number of distinct rank-one matrix summands. This quantity in turn is equal to the number of vectors of the form , where . This in turn is equivalent to computing the number of -combinations with repetition from objects. Via a stars and bars argument this is equal to . It therefore follows that
The rank of can therefore be bounded via subadditivity of the rank as
(111) | ||||
∎
As our goal here is to characterize the small eigenvalues, then as grows we need both and therefore to grow as well. As a result we will therefore be operating in the regime where . To this end we provide the following corollary.
Corollary C.21.
Under the same conditions and setup as Lemma C.20 with then
Proof.
Corollary C.21 implies for any , that we can apply Lemma C.19 to upper bound the size of the th eigenvalue. Our goal is to upper bound the decay of the smallest eigenvalue. To this end, and in order to make our bounds as tight as possible, we therefore choose the truncation point , note this is the largest truncation which still satisfies . In order to state the next lemma, we introduce the following pieces of notation: with define as
Lemma C.22.
Given a sequence of data points with for all , construct a sequence of row-wise data matrices , , with corresponding to the th row of . The corresponding sequence of gram matrices we denote . Let where and suppose for all sufficiently large that . Let the coefficients with for all be such that 1) the series converges for all and 2) , where satisfies for all and is monotonically decreasing. Consider the sequence of kernel matrices indexed by and defined as
With suppose , then
(112) |
Proof.
Based on Lemma C.19 we provide Theorem C.23, which considers three specific scenarios for the decay of the power series coefficients inspired by Lemma 3.2.
Theorem C.23.
In the same setting, and also under the same assumptions as in Lemma C.22, then
-
1.
if with for all then ,
-
2.
if , then for any ,
-
3.
if , then for any .
Proof.
First, as then
Therefore, to recover the three results listed we now apply Lemma C.22 with . First, to prove 1., under the assumption with then
As a result
To prove ii), under the assumption with then
As a result
for any . Finally, to prove iii), under the assumption with then
Therefore
again for any . ∎
Unfortunately, the curse of dimensionality is clearly present in these results due to the factor in the exponents of . However, although perhaps somewhat loose we emphasize that these results are certainly far from trivial. In particular, while trivially we know that , in contrast, even the weakest result concerning the power law decay our result is a clear improvement as long as . For the other settings, i.e., those specified in 2. and 3., our results are significantly stronger.