On the Generalization Power of
Overfitted Two-Layer Neural Tangent Kernel Models
Abstract
In this paper, we study the generalization performance of min -norm overfitting solutions for the neural tangent kernel (NTK) model of a two-layer neural network with ReLU activation that has no bias term. We show that, depending on the ground-truth function, the test error of overfitted NTK models exhibits characteristics that are different from the “double-descent” of other overparameterized linear models with simple Fourier or Gaussian features. Specifically, for a class of learnable functions, we provide a new upper bound of the generalization error that approaches a small limiting value, even when the number of neurons approaches infinity. This limiting value further decreases with the number of training samples . For functions outside of this class, we provide a lower bound on the generalization error that does not diminish to zero even when and are both large.
1 Introduction
Recently, there is significant interest in understanding why overparameterized deep neural networks (DNNs) can still generalize well (Zhang et al., 2017; Advani et al., 2020), which seems to defy the classical understanding of bias-variance tradeoff in statistical learning (Bishop, 2006; Hastie et al., 2009; Stein, 1956; James & Stein, 1992; LeCun et al., 1991; Tikhonov, 1943). Towards this direction, a recent line of study has focused on overparameterized linear models (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Ju et al., 2020; Mei & Montanari, 2019). For linear models with simple features (e.g., Gaussian features and Fourier features) (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Ju et al., 2020), an interesting “double-descent” phenomenon has been observed. Thus, there is a region where the number of model parameters (or linear features) is larger than the number of samples (and thus overfitting occurs), but the generalization error actually decreases with the number of features. However, linear models with these simple features are still quite different from nonlinear neural networks. Thus, although such results provide some hint why overparameterization and overfitting may be harmless, it is still unclear whether similar conclusions apply to neural networks.
In this paper, we are interested in linear models based on the neural tangent kernel (NTK) (Jacot et al., 2018), which can be viewed as a useful intermediate step towards modeling nonlinear neural networks. Essentially, NTK can be seen as a linear approximation of neural networks when the weights of the neurons do not change much. Indeed, Li & Liang (2018); Du et al. (2018) have shown that, for a wide and fully-connected two-layer neural network, both the neuron weights and their activation patterns do not change much after gradient descent (GD) training with a sufficiently small step size. As a result, such a shallow and wide neural network is approximately linear in the weights when there are a sufficient number of neurons, which suggests the utility of the NTK model.
Despite its linearity, however, characterizing the double descent of such a NTK model remains elusive. The work in Mei & Montanari (2019) also studies the double-descent of a linear version of two-layer neural network. It uses the so-called “random-feature” model, where the bottom-layer weights are random and fixed, and only the top-layer weights are trained. (In comparison, the NTK model for such a two-layer neural network corresponds to training only the bottom-layer weights.) However, the setting there requires the number of neurons, the number of samples, and the data dimension to all grow proportionally to infinity. In contrast, we are interested in the setting where the number of samples is given, and the number of neurons is allowed to be much larger than the number of samples. As a consequence of the different setting, in Mei & Montanari (2019) eventually only linear ground-truth functions can be learned. (Similar settings are also studied in d’Ascoli et al. (2020).) In contrast, we will show that far more complex functions can be learned in our setting. In a related work, Ghorbani et al. (2019) shows that both the random-feature model and the NTK model can approximate highly nonlinear ground-truth functions with a sufficient number of neurons. However, Ghorbani et al. (2019) mainly studies the expressiveness of the models, and therefore does not explain why overfitting solutions can still generalize well. To the best of our knowledge, our work is the first to characterize the double-descent of overfitting solutions based on the NTK model.
Specifically, in this paper we study the generalization error of the min -norm overfitting solution for a linear model based on the NTK of a two-layer neural network with ReLU activation that has no bias. Only the bottom-layer weights are trained. We are interested in min -norm overfitting solutions because gradient descent (GD) can be shown to converge to such solutions while driving the training error to zero (Zhang et al., 2017) (see also Section 2). Given a class of ground truth functions (see details in Section 3), which we refer to as “learnable functions,” our main result (Theorem 1) provides an upper bound on the generalization error of the min -norm overfitting solution for the two-layer NTK model with samples and neurons (for any finite larger than a polynomial function of ). This upper bound confirms that the generalization error of the overfitting solution indeed exhibits descent in the overparameterized regime when increases. Further, our upper bound can also account for the noise in the training samples.
Our results reveal several important insights. First, we find that the (double) descent of the overfitted two-layer NTK model is drastically different from that of linear models with simple Gaussian or Fourier features (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019). Specifically, for linear models with simple features, when the number of features increases, the generalization error will eventually grow again and approach the so-called “null risk” (Hastie et al., 2019), which is the error of a trivial model that predicts zero. In contrast, for the class of learnable functions described earlier, the generalization error of the overfitted NTK model will continue to descend as grows to infinity, and will approach a limiting value that depends on the number of samples . Further, when there is no noise, this limiting value will decrease to zero as the number of samples increases. This difference is shown in Fig. 1(a). As increases, the test mean-square-error (MSE) of min- and min- overfitting solutions for Fourier features (blue and red curves) eventually grow back to the null risk (the black dashed line), even though they exhibit a descent at smaller . In contrast, the error of the overfitted NTK model continues to descend to a much lower level.
The second important insight is that the aforementioned behavior critically depends on the ground-truth function belonging to the class of “learnable functions.” Further, this class of learnable functions depend on the specific network architecture. For our NTK model (with RELU activation that has no bias), we precisely characterize this class of learnable functions. Specifically, for ground-truth functions that are outside the class of learnable functions, we show a lower bound on the generalization error that does not diminish to zero for any and (see Proposition 2 and Section 4). This difference is shown in Fig. 1(b), where we use an almost identical setting as Fig. 1(a), except a different ground-truth function. We can see in Fig. 1(b) that the test-error of the overfitted NTK model is always above the null risk and looks very different from that in Fig. 1(a). We note that whether certain functions are learnable or not critically depends on the specific structure of the NTK model, such as the choice of the activation unit. Recently, (Satpathi & Srikant, 2021) shows that all polynomials can be learned by 2-layer NTK model with ReLU activation that has a bias term, provided that the number of neurons is sufficiently large. (See further discussions in Remark 2. However, (Satpathi & Srikant, 2021) does not characterize the descent of generalization errors as increases.) This difference in the class of learnable functions between the two settings (ReLU with or without bias) also turns out to be consistent with the difference in the expressiveness of the neural networks. That is, shallow networks with biased-ReLU are known to be universal function approximators (Ji et al., 2019), while those without bias can only approximate the sum of linear functions and even functions (Ghorbani et al., 2019).

A closely related result to ours is the work in Arora et al. (2019), which characterizes the generalization performance of wide two-layer neural networks whose bottom-layer weights are trained by gradient descent (GD) to overfit the training samples. In particular, our class of learnable functions almost coincides with that of Arora et al. (2019). This is not surprising because, when the number of neurons is large, NTK becomes a close approximation of such two-layer neural networks. In that sense, the results in Arora et al. (2019) are even more faithful in following the GD dynamics of the original two-layer network. However, the advantage of the NTK model is that it is easier to analyze. In particular, the results in this paper can quantify how the generalization error descends with . In contrast, the results in Arora et al. (2019) provide only a generalization bound that is independent of (provided that is sufficiently large), but do not quantify the descent behavior as increases. Our numerical results in Fig. 1(a) suggest that, over a wide range of , the descent behavior of the NTK model (the green curve) matches well with that of two-layer neural networks trained by gradient descent (the cyan curve). Thus, we believe that our results also provide guidance for the latter model. The work in (Fiat et al., 2019) studies a different neural network architecture with gated ReLU, whose NTK model turns out to be the same as ours. However, similar to Arora et al. (2019), the result in Fiat et al. (2019) does not capture the speed of descent with respect to either. Second, Arora et al. (2019) only provides upper bounds on the generalization error. There is no corresponding lower bound to explain whether ground-truth functions outside a certain class are not learnable. Our result in Proposition 2 provides such a lower bound, and therefore more completely characterizes the class of learnable functions. (See further comparison in Remark 1 of Section 3 and Remark 3 of Section 5.) Another related work Allen-Zhu et al. (2019) also characterizes the class of learnable functions for two-layer and three-layer networks. However, Allen-Zhu et al. (2019) studies a training method that takes a new sample in every iteration, and thus does not overfit all training data. Finally, our paper studies generalization of NTK models for the regression setting, which is different from the classification setting that assumes a separability condition, e.g., in (Ji & Telgarsky, 2019).
2 Problem Setup
We assume the following data model , with the input , the output , the noise , and denotes the ground-truth function. Let , denote training samples. We collect them as , , , and . Then, the training samples can be written as . After training (to be described below), we denote the trained model by the function . Then, for any new test data , we will calculate the test error by , and the mean squared error (MSE) by .
For training, consider a fully-connected two-layer neural network with neurons. Let and denote the top-layer and bottom-layer weights, respectively, of the -th neuron, (see Fig. 2). We collect them into , and (a column vector with elements). Note that with this notation, for any row or column vector with elements, denotes a (row/column) vector that consists of the -th to -th elements of . We choose ReLU as the activation function for all neurons and there is no bias term in the ReLU activation function.
Now we are ready to introduce the NTK model (Jacot et al., 2018). We fix the top-layer weights , and let the initial bottom-layer weights be randomly chosen. We then train only the bottom-layer weights. Let denote the bottom-layer weights after training. Thus, the change of the output after training is
In the NTK model, one assumes that is very small. As a result, for most . Thus, the change of the output can be approximated by
where is given by , and is given by
(1) |
In the NTK model, we assume that the output of the trained model is exactly given by Eq. (1), i.e.,
(2) |
In other words, the NTK model can be viewed as a linear approximation of the two-layer network when the change of the bottom-layer weights is small.
Define such that its -th row is . Throughout the paper, we will focus on the following min--norm overfitting solution
Whenever exists, it can be written in closed form as
(3) |
The reason that we are interested in is that gradient descent (GD) or stochastic gradient descent (SGD) for the NTK model in Eq. (2) is known to converge to (proven in Supplementary Material, Appendix B).
Using Eq. (2) and Eq. (3), the trained model is then
(4) |
In the rest of the paper, we will study the generalization error of Eq. (4).
We collect some assumptions. Define the unit sphere in as: . Let denote the distribution of the input . Without loss of generality, we make the following assumptions: (i) the inputs are i.i.d. uniformly distributed in , and the initial weights ’s are i.i.d. uniformly distributed in all directions in ; (ii) and ; (iii) for any , and for any . We provide detailed justification of those assumptions in Supplementary Material, Appendix C.
3 Learnable Functions and Generalization Performance
We now show that the generalization performance of the overfitted NTK model in Eq. (4) crucially depends on the ground-truth function , where good generalization performance only occurs when the ground-truth function is “learnable.” Below, we first describe a candidate class of ground-truth functions, and explain why they may correspond to the class of “learnable functions.” Then, we will give an upper-bound on the generalization performance for this class of ground-truth functions. Finally, we will give a lower-bound on the generalization performance when the ground-truth functions are outside of this class.
We first define a set of ground-truth functions.
Definition 1.
.
Note that in Definition 1, means two functions equals almost everywhere, and . The function may be any finite-value function in . Further, we also allow to contain (as components) Dirac -functions on . Note that a -function has zero value for all , but . Thus, the function may contain any sum of -functions and finite-value -functions. 111Alternatively, we can also interpret as a signed measure (Rao & Rao, 1983) on . Then, -functions correspond to point masses, and the condition implies that the corresponding unsigned version of the measure on is bounded.
To see why may correspond to the class of learnable functions, we can first examine what the learned function in Eq. (4) should look like. Recall that . Thus, , where denotes the -th standard basis. Combining Eq. (3) and Eq. (4), we can see that the learned function in Eq. (4) is of the form
(5) |
For all , define , and its cardinality is given by
(6) |
Then, using Eq. (1), we can show . It is not hard to show that
(7) |
where denotes converge in probability. (see Supplementary Material, Appendix D.5). Thus, if we let
(8) |
then as , Eq. (3) should approach a function in . This explains why is a candidate class of “learnable functions.” However, note that the above discussion only addresses the expressiveness of the model. It is still unclear whether any function in can be learned with low generalization error. The following result provides the answer.
For some , define (recall that is the dimension of )
(9) |
Theorem 1.
Assume a ground-truth function where 222 The requirement of can be relaxed. We show in Supplementary Material, Appendix L that, even when is a -function (so ), we can still have a similar result of Eq. (10) but Term 1 will have a slower speed of decay with respect to instead of shown in Eq. (10). Term 4 of Eq. (10) will also be different when is a -function, but it still goes to zero when and are large. , , , , and . Then, for any and for almost every , we must have333The notion in Eq. (10) emphasizes that randomness is in .
(10) |
To interpret Theorem 1, we can first focus on the noiseless case, where and Term 3 are zero. If we fix and let , then Terms 2, 5, and 6 all approach zero. We can then conclude that, in the noiseless and heavily overparameterized setting (), the generalization error will converge to a small limiting value (Term 1) that depends only on . Further, this limiting value (Term 1) will converge to zero (so do Terms 4 and 7) as , i.e., when there are sufficiently many training samples. Finally, Theorem 1 holds even when there is noise.
The parameters of and can be tuned to make Eq. (10) sharper when and are large. For example, as we increase , Term 1 will approach . Although a larger makes Terms 4, 5, and 6 bigger, as long as and are sufficiently large, those terms will still be close to 0. Similarly, if we increase , then will approach the order of . As a result, Term 3 approaches the order of times and the requirement approaches the order of .
Remark 1.
We note that Arora et al. (2019) shows that, for two-layer neural networks whose bottom-layer weights are trained by gradient descent, the generalization error for sufficiently large has the following upper bound: for any ,
(11) |
where . For certain class of learnable functions (we will compare them with our in Section 4), the quantity is bounded. Thus, also decreases at the speed . The second -term in Eq. (1) contains the minimum eigenvalue of , which decreases with . (Indeed, we show that this minimum eigenvalue is upper bounded by in Supplementary Material, Appendix G.) Thus, Eq. (1) may decrease a little bit slower than , which is consistent with Term 1 in Eq. (10) (when is large). Note that the term in Eq. (1) captures how the complexity of the ground-truth function affects the generalization error. Similarly, the norm of also captures the impact444Although Term 1 in Eq. (10) in its current form does not depend on , it is possible to modify our proof so that the norm of also enters Term 1. of the complexity of the ground-truth function in Eq. (10). However, we caution that the GD solution in Arora et al. (2019) is based on the original neural network, which is usually different from our min -norm solution based on the NTK model (even though they are close for very large ). Thus, the two results may not be directly comparable.
Theorem 1 reveals several important insights on the generalization performance when the ground-truth function belongs to .
(i) Descent in the overparameterized region: When increases, both sides of Eq. (10) decreases, suggesting that the test error of the overfitted NTK model decreases with . In Fig. 1(a), we choose a ground-truth function in (we will explain why this function is in later in Section 4). The test MSE of the aforementioned NTK model (green curve) confirms the overall trend555This curve oscillates at the early stage when is small. We suspect it is because, at small , the convergence in Eq. (7) has not occurred yet, and thus the randomness in makes the simulation results more volatile. of descent in the overparameterized region. We note that while Arora et al. (2019) provides a generalization error upper-bound for large (i.e., Eq. (1)), the upper bound there does not capture the dependency in and thus does not predict this descent.
More importantly, we note a significant difference between the descent in Theorem 1 and that of min -norm overfitting solutions for linear models with simple features (Belkin et al., 2018b, 2019; Bartlett et al., 2020; Hastie et al., 2019; Muthukumar et al., 2019; Liao et al., 2020; Jacot et al., 2020). For example, for linear models with Gaussian features, we can obtain (see, e.g., Theorem 2 of Belkin et al. (2019)):
(12) |
where denotes the variance of the noise. If we let in Eq. (12), we can see that the MSE quickly approaches , which is referred to as the “null risk” (Hastie et al., 2019), i.e., the MSE of a model that predicts zero. Note that the null-risk is at the level of the signal, and thus is quite large. In contrast, as , the test error of the NTK model converges to a value determined by and (and is independent of the null risk). This difference is confirmed in Fig. 1(a), where the test MSE for the NTK model (green curve) is much lower than the null risk (the dashed line) when , while both the min -norm (the red curve) and the min -norm solutions (the blue curve) Ju et al. (2020) with Fourier features rise to the null risk when . Finally, note that the descent in Theorem 1 requires to increase much faster than . Specifically, to keep Term 2 in Eq. (10) small, it suffices to let increase a little bit faster than . This is again quite different from the descent shown in Eq. (12) and in other related work using Fourier and Gaussian features (Liao et al., 2020; Jacot et al., 2020), where only needs to grow proportionally with .
(ii) Speed of the descent: Since Theorem 1 holds for finite , it also characterizes the speed of descent. In particular, Term 2 is proportional to , which approaches when is large. Again, such a speed of descent is not captured in Arora et al. (2019). As we show in Fig. 1(a), the test error of the gradient descent solution under the original neural network (cyan curve) is usually quite close to that of the NTK model (green curve). Thus, our result provides useful guidance on how fast the generalization error descends with for such neural networks.

(iii) The effect of noise: Term 3 in Eq. (10) characterizes the impact of the noise , which does not decrease or increase with . Notice that this is again very different from Eq. (12), i.e., results of min -norm overfitting solutions for simple features, where the noise term when . We use Fig. 3(a) to validate this insight. In Fig. 3(a), we fix and plot curves of test MSE of NTK overfitting solution as increases. We let the noise in the -th training sample be i.i.d. Gaussian with zero mean and variance . The green, red, and blue curves in Fig. 3(a) corresponds to the situation , , and , respectively. We can see that all three curves become flat when is very large, and this phenomenon implies that the gap across different noise levels does not decrease when , which is in contrast to Eq. (12).
In Fig. 3(b), we instead fix , and increase ). We plot the test MSE both for the noiseless setting (green curve) and for (red curve). The difference between the two curves (dashed blue curve) then captures the impact of noise, which is related to Term 3 in Eq. (10). Somewhat surprisingly, we find that the dashed blue curve is insensitive to , which suggests that Term 3 in Eq. (10) may have room for improvement.
In summary, we have shown that any ground-truth function in leads to low generalization error for overfitted NTK models. It is then natural to ask what happens if the ground-truth function is not in . Let denote the closure666We consider the normed space of all functions in . Notice that although in Definition 1 may not be in , is always in . Specifically, is bounded for every when . of , and denotes the -distance between and (i.e., the infimum of the -distance from to every function in ).
Proposition 2.
(i) For any given , there exists a function such that, uniformly over all , as . (ii) Consequently, if the ground-truth function (or equivalently, ), then the MSE of (with respect to the ground-truth function ) is at least .
Intuitively, Proposition 2 (proven in Supplementary Material Appendix J) suggests that, if a ground-truth function is outside the closure of , then no matter how large is, the test error of a NTK model with infinitely many neurons cannot be small (regardless whether or not the training samples contain noise). We validate this in Fig. 1(b), where a ground-truth function is chosen outside . The test MSE of NTK overfitting solutions (green curve) is above null risk (dashed black line) and thus is much higher compared with Fig. 1(a). We also plot the test MSE of the GD solution of the real neural network (cyan curve), which seems to show the same trend.
4 What Exactly are the Functions in ?
Our expression for learnable functions in Definition 1 is still in an indirect form, i.e., through the unknown function . In Arora et al. (2019), the authors show that all functions of the form , are learnable by GD (assuming large and small step size), for a similar 2-layer network with ReLU activation that has no bias. In the following, we will show that our learnable functions in Definition 1 also have a similar form. Further, we can show that any functions of the form , are not learnable. Our characterization uses an interesting connection to harmonics and filtering on , which may be of independent interest.
Towards this end, we first note that the integral form in Definition 1 can be viewed as a convolution on (denoted by ). Specifically, for any , we can rewrite it as
(13) | |||
(14) |
where , and is a orthogonal matrix that denotes a rotation in , chosen from the set of all rotations. An important property of the convolution Eq. (13) is that it corresponds to multiplication in the frequency domain, similar to Fourier coefficients. To define such a transformation to the frequency domain, we use a set of hyper-spherical harmonics (Vilenkin, 1968; Dokmanic & Petrinovic, 2009) when , which forms an orthonormal basis for functions on . These harmonics are indexed by and , where and (those ’s and are all non-negative integers). Any function (including even -functions (Li & Wong, 2013)) can be decomposed uniquely into these harmonics, i.e., , where are projections of onto the basis function. In Eq. (13), let and denote the coefficients corresponding to the decompositions of and , respectively. Then, we must have (Dokmanic & Petrinovic, 2009)
(15) |
where is some normalization constant. Notice that in Eq. (15), the coefficient for is instead of , which is due to the intrinsic rotational symmetry of such convolution (Dokmanic & Petrinovic, 2009).
The above decomposition has an interesting “filtering” interpretation as follows. We can regard the function as a “filter” or “channel,” while the function as a transmitted “signal.” Then, the function in Eq. (13) and Eq. (15) can be regarded as the received signal after goes through the channel/filter . Therefore, when coefficient of is non-zero, then the corresponding coefficient for can be any value (because we can arbitrarily choose ). In contrast, if a coefficient of is zero, then the corresponding coefficient for must also be zero for all .
Ideally, if contains all “frequencies,” i.e., all coefficients are non-zero, then can also contain all “frequencies,” which means that can contain almost all functions. Unfortunately, this is not true for the function given in Eq. (14). Specifically, using the harmonics defined in Dokmanic & Petrinovic (2009), the basis for turns out to have the form
(16) |
where are positive constants. Note that the expression Eq. (16) contains either only even powers of (if is even) or odd powers of (if is odd). Then, for the function in Eq. (14), we have the following proposition (proven in Supplementary Material, Appendix K.4). We note that (Basri et al., 2019) has a similar harmonics analysis, where the expression of is given. However, it is not obvious that the expression of for all given in (Basri et al., 2019) must be non-zero, which is made clear by Proposition 3 as follows.
Proposition 3.
is zero for and is non-zero for .
We are now ready to characterize what functions are in . By the form of Eq. (16), for any non-negative integer , any even power is a linear combination of , and any odd power is a linear combination of . By Proposition 3, we thus conclude that any function where can be written in the form of Eq. (15) in the frequency domain, and thus are in . In contrast, any function where cannot be written in the form of Eq. (15), and are thus not in . Further, the -norm of any latter function will also be equal to its distance to . Therefore, the generalization-error lower-bound in Proposition 2 will apply (with ). Finally, by Eq. (13), is invariant under rotation and finite linear summation. Therefore, any finite sum of , must also belong to .
For the special case of , the input corresponds to an angle , and the above-mentioned harmonics become Fourier series and , . We can then get similar results that frequencies of are learnable (while others are not), which explains the learnable and not-learnable functions in Fig. 1. Details can be found in Supplementary Material, Appendix K.5.
Remark 2.
We caution that the above claim on non-learnable functions critically depends on the network architecture. That is, we assume throughout this paper that the ReLU activation has no bias. It is known from an expressiveness point of view that, using ReLU without bias, a shallow network can only approximate the sum of linear functions and even functions (Ghorbani et al., 2019). Thus, it is not surprising that other odd-power (but non-linear) polynomials cannot be learned. In contrast, by adding a bias, a shallow network using ReLU becomes a universal approximator (Ji et al., 2019). The recent work of Satpathi & Srikant (2021) shows that polynomials with all powers can be learned by the corresponding 2-layer NTK model. These results are consistent with ours because a ReLU activation function operating on with a bias can be equivalently viewed as one operating on a -dimension input (with the last-dimension being fixed at ) but with no bias. Even though only a subset of functions are learnable in the -dimension space, when projected into a -dimension subspace, they may already span all functions. For example, one could write as a linear combination of (, where , , and depends only on . (See Supplementary Material, Appendix K.6 for details.) It remains an interesting question whether similar difference arises for other network architectures (e.g., with more than 2 layers).
5 Proof Sketch of Theorem 1
In this section, we sketch the key steps to prove Theorem 1. Starting from Eq. (3), we have
(17) |
For the learned model given in Eq. (4), the error for any test input is then
(18) |
In the classical “bias-variance” analysis with respect to MSE (Belkin et al., 2018a), the first term on the right-hand-side of Eq. (5) contributes to the bias and the second term contributes to the variance. We first quantify the second term (i.e., the variance) in the following proposition.
Proposition 4.
For any , , , if , we must have .
The proof is in Supplementary Material Appendix F. Proposition 4 implies that, for fixed and , when , with high probability the variance will not exceed a certain factor of the noise . In other words, the variance will not go to infinity when . The main step in the proof is to lower bound , which is given by . Note that this is the main place where we used the assumption that is uniformly distributed. We expect that our main proof techniques can be generalized to other distributions (with a different expression of ), which we leave for future work.
Remark 3.
In the upper bound in Arora et al. (2019) (i.e., Eq. (1)), any noise added to will at least contribute to the generalization upper bound Eq. (1) by a positive term . Thus, their upper bound may also grow as decreases. One of the contribution of Proposition 4 is to characterize this minimum eigenvalue.
We now bound the bias part. We first study the class of ground-truth functions that can be learned with fixed . We refer to them as pseudo ground-truth, to differentiate them with the set of learnable functions for random . They are defined with respect to the same function, so that we can later extend to the “real” ground-truth functions in when considering the randomness of .
Definition 2.
Given , for any learnable ground-truth function with the corresponding function , define the corresponding pseudo ground-truth as
The reason that this class of functions may be the learnable functions for fixed is similar to the discussions in Eq. (3) and Eq. (6). Indeed, using the same choice of in Eq. (8), the learned function in Eq. (3) at fixed is always of the form in Definition 2.
The following proposition gives an upper bound of the generalization performance when the data model is based on the pseudo ground-truth and the NTK model uses exactly the same .
Proposition 5.
Assume fixed (thus and are also fixed), there is no noise. If the ground-truth function is in Definition 2 and , then for any and , we have .
The proof is in Supplementary Material, Appendix H. Note that both the threshold of the probability event and the upper bound coincide with Term 1 and Term 4, respectively, in Eq. (10). Here we sketch the proof of Proposition 5. Based on the definition of the pseudo ground-truth, we can rewrite as , where is given by, for all , . From Eq. (3) and Eq. (4), we can see that the learned model is where . Note that is an orthogonal projection to the row-space of . Further, it is easy to show that . Thus, we have . The term can be interpreted as the distance from to the row-space of . Note that this distance is no greater than the distance between and any point in the row-space of . Thus, in order to get an upper bound on , we only need to find a vector that makes as small as possible, especially when is large. Our proof uses the vector such that its -th element is . See Supplementary Material, Appendix H for the rest of the details.
The final step is to allow to be random. Given any random , any function can be viewed as the summation of a pseudo ground-truth function (with the same ) and a difference term. This difference can be viewed as a special form of “noise”, and thus we can use Proposition 4 to quantify its impact. Further, the magnitude of this “noise” should decrease with (because of Eq. (7)). Combining this argument with Proposition 5, we can then prove Theorem 1. See Supplementary Material, Appendix I for details.
6 Conclusions
In this paper, we studied the generalization performance of the min -norm overfitting solution for a two-layer NTK model. We provide a precise characterization of the learnable ground-truth functions for such models, by providing a generalization upper bound for all functions in , and a generalization lower bound for all functions not in . We show that, while the test error of the overfitted NTK model also exhibits descent in the overparameterized regime, the descent behavior can be quite different from the double descent of linear models with simple features.
There are several interesting directions for future work. First, based on Fig. 3(b), our estimation of the effect of noise could be further improved. Second, it would be interesting to explore whether the methodology can be extended to NTK model for other neural networks, e.g., with different activation functions and with more than two layers.
Acknowledgements
This work is partially supported by an NSF sub-award via Duke University (IIS-1932630), by NSF grants CNS-1717493, CNS-1901057, and CNS-2007231, and by Office of Naval Research under Grant N00014-17-1-241. The authors would like to thank Professor R. Srikant at the University of Illinois at Urbana-Champaign and anonymous reviewers for their valuable comments and suggestions.
References
- Advani et al. (2020) Advani, M. S., Saxe, A. M., and Sompolinsky, H. High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132:428–446, 2020.
- Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pp. 6158–6169, 2019.
- Arora et al. (2019) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332, 2019.
- Bartlett et al. (2020) Bartlett, P. L., Long, P. M., Lugosi, G., and Tsigler, A. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
- Basri et al. (2019) Basri, R., Jacobs, D., Kasten, Y., and Kritchman, S. The convergence rate of neural networks for learned functions of different frequencies. arXiv preprint arXiv:1906.00425, 2019.
- Belkin et al. (2018a) Belkin, M., Hsu, D., Ma, S., and Mandal, S. Reconciling modern machine learning and the bias-variance trade-off. arXiv preprint arXiv:1812.11118, 2018a.
- Belkin et al. (2018b) Belkin, M., Ma, S., and Mandal, S. To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549, 2018b.
- Belkin et al. (2019) Belkin, M., Hsu, D., and Xu, J. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
- Bishop (2006) Bishop, C. M. Pattern recognition and machine learning. Springer, 2006.
- Chaudhry et al. (1997) Chaudhry, M. A., Qadir, A., Rafique, M., and Zubair, S. Extension of euler’s beta function. Journal of computational and applied mathematics, 78(1):19–32, 1997.
- Dokmanic & Petrinovic (2009) Dokmanic, I. and Petrinovic, D. Convolution on the -sphere with application to pdf modeling. IEEE transactions on signal processing, 58(3):1157–1170, 2009.
- Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
- Dutka (1981) Dutka, J. The incomplete beta function—a historical profile. Archive for history of exact sciences, pp. 11–29, 1981.
- d’Ascoli et al. (2020) d’Ascoli, S., Refinetti, M., Biroli, G., and Krzakala, F. Double trouble in double descent: Bias and variance (s) in the lazy regime. In International Conference on Machine Learning, pp. 2280–2290. PMLR, 2020.
- Fiat et al. (2019) Fiat, J., Malach, E., and Shalev-Shwartz, S. Decoupling gating from linearity. arXiv preprint arXiv:1906.05032, 2019.
- Ghorbani et al. (2019) Ghorbani, B., Mei, S., Misiakiewicz, T., and Montanari, A. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.
- Goemans (2015) Goemans, M. Chernoff bounds, and some applications. URL http: //math.mit.edu /goemans /18310S15 /chernoff-notes.pdf, 2015.
- Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
- Hastie et al. (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- Hayes (2005) Hayes, T. P. A large-deviation inequality for vector-valued martingales. Combinatorics, Probability and Computing, 2005.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
- Jacot et al. (2020) Jacot, A., Simsek, B., Spadaro, F., Hongler, C., and Gabriel, F. Implicit regularization of random feature models. In International Conference on Machine Learning, pp. 4631–4640. PMLR, 2020.
- James & Stein (1992) James, W. and Stein, C. Estimation with quadratic loss. In Breakthroughs in Statistics, pp. 443–460. Springer, 1992.
- Ji & Telgarsky (2019) Ji, Z. and Telgarsky, M. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
- Ji et al. (2019) Ji, Z., Telgarsky, M., and Xian, R. Neural tangent kernels, transportation mappings, and universal approximation. arXiv preprint arXiv:1910.06956, 2019.
- Ju et al. (2020) Ju, P., Lin, X., and Liu, J. Overfitting can be harmless for basis pursuit, but only to a degree. Advances in Neural Information Processing Systems, 33, 2020.
- LeCun et al. (1991) LeCun, Y., Kanter, I., and Solla, S. A. Second order properties of error surfaces: Learning time and generalization. In Advances in Neural Information Processing Systems, pp. 918–924, 1991.
- Li (2011) Li, S. Concise formulas for the area and volume of a hyperspherical cap. Asian Journal of Mathematics and Statistics, 4(1):66–70, 2011.
- Li & Liang (2018) Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems, 31:8157–8166, 2018.
- Li & Wong (2013) Li, Y. and Wong, R. Integral and series representations of the dirac delta function. arXiv preprint arXiv:1303.1943, 2013.
- Liao et al. (2020) Liao, Z., Couillet, R., and Mahoney, M. W. A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. arXiv preprint arXiv:2006.05013, 2020.
- Mei & Montanari (2019) Mei, S. and Montanari, A. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
- Muthukumar et al. (2019) Muthukumar, V., Vodrahalli, K., and Sahai, A. Harmless interpolation of noisy data in regression. In 2019 IEEE International Symposium on Information Theory (ISIT), pp. 2299–2303. IEEE, 2019.
- Rao & Rao (1983) Rao, K. B. and Rao, M. B. Theory of charges: a study of finitely additive measures. Academic Press, 1983.
- Satpathi & Srikant (2021) Satpathi, S. and Srikant, R. The dynamics of gradient descent for overparametrized neural networks. In 3rd Annual Learning for Dynamics and Control Conference (L4DC), 2021.
- Stein (1956) Stein, C. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Technical report, Stanford University Stanford United States, 1956.
- Tikhonov (1943) Tikhonov, A. N. On the stability of inverse problems. In Dokl. Akad. Nauk SSSR, volume 39, pp. 195–198, 1943.
- Vilenkin (1968) Vilenkin, N. Y. Special functions and the theory of group representations. providence: American mathematical society. sftp, 1968.
- Wainwright (2015) Wainwright, M. Uniform laws of large numbers, 2015. https://www.stat.berkeley.edu/~mjwain/stat210b/Chap4_Uniform_Feb4_2015.pdf, Accessed: Feb. 7, 2021.
- Wendel (1962) Wendel, J. G. A problem in geometric probability. Math. Scand, 11:109–111, 1962.
- Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
Appendix A Extra Notations
In addition to the notations that we have introduced in the main body of this paper, we need some extra notations that are used in the following appendices. The distribution of the initial weights is denoted by the probability density on , and the directions of the initial weights (i.e., the normalized initial weights ) follows the probability density on . Let be the Lebesgue measure on where the dimension can be, e.g., and .
Let denote the binomial distribution, where is the number of trials and is the success probability. Let denote the regularized incomplete beta function Dutka (1981). Let denote the beta function Chaudhry et al. (1997). Specifically,
(19) | |||
(20) |
Define a cap on a unit hyper-sphere as the intersection of with an open ball in centered at with radius , i.e.,
(21) |
Remark 4.
For ease of exposition, we will sometimes neglect the subscript of and use instead, when the quantity that we are estimating only depends on but not . For example, where we are interested in the area of , it only depends on but not . Thus, we write instead.
For any such that , define two halves of the cap as
(22) |
Define the set of directions of the initial weights ’s as
(23) |
Appendix B GD (gradient descent) Converges to Min -Norm Solutions
We assume that the GD algorithm for minimizing the training MSE is given by
(24) |
where denotes the solution in the -th GD iteration (), and denotes the step size of the -th iteration.
Lemma 6.
If exists and GD in Eq. (24) converges to zero-training loss (i.e., ), then .
Proof.
Because and Eq. (24), we know that is in the row space of for any . Thus, we can let where . When GD converges to zero training loss, we have . Thus, we have , which implies . Therefore, we must have . ∎
Appendix C Assumptions and Justifications
Because for any , we can always do preprocessing to normalize the input . For simplicity, we focus on the simplest situation that the randomness for the inputs and the initial weights are uniform. Nonetheless, methods and results of this paper can be readily generalized to other continuous random variable distributions, which we leave for future work. We thus make the following Assumption 1.
Assumption 1.
The input are uniformly distributed in . The initial weights ’s are uniform in all directions. In other words, and are both .
We study the overparameterized and overfitted setting, so in this paper we always assume , i.e., the number of parameters is larger than or equal to the number of training samples . The situation of is relatively trivial, so we only consider the case . We then make Assumption 2.
Assumption 2.
and .
If the input is a continuous random vector, then for any , we have and (because the probability that a continuous random variable equals to a given value is zero). Thus, , and . Similarly, we can show that . We thus make Assumption 3.
Assumption 3.
for any , and for any .
With these assumptions, the following lemma says that when is large enough, with high probability has full row-rank (and thus exists).
Lemma 7.
.
Proof.
See Appendix E. ∎
Appendix D Some Useful Supporting Results
Here we collect some useful lemmas that are needed for proofs in other appendices, many of which are estimations of certain quantities that we will use later.
D.1 Quantities related to the area of a cap on a hyper-sphere
The following lemma is introduced by Li (2011), which gives the area of a cap on a hyper-sphere with respect to the colatitude angle.
Lemma 8.
Let denote the colatitude angle of the smaller cap on , then the area (in the measure of ) of this hyper-spherical cap is
The following lemma is another representation of the area of the cap with respect to the radius (recall the definition of in Eq. (21) and Remark 4).
Lemma 9.
If , then we have
Proof.
The area of a cap can be interpreted as the probability of the event that a uniformly-distributed random vector falls into that cap. We have the following lemma.
Lemma 10.
Suppose that a random vector follows uniform distribution in all directions. Given any and for any , we have
Proof.
Notice that is a hyper-spherical cap. Define its colatitude angle as . We have . Thus, we have . By Lemma 8, we then have
Further, by symmetry, we have
Because follows uniform distribution in all directions, we have
∎
D.2 Estimation of certain norms
In this subsection, we will show in Lemma 11. We also upper bound the norm of the product of two matrices by the product of their norms in Lemma 12. At last, Lemma 13 states that if two vector differ a lot, then the sum of their norm cannot be too small.
Lemma 11.
for any .
Proof.
Lemma 12.
If , then . Here , , and could be scalars, vectors, or matrices.
Proof.
This lemma directly follows the definition of matrix norm. ∎
Remark 5.
Note that the () matrix-norm (i.e., spectral norm) of a vector is exactly its vector-norm (i.e., Euclidean norm)777To see this, consider a (row or column) vector . The matrix norm of is or In both cases, the value of the matrix-norm equals to , which is exactly the -norm (Euclidean norm) of . . Therefore, when applying Lemma 12, we do not need to worry about whether , , and are matrices or vectors.
Lemma 13.
For any , we have
Proof.
It is easy to prove that is convex. Thus, we have
∎
D.3 Estimates of certain tail probabilities
The following is the (restated) Corollary 5 of Goemans (2015).
Lemma 14.
If the random variable follows , then for all , we have
The following lemma is the (restated) Theorem 1.8 of Hayes (2005).
Lemma 15 (Azuma–Hoeffding inequality for random vectors).
Let be i.i.d. random vectors with zero mean (of the same dimension) in a real Euclidean space such that for all . Then, for every ,
In the following lemma, we use Azuma–Hoeffding inequality to upper bound the deviation of the empirical mean value of a bounded random vector from its expectation.
Lemma 16.
Let be i.i.d. random vectors (of the same dimension) in a real Euclidean space such that for all . Then, for any ,
Proof.
Because , we have . By triangle inequality, we have , i.e.,
(26) |
We also have
(27) |
We then have
∎
D.4 Calculation of certain integrals
The following lemma calculates the ratio between the intersection area of two hyper-hemispheres and the area of the whole hyper-sphere.
Lemma 17.
(28) |
(Recall that denotes the distribution of the normalized version of on and is assumed to be uniform in all directions.)
Before we give the proof of Lemma 17, we give its geometric explanation.
Geometric explanation of Eq. (28): Indeed, since is uniform on , the integral on the left-hand-side of Eq. (28) represents the probability that a random point falls into the intersection of two hyper-hemispheres that are represented by and , respectively. We can calculate that probability by
(29) |
where denote the angle (in radians) between two vectors, which would lead to Eq. (28).


To help readers understand Eq. (29), we give examples for 2D and 3D in Fig. 4 and Fig. 5, respectively. In the 2D case depicted in Fig. 4, denotes , denotes . Thus, the arc denotes , and the arc denotes . The intersection of and , i.e., the arc , represents . Notice that the angle of equals , where denotes the angle between and . Therefore, ratio of the length of to the perimeter of the circle equals to . Similarly, in the 3D case depicted in Fig. 5, the spherical lune denotes the intersection of the semi-sphere in the direction of and the semi-sphere in the direction of . We can see that the area of the spherical lune is still proportional to the angle . Thus, we still have the result that the area of the spherical lune is of the area of the whole sphere. The proof below, on the other hand, applies to arbitrary dimensions.
Proof.
Due to symmetry, we know that the integral of Eq. (28) only depends on the angle between and . Thus, without loss of generality, we let
where
(30) |
Thus, for any that makes and , it only needs to satisfy
(31) |
We compute the spherical coordinates where and with the convention that
Thus, we have . Similarly, the spherical coordinates for is . Let the spherical coordinates for be . Thus, Eq. (31) is equivalent to
(32) | |||
(33) |
Because (by the convention of spherical coordinates), we have
Thus, for Eq. (32) and Eq. (33), we have
i.e., . We have
(by defining ) | |||
The result of this lemma thus follows. ∎
D.5 Limits of when
We introduce some notions given by Wainwright (2015).
Glivenko-Cantelli class. Let be a class of integrable real-valued functions with domain , and let be a collection of i.i.d. samples from some distribution over . Consider the random variable
which measures the maximum deviation (over the class ) between the sample average and the population average . We say that is a Glivenko-Cantelli class for if converges to zero in probability as .
Polynomial discrimination. A class of functions with domain has polynomial discrimination of order if for each positive integer and collection of points in , the set has cardinality upper bounded by
The following lemma is shown in Page 108 of Wainwright (2015).
Lemma 18.
Any bounded function class with polynomial discrimination is Glivenko-Cantelli.
For our case, we care about the following value.
In the language of Glivenko-Cantelli class, the function class consists of functions that map to or , where every and corresponds to a distinct function in . According to Lemma 18, we need to calculate the order of the polynomial discrimination for this . Towards this end, we need the following lemma, which can be derived from the quantity in Wendel (1962) (which is the quantity in the following lemma).
Lemma 19.
Given , the number of different values (i.e., the cardinality) of the set is at most , where
Intuitively, Lemma 19 states the number of different regions that hyper-planes through the origin (i.e., the kernel of the inner product with each ) can cut into, because all in one region corresponds to the same value of the tuple . For example, in the 2D case (i.e., ), diameters of a circle can at most cut the whole circle into (which equals to ) parts. Notice that if some ’s are parallel (thus some diameters are overlapped), then the total number of different parts can only be smaller. That is why Lemma 19 states that the cardinality is “at most” .
The following lemma shows that the cardinality in Lemma 19 is polynomial in .
Lemma 20.
Recall the definition in Lemma 19. For any integer and , we must have .
Proof.
When , because when , we have (the last step uses and ). When , because , we have . In summary, for any integer and , the result always holds. ∎
We can now calculate the order of the polynomial discrimination for the function class . Because
by Lemma 19 and Lemma 20, we know that
(Here means .)
Thus, has polynomial discrimination with order at most . Notice that all functions in is bounded because their outputs can only be or . Therefore, by Lemma 18 (i.e., any bounded function class with polynomial discrimination is Glivenko-Cantelli), we know that is Glivenko-Cantelli. In other words, we have shown the following lemma.
Lemma 21.
(34) |
Appendix E Proof of Lemma 7 ( has full row-rank with high probability as )
In this section, we prove Lemma 7, i.e., the matrix has full row-rank with high probability when . We first introduce two useful lemmas as follows.
The following lemma states that, given (that satisfies Assumption 3) and , there always exists a vector that is only orthogonal to one training input but not orthogonal to other training inputs for all . An intuitive explanation is that, because no training inputs are parallel (as stated in Assumption 3), the total set of vectors that are orthogonal to at least two training inputs is too small. That gives us many options to pick such a vector that is only orthogonal to one input but not others.
Lemma 22.
For all we have
Proof.

The following lemma plays an important role in answering whether has full row-rank. Further, it is also closely related to our estimation on the later in Appendix F.
Lemma 23.
Consider any . For any satisfying , we define
(37) |
If there exist such that
(38) |
then we must have
(39) | |||
(40) | |||
(41) |
(Notice that Eq. (38) implies .)
Remark 6.
We first give an intuitive geometric interpretation of Lemma 23. In Fig. 6, the sphere centered at denotes , the vector denotes , the vector denotes one of other ’s, the vector denotes , which is perpendicular to (i.e., ). The upper half of the cap denotes , the lower half of the cap denotes . The great circle cuts the sphere into two semi-spheres. The semi-sphere in the direction of corresponds to all vectors on the sphere that have positive inner product with (i.e., ), and the semi-sphere in the opposite direction of corresponds to all vectors on the sphere that have negative inner product with (i.e., ). The great circle is similar to the great circle , but is perpendicular to the direction (i.e., ). By choosing the radius of the cap in Eq. (37), we can ensure that all great circles that are perpendicular to other ’s do not pass the cap . In other words, for the two semi-spheres cut by the great circle perpendicular to , , the cap must be contained in one of them. Therefore, vectors on the upper half of the cap and the vectors on the lower half of the cap must have the same sign when calculating the inner product with all ’s, for all .
Now, let us consider the meaning of Eq. (38) in this geometric setup depicted in Fig. 6. The expression means that the direction of is in the upper half of the cap . By the definition of in Eq. (1), we must then have . Similarly, the expression means that the direction of is in the lower half of the cap , and thus . Then, based on the discussions in the previous paragraph, we know that and has the same activation pattern under ReLU for all ’s that , which implies that . These are precisely the conclusions in Eqs. (39)(40)(41).
Later in Appendix F, Lemma 23 plays an important role in estimating . To see this, let denotes the -th element of . By Eq. (39), we have . By Eq. (40) and Eq. (41), we have . Combining them together, we have . As long as is not zero, then regardless values of other elements in , we always obtain that is a non-zero vector. This implies , which will be useful for estimating in Appendix F.
Proof.
Now, we are ready to prove Lemma 7.
Proof.
We prove by contradiction. Suppose on the contrary that with some nonzero probability, the design matrix is not full row-rank as . Note that when the design matrix is not full row-rank, there exists a set of indices such that
(44) |
The proof will be finished by two steps: 1) find an event that happens almost surely when ; 2) prove this event contradicts Eq. (44).
Step 1:
For all , we define several events as follows.
(Recall the geometric interpretation in Remark 6. The events and mean that there exists in the upper half and the lower half of the cap , respectively. The event means that there exist in both halves of the cap . Finally, the event occurs when occurs for all , although the vector that falls into the two halves may differ across . As we will show later, whenever the event occurs, the matrix will have the full row-rank, which is why we are interesting in the probability of the event .)
Those definitions implies that
(47) | |||
(48) |
Thus, we have
(49) |
For a fixed , recall that by Eq. (46), we have . Because and are two halves of , we have
(50) |
Therefore, we have
(all ’s are independent and Assumption 1) | |||
Notice that is determined only by , and is independent of and . Therefore, we have
(51) |
Plugging Eq. (51) into Eq. (49), we have
Step 2:
To complete the proof, it remains to show that the event contradicts Eq. (44). Towards this end, we assume the event happens. By Eq. (44), we can pick one . Further, by the definition of , there exists such that and . In other words, there must exist such that
By Lemma 23, we have
(52) | |||
(53) |
We now show that restricted to the columns corresponding to and cannot be linearly dependent. Specifically, we have
This contradicts the assumption Eq. (44) that
The result thus follows. ∎
Appendix F Proof of Proposition 4 (the upper bound of the variance)
The following lemma shows the relationship between the variance term and .
Lemma 24.
Proof.
We have
(54) |
Thus, we have
∎
The following lemma shows our estimation on .
Lemma 25.
For any , , , if , we must have
Proposition 4 directly follows from Lemma 25 and Lemma 24. 888We can see that the key part during the proof of Proposition 4 is to estimate . Lemma 25 shows a lower bound of which is almost when is large. However, our estimation of this value may be loose. We will show a upper bound which is (see Appendix G).
In rest of this section, we will show how to prove Lemma 25. The following lemma shows that, to estimate , it is equivalent to estimate .
Lemma 26.
Proof.
Do the singular value decomposition (SVD) of as , where
By properties of singular values, we have
We also have
This equation is indeed the eigenvalue decomposition of , which implies that its eigenvalues are . Thus, we have
∎
Therefore, to finish the proof of Proposition 4, it only remains to estimate .
By Lemma 7 and its proof in Appendix E, we have already shown that is not likely to be zero (i.e. ) when . Here, we basically use the similar method as in Appendix E, but with more precise quantification.
Recall the definitions in Eqs. (21)(22)(23). For any , we choose one
(55) |
(Note that here, unlike in Eq. (45), we do not require for all . This is important as we would like to be independent of for all .) Further, for any , we define
(56) |
Then, we define
(57) | |||
(58) |
(Note that here or may be zero. Later we will show that they can be lower bounded with high probability.) Define
(59) |
Similar to Remark 6, these definitions have their geometric interpretation in Fig. 6. The value denotes the number of distinct pairs 999Here, “distinct” means that any normalized version of can appear at most in one pair. such that is in the upper half of the cap , and is in the lower half of the cap . The quantities , , and can all be used as the radius of the cap . The ratio is proportional to the area of the cap with radius (or equivalently, the probability that the normalized falls in the cap ).
The following lemma gives an estimation on when is given. We put its proof in Appendix F.1.
Lemma 27.
Given , we have
Notice that only depends on and it may even be zero if is zero. However, after we introduce the randomness of , we can show that is lower bounded with high probability. We can then obtain the following lemma. We put its proof in Appendix F.2.
Define
(60) | |||
(61) |
Lemma 28.
For any , we have
Notice that Lemma 28 is already very close to Lemma 25, and we put the final steps of the proof of Lemma 25 in Appendix F.3.
F.1 Proof of Lemma 27
Proof.
Define events as follows.
Those definitions directly imply that
(62) |
Step 1: prove
To show , we only need to prove that implies . To that end, it suffices to show for the vector defined in . Because and , we have
(63) |
By Eq. (56), we can construct pairs for (all ’s are different and all ’s are different), such that
Thus, we have
We then have
Further, we have
(64) |
Clearly, if the event occurs, then . Combining with Eq. (64), we then have . In other words, the event must occur. Hence, we have shown that .
Step 2: estimate the probability of
For all , because is uniformly distributed in all directions, for any fixed , we have
Thus, follows the distribution given . By Lemma 14 (with ), we have
(65) |
Similarly, we have
(66) |
By plugging Eq. (65) and Eq. (66) into Eq. (56) and applying the union bound, we have
By letting and by Eq. (59), we thus have
i.e.,
(67) |
Step3: estimate the probability of
We have
Thus, we have
The result of this lemma thus follows. ∎
F.2 Proof of Lemma 28
Based on Lemma 27, it remains to estimate , which will then allow us to bound . Towards this end, we need a few lemmas to estimate and .
Lemma 29.
For any , we must have .
Proof.
Consider a function . It remains to show that for all . We have . In other words, when , and when . Thus, is monotone decreasing when , and is monotone increasing when . Hence, we know that achieves its minimum value at , i.e., for any . The conclusion of this lemma thus follows. ∎
Lemma 30.
For any , we have
Proof.
Lemma 31.
For any , we must have
Proof.
The result directly follows by when . ∎
Lemma 32.
Further, if , we have
Proof.
When , we have . When , we have . When , we have . It is easy to verify that the statement of the lemma holds for , , and . It remains to validate the case of . We first prove the lower bound. For any , we have
(because is monotone increasing with respect to when ) | |||
By letting , we thus have
Then, applying Lemma 30, we have
Thus, by Lemma 31, we have
Now we prove the upper bound. For any , we have
By letting , we thus have
Notice that . The result of this lemma thus follows.
∎
Lemma 33.
Recall is defined in Eq. (60). If and , then
Proof.
We have
∎
Lemma 34.
For any , we must have .
Lemma 35.
For any , we must have
and
Proof.
we have
(because ) | |||
Thus, we have
which implies
∎
Lemma 36.
For any and for any , we have
We also have
Proof.
Now we are ready to prove Lemma 28.
Recall defined in Eq. (55). For any , we have, for independent of and with distribution ,
(71) |
Since each of the , , is independent of , we have
Or, equivalently,
(72) |
Recall the definition of and in Eqs. (57)(58). Thus, we then have
(73) |
By Lemma 9 and Eq. (61), we have
Thus, we have
(74) |
By Eq. (59) and Eq. (74), we have
Notice that only depends on and is independent of . By Lemma 27, for any that makes , we must have
In other words,
We thus have
(75) |
Thus, we have
The result of this lemma thus follows.
F.3 Proof of Lemma 25
Based on Lemma 28, it only remains to estimate . We start with a lemma.
Lemma 37.
If and , we must have
(76) |
For any given and , we must have
Proof.
We start from
(77) |
Thus, we have
For any given and , we have
∎
Now we are ready to finish our proof of Lemma 25.
We have
Thus, when , we have
Then, we have
Appendix G Upper bound of
By Lemma 26, to get an upper bound of , it is equivalent to get an upper bound of . To that end, we only need to construct a vector and calculate the value of , which automatically becomes an upper bound .
The following lemma shows that, for given , if two input training data and are close to each other, then is unlikely to be large.
Lemma 38.
If there exist and such that and , then
Intuitively, Lemma 38 is true because, when and are similar, and (the -th and -th row of , respectively) will also likely be similar, i.e., is not likely to be large. Thus, we can construct such that is proportional to , which will lead to the result of Lemma 38. We put the proof of Lemma 38 in Appendix G.1.
The next step is to estimate such difference between and (or equivalently, the angle between them). We have the following lemma.
Lemma 39.
When , there must exist two different ’s such that the angle between them is at most
Lemma 39 is intuitive because has limited area. When there are many ’s on , there must exist at least two ’s that are relatively close. We put the proof of Lemma 39 in Appendix G.2.
Finally, we have the following lemma.
Lemma 40.
When , we have
By Lemma 40, we can conclude that when is much larger than , with high probability.
G.1 Proof of Lemma 38
We first prove a useful lemma.
Lemma 41.
For any , we must have . For any , we must have .
Proof.
To prove the first part of the lemma, note that
Thus, the function is monotone increasing with respect to . Thus, we have
In other words, we have for any .
To prove the second part of the lemma, note that when , we have
Thus, the function is convex with respect to . Because the maximum of a convex function must be attained at the endpoint of the domain interval, we have
Thus, we have for any . ∎
Now we are ready to prove Lemma 38.
Proof.
Through the proof, we fix and , and only consider the randomness of . Because is the angle between and and because of Assumption 1, we have
(78) |
Let , where denotes the -th standard basis vector, . Then, we have
(79) |
Since is fixed and the direction of is uniformly distributed, we have and
Thus, based on the randomness of , when are given, we have
By letting , , in Lemma 14, we then have
(80) | |||
(81) |
Thus, we have
(by Eq. (79)) | |||
(by the union bound) | |||
The result of Lemma 38 thus follows. ∎
G.2 Proof of Lemma 39
We first prove a useful lemma. Recall the definition of in Eq. (60).
Lemma 42.
We have
Now we are ready to proof Lemma 39.
Proof.
Recall the definition of in Lemma 39. Draw caps on centered at with the colatitude angle where
(82) |
By Lemma 42 and , we have . Thus, by Lemma 41, we have
(83) |
By Lemma 8, the area of each cap is
Applying Lemma 35 and Eq. (83), we thus have
In other words, we have
By the pigeonhole principle, we know there exist at least two different caps that overlap, i.e., the angle between them is at most . The result of this lemma thus follows by Eq. (82). ∎
Appendix H Proof of Proposition 5
We follow the sketch of proof in Section 5. Recall the definition of the pseudo ground-truth function in Definition 2, and the corresponding that
(84) |
We first show that the pseudo ground-truth can be written in a linear form.
Lemma 43.
for all .
Proof.
For all , we have
∎
Let . Since and , we know that is an orthogonal projection to the row-space of . Next, we give an expression for the test error. Note that even though Proposition 4 assumes no noise, below we state a more general version below with noise (which will be useful later).
Lemma 44.
If the ground-truth is for all , then we have
Proof.
When there is no noise, Lemma 44 reduces to . As we described in Section 5, has the interpretation of the distance from to the row-space of . We then have the following.
Lemma 45.
For all , we have
Proof.
Recall that . Thus, we have
(85) |
We then have
Therefore, we have
∎
Now we are ready to prove Proposition 5.
Proof.
Because there is no noise, we have . Thus, by Lemma 44, we have
(86) |
We then have, for all ,
(87) |
It only remains to find the vector . Define for as
By Eq. (84), for all , we have
(88) |
Because , we have
Thus, we have
i.e.,
(89) |
We now construct the vector . Define whose -th element is , . Notice that is well-defined because . Then, for all , we have
i.e.,
(90) |
Thus, by Eq. (89) and Lemma 16 (with , , and ), we have
Further, by Eq. (90) and Eq. (88), we have
(91) |
Plugging Eq. (91) into Eq. (87), we thus have
∎
Appendix I Proof of Theorem 1
We first prove a useful lemma.
Lemma 46.
If , then for any , we must have
Proof.
This follows from Fubini’s Theorem and by a change of order of the integral. Specifically, because , we have
Thus, we have
Because when and , we have
Thus, by Fubini’s theorem, we can exchange the sequence of integral, i.e., we have
∎
The following proposition characterizes generalization performance when , i.e., the bias term in Eq. (5).
Proposition 47.
Assume no noise (), a ground truth where , , , , and . Then, for any and for almost every , we must have
Proof.
We split the whole proof into 5 steps as follows.
Step 1: use pseudo ground-truth as a “intermediary”
Recall Definition 2 where we define the pseudo ground-truth . We then define the output of the pseudo ground-truth for training input as
The rest of the proof will use the pseudo ground-truth as a “intermediary” to connect the ground-truth and the model output . Specifically, we have
(92) |
Thus, we have
(93) |
In Eq. (93), we can see that the term corresponds to the test error of the pseudo ground-truth, the term corresponds to the impact of the difference between the pseudo ground-truth and the real ground-truth in the training data, and the term corresponds to the impact of the difference between pseudo ground-truth and real ground-truth in the test data. Using the terminology of bias-variance decomposition, we refer to term as the “pseudo bias” and term as the “pseudo variance”.
Step 2: estimate term
We have
(94) |
Step 3: estimate term
For all , define
We now show that is bounded and with mean equal to , where defined by Definition 1. Specifically, we have
(95) |
From Definition 2, we have
(96) |
Because ’s are i.i.d., ’s are also i.i.d.. Thus, we have
(97) |
Further, for any , we have
(98) |
Thus, by Lemma 16, we have
Further, by Eq. (I) and Eq. (I), we have
Because , we have
Because does not change with , we thus have
(99) |
Step 4: estimate term
Our idea is to treat as a special form of noise, and then apply Proposition 4. We first bound the magnitude of this special noise. For , we define
Then, we have
Similar to how we get Eq. (99) in Step 3, we have
(100) |
Thus, we have
(101) |
Step 5: estimate
We have
The last step exactly gives the conclusion of this proposition.
∎
Appendix J Proof of Proposition 2 (lower bound for ground-truth functions outside )
We first show what looks like. Define where its -th element is
Notice that
By Lemma 21, we have that converges in probability to as uniformly in . In other words,
(102) |
Let denote the standard basis in . For , define
(103) |
which is a number. Further, define
Further, define the number
and
By Eq. (102), we have
(104) |
For any given , we define as
(105) |
By the definition of the Dirac delta function with peak position at , we can write as an integral
Notice that only depends on the training data and does not change with (and thus is finite). Therefore, we have . It remains to show why converges to in probability. The following lemma shows what looks like.
Lemma 48.
.
Proof.
For any , we have
By Eq. (6), we thus have
(106) |
By the definition of the Dirac delta function, we have
∎
Now we are ready to prove the statement of Proposition 2, i.e., uniformly over all , as (notice that we have already shown that ). To be more specific, we restate that uniform convergence as the following lemma.
Lemma 49.
For any given , as .
Proof.
For any , define two events:
By Lemma 21, there exists a threshold such that for any ,
By Eq. (104), there exists a threshold such that for any ,
Thus, by the union bound, when , we have
(107) |
When happens, we have
(by Lemma 48 and Eq. (105)) | |||
Because is fixed when is given, can be arbitrarily small as long as is small enough. The conclusion of this lemma thus follows by Eq. (107). ∎
If the ground-truth function (or equivalently, ), then the MSE of (with respect to the ground-truth function ) is at least (because ). Therefore, we have proved Proposition 2. Below we state an even stronger result than part (ii) of Proposition 2, i.e., it captures not only the MSE of , but also that of for sufficiently large .
Lemma 50.
For any given and , there exists a threshold such that for all , .
Proof.
By Lemma 49, for any , there must exist a threshold such that for all ,
When , we have
Because , we have . Thus, by the triangle inequality, we have . Putting these together, we have
Notice that . The result of this lemma thus follows. ∎
Appendix K Details for Section 4 (hyper-spherical harmonics decomposition on )
K.1 Convolution on
First, we introduce the definition of the convolution on . In Dokmanic & Petrinovic (2009), the convolution on is defined as follows.
where is a orthogonal matrix that denotes a rotation in , chosen from the set of all rotations. In the following, we will show Eq. (13). To that end, we have
(108) |
Now, we replace by . Thus, we have
Because is an orthonormal matrix, we have . Therefore, we have . Thus, by Eq. (14), we have
(109) |
By plugging Eq. (109) into Eq. (108), we have
Eq. (13) thus follows.
The following lemma shows the intrinsic symmetry of such a convolution.
Lemma 51.
Let denotes any rotation in . If , then .
Proof.
Because , we can find such that
Thus, we have
(because is a rotation, we have ) | |||
The result of this lemma thus follows. ∎
K.2 Hyper-spherical harmonics
We follow the the conventions of hyper-spherical harmonics in Dokmanic & Petrinovic (2009). We express in a set of hyper-spherical polar coordinates as follows.
Notice that and . Let . In such coordinates, hyper-spherical harmonics are given by Dokmanic & Petrinovic (2009)
(110) |
where the normalization factor is
and are the Gegenbauer polynomials of degree . These Gegenbauer polynomials can be defined as the coefficients of in the power-series expansion of the following function,
Further, the Gegenbauer polynomials can be computed by a three-term recursive relation,
(111) |
with and .
K.3 Calculate where
Recall that and . By plugging into Eq. (110), we have
(112) |
The following lemma gives an explicit form of Gegenbauer polynomials.
Lemma 52.
(113) |
Proof.
We use mathematical induction. We already know that and , which both satisfy Eq. (113). Suppose that and satisfy Eq. (113), i.e.,
It remains to show that also satisfy Eq. (113). By Eq. (111), it suffices to show that
(114) |
To that end, it suffices to show that the coefficients of are the same for both sides of Eq. (114), for . For the first step, we verify the coefficients of for . We have
coefficients of on the right-hand-side of Eq. (114) | |||
For the second step, we verify the coefficient of for , i.e., the coefficient of . We have
coefficients of on the right-hand-side of Eq. (114) | |||
For the third step, we verify the coefficient of for . We consider two cases: 1) is even, and 2) is odd. When is even, we have , i.e., . Thus, we have
coefficients of on the right-hand-side of Eq. (114) | |||
When is odd, we have and this case has already been verified in the first step.
In conclusion, the coefficients of are the same for both sides of Eq. (114), for . Thus, by mathematical induction, the result of this lemma thus follows. ∎
K.4 Proof of Proposition 3
Recall that
Notice that . Thus, we have
The function has a Taylor Series Expansion:
which converges when . Thus, we have
(116) |
By comparing terms of even and odd power of in Eq. (115) and Eq. (116), we immediately see that when , and when . It remains to examine whether or for . We first introduce the following lemma.
Lemma 53.
Let and be two non-negative integers. Define the function
We must have
(117) |
Proof.
We have
Thus, to finish the proof, we only need to consider the case of in Eq. (117). We then prove by mathematical induction on the first parameter of , i.e., in Eq. (117). When , we have
Thus, Eq. (117) holds for all when . Suppose that Eq. (117) holds when . To complete the mathematical induction, it only remains to show that Eq. (117) also holds for all when . By Eq. (111) and Eq. (112), for any , we have
Thus, we have
(118) |
where
It is obvious that and . Applying Eq. (118) multiple times, we have
(119) | |||
(120) | |||
(121) |
(Notice that we have already let , so all in those equations are well-defined.) By plugging Eq. (120) and Eq. (121) into Eq. (119), we have
(122) |
To prove that Eq. (117) holds when for all , we consider two cases, Case 1: , and Case 2: . Notice that by the induction hypothesis, we already know that Eq. (117) holds when for all .
Case 1. When , we have . Thus, by the induction hypothesis for , we have (by ), which implies that the third term of the right-hand-side of Eq. (122) is positive. Further, by the induction hypothesis for , we also know that and (regardless of the value of ), which means that the first and the second term of Eq. (122) is non-negative. Thus, by considering all three terms in Eq. (122) together, we have when .
Case 2. When , we have , , and . Thus, by the induction hypothesis for , we have . Therefore, by Eq. (122), we have .
In summary, Eq. (117) holds when for all . The mathematical induction is completed and the result of this lemma follows. ∎
K.5 A special case: when
When , denotes a unit circle. Therefore, every corresponds to an angle such that . In this situation, the hyper-spherical harmonics are the well-known Fourier series, i.e., . Thus, we can explicitly calculate all Fourier coefficients of more easily.
Similarly to Appendix K.1, we first write down the convolution for , which is also in a simpler form. For any function , we have
Define . We then have
where denotes (continuous) circular convolution. Let and (where ) denote the (complex) Fourier series coefficients for , , and , correspondingly. Specifically, we have
Thus, we have
(123) |
Now we calculate , i.e., the Fourier decomposition of . We have
It is easy to verify that
Thus, we have
Similarly, we have
Now we consider the situation of . We have
Notice that . Therefore, we have
Similarly, we have
In summary, we have
By Eq. (123), we thus have
In other words, when , functions in can only contain frequencies , and cannot contain other frequencies .
K.6 Details of Remark 2
As we discussed in Remark 2, a ReLU activation function with bias that operates on , can be equivalently viewed as one without bias that operates on , but with the last element of fixed at . Note that by fixing the last element of at a constant , we essentially consider ground-truth functions with a much smaller domain . Correspondingly, define a vector and such that . We claim that for any and for all non-negative integer , a ground-truth function must be learnable. In other words, all polynomials can be learned in the constrained domain . Towards this end, recall that we have already shown that polynomials (of ) to the power of are learnable. Thus, it suffices to prove that polynomials of to the power of can be represented by a finite sum of those to the power of . The idea is to utilize the fact that the binomial expansion of contains for all . Here we give an example for writing as a linear combination of learnable components. Other values of can be proved in a similar way. Notice that
(124) |
Thus, for all and , we have
which is a sum of learnable components (corresponding to the polynomials with power of , , , , and , respectively).
Appendix L Discussion when is a -function ()
We now discuss what happens to the conclusion of Theorem 1 if contains a -function, in which case . In Eq. (10) of Theorem 1, only Term 1 and Term 4 (come from Proposition 5) will be affected when . That is because only Proposition 5 requires during the proof of Theorem 1. To accommodate the situation when contains a -function (), we need a new version of Proposition 5. In other words, we need to know the performance of the overfitted NTK solution in learning the pseudo ground-truth when .
Without loss of generality, we consider the situation that . We have the following proposition.
Proposition 54.
Proposition 54 implies that when is large and is much larger than , the test error between the pseudo ground-truth and learned result decreases with at the speed . Further, if we let be large, then the decreasing speed with is almost . When , this speed is slower than described in Proposition 5 (i.e., Term 1 in Eq. (10) of Theorem 1). When , the decreasing speed with respect to is for both Proposition 5 and Proposition 54. Nonetheless, Proposition 54 implies that the ground-truth functions is still learnable even when is a -function (i.e., ), but the test error potentially suffers a slower convergence speed with respect to when is large.

In Fig. 7, we plot the curves of the model error for learning the pseudo ground-truth with respect to when (two blue curves) and when is constant (two red curves). We plot both the case when (two solid curves) and the case when (two dashed curves). By Lemma 44, the model error can represent the generalization performance for learning the pseudo ground-truth when there is no noise. In Fig. 7, we can see that those two curves corresponding to have different slopes and the other two curves corresponding to have a similar slope, which confirms our prediction in the earlier paragraph (i.e., when the test error will decay at the same speed regardless of whether contains a -function or not, but when the test error will decay more slowly when contains a -function).
L.1 Proof of Proposition 54
We first show two useful lemmas.
Lemma 55.
For any , if , then
Proof.
Lemma 56.
Consider where . For any , there must exist such that and
(126) |
We will explain the intuition of Lemma 56 in Remark 8 right after we use the lemma. We put the proof of Lemma 56 in Section L.2.
Now we are ready to prove Proposition 54. Recall defined in Eq. (84). By Eq. (1) and , we have
Define
Thus, we have
(127) |
(Graphically, Eq. (L.1) means that a chord is not longer than the corresponding arc.)
As we discussed in the proof sketch of Proposition 5, we now construct the vector such that is close to . Define whose -th element is
Thus, we have . Therefore, we have
Thus, we have
(128) |
Remark 7.
We give a geometric interpretation of Eq. (128) when by Fig. 4, where denotes , denotes . Then, corresponds to the number of ’s whose direction is in the arc or the arc , and corresponds to the angle . Intuitively, when increases, and get closer, so becomes smaller. At the same time, both the arc and the arc become shorter. Consequently, the value of Eq. (128) decreases as increases. In the rest of the proof, we will quantitatively estimate the above relationship.
Recall in Eq. (60). Define
(129) |
For any , we define two events:
If both and happen, by Eq. (128), we must then have
Thus, by Lemma 44 and Lemma 45, if and both and happen, then for any , we must have
(130) |
It then only remains to estimate the probability of .
Step 1: Estimate the probability of conditional on .
When happens, we have . By Lemma 56, we can find such that the angle between and is exactly and
(131) |
Remark 8.
We give a geometric interpretation of Eq. (131) (i.e., Lemma 56) when by Fig. 4. Recall in Remark 7 that, if we take as and as , then corresponds to the number of ’s whose direction is in the arc or the arc . If we fix (i.e., ) and increase the angle (corresponding to ), then both the arc and the arc will become longer. In other words, if we replace by such that the angle (between and ) increases to the angle (between and ), then and , and thus Eq. (131) follows.
We next estimate the probability that the right-hand-side of Eq. (131) is greater than . By Eq. (6), we have
(132) |
Notice that the angle between and is , and the angle between and is also . By Lemma 17 and Assumption 1, we know that the Term A in Eq. (132) follows Bernoulli distribution with the probability . By letting , , in Lemma 14, we have
By Eq. (131), we then have
Step 2: Estimate the probability of .
L.2 Proof of Lemma 56
Proof.
When , the conclusion of this lemma trivially holds because (because and cannot be both positive or negative at the same time when .). It remains to consider . Define
Thus, we have and , i.e., and are orthonormal basis vectors on the 2D plane spanned by and . Thus, we can represent as
For any , we construct as
In order to show , we only need to prove any must in . For any , , define the angle as the angle between and ’s projected component on 101010Note that such an angle is well defined as long as is not perpendicular to . The reason that we do not need to worry about those ’s such that is as follows. When , we then have . Thus, those ’s do not belong to any set , , , or in Eq. (126)., i.e.,
By the proof of Lemma 17, we know that if and only if (mod ). Similarly, if and only if (mod ). Because and , we have
Thus, whenever , we must have . Therefore, we conclude that . Using a similar method, we can also show that . The result of this lemma thus follows. ∎