Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting
Abstract
We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class’s Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of [66] for minimum-norm interpolators, and confirms a prediction of [73] for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum -norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.
Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)
1 Introduction
The traditional understanding of machine learning suggests that models with zero training error tend to overfit, and explicit regularization is often necessary to achieve good generalization. Given the empirical success of deep learning models with zero training error [58, 55] and the (re-)discovery of the “double descent” phenomenon [61], however, it has become clear that the textbook U-shaped learning curve is only part of a larger picture: it is possible for an overparameterized model with zero training loss to achieve low population error in a noisy setting. In an effort to understand how interpolation learning occurs, there has been much recent study of the testbed problem of linear regression with Gaussian features [[, e.g.]]bartlett2020benign, tsigler2020benign, BHX:two-models, hastie:surprises, JLL:basis-pursuit, junk-feats, muthukumar:interpolation, negrea:in-defense. Significant progress has been made in this setting, including nearly-matching necessary and sufficient conditions for consistency of the minimal norm interpolator [66].
Despite the fundamental role of uniform convergence in statistical learning theory, most of this line of work has used other techniques to analyze the particular minimal-norm interpolator.111[71] argue that [66]’s proof technique is fundamentally based on uniform convergence of a surrogate predictor; [76] study a closely related setting with a uniform convergence-type argument, but do not establish consistency. We discuss both papers in more detail in Section 4. Instead of directly analyzing the population error of a learning algorithm, a uniform convergence-type argument would control the worst-case generalization gap over a class of predictors containing the typical outputs of a learning rule. Typically, this is done because for many algorithms – unlike the minimal Euclidean norm interpolator – it is difficult to exactly characterize the learned predictor, but we may be able to say e.g. that its norm is not too large. Since uniform convergence does not tightly depend on a specific algorithm, the resulting analysis can highlight the key properties that lead to good generalization: it can give bounds not only for, say, the minimal-norm interpolator, but also for other interpolators with low norm [[, e.g.]]junk-feats, increasing our confidence that low norm – and not some other property the particular minimal-norm interpolator happens to have – is key to generalization. In linear regression, practical training algorithms may not always find the exact minimal Euclidean norm solution, so it is also reassuring that all interpolators with sufficiently low Euclidean norm generalize.
[64], however, raised significant questions about the applicability of typical uniform convergence arguments to certain high-dimensional regimes, similar to those seen in interpolation learning. Following their work, [73, 65, 76, 71] all demonstrated the failure of forms of uniform convergence in various interpolation learning setups. To sidestep these negative results, [73] suggested considering bounds which are uniform only over predictors with zero training error. This weaker notion of uniform convergence has been standard in analyses of “realizable” (noiseless) learning at least since the work of [39, Chapter 6.4] and [40]. [73] demonstrated that at least in one particular noisy setting, such uniform convergence is sufficient for showing consistency of the minimal norm interpolator, even though “non-realizable” uniform convergence arguments (those over predictors regardless of their training error) cannot succeed. It remains unknown, however, whether these types of arguments can apply to more general linear regression problems and more typical asymptotic regimes, particularly showing rates of convergence rather than just consistency.
In this work, we show for the first time that uniform convergence is indeed able to explain benign overfitting in general high-dimensional Gaussian linear regression problems. Similarly to how the standard analysis for learning with Lipschitz losses bounds generalization gaps through Rademacher complexity [[, e.g.]]understanding-ML, our Theorem 1 (Section 3) establishes a finite-sample high probability bound on the uniform convergence of the error of interpolating predictors in a hypothesis class, in terms of its Gaussian width. This is done through an application of the Gaussian Minimax Theorem; see the proof sketch in Section 7. Combined with an analysis of the norm of the minimal norm interpolator (Theorem 2 in Section 4), our bound recovers known consistency results [66], as well as proving a conjectured upper bound for larger-norm interpolators [73].
In addition, since we do not restrict ourselves to Euclidean norm balls but instead consider interpolators in an arbitrary compact set, our results allows for a wide range of other applications. Our analysis leads to a natural extension of the consistency result and notions of effective rank of [66] for arbitrary norms (Theorem 5 in Section 5). As a demonstration of our general theory, in Section 6 we show novel consistency results for the minimal norm interpolator (basis pursuit) in particular settings, which we believe are the first results of their kind.
2 Problem Formulation
Notation.
We use for the norm, . We always use to be when is empty, and similarly to be . We use standard notation, and for inequality up to an absolute constant. For a positive semidefinite matrix , the Mahalanobis (semi-)norm is . For a matrix and set , denotes the set .
Data model.
We assume that data is generated as
(1) |
where has i.i.d. Gaussian rows , , is arbitrary, and is Gaussian and independent of . Though our proof techniques crucially depend on being Gaussian, we can easily relax the assumption on the noise to only being sub-Gaussian; we assume Gaussian noise here for simplicity. The empirical and population loss are defined as, respectively,
where in the expectation with independent of . For an arbitrary norm , the minimal norm interpolator is . For Euclidean norm specifically, the minimal norm interpolator can be written explicitly as . If there is more than one minimal norm interpolator, all of our guarantees will hold for any minimizer .
Speculative bound.
[73] studied uniform convergence of low norm interpolators,
(2) |
Clearly, when , this quantity upper-bounds the population risk of . [73] evaluated the asymptotic limit of (2) in one particular setting. But they further speculated that a bound of the following form may hold more generally:
() |
where222When concentrates, we need to match its typical value. That is, might be a high probability bound on , or for (sub)Gaussian data, as in our case, . . As discussed by [73], a bound almost of this form is implied by results of [47] for general data distributions, except that approach gives a large leading constant and logarithmic factors. To show consistency of benign overfitting, though, we need ( ‣ 2) to hold without even a constant multiplicative factor. [73] ask whether and when this holds, speculating that it might in broad generality.
The goal of this paper is essentially to prove ( ‣ 2), at least for Gaussian data, and to use it to show consistency of the minimal norm interpolator. Our main result (Theorem 1) can be thought of showing ( ‣ 2) for Gaussian data with , as well as strengthening and significantly generalizing it. Our result is more general, as it applies to general compact hypothesis sets beyond just the Euclidean ball as in ( ‣ 2). But it also falls short of fully proving the speculative ( ‣ 2) since our results are limited to Gaussian data, while there is no reason we are aware of to believe a tight uniform convergence guarantee of the form ( ‣ 2) does not hold much more broadly. We leave extending our results beyond the Gaussian case open.
3 Generic Uniform Convergence Guarantee
To state our results, we first need to introduce some key tools.
Definition 1.
The Gaussian width and the radius of a set are
The radius measures the size of a set in the Euclidean norm. The Gaussian width of a set can be interpreted as the number of dimensions that a random projection needs to approximately preserve the norms of points in [57, 42]. These two complexity measures are connected by Gaussian concentration: Gaussian width is the expected value of the supremum of some Gaussian process, and the radius upper bounds the typical deviation of that supremum from its expected value.
Definition 2 (Covariance splitting).
Given a positive semidefinite matrix , we write if , each matrix is positive semidefinite, and their spans are orthogonal.
Note that effectively splits the eigenvectors of into two disjoint parts.
We can now state our generic bound. Section 7 sketches the proof; all full proofs are in the appendix.
Theorem 1 (Main generalization bound).
There exists an absolute constant such that the following is true. Under the model assumptions in (1), let be an arbitrary compact set, and take any covariance splitting . Fixing , let . If is large enough that , then the following holds with probability at least :
In our applications, we consider for an arbitrary norm, with based on a high-probability upper bound for . Depending on the application, the rank of will be either constant or , so that . The term generally does not scale with and hence is often negligible. As hinted earlier, we can think of the Gaussian width term and the radius term as bias and variance, respectively. To achieve consistency, we can expect that the Gaussian width should scale as . This agrees with the intuition that we need increasing norm to memorize noise when the model is not realizable. The radius term requires some care in our applications, but can be handled by the covariance splitting technique. As part of the analysis in the following sections, we will rigorously show in many settings that the dominant term in the upper bound is the Gaussian width. In these cases, our upper bound is roughly , which can be viewed as the ratio between the (probabilistic) dimension of our hypothesis class and sample size. We will also analyze how large must be to contain any interpolators, allowing us to find consistency results.
4 Application: Euclidean Norm Ball
It can be easily seen that the Gaussian width of a Euclidean norm ball reduces nicely to the product of the norm of our predictor with the typical norm of : if , then
(3) |
Therefore, it is plausible that ( ‣ 2) holds with . Figure 1 illustrates this generalization bound in two simple examples, motivated by [63, 73]. Indeed, an application of our main theorem proves that this is exactly the case for Gaussian data.
Corollary 1 (Proof of the speculative bound ( ‣ 2) for Gaussian data).
Fix any . Under the model assumptions in (1) with and , for some , it holds with probability at least that
(4) |


The above bound is clean and simple, but only proves a sub-optimal rate of . This is because the choice of covariance split used in the proof of Corollary 1 uses no information about the particular structure of . This bound can also be slightly loose in situations where the eigenvalues of decay rapidly, in which case can be replaced by a smaller quantity. We next state a more precise bound on the generalization error, which requires introducing the following notions of effective rank.
Definition 3 ([66]).
The effective ranks of a covariance matrix are
The rank can roughly be understood as the squared ratio between the Gaussian width and radius in our previous bound. It is related to the concentration of the norm of a Gaussian vector with covariance . In fact, both definitions of effective ranks can be derived by applying Bernstein’s inequality to . We will only need in the generalization bound below, but we will show in Theorem 2 that can be used to control the norm of the minimal norm interpolator .
Corollary 2.
There exists an absolute constant such that the following is true. Under (1), pick any split , fix , and let . If and is large enough that , the following holds with probability at least :
(5) |
In order to use Corollary 1 or 2 to prove consistency, we need a high-probability bound for , the norm of the minimal norm interpolator, so that will be large enough to contain any interpolators. Theorem 2 gives exactly such a bound, showing that if the effective ranks and are large, then we can construct an interpolator with Euclidean norm nearly .
Theorem 2 (Euclidean norm bound; special case of Theorem 4).
Fix any . Under the model assumptions in (1) with any choice of covariance splitting , there exists some such that the following is true. If and the effective ranks are such that and , then with probability at least , it holds that
(6) |
Plugging in estimates of to our scale-sensitive bound Corollary 2, we obtain a population loss guarantee for in terms of effective ranks.
Theorem 3 (Benign overfitting).
Fix any . Under the model assumptions in (1) with any covariance splitting , let and be as defined in Corollaries 2 and 2. Suppose that and the effective ranks are such that and . Then, with probability at least ,
(7) |
From (7), we can see that to ensure consistency, i.e. , it is enough that , , and . Recalling the definitions of the various quantities and using that [66, Lemma 5], we arrive at the following conditions.
Sufficient conditions for consistency of .
As , converges in probability to if there exists a sequence of covariance splits such that
(8) |
Relationship to [66].
Our set of sufficient conditions above subsumes and is slightly more general than the conditions of [66]. There are two differences:
-
1.
They choose the covariance split specifically to minimize such that .
-
2.
Their version of the second condition replaces by the larger term .
From the perspective of showing , the first difference is immaterial: if there exists a choice of split that satisfies our conditions, it can be shown that there exists a (possibly different) split which will also satisfy (see Section D.2.1). The second point is a genuine improvement over the consistency result of [66] when has a few very large eigenvalues; this improvement has also been implicitly obtained by [72].
Regarding the rate of convergence, our additional term and the dependence on instead of is slightly worse than that of [66], but our bound can be applied for a smaller value of and is better in the term. We believe these differences are minimal in most cases, and not so important for our primary goal to showcase the power of uniform convergence.
Relationship to [71].
The consistency result of [66] can also be recovered with a uniform convergence-based argument [71]. Instead of considering uniform convergence over a norm ball, [71] applied uniform convergence to a surrogate predictor, and separately showed that the minimal-norm interpolator has risk close to the surrogate (and, indeed, argue that this was fundamentally the proof strategy of [66] all along). Their analysis reveals an interesting connection between realizability and interpolation learning, but it does not highlight that low norm is key to good generalization, nor does it predict the worst-case error for other low-norm interpolators.
Relationship to [65].
[65] recently showed that it is impossible to find a tight excess risk bound that only depends on the learned predictor and sample size. This does not, however, contradict our results. A closer look at the construction of their lower bound reveals that the excess risk bounds being ruled out cannot depend on either the training error or the population noise level . The former is crucial: their considered class of bounds cannot incorporate the knowledge that the training error is small, which is the defining property of uniform convergence of interpolators. The latter point is also important; they consider excess risk (), but ( ‣ 2) and our bounds are about the generalization gap ().
Relationship to [76].
[76] give expressions for the asymptotic generalization error of predictors in a norm ball, in a random feature model. Their model is not directly comparable to ours (their labels are effectively a nonlinear function of their non-Gaussian random features), but they similarly showed that uniform convergence of interpolators can lead to a non-vacuous bound. It is unclear, though, whether uniform convergence of low-norm interpolators can yield consistency in their model: they only study sets of the form with a constant, where we would expect a loss of – i.e. would not expect consistency. They also rely on numerical methods to compare their (quite complicated) analytic expressions. It remains possible that the gap between uniform convergence of interpolators and the Bayes risk vanishes in their setting as approaches 1.
5 General Norm Ball
All the results on the Euclidean setting are special cases of the following results for arbitrary norms. It is worth keeping in mind that the Euclidean norm will still play a role in these analyses, via the Gaussian width and radius appearing in Theorem 1 and the projection appearing in Theorem 4.
Definition 4.
The dual norm of a norm on is , and the set of all its sub-gradients with respect to is .
The Euclidean norm’s dual is itself; for it and many other norms, is a singleton set. Using these notions, we will now give versions of the effective ranks appropriate for generic norm balls.
Definition 5.
The effective -ranks of a covariance matrix are given as follows. Let , and define . Then
The first effective -rank is the squared ratio of Gaussian width to the radius of the set , where is a norm ball ; the importance of this ratio should be clear from Theorem 1. The Gaussian width is given by , while the radius can be written so that the factors of cancel.
The choice of arises naturally from our bound on in Theorem 4 below. Large effective rank of means the sub-gradient of is small in the norm. This is, in fact, closely related to the existence of low-norm interpolators. First, note that corresponds to the small-eigenvalue components of the covariate vector. For to be a sub-gradient means that moving the weight vector in the direction of is very effective at changing the prediction ; having small norm means that moving in this direction has a very small effect on the population loss . Together, this means the sub-gradient will be a good direction for benignly overfitting the noise.
Remark 1 (Definitions of effective ranks).
Using in 5 yields slightly different effective ranks than those of 3, but the difference is small and asymptotically negligible. Both and use in their numerators, while and use . The denominators of and agree; Lemma 9, in Section C.2, shows that . The denominator of uses the sole sub-gradient ; the denominator is then , giving that . Equation 75, in Section C.2, shows that for some that converges to 1 as , as is required by the consistency conditions. The other direction, , also holds with as . It would be possible (and probably even more natural with our analysis) to state the consistency conditions in Section 4 in terms of and ; we used and mainly to allow direct comparison to [66].
Using the general notion of effective ranks, we can find an analogue of Corollary 2 for general norms.
Corollary 3.
There exists an absolute constant such that the following is true. Under the model assumptions in (1), take any covariance splitting and let be an arbitrary norm. Fixing , let . If and is large enough that , then the following holds with probability at least :
(9) |
As in the Euclidean special case, we still need a bound on the norm of to use this result to study the consistency of . This leads us to the second main technical result of this paper, which essentially says that if the effective ranks and are sufficiently large, then there exists an interpolator with norm .
Theorem 4 (General norm bound).
There exists an absolute constant such that the following is true. Under the model assumptions in (1) with any covariance split , let be an arbitrary norm, and fix . Denote the orthogonal projection matrix onto the space spanned by as . Let , and let . Suppose that there exist such that with probability at least
(10) |
let . Then if and the effective ranks are large enough that , with probability at least , it holds that
(11) |
For a specific choice of norm , we can verify that is small. In the Euclidean case, for example, this is done by (78) in Section C.2; in our basis pursuit application to come, this is done by (92). The term measures the cost of using a projected version of the subgradient; in most of our applications, we can take . Recalling that , this is obviously true with the Euclidean norm for any . More generally, if is diagonal, then it is natural to only consider covariance splits such that is diagonal. Then, when is the norm (or norms more generally), it can be easily seen that and so .
Straightforwardly combining Corollaries 3 and 4 yields the following theorem, which gives guarantees for minimal-norm interpolators in terms of effective rank conditions. Just as in the Euclidean case, we can extract from this result a simple set of sufficient conditions for consistency of the minimal norm interpolator.
Theorem 5 (Benign overfitting with general norm).
Fix any . Under the model assumptions in (1), let be an arbitrary norm and pick a covariance split . Suppose that and the effective ranks are sufficiently large such that with the same choice of and as in Corollary 3 and Theorem 4. Then, with probability at least ,
(12) |
Sufficient conditions for consistency of .
As , converges in probability to if there exists a sequence of covariance splits such that
(13) |
and, with the same definition of and as in Theorem 4, it holds for any that
(14) |
As we see, the conditions for a minimal norm interpolator to succeed with a general norm generalize those from the Euclidean setting in a natural way. As discussed above, (14) is always satisfied for the Euclidean norm. The only remaining notable difference from the Euclidean setting is that we have two large effective dimension conditions on instead of a single one; in the Euclidean case, the condition on implies the condition on .
6 Application: Norm Balls for Basis Pursuit
The theory for the minimal norm interpolator, – also known as basis pursuit [44] – is much less developed than that of the minimal norm interpolator. In this section, we illustrate the consequences of our general theory for basis pursuit. Full statements and proofs of results in this section are given in Appendix E.
The dual of the norm is the norm , and is the convex hull of . From the definition of sub-gradient, we observe that
(15) |
Furthermore, by convexity we have
(16) |
and so . Therefore, we can use a single notion of effective rank. For simplicity, we denote . Combining this with (13) and the previous discussion of (14), we obtain the following sufficient conditions for consistency of basis pursuit.
Sufficient conditions for consistency of .
As , converges to in probability if there exists a sequence of covariance splits such that is diagonal and
(17) |
Application: Junk features.
We now consider the behavior of basis pursuit in a junk feature model similar to that of [73]. Suppose that , where is a fixed matrix and is fixed. Quite naturally, we choose the covariance splitting , which has constant rank so that the first sufficient condition is immediately satisfied.
By standard results on the maximum of independent Gaussian variables [[, e.g.]]vershynin2018high, it is routine to check that
(18) |
Therefore, basis pursuit will be consistent provided that and . To the best of our knowledge, this is the first time that basis pursuit has been shown to give consistent predictions in any setting with Gaussian covariates and . Although we show consistency, the dimension must be quite high, and the rate of convergence depends on and .
Application: Isotropic features.
As in the Euclidean case, we generally do not expect basis pursuit to be consistent when and . However, we can expect its risk to approach the null risk if ; we will show this using uniform convergence (without covariance splitting).
A direct application of Theorem 5 is not enough because the term diverges, but we can remove the dependence on with a better norm bound. Let be the support of and denote as the matrix formed by selecting the columns of in . The key observation is that we can rewrite our model as , which corresponds to the case when and the Bayes risk is . If we interpolate using only the features in , the minimal norm will be approximately upper bounded by as long as , by Theorem 4. This implies the original model can also be upper bounded by the same quantity with high probability. Plugging the norm estimate in to Corollary 3 yields a risk bound of .
Relationship to previous works.
Both [69] and [74] study the minimal norm interpolator in the isotropic setting. They consider a more realistic scaling where is not large and the target is not the null risk. The best bound of [69], their Corollary 3, is , where is the ground truth sparsity. Note that even when , this bound does not show consistency. Similarly, [74] establish sufficient conditions for , which is nontrivial but also does not show consistency for any ; see also the work of [68] for a similar result in the Euclidean setting. In contrast, the constants in our result are tight enough to show in the isotropic setting when and in the junk feature setting when is bounded.
Like our work, the results of [74] generalize to arbitrary norms; they also consider a larger class of anti-concentrated covariate distributions than just Gaussians, as in the work of [52]. If and (i.e. the model is well-specified and noiseless), their work as well as that of [52] can recover generalization bounds similar to our Corollary 3, but with a large leading constant.
7 Proof Sketches
A key ingredient in our analysis is a celebrated result from Gaussian process theory known as the Gaussian Minmax Theorem (GMT) [41, 56]. Since the seminal work of [45], the GMT has seen numerous applications to problems in statistics, machine learning, and signal processing [[, e.g.]]stojnic2013framework,deng2019model,oymak2010new,oymak2018universality. Most relevant to us is the work of [56], which introduced the Convex Gaussian Minmax Theorem (CGMT) and developed a framework for the precise analysis of regularized linear regression. Here we apply the GMT/CGMT to study uniform convergence and the norm of the minimal norm interpolator.
Proof sketch of Theorem 1.
For simplicity, assume here there is no covariance splitting: . By a change of variable and introducing the Lagrangian, we can rewrite the generalization gap as
(19) |
where is a random matrix with i.i.d. standard normal entries. By GMT333We ignore a compactness issue here, but this is done rigorously by a truncation argument in Section B.1., we can control the upper tail of the max-min problem above (PO) by the auxiliary problem below (AO), with :
(20) |
By standard concentration results, we can expect and , so expanding the second constraint in the AO, we obtain . Plugging into (19), we have essentially shown that
(21) |
Applying concentration on the right hand side concludes the proof sketch. In situations where the supremum does not sharply concentrate around its mean, we can apply GMT only to the small variance directions of . This requires a slightly more general version of GMT, which we prove in Appendix A. We also show the additional terms contributed by the large variance components of cancel out due to Wishart concentration. This is reflected in the term of our theorem statement.
Proof sketch of Theorem 4.
Since the minimal norm problem is convex-concave, we can apply the CGMT, which provides a useful direction that GMT cannot. By the same argument as above
(22) |
To upper bound the infimum, it suffices to construct a feasible . Consider of the form where . Plugging in the constraint, we can choose . Rearranging the terms conclude the proof sketch when there is no covariance splitting. The general proof (in Section C.1) is more technical, but follows the same idea.
8 Discussion
In this work, we prove a generic generalization bound in terms of the Gaussian width and radius of a hypothesis class. We also provide a general high probability upper bound for the norm of the minimal norm interpolator. Combining these results, we recover the sufficient conditions from [66] in the case, confirm the conjecture of [73] for Gaussian data, and obtain novel consistency results in the case. Our results provide concrete evidence that uniform convergence is indeed sufficient to explain interpolation learning, at least in some settings.
A future direction of our work is to extend the main results to settings with non-Gaussian features; this has been achieved in other applications of the GMT [59], and indeed we expect that a version of ( ‣ 2) likely holds for non-Gaussian data as well. Another interesting problem is to study uniform convergence of low-norm near-interpolators, and characterize the worst-case population error as the norm and training error both grow. This could lead to a more precise understanding of early stopping, by connecting the optimization path with the regularization path. Finally, it is unknown whether our sufficient conditions for consistency in Section 5 are necessary, and it remains a challenge to apply uniform convergence of interpolators to more complex models such as deep neural networks.
Acknowledgments and Disclosure of Funding
Frederic Koehler was supported in part by E. Mossel’s Vannevar Bush Faculty Fellowship ONR-N00014-20-1-2826. Research supported in part by NSF IIS award 1764032, NSF HDR TRIPODS award 1934843, and the Canada CIFAR AI Chairs program. This work was done as part of the Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai).
[sorting=nyt]
References
- [1] Afonso S. Bandeira “Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science”, Lecture notes, MIT, 2016 URL: https://people.math.ethz.ch/~abandeira/TenLecturesFortyTwoProblems.pdf
- [2] Peter L. Bartlett and Philip M. Long “Failures of model-dependent generalization bounds for least-norm interpolation”, 2020 arXiv:2010.08479
- [3] Peter L. Bartlett, Philip M. Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070 arXiv:1906.11300
- [4] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32, 2019, pp. 15849–15854 arXiv:1812.11118
- [5] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180 arXiv:1903.07571
- [6] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo and Alan S. Willsky “The convex geometry of linear inverse problems” In Foundations of Computational Mathematics 12.6 Springer, 2012, pp. 805–849
- [7] Scott Shaobing Chen, David L. Donoho and Michael A. Saunders “Atomic decomposition by basis pursuit” In SIAM Review 43.1, 2001, pp. 129–159
- [8] Geoffrey Chinot and Matthieu Lerasle “On the robustness of the minimum interpolator”, 2020 arXiv:2003.05838
- [9] Geoffrey Chinot, Matthias Löffler and Sara Geer “On the robustness of minimum-norm interpolators”, 2021 arXiv:2012.00807
- [10] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A model of double descent for high-dimensional binary linear classification” In Information and Inference: A Journal of the IMA, 2021 arXiv:1911.05822
- [11] Rick Durrett “Probability: theory and examples” Cambridge University Press, 2019
- [12] Yehoram Gordon “Some inequalities for Gaussian processes and applications” In Israel Journal of Mathematics 50.4 Springer, 1985, pp. 265–289
- [13] Yehoram Gordon “On Milman’s inequality and random subspaces which escape through a mesh in ” In Geometric Aspects of Functional Analysis 1317, Lecture Notes in Mathematics Springer, 1988, pp. 84–106
- [14] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J. Tibshirani “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”, 2019 arXiv:1903.08560
- [15] Peizhong Ju, Xiaojun Lin and Jia Liu “Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree” In Advances in Neural Information Processing Systems, 2020 arXiv:2002.00492
- [16] Gautam Kamath “Bounds on the Expectation of the Maximum of Samples from a Gaussian”, 2013 URL: http://www.gautamkamath.com/writings/gaussian_max.pdf
- [17] Michel Ledoux “A heat semigroup approach to concentration on the sphere and on a compact Riemannian manifold” In Geometric & Functional Analysis GAFA 2.2 Springer, 1992, pp. 221–224
- [18] Shahar Mendelson “Learning without concentration” In Conference on Learning Theory, 2014, pp. 25–39 PMLR
- [19] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020 arXiv:1903.09139
- [20] Vaishnavh Nagarajan and J. Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019 arXiv:1902.04742
- [21] Jeffrey Negrea, Gintare Karolina Dziugaite and Daniel M. Roy “In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors” In International Conference on Machine Learning, 2020 arXiv:1912.04265
- [22] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning” In International Conference on Learning Representations – Workshop, 2015 arXiv:1412.6614
- [23] Samet Oymak and Babak Hassibi “New null space results and recovery thresholds for matrix rank minimization”, 2010 arXiv:1011.6326
- [24] Samet Oymak and Joel A Tropp “Universality laws for randomized dimension reduction, with applications” In Information and Inference: A Journal of the IMA 7.3 Oxford University Press, 2018, pp. 337–446
- [25] Mark Rudelson and Roman Vershynin “On sparse reconstruction from Fourier and Gaussian measurements” In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 61.8 Wiley Online Library, 2008, pp. 1025–1045
- [26] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” Cambridge University Press, 2014
- [27] Nathan Srebro, Karthik Sridharan and Ambuj Tewari “Optimistic Rates for Learning with a Smooth Loss”, 2010 arXiv:1009.3896
- [28] Mihailo Stojnic “A framework to characterize performance of LASSO algorithms”, 2013 arXiv:1303.7291
- [29] Christos Thrampoulidis, Samet Oymak and Babak Hassibi “Regularized linear regression: A precise analysis of the estimation error” In Conference on Learning Theory, 2015, pp. 1683–1709 PMLR
- [30] Alexander Tsigler and Peter L. Bartlett “Benign overfitting in ridge regression”, 2020 arXiv:2009.14286
- [31] Leslie G. Valiant “A theory of the learnable” In Communications of the ACM 27.11, 1984, pp. 1134–1142
- [32] Ramon Handel “Probability in High Dimension”, Lecture notes, Princeton University, 2014 URL: https://web.math.princeton.edu/~rvan/APC550.pdf
- [33] Vladimir Vapnik “Estimation of Dependences Based on Empirical Data”, Springer Series in Statistics Springer-Verlag, 1982
- [34] Roman Vershynin “Introduction to the non-asymptotic analysis of random matrices”, 2010 arXiv:1011.3027
- [35] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
- [36] Zitong Yang, Yu Bai and Song Mei “Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models” In International Conference on Machine Learning, 2021 arXiv:2103.04554
- [37] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017 arXiv:1611.03530
- [38] Lijia Zhou, Danica J. Sutherland and Nathan Srebro “On Uniform Convergence and Low-Norm Interpolation Learning” In Advances in Neural Information Processing Systems, 2020 arXiv:2006.05942
References
- [39] Vladimir Vapnik “Estimation of Dependences Based on Empirical Data”, Springer Series in Statistics Springer-Verlag, 1982
- [40] Leslie G. Valiant “A theory of the learnable” In Communications of the ACM 27.11, 1984, pp. 1134–1142
- [41] Yehoram Gordon “Some inequalities for Gaussian processes and applications” In Israel Journal of Mathematics 50.4 Springer, 1985, pp. 265–289
- [42] Yehoram Gordon “On Milman’s inequality and random subspaces which escape through a mesh in ” In Geometric Aspects of Functional Analysis 1317, Lecture Notes in Mathematics Springer, 1988, pp. 84–106
- [43] Michel Ledoux “A heat semigroup approach to concentration on the sphere and on a compact Riemannian manifold” In Geometric & Functional Analysis GAFA 2.2 Springer, 1992, pp. 221–224
- [44] Scott Shaobing Chen, David L. Donoho and Michael A. Saunders “Atomic decomposition by basis pursuit” In SIAM Review 43.1, 2001, pp. 129–159
- [45] Mark Rudelson and Roman Vershynin “On sparse reconstruction from Fourier and Gaussian measurements” In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 61.8 Wiley Online Library, 2008, pp. 1025–1045
- [46] Samet Oymak and Babak Hassibi “New null space results and recovery thresholds for matrix rank minimization”, 2010 arXiv:1011.6326
- [47] Nathan Srebro, Karthik Sridharan and Ambuj Tewari “Optimistic Rates for Learning with a Smooth Loss”, 2010 arXiv:1009.3896
- [48] Roman Vershynin “Introduction to the non-asymptotic analysis of random matrices”, 2010 arXiv:1011.3027
- [49] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo and Alan S. Willsky “The convex geometry of linear inverse problems” In Foundations of Computational Mathematics 12.6 Springer, 2012, pp. 805–849
- [50] Gautam Kamath “Bounds on the Expectation of the Maximum of Samples from a Gaussian”, 2013 URL: http://www.gautamkamath.com/writings/gaussian_max.pdf
- [51] Mihailo Stojnic “A framework to characterize performance of LASSO algorithms”, 2013 arXiv:1303.7291
- [52] Shahar Mendelson “Learning without concentration” In Conference on Learning Theory, 2014, pp. 25–39 PMLR
- [53] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” Cambridge University Press, 2014
- [54] Ramon Handel “Probability in High Dimension”, Lecture notes, Princeton University, 2014 URL: https://web.math.princeton.edu/~rvan/APC550.pdf
- [55] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning” In International Conference on Learning Representations – Workshop, 2015 arXiv:1412.6614
- [56] Christos Thrampoulidis, Samet Oymak and Babak Hassibi “Regularized linear regression: A precise analysis of the estimation error” In Conference on Learning Theory, 2015, pp. 1683–1709 PMLR
- [57] Afonso S. Bandeira “Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science”, Lecture notes, MIT, 2016 URL: https://people.math.ethz.ch/~abandeira/TenLecturesFortyTwoProblems.pdf
- [58] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017 arXiv:1611.03530
- [59] Samet Oymak and Joel A Tropp “Universality laws for randomized dimension reduction, with applications” In Information and Inference: A Journal of the IMA 7.3 Oxford University Press, 2018, pp. 337–446
- [60] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
- [61] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32, 2019, pp. 15849–15854 arXiv:1812.11118
- [62] Rick Durrett “Probability: theory and examples” Cambridge University Press, 2019
- [63] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J. Tibshirani “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”, 2019 arXiv:1903.08560
- [64] Vaishnavh Nagarajan and J. Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019 arXiv:1902.04742
- [65] Peter L. Bartlett and Philip M. Long “Failures of model-dependent generalization bounds for least-norm interpolation”, 2020 arXiv:2010.08479
- [66] Peter L. Bartlett, Philip M. Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070 arXiv:1906.11300
- [67] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180 arXiv:1903.07571
- [68] Geoffrey Chinot and Matthieu Lerasle “On the robustness of the minimum interpolator”, 2020 arXiv:2003.05838
- [69] Peizhong Ju, Xiaojun Lin and Jia Liu “Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree” In Advances in Neural Information Processing Systems, 2020 arXiv:2002.00492
- [70] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020 arXiv:1903.09139
- [71] Jeffrey Negrea, Gintare Karolina Dziugaite and Daniel M. Roy “In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors” In International Conference on Machine Learning, 2020 arXiv:1912.04265
- [72] Alexander Tsigler and Peter L. Bartlett “Benign overfitting in ridge regression”, 2020 arXiv:2009.14286
- [73] Lijia Zhou, Danica J. Sutherland and Nathan Srebro “On Uniform Convergence and Low-Norm Interpolation Learning” In Advances in Neural Information Processing Systems, 2020 arXiv:2006.05942
- [74] Geoffrey Chinot, Matthias Löffler and Sara Geer “On the robustness of minimum-norm interpolators”, 2021 arXiv:2012.00807
- [75] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A model of double descent for high-dimensional binary linear classification” In Information and Inference: A Journal of the IMA, 2021 arXiv:1911.05822
- [76] Zitong Yang, Yu Bai and Song Mei “Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models” In International Conference on Machine Learning, 2021 arXiv:2103.04554
Appendices
In the remainder of the paper, we give self-contained proofs of all results from the main text.
In Appendix A, we introduce some technical results that we will use in our analysis.
In Appendix B, we prove the main generalization bound (Theorem 1) and show its specialization to norm balls (Corollaries 1, 2 and 3).
In Appendix C, we prove upper bounds on the norm of the minimal-norm interpolator for a general norm (Theorem 4), and show applications to the Euclidean case (Theorem 2).
In Appendix D, we show how to combine the previous sets of results to give risk guarantees for the minimal norm interpolators (Theorems 3 and 5). In particular, Section D.2.1 shows the equivalence of conditions for consistency in the Euclidean norm setting.
In Appendix E, we provide full theorem statements and proofs of the results on interpolation (basis pursuit) mentioned in Section 6.
Appendix A Preliminaries
We will first give some general results useful to the rest of the proofs. Most are standard, but a few are variations on existing results.
Concentration of Lipschitz functions.
Recall that a function is -Lipschitz with respect to the norm if it holds for all that . We use the concentration of Lipschitz functions of a Gaussian.
Theorem 6 ([54], Theorem 3.25).
If is -Lipschitz with respect to the Euclidean norm and , then
(23) |
We also use a similar result for functions of a uniformly spherical vector [[, see]Theorem 5.1.4 and Exercise 5.1.12]vershynin2018high; we cite a result with sharp constant factor from [43].
Theorem 7 (Spherical concentration; [43]).
If is -Lipschitz with respect to the Euclidean norm and where is the unit sphere, is the uniform measure on the sphere, and , then
(24) |
The following lemma, which we will use multiple timues, says that a -dimensional subspace cannot align with a random spherically symmetric vector.
Lemma 1.
Suppose that is a fixed subspace of dimension in with , is the orthogonal projection onto , and is a spherically symmetric random vector (i.e. is uniform on the sphere). Then
(25) |
with probability at least . Conditional on this inequality holding, we therefore have uniformly for all that
(26) |
Proof.
This is trivial if , since the left-hand side is at most . Thus assume without loss of generality that . By symmetry, it suffices to fix to be the span of basis vectors and to bound for a uniformly random chosen vector from the unit sphere in . Recall that for any coordinate , we have by symmetry among the coordinates and the fact that almost surely. The function is a -Lipschitz function and , so by Theorem 7 above
with probability at least . Using gives the result. ∎
The concentration of the Euclidean norm of a Gaussian vector follows from Theorem 6; we state it explicitly below.
Lemma 2.
Suppose that . Then
(27) |
Proof.
First we recall the standard fact [[, see e.g.]]chandrasekaran2012convex that
Because the norm is 1-Lipschitz, it follows from Theorem 6 that
so
Now using that shows
Wishart concentration.
We recall the notation for the Loewner order on symmetric matrices: means that is positive semidefinite. Let denote the minimum singular value of an arbitrary matrix , and the maximum singular value. Similarly, let denote the minimum eigenvalue. We use to denote the operator norm of matrix .
Theorem 8 ([48], Corollary 5.35).
Let . Let be a random matrix with entries i.i.d. . Then for any , it holds with probability at least that
(28) |
Corollary 4.
Suppose are independent with a positive semidefinite matrix, and . Let be the empirical covariance matrix. Then with probability at least ,
(29) |
with .
Proof.
Let be the random matrix with rows so that . By equality in distribution, we can take to have independent entries and write and
By definition of singular values, from Theorem 8 the eigenvalues of are bounded between and . Since , using the inequality for , we have shown that
Rewriting and taking gives the result. ∎
Gaussian Minmax Theorem.
The following result is Theorem 3 of [56], known as the Convex Gaussian Minmax Theorem or CGMT (see also Theorem 1 in the same reference). As explained there, it is a consequence of the main result of [41], known as Gordon’s Theorem or the Gaussian Minmax Theorem. Despite the name, convexity is only required for one of the theorem’s conclusions.
Theorem 9 (Convex Gaussian Minmax Theorem; [56, 41]).
Let be a matrix with i.i.d. entries and suppose and are independent of and each other. Let be compact sets and be an arbitrary continuous function. Define the Primary Optimization (PO) problem
(30) |
and the Auxiliary Optimization (AO) problem
(31) |
Under these assumptions, for any .
Furthermore, if we suppose that are convex sets and is convex in and concave in , then .
In other words, the first conclusion says that high probability lower bounds on the auxiliary optimization imply high probability lower bounds on the primary optimization . Importantly, this direction holds without any convexity assumptions. Under the additional convexity assumptions, the second conclusion gives a similar comparison of high probability upper bounds.
In our analysis, we need a slightly more general statement of the Gaussian Minmax Theorem than Theorem 9: we need the minmax formulation to include additional variables which only affect the deterministic term in the minmax problem. It’s straightforward to prove this result by repeating the argument in [56]; below we give an alternative proof which reduces to Theorem 9, by introducing extremely small extra dimensions to contain the extra variables. Intuitively, this works because the statement of the GMT allows for arbitrary continuous functions , with no dependence on their quantitative smoothness.
Theorem 10 (Variant of GMT).
Let be a matrix with i.i.d. entries and suppose and are independent of and each other. Let be compact sets in and respectively, and let be an arbitrary continuous function. Define the Primary Optimization (PO) problem
(32) |
and the Auxiliary Optimization (AO) problem
(33) |
Under these assumptions, for any .
Proof.
Let be arbitrary and
Define so that if and , then . We also define . The other sets and are defined similarly. It is clear that , and are all still compact in their respective topology, and is continuous for every .
Let be a matrix with i.i.d. entries such that the top left matrix is . Similarly, we define to be a -dimensional Gaussian vector with independent coordinates such that the first coordinates are , and to be a -dimensional Gaussian vector with independent coordinates such that the first coordinates are . Next, consider the augmented PO and AO:
(34) |
It is clear that for a small value of , the augmented problem will be close to the original problem. More precisely, for every and
(35) |
where is deterministic and does not depend on . Similarly, it is routine to check
so by the triangle inequality and Cauchy-Schwarz inequality, we have
(36) |
and
(37) |
From (35), it follows that
(38) |
Similarly, from (36) and (37), it follows that
(39) |
Approximating the original PO and AO by (34) allows us to directly apply the Gaussian Minmax Theorem. For any , we have
where we used (38) in the first inequality, Theorem 9 in the second inequality, and (39) in the last inequality. This holds for arbitrary and taking the limit shows the result, because the CDF is right continuous [62] and the remaining terms go to zero by standard concentration inequalities (Lemmas 2 and 8). ∎
Appendix B Uniform Convergence Bounds
We will now prove the main generalization bound, as well as its special cases in norm balls and specifically Euclidean norm balls.
B.1 General case: Proof of Theorem 1
For convenience, we restate the definition of covariance splitting here: See 2
It follows from our definition that . Although our results in Appendix C requires this orthogonality condition (in particular, Lemma 8), we note that all of our results here in Appendix B continue to hold as long as and both are positive semi-definite. To apply the Gaussian Minimax Theorem, we first formulate the generalization gap as an optimization problem in terms of a random matrix with entries.
Lemma 3.
Under the model assumptions in (1), let be an arbitrary compact set and . Define the primary optimization problem (PO) as
(40) |
where
(41) |
and are both random matrices with i.i.d. standard normal entries independent of and each other. Then the generalization gap of interpolators is equal in distribution to the sum of the Bayes risk and the PO:
(42) |
Proof.
Recall that and is equivalent to . Observe that
so we can decompose
Lemma 4 (Application of GMT).
In the same setting as Lemma 3, let be Gaussian vectors independent of and each other. With the same definition of , define the auxiliary optimization problem (AO) as
(43) |
Then it holds that
(44) |
and taking expectations we have
(45) |
Proof.
By introducing Lagrange multipliers, we have
By independence, the distribution of remains the same after conditioning on and and the randomness in comes solely from . Since the mapping from to is continuous and is compact, is compact. To apply Theorem 10, we can take , which is clearly continuous. The only challenge is that the domain of is not compact, but we can handle it by a truncation argument. Define
(46) |
and observe that , since the minimum in the definition of ranges over a smaller set. The AO associated with is
(47) |
We observe that the untruncated auxiliary problem from (43) has a completely analogous form:
This is because if then the minimum is achieved at , and if do not satisfy the constraint then taking sends the minimum to . From this formulation, we see that for any since the minimum is taken over a larger set as grows, and is unconstrained in .
The proof that is an exercise in real analysis, which splits into two cases:
-
1.
The auxiliary problem is infeasible. In this case, we know that for all
By compactness of and continuity of the right hand side, there exists (in particular, independent of ) such that
Therefore, we show
Since the second term is bounded and has no dependence on , taking we have as desired (since by definition).
-
2.
The auxiliary problem is feasible. In this case, we can let be an arbitrary maximizer achieving the objective for each by compactness. By compactness again, the sequence at positive integer values of has a subsequential limit , i.e. this point satisfies for some sequence satisfying .
Suppose that does not satisfy the last constraint defining , then by continuity, there exists and a sufficiently small such that for all and , we have
This implies that for sufficiently large , we have
and
so – but this is impossible, since considering any feasible element of we can show that . By contradiction, we find that is feasible for .
By taking in the definition of we have
By continuity, we show that
Since , the limit of exists and equals . We can conclude that because is a monotone decreasing function of .
By our version of the Gaussian Minmax Theorem, Theorem 10,
We introduce the negative signs here because we have originally a max-min problem instead of a min-max problem. This means the comparison theorem gives an upper bound, instead of a lower bound, on the quantity of interest.
Finally, we can conclude
where the last step uses continuity (from above) of probability measure and the fact that monotonically decreases to almost surely. ∎
Recall the definition of Gaussian width and radius:
See 1
It remains to analyze the auxiliary problem, which we do in the following lemma:
Lemma 5.
Let . If is sufficiently large such that , then with probability at least , it holds that
(48) |
Proof.
For notational simplicity, define
By a union bound, the following collection of events, which together we call , occurs with probability at least :
- 1.
-
2.
(Approximate isometry.) By Corollary 4, uniformly over all , it holds that
(51) - 3.
-
4.
(Typical size of .) By the standard Gaussian tail bound , it holds that
(54) because the marginal law of is .
-
5.
(Gaussian process concentration.) By Theorem 6, it holds that
(55) because is a -Lipschitz function of .
From now on, the argument is conditional on the event defined above. By squaring the last constraint in the definition of we see that
where in the last line we used (49) and the AM-GM inequality (). Rearranging gives the inequality
where in the second inequality we used (50) and the AM-GM inequality again, in the third inequality we used (52) and in the last inequality we used (51). This shows
Dividing through by the first two factors on the left hand side and plugging in (53) gives
We can simplify the first term by defining and the second term by observing . Finally, plugging into (43) gives
by the triangle inequality, and (48) follows by (54) and (55). To deduce the explicit bound for , first use that
and similarly to show
If , then
Provided that (which implies that ), we can use the inequality for to show that
and thus we can choose
We are finally ready to prove our main generalization bound:
See 1
Proof.
Remark 2 (Translation-invariant version).
Our generalization guarantee is stated in terms of and , which are not translation-invariant. However, the generalization guarantee of Theorem 1 can be made translation invariant, e.g. replacing by for an arbitrary , by recentering the problem before applying Theorem 1, i.e. by subtracting from both sides of the interpolation constraint .
We also note that in Theorem 1, there is no requirement that , so the true function may not necessarily lie in the class even if there is no noise ().
B.2 Specialization to General Norm Balls
For convenience, we restate the general definition of effective rank.
See 5
Applying Theorem 1 to an arbitrary norm ball yield the following:
See 3
B.3 Special Case: Euclidean Norm
In the Euclidean setting, the effective ranks are defined as follows:
See 3
Due to the small difference between and , our generalization bound below requires a slightly different proof (see discussion in Section 5), but the proof strategies are exactly the same.
See 2
Proof.
The proof is identical to Corollary 3 except for the inconsequential difference between and . It is easy to see that
Next, by choosing a particular covariance split, we prove the speculative bound from [73] when the features are Gaussian:
See 1
Proof.
By Theorem 1 and the same argument in proof of Corollary 2, we obtain
Let contain the largest eigenvalues, then we have
Plugging in the inequality shows
Therefore, we can pick and it is clear that
for sufficiently large and . To balance the last two terms, we can pick a covariance split such that is of order , which proves the rate. ∎
Appendix C Bounds on the Norm of the Minimal-Norm Interpolator
In this section, we will give bounds – again based on the Gaussian Minimax Theorem – for the norm of the minimal norm interpolator, first in general and then in the Euclidean case.
C.1 General Norms: Proof of Theorem 4
Similar to the analysis in the previous section, we first formulate the minimal norm as an optimization problem in terms of a random matrix with entries. Next, we apply the Convex Gaussian Minimax Theorem.
Lemma 6.
Under the model assumptions in (1), let be an arbitrary norm and be a matrix with i.i.d. entries independent of . Define the primary optimization problem (PO) as
(56) |
Then for any , it holds that
(57) |
Proof.
By equality in distribution, we can write . By the triangle inequality and two changes of variables, we have
Lemma 7 (Application of CGMT).
In the same setting as Lemma 6, let be Gaussian vectors independent of and each other. Define the auxiliary optimization problem (AO) as
(58) |
Then it holds that
(59) |
and taking expectations we have
(60) |
Proof.
By introducing Lagrange multipliers, we have
By independence, the distribution of remains the same after conditioning on and the randomness in comes solely from . Therefore, we can apply CGMT in Theorem 9 with because is convex-concave, but we again have the technical difficulty that the domains of and are not compact. To overcome this, we will use a double truncation argument. For any , we define
(61) |
and the corresponding AO
(62) |
Note that the optimization in and now ranges over compact sets. We will also use an intermediate problem between and , defined as
(63) |
We similarly define the intermediate AO as
(64) |
Compared to the definition of , we have instead of , but this difference is negligible because is Gaussian. It can be easily seen that the event is the same as , and the same holds for and . It is also clear that and we can connect with by CGMT. It remains to show that as .
By definition, for . We consider two cases:
-
1.
, i.e. the minimization problem defining is infeasible. In this case, we know that for all
By compactness, there exists (in particular, independent of ) such that
Therefore, considering along the direction of shows that
so as .
-
2.
Otherwise , i.e. the minimization problem defining is feasible. In this case, we can let be an arbitrary minimizer achieving the objective for each by compactness. By compactness again, the sequence at positive integer values of has a subsequential limit such that . Equivalently, there exists an increasing sequence such that .
Suppose for the sake of contradiction that , then by continuity, there exists and a sufficiently small such that for all
This implies that for sufficiently large , we have
and by the same argument as in the previous case
so , but this is impossible since . By contradiction, it must be the case that . By taking in the definition of , we have
By continuity, we show that
Since , the limit of exists and equals . We can conclude that because is an increasing function of .
By the last part of Theorem 9 (the CGMT),
By continuity (from below) of the probability measure, and the fact that monotonically increases to almost surely, we can conclude
It remains to analyze the auxiliary problem, which we do in the following lemma:
Lemma 8.
For any covariance splitting , denote as the orthogonal projection matrix onto the space spanned by , and let . Assume that there exists such that with probability at least ,
(65) |
and
(66) |
Let
If and the effective ranks are sufficiently large such that , then with probability at least , it holds that
(67) |
Proof.
For notational simplicity, we define
By a union bound, the following collection of events occurs with probability at least :
-
1.
(Approximate Orthogonality.) By Lemma 1, it holds that
(68) - 2.
- 3.
By a change of variables, recall that
Equations 68, 69 and 70 imply that
To upper bound , it suffices to construct a that satisfies the constraint. Consider of the form , then . Plugging in, it suffices to choose such that
Solving for , we can choose
given that it is positive. By (65) and (71), we have
If , then
and using the inequality , we show
Therefore, we can conclude that
where we define
Provided that (which also guarantees that and our definition of is sensible), we can use the inequality for to show that
and thus by (66)
with . ∎
Finally, we are ready to prove our general norm bound.
See 4
C.2 Special Case: Euclidean Norm
Lemma 9.
For any covariance matrix , it holds that
(72) |
and
(73) |
As a result, it holds that
(74) |
and
(75) |
Proof.
Observe that if , then it can easily be checked that
and so by the Gaussian Poincaré inequality [54, Corollary 2.27], we have
Rearranging the terms proves (72). To prove (73), without loss of generality assume that is diagonal, with diagonal entries . Observe that for any integer , we can pad with 0’s such that divides , and we have
By Jensen’s inequality, ; it follows that
In the last equality, we use the fact that for each the random variable follows an inverse Chi-square distribution with degrees of freedom; its expectation is . In addition, notice that
Plugging the above estimate into our upper bound shows for any integer , it holds that
We can show (73) by choosing :
It remains to verify (74) and (75). By (72), we can check
The other direction follows directly from an application of the Cauchy-Schwarz inequality. By Jensen’s inequality and the Cauchy-Schwarz inequality, we show
Recall that . By Cauchy-Schwarz inequality and (73), it follows that
and also by (72)
Lemma 10.
For any covariance matrix , it holds that with probability at least ,
(76) |
and
(77) |
Therefore, provided that , it holds that
(78) |
Proof.
Because we are considering norm and is standard Gaussian, without loss of generality we can assume that is diagonal and we denote the diagonals of as . By the sub-exponential Bernstein inequality [60, Corollary 2.8.3], we have with probability at least
where the last inequality uses that , shown in Lemma 5 of [66]. Using the sub-exponential Bernstein inequality again, we show with probability at least
From Lemma 5 of [66], we know that the effective ranks are at least 1. This implies
Provided that , we have
in which case it holds that
See 2
Appendix D Benign Overfitting
In this section, we will combine results from the previous two sections to study when interpolators are consistent.
D.1 General Norm
See 5
Proof.
By Theorem 4, if we choose
then with large probability, has non-empty intersection with , which contains the minimal norm interpolator . Also, it is clear that and so by Corollary 3, it holds that
∎
Theorem 11 (Sufficient conditions).
Under the model assumptions in (1), let be an arbitrary norm. Suppose that as goes to , there exists a sequence of covariance splits such that the following properties hold:
-
1.
(Small large-variance dimension.)
(79) -
2.
(Large effective dimension.)
(80) -
3.
(No aliasing condition.)
(81) -
4.
(Contracting projection condition.) With the same definition of and as in Theorem 4, it holds that for any ,
(82)
Then converges to in probability. In other words, minimum norm interpolation is consistent.
Proof.
Fix any , for sufficiently small and , it is clear that
(83) |
For any , by the definition of in Corollary 3 and our assumptions, the terms and can be made arbitrarily small for large enough . Also by our assumption, in the definition of in Theorem 4 can be arbitrarily small. Note that
converges to 0 by assumption. Then by Markov’s inequality, for any , it holds that for all sufficiently large
and we can pick
This implies that
By Theorem 5, we have shown that for sufficiently large such that and are small enough for (83) to hold, it holds that
As a result, we show for any . To summarize, for any fixed , we have
and so converges to in probability. ∎
D.2 Euclidean Norm
See 3
Proof.
The proof follows the same strategy as Theorem 5. By Theorem 2, if we choose
then with large probability, has non-empty intersection with . This intersection necessarily contains the minimal norm interpolator .
Also, it is clear that and so by Corollary 2, it holds that
Theorem 12 (Sufficient conditions).
Under the model assumptions in (1), let be the minimal norm interpolator. Suppose that as goes to , there exists a sequence of covariance splitting such that the following conditions hold:
-
1.
(Small large-variance dimension.)
(84) -
2.
(Large effective dimension.)
(85) -
3.
(No aliasing condition.)
(86)
Then converges to in probability. In other words, minimum norm interpolation is consistent.
Proof.
Fix any , for sufficiently small and , it is clear that
(87) |
From Lemma 5 of [66], it holds that , and so the condition implies that . For any , by the definition of in Corollary 2 and Theorem 2 and our assumptions, the terms and can be made small enough for Equation 87 to hold with a sufficiently large . By Theorem 3, we show that
Since the choice of is arbitrary, we have shown that converges to in probability. ∎
D.2.1 Equivalence of Consistency Conditions
If we assume that , our consistency condition (Theorem 12) for minimum norm interpolation is the existence of a covariance splitting such that
(88) |
We compare the above conditions to the following conditions:
(89) |
Obviously, the conditions in (89) imply (88), but we show in Theorem 13 that the existence of a splitting that satisfies (88) also implies the existence of a (potentially different) splitting that satisfies (89). This is one way to see that the particular choice of from [66] can be made without loss of generality, at least if we only consider the consistency conditions.
Theorem 13.
Proof.
Denote as the vector of eigenvalues of , and as the vector obtained by setting the coordinates of corresponding to to be 0. By our assumptions in (88), there exists such that
For any , we let and define by setting the coordinates of in to be 0. For simplicity of notation, define and . Observe that
This shows that
and
In addition, observe that
The above inequalities imply that
and
Finally, we pick by setting . By our assumption that , we can check
and
By Holder’s inequality, we also have
It is clear that and , so picking the covariance splitting that corresponds to concludes the proof. ∎
Appendix E Basis Pursuit (Minimum -Norm Interpolation)
In this section, we illustrate the consequences of our general theory for basis pursuit. The following generalization bound for basis pursuit follows immediately from Corollary 3:
Corollary 5 (Generalization bound for norm balls).
There exists an absolute constant such that the following is true. Under the model assumptions in (1) with , fix and let . If and is large enough that , then the following holds with probability at least :
(90) |
Proof.
Recall that the dual of the norm is the norm. By convexity
and so we can use . ∎
The following norm bound for basis pursuit follows from Theorem 4:
Corollary 6 ( norm bound).
There exists an absolute constant such that the following is true. Under the model assumptions in (1), let such that is diagonal. Fix and let . Then if and the effective rank are large enough that , with probability at least , it holds that
(91) |
Proof.
Recall that , where denotes the convex hull of . By definition, it holds almost surely that
(92) |
and so we can pick such that
and
In addition, since is diagonal, the coordinates of that correspond to the zero diagonals of are 0. Therefore, must also have zero entry in those coordinates. In other words, lies in the span of . As is the orthogonal projection onto the space spanned by , this implies , and so , so that we can take . Plugging into Theorem 4 concludes the proof. ∎
Theorem 14 (Benign overfitting).
Fix any . Under the model assumptions in (1), let such that is diagonal. Suppose that and the effective rank are sufficiently large such that with the same choice of and as in Corollaries 5 and 6. Then, with probability at least :
(93) |
The proof of Theorem 14 uses Corollaries 5 and 6, and follows the same lines as in Theorem 5. The details are repetitive, so we omit writing them out in full here. As before, we can use the finite sample bound to deduce sufficient conditions for consistency.
Theorem 15 (Sufficient conditions).
Under the model assumptions in (1), let be the minimal norm interpolator. Suppose that as goes to , there exists a sequence of covariance splits such that is diagonal and the following conditions hold:
-
1.
(Small large-variance dimension.)
(94) -
2.
(Large effective dimension.)
(95) -
3.
(No aliasing condition.)
(96)
Then converges to in probability. In other words, minimum norm interpolation is consistent.
Again, the proof of Theorem 15 is exactly analogous to Theorem 12, so we omit the full proof here.
E.1 Isotropic features
Theorem 16.
There exists an absolute constant such that the following is true. Under the model assumptions in (1) with , denote as the support of . Fix and let . Then if and are large enough that , the following holds with probability where :
(97) |
Proof.
Write , where is formed by selecting the columns of in . Also let ; then the entries of are i.i.d. and independent of . Observe that . By choosing in Corollary 6, we show with large probability
for some . By the bound of [50], it holds that
and so we can choose . Observe that if , then and . It follows that
Theorem 17.
Under the model assumptions in (1) with , fix any and let . Suppose that and are large enough that . Then, with probability at least ,
(98) |
Proof.
By Theorem 16, if we choose
then with large probability, has non-empty intersection with , which contains the minimal norm interpolator . It can be easily seen that
and so by Theorem 1, with large probability
where and . Observe that
Combined with the lower bound of [50], we show
In addition, we have
Finally, it is a routine calculation to show
using the inequality for . ∎