This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting

Frederic Koehler
MIT
[email protected] &Lijia Zhou
University of Chicago
[email protected] &Danica J. Sutherland
UBC and Amii
[email protected] &Nathan Srebro
TTI-Chicago
[email protected]
These authors contributed equally.
Abstract

We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class’s Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of [66] for minimum-norm interpolators, and confirms a prediction of [73] for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum 1\ell_{1}-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.

Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)

1 Introduction

The traditional understanding of machine learning suggests that models with zero training error tend to overfit, and explicit regularization is often necessary to achieve good generalization. Given the empirical success of deep learning models with zero training error [58, 55] and the (re-)discovery of the “double descent” phenomenon [61], however, it has become clear that the textbook U-shaped learning curve is only part of a larger picture: it is possible for an overparameterized model with zero training loss to achieve low population error in a noisy setting. In an effort to understand how interpolation learning occurs, there has been much recent study of the testbed problem of linear regression with Gaussian features [[, e.g.]]bartlett2020benign, tsigler2020benign, BHX:two-models, hastie:surprises, JLL:basis-pursuit, junk-feats, muthukumar:interpolation, negrea:in-defense. Significant progress has been made in this setting, including nearly-matching necessary and sufficient conditions for consistency of the minimal 2\ell_{2} norm interpolator [66].

Despite the fundamental role of uniform convergence in statistical learning theory, most of this line of work has used other techniques to analyze the particular minimal-norm interpolator.111[71] argue that [66]’s proof technique is fundamentally based on uniform convergence of a surrogate predictor; [76] study a closely related setting with a uniform convergence-type argument, but do not establish consistency. We discuss both papers in more detail in Section 4. Instead of directly analyzing the population error of a learning algorithm, a uniform convergence-type argument would control the worst-case generalization gap over a class of predictors containing the typical outputs of a learning rule. Typically, this is done because for many algorithms – unlike the minimal Euclidean norm interpolator – it is difficult to exactly characterize the learned predictor, but we may be able to say e.g. that its norm is not too large. Since uniform convergence does not tightly depend on a specific algorithm, the resulting analysis can highlight the key properties that lead to good generalization: it can give bounds not only for, say, the minimal-norm interpolator, but also for other interpolators with low norm [[, e.g.]]junk-feats, increasing our confidence that low norm – and not some other property the particular minimal-norm interpolator happens to have – is key to generalization. In linear regression, practical training algorithms may not always find the exact minimal Euclidean norm solution, so it is also reassuring that all interpolators with sufficiently low Euclidean norm generalize.

[64], however, raised significant questions about the applicability of typical uniform convergence arguments to certain high-dimensional regimes, similar to those seen in interpolation learning. Following their work, [73, 65, 76, 71] all demonstrated the failure of forms of uniform convergence in various interpolation learning setups. To sidestep these negative results, [73] suggested considering bounds which are uniform only over predictors with zero training error. This weaker notion of uniform convergence has been standard in analyses of “realizable” (noiseless) learning at least since the work of [39, Chapter 6.4] and [40]. [73] demonstrated that at least in one particular noisy setting, such uniform convergence is sufficient for showing consistency of the minimal 2\ell_{2} norm interpolator, even though “non-realizable” uniform convergence arguments (those over predictors regardless of their training error) cannot succeed. It remains unknown, however, whether these types of arguments can apply to more general linear regression problems and more typical asymptotic regimes, particularly showing rates of convergence rather than just consistency.

In this work, we show for the first time that uniform convergence is indeed able to explain benign overfitting in general high-dimensional Gaussian linear regression problems. Similarly to how the standard analysis for learning with Lipschitz losses bounds generalization gaps through Rademacher complexity [[, e.g.]]understanding-ML, our Theorem 1 (Section 3) establishes a finite-sample high probability bound on the uniform convergence of the error of interpolating predictors in a hypothesis class, in terms of its Gaussian width. This is done through an application of the Gaussian Minimax Theorem; see the proof sketch in Section 7. Combined with an analysis of the norm of the minimal 2\ell_{2} norm interpolator (Theorem 2 in Section 4), our bound recovers known consistency results [66], as well as proving a conjectured upper bound for larger-norm interpolators [73].

In addition, since we do not restrict ourselves to Euclidean norm balls but instead consider interpolators in an arbitrary compact set, our results allows for a wide range of other applications. Our analysis leads to a natural extension of the consistency result and notions of effective rank of [66] for arbitrary norms (Theorem 5 in Section 5). As a demonstration of our general theory, in Section 6 we show novel consistency results for the minimal 1\ell_{1} norm interpolator (basis pursuit) in particular settings, which we believe are the first results of their kind.

2 Problem Formulation

Notation.

We use p\lVert\cdot\rVert_{p} for the p\ell_{p} norm, xp=(i|xi|p)1/p\left\lVert x\right\rVert_{p}=\left(\sum_{i}|x_{i}|^{p}\right)^{1/p}. We always use maxxSf(x)\max_{x\in S}f(x) to be -\infty when SS is empty, and similarly minxSf(x)\min_{x\in S}f(x) to be \infty. We use standard O()O(\cdot) notation, and aba\lesssim b for inequality up to an absolute constant. For a positive semidefinite matrix AA, the Mahalanobis (semi-)norm is xA2:=x,Ax\left\lVert x\right\rVert_{A}^{2}:=\langle x,Ax\rangle. For a matrix AA and set SS, ASAS denotes the set {Ax:xS}\{Ax:x\in S\}.

Data model.

We assume that data (X,Y)(X,Y) is generated as

Y=Xw+ξ,XiiidN(0,Σ),ξN(0,σ2In),Y=Xw^{*}+\xi,\qquad X_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\Sigma),\qquad\xi\sim N(0,\sigma^{2}I_{n}), (1)

where Xn×dX\in\mathbb{R}^{n\times d} has i.i.d. Gaussian rows X1,,XnX_{1},\ldots,X_{n}, dnd\geq n, ww^{*} is arbitrary, and ξ\xi is Gaussian and independent of XX. Though our proof techniques crucially depend on XiX_{i} being Gaussian, we can easily relax the assumption on the noise ξ\xi to only being sub-Gaussian; we assume Gaussian noise here for simplicity. The empirical and population loss are defined as, respectively,

L^(w)=1nYXw22,L(w)=𝔼(x,y)(yw,x)2=σ2+wwΣ2,\hat{L}(w)=\frac{1}{n}\lVert Y-Xw\rVert_{2}^{2},\quad L(w)=\operatorname*{\mathbb{E}}_{(x,y)}(y-\langle w,x\rangle)^{2}=\sigma^{2}+\lVert w-w^{*}\rVert_{\Sigma}^{2},

where in the expectation y=x,w+ξ0y=\langle x,w^{*}\rangle+\xi_{0} with xN(0,Σ)x\sim N(0,\Sigma) independent of ξ0N(0,σ2)\xi_{0}\sim N(0,\sigma^{2}). For an arbitrary norm \left\lVert\cdot\right\rVert, the minimal norm interpolator is w^=argminL^(w)=0w\hat{w}=\operatorname*{arg\,min}_{\hat{L}(w)=0}\left\lVert w\right\rVert. For Euclidean norm specifically, the minimal norm interpolator can be written explicitly as w^=X(XXT)1Y\hat{w}=X(XX^{T})^{-1}Y. If there is more than one minimal norm interpolator, all of our guarantees will hold for any minimizer w^\hat{w}.

Speculative bound.

[73] studied uniform convergence of low norm interpolators,

supwB,L^(w)=0L(w)L^(w).\sup_{\left\lVert w\right\rVert\leq B,\,\hat{L}(w)=0}L(w)-\hat{L}(w). (2)

Clearly, when Bw^B\geq\left\lVert\hat{w}\right\rVert, this quantity upper-bounds the population risk of w^\hat{w}. [73] evaluated the asymptotic limit of (2) in one particular setting. But they further speculated that a bound of the following form may hold more generally:

supw2B,L^(w)=0L(w)L^(w)B2ψnn+o(1),\sup_{\left\lVert w\right\rVert_{2}\leq B,\,\hat{L}(w)=0}L(w)-\hat{L}(w)\leq\frac{B^{2}\psi_{n}}{n}+o(1), (\star)

where222When x2\left\lVert x\right\rVert^{2} concentrates, we need ψn\psi_{n} to match its typical value. That is, ψn\psi_{n} might be a high probability bound on x2\left\lVert x\right\rVert^{2}, or for (sub)Gaussian data, as in our case, ψn=𝔼x2\psi_{n}=\operatorname*{\mathbb{E}}\left\lVert x\right\rVert^{2}. ψnx2\psi_{n}\approx\left\lVert x\right\rVert^{2} . As discussed by [73], a bound almost of this form is implied by results of [47] for general data distributions, except that approach gives a large leading constant and logarithmic factors. To show consistency of benign overfitting, though, we need (\star2) to hold without even a constant multiplicative factor. [73] ask whether and when this holds, speculating that it might in broad generality.

The goal of this paper is essentially to prove (\star2), at least for Gaussian data, and to use it to show consistency of the minimal norm interpolator. Our main result (Theorem 1) can be thought of showing (\star2) for Gaussian data with ψn=𝔼x2\psi_{n}=\operatorname*{\mathbb{E}}\left\lVert x\right\rVert^{2}, as well as strengthening and significantly generalizing it. Our result is more general, as it applies to general compact hypothesis sets beyond just the Euclidean ball as in (\star2). But it also falls short of fully proving the speculative (\star2) since our results are limited to Gaussian data, while there is no reason we are aware of to believe a tight uniform convergence guarantee of the form (\star2) does not hold much more broadly. We leave extending our results beyond the Gaussian case open.

3 Generic Uniform Convergence Guarantee

To state our results, we first need to introduce some key tools.

Definition 1.

The Gaussian width and the radius of a set SdS\subset\mathbb{R}^{d} are

W(S):=𝔼HN(0,Id)supsS|s,H|andrad(S):=supsSs2.W(S):=\operatorname*{\mathbb{E}}_{H\sim N(0,I_{d})}\sup_{s\in S}|\langle s,H\rangle|\quad\text{and}\quad\operatorname{rad}(S):=\sup_{s\in S}\|s\|_{2}.

The radius measures the size of a set in the Euclidean norm. The Gaussian width of a set SS can be interpreted as the number of dimensions that a random projection needs to approximately preserve the norms of points in SS [57, 42]. These two complexity measures are connected by Gaussian concentration: Gaussian width is the expected value of the supremum of some Gaussian process, and the radius upper bounds the typical deviation of that supremum from its expected value.

Definition 2 (Covariance splitting).

Given a positive semidefinite matrix Σd×d\Sigma\in\mathbb{R}^{d\times d}, we write Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} if Σ=Σ1+Σ2\Sigma=\Sigma_{1}+\Sigma_{2}, each matrix is positive semidefinite, and their spans are orthogonal.

Note that Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} effectively splits the eigenvectors of Σ\Sigma into two disjoint parts.

We can now state our generic bound. Section 7 sketches the proof; all full proofs are in the appendix.

Theorem 1 (Main generalization bound).

There exists an absolute constant C166C_{1}\leq 66 such that the following is true. Under the model assumptions in (1), let 𝒦\mathcal{K} be an arbitrary compact set, and take any covariance splitting Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}. Fixing δ1/4\delta\leq 1/4, let β=C1(log(1/δ)n+rank(Σ1)n)\beta=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right). If nn is large enough that β1\beta\leq 1, then the following holds with probability at least 1δ1-\delta:

supw𝒦,L^(w)=0L(w)1+βn[W(Σ21/2𝒦)+(rad(Σ21/2𝒦)+wΣ2)2log(32δ)]2.\sup_{w\in\mathcal{K},\hat{L}(w)=0}L(w)\leq\frac{1+\beta}{n}\left[W(\Sigma_{2}^{1/2}\mathcal{K})+\left(\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})+\lVert w^{*}\rVert_{\Sigma_{2}}\right)\sqrt{2\log\left(\frac{32}{\delta}\right)}\,\right]^{2}.

In our applications, we consider 𝒦={wd:wB}\mathcal{K}=\{w\in\mathbb{R}^{d}:\|w\|\leq B\} for an arbitrary norm, with BB based on a high-probability upper bound for w^\left\lVert\hat{w}\right\rVert. Depending on the application, the rank of Σ1\Sigma_{1} will be either constant or o(n)o(n), so that β0\beta\to 0. The term wΣ2\left\lVert w^{*}\right\rVert_{\Sigma_{2}} generally does not scale with nn and hence is often negligible. As hinted earlier, we can think of the Gaussian width term and the radius term as bias and variance, respectively. To achieve consistency, we can expect that the Gaussian width should scale as σn\sigma\sqrt{n}. This agrees with the intuition that we need increasing norm to memorize noise when the model is not realizable. The radius term requires some care in our applications, but can be handled by the covariance splitting technique. As part of the analysis in the following sections, we will rigorously show in many settings that the dominant term in the upper bound is the Gaussian width. In these cases, our upper bound is roughly W(Σ21/2𝒦)2/nW(\Sigma_{2}^{1/2}\mathcal{K})^{2}/n, which can be viewed as the ratio between the (probabilistic) dimension of our hypothesis class and sample size. We will also analyze how large 𝒦\mathcal{K} must be to contain any interpolators, allowing us to find consistency results.

4 Application: Euclidean Norm Ball

It can be easily seen that the Gaussian width of a Euclidean norm ball reduces nicely to the product of the norm of our predictor with the typical norm of xx: if 𝒦={wd:w2B}\mathcal{K}=\{w\in\mathbb{R}^{d}:\|w\|_{2}\leq B\}, then

W(Σ1/2𝒦)=B𝔼HN(0,Id)Σ1/2H2B2𝔼x22.W(\Sigma^{1/2}\mathcal{K})=B\cdot\operatorname*{\mathbb{E}}_{H\sim N(0,I_{d})}\|\Sigma^{1/2}H\|_{2}\leq\sqrt{B^{2}\operatorname*{\mathbb{E}}\left\lVert x\right\rVert_{2}^{2}}. (3)

Therefore, it is plausible that (\star2) holds with ψn=𝔼x22=Tr(Σ)\psi_{n}=\operatorname*{\mathbb{E}}\|x\|_{2}^{2}=\operatorname{Tr}(\Sigma). Figure 1 illustrates this generalization bound in two simple examples, motivated by [63, 73]. Indeed, an application of our main theorem proves that this is exactly the case for Gaussian data.

Corollary 1 (Proof of the speculative bound (\star2) for Gaussian data).

Fix any δ1/4\delta\leq 1/4. Under the model assumptions in (1) with Bw2B\geq\|w^{*}\|_{2} and nlog(1/δ)n\gtrsim\log(1/\delta), for some γlog(1/δ)/n4\gamma\lesssim\sqrt[4]{\log(1/\delta)/n}, it holds with probability at least 1δ1-\delta that

supw2B,L^(w)=0L(w)(1+γ)B2Tr(Σ)n.\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{B^{2}\operatorname{Tr}(\Sigma)}{n}. (4)
Refer to caption
(a) λ=1\lambda=1 (isotropic)
Refer to caption
(b) λ=0.1\lambda=0.1
Figure 1: Illustration of our generalization bound when Σ=[100λ2Id1]\Sigma=\begin{bmatrix}1&0\\ 0&\lambda^{2}I_{d-1}\end{bmatrix}, n=200n=200, σ2=1/2\sigma^{2}=1/2, w=(1/2,0,,0)w^{*}=(1/\sqrt{2},0,\ldots,0), and dd is varied (x-axis). Averages (curve) and standard deviations (error bars) are estimated from 400 trials for each value of dd. Here the curve marked “loss” corresponds to L(w^)L(\hat{w}) for the minimum Euclidean norm interpolator w^\hat{w}, “bound” to w^22TrΣ/n=𝔼w^22(1+λ2(d1))/n{\|\hat{w}\|_{2}^{2}\operatorname{Tr}\Sigma}/{n}=\operatorname*{\mathbb{E}}{\|\hat{w}\|_{2}^{2}(1+\lambda^{2}(d-1))}/{n} which is an asymptotic bound on L(w^)L(\hat{w}) due to Corollary 1, “null” is the loss L(0)=1L(0)=1 of the zero estimator, and “bayes” is the Bayes-optimal error L(w)=σ2=1/2L(w^{*})=\sigma^{2}=1/2. The vertical line is d/n=1d/n=1, the location of the double-descent peak; for d/n<1d/n<1 there are almost surely no interpolators.

The above bound is clean and simple, but only proves a sub-optimal rate of n1/4n^{-1/4}. This is because the choice of covariance split used in the proof of Corollary 1 uses no information about the particular structure of Σ\Sigma. This bound can also be slightly loose in situations where the eigenvalues of Σ\Sigma decay rapidly, in which case Tr(Σ)\operatorname{Tr}(\Sigma) can be replaced by a smaller quantity. We next state a more precise bound on the generalization error, which requires introducing the following notions of effective rank.

Definition 3 ([66]).

The effective ranks of a covariance matrix Σ\Sigma are

r(Σ)=Tr(Σ)Σ𝑜𝑝andR(Σ)=Tr(Σ)2Tr(Σ2).r(\Sigma)=\frac{\operatorname{Tr}(\Sigma)}{\left\lVert\Sigma\right\rVert_{\mathit{op}}}\quad\text{and}\quad R(\Sigma)=\frac{\operatorname{Tr}(\Sigma)^{2}}{\operatorname{Tr}(\Sigma^{2})}.

The r(Σ)r(\Sigma) rank can roughly be understood as the squared ratio between the Gaussian width and radius in our previous bound. It is related to the concentration of the 2\ell_{2} norm of a Gaussian vector with covariance Σ\Sigma. In fact, both definitions of effective ranks can be derived by applying Bernstein’s inequality to x2/𝔼x2{\|x\|^{2}}/{\operatorname*{\mathbb{E}}\|x\|^{2}}. We will only need r(Σ)r(\Sigma) in the generalization bound below, but we will show in Theorem 2 that R(Σ)R(\Sigma) can be used to control the norm of the minimal norm interpolator w^\hat{w}.

Corollary 2.

There exists an absolute constant C166C_{1}\leq 66 such that the following is true. Under (1), pick any split Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}, fix δ1/4\delta\leq 1/4, and let γ=C1(log(1/δ)r(Σ2)+log(1/δ)n+rank(Σ1)n)\gamma=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right). If Bw2B\geq\|w^{*}\|_{2} and nn is large enough that γ1\gamma\leq 1, the following holds with probability at least 1δ1-\delta:

supw2B,L^(w)=0L(w)(1+γ)B2Tr(Σ2)n.\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{B^{2}\operatorname{Tr}(\Sigma_{2})}{n}. (5)

In order to use Corollary 1 or 2 to prove consistency, we need a high-probability bound for w^2\left\lVert\hat{w}\right\rVert_{2}, the norm of the minimal norm interpolator, so that BB will be large enough to contain any interpolators. Theorem 2 gives exactly such a bound, showing that if the effective ranks R(Σ2)R(\Sigma_{2}) and r(Σ2)r(\Sigma_{2}) are large, then we can construct an interpolator with Euclidean norm nearly w2+σn/Tr(Σ2)\|w^{*}\|_{2}+\sigma\sqrt{n/\operatorname{Tr}(\Sigma_{2})}.

Theorem 2 (Euclidean norm bound; special case of Theorem 4).

Fix any δ1/4\delta\leq 1/4. Under the model assumptions in (1) with any choice of covariance splitting Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}, there exists some ϵlog(1/δ)r(Σ2)+log(1/δ)n+nlog(1/δ)R(Σ2)\epsilon\lesssim\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\frac{n\log(1/\delta)}{R(\Sigma_{2})} such that the following is true. If nn and the effective ranks are such that ϵ1\epsilon\leq 1 and R(Σ2)log(1/δ)2R(\Sigma_{2})\gtrsim\log(1/\delta)^{2}, then with probability at least 1δ1-\delta, it holds that

w^2w2+(1+ϵ)1/2σnTr(Σ2).\|\hat{w}\|_{2}\leq\|w^{*}\|_{2}+(1+\epsilon)^{1/2}\,\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}. (6)

Plugging in estimates of w^\|\hat{w}\| to our scale-sensitive bound Corollary 2, we obtain a population loss guarantee for w^\hat{w} in terms of effective ranks.

Theorem 3 (Benign overfitting).

Fix any δ1/2\delta\leq 1/2. Under the model assumptions in (1) with any covariance splitting Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}, let γ\gamma and ϵ\epsilon be as defined in Corollaries 2 and 2. Suppose that nn and the effective ranks are such that R(Σ2)log(1/δ)2R(\Sigma_{2})\gtrsim\log(1/\delta)^{2} and γ,ϵ1\gamma,\epsilon\leq 1. Then, with probability at least 1δ1-\delta,

L(w^)(1+γ)(1+ϵ)(σ+w2Tr(Σ2)n)2.L(\hat{w})\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\right)^{2}. (7)

From (7), we can see that to ensure consistency, i.e. L(w^)σ2L(\hat{w})\to\sigma^{2}, it is enough that γ0\gamma\to 0, ϵ0\epsilon\to 0, and w2Tr(Σ2)n0\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\to 0. Recalling the definitions of the various quantities and using that r(Σ2)2R(Σ2)r(\Sigma_{2})^{2}\geq R(\Sigma_{2}) [66, Lemma 5], we arrive at the following conditions.

Sufficient conditions for consistency of w^\hat{w}.

As nn\to\infty, L(w^)L(\hat{w}) converges in probability to σ2\sigma^{2} if there exists a sequence of covariance splits Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that

rank(Σ1)n0,w2Tr(Σ2)n0,nR(Σ2)0.\frac{\operatorname{rank}(\Sigma_{1})}{n}\to 0,\qquad\left\lVert w^{*}\right\rVert_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\to 0,\qquad\frac{n}{R(\Sigma_{2})}\to 0. (8)
Relationship to [66].

Our set of sufficient conditions above subsumes and is slightly more general than the conditions of [66]. There are two differences:

  1. 1.

    They choose the covariance split specifically to minimize rank(Σ1)\operatorname{rank}(\Sigma_{1}) such that r(Σ2)nr(\Sigma_{2})\gtrsim n.

  2. 2.

    Their version of the second condition replaces Tr(Σ2)\operatorname{Tr}(\Sigma_{2}) by the larger term Tr(Σ)\operatorname{Tr}(\Sigma).

From the perspective of showing L(w^)σ2L(\hat{w})\to\sigma^{2}, the first difference is immaterial: if there exists a choice of split that satisfies our conditions, it can be shown that there exists a (possibly different) split which will also satisfy r(Σ2)nr(\Sigma_{2})\gtrsim n (see Section D.2.1). The second point is a genuine improvement over the consistency result of [66] when Σ\Sigma has a few very large eigenvalues; this improvement has also been implicitly obtained by [72].

Regarding the rate of convergence, our additional r(Σ2)1/2r(\Sigma_{2})^{-1/2} term and the dependence on rank(Σ1)/n\sqrt{\operatorname{rank}(\Sigma_{1})/n} instead of rank(Σ1)/n\operatorname{rank}(\Sigma_{1})/n is slightly worse than that of [66], but our bound can be applied for a smaller value of rank(Σ1)\operatorname{rank}(\Sigma_{1}) and is better in the w2Tr(Σ2)/n\|w^{*}\|_{2}\sqrt{\operatorname{Tr}(\Sigma_{2})/n} term. We believe these differences are minimal in most cases, and not so important for our primary goal to showcase the power of uniform convergence.

Relationship to [71].

The consistency result of [66] can also be recovered with a uniform convergence-based argument [71]. Instead of considering uniform convergence over a norm ball, [71] applied uniform convergence to a surrogate predictor, and separately showed that the minimal-norm interpolator has risk close to the surrogate (and, indeed, argue that this was fundamentally the proof strategy of [66] all along). Their analysis reveals an interesting connection between realizability and interpolation learning, but it does not highlight that low norm is key to good generalization, nor does it predict the worst-case error for other low-norm interpolators.

Relationship to [65].

[65] recently showed that it is impossible to find a tight excess risk bound that only depends on the learned predictor and sample size. This does not, however, contradict our results. A closer look at the construction of their lower bound reveals that the excess risk bounds being ruled out cannot depend on either the training error L^\hat{L} or the population noise level σ2\sigma^{2}. The former is crucial: their considered class of bounds cannot incorporate the knowledge that the training error is small, which is the defining property of uniform convergence of interpolators. The latter point is also important; they consider excess risk (Lσ2L-\sigma^{2}), but (\star2) and our bounds are about the generalization gap (LL^L-\hat{L}).

Relationship to [76].

[76] give expressions for the asymptotic generalization error of predictors in a norm ball, in a random feature model. Their model is not directly comparable to ours (their labels are effectively a nonlinear function of their non-Gaussian random features), but they similarly showed that uniform convergence of interpolators can lead to a non-vacuous bound. It is unclear, though, whether uniform convergence of low-norm interpolators can yield consistency in their model: they only study sets of the form {wαw^}\{\left\lVert w\right\rVert\leq\alpha\left\lVert\hat{w}\right\rVert\} with α>1\alpha>1 a constant, where we would expect a loss of α2σ2\alpha^{2}\sigma^{2} – i.e. would not expect consistency. They also rely on numerical methods to compare their (quite complicated) analytic expressions. It remains possible that the gap between uniform convergence of interpolators and the Bayes risk vanishes in their setting as α\alpha approaches 1.

5 General Norm Ball

All the results on the Euclidean setting are special cases of the following results for arbitrary norms. It is worth keeping in mind that the Euclidean norm will still play a role in these analyses, via the Gaussian width and radius appearing in Theorem 1 and the 2\ell_{2} projection PP appearing in Theorem 4.

Definition 4.

The dual norm of a norm \left\lVert\cdot\right\rVert on d\mathbb{R}^{d} is u:=maxv=1v,u\|u\|_{*}:=\max_{\|v\|=1}\langle v,u\rangle, and the set of all its sub-gradients with respect to uu is u={v:v=1,v,u=u}\partial\|u\|_{*}=\{v:\|v\|=1,\langle v,u\rangle=\|u\|_{*}\}.

The Euclidean norm’s dual is itself; for it and many other norms, u\partial\left\lVert u\right\rVert_{*} is a singleton set. Using these notions, we will now give versions of the effective ranks appropriate for generic norm balls.

Definition 5.

The effective \left\lVert\cdot\right\rVert-ranks of a covariance matrix Σ\Sigma are given as follows. Let HN(0,Id)H\sim N(0,I_{d}), and define v=argminvΣ1/2HvΣv^{*}=\operatorname*{arg\,min}_{v\in\partial\|\Sigma^{1/2}H\|_{*}}\left\lVert v\right\rVert_{\Sigma}. Then

r(Σ)=(𝔼Σ1/2Hsupw1wΣ)2andR(Σ)=(𝔼Σ1/2H𝔼vΣ)2.r_{\left\lVert\cdot\right\rVert}(\Sigma)=\left(\frac{\operatorname*{\mathbb{E}}\left\lVert\Sigma^{1/2}H\right\rVert_{*}}{\sup_{\left\lVert w\right\rVert\leq 1}\left\lVert w\right\rVert_{\Sigma}}\right)^{2}\quad\text{and}\quad R_{\left\lVert\cdot\right\rVert}(\Sigma)=\left(\frac{\operatorname*{\mathbb{E}}\left\lVert\Sigma^{1/2}H\right\rVert_{*}}{\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma}}\right)^{2}.

The first effective \left\lVert\cdot\right\rVert-rank is the squared ratio of Gaussian width to the radius of the set Σ1/2𝒦\Sigma^{1/2}\mathcal{K}, where 𝒦\mathcal{K} is a norm ball {w:wB}\{w:\left\lVert w\right\rVert\leq B\}; the importance of this ratio should be clear from Theorem 1. The Gaussian width is given by W(Σ12𝒦)=B𝔼xW(\Sigma^{\frac{1}{2}}\mathcal{K})=B\operatorname*{\mathbb{E}}\left\lVert x\right\rVert_{*}, while the radius can be written supwBwΣ\sup_{\left\lVert w\right\rVert\leq B}\left\lVert w\right\rVert_{\Sigma} so that the factors of BB cancel.

The choice of RR_{\left\lVert\cdot\right\rVert} arises naturally from our bound on w^\left\lVert\hat{w}\right\rVert in Theorem 4 below. Large effective rank of Σ2\Sigma_{2} means the sub-gradient vv^{*} of Σ21/2H\|\Sigma_{2}^{1/2}H\|_{*} is small in the Σ2\lVert\cdot\rVert_{\Sigma_{2}} norm. This is, in fact, closely related to the existence of low-norm interpolators. First, note that Σ21/2H\Sigma_{2}^{1/2}H corresponds to the small-eigenvalue components of the covariate vector. For vv^{*} to be a sub-gradient means that moving the weight vector ww in the direction of vv^{*} is very effective at changing the prediction w,X\langle w,X\rangle; having small Σ2\lVert\cdot\rVert_{\Sigma_{2}} norm means that moving in this direction has a very small effect on the population loss L(w)L(w). Together, this means the sub-gradient will be a good direction for benignly overfitting the noise.

Remark 1 (Definitions of effective ranks).

Using 2\left\lVert\cdot\right\rVert_{2} in 5 yields slightly different effective ranks than those of 3, but the difference is small and asymptotically negligible. Both rr and RR use 𝔼x2\operatorname*{\mathbb{E}}\left\lVert x\right\rVert^{2} in their numerators, while r2r_{\left\lVert\cdot\right\rVert_{2}} and R2R_{\left\lVert\cdot\right\rVert_{2}} use (𝔼x)2(\operatorname*{\mathbb{E}}\left\lVert x\right\rVert)^{2}. The denominators of rr and r2r_{\left\lVert\cdot\right\rVert_{2}} agree; Lemma 9, in Section C.2, shows that r(Σ)1r2(Σ)r(Σ)r(\Sigma)-1\leq r_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)\leq r(\Sigma). The denominator of R2R_{\left\lVert\cdot\right\rVert_{2}} uses the sole sub-gradient v=Σ1/2H/Σ1/2H2v^{*}=\Sigma^{1/2}H/\lVert\Sigma^{1/2}H\rVert_{2}; the denominator is then Σ1/2HΣ2/Σ1/2H22Tr(Σ2)/Tr(Σ)\lVert\Sigma^{1/2}H\rVert_{\Sigma}^{2}/\lVert\Sigma^{1/2}H\rVert_{2}^{2}\approx\operatorname{Tr}(\Sigma^{2})/\operatorname{Tr}(\Sigma), giving that R2(Σ)Tr(Σ)2/Tr(Σ2)=R(Σ)R_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)\approx\operatorname{Tr}(\Sigma)^{2}/\operatorname{Tr}(\Sigma^{2})=R(\Sigma). Equation 75, in Section C.2, shows that R2(Σ)cR(Σ)R_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)\geq cR(\Sigma) for some cc that converges to 1 as r(Σ)r(\Sigma)\to\infty, as is required by the consistency conditions. The other direction, R(Σ)cR2(Σ)R(\Sigma)\geq c^{\prime}R_{\left\lVert\cdot\right\rVert_{2}}(\Sigma), also holds with c1c^{\prime}\to 1 as r(Σ2)r(\Sigma^{2})\to\infty. It would be possible (and probably even more natural with our analysis) to state the consistency conditions in Section 4 in terms of r2r_{\left\lVert\cdot\right\rVert_{2}} and R2R_{\left\lVert\cdot\right\rVert_{2}}; we used rr and RR mainly to allow direct comparison to [66].

Using the general notion of effective ranks, we can find an analogue of Corollary 2 for general norms.

Corollary 3.

There exists an absolute constant C166C_{1}\leq 66 such that the following is true. Under the model assumptions in (1), take any covariance splitting Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} and let \left\lVert\cdot\right\rVert be an arbitrary norm. Fixing δ1/4\delta\leq 1/4, let γ=C1(log(1/δ)r(Σ2)+log(1/δ)n+rank(Σ1)n)\gamma=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{r_{\|\cdot\|}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right). If BwB\geq\|w^{*}\| and nn is large enough that γ1\gamma\leq 1, then the following holds with probability at least 1δ1-\delta:

supwB,L^(w)=0L(w)(1+γ)(B𝔼Σ21/2H)2n.\sup_{\|w\|\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{\left(B\cdot\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{*}\right)^{2}}{n}. (9)

As in the Euclidean special case, we still need a bound on the norm of w^=argminL^(w)=0w\hat{w}=\operatorname*{arg\,min}_{\hat{L}(w)=0}\left\lVert w\right\rVert to use this result to study the consistency of w^\hat{w}. This leads us to the second main technical result of this paper, which essentially says that if the effective ranks R(Σ2)R_{\left\lVert\cdot\right\rVert}(\Sigma_{2}) and r(Σ2)r_{\left\lVert\cdot\right\rVert}(\Sigma_{2}) are sufficiently large, then there exists an interpolator with norm w+σn/𝔼Σ21/2g\left\lVert w^{*}\right\rVert+\sigma\sqrt{n}/\operatorname*{\mathbb{E}}\lVert\Sigma_{2}^{1/2}g\rVert_{*}.

Theorem 4 (General norm bound).

There exists an absolute constant C264C_{2}\leq 64 such that the following is true. Under the model assumptions in (1) with any covariance split Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}, let \left\lVert\cdot\right\rVert be an arbitrary norm, and fix δ1/4\delta\leq 1/4. Denote the 2\ell_{2} orthogonal projection matrix onto the space spanned by Σ2\Sigma_{2} as PP. Let HN(0,Id)H\sim N(0,I_{d}), and let v=argminvΣ21/2HvΣ2v^{*}=\operatorname*{arg\,min}_{v\in\partial\|\Sigma^{1/2}_{2}H\|_{*}}\left\lVert v\right\rVert_{\Sigma_{2}}. Suppose that there exist ϵ1,ϵ20\epsilon_{1},\epsilon_{2}\geq 0 such that with probability at least 1δ/41-\delta/4

vΣ2(1+ϵ1)𝔼vΣ2andPv21+ϵ2;\left\lVert v^{*}\right\rVert_{\Sigma_{2}}\leq(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}\qquad\text{and}\qquad\left\lVert Pv^{*}\right\rVert^{2}\leq 1+\epsilon_{2}; (10)

let ϵ=C2(log(1/δ)r(Σ2)+log(1/δ)n+(1+ϵ1)2nR(Σ2)+ϵ2)\epsilon=C_{2}\left(\sqrt{\frac{\log(1/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+(1+\epsilon_{1})^{2}\frac{n}{R_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}+\epsilon_{2}\right). Then if nn and the effective ranks are large enough that ϵ1\epsilon\leq 1, with probability at least 1δ1-\delta, it holds that

w^w+(1+ϵ)1/2σn𝔼Σ21/2H.\left\lVert\hat{w}\right\rVert\leq\left\lVert w^{*}\right\rVert+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}. (11)

For a specific choice of norm \left\lVert\cdot\right\rVert, we can verify that vΣ2\|v^{*}\|_{\Sigma_{2}} is small. In the Euclidean case, for example, this is done by (78) in Section C.2; in our basis pursuit application to come, this is done by (92). The term ϵ2\epsilon_{2} measures the cost of using a projected version of the subgradient; in most of our applications, we can take ϵ2=0\epsilon_{2}=0. Recalling that v=1\lVert v^{*}\rVert=1, this is obviously true with the Euclidean norm for any Σ2\Sigma_{2}. More generally, if Σ\Sigma is diagonal, then it is natural to only consider covariance splits Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that Σ2\Sigma_{2} is diagonal. Then, when \left\lVert\cdot\right\rVert is the 1\ell_{1} norm (or p\ell_{p} norms more generally), it can be easily seen that Pv=vPv^{*}=v^{*} and so Pv=v=1\|Pv^{*}\|=\|v^{*}\|=1.

Straightforwardly combining Corollaries 3 and 4 yields the following theorem, which gives guarantees for minimal-norm interpolators in terms of effective rank conditions. Just as in the Euclidean case, we can extract from this result a simple set of sufficient conditions for consistency of the minimal norm interpolator.

Theorem 5 (Benign overfitting with general norm).

Fix any δ1/2\delta\leq 1/2. Under the model assumptions in (1), let \left\lVert\cdot\right\rVert be an arbitrary norm and pick a covariance split Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}. Suppose that nn and the effective ranks are sufficiently large such that γ,ϵ1\gamma,\epsilon\leq 1 with the same choice of γ\gamma and ϵ\epsilon as in Corollary 3 and Theorem 4. Then, with probability at least 1δ1-\delta,

L(w^)(1+γ)(1+ϵ)(σ+w𝔼Σ21/2Hn)2.L(\hat{w})\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}\right)^{2}. (12)
Sufficient conditions for consistency of w^\hat{w}.

As nn\to\infty, L(w^)L(\hat{w}) converges in probability to σ2\sigma^{2} if there exists a sequence of covariance splits Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that

rank(Σ1)n0,w𝔼Σ21/2Hn0,1r(Σ2)0,nR(Σ2)0,\frac{\operatorname{rank}(\Sigma_{1})}{n}\to 0,\qquad\frac{\left\lVert w^{*}\right\rVert\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}{\sqrt{n}}\to 0,\qquad\frac{1}{r_{\|\cdot\|}(\Sigma_{2})}\to 0,\qquad\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\to 0, (13)

and, with the same definition of PP and vv^{*} as in Theorem 4, it holds for any η>0\eta>0 that

Pr(Pv2>1+η)0.\Pr(\|Pv^{*}\|^{2}>1+\eta)\to 0. (14)

As we see, the conditions for a minimal norm interpolator to succeed with a general norm generalize those from the Euclidean setting in a natural way. As discussed above, (14) is always satisfied for the Euclidean norm. The only remaining notable difference from the Euclidean setting is that we have two large effective dimension conditions on Σ2\Sigma_{2} instead of a single one; in the Euclidean case, the condition on RR implies the condition on rr.

6 Application: 1\ell_{1} Norm Balls for Basis Pursuit

The theory for the minimal 1\ell_{1} norm interpolator, w^𝐵𝑃argminL^(w)=0w1\hat{w}_{\mathit{BP}}\in\arg\min_{\hat{L}(w)=0}\left\lVert w\right\rVert_{1} – also known as basis pursuit [44] – is much less developed than that of the minimal 2\ell_{2} norm interpolator. In this section, we illustrate the consequences of our general theory for basis pursuit. Full statements and proofs of results in this section are given in Appendix E.

The dual of the 1\ell_{1} norm is the \ell_{\infty} norm u=maxi|ui|\left\lVert u\right\rVert_{\infty}=\max_{i}\lvert u_{i}\rvert, and u\partial\|u\|_{\infty} is the convex hull of {sign(ui)ei:iargmax|ui|}\{\operatorname{sign}(u_{i})\,e_{i}:i\in\arg\max|u_{i}|\}. From the definition of sub-gradient, we observe that

minvΣ1/2gvΣmaxi[d]eiΣ=maxiΣii.\min_{v\in\partial\|\Sigma^{1/2}g\|_{\infty}}\|v\|_{\Sigma}\leq\max_{i\in[d]}\,\|e_{i}\|_{\Sigma}=\sqrt{\max_{i}\,\Sigma_{ii}}. (15)

Furthermore, by convexity we have

maxw11wΣ=maxiei,Σei=maxiΣii\max_{\|w\|_{1}\leq 1}\|w\|_{\Sigma}=\sqrt{\max_{i}\,\langle e_{i},\Sigma e_{i}\rangle}=\sqrt{\max_{i}\,\Sigma_{ii}} (16)

and so r1(Σ)=(𝔼Σ1/2g)2maxiΣiiR1(Σ)r_{\|\cdot\|_{1}}(\Sigma)=\frac{\left(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}g\|_{\infty}\right)^{2}}{\max_{i}\Sigma_{ii}}\leq R_{\|\cdot\|_{1}}(\Sigma). Therefore, we can use a single notion of effective rank. For simplicity, we denote r1(Σ)=r1(Σ)r_{1}(\Sigma)=r_{\|\cdot\|_{1}}(\Sigma). Combining this with (13) and the previous discussion of (14), we obtain the following sufficient conditions for consistency of basis pursuit.

Sufficient conditions for consistency of w^𝐵𝑃\hat{w}_{\mathit{BP}}.

As nn\to\infty, L(w^)L(\hat{w}) converges to σ2\sigma^{2} in probability if there exists a sequence of covariance splits Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that Σ2\Sigma_{2} is diagonal and

rank(Σ1)n0,w1𝔼Σ21/2Hn0,nr1(Σ2)0.\frac{\operatorname{rank}(\Sigma_{1})}{n}\to 0,\qquad\frac{\left\lVert w^{*}\right\rVert_{1}\operatorname*{\mathbb{E}}\lVert\Sigma_{2}^{1/2}H\rVert_{\infty}}{\sqrt{n}}\to 0,\qquad\frac{n}{r_{1}(\Sigma_{2})}\to 0. (17)
Application: Junk features.

We now consider the behavior of basis pursuit in a junk feature model similar to that of [73]. Suppose that Σ=[Σs00λnlog(d)Id]\Sigma=\begin{bmatrix}\Sigma_{s}&0\\ 0&\frac{\lambda_{n}}{\log(d)}I_{d}\end{bmatrix}, where Σs\Sigma_{s} is a fixed matrix and w1\|w^{*}\|_{1} is fixed. Quite naturally, we choose the covariance splitting Σ1=[Σs000]\Sigma_{1}=\begin{bmatrix}\Sigma_{s}&0\\ 0&0\end{bmatrix}, which has constant rank so that the first sufficient condition is immediately satisfied.

By standard results on the maximum of independent Gaussian variables [[, e.g.]]vershynin2018high, it is routine to check that

𝔼Σ21/2Hn=Θ(λnn)andr1(Σ2)=Θ(log(d)).\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{\infty}}{\sqrt{n}}=\Theta\left(\sqrt{\frac{\lambda_{n}}{n}}\right)\quad\text{and}\quad r_{1}(\Sigma_{2})=\Theta\left(\log(d)\right). (18)

Therefore, basis pursuit will be consistent provided that λn=o(n)\lambda_{n}=o(n) and d=eω(n)d=e^{\omega(n)}. To the best of our knowledge, this is the first time that basis pursuit has been shown to give consistent predictions in any setting with Gaussian covariates and σ>0\sigma>0. Although we show consistency, the dimension must be quite high, and the rate of convergence depends on n/log(d){n}/{\log(d)} and 1/log(d)1/\sqrt{\log(d)}.

Application: Isotropic features.

As in the Euclidean case, we generally do not expect basis pursuit to be consistent when Σ=Id\Sigma=I_{d} and w0w^{*}\neq 0. However, we can expect its risk to approach the null risk σ2+w2\sigma^{2}+\|w^{*}\|^{2} if d=eω(n)d=e^{\omega(n)}; we will show this using uniform convergence (without covariance splitting).

A direct application of Theorem 5 is not enough because the w1log(d)/n\|w^{*}\|_{1}\sqrt{{\log(d)}/{n}} term diverges, but we can remove the dependence on log(d)/n\sqrt{\log(d)/n} with a better norm bound. Let SS be the support of ww^{*} and denote XSX_{S} as the matrix formed by selecting the columns of XX in SS. The key observation is that we can rewrite our model as Y=XS𝖼 0+(XSwS+ξ)Y=X_{S^{\mathsf{c}}}\,0+(X_{S}w_{S}^{*}+\xi), which corresponds to the case when w=0w^{*}=0 and the Bayes risk is σ2+w22\sigma^{2}+\|w^{*}\|_{2}^{2}. If we interpolate using only the features in S𝖼S^{\mathsf{c}}, the minimal norm will be approximately upper bounded by σ2+w22n𝔼H\sqrt{\sigma^{2}+\|w^{*}\|_{2}^{2}}\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H\|_{*}} as long as d|S|=eω(n)d-|S|=e^{\omega(n)}, by Theorem 4. This implies the original model w^𝐵𝑃1\|\hat{w}_{\mathit{BP}}\|_{1} can also be upper bounded by the same quantity with high probability. Plugging the norm estimate in to Corollary 3 yields a risk bound of σ2+w22\sigma^{2}+\|w^{*}\|_{2}^{2}.

Relationship to previous works.

Both [69] and [74] study the minimal 1\ell_{1} norm interpolator in the isotropic setting. They consider a more realistic scaling where log(d)/n\log(d)/n is not large and the target is not the null risk. The best bound of [69], their Corollary 3, is L(w^𝐵𝑃)σ2(2+3214s)2L(\hat{w}_{\mathit{BP}})\leq\sigma^{2}(2+32\sqrt{14}\sqrt{s})^{2}, where ss is the ground truth sparsity. Note that even when w=0w^{*}=0, this bound does not show consistency. Similarly, [74] establish sufficient conditions for L(w^𝐵𝑃)=O(σ2)L(\hat{w}_{\mathit{BP}})=O(\sigma^{2}), which is nontrivial but also does not show consistency for any σ>0\sigma>0; see also the work of [68] for a similar result in the Euclidean setting. In contrast, the constants in our result are tight enough to show L(w^𝐵𝑃)σ2L(\hat{w}_{\mathit{BP}})\to\sigma^{2} in the isotropic setting when w=0w^{*}=0 and in the junk feature setting when w1\|w^{*}\|_{1} is bounded.

Like our work, the results of [74] generalize to arbitrary norms; they also consider a larger class of anti-concentrated covariate distributions than just Gaussians, as in the work of [52]. If σ=0\sigma=0 and w𝒦w^{*}\in\mathcal{K} (i.e. the model is well-specified and noiseless), their work as well as that of [52] can recover generalization bounds similar to our Corollary 3, but with a large leading constant.

7 Proof Sketches

A key ingredient in our analysis is a celebrated result from Gaussian process theory known as the Gaussian Minmax Theorem (GMT) [41, 56]. Since the seminal work of [45], the GMT has seen numerous applications to problems in statistics, machine learning, and signal processing [[, e.g.]]stojnic2013framework,deng2019model,oymak2010new,oymak2018universality. Most relevant to us is the work of [56], which introduced the Convex Gaussian Minmax Theorem (CGMT) and developed a framework for the precise analysis of regularized linear regression. Here we apply the GMT/CGMT to study uniform convergence and the norm of the minimal norm interpolator.

Proof sketch of Theorem 1.

For simplicity, assume here there is no covariance splitting: Σ2=Σ\Sigma_{2}=\Sigma. By a change of variable and introducing the Lagrangian, we can rewrite the generalization gap as

supw𝒦Xw=YL(w)=σ2+supwΣ1/2(𝒦w)Zw=ξw22=σ2+supwΣ1/2(𝒦w)infλλ,Zwξ+w22\begin{split}\sup_{\begin{subarray}{c}w\in\mathcal{K}\\ Xw=Y\end{subarray}}L(w)&=\sigma^{2}+\sup_{\begin{subarray}{c}w\in\Sigma^{1/2}(\mathcal{K}-w^{*})\\ Zw=\xi\end{subarray}}\|w\|_{2}^{2}\\ &=\sigma^{2}+\sup_{w\in\Sigma^{1/2}(\mathcal{K}-w^{*})}\inf_{\lambda}\,\langle\lambda,Zw-\xi\rangle+\|w\|_{2}^{2}\end{split} (19)

where ZZ is a random matrix with i.i.d. standard normal entries. By GMT333We ignore a compactness issue here, but this is done rigorously by a truncation argument in Section B.1., we can control the upper tail of the max-min problem above (PO) by the auxiliary problem below (AO), with G,HN(0,I)G,H\sim N(0,I):

supwΣ1/2(𝒦w)infλλ2H,w+λ,Gw2ξ+w22=supwΣ1/2(𝒦w)Gw2ξ2H,ww22.\sup_{w\in\Sigma^{1/2}(\mathcal{K}-w^{*})}\inf_{\lambda}\,\|\lambda\|_{2}\langle H,w\rangle+\langle\lambda,G\|w\|_{2}-\xi\rangle+\|w\|_{2}^{2}=\sup_{\begin{subarray}{c}w\in\Sigma^{1/2}(\mathcal{K}-w^{*})\\ \left\lVert G\left\lVert w\right\rVert_{2}-\xi\right\rVert_{2}\leq\langle H,w\rangle\end{subarray}}\left\lVert w\right\rVert_{2}^{2}. (20)

By standard concentration results, we can expect G22/n1\|G\|_{2}^{2}/n\approx 1 and ξ22/nσ2\|\xi\|_{2}^{2}/n\approx\sigma^{2}, so expanding the second constraint in the AO, we obtain w22+σ2|H,w|2/n\|w\|_{2}^{2}+\sigma^{2}\leq|\langle H,w\rangle|^{2}/n. Plugging into (19), we have essentially shown that

supw𝒦L^(w)=0L(w)supwΣ1/2(𝒦w)|H,w|2n(supwΣ1/2𝒦|H,w|+|H,Σ1/2w|)2n.\sup_{\begin{subarray}{c}w\in\mathcal{K}\\ \hat{L}(w)=0\end{subarray}}L(w)\leq\sup_{w\in\Sigma^{1/2}(\mathcal{K}-w^{*})}\frac{|\langle H,w\rangle|^{2}}{n}\leq\frac{\left(\sup_{w\in\Sigma^{1/2}\mathcal{K}}|\langle H,w\rangle|+|\langle H,\Sigma^{1/2}w^{*}\rangle|\right)^{2}}{n}. (21)

Applying concentration on the right hand side concludes the proof sketch. In situations where the supremum does not sharply concentrate around its mean, we can apply GMT only to the small variance directions of Σ\Sigma. This requires a slightly more general version of GMT, which we prove in Appendix A. We also show the additional terms contributed by the large variance components of XX cancel out due to Wishart concentration. This is reflected in the β\beta term of our theorem statement.

Proof sketch of Theorem 4.

Since the minimal norm problem is convex-concave, we can apply the CGMT, which provides a useful direction that GMT cannot. By the same argument as above

infXw=YwwinfXw=ξw=infwsupλΣ1/2w+λ,Zwξinfw22+σ2|H,w|2/nΣ1/2w=infwΣ2+σ2|Σ1/2H,w|2/nw.\begin{split}\inf_{Xw=Y}\|w\|-\|w^{*}\|&\leq\inf_{Xw=\xi}\|w\|=\inf_{w}\sup_{\lambda}\|\Sigma^{-1/2}w\|+\langle\lambda,Zw-\xi\rangle\\ &\approx\inf_{\|w\|_{2}^{2}+\sigma^{2}\leq|\langle H,w\rangle|^{2}/n}\|\Sigma^{-1/2}w\|=\inf_{\|w\|_{\Sigma}^{2}+\sigma^{2}\leq|\langle\Sigma^{1/2}H,w\rangle|^{2}/n}\|w\|.\end{split} (22)

To upper bound the infimum, it suffices to construct a feasible ww. Consider ww of the form αv\alpha v where vΣ1/2Hv\in\partial\|\Sigma^{1/2}H\|_{*}. Plugging in the constraint, we can choose w=α=σ2(Σ1/2H2nvΣ2)1\|w\|=\alpha=\sqrt{\sigma^{2}\left(\frac{\|\Sigma^{1/2}H\|_{*}^{2}}{n}-\|v\|_{\Sigma}^{2}\right)^{-1}}. Rearranging the terms conclude the proof sketch when there is no covariance splitting. The general proof (in Section C.1) is more technical, but follows the same idea.

8 Discussion

In this work, we prove a generic generalization bound in terms of the Gaussian width and radius of a hypothesis class. We also provide a general high probability upper bound for the norm of the minimal norm interpolator. Combining these results, we recover the sufficient conditions from [66] in the 2\ell_{2} case, confirm the conjecture of [73] for Gaussian data, and obtain novel consistency results in the 1\ell_{1} case. Our results provide concrete evidence that uniform convergence is indeed sufficient to explain interpolation learning, at least in some settings.

A future direction of our work is to extend the main results to settings with non-Gaussian features; this has been achieved in other applications of the GMT [59], and indeed we expect that a version of (\star2) likely holds for non-Gaussian data as well. Another interesting problem is to study uniform convergence of low-norm near-interpolators, and characterize the worst-case population error as the norm and training error both grow. This could lead to a more precise understanding of early stopping, by connecting the optimization path with the regularization path. Finally, it is unknown whether our sufficient conditions for consistency in Section 5 are necessary, and it remains a challenge to apply uniform convergence of interpolators to more complex models such as deep neural networks.

Acknowledgments and Disclosure of Funding

Frederic Koehler was supported in part by E. Mossel’s Vannevar Bush Faculty Fellowship ONR-N00014-20-1-2826. Research supported in part by NSF IIS award 1764032, NSF HDR TRIPODS award 1934843, and the Canada CIFAR AI Chairs program. This work was done as part of the Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai).

{refcontext}

[sorting=nyt]

References

  • [1] Afonso S. Bandeira “Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science”, Lecture notes, MIT, 2016 URL: https://people.math.ethz.ch/~abandeira/TenLecturesFortyTwoProblems.pdf
  • [2] Peter L. Bartlett and Philip M. Long “Failures of model-dependent generalization bounds for least-norm interpolation”, 2020 arXiv:2010.08479
  • [3] Peter L. Bartlett, Philip M. Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070 arXiv:1906.11300
  • [4] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32, 2019, pp. 15849–15854 arXiv:1812.11118
  • [5] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180 arXiv:1903.07571
  • [6] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo and Alan S. Willsky “The convex geometry of linear inverse problems” In Foundations of Computational Mathematics 12.6 Springer, 2012, pp. 805–849
  • [7] Scott Shaobing Chen, David L. Donoho and Michael A. Saunders “Atomic decomposition by basis pursuit” In SIAM Review 43.1, 2001, pp. 129–159
  • [8] Geoffrey Chinot and Matthieu Lerasle “On the robustness of the minimum 2\ell_{2} interpolator”, 2020 arXiv:2003.05838
  • [9] Geoffrey Chinot, Matthias Löffler and Sara Geer “On the robustness of minimum-norm interpolators”, 2021 arXiv:2012.00807
  • [10] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A model of double descent for high-dimensional binary linear classification” In Information and Inference: A Journal of the IMA, 2021 arXiv:1911.05822
  • [11] Rick Durrett “Probability: theory and examples” Cambridge University Press, 2019
  • [12] Yehoram Gordon “Some inequalities for Gaussian processes and applications” In Israel Journal of Mathematics 50.4 Springer, 1985, pp. 265–289
  • [13] Yehoram Gordon “On Milman’s inequality and random subspaces which escape through a mesh in n\mathbb{R}^{n} In Geometric Aspects of Functional Analysis 1317, Lecture Notes in Mathematics Springer, 1988, pp. 84–106
  • [14] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J. Tibshirani “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”, 2019 arXiv:1903.08560
  • [15] Peizhong Ju, Xiaojun Lin and Jia Liu “Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree” In Advances in Neural Information Processing Systems, 2020 arXiv:2002.00492
  • [16] Gautam Kamath “Bounds on the Expectation of the Maximum of Samples from a Gaussian”, 2013 URL: http://www.gautamkamath.com/writings/gaussian_max.pdf
  • [17] Michel Ledoux “A heat semigroup approach to concentration on the sphere and on a compact Riemannian manifold” In Geometric & Functional Analysis GAFA 2.2 Springer, 1992, pp. 221–224
  • [18] Shahar Mendelson “Learning without concentration” In Conference on Learning Theory, 2014, pp. 25–39 PMLR
  • [19] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020 arXiv:1903.09139
  • [20] Vaishnavh Nagarajan and J. Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019 arXiv:1902.04742
  • [21] Jeffrey Negrea, Gintare Karolina Dziugaite and Daniel M. Roy “In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors” In International Conference on Machine Learning, 2020 arXiv:1912.04265
  • [22] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning” In International Conference on Learning Representations – Workshop, 2015 arXiv:1412.6614
  • [23] Samet Oymak and Babak Hassibi “New null space results and recovery thresholds for matrix rank minimization”, 2010 arXiv:1011.6326
  • [24] Samet Oymak and Joel A Tropp “Universality laws for randomized dimension reduction, with applications” In Information and Inference: A Journal of the IMA 7.3 Oxford University Press, 2018, pp. 337–446
  • [25] Mark Rudelson and Roman Vershynin “On sparse reconstruction from Fourier and Gaussian measurements” In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 61.8 Wiley Online Library, 2008, pp. 1025–1045
  • [26] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” Cambridge University Press, 2014
  • [27] Nathan Srebro, Karthik Sridharan and Ambuj Tewari “Optimistic Rates for Learning with a Smooth Loss”, 2010 arXiv:1009.3896
  • [28] Mihailo Stojnic “A framework to characterize performance of LASSO algorithms”, 2013 arXiv:1303.7291
  • [29] Christos Thrampoulidis, Samet Oymak and Babak Hassibi “Regularized linear regression: A precise analysis of the estimation error” In Conference on Learning Theory, 2015, pp. 1683–1709 PMLR
  • [30] Alexander Tsigler and Peter L. Bartlett “Benign overfitting in ridge regression”, 2020 arXiv:2009.14286
  • [31] Leslie G. Valiant “A theory of the learnable” In Communications of the ACM 27.11, 1984, pp. 1134–1142
  • [32] Ramon Handel “Probability in High Dimension”, Lecture notes, Princeton University, 2014 URL: https://web.math.princeton.edu/~rvan/APC550.pdf
  • [33] Vladimir Vapnik “Estimation of Dependences Based on Empirical Data”, Springer Series in Statistics Springer-Verlag, 1982
  • [34] Roman Vershynin “Introduction to the non-asymptotic analysis of random matrices”, 2010 arXiv:1011.3027
  • [35] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
  • [36] Zitong Yang, Yu Bai and Song Mei “Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models” In International Conference on Machine Learning, 2021 arXiv:2103.04554
  • [37] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017 arXiv:1611.03530
  • [38] Lijia Zhou, Danica J. Sutherland and Nathan Srebro “On Uniform Convergence and Low-Norm Interpolation Learning” In Advances in Neural Information Processing Systems, 2020 arXiv:2006.05942

References

  • [39] Vladimir Vapnik “Estimation of Dependences Based on Empirical Data”, Springer Series in Statistics Springer-Verlag, 1982
  • [40] Leslie G. Valiant “A theory of the learnable” In Communications of the ACM 27.11, 1984, pp. 1134–1142
  • [41] Yehoram Gordon “Some inequalities for Gaussian processes and applications” In Israel Journal of Mathematics 50.4 Springer, 1985, pp. 265–289
  • [42] Yehoram Gordon “On Milman’s inequality and random subspaces which escape through a mesh in n\mathbb{R}^{n} In Geometric Aspects of Functional Analysis 1317, Lecture Notes in Mathematics Springer, 1988, pp. 84–106
  • [43] Michel Ledoux “A heat semigroup approach to concentration on the sphere and on a compact Riemannian manifold” In Geometric & Functional Analysis GAFA 2.2 Springer, 1992, pp. 221–224
  • [44] Scott Shaobing Chen, David L. Donoho and Michael A. Saunders “Atomic decomposition by basis pursuit” In SIAM Review 43.1, 2001, pp. 129–159
  • [45] Mark Rudelson and Roman Vershynin “On sparse reconstruction from Fourier and Gaussian measurements” In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 61.8 Wiley Online Library, 2008, pp. 1025–1045
  • [46] Samet Oymak and Babak Hassibi “New null space results and recovery thresholds for matrix rank minimization”, 2010 arXiv:1011.6326
  • [47] Nathan Srebro, Karthik Sridharan and Ambuj Tewari “Optimistic Rates for Learning with a Smooth Loss”, 2010 arXiv:1009.3896
  • [48] Roman Vershynin “Introduction to the non-asymptotic analysis of random matrices”, 2010 arXiv:1011.3027
  • [49] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo and Alan S. Willsky “The convex geometry of linear inverse problems” In Foundations of Computational Mathematics 12.6 Springer, 2012, pp. 805–849
  • [50] Gautam Kamath “Bounds on the Expectation of the Maximum of Samples from a Gaussian”, 2013 URL: http://www.gautamkamath.com/writings/gaussian_max.pdf
  • [51] Mihailo Stojnic “A framework to characterize performance of LASSO algorithms”, 2013 arXiv:1303.7291
  • [52] Shahar Mendelson “Learning without concentration” In Conference on Learning Theory, 2014, pp. 25–39 PMLR
  • [53] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” Cambridge University Press, 2014
  • [54] Ramon Handel “Probability in High Dimension”, Lecture notes, Princeton University, 2014 URL: https://web.math.princeton.edu/~rvan/APC550.pdf
  • [55] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning” In International Conference on Learning Representations – Workshop, 2015 arXiv:1412.6614
  • [56] Christos Thrampoulidis, Samet Oymak and Babak Hassibi “Regularized linear regression: A precise analysis of the estimation error” In Conference on Learning Theory, 2015, pp. 1683–1709 PMLR
  • [57] Afonso S. Bandeira “Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science”, Lecture notes, MIT, 2016 URL: https://people.math.ethz.ch/~abandeira/TenLecturesFortyTwoProblems.pdf
  • [58] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017 arXiv:1611.03530
  • [59] Samet Oymak and Joel A Tropp “Universality laws for randomized dimension reduction, with applications” In Information and Inference: A Journal of the IMA 7.3 Oxford University Press, 2018, pp. 337–446
  • [60] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
  • [61] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32, 2019, pp. 15849–15854 arXiv:1812.11118
  • [62] Rick Durrett “Probability: theory and examples” Cambridge University Press, 2019
  • [63] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J. Tibshirani “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”, 2019 arXiv:1903.08560
  • [64] Vaishnavh Nagarajan and J. Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019 arXiv:1902.04742
  • [65] Peter L. Bartlett and Philip M. Long “Failures of model-dependent generalization bounds for least-norm interpolation”, 2020 arXiv:2010.08479
  • [66] Peter L. Bartlett, Philip M. Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070 arXiv:1906.11300
  • [67] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180 arXiv:1903.07571
  • [68] Geoffrey Chinot and Matthieu Lerasle “On the robustness of the minimum 2\ell_{2} interpolator”, 2020 arXiv:2003.05838
  • [69] Peizhong Ju, Xiaojun Lin and Jia Liu “Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree” In Advances in Neural Information Processing Systems, 2020 arXiv:2002.00492
  • [70] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020 arXiv:1903.09139
  • [71] Jeffrey Negrea, Gintare Karolina Dziugaite and Daniel M. Roy “In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors” In International Conference on Machine Learning, 2020 arXiv:1912.04265
  • [72] Alexander Tsigler and Peter L. Bartlett “Benign overfitting in ridge regression”, 2020 arXiv:2009.14286
  • [73] Lijia Zhou, Danica J. Sutherland and Nathan Srebro “On Uniform Convergence and Low-Norm Interpolation Learning” In Advances in Neural Information Processing Systems, 2020 arXiv:2006.05942
  • [74] Geoffrey Chinot, Matthias Löffler and Sara Geer “On the robustness of minimum-norm interpolators”, 2021 arXiv:2012.00807
  • [75] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A model of double descent for high-dimensional binary linear classification” In Information and Inference: A Journal of the IMA, 2021 arXiv:1911.05822
  • [76] Zitong Yang, Yu Bai and Song Mei “Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models” In International Conference on Machine Learning, 2021 arXiv:2103.04554

Appendices

In the remainder of the paper, we give self-contained proofs of all results from the main text.

In Appendix A, we introduce some technical results that we will use in our analysis.

In Appendix B, we prove the main generalization bound (Theorem 1) and show its specialization to norm balls (Corollaries 1, 2 and 3).

In Appendix C, we prove upper bounds on the norm of the minimal-norm interpolator for a general norm (Theorem 4), and show applications to the Euclidean case (Theorem 2).

In Appendix D, we show how to combine the previous sets of results to give risk guarantees for the minimal norm interpolators (Theorems 3 and 5). In particular, Section D.2.1 shows the equivalence of conditions for consistency in the Euclidean norm setting.

In Appendix E, we provide full theorem statements and proofs of the results on 1\ell_{1} interpolation (basis pursuit) mentioned in Section 6.

Appendix A Preliminaries

We will first give some general results useful to the rest of the proofs. Most are standard, but a few are variations on existing results.

Concentration of Lipschitz functions.

Recall that a function f:nf:\mathbb{R}^{n}\to\mathbb{R} is LL-Lipschitz with respect to the norm \left\lVert\cdot\right\rVert if it holds for all x,ynx,y\in\mathbb{R}^{n} that |f(x)f(y)|Lxy|f(x)-f(y)|\leq L\|x-y\|. We use the concentration of Lipschitz functions of a Gaussian.

Theorem 6 ([54], Theorem 3.25).

If ff is LL-Lipschitz with respect to the Euclidean norm and ZN(0,In)Z\sim N(0,I_{n}), then

Pr(|f(Z)𝔼f(Z)|t)2et2/2L2.\Pr(|f(Z)-\operatorname*{\mathbb{E}}f(Z)|\geq t)\leq 2e^{-t^{2}/2L^{2}}. (23)

We also use a similar result for functions of a uniformly spherical vector [[, see]Theorem 5.1.4 and Exercise 5.1.12]vershynin2018high; we cite a result with sharp constant factor from [43].

Theorem 7 (Spherical concentration; [43]).

If ff is LL-Lipschitz with respect to the Euclidean norm and ZUni(Sn1)Z\sim\text{Uni}(S^{n-1}) where Sn1={un:u=1}S^{n-1}=\{u\in\mathbb{R}^{n}:\|u\|=1\} is the unit sphere, Uni(Sn1)\text{Uni}(S^{n-1}) is the uniform measure on the sphere, and n3n\geq 3, then

Pr(|f(Z)𝔼f(Z)|t)2e(n2)t2/2L2.\Pr(|f(Z)-\operatorname*{\mathbb{E}}f(Z)|\geq t)\leq 2e^{-(n-2)t^{2}/2L^{2}}. (24)

The following lemma, which we will use multiple timues, says that a o(n)o(n)-dimensional subspace cannot align with a random spherically symmetric vector.

Lemma 1.

Suppose that SS is a fixed subspace of dimension dd in n\mathbb{R}^{n} with n4n\geq 4, PSP_{S} is the orthogonal projection onto SS, and VV is a spherically symmetric random vector (i.e. V/V2V/\|V\|_{2} is uniform on the sphere). Then

PSV2V2d/n+2log(2/δ)/n.\frac{\|P_{S}V\|_{2}}{\|V\|_{2}}\leq\sqrt{d/n}+2\sqrt{\log(2/\delta)/n}. (25)

with probability at least 1δ1-\delta. Conditional on this inequality holding, we therefore have uniformly for all sSs\in S that

|s,V|=|s,PSV|s2PSV2s2V2(d/n+2log(2/δ)/n)).|\langle s,V\rangle|=|\langle s,P_{S}V\rangle|\leq\|s\|_{2}\|P_{S}V\|_{2}\leq\|s\|_{2}\|V\|_{2}\left(\sqrt{d/n}+2\sqrt{\log(2/\delta)/n})\right). (26)
Proof.

This is trivial if dnd\geq n, since the left-hand side is at most 11. Thus assume without loss of generality that d<nd<n. By symmetry, it suffices to fix SS to be the span of basis vectors e1,,ede_{1},\ldots,e_{d} and to bound PSV2\|P_{S}V\|_{2} for VV a uniformly random chosen vector from the unit sphere in n\mathbb{R}^{n}. Recall that for any coordinate ii, we have 𝔼Vi2=1/n\operatorname*{\mathbb{E}}V_{i}^{2}=1/n by symmetry among the coordinates and the fact that V22=1\|V\|_{2}^{2}=1 almost surely. The function vPSv2v\mapsto\|P_{S}v\|_{2} is a 11-Lipschitz function and 𝔼PSV2𝔼PSV22=d/n\operatorname*{\mathbb{E}}\|P_{S}V\|_{2}\leq\sqrt{\operatorname*{\mathbb{E}}\|P_{S}V\|_{2}^{2}}=\sqrt{d/n}, so by Theorem 7 above

PSV2d/n+2log(2/δ)/(n2))\|P_{S}V\|_{2}\leq\sqrt{d/n}+\sqrt{2\log(2/\delta)/(n-2)})

with probability at least 1δ1-\delta. Using n4n\geq 4 gives the result. ∎

The concentration of the Euclidean norm of a Gaussian vector follows from Theorem 6; we state it explicitly below.

Lemma 2.

Suppose that ZN(0,In)Z\sim N(0,I_{n}). Then

Pr(|Z2n|t)4et2/4.\Pr(\left|\|Z\|_{2}-\sqrt{n}\right|\geq t)\leq 4e^{-t^{2}/4}. (27)
Proof.

First we recall the standard fact [[, see e.g.]]chandrasekaran2012convex that

n1nn+1𝔼Z2n.\sqrt{n}-1\leq\frac{n}{\sqrt{n+1}}\leq\operatorname*{\mathbb{E}}\|Z\|_{2}\leq\sqrt{n}.

Because the norm is 1-Lipschitz, it follows from Theorem 6 that

Pr(|Z2𝔼Z2|t)2et2/2\Pr(\left|\|Z\|_{2}-\operatorname*{\mathbb{E}}\|Z\|_{2}\right|\geq t)\leq 2e^{-t^{2}/2}

so

Pr(|Z2n|t+1)2et2/2.\Pr(\left|\|Z\|_{2}-\sqrt{n}\right|\geq t+1)\leq 2e^{-t^{2}/2}.

Now using that (t1)2t2/21(t-1)^{2}\geq t^{2}/2-1 shows

Pr(|Z2n|t)2e(t2/21)/24et2/4.\Pr(\left|\|Z\|_{2}-\sqrt{n}\right|\geq t)\leq 2e^{-(t^{2}/2-1)/2}\leq 4e^{-t^{2}/4}.\qed
Wishart concentration.

We recall the notation for the Loewner order on symmetric matrices: ABA\preceq B means that BAB-A is positive semidefinite. Let σmin(A)\sigma_{\min}(A) denote the minimum singular value of an arbitrary matrix AA, and σmax\sigma_{\max} the maximum singular value. Similarly, let λmin(A)\lambda_{\min}(A) denote the minimum eigenvalue. We use A𝑜𝑝=σmax(A)\|A\|_{\mathit{op}}=\sigma_{\max}(A) to denote the operator norm of matrix AA.

Theorem 8 ([48], Corollary 5.35).

Let n,Nn,N\in\mathbb{N}. Let AN×nA\in\mathbb{R}^{N\times n} be a random matrix with entries i.i.d. N(0,1)N(0,1). Then for any t>0t>0, it holds with probability at least 12exp(t2/2)1-2\exp(-t^{2}/2) that

Nntσmin(A)σmax(A)N+n+t.\sqrt{N}-\sqrt{n}-t\leq\sigma_{\text{min}}(A)\leq\sigma_{\text{max}}(A)\leq\sqrt{N}+\sqrt{n}+t. (28)
Corollary 4.

Suppose X1,,XnN(0,Σ)X_{1},\ldots,X_{n}\sim N(0,\Sigma) are independent with Σ:d×d\Sigma:d\times d a positive semidefinite matrix, t>0t>0 and n4(d+t2)n\geq 4(d+t^{2}). Let Σ^=1niXiXiT\hat{\Sigma}=\frac{1}{n}\sum_{i}X_{i}X_{i}^{T} be the empirical covariance matrix. Then with probability at least 1δ1-\delta,

(1ϵ)ΣΣ^(1+ϵ)Σ(1-\epsilon)\Sigma\preceq\hat{\Sigma}\preceq(1+\epsilon)\Sigma (29)

with ϵ=3d/n+32log(2/δ)/n\epsilon=3\sqrt{d/n}+3\sqrt{2\log(2/\delta)/n}.

Proof.

Let X:n×dX:n\times d be the random matrix with rows X1,,XnX_{1},\ldots,X_{n} so that Σ^=1nXTX\hat{\Sigma}=\frac{1}{n}X^{T}X. By equality in distribution, we can take Z:n×dZ:n\times d to have N(0,1)N(0,1) independent entries and write X=ZΣ1/2X=Z\Sigma^{1/2} and

Σ1/2Σ^Σ1/2=1mΣ1/2XTXΣ1/2=1nZTZ.\Sigma^{-1/2}\hat{\Sigma}\Sigma^{-1/2}=\frac{1}{m}\Sigma^{-1/2}X^{T}X\Sigma^{-1/2}=\frac{1}{n}Z^{T}Z.

By definition of singular values, from Theorem 8 the eigenvalues of ZTZ/nZ^{T}Z/n are bounded between (1d/nt2/n)2(1-\sqrt{d/n}-\sqrt{t^{2}/n})^{2} and (1+d/n+t2/n)2(1+\sqrt{d/n}+\sqrt{t^{2}/n})^{2}. Since 1(1x)2(1+x)211-(1-x)^{2}\leq(1+x)^{2}-1, using the inequality (1+x)21+3x(1+x)^{2}\leq 1+3x for x[0,1]x\in[0,1], we have shown that

IΣ1/2Σ^Σ1/2𝑜𝑝(1+d/n+t2/n)213d/n+3t2/n.\|I-\Sigma^{-1/2}\hat{\Sigma}\Sigma^{-1/2}\|_{\mathit{op}}\leq(1+\sqrt{d/n}+\sqrt{t^{2}/n})^{2}-1\leq 3\sqrt{d/n}+3\sqrt{t^{2}/n}.

Rewriting and taking t2=2log(2/δ)t^{2}=2\log(2/\delta) gives the result. ∎

Gaussian Minmax Theorem.

The following result is Theorem 3 of [56], known as the Convex Gaussian Minmax Theorem or CGMT (see also Theorem 1 in the same reference). As explained there, it is a consequence of the main result of [41], known as Gordon’s Theorem or the Gaussian Minmax Theorem. Despite the name, convexity is only required for one of the theorem’s conclusions.

Theorem 9 (Convex Gaussian Minmax Theorem; [56, 41]).

Let Z:n×dZ:n\times d be a matrix with i.i.d. N(0,1)N(0,1) entries and suppose GN(0,In)G\sim N(0,I_{n}) and HN(0,Id)H\sim N(0,I_{d}) are independent of ZZ and each other. Let Sw,SuS_{w},S_{u} be compact sets and ψ:Sw×Su\psi:S_{w}\times S_{u}\to\mathbb{R} be an arbitrary continuous function. Define the Primary Optimization (PO) problem

Φ(Z):=minwSwmaxuSuu,Zw+ψ(w,u)\Phi(Z):=\min_{w\in S_{w}}\max_{u\in S_{u}}\langle u,Zw\rangle+\psi(w,u) (30)

and the Auxiliary Optimization (AO) problem

ϕ(G,H):=minwSwmaxuSuw2G,u+u2H,w+ψ(w,u).\phi(G,H):=\min_{w\in S_{w}}\max_{u\in S_{u}}\|w\|_{2}\langle G,u\rangle+\|u\|_{2}\langle H,w\rangle+\psi(w,u). (31)

Under these assumptions, Pr(Φ(Z)<c)2Pr(ϕ(G,H)c)\Pr(\Phi(Z)<c)\leq 2\Pr(\phi(G,H)\leq c) for any cc\in\mathbb{R}.

Furthermore, if we suppose that Sw,SuS_{w},S_{u} are convex sets and ψ(w,u)\psi(w,u) is convex in ww and concave in uu, then Pr(Φ(Z)>c)2Pr(ϕ(G,H)c)\Pr(\Phi(Z)>c)\leq 2\Pr(\phi(G,H)\geq c).

In other words, the first conclusion says that high probability lower bounds on the auxiliary optimization ϕ(G,H)\phi(G,H) imply high probability lower bounds on the primary optimization Φ(Z)\Phi(Z). Importantly, this direction holds without any convexity assumptions. Under the additional convexity assumptions, the second conclusion gives a similar comparison of high probability upper bounds.

In our analysis, we need a slightly more general statement of the Gaussian Minmax Theorem than Theorem 9: we need the minmax formulation to include additional variables which only affect the deterministic term in the minmax problem. It’s straightforward to prove this result by repeating the argument in [56]; below we give an alternative proof which reduces to Theorem 9, by introducing extremely small extra dimensions to contain the extra variables. Intuitively, this works because the statement of the GMT allows for arbitrary continuous functions ψ\psi, with no dependence on their quantitative smoothness.

Theorem 10 (Variant of GMT).

Let Z:n×dZ:n\times d be a matrix with i.i.d. N(0,1)N(0,1) entries and suppose GN(0,In)G\sim N(0,I_{n}) and HN(0,Id)H\sim N(0,I_{d}) are independent of ZZ and each other. Let SW,SUS_{W},S_{U} be compact sets in d×d\mathbb{R}^{d}\times\mathbb{R}^{d^{\prime}} and n×n\mathbb{R}^{n}\times\mathbb{R}^{n^{\prime}} respectively, and let ψ:SW×SU\psi:S_{W}\times S_{U}\to\mathbb{R} be an arbitrary continuous function. Define the Primary Optimization (PO) problem

Φ(Z):=min(w,w)SWmax(u,u)SUu,Zw+ψ((w,w),(u,u))\Phi(Z):=\min_{(w,w^{\prime})\in S_{W}}\max_{(u,u^{\prime})\in S_{U}}\langle u,Zw\rangle+\psi((w,w^{\prime}),(u,u^{\prime})) (32)

and the Auxiliary Optimization (AO) problem

ϕ(G,H):=min(w,w)SWmax(u,u)SUw2G,u+u2H,w+ψ((w,w),(u,u)).\phi(G,H):=\min_{(w,w^{\prime})\in S_{W}}\max_{(u,u^{\prime})\in S_{U}}\|w\|_{2}\langle G,u\rangle+\|u\|_{2}\langle H,w\rangle+\psi((w,w^{\prime}),(u,u^{\prime})). (33)

Under these assumptions, Pr(Φ(Z)<c)2Pr(ϕ(G,H)c)\Pr(\Phi(Z)<c)\leq 2\Pr(\phi(G,H)\leq c) for any cc\in\mathbb{R}.

Proof.

Let ϵ(0,1)\epsilon\in(0,1) be arbitrary and

SW,ϵ:={(w,ϵw):(w,w)SW},SU,ϵ:={(u,ϵu):(u,u)SU}.S_{W,\epsilon}:=\{(w,\epsilon w^{\prime}):(w,w^{\prime})\in S_{W}\},\quad S_{U,\epsilon}:=\{(u,\epsilon u^{\prime}):(u,u^{\prime})\in S_{U}\}.

Define ψϵ((w,w),(u,u)):=ψ((w,1ϵw),(u,1ϵu))\psi_{\epsilon}((w,w^{\prime}),(u,u^{\prime})):=\psi((w,\frac{1}{\epsilon}w^{\prime}),(u,\frac{1}{\epsilon}u^{\prime})) so that if W=(w,ϵw)W=(w,\epsilon w^{\prime}) and U=(u,ϵu)U=(u,\epsilon u^{\prime}), then ψϵ(W,U)=ψ((w,w),(u,u))\psi_{\epsilon}(W,U)=\psi((w,w^{\prime}),(u,u^{\prime})). We also define Sw={wd:ws.t.(w,w)SW}S_{w}=\{w\in\mathbb{R}^{d}:\exists w^{\prime}\,s.t.\,(w,w^{\prime})\in S_{W}\}. The other sets Sw,SuS_{w^{\prime}},S_{u} and SuS_{u^{\prime}} are defined similarly. It is clear that Sw,Sw,Su,SuS_{w},S_{w^{\prime}},S_{u},S_{u^{\prime}}, SW,ϵS_{W,\epsilon} and SU,ϵS_{U,\epsilon} are all still compact in their respective topology, and ψϵ\psi_{\epsilon} is continuous for every ϵ>0\epsilon>0.

Let Z:(n+n)×(d+d)Z^{\prime}:(n+n^{\prime})\times(d+d^{\prime}) be a matrix with i.i.d. N(0,1)N(0,1) entries such that the top left n×dn\times d matrix is ZZ. Similarly, we define GG^{\prime} to be a (n+n)(n+n^{\prime})-dimensional Gaussian vector with independent coordinates such that the first nn coordinates are GG, and HH^{\prime} to be a (d+d)(d+d^{\prime})-dimensional Gaussian vector with independent coordinates such that the first dd coordinates are HH. Next, consider the augmented PO and AO:

Φϵ(Z):=minWSW,ϵmaxUSU,ϵU,ZW+ψϵ(W,U)ϕϵ(G,H):=minWSW,ϵmaxUSU,ϵW2G,U+U2H,W+ψϵ(W,U)\begin{split}\Phi_{\epsilon}(Z^{\prime})&:=\min_{W\in S_{W,\epsilon}}\max_{U\in S_{U,\epsilon}}\langle U,Z^{\prime}W\rangle+\psi_{\epsilon}(W,U)\\ \phi_{\epsilon}(G^{\prime},H^{\prime})&:=\min_{W\in S_{W,\epsilon}}\max_{U\in S_{U,\epsilon}}\|W\|_{2}\langle G^{\prime},U\rangle+\|U\|_{2}\langle H^{\prime},W\rangle+\psi_{\epsilon}(W,U)\\ \end{split} (34)

It is clear that for a small value of ϵ\epsilon, the augmented problem will be close to the original problem. More precisely, for every (w,w)SW(w,w^{\prime})\in S_{W} and (u,u)SU(u,u^{\prime})\in S_{U}

|(w,ϵw),Z(u,ϵu)w,Zu|=|ϵ(0,w),Z(u,0)+ϵ(w,0),Z(0,u)+ϵ2(0,w),Z(0,u)|ϵ(R(Sw)+R(Sw))(R(Su)+R(Su))Z𝑜𝑝=ϵAZ𝑜𝑝\begin{split}|\langle(w,\epsilon&w^{\prime}),Z^{\prime}(u,\epsilon u^{\prime})\rangle-\langle w,Zu\rangle|\\ &=|\epsilon\langle(0,w^{\prime}),Z^{\prime}(u,0)\rangle+\epsilon\langle(w,0),Z^{\prime}(0,u^{\prime})\rangle+\epsilon^{2}\langle(0,w^{\prime}),Z^{\prime}(0,u^{\prime})\rangle|\\ &\leq\epsilon(R(S_{w})+R(S_{w^{\prime}}))(R(S_{u})+R(S_{u^{\prime}}))\|Z^{\prime}\|_{\mathit{op}}=\epsilon A\|Z^{\prime}\|_{\mathit{op}}\\ \end{split} (35)

where A:=(R(Sw)+R(Sw))(R(Su)+R(Su))A:=(R(S_{w})+R(S_{w^{\prime}}))(R(S_{u})+R(S_{u^{\prime}})) is deterministic and does not depend on ϵ\epsilon. Similarly, it is routine to check

w2G,u=w2(G,(u,ϵu)ϵG,(0,u))u2H,w=u2(H,(w,ϵw)ϵH,(0,w))\begin{split}\|w\|_{2}\langle G,u\rangle&=\|w\|_{2}(\langle G^{\prime},(u,\epsilon u^{\prime})\rangle-\epsilon\langle G^{\prime},(0,u^{\prime})\rangle)\\ \|u\|_{2}\langle H,w\rangle&=\|u\|_{2}(\langle H^{\prime},(w,\epsilon w^{\prime})\rangle-\epsilon\langle H^{\prime},(0,w^{\prime})\rangle)\\ \end{split}

so by the triangle inequality and Cauchy-Schwarz inequality, we have

|(w,ϵw)2G,(u,ϵu)w2G,u|ϵR(Sw)G2(R(Su)+ϵR(Su))+ϵR(Sw)G2R(Su)ϵAG2\begin{split}\big{\lvert}\|(w,\epsilon&w^{\prime})\|_{2}\langle G^{\prime},(u,\epsilon u^{\prime})\rangle-\|w\|_{2}\langle G,u\rangle\big{\rvert}\\ &\leq\epsilon R(S_{w^{\prime}})\|G^{\prime}\|_{2}(R(S_{u})+\epsilon R(S_{u^{\prime}}))+\epsilon R(S_{w})\|G^{\prime}\|_{2}R(S_{u^{\prime}})\leq\epsilon A\|G^{\prime}\|_{2}\\ \end{split} (36)

and

|(u,ϵu)2H,(w,ϵw)u2H,w|ϵR(Su)H2(R(Sw)+ϵR(Sw))+ϵR(Su)H2R(Sw)ϵAH2\begin{split}\big{\lvert}\|(u,\epsilon&u^{\prime})\|_{2}\langle H^{\prime},(w,\epsilon w^{\prime})\rangle-\|u\|_{2}\langle H,w\rangle\big{\rvert}\\ \leq\,&\epsilon R(S_{u^{\prime}})\|H^{\prime}\|_{2}(R(S_{w})+\epsilon R(S_{w^{\prime}}))+\epsilon R(S_{u})\|H^{\prime}\|_{2}R(S_{w^{\prime}})\leq\epsilon A\|H^{\prime}\|_{2}\\ \end{split} (37)

From (35), it follows that

|Φϵ(Z)Φ(Z)|ϵAZ𝑜𝑝.\left|\Phi_{\epsilon}(Z^{\prime})-\Phi(Z)\right|\leq\epsilon A\|Z^{\prime}\|_{\mathit{op}}. (38)

Similarly, from (36) and (37), it follows that

|ϕϵ(G,H)ϕ(G,H)|ϵA(G2+H2).|\phi_{\epsilon}(G^{\prime},H^{\prime})-\phi(G,H)|\leq\epsilon A(\|G^{\prime}\|_{2}+\|H^{\prime}\|_{2}). (39)

Approximating the original PO and AO by (34) allows us to directly apply the Gaussian Minmax Theorem. For any cc\in\mathbb{R}, we have

Pr(Φ(Z)<c)Pr(Φϵ(Z)<c+ϵ)+Pr(ϵAZ𝑜𝑝>ϵ)2Pr(ϕϵ(G,H)c+ϵ)+Pr(ϵAZ𝑜𝑝>ϵ)2Pr(ϕ(G,H)c+2ϵ)+2Pr(ϵA(G2+H2)>ϵ)+Pr(ϵAZ𝑜𝑝>ϵ)2Pr(ϕ(G,H)c+2ϵ)+2Pr(G2>12Aϵ)+2Pr(H2>12Aϵ)+Pr(Z𝑜𝑝>1Aϵ)\begin{split}\Pr(\Phi(Z)<c)&\leq\Pr(\Phi_{\epsilon}(Z^{\prime})<c+\sqrt{\epsilon})+\Pr(\epsilon A\|Z^{\prime}\|_{\mathit{op}}>\sqrt{\epsilon})\\ &\leq 2\Pr(\phi_{\epsilon}(G^{\prime},H^{\prime})\leq c+\sqrt{\epsilon})+\Pr(\epsilon A\|Z^{\prime}\|_{\mathit{op}}>\sqrt{\epsilon})\\ &\leq 2\Pr(\phi(G^{\prime},H^{\prime})\leq c+2\sqrt{\epsilon})+2\Pr\left(\epsilon A(\|G^{\prime}\|_{2}+\|H^{\prime}\|_{2})>\sqrt{\epsilon}\right)\\ &\quad+\Pr(\epsilon A\|Z^{\prime}\|_{\mathit{op}}>\sqrt{\epsilon})\\ &\leq 2\Pr(\phi(G^{\prime},H^{\prime})\leq c+2\sqrt{\epsilon})+2\Pr\left(\|G^{\prime}\|_{2}>\frac{1}{2A\sqrt{\epsilon}}\right)\\ &\quad+2\Pr\left(\|H^{\prime}\|_{2}>\frac{1}{2A\sqrt{\epsilon}}\right)+\Pr\left(\|Z^{\prime}\|_{\mathit{op}}>\frac{1}{A\sqrt{\epsilon}}\right)\\ \end{split}

where we used (38) in the first inequality, Theorem 9 in the second inequality, and (39) in the last inequality. This holds for arbitrary ϵ>0\epsilon>0 and taking the limit ϵ0\epsilon\to 0 shows the result, because the CDF is right continuous [62] and the remaining terms go to zero by standard concentration inequalities (Lemmas 2 and 8). ∎

Appendix B Uniform Convergence Bounds

We will now prove the main generalization bound, as well as its special cases in norm balls and specifically Euclidean norm balls.

B.1 General case: Proof of Theorem 1

For convenience, we restate the definition of covariance splitting here: See 2

It follows from our definition that Σ1Σ2=0\Sigma_{1}\Sigma_{2}=0. Although our results in Appendix C requires this orthogonality condition (in particular, Lemma 8), we note that all of our results here in Appendix B continue to hold as long as Σ=Σ1+Σ2\Sigma=\Sigma_{1}+\Sigma_{2} and both Σ1,Σ2\Sigma_{1},\Sigma_{2} are positive semi-definite. To apply the Gaussian Minimax Theorem, we first formulate the generalization gap as an optimization problem in terms of a random matrix with N(0,1)N(0,1) entries.

Lemma 3.

Under the model assumptions in (1), let 𝒦\mathcal{K} be an arbitrary compact set and Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}. Define the primary optimization problem (PO) as

Φ:=max(w1,w2)𝒮Z1w1+Z2w2=ξw122+w222\Phi:=\max_{\begin{subarray}{c}(w_{1},w_{2})\in\mathcal{S}\\ Z_{1}w_{1}+Z_{2}w_{2}=\xi\end{subarray}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2} (40)

where

𝒮={(w1,w2):w𝒦s.t.w1=Σ11/2(ww) and w2=Σ21/2(ww)}\mathcal{S}=\{(w_{1},w_{2}):\exists w\in\mathcal{K}\,s.t.\,w_{1}=\Sigma_{1}^{1/2}(w-w^{*})\text{ and }w_{2}=\Sigma_{2}^{1/2}(w-w^{*})\} (41)

and Z1,Z2Z_{1},Z_{2} are both n×dn\times d random matrices with i.i.d. standard normal entries independent of ξ\xi and each other. Then the generalization gap of interpolators is equal in distribution to the sum of the Bayes risk and the PO:

maxw𝒦,L^(w)=0L(w)L^(w)=𝒟σ2+Φ.\max_{w\in\mathcal{K},\hat{L}(w)=0}L(w)-\hat{L}(w)\overset{\mathcal{D}}{=}\sigma^{2}+\Phi. (42)
Proof.

Recall that L(w)=σ2+wwΣ2L(w)=\sigma^{2}+\|w-w^{*}\|_{\Sigma}^{2} and L^(w)=0\hat{L}(w)=0 is equivalent to Y=XwY=Xw. Observe that

X=𝒟Z1Σ11/2+Z2Σ21/2andwΣ2=wΣ12+wΣ22X\overset{\mathcal{D}}{=}Z_{1}\Sigma_{1}^{1/2}+Z_{2}\Sigma_{2}^{1/2}\quad\text{and}\quad\|w\|_{\Sigma}^{2}=\|w\|_{\Sigma_{1}}^{2}+\|w\|_{\Sigma_{2}}^{2}

so we can decompose

maxw𝒦,L^(w)=0L(w)L^(w)=σ2+maxw𝒦,Y=XwwwΣ2=σ2+maxw𝒦,X(ww)=ξwwΣ2=𝒟σ2+maxw𝒦w(Z1Σ11/2+Z2Σ21/2)w=ξwΣ12+wΣ22=σ2+Φ.\begin{split}\max_{w\in\mathcal{K},\hat{L}(w)=0}L(w)-\hat{L}(w)&=\sigma^{2}+\max_{w\in\mathcal{K},Y=Xw}\|w-w^{*}\|_{\Sigma}^{2}\\ &=\sigma^{2}+\max_{w\in\mathcal{K},X(w-w^{*})=\xi}\|w-w^{*}\|_{\Sigma}^{2}\\ &\overset{\mathcal{D}}{=}\sigma^{2}+\max_{\begin{subarray}{c}w\in\mathcal{K}-w^{*}\\ (Z_{1}\Sigma_{1}^{1/2}+Z_{2}\Sigma_{2}^{1/2})w=\xi\end{subarray}}\|w\|_{\Sigma_{1}}^{2}+\|w\|_{\Sigma_{2}}^{2}=\sigma^{2}+\Phi.\qed\end{split}
Lemma 4 (Application of GMT).

In the same setting as Lemma 3, let GN(0,In),HN(0,Id)G\sim N(0,I_{n}),H\sim N(0,I_{d}) be Gaussian vectors independent of Z1,Z2,ξZ_{1},Z_{2},\xi and each other. With the same definition of 𝒮\mathcal{S}, define the auxiliary optimization problem (AO) as

ϕ:=max(w1,w2)𝒮ξZ1w1Gw222w2,Hw122+w222\phi:=\max_{\begin{subarray}{c}(w_{1},w_{2})\in\mathcal{S}\\ \left\lVert\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\right\rVert_{2}\leq\langle w_{2},H\rangle\end{subarray}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2} (43)

Then it holds that

Pr(Φ>t|Z1,ξ)2Pr(ϕt|Z1,ξ),\Pr(\Phi>t\,|\,Z_{1},\xi)\leq 2\Pr(\phi\geq t\,|\,Z_{1},\xi), (44)

and taking expectations we have

Pr(Φ>t)2Pr(ϕt).\Pr(\Phi>t)\leq 2\Pr(\phi\geq t). (45)
Proof.

By introducing Lagrange multipliers, we have

Φ=max(w1,w2)𝒮minλw122+w222+λ,Z2w2(ξZ1w1)=max(w1,w2)𝒮minλλ,Z2w2+w122+w222λ,ξZ1w1.\begin{split}\Phi&=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\lambda}\,\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}+\langle\lambda,Z_{2}w_{2}-(\xi-Z_{1}w_{1})\rangle\\ &=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\lambda}\,\langle\lambda,Z_{2}w_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle.\\ \end{split}

By independence, the distribution of Z2Z_{2} remains the same after conditioning on Z1Z_{1} and ξ\xi and the randomness in Φ\Phi comes solely from Z2Z_{2}. Since the mapping from ww to (w1,w2)(w_{1},w_{2}) is continuous and 𝒦\mathcal{K} is compact, 𝒮\mathcal{S} is compact. To apply Theorem 10, we can take ψ(w1,w2,λ)=w122+w222λ,ξZ1w1\psi(w_{1},w_{2},\lambda)=\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle, which is clearly continuous. The only challenge is that the domain of λ\lambda is not compact, but we can handle it by a truncation argument. Define

Φr:=max(w1,w2)𝒮minλrλ,Z2w2+w122+w222λ,ξZ1w1\Phi_{r}:=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\|\lambda\|\leq r}\,\langle\lambda,Z_{2}w_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle (46)

and observe that ΦΦr\Phi\leq\Phi_{r}, since the minimum in the definition of Φr\Phi_{r} ranges over a smaller set. The AO associated with Φr\Phi_{r} is

ϕr:=max(w1,w2)𝒮minλrw22G,λ+λ2H,w2+w122+w222λ,ξZ1w1=max(w1,w2)𝒮minλrλ2H,w2λ,ξZ1w1Gw22+w122+w222=max(w1,w2)𝒮min0λrλ(H,w2ξZ1w1Gw222)+w122+w222.\begin{split}\phi_{r}:&=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\|\lambda\|\leq r}\,\|w_{2}\|_{2}\langle G,\lambda\rangle+\|\lambda\|_{2}\langle H,w_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle\\ &=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\|\lambda\|\leq r}\,\|\lambda\|_{2}\langle H,w_{2}\rangle-\langle\lambda,\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}\\ &=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{0\leq\lambda\leq r}\,\lambda\left(\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\right)+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}.\\ \end{split} (47)

We observe that the untruncated auxiliary problem ϕ\phi from (43) has a completely analogous form:

ϕ=max(w1,w2)𝒮minλ0λ(H,w2ξZ1w1Gw222)+w122+w222.\phi=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\lambda\geq 0}\,\lambda\left(\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\right)+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}.\\

This is because if H,w2ξZ1w1Gw2220\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\geq 0 then the minimum is achieved at λ=0\lambda=0, and if w1,w2w_{1},w_{2} do not satisfy the constraint then taking λ\lambda\to\infty sends the minimum to -\infty. From this formulation, we see that ϕϕrϕs\phi\leq\phi_{r}\leq\phi_{s} for any rs0r\geq s\geq 0 since the minimum is taken over a larger set as rr grows, and is unconstrained in ϕ\phi.

The proof that limrϕr=ϕ\lim_{r\to\infty}\phi_{r}=\phi is an exercise in real analysis, which splits into two cases:

  1. 1.

    The auxiliary problem ϕ\phi is infeasible. In this case, we know that for all (w1,w2)𝒮(w_{1},w_{2})\in\mathcal{S}

    H,w2ξZ1w1Gw222<0.\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}<0.

    By compactness of 𝒮\mathcal{S} and continuity of the right hand side, there exists μ=μ(ξ,Z1,G,H)<0\mu=\mu(\xi,Z_{1},G,H)<0 (in particular, independent of rr) such that

    H,w2ξZ1w1Gw222μ.\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\leq\mu.

    Therefore, we show

    ϕrmax(w1,w2)𝒮min0λrλμ+w122+w222=rμ+max(w1,w2)𝒮w122+w222.\begin{split}\phi_{r}&\leq\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{0\leq\lambda\leq r}\,\lambda\mu+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}\\ &=r\mu+\max_{(w_{1},w_{2})\in\mathcal{S}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}.\end{split}

    Since the second term is bounded and has no dependence on rr, taking rr\to\infty we have ϕr\phi_{r}\to-\infty as desired (since ϕ=\phi=-\infty by definition).

  2. 2.

    The auxiliary problem ϕ\phi is feasible. In this case, we can let (w1(r),w2(r))𝒮(w_{1}(r),w_{2}(r))\in\mathcal{S} be an arbitrary maximizer achieving the objective ϕr\phi_{r} for each r0r\geq 0 by compactness. By compactness again, the sequence (w1(r),w2(r))r=1(w_{1}(r),w_{2}(r))_{r=1}^{\infty} at positive integer values of rr has a subsequential limit (w1(),w2())𝒮(w_{1}(\infty),w_{2}(\infty))\in\mathcal{S}, i.e. this point satisfies (w1(),w2())=limn(w1(rn),w2(rn))(w_{1}(\infty),w_{2}(\infty))=\lim_{n\to\infty}(w_{1}(r_{n}),w_{2}(r_{n})) for some sequence rnr_{n} satisfying rnnr_{n}\geq n.

    Suppose that (w1(),w2())(w_{1}(\infty),w_{2}(\infty)) does not satisfy the last constraint defining ϕ\phi, then by continuity, there exists μ<0\mu<0 and a sufficiently small ϵ>0\epsilon>0 such that for all w1w1()2ϵ\|w_{1}-w_{1}(\infty)\|_{2}\leq\epsilon and w2w2()2ϵ\|w_{2}-w_{2}(\infty)\|_{2}\leq\epsilon, we have

    H,w2ξZ1w1Gw222μ.\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\leq\mu.

    This implies that for sufficiently large nn, we have

    H,w2(rn)ξZ1w1(rn)Gw2(rn)22μ\langle H,w_{2}(r_{n})\rangle-\|\xi-Z_{1}w_{1}(r_{n})-G\|w_{2}(r_{n})\|_{2}\|_{2}\leq\mu

    and

    ϕrnrnμ+w1(rn)22+w2(rn)22rnμ+max(w1,w2)𝒮w122+w222\begin{split}\phi_{r_{n}}&\leq r_{n}\mu+\|w_{1}(r_{n})\|_{2}^{2}+\|w_{2}(r_{n})\|_{2}^{2}\\ &\leq r_{n}\mu+\max_{(w_{1},w_{2})\in\mathcal{S}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}\\ \end{split}

    so ϕrn\phi_{r_{n}}\to-\infty – but this is impossible, since considering any feasible element of ϕ\phi we can show that ϕrn0\phi_{r_{n}}\geq 0. By contradiction, we find that (w1(),w2())(w_{1}(\infty),w_{2}(\infty)) is feasible for ϕ\phi.

    By taking λ=0\lambda=0 in the definition of ϕr\phi_{r} we have

    ϕrnw1(rn)22+w2(rn)22.\phi_{r_{n}}\leq\|w_{1}(r_{n})\|_{2}^{2}+\|w_{2}(r_{n})\|_{2}^{2}.

    By continuity, we show that

    limsupnϕrnlimnw1(rn)22+w2(rn)22=w1()22+w2()22ϕ.\begin{split}\lim\sup_{n\to\infty}\phi_{r_{n}}&\leq\lim_{n\to\infty}\|w_{1}(r_{n})\|_{2}^{2}+\|w_{2}(r_{n})\|_{2}^{2}\\ &=\|w_{1}(\infty)\|_{2}^{2}+\|w_{2}(\infty)\|_{2}^{2}\leq\phi.\end{split}

    Since ϕrnϕ\phi_{r_{n}}\geq\phi, the limit of ϕrn\phi_{r_{n}} exists and equals ϕ\phi. We can conclude that limrϕr=ϕ\lim_{r\to\infty}\phi_{r}=\phi because ϕr\phi_{r} is a monotone decreasing function of rr.

By our version of the Gaussian Minmax Theorem, Theorem 10,

Pr(Φr>t|Z1,ξ)=Pr(Φr<t|Z1,ξ)2Pr(ϕrt|Z1,ξ)=2Pr(ϕrt|Z1,ξ)\Pr(\Phi_{r}>t|Z_{1},\xi)=\Pr(-\Phi_{r}<-t|Z_{1},\xi)\leq 2\Pr(-\phi_{r}\leq-t|Z_{1},\xi)=2\Pr(\phi_{r}\geq t|Z_{1},\xi)

We introduce the negative signs here because we have originally a max-min problem instead of a min-max problem. This means the comparison theorem gives an upper bound, instead of a lower bound, on the quantity of interest.

Finally, we can conclude

Pr(Φ>t|Z1,ξ)infr0Pr(Φr>t|Z1,ξ)2infr0Pr(ϕrt|Z1,ξ)2Pr(ϕt|Z1,ξ).\Pr(\Phi>t|Z_{1},\xi)\leq\inf_{r\geq 0}\Pr(\Phi_{r}>t|Z_{1},\xi)\leq 2\inf_{r\geq 0}\Pr(\phi_{r}\geq t|Z_{1},\xi)\leq 2\Pr(\phi\geq t|Z_{1},\xi).

where the last step uses continuity (from above) of probability measure and the fact that ϕr\phi_{r} monotonically decreases to ϕ\phi almost surely. ∎

Recall the definition of Gaussian width and radius:

See 1

It remains to analyze the auxiliary problem, which we do in the following lemma:

Lemma 5.

Let β=33log(32/δ)n+18rank(Σ1)n\beta=33\sqrt{\frac{\log(32/\delta)}{n}}+18\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}. If nn is sufficiently large such that β1\beta\leq 1, then with probability at least 1δ1-\delta, it holds that

ϕ1+βn(W(Σ21/2𝒦)+rad(Σ21/2𝒦)2log(16/δ)+wΣ22log(16/δ))2σ2.\phi\leq\frac{1+\beta}{n}(W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(16/\delta)}+\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(16/\delta)})^{2}-\sigma^{2}. (48)
Proof.

For notational simplicity, define

α:=2log(32/δ)nγ:=3rank(Σ1)n+32log(16/δ)nρ:=rank(Σ1)+1n+2log(16/δ)n.\begin{split}\alpha&:=2\sqrt{\frac{\log(32/\delta)}{n}}\\ \gamma&:=3\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}+3\sqrt{\frac{2\log(16/\delta)}{n}}\\ \rho&:=\sqrt{\frac{\operatorname{rank}(\Sigma_{1})+1}{n}}+2\sqrt{\frac{\log(16/\delta)}{n}}.\end{split}

By a union bound, the following collection of events, which together we call \mathcal{E}, occurs with probability at least 1δ1-\delta:

  1. 1.

    (Approximate orthogonality.) By Lemma 1, uniformly over all w1Σ11/2(𝒦w)w_{1}\in\Sigma_{1}^{1/2}(\mathcal{K}-w^{*}) and aa\in\mathbb{R}, it holds that

    |ξaZ1w1,G|ξaZ1w12G2ρ|\langle\xi a-Z_{1}w_{1},G\rangle|\leq\|\xi a-Z_{1}w_{1}\|_{2}\|G\|_{2}\rho (49)

    and

    |ξ,Z1w1|ξ2Z1w12ρ.|\langle\xi,Z_{1}w_{1}\rangle|\leq\|\xi\|_{2}\|Z_{1}w_{1}\|_{2}\rho. (50)
  2. 2.

    (Approximate isometry.) By Corollary 4, uniformly over all w1Σ11/2(𝒦w)w_{1}\in\Sigma_{1}^{1/2}(\mathcal{K}-w^{*}), it holds that

    (1γ)w122Z1w122n(1+γ)w122.(1-\gamma)\|w_{1}\|_{2}^{2}\leq\frac{\|Z_{1}w_{1}\|_{2}^{2}}{n}\leq(1+\gamma)\|w_{1}\|_{2}^{2}. (51)
  3. 3.

    (Typical norm of GG and ξ\xi.) By Lemma 2, it holds that

    α1nG21α-\alpha\leq\frac{1}{\sqrt{n}}\|G\|_{2}-1\leq\alpha (52)

    and

    ασ1nξ2σασ.-\alpha\sigma\leq\frac{1}{\sqrt{n}}\|\xi\|_{2}-\sigma\leq\alpha\sigma. (53)
  4. 4.

    (Typical size of Σ21/2w,H\langle\Sigma_{2}^{1/2}w^{*},H\rangle.) By the standard Gaussian tail bound Pr(|Z|t)2et2/2\Pr(|Z|\geq t)\leq 2e^{-t^{2}/2}, it holds that

    |Σ21/2w,H|wΣ22log(16/δ)|\langle\Sigma_{2}^{1/2}w^{*},H\rangle|\leq\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(16/\delta)} (54)

    because the marginal law of Σ21/2w,H\langle\Sigma_{2}^{1/2}w^{*},H\rangle is N(0,wΣ22)N(0,\|w^{*}\|_{\Sigma_{2}}^{2}).

  5. 5.

    (Gaussian process concentration.) By Theorem 6, it holds that

    maxw2Σ21/2𝒦|w2,H|W(Σ21/2𝒦)+rad(Σ21/2𝒦)2log(16/δ)\max_{w_{2}\in\Sigma_{2}^{1/2}\mathcal{K}}|\langle w_{2},H\rangle|\leq W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(16/\delta)} (55)

    because maxw2Σ21/2𝒦|u2,H|\max_{w_{2}\in\Sigma_{2}^{1/2}\mathcal{K}}|\langle u_{2},H\rangle| is a rad(Σ21/2𝒦)\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})-Lipschitz function of HH.

From now on, the argument is conditional on the event \mathcal{E} defined above. By squaring the last constraint in the definition of ϕ\phi we see that

w2,H2ξZ1w1w22G22=ξZ1w122+w222G222ξZ1w1,w22G(1ρ)[ξZ1w122+w222G22]\begin{split}\langle w_{2},H\rangle^{2}&\geq\|\xi-Z_{1}w_{1}-\|w_{2}\|_{2}G\|_{2}^{2}\\ &=\left\lVert\xi-Z_{1}w_{1}\right\rVert_{2}^{2}+\|w_{2}\|_{2}^{2}\left\lVert G\right\rVert_{2}^{2}-2\langle\xi-Z_{1}w_{1},\|w_{2}\|_{2}G\rangle\\ &\geq(1-\rho)[\left\lVert\xi-Z_{1}w_{1}\right\rVert_{2}^{2}+\|w_{2}\|_{2}^{2}\left\lVert G\right\rVert_{2}^{2}]\\ \end{split}

where in the last line we used (49) and the AM-GM inequality (aba2/2+b2/2ab\leq a^{2}/2+b^{2}/2). Rearranging gives the inequality

w222(1ρ)1w2,H2ξZ1w122G22(1ρ)1w2,H2(1ρ)[ξ22+Z1w122]G22(1ρ)1w2,H2(1ρ)[ξ22+Z1w122](1α)2n(1γ)(1ρ)(1α)2w122+(1ρ)1w2,H2(1ρ)ξ22(1α)2n\begin{split}\|w_{2}\|_{2}^{2}&\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-\left\lVert\xi-Z_{1}w_{1}\right\rVert_{2}^{2}}{\|G\|_{2}^{2}}\\ &\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)[\left\lVert\xi\right\rVert_{2}^{2}+\left\lVert Z_{1}w_{1}\right\rVert_{2}^{2}]}{\|G\|_{2}^{2}}\\ &\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)[\left\lVert\xi\right\rVert_{2}^{2}+\left\lVert Z_{1}w_{1}\right\rVert_{2}^{2}]}{(1-\alpha)^{2}n}\\ &\leq-\frac{(1-\gamma)(1-\rho)}{(1-\alpha)^{2}}\left\lVert w_{1}\right\rVert_{2}^{2}+\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)\left\lVert\xi\right\rVert_{2}^{2}}{(1-\alpha)^{2}n}\\ \end{split}

where in the second inequality we used (50) and the AM-GM inequality again, in the third inequality we used (52) and in the last inequality we used (51). This shows

(1γ)(1ρ)(w122+w222)(1γ)(1ρ)(1α)2w122+w222(1ρ)1w2,H2(1ρ)ξ22(1α)2n.\begin{split}(1-\gamma)(1-\rho)(\left\lVert w_{1}\right\rVert_{2}^{2}+\left\lVert w_{2}\right\rVert_{2}^{2})&\leq\frac{(1-\gamma)(1-\rho)}{(1-\alpha)^{2}}\left\lVert w_{1}\right\rVert_{2}^{2}+\|w_{2}\|_{2}^{2}\\ &\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)\left\lVert\xi\right\rVert_{2}^{2}}{(1-\alpha)^{2}n}.\\ \end{split}

Dividing through by the first two factors on the left hand side and plugging in (53) gives

w122+w222(1ρ)2w2,H2ξ22(1γ)(1α)2n1(1γ)(1α)2(1ρ)2w2,H2nσ21γ.\begin{split}\left\lVert w_{1}\right\rVert_{2}^{2}+\left\lVert w_{2}\right\rVert_{2}^{2}&\leq\frac{(1-\rho)^{-2}\langle w_{2},H\rangle^{2}-\left\lVert\xi\right\rVert_{2}^{2}}{(1-\gamma)(1-\alpha)^{2}n}\\ &\leq\frac{1}{(1-\gamma)(1-\alpha)^{2}(1-\rho)^{2}}\frac{\langle w_{2},H\rangle^{2}}{n}-\frac{\sigma^{2}}{1-\gamma}.\\ \end{split}

We can simplify the first term by defining β=(1γ)1(1α)2(1ρ)21\beta=(1-\gamma)^{-1}(1-\alpha)^{-2}(1-\rho)^{-2}-1 and the second term by observing σ21γσ2-\frac{\sigma^{2}}{1-\gamma}\leq-\sigma^{2}. Finally, plugging into (43) gives

ϕmax(w1,w2)S(1+β)w2,H2nσ2=1+βnmaxw2Σ21/2(𝒦w)|w2,H|2σ21+βn(maxw2Σ21/2𝒦|w2,H|+|Σ21/2w,H|)2σ2\begin{split}\phi&\leq\max_{(w_{1},w_{2})\in S}(1+\beta)\frac{\langle w_{2},H\rangle^{2}}{n}-\sigma^{2}\\ &=\frac{1+\beta}{n}\max_{w_{2}\in\Sigma_{2}^{1/2}(\mathcal{K}-w^{*})}|\langle w_{2},H\rangle|^{2}-\sigma^{2}\\ &\leq\frac{1+\beta}{n}\left(\max_{w_{2}\in\Sigma_{2}^{1/2}\mathcal{K}}|\langle w_{2},H\rangle|+|\langle\Sigma_{2}^{1/2}w^{*},H\rangle|\right)^{2}-\sigma^{2}\\ \end{split}

by the triangle inequality, and (48) follows by (54) and (55). To deduce the explicit bound for β\beta, first use that

(1α)2=12α+α212α(1-\alpha)^{2}=1-2\alpha+\alpha^{2}\geq 1-2\alpha

and similarly (1ρ)212ρ(1-\rho)^{2}\geq 1-2\rho to show

1(1γ)(1α)2(1ρ)21(1γ)(12ρ)(12α).\frac{1}{(1-\gamma)(1-\alpha)^{2}(1-\rho)^{2}}\leq\frac{1}{(1-\gamma)(1-2\rho)(1-2\alpha)}.

If γ,ρ<1/2\gamma,\rho<1/2, then

(1γ)(12α)(12ρ)=1γ2α2ρ+2γα+2γρ+4αρ4γαρ1γ2α2ρ4γαρ>12γ2α2ρ.\begin{split}(1-\gamma)(1-2\alpha)(1-2\rho)&=1-\gamma-2\alpha-2\rho+2\gamma\alpha+2\gamma\rho+4\alpha\rho-4\gamma\alpha\rho\\ &\geq 1-\gamma-2\alpha-2\rho-4\gamma\alpha\rho>1-2\gamma-2\alpha-2\rho.\\ \end{split}

Provided that 2γ+2α+2ρ<1/22\gamma+2\alpha+2\rho<1/2 (which implies that γ,ρ<1/2\gamma,\rho<1/2), we can use the inequality (1x)11+2x(1-x)^{-1}\leq 1+2x for x[0,1/2]x\in[0,1/2] to show that

1(1γ)(1α)2(1ρ)2112γ2α2ρ1+4γ+4α+4ρ\frac{1}{(1-\gamma)(1-\alpha)^{2}(1-\rho)^{2}}\leq\frac{1}{1-2\gamma-2\alpha-2\rho}\leq 1+4\gamma+4\alpha+4\rho

and thus we can choose

β=33log(32/δ)n+18rank(Σ1)n4γ+4α+4ρ\beta=33\sqrt{\frac{\log(32/\delta)}{n}}+18\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\geq 4\gamma+4\alpha+4\rho\qed

We are finally ready to prove our main generalization bound:

See 1

Proof.

By Lemmas 3 and 4, we show that for any tt

Pr(maxw𝒦,L^(w)=0L(w)L^(w)>t)=Pr(Φ>tσ2)2Pr(ϕtσ2).\Pr\left(\max_{w\in\mathcal{K},\hat{L}(w)=0}L(w)-\hat{L}(w)>t\right)=\Pr(\Phi>t-\sigma^{2})\leq 2\Pr(\phi\geq t-\sigma^{2}).

By Lemma 5, the above is upper bounded by δ\delta if we set tσ2t-\sigma^{2} according to (48) with δ\delta replaced by δ/2\delta/2. Observe that the σ2\sigma^{2} term cancels, and the proof is complete. ∎

Remark 2 (Translation-invariant version).

Our generalization guarantee is stated in terms of W()W(\cdot) and rad()\operatorname{rad}(\cdot), which are not translation-invariant. However, the generalization guarantee of Theorem 1 can be made translation invariant, e.g. replacing W(Σ21/2𝒦)W(\Sigma_{2}^{1/2}\mathcal{K}) by W(Σ21/2(𝒦a))W(\Sigma_{2}^{1/2}(\mathcal{K}-a)) for an arbitrary ada\in\mathbb{R}^{d}, by recentering the problem before applying Theorem 1, i.e. by subtracting XaXa from both sides of the interpolation constraint Xw=Xw+ξXw=Xw^{*}+\xi.

We also note that in Theorem 1, there is no requirement that w𝒦w^{*}\in\mathcal{K}, so the true function may not necessarily lie in the class even if there is no noise (σ=0\sigma=0).

B.2 Specialization to General Norm Balls

For convenience, we restate the general definition of effective rank.

See 5

Applying Theorem 1 to an arbitrary norm ball yield the following:

See 3

Proof.

Let 𝒦={w:wB}\mathcal{K}=\{w:\|w\|\leq B\} in Theorem 1. It is easy to see that

W(Σ21/2𝒦)=𝔼supwB|Σ21/2w,H|=𝔼supwB|w,Σ21/2H|=B𝔼Σ21/2HW(\Sigma_{2}^{1/2}\mathcal{K})=\operatorname*{\mathbb{E}}\sup_{\|w\|\leq B}|\langle\Sigma_{2}^{1/2}w,H\rangle|=\operatorname*{\mathbb{E}}\sup_{\|w\|\leq B}|\langle w,\Sigma_{2}^{1/2}H\rangle|=B\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}

and

R(Σ21/2𝒦)=supwBΣ21/2w2=Bsupw1wΣ2.R(\Sigma_{2}^{1/2}\mathcal{K})=\sup_{\|w\|\leq B}\|\Sigma_{2}^{1/2}w\|_{2}=B\sup_{\|w\|\leq 1}\|w\|_{\Sigma_{2}}.

From our definition, it is clear that

r(Σ)=(W(Σ1/2𝒦)R(Σ1/2𝒦))2.r_{\left\lVert\cdot\right\rVert}(\Sigma)=\left(\frac{W(\Sigma^{1/2}\mathcal{K})}{R(\Sigma^{1/2}\mathcal{K})}\right)^{2}.

Observe that

wΣ2wsupw1wΣ2Bsupw1wΣ2=R(Σ21/2𝒦).\|w^{*}\|_{\Sigma_{2}}\leq\|w^{*}\|\sup_{\|w\|\leq 1}\|w\|_{\Sigma_{2}}\leq B\sup_{\|w\|\leq 1}\|w\|_{\Sigma_{2}}=R(\Sigma_{2}^{1/2}\mathcal{K}).

The two above equations imply

W(Σ21/2𝒦)+rad(Σ21/2𝒦)2log(32/δ)+wΣ22log(32/δ)W(Σ21/2𝒦)+22log(32/δ)rad(Σ21/2𝒦)=W(Σ21/2𝒦)+22log(32/δ)r(Σ2)W(Σ21/2𝒦)=(1+22log(32/δ)r(Σ2))(B𝔼Σ21/2H).\begin{split}&W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(32/\delta)}+\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(32/\delta)}\\ \leq\,&W(\Sigma_{2}^{1/2}\mathcal{K})+2\sqrt{2\log(32/\delta)}\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\\ =\,&W(\Sigma_{2}^{1/2}\mathcal{K})+2\sqrt{\frac{2\log(32/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}W(\Sigma_{2}^{1/2}\mathcal{K})\\ =\,&\left(1+2\sqrt{\frac{2\log(32/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}\right)\left(B\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}\right).\\ \end{split}

Under our assumptions that γ1\gamma\leq 1 and δ1/4\delta\leq 1/4, using the inequality (1+x)(1+y)1+x+2y(1+x)(1+y)\leq 1+x+2y for x1x\leq 1, it is routine to check that

(1+β)(1+22log(32/δ)r(Σ2))21+γ.(1+\beta)\left(1+2\sqrt{\frac{2\log(32/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}\right)^{2}\leq 1+\gamma.

Plugging into Theorem 1 concludes the proof. ∎

B.3 Special Case: Euclidean Norm

In the Euclidean setting, the effective ranks are defined as follows:

See 3

Due to the small difference between r(Σ)r(\Sigma) and r2(Σ)r_{\left\lVert\cdot\right\rVert_{2}}(\Sigma), our generalization bound below requires a slightly different proof (see discussion in Section 5), but the proof strategies are exactly the same.

See 2

Proof.

The proof is identical to Corollary 3 except for the inconsequential difference between 𝔼Σ21/2g2\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}g\|_{2} and Tr(Σ2)1/2\operatorname{Tr}(\Sigma_{2})^{1/2}. It is easy to see that

W(Σ21/2𝒦)BTr(Σ2)1/2andR(Σ21/2𝒦)=BΣ2𝑜𝑝1/2.W(\Sigma_{2}^{1/2}\mathcal{K})\leq B\operatorname{Tr}(\Sigma_{2})^{1/2}\quad\text{and}\quad R(\Sigma_{2}^{1/2}\mathcal{K})=B\|\Sigma_{2}\|_{\mathit{op}}^{1/2}.

By the same argument, we can show that wΣ2R(Σ21/2𝒦)\|w^{*}\|_{\Sigma_{2}}\leq R(\Sigma_{2}^{1/2}\mathcal{K}) and

W(Σ21/2𝒦)+rad(Σ21/2𝒦)2log(32/δ)+wΣ22log(32/δ)W(Σ21/2𝒦)+22log(32/δ)rad(Σ21/2𝒦)BTr(Σ2)1/2+22log(32/δ)BΣ2𝑜𝑝1/2=(1+22log(32/δ)r(Σ2))BTr(Σ2)1/2.\begin{split}&W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(32/\delta)}+\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(32/\delta)}\\ \leq\,&W(\Sigma_{2}^{1/2}\mathcal{K})+2\sqrt{2\log(32/\delta)}\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\\ \leq\,&B\operatorname{Tr}(\Sigma_{2})^{1/2}+2\sqrt{2\log(32/\delta)}B\|\Sigma_{2}\|_{\mathit{op}}^{1/2}\\ =\,&\left(1+2\sqrt{\frac{2\log(32/\delta)}{r(\Sigma_{2})}}\right)B\operatorname{Tr}(\Sigma_{2})^{1/2}.\\ \end{split}

Plugging into Theorem 1 concludes the proof. ∎

Next, by choosing a particular covariance split, we prove the speculative bound from [73] when the features are Gaussian:

See 1

Proof.

By Theorem 1 and the same argument in proof of Corollary 2, we obtain

supw𝒦,Y=XwL(w)1+βn(BTr(Σ2)1/2+BΣ2𝑜𝑝1/222log(32δ))2B2n(1+β)(Tr(Σ)1/2+Σ2𝑜𝑝1/26log(1/δ))2.\begin{split}\sup_{w\in\mathcal{K},Y=Xw}L(w)&\leq\frac{1+\beta}{n}\left(B\operatorname{Tr}(\Sigma_{2})^{1/2}+B\|\Sigma_{2}\|_{\mathit{op}}^{1/2}\cdot 2\sqrt{2\log\left(\frac{32}{\delta}\right)}\right)^{2}\\ &\leq\frac{B^{2}}{n}(1+\beta)\left(\operatorname{Tr}(\Sigma)^{1/2}+\|\Sigma_{2}\|_{\mathit{op}}^{1/2}\cdot 6\sqrt{\log(1/\delta)}\right)^{2}.\\ \end{split}

Let Σ1\Sigma_{1} contain the largest eigenvalues, then we have

rank(Σ1)Σ2𝑜𝑝Tr(Σ).\operatorname{rank}(\Sigma_{1})\|\Sigma_{2}\|_{\mathit{op}}\leq\operatorname{Tr}(\Sigma).

Plugging in the inequality shows

supw𝒦,Y=XwL(w)B2Tr(Σ)n(1+β)(1+6log(1/δ)rank(Σ1))2.\begin{split}\sup_{w\in\mathcal{K},Y=Xw}L(w)&\leq\frac{B^{2}\operatorname{Tr}(\Sigma)}{n}(1+\beta)\left(1+6\sqrt{\frac{\log(1/\delta)}{\operatorname{rank}(\Sigma_{1})}}\right)^{2}.\\ \end{split}

Therefore, we can pick γ=(1+β)(1+6log(1/δ)rank(Σ1))21\gamma=(1+\beta)\left(1+6\sqrt{\frac{\log(1/\delta)}{\operatorname{rank}(\Sigma_{1})}}\right)^{2}-1 and it is clear that

γlog(1/δ)n+rank(Σ1)n+log(1/δ)rank(Σ1)\gamma\lesssim\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}+\sqrt{\frac{\log(1/\delta)}{\operatorname{rank}(\Sigma_{1})}}

for sufficiently large nn and rank(Σ1)\operatorname{rank}(\Sigma_{1}). To balance the last two terms, we can pick a covariance split such that rank(Σ1)\operatorname{rank}(\Sigma_{1}) is of order [nlog(1/δ)]1/2[n\log(1/\delta)]^{1/2}, which proves the log(1/δ)/n4\sqrt[4]{\log(1/\delta)/n} rate. ∎

Appendix C Bounds on the Norm of the Minimal-Norm Interpolator

In this section, we will give bounds – again based on the Gaussian Minimax Theorem – for the norm of the minimal norm interpolator, first in general and then in the Euclidean case.

C.1 General Norms: Proof of Theorem 4

Similar to the analysis in the previous section, we first formulate the minimal norm as an optimization problem in terms of a random matrix with N(0,1)N(0,1) entries. Next, we apply the Convex Gaussian Minimax Theorem.

Lemma 6.

Under the model assumptions in (1), let \lVert\cdot\rVert be an arbitrary norm and Z:n×dZ:n\times d be a matrix with i.i.d. N(0,1)N(0,1) entries independent of ξ\xi. Define the primary optimization problem (PO) as

Φ:=minZw=ξΣ1/2w.\Phi:=\min_{Zw=\xi}\|\Sigma^{-1/2}w\|. (56)

Then for any tt, it holds that

Pr(minXw=Yw>t)Pr(w+Φ>t).\Pr\left(\,\min_{Xw=Y}\|w\|>t\,\right)\leq\Pr\left(\,\|w^{*}\|+\Phi>t\,\right). (57)
Proof.

By equality in distribution, we can write X=ZΣ1/2X=Z\Sigma^{1/2}. By the triangle inequality and two changes of variables, we have

minXw=Yw=minXw=ξw+ww+minZΣ1/2w=ξw=w+minZw=ξΣ1/2w.\begin{split}\min_{Xw=Y}\|w\|&=\min_{Xw=\xi}\|w+w^{*}\|\\ &\leq\|w^{*}\|+\min_{Z\Sigma^{1/2}w=\xi}\|w\|\\ &=\|w^{*}\|+\min_{Zw=\xi}\|\Sigma^{-1/2}w\|.\qed\end{split}
Lemma 7 (Application of CGMT).

In the same setting as Lemma 6, let GN(0,In),HN(0,Id)G\sim N(0,I_{n}),H\sim N(0,I_{d}) be Gaussian vectors independent of ξ\xi and each other. Define the auxiliary optimization problem (AO) as

ϕ:=minξw2G2H,wΣ1/2w.\phi:=\min_{\|\xi-\|w\|_{2}G\|_{2}\leq\langle H,w\rangle}\|\Sigma^{-1/2}w\|. (58)

Then it holds that

Pr(Φ>t|ξ)2Pr(ϕt|ξ),\Pr(\Phi>t\,|\,\xi)\leq 2\Pr(\phi\geq t\,|\,\xi), (59)

and taking expectations we have

Pr(Φ>t)2Pr(ϕt).\Pr(\Phi>t)\leq 2\Pr(\phi\geq t). (60)
Proof.

By introducing Lagrange multipliers, we have

Φ=minwmaxλΣ1/2w+λ,Zwξ=minwmaxλλ,Zw+Σ1/2wλ,ξ.\begin{split}\Phi&=\min_{w}\max_{\lambda}\,\|\Sigma^{-1/2}w\|+\langle\lambda,Zw-\xi\rangle\\ &=\min_{w}\max_{\lambda}\,\langle\lambda,Zw\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle.\end{split}

By independence, the distribution of ZZ remains the same after conditioning on ξ\xi and the randomness in Φ\Phi comes solely from ZZ. Therefore, we can apply CGMT in Theorem 9 with ψ(w,λ)=Σ1/2wλ,ξ\psi(w,\lambda)=\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle because ψ\psi is convex-concave, but we again have the technical difficulty that the domains of ww and λ\lambda are not compact. To overcome this, we will use a double truncation argument. For any r,t>0r,t>0, we define

Φr(t):=minΣ1/2w2tmaxλ2rλ,Zw+Σ1/2wλ,ξ\Phi_{r}(t):=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\langle\lambda,Zw\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle (61)

and the corresponding AO

ϕr(t):=minΣ1/2w2tmaxλ2rw2G,λ+λ2H,w+Σ1/2wλ,ξ=minΣ1/2w2tmaxλ2rλ2H,wλ,ξw2G+Σ1/2w=minΣ1/2w2tmax0λrλ(H,w+ξw2G2)+Σ1/2w.\begin{split}\phi_{r}(t):&=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\|w\|_{2}\langle G,\lambda\rangle+\|\lambda\|_{2}\langle H,w\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle\\ &=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\|\lambda\|_{2}\langle H,w\rangle-\langle\lambda,\xi-\|w\|_{2}G\rangle+\|\Sigma^{-1/2}w\|\\ &=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{0\leq\lambda\leq r}\,\lambda\left(\langle H,w\rangle+\|\xi-\|w\|_{2}G\|_{2}\right)+\|\Sigma^{-1/2}w\|.\end{split} (62)

Note that the optimization in Φr(t)\Phi_{r}(t) and ϕr(t)\phi_{r}(t) now ranges over compact sets. We will also use an intermediate problem between Φ\Phi and Φr(t)\Phi_{r}(t), defined as

Φ(t):=minΣ1/2w2tmaxλλ,Zw+Σ1/2wλ,ξ=minZw=ξΣ1/2w2tΣ1/2w.\begin{split}\Phi(t):&=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\lambda}\,\langle\lambda,Zw\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle\\ &=\min_{\begin{subarray}{c}Zw=\xi\\ \|\Sigma^{-1/2}w\|\leq 2t\end{subarray}}\|\Sigma^{-1/2}w\|.\end{split} (63)

We similarly define the intermediate AO as

ϕ(t):=minΣ1/2w2tmaxλ0λ(H,w+ξw2G2)+Σ1/2w=minξw2G2H,wΣ1/2w2tΣ1/2w.\begin{split}\phi(t):&=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\lambda\geq 0}\,\lambda\left(\langle H,w\rangle+\|\xi-\|w\|_{2}G\|_{2}\right)+\|\Sigma^{-1/2}w\|\\ &=\min_{\begin{subarray}{c}\|\xi-\|w\|_{2}G\|_{2}\leq\langle-H,w\rangle\\ \|\Sigma^{-1/2}w\|\leq 2t\end{subarray}}\|\Sigma^{-1/2}w\|.\end{split} (64)

Compared to the definition of ϕ\phi, we have H,w\langle-H,w\rangle instead of H,w\langle H,w\rangle, but this difference is negligible because HH is Gaussian. It can be easily seen that the event Φ>t\Phi>t is the same as Φ(t)>t\Phi(t)>t, and the same holds for ϕ\phi and ϕ(t)\phi(t). It is also clear that ϕ(t)ϕr(t)\phi(t)\geq\phi_{r}(t) and we can connect ϕr(t)\phi_{r}(t) with Φr(t)\Phi_{r}(t) by CGMT. It remains to show that Φr(t)Φ(t)\Phi_{r}(t)\to\Phi(t) as rr\to\infty.

By definition, Φr(t)Φs(t)\Phi_{r}(t)\leq\Phi_{s}(t) for rsr\leq s. We consider two cases:

  1. 1.

    Φ(t)=\Phi(t)=\infty, i.e. the minimization problem defining Φ(t)\Phi(t) is infeasible. In this case, we know that for all Σ1/2w2t\|\Sigma^{-1/2}w\|\leq 2t

    Zwξ2>0.\|Zw-\xi\|_{2}>0.

    By compactness, there exists μ=μ(Z,ξ)>0\mu=\mu(Z,\xi)>0 (in particular, independent of rr) such that

    Zwξ2μ.\|Zw-\xi\|_{2}\geq\mu.

    Therefore, considering λ\lambda along the direction of ZwξZw-\xi shows that

    Φr(t)=minΣ1/2w2tmaxλ2rλ,Zwξ+Σ1/2wrμ\Phi_{r}(t)=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\langle\lambda,Zw-\xi\rangle+\|\Sigma^{-1/2}w\|\geq r\mu

    so Φr(t)\Phi_{r}(t)\to\infty as rr\to\infty.

  2. 2.

    Otherwise Φ(t)<\Phi(t)<\infty, i.e. the minimization problem defining Φ(t)\Phi(t) is feasible. In this case, we can let w(r)w(r) be an arbitrary minimizer achieving the objective Φr(t)\Phi_{r}(t) for each r0r\geq 0 by compactness. By compactness again, the sequence {w(r)}r=1\{w(r)\}_{r=1}^{\infty} at positive integer values of rr has a subsequential limit w()w(\infty) such that Σ1/2w()2t\|\Sigma^{-1/2}w(\infty)\|\leq 2t. Equivalently, there exists an increasing sequence rnr_{n} such that limnw(rn)=w()\lim_{n\to\infty}w(r_{n})=w(\infty).

    Suppose for the sake of contradiction that Zw()ξZw(\infty)\neq\xi, then by continuity, there exists μ>0\mu>0 and a sufficiently small ϵ>0\epsilon>0 such that for all ww()2ϵ\|w-w(\infty)\|_{2}\leq\epsilon

    Zwξ2μ.\|Zw-\xi\|_{2}\geq\mu.

    This implies that for sufficiently large nn, we have

    Zw(rn)ξ2μ\|Zw(r_{n})-\xi\|_{2}\geq\mu

    and by the same argument as in the previous case

    Φrn(t)=maxλ2rλ,Zw(rn)ξ+Σ1/2w(rn)rμ\Phi_{r_{n}}(t)=\max_{\|\lambda\|_{2}\leq r}\,\langle\lambda,Zw(r_{n})-\xi\rangle+\|\Sigma^{-1/2}w(r_{n})\|\geq r\mu

    so Φrn\Phi_{r_{n}}\to\infty, but this is impossible since Φr(t)Φ(t)<\Phi_{r}(t)\leq\Phi(t)<\infty. By contradiction, it must be the case that Zw()=ξZw(\infty)=\xi. By taking λ=0\lambda=0 in the definition of Φr(t)\Phi_{r}(t), we have

    Φrn(t)Σ1/2w(rn).\Phi_{r_{n}}(t)\geq\|\Sigma^{-1/2}w(r_{n})\|.

    By continuity, we show that

    liminfnΦrn(t)limnΣ1/2w(rn)=Σ1/2w()Φ(t).\lim\inf_{n\to\infty}\Phi_{r_{n}}(t)\geq\lim_{n\to\infty}\|\Sigma^{-1/2}w(r_{n})\|=\|\Sigma^{-1/2}w(\infty)\|\geq\Phi(t).

    Since Φrn(t)Φ(t)\Phi_{r_{n}}(t)\leq\Phi(t), the limit of Φrn(t)\Phi_{r_{n}}(t) exists and equals Φ(t)\Phi(t). We can conclude that limrΦr(t)=Φ(t)\lim_{r\to\infty}\Phi_{r}(t)=\Phi(t) because Φr(t)\Phi_{r}(t) is an increasing function of rr.

By the last part of Theorem 9 (the CGMT),

Pr(Φr(t)>t|ξ)2Pr(ϕr(t)t|ξ).\Pr(\Phi_{r}(t)>t\,|\,\xi)\leq 2\Pr(\phi_{r}(t)\geq t\,|\,\xi).

By continuity (from below) of the probability measure, and the fact that Φr(t)\Phi_{r}(t) monotonically increases to Φ(t)\Phi(t) almost surely, we can conclude

Pr(Φ>t|ξ)=Pr(Φ(t)>t|ξ)Pr(rrrΦr(t)>t|ξ)=limrPr(rrΦr(t)>t|ξ)=limrPr(Φr(t)>t|ξ)2limrPr(ϕr(t)t|ξ)2Pr(ϕ(t)t|ξ)=2Pr(ϕt|ξ).\begin{split}\Pr(\Phi>t\,|\,\xi)&=\Pr(\Phi(t)>t\,|\,\xi)\leq\Pr\left(\cup_{r}\cap_{r^{\prime}\geq r}\Phi_{r^{\prime}}(t)>t\,|\,\xi\right)\\ &=\lim_{r\to\infty}\Pr\left(\cap_{r^{\prime}\geq r}\Phi_{r^{\prime}}(t)>t\,|\,\xi\right)=\lim_{r\to\infty}\Pr\left(\Phi_{r}(t)>t\,|\,\xi\right)\\ &\leq 2\lim_{r\to\infty}\Pr(\phi_{r}(t)\geq t\,|\,\xi)\leq 2\Pr(\phi(t)\geq t\,|\,\xi)\\ &=2\Pr(\phi\geq t\,|\,\xi).\qed\end{split}

It remains to analyze the auxiliary problem, which we do in the following lemma:

Lemma 8.

For any covariance splitting Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}, denote PP as the orthogonal projection matrix onto the space spanned by Σ2\Sigma_{2}, and let v=argminvΣ21/2HvΣ2v^{*}=\operatorname*{arg\,min}_{v\in\partial\|\Sigma^{1/2}_{2}H\|_{*}}\left\lVert v\right\rVert_{\Sigma_{2}}. Assume that there exists ϵ1,ϵ20\epsilon_{1},\epsilon_{2}\geq 0 such that with probability at least 1δ/21-\delta/2,

vΣ2(1+ϵ1)𝔼vΣ2\left\lVert v^{*}\right\rVert_{\Sigma_{2}}\leq(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}} (65)

and

Pv21+ϵ2.\left\lVert Pv^{*}\right\rVert^{2}\leq 1+\epsilon_{2}. (66)

Let

ϵ=8n1/2+28log(32/δ)n+8log(8/δ)r(Σ)+2(1+ϵ1)2nR(Σ2)+2ϵ2.\epsilon=8n^{-1/2}+28\sqrt{\frac{\log(32/\delta)}{n}}+8\sqrt{\frac{\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}+2(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}+2\epsilon_{2}.

If nn and the effective ranks are sufficiently large such that ϵ1\epsilon\leq 1, then with probability at least 1δ1-\delta, it holds that

ϕ2(1+ϵ)σ2n(𝔼Σ21/2H)2\phi^{2}\leq(1+\epsilon)\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}} (67)
Proof.

For notational simplicity, we define

α=2log(32/δ)nρ=1n+2log(16/δ)n.\begin{split}\alpha&=2\sqrt{\frac{\log(32/\delta)}{n}}\\ \rho&=\sqrt{\frac{1}{n}}+2\sqrt{\frac{\log(16/\delta)}{n}}.\end{split}

By a union bound, the following collection of events occurs with probability at least 1δ/21-\delta/2:

  1. 1.

    (Approximate Orthogonality.) By Lemma 1, it holds that

    |ξ,G|<ξ2G2ρ.|\langle\xi,G\rangle|<\left\lVert\xi\right\rVert_{2}\left\lVert G\right\rVert_{2}\rho. (68)
  2. 2.

    (Typical Norm of GG and ξ\xi.) By Lemma 2, it holds that

    α1nG21α-\alpha\leq\frac{1}{\sqrt{n}}\|G\|_{2}-1\leq\alpha (69)

    and

    ασ1nξ2σασ.-\alpha\sigma\leq\frac{1}{\sqrt{n}}\|\xi\|_{2}-\sigma\leq\alpha\sigma. (70)
  3. 3.

    (Typical Norm of Σ21/2H\Sigma_{2}^{1/2}H.) By Theorem 6, it holds that

    Σ21/2H𝔼Σ21/2Hsupu1uΣ22log(8/δ)=(12log(8/δ)r(Σ))𝔼Σ21/2H,\begin{split}\|\Sigma_{2}^{1/2}H\|_{*}&\geq\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}-\sup_{\|u\|\leq 1}\|u\|_{\Sigma_{2}}\sqrt{2\log(8/\delta)}\\ &=\left(1-\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*},\end{split} (71)

    because Σ21/2H\|\Sigma_{2}^{1/2}H\|_{*} is a supu1uΣ2\sup_{\|u\|\leq 1}\|u\|_{\Sigma_{2}}-Lipschitz function of HH.

By a change of variables, recall that

ϕ:=minξΣ1/2w2G2H,Σ1/2ww.\phi:=\min_{\left\lVert\xi-\lVert\Sigma^{1/2}w\rVert_{2}G\right\rVert_{2}\leq\langle H,\Sigma^{1/2}w\rangle}\|w\|.

Equations 68, 69 and 70 imply that

ξΣ1/2w2G22=ξ222ξ,GΣ1/2w2+Σ1/2w22G22(1+ρ)(ξ22+Σ1/2w22G22)(1+ρ)(1+α)2n(σ2+Σ1/2w22).\begin{split}\left\lVert\xi-\|\Sigma^{1/2}w\|_{2}G\right\rVert_{2}^{2}&=\|\xi\|_{2}^{2}-2\langle\xi,G\rangle\|\Sigma^{1/2}w\|_{2}+\|\Sigma^{1/2}w\|_{2}^{2}\|G\|_{2}^{2}\\ &\leq(1+\rho)\left(\|\xi\|_{2}^{2}+\|\Sigma^{1/2}w\|_{2}^{2}\|G\|_{2}^{2}\right)\\ &\leq(1+\rho)(1+\alpha)^{2}n(\sigma^{2}+\|\Sigma^{1/2}w\|_{2}^{2}).\end{split}

To upper bound ϕ\phi, it suffices to construct a ww that satisfies the constraint. Consider ww of the form s(Pv)s(Pv^{*}), then Σ1/2w=sΣ21/2v\Sigma^{1/2}w=s\Sigma^{1/2}_{2}v^{*}. Plugging in, it suffices to choose ss such that

(1+ρ)(1+α)2n(σ2+s2Σ21/2v22)s2H,Σ21/2v2=s2Σ21/2H2.\begin{split}(1+\rho)(1+\alpha)^{2}n(\sigma^{2}+s^{2}\|\Sigma^{1/2}_{2}v^{*}\|_{2}^{2})&\leq s^{2}\langle H,\Sigma^{1/2}_{2}v^{*}\rangle^{2}=s^{2}\|\Sigma^{1/2}_{2}H\|_{*}^{2}.\end{split}

Solving for ss, we can choose

s2=σ2(Σ21/2H2(1+ρ)(1+α)2nvΣ22)1s^{2}=\sigma^{2}\left(\frac{\|\Sigma^{1/2}_{2}H\|_{*}^{2}}{(1+\rho)(1+\alpha)^{2}n}-\|v^{*}\|_{\Sigma_{2}}^{2}\right)^{-1}

given that it is positive. By (65) and (71), we have

Σ21/2H2(1+ρ)(1+α)2nvΣ22(𝔼Σ21/2H)2(1+ρ)(1+α)2n(12log(8/δ)r(Σ))2(1+ϵ1)2(𝔼vΣ2)2=(𝔼Σ21/2H)2n(1(1+ρ)(1+α)2(122log(8/δ)r(Σ))(1+ϵ1)2nR(Σ2)).\begin{split}&\frac{\|\Sigma^{1/2}_{2}H\|_{*}^{2}}{(1+\rho)(1+\alpha)^{2}n}-\|v^{*}\|_{\Sigma_{2}}^{2}\\ \geq\,&\frac{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}{(1+\rho)(1+\alpha)^{2}n}\left(1-\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)^{2}-(1+\epsilon_{1})^{2}(\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}})^{2}\\ =\,&\frac{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}{n}\left(\frac{1}{(1+\rho)(1+\alpha)^{2}}\left(1-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\right).\end{split}

If α<1\alpha<1, then

(1+ρ)(1+α)2=(1+ρ)(1+2α+α2)(1+ρ)(1+3α)=1+3α+ρ+3αρ1+3α+4ρ\begin{split}(1+\rho)(1+\alpha)^{2}&=(1+\rho)(1+2\alpha+\alpha^{2})\\ &\leq(1+\rho)(1+3\alpha)=1+3\alpha+\rho+3\alpha\rho\\ &\leq 1+3\alpha+4\rho\\ \end{split}

and using the inequality (1+x)11x(1+x)^{-1}\geq 1-x, we show

1(1+ρ)(1+α)21((1+ρ)(1+α)21)1(3α+4ρ).\begin{split}\frac{1}{(1+\rho)(1+\alpha)^{2}}&\geq 1-((1+\rho)(1+\alpha)^{2}-1)\\ &\geq 1-(3\alpha+4\rho).\\ \end{split}

Therefore, we can conclude that

1(1+ρ)(1+α)2(122log(8/δ)r(Σ))(1+ϵ1)2nR(Σ2)(1(3α+4ρ))(122log(8/δ)r(Σ))(1+ϵ1)2nR(Σ2)1(3α+4ρ)22log(8/δ)r(Σ)(1+ϵ1)2nR(Σ2)1ϵ\begin{split}&\frac{1}{(1+\rho)(1+\alpha)^{2}}\left(1-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\\ \geq\,&(1-(3\alpha+4\rho))\left(1-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\\ \geq\,&1-(3\alpha+4\rho)-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\geq 1-\epsilon^{\prime}\\ \end{split}

where we define

ϵ=4n1/2+14log(32/δ)n+4log(8/δ)r(Σ)+(1+ϵ1)2nR(Σ2).\epsilon^{\prime}=4n^{-1/2}+14\sqrt{\frac{\log(32/\delta)}{n}}+4\sqrt{\frac{\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}+(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}.

Provided that ϵ1/2\epsilon^{\prime}\leq 1/2 (which also guarantees that α<1\alpha<1 and our definition of s2s^{2} is sensible), we can use the inequality (1x)11+2x(1-x)^{-1}\leq 1+2x for x[0,1/2]x\in[0,1/2] to show that

s2σ2n(𝔼Σ21/2H)211ϵ(1+2ϵ)σ2n(𝔼Σ21/2H)2s^{2}\leq\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}\frac{1}{1-\epsilon^{\prime}}\leq(1+2\epsilon^{\prime})\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}

and thus by (66)

ϕ2s2Pv2(1+ϵ2)(1+2ϵ)σ2n(𝔼Σ21/2H)2(1+ϵ)σ2n(𝔼Σ21/2H)2\phi^{2}\leq s^{2}\|Pv^{*}\|^{2}\leq(1+\epsilon_{2})(1+2\epsilon^{\prime})\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}\leq(1+\epsilon)\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}

with ϵ=2ϵ+2ϵ2\epsilon=2\epsilon^{\prime}+2\epsilon_{2}. ∎

Finally, we are ready to prove our general norm bound.

See 4

Proof.

By Lemmas 6 and 7, we show that for any tt

Pr(w^>t)Pr(Φ>tw)2Pr(ϕtw).\Pr(\left\lVert\hat{w}\right\rVert>t)\leq\Pr(\Phi>t-\left\lVert w^{*}\right\rVert)\leq 2\Pr(\phi\geq t-\left\lVert w^{*}\right\rVert).

By Lemma 8, the above is upper bounded by δ\delta if we set twt-\left\lVert w^{*}\right\rVert according to (67) with δ\delta replaced by δ/2\delta/2. Moving w\left\lVert w^{*}\right\rVert to the other side concludes the proof. ∎

C.2 Special Case: Euclidean Norm

Lemma 9.

For any covariance matrix Σ\Sigma, it holds that

(𝔼Σ1/2H2)2(11r(Σ))Tr(Σ)\left(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2}\right)^{2}\geq\left(1-\frac{1}{r(\Sigma)}\right)\operatorname{Tr}(\Sigma) (72)

and

1Tr(Σ)(18r(Σ))𝔼[1HTΣH].\frac{1}{\operatorname{Tr}(\Sigma)}\geq\left(1-\sqrt{\frac{8}{r(\Sigma)}}\right)\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]. (73)

As a result, it holds that

r(Σ)1r2(Σ)r(Σ)r(\Sigma)-1\leq r_{\|\cdot\|_{2}}(\Sigma)\leq r(\Sigma) (74)

and

14r(Σ)R2(Σ)R(Σ)(18r(Σ2))1.1-\frac{4}{\sqrt{r(\Sigma)}}\leq\frac{R_{\|\cdot\|_{2}}(\Sigma)}{R(\Sigma)}\leq\left(1-\sqrt{\frac{8}{r(\Sigma^{2})}}\right)^{-1}. (75)
Proof.

Observe that if f(H)=Σ1/2H2f(H)=\|\Sigma^{1/2}H\|_{2}, then it can easily be checked that

f22=ΣH22Σ1/2H22Σ𝑜𝑝\|\nabla f\|_{2}^{2}=\frac{\|\Sigma H\|^{2}_{2}}{\|\Sigma^{1/2}H\|_{2}^{2}}\leq\|\Sigma\|_{\mathit{op}}

and so by the Gaussian Poincaré inequality [54, Corollary 2.27], we have

Tr(Σ)=𝔼Σ1/2H22=(𝔼Σ1/2H2)2+VarΣ1/2H2(𝔼Σ1/2H2)2+Σ𝑜𝑝=(𝔼Σ1/2H2)2+Tr(Σ)r(Σ).\begin{split}\operatorname{Tr}(\Sigma)=\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2}^{2}&=(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}+\operatorname*{\mathrm{Var}}\|\Sigma^{1/2}H\|_{2}\\ &\leq(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}+\|\Sigma\|_{\mathit{op}}\\ &=(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}+\frac{\operatorname{Tr}(\Sigma)}{r(\Sigma)}.\end{split}

Rearranging the terms proves (72). To prove (73), without loss of generality assume that Σ\Sigma is diagonal, with diagonal entries λ1λ2λd\lambda_{1}\geq\lambda_{2}\geq...\geq\lambda_{d}. Observe that for any integer ν>2\nu>2, we can pad Σ\Sigma with 0’s such that ν\nu divides dd, and we have

HTΣH=i=1dλiHi2i=1d/νλνi(Hν(i1)+12++Hνi2).H^{T}\Sigma H=\sum_{i=1}^{d}\lambda_{i}H_{i}^{2}\geq\sum_{i=1}^{d/\nu}\lambda_{\nu i}(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2}).

By Jensen’s inequality, 1/𝔼[X]𝔼[1/X]1/\operatorname*{\mathbb{E}}[X]\leq\operatorname*{\mathbb{E}}[1/X]; it follows that

𝔼[1HTΣH]𝔼[1i=1d/νλνi(Hν(i1)+12++Hνi2)]=1j=1d/νλνj𝔼[1i=1d/νλνij=1d/νλνj(Hν(i1)+12++Hνi2)]1j=1d/νλνj𝔼[i=1d/νλνij=1d/νλνj1Hν(i1)+12++Hνi2]=1j=1d/νλνj1ν2.\begin{split}\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]&\leq\operatorname*{\mathbb{E}}\left[\frac{1}{\sum_{i=1}^{d/\nu}\lambda_{\nu i}(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2})}\right]\\ &=\frac{1}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\operatorname*{\mathbb{E}}\left[\frac{1}{\sum_{i=1}^{d/\nu}\frac{\lambda_{\nu i}}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2})}\right]\\ &\leq\frac{1}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{d/\nu}\frac{\lambda_{\nu i}}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\frac{1}{H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2}}\right]\\ &=\frac{1}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\frac{1}{\nu-2}.\end{split}

In the last equality, we use the fact that for each ii the random variable (Hν(i1)+12++Hνi2)1\left(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2}\right)^{-1} follows an inverse Chi-square distribution with ν\nu degrees of freedom; its expectation is (ν2)1(\nu-2)^{-1}. In addition, notice that

νΣ𝑜𝑝+νi=1d/νλνi(λ1++λν)+i=1d/ν1(λνi+1++λν(i+1))=Tr(Σ).\nu\|\Sigma\|_{\mathit{op}}+\nu\sum_{i=1}^{d/\nu}\lambda_{\nu i}\geq(\lambda_{1}+...+\lambda_{\nu})+\sum_{i=1}^{d/\nu-1}(\lambda_{\nu i+1}+...+\lambda_{\nu(i+1)})=\operatorname{Tr}(\Sigma).

Plugging the above estimate into our upper bound shows for any integer ν>2\nu>2, it holds that

𝔼[1HTΣH]1Tr(Σ)νΣ𝑜𝑝νν2=1Tr(Σ)(1νr(Σ)2ν+2r(Σ))1.\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]\leq\frac{1}{\operatorname{Tr}(\Sigma)-\nu\|\Sigma\|_{\mathit{op}}}\frac{\nu}{\nu-2}=\frac{1}{\operatorname{Tr}(\Sigma)}\left(1-\frac{\nu}{r(\Sigma)}-\frac{2}{\nu}+\frac{2}{r(\Sigma)}\right)^{-1}.

We can show (73) by choosing ν=(2r(Σ))1/2\nu=\lceil(2r(\Sigma))^{1/2}\rceil:

𝔼[1HTΣH]1Tr(Σ)(18r(Σ))1.\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]\leq\frac{1}{\operatorname{Tr}(\Sigma)}\left(1-\sqrt{\frac{8}{r(\Sigma)}}\right)^{-1}.

It remains to verify (74) and (75). By (72), we can check

r2(Σ)=(𝔼Σ1/2H2)2Σ𝑜𝑝(11r(Σ))Tr(Σ)Σ𝑜𝑝=r(Σ)1.r_{\|\cdot\|_{2}}(\Sigma)=\frac{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}}{\|\Sigma\|_{\mathit{op}}}\geq\left(1-\frac{1}{r(\Sigma)}\right)\frac{\operatorname{Tr}(\Sigma)}{\|\Sigma\|_{\mathit{op}}}=r(\Sigma)-1.

The other direction r(Σ)r2(Σ)r(\Sigma)\geq r_{\|\cdot\|_{2}}(\Sigma) follows directly from an application of the Cauchy-Schwarz inequality. By Jensen’s inequality 1/𝔼[X]𝔼[1/X]1/\operatorname*{\mathbb{E}}[X]\leq\operatorname*{\mathbb{E}}[1/X] and the Cauchy-Schwarz inequality, we show

1Tr(Σ)(𝔼1ΣH22)1(𝔼Σ1/2H2ΣH2)2(𝔼ΣH2Σ1/2H2)2Tr(Σ2)𝔼1Σ1/2H22.\frac{1}{\operatorname{Tr}(\Sigma)}\left(\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma H\|_{2}^{2}}\right)^{-1}\leq\left(\operatorname*{\mathbb{E}}\frac{\|\Sigma^{1/2}H\|_{2}}{\|\Sigma H\|_{2}}\right)^{-2}\leq\left(\operatorname*{\mathbb{E}}\frac{\|\Sigma H\|_{2}}{\|\Sigma^{1/2}H\|_{2}}\right)^{2}\leq\operatorname{Tr}(\Sigma^{2})\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma^{1/2}H\|_{2}^{2}}.

Recall that R2(Σ)=(𝔼Σ1/2H2)2(𝔼ΣH2Σ1/2H2)2R_{\|\cdot\|_{2}}(\Sigma)=(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}\left(\operatorname*{\mathbb{E}}\frac{\|\Sigma H\|_{2}}{\|\Sigma^{1/2}H\|_{2}}\right)^{-2}. By Cauchy-Schwarz inequality and (73), it follows that

R2(Σ)Tr(Σ)2(𝔼1ΣH22)(18r(Σ2))1R(Σ)R_{\|\cdot\|_{2}}(\Sigma)\leq\operatorname{Tr}(\Sigma)^{2}\left(\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma H\|_{2}^{2}}\right)\leq\left(1-\sqrt{\frac{8}{r(\Sigma^{2})}}\right)^{-1}R(\Sigma)

and also by (72)

R2(Σ)(11r(Σ))Tr(Σ)Tr(Σ2)(𝔼1Σ1/2H22)1(11r(Σ))(18r(Σ))R(Σ)(14r(Σ))R(Σ).\begin{split}R_{\|\cdot\|_{2}}(\Sigma)&\geq\left(1-\frac{1}{r(\Sigma)}\right)\frac{\operatorname{Tr}(\Sigma)}{\operatorname{Tr}(\Sigma^{2})}\left(\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma^{1/2}H\|_{2}^{2}}\right)^{-1}\\ &\geq\left(1-\frac{1}{r(\Sigma)}\right)\left(1-\sqrt{\frac{8}{r(\Sigma)}}\right)R(\Sigma)\geq\left(1-\frac{4}{\sqrt{r(\Sigma)}}\right)R(\Sigma).\qed\end{split}
Lemma 10.

For any covariance matrix Σ\Sigma, it holds that with probability at least 1δ1-\delta,

1Σ1/2H22Tr(Σ)log(4/δ)R(Σ)1-\frac{\|\Sigma^{1/2}H\|_{2}^{2}}{\operatorname{Tr}(\Sigma)}\lesssim\frac{\log(4/\delta)}{\sqrt{R(\Sigma)}} (76)

and

ΣH22log(4/δ)Tr(Σ2).\|\Sigma H\|_{2}^{2}\lesssim\log(4/\delta)\operatorname{Tr}(\Sigma^{2}). (77)

Therefore, provided that R(Σ)log(4/δ)2R(\Sigma)\gtrsim\log(4/\delta)^{2}, it holds that

(ΣH2Σ1/2H2)2log(4/δ)Tr(Σ2)Tr(Σ).\left(\frac{\|\Sigma H\|_{2}}{\|\Sigma^{1/2}H\|_{2}}\right)^{2}\lesssim\log(4/\delta)\frac{\operatorname{Tr}(\Sigma^{2})}{\operatorname{Tr}(\Sigma)}. (78)
Proof.

Because we are considering 2\ell_{2} norm and HH is standard Gaussian, without loss of generality we can assume that Σ\Sigma is diagonal and we denote the diagonals of Σ\Sigma as λ1,,λd\lambda_{1},...,\lambda_{d}. By the sub-exponential Bernstein inequality [60, Corollary 2.8.3], we have with probability at least 1δ/21-\delta/2

|Σ1/2H22Tr(Σ)1|=|i=1pλijλj(Hi21)|log(4/δ)R(Σ)log(4/δ)r(Σ)log(4/δ)R(Σ)\left|\frac{\|\Sigma^{1/2}H\|_{2}^{2}}{\operatorname{Tr}(\Sigma)}-1\right|=\left|\sum_{i=1}^{p}\frac{\lambda_{i}}{\sum_{j}\lambda_{j}}(H_{i}^{2}-1)\right|\lesssim\sqrt{\frac{\log(4/\delta)}{R(\Sigma)}}\vee\frac{\log(4/\delta)}{r(\Sigma)}\leq\frac{\log(4/\delta)}{\sqrt{R(\Sigma)}}

where the last inequality uses that R(Σ)r(Σ)2R(\Sigma)\leq r(\Sigma)^{2}, shown in Lemma 5 of [66]. Using the sub-exponential Bernstein inequality again, we show with probability at least 1δ/21-\delta/2

|ΣH22Tr(Σ2)1|log(4/δ)R(Σ2)log(4/δ)r(Σ2)\left|\frac{\|\Sigma H\|_{2}^{2}}{\operatorname{Tr}(\Sigma^{2})}-1\right|\lesssim\sqrt{\frac{\log(4/\delta)}{R(\Sigma^{2})}}\vee\frac{\log(4/\delta)}{r(\Sigma^{2})}

From Lemma 5 of [66], we know that the effective ranks are at least 1. This implies

ΣH22log(4/δ)Tr(Σ2).\|\Sigma H\|_{2}^{2}\lesssim\log(4/\delta)\operatorname{Tr}(\Sigma^{2}).

Provided that R(Σ)log(4/δ)2R(\Sigma)\gtrsim\log(4/\delta)^{2}, we have

Σ1/2H2212Tr(Σ)\|\Sigma^{1/2}H\|_{2}^{2}\geq\frac{1}{2}\operatorname{Tr}(\Sigma)

in which case it holds that

ΣH22Σ1/2H22log(4/δ)Tr(Σ2)Tr(Σ).\frac{\|\Sigma H\|_{2}^{2}}{\|\Sigma^{1/2}H\|_{2}^{2}}\lesssim\log(4/\delta)\frac{\operatorname{Tr}(\Sigma^{2})}{\operatorname{Tr}(\Sigma)}.\qed

See 2

Proof.

To apply Theorem 4, it is clear that v=Σ21/2HΣ21/2H2v^{*}=\frac{\Sigma_{2}^{1/2}H}{\|\Sigma_{2}^{1/2}H\|_{2}} and so vΣ2=Σ2H2Σ21/2H2\|v^{*}\|_{\Sigma_{2}}=\frac{\|\Sigma_{2}H\|_{2}}{\|\Sigma_{2}^{1/2}H\|_{2}}. By (78), it suffices to pick ϵ1\epsilon_{1} such that for some constant c>0c>0

(1+ϵ1)𝔼vΣ2=clog(16/δ)Tr(Σ22)Tr(Σ2).(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}=c\sqrt{\log(16/\delta)\frac{\operatorname{Tr}(\Sigma_{2}^{2})}{\operatorname{Tr}(\Sigma_{2})}}.

By (72) of Lemma 9, for sufficiently large effective rank, it holds that (𝔼Σ21/2H2)2Tr(Σ2)\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{2}\right)^{2}\gtrsim\operatorname{Tr}(\Sigma_{2}) and so

(1+ϵ1)2nR2(Σ2)=n(1+ϵ1)2(𝔼vΣ2)2(𝔼Σ21/2H2)2nlog(16/δ)Tr(Σ22)Tr(Σ2)2=nlog(16/δ)R(Σ2).(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|_{2}}(\Sigma_{2})}=n\frac{(1+\epsilon_{1})^{2}(\operatorname*{\mathbb{E}}\|v^{*}\|_{\Sigma_{2}})^{2}}{\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{2}\right)^{2}}\lesssim n\log(16/\delta)\frac{\operatorname{Tr}(\Sigma_{2}^{2})}{\operatorname{Tr}(\Sigma_{2})^{2}}=\frac{n\log(16/\delta)}{R(\Sigma_{2})}.

Furthermore, it suffices to let ϵ2=0\epsilon_{2}=0 because PP is an 2\ell_{2} projection matrix. Combined with (74) of Lemma 9, we show

ϵlog(1/δ)n+log(1/δ)r(Σ2)+nlog(1/δ)R(Σ2).\epsilon\lesssim\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\frac{n\log(1/\delta)}{R(\Sigma_{2})}.

Finally, using the inequality (1x)11+2x(1-x)^{-1}\leq 1+2x for x[0,1/2]x\in[0,1/2] and (72) of Lemma 9 again, we can conclude

(1+ϵ)1/2σn𝔼Σ21/2H2(1+ϵ)1/2(11r(Σ2))1/2σnTr(Σ2)(1+2ϵ+2r(Σ2))1/2σnTr(Σ2)\begin{split}(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{2}}&\leq(1+\epsilon)^{1/2}\left(1-\frac{1}{r(\Sigma_{2})}\right)^{-1/2}\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}\\ &\leq\left(1+2\epsilon+\frac{2}{r(\Sigma_{2})}\right)^{1/2}\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}\\ \end{split}

and we can replace ϵ\epsilon with

ϵ=2ϵ+2r(Σ2)log(1/δ)n+log(1/δ)r(Σ2)+nlog(1/δ)R(Σ2).\epsilon^{\prime}=2\epsilon+\frac{2}{r(\Sigma_{2})}\lesssim\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\frac{n\log(1/\delta)}{R(\Sigma_{2})}.

Appendix D Benign Overfitting

In this section, we will combine results from the previous two sections to study when interpolators are consistent.

D.1 General Norm

See 5

Proof.

By Theorem 4, if we choose

B=w+(1+ϵ)1/2σn𝔼Σ21/2HB=\left\lVert w^{*}\right\rVert+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}

then with large probability, {w:wB}\{w:\|w\|\leq B\} has non-empty intersection with {w:Xw=Y}\{w:Xw=Y\}, which contains the minimal norm interpolator w^\hat{w}. Also, it is clear that B>wB>\|w^{*}\| and so by Corollary 3, it holds that

L(w^)supwB,L^(w)=0L(w)(1+γ)(w+(1+ϵ)1/2σn𝔼Σ21/2H)2(𝔼Σ21/2H)2n=(1+γ)(w𝔼Σ21/2Hn+(1+ϵ)1/2σ)2(1+γ)(1+ϵ)(σ+w𝔼Σ21/2Hn)2.\begin{split}L(\hat{w})&\leq\sup_{\|w\|\leq B,\hat{L}(w)=0}L(w)\\ &\leq(1+\gamma)\left(\left\lVert w^{*}\right\rVert+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}\right)^{2}\frac{\left(\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{*}\right)^{2}}{n}\\ &=(1+\gamma)\left(\left\lVert w^{*}\right\rVert\frac{\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{*}}{\sqrt{n}}+(1+\epsilon)^{1/2}\,\sigma\right)^{2}\\ &\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}\right)^{2}.\\ \end{split}

Theorem 11 (Sufficient conditions).

Under the model assumptions in (1), let \left\lVert\cdot\right\rVert be an arbitrary norm. Suppose that as nn goes to \infty, there exists a sequence of covariance splits Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that the following properties hold:

  1. 1.

    (Small large-variance dimension.)

    limnrank(Σ1)n=0.\lim_{n\to\infty}\frac{\operatorname{rank}(\Sigma_{1})}{n}=0. (79)
  2. 2.

    (Large effective dimension.)

    limn1r(Σ2)=0andlimnnR(Σ2)=0.\lim_{n\to\infty}\frac{1}{r_{\|\cdot\|}(\Sigma_{2})}=0\quad\text{and}\quad\lim_{n\to\infty}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}=0. (80)
  3. 3.

    (No aliasing condition.)

    limnw𝔼Σ21/2Hn=0.\lim_{n\to\infty}\frac{\left\lVert w^{*}\right\rVert\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}{\sqrt{n}}=0. (81)
  4. 4.

    (Contracting 2\ell_{2} projection condition.) With the same definition of PP and vv^{*} as in Theorem 4, it holds that for any η>0\eta>0,

    limnPr(Pv2>1+η)=0.\lim_{n\to\infty}\Pr(\|Pv^{*}\|^{2}>1+\eta)=0. (82)

Then L(w^)L(\hat{w}) converges to σ2\sigma^{2} in probability. In other words, minimum norm interpolation is consistent.

Proof.

Fix any η>0\eta>0, for sufficiently small γ,ϵ\gamma,\epsilon and w𝔼Σ21/2Hn\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}, it is clear that

(1+γ)(1+ϵ)(σ+w𝔼Σ21/2Hn)2σ2η.(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}\right)^{2}-\sigma^{2}\leq\eta. (83)

For any δ>0\delta>0, by the definition of γ\gamma in Corollary 3 and our assumptions, the terms γ\gamma and w𝔼Σ21/2Hn\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}} can be made arbitrarily small for large enough nn. Also by our assumption, ϵ2\epsilon_{2} in the definition of ϵ\epsilon in Theorem 4 can be arbitrarily small. Note that

nR(Σ2)=𝔼[vΣ2𝔼Σ21/2H/n]\sqrt{\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}}=\operatorname*{\mathbb{E}}\left[\frac{\|v^{*}\|_{\Sigma_{2}}}{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}/\sqrt{n}}\right]

converges to 0 by assumption. Then by Markov’s inequality, for any η>0\eta^{\prime}>0, it holds that for all sufficiently large nn

Pr(vΣ2𝔼Σ21/2H/n>η)<δ\Pr\left(\frac{\|v^{*}\|_{\Sigma_{2}}}{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}/\sqrt{n}}>\sqrt{\eta^{\prime}}\right)<\delta

and we can pick

(1+ϵ1)𝔼vΣ2=η𝔼Σ21/2Hn.(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}=\sqrt{\eta^{\prime}}\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}.

This implies that

(1+ϵ1)2nR(Σ2)=n(𝔼Σ21/2H)2((1+ϵ1)𝔼vΣ2)2=η.(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}=\frac{n}{\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}\right)^{2}}\left((1+\epsilon_{1})\operatorname*{\mathbb{E}}\|v^{*}\|_{\Sigma_{2}}\right)^{2}=\eta^{\prime}.

By Theorem 5, we have shown that for sufficiently large nn such that γ,ϵ\gamma,\epsilon and w𝔼Σ21/2Hn\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}} are small enough for (83) to hold, it holds that

Pr(|L(w^)σ2|>η)δ.\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)\leq\delta.

As a result, we show limnPr(|L(w^)σ2|>η)δ\lim_{n\to\infty}\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)\leq\delta for any δ>0\delta>0. To summarize, for any fixed η>0\eta>0, we have

limnPr(|L(w^)σ2|>η)=0\lim_{n\to\infty}\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)=0

and so L(w^)L(\hat{w}) converges to σ2\sigma^{2} in probability. ∎

D.2 Euclidean Norm

See 3

Proof.

The proof follows the same strategy as Theorem 5. By Theorem 2, if we choose

B=w2+(1+ϵ)1/2σnTr(Σ2),B=\|w^{*}\|_{2}+(1+\epsilon)^{1/2}\,\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}},

then with large probability, {w:w2B}\{w:\|w\|_{2}\leq B\} has non-empty intersection with {w:Xw=Y}\{w:Xw=Y\}. This intersection necessarily contains the minimal norm interpolator w^\hat{w}.

Also, it is clear that B>wB>\|w^{*}\| and so by Corollary 2, it holds that

L(w^)supw2B,L^(w)=0L(w)(1+γ)(w2+(1+ϵ)1/2σnTr(Σ2))2Tr(Σ2)n=(1+γ)(w2Tr(Σ2)n+(1+ϵ)1/2σ)2(1+γ)(1+ϵ)(σ+w2Tr(Σ2)n)2.\begin{split}L(\hat{w})&\leq\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\\ &\leq(1+\gamma)\left(\|w^{*}\|_{2}+(1+\epsilon)^{1/2}\,\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}\right)^{2}\frac{\operatorname{Tr}(\Sigma_{2})}{n}\\ &=(1+\gamma)\left(\left\lVert w^{*}\right\rVert_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}+(1+\epsilon)^{1/2}\,\sigma\right)^{2}\\ &\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\right)^{2}.\qed\end{split}
Theorem 12 (Sufficient conditions).

Under the model assumptions in (1), let w^\hat{w} be the minimal 2\ell_{2} norm interpolator. Suppose that as nn goes to \infty, there exists a sequence of covariance splitting Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that the following conditions hold:

  1. 1.

    (Small large-variance dimension.)

    limnrank(Σ1)n=0.\lim_{n\to\infty}\frac{\operatorname{rank}(\Sigma_{1})}{n}=0. (84)
  2. 2.

    (Large effective dimension.)

    limnnR(Σ2)=0.\lim_{n\to\infty}\frac{n}{R(\Sigma_{2})}=0. (85)
  3. 3.

    (No aliasing condition.)

    limnw2𝔼Σ21/2H2n=0.\lim_{n\to\infty}\frac{\left\lVert w^{*}\right\rVert_{2}\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{2}}{\sqrt{n}}=0. (86)

Then L(w^)L(\hat{w}) converges to σ2\sigma^{2} in probability. In other words, minimum 2\ell_{2} norm interpolation is consistent.

Proof.

Fix any η>0\eta>0, for sufficiently small γ,ϵ\gamma,\epsilon and w2Tr(Σ2)n\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}, it is clear that

(1+γ)(1+ϵ)(σ+w2Tr(Σ2)n)2σ2η.(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\right)^{2}-\sigma^{2}\leq\eta. (87)

From Lemma 5 of [66], it holds that R(Σ2)r(Σ2)2R(\Sigma_{2})\leq r(\Sigma_{2})^{2}, and so the condition R(Σ2)=ω(n)R(\Sigma_{2})=\omega(n) implies that r(Σ2)=ω(n)=ω(1)r(\Sigma_{2})=\omega(\sqrt{n})=\omega(1). For any δ>0\delta>0, by the definition of γ,ϵ\gamma,\epsilon in Corollary 2 and Theorem 2 and our assumptions, the terms γ,ϵ\gamma,\epsilon and w2Tr(Σ2)n\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}} can be made small enough for Equation 87 to hold with a sufficiently large nn. By Theorem 3, we show that

limnPr(|L(w^)σ2|>η)δ\lim_{n\to\infty}\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)\leq\delta

Since the choice of δ>0\delta>0 is arbitrary, we have shown that L(w^)L(\hat{w}) converges to σ2\sigma^{2} in probability. ∎

D.2.1 Equivalence of Consistency Conditions

If we assume that w=Θ(1)\|w^{*}\|=\Theta(1), our consistency condition (Theorem 12) for minimum 2\ell_{2} norm interpolation is the existence of a covariance splitting such that

rank(Σ1)=o(n),TrΣ2=o(n),(TrΣ2)2Tr[(Σ2)2]=ω(n).\operatorname{rank}(\Sigma_{1})=o(n),\qquad\operatorname{Tr}\Sigma_{2}=o(n),\qquad\frac{(\operatorname{Tr}\Sigma_{2})^{2}}{\operatorname{Tr}[(\Sigma_{2})^{2}]}=\omega(n). (88)

We compare the above conditions to the following conditions:

rank(Σ1)=o(n),TrΣ2=o(n),TrΣ2Σ2𝑜𝑝=ω(n),(TrΣ2)2Tr[(Σ2)2]=ω(n).\operatorname{rank}(\Sigma_{1})=o(n),\quad\operatorname{Tr}\Sigma_{2}=o(n),\quad\frac{\operatorname{Tr}\Sigma_{2}}{\|\Sigma_{2}\|_{\mathit{op}}}=\omega(n),\quad\frac{(\operatorname{Tr}\Sigma_{2})^{2}}{\operatorname{Tr}[(\Sigma_{2})^{2}]}=\omega(n). (89)

Obviously, the conditions in (89) imply (88), but we show in Theorem 13 that the existence of a splitting that satisfies (88) also implies the existence of a (potentially different) splitting that satisfies (89). This is one way to see that the particular choice of kk^{*} from [66] can be made without loss of generality, at least if we only consider the consistency conditions.

Theorem 13.

Suppose that there exists Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} that satisfies the conditions in (88). Then there exists a Σ=Σ1Σ2\Sigma=\Sigma^{{}^{\prime}}_{1}\oplus\Sigma^{{}^{\prime}}_{2} that satisfies the conditions in (89).

Proof.

Denote vv as the vector of eigenvalues of Σ\Sigma, and vkv_{k} as the vector obtained by setting the kk coordinates of vv corresponding to Σ1\Sigma_{1} to be 0. By our assumptions in (88), there exists k=o(n)k=o(n) such that

vk1=o(n),vk12vk22=ω(n).\|v_{k}\|_{1}=o(n),\qquad\frac{\|v_{k}\|_{1}^{2}}{\|v_{k}\|_{2}^{2}}=\omega(n).

For any τ0\tau\geq 0, we let Sτ={i[d]:|vk,i|τvk}S_{\tau}=\{i\in[d]:|v_{k,i}|\geq\tau\|v_{k}\|_{\infty}\} and define vk,τv_{k,\tau} by setting the coordinates of vkv_{k} in SτS_{\tau} to be 0. For simplicity of notation, define a=vk12/vk22a=\|v_{k}\|_{1}^{2}/\|v_{k}\|_{2}^{2} and b=vk1/vkb=\|v_{k}\|_{1}/\|v_{k}\|_{\infty}. Observe that

iSτ|vk,i|1τvkiSτvk,i2vk22τvk=vk1τba.\sum_{i\in S_{\tau}}|v_{k,i}|\leq\frac{1}{\tau\|v_{k}\|_{\infty}}\sum_{i\in S_{\tau}}v_{k,i}^{2}\leq\frac{\|v_{k}\|_{2}^{2}}{\tau\|v_{k}\|_{\infty}}=\frac{\|v_{k}\|_{1}}{\tau}\frac{b}{a}.

This shows that

vk,τ1(1bτa)vk1\|v_{k,\tau}\|_{1}\geq\left(1-\frac{b}{\tau a}\right)\|v_{k}\|_{1}

and

vk,ττvk=τbvk1.\|v_{k,\tau}\|_{\infty}\leq\tau\|v_{k}\|_{\infty}=\frac{\tau}{b}\|v_{k}\|_{1}.

In addition, observe that

τvk|Sτ|iSτ|vk,i|vk1τba.\tau\|v_{k}\|_{\infty}\cdot|S_{\tau}|\leq\sum_{i\in S_{\tau}}|v_{k,i}|\leq\frac{\|v_{k}\|_{1}}{\tau}\frac{b}{a}.

The above inequalities imply that

vk,τ1vk,τbτ(1bτa)\frac{\|v_{k,\tau}\|_{1}}{\|v_{k,\tau}\|_{\infty}}\geq\frac{b}{\tau}\left(1-\frac{b}{\tau a}\right)

and

|Sτ|a(bτa)2|S_{\tau}|\leq a\cdot\left(\frac{b}{\tau a}\right)^{2}

Finally, we pick τ\tau by setting b/(τa)=(n/a)3/4b/(\tau a)=(n/a)^{3/4}. By our assumption that a=ω(n)a=\omega(n), we can check

bτ(1bτa)=n3/4a1/4(1(n/a)3/4)=ω(n)(1o(1))=ω(n)\frac{b}{\tau}\left(1-\frac{b}{\tau a}\right)=n^{3/4}a^{1/4}(1-(n/a)^{3/4})=\omega(n)(1-o(1))=\omega(n)

and

a(bτa)2=a(n/a)3/2=n(n/a)1/2=o(n).a\cdot\left(\frac{b}{\tau a}\right)^{2}=a(n/a)^{3/2}=n(n/a)^{1/2}=o(n).

By Holder’s inequality, we also have

vk,τ12vk,τ22vk,τ1vk,τ=ω(n).\frac{\|v_{k,\tau}\|_{1}^{2}}{\|v_{k,\tau}\|_{2}^{2}}\geq\frac{\|v_{k,\tau}\|_{1}}{\|v_{k,\tau}\|_{\infty}}=\omega(n).

It is clear that vk,τ1vk1=o(n)\|v_{k,\tau}\|_{1}\leq\|v_{k}\|_{1}=o(n) and k+|Sτ|=o(n)k+|S_{\tau}|=o(n), so picking the covariance splitting that corresponds to vk,τv_{k,\tau} concludes the proof. ∎

Appendix E Basis Pursuit (Minimum 1\ell_{1}-Norm Interpolation)

In this section, we illustrate the consequences of our general theory for basis pursuit. The following generalization bound for basis pursuit follows immediately from Corollary 3:

Corollary 5 (Generalization bound for 1\ell_{1} norm balls).

There exists an absolute constant C166C_{1}\leq 66 such that the following is true. Under the model assumptions in (1) with Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2}, fix δ1/4\delta\leq 1/4 and let γ=C1(log(1/δ)r1(Σ2)+log(1/δ)n+rank(Σ1)n)\gamma=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{r_{1}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right). If Bw1B\geq\|w^{*}\|_{1} and nn is large enough that γ1\gamma\leq 1, then the following holds with probability at least 1δ1-\delta:

supw1B,L^(w)=0L(w)(1+γ)(B𝔼Σ21/2H)2n.\sup_{\|w\|_{1}\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{\left(B\cdot\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{\infty}\right)^{2}}{n}. (90)
Proof.

Recall that the dual of the 1\ell_{1} norm is the \ell_{\infty} norm. By convexity

maxw11wΣ=maxiei,Σei=maxiΣii\max_{\|w\|_{1}\leq 1}\|w\|_{\Sigma}=\sqrt{\max_{i}\,\langle e_{i},\Sigma e_{i}\rangle}=\sqrt{\max_{i}\,\Sigma_{ii}}

and so we can use r1(Σ)=(𝔼Σ1/2H)2maxi(Σ)ii=r1(Σ)r_{1}(\Sigma)=\frac{\left(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{\infty}\right)^{2}}{\max_{i}(\Sigma)_{ii}}=r_{\|\cdot\|_{1}}(\Sigma). ∎

The following norm bound for basis pursuit follows from Theorem 4:

Corollary 6 (1\ell_{1} norm bound).

There exists an absolute constant C264C_{2}\leq 64 such that the following is true. Under the model assumptions in (1), let Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that Σ2\Sigma_{2} is diagonal. Fix δ1/4\delta\leq 1/4 and let ϵ=C2(log(1/δ)r1(Σ2)+log(1/δ)n+nr1(Σ2))\epsilon=C_{2}\left(\sqrt{\frac{\log(1/\delta)}{r_{1}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\frac{n}{r_{1}(\Sigma_{2})}\right). Then if nn and the effective rank r1(Σ2)r_{1}(\Sigma_{2}) are large enough that ϵ1\epsilon\leq 1, with probability at least 1δ1-\delta, it holds that

w^1w1+(1+ϵ)1/2σn𝔼Σ21/2H.\left\lVert\hat{w}\right\rVert_{1}\leq\left\lVert w^{*}\right\rVert_{1}+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{\infty}}. (91)
Proof.

Recall that u=conv{sign(ui)ei:iargmax|ui|}\partial\|u\|_{*}=\text{conv}\{\operatorname{sign}(u_{i})\,e_{i}:i\in\arg\max|u_{i}|\}, where conv(S)\text{conv}(S) denotes the convex hull of SS. By definition, it holds almost surely that

vΣ2maxi[d]eiΣ=maxiΣii,\|v^{*}\|_{\Sigma_{2}}\leq\max_{i\in[d]}\,\|e_{i}\|_{\Sigma}=\sqrt{\max_{i}\,\Sigma_{ii}}, (92)

and so we can pick ϵ1\epsilon_{1} such that

(1+ϵ1)𝔼vΣ2=maxiΣii(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}=\sqrt{\max_{i}\,\Sigma_{ii}}

and

(1+ϵ1)2nR1(Σ2)=n(1+ϵ1)2(𝔼vΣ2)2(𝔼Σ21/2H)2=nr1(Σ2).(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|_{1}}(\Sigma_{2})}=n\frac{(1+\epsilon_{1})^{2}(\operatorname*{\mathbb{E}}\|v^{*}\|_{\Sigma_{2}})^{2}}{\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{\infty}\right)^{2}}=\frac{n}{r_{1}(\Sigma_{2})}.

In addition, since Σ2\Sigma_{2} is diagonal, the coordinates of Σ21/2H\Sigma_{2}^{1/2}H that correspond to the zero diagonals of Σ2\Sigma_{2} are 0. Therefore, vv^{*} must also have zero entry in those coordinates. In other words, vv^{*} lies in the span of Σ2\Sigma_{2}. As PP is the orthogonal projection onto the space spanned by Σ2\Sigma_{2}, this implies Pv=vPv^{*}=v^{*}, and so Pv1=v1=1\|Pv^{*}\|_{1}=\|v^{*}\|_{1}=1, so that we can take ϵ2=0\epsilon_{2}=0. Plugging ϵ1,ϵ2\epsilon_{1},\epsilon_{2} into Theorem 4 concludes the proof. ∎

Theorem 14 (Benign overfitting).

Fix any δ1/2\delta\leq 1/2. Under the model assumptions in (1), let Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that Σ2\Sigma_{2} is diagonal. Suppose that nn and the effective rank r1(Σ2)r_{1}(\Sigma_{2}) are sufficiently large such that γ,ϵ1\gamma,\epsilon\leq 1 with the same choice of γ\gamma and ϵ\epsilon as in Corollaries 5 and 6. Then, with probability at least 1δ1-\delta:

L(w^)(1+γ)(1+ϵ)(σ+w1𝔼Σ21/2Hn)2.L(\hat{w})\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{1}\frac{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{\infty}}{\sqrt{n}}\right)^{2}. (93)

The proof of Theorem 14 uses Corollaries 5 and 6, and follows the same lines as in Theorem 5. The details are repetitive, so we omit writing them out in full here. As before, we can use the finite sample bound to deduce sufficient conditions for consistency.

Theorem 15 (Sufficient conditions).

Under the model assumptions in (1), let w^\hat{w} be the minimal 1\ell_{1} norm interpolator. Suppose that as nn goes to \infty, there exists a sequence of covariance splits Σ=Σ1Σ2\Sigma=\Sigma_{1}\oplus\Sigma_{2} such that Σ2\Sigma_{2} is diagonal and the following conditions hold:

  1. 1.

    (Small large-variance dimension.)

    limnrank(Σ1)n=0.\lim_{n\to\infty}\frac{\operatorname{rank}(\Sigma_{1})}{n}=0. (94)
  2. 2.

    (Large effective dimension.)

    limnnr1(Σ2)=0.\lim_{n\to\infty}\frac{n}{r_{1}(\Sigma_{2})}=0. (95)
  3. 3.

    (No aliasing condition.)

    limnw1𝔼Σ21/2Hn=0.\lim_{n\to\infty}\frac{\left\lVert w^{*}\right\rVert_{1}\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{\infty}}{\sqrt{n}}=0. (96)

Then L(w^)L(\hat{w}) converges to σ2\sigma^{2} in probability. In other words, minimum 1\ell_{1} norm interpolation is consistent.

Again, the proof of Theorem 15 is exactly analogous to Theorem 12, so we omit the full proof here.

E.1 Isotropic features

Theorem 16.

There exists an absolute constant C3140C_{3}\leq 140 such that the following is true. Under the model assumptions in (1) with Σ=Id\Sigma=I_{d}, denote SS as the support of ww^{*}. Fix δ1/4\delta\leq 1/4 and let ϵ=C3(log(1/δ)n+log(1/δ)log(d|S|)+nlog(d|S|))\epsilon=C_{3}\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{\log(d-|S|)}}+\frac{n}{\log(d-|S|)}\right). Then if nn and dd are large enough that ϵ1\epsilon\leq 1, the following holds with probability 1δ1-\delta where HN(0,Id|S|)H^{\prime}\sim N(0,I_{d-|S|}):

w^1(1+ϵ)1/2(σ2+w22)1/2n𝔼H.\left\lVert\hat{w}\right\rVert_{1}\leq(1+\epsilon)^{1/2}\,(\sigma^{2}+\|w^{*}\|_{2}^{2})^{1/2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}. (97)
Proof.

Write X=[XS,XS𝖼]X=[X_{S},X_{S^{\mathsf{c}}}], where XSX_{S} is formed by selecting the columns of XX in SS. Also let ξ=XSwS+ξ\xi^{\prime}=X_{S}w_{S}^{*}+\xi; then the entries of ξ\xi^{\prime} are i.i.d. N(0,σ2+w22)N(0,\sigma^{2}+\|w^{*}\|_{2}^{2}) and independent of XS𝖼X_{S^{\mathsf{c}}}. Observe that Y=XS𝖼0+ξY=X_{S^{\mathsf{c}}}0+\xi^{\prime}. By choosing Σ1=0\Sigma_{1}=0 in Corollary 6, we show with large probability

minXS𝖼w=Yw1(1+ϵ)1/2(σ2+w22)1/2n𝔼H\min_{X_{S^{\mathsf{c}}}w=Y}\|w\|_{1}\leq(1+\epsilon)^{1/2}\,(\sigma^{2}+\|w^{*}\|_{2}^{2})^{1/2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}

for some ϵ64(log(1/δ)n+log(1/δ)r1(Id|S|)+nr1(Id|S|))\epsilon\leq 64\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{r_{1}(I_{d-|S|})}}+\frac{n}{r_{1}(I_{d-|S|})}\right). By the bound of [50], it holds that

r1(Id|S|)=(𝔼H)2log(d|S|)πlog2r_{1}(I_{d-|S|})=\left(\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}\right)^{2}\geq\frac{\log(d-|S|)}{\pi\log 2}

and so we can choose C364πlog2<140C_{3}\leq 64\pi\log 2<140. Observe that if XS𝖼w=YX_{S^{\mathsf{c}}}w=Y, then X(0,w)T=YX(0,w)^{T}=Y and (0,w)1=w1\|(0,w)\|_{1}=\|w\|_{1}. It follows that

w^1=minXw=Yw1minXS𝖼w=Yw1.\left\lVert\hat{w}\right\rVert_{1}=\min_{Xw=Y}\|w\|_{1}\leq\min_{X_{S^{\mathsf{c}}}w=Y}\|w\|_{1}.\qed
Theorem 17.

Under the model assumptions in (1) with Σ=Id\Sigma=I_{d}, fix any δ1/2\delta\leq 1/2 and let η=368(log(1/δ)n+log(1/δ)+log|S|log(d|S|)+nlog(d|S|))\eta=368\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)+\log|S|}{\log(d-|S|)}}+\frac{n}{\log(d-|S|)}\right). Suppose that nn and dd are large enough that η1\eta\leq 1. Then, with probability at least 1δ1-\delta,

L(w^)(1+η)(σ2+w22).L(\hat{w})\leq(1+\eta)(\sigma^{2}+\|w^{*}\|_{2}^{2}). (98)
Proof.

By Theorem 16, if we choose

B=(1+ϵ)1/2(σ2+w22)1/2n𝔼HB=(1+\epsilon)^{1/2}\,(\sigma^{2}+\|w^{*}\|_{2}^{2})^{1/2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}

then with large probability, 𝒦={w:w1B}\mathcal{K}=\{w:\|w\|_{1}\leq B\} has non-empty intersection with {w:Xw=Y}\{w:Xw=Y\}, which contains the minimal 1\ell_{1} norm interpolator w^\hat{w}. It can be easily seen that

W(𝒦)=B𝔼HandR(𝒦)=BW(\mathcal{K})=B\operatorname*{\mathbb{E}}\|H\|_{\infty}\quad\text{and}\quad R(\mathcal{K})=B

and so by Theorem 1, with large probability

L(w^)supw2B,L^(w)=0L(w)1+βn(B𝔼H+B2log(64δ)+w22log(64δ))2=1+βnB2(𝔼H)2(1+γ)2=(1+β)(1+ϵ)(1+γ)2(𝔼H𝔼H)2(σ2+w22)\begin{split}L(\hat{w})&\leq\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\\ &\leq\frac{1+\beta}{n}\left(B\operatorname*{\mathbb{E}}\|H\|_{\infty}+B\sqrt{2\log\left(\frac{64}{\delta}\right)}+\|w^{*}\|_{2}\sqrt{2\log\left(\frac{64}{\delta}\right)}\right)^{2}\\ &=\frac{1+\beta}{n}B^{2}(\operatorname*{\mathbb{E}}\|H\|_{\infty})^{2}\left(1+\gamma\right)^{2}\\ &=(1+\beta)(1+\epsilon)(1+\gamma)^{2}\left(\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\right)^{2}(\sigma^{2}+\|w^{*}\|_{2}^{2})\\ \end{split}

where β=66log(1/δ)/n\beta=66\sqrt{\log(1/\delta)/n} and γ=2log(64δ)𝔼H+w22log(64δ)B𝔼H\gamma=\frac{\sqrt{2\log\left(\frac{64}{\delta}\right)}}{\operatorname*{\mathbb{E}}\|H\|_{\infty}}+\frac{\|w^{*}\|_{2}\sqrt{2\log\left(\frac{64}{\delta}\right)}}{B\operatorname*{\mathbb{E}}\|H\|_{\infty}}. Observe that

Bw2n𝔼Hand𝔼H𝔼H.B\geq\|w^{*}\|_{2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\quad\text{and}\quad\operatorname*{\mathbb{E}}\|H\|_{\infty}\geq\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}.

Combined with the lower bound of [50], we show

γ2πlog(128/δ)logd+2log(64/δ)n8(log(1/δ)logd+log(1/δ)n).\gamma\leq\sqrt{\frac{2\pi\log(128/\delta)}{\log d}}+\sqrt{\frac{2\log(64/\delta)}{n}}\leq 8\left(\sqrt{\frac{\log(1/\delta)}{\log d}}+\sqrt{\frac{\log(1/\delta)}{n}}\right).

In addition, we have

𝔼H𝔼H=1+𝔼H𝔼H𝔼H1+2πlog(2)log|S|log(d|S|).\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}=1+\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}-\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\leq 1+\sqrt{\frac{2\pi\log(2)\log|S|}{\log(d-|S|)}}.

Finally, it is a routine calculation to show

(1+β)(1+ϵ)(1+γ)2(𝔼H𝔼H)21+368(log(1/δ)n+log(1/δ)+log|S|log(d|S|)+nlog(d|S|))=1+η\begin{split}&(1+\beta)(1+\epsilon)(1+\gamma)^{2}\left(\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\right)^{2}\\ \leq\,&1+368\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)+\log|S|}{\log(d-|S|)}}+\frac{n}{\log(d-|S|)}\right)=1+\eta\\ \end{split}

using the inequality (1+x)(1+y)1+x+2y(1+x)(1+y)\leq 1+x+2y for x1x\leq 1. ∎