Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting

Frederic Koehler
MIT
[email protected] &Lijia Zhou^∗
University of Chicago
[email protected] &Danica J. Sutherland
UBC and Amii
[email protected] &Nathan Srebro
TTI-Chicago
[email protected] These authors contributed equally.

Abstract

We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class’s Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of [66] for minimum-norm interpolators, and confirms a prediction of [73] for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum $\ell_{1}$ -norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.

Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)

1 Introduction

The traditional understanding of machine learning suggests that models with zero training error tend to overfit, and explicit regularization is often necessary to achieve good generalization. Given the empirical success of deep learning models with zero training error [58, 55] and the (re-)discovery of the “double descent” phenomenon [61], however, it has become clear that the textbook U-shaped learning curve is only part of a larger picture: it is possible for an overparameterized model with zero training loss to achieve low population error in a noisy setting. In an effort to understand how interpolation learning occurs, there has been much recent study of the testbed problem of linear regression with Gaussian features [[, e.g.]]bartlett2020benign, tsigler2020benign, BHX:two-models, hastie:surprises, JLL:basis-pursuit, junk-feats, muthukumar:interpolation, negrea:in-defense. Significant progress has been made in this setting, including nearly-matching necessary and sufficient conditions for consistency of the minimal $\ell_{2}$ norm interpolator [66].

Despite the fundamental role of uniform convergence in statistical learning theory, most of this line of work has used other techniques to analyze the particular minimal-norm interpolator.¹¹1[71] argue that [66]’s proof technique is fundamentally based on uniform convergence of a surrogate predictor; [76] study a closely related setting with a uniform convergence-type argument, but do not establish consistency. We discuss both papers in more detail in Section 4. Instead of directly analyzing the population error of a learning algorithm, a uniform convergence-type argument would control the worst-case generalization gap over a class of predictors containing the typical outputs of a learning rule. Typically, this is done because for many algorithms – unlike the minimal Euclidean norm interpolator – it is difficult to exactly characterize the learned predictor, but we may be able to say e.g. that its norm is not too large. Since uniform convergence does not tightly depend on a specific algorithm, the resulting analysis can highlight the key properties that lead to good generalization: it can give bounds not only for, say, the minimal-norm interpolator, but also for other interpolators with low norm [[, e.g.]]junk-feats, increasing our confidence that low norm – and not some other property the particular minimal-norm interpolator happens to have – is key to generalization. In linear regression, practical training algorithms may not always find the exact minimal Euclidean norm solution, so it is also reassuring that all interpolators with sufficiently low Euclidean norm generalize.

[64], however, raised significant questions about the applicability of typical uniform convergence arguments to certain high-dimensional regimes, similar to those seen in interpolation learning. Following their work, [73, 65, 76, 71] all demonstrated the failure of forms of uniform convergence in various interpolation learning setups. To sidestep these negative results, [73] suggested considering bounds which are uniform only over predictors with zero training error. This weaker notion of uniform convergence has been standard in analyses of “realizable” (noiseless) learning at least since the work of [39, Chapter 6.4] and [40]. [73] demonstrated that at least in one particular noisy setting, such uniform convergence is sufficient for showing consistency of the minimal $\ell_{2}$ norm interpolator, even though “non-realizable” uniform convergence arguments (those over predictors regardless of their training error) cannot succeed. It remains unknown, however, whether these types of arguments can apply to more general linear regression problems and more typical asymptotic regimes, particularly showing rates of convergence rather than just consistency.

In this work, we show for the first time that uniform convergence is indeed able to explain benign overfitting in general high-dimensional Gaussian linear regression problems. Similarly to how the standard analysis for learning with Lipschitz losses bounds generalization gaps through Rademacher complexity [[, e.g.]]understanding-ML, our Theorem 1 (Section 3) establishes a finite-sample high probability bound on the uniform convergence of the error of interpolating predictors in a hypothesis class, in terms of its Gaussian width. This is done through an application of the Gaussian Minimax Theorem; see the proof sketch in Section 7. Combined with an analysis of the norm of the minimal $\ell_{2}$ norm interpolator (Theorem 2 in Section 4), our bound recovers known consistency results [66], as well as proving a conjectured upper bound for larger-norm interpolators [73].

In addition, since we do not restrict ourselves to Euclidean norm balls but instead consider interpolators in an arbitrary compact set, our results allows for a wide range of other applications. Our analysis leads to a natural extension of the consistency result and notions of effective rank of [66] for arbitrary norms (Theorem 5 in Section 5). As a demonstration of our general theory, in Section 6 we show novel consistency results for the minimal $\ell_{1}$ norm interpolator (basis pursuit) in particular settings, which we believe are the first results of their kind.

2 Problem Formulation

Notation.

We use $\lVert\cdot\rVert_{p}$ for the $\ell_{p}$ norm, $\left\lVert x\right\rVert_{p}=\left(\sum_{i}|x_{i}|^{p}\right)^{1/p}$ . We always use $\max_{x\in S}f(x)$ to be $-\infty$ when $S$ is empty, and similarly $\min_{x\in S}f(x)$ to be $\infty$ . We use standard $O(\cdot)$ notation, and $a\lesssim b$ for inequality up to an absolute constant. For a positive semidefinite matrix $A$ , the Mahalanobis (semi-)norm is $\left\lVert x\right\rVert_{A}^{2}:=\langle x,Ax\rangle$ . For a matrix $A$ and set $S$ , $AS$ denotes the set $\{Ax:x\in S\}$ .

Data model.

We assume that data $(X,Y)$ is generated as

Y=Xw^{*}+\xi,\qquad X_{i}\stackrel{{\scriptstyle iid}}{{\sim}}N(0,\Sigma),\qquad\xi\sim N(0,\sigma^{2}I_{n}),

(1)

where $X\in\mathbb{R}^{n\times d}$ has i.i.d. Gaussian rows $X_{1},\ldots,X_{n}$ , $d\geq n$ , $w^{*}$ is arbitrary, and $\xi$ is Gaussian and independent of $X$ . Though our proof techniques crucially depend on $X_{i}$ being Gaussian, we can easily relax the assumption on the noise $\xi$ to only being sub-Gaussian; we assume Gaussian noise here for simplicity. The empirical and population loss are defined as, respectively,

\hat{L}(w)=\frac{1}{n}\lVert Y-Xw\rVert_{2}^{2},\quad L(w)=\operatorname*{\mathbb{E}}_{(x,y)}(y-\langle w,x\rangle)^{2}=\sigma^{2}+\lVert w-w^{*}\rVert_{\Sigma}^{2},

where in the expectation $y=\langle x,w^{*}\rangle+\xi_{0}$ with $x\sim N(0,\Sigma)$ independent of $\xi_{0}\sim N(0,\sigma^{2})$ . For an arbitrary norm $\left\lVert\cdot\right\rVert$ , the minimal norm interpolator is $\hat{w}=\operatorname*{arg\,min}_{\hat{L}(w)=0}\left\lVert w\right\rVert$ . For Euclidean norm specifically, the minimal norm interpolator can be written explicitly as $\hat{w}=X(XX^{T})^{-1}Y$ . If there is more than one minimal norm interpolator, all of our guarantees will hold for any minimizer $\hat{w}$ .

Speculative bound.

[73] studied uniform convergence of low norm interpolators,

\sup_{\left\lVert w\right\rVert\leq B,\,\hat{L}(w)=0}L(w)-\hat{L}(w).

(2)

Clearly, when $B\geq\left\lVert\hat{w}\right\rVert$ , this quantity upper-bounds the population risk of $\hat{w}$ . [73] evaluated the asymptotic limit of (2) in one particular setting. But they further speculated that a bound of the following form may hold more generally:

\sup_{\left\lVert w\right\rVert_{2}\leq B,\,\hat{L}(w)=0}L(w)-\hat{L}(w)\leq\frac{B^{2}\psi_{n}}{n}+o(1),

(

\star

)

where²²2When $\left\lVert x\right\rVert^{2}$ concentrates, we need $\psi_{n}$ to match its typical value. That is, $\psi_{n}$ might be a high probability bound on $\left\lVert x\right\rVert^{2}$ , or for (sub)Gaussian data, as in our case, $\psi_{n}=\operatorname*{\mathbb{E}}\left\lVert x\right\rVert^{2}$ . $\psi_{n}\approx\left\lVert x\right\rVert^{2}$ . As discussed by [73], a bound almost of this form is implied by results of [47] for general data distributions, except that approach gives a large leading constant and logarithmic factors. To show consistency of benign overfitting, though, we need ( $\star$ ‣ 2) to hold without even a constant multiplicative factor. [73] ask whether and when this holds, speculating that it might in broad generality.

The goal of this paper is essentially to prove ( $\star$ ‣ 2), at least for Gaussian data, and to use it to show consistency of the minimal norm interpolator. Our main result (Theorem 1) can be thought of showing ( $\star$ ‣ 2) for Gaussian data with $\psi_{n}=\operatorname*{\mathbb{E}}\left\lVert x\right\rVert^{2}$ , as well as strengthening and significantly generalizing it. Our result is more general, as it applies to general compact hypothesis sets beyond just the Euclidean ball as in ( $\star$ ‣ 2). But it also falls short of fully proving the speculative ( $\star$ ‣ 2) since our results are limited to Gaussian data, while there is no reason we are aware of to believe a tight uniform convergence guarantee of the form ( $\star$ ‣ 2) does not hold much more broadly. We leave extending our results beyond the Gaussian case open.

3 Generic Uniform Convergence Guarantee

To state our results, we first need to introduce some key tools.

Definition 1.

The Gaussian width and the radius of a set $S\subset\mathbb{R}^{d}$ are

W(S):=\operatorname*{\mathbb{E}}_{H\sim N(0,I_{d})}\sup_{s\in S}|\langle s,H\rangle|\quad\text{and}\quad\operatorname{rad}(S):=\sup_{s\in S}\|s\|_{2}.

The radius measures the size of a set in the Euclidean norm. The Gaussian width of a set $S$ can be interpreted as the number of dimensions that a random projection needs to approximately preserve the norms of points in $S$ [57, 42]. These two complexity measures are connected by Gaussian concentration: Gaussian width is the expected value of the supremum of some Gaussian process, and the radius upper bounds the typical deviation of that supremum from its expected value.

Definition 2 (Covariance splitting).

Given a positive semidefinite matrix $\Sigma\in\mathbb{R}^{d\times d}$ , we write $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ if $\Sigma=\Sigma_{1}+\Sigma_{2}$ , each matrix is positive semidefinite, and their spans are orthogonal.

Note that $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ effectively splits the eigenvectors of $\Sigma$ into two disjoint parts.

We can now state our generic bound. Section 7 sketches the proof; all full proofs are in the appendix.

Theorem 1 (Main generalization bound).

There exists an absolute constant $C_{1}\leq 66$ such that the following is true. Under the model assumptions in (1), let $\mathcal{K}$ be an arbitrary compact set, and take any covariance splitting $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ . Fixing $\delta\leq 1/4$ , let $\beta=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right)$ . If $n$ is large enough that $\beta\leq 1$ , then the following holds with probability at least $1-\delta$ :

\sup_{w\in\mathcal{K},\hat{L}(w)=0}L(w)\leq\frac{1+\beta}{n}\left[W(\Sigma_{2}^{1/2}\mathcal{K})+\left(\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})+\lVert w^{*}\rVert_{\Sigma_{2}}\right)\sqrt{2\log\left(\frac{32}{\delta}\right)}\,\right]^{2}.

In our applications, we consider $\mathcal{K}=\{w\in\mathbb{R}^{d}:\|w\|\leq B\}$ for an arbitrary norm, with $B$ based on a high-probability upper bound for $\left\lVert\hat{w}\right\rVert$ . Depending on the application, the rank of $\Sigma_{1}$ will be either constant or $o(n)$ , so that $\beta\to 0$ . The term $\left\lVert w^{*}\right\rVert_{\Sigma_{2}}$ generally does not scale with $n$ and hence is often negligible. As hinted earlier, we can think of the Gaussian width term and the radius term as bias and variance, respectively. To achieve consistency, we can expect that the Gaussian width should scale as $\sigma\sqrt{n}$ . This agrees with the intuition that we need increasing norm to memorize noise when the model is not realizable. The radius term requires some care in our applications, but can be handled by the covariance splitting technique. As part of the analysis in the following sections, we will rigorously show in many settings that the dominant term in the upper bound is the Gaussian width. In these cases, our upper bound is roughly $W(\Sigma_{2}^{1/2}\mathcal{K})^{2}/n$ , which can be viewed as the ratio between the (probabilistic) dimension of our hypothesis class and sample size. We will also analyze how large $\mathcal{K}$ must be to contain any interpolators, allowing us to find consistency results.

4 Application: Euclidean Norm Ball

It can be easily seen that the Gaussian width of a Euclidean norm ball reduces nicely to the product of the norm of our predictor with the typical norm of $x$ : if $\mathcal{K}=\{w\in\mathbb{R}^{d}:\|w\|_{2}\leq B\}$ , then

W(\Sigma^{1/2}\mathcal{K})=B\cdot\operatorname*{\mathbb{E}}_{H\sim N(0,I_{d})}\|\Sigma^{1/2}H\|_{2}\leq\sqrt{B^{2}\operatorname*{\mathbb{E}}\left\lVert x\right\rVert_{2}^{2}}.

(3)

Therefore, it is plausible that ( $\star$ ‣ 2) holds with $\psi_{n}=\operatorname*{\mathbb{E}}\|x\|_{2}^{2}=\operatorname{Tr}(\Sigma)$ . Figure 1 illustrates this generalization bound in two simple examples, motivated by [63, 73]. Indeed, an application of our main theorem proves that this is exactly the case for Gaussian data.

Corollary 1 (Proof of the speculative bound ( $\star$ ‣ 2) for Gaussian data).

Fix any $\delta\leq 1/4$ . Under the model assumptions in (1) with $B\geq\|w^{*}\|_{2}$ and $n\gtrsim\log(1/\delta)$ , for some $\gamma\lesssim\sqrt[4]{\log(1/\delta)/n}$ , it holds with probability at least $1-\delta$ that

\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{B^{2}\operatorname{Tr}(\Sigma)}{n}.

(4)

Refer to caption — (a) $\lambda=1$ (isotropic)

The above bound is clean and simple, but only proves a sub-optimal rate of $n^{-1/4}$ . This is because the choice of covariance split used in the proof of Corollary 1 uses no information about the particular structure of $\Sigma$ . This bound can also be slightly loose in situations where the eigenvalues of $\Sigma$ decay rapidly, in which case $\operatorname{Tr}(\Sigma)$ can be replaced by a smaller quantity. We next state a more precise bound on the generalization error, which requires introducing the following notions of effective rank.

Definition 3 ([66]).

The effective ranks of a covariance matrix $\Sigma$ are

r(\Sigma)=\frac{\operatorname{Tr}(\Sigma)}{\left\lVert\Sigma\right\rVert_{\mathit{op}}}\quad\text{and}\quad R(\Sigma)=\frac{\operatorname{Tr}(\Sigma)^{2}}{\operatorname{Tr}(\Sigma^{2})}.

The $r(\Sigma)$ rank can roughly be understood as the squared ratio between the Gaussian width and radius in our previous bound. It is related to the concentration of the $\ell_{2}$ norm of a Gaussian vector with covariance $\Sigma$ . In fact, both definitions of effective ranks can be derived by applying Bernstein’s inequality to ${\|x\|^{2}}/{\operatorname*{\mathbb{E}}\|x\|^{2}}$ . We will only need $r(\Sigma)$ in the generalization bound below, but we will show in Theorem 2 that $R(\Sigma)$ can be used to control the norm of the minimal norm interpolator $\hat{w}$ .

Corollary 2.

There exists an absolute constant $C_{1}\leq 66$ such that the following is true. Under (1), pick any split $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ , fix $\delta\leq 1/4$ , and let $\gamma=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right)$ . If $B\geq\|w^{*}\|_{2}$ and $n$ is large enough that $\gamma\leq 1$ , the following holds with probability at least $1-\delta$ :

\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{B^{2}\operatorname{Tr}(\Sigma_{2})}{n}.

(5)

In order to use Corollary 1 or 2 to prove consistency, we need a high-probability bound for $\left\lVert\hat{w}\right\rVert_{2}$ , the norm of the minimal norm interpolator, so that $B$ will be large enough to contain any interpolators. Theorem 2 gives exactly such a bound, showing that if the effective ranks $R(\Sigma_{2})$ and $r(\Sigma_{2})$ are large, then we can construct an interpolator with Euclidean norm nearly $\|w^{*}\|_{2}+\sigma\sqrt{n/\operatorname{Tr}(\Sigma_{2})}$ .

Theorem 2 (Euclidean norm bound; special case of Theorem 4).

Fix any $\delta\leq 1/4$ . Under the model assumptions in (1) with any choice of covariance splitting $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ , there exists some $\epsilon\lesssim\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\frac{n\log(1/\delta)}{R(\Sigma_{2})}$ such that the following is true. If $n$ and the effective ranks are such that $\epsilon\leq 1$ and $R(\Sigma_{2})\gtrsim\log(1/\delta)^{2}$ , then with probability at least $1-\delta$ , it holds that

\|\hat{w}\|_{2}\leq\|w^{*}\|_{2}+(1+\epsilon)^{1/2}\,\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}.

(6)

Plugging in estimates of $\|\hat{w}\|$ to our scale-sensitive bound Corollary 2, we obtain a population loss guarantee for $\hat{w}$ in terms of effective ranks.

Theorem 3 (Benign overfitting).

Fix any $\delta\leq 1/2$ . Under the model assumptions in (1) with any covariance splitting $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ , let $\gamma$ and $\epsilon$ be as defined in Corollaries 2 and 2. Suppose that $n$ and the effective ranks are such that $R(\Sigma_{2})\gtrsim\log(1/\delta)^{2}$ and $\gamma,\epsilon\leq 1$ . Then, with probability at least $1-\delta$ ,

L(\hat{w})\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\right)^{2}.

(7)

From (7), we can see that to ensure consistency, i.e. $L(\hat{w})\to\sigma^{2}$ , it is enough that $\gamma\to 0$ , $\epsilon\to 0$ , and $\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\to 0$ . Recalling the definitions of the various quantities and using that $r(\Sigma_{2})^{2}\geq R(\Sigma_{2})$ [66, Lemma 5], we arrive at the following conditions.

Sufficient conditions for consistency of $\hat{w}$ .

As $n\to\infty$ , $L(\hat{w})$ converges in probability to $\sigma^{2}$ if there exists a sequence of covariance splits $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that

\frac{\operatorname{rank}(\Sigma_{1})}{n}\to 0,\qquad\left\lVert w^{*}\right\rVert_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\to 0,\qquad\frac{n}{R(\Sigma_{2})}\to 0.

(8)

Relationship to [66].

Our set of sufficient conditions above subsumes and is slightly more general than the conditions of [66]. There are two differences:

1.

They choose the covariance split specifically to minimize $\operatorname{rank}(\Sigma_{1})$ such that $r(\Sigma_{2})\gtrsim n$ .
2.

Their version of the second condition replaces $\operatorname{Tr}(\Sigma_{2})$ by the larger term $\operatorname{Tr}(\Sigma)$ .

From the perspective of showing $L(\hat{w})\to\sigma^{2}$ , the first difference is immaterial: if there exists a choice of split that satisfies our conditions, it can be shown that there exists a (possibly different) split which will also satisfy $r(\Sigma_{2})\gtrsim n$ (see Section D.2.1). The second point is a genuine improvement over the consistency result of [66] when $\Sigma$ has a few very large eigenvalues; this improvement has also been implicitly obtained by [72].

Regarding the rate of convergence, our additional $r(\Sigma_{2})^{-1/2}$ term and the dependence on $\sqrt{\operatorname{rank}(\Sigma_{1})/n}$ instead of $\operatorname{rank}(\Sigma_{1})/n$ is slightly worse than that of [66], but our bound can be applied for a smaller value of $\operatorname{rank}(\Sigma_{1})$ and is better in the $\|w^{*}\|_{2}\sqrt{\operatorname{Tr}(\Sigma_{2})/n}$ term. We believe these differences are minimal in most cases, and not so important for our primary goal to showcase the power of uniform convergence.

Relationship to [71].

The consistency result of [66] can also be recovered with a uniform convergence-based argument [71]. Instead of considering uniform convergence over a norm ball, [71] applied uniform convergence to a surrogate predictor, and separately showed that the minimal-norm interpolator has risk close to the surrogate (and, indeed, argue that this was fundamentally the proof strategy of [66] all along). Their analysis reveals an interesting connection between realizability and interpolation learning, but it does not highlight that low norm is key to good generalization, nor does it predict the worst-case error for other low-norm interpolators.

Relationship to [65].

[65] recently showed that it is impossible to find a tight excess risk bound that only depends on the learned predictor and sample size. This does not, however, contradict our results. A closer look at the construction of their lower bound reveals that the excess risk bounds being ruled out cannot depend on either the training error $\hat{L}$ or the population noise level $\sigma^{2}$ . The former is crucial: their considered class of bounds cannot incorporate the knowledge that the training error is small, which is the defining property of uniform convergence of interpolators. The latter point is also important; they consider excess risk ( $L-\sigma^{2}$ ), but ( $\star$ ‣ 2) and our bounds are about the generalization gap ( $L-\hat{L}$ ).

Relationship to [76].

[76] give expressions for the asymptotic generalization error of predictors in a norm ball, in a random feature model. Their model is not directly comparable to ours (their labels are effectively a nonlinear function of their non-Gaussian random features), but they similarly showed that uniform convergence of interpolators can lead to a non-vacuous bound. It is unclear, though, whether uniform convergence of low-norm interpolators can yield consistency in their model: they only study sets of the form $\{\left\lVert w\right\rVert\leq\alpha\left\lVert\hat{w}\right\rVert\}$ with $\alpha>1$ a constant, where we would expect a loss of $\alpha^{2}\sigma^{2}$ – i.e. would not expect consistency. They also rely on numerical methods to compare their (quite complicated) analytic expressions. It remains possible that the gap between uniform convergence of interpolators and the Bayes risk vanishes in their setting as $\alpha$ approaches 1.

5 General Norm Ball

All the results on the Euclidean setting are special cases of the following results for arbitrary norms. It is worth keeping in mind that the Euclidean norm will still play a role in these analyses, via the Gaussian width and radius appearing in Theorem 1 and the $\ell_{2}$ projection $P$ appearing in Theorem 4.

Definition 4.

The dual norm of a norm $\left\lVert\cdot\right\rVert$ on $\mathbb{R}^{d}$ is $\|u\|_{*}:=\max_{\|v\|=1}\langle v,u\rangle$ , and the set of all its sub-gradients with respect to $u$ is $\partial\|u\|_{*}=\{v:\|v\|=1,\langle v,u\rangle=\|u\|_{*}\}$ .

The Euclidean norm’s dual is itself; for it and many other norms, $\partial\left\lVert u\right\rVert_{*}$ is a singleton set. Using these notions, we will now give versions of the effective ranks appropriate for generic norm balls.

Definition 5.

The effective $\left\lVert\cdot\right\rVert$ -ranks of a covariance matrix $\Sigma$ are given as follows. Let $H\sim N(0,I_{d})$ , and define $v^{*}=\operatorname*{arg\,min}_{v\in\partial\|\Sigma^{1/2}H\|_{*}}\left\lVert v\right\rVert_{\Sigma}$ . Then

r_{\left\lVert\cdot\right\rVert}(\Sigma)=\left(\frac{\operatorname*{\mathbb{E}}\left\lVert\Sigma^{1/2}H\right\rVert_{*}}{\sup_{\left\lVert w\right\rVert\leq 1}\left\lVert w\right\rVert_{\Sigma}}\right)^{2}\quad\text{and}\quad R_{\left\lVert\cdot\right\rVert}(\Sigma)=\left(\frac{\operatorname*{\mathbb{E}}\left\lVert\Sigma^{1/2}H\right\rVert_{*}}{\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma}}\right)^{2}.

The first effective $\left\lVert\cdot\right\rVert$ -rank is the squared ratio of Gaussian width to the radius of the set $\Sigma^{1/2}\mathcal{K}$ , where $\mathcal{K}$ is a norm ball $\{w:\left\lVert w\right\rVert\leq B\}$ ; the importance of this ratio should be clear from Theorem 1. The Gaussian width is given by $W(\Sigma^{\frac{1}{2}}\mathcal{K})=B\operatorname*{\mathbb{E}}\left\lVert x\right\rVert_{*}$ , while the radius can be written $\sup_{\left\lVert w\right\rVert\leq B}\left\lVert w\right\rVert_{\Sigma}$ so that the factors of $B$ cancel.

The choice of $R_{\left\lVert\cdot\right\rVert}$ arises naturally from our bound on $\left\lVert\hat{w}\right\rVert$ in Theorem 4 below. Large effective rank of $\Sigma_{2}$ means the sub-gradient $v^{*}$ of $\|\Sigma_{2}^{1/2}H\|_{*}$ is small in the $\lVert\cdot\rVert_{\Sigma_{2}}$ norm. This is, in fact, closely related to the existence of low-norm interpolators. First, note that $\Sigma_{2}^{1/2}H$ corresponds to the small-eigenvalue components of the covariate vector. For $v^{*}$ to be a sub-gradient means that moving the weight vector $w$ in the direction of $v^{*}$ is very effective at changing the prediction $\langle w,X\rangle$ ; having small $\lVert\cdot\rVert_{\Sigma_{2}}$ norm means that moving in this direction has a very small effect on the population loss $L(w)$ . Together, this means the sub-gradient will be a good direction for benignly overfitting the noise.

Remark 1 (Definitions of effective ranks).

Using $\left\lVert\cdot\right\rVert_{2}$ in 5 yields slightly different effective ranks than those of 3, but the difference is small and asymptotically negligible. Both $r$ and $R$ use $\operatorname*{\mathbb{E}}\left\lVert x\right\rVert^{2}$ in their numerators, while $r_{\left\lVert\cdot\right\rVert_{2}}$ and $R_{\left\lVert\cdot\right\rVert_{2}}$ use $(\operatorname*{\mathbb{E}}\left\lVert x\right\rVert)^{2}$ . The denominators of $r$ and $r_{\left\lVert\cdot\right\rVert_{2}}$ agree; Lemma 9, in Section C.2, shows that $r(\Sigma)-1\leq r_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)\leq r(\Sigma)$ . The denominator of $R_{\left\lVert\cdot\right\rVert_{2}}$ uses the sole sub-gradient $v^{*}=\Sigma^{1/2}H/\lVert\Sigma^{1/2}H\rVert_{2}$ ; the denominator is then $\lVert\Sigma^{1/2}H\rVert_{\Sigma}^{2}/\lVert\Sigma^{1/2}H\rVert_{2}^{2}\approx\operatorname{Tr}(\Sigma^{2})/\operatorname{Tr}(\Sigma)$ , giving that $R_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)\approx\operatorname{Tr}(\Sigma)^{2}/\operatorname{Tr}(\Sigma^{2})=R(\Sigma)$ . Equation 75, in Section C.2, shows that $R_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)\geq cR(\Sigma)$ for some $c$ that converges to 1 as $r(\Sigma)\to\infty$ , as is required by the consistency conditions. The other direction, $R(\Sigma)\geq c^{\prime}R_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)$ , also holds with $c^{\prime}\to 1$ as $r(\Sigma^{2})\to\infty$ . It would be possible (and probably even more natural with our analysis) to state the consistency conditions in Section 4 in terms of $r_{\left\lVert\cdot\right\rVert_{2}}$ and $R_{\left\lVert\cdot\right\rVert_{2}}$ ; we used $r$ and $R$ mainly to allow direct comparison to [66].

Using the general notion of effective ranks, we can find an analogue of Corollary 2 for general norms.

Corollary 3.

There exists an absolute constant $C_{1}\leq 66$ such that the following is true. Under the model assumptions in (1), take any covariance splitting $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ and let $\left\lVert\cdot\right\rVert$ be an arbitrary norm. Fixing $\delta\leq 1/4$ , let $\gamma=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{r_{\|\cdot\|}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right)$ . If $B\geq\|w^{*}\|$ and $n$ is large enough that $\gamma\leq 1$ , then the following holds with probability at least $1-\delta$ :

\sup_{\|w\|\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{\left(B\cdot\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{*}\right)^{2}}{n}.

(9)

As in the Euclidean special case, we still need a bound on the norm of $\hat{w}=\operatorname*{arg\,min}_{\hat{L}(w)=0}\left\lVert w\right\rVert$ to use this result to study the consistency of $\hat{w}$ . This leads us to the second main technical result of this paper, which essentially says that if the effective ranks $R_{\left\lVert\cdot\right\rVert}(\Sigma_{2})$ and $r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})$ are sufficiently large, then there exists an interpolator with norm $\left\lVert w^{*}\right\rVert+\sigma\sqrt{n}/\operatorname*{\mathbb{E}}\lVert\Sigma_{2}^{1/2}g\rVert_{*}$ .

Theorem 4 (General norm bound).

There exists an absolute constant $C_{2}\leq 64$ such that the following is true. Under the model assumptions in (1) with any covariance split $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ , let $\left\lVert\cdot\right\rVert$ be an arbitrary norm, and fix $\delta\leq 1/4$ . Denote the $\ell_{2}$ orthogonal projection matrix onto the space spanned by $\Sigma_{2}$ as $P$ . Let $H\sim N(0,I_{d})$ , and let $v^{*}=\operatorname*{arg\,min}_{v\in\partial\|\Sigma^{1/2}_{2}H\|_{*}}\left\lVert v\right\rVert_{\Sigma_{2}}$ . Suppose that there exist $\epsilon_{1},\epsilon_{2}\geq 0$ such that with probability at least $1-\delta/4$

\left\lVert v^{*}\right\rVert_{\Sigma_{2}}\leq(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}\qquad\text{and}\qquad\left\lVert Pv^{*}\right\rVert^{2}\leq 1+\epsilon_{2};

(10)

let $\epsilon=C_{2}\left(\sqrt{\frac{\log(1/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+(1+\epsilon_{1})^{2}\frac{n}{R_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}+\epsilon_{2}\right)$ . Then if $n$ and the effective ranks are large enough that $\epsilon\leq 1$ , with probability at least $1-\delta$ , it holds that

\left\lVert\hat{w}\right\rVert\leq\left\lVert w^{*}\right\rVert+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}.

(11)

For a specific choice of norm $\left\lVert\cdot\right\rVert$ , we can verify that $\|v^{*}\|_{\Sigma_{2}}$ is small. In the Euclidean case, for example, this is done by (78) in Section C.2; in our basis pursuit application to come, this is done by (92). The term $\epsilon_{2}$ measures the cost of using a projected version of the subgradient; in most of our applications, we can take $\epsilon_{2}=0$ . Recalling that $\lVert v^{*}\rVert=1$ , this is obviously true with the Euclidean norm for any $\Sigma_{2}$ . More generally, if $\Sigma$ is diagonal, then it is natural to only consider covariance splits $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that $\Sigma_{2}$ is diagonal. Then, when $\left\lVert\cdot\right\rVert$ is the $\ell_{1}$ norm (or $\ell_{p}$ norms more generally), it can be easily seen that $Pv^{*}=v^{*}$ and so $\|Pv^{*}\|=\|v^{*}\|=1$ .

Straightforwardly combining Corollaries 3 and 4 yields the following theorem, which gives guarantees for minimal-norm interpolators in terms of effective rank conditions. Just as in the Euclidean case, we can extract from this result a simple set of sufficient conditions for consistency of the minimal norm interpolator.

Theorem 5 (Benign overfitting with general norm).

Fix any $\delta\leq 1/2$ . Under the model assumptions in (1), let $\left\lVert\cdot\right\rVert$ be an arbitrary norm and pick a covariance split $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ . Suppose that $n$ and the effective ranks are sufficiently large such that $\gamma,\epsilon\leq 1$ with the same choice of $\gamma$ and $\epsilon$ as in Corollary 3 and Theorem 4. Then, with probability at least $1-\delta$ ,

L(\hat{w})\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}\right)^{2}.

(12)

Sufficient conditions for consistency of $\hat{w}$ .

As $n\to\infty$ , $L(\hat{w})$ converges in probability to $\sigma^{2}$ if there exists a sequence of covariance splits $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that

\frac{\operatorname{rank}(\Sigma_{1})}{n}\to 0,\qquad\frac{\left\lVert w^{*}\right\rVert\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}{\sqrt{n}}\to 0,\qquad\frac{1}{r_{\|\cdot\|}(\Sigma_{2})}\to 0,\qquad\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\to 0,

(13)

and, with the same definition of $P$ and $v^{*}$ as in Theorem 4, it holds for any $\eta>0$ that

\Pr(\|Pv^{*}\|^{2}>1+\eta)\to 0.

(14)

As we see, the conditions for a minimal norm interpolator to succeed with a general norm generalize those from the Euclidean setting in a natural way. As discussed above, (14) is always satisfied for the Euclidean norm. The only remaining notable difference from the Euclidean setting is that we have two large effective dimension conditions on $\Sigma_{2}$ instead of a single one; in the Euclidean case, the condition on $R$ implies the condition on $r$ .

6 Application: $\ell_{1}$ Norm Balls for Basis Pursuit

The theory for the minimal $\ell_{1}$ norm interpolator, $\hat{w}_{\mathit{BP}}\in\arg\min_{\hat{L}(w)=0}\left\lVert w\right\rVert_{1}$ – also known as basis pursuit [44] – is much less developed than that of the minimal $\ell_{2}$ norm interpolator. In this section, we illustrate the consequences of our general theory for basis pursuit. Full statements and proofs of results in this section are given in Appendix E.

The dual of the $\ell_{1}$ norm is the $\ell_{\infty}$ norm $\left\lVert u\right\rVert_{\infty}=\max_{i}\lvert u_{i}\rvert$ , and $\partial\|u\|_{\infty}$ is the convex hull of $\{\operatorname{sign}(u_{i})\,e_{i}:i\in\arg\max|u_{i}|\}$ . From the definition of sub-gradient, we observe that

\min_{v\in\partial\|\Sigma^{1/2}g\|_{\infty}}\|v\|_{\Sigma}\leq\max_{i\in[d]}\,\|e_{i}\|_{\Sigma}=\sqrt{\max_{i}\,\Sigma_{ii}}.

(15)

Furthermore, by convexity we have

\max_{\|w\|_{1}\leq 1}\|w\|_{\Sigma}=\sqrt{\max_{i}\,\langle e_{i},\Sigma e_{i}\rangle}=\sqrt{\max_{i}\,\Sigma_{ii}}

(16)

and so $r_{\|\cdot\|_{1}}(\Sigma)=\frac{\left(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}g\|_{\infty}\right)^{2}}{\max_{i}\Sigma_{ii}}\leq R_{\|\cdot\|_{1}}(\Sigma)$ . Therefore, we can use a single notion of effective rank. For simplicity, we denote $r_{1}(\Sigma)=r_{\|\cdot\|_{1}}(\Sigma)$ . Combining this with (13) and the previous discussion of (14), we obtain the following sufficient conditions for consistency of basis pursuit.

Sufficient conditions for consistency of $\hat{w}_{\mathit{BP}}$ .

As $n\to\infty$ , $L(\hat{w})$ converges to $\sigma^{2}$ in probability if there exists a sequence of covariance splits $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that $\Sigma_{2}$ is diagonal and

\frac{\operatorname{rank}(\Sigma_{1})}{n}\to 0,\qquad\frac{\left\lVert w^{*}\right\rVert_{1}\operatorname*{\mathbb{E}}\lVert\Sigma_{2}^{1/2}H\rVert_{\infty}}{\sqrt{n}}\to 0,\qquad\frac{n}{r_{1}(\Sigma_{2})}\to 0.

(17)

Application: Junk features.

We now consider the behavior of basis pursuit in a junk feature model similar to that of [73]. Suppose that $\Sigma=\begin{bmatrix}\Sigma_{s}&0\\ 0&\frac{\lambda_{n}}{\log(d)}I_{d}\end{bmatrix}$ , where $\Sigma_{s}$ is a fixed matrix and $\|w^{*}\|_{1}$ is fixed. Quite naturally, we choose the covariance splitting $\Sigma_{1}=\begin{bmatrix}\Sigma_{s}&0\\ 0&0\end{bmatrix}$ , which has constant rank so that the first sufficient condition is immediately satisfied.

By standard results on the maximum of independent Gaussian variables [[, e.g.]]vershynin2018high, it is routine to check that

\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{\infty}}{\sqrt{n}}=\Theta\left(\sqrt{\frac{\lambda_{n}}{n}}\right)\quad\text{and}\quad r_{1}(\Sigma_{2})=\Theta\left(\log(d)\right).

(18)

Therefore, basis pursuit will be consistent provided that $\lambda_{n}=o(n)$ and $d=e^{\omega(n)}$ . To the best of our knowledge, this is the first time that basis pursuit has been shown to give consistent predictions in any setting with Gaussian covariates and $\sigma>0$ . Although we show consistency, the dimension must be quite high, and the rate of convergence depends on ${n}/{\log(d)}$ and $1/\sqrt{\log(d)}$ .

Application: Isotropic features.

As in the Euclidean case, we generally do not expect basis pursuit to be consistent when $\Sigma=I_{d}$ and $w^{*}\neq 0$ . However, we can expect its risk to approach the null risk $\sigma^{2}+\|w^{*}\|^{2}$ if $d=e^{\omega(n)}$ ; we will show this using uniform convergence (without covariance splitting).

A direct application of Theorem 5 is not enough because the $\|w^{*}\|_{1}\sqrt{{\log(d)}/{n}}$ term diverges, but we can remove the dependence on $\sqrt{\log(d)/n}$ with a better norm bound. Let $S$ be the support of $w^{*}$ and denote $X_{S}$ as the matrix formed by selecting the columns of $X$ in $S$ . The key observation is that we can rewrite our model as $Y=X_{S^{\mathsf{c}}}\,0+(X_{S}w_{S}^{*}+\xi)$ , which corresponds to the case when $w^{*}=0$ and the Bayes risk is $\sigma^{2}+\|w^{*}\|_{2}^{2}$ . If we interpolate using only the features in $S^{\mathsf{c}}$ , the minimal norm will be approximately upper bounded by $\sqrt{\sigma^{2}+\|w^{*}\|_{2}^{2}}\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H\|_{*}}$ as long as $d-|S|=e^{\omega(n)}$ , by Theorem 4. This implies the original model $\|\hat{w}_{\mathit{BP}}\|_{1}$ can also be upper bounded by the same quantity with high probability. Plugging the norm estimate in to Corollary 3 yields a risk bound of $\sigma^{2}+\|w^{*}\|_{2}^{2}$ .

Relationship to previous works.

Both [69] and [74] study the minimal $\ell_{1}$ norm interpolator in the isotropic setting. They consider a more realistic scaling where $\log(d)/n$ is not large and the target is not the null risk. The best bound of [69], their Corollary 3, is $L(\hat{w}_{\mathit{BP}})\leq\sigma^{2}(2+32\sqrt{14}\sqrt{s})^{2}$ , where $s$ is the ground truth sparsity. Note that even when $w^{*}=0$ , this bound does not show consistency. Similarly, [74] establish sufficient conditions for $L(\hat{w}_{\mathit{BP}})=O(\sigma^{2})$ , which is nontrivial but also does not show consistency for any $\sigma>0$ ; see also the work of [68] for a similar result in the Euclidean setting. In contrast, the constants in our result are tight enough to show $L(\hat{w}_{\mathit{BP}})\to\sigma^{2}$ in the isotropic setting when $w^{*}=0$ and in the junk feature setting when $\|w^{*}\|_{1}$ is bounded.

Like our work, the results of [74] generalize to arbitrary norms; they also consider a larger class of anti-concentrated covariate distributions than just Gaussians, as in the work of [52]. If $\sigma=0$ and $w^{*}\in\mathcal{K}$ (i.e. the model is well-specified and noiseless), their work as well as that of [52] can recover generalization bounds similar to our Corollary 3, but with a large leading constant.

7 Proof Sketches

A key ingredient in our analysis is a celebrated result from Gaussian process theory known as the Gaussian Minmax Theorem (GMT) [41, 56]. Since the seminal work of [45], the GMT has seen numerous applications to problems in statistics, machine learning, and signal processing [[, e.g.]]stojnic2013framework,deng2019model,oymak2010new,oymak2018universality. Most relevant to us is the work of [56], which introduced the Convex Gaussian Minmax Theorem (CGMT) and developed a framework for the precise analysis of regularized linear regression. Here we apply the GMT/CGMT to study uniform convergence and the norm of the minimal norm interpolator.

Proof sketch of Theorem 1.

For simplicity, assume here there is no covariance splitting: $\Sigma_{2}=\Sigma$ . By a change of variable and introducing the Lagrangian, we can rewrite the generalization gap as

\begin{split}\sup_{\begin{subarray}{c}w\in\mathcal{K}\\ Xw=Y\end{subarray}}L(w)&=\sigma^{2}+\sup_{\begin{subarray}{c}w\in\Sigma^{1/2}(\mathcal{K}-w^{*})\\ Zw=\xi\end{subarray}}\|w\|_{2}^{2}\\ &=\sigma^{2}+\sup_{w\in\Sigma^{1/2}(\mathcal{K}-w^{*})}\inf_{\lambda}\,\langle\lambda,Zw-\xi\rangle+\|w\|_{2}^{2}\end{split}

(19)

where $Z$ is a random matrix with i.i.d. standard normal entries. By GMT³³3We ignore a compactness issue here, but this is done rigorously by a truncation argument in Section B.1., we can control the upper tail of the max-min problem above (PO) by the auxiliary problem below (AO), with $G,H\sim N(0,I)$ :

\sup_{w\in\Sigma^{1/2}(\mathcal{K}-w^{*})}\inf_{\lambda}\,\|\lambda\|_{2}\langle H,w\rangle+\langle\lambda,G\|w\|_{2}-\xi\rangle+\|w\|_{2}^{2}=\sup_{\begin{subarray}{c}w\in\Sigma^{1/2}(\mathcal{K}-w^{*})\\ \left\lVert G\left\lVert w\right\rVert_{2}-\xi\right\rVert_{2}\leq\langle H,w\rangle\end{subarray}}\left\lVert w\right\rVert_{2}^{2}.

(20)

By standard concentration results, we can expect $\|G\|_{2}^{2}/n\approx 1$ and $\|\xi\|_{2}^{2}/n\approx\sigma^{2}$ , so expanding the second constraint in the AO, we obtain $\|w\|_{2}^{2}+\sigma^{2}\leq|\langle H,w\rangle|^{2}/n$ . Plugging into (19), we have essentially shown that

\sup_{\begin{subarray}{c}w\in\mathcal{K}\\ \hat{L}(w)=0\end{subarray}}L(w)\leq\sup_{w\in\Sigma^{1/2}(\mathcal{K}-w^{*})}\frac{|\langle H,w\rangle|^{2}}{n}\leq\frac{\left(\sup_{w\in\Sigma^{1/2}\mathcal{K}}|\langle H,w\rangle|+|\langle H,\Sigma^{1/2}w^{*}\rangle|\right)^{2}}{n}.

(21)

Applying concentration on the right hand side concludes the proof sketch. In situations where the supremum does not sharply concentrate around its mean, we can apply GMT only to the small variance directions of $\Sigma$ . This requires a slightly more general version of GMT, which we prove in Appendix A. We also show the additional terms contributed by the large variance components of $X$ cancel out due to Wishart concentration. This is reflected in the $\beta$ term of our theorem statement.

Proof sketch of Theorem 4.

Since the minimal norm problem is convex-concave, we can apply the CGMT, which provides a useful direction that GMT cannot. By the same argument as above

\begin{split}\inf_{Xw=Y}\|w\|-\|w^{*}\|&\leq\inf_{Xw=\xi}\|w\|=\inf_{w}\sup_{\lambda}\|\Sigma^{-1/2}w\|+\langle\lambda,Zw-\xi\rangle\\ &\approx\inf_{\|w\|_{2}^{2}+\sigma^{2}\leq|\langle H,w\rangle|^{2}/n}\|\Sigma^{-1/2}w\|=\inf_{\|w\|_{\Sigma}^{2}+\sigma^{2}\leq|\langle\Sigma^{1/2}H,w\rangle|^{2}/n}\|w\|.\end{split}

(22)

To upper bound the infimum, it suffices to construct a feasible $w$ . Consider $w$ of the form $\alpha v$ where $v\in\partial\|\Sigma^{1/2}H\|_{*}$ . Plugging in the constraint, we can choose $\|w\|=\alpha=\sqrt{\sigma^{2}\left(\frac{\|\Sigma^{1/2}H\|_{*}^{2}}{n}-\|v\|_{\Sigma}^{2}\right)^{-1}}$ . Rearranging the terms conclude the proof sketch when there is no covariance splitting. The general proof (in Section C.1) is more technical, but follows the same idea.

8 Discussion

In this work, we prove a generic generalization bound in terms of the Gaussian width and radius of a hypothesis class. We also provide a general high probability upper bound for the norm of the minimal norm interpolator. Combining these results, we recover the sufficient conditions from [66] in the $\ell_{2}$ case, confirm the conjecture of [73] for Gaussian data, and obtain novel consistency results in the $\ell_{1}$ case. Our results provide concrete evidence that uniform convergence is indeed sufficient to explain interpolation learning, at least in some settings.

A future direction of our work is to extend the main results to settings with non-Gaussian features; this has been achieved in other applications of the GMT [59], and indeed we expect that a version of ( $\star$ ‣ 2) likely holds for non-Gaussian data as well. Another interesting problem is to study uniform convergence of low-norm near-interpolators, and characterize the worst-case population error as the norm and training error both grow. This could lead to a more precise understanding of early stopping, by connecting the optimization path with the regularization path. Finally, it is unknown whether our sufficient conditions for consistency in Section 5 are necessary, and it remains a challenge to apply uniform convergence of interpolators to more complex models such as deep neural networks.

Acknowledgments and Disclosure of Funding

Frederic Koehler was supported in part by E. Mossel’s Vannevar Bush Faculty Fellowship ONR-N00014-20-1-2826. Research supported in part by NSF IIS award 1764032, NSF HDR TRIPODS award 1934843, and the Canada CIFAR AI Chairs program. This work was done as part of the Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai).

{refcontext}

[sorting=nyt]

References

[1] Afonso S. Bandeira “Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science”, Lecture notes, MIT, 2016 URL: https://people.math.ethz.ch/~abandeira/TenLecturesFortyTwoProblems.pdf
[2] Peter L. Bartlett and Philip M. Long “Failures of model-dependent generalization bounds for least-norm interpolation”, 2020 arXiv:2010.08479
[3] Peter L. Bartlett, Philip M. Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070 arXiv:1906.11300
[4] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32, 2019, pp. 15849–15854 arXiv:1812.11118
[5] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180 arXiv:1903.07571
[6] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo and Alan S. Willsky “The convex geometry of linear inverse problems” In Foundations of Computational Mathematics 12.6 Springer, 2012, pp. 805–849
[7] Scott Shaobing Chen, David L. Donoho and Michael A. Saunders “Atomic decomposition by basis pursuit” In SIAM Review 43.1, 2001, pp. 129–159
[8] Geoffrey Chinot and Matthieu Lerasle “On the robustness of the minimum $\ell_{2}$ interpolator”, 2020 arXiv:2003.05838
[9] Geoffrey Chinot, Matthias Löffler and Sara Geer “On the robustness of minimum-norm interpolators”, 2021 arXiv:2012.00807
[10] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A model of double descent for high-dimensional binary linear classification” In Information and Inference: A Journal of the IMA, 2021 arXiv:1911.05822
[11] Rick Durrett “Probability: theory and examples” Cambridge University Press, 2019
[12] Yehoram Gordon “Some inequalities for Gaussian processes and applications” In Israel Journal of Mathematics 50.4 Springer, 1985, pp. 265–289
[13] Yehoram Gordon “On Milman’s inequality and random subspaces which escape through a mesh in $\mathbb{R}^{n}$ ” In Geometric Aspects of Functional Analysis 1317, Lecture Notes in Mathematics Springer, 1988, pp. 84–106
[14] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J. Tibshirani “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”, 2019 arXiv:1903.08560
[15] Peizhong Ju, Xiaojun Lin and Jia Liu “Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree” In Advances in Neural Information Processing Systems, 2020 arXiv:2002.00492
[16] Gautam Kamath “Bounds on the Expectation of the Maximum of Samples from a Gaussian”, 2013 URL: http://www.gautamkamath.com/writings/gaussian_max.pdf
[17] Michel Ledoux “A heat semigroup approach to concentration on the sphere and on a compact Riemannian manifold” In Geometric & Functional Analysis GAFA 2.2 Springer, 1992, pp. 221–224
[18] Shahar Mendelson “Learning without concentration” In Conference on Learning Theory, 2014, pp. 25–39 PMLR
[19] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020 arXiv:1903.09139
[20] Vaishnavh Nagarajan and J. Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019 arXiv:1902.04742
[21] Jeffrey Negrea, Gintare Karolina Dziugaite and Daniel M. Roy “In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors” In International Conference on Machine Learning, 2020 arXiv:1912.04265
[22] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning” In International Conference on Learning Representations – Workshop, 2015 arXiv:1412.6614
[23] Samet Oymak and Babak Hassibi “New null space results and recovery thresholds for matrix rank minimization”, 2010 arXiv:1011.6326
[24] Samet Oymak and Joel A Tropp “Universality laws for randomized dimension reduction, with applications” In Information and Inference: A Journal of the IMA 7.3 Oxford University Press, 2018, pp. 337–446
[25] Mark Rudelson and Roman Vershynin “On sparse reconstruction from Fourier and Gaussian measurements” In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 61.8 Wiley Online Library, 2008, pp. 1025–1045
[26] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” Cambridge University Press, 2014
[27] Nathan Srebro, Karthik Sridharan and Ambuj Tewari “Optimistic Rates for Learning with a Smooth Loss”, 2010 arXiv:1009.3896
[28] Mihailo Stojnic “A framework to characterize performance of LASSO algorithms”, 2013 arXiv:1303.7291
[29] Christos Thrampoulidis, Samet Oymak and Babak Hassibi “Regularized linear regression: A precise analysis of the estimation error” In Conference on Learning Theory, 2015, pp. 1683–1709 PMLR
[30] Alexander Tsigler and Peter L. Bartlett “Benign overfitting in ridge regression”, 2020 arXiv:2009.14286
[31] Leslie G. Valiant “A theory of the learnable” In Communications of the ACM 27.11, 1984, pp. 1134–1142
[32] Ramon Handel “Probability in High Dimension”, Lecture notes, Princeton University, 2014 URL: https://web.math.princeton.edu/~rvan/APC550.pdf
[33] Vladimir Vapnik “Estimation of Dependences Based on Empirical Data”, Springer Series in Statistics Springer-Verlag, 1982
[34] Roman Vershynin “Introduction to the non-asymptotic analysis of random matrices”, 2010 arXiv:1011.3027
[35] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
[36] Zitong Yang, Yu Bai and Song Mei “Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models” In International Conference on Machine Learning, 2021 arXiv:2103.04554
[37] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017 arXiv:1611.03530
[38] Lijia Zhou, Danica J. Sutherland and Nathan Srebro “On Uniform Convergence and Low-Norm Interpolation Learning” In Advances in Neural Information Processing Systems, 2020 arXiv:2006.05942

References

[39] Vladimir Vapnik “Estimation of Dependences Based on Empirical Data”, Springer Series in Statistics Springer-Verlag, 1982
[40] Leslie G. Valiant “A theory of the learnable” In Communications of the ACM 27.11, 1984, pp. 1134–1142
[41] Yehoram Gordon “Some inequalities for Gaussian processes and applications” In Israel Journal of Mathematics 50.4 Springer, 1985, pp. 265–289
[42] Yehoram Gordon “On Milman’s inequality and random subspaces which escape through a mesh in $\mathbb{R}^{n}$ ” In Geometric Aspects of Functional Analysis 1317, Lecture Notes in Mathematics Springer, 1988, pp. 84–106
[43] Michel Ledoux “A heat semigroup approach to concentration on the sphere and on a compact Riemannian manifold” In Geometric & Functional Analysis GAFA 2.2 Springer, 1992, pp. 221–224
[44] Scott Shaobing Chen, David L. Donoho and Michael A. Saunders “Atomic decomposition by basis pursuit” In SIAM Review 43.1, 2001, pp. 129–159
[45] Mark Rudelson and Roman Vershynin “On sparse reconstruction from Fourier and Gaussian measurements” In Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences 61.8 Wiley Online Library, 2008, pp. 1025–1045
[46] Samet Oymak and Babak Hassibi “New null space results and recovery thresholds for matrix rank minimization”, 2010 arXiv:1011.6326
[47] Nathan Srebro, Karthik Sridharan and Ambuj Tewari “Optimistic Rates for Learning with a Smooth Loss”, 2010 arXiv:1009.3896
[48] Roman Vershynin “Introduction to the non-asymptotic analysis of random matrices”, 2010 arXiv:1011.3027
[49] Venkat Chandrasekaran, Benjamin Recht, Pablo A. Parrilo and Alan S. Willsky “The convex geometry of linear inverse problems” In Foundations of Computational Mathematics 12.6 Springer, 2012, pp. 805–849
[50] Gautam Kamath “Bounds on the Expectation of the Maximum of Samples from a Gaussian”, 2013 URL: http://www.gautamkamath.com/writings/gaussian_max.pdf
[51] Mihailo Stojnic “A framework to characterize performance of LASSO algorithms”, 2013 arXiv:1303.7291
[52] Shahar Mendelson “Learning without concentration” In Conference on Learning Theory, 2014, pp. 25–39 PMLR
[53] Shai Shalev-Shwartz and Shai Ben-David “Understanding Machine Learning: From Theory to Algorithms” Cambridge University Press, 2014
[54] Ramon Handel “Probability in High Dimension”, Lecture notes, Princeton University, 2014 URL: https://web.math.princeton.edu/~rvan/APC550.pdf
[55] Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning” In International Conference on Learning Representations – Workshop, 2015 arXiv:1412.6614
[56] Christos Thrampoulidis, Samet Oymak and Babak Hassibi “Regularized linear regression: A precise analysis of the estimation error” In Conference on Learning Theory, 2015, pp. 1683–1709 PMLR
[57] Afonso S. Bandeira “Ten Lectures and Forty-Two Open Problems in the Mathematics of Data Science”, Lecture notes, MIT, 2016 URL: https://people.math.ethz.ch/~abandeira/TenLecturesFortyTwoProblems.pdf
[58] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017 arXiv:1611.03530
[59] Samet Oymak and Joel A Tropp “Universality laws for randomized dimension reduction, with applications” In Information and Inference: A Journal of the IMA 7.3 Oxford University Press, 2018, pp. 337–446
[60] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
[61] Mikhail Belkin, Daniel Hsu, Siyuan Ma and Soumik Mandal “Reconciling modern machine learning practice and the bias-variance trade-off” In Proceedings of the National Academy of Sciences 116.32, 2019, pp. 15849–15854 arXiv:1812.11118
[62] Rick Durrett “Probability: theory and examples” Cambridge University Press, 2019
[63] Trevor Hastie, Andrea Montanari, Saharon Rosset and Ryan J. Tibshirani “Surprises in High-Dimensional Ridgeless Least Squares Interpolation”, 2019 arXiv:1903.08560
[64] Vaishnavh Nagarajan and J. Kolter “Uniform convergence may be unable to explain generalization in deep learning” In Advances in Neural Information Processing Systems, 2019 arXiv:1902.04742
[65] Peter L. Bartlett and Philip M. Long “Failures of model-dependent generalization bounds for least-norm interpolation”, 2020 arXiv:2010.08479
[66] Peter L. Bartlett, Philip M. Long, Gábor Lugosi and Alexander Tsigler “Benign overfitting in linear regression” In Proceedings of the National Academy of Sciences 117.48, 2020, pp. 30063–30070 arXiv:1906.11300
[67] Mikhail Belkin, Daniel Hsu and Ji Xu “Two models of double descent for weak features” In SIAM Journal on Mathematics of Data Science 2.4, 2020, pp. 1167–1180 arXiv:1903.07571
[68] Geoffrey Chinot and Matthieu Lerasle “On the robustness of the minimum $\ell_{2}$ interpolator”, 2020 arXiv:2003.05838
[69] Peizhong Ju, Xiaojun Lin and Jia Liu “Overfitting Can Be Harmless for Basis Pursuit: Only to a Degree” In Advances in Neural Information Processing Systems, 2020 arXiv:2002.00492
[70] Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian and Anant Sahai “Harmless interpolation of noisy data in regression” In IEEE Journal on Selected Areas in Information Theory, 2020 arXiv:1903.09139
[71] Jeffrey Negrea, Gintare Karolina Dziugaite and Daniel M. Roy “In Defense of Uniform Convergence: Generalization via derandomization with an application to interpolating predictors” In International Conference on Machine Learning, 2020 arXiv:1912.04265
[72] Alexander Tsigler and Peter L. Bartlett “Benign overfitting in ridge regression”, 2020 arXiv:2009.14286
[73] Lijia Zhou, Danica J. Sutherland and Nathan Srebro “On Uniform Convergence and Low-Norm Interpolation Learning” In Advances in Neural Information Processing Systems, 2020 arXiv:2006.05942
[74] Geoffrey Chinot, Matthias Löffler and Sara Geer “On the robustness of minimum-norm interpolators”, 2021 arXiv:2012.00807
[75] Zeyu Deng, Abla Kammoun and Christos Thrampoulidis “A model of double descent for high-dimensional binary linear classification” In Information and Inference: A Journal of the IMA, 2021 arXiv:1911.05822
[76] Zitong Yang, Yu Bai and Song Mei “Exact Gap between Generalization Error and Uniform Convergence in Random Feature Models” In International Conference on Machine Learning, 2021 arXiv:2103.04554

Appendices

In the remainder of the paper, we give self-contained proofs of all results from the main text.

In Appendix A, we introduce some technical results that we will use in our analysis.

In Appendix B, we prove the main generalization bound (Theorem 1) and show its specialization to norm balls (Corollaries 1, 2 and 3).

In Appendix C, we prove upper bounds on the norm of the minimal-norm interpolator for a general norm (Theorem 4), and show applications to the Euclidean case (Theorem 2).

In Appendix D, we show how to combine the previous sets of results to give risk guarantees for the minimal norm interpolators (Theorems 3 and 5). In particular, Section D.2.1 shows the equivalence of conditions for consistency in the Euclidean norm setting.

In Appendix E, we provide full theorem statements and proofs of the results on $\ell_{1}$ interpolation (basis pursuit) mentioned in Section 6.

Appendix A Preliminaries

We will first give some general results useful to the rest of the proofs. Most are standard, but a few are variations on existing results.

Concentration of Lipschitz functions.

Recall that a function $f:\mathbb{R}^{n}\to\mathbb{R}$ is $L$ -Lipschitz with respect to the norm $\left\lVert\cdot\right\rVert$ if it holds for all $x,y\in\mathbb{R}^{n}$ that $|f(x)-f(y)|\leq L\|x-y\|$ . We use the concentration of Lipschitz functions of a Gaussian.

Theorem 6 ([54], Theorem 3.25).

If $f$ is $L$ -Lipschitz with respect to the Euclidean norm and $Z\sim N(0,I_{n})$ , then

\Pr(|f(Z)-\operatorname*{\mathbb{E}}f(Z)|\geq t)\leq 2e^{-t^{2}/2L^{2}}.

(23)

We also use a similar result for functions of a uniformly spherical vector [[, see]Theorem 5.1.4 and Exercise 5.1.12]vershynin2018high; we cite a result with sharp constant factor from [43].

Theorem 7 (Spherical concentration; [43]).

If $f$ is $L$ -Lipschitz with respect to the Euclidean norm and $Z\sim\text{Uni}(S^{n-1})$ where $S^{n-1}=\{u\in\mathbb{R}^{n}:\|u\|=1\}$ is the unit sphere, $\text{Uni}(S^{n-1})$ is the uniform measure on the sphere, and $n\geq 3$ , then

\Pr(|f(Z)-\operatorname*{\mathbb{E}}f(Z)|\geq t)\leq 2e^{-(n-2)t^{2}/2L^{2}}.

(24)

The following lemma, which we will use multiple timues, says that a $o(n)$ -dimensional subspace cannot align with a random spherically symmetric vector.

Lemma 1.

Suppose that $S$ is a fixed subspace of dimension $d$ in $\mathbb{R}^{n}$ with $n\geq 4$ , $P_{S}$ is the orthogonal projection onto $S$ , and $V$ is a spherically symmetric random vector (i.e. $V/\|V\|_{2}$ is uniform on the sphere). Then

\frac{\|P_{S}V\|_{2}}{\|V\|_{2}}\leq\sqrt{d/n}+2\sqrt{\log(2/\delta)/n}.

(25)

with probability at least $1-\delta$ . Conditional on this inequality holding, we therefore have uniformly for all $s\in S$ that

|\langle s,V\rangle|=|\langle s,P_{S}V\rangle|\leq\|s\|_{2}\|P_{S}V\|_{2}\leq\|s\|_{2}\|V\|_{2}\left(\sqrt{d/n}+2\sqrt{\log(2/\delta)/n})\right).

(26)

Proof.

This is trivial if $d\geq n$ , since the left-hand side is at most $1$ . Thus assume without loss of generality that $d<n$ . By symmetry, it suffices to fix $S$ to be the span of basis vectors $e_{1},\ldots,e_{d}$ and to bound $\|P_{S}V\|_{2}$ for $V$ a uniformly random chosen vector from the unit sphere in $\mathbb{R}^{n}$ . Recall that for any coordinate $i$ , we have $\operatorname*{\mathbb{E}}V_{i}^{2}=1/n$ by symmetry among the coordinates and the fact that $\|V\|_{2}^{2}=1$ almost surely. The function $v\mapsto\|P_{S}v\|_{2}$ is a $1$ -Lipschitz function and $\operatorname*{\mathbb{E}}\|P_{S}V\|_{2}\leq\sqrt{\operatorname*{\mathbb{E}}\|P_{S}V\|_{2}^{2}}=\sqrt{d/n}$ , so by Theorem 7 above

\|P_{S}V\|_{2}\leq\sqrt{d/n}+\sqrt{2\log(2/\delta)/(n-2)})

with probability at least $1-\delta$ . Using $n\geq 4$ gives the result. ∎

The concentration of the Euclidean norm of a Gaussian vector follows from Theorem 6; we state it explicitly below.

Lemma 2.

Suppose that $Z\sim N(0,I_{n})$ . Then

\Pr(\left|\|Z\|_{2}-\sqrt{n}\right|\geq t)\leq 4e^{-t^{2}/4}.

(27)

Proof.

First we recall the standard fact [[, see e.g.]]chandrasekaran2012convex that

\sqrt{n}-1\leq\frac{n}{\sqrt{n+1}}\leq\operatorname*{\mathbb{E}}\|Z\|_{2}\leq\sqrt{n}.

Because the norm is 1-Lipschitz, it follows from Theorem 6 that

\Pr(\left|\|Z\|_{2}-\operatorname*{\mathbb{E}}\|Z\|_{2}\right|\geq t)\leq 2e^{-t^{2}/2}

\Pr(\left|\|Z\|_{2}-\sqrt{n}\right|\geq t+1)\leq 2e^{-t^{2}/2}.

Now using that $(t-1)^{2}\geq t^{2}/2-1$ shows

\Pr(\left|\|Z\|_{2}-\sqrt{n}\right|\geq t)\leq 2e^{-(t^{2}/2-1)/2}\leq 4e^{-t^{2}/4}.\qed

Wishart concentration.

We recall the notation for the Loewner order on symmetric matrices: $A\preceq B$ means that $B-A$ is positive semidefinite. Let $\sigma_{\min}(A)$ denote the minimum singular value of an arbitrary matrix $A$ , and $\sigma_{\max}$ the maximum singular value. Similarly, let $\lambda_{\min}(A)$ denote the minimum eigenvalue. We use $\|A\|_{\mathit{op}}=\sigma_{\max}(A)$ to denote the operator norm of matrix $A$ .

Theorem 8 ([48], Corollary 5.35).

Let $n,N\in\mathbb{N}$ . Let $A\in\mathbb{R}^{N\times n}$ be a random matrix with entries i.i.d. $N(0,1)$ . Then for any $t>0$ , it holds with probability at least $1-2\exp(-t^{2}/2)$ that

\sqrt{N}-\sqrt{n}-t\leq\sigma_{\text{min}}(A)\leq\sigma_{\text{max}}(A)\leq\sqrt{N}+\sqrt{n}+t.

(28)

Corollary 4.

Suppose $X_{1},\ldots,X_{n}\sim N(0,\Sigma)$ are independent with $\Sigma:d\times d$ a positive semidefinite matrix, $t>0$ and $n\geq 4(d+t^{2})$ . Let $\hat{\Sigma}=\frac{1}{n}\sum_{i}X_{i}X_{i}^{T}$ be the empirical covariance matrix. Then with probability at least $1-\delta$ ,

(1-\epsilon)\Sigma\preceq\hat{\Sigma}\preceq(1+\epsilon)\Sigma

(29)

with $\epsilon=3\sqrt{d/n}+3\sqrt{2\log(2/\delta)/n}$ .

Proof.

Let $X:n\times d$ be the random matrix with rows $X_{1},\ldots,X_{n}$ so that $\hat{\Sigma}=\frac{1}{n}X^{T}X$ . By equality in distribution, we can take $Z:n\times d$ to have $N(0,1)$ independent entries and write $X=Z\Sigma^{1/2}$ and

\Sigma^{-1/2}\hat{\Sigma}\Sigma^{-1/2}=\frac{1}{m}\Sigma^{-1/2}X^{T}X\Sigma^{-1/2}=\frac{1}{n}Z^{T}Z.

By definition of singular values, from Theorem 8 the eigenvalues of $Z^{T}Z/n$ are bounded between $(1-\sqrt{d/n}-\sqrt{t^{2}/n})^{2}$ and $(1+\sqrt{d/n}+\sqrt{t^{2}/n})^{2}$ . Since $1-(1-x)^{2}\leq(1+x)^{2}-1$ , using the inequality $(1+x)^{2}\leq 1+3x$ for $x\in[0,1]$ , we have shown that

\|I-\Sigma^{-1/2}\hat{\Sigma}\Sigma^{-1/2}\|_{\mathit{op}}\leq(1+\sqrt{d/n}+\sqrt{t^{2}/n})^{2}-1\leq 3\sqrt{d/n}+3\sqrt{t^{2}/n}.

Rewriting and taking $t^{2}=2\log(2/\delta)$ gives the result. ∎

Gaussian Minmax Theorem.

The following result is Theorem 3 of [56], known as the Convex Gaussian Minmax Theorem or CGMT (see also Theorem 1 in the same reference). As explained there, it is a consequence of the main result of [41], known as Gordon’s Theorem or the Gaussian Minmax Theorem. Despite the name, convexity is only required for one of the theorem’s conclusions.

Theorem 9 (Convex Gaussian Minmax Theorem; [56, 41]).

Let $Z:n\times d$ be a matrix with i.i.d. $N(0,1)$ entries and suppose $G\sim N(0,I_{n})$ and $H\sim N(0,I_{d})$ are independent of $Z$ and each other. Let $S_{w},S_{u}$ be compact sets and $\psi:S_{w}\times S_{u}\to\mathbb{R}$ be an arbitrary continuous function. Define the Primary Optimization (PO) problem

\Phi(Z):=\min_{w\in S_{w}}\max_{u\in S_{u}}\langle u,Zw\rangle+\psi(w,u)

(30)

and the Auxiliary Optimization (AO) problem

\phi(G,H):=\min_{w\in S_{w}}\max_{u\in S_{u}}\|w\|_{2}\langle G,u\rangle+\|u\|_{2}\langle H,w\rangle+\psi(w,u).

(31)

Under these assumptions, $\Pr(\Phi(Z)<c)\leq 2\Pr(\phi(G,H)\leq c)$ for any $c\in\mathbb{R}$ .

Furthermore, if we suppose that $S_{w},S_{u}$ are convex sets and $\psi(w,u)$ is convex in $w$ and concave in $u$ , then $\Pr(\Phi(Z)>c)\leq 2\Pr(\phi(G,H)\geq c)$ .

In other words, the first conclusion says that high probability lower bounds on the auxiliary optimization $\phi(G,H)$ imply high probability lower bounds on the primary optimization $\Phi(Z)$ . Importantly, this direction holds without any convexity assumptions. Under the additional convexity assumptions, the second conclusion gives a similar comparison of high probability upper bounds.

In our analysis, we need a slightly more general statement of the Gaussian Minmax Theorem than Theorem 9: we need the minmax formulation to include additional variables which only affect the deterministic term in the minmax problem. It’s straightforward to prove this result by repeating the argument in [56]; below we give an alternative proof which reduces to Theorem 9, by introducing extremely small extra dimensions to contain the extra variables. Intuitively, this works because the statement of the GMT allows for arbitrary continuous functions $\psi$ , with no dependence on their quantitative smoothness.

Theorem 10 (Variant of GMT).

Let $Z:n\times d$ be a matrix with i.i.d. $N(0,1)$ entries and suppose $G\sim N(0,I_{n})$ and $H\sim N(0,I_{d})$ are independent of $Z$ and each other. Let $S_{W},S_{U}$ be compact sets in $\mathbb{R}^{d}\times\mathbb{R}^{d^{\prime}}$ and $\mathbb{R}^{n}\times\mathbb{R}^{n^{\prime}}$ respectively, and let $\psi:S_{W}\times S_{U}\to\mathbb{R}$ be an arbitrary continuous function. Define the Primary Optimization (PO) problem

\Phi(Z):=\min_{(w,w^{\prime})\in S_{W}}\max_{(u,u^{\prime})\in S_{U}}\langle u,Zw\rangle+\psi((w,w^{\prime}),(u,u^{\prime}))

(32)

and the Auxiliary Optimization (AO) problem

\phi(G,H):=\min_{(w,w^{\prime})\in S_{W}}\max_{(u,u^{\prime})\in S_{U}}\|w\|_{2}\langle G,u\rangle+\|u\|_{2}\langle H,w\rangle+\psi((w,w^{\prime}),(u,u^{\prime})).

(33)

Under these assumptions, $\Pr(\Phi(Z)<c)\leq 2\Pr(\phi(G,H)\leq c)$ for any $c\in\mathbb{R}$ .

Proof.

Let $\epsilon\in(0,1)$ be arbitrary and

S_{W,\epsilon}:=\{(w,\epsilon w^{\prime}):(w,w^{\prime})\in S_{W}\},\quad S_{U,\epsilon}:=\{(u,\epsilon u^{\prime}):(u,u^{\prime})\in S_{U}\}.

Define $\psi_{\epsilon}((w,w^{\prime}),(u,u^{\prime})):=\psi((w,\frac{1}{\epsilon}w^{\prime}),(u,\frac{1}{\epsilon}u^{\prime}))$ so that if $W=(w,\epsilon w^{\prime})$ and $U=(u,\epsilon u^{\prime})$ , then $\psi_{\epsilon}(W,U)=\psi((w,w^{\prime}),(u,u^{\prime}))$ . We also define $S_{w}=\{w\in\mathbb{R}^{d}:\exists w^{\prime}\,s.t.\,(w,w^{\prime})\in S_{W}\}$ . The other sets $S_{w^{\prime}},S_{u}$ and $S_{u^{\prime}}$ are defined similarly. It is clear that $S_{w},S_{w^{\prime}},S_{u},S_{u^{\prime}}$ , $S_{W,\epsilon}$ and $S_{U,\epsilon}$ are all still compact in their respective topology, and $\psi_{\epsilon}$ is continuous for every $\epsilon>0$ .

Let $Z^{\prime}:(n+n^{\prime})\times(d+d^{\prime})$ be a matrix with i.i.d. $N(0,1)$ entries such that the top left $n\times d$ matrix is $Z$ . Similarly, we define $G^{\prime}$ to be a $(n+n^{\prime})$ -dimensional Gaussian vector with independent coordinates such that the first $n$ coordinates are $G$ , and $H^{\prime}$ to be a $(d+d^{\prime})$ -dimensional Gaussian vector with independent coordinates such that the first $d$ coordinates are $H$ . Next, consider the augmented PO and AO:

\begin{split}\Phi_{\epsilon}(Z^{\prime})&:=\min_{W\in S_{W,\epsilon}}\max_{U\in S_{U,\epsilon}}\langle U,Z^{\prime}W\rangle+\psi_{\epsilon}(W,U)\\ \phi_{\epsilon}(G^{\prime},H^{\prime})&:=\min_{W\in S_{W,\epsilon}}\max_{U\in S_{U,\epsilon}}\|W\|_{2}\langle G^{\prime},U\rangle+\|U\|_{2}\langle H^{\prime},W\rangle+\psi_{\epsilon}(W,U)\\ \end{split}

(34)

It is clear that for a small value of $\epsilon$ , the augmented problem will be close to the original problem. More precisely, for every $(w,w^{\prime})\in S_{W}$ and $(u,u^{\prime})\in S_{U}$

\begin{split}|\langle(w,\epsilon&w^{\prime}),Z^{\prime}(u,\epsilon u^{\prime})\rangle-\langle w,Zu\rangle|\\ &=|\epsilon\langle(0,w^{\prime}),Z^{\prime}(u,0)\rangle+\epsilon\langle(w,0),Z^{\prime}(0,u^{\prime})\rangle+\epsilon^{2}\langle(0,w^{\prime}),Z^{\prime}(0,u^{\prime})\rangle|\\ &\leq\epsilon(R(S_{w})+R(S_{w^{\prime}}))(R(S_{u})+R(S_{u^{\prime}}))\|Z^{\prime}\|_{\mathit{op}}=\epsilon A\|Z^{\prime}\|_{\mathit{op}}\\ \end{split}

(35)

where $A:=(R(S_{w})+R(S_{w^{\prime}}))(R(S_{u})+R(S_{u^{\prime}}))$ is deterministic and does not depend on $\epsilon$ . Similarly, it is routine to check

\begin{split}\|w\|_{2}\langle G,u\rangle&=\|w\|_{2}(\langle G^{\prime},(u,\epsilon u^{\prime})\rangle-\epsilon\langle G^{\prime},(0,u^{\prime})\rangle)\\ \|u\|_{2}\langle H,w\rangle&=\|u\|_{2}(\langle H^{\prime},(w,\epsilon w^{\prime})\rangle-\epsilon\langle H^{\prime},(0,w^{\prime})\rangle)\\ \end{split}

so by the triangle inequality and Cauchy-Schwarz inequality, we have

\begin{split}\big{\lvert}\|(w,\epsilon&w^{\prime})\|_{2}\langle G^{\prime},(u,\epsilon u^{\prime})\rangle-\|w\|_{2}\langle G,u\rangle\big{\rvert}\\ &\leq\epsilon R(S_{w^{\prime}})\|G^{\prime}\|_{2}(R(S_{u})+\epsilon R(S_{u^{\prime}}))+\epsilon R(S_{w})\|G^{\prime}\|_{2}R(S_{u^{\prime}})\leq\epsilon A\|G^{\prime}\|_{2}\\ \end{split}

(36)

and

\begin{split}\big{\lvert}\|(u,\epsilon&u^{\prime})\|_{2}\langle H^{\prime},(w,\epsilon w^{\prime})\rangle-\|u\|_{2}\langle H,w\rangle\big{\rvert}\\ \leq\,&\epsilon R(S_{u^{\prime}})\|H^{\prime}\|_{2}(R(S_{w})+\epsilon R(S_{w^{\prime}}))+\epsilon R(S_{u})\|H^{\prime}\|_{2}R(S_{w^{\prime}})\leq\epsilon A\|H^{\prime}\|_{2}\\ \end{split}

(37)

From (35), it follows that

\left|\Phi_{\epsilon}(Z^{\prime})-\Phi(Z)\right|\leq\epsilon A\|Z^{\prime}\|_{\mathit{op}}.

(38)

Similarly, from (36) and (37), it follows that

|\phi_{\epsilon}(G^{\prime},H^{\prime})-\phi(G,H)|\leq\epsilon A(\|G^{\prime}\|_{2}+\|H^{\prime}\|_{2}).

(39)

Approximating the original PO and AO by (34) allows us to directly apply the Gaussian Minmax Theorem. For any $c\in\mathbb{R}$ , we have

\begin{split}\Pr(\Phi(Z)<c)&\leq\Pr(\Phi_{\epsilon}(Z^{\prime})<c+\sqrt{\epsilon})+\Pr(\epsilon A\|Z^{\prime}\|_{\mathit{op}}>\sqrt{\epsilon})\\ &\leq 2\Pr(\phi_{\epsilon}(G^{\prime},H^{\prime})\leq c+\sqrt{\epsilon})+\Pr(\epsilon A\|Z^{\prime}\|_{\mathit{op}}>\sqrt{\epsilon})\\ &\leq 2\Pr(\phi(G^{\prime},H^{\prime})\leq c+2\sqrt{\epsilon})+2\Pr\left(\epsilon A(\|G^{\prime}\|_{2}+\|H^{\prime}\|_{2})>\sqrt{\epsilon}\right)\\ &\quad+\Pr(\epsilon A\|Z^{\prime}\|_{\mathit{op}}>\sqrt{\epsilon})\\ &\leq 2\Pr(\phi(G^{\prime},H^{\prime})\leq c+2\sqrt{\epsilon})+2\Pr\left(\|G^{\prime}\|_{2}>\frac{1}{2A\sqrt{\epsilon}}\right)\\ &\quad+2\Pr\left(\|H^{\prime}\|_{2}>\frac{1}{2A\sqrt{\epsilon}}\right)+\Pr\left(\|Z^{\prime}\|_{\mathit{op}}>\frac{1}{A\sqrt{\epsilon}}\right)\\ \end{split}

where we used (38) in the first inequality, Theorem 9 in the second inequality, and (39) in the last inequality. This holds for arbitrary $\epsilon>0$ and taking the limit $\epsilon\to 0$ shows the result, because the CDF is right continuous [62] and the remaining terms go to zero by standard concentration inequalities (Lemmas 2 and 8). ∎

Appendix B Uniform Convergence Bounds

We will now prove the main generalization bound, as well as its special cases in norm balls and specifically Euclidean norm balls.

B.1 General case: Proof of Theorem 1

For convenience, we restate the definition of covariance splitting here: See 2

It follows from our definition that $\Sigma_{1}\Sigma_{2}=0$ . Although our results in Appendix C requires this orthogonality condition (in particular, Lemma 8), we note that all of our results here in Appendix B continue to hold as long as $\Sigma=\Sigma_{1}+\Sigma_{2}$ and both $\Sigma_{1},\Sigma_{2}$ are positive semi-definite. To apply the Gaussian Minimax Theorem, we first formulate the generalization gap as an optimization problem in terms of a random matrix with $N(0,1)$ entries.

Lemma 3.

Under the model assumptions in (1), let $\mathcal{K}$ be an arbitrary compact set and $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ . Define the primary optimization problem (PO) as

\Phi:=\max_{\begin{subarray}{c}(w_{1},w_{2})\in\mathcal{S}\\ Z_{1}w_{1}+Z_{2}w_{2}=\xi\end{subarray}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}

(40)

where

\mathcal{S}=\{(w_{1},w_{2}):\exists w\in\mathcal{K}\,s.t.\,w_{1}=\Sigma_{1}^{1/2}(w-w^{*})\text{ and }w_{2}=\Sigma_{2}^{1/2}(w-w^{*})\}

(41)

and $Z_{1},Z_{2}$ are both $n\times d$ random matrices with i.i.d. standard normal entries independent of $\xi$ and each other. Then the generalization gap of interpolators is equal in distribution to the sum of the Bayes risk and the PO:

\max_{w\in\mathcal{K},\hat{L}(w)=0}L(w)-\hat{L}(w)\overset{\mathcal{D}}{=}\sigma^{2}+\Phi.

(42)

Proof.

Recall that $L(w)=\sigma^{2}+\|w-w^{*}\|_{\Sigma}^{2}$ and $\hat{L}(w)=0$ is equivalent to $Y=Xw$ . Observe that

X\overset{\mathcal{D}}{=}Z_{1}\Sigma_{1}^{1/2}+Z_{2}\Sigma_{2}^{1/2}\quad\text{and}\quad\|w\|_{\Sigma}^{2}=\|w\|_{\Sigma_{1}}^{2}+\|w\|_{\Sigma_{2}}^{2}

so we can decompose

\begin{split}\max_{w\in\mathcal{K},\hat{L}(w)=0}L(w)-\hat{L}(w)&=\sigma^{2}+\max_{w\in\mathcal{K},Y=Xw}\|w-w^{*}\|_{\Sigma}^{2}\\ &=\sigma^{2}+\max_{w\in\mathcal{K},X(w-w^{*})=\xi}\|w-w^{*}\|_{\Sigma}^{2}\\ &\overset{\mathcal{D}}{=}\sigma^{2}+\max_{\begin{subarray}{c}w\in\mathcal{K}-w^{*}\\ (Z_{1}\Sigma_{1}^{1/2}+Z_{2}\Sigma_{2}^{1/2})w=\xi\end{subarray}}\|w\|_{\Sigma_{1}}^{2}+\|w\|_{\Sigma_{2}}^{2}=\sigma^{2}+\Phi.\qed\end{split}

Lemma 4 (Application of GMT).

In the same setting as Lemma 3, let $G\sim N(0,I_{n}),H\sim N(0,I_{d})$ be Gaussian vectors independent of $Z_{1},Z_{2},\xi$ and each other. With the same definition of $\mathcal{S}$ , define the auxiliary optimization problem (AO) as

\phi:=\max_{\begin{subarray}{c}(w_{1},w_{2})\in\mathcal{S}\\ \left\lVert\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\right\rVert_{2}\leq\langle w_{2},H\rangle\end{subarray}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}

(43)

Then it holds that

\Pr(\Phi>t\,|\,Z_{1},\xi)\leq 2\Pr(\phi\geq t\,|\,Z_{1},\xi),

(44)

and taking expectations we have

\Pr(\Phi>t)\leq 2\Pr(\phi\geq t).

(45)

Proof.

By introducing Lagrange multipliers, we have

\begin{split}\Phi&=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\lambda}\,\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}+\langle\lambda,Z_{2}w_{2}-(\xi-Z_{1}w_{1})\rangle\\ &=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\lambda}\,\langle\lambda,Z_{2}w_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle.\\ \end{split}

By independence, the distribution of $Z_{2}$ remains the same after conditioning on $Z_{1}$ and $\xi$ and the randomness in $\Phi$ comes solely from $Z_{2}$ . Since the mapping from $w$ to $(w_{1},w_{2})$ is continuous and $\mathcal{K}$ is compact, $\mathcal{S}$ is compact. To apply Theorem 10, we can take $\psi(w_{1},w_{2},\lambda)=\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle$ , which is clearly continuous. The only challenge is that the domain of $\lambda$ is not compact, but we can handle it by a truncation argument. Define

\Phi_{r}:=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\|\lambda\|\leq r}\,\langle\lambda,Z_{2}w_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle

(46)

and observe that $\Phi\leq\Phi_{r}$ , since the minimum in the definition of $\Phi_{r}$ ranges over a smaller set. The AO associated with $\Phi_{r}$ is

\begin{split}\phi_{r}:&=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\|\lambda\|\leq r}\,\|w_{2}\|_{2}\langle G,\lambda\rangle+\|\lambda\|_{2}\langle H,w_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}-\langle\lambda,\xi-Z_{1}w_{1}\rangle\\ &=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\|\lambda\|\leq r}\,\|\lambda\|_{2}\langle H,w_{2}\rangle-\langle\lambda,\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\rangle+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}\\ &=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{0\leq\lambda\leq r}\,\lambda\left(\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\right)+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}.\\ \end{split}

(47)

We observe that the untruncated auxiliary problem $\phi$ from (43) has a completely analogous form:

\phi=\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{\lambda\geq 0}\,\lambda\left(\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\right)+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}.\\

This is because if $\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\geq 0$ then the minimum is achieved at $\lambda=0$ , and if $w_{1},w_{2}$ do not satisfy the constraint then taking $\lambda\to\infty$ sends the minimum to $-\infty$ . From this formulation, we see that $\phi\leq\phi_{r}\leq\phi_{s}$ for any $r\geq s\geq 0$ since the minimum is taken over a larger set as $r$ grows, and is unconstrained in $\phi$ .

The proof that $\lim_{r\to\infty}\phi_{r}=\phi$ is an exercise in real analysis, which splits into two cases:

The auxiliary problem $\phi$ is infeasible. In this case, we know that for all $(w_{1},w_{2})\in\mathcal{S}$

\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}<0.

By compactness of $\mathcal{S}$ and continuity of the right hand side, there exists $\mu=\mu(\xi,Z_{1},G,H)<0$ (in particular, independent of $r$ ) such that

\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\leq\mu.

Therefore, we show

\begin{split}\phi_{r}&\leq\max_{(w_{1},w_{2})\in\mathcal{S}}\min_{0\leq\lambda\leq r}\,\lambda\mu+\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}\\ &=r\mu+\max_{(w_{1},w_{2})\in\mathcal{S}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}.\end{split}

Since the second term is bounded and has no dependence on $r$ , taking $r\to\infty$ we have $\phi_{r}\to-\infty$ as desired (since $\phi=-\infty$ by definition).

The auxiliary problem $\phi$ is feasible. In this case, we can let $(w_{1}(r),w_{2}(r))\in\mathcal{S}$ be an arbitrary maximizer achieving the objective $\phi_{r}$ for each $r\geq 0$ by compactness. By compactness again, the sequence $(w_{1}(r),w_{2}(r))_{r=1}^{\infty}$ at positive integer values of $r$ has a subsequential limit $(w_{1}(\infty),w_{2}(\infty))\in\mathcal{S}$ , i.e. this point satisfies $(w_{1}(\infty),w_{2}(\infty))=\lim_{n\to\infty}(w_{1}(r_{n}),w_{2}(r_{n}))$ for some sequence $r_{n}$ satisfying $r_{n}\geq n$ .

Suppose that $(w_{1}(\infty),w_{2}(\infty))$ does not satisfy the last constraint defining $\phi$ , then by continuity, there exists $\mu<0$ and a sufficiently small $\epsilon>0$ such that for all $\|w_{1}-w_{1}(\infty)\|_{2}\leq\epsilon$ and $\|w_{2}-w_{2}(\infty)\|_{2}\leq\epsilon$ , we have

\langle H,w_{2}\rangle-\|\xi-Z_{1}w_{1}-G\|w_{2}\|_{2}\|_{2}\leq\mu.

This implies that for sufficiently large $n$ , we have

\langle H,w_{2}(r_{n})\rangle-\|\xi-Z_{1}w_{1}(r_{n})-G\|w_{2}(r_{n})\|_{2}\|_{2}\leq\mu

and

\begin{split}\phi_{r_{n}}&\leq r_{n}\mu+\|w_{1}(r_{n})\|_{2}^{2}+\|w_{2}(r_{n})\|_{2}^{2}\\ &\leq r_{n}\mu+\max_{(w_{1},w_{2})\in\mathcal{S}}\|w_{1}\|_{2}^{2}+\|w_{2}\|_{2}^{2}\\ \end{split}

so $\phi_{r_{n}}\to-\infty$ – but this is impossible, since considering any feasible element of $\phi$ we can show that $\phi_{r_{n}}\geq 0$ . By contradiction, we find that $(w_{1}(\infty),w_{2}(\infty))$ is feasible for $\phi$ .

By taking $\lambda=0$ in the definition of $\phi_{r}$ we have

\phi_{r_{n}}\leq\|w_{1}(r_{n})\|_{2}^{2}+\|w_{2}(r_{n})\|_{2}^{2}.

By continuity, we show that

\begin{split}\lim\sup_{n\to\infty}\phi_{r_{n}}&\leq\lim_{n\to\infty}\|w_{1}(r_{n})\|_{2}^{2}+\|w_{2}(r_{n})\|_{2}^{2}\\ &=\|w_{1}(\infty)\|_{2}^{2}+\|w_{2}(\infty)\|_{2}^{2}\leq\phi.\end{split}

Since $\phi_{r_{n}}\geq\phi$ , the limit of $\phi_{r_{n}}$ exists and equals $\phi$ . We can conclude that $\lim_{r\to\infty}\phi_{r}=\phi$ because $\phi_{r}$ is a monotone decreasing function of $r$ .

By our version of the Gaussian Minmax Theorem, Theorem 10,

\Pr(\Phi_{r}>t|Z_{1},\xi)=\Pr(-\Phi_{r}<-t|Z_{1},\xi)\leq 2\Pr(-\phi_{r}\leq-t|Z_{1},\xi)=2\Pr(\phi_{r}\geq t|Z_{1},\xi)

We introduce the negative signs here because we have originally a max-min problem instead of a min-max problem. This means the comparison theorem gives an upper bound, instead of a lower bound, on the quantity of interest.

Finally, we can conclude

\Pr(\Phi>t|Z_{1},\xi)\leq\inf_{r\geq 0}\Pr(\Phi_{r}>t|Z_{1},\xi)\leq 2\inf_{r\geq 0}\Pr(\phi_{r}\geq t|Z_{1},\xi)\leq 2\Pr(\phi\geq t|Z_{1},\xi).

where the last step uses continuity (from above) of probability measure and the fact that $\phi_{r}$ monotonically decreases to $\phi$ almost surely. ∎

Recall the definition of Gaussian width and radius:

See 1

It remains to analyze the auxiliary problem, which we do in the following lemma:

Lemma 5.

Let $\beta=33\sqrt{\frac{\log(32/\delta)}{n}}+18\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}$ . If $n$ is sufficiently large such that $\beta\leq 1$ , then with probability at least $1-\delta$ , it holds that

\phi\leq\frac{1+\beta}{n}(W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(16/\delta)}+\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(16/\delta)})^{2}-\sigma^{2}.

(48)

Proof.

For notational simplicity, define

\begin{split}\alpha&:=2\sqrt{\frac{\log(32/\delta)}{n}}\\ \gamma&:=3\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}+3\sqrt{\frac{2\log(16/\delta)}{n}}\\ \rho&:=\sqrt{\frac{\operatorname{rank}(\Sigma_{1})+1}{n}}+2\sqrt{\frac{\log(16/\delta)}{n}}.\end{split}

By a union bound, the following collection of events, which together we call $\mathcal{E}$ , occurs with probability at least $1-\delta$ :

1.

(Approximate orthogonality.) By Lemma 1, uniformly over all $w_{1}\in\Sigma_{1}^{1/2}(\mathcal{K}-w^{*})$ and $a\in\mathbb{R}$ , it holds that

$|\langle\xi a-Z_{1}w_{1},G\rangle|\leq\|\xi a-Z_{1}w_{1}\|_{2}\|G\|_{2}\rho$ (49)

and

$|\langle\xi,Z_{1}w_{1}\rangle|\leq\|\xi\|_{2}\|Z_{1}w_{1}\|_{2}\rho.$ (50)
2.

(Approximate isometry.) By Corollary 4, uniformly over all $w_{1}\in\Sigma_{1}^{1/2}(\mathcal{K}-w^{*})$ , it holds that

$(1-\gamma)\|w_{1}\|_{2}^{2}\leq\frac{\|Z_{1}w_{1}\|_{2}^{2}}{n}\leq(1+\gamma)\|w_{1}\|_{2}^{2}.$ (51)
3.

(Typical norm of $G$ and $\xi$ .) By Lemma 2, it holds that

$-\alpha\leq\frac{1}{\sqrt{n}}\|G\|_{2}-1\leq\alpha$ (52)

and

$-\alpha\sigma\leq\frac{1}{\sqrt{n}}\|\xi\|_{2}-\sigma\leq\alpha\sigma.$ (53)
4.

(Typical size of $\langle\Sigma_{2}^{1/2}w^{*},H\rangle$ .) By the standard Gaussian tail bound $\Pr(|Z|\geq t)\leq 2e^{-t^{2}/2}$ , it holds that

$|\langle\Sigma_{2}^{1/2}w^{*},H\rangle|\leq\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(16/\delta)}$ (54)

because the marginal law of $\langle\Sigma_{2}^{1/2}w^{*},H\rangle$ is $N(0,\|w^{*}\|_{\Sigma_{2}}^{2})$ .

(Gaussian process concentration.) By Theorem 6, it holds that

\max_{w_{2}\in\Sigma_{2}^{1/2}\mathcal{K}}|\langle w_{2},H\rangle|\leq W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(16/\delta)}

(55)

because $\max_{w_{2}\in\Sigma_{2}^{1/2}\mathcal{K}}|\langle u_{2},H\rangle|$ is a $\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})$ -Lipschitz function of $H$ .

From now on, the argument is conditional on the event $\mathcal{E}$ defined above. By squaring the last constraint in the definition of $\phi$ we see that

\begin{split}\langle w_{2},H\rangle^{2}&\geq\|\xi-Z_{1}w_{1}-\|w_{2}\|_{2}G\|_{2}^{2}\\ &=\left\lVert\xi-Z_{1}w_{1}\right\rVert_{2}^{2}+\|w_{2}\|_{2}^{2}\left\lVert G\right\rVert_{2}^{2}-2\langle\xi-Z_{1}w_{1},\|w_{2}\|_{2}G\rangle\\ &\geq(1-\rho)[\left\lVert\xi-Z_{1}w_{1}\right\rVert_{2}^{2}+\|w_{2}\|_{2}^{2}\left\lVert G\right\rVert_{2}^{2}]\\ \end{split}

where in the last line we used (49) and the AM-GM inequality ( $ab\leq a^{2}/2+b^{2}/2$ ). Rearranging gives the inequality

\begin{split}\|w_{2}\|_{2}^{2}&\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-\left\lVert\xi-Z_{1}w_{1}\right\rVert_{2}^{2}}{\|G\|_{2}^{2}}\\ &\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)[\left\lVert\xi\right\rVert_{2}^{2}+\left\lVert Z_{1}w_{1}\right\rVert_{2}^{2}]}{\|G\|_{2}^{2}}\\ &\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)[\left\lVert\xi\right\rVert_{2}^{2}+\left\lVert Z_{1}w_{1}\right\rVert_{2}^{2}]}{(1-\alpha)^{2}n}\\ &\leq-\frac{(1-\gamma)(1-\rho)}{(1-\alpha)^{2}}\left\lVert w_{1}\right\rVert_{2}^{2}+\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)\left\lVert\xi\right\rVert_{2}^{2}}{(1-\alpha)^{2}n}\\ \end{split}

where in the second inequality we used (50) and the AM-GM inequality again, in the third inequality we used (52) and in the last inequality we used (51). This shows

\begin{split}(1-\gamma)(1-\rho)(\left\lVert w_{1}\right\rVert_{2}^{2}+\left\lVert w_{2}\right\rVert_{2}^{2})&\leq\frac{(1-\gamma)(1-\rho)}{(1-\alpha)^{2}}\left\lVert w_{1}\right\rVert_{2}^{2}+\|w_{2}\|_{2}^{2}\\ &\leq\frac{(1-\rho)^{-1}\langle w_{2},H\rangle^{2}-(1-\rho)\left\lVert\xi\right\rVert_{2}^{2}}{(1-\alpha)^{2}n}.\\ \end{split}

Dividing through by the first two factors on the left hand side and plugging in (53) gives

\begin{split}\left\lVert w_{1}\right\rVert_{2}^{2}+\left\lVert w_{2}\right\rVert_{2}^{2}&\leq\frac{(1-\rho)^{-2}\langle w_{2},H\rangle^{2}-\left\lVert\xi\right\rVert_{2}^{2}}{(1-\gamma)(1-\alpha)^{2}n}\\ &\leq\frac{1}{(1-\gamma)(1-\alpha)^{2}(1-\rho)^{2}}\frac{\langle w_{2},H\rangle^{2}}{n}-\frac{\sigma^{2}}{1-\gamma}.\\ \end{split}

We can simplify the first term by defining $\beta=(1-\gamma)^{-1}(1-\alpha)^{-2}(1-\rho)^{-2}-1$ and the second term by observing $-\frac{\sigma^{2}}{1-\gamma}\leq-\sigma^{2}$ . Finally, plugging into (43) gives

\begin{split}\phi&\leq\max_{(w_{1},w_{2})\in S}(1+\beta)\frac{\langle w_{2},H\rangle^{2}}{n}-\sigma^{2}\\ &=\frac{1+\beta}{n}\max_{w_{2}\in\Sigma_{2}^{1/2}(\mathcal{K}-w^{*})}|\langle w_{2},H\rangle|^{2}-\sigma^{2}\\ &\leq\frac{1+\beta}{n}\left(\max_{w_{2}\in\Sigma_{2}^{1/2}\mathcal{K}}|\langle w_{2},H\rangle|+|\langle\Sigma_{2}^{1/2}w^{*},H\rangle|\right)^{2}-\sigma^{2}\\ \end{split}

by the triangle inequality, and (48) follows by (54) and (55). To deduce the explicit bound for $\beta$ , first use that

(1-\alpha)^{2}=1-2\alpha+\alpha^{2}\geq 1-2\alpha

and similarly $(1-\rho)^{2}\geq 1-2\rho$ to show

\frac{1}{(1-\gamma)(1-\alpha)^{2}(1-\rho)^{2}}\leq\frac{1}{(1-\gamma)(1-2\rho)(1-2\alpha)}.

If $\gamma,\rho<1/2$ , then

\begin{split}(1-\gamma)(1-2\alpha)(1-2\rho)&=1-\gamma-2\alpha-2\rho+2\gamma\alpha+2\gamma\rho+4\alpha\rho-4\gamma\alpha\rho\\ &\geq 1-\gamma-2\alpha-2\rho-4\gamma\alpha\rho>1-2\gamma-2\alpha-2\rho.\\ \end{split}

Provided that $2\gamma+2\alpha+2\rho<1/2$ (which implies that $\gamma,\rho<1/2$ ), we can use the inequality $(1-x)^{-1}\leq 1+2x$ for $x\in[0,1/2]$ to show that

\frac{1}{(1-\gamma)(1-\alpha)^{2}(1-\rho)^{2}}\leq\frac{1}{1-2\gamma-2\alpha-2\rho}\leq 1+4\gamma+4\alpha+4\rho

and thus we can choose

\beta=33\sqrt{\frac{\log(32/\delta)}{n}}+18\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\geq 4\gamma+4\alpha+4\rho\qed

We are finally ready to prove our main generalization bound:

See 1

Proof.

By Lemmas 3 and 4, we show that for any $t$

\Pr\left(\max_{w\in\mathcal{K},\hat{L}(w)=0}L(w)-\hat{L}(w)>t\right)=\Pr(\Phi>t-\sigma^{2})\leq 2\Pr(\phi\geq t-\sigma^{2}).

By Lemma 5, the above is upper bounded by $\delta$ if we set $t-\sigma^{2}$ according to (48) with $\delta$ replaced by $\delta/2$ . Observe that the $\sigma^{2}$ term cancels, and the proof is complete. ∎

Remark 2 (Translation-invariant version).

Our generalization guarantee is stated in terms of $W(\cdot)$ and $\operatorname{rad}(\cdot)$ , which are not translation-invariant. However, the generalization guarantee of Theorem 1 can be made translation invariant, e.g. replacing $W(\Sigma_{2}^{1/2}\mathcal{K})$ by $W(\Sigma_{2}^{1/2}(\mathcal{K}-a))$ for an arbitrary $a\in\mathbb{R}^{d}$ , by recentering the problem before applying Theorem 1, i.e. by subtracting $Xa$ from both sides of the interpolation constraint $Xw=Xw^{*}+\xi$ .

We also note that in Theorem 1, there is no requirement that $w^{*}\in\mathcal{K}$ , so the true function may not necessarily lie in the class even if there is no noise ( $\sigma=0$ ).

B.2 Specialization to General Norm Balls

For convenience, we restate the general definition of effective rank.

See 5

Applying Theorem 1 to an arbitrary norm ball yield the following:

See 3

Proof.

Let $\mathcal{K}=\{w:\|w\|\leq B\}$ in Theorem 1. It is easy to see that

W(\Sigma_{2}^{1/2}\mathcal{K})=\operatorname*{\mathbb{E}}\sup_{\|w\|\leq B}|\langle\Sigma_{2}^{1/2}w,H\rangle|=\operatorname*{\mathbb{E}}\sup_{\|w\|\leq B}|\langle w,\Sigma_{2}^{1/2}H\rangle|=B\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}

and

R(\Sigma_{2}^{1/2}\mathcal{K})=\sup_{\|w\|\leq B}\|\Sigma_{2}^{1/2}w\|_{2}=B\sup_{\|w\|\leq 1}\|w\|_{\Sigma_{2}}.

From our definition, it is clear that

r_{\left\lVert\cdot\right\rVert}(\Sigma)=\left(\frac{W(\Sigma^{1/2}\mathcal{K})}{R(\Sigma^{1/2}\mathcal{K})}\right)^{2}.

Observe that

\|w^{*}\|_{\Sigma_{2}}\leq\|w^{*}\|\sup_{\|w\|\leq 1}\|w\|_{\Sigma_{2}}\leq B\sup_{\|w\|\leq 1}\|w\|_{\Sigma_{2}}=R(\Sigma_{2}^{1/2}\mathcal{K}).

The two above equations imply

\begin{split}&W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(32/\delta)}+\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(32/\delta)}\\ \leq\,&W(\Sigma_{2}^{1/2}\mathcal{K})+2\sqrt{2\log(32/\delta)}\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\\ =\,&W(\Sigma_{2}^{1/2}\mathcal{K})+2\sqrt{\frac{2\log(32/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}W(\Sigma_{2}^{1/2}\mathcal{K})\\ =\,&\left(1+2\sqrt{\frac{2\log(32/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}\right)\left(B\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}\right).\\ \end{split}

Under our assumptions that $\gamma\leq 1$ and $\delta\leq 1/4$ , using the inequality $(1+x)(1+y)\leq 1+x+2y$ for $x\leq 1$ , it is routine to check that

(1+\beta)\left(1+2\sqrt{\frac{2\log(32/\delta)}{r_{\left\lVert\cdot\right\rVert}(\Sigma_{2})}}\right)^{2}\leq 1+\gamma.

Plugging into Theorem 1 concludes the proof. ∎

B.3 Special Case: Euclidean Norm

In the Euclidean setting, the effective ranks are defined as follows:

See 3

Due to the small difference between $r(\Sigma)$ and $r_{\left\lVert\cdot\right\rVert_{2}}(\Sigma)$ , our generalization bound below requires a slightly different proof (see discussion in Section 5), but the proof strategies are exactly the same.

See 2

Proof.

The proof is identical to Corollary 3 except for the inconsequential difference between $\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}g\|_{2}$ and $\operatorname{Tr}(\Sigma_{2})^{1/2}$ . It is easy to see that

W(\Sigma_{2}^{1/2}\mathcal{K})\leq B\operatorname{Tr}(\Sigma_{2})^{1/2}\quad\text{and}\quad R(\Sigma_{2}^{1/2}\mathcal{K})=B\|\Sigma_{2}\|_{\mathit{op}}^{1/2}.

By the same argument, we can show that $\|w^{*}\|_{\Sigma_{2}}\leq R(\Sigma_{2}^{1/2}\mathcal{K})$ and

\begin{split}&W(\Sigma_{2}^{1/2}\mathcal{K})+\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\sqrt{2\log(32/\delta)}+\|w^{*}\|_{\Sigma_{2}}\sqrt{2\log(32/\delta)}\\ \leq\,&W(\Sigma_{2}^{1/2}\mathcal{K})+2\sqrt{2\log(32/\delta)}\operatorname{rad}(\Sigma_{2}^{1/2}\mathcal{K})\\ \leq\,&B\operatorname{Tr}(\Sigma_{2})^{1/2}+2\sqrt{2\log(32/\delta)}B\|\Sigma_{2}\|_{\mathit{op}}^{1/2}\\ =\,&\left(1+2\sqrt{\frac{2\log(32/\delta)}{r(\Sigma_{2})}}\right)B\operatorname{Tr}(\Sigma_{2})^{1/2}.\\ \end{split}

Plugging into Theorem 1 concludes the proof. ∎

Next, by choosing a particular covariance split, we prove the speculative bound from [73] when the features are Gaussian:

See 1

Proof.

By Theorem 1 and the same argument in proof of Corollary 2, we obtain

\begin{split}\sup_{w\in\mathcal{K},Y=Xw}L(w)&\leq\frac{1+\beta}{n}\left(B\operatorname{Tr}(\Sigma_{2})^{1/2}+B\|\Sigma_{2}\|_{\mathit{op}}^{1/2}\cdot 2\sqrt{2\log\left(\frac{32}{\delta}\right)}\right)^{2}\\ &\leq\frac{B^{2}}{n}(1+\beta)\left(\operatorname{Tr}(\Sigma)^{1/2}+\|\Sigma_{2}\|_{\mathit{op}}^{1/2}\cdot 6\sqrt{\log(1/\delta)}\right)^{2}.\\ \end{split}

Let $\Sigma_{1}$ contain the largest eigenvalues, then we have

\operatorname{rank}(\Sigma_{1})\|\Sigma_{2}\|_{\mathit{op}}\leq\operatorname{Tr}(\Sigma).

Plugging in the inequality shows

\begin{split}\sup_{w\in\mathcal{K},Y=Xw}L(w)&\leq\frac{B^{2}\operatorname{Tr}(\Sigma)}{n}(1+\beta)\left(1+6\sqrt{\frac{\log(1/\delta)}{\operatorname{rank}(\Sigma_{1})}}\right)^{2}.\\ \end{split}

Therefore, we can pick $\gamma=(1+\beta)\left(1+6\sqrt{\frac{\log(1/\delta)}{\operatorname{rank}(\Sigma_{1})}}\right)^{2}-1$ and it is clear that

\gamma\lesssim\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}+\sqrt{\frac{\log(1/\delta)}{\operatorname{rank}(\Sigma_{1})}}

for sufficiently large $n$ and $\operatorname{rank}(\Sigma_{1})$ . To balance the last two terms, we can pick a covariance split such that $\operatorname{rank}(\Sigma_{1})$ is of order $[n\log(1/\delta)]^{1/2}$ , which proves the $\sqrt[4]{\log(1/\delta)/n}$ rate. ∎

Appendix C Bounds on the Norm of the Minimal-Norm Interpolator

In this section, we will give bounds – again based on the Gaussian Minimax Theorem – for the norm of the minimal norm interpolator, first in general and then in the Euclidean case.

C.1 General Norms: Proof of Theorem 4

Similar to the analysis in the previous section, we first formulate the minimal norm as an optimization problem in terms of a random matrix with $N(0,1)$ entries. Next, we apply the Convex Gaussian Minimax Theorem.

Lemma 6.

Under the model assumptions in (1), let $\lVert\cdot\rVert$ be an arbitrary norm and $Z:n\times d$ be a matrix with i.i.d. $N(0,1)$ entries independent of $\xi$ . Define the primary optimization problem (PO) as

\Phi:=\min_{Zw=\xi}\|\Sigma^{-1/2}w\|.

(56)

Then for any $t$ , it holds that

\Pr\left(\,\min_{Xw=Y}\|w\|>t\,\right)\leq\Pr\left(\,\|w^{*}\|+\Phi>t\,\right).

(57)

Proof.

By equality in distribution, we can write $X=Z\Sigma^{1/2}$ . By the triangle inequality and two changes of variables, we have

\begin{split}\min_{Xw=Y}\|w\|&=\min_{Xw=\xi}\|w+w^{*}\|\\ &\leq\|w^{*}\|+\min_{Z\Sigma^{1/2}w=\xi}\|w\|\\ &=\|w^{*}\|+\min_{Zw=\xi}\|\Sigma^{-1/2}w\|.\qed\end{split}

Lemma 7 (Application of CGMT).

In the same setting as Lemma 6, let $G\sim N(0,I_{n}),H\sim N(0,I_{d})$ be Gaussian vectors independent of $\xi$ and each other. Define the auxiliary optimization problem (AO) as

\phi:=\min_{\|\xi-\|w\|_{2}G\|_{2}\leq\langle H,w\rangle}\|\Sigma^{-1/2}w\|.

(58)

Then it holds that

\Pr(\Phi>t\,|\,\xi)\leq 2\Pr(\phi\geq t\,|\,\xi),

(59)

and taking expectations we have

\Pr(\Phi>t)\leq 2\Pr(\phi\geq t).

(60)

Proof.

By introducing Lagrange multipliers, we have

\begin{split}\Phi&=\min_{w}\max_{\lambda}\,\|\Sigma^{-1/2}w\|+\langle\lambda,Zw-\xi\rangle\\ &=\min_{w}\max_{\lambda}\,\langle\lambda,Zw\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle.\end{split}

By independence, the distribution of $Z$ remains the same after conditioning on $\xi$ and the randomness in $\Phi$ comes solely from $Z$ . Therefore, we can apply CGMT in Theorem 9 with $\psi(w,\lambda)=\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle$ because $\psi$ is convex-concave, but we again have the technical difficulty that the domains of $w$ and $\lambda$ are not compact. To overcome this, we will use a double truncation argument. For any $r,t>0$ , we define

\Phi_{r}(t):=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\langle\lambda,Zw\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle

(61)

and the corresponding AO

\begin{split}\phi_{r}(t):&=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\|w\|_{2}\langle G,\lambda\rangle+\|\lambda\|_{2}\langle H,w\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle\\ &=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\|\lambda\|_{2}\langle H,w\rangle-\langle\lambda,\xi-\|w\|_{2}G\rangle+\|\Sigma^{-1/2}w\|\\ &=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{0\leq\lambda\leq r}\,\lambda\left(\langle H,w\rangle+\|\xi-\|w\|_{2}G\|_{2}\right)+\|\Sigma^{-1/2}w\|.\end{split}

(62)

Note that the optimization in $\Phi_{r}(t)$ and $\phi_{r}(t)$ now ranges over compact sets. We will also use an intermediate problem between $\Phi$ and $\Phi_{r}(t)$ , defined as

\begin{split}\Phi(t):&=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\lambda}\,\langle\lambda,Zw\rangle+\|\Sigma^{-1/2}w\|-\langle\lambda,\xi\rangle\\ &=\min_{\begin{subarray}{c}Zw=\xi\\ \|\Sigma^{-1/2}w\|\leq 2t\end{subarray}}\|\Sigma^{-1/2}w\|.\end{split}

(63)

We similarly define the intermediate AO as

\begin{split}\phi(t):&=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\lambda\geq 0}\,\lambda\left(\langle H,w\rangle+\|\xi-\|w\|_{2}G\|_{2}\right)+\|\Sigma^{-1/2}w\|\\ &=\min_{\begin{subarray}{c}\|\xi-\|w\|_{2}G\|_{2}\leq\langle-H,w\rangle\\ \|\Sigma^{-1/2}w\|\leq 2t\end{subarray}}\|\Sigma^{-1/2}w\|.\end{split}

(64)

Compared to the definition of $\phi$ , we have $\langle-H,w\rangle$ instead of $\langle H,w\rangle$ , but this difference is negligible because $H$ is Gaussian. It can be easily seen that the event $\Phi>t$ is the same as $\Phi(t)>t$ , and the same holds for $\phi$ and $\phi(t)$ . It is also clear that $\phi(t)\geq\phi_{r}(t)$ and we can connect $\phi_{r}(t)$ with $\Phi_{r}(t)$ by CGMT. It remains to show that $\Phi_{r}(t)\to\Phi(t)$ as $r\to\infty$ .

By definition, $\Phi_{r}(t)\leq\Phi_{s}(t)$ for $r\leq s$ . We consider two cases:

$\Phi(t)=\infty$ , i.e. the minimization problem defining $\Phi(t)$ is infeasible. In this case, we know that for all $\|\Sigma^{-1/2}w\|\leq 2t$

\|Zw-\xi\|_{2}>0.

By compactness, there exists $\mu=\mu(Z,\xi)>0$ (in particular, independent of $r$ ) such that

\|Zw-\xi\|_{2}\geq\mu.

Therefore, considering $\lambda$ along the direction of $Zw-\xi$ shows that

\Phi_{r}(t)=\min_{\|\Sigma^{-1/2}w\|\leq 2t}\max_{\|\lambda\|_{2}\leq r}\,\langle\lambda,Zw-\xi\rangle+\|\Sigma^{-1/2}w\|\geq r\mu

so $\Phi_{r}(t)\to\infty$ as $r\to\infty$ .

Otherwise $\Phi(t)<\infty$ , i.e. the minimization problem defining $\Phi(t)$ is feasible. In this case, we can let $w(r)$ be an arbitrary minimizer achieving the objective $\Phi_{r}(t)$ for each $r\geq 0$ by compactness. By compactness again, the sequence $\{w(r)\}_{r=1}^{\infty}$ at positive integer values of $r$ has a subsequential limit $w(\infty)$ such that $\|\Sigma^{-1/2}w(\infty)\|\leq 2t$ . Equivalently, there exists an increasing sequence $r_{n}$ such that $\lim_{n\to\infty}w(r_{n})=w(\infty)$ .

Suppose for the sake of contradiction that $Zw(\infty)\neq\xi$ , then by continuity, there exists $\mu>0$ and a sufficiently small $\epsilon>0$ such that for all $\|w-w(\infty)\|_{2}\leq\epsilon$

\|Zw-\xi\|_{2}\geq\mu.

This implies that for sufficiently large $n$ , we have

\|Zw(r_{n})-\xi\|_{2}\geq\mu

and by the same argument as in the previous case

\Phi_{r_{n}}(t)=\max_{\|\lambda\|_{2}\leq r}\,\langle\lambda,Zw(r_{n})-\xi\rangle+\|\Sigma^{-1/2}w(r_{n})\|\geq r\mu

so $\Phi_{r_{n}}\to\infty$ , but this is impossible since $\Phi_{r}(t)\leq\Phi(t)<\infty$ . By contradiction, it must be the case that $Zw(\infty)=\xi$ . By taking $\lambda=0$ in the definition of $\Phi_{r}(t)$ , we have

\Phi_{r_{n}}(t)\geq\|\Sigma^{-1/2}w(r_{n})\|.

By continuity, we show that

\lim\inf_{n\to\infty}\Phi_{r_{n}}(t)\geq\lim_{n\to\infty}\|\Sigma^{-1/2}w(r_{n})\|=\|\Sigma^{-1/2}w(\infty)\|\geq\Phi(t).

Since $\Phi_{r_{n}}(t)\leq\Phi(t)$ , the limit of $\Phi_{r_{n}}(t)$ exists and equals $\Phi(t)$ . We can conclude that $\lim_{r\to\infty}\Phi_{r}(t)=\Phi(t)$ because $\Phi_{r}(t)$ is an increasing function of $r$ .

By the last part of Theorem 9 (the CGMT),

\Pr(\Phi_{r}(t)>t\,|\,\xi)\leq 2\Pr(\phi_{r}(t)\geq t\,|\,\xi).

By continuity (from below) of the probability measure, and the fact that $\Phi_{r}(t)$ monotonically increases to $\Phi(t)$ almost surely, we can conclude

\begin{split}\Pr(\Phi>t\,|\,\xi)&=\Pr(\Phi(t)>t\,|\,\xi)\leq\Pr\left(\cup_{r}\cap_{r^{\prime}\geq r}\Phi_{r^{\prime}}(t)>t\,|\,\xi\right)\\ &=\lim_{r\to\infty}\Pr\left(\cap_{r^{\prime}\geq r}\Phi_{r^{\prime}}(t)>t\,|\,\xi\right)=\lim_{r\to\infty}\Pr\left(\Phi_{r}(t)>t\,|\,\xi\right)\\ &\leq 2\lim_{r\to\infty}\Pr(\phi_{r}(t)\geq t\,|\,\xi)\leq 2\Pr(\phi(t)\geq t\,|\,\xi)\\ &=2\Pr(\phi\geq t\,|\,\xi).\qed\end{split}

It remains to analyze the auxiliary problem, which we do in the following lemma:

Lemma 8.

For any covariance splitting $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ , denote $P$ as the orthogonal projection matrix onto the space spanned by $\Sigma_{2}$ , and let $v^{*}=\operatorname*{arg\,min}_{v\in\partial\|\Sigma^{1/2}_{2}H\|_{*}}\left\lVert v\right\rVert_{\Sigma_{2}}$ . Assume that there exists $\epsilon_{1},\epsilon_{2}\geq 0$ such that with probability at least $1-\delta/2$ ,

\left\lVert v^{*}\right\rVert_{\Sigma_{2}}\leq(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}

(65)

and

\left\lVert Pv^{*}\right\rVert^{2}\leq 1+\epsilon_{2}.

(66)

Let

\epsilon=8n^{-1/2}+28\sqrt{\frac{\log(32/\delta)}{n}}+8\sqrt{\frac{\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}+2(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}+2\epsilon_{2}.

If $n$ and the effective ranks are sufficiently large such that $\epsilon\leq 1$ , then with probability at least $1-\delta$ , it holds that

\phi^{2}\leq(1+\epsilon)\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}

(67)

Proof.

For notational simplicity, we define

\begin{split}\alpha&=2\sqrt{\frac{\log(32/\delta)}{n}}\\ \rho&=\sqrt{\frac{1}{n}}+2\sqrt{\frac{\log(16/\delta)}{n}}.\end{split}

By a union bound, the following collection of events occurs with probability at least $1-\delta/2$ :

1.

(Approximate Orthogonality.) By Lemma 1, it holds that

$|\langle\xi,G\rangle|<\left\lVert\xi\right\rVert_{2}\left\lVert G\right\rVert_{2}\rho.$ (68)
2.

(Typical Norm of $G$ and $\xi$ .) By Lemma 2, it holds that

$-\alpha\leq\frac{1}{\sqrt{n}}\|G\|_{2}-1\leq\alpha$ (69)

and

$-\alpha\sigma\leq\frac{1}{\sqrt{n}}\|\xi\|_{2}-\sigma\leq\alpha\sigma.$ (70)

(Typical Norm of $\Sigma_{2}^{1/2}H$ .) By Theorem 6, it holds that

\begin{split}\|\Sigma_{2}^{1/2}H\|_{*}&\geq\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}-\sup_{\|u\|\leq 1}\|u\|_{\Sigma_{2}}\sqrt{2\log(8/\delta)}\\ &=\left(1-\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*},\end{split}

(71)

because $\|\Sigma_{2}^{1/2}H\|_{*}$ is a $\sup_{\|u\|\leq 1}\|u\|_{\Sigma_{2}}$ -Lipschitz function of $H$ .

By a change of variables, recall that

\phi:=\min_{\left\lVert\xi-\lVert\Sigma^{1/2}w\rVert_{2}G\right\rVert_{2}\leq\langle H,\Sigma^{1/2}w\rangle}\|w\|.

Equations 68, 69 and 70 imply that

\begin{split}\left\lVert\xi-\|\Sigma^{1/2}w\|_{2}G\right\rVert_{2}^{2}&=\|\xi\|_{2}^{2}-2\langle\xi,G\rangle\|\Sigma^{1/2}w\|_{2}+\|\Sigma^{1/2}w\|_{2}^{2}\|G\|_{2}^{2}\\ &\leq(1+\rho)\left(\|\xi\|_{2}^{2}+\|\Sigma^{1/2}w\|_{2}^{2}\|G\|_{2}^{2}\right)\\ &\leq(1+\rho)(1+\alpha)^{2}n(\sigma^{2}+\|\Sigma^{1/2}w\|_{2}^{2}).\end{split}

To upper bound $\phi$ , it suffices to construct a $w$ that satisfies the constraint. Consider $w$ of the form $s(Pv^{*})$ , then $\Sigma^{1/2}w=s\Sigma^{1/2}_{2}v^{*}$ . Plugging in, it suffices to choose $s$ such that

\begin{split}(1+\rho)(1+\alpha)^{2}n(\sigma^{2}+s^{2}\|\Sigma^{1/2}_{2}v^{*}\|_{2}^{2})&\leq s^{2}\langle H,\Sigma^{1/2}_{2}v^{*}\rangle^{2}=s^{2}\|\Sigma^{1/2}_{2}H\|_{*}^{2}.\end{split}

Solving for $s$ , we can choose

s^{2}=\sigma^{2}\left(\frac{\|\Sigma^{1/2}_{2}H\|_{*}^{2}}{(1+\rho)(1+\alpha)^{2}n}-\|v^{*}\|_{\Sigma_{2}}^{2}\right)^{-1}

given that it is positive. By (65) and (71), we have

\begin{split}&\frac{\|\Sigma^{1/2}_{2}H\|_{*}^{2}}{(1+\rho)(1+\alpha)^{2}n}-\|v^{*}\|_{\Sigma_{2}}^{2}\\ \geq\,&\frac{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}{(1+\rho)(1+\alpha)^{2}n}\left(1-\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)^{2}-(1+\epsilon_{1})^{2}(\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}})^{2}\\ =\,&\frac{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}{n}\left(\frac{1}{(1+\rho)(1+\alpha)^{2}}\left(1-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\right).\end{split}

If $\alpha<1$ , then

\begin{split}(1+\rho)(1+\alpha)^{2}&=(1+\rho)(1+2\alpha+\alpha^{2})\\ &\leq(1+\rho)(1+3\alpha)=1+3\alpha+\rho+3\alpha\rho\\ &\leq 1+3\alpha+4\rho\\ \end{split}

and using the inequality $(1+x)^{-1}\geq 1-x$ , we show

\begin{split}\frac{1}{(1+\rho)(1+\alpha)^{2}}&\geq 1-((1+\rho)(1+\alpha)^{2}-1)\\ &\geq 1-(3\alpha+4\rho).\\ \end{split}

Therefore, we can conclude that

\begin{split}&\frac{1}{(1+\rho)(1+\alpha)^{2}}\left(1-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\\ \geq\,&(1-(3\alpha+4\rho))\left(1-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}\right)-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\\ \geq\,&1-(3\alpha+4\rho)-2\sqrt{\frac{2\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}-(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}\geq 1-\epsilon^{\prime}\\ \end{split}

where we define

\epsilon^{\prime}=4n^{-1/2}+14\sqrt{\frac{\log(32/\delta)}{n}}+4\sqrt{\frac{\log(8/\delta)}{r_{\|\cdot\|}(\Sigma)}}+(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}.

Provided that $\epsilon^{\prime}\leq 1/2$ (which also guarantees that $\alpha<1$ and our definition of $s^{2}$ is sensible), we can use the inequality $(1-x)^{-1}\leq 1+2x$ for $x\in[0,1/2]$ to show that

s^{2}\leq\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}\frac{1}{1-\epsilon^{\prime}}\leq(1+2\epsilon^{\prime})\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}

and thus by (66)

\phi^{2}\leq s^{2}\|Pv^{*}\|^{2}\leq(1+\epsilon_{2})(1+2\epsilon^{\prime})\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}\leq(1+\epsilon)\,\sigma^{2}\frac{n}{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*})^{2}}

with $\epsilon=2\epsilon^{\prime}+2\epsilon_{2}$ . ∎

Finally, we are ready to prove our general norm bound.

See 4

Proof.

By Lemmas 6 and 7, we show that for any $t$

\Pr(\left\lVert\hat{w}\right\rVert>t)\leq\Pr(\Phi>t-\left\lVert w^{*}\right\rVert)\leq 2\Pr(\phi\geq t-\left\lVert w^{*}\right\rVert).

By Lemma 8, the above is upper bounded by $\delta$ if we set $t-\left\lVert w^{*}\right\rVert$ according to (67) with $\delta$ replaced by $\delta/2$ . Moving $\left\lVert w^{*}\right\rVert$ to the other side concludes the proof. ∎

C.2 Special Case: Euclidean Norm

Lemma 9.

For any covariance matrix $\Sigma$ , it holds that

\left(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2}\right)^{2}\geq\left(1-\frac{1}{r(\Sigma)}\right)\operatorname{Tr}(\Sigma)

(72)

and

\frac{1}{\operatorname{Tr}(\Sigma)}\geq\left(1-\sqrt{\frac{8}{r(\Sigma)}}\right)\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right].

(73)

As a result, it holds that

r(\Sigma)-1\leq r_{\|\cdot\|_{2}}(\Sigma)\leq r(\Sigma)

(74)

and

1-\frac{4}{\sqrt{r(\Sigma)}}\leq\frac{R_{\|\cdot\|_{2}}(\Sigma)}{R(\Sigma)}\leq\left(1-\sqrt{\frac{8}{r(\Sigma^{2})}}\right)^{-1}.

(75)

Proof.

Observe that if $f(H)=\|\Sigma^{1/2}H\|_{2}$ , then it can easily be checked that

\|\nabla f\|_{2}^{2}=\frac{\|\Sigma H\|^{2}_{2}}{\|\Sigma^{1/2}H\|_{2}^{2}}\leq\|\Sigma\|_{\mathit{op}}

and so by the Gaussian Poincaré inequality [54, Corollary 2.27], we have

\begin{split}\operatorname{Tr}(\Sigma)=\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2}^{2}&=(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}+\operatorname*{\mathrm{Var}}\|\Sigma^{1/2}H\|_{2}\\ &\leq(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}+\|\Sigma\|_{\mathit{op}}\\ &=(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}+\frac{\operatorname{Tr}(\Sigma)}{r(\Sigma)}.\end{split}

Rearranging the terms proves (72). To prove (73), without loss of generality assume that $\Sigma$ is diagonal, with diagonal entries $\lambda_{1}\geq\lambda_{2}\geq...\geq\lambda_{d}$ . Observe that for any integer $\nu>2$ , we can pad $\Sigma$ with 0’s such that $\nu$ divides $d$ , and we have

H^{T}\Sigma H=\sum_{i=1}^{d}\lambda_{i}H_{i}^{2}\geq\sum_{i=1}^{d/\nu}\lambda_{\nu i}(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2}).

By Jensen’s inequality, $1/\operatorname*{\mathbb{E}}[X]\leq\operatorname*{\mathbb{E}}[1/X]$ ; it follows that

\begin{split}\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]&\leq\operatorname*{\mathbb{E}}\left[\frac{1}{\sum_{i=1}^{d/\nu}\lambda_{\nu i}(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2})}\right]\\ &=\frac{1}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\operatorname*{\mathbb{E}}\left[\frac{1}{\sum_{i=1}^{d/\nu}\frac{\lambda_{\nu i}}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2})}\right]\\ &\leq\frac{1}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{d/\nu}\frac{\lambda_{\nu i}}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\frac{1}{H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2}}\right]\\ &=\frac{1}{\sum_{j=1}^{d/\nu}\lambda_{\nu j}}\frac{1}{\nu-2}.\end{split}

In the last equality, we use the fact that for each $i$ the random variable $\left(H_{\nu(i-1)+1}^{2}+...+H_{\nu i}^{2}\right)^{-1}$ follows an inverse Chi-square distribution with $\nu$ degrees of freedom; its expectation is $(\nu-2)^{-1}$ . In addition, notice that

\nu\|\Sigma\|_{\mathit{op}}+\nu\sum_{i=1}^{d/\nu}\lambda_{\nu i}\geq(\lambda_{1}+...+\lambda_{\nu})+\sum_{i=1}^{d/\nu-1}(\lambda_{\nu i+1}+...+\lambda_{\nu(i+1)})=\operatorname{Tr}(\Sigma).

Plugging the above estimate into our upper bound shows for any integer $\nu>2$ , it holds that

\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]\leq\frac{1}{\operatorname{Tr}(\Sigma)-\nu\|\Sigma\|_{\mathit{op}}}\frac{\nu}{\nu-2}=\frac{1}{\operatorname{Tr}(\Sigma)}\left(1-\frac{\nu}{r(\Sigma)}-\frac{2}{\nu}+\frac{2}{r(\Sigma)}\right)^{-1}.

We can show (73) by choosing $\nu=\lceil(2r(\Sigma))^{1/2}\rceil$ :

\operatorname*{\mathbb{E}}\left[\frac{1}{H^{T}\Sigma H}\right]\leq\frac{1}{\operatorname{Tr}(\Sigma)}\left(1-\sqrt{\frac{8}{r(\Sigma)}}\right)^{-1}.

It remains to verify (74) and (75). By (72), we can check

r_{\|\cdot\|_{2}}(\Sigma)=\frac{(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}}{\|\Sigma\|_{\mathit{op}}}\geq\left(1-\frac{1}{r(\Sigma)}\right)\frac{\operatorname{Tr}(\Sigma)}{\|\Sigma\|_{\mathit{op}}}=r(\Sigma)-1.

The other direction $r(\Sigma)\geq r_{\|\cdot\|_{2}}(\Sigma)$ follows directly from an application of the Cauchy-Schwarz inequality. By Jensen’s inequality $1/\operatorname*{\mathbb{E}}[X]\leq\operatorname*{\mathbb{E}}[1/X]$ and the Cauchy-Schwarz inequality, we show

\frac{1}{\operatorname{Tr}(\Sigma)}\left(\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma H\|_{2}^{2}}\right)^{-1}\leq\left(\operatorname*{\mathbb{E}}\frac{\|\Sigma^{1/2}H\|_{2}}{\|\Sigma H\|_{2}}\right)^{-2}\leq\left(\operatorname*{\mathbb{E}}\frac{\|\Sigma H\|_{2}}{\|\Sigma^{1/2}H\|_{2}}\right)^{2}\leq\operatorname{Tr}(\Sigma^{2})\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma^{1/2}H\|_{2}^{2}}.

Recall that $R_{\|\cdot\|_{2}}(\Sigma)=(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{2})^{2}\left(\operatorname*{\mathbb{E}}\frac{\|\Sigma H\|_{2}}{\|\Sigma^{1/2}H\|_{2}}\right)^{-2}$ . By Cauchy-Schwarz inequality and (73), it follows that

R_{\|\cdot\|_{2}}(\Sigma)\leq\operatorname{Tr}(\Sigma)^{2}\left(\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma H\|_{2}^{2}}\right)\leq\left(1-\sqrt{\frac{8}{r(\Sigma^{2})}}\right)^{-1}R(\Sigma)

and also by (72)

\begin{split}R_{\|\cdot\|_{2}}(\Sigma)&\geq\left(1-\frac{1}{r(\Sigma)}\right)\frac{\operatorname{Tr}(\Sigma)}{\operatorname{Tr}(\Sigma^{2})}\left(\operatorname*{\mathbb{E}}\frac{1}{\|\Sigma^{1/2}H\|_{2}^{2}}\right)^{-1}\\ &\geq\left(1-\frac{1}{r(\Sigma)}\right)\left(1-\sqrt{\frac{8}{r(\Sigma)}}\right)R(\Sigma)\geq\left(1-\frac{4}{\sqrt{r(\Sigma)}}\right)R(\Sigma).\qed\end{split}

Lemma 10.

For any covariance matrix $\Sigma$ , it holds that with probability at least $1-\delta$ ,

1-\frac{\|\Sigma^{1/2}H\|_{2}^{2}}{\operatorname{Tr}(\Sigma)}\lesssim\frac{\log(4/\delta)}{\sqrt{R(\Sigma)}}

(76)

and

\|\Sigma H\|_{2}^{2}\lesssim\log(4/\delta)\operatorname{Tr}(\Sigma^{2}).

(77)

Therefore, provided that $R(\Sigma)\gtrsim\log(4/\delta)^{2}$ , it holds that

\left(\frac{\|\Sigma H\|_{2}}{\|\Sigma^{1/2}H\|_{2}}\right)^{2}\lesssim\log(4/\delta)\frac{\operatorname{Tr}(\Sigma^{2})}{\operatorname{Tr}(\Sigma)}.

(78)

Proof.

Because we are considering $\ell_{2}$ norm and $H$ is standard Gaussian, without loss of generality we can assume that $\Sigma$ is diagonal and we denote the diagonals of $\Sigma$ as $\lambda_{1},...,\lambda_{d}$ . By the sub-exponential Bernstein inequality [60, Corollary 2.8.3], we have with probability at least $1-\delta/2$

\left|\frac{\|\Sigma^{1/2}H\|_{2}^{2}}{\operatorname{Tr}(\Sigma)}-1\right|=\left|\sum_{i=1}^{p}\frac{\lambda_{i}}{\sum_{j}\lambda_{j}}(H_{i}^{2}-1)\right|\lesssim\sqrt{\frac{\log(4/\delta)}{R(\Sigma)}}\vee\frac{\log(4/\delta)}{r(\Sigma)}\leq\frac{\log(4/\delta)}{\sqrt{R(\Sigma)}}

where the last inequality uses that $R(\Sigma)\leq r(\Sigma)^{2}$ , shown in Lemma 5 of [66]. Using the sub-exponential Bernstein inequality again, we show with probability at least $1-\delta/2$

\left|\frac{\|\Sigma H\|_{2}^{2}}{\operatorname{Tr}(\Sigma^{2})}-1\right|\lesssim\sqrt{\frac{\log(4/\delta)}{R(\Sigma^{2})}}\vee\frac{\log(4/\delta)}{r(\Sigma^{2})}

From Lemma 5 of [66], we know that the effective ranks are at least 1. This implies

\|\Sigma H\|_{2}^{2}\lesssim\log(4/\delta)\operatorname{Tr}(\Sigma^{2}).

Provided that $R(\Sigma)\gtrsim\log(4/\delta)^{2}$ , we have

\|\Sigma^{1/2}H\|_{2}^{2}\geq\frac{1}{2}\operatorname{Tr}(\Sigma)

in which case it holds that

\frac{\|\Sigma H\|_{2}^{2}}{\|\Sigma^{1/2}H\|_{2}^{2}}\lesssim\log(4/\delta)\frac{\operatorname{Tr}(\Sigma^{2})}{\operatorname{Tr}(\Sigma)}.\qed

See 2

Proof.

To apply Theorem 4, it is clear that $v^{*}=\frac{\Sigma_{2}^{1/2}H}{\|\Sigma_{2}^{1/2}H\|_{2}}$ and so $\|v^{*}\|_{\Sigma_{2}}=\frac{\|\Sigma_{2}H\|_{2}}{\|\Sigma_{2}^{1/2}H\|_{2}}$ . By (78), it suffices to pick $\epsilon_{1}$ such that for some constant $c>0$

(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}=c\sqrt{\log(16/\delta)\frac{\operatorname{Tr}(\Sigma_{2}^{2})}{\operatorname{Tr}(\Sigma_{2})}}.

By (72) of Lemma 9, for sufficiently large effective rank, it holds that $\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{2}\right)^{2}\gtrsim\operatorname{Tr}(\Sigma_{2})$ and so

(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|_{2}}(\Sigma_{2})}=n\frac{(1+\epsilon_{1})^{2}(\operatorname*{\mathbb{E}}\|v^{*}\|_{\Sigma_{2}})^{2}}{\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{2}\right)^{2}}\lesssim n\log(16/\delta)\frac{\operatorname{Tr}(\Sigma_{2}^{2})}{\operatorname{Tr}(\Sigma_{2})^{2}}=\frac{n\log(16/\delta)}{R(\Sigma_{2})}.

Furthermore, it suffices to let $\epsilon_{2}=0$ because $P$ is an $\ell_{2}$ projection matrix. Combined with (74) of Lemma 9, we show

\epsilon\lesssim\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\frac{n\log(1/\delta)}{R(\Sigma_{2})}.

Finally, using the inequality $(1-x)^{-1}\leq 1+2x$ for $x\in[0,1/2]$ and (72) of Lemma 9 again, we can conclude

\begin{split}(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{2}}&\leq(1+\epsilon)^{1/2}\left(1-\frac{1}{r(\Sigma_{2})}\right)^{-1/2}\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}\\ &\leq\left(1+2\epsilon+\frac{2}{r(\Sigma_{2})}\right)^{1/2}\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}\\ \end{split}

and we can replace $\epsilon$ with

\epsilon^{\prime}=2\epsilon+\frac{2}{r(\Sigma_{2})}\lesssim\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{r(\Sigma_{2})}}+\frac{n\log(1/\delta)}{R(\Sigma_{2})}.

∎

Appendix D Benign Overfitting

In this section, we will combine results from the previous two sections to study when interpolators are consistent.

D.1 General Norm

See 5

Proof.

By Theorem 4, if we choose

B=\left\lVert w^{*}\right\rVert+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}

then with large probability, $\{w:\|w\|\leq B\}$ has non-empty intersection with $\{w:Xw=Y\}$ , which contains the minimal norm interpolator $\hat{w}$ . Also, it is clear that $B>\|w^{*}\|$ and so by Corollary 3, it holds that

\begin{split}L(\hat{w})&\leq\sup_{\|w\|\leq B,\hat{L}(w)=0}L(w)\\ &\leq(1+\gamma)\left(\left\lVert w^{*}\right\rVert+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}\right)^{2}\frac{\left(\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{*}\right)^{2}}{n}\\ &=(1+\gamma)\left(\left\lVert w^{*}\right\rVert\frac{\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{*}}{\sqrt{n}}+(1+\epsilon)^{1/2}\,\sigma\right)^{2}\\ &\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}\right)^{2}.\\ \end{split}

∎

Theorem 11 (Sufficient conditions).

Under the model assumptions in (1), let $\left\lVert\cdot\right\rVert$ be an arbitrary norm. Suppose that as $n$ goes to $\infty$ , there exists a sequence of covariance splits $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that the following properties hold:

1.

(Small large-variance dimension.)

$\lim_{n\to\infty}\frac{\operatorname{rank}(\Sigma_{1})}{n}=0.$ (79)

(Large effective dimension.)

\lim_{n\to\infty}\frac{1}{r_{\|\cdot\|}(\Sigma_{2})}=0\quad\text{and}\quad\lim_{n\to\infty}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}=0.

(80)

3.

(No aliasing condition.)

$\lim_{n\to\infty}\frac{\left\lVert w^{*}\right\rVert\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{*}}{\sqrt{n}}=0.$ (81)
4.

(Contracting $\ell_{2}$ projection condition.) With the same definition of $P$ and $v^{*}$ as in Theorem 4, it holds that for any $\eta>0$ ,

$\lim_{n\to\infty}\Pr(\|Pv^{*}\|^{2}>1+\eta)=0.$ (82)

Then $L(\hat{w})$ converges to $\sigma^{2}$ in probability. In other words, minimum norm interpolation is consistent.

Proof.

Fix any $\eta>0$ , for sufficiently small $\gamma,\epsilon$ and $\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}$ , it is clear that

(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}\right)^{2}-\sigma^{2}\leq\eta.

(83)

For any $\delta>0$ , by the definition of $\gamma$ in Corollary 3 and our assumptions, the terms $\gamma$ and $\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}$ can be made arbitrarily small for large enough $n$ . Also by our assumption, $\epsilon_{2}$ in the definition of $\epsilon$ in Theorem 4 can be arbitrarily small. Note that

\sqrt{\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}}=\operatorname*{\mathbb{E}}\left[\frac{\|v^{*}\|_{\Sigma_{2}}}{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}/\sqrt{n}}\right]

converges to 0 by assumption. Then by Markov’s inequality, for any $\eta^{\prime}>0$ , it holds that for all sufficiently large $n$

\Pr\left(\frac{\|v^{*}\|_{\Sigma_{2}}}{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}/\sqrt{n}}>\sqrt{\eta^{\prime}}\right)<\delta

and we can pick

(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}=\sqrt{\eta^{\prime}}\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}.

This implies that

(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|}(\Sigma_{2})}=\frac{n}{\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}\right)^{2}}\left((1+\epsilon_{1})\operatorname*{\mathbb{E}}\|v^{*}\|_{\Sigma_{2}}\right)^{2}=\eta^{\prime}.

By Theorem 5, we have shown that for sufficiently large $n$ such that $\gamma,\epsilon$ and $\|w^{*}\|\frac{\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{*}}{\sqrt{n}}$ are small enough for (83) to hold, it holds that

\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)\leq\delta.

As a result, we show $\lim_{n\to\infty}\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)\leq\delta$ for any $\delta>0$ . To summarize, for any fixed $\eta>0$ , we have

\lim_{n\to\infty}\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)=0

and so $L(\hat{w})$ converges to $\sigma^{2}$ in probability. ∎

D.2 Euclidean Norm

See 3

Proof.

The proof follows the same strategy as Theorem 5. By Theorem 2, if we choose

B=\|w^{*}\|_{2}+(1+\epsilon)^{1/2}\,\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}},

then with large probability, $\{w:\|w\|_{2}\leq B\}$ has non-empty intersection with $\{w:Xw=Y\}$ . This intersection necessarily contains the minimal norm interpolator $\hat{w}$ .

Also, it is clear that $B>\|w^{*}\|$ and so by Corollary 2, it holds that

\begin{split}L(\hat{w})&\leq\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\\ &\leq(1+\gamma)\left(\|w^{*}\|_{2}+(1+\epsilon)^{1/2}\,\sigma\sqrt{\frac{n}{\operatorname{Tr}(\Sigma_{2})}}\right)^{2}\frac{\operatorname{Tr}(\Sigma_{2})}{n}\\ &=(1+\gamma)\left(\left\lVert w^{*}\right\rVert_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}+(1+\epsilon)^{1/2}\,\sigma\right)^{2}\\ &\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\right)^{2}.\qed\end{split}

Theorem 12 (Sufficient conditions).

Under the model assumptions in (1), let $\hat{w}$ be the minimal $\ell_{2}$ norm interpolator. Suppose that as $n$ goes to $\infty$ , there exists a sequence of covariance splitting $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that the following conditions hold:

1.

(Small large-variance dimension.)

$\lim_{n\to\infty}\frac{\operatorname{rank}(\Sigma_{1})}{n}=0.$ (84)
2.

(Large effective dimension.)

$\lim_{n\to\infty}\frac{n}{R(\Sigma_{2})}=0.$ (85)
3.

(No aliasing condition.)

$\lim_{n\to\infty}\frac{\left\lVert w^{*}\right\rVert_{2}\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{2}}{\sqrt{n}}=0.$ (86)

Then $L(\hat{w})$ converges to $\sigma^{2}$ in probability. In other words, minimum $\ell_{2}$ norm interpolation is consistent.

Proof.

Fix any $\eta>0$ , for sufficiently small $\gamma,\epsilon$ and $\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}$ , it is clear that

(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}\right)^{2}-\sigma^{2}\leq\eta.

(87)

From Lemma 5 of [66], it holds that $R(\Sigma_{2})\leq r(\Sigma_{2})^{2}$ , and so the condition $R(\Sigma_{2})=\omega(n)$ implies that $r(\Sigma_{2})=\omega(\sqrt{n})=\omega(1)$ . For any $\delta>0$ , by the definition of $\gamma,\epsilon$ in Corollary 2 and Theorem 2 and our assumptions, the terms $\gamma,\epsilon$ and $\|w^{*}\|_{2}\sqrt{\frac{\operatorname{Tr}(\Sigma_{2})}{n}}$ can be made small enough for Equation 87 to hold with a sufficiently large $n$ . By Theorem 3, we show that

\lim_{n\to\infty}\Pr(\,|L(\hat{w})-\sigma^{2}|>\eta\,)\leq\delta

Since the choice of $\delta>0$ is arbitrary, we have shown that $L(\hat{w})$ converges to $\sigma^{2}$ in probability. ∎

D.2.1 Equivalence of Consistency Conditions

If we assume that $\|w^{*}\|=\Theta(1)$ , our consistency condition (Theorem 12) for minimum $\ell_{2}$ norm interpolation is the existence of a covariance splitting such that

\operatorname{rank}(\Sigma_{1})=o(n),\qquad\operatorname{Tr}\Sigma_{2}=o(n),\qquad\frac{(\operatorname{Tr}\Sigma_{2})^{2}}{\operatorname{Tr}[(\Sigma_{2})^{2}]}=\omega(n).

(88)

We compare the above conditions to the following conditions:

\operatorname{rank}(\Sigma_{1})=o(n),\quad\operatorname{Tr}\Sigma_{2}=o(n),\quad\frac{\operatorname{Tr}\Sigma_{2}}{\|\Sigma_{2}\|_{\mathit{op}}}=\omega(n),\quad\frac{(\operatorname{Tr}\Sigma_{2})^{2}}{\operatorname{Tr}[(\Sigma_{2})^{2}]}=\omega(n).

(89)

Obviously, the conditions in (89) imply (88), but we show in Theorem 13 that the existence of a splitting that satisfies (88) also implies the existence of a (potentially different) splitting that satisfies (89). This is one way to see that the particular choice of $k^{*}$ from [66] can be made without loss of generality, at least if we only consider the consistency conditions.

Theorem 13.

Suppose that there exists $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ that satisfies the conditions in (88). Then there exists a $\Sigma=\Sigma^{{}^{\prime}}_{1}\oplus\Sigma^{{}^{\prime}}_{2}$ that satisfies the conditions in (89).

Proof.

Denote $v$ as the vector of eigenvalues of $\Sigma$ , and $v_{k}$ as the vector obtained by setting the $k$ coordinates of $v$ corresponding to $\Sigma_{1}$ to be 0. By our assumptions in (88), there exists $k=o(n)$ such that

\|v_{k}\|_{1}=o(n),\qquad\frac{\|v_{k}\|_{1}^{2}}{\|v_{k}\|_{2}^{2}}=\omega(n).

For any $\tau\geq 0$ , we let $S_{\tau}=\{i\in[d]:|v_{k,i}|\geq\tau\|v_{k}\|_{\infty}\}$ and define $v_{k,\tau}$ by setting the coordinates of $v_{k}$ in $S_{\tau}$ to be 0. For simplicity of notation, define $a=\|v_{k}\|_{1}^{2}/\|v_{k}\|_{2}^{2}$ and $b=\|v_{k}\|_{1}/\|v_{k}\|_{\infty}$ . Observe that

\sum_{i\in S_{\tau}}|v_{k,i}|\leq\frac{1}{\tau\|v_{k}\|_{\infty}}\sum_{i\in S_{\tau}}v_{k,i}^{2}\leq\frac{\|v_{k}\|_{2}^{2}}{\tau\|v_{k}\|_{\infty}}=\frac{\|v_{k}\|_{1}}{\tau}\frac{b}{a}.

This shows that

\|v_{k,\tau}\|_{1}\geq\left(1-\frac{b}{\tau a}\right)\|v_{k}\|_{1}

and

\|v_{k,\tau}\|_{\infty}\leq\tau\|v_{k}\|_{\infty}=\frac{\tau}{b}\|v_{k}\|_{1}.

In addition, observe that

\tau\|v_{k}\|_{\infty}\cdot|S_{\tau}|\leq\sum_{i\in S_{\tau}}|v_{k,i}|\leq\frac{\|v_{k}\|_{1}}{\tau}\frac{b}{a}.

The above inequalities imply that

\frac{\|v_{k,\tau}\|_{1}}{\|v_{k,\tau}\|_{\infty}}\geq\frac{b}{\tau}\left(1-\frac{b}{\tau a}\right)

and

|S_{\tau}|\leq a\cdot\left(\frac{b}{\tau a}\right)^{2}

Finally, we pick $\tau$ by setting $b/(\tau a)=(n/a)^{3/4}$ . By our assumption that $a=\omega(n)$ , we can check

\frac{b}{\tau}\left(1-\frac{b}{\tau a}\right)=n^{3/4}a^{1/4}(1-(n/a)^{3/4})=\omega(n)(1-o(1))=\omega(n)

and

a\cdot\left(\frac{b}{\tau a}\right)^{2}=a(n/a)^{3/2}=n(n/a)^{1/2}=o(n).

By Holder’s inequality, we also have

\frac{\|v_{k,\tau}\|_{1}^{2}}{\|v_{k,\tau}\|_{2}^{2}}\geq\frac{\|v_{k,\tau}\|_{1}}{\|v_{k,\tau}\|_{\infty}}=\omega(n).

It is clear that $\|v_{k,\tau}\|_{1}\leq\|v_{k}\|_{1}=o(n)$ and $k+|S_{\tau}|=o(n)$ , so picking the covariance splitting that corresponds to $v_{k,\tau}$ concludes the proof. ∎

Appendix E Basis Pursuit (Minimum $\ell_{1}$ -Norm Interpolation)

In this section, we illustrate the consequences of our general theory for basis pursuit. The following generalization bound for basis pursuit follows immediately from Corollary 3:

Corollary 5 (Generalization bound for $\ell_{1}$ norm balls).

There exists an absolute constant $C_{1}\leq 66$ such that the following is true. Under the model assumptions in (1) with $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ , fix $\delta\leq 1/4$ and let $\gamma=C_{1}\left(\sqrt{\frac{\log(1/\delta)}{r_{1}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\operatorname{rank}(\Sigma_{1})}{n}}\right)$ . If $B\geq\|w^{*}\|_{1}$ and $n$ is large enough that $\gamma\leq 1$ , then the following holds with probability at least $1-\delta$ :

\sup_{\|w\|_{1}\leq B,\hat{L}(w)=0}L(w)\leq(1+\gamma)\frac{\left(B\cdot\operatorname*{\mathbb{E}}\lVert\Sigma^{1/2}_{2}H\rVert_{\infty}\right)^{2}}{n}.

(90)

Proof.

Recall that the dual of the $\ell_{1}$ norm is the $\ell_{\infty}$ norm. By convexity

\max_{\|w\|_{1}\leq 1}\|w\|_{\Sigma}=\sqrt{\max_{i}\,\langle e_{i},\Sigma e_{i}\rangle}=\sqrt{\max_{i}\,\Sigma_{ii}}

and so we can use $r_{1}(\Sigma)=\frac{\left(\operatorname*{\mathbb{E}}\|\Sigma^{1/2}H\|_{\infty}\right)^{2}}{\max_{i}(\Sigma)_{ii}}=r_{\|\cdot\|_{1}}(\Sigma)$ . ∎

The following norm bound for basis pursuit follows from Theorem 4:

Corollary 6 ( $\ell_{1}$ norm bound).

There exists an absolute constant $C_{2}\leq 64$ such that the following is true. Under the model assumptions in (1), let $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that $\Sigma_{2}$ is diagonal. Fix $\delta\leq 1/4$ and let $\epsilon=C_{2}\left(\sqrt{\frac{\log(1/\delta)}{r_{1}(\Sigma_{2})}}+\sqrt{\frac{\log(1/\delta)}{n}}+\frac{n}{r_{1}(\Sigma_{2})}\right)$ . Then if $n$ and the effective rank $r_{1}(\Sigma_{2})$ are large enough that $\epsilon\leq 1$ , with probability at least $1-\delta$ , it holds that

\left\lVert\hat{w}\right\rVert_{1}\leq\left\lVert w^{*}\right\rVert_{1}+(1+\epsilon)^{1/2}\,\sigma\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{\infty}}.

(91)

Proof.

Recall that $\partial\|u\|_{*}=\text{conv}\{\operatorname{sign}(u_{i})\,e_{i}:i\in\arg\max|u_{i}|\}$ , where $\text{conv}(S)$ denotes the convex hull of $S$ . By definition, it holds almost surely that

\|v^{*}\|_{\Sigma_{2}}\leq\max_{i\in[d]}\,\|e_{i}\|_{\Sigma}=\sqrt{\max_{i}\,\Sigma_{ii}},

(92)

and so we can pick $\epsilon_{1}$ such that

(1+\epsilon_{1})\operatorname*{\mathbb{E}}\left\lVert v^{*}\right\rVert_{\Sigma_{2}}=\sqrt{\max_{i}\,\Sigma_{ii}}

and

(1+\epsilon_{1})^{2}\frac{n}{R_{\|\cdot\|_{1}}(\Sigma_{2})}=n\frac{(1+\epsilon_{1})^{2}(\operatorname*{\mathbb{E}}\|v^{*}\|_{\Sigma_{2}})^{2}}{\left(\operatorname*{\mathbb{E}}\|\Sigma_{2}^{1/2}H\|_{\infty}\right)^{2}}=\frac{n}{r_{1}(\Sigma_{2})}.

In addition, since $\Sigma_{2}$ is diagonal, the coordinates of $\Sigma_{2}^{1/2}H$ that correspond to the zero diagonals of $\Sigma_{2}$ are 0. Therefore, $v^{*}$ must also have zero entry in those coordinates. In other words, $v^{*}$ lies in the span of $\Sigma_{2}$ . As $P$ is the orthogonal projection onto the space spanned by $\Sigma_{2}$ , this implies $Pv^{*}=v^{*}$ , and so $\|Pv^{*}\|_{1}=\|v^{*}\|_{1}=1$ , so that we can take $\epsilon_{2}=0$ . Plugging $\epsilon_{1},\epsilon_{2}$ into Theorem 4 concludes the proof. ∎

Theorem 14 (Benign overfitting).

Fix any $\delta\leq 1/2$ . Under the model assumptions in (1), let $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that $\Sigma_{2}$ is diagonal. Suppose that $n$ and the effective rank $r_{1}(\Sigma_{2})$ are sufficiently large such that $\gamma,\epsilon\leq 1$ with the same choice of $\gamma$ and $\epsilon$ as in Corollaries 5 and 6. Then, with probability at least $1-\delta$ :

L(\hat{w})\leq(1+\gamma)(1+\epsilon)\left(\sigma+\|w^{*}\|_{1}\frac{\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{\infty}}{\sqrt{n}}\right)^{2}.

(93)

The proof of Theorem 14 uses Corollaries 5 and 6, and follows the same lines as in Theorem 5. The details are repetitive, so we omit writing them out in full here. As before, we can use the finite sample bound to deduce sufficient conditions for consistency.

Theorem 15 (Sufficient conditions).

Under the model assumptions in (1), let $\hat{w}$ be the minimal $\ell_{1}$ norm interpolator. Suppose that as $n$ goes to $\infty$ , there exists a sequence of covariance splits $\Sigma=\Sigma_{1}\oplus\Sigma_{2}$ such that $\Sigma_{2}$ is diagonal and the following conditions hold:

1.

(Small large-variance dimension.)

$\lim_{n\to\infty}\frac{\operatorname{rank}(\Sigma_{1})}{n}=0.$ (94)
2.

(Large effective dimension.)

$\lim_{n\to\infty}\frac{n}{r_{1}(\Sigma_{2})}=0.$ (95)
3.

(No aliasing condition.)

$\lim_{n\to\infty}\frac{\left\lVert w^{*}\right\rVert_{1}\operatorname*{\mathbb{E}}\|\Sigma^{1/2}_{2}H\|_{\infty}}{\sqrt{n}}=0.$ (96)

Then $L(\hat{w})$ converges to $\sigma^{2}$ in probability. In other words, minimum $\ell_{1}$ norm interpolation is consistent.

Again, the proof of Theorem 15 is exactly analogous to Theorem 12, so we omit the full proof here.

E.1 Isotropic features

Theorem 16.

There exists an absolute constant $C_{3}\leq 140$ such that the following is true. Under the model assumptions in (1) with $\Sigma=I_{d}$ , denote $S$ as the support of $w^{*}$ . Fix $\delta\leq 1/4$ and let $\epsilon=C_{3}\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{\log(d-|S|)}}+\frac{n}{\log(d-|S|)}\right)$ . Then if $n$ and $d$ are large enough that $\epsilon\leq 1$ , the following holds with probability $1-\delta$ where $H^{\prime}\sim N(0,I_{d-|S|})$ :

\left\lVert\hat{w}\right\rVert_{1}\leq(1+\epsilon)^{1/2}\,(\sigma^{2}+\|w^{*}\|_{2}^{2})^{1/2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}.

(97)

Proof.

Write $X=[X_{S},X_{S^{\mathsf{c}}}]$ , where $X_{S}$ is formed by selecting the columns of $X$ in $S$ . Also let $\xi^{\prime}=X_{S}w_{S}^{*}+\xi$ ; then the entries of $\xi^{\prime}$ are i.i.d. $N(0,\sigma^{2}+\|w^{*}\|_{2}^{2})$ and independent of $X_{S^{\mathsf{c}}}$ . Observe that $Y=X_{S^{\mathsf{c}}}0+\xi^{\prime}$ . By choosing $\Sigma_{1}=0$ in Corollary 6, we show with large probability

\min_{X_{S^{\mathsf{c}}}w=Y}\|w\|_{1}\leq(1+\epsilon)^{1/2}\,(\sigma^{2}+\|w^{*}\|_{2}^{2})^{1/2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}

for some $\epsilon\leq 64\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)}{r_{1}(I_{d-|S|})}}+\frac{n}{r_{1}(I_{d-|S|})}\right)$ . By the bound of [50], it holds that

r_{1}(I_{d-|S|})=\left(\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}\right)^{2}\geq\frac{\log(d-|S|)}{\pi\log 2}

and so we can choose $C_{3}\leq 64\pi\log 2<140$ . Observe that if $X_{S^{\mathsf{c}}}w=Y$ , then $X(0,w)^{T}=Y$ and $\|(0,w)\|_{1}=\|w\|_{1}$ . It follows that

\left\lVert\hat{w}\right\rVert_{1}=\min_{Xw=Y}\|w\|_{1}\leq\min_{X_{S^{\mathsf{c}}}w=Y}\|w\|_{1}.\qed

Theorem 17.

Under the model assumptions in (1) with $\Sigma=I_{d}$ , fix any $\delta\leq 1/2$ and let $\eta=368\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)+\log|S|}{\log(d-|S|)}}+\frac{n}{\log(d-|S|)}\right)$ . Suppose that $n$ and $d$ are large enough that $\eta\leq 1$ . Then, with probability at least $1-\delta$ ,

L(\hat{w})\leq(1+\eta)(\sigma^{2}+\|w^{*}\|_{2}^{2}).

(98)

Proof.

By Theorem 16, if we choose

B=(1+\epsilon)^{1/2}\,(\sigma^{2}+\|w^{*}\|_{2}^{2})^{1/2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}

then with large probability, $\mathcal{K}=\{w:\|w\|_{1}\leq B\}$ has non-empty intersection with $\{w:Xw=Y\}$ , which contains the minimal $\ell_{1}$ norm interpolator $\hat{w}$ . It can be easily seen that

W(\mathcal{K})=B\operatorname*{\mathbb{E}}\|H\|_{\infty}\quad\text{and}\quad R(\mathcal{K})=B

and so by Theorem 1, with large probability

\begin{split}L(\hat{w})&\leq\sup_{\|w\|_{2}\leq B,\hat{L}(w)=0}L(w)\\ &\leq\frac{1+\beta}{n}\left(B\operatorname*{\mathbb{E}}\|H\|_{\infty}+B\sqrt{2\log\left(\frac{64}{\delta}\right)}+\|w^{*}\|_{2}\sqrt{2\log\left(\frac{64}{\delta}\right)}\right)^{2}\\ &=\frac{1+\beta}{n}B^{2}(\operatorname*{\mathbb{E}}\|H\|_{\infty})^{2}\left(1+\gamma\right)^{2}\\ &=(1+\beta)(1+\epsilon)(1+\gamma)^{2}\left(\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\right)^{2}(\sigma^{2}+\|w^{*}\|_{2}^{2})\\ \end{split}

where $\beta=66\sqrt{\log(1/\delta)/n}$ and $\gamma=\frac{\sqrt{2\log\left(\frac{64}{\delta}\right)}}{\operatorname*{\mathbb{E}}\|H\|_{\infty}}+\frac{\|w^{*}\|_{2}\sqrt{2\log\left(\frac{64}{\delta}\right)}}{B\operatorname*{\mathbb{E}}\|H\|_{\infty}}$ . Observe that

B\geq\|w^{*}\|_{2}\,\frac{\sqrt{n}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\quad\text{and}\quad\operatorname*{\mathbb{E}}\|H\|_{\infty}\geq\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}.

Combined with the lower bound of [50], we show

\gamma\leq\sqrt{\frac{2\pi\log(128/\delta)}{\log d}}+\sqrt{\frac{2\log(64/\delta)}{n}}\leq 8\left(\sqrt{\frac{\log(1/\delta)}{\log d}}+\sqrt{\frac{\log(1/\delta)}{n}}\right).

In addition, we have

\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}=1+\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}-\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\leq 1+\sqrt{\frac{2\pi\log(2)\log|S|}{\log(d-|S|)}}.

Finally, it is a routine calculation to show

\begin{split}&(1+\beta)(1+\epsilon)(1+\gamma)^{2}\left(\frac{\operatorname*{\mathbb{E}}\|H\|_{\infty}}{\operatorname*{\mathbb{E}}\|H^{\prime}\|_{\infty}}\right)^{2}\\ \leq\,&1+368\left(\sqrt{\frac{\log(1/\delta)}{n}}+\sqrt{\frac{\log(1/\delta)+\log|S|}{\log(d-|S|)}}+\frac{n}{\log(d-|S|)}\right)=1+\eta\\ \end{split}

using the inequality $(1+x)(1+y)\leq 1+x+2y$ for $x\leq 1$ . ∎

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds and Benign Overfitting

Abstract

1 Introduction

2 Problem Formulation

Notation.

Data model.

Speculative bound.

3 Generic Uniform Convergence Guarantee

Definition 1.

Definition 2 (Covariance splitting).

Theorem 1 (Main generalization bound).

4 Application: Euclidean Norm Ball

Corollary 1 (Proof of the speculative bound (⋆\star ‣ 2) for Gaussian data).

Definition 3 ([66]).

Corollary 2.

Theorem 2 (Euclidean norm bound; special case of Theorem 4).

Theorem 3 (Benign overfitting).

Sufficient conditions for consistency of w^\hat{w}.

Relationship to [66].

Relationship to [71].

Relationship to [65].

Relationship to [76].

5 General Norm Ball

Definition 4.

Definition 5.

Remark 1 (Definitions of effective ranks).

Corollary 3.

Theorem 4 (General norm bound).

Theorem 5 (Benign overfitting with general norm).

Sufficient conditions for consistency of w^\hat{w}.

6 Application: ℓ1\ell_{1} Norm Balls for Basis Pursuit

Sufficient conditions for consistency of w^𝐵𝑃\hat{w}_{\mathit{BP}}.

Application: Junk features.

Application: Isotropic features.

Relationship to previous works.

7 Proof Sketches

Proof sketch of Theorem 1.

Proof sketch of Theorem 4.

8 Discussion

Acknowledgments and Disclosure of Funding

References

References

Appendices

Appendix A Preliminaries

Concentration of Lipschitz functions.

Theorem 6 ([54], Theorem 3.25).

Theorem 7 (Spherical concentration; [43]).

Lemma 1.

Proof.

Lemma 2.

Proof.

Wishart concentration.

Theorem 8 ([48], Corollary 5.35).

Corollary 4.

Proof.

Gaussian Minmax Theorem.

Theorem 9 (Convex Gaussian Minmax Theorem; [56, 41]).

Theorem 10 (Variant of GMT).

Proof.

Appendix B Uniform Convergence Bounds

B.1 General case: Proof of Theorem 1

Lemma 3.

Proof.

Lemma 4 (Application of GMT).

Proof.

Lemma 5.

Proof.

Proof.

Remark 2 (Translation-invariant version).

B.2 Specialization to General Norm Balls

Proof.

B.3 Special Case: Euclidean Norm

Proof.

Proof.

Appendix C Bounds on the Norm of the Minimal-Norm Interpolator

C.1 General Norms: Proof of Theorem 4

Lemma 6.

Proof.

Lemma 7 (Application of CGMT).

Proof.

Corollary 1 (Proof of the speculative bound ( $\star$ ‣ 2) for Gaussian data).

Sufficient conditions for consistency of $\hat{w}$ .

Sufficient conditions for consistency of $\hat{w}$ .

6 Application: $\ell_{1}$ Norm Balls for Basis Pursuit

Sufficient conditions for consistency of $\hat{w}_{\mathit{BP}}$ .

Appendix E Basis Pursuit (Minimum $\ell_{1}$ -Norm Interpolation)

Corollary 5 (Generalization bound for $\ell_{1}$ norm balls).

Corollary 6 ( $\ell_{1}$ norm bound).