The distribution of Ridgeless least squares interpolators
Abstract.
The Ridgeless minimum -norm interpolator in overparametrized linear regression has attracted considerable attention in recent years. While it seems to defy the conventional wisdom that overfitting leads to poor prediction, recent research reveals that its norm minimizing property induces an ‘implicit regularization’ that helps prediction in spite of interpolation. This renders the Ridgeless interpolator a theoretically tractable proxy that offers useful insights into the mechanisms of modern machine learning methods.
This paper takes a different perspective that aims at understanding the precise stochastic behavior of the Ridgeless interpolator as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which plays the role of the prescribed implicit regularization observed previously in the context of prediction risk. Our distributional characterizations hold for general random designs and extend uniformly to positively regularized Ridge estimators.
As a demonstration of the analytic power of these characterizations, we derive approximate formulae for a general class of weighted risks for Ridge(less) estimators that were previously available only for . Our theory also provides certain further conceptual reconciliation with the conventional wisdom: given any (regular) data covariance, for all but an exponentially small proportion of the signals, a certain amount of regularization in Ridge regression remains beneficial across various statistical tasks including (in-sample) prediction, estimation and inference, as long as the noise level is non-trivial. Surprisingly, optimal tuning can be achieved simultaneously for all the designated statistical tasks by a single generalized or -fold cross-validation scheme, despite being designed specifically for tuning prediction risk.
The proof follows a two-step strategy that first proceeds under a Gaussian design using Gordon’s comparison principles, and then lifts the Gaussianity via universality arguments. Our analysis relies on uniform localization and stability properties of the Gordon’s optimization problem, along with uniform delocalization of the Ridge(less) estimators, both of which remain valid down to the interpolation regime.
Key words and phrases:
comparison inequality, cross validation, minimum norm interpolator, random matrix theory, ridge regression, universality2000 Mathematics Subject Classification:
60E15, 60G151. Introduction
1.1. Overview
Consider the standard linear regression model
(1.1) |
where we observe i.i.d. feature vectors and responses , and ’s are unobservable errors. For notational simplicity, we write as the design matrix that collects all the feature vectors, and as the response vector. The feature vectors ’s are assumed to satisfy and , and the errors satisfy and .
Throughout this paper, we reserve for the sample size, and for the signal dimension. The aspect ratio , i.e., the number of samples per dimension, is then defined as . Accordingly, we refer to as the overparametrized regime, and as the underparametrized regime.
Within the linear model (1.1), the main object of interest is to recover/estimate the unknown signal . While a large class of regression techniques can be used for the purpose of signal recovery under various structural assumptions on , here we will focus our attention on one widely used class of regression estimators, namely, the Ridge estimator (cf. [HK70]) with regularization ,
(1.2) |
and the Ridgeless estimator (also known as the minimum-norm interpolator),
(1.3) |
which is almost surely (a.s.) well-defined in the overparametrized regime . Here is the Moore-Penrose pseudo-inverse of . The notation is justified since for , a.s. as .
From a conventional statistical point of view, the Ridgeless estimator seems far from an obviously good choice: As perfectly interpolates the data, it is susceptible to high variability due to the widely recognized bias-variance tradeoff inherent in ‘optimal’ statistical estimators [JWHT21, DSH23]. On the other hand, as the Ridgeless estimator is the limit point of the gradient descent algorithm run on the squared loss in the overparametrized regime , it provides a simple yet informative test case for understanding one major enigma of modern machine learning methods: these methods typically interpolate training data perfectly; still, they enjoy good generalization properties [JGH18, DZPS18, AZLS19, BHMM19, COB19, ZBH+21].
Inspired by this connection, recent years have witnessed a surge of interest in understanding the behavior of the Ridgeless estimator and its closely related Ridge estimator , with a particular focus on their prediction risks, cf. [TV+04, EK13, HKZ14, Dic16, DW18, EK18, ASS20, BHX20, WX20, BLLT20, BMR21, RMR21, HMRT22, TB22, CM22]. An emerging picture from these works is that, the norm minimizing property of the Ridgeless estimator inherited from the gradient descent algorithm, induces certain ‘implicit regularization’ that helps prediction accuracy despite interpolation. More interestingly, in certain scenarios of , the Ridgeless estimator has a smaller (limiting) prediction risk compared to any Ridge estimator with , and therefore in such scenarios interpolation becomes optimal for the task of prediction, cf. [KLS20, HMRT22, TB22]. As a result, the Ridgeless estimator can be viewed, at least to some extent, as a (substantially) simplified yet theoretically tractable proxy that captures some important features of modern machine learning methods. Moreover, understanding for the prediction risk of serves as a basis for more complicated interpolation methods in e.g., kernel ridge regression [LR20, BMR21], random features model [MM22, MMM22], neural tangent model [MZ22], among others.
Despite these encouraging progress, there remains limited understanding of the behavior of the Ridgeless estimator as a statistical estimator. From a statistical perspective, a ‘good’ estimator usually possesses multiple desirable properties, and therefore it is natural to ask whether the Ridgeless estimator is also ‘good’ in other senses beyond the prediction accuracy. This is particularly relevant, if we aim to consider also as a ‘good’ estimator that can be applied in broader statistical contexts, rather than viewing it solely as a theoretical proxy designed to provide insights into the mechanisms of modern machine learning methods.
From a different angle, while much of the aforementioned research has focused on identifying favorable scenarios of in which prediction via can be accurate or even optimal, this line of theory does not automatically imply that in the practical statistical applications of overparametrized linear regression, the Ridgeless estimator remains the optimal choice for the task of prediction for ‘typical’ scenarios of . The prevailing conventional statistical wisdom, which strongly advocates the use of regularization to strike a balance between bias and variance, may be still at work. From both conceptual and practical standpoints, it is therefore essential to understand the extent to which the optimality phenomenon of interpolation, as observed in certain scenarios of in the literature, can be considered ‘generic’ within the context of Ridge regression.
The main goal of this paper is to make a further step in understanding the precise stochastic behavior of the Ridge(less) estimator , by developing a high-dimensional distributional characterization in the so-called proportional regime, where and is of the same order. As will be clear from below, the distributional characterization of the Ridge(less) estimator offers a detailed understanding of its statistical properties across various statistical tasks, extending beyond the prediction accuracy. Notable examples include a characterization for a general class of weighted risks () associated with , as well as its capacity for statistical inference in terms of constructing confidence intervals.
In addition, the distributional characterization also leads to new insights on the (sub-)optimality of the Ridgeless estimator , which, interestingly, further reconciles with the conventional statistical wisdom: given any covariance structure , except for an exponentially small proportion of signal ’s within the norm ball, interpolation is optimal in prediction, estimation and inference, if and only if the noise level . In other words, in the common statistical setting where the noise level is nontrivial , a certain amount of regularization in Ridge regression remains beneficial for ‘most’ signal ’s across a number of different statistical tasks in the linear model (1.1)—including the most intensively studied prediction task in the literature.
Of course, in practice, the optimal regularization is unknown and may differ significantly for each statistical task. Surprisingly, our distributional characterization reveals that two widely used adaptive tuning methods, the generalized cross-validation scheme [CW79, GHW79] and the -fold cross-validation scheme [GKKW02, JWHT21]—both of which designed for tuning the prediction risk—are actually simultaneously optimal for all the aforementioned statistical tasks, at least for ‘most’ signal ’s. In particular, for ‘most’ signal ’s, these two popular adaptive tuning methods automatically lead to optimal prediction, estimation and in-sample risks. Moreover, when combined with the debiased Ridge estimator, they produce the shortest confidence intervals for the coordinates of with asymptotically valid coverage.
1.2. Distribution of Ridge(less) estimators
For simplicity of discussion, we will focus here on the overparametrized regime that is of our main interest.
1.2.1. The Gaussian sequence model
We will describe the distribution of the Ridge(less) estimator in the linear model (1.1), in terms of a corresponding Ridge estimator in a simpler Gaussian sequence model, defined for a given pair of and a noise level as
(1.4) |
The Ridge estimator with regularization in the Gaussian sequence model (1.4) is defined as
(1.5) |
1.2.2. Distributional characterization of via
Under standard assumptions on (i) the design matrix , where consists of independent mean , unit-variance and light-tailed entries, and (ii) the error vector with light-tailed components, we show that the distribution can be characterized via as follows. For any , there exists a unique pair determined via an implicit fixed point equation (cf. Eqn. (2.1) in Section 2 ahead), such that the distribution of is about the same as that of . Formally, for any -Lipschitz function and any , we show in Theorems 2.3 and 2.4 that with high probability,
(1.6) |
A particularly important technical aspect of (1.6) is that the distributional approximation (1.6) holds uniformly down to the interpolation regime for . This uniform guarantee will prove essential in the results to be detailed ahead.
From (1.6), the quantities and can be naturally interpreted as the effective noise and effective regularization for the Ridge estimator in the Gaussian sequence model (1.4). Moreover, can be regarded as the aforementioned ‘implicit regularization’. Interestingly, while this interpretation has been previously noted in the special context of prediction risk of the Ridge(less) estimator (cf. [BMR21, Remark 4.15], [CM22, Eqns. (16)-(19)]), our distributional characterization (1.6) reveals that such effective/implicit regularization is an inherently general phenomenon that persists at the level of distributional properties of .
1.2.3. Approximate formulae for general weighted risks
As a first, yet non-trivial demonstration of the analytic power of (1.6), we show in Theorem 2.5 that for ‘most’ signals ’s, the weighted risk () of , namely, with a well-behaved p.s.d. matrix , concentrates around some deterministic quantity in the following sense: with high probability,
(1.7) |
Here , and is the vector that collects all diagonal elements of a p.s.d. matrix , defined explicitly (in Eqn. (2.7)) via the weight matrix , the data covariance , the effective regularization , and the signal energy .
To the best of our knowledge, results of the form (1.7) are available in the literature (cited above) only for the special case , where the quantity admits a closed form expression in terms of the spectral statistics of the sample covariance that allows for direct applications of random matrix methods. Moving beyond this explicit closed form presents a notable analytic advantage of our theory (1.6) over existing random matrix approaches in analyzing risks of .
1.3. Phase transitions for the optimality of interpolation
While our theory (1.6) is strong enough to characterize all risks of the Ridge(less) estimator in the sense of (1.7), the uniform nature of (1.6) also illuminates novel insights into certain global, qualitative behavior of the most commonly studied risks for finite samples. To fix notation, we define
-
•
(prediction risk) ,
-
•
(estimation risk) ,
-
•
(in-sample risk) .
Using our uniform distributional characterization in (1.6), we show in Theorem 3.4 that for ‘most’ ’s and all , with high probability,
(1.8) |
In fact, we prove in Theorem 3.4 a much stronger statement: for ‘most’ ’s, the (random) global optimum of for all will be achieved approximately at with high probability. Here is the usual notion of signal-to-noise ratio; when and , we shall interpret .
It must be stressed that, for different , the empirical risk curves concentrate on genuinely different deterministic counterparts with different mathematical expressions (cf. Theorem 3.1). As such, there are no apriori reasons to expect that these risk curves share approximately the same global minimum. Remarkably, as a consequence of the approximate formulae for the deterministic risk curves (cf. Theorem 3.2), we show that the curves are qualitatively similar, in that they approximately behave locally like a quadratic function centered around (cf. Proposition 3.3), at least for ‘most’ signal ’s.
From a broader perspective, the phase transition (1.8) aligns closely with the conventional statistical wisdom: for ‘most’ signal ’s, certain amount of regularization is necessary to achieve optimal performance for all the prediction, estimation and in-sample risks, as long as the noise level .
1.4. Cross-validation: optimality beyond prediction
The phase transition in (1.8) naturally raises the question of how one can choose the optimal regularization in a data-driven manner. Here we study two widely used adaptive tuning methods, namely,
-
(1)
the generalized cross-validation scheme , and
-
(2)
the -fold cross-validation scheme .
The readers are referred to (4.3) and (4.5) for precise definitions of in the context of Ridge regression. Both methods have a long history in the literature; see, e.g., [Sto74, Sto77, CW79, GHW79, Li85, Li86, Li87, DvdL05] for some historical references. In essence, both methods are designed to estimate the prediction risk, so it is natural to expect that they perform well for the task of prediction. Rigorous theoretical justifications along this line in Ridge regression in high dimensional settings can be found in, e.g., [LD19, PWRT21, HMRT22].
Interestingly, the insight from (1.8) suggests a far broader utility of these adaptive tuning methods. As all the empirical risk curves are approximately minimized at the same point with high probability, it is plausible to conjecture that the aforementioned cross-validation methods could also yield optimal performance for estimation and in-sample risks, at least for ‘most’ signal ’s. We show in Theorems 4.2 and 4.3 that this is indeed the case: for ‘most’ signal ’s and all , with high probability,
(1.9) |
Even more surprisingly, the optimality of extends to the much more difficult task of statistical inference. In fact, we show in Theorem 4.4 that in the so-called debiased Ridge scheme, these two adaptive tuning methods yield an asymptotically valid construction of confidence intervals for the coordinates of with the shortest possible length.
To the best of our knowledge, theoretical optimality properties for the cross-validation schemes beyond the realm of prediction accuracy has not been observed in the literature, either for Ridge regression or for other regularized regression estimators. Our findings here in the context of Ridge regression can therefore be viewed as a first step in understanding the potential broader merits of cross validation schemes for a larger array of statistical inference problems.
1.5. Proof techniques
As previously mentioned, a significant body of recent research (cited above) has concentrated on various facets of the precise asymptotics of the prediction risk for the Ridgeless estimator . These studies extensively utilize the explicit form of (1.3), and rely almost exclusively on techniques from random matrix theory (RMT) to relate the behavior of the bias and variance terms in the prediction risk and the spectral properties of the sample covariance.
Here, as (1.6) lacks a direct connection to the spectrum of the sample covariance, we adopt a different, two-step strategy for its proof:
- (1)
-
(2)
In the second step, we prove universality that lifts the Gaussianity via leveraging the recently developed comparison inequalities in [HS22].
Under a Gaussian design, the proof method for establishing distributional properties of regularized regression estimators via the CGMT has been executed for the closely related Lasso estimator with strict non-vanishing regularization, cf. [MM21, CMW22]111More literature on other statistical applications of the CGMT can be found in Section 6.. Here in our setting, the major technical hurdle is to handle the vanishing strong convexity in the optimization problem (1.2) as . We overcome this technical issue by establishing uniform localization of Gordon’s min-max optimization, valid down to . The localization property, in a certain sense, allows us to conclude that the distributional properties of are stable as . The readers are referred to Section 6.3 for a proof outline.
For the universality problem, the key step of our arguments is to reduce, in a proper sense, the difficult problem on the universality of to an easier problem on the universality of for small , so that the comparison inequalities in [HS22] can be applied. This reduction is achieved by proving that (i) and related quantities are uniformly delocalized down to the interpolation regime , and (ii) both the primal and Gordon optimization problems are ‘stable’ in suitable senses as . A more detailed proof outline is contained in Section 6.4.
1.6. Organization
The rest of the paper is organized as follows. In Section 2, we present our main results on the distributional characterizations (1.6) of the Ridge(less) estimator , and the approximate risk formulae (1.7). In Section 3, we provide a number of approximate risk formulae via RMT, and establish the phase transition (1.8). In Section 4, we give a formal validation for the two cross validation schemes mentioned above, both in terms of (1.9) and statistical inference via the debiased Ridge estimator. A set of illustrative simulation results are presented in Section 5 to collaborate (some of) the theoretical results. Due to the high technicalities involved in the proof of (1.6), a proof outline will be given in Section 6. All the proof details are then presented in Sections 7-12.
1.7. Notation
For any positive integer , let denote the set . For , and . For , let . For , let denote its -norm , and . We simply write and . For a matrix , let denote the spectral and Frobenius norm of , respectively. is reserved for an identity matrix, written simply as (in the proofs) if no confusion arises. For a square matrix , we let .
We use to denote a generic constant that depends only on , whose numeric value may change from line to line unless otherwise specified. and mean and , abbreviated as respectively; means and , abbreviated as . and (resp. and ) denote the usual big and small O notation (resp. in probability). For a random variable , we use (resp. ) to indicate that the probability and expectation are taken with respect to (resp. conditional on ).
For a measurable map , let . is called -Lipschitz iff . For a proper, closed convex function defined on , its Moreau envelope and proximal operator for any are defined by and .
Throughout this paper, for an invertible covariance matrix , we write as the harmonic mean of the eigenvalues of .
2. Distribution of Ridge(less) estimators
2.1. Some definitions
For , let
This notation will be used throughout the paper for uniform-in- statements. In particular, in the overparametrized regime , we have .
2.2. Working assumptions
Assumption A.
, where (i) has independent, mean-zero, unit variance, uniformly sub-gaussian entries, and (ii) is an invertible covariance matrix with eigenvalues .
Here ‘uniform sub-gaussianity’ means for some universal , where is the Orlicz 2-norm (cf. [vdVW96, Section 2.2, pp. 95]).
We shall often write the Gaussian design as , where consists of i.i.d. entries.
Assumption B.
for some with i.i.d. mean zero, unit variance and uniform sub-gaussian entries.
The requirement on the noise level will be specified in concrete results below.
2.3. The fixed point equation
Fix . Consider the following fixed point equation in :
(2.1) |
We first establish some qualitative properties for the solution of (2.1).
Proposition 2.1.
Recall . The following hold.
-
(1)
The fixed point equation (2.1) admits a unique solution , for all when and when .
-
(2)
Suppose and for some . Then there exists some such that uniformly in ,
If furthermore and , then uniformly in ,
-
(3)
Suppose and for some . Then there exists some such that the following hold. For any , we may find some with ,
where . When , we may take and the above inequality holds with .
As an important qualitative consequence of (2), under the condition , the effective regularization is a strictly increasing and concave function of . Moreover, in the overparametrized regime , the quantity —also known as ‘implicit regularization’ in the literature [BLLT20, BMR21, CM22, HMRT22, TB22]—is strictly bounded away from zero.
The claim in (3) offers a useful approximate representation of the effective noise in terms of the original noise , the effective regularization and the signal energy without explicitly dependence of . This representation will prove useful in the approximate risk formulae in Theorem 2.5, as well as in understanding some qualitative aspects of the risk curves in Section 3 ahead.
2.4. Some connections of (2.1) to RMT
The second equation of (2.1) has a natural connection to random matrix theory (RMT). To detail this connection, let and be the sample covariance matrix and its dimension flipped, companion matrix. For , let and be the Stieltjes transforms of the empirical spectral distribution and the asymptotic eigenvalue density (cf. [KY17, Definition 2.3]) of , respectively. It is well-known that can be determined uniquely via the fixed point equation
(2.2) |
See, e.g., [KY17, Lemma 2.2] for more technical details and historical references. We also note that while the above equation is initially defined for , it can be straightforwardly extended to the real axis provided that lies outside the support of the asymptotic spectrum of .
The following proposition provides a precise connection between the effective regularization defined via the second equation of (2.1), and the Stieltjes transform . This connection will prove important in some of the results ahead.
Proposition 2.2.
For any and with ,
(2.3) |
Proof.
While (2.3) appears somewhat purely algebraic, it actually admits a natural statistical interpretation. Suppose is also Gaussian. We may then compute
(2.4) |
Now comparing the above display with (2.3), we arrive at the following intriguing equivalence between the averaged law in RMT, and the proximity of and in terms of “degrees-of-freedom”:
Below we will show the above proximity of and can be taken as far as the distributions of the two themselves.
2.5. Distribution of Ridge(less) estimators
In addition to , we will also consider the distribution of the (scaled) residual , defined by
(2.5) |
We define the ‘population’ version of as
(2.6) |
Here is independent of .
We are now in a position to state our main results on the distributional results for the Ridge(less) estimator and the residual .
First we work under the Gaussian design , and we write . Recall .
Theorem 2.3.
The choice is made merely for simplicity of presentation; it can be replaced by with another constant that depends further on . The assumption is quite common in the literature of Ridge(less) regression; see, e.g., [BMR21, Assumption 4.12] or a slight variant in [MRSY23, Assumption 1]. The major assumption in the above theorem is the Gaussianity on the design . This may be lifted at the cost of a set of slightly stronger conditions.
Theorem 2.4.
Suppose Assumption A holds and the following hold for some .
-
•
, .
-
•
Assumption B holds with .
Fix . There exist some and two measurable sets with , such that the following hold.
-
(1)
For any -Lipschitz function , and ,
-
(2)
For any , and -Lipschitz function (which may depend on ),
Concrete forms of are specified in Proposition 10.3.
Theorems 2.3 and 2.4 are proved in Section 9 and Section 10, respectively. Due to the high technicalities in the proof, a sketch is outlined in Section 6.
We mention two particular important features on the theorems above:
-
(1)
The distributional characterizations for in both theorems above are uniformly valid down to the interpolation regime for . This uniform control will play a crucial role in our non-asymptotic analysis of cross-validation methods to be studied in Section 4 ahead.
-
(2)
The distribution of the residual in (2) is formulated conditional on the noise . A fundamental reason for adopting this formulation is that the distribution of is not universal with respect to the law of . In other words, one cannot simply assume Gaussianity of in Theorem 2.3 in hope of proving universality of in Theorem 2.4.
In the context of distributional characterizations for regularized regression estimators in the proportional regime, results in similar vein to Theorem 2.3 have been obtained in the closely related Lasso setting for isotropic in [MM21], and for general in [CMW22], both under Gaussian designs and with strictly non-vanishing regularization. A substantially simpler, isotropic () version of Theorem 2.4 is obtained in [HS22] that holds pointwise in non-vanishing regularization level . As will be clear from the proof sketch in Section 6, in addition to the complications due to the implicit nature of the solution to the fixed point equation (2.1) for general , the major difficulty in proving Theorems 2.3 and 2.4 rests in handling the singularity of the optimization problem (1.2) as .
2.6. Weighted risks of
As a demonstration of the analytic power of the above Theorems 2.3 and 2.4, we compute below the weighted risk for a well-behaved matrix and .
Theorem 2.5.
The proof of the above theorem can be found in Section 10.7. To the best of our knowledge, general weighted risks for the Ridge(less) estimator have not be available in the literature except for the special case , for which admits a closed-form expression in terms of the spectral statistics of that facilitates direct applications of RMT techniques, cf. [TV+04, EK13, Dic16, DW18, EK18, ASS20, WX20, BMR21, RMR21, HMRT22, CM22].
Here, obtaining risks for via our Theorems 2.3 and 2.4 is relatively easy, as is -Lipschitz with respect to for . The stronger norm case is significantly harder. In fact, we need additionally the following delocalization result for .
Proposition 2.6.
Suppose the same conditions as in Theorem 2.5 hold for some . Fix . Then there exist some constant and a measurable set with , such that
The above proposition is a simplified version of Proposition 10.3, proved via the anisotropic local laws developed in [KY17]. In essence, delocalization allows us to apply Theorems 2.3 and 2.4 with a truncated version of the norm () with a well-controlled Lipschitz constant with respect to . Moreover, delocalization of also serves as a key technical ingredient in proving the universality Theorem 2.4; the readers are referred to Section 6 for a detailed account on the technical connection between delocalization and universality.
Convention on probability estimates:
3. risk formulae and phase transitions
In this section, we will study in some detail the behavior of various risks associated with . Compared to the general risks as in Theorem 2.5, the major additional analytic advantage of working with risks is its close connection to techniques from random matrix theory (RMT). As will be clear from below, this allows us to characterize certain global, qualitative behavior of these risk curves with respect to the regularization level .
3.1. Definitions of various risks
Recall the notation defined in Section 1.3. Let their ‘theoretical’ versions be defined as follows:
-
•
.
-
•
.
-
•
.
We also define the residual and its theoretical version as
-
•
, .
The following theorem follows easily from Theorems 2.3 and 2.4. The proof of this and all other results in this section can be found in Section 11.
Theorem 3.1.
Suppose Assumption A holds and the following hold for some .
-
•
, .
-
•
Assumption B holds with .
Fix a small enough . Then there exist a constant , and a measurable set with , such that for any , and ,
Here for and for , and is universal. Moreover, when , the supremum in the above display extends to , and the constant does not depend on .
Remark 1.
For , we may take at the cost of an worsened probability estimate , cf. Lemma 11.4.
A substantial recent line of research has focused on understanding the behavior of , with a special emphasis for the interpolating regime when , cf. [BLLT20, KLS20, WX20, BMR21, KZSS21, RMR21, CM22, HMRT22, TB22, ZKS+22]. We refer the readers to [TB22, Section 1.2 and Section 9] for a thorough account and a state-of-art review on various results on and their comparisons.
Here, the closest non-asymptotic results on exact risk characterizations related to our Theorem 3.1, appear to be those presented in (i) [HMRT22, Theorems 2 and 5], which proved non-asymptotic additive approximations , and (ii) [CM22, Theorems 1 and 2], which provided substantially refined, multiplicative approximations that hold beyond the proportional regime. Both works [HMRT22, CM22] leverage the closed form of the Ridge(less) estimator to analyze the bias and variance terms in , by means of calculus for the resolvent of the sample covariance. Their analysis works under for some suitable . For the case , Theorem 3.1 above complements the results in [HMRT22, CM22] by providing uniform control in when (under a set of different conditions).
3.2. Approximate representation of via RMT
Recall the Stieltjes transformation defined via (2.2), and the signal-to-noise ratio (for , we interpret ). The following theorem provides an efficient RMT representation of that holds for ‘most’ ’s.
Theorem 3.2.
Suppose , and for some . There exists some constant such that for any , we may find a measurable set with ,
(3.1) |
Here with and ,
-
•
,
-
•
,
-
•
,
-
•
.
When , we may take and (3.1) holds with .
The RMT representation above yields a crucial insight into the extremal behavior of the risk maps . In fact, the following derivative formulae hold.
Proposition 3.3.
For ,
Here with denoting the asymptotic eigenvalue density of , and ,
As (cf. [KY17, Lemma 2.2]), it is easy to see using the above integral representation via . In Proposition 11.3 ahead, we will give a different representation of via the effective regularization . With the help of the stability estimates of in Proposition 2.1, this representation allows us to derive a much stronger estimate
(3.2) |
under the condition .
A particular important consequence of (3.2) is that, the maps are approximately (locally) quadratic functions centered at the same point for all . Therefore, in view of Theorems 3.1 and 3.2, approximately so do the risk maps for ‘most’ ’s. Moreover, the approximate local quadraticity of due to (3.2) allows one to relate tightly the change of value in and that of the actual risks . This will play an important technical role in validating the optimality of cross-validation schemes beyond prediction errors in Section 4 ahead.
Remark 2.
3.3. Phase transitions on the optimality of interpolation
The fact that the maps admit the same global minimizer due to Proposition 3.3 may be viewed from a different perspective: if and only if for ‘most’ signal ’s. Combined with Theorems 3.1 and 3.2, it naturally suggests that for ‘most’ ’s, interpolation is optimal simultaneously for prediction, estimation and in-sample risks, if and only if the noise level is (nearly) zero. The following theorem makes this precise.
Theorem 3.4.
We note that the sub-optimality of interpolation in the noisy setting , at least for ‘most’ ’s as described in the above theorem, does not contradict the possible benign behavior of the prediction risk extensively studied in the literature (cited after Theorem 3.1). In fact, from a conceptual standpoint, the observation that interpolation may fall short while a certain amount of regularization remains advantageous in ‘typical’ scenarios aligns closely with the conventional statistical wisdom, which emphasizes the vital role of regularization in striking a balance between the bias and variance [JWHT21].
On the other hand, the optimality of interpolation in the noiseless setting stated in the above theorem is not as intuitively obvious. This phenomenon can be elucidated from our distributional characterizations in Theorems 2.3 and 2.4. As the effective regularization decreases as , and for , the effective noise (cf. Proposition 2.1-(3)) also decreases as , both the bias and variance of are also expected to decrease as , at least for and for ‘most’ ’s. The above theorem rigorously establishes this heuristic for all the prediction, estimation and in-sample risks.
Remark 3.
Some remarks on the connection of Theorem 3.4 to the literature:
-
(1)
Theorem 3.4-(1) is related to some results in [CDK22] which considers a Bayesian setting with an isotropic prior on in that and . In particular, [CDK22, Theorem 3-(ii)] shows that, using our notation, in the proportional asymptotics , if and the empirical spectral distribution of converges, then with -probability , . Here for the case , our Theorem 3.4-(1) above provides a non-asymptotic, and more importantly, a non-Bayesian version that holds for ‘most’ ’s.
-
(2)
As mentioned before, the recent work [CM22] proves an important, multiplicative characterization beyond the proportional regime. In particular, the multiplicative formulation encompasses several important cases of data covariance with decaying eigenvalues that lead to benign overfitting , cf. [CM22, Section 4.2]. We conjecture that Theorem 3.4 also remains valid for such irregular data covariance, in a similar multiplicative formulation. However, formally establishing this validity remains an interesting open question.
4. Cross-validation: optimality beyond prediction
This section is devoted to the validation of the broad optimality of two widely used cross-validation schemes beyond the prediction risk. Some consequences to statistical inference via debiased Ridge(less) estimators will also be discussed.
4.1. Estimation of effective noise and regularization
We shall first take a slight detour, by considering estimation of the effective regularization and the effective noise . We propose the following estimators:
(4.1) |
These estimators will not only be useful in their own rights, they will also play an important rule in understanding the generalized cross-validation scheme in the next subsection.
The following theorem provides a formal justification for in (4.1). The proof of this and all other results in this section can be found in Section 12.
Theorem 4.1.
Remark 4.
The original noise level can be consistently estimated when is known. In particular, we may use
(4.2) |
Estimators for of a similar flavor for other convex regularized estimators under Gaussian designs can be found in [BEM13, Bel20, MM21]. An advantage of in (4.2) is its validity in the interpolation regime when . We may also replace in the above display by , as it can be solved exactly with known using the second equation of (2.1). A formal proof of can be carried out similar to that of Theorem 4.1-(2) above, so we omit the details.
4.2. Validation of cross-validation schemes
4.2.1. Generalized cross-validation
Consider choosing by minimizing the estimated effective noise given in (4.1): for any ,
(4.3) |
The subscript on in will usually be suppressed for notational simplicity.
The above proposal (4.3) is known in the literature as the generalized cross validation [CW79, GHW79] in the underparametrized regime , with the same form of modification [HMRT22, Eqn. (48)] in the overparametrized regime . The connection there is strongly tied to the so-called shortcut formula for leave-one-out cross validation that exists uniquely for Ridge regression, cf. [HMRT22, Eqn. (46)].
Here we take a different perspective on (4.3). From our developed theory, this tuning scheme is easily believed to “work” since
(4.4) |
We therefore expect that minimization of is approximately the same as that of . Moreover, in view of Theorem 3.4, the minimizer of the latter problem should be roughly the same as for for ‘most’ ’s. Now it is natural to expect that the tuning method in (4.3)—which aims at minimizing the prediction risk—should simultaneously give optimal performance for all prediction, estimation and in-sample risks for ‘most’ signal ’s. We make precise the foregoing heuristics in the following theorem.
Theorem 4.2.
Earlier results for generalized cross validation in Ridge regression in low-dimensional settings include [Sto74, Sto77, CW79, Li85, Li86, Li87, DvdL05]. In the proportional high-dimensional regime, [HMRT22, Theorem 7] provides an asymptotic justification for the generalized cross validation under isotropic and an isotropic prior on . Subsequent improvement by [PWRT21, Theorem 4.1] allows for general , deterministic ’s and a much larger range of regularization levels that include and possibly even negative ’s. Both works consider the optimality of with respect to the prediction risk in an asymptotic framework; the proofs therein rely intrinsically on the asymptotics.
Here in Theorem 4.2 above, we provide a non-asymptotic justification for the optimality of that surprisingly holds simultaneously for all the three indicated risks. To the best of our knowledge, the optimality of the generalized cross validation in (4.3) beyond prediction risk has not been observed in prior literature.
4.2.2. -fold cross-validation
Next we consider the widely used -fold cross-validation. Before detailing the procedure, we need some further notation. Let be the sample size of batch , so . In the standard -fold cross validation, we choose equal sized batch with (assumed to be integer without loss of generality). Let (resp. ) be the submatrix of (resp. subvector of ) that contains all rows corresponding to the training data in batch . In a similar fashion, let (resp. ) be the submatrix of (resp. subvector of ) that removes all rows corresponding to (resp. ).
The -fold cross-validation works as follows. For , let be the Ridge estimator over with regularization . We then pick the tuning parameter that minimizes the averaged test errors of over : for any ,
(4.5) |
We shall often omit the subscript in . Intuitively, due to the independence between and , can be viewed as an estimator of the generalization error . So it is natural to expect that approximately minimizes . Based on the same heuristics as for in (4.3), we may therefore expect that in (4.5) simultaneously provides optimal prediction, estimation and in-sample risks for ‘most’ signal ’s. This is the content of the following theorem.
Theorem 4.3.
Suppose the same conditions as in Theorem 4.2 and hold for some . Fix , and a small enough . Further assume . There exist a constant and a measurable set with , such that for ,
Here .
Non-asymptotic results of this type for -fold cross validation are previously obtained for in the Lasso setting [MM21, Proposition 4.3] under isotropic , where the range of the regularization must be strictly away from the interpolation regime. In contrast, our results above are valid down to when , and allow for general anisotropic .
Interestingly, the error bound in the above theorem reflects the folklore tension between the bias and variance in the selection of in the cross validation scheme (cf. [JWHT21, Chapter 5]):
-
•
For a small number of , is biased for estimating ; this corresponds to the term in the error bound, which is known to be of the optimal order in Ridge regression (cf. [LD19]).
- •
4.3. Implications to statistical inference via
As Ridge(less) estimators are in general biased, debiasing is necessary for statistical inference of , cf. [BZ23]. Here the debiasing scheme for can be readily read off from the distributional characterizations in Theorems 2.3 and 2.4. Assuming known covariance , let the debiased Ridge(less) estimator be defined as
(4.6) |
Similar to Remark 4, and is interchangeable in the above display due to known . Using Theorems 2.3 and 2.4, we expect that . This motivates the following confidence intervals for :
(4.7) |
Here is the normal upper- quantile defined via . It is easy to see from the above definition that minimization of is equivalent to that of the CI length. As the former minimization procedure corresponds exactly to the proposal in (4.3), we expect that provide the shortest (asymptotic) -CIs along the regularization path, and so do .
Below we give a rigorous statement on the above informal discussion. Let denote the averaged coverage of for . We have the following.
Theorem 4.4.
An interesting and somewhat non-standard special case of the above theorem is the noiseless setting in the overparametrized regime . In this case, exact recovery of is impossible and our CI’s above provide a precise scheme for partial recovery of . Moreover, as the effective noise , Theorem 3.2 and Proposition 3.3 suggest that is approximately minimized at for ‘most’ ’s. This means that, in this noiseless case, the length of the adaptively tuned CIs is also approximately minimized at the interpolation regime for ‘most’ ’s.
Remark 5.
The debiased Ridge estimator in (4.6) takes a slightly non-standard form, which allows for interpolation when . To see its equivalence to the standard debiased form, for any , using the KKT condition and the calculation in (2.4),
The form in the right hand side of the above display matches the standard form of the debiased Ridge estimator, cf. [BZ23, Eqn. (3.15)].
5. Some illustrative simulations



In this section, we provide some illustrative numerical simulations to validate some of the developed theoretical results. In particular, we focus on:
5.1. Common numerical settings
We set , with representing an -dimensional all one vector. The random design matrix and the error are both generated by -distribution with degrees of freedom, scaled by . This scaling choice ensures that and have mean zero and variance one. The concrete choice of the signal dimension , the sample size , and will be specified later.
5.2. Validation of the phase transitions in Theorem 3.4
First, we use , , and a unit chosen randomly (but fixed afterwards) from the sphere , and we plot both the theoretical risk curve and the empirical risk curve for all . The left panel of Figure 1 reports the outcome of this simulation with noise level and , while the middle panel of the same figure reports the noiseless case with . These plots show excellent agreements with the theory in Theorem 3.4 in that the global minimum of both the theoretical and empirical risk curves are attained roughly at .
In order to demonstrate the validity of the aforementioned phenomenon for ‘most’ ’s as claimed in Theorem 3.4, we uniformly generate different over . Next, we discretize into grid points. We then select the optimal value by minimizing the empirical prediction, estimation, and in-sample risks. The difference between the chosen empirical optimal and the theoretically optimal tuning is depicted in the right panel of Figure 1 through a boxplot of . It is easily seen that the differences for all three risks are highly concentrated around .



5.3. Optimality of (generalized) cross-validation schemes
Next, we investigate the efficacy of two cross validation schemes in Section 4, namely in (4.3) and in (4.5). We keep the sample size fixed at , and allow the signal dimension to vary so that the aspect ratio ranges from . To facilitate the tuning process, we employ equidistant ’s within the range of . Moreover, the -fold cross validation scheme is carried out with the default choice .
To empirically verify Theorem 4.2 and 4.3, we report in the left panel of Figure 2 the empirical risks for all . All the empirical risk curves are found to concentrate around their theoretical optimal counterparts . We note again that as and are designed to tune the prediction risk, it is not surprising that concentrate around . The major surprise appears to be that and also provide optimal tuning for estimation and in-sample risks, both theoretically validated in our Theorems 4.2 and 4.3 and empirically confirmed here.
To empirically verify Theorem 4.4, we report in the middle and right panels of Figure 2 the averaged coverage and length for the -debiased Ridge CI’s with cross-validation, namely for , and with oracle tuning . For the middle panel, we observe that adaptive tuning via and both provide approximate nominal coverage for a moderate sample size and signal dimension . For the right panel, as the lengths of are solely determined by , we report here only the length of . We observe that the CI length for both are also in excellent agreement to the oracle length across different aspect ratios.
6. Proof outlines
6.1. Technical tools
The main technical tool we use for the proof of Theorem 2.3 is the following version of convex Gaussian min-max theorem, taken from [MM21, Corollary G.1].
Theorem 6.1 (Convex Gaussian Min-Max Theorem).
Suppose are compact sets, and is continuous. Let with ’s i.i.d. , and , be independent Gaussian vectors. For , write . Define
Then the following hold.
-
(1)
For all , .
-
(2)
If satisfies the conditions of Sion’s min-max theorem for the pair a.s. (for instance, are convex, and is convex-concave), then for any , .
Clearly, (resp. ) in (1) (resp. (2)) can be replaced with (resp ). In the proofs below, we shall assume without loss of generality that are independent Gaussian matrix/vectors defined on the same probability space.
As mentioned above, the CGMT above has been utilized for deriving precise risk/distributional asymptotics for a number of canonical statistical estimators across various important models; we only refer the readers to [TOH15, TAH18, SAH19, LGC+21, CMW22, DKT22, Han22, LS22, WWM22, ZZY22, MRSY23] for some selected references.
6.2. Reparametrization and further notation
Consider the reparametrization
Then with
(6.1) |
we have the following reparametrized version of :
Next we give some further notation for cost functions. Let for ,
(6.2) |
and for ,
(6.3) | ||||
We shall simply write and . When , we sometimes write and for simplicity of notation.
Let the empirical noise and its modified version be
(6.4) |
Finally we define and its deterministic version as follows:
(6.5) |
Here recall is the Moreau envelope of in (6.1). Note that depends on the choice of , but for notational convenience we drop this dependence here.
6.3. Proof outline for Theorem 2.3 for
We shall outline below the main steps for the proof of Theorem 2.3 for in the regime under a stronger condition . The high level strategy of the proof shares conceptual similarities to [MM21, CMW22], but the details differ significantly.
(Step 1: Localization of the primal optimization). In this step, we show that for such that , with high probability (w.h.p.),
(6.6) |
A formal statement of the above localization can be found in Proposition 9.1. The key point here is that despite optimizes a deterministic function with a random constraint, it can be efficiently rewritten (in a probabilistic sense) in a minimax form indexed by compact sets that facilitate the application of the convex Gaussian min-max Theorem 6.1.
(Step 2: Characterization of the Gordon cost optimum). In this step, we show that a suitably localized version of concentrates around some deterministic quantity involving the function in (6.2). In particular, we show in Theorem 9.2 that for chosen large enough, w.h.p.,
(6.7) |
The proof of (6.7) is fairly involved, as the minimax problem (and its suitably localized versions) cannot be computed exactly. We get around this technical issue by the following bracketing strategy:
-
•
(Step 2.1). We show in Proposition 9.3 that for the prescribed choice of , w.h.p., both
and the localization
hold for some large .
-
•
(Step 2.2). We show in Proposition 9.4 that for localized minimax problems, we may replace by : w.h.p.,
-
•
(Step 2.3). We show in Proposition 9.5 that (de)localization holds for the (deterministic) max-min optimization problem with :
Combining the above Steps 2.1-2.3 yields (6.7). An important step to prove the (de)localization claims above is to derive apriori estimates for the solutions of the fixed point equation (2.1) and its sample version, to be defined in (8.12). These estimates will be detailed in Section 8.
(Step 3: Locating the global minimizer of the Gordon objective). In this step, we show that a suitably localized version of the Gordon objective attains its global minimum approximately at in the following sense. For any and any that is -Lipschitz with respect to , let be the ‘exceptional set’. We show in Theorem 9.6 that again for chosen large enough, w.h.p.,
(6.8) |
The main challenge in proving (6.8) is partly attributed to the possible violation of strong convexity of the map , due to the necessity of working with non-Gaussian ’s. We will get around this technical issue in similar spirit to Step 2 by another bracketing strategy. In particular:
- •
-
•
(Step 3.2). In Proposition 9.8, we show that the minimizers of can be computed exactly and are close enough to .
-
•
(Step 3.3). In Proposition 9.9, combined with the tight bracketing and certain apriori estimates, we then conclude that all minimizers of must be close to .
With all the above steps, finally we prove (6.8) by (i) using the proximity of and its surrogate and (ii) exploiting the strong convexity of .
(Step 4: Putting pieces together and establishing uniform guarantees). In this final step, we shall use the convex Gaussian min-max theorem to translate the estimates (6.7) in Step 2 and (6.8) in Step 3 to their counterparts with primal cost function . For the global cost optimum, with the help of the localization in (6.6), by choosing , we have w.h.p.,
For the cost over the exceptional set, we have w.h.p.,
Combining the above two displays, we then conclude that w.h.p., . Finally using apriori estimate on we may conclude that w.h.p., , i.e., .
The uniform guarantee in is then proved by (i) extending the above arguments to include any positive , and (ii) establishing (high probability) Lipschitz continuity (w.r.t. ) of the maps and .
Details of the above outline are implemented in Section 9.
6.4. Proof outline for Theorem 2.4 for
The main tool we will use to prove the universality Theorem 2.4 is the following set of comparison inequalities developed in [HS22]: Suppose matches the first two moments of , and possesses enough high moments. Then for any measurable sets , and any smooth test function (standardized with derivatives of order in ),
(6.9) |
Here for with sufficiently small . The readers are referred to Theorems 10.1 and 10.2 for a precise statement of (6.4).
An important technical subtlety here is that while the first inequality in (6.4) holds down to , the second inequality does not. This is so because , which minimizes a deterministic function under a random constraint due to the unbounded constraint in the maximization of , is qualitatively different from for any .
Now we shall sketch how the comparison inequalities (6.4) lead to universality.
(Step 1: Universality of the global cost optimum). In this step, we shall use the first inequality in (6.4) to establish the universality of the global Gordon cost:
(6.10) |
The crux to establish (6.10) via the first inequality of (6.4) is to show that, the ranges of the minimum and the maximum of can be localized into an ball of order close to . This amounts to showing that the stationary points , where and (cf. Eqn. (10.3)), are delocalized. We prove such delocalization properties in Proposition 10.3 for ‘most’ .
(Step 2: Universality of the cost over exceptional sets). In this step, we shall use the second inequality in (6.4) to establish the universality of the Gordon cost over exceptional sets . In particular, we show in Theorem 10.5 that with for sufficiently small and a large enough , w.h.p.,
(6.11) |
Here . As mentioned above, a technical difficulty to apply the second inequality of (6.4) rests in its singular behavior near the interpolation regime . Also, we note that for a general exceptional set , the maximum over in need not be delocalized, so the first inequality of (6.4) cannot be applied. This singularity issue will be resolved in two steps:
- •
- •
A complete proof of the above outline is detailed in Section 10.
7. Proof preliminaries
7.1. Some properties of and
We write in this subsection. First we give an explicit expression for and .
Lemma 7.1.
For any ,
Proof.
Using the closed-form of , we may compute
(7.1) |
The claims follow from direct calculations. ∎
Next we give explicit expression for and .
Lemma 7.2.
It holds that
Furthermore,
Proof.
The two identities in the first display follows from the definition of . For the second display, note that is equal to
Using to conclude. ∎
The derivative formula below for will be useful.
Lemma 7.3.
It holds that
Proof.
See e.g., [TAH18, Lemmas B.5 and D.1]. ∎
Finally we provide a concentration inequality for .
Proposition 7.4.
There exists some universal constant such that
holds for any . Here .
Proof.
7.2. Some high probability events
Let
(7.3) |
For , consider the event
Here in the definition of , we interpret . Typically we think of and .
Lemma 7.5.
Fix and . Then , where .
Proof.
Lemma 7.6.
Suppose . Then there exists some such that .
Proof.
The claim for follows from standard concentration estimates. The claim for follows from, e.g., [RV09, Theorem 1.1]. ∎
Lemma 7.7.
Suppose , and for some , and Assumption B hold with . There exists some constant such that for all , with , for , we have .
Proof.
The claim follows by standard concentration inequalities. ∎
8. Properties of the fixed point equations
8.1. The fixed point equation (2.1)
Proposition 8.1.
The following hold.
-
(1)
The fixed point equation (2.1) admits a unique solution , for all when and when .
-
(2)
The following apriori bounds hold:
-
(3)
If and for some , then there exists some such that uniformly in ,
If furthermore and , then uniformly in ,
Proof.
We shall write for notational simplicity. All the constants in below may depend on .
(1). First we prove the existence and uniqueness of . We rewrite the second equation of (2.1) as
(8.1) |
Clearly is smooth, non-increasing, for and for , and , so must admit a unique zero .
Next we prove the existence and uniqueness of . Using Lemma 7.1, the equation reads
(8.2) |
As by (8.1) and the fact , the above equation admits a unique solution , analytically given by
(8.3) |
(2). For the upper bound for , using the equation (8.1), we have
Solving for yields the desired upper bound. For the lower bound for , note that (8.1) leads to
or equivalently . Solving this quadratic inequality yields the lower bound for .
On the other hand, the lower bound is trivial by (8.3). For the upper bound for , using that
(8.4) |
and the first identity in (8.3), we have
Collecting the bounds proves the claim.
(3). The claim on is a simple consequence of (2). We shall prove the other claim on their derivatives. Viewing and taking derivative with respect to on both sides of (8.1) yield that, with for ,
Solving for yields that
(8.5) |
Further taking derivative with respect to on both sides of the above display (8.5), we have
(8.6) |
Using the apriori estimate for proved in (2), it follows that for ,
(8.7) |
For , let us define
Then
(8.8) |
We shall now prove bounds for . First, using (8.1), we have
In particular, uniformly in ,
(8.9) |
The derivatives are
Using the apriori estimates on and (8.7), it now follows that
(8.10) |
Combining (8.8)-(8.10) and using apriori estimates on , we arrive at
(8.11) |
8.2. Sample version of (2.1)
Let the sample version of (2.1) be defined by
(8.12) |
Here recall that is defined in (7.3), and is defined in (6.4).
Proposition 8.2.
We need two concentration lemmas before the proof of Proposition 8.2.
Lemma 8.3.
Let for . Suppose that for some . Then there exists some constant such that for ,
Lemma 8.4.
Let for . Suppose that for some . Then there exists some constant such that for ,
The proofs of these lemmas are deferred to the next subsection.
Proof of Proposition 8.2.
All the constants in , , and below may possibly depend on . We often suppress the dependence of on for simplicity.
(1). We shall write as and for notational simplicity. Using (7.1), any solution to the equations in (8.12) satisfies
(8.13) |
On the event with , and , using Lemma 7.5, the second equation in (8.13) becomes
Rearranging terms we obtain the inequality
So with , the equations in (8.13) reduce to
(8.14) |
The above equations match (2.1) up to the small perturbation that can be assimilated into the leading term with small enough such that . From here the existence (but not uniqueness) and apriori bounds for can be established similarly to the proof of Proposition 8.1.
(2). Now we shall prove the claimed error bounds. By using (8.1) and the first equation of (8.14), we have
Let . Then it is easy to calculate , and for any ,
Now using the apriori estimates on , we may conclude
(8.15) |
On the other hand, using (8.2) and the second equation of (8.14), we have
(8.16) |
Using the error bound in (8.15) and apriori estimates for , and the fact that , by an easy derivative estimate we have
-
•
, and
-
•
.
Now plugging these estimates into (8.2), with satisfying , we arrive at
Using apriori estimates on , we may then invert the above estimate into
(8.17) |
The claimed error bounds follow by combining (8.15) and (8.17). ∎
8.3. Proofs of Lemmas 8.3 and 8.4
Proof of Lemma 8.3.
We only handle the case . The case is similar. Note that the assumption on invariant over orthogonal transforms, so for notational simplicity we assume without loss of generality that is diagonal. As , a standard concentration for the first term shows for , with probability ,
(8.18) |
On the other hand, for to be chosen later, by taking an -net of , a union bound shows that with probability at least ,
Here in the last inequality we used the simple estimate . Finally by choosing , we conclude that for , with probability ,
(8.19) |
Proof of Lemma 8.4.
We focus on the case and will follow a similar idea used in the proof of Lemma 8.3 above. Similarly we assume is diagonal without loss of generality. All the constants in below may depend on .
First note by a standard concentration, for any , with probability at least , . Similarly we have . This means for any , with probability at least ,
(8.20) |
Next we handle the suprema over by discretization over an -net . To this end, we shall establish a pointwise concentration. Note that . An application of Proposition B.1 then yields that, for each and , with probability at least ,
On the other hand, as and , we deduce that with probability at least ,
From here the claim follows by the same arguments used in the proof of Lemma 8.3 above. ∎
9. Gaussian designs: Proof of Theorem 2.3
We assume without loss of generality that , so unless otherwise specified. Recall .
9.1. Localization of the primal problem
Proposition 9.1.
Suppose , and for some . Fix and . On the event , there exists some such that for any deterministic choice of with
we have .
Proof.
Using the first-order optimality condition for the minimax problem
(9.1) |
any saddle point of (9.1) must satisfy and , or equivalently,
Here recall . On the event ,
So on ,
This means that on the event , for any chosen as in the statement of the lemma,
The proof is complete by recalling the definition of . ∎
9.2. Characterization of the Gordon cost optimum
Theorem 9.2.
Suppose the following hold for some .
-
•
, .
-
•
Assumption B with .
There exist some depending on such that for any deterministic choice of , it holds for any , and ,
In the next subsection we will show that for large , the map attains its global minimum in an ball of constant order radius (under ) with high probability. This means that although the initial localization radius for the primal optimization may be highly suboptimal (which involves ), the Gordon objective can be further localized into an ball with constant order radius.
To prove Theorem 9.2, we shall first relate to and its localized versions.
Proposition 9.3.
Suppose , and for some . There exists constant such that for any deterministic choice of , on the event (defined in Proposition 8.2) with and , we have for any ,
and the following localization holds:
Proof.
We write in the proof.
(Step 1). Fix any . We may compute
(9.2) | ||||
Here in the last line we used Sion’s min-max theorem to flip the order of minimum and maximum in . The minimum over is achieved exactly at , so when , using the simple inequality
(9.3) |
on the event , we may further bound (9.2) as follows:
(9.4) |
We note that depends on , but this notational dependence will be dropped from now on for convenience.
(Step 2). Consider the minimax optimization problem in (9.2):
(9.5) |
Any saddle point of the above program must satisfy the first-order optimality condition
(9.6) |
Using the derivative formula in Lemma 7.3 and the form of in Lemma 7.2, we may compute
(9.7) |
Plugging (9.7) into (9.6), the first-order optimality condition for in the minimax program (9.2) is given by
Equivalently,
(9.8) |
Using the apriori estimates in Proposition 8.2, on the event we have and . This implies on the same event,
(9.9) |
Using the last equation of (9.6), we have
(9.10) |
In view of (9.9)-(9.10), by choosing for large enough , the constraints in the optimization in (9.2) can be dropped for free. ∎
Next we replace the random function in the above proposition by its deterministic counterpart in their localized versions.
Proposition 9.4.
Suppose , and for some . There exist some depending on such that for , , and ,
Proof.
In the proof, we write . All the constants in and below may depend on .
(Step 1). We first prove the following: On the event , for any ,
(9.11) |
To this end, with written as , and using , , ,
on the event , we may estimate . A similar estimate applies to the expectation versions, proving (9.11).
(Step 2). Next we show that for any , there exists such that for ,
(9.12) |
To prove the claim, we fix to be chosen later, and take an -net for . Then . So on the event , using the estimate in (9.11) and a union bound via the pointwise concentration inequality in Proposition 7.4, for , with probability at least ,
Here in the last inequality we used Lemma 7.2 to estimate , where is defined in Proposition 7.4. The claim (9.12) follows by choosing and some calculations.
(Step 3). By (9.12), for , on the event , it holds with probability at least that
The estimate in is uniform in , so the claim follows. ∎
Finally we delocalize the range constraints for in the deterministic minimax problem with in the above proposition.
Proposition 9.5.
Suppose , and for some . There exists some such that for any ,
Consequently,
(9.13) |
Proof.
The proof is essentially a deterministic version of Step 2 in the proof of Proposition 9.3. We give some details below. We write . First, using similar calculations as that of (9.7),
Then the first-order optimality condition for to be the saddle point of , i.e., a deterministic version of (9.8), is given by
Finally using the apriori estimates in Proposition 8.1, we obtain a deterministic analogue of (9.9) in that . The claimed localization follows. The continuity follows by the definition of and the proven localization. ∎
9.3. Locating the global minimizer of the Gordon objective
With denoting the unique solution to the system of equations (2.1), let
(9.14) |
For any , let the exceptional set be defined as
(9.15) |
Theorem 9.6.
Suppose the following hold for some .
-
•
, .
-
•
Assumption B holds with .
Fix any that is -Lipschitz with respect to . There exist constants depending on such that for , , and ,
Roughly speaking, the above theorem will be proved by approximating both from above and below by nicer strongly convex, surrogate functions whose minimizers can be directly located. Then we may relate the minimizer of and those of the surrogate functions.
We first formally define these surrogate functions. For , let
(9.16) |
Again we omit notational dependence of on for simplicity.
The following lemma provides uniform (bracketing) approximation of via on compact sets.
Lemma 9.7.
Fix . The following hold when .
-
(1)
For any , .
-
(2)
For any ,
Proof.
Next, we will study the properties of the global minimizers for .
Proposition 9.8.
Suppose , and for some . There exists some constant such that for any deterministic choice of , on the event (defined in Proposition 8.2) with and , for any , the maps attain its global minimum at with . Moreover, .
Proof.
Note that the optimization problem
Here in we used Sion’s min-max theorem to exchange minimum and maximum, as the maximum is taken over a compact set. The difference of the above minimax problem compared to (9.2) rests in its range constraint on . As proven in (9.9), all solutions to the unconstrained minimax problem (9.2) must satisfy on the event . So on this event, for the choice for some large , exactly corresponds to (9.2), whose minimizers admit the apriori estimate (9.10) (with minor modifications that change to the stronger estimate in ).
Finally we shall relate back to the global minimizer of . We note that the proposition below by itself is not formally used in the proof of Theorem 9.6, but will turn out to be useful in the proof of Theorem 2.3 ahead.
Proposition 9.9.
Suppose the conditions in Theorem 9.6 hold for some . There exist constants depending on such that for , , and ,
Proof.
Let us fix .
(Step 1). We first prove the apriori estimate for . To this end, for large enough depending on , we choose and in Proposition 9.8, it follows that
(9.17) |
On the other hand, choosing with leads to
(9.18) |
On , we may characterize the value of by applying Propositions 9.3-9.5: for ,
(9.19) |
Note by the strong convexity of with respect to , we have
This means on ,
This in particular means on ,
Consequently, by enlarging if necessary, using Lemma 9.7-(1), on
This implies, on , we have , proving the apriori bound.
(Step 2). Next we establish the announced error bound. On the event , by Lemma 9.7-(2),
(9.20) |
Consequently, on ,
(9.21) |
On this event, combining (9.20)-(9.21) with (9.3), and using again the strong convexity of respect to , we have for ,
This means that on . The claim follows by intersecting the prescribed event with in (9.18) that controls the -probability of . ∎
Proof of Theorem 9.6.
Fix , and to be chosen later on. First, as is Lipschitz with respect to , by the Gaussian concentration inequality, there exists such that for , on an event with -probability at least ,
Moreover, by Proposition 9.8 and Propositions 9.3-9.5, there exist some depending on such that for , on an event with -probability , we have
-
(1)
, , and
-
(2)
Consequently, for , on the event , uniformly in ,
This implies that, for the prescribed range of and on the event ,
Using the strong convexity of with respect to , we have for , on the event ,
Now we may choose to conclude by adjusting constants. ∎
9.4. Proof of Theorem 2.3 for
Fix . All the constants in below may depend on .
(Step 1). In this step, we will obtain an upper bound . By Proposition 9.1 and the concentration estimate in Lemma 7.6, there exists some such that on an event with ,
(9.22) |
where
(9.23) |
Now we shall apply the convex(-side) Gaussian min-max theorem to obtain an upper bound for the right hand side of (9.22). Recall the definition of and in (6.2). Using Theorem 6.1-(2), for any ,
(9.24) |
By Proposition 9.9, there exist some depending on (which we assume without loss of generality and exceeds the constants in Theorems 9.2 and 9.6), such that on an event with -probability at least , the map attains its global minimum in . We may now apply Theorem 9.2: with , for ,
(9.25) |
Combining (9.4)-(9.4), by enlarging if necessary, for , and ,
(9.26) |
An entirely similar argument leads to a lower bound (which will be used later on):
(9.27) |
(Step 2). In this step, we will obtain a lower bound on for the exceptional set defined in (9.15), with a suitable choice of . Let us take to be the constants in Theorem 9.6, and let for . To this end, using Theorem 6.1-(1) (that holds without convexity), for any and
By choosing of constant order but large enough, and , we have for ,
(9.28) |
(Step 3). Combining (9.28) and the localization in (9.22), there exist some depending on such that for , on an event with ,
So on , , i.e., for ,
Using a change of variable and suitably adjusting the constant , for any -Lipschitz function , and ,
(Step 4). In this step we shall establish uniform guarantees. We write in this part of the proof. First, in the case , using , for ,
(9.29) |
Here the last inequality follows by the fact that any p.s.d. matrix , . As under , there exists such that on an event with ,
(9.30) |
On the other hand, note that for , using Proposition 8.1-(3),
(9.31) |
So we have
(9.32) |
Now by taking an -net of and a union bound,
(9.33) |
By adjusting constants, we may replace by . We then conclude by further taking expectation with respect to , and noting that .
9.5. Proof of Theorem 2.3 for
Recall the cost function defined in (6.2). It is easy to see that
(9.34) |
We shall define the ‘population’ version of as
(9.35) |
in the Gordon problem.
Proposition 9.10.
Suppose the following hold for some .
-
•
, .
-
•
Assumption B holds with .
There exist constants depending on such that for and ,
satisfying and . | ||
We need the following before the proof of Proposition 9.10.
Lemma 9.11.
Suppose , and for some . Recall defined in (9.14). Then there exist constants depending on such that for , and ,
Proof.
All the constants in below may depend on . Recall . Under the assumed conditions, . We shall consider the four terms separately below.
For the first term, we have
The concentration of the term can be handled using Gaussian tails and the fact that . For the term , with , it is easy to evaluate and , so Proposition B.1 applies to conclude the concentration of .
For the second term, we may decompose
From here we may handle the concentration of the above two terms in a completely similar fashion to and above.
For the third term, recall that , so
The concentration properties of the two terms on the right hand side above can be handled similarly to the case for the second term.
For the last term, we have
On the other hand, on the event ,
and . Combining the above estimates concludes the concentration claim for the last term. ∎
Proof of Proposition 9.10.
Fix . All the constants in below may depend on .
(Step 1). In this step, we establish both the uniqueness and the apriori estimates for . Using Lemma 9.11, we may choose a sufficiently large depending on such that ,
Therefore, on the event ,
Note that , by choosing sufficiently large , we conclude on the event . This implies that is -strongly concave with respect to , so exists uniquely on .
Next we derive apriori estimates. We claim that on , takes the following form:
(9.36) |
To see this, using the definition
Some simple algebra leads to the expression in (9.36). The boundedness of then follows from the boundedness of .
(Step 2). In this step, we establish the bound on . The key observation is that we may rewrite defined via (9.35) into the following form
(9.37) |
This can be seen by observing
(9.38) |
and therefore . Now with (9.36)-(9.37), we may use Lemma 9.11 to estimate
(9.39) |
We first handle the term . As , on the event ,
(9.40) |
Next we handle . On the event ,
(9.41) |
(Step 3). In this step, we prove the claimed bound on . First note that
(9.42) |
On the other hand, with , ,
so we may rewrite as follows:
Further using the second and third equations in (9.38), it now follows that
(9.43) |
Now combining (9.5) and (9.5), on the event , we may estimate
completing the proof. ∎
Proof of Theorem 2.3 for .
Fix . All the constants in below may depend on . We sometimes write .
As , we only need to study . Fix , and any , let
(Step 1). In this step we establish the Gordon cost cap: there exist constants depending on such that for ,
(9.44) |
To this end, first note that by the Lipschitz property of , the Gaussian concentration and Proposition 9.10, there exist some depending on such that for , on an event with probability at least , we have uniformly in ,
and all the properties in Proposition 9.10 hold. In other word, on with the prescribed range of ,
Using the -strong concavity of on , we have
By choosing , we have on ,
(9.45) |
Adjusting constants proves the claim in (9.44).
(Step 2). In this step, we provide an upper bound for the original cost over exceptional set. More concretely, we will prove that there exist constants depending on such that for any , and ,
(9.46) |
To see this, first note by Proposition 9.9, there exists some such that on an event with , . So with , for any , an application of Theorem 6.1-(1) yields that for ,
proving the claim (9.44) by possibly adjusting constants.
(Step 3). In this step, we recall a lower bound for the original cost optimum, essentially established in the Step 1 in the proof of Theorem 2.3. In particular, using (9.22), (9.23) and (9.27), there exist depending on , such that for ,
(9.47) |
and
(9.48) |
(Step 4). By choosing without loss of generality , on the event , (9.5)-(9.48) yield that for any ,
This means on the event , , i.e., there exist some depending on such that for and ,
(9.49) |
(Step 5). In this final step, we shall prove uniform version of the estimate (9.49). For , using the definition of in (9.34),
Using that , we have
(9.50) |
In view of (9.30), there exists some depending on , such that on an event with ,
(9.51) |
On the other hand, using the definition of in (9.35), Proposition 8.1-(3) and the fact that , we have
(9.52) |
Now we may mimic the proof in (9.4) to conclude that, by possibly enlarging , for any and ,
as desired. ∎
10. Universality: Proof of Theorem 2.4
10.1. Comparison inequalities
For , let
The following theorem is proved in [HS22, Theorem 2.3].
Theorem 10.1.
Suppose for some . Let be two random matrices with independent components, such that and for all . Further assume that
Let and . Then there exists some such that the following hold: For any with , and any , we have
Here , and is defined by
where with the supremum taken over all such that . Consequently, for any ,
Here is an absolute multiple of .
Theorem 10.2.
Let be two random matrices with independent entries and matching first two moments, i.e., for all . There exists a universal constant such that the following hold. For any measurable subsets , with , and any , we have
Here , , , and with the supremum taken over all such that . The conclusion continues to hold when max-min is flipped to min-max.
10.2. Delocalization
Recall that defined in (1.3) can be rewritten as
For any , we have the following closed form for :
(10.2) |
The above formula does not include the interpolating case when . To give an alternative expression, note that the first-order condition for the above minimax optimization is , , or equivalently,
(10.3) |
The following proposition proves delocalization for and .
Proposition 10.3.
Remark 6.
Delocalization in the same sense of the above proposition holds for with any deterministic matrix and vector satisfying , with a (slightly) different construction of .
Proof of Proposition 10.3.
All the constants in below may depend on .
(1). Let us consider delocalization for . Using (10.3), for any ,
(10.4) |
We first handle . Let be the asymptotic eigenvalue density of and fix . By [KY17, Theorem 3.16-(i), Remark 3.17 and Lemma 4.4-(i)], for any small and large ,
holds for all . With , by further using the simple relation , the error bound in the above display can be replaced by .
When , according to [BS10, Theorem 6.3-(2)], for some constant . Therefore, for with a small enough to be chosen later, it is easy to see that . Therefore, on an event with ,
(10.5) |
where
-
•
,
-
•
.
By a derivative calculation, it is easy to derive
Now by using the concentration result in [RV09, Theorem 1.1], on an event with , we have .
For , using the boundedness of around for , we may estimate
Combining the above estimates, for chosen small enough, say, , on the event ,
Using and the definition of , recall defined in (9.14), we then have
(10.6) |
The term can be handled similarly, now reading off the element in [KY17, Eqn. (3.10)], which shows that for any ,
(10.7) |
Combining (10.2), (10.6) and (10.7), we have
(10.8) |
Now we will construct with the desired volume estimate, and . To this end, we place a uniform prior on , where and are independent of all other random variables. Then . Using Proposition 8.1-(3) and a standard Gaussian tail bound, . Moreover, . The pointwise-in- delocalization claim on follows. As is -Lipschitz with exponentially high probability, the uniform version follows by a standard discretization and union bound argument.
(2). Let us consider delocalization for . Using again (10.3), for any ,
The term can be handled, by reading off the element in [KY17, Eqn. (3.10)], which shows that
(10.9) |
The term relies on the local law described by the element in [KY17, Eqn. (3.10)]: for any ,
(10.10) |
Consequently, combining (10.9)-(10.10), we have
The claim follows. ∎
10.3. Universality of the global cost optimum
Theorem 10.4.
Proof.
Fix , and as specified in Proposition 10.3. Let . By the same proposition, with -probability at least ,
(10.11) |
and
(10.12) |
By writing , we have
Now with , for , by applying Theorem 10.2, we have for any ,
(10.13) |
Replicating the last paragraph of proof of [HS22, Theorem 2.3] (right above Section 4.3 therein), for any ,
Combined with (10.3)-(10.3), we have
In view of (9.26) (in Step 1 of the final proof of Theorem 2.3), for , we take and therein, so that for ,
Combining the estimates, for , ,
The first term above can be assimilated into the second one, and can be dropped. The lower bound follow similarly by utilizing (9.27). ∎
10.4. Universality of the cost over exceptional sets
Theorem 10.5.
Proof.
(Step 1). Let . For any and , with ,
(10.14) | ||||
Now we may apply Theorem 10.1. To do so, let us write to match the notation. Then a simple calculation leads to
Consequently, an application of Theorem 10.1 leads to
Here in the last inequality we simply drop the constraint. Now for , by choosing and in Theorem 9.6, where is the constant therein, we have
(10.15) |
The constraints can be removed by enlarging if necessary.
(Step 2). In this step we shall trade the dependence of the above bound with respect to with a possible worsened dependence on , primarily in the regime . Fix . Let be chosen later and . Without loss of generality we assume , so by (9.13) in Proposition 9.5, . By enlarging if necessary we assume that exceeds the constant in Lemma 10.6. Using Lemma 10.6, for , with the choice (we assume without loss of generality ),
The proof is complete by adjusting constants. ∎
Lemma 10.6.
Suppose . Let be -Lipschitz with respect to . Then there exists some constant such that for any with , we have .
10.5. Proof of the universality Theorem 2.4 for
Fix , and . Let , and . We assume that exceeds the constants in Proposition 10.3 and Theorem 10.5. By Proposition 10.3 and a simple estimate, . We further let for .
Let be -Lipschitz with respect to . Then for and , we have
Here in the last inequality we used the simple fact that
Invoking Theorems 10.4 and 10.5, by enlarging if necessary, we have for and ,
or equivalently, for being -Lipschitz with respect to ,
Now we may follow Step 4 in the proof of Theorem 2.3 to strengthen the above statement to a uniform one in ; we only sketch the differences below. Using (9.4) with therein replaced by , and the assumption , we arrive at a modified form of (9.30): on an event with , for any ,
(10.16) |
Using (9.31) with , we arrive at a modified form of (9.32): for any ,
Now using a standard discretization and a union bound, we have
The proof is complete by taking expectation with respect to and note that as in Proposition 10.3. ∎
10.6. Proof of the universality Theorem 2.4 for
Proposition 10.7.
Proof.
Fix , and as specified in Proposition 10.3. We define a renormalized version of as
where . Let and be defined as in the proof of Theorem 10.4. Then we have,
Using the comparison inequality in Theorem 10.2 and a similar calculation as in (10.3), with ,
Using the convex Gaussian min-max theorem (cf. Theorem 6.1),
(10.17) |
On the other hand, using the definition of in (9.14), and the fact that for any , , we have . Combined with (10.6), we have
In view of (9.45), now by choosing and , for , , it follows that
The claim follows by adjusting constants. ∎
Proof of Theorem 2.4 for .
Fix , and as specified in Proposition 10.3. We continue writing in the proof. Using the delocalization results in Proposition 10.3, on an event with , we have with . Using Theorem 10.4, for , and , by possibly adjusting ,
(10.18) |
Let us take to be the constant in Proposition 10.7. By noting that
it follows from (10.6) and Proposition 10.7 that
Finally we only need to extend the above display to a uniform control over by continuity arguments similar to Step 5 of the proof of Theorem 2.3 for . By (9.50) (where therein is replaced by ) and (10.16), on an event with , for any ,
On the other hand, (9.5) remains valid, so we may proceed with an -net argument over to conclude. ∎
10.7. Proof of Theorem 2.5
To keep notation simple, we work with and write . The general case follows from minor modifications.
Lemma 10.8.
Suppose the conditions in Theorem 2.5 hold for some . Fix . There exists some constant such that uniformly in .
Proof.
We may write for some with , and not necessarily independent of each other. So for some ,
If for a large enough , the lower bound follows trivially. Otherwise, with , we have and , so by Paley-Zygmund inequality, for some . In other words, on an event with , . Using the above display, this means that . ∎
Lemma 10.9.
Suppose the conditions in Theorem 2.5 hold for some . Fix . Then there exist constants , , and a measurable set with , such that
Here .
Proof.
We write , for notational simplicity in the proof. All the constants in below may depend on . Recall the general fact for and .
By Proposition 11.2 below, for any , there exists some with , such that . Consequently, uniformly in and ,
(10.19) |
For , let , and
Then for ,
By Gaussian concentration inequality, for any , we may find some with , , such that uniformly in ,
(10.20) |
As , for small enough, uniformly in ,
(10.21) |
Combining (10.7)-(10.7), for small enough,
(10.22) |
Now let . Using that , we have . So with
we have . In other words, for this constructed set , we have the desired volume estimate , and by (10.22),
(10.23) |
On the other hand, using the definition of in (2.7), we may compute
(10.24) |
Combining (10.7), (10.23) and (10.24), for chosen small enough,
The claim follows from Lemma B.2. ∎
Proof of Theorem 2.5.
We write in the proof.
First we consider . This is the easy case, as is -Lipschitz with respect to . So applying Theorems 2.3 and 2.4 verifies the existence of some small such that for some with ,
The ratio formulation follows from Lemmas 10.8 and 10.9 by further intersecting and the set therein.
Next we consider . Let for some to be chosen later. Using Proposition 10.3 and its proofs below (10.8), for chosen small enough, we may find some with the desired volume estimate, such that , and
(10.25) |
where we choose sufficiently large. Recall for and , . This motivates the choice
which verifies that is -Lipschitz with respect to . Using (10.25),
(10.26) |
and with ,
(10.27) |
As the map is -Lipschitz with respect to , Gaussian concentration yields
Using the Lipschitz property of the maps, we may strengthen the above inequality to a uniform control over . This means uniformly in ,
(10.28) |
Combining (10.7)-(10.28), we have uniformly in ,
(10.29) |
Combining (10.26) and (10.29) proves the existence of some small and some with the desired volume estimate, such that
The ratio formulation follows again from Lemmas 10.8 and 10.9. ∎
11. Proofs for Section 3
11.1. Proof of Theorem 3.1
For chosen small enough, we fix , where is specified in Theorem 2.4. We omit the subscripts in , , and write in the proof. All the constants in and below may possibly depend on .
(1). Consider the case . We omit the superscript as well. Using Theorem 2.4-(1) with , on an event with ,
By Gaussian-Poincaré inequality, . As uniformly in , on ,
(11.1) |
On the other hand, using both the standard form and the alternative form , we have
(11.2) |
Consequently, on an event with ,
(11.3) |
Finally, using (11.1) and (11.3), on ,
The claim follows. The case follows from minor modifications so will be omitted.
(2). Consider the case . We omit the superscript as well. Further fix as specified in Theorem 2.4 (the concrete form of is given in Proposition 10.3). Using the same Theorem 2.4-(2) with ,
By Gaussian-Poincaré inequality, . Combined with the fact that , for , using the stability estimate in Proposition 8.1-(3),
So for ,
Now taking expectation over , for the same range of ,
(11.4) |
On the other hand, using (11.2),
Consequently, on an event with , , and therefore
The claim follows. The case proceeds similarly, but with the function now taken as , and the claim follows by computing that
The proof is complete. ∎
11.2. Proof of Theorem 3.2
Lemma 11.1.
Suppose , and for some . Then with , there exists some such that for , and ,
Proof.
We only prove the case . All the constants in below may depend on . We write for notational simplicity. Note that
Here in the last inequality we used . As
-
•
, and
-
•
,
we have uniformly in , . It is easy to see that . So by Hanson-Wright inequality, there exists some constant such that for ,
On the other hand, for any , using Proposition 8.1-(3),
so we may conclude by a standard discretization and union bound argument. ∎
Proposition 11.2.
The following hold with , .
-
(1)
and .
-
(2)
It holds that
-
(3)
Suppose , and for some . There exists some constant such that the following hold. For any , for some with ,
When , we may take and the above inequality holds with .
Proof.
(1) follows from definition so we focus on (2)-(3).
(2). Differentiating both sides of (2.3) with respect to yields that
Now using to obtain the formula for .
Next, using that , we may solve
Differentiating with respect to on both sides of the above display, we obtain
proving the second identity.
(3). Let , where and are independent variables. Then is uniformly distributed on . For some to be chosen later, let
(11.5) |
Let . Using Lemma 11.1, there exists some constant such that , and moreover,
Note that when , the above estimate holds for all with .
Combining the above display with the formula (8.3) for , and the fact that the denominator therein is of order (depending on ), we have
Now using (2), the second term in the above display equals to
The claim follows by adjusting constants. ∎
Proof of Theorem 3.2.
As , directly invoking Proposition 11.2-(3) yields the claim for .
Next we handle . Note that
(11.6) |
Using a similar construction as in the proof of Proposition 11.2 via the help of Lemma 11.1, this time with therein, we may find some with the desired volume estimate, such that both Proposition 11.2-(3) and
(11.7) |
hold. Combining (11.6)-(11.7), we may set
By Proposition 11.2-(2), we may compute separately:
Consequently,
The claims for and follow from Proposition 11.2-(3). ∎
11.3. Proof of Proposition 3.3
We will prove the following version of Proposition 3.3, where is represented via instead of . In the proof below, we will also verify the representation of via as stated in Proposition 3.3.
Proposition 11.3.
Recall . Then for ,
Here with for ,
Suppose further and for some . Then there exists some such that uniformly in and for all ,
-
(1)
, and
-
(2)
if additionally ,
Proof.
In the proof we write . Recall the notation , , and we naturally write . By differentiating with respect to for both sides of , with some calculations we have
(11.8) |
Using , we may also write for . Here by convention .
(1). Using the formula for ,
Some calculations show that
so the identity follows.
(2). Using the formula for ,
(11.9) |
To compute the second term in the above display, recall the identity for in (8.5)-(8.1). Also recall defined in (8.5). Then
Here in we used . The claimed identity follows by combining the above display and (11.3). Using , we may write
(3). Using the formula for ,
(11.10) |
The second term in the above display requires some non-trivial calculations:
Expanding the terms in the bracket using , with some calculations we arrive at
The claimed identity follows by combining the above display and (11.3). Using , we may write
Finally, the claimed first two-sided bound on follows from Proposition 8.1, and the second bound follows by using the fundamental theorem of calculus. ∎
11.4. Proof of Theorem 3.4
The following lemma gives a technical extension of Theorem 3.1 for under when . For , the extension also allows uniform control over under both the above small variance scenario with , and under the original conditions.
Lemma 11.4.
Proof.
All the constants in below may possibly depend on .
(Part 1). We shall first extend the claim of Theorem 3.1 for to in the case . Note that uniformly in , for ,
(11.11) |
Using the estimate (11.2), uniformly in , for all ,
So on an event with , for ,
On the other hand, using Lemma 11.5-(2),
Using the above two displays, for any , by choosing , we have for any ,
(11.12) |
The first term on the right hand side of the above display can be handled by the proven claim in Theorem 3.1, upon noting that (i) the constant therein depends on polynomially, and here we choose to be larger than ; (ii) holds for chosen much larger than .
The extension of the claim of Theorem 3.1 for to follows a similar proof with minor modifications, so we omit the details.
(Part 2). Next we consider the case . We need to extend the corresponding claim of Theorem 3.1 to both and .
We first verify the (high probability) Lipschitz continuity of the maps . Note that uniformly in , by virtue of (11.11), for any ,
This verifies the high probability Lipschitz property of . The Lipschitz property of is easily verified. From here we may use a similar argument to (11.4) to conclude the extension of the claim of Theorem 3.1 for to .
Finally we verify the (high probability) Lipschitz continuity of the maps . Using the estimates (9.4) (with replaced by ) and (11.2), uniformly in and ,
The Lipschitz property of is again easily verified. Again from here we may argue similarly to (11.4) to extend the claim of Theorem 3.1 for to . The case for is similar so we omit repetitive details. ∎
Lemma 11.5.
Suppose . The following hold.
-
(1)
The system of equations
admit a unique solution .
-
(2)
It holds that . If furthermore and for some , then there exists some such that .
Proof.
Proof of Theorem 3.4.
Let be as specified in Theorem 3.1 or 2.4. In view of its explicit form given in Proposition 10.3, with , the volume estimates hold.
On the other hand, using the construction around (11.5), we may find some (for the latter, we take therein) with , such that for ,
(11.13) |
Now let
(11.14) |
Then we have the volume estimates .
12. Proofs for Section 4
12.1. Proof of Theorem 4.1
All the constants in may depend on .
Proof of Theorem 4.1 for .
Let be defined in the same way as in the proof of Proposition 10.3. Using a similar local law and continuity argument as in the proof of that proposition, on an event with ,
So on , where with , uniformly in ,
Here in the last inequality, we use the following estimate for : As is the Stieltjes transform of (cf. [KY17, Lemma 2.2]), for , and
The claim follows. ∎
Proof of Theorem 4.1 for .
Using Theorem 3.1, the stability of in Proposition 8.1, and the proven fact in (1) on , it holds for that
(12.1) |
Next we consider extension to in the regime . By KKT condition, we have , so a.s. for any . So we only need to verify the high probability Lipschitz continuity for : for any , using the estimate (9.4) (with replaced by ) we obtain, for some universal ,
Finally we consider extension to in the same regime by verifying a similar high probability uniform-in- Lipschitz continuity property for : for any , using the estimate (11.11),
The claimed bound follows. ∎
12.2. Proof of Theorem 4.2
Recall we have . For both the case and with , we take as constructed in (11.14) in the proof of Theorem 3.4, with . Fix , then . Using Theorems 3.2 and 4.1, on an event with ,
(12.2) |
This in particular implies that on , both the following inequalities hold:
(12.3) |
Using the definition of which gives , the above two displays can be used to relate and : on the event ,
(12.4) |
As , for . Consequently, by the second inequality in Proposition 11.3, we have on the event ,
(12.5) |
This means on , for both ,
We may conclude from here by virtues of Theorems 3.1 and 3.2, together with Lemma 11.4.∎
12.3. Proof of Theorem 4.3
Lemma 12.1.
Proof.
All the constants in below may depend on . We only need to prove (2). The method of proof is similar to that of Proposition 8.1-(3). Instead of considering (12.6), we shall consider the system of equations
(12.7) |
indexed by . For , the solution exists uniquely for and also for if additionally . Moreover, using the apriori estimate in Proposition 8.1-(2), we have uniformly in and , . Now differentiating on both sides of the second equation in (12.7) with respect to , we obtain
This means uniformly in and , . Next, using the first equation in (12.7), we obtain
Using similar calculations as in (8.9)-(8.10), we have uniformly in and , , and . This concludes the claim. ∎
Proof of Theorem 4.3.
All the constants in below may depend on .
As and is independent of , by using Lemma B.3 first conditionally on and then further taking expectation over , we have for ,
Here is a universal constant. Using similar arguments as in (11.3) (by noting that the normalization in is still ), there exists some constant such that for any , on an event with , . This means that for any , on the event ,
(12.8) |
On the other hand, using Theorem 3.1, we may find some with , such that for , on an event with , for ,
Here is taken from Lemma 12.1, and we extend the definition to with and . Using the statement (2) of the same Lemma 12.1, on the event , we then have
Replacing by yields that, on ,
(12.9) |
Combining (12.8)-(12.9), for , and ,
(12.10) |
Now we strengthen the estimate (12.3) into a uniform version. It is easy to verify that on an event with , , , and for , . So on , for ,
and
From here, using (i) (12.3) along with a discretization and union bound that strengthens (12.3) to a uniform control, and (ii) Theorem 3.1 which replaces by , we obtain for and ,
Now with the same as in (11.14) using , for any , we may further replace by in the above display (with a possibly slightly larger , but for notational simplicity we abuse this notation). In summary, for any , on an event with ,
From here, using similar arguments as in (12.2)-(12.4), on the event ,
Similar to (12.5), on the event , we have
(12.11) |
From here we may argue along the same lines as those following (12.5) in the proof of Theorem 4.2 to conclude with probability estimated at , by further noting that satisfies the desired volume estimate. Under the further condition , by taking , simplifies as indicated in the statement of the theorem for large. ∎
12.4. Proof of Theorem 4.4
We only prove the case for ; the other case is similar. All constants in and may possibly depend on . Let be as constructed in (11.14) with .
(1). We first prove the statement for the length of the CI. Note that . By Theorem 4.1-(2), on an event with the probability indicated therein,
Consequently, on the event , for any ,
As in the proof of Theorem 4.2, for , , so by using Proposition 11.3-(2), on the event , for any ,
The above reasoning also proves that on the same event , for any ,
From here, in view of (12.5), by adjusting constants, on an event with , it holds that
(12.12) |
This proves the claim for the length of the CI.
(2). Next we prove the statement for the coverage. We note that a similar Lipschitz continuity argument as in the proof of Lemma 11.4 shows that for any -Lipschitz , on an event with ,
(12.13) |
On the other hand, using the Lipschitz continuity of in Proposition 8.1-(3),
So by enlarging if necessary, we may assume without loss of generality that on , we have
(12.14) |
Now we shall make a good choice of in (12.4). Let and be a function such that on , on , and linearly interpolated in . Let
(12.15) |
It is easy to verify the Lipschitz property of : for any , . Consequently, we may apply (12.13) with defined in (12.15) to obtain that on the event ,
(12.16) |
Now using and the anti-concentration of the standard normal random variable, we may compute
(12.17) |
Combining the above two displays (12.4)-(12.4), on the event ,
Finally choosing to conclude the upper control. The lower control can be proved similarly so we omit the details. ∎
Appendix A Further results on
For , recall defined in Theorem 3.2, and the optimally tuned risks defined as with .
Proposition A.1.
We have
Consequently,
-
(1)
, and ;
-
(2)
, , and .
Proof.
We only need to verify (2). This follows from the formula for and the following simple consequences of the second equation of (2.1):
-
•
,
-
•
,
-
•
.
The proof is complete. ∎
Appendix B Auxiliary results
Proposition B.1.
Let be a non-negative, differentiable function. Suppose there exists some deterministic such that almost surely for . Then there exists some universal constant such that for all ,
Proof.
The method of proof via the Gaussian log-Sobolev inequality and the Herbst’s argument is well known. We give some details for the convenience of the reader. Let be the centered version of , and . Then . By the Gaussian log-Sobolev inequality (see e.g., [BLM13, Theorem 5.4], or [GN16, Theorem 2.5.6]),
With denoting the moment generation function of , the above inequality is equivalent to
Now dividing on both sides of the above display, we have . Integrating both sides with the condition and , we arrive at . Solving for and using the standard method to convert to tail bound yield the claimed inequality. ∎
Lemma B.2.
Let be an invertible covariance matrix with for some . Then for any , there exists some such that
where .
Proof.
Let . We first prove that for some ,
(B.1) |
The upper bound in the above display is trivial. For the lower bound, using , we find . This proves (B.1).
As , the map is -Lipschitz with respect to . So by Gaussian concentration, for any ,
Consequently, using the above concentration and (B.1),
By choosing for some sufficiently large , we have
The upper bound follows similarly. ∎
Lemma B.3.
Let be a random matrix with independent, mean-zero, unit variance, uniformly sub-gaussian components. Suppose the coordinates of are i.i.d. mean zero and uniformly subgaussian with variance , and are independent of . Then there exists some universal constant such that for any and , with probability at least ,
Proof.
Let be the rows of . Then
Using standard concentration estimates, with probability at least ,
-
•
,
-
•
,
-
•
.
Collecting the bounds to conclude. ∎
References
- [ASS20] Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky, High-dimensional dynamics of generalization error in neural networks, Neural Networks 132 (2020), 428–446.
- [AZ20] Morgane Austern and Wenda Zhou, Asymptotics of cross-validation, arXiv preprint arXiv:2001.11111 (2020).
- [AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, International Conference on Machine Learning, PMLR, 2019, pp. 242–252.
- [Bel20] Pierre C. Bellec, Out-of-sample error estimate for robust m-estimators with convex penalty, arXiv preprint arXiv:2008.11840 (2020).
- [BEM13] Mohsen Bayati, Murat A Erdogdu, and Andrea Montanari, Estimating lasso risk and noise level, Advances in Neural Information Processing Systems 26 (2013).
- [BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA 116 (2019), no. 32, 15849–15854.
- [BHX20] Mikhail Belkin, Daniel Hsu, and Ji Xu, Two models of double descent for weak features, SIAM J. Math. Data Sci. 2 (2020), no. 4, 1167–1180.
- [BLLT20] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler, Benign overfitting in linear regression, Proc. Natl. Acad. Sci. USA 117 (2020), no. 48, 30063–30070.
- [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford University Press, Oxford, 2013.
- [BMR21] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin, Deep learning: a statistical viewpoint, Acta Numer. 30 (2021), 87–201.
- [BS10] Zhidong Bai and Jack W. Silverstein, Spectral analysis of large dimensional random matrices, second ed., Springer Series in Statistics, Springer, New York, 2010.
- [BZ23] Pierre C. Bellec and Cun-Hui Zhang, Debiasing convex regularized estimators and interval estimation in linear models, Ann. Statist. 51 (2023), no. 2, 391–436.
- [CDK22] Chen Cheng, John Duchi, and Rohith Kuditipudi, Memorize to generalize: on the necessity of interpolation in high dimensional linear regression, Conference on Learning Theory, PMLR, 2022, pp. 5528–5560.
- [CM22] Chen Cheng and Andrea Montanari, Dimension free ridge regression, arXiv preprint arXiv:2210.08571 (2022).
- [CMW22] Michael Celentano, Andrea Montanari, and Yuting Wei, The lasso with general gaussian designs with applications to hypothesis testing, arXiv preprint arXiv:2007.13716v2 (2022).
- [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable programming, Advances in Neural Information Processing Systems 32 (2019).
- [CW79] Peter Craven and Grace Wahba, Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation, Numer. Math. 31 (1978/79), no. 4, 377–403.
- [Dic16] Lee H. Dicker, Ridge regression and asymptotic minimax estimation over spheres of growing dimension, Bernoulli 22 (2016), no. 1, 1–37.
- [DKT22] Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis, A model of double descent for high-dimensional binary linear classification, Inf. Inference 11 (2022), no. 2, 435–495.
- [DSH23] Alexis Derumigny and Johannes Schmidt-Hieber, On lower bounds for the bias-variance trade-off, Ann. Statist., to appear. Available at arXiv:2006.00278 (2023+).
- [DvdL05] Sandrine Dudoit and Mark J. van der Laan, Asymptotics of cross-validated risk estimation in estimator selection and performance assessment, Stat. Methodol. 2 (2005), no. 2, 131–154.
- [DW18] Edgar Dobriban and Stefan Wager, High-dimensional asymptotics of prediction: ridge regression and classification, Ann. Statist. 46 (2018), no. 1, 247–279.
- [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv preprint arXiv:1810.02054 (2018).
- [Efr04] Bradley Efron, The estimation of prediction error: covariance penalties and cross-validation, J. Amer. Statist. Assoc. 99 (2004), no. 467, 619–642, With comments and a rejoinder by the author.
- [EK13] Noureddine El Karoui, Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results, arXiv preprint arXiv:1311.2445 (2013).
- [EK18] by same author, On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators, Probab. Theory Related Fields 170 (2018), no. 1-2, 95–175.
- [GHW79] Gene H. Golub, Michael Heath, and Grace Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979), no. 2, 215–223.
- [GKKW02] László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk, A Distribution-Free Theory of Nonparametric Regression, Springer Series in Statistics, Springer-Verlag, New York, 2002.
- [GN16] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge Series in Statistical and Probabilistic Mathematics, [40], Cambridge University Press, New York, 2016.
- [Gor85] Yehoram Gordon, Some inequalities for Gaussian processes and applications, Israel J. Math. 50 (1985), no. 4, 265–289.
- [Gor88] by same author, On Milman’s inequality and random subspaces which escape through a mesh in , Geometric aspects of functional analysis (1986/87), Lecture Notes in Math., vol. 1317, Springer, Berlin, 1988, pp. 84–106.
- [Han22] Qiyang Han, Noisy linear inverse problems under convex constraints: Exact risk asymptotics in high dimensions, arXiv preprint arXiv:2201.08435 (2022).
- [HK70] Arthur E. Hoerl and Robert W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), no. 1, 55–67.
- [HKZ14] Daniel Hsu, Sham M. Kakade, and Tong Zhang, Random design analysis of ridge regression, Found. Comput. Math. 14 (2014), no. 3, 569–600.
- [HMRT22] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, Ann. Statist. 50 (2022), no. 2, 949–986.
- [HS22] Qiyang Han and Yandi Shen, Universality of regularized regression estimators in high dimensions, arXiv preprint arXiv:2206.07936 (2022).
- [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in Neural Information Processing Systems 31 (2018).
- [JWHT21] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An introduction to statistical learning—with applications in R, Springer Texts in Statistics, Springer, New York, [2021] ©2021, Second edition [of 3100153].
- [KL22] Nicholas Kissel and Jing Lei, On high-dimensional gaussian comparisons for cross-validation, arXiv preprint arXiv:2211.04958 (2022).
- [KLS20] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez, The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization, J. Mach. Learn. Res. 21 (2020), Paper No. 169, 16.
- [KY17] Antti Knowles and Jun Yin, Anisotropic local laws for random matrices, Probab. Theory Related Fields 169 (2017), no. 1-2, 257–352.
- [KZSS21] Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro, Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting, Advances in Neural Information Processing Systems 34 (2021), 20657–20668.
- [LD19] Sifan Liu and Edgar Dobriban, Ridge regression: Structure, cross-validation, and sketching, arXiv preprint arXiv:1910.02373 (2019).
- [LGC+21] Bruno Loureiro, Cedric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, and Lenka Zdeborová, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems 34 (2021), 18137–18151.
- [Li85] Ker-Chau Li, From Stein’s unbiased risk estimates to the method of generalized cross validation, Ann. Statist. 13 (1985), no. 4, 1352–1377.
- [Li86] by same author, Asymptotic optimality of and generalized cross-validation in ridge regression with application to spline smoothing, Ann. Statist. 14 (1986), no. 3, 1101–1112.
- [Li87] by same author, Asymptotic optimality for , , cross-validation and generalized cross-validation: discrete index set, Ann. Statist. 15 (1987), no. 3, 958–975.
- [LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: kernel “ridgeless” regression can generalize, Ann. Statist. 48 (2020), no. 3, 1329–1347.
- [LS22] Tengyuan Liang and Pragya Sur, A precise high-dimensional asymptotic theory for boosting and minimum-1-norm interpolated classifiers, Ann. Statist. 50 (2022), no. 3, 1669–1695.
- [MM21] Léo Miolane and Andrea Montanari, The distribution of the Lasso: uniform control over sparse balls and adaptive parameter tuning, Ann. Statist. 49 (2021), no. 4, 2313–2335.
- [MM22] Song Mei and Andrea Montanari, The generalization error of random features regression: precise asymptotics and the double descent curve, Comm. Pure Appl. Math. 75 (2022), no. 4, 667–766.
- [MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration, Appl. Comput. Harmon. Anal. 59 (2022), 3–84.
- [MRSY23] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan, The generalization error of max-margin linear classifiers: Benign overfitting and high-dimensional asymptotics in the overparametrized regime, arXiv preprint arXiv:1911.01544v3 (2023).
- [MZ22] Andrea Montanari and Yiqiao Zhong, The interpolation phase transition in neural networks: memorization and generalization under lazy training, Ann. Statist. 50 (2022), no. 5, 2816–2847.
- [PD23] Pratik Patil and Jin-Hong Du, Generalized equivalences between subsampling and ridge regularization, arXiv preprint arXiv:2305.18496 (2023).
- [PWRT21] Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani, Uniform consistency of cross-validation estimators for high-dimensional ridge regression, International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 3178–3186.
- [RMR21] Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco, Asymptotics of ridge (less) regression under general source condition, International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 3889–3897.
- [RV09] Mark Rudelson and Roman Vershynin, Smallest singular value of a random rectangular matrix, Comm. Pure Appl. Math. 62 (2009), no. 12, 1707–1739.
- [SAH19] Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi, The impact of regularization on high-dimensional logistic regression, Advances in Neural Information Processing Systems 32 (2019).
- [Ste81] Charles M. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Statist. 9 (1981), no. 6, 1135–1151.
- [Sto74] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. Roy. Statist. Soc. Ser. B 36 (1974), 111–147.
- [Sto77] by same author, Asymptotics for and against cross-validation, Biometrika 64 (1977), no. 1, 29–35.
- [TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi, Precise error analysis of regularized -estimators in high dimensions, IEEE Trans. Inform. Theory 64 (2018), no. 8, 5592–5628.
- [TB22] Alexander Tsigler and Peter L. Bartlett, Benign overfitting in ridge regression, arXiv preprint arXiv:2009.14286v2 (2022).
- [TOH15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, Regularized linear regression: A precise analysis of the estimation error, Conference on Learning Theory, PMLR, 2015, pp. 1683–1709.
- [TV+04] Antonia M Tulino, Sergio Verdú, et al., Random matrix theory and wireless communications, Foundations and Trends® in Communications and Information Theory 1 (2004), no. 1, 1–182.
- [vdVW96] Aad van der Vaart and Jon A. Wellner, Weak Convergence and Empirical Processes, Springer Series in Statistics, Springer-Verlag, New York, 1996.
- [WWM22] Shuaiwen Wang, Haolei Weng, and Arian Maleki, Does SLOPE outperform bridge regression?, Inf. Inference 11 (2022), no. 1, 1–54.
- [WX20] Denny Wu and Ji Xu, On the optimal weighted regularization in overparameterized linear regression, Advances in Neural Information Processing Systems 33 (2020), 10112–10123.
- [ZBH+21] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM 64 (2021), no. 3, 107–115.
- [ZKS+22] Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J Sutherland, and Nathan Srebro, A non-asymptotic moreau envelope theory for high-dimensional generalized linear models, arXiv preprint arXiv:2210.12082 (2022).
- [ZZY22] Xianyang Zhang, Huijuan Zhou, and Hanxuan Ye, A modern theory for high-dimensional Cox regression models, arXiv preprint arXiv:2204.01161 (2022).