This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The distribution of Ridgeless least squares interpolators

Qiyang Han Department of Statistics, Rutgers University, Piscataway, NJ 08854, USA. [email protected]  and  Xiaocong Xu Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong. [email protected]
Abstract.

The Ridgeless minimum 2\ell_{2}-norm interpolator in overparametrized linear regression has attracted considerable attention in recent years. While it seems to defy the conventional wisdom that overfitting leads to poor prediction, recent research reveals that its norm minimizing property induces an ‘implicit regularization’ that helps prediction in spite of interpolation. This renders the Ridgeless interpolator a theoretically tractable proxy that offers useful insights into the mechanisms of modern machine learning methods.

This paper takes a different perspective that aims at understanding the precise stochastic behavior of the Ridgeless interpolator as a statistical estimator. Specifically, we characterize the distribution of the Ridgeless interpolator in high dimensions, in terms of a Ridge estimator in an associated Gaussian sequence model with positive regularization, which plays the role of the prescribed implicit regularization observed previously in the context of prediction risk. Our distributional characterizations hold for general random designs and extend uniformly to positively regularized Ridge estimators.

As a demonstration of the analytic power of these characterizations, we derive approximate formulae for a general class of weighted q\ell_{q} risks (0<q<)(0<q<\infty) for Ridge(less) estimators that were previously available only for 2\ell_{2}. Our theory also provides certain further conceptual reconciliation with the conventional wisdom: given any (regular) data covariance, for all but an exponentially small proportion of the signals, a certain amount of regularization in Ridge regression remains beneficial across various statistical tasks including (in-sample) prediction, estimation and inference, as long as the noise level is non-trivial. Surprisingly, optimal tuning can be achieved simultaneously for all the designated statistical tasks by a single generalized or kk-fold cross-validation scheme, despite being designed specifically for tuning prediction risk.

The proof follows a two-step strategy that first proceeds under a Gaussian design using Gordon’s comparison principles, and then lifts the Gaussianity via universality arguments. Our analysis relies on uniform localization and stability properties of the Gordon’s optimization problem, along with uniform delocalization of the Ridge(less) estimators, both of which remain valid down to the interpolation regime.

Key words and phrases:
comparison inequality, cross validation, minimum norm interpolator, random matrix theory, ridge regression, universality
2000 Mathematics Subject Classification:
60E15, 60G15
The research of Q. Han is partially supported by NSF grants DMS-1916221 and DMS-2143468. The research of X. Xu is partially supported by Hong Kong RGC grants GRF 16305421 and GRF 16303922.

1. Introduction

1.1. Overview

Consider the standard linear regression model

Yi=Xiμ0+ξi,1im,\displaystyle Y_{i}=X_{i}^{\top}\mu_{0}+\xi_{i},\quad 1\leq i\leq m, (1.1)

where we observe i.i.d. feature vectors XinX_{i}\in\mathbb{R}^{n} and responses YiY_{i}\in\mathbb{R}, and ξi\xi_{i}’s are unobservable errors. For notational simplicity, we write X=[X1Xm]m×nX=[X_{1}\cdots X_{m}]^{\top}\in\mathbb{R}^{m\times n} as the design matrix that collects all the feature vectors, and Y=(Y1,,Ym)mY=(Y_{1},\ldots,Y_{m})^{\top}\in\mathbb{R}^{m} as the response vector. The feature vectors XiX_{i}’s are assumed to satisfy 𝔼X1=0\operatorname{\mathbb{E}}X_{1}=0 and Cov(X1)=Σ\operatorname{Cov}(X_{1})=\Sigma, and the errors satisfy 𝔼ξ1=0\operatorname{\mathbb{E}}\xi_{1}=0 and Var(ξ1)=σξ2\operatorname{Var}(\xi_{1})=\sigma_{\xi}^{2}.

Throughout this paper, we reserve mm for the sample size, and nn for the signal dimension. The aspect ratio ϕ\phi, i.e., the number of samples per dimension, is then defined as ϕm/n\phi\equiv m/n. Accordingly, we refer to ϕ1>1\phi^{-1}>1 as the overparametrized regime, and ϕ1<1\phi^{-1}<1 as the underparametrized regime.

Within the linear model (1.1), the main object of interest is to recover/estimate the unknown signal μ0n\mu_{0}\in\mathbb{R}^{n}. While a large class of regression techniques can be used for the purpose of signal recovery under various structural assumptions on μ0\mu_{0}, here we will focus our attention on one widely used class of regression estimators, namely, the Ridge estimator (cf. [HK70]) with regularization η>0\eta>0,

μ^η=argminμn{12nYXμ2+η2μ2}=1n(1nXX+ηIn)1XY,\displaystyle\widehat{\mu}_{\eta}=\operatorname*{arg\,min\,}_{\mu\in\mathbb{R}^{n}}\bigg{\{}\frac{1}{2n}\lVert Y-X\mu\rVert^{2}+\frac{\eta}{2}\lVert\mu\rVert^{2}\bigg{\}}=\frac{1}{n}\bigg{(}\frac{1}{n}X^{\top}X+\eta I_{n}\bigg{)}^{-1}X^{\top}Y, (1.2)

and the Ridgeless estimator (also known as the minimum-norm interpolator),

μ^0=argminμn{μ2:Y=Xμ}=(XX)XY,\displaystyle\widehat{\mu}_{0}=\operatorname*{arg\,min\,}_{\mu\in\mathbb{R}^{n}}\big{\{}\lVert\mu\rVert^{2}:Y=X\mu\big{\}}=(X^{\top}X)^{-}X^{\top}Y, (1.3)

which is almost surely (a.s.) well-defined in the overparametrized regime ϕ1>1\phi^{-1}>1. Here AA^{-} is the Moore-Penrose pseudo-inverse of AA. The notation μ^0\widehat{\mu}_{0} is justified since for ϕ1>1\phi^{-1}>1, μ^ημ^0\widehat{\mu}_{\eta}\to\widehat{\mu}_{0} a.s. as η0\eta\downarrow 0.

From a conventional statistical point of view, the Ridgeless estimator seems far from an obviously good choice: As μ^0\widehat{\mu}_{0} perfectly interpolates the data, it is susceptible to high variability due to the widely recognized bias-variance tradeoff inherent in ‘optimal’ statistical estimators [JWHT21, DSH23]. On the other hand, as the Ridgeless estimator μ^0\widehat{\mu}_{0} is the limit point of the gradient descent algorithm run on the squared loss in the overparametrized regime ϕ1>1\phi^{-1}>1, it provides a simple yet informative test case for understanding one major enigma of modern machine learning methods: these methods typically interpolate training data perfectly; still, they enjoy good generalization properties [JGH18, DZPS18, AZLS19, BHMM19, COB19, ZBH+21].

Inspired by this connection, recent years have witnessed a surge of interest in understanding the behavior of the Ridgeless estimator μ^0\widehat{\mu}_{0} and its closely related Ridge estimator μ^η\widehat{\mu}_{\eta}, with a particular focus on their prediction risks, cf. [TV+04, EK13, HKZ14, Dic16, DW18, EK18, ASS20, BHX20, WX20, BLLT20, BMR21, RMR21, HMRT22, TB22, CM22]. An emerging picture from these works is that, the norm minimizing property of the Ridgeless estimator μ^0\widehat{\mu}_{0} inherited from the gradient descent algorithm, induces certain ‘implicit regularization’ that helps prediction accuracy despite interpolation. More interestingly, in certain scenarios of (Σ,μ0)(\Sigma,\mu_{0}), the Ridgeless estimator μ^0\widehat{\mu}_{0} has a smaller (limiting) prediction risk compared to any Ridge estimator μ^η\widehat{\mu}_{\eta} with η>0\eta>0, and therefore in such scenarios interpolation becomes optimal for the task of prediction, cf. [KLS20, HMRT22, TB22]. As a result, the Ridgeless estimator μ^0\widehat{\mu}_{0} can be viewed, at least to some extent, as a (substantially) simplified yet theoretically tractable proxy that captures some important features of modern machine learning methods. Moreover, understanding for the prediction risk of μ^0\widehat{\mu}_{0} serves as a basis for more complicated interpolation methods in e.g., kernel ridge regression [LR20, BMR21], random features model [MM22, MMM22], neural tangent model [MZ22], among others.

Despite these encouraging progress, there remains limited understanding of the behavior of the Ridgeless estimator μ^0\widehat{\mu}_{0} as a statistical estimator. From a statistical perspective, a ‘good’ estimator usually possesses multiple desirable properties, and therefore it is natural to ask whether the Ridgeless estimator μ^0\widehat{\mu}_{0} is also ‘good’ in other senses beyond the prediction accuracy. This is particularly relevant, if we aim to consider μ^0\widehat{\mu}_{0} also as a ‘good’ estimator that can be applied in broader statistical contexts, rather than viewing it solely as a theoretical proxy designed to provide insights into the mechanisms of modern machine learning methods.

From a different angle, while much of the aforementioned research has focused on identifying favorable scenarios of (Σ,μ0)(\Sigma,\mu_{0}) in which prediction via μ^0\widehat{\mu}_{0} can be accurate or even optimal, this line of theory does not automatically imply that in the practical statistical applications of overparametrized linear regression, the Ridgeless estimator μ^0\widehat{\mu}_{0} remains the optimal choice for the task of prediction for ‘typical’ scenarios of (Σ,μ0)(\Sigma,\mu_{0}). The prevailing conventional statistical wisdom, which strongly advocates the use of regularization to strike a balance between bias and variance, may be still at work. From both conceptual and practical standpoints, it is therefore essential to understand the extent to which the optimality phenomenon of interpolation, as observed in certain scenarios of (Σ,μ0)(\Sigma,\mu_{0}) in the literature, can be considered ‘generic’ within the context of Ridge regression.

The main goal of this paper is to make a further step in understanding the precise stochastic behavior of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta}, by developing a high-dimensional distributional characterization in the so-called proportional regime, where mm and nn is of the same order. As will be clear from below, the distributional characterization of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} offers a detailed understanding of its statistical properties across various statistical tasks, extending beyond the prediction accuracy. Notable examples include a characterization for a general class of weighted q\ell_{q} risks (0<q<0<q<\infty) associated with μ^η\widehat{\mu}_{\eta}, as well as its capacity for statistical inference in terms of constructing confidence intervals.

In addition, the distributional characterization also leads to new insights on the (sub-)optimality of the Ridgeless estimator μ^0\widehat{\mu}_{0}, which, interestingly, further reconciles with the conventional statistical wisdom: given any covariance structure Σ\Sigma, except for an exponentially small proportion of signal μ0\mu_{0}’s within the 2\ell_{2} norm ball, interpolation is optimal in prediction, estimation and inference, if and only if the noise level σξ2=0\sigma_{\xi}^{2}=0. In other words, in the common statistical setting where the noise level is nontrivial σξ2>0\sigma_{\xi}^{2}>0, a certain amount of regularization in Ridge regression remains beneficial for ‘most’ signal μ0\mu_{0}’s across a number of different statistical tasks in the linear model (1.1)—including the most intensively studied prediction task in the literature.

Of course, in practice, the optimal regularization is unknown and may differ significantly for each statistical task. Surprisingly, our distributional characterization reveals that two widely used adaptive tuning methods, the generalized cross-validation scheme [CW79, GHW79] and the kk-fold cross-validation scheme [GKKW02, JWHT21]—both of which designed for tuning the prediction risk—are actually simultaneously optimal for all the aforementioned statistical tasks, at least for ‘most’ signal μ0\mu_{0}’s. In particular, for ‘most’ signal μ0\mu_{0}’s, these two popular adaptive tuning methods automatically lead to optimal prediction, estimation and in-sample risks. Moreover, when combined with the debiased Ridge estimator, they produce the shortest confidence intervals for the coordinates of μ0\mu_{0} with asymptotically valid coverage.

1.2. Distribution of Ridge(less) estimators

For simplicity of discussion, we will focus here on the overparametrized regime ϕ1>1\phi^{-1}>1 that is of our main interest.

1.2.1. The Gaussian sequence model

We will describe the distribution of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} in the linear model (1.1), in terms of a corresponding Ridge estimator in a simpler Gaussian sequence model, defined for a given pair of (Σ,μ0)(\Sigma,\mu_{0}) and a noise level γ>0\gamma>0 as

y(Σ,μ0)𝗌𝖾𝗊(γ)Σ1/2μ0+γgn,g𝒩(0,In).\displaystyle y^{\operatorname{\mathsf{seq}}}_{(\Sigma,\mu_{0})}(\gamma)\equiv\Sigma^{1/2}\mu_{0}+\frac{\gamma g}{\sqrt{n}},\quad g\sim\mathcal{N}(0,I_{n}). (1.4)

The Ridge estimator μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau) with regularization τ0\tau\geq 0 in the Gaussian sequence model (1.4) is defined as

μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)\displaystyle\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau) argminμn{12Σ1/2μy(Σ,μ0)𝗌𝖾𝗊(γ)2+τ2μ2}\displaystyle\equiv\operatorname*{arg\,min\,}_{\mu\in\mathbb{R}^{n}}\bigg{\{}\frac{1}{2}\lVert\Sigma^{1/2}\mu-y_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma)\rVert^{2}+\frac{\tau}{2}\lVert\mu\rVert^{2}\bigg{\}}
=(Σ+τIn)1Σ1/2(Σ1/2μ0+γgn).\displaystyle=(\Sigma+\tau I_{n})^{-1}\Sigma^{1/2}\Big{(}\Sigma^{1/2}\mu_{0}+\frac{\gamma g}{\sqrt{n}}\Big{)}. (1.5)

1.2.2. Distributional characterization of μ^η\widehat{\mu}_{\eta} via μ^(Σ,μ0)𝗌𝖾𝗊\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}

Under standard assumptions on (i) the design matrix X=Σ1/2ZX=\Sigma^{1/2}Z, where ZZ consists of independent mean 0, unit-variance and light-tailed entries, and (ii) the error vector ξ\xi with light-tailed components, we show that the distribution μ^η\widehat{\mu}_{\eta} can be characterized via μ^(Σ,μ0)𝗌𝖾𝗊\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}} as follows. For any η0\eta\geq 0, there exists a unique pair (γη,,τη,)(0,)2(\gamma_{\eta,\ast},\tau_{\eta,\ast})\in(0,\infty)^{2} determined via an implicit fixed point equation (cf. Eqn. (2.1) in Section 2 ahead), such that the distribution of μ^η\widehat{\mu}_{\eta} is about the same as that of μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast}). Formally, for any 11-Lipschitz function 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R} and any K>0K>0, we show in Theorems 2.3 and 2.4 that with high probability,

supη[0,K]|𝗀(μ^η)𝔼𝗀(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,))|0.\displaystyle\sup_{\eta\in[0,K]}\big{\lvert}\mathsf{g}(\widehat{\mu}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}\big{\rvert}\approx 0. (1.6)

A particularly important technical aspect of (1.6) is that the distributional approximation (1.6) holds uniformly down to the interpolation regime η=0\eta=0 for ϕ1>1\phi^{-1}>1. This uniform guarantee will prove essential in the results to be detailed ahead.

From (1.6), the quantities γη,\gamma_{\eta,\ast} and τη,\tau_{\eta,\ast} can be naturally interpreted as the effective noise and effective regularization for the Ridge estimator μ^(Σ,μ0)𝗌𝖾𝗊\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}} in the Gaussian sequence model (1.4). Moreover, τ0,>0\tau_{0,\ast}>0 can be regarded as the aforementioned ‘implicit regularization’. Interestingly, while this interpretation has been previously noted in the special context of prediction risk of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} (cf. [BMR21, Remark 4.15], [CM22, Eqns. (16)-(19)]), our distributional characterization (1.6) reveals that such effective/implicit regularization is an inherently general phenomenon that persists at the level of distributional properties of μ^η\widehat{\mu}_{\eta}.

1.2.3. Approximate formulae for general weighted q\ell_{q} risks

As a first, yet non-trivial demonstration of the analytic power of (1.6), we show in Theorem 2.5 that for ‘most’ signals μ0\mu_{0}’s, the weighted q\ell_{q} risk (0<q<0<q<\infty) of μ^η\widehat{\mu}_{\eta}, namely, 𝖠(μ^ημ0)q\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{q} with a well-behaved p.s.d. matrix 𝖠\mathsf{A}, concentrates around some deterministic quantity in the following sense: with high probability,

supη[0,K]|𝖠(μ^ημ0)qn1/2diag(Γη;(Σ,μ0)𝖠)q/21/2Mq1|0.\displaystyle\sup_{\eta\in[0,K]}\bigg{\lvert}\frac{\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{q}}{n^{-1/2}\lVert\mathrm{diag}(\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{\mathsf{A}})\rVert_{q/2}^{1/2}M_{q}}-1\bigg{\rvert}\approx 0. (1.7)

Here Mq𝔼1/q|𝒩(0,1)|qM_{q}\equiv\operatorname{\mathbb{E}}^{1/q}\lvert\mathcal{N}(0,1)\rvert^{q}, and diag(Γη;(Σ,μ0)𝖠)n\mathrm{diag}(\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{\mathsf{A}})\in\mathbb{R}^{n} is the vector that collects all diagonal elements of a p.s.d. matrix Γη;(Σ,μ0)𝖠n×n\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{\mathsf{A}}\in\mathbb{R}^{n\times n}, defined explicitly (in Eqn. (2.7)) via the weight matrix 𝖠\mathsf{A}, the data covariance Σ\Sigma, the effective regularization τη,\tau_{\eta,\ast}, and the signal energy μ0\lVert\mu_{0}\rVert.

To the best of our knowledge, results of the form (1.7) are available in the literature (cited above) only for the special case q=2q=2, where the quantity 𝖠(μ^ημ0)2\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{2} admits a closed form expression in terms of the spectral statistics of the sample covariance that allows for direct applications of random matrix methods. Moving beyond this explicit closed form presents a notable analytic advantage of our theory (1.6) over existing random matrix approaches in analyzing 2\ell_{2} risks of μ^η\widehat{\mu}_{\eta}.

1.3. Phase transitions for the optimality of interpolation

While our theory (1.6) is strong enough to characterize all q\ell_{q} risks of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} in the sense of (1.7), the uniform nature of (1.6) also illuminates novel insights into certain global, qualitative behavior of the most commonly studied 2\ell_{2} risks for finite samples. To fix notation, we define

  • (prediction risk) R(Σ,μ0)𝗉𝗋𝖾𝖽(η)Σ1/2(μ^ημ0)2R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\lVert\Sigma^{1/2}(\widehat{\mu}_{\eta}-\mu_{0})\rVert^{2},

  • (estimation risk) R(Σ,μ0)𝖾𝗌𝗍(η)μ^ημ02R^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\lVert\widehat{\mu}_{\eta}-\mu_{0}\rVert^{2},

  • (in-sample risk) R(Σ,μ0)𝗂𝗇(η)n1X(μ^ημ0)2R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta)\equiv n^{-1}\lVert X(\widehat{\mu}_{\eta}-\mu_{0})\rVert^{2}.

Using our uniform distributional characterization in (1.6), we show in Theorem 3.4 that for ‘most’ μ0\mu_{0}’s and all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}, with high probability,

|R(Σ,μ0)#(0)minη[0,K]R(Σ,μ0)#(η)|{1,σξ2>0,0,σξ2=0.\displaystyle\big{\lvert}R^{\#}_{(\Sigma,\mu_{0})}(0)-\min_{\eta\in[0,K]}R^{\#}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\begin{cases}\gtrsim 1,&\sigma_{\xi}^{2}>0,\\ \approx 0,&\sigma_{\xi}^{2}=0.\end{cases} (1.8)

In fact, we prove in Theorem 3.4 a much stronger statement: for ‘most’ μ0\mu_{0}’s, the (random) global optimum of ηR(Σ,μ0)#(η)\eta\mapsto R^{\#}_{(\Sigma,\mu_{0})}(\eta) for all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\} will be achieved approximately at η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1} with high probability. Here 𝖲𝖭𝖱μ0=μ02/σξ2\operatorname{\mathsf{SNR}}_{\mu_{0}}=\lVert\mu_{0}\rVert^{2}/\sigma_{\xi}^{2} is the usual notion of signal-to-noise ratio; when μ00\mu_{0}\neq 0 and σξ2=0\sigma_{\xi}^{2}=0, we shall interpret 𝖲𝖭𝖱μ01=0\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}=0.

It must be stressed that, for different #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}, the empirical risk curves ηR(Σ,μ0)#(η)\eta\mapsto R^{\#}_{(\Sigma,\mu_{0})}(\eta) concentrate on genuinely different deterministic counterparts ηR¯(Σ,μ0)#(η)\eta\mapsto\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) with different mathematical expressions (cf. Theorem 3.1). As such, there are no apriori reasons to expect that these risk curves share approximately the same global minimum. Remarkably, as a consequence of the approximate formulae for the deterministic risk curves ηR¯(Σ,μ0)#(η)\eta\mapsto\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) (cf. Theorem 3.2), we show that the curves ηR¯(Σ,μ0)#(η)\eta\mapsto\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) are qualitatively similar, in that they approximately behave locally like a quadratic function centered around η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1} (cf. Proposition 3.3), at least for ‘most’ signal μ0\mu_{0}’s.

From a broader perspective, the phase transition (1.8) aligns closely with the conventional statistical wisdom: for ‘most’ signal μ0\mu_{0}’s, certain amount of regularization is necessary to achieve optimal performance for all the prediction, estimation and in-sample risks, as long as the noise level σξ2>0\sigma_{\xi}^{2}>0.

1.4. Cross-validation: optimality beyond prediction

The phase transition in (1.8) naturally raises the question of how one can choose the optimal regularization in a data-driven manner. Here we study two widely used adaptive tuning methods, namely,

  1. (1)

    the generalized cross-validation scheme η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}}, and

  2. (2)

    the kk-fold cross-validation scheme η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}}.

The readers are referred to (4.3) and (4.5) for precise definitions of η^𝖦𝖢𝖵,η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\widehat{\eta}^{\operatorname{\mathsf{CV}}} in the context of Ridge regression. Both methods have a long history in the literature; see, e.g., [Sto74, Sto77, CW79, GHW79, Li85, Li86, Li87, DvdL05] for some historical references. In essence, both methods η^𝖦𝖢𝖵,η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\widehat{\eta}^{\operatorname{\mathsf{CV}}} are designed to estimate the prediction risk, so it is natural to expect that they perform well for the task of prediction. Rigorous theoretical justifications along this line in Ridge regression in high dimensional settings can be found in, e.g., [LD19, PWRT21, HMRT22].

Interestingly, the insight from (1.8) suggests a far broader utility of these adaptive tuning methods. As all the empirical risk curves ηR(Σ,μ0)#(η)\eta\mapsto R^{\#}_{(\Sigma,\mu_{0})}(\eta) are approximately minimized at the same point η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1} with high probability, it is plausible to conjecture that the aforementioned cross-validation methods η^𝖦𝖢𝖵,η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\widehat{\eta}^{\operatorname{\mathsf{CV}}} could also yield optimal performance for estimation and in-sample risks, at least for ‘most’ signal μ0\mu_{0}’s. We show in Theorems 4.2 and 4.3 that this is indeed the case: for ‘most’ signal μ0\mu_{0}’s and all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}, with high probability,

R(Σ,μ0)#(η^𝖦𝖢𝖵),R(Σ,μ0)#(η^𝖢𝖵)minη[0,K]R(Σ,μ0)#(η).\displaystyle R^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}),R^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{CV}}})\approx\min_{\eta\in[0,K]}R^{\#}_{(\Sigma,\mu_{0})}(\eta). (1.9)

Even more surprisingly, the optimality of η^𝖦𝖢𝖵,η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\widehat{\eta}^{\operatorname{\mathsf{CV}}} extends to the much more difficult task of statistical inference. In fact, we show in Theorem 4.4 that in the so-called debiased Ridge scheme, these two adaptive tuning methods η^𝖦𝖢𝖵,η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\widehat{\eta}^{\operatorname{\mathsf{CV}}} yield an asymptotically valid construction of confidence intervals for the coordinates of μ0\mu_{0} with the shortest possible length.

To the best of our knowledge, theoretical optimality properties for the cross-validation schemes beyond the realm of prediction accuracy has not been observed in the literature, either for Ridge regression or for other regularized regression estimators. Our findings here in the context of Ridge regression can therefore be viewed as a first step in understanding the potential broader merits of cross validation schemes for a larger array of statistical inference problems.

1.5. Proof techniques

As previously mentioned, a significant body of recent research (cited above) has concentrated on various facets of the precise asymptotics of the prediction risk for the Ridgeless estimator μ^0\widehat{\mu}_{0}. These studies extensively utilize the explicit form of (1.3), and rely almost exclusively on techniques from random matrix theory (RMT) to relate the behavior of the bias and variance terms in the prediction risk and the spectral properties of the sample covariance.

Here, as (1.6) lacks a direct connection to the spectrum of the sample covariance, we adopt a different, two-step strategy for its proof:

  1. (1)

    In the first step, we establish (1.6) under a Gaussian design XX via the so-called convex Gaussian min-max theorem (CGMT) approach [Gor85, Gor88].

  2. (2)

    In the second step, we prove universality that lifts the Gaussianity XX via leveraging the recently developed comparison inequalities in [HS22].

Under a Gaussian design, the proof method for establishing distributional properties of regularized regression estimators via the CGMT has been executed for the closely related Lasso estimator with strict non-vanishing regularization, cf. [MM21, CMW22]111More literature on other statistical applications of the CGMT can be found in Section 6.. Here in our setting, the major technical hurdle is to handle the vanishing strong convexity in the optimization problem (1.2) as η0\eta\downarrow 0. We overcome this technical issue by establishing uniform localization of Gordon’s min-max optimization, valid down to η=0\eta=0. The localization property, in a certain sense, allows us to conclude that the distributional properties of μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast}) are stable as η0\eta\downarrow 0. The readers are referred to Section 6.3 for a proof outline.

For the universality problem, the key step of our arguments is to reduce, in a proper sense, the difficult problem on the universality of μ^0\widehat{\mu}_{0} to an easier problem on the universality of μ^η\widehat{\mu}_{\eta} for small η>0\eta>0, so that the comparison inequalities in [HS22] can be applied. This reduction is achieved by proving that (i) μ^η\widehat{\mu}_{\eta} and related quantities are uniformly delocalized down to the interpolation regime η=0\eta=0, and (ii) both the primal and Gordon optimization problems are ‘stable’ in suitable senses as η0\eta\downarrow 0. A more detailed proof outline is contained in Section 6.4.

1.6. Organization

The rest of the paper is organized as follows. In Section 2, we present our main results on the distributional characterizations (1.6) of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta}, and the approximate q\ell_{q} risk formulae (1.7). In Section 3, we provide a number of approximate 2\ell_{2} risk formulae via RMT, and establish the phase transition (1.8). In Section 4, we give a formal validation for the two cross validation schemes mentioned above, both in terms of (1.9) and statistical inference via the debiased Ridge estimator. A set of illustrative simulation results are presented in Section 5 to collaborate (some of) the theoretical results. Due to the high technicalities involved in the proof of (1.6), a proof outline will be given in Section 6. All the proof details are then presented in Sections 7-12.

1.7. Notation

For any positive integer nn, let [n]=[1:n][n]=[1:n] denote the set {1,,n}\{1,\ldots,n\}. For a,ba,b\in\mathbb{R}, abmax{a,b}a\vee b\equiv\max\{a,b\} and abmin{a,b}a\wedge b\equiv\min\{a,b\}. For aa\in\mathbb{R}, let a±(±a)0a_{\pm}\equiv(\pm a)\vee 0. For xnx\in\mathbb{R}^{n}, let xp\lVert x\rVert_{p} denote its pp-norm (0p)(0\leq p\leq\infty), and Bn;p(R){xn:xpR}B_{n;p}(R)\equiv\{x\in\mathbb{R}^{n}:\lVert x\rVert_{p}\leq R\}. We simply write xx2\lVert x\rVert\equiv\lVert x\rVert_{2} and Bn(R)Bn;2(R)B_{n}(R)\equiv B_{n;2}(R). For a matrix Mm×nM\in\mathbb{R}^{m\times n}, let Mop,MF\lVert M\rVert_{\operatorname{op}},\lVert M\rVert_{F} denote the spectral and Frobenius norm of MM, respectively. InI_{n} is reserved for an n×nn\times n identity matrix, written simply as II (in the proofs) if no confusion arises. For a square matrix Mn×nM\in\mathbb{R}^{n\times n}, we let diag(M)(Mii)i=1nn\mathrm{diag}(M)\equiv(M_{ii})_{i=1}^{n}\in\mathbb{R}^{n}.

We use CxC_{x} to denote a generic constant that depends only on xx, whose numeric value may change from line to line unless otherwise specified. axba\lesssim_{x}b and axba\gtrsim_{x}b mean aCxba\leq C_{x}b and aCxba\geq C_{x}b, abbreviated as a=𝒪x(b),a=Ωx(b)a=\mathcal{O}_{x}(b),a=\Omega_{x}(b) respectively; axba\asymp_{x}b means axba\lesssim_{x}b and axba\gtrsim_{x}b, abbreviated as a=Θx(b)a=\Theta_{x}(b). 𝒪\mathcal{O} and 𝔬\mathfrak{o} (resp. 𝒪𝐏\mathcal{O}_{\mathbf{P}} and 𝔬𝐏\mathfrak{o}_{\mathbf{P}}) denote the usual big and small O notation (resp. in probability). For a random variable XX, we use X,𝔼X\operatorname{\mathbb{P}}_{X},\operatorname{\mathbb{E}}_{X} (resp. X,𝔼X\operatorname{\mathbb{P}}^{X},\operatorname{\mathbb{E}}^{X}) to indicate that the probability and expectation are taken with respect to XX (resp. conditional on XX).

For a measurable map f:nf:\mathbb{R}^{n}\to\mathbb{R}, let fLipsupxy|f(x)f(y)|/xy\lVert f\rVert_{\mathrm{Lip}}\equiv\sup_{x\neq y}\lvert f(x)-f(y)\rvert/\lVert x-y\rVert. ff is called LL-Lipschitz iff fLipL\lVert f\rVert_{\mathrm{Lip}}\leq L. For a proper, closed convex function ff defined on n\mathbb{R}^{n}, its Moreau envelope 𝖾f(;τ)\mathsf{e}_{f}(\cdot;\tau) and proximal operator 𝗉𝗋𝗈𝗑f(;τ)\operatorname{\mathsf{prox}}_{f}(\cdot;\tau) for any τ>0\tau>0 are defined by 𝖾f(x;τ)minzn{12τxz2+f(z)}\mathsf{e}_{f}(x;\tau)\equiv\min_{z\in\mathbb{R}^{n}}\big{\{}\frac{1}{2\tau}\lVert x-z\rVert^{2}+f(z)\big{\}} and 𝗉𝗋𝗈𝗑f(x;τ)argminzn{12τxz2+f(z)}\operatorname{\mathsf{prox}}_{f}(x;\tau)\equiv\operatorname*{arg\,min\,}_{z\in\mathbb{R}^{n}}\big{\{}\frac{1}{2\tau}\lVert x-z\rVert^{2}+f(z)\big{\}}.

Throughout this paper, for an invertible covariance matrix Σn×n\Sigma\in\mathbb{R}^{n\times n}, we write Σtr(Σ1)/n\mathcal{H}_{\Sigma}\equiv\operatorname{tr}(\Sigma^{-1})/n as the harmonic mean of the eigenvalues of Σ\Sigma.

2. Distribution of Ridge(less) estimators

2.1. Some definitions

For K>1K>1, let

ΞK[𝟏ϕ1<1+1/KK1,K].\displaystyle\Xi_{K}\equiv[\bm{1}_{\phi^{-1}<1+1/K}K^{-1},K].

This notation will be used throughout the paper for uniform-in-η\eta statements. In particular, in the overparametrized regime ϕ11+1/K\phi^{-1}\geq 1+1/K, we have ΞK=[0,K]\Xi_{K}=[0,K].

Next, for γ,τ0\gamma,\tau\geq 0, recall μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau) in (1.2.1), and we define its associated estimation error 𝖾𝗋𝗋(Σ,μ0)(γ;τ)\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau) and the degrees-of-freedom 𝖽𝗈𝖿(Σ,μ0)(γ;τ)\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau) as

{𝖾𝗋𝗋(Σ,μ0)(γ;τ)Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)μ0)2,𝖽𝗈𝖿(Σ,μ0)(γ;τ)γgn,Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)μ0).\displaystyle\begin{cases}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau)\equiv\lVert\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-\mu_{0}\big{)}\rVert^{2},\\ \operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau)\equiv\big{\langle}\frac{\gamma g}{\sqrt{n}},\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-\mu_{0}\big{)}\big{\rangle}.\end{cases}

We note that the 𝖽𝗈𝖿(Σ,μ0)(γ;τ)\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau) defined above is naturally related to the usual notion of degrees-of-freedom (cf. [Ste81, Efr04]) for μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau), in the sense that df(μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ))j=1n1γ2/nCov((Σ1/2μ^(Σ,μ0)𝗌𝖾𝗊)j,y(Σ,μ0),j𝗌𝖾𝗊)=nγ2𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ)\mathrm{df}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)\big{)}\equiv\sum_{j=1}^{n}\frac{1}{\gamma^{2}/n}\operatorname{Cov}\big{(}(\Sigma^{1/2}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}})_{j},y_{(\Sigma,\mu_{0}),j}^{\operatorname{\mathsf{seq}}}\big{)}=\frac{n}{\gamma^{2}}\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau).

2.2. Working assumptions

Assumption A.

X=ZΣ1/2X=Z\Sigma^{1/2}, where (i) Zm×nZ\in\mathbb{R}^{m\times n} has independent, mean-zero, unit variance, uniformly sub-gaussian entries, and (ii) Σn×n\Sigma\in\mathbb{R}^{n\times n} is an invertible covariance matrix with eigenvalues λ1λn>0\lambda_{1}\geq\cdots\geq\lambda_{n}>0.

Here ‘uniform sub-gaussianity’ means supi[m],j[n]Zijψ2C\sup_{i\in[m],j\in[n]}\lVert Z_{ij}\rVert_{\psi_{2}}\leq C for some universal C>0C>0, where ψ2\psi_{2} is the Orlicz 2-norm (cf. [vdVW96, Section 2.2, pp. 95]).

We shall often write the Gaussian design as Z=GZ=G, where Gm×nG\in\mathbb{R}^{m\times n} consists of i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries.

Assumption B.

ξ=σξξ0\xi=\sigma_{\xi}\cdot\xi_{0} for some ξ0\xi_{0} with i.i.d. mean zero, unit variance and uniform sub-gaussian entries.

The requirement on the noise level σξ2\sigma_{\xi}^{2} will be specified in concrete results below.

2.3. The fixed point equation

Fix η0\eta\geq 0. Consider the following fixed point equation in (γ,τ)(\gamma,\tau):

{ϕγ2=σξ2+𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ),ϕητ=1ntr((Σ+τIn)1Σ)=1γ2𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ).\displaystyle\begin{cases}\phi\gamma^{2}=\sigma_{\xi}^{2}+\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau),\\ \phi-\frac{\eta}{\tau}=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I_{n})^{-1}\Sigma\big{)}=\frac{1}{\gamma^{2}}\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau).\end{cases} (2.1)

We first establish some qualitative properties for the solution of (2.1).

Proposition 2.1.

Recall Σ=tr(Σ1)/n\mathcal{H}_{\Sigma}=\operatorname{tr}(\Sigma^{-1})/n. The following hold.

  1. (1)

    The fixed point equation (2.1) admits a unique solution (γη,,τη,)(0,)2(\gamma_{\eta,\ast},\tau_{\eta,\ast})\in(0,\infty)^{2}, for all (m,n)2(m,n)\in\mathbb{N}^{2} when η>0\eta>0 and m<nm<n when η=0\eta=0.

  2. (2)

    Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>1K>1. Then there exists some C=C(K)>1C=C(K)>1 such that uniformly in ηΞK\eta\in\Xi_{K},

    1/Cτη,C,1/C(1)q+1ηqτη,C,q{1,2}.\displaystyle 1/C\leq\tau_{\eta,\ast}\leq C,\quad 1/C\leq(-1)^{q+1}\partial_{\eta}^{q}\tau_{\eta,\ast}\leq C,\quad q\in\{1,2\}.

    If furthermore 1/Kσξ2K1/K\leq\sigma_{\xi}^{2}\leq K and μ0K\lVert\mu_{0}\rVert\leq K, then uniformly in ηΞK\eta\in\Xi_{K},

    1/Cγη,C,|ηγη,|C.\displaystyle 1/C\leq\gamma_{\eta,\ast}\leq C,\quad\lvert\partial_{\eta}\gamma_{\eta,\ast}\rvert\leq C.
  3. (3)

    Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>1K>1. Then there exists some C=C(K)>1C=C(K)>1 such that the following hold. For any ε(0,1/2]\varepsilon\in(0,1/2], we may find some 𝒰εBn(1)\mathcal{U}_{\varepsilon}\subset B_{n}(1) with vol(𝒰ε)/vol(Bn(1))1Cε1enε2/C\mathrm{vol}(\mathcal{U}_{\varepsilon})/\mathrm{vol}(B_{n}(1))\geq 1-C\varepsilon^{-1}e^{-n\varepsilon^{2}/C},

    supμ0𝒰εsupηΞK|γη,2γ~η,2(μ0)|ε,\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\varepsilon}}\sup_{\eta\in\Xi_{K}}\big{\lvert}\gamma_{\eta,\ast}^{2}-\widetilde{\gamma}_{\eta,\ast}^{2}(\lVert\mu_{0}\rVert)\big{\rvert}\leq\varepsilon,

    where γ~η,2(μ0)σξ2ητη,+μ02(τη,ηητη,)>0\widetilde{\gamma}_{\eta,\ast}^{2}(\lVert\mu_{0}\rVert)\equiv\sigma_{\xi}^{2}\partial_{\eta}\tau_{\eta,\ast}+\lVert\mu_{0}\rVert^{2}(\tau_{\eta,\ast}-\eta\partial_{\eta}\tau_{\eta,\ast})>0. When Σ=In\Sigma=I_{n}, we may take 𝒰ε=Bn(1)\mathcal{U}_{\varepsilon}=B_{n}(1) and the above inequality holds with ε=0\varepsilon=0.

The above proposition combines parts of Propositions 8.1 and 11.2.

As an important qualitative consequence of (2), under the condition ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K, the effective regularization ητη,\eta\mapsto\tau_{\eta,\ast} is a strictly increasing and concave function of η\eta. Moreover, in the overparametrized regime ϕ1>1\phi^{-1}>1, the quantity τ0,\tau_{0,\ast}—also known as ‘implicit regularization’ in the literature [BLLT20, BMR21, CM22, HMRT22, TB22]—is strictly bounded away from zero.

The claim in (3) offers a useful approximate representation of the effective noise γη,2\gamma_{\eta,\ast}^{2} in terms of the original noise σξ2\sigma_{\xi}^{2}, the effective regularization τη,\tau_{\eta,\ast} and the signal energy μ0\lVert\mu_{0}\rVert without explicitly dependence of Σ\Sigma. This representation will prove useful in the approximate q\ell_{q} risk formulae in Theorem 2.5, as well as in understanding some qualitative aspects of the risk curves in Section 3 ahead.

2.4. Some connections of (2.1) to RMT

The second equation of (2.1) has a natural connection to random matrix theory (RMT). To detail this connection, let Σ^Σ1/2GGΣ1/2/mn×n\widehat{\Sigma}\equiv\Sigma^{1/2}G^{\top}G\Sigma^{1/2}/m\in\mathbb{R}^{n\times n} and ΣˇGΣG/mm×m\check{\Sigma}\equiv G\Sigma G^{\top}/m\in\mathbb{R}^{m\times m} be the sample covariance matrix and its dimension flipped, companion matrix. For z+{z:z>0}z\in\mathbb{C}^{+}\equiv\{z\in\mathbb{C}:\Im z>0\}, let 𝔪n(z)m1tr(ΣˇzIm)1\mathfrak{m}_{n}(z)\equiv m^{-1}\operatorname{tr}\big{(}\check{\Sigma}-zI_{m}\big{)}^{-1} and 𝔪(z)\mathfrak{m}(z) be the Stieltjes transforms of the empirical spectral distribution and the asymptotic eigenvalue density (cf. [KY17, Definition 2.3]) of Σˇ\check{\Sigma}, respectively. It is well-known that 𝔪(z)\mathfrak{m}(z) can be determined uniquely via the fixed point equation

z=1𝔪(z)+1ϕ1ntr((In+Σ𝔪(z))1Σ).\displaystyle z=-\frac{1}{\mathfrak{m}(z)}+\frac{1}{\phi}\cdot\frac{1}{n}\operatorname{tr}\Big{(}\big{(}I_{n}+\Sigma\mathfrak{m}(z)\big{)}^{-1}\Sigma\Big{)}. (2.2)

See, e.g., [KY17, Lemma 2.2] for more technical details and historical references. We also note that while the above equation is initially defined for z+z\in\mathbb{C}^{+}, it can be straightforwardly extended to the real axis provided that zz lies outside the support of the asymptotic spectrum of Σˇ\check{\Sigma}.

The following proposition provides a precise connection between the effective regularization τη,\tau_{\eta,\ast} defined via the second equation of (2.1), and the Stieltjes transform 𝔪\mathfrak{m}. This connection will prove important in some of the results ahead.

Proposition 2.2.

For any η>0\eta>0 and η=0\eta=0 with ϕ1>1\phi^{-1}>1,

n1tr((Σ+τη,In)1Σ)=ϕη𝔪(η/ϕ).\displaystyle n^{-1}\mathrm{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I_{n})^{-1}\Sigma\big{)}=\phi-\eta\cdot\mathfrak{m}(-\eta/\phi). (2.3)
Proof.

By comparing (2.2) and the second equation of (2.1), we may identify the two equations by setting τη,1/𝔪(zη)\tau_{\eta,\ast}\equiv{1}/{\mathfrak{m}(-z_{\eta})} with zηη/ϕz_{\eta}\equiv\eta/\phi, as claimed. ∎

While (2.3) appears somewhat purely algebraic, it actually admits a natural statistical interpretation. Suppose ξ\xi is also Gaussian. We may then compute

df(μ^η)\displaystyle\mathrm{df}(\widehat{\mu}_{\eta}) =j=1nCovX((Xμ^η)j,Yj)σξ2=tr((Σ^+zηIn)1Σ^)=n(ϕη𝔪n(zη)).\displaystyle=\sum_{j=1}^{n}\frac{\operatorname{Cov}^{X}\big{(}(X\widehat{\mu}_{\eta})_{j},Y_{j}\big{)}}{\sigma_{\xi}^{2}}=\operatorname{tr}\big{(}(\widehat{\Sigma}+z_{\eta}I_{n})^{-1}\widehat{\Sigma}\big{)}=n\big{(}\phi-\eta\cdot\mathfrak{m}_{n}(-z_{\eta})\big{)}. (2.4)

Now comparing the above display with (2.3), we arrive at the following intriguing equivalence between the averaged law in RMT, and the proximity of μ^η\widehat{\mu}_{\eta} and μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast}) in terms of “degrees-of-freedom”:

𝔪n(zη)𝔪(zη)df(μ^η)df(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)).\displaystyle\mathfrak{m}_{n}(-z_{\eta})\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\mathfrak{m}(-z_{\eta})\,\Leftrightarrow\,\mathrm{df}(\widehat{\mu}_{\eta})\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\mathrm{df}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}.

Below we will show the above proximity of μ^η\widehat{\mu}_{\eta} and μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast}) can be taken as far as the distributions of the two themselves.

2.5. Distribution of Ridge(less) estimators

In addition to μ^η\widehat{\mu}_{\eta}, we will also consider the distribution of the (scaled) residual r^η\widehat{r}_{\eta}, defined by

r^η1n(YXμ^η).\displaystyle\widehat{r}_{\eta}\equiv\frac{1}{\sqrt{n}}\big{(}Y-X\widehat{\mu}_{\eta}\big{)}. (2.5)

We define the ‘population’ version of r^η\widehat{r}_{\eta} as

rη,ηϕτη,(ϕγη,2σξ2hn+ξn).\displaystyle r_{\eta,\ast}\equiv\frac{\eta}{\phi\tau_{\eta,\ast}}\bigg{(}-\sqrt{\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2}}\cdot\frac{h}{\sqrt{n}}+\frac{\xi}{\sqrt{n}}\bigg{)}. (2.6)

Here h𝒩(0,Im)h\sim\mathcal{N}(0,I_{m}) is independent of ξ\xi.

We are now in a position to state our main results on the distributional results for the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} and the residual r^η\widehat{r}_{\eta}.

First we work under the Gaussian design Z=GZ=G, and we write μ^η=μ^η;G,r^η=r^η;G\widehat{\mu}_{\eta}=\widehat{\mu}_{\eta;G},\widehat{r}_{\eta}=\widehat{r}_{\eta;G}. Recall Σ=tr(Σ1)/n\mathcal{H}_{\Sigma}=\operatorname{tr}(\Sigma^{-1})/n.

Theorem 2.3.

Suppose Assumption A holds with Z=GZ=G and the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Then there exists some constant C=C(K)>0C=C(K)>0 such that the following hold.

  1. (1)

    For any 11-Lipschitz function 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R} and ε(0,1/2]\varepsilon\in(0,1/2],

    supμ0Bn(1)(supηΞK|𝗀(μ^η;G)𝔼𝗀(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,))|ε)Cnenε4/C.\displaystyle\sup_{\mu_{0}\in B_{n}(1)}\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}\big{\lvert}\mathsf{g}(\widehat{\mu}_{\eta;G})-\operatorname{\mathbb{E}}\mathsf{g}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}\big{\rvert}\geq\varepsilon\Big{)}\leq Cne^{-n\varepsilon^{4}/C}.
  2. (2)

    For any ε(0,1/2]\varepsilon\in(0,1/2], ξm\xi\in\mathbb{R}^{m} satisfying |ξ2/mσξ2|ε2/C\lvert\,\lVert\xi\rVert^{2}/m-\sigma_{\xi}^{2}\rvert\leq\varepsilon^{2}/C, and 11-Lipschitz function 𝗁:m\mathsf{h}:\mathbb{R}^{m}\to\mathbb{R} (which may depend on ξ\xi),

    supμ0Bn(1)ξ(supη[1/K,K]|𝗁(r^η;G)𝔼ξ𝗁(rη,)|ε)Cnenε4/C.\displaystyle\sup_{\mu_{0}\in B_{n}(1)}\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in[1/K,K]}\lvert\mathsf{h}(\widehat{r}_{\eta;G})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(r_{\eta,\ast})\rvert\geq\varepsilon\Big{)}\leq Cne^{-n\varepsilon^{4}/C}.

The choice μ0Bn(1)\mu_{0}\in B_{n}(1) is made merely for simplicity of presentation; it can be replaced by μ0Bn(R)\mu_{0}\in B_{n}(R) with another constant CC that depends further on RR. The assumption Σ1\mathcal{H}_{\Sigma}\lesssim 1 is quite common in the literature of Ridge(less) regression; see, e.g., [BMR21, Assumption 4.12] or a slight variant in [MRSY23, Assumption 1]. The major assumption in the above theorem is the Gaussianity on the design XX. This may be lifted at the cost of a set of slightly stronger conditions.

Theorem 2.4.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, ΣopΣ1opK\lVert\Sigma\rVert_{\operatorname{op}}\vee\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix ϑ(0,1/18)\vartheta\in(0,1/18). There exist some C=C(K,ϑ)>0C=C(K,\vartheta)>0 and two measurable sets 𝒰ϑBn(1),ϑm\mathcal{U}_{\vartheta}\subset B_{n}(1),\mathcal{E}_{\vartheta}\subset\mathbb{R}^{m} with min{vol(𝒰ϑ)/vol(Bn(1)),(ξϑ)}1Cen2ϑ/C\min\{\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1)),\operatorname{\mathbb{P}}(\xi\in\mathcal{E}_{\vartheta})\}\geq 1-Ce^{-n^{2\vartheta}/C}, such that the following hold.

  1. (1)

    For any 11-Lipschitz function 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R}, and ε(0,1/2]\varepsilon\in(0,1/2],

    supμ0𝒰ϑ(supηΞK|𝗀(μ^η)𝔼𝗀(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,))|ε)Cε13n1/6+3ϑ.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}\big{\lvert}\mathsf{g}(\widehat{\mu}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}\big{\rvert}\geq\varepsilon\Big{)}\leq C\varepsilon^{-13}n^{-1/6+3\vartheta}.
  2. (2)

    For any ε(0,1/2]\varepsilon\in(0,1/2], ξϑ\xi\in\mathcal{E}_{\vartheta} and 11-Lipschitz function 𝗁:m\mathsf{h}:\mathbb{R}^{m}\to\mathbb{R} (which may depend on ξ\xi),

    supμ0𝒰ϑξ(supη[1/K,K]|𝗁(r^η)𝔼ξ𝗁(rη,)|ε)Cε7n1/6+3ϑ.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in[1/K,K]}\lvert\mathsf{h}(\widehat{r}_{\eta})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(r_{\eta,\ast})\rvert\geq\varepsilon\Big{)}\leq C\varepsilon^{-7}n^{-1/6+3\vartheta}.

Concrete forms of 𝒰ϑ,ϑ\mathcal{U}_{\vartheta},\mathcal{E}_{\vartheta} are specified in Proposition 10.3.

Theorems 2.3 and 2.4 are proved in Section 9 and Section 10, respectively. Due to the high technicalities in the proof, a sketch is outlined in Section 6.

We mention two particular important features on the theorems above:

  1. (1)

    The distributional characterizations for μ^η\widehat{\mu}_{\eta} in both theorems above are uniformly valid down to the interpolation regime η=0\eta=0 for ϕ1>1\phi^{-1}>1. This uniform control will play a crucial role in our non-asymptotic analysis of cross-validation methods to be studied in Section 4 ahead.

  2. (2)

    The distribution of the residual r^η\widehat{r}_{\eta} in (2) is formulated conditional on the noise ξ\xi. A fundamental reason for adopting this formulation is that the distribution of r^η\widehat{r}_{\eta} is not universal with respect to the law of ξ\xi. In other words, one cannot simply assume Gaussianity of ξ\xi in Theorem 2.3 in hope of proving universality of r^η\widehat{r}_{\eta} in Theorem 2.4.

In the context of distributional characterizations for regularized regression estimators in the proportional regime, results in similar vein to Theorem 2.3 have been obtained in the closely related Lasso setting for isotropic Σ=In\Sigma=I_{n} in [MM21], and for general Σ\Sigma in [CMW22], both under Gaussian designs and with strictly non-vanishing regularization. A substantially simpler, isotropic (Σ=In\Sigma=I_{n}) version of Theorem 2.4 is obtained in [HS22] that holds pointwise in non-vanishing regularization level η>0\eta>0. As will be clear from the proof sketch in Section 6, in addition to the complications due to the implicit nature of the solution to the fixed point equation (2.1) for general Σ\Sigma, the major difficulty in proving Theorems 2.3 and 2.4 rests in handling the singularity of the optimization problem (1.2) as η0\eta\downarrow 0.

2.6. Weighted q\ell_{q} risks of μ^η\widehat{\mu}_{\eta}

As a demonstration of the analytic power of the above Theorems 2.3 and 2.4, we compute below the weighted q\ell_{q} risk 𝖠(μ^ημ0)q\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{q} for a well-behaved matrix 𝖠n×n\mathsf{A}\in\mathbb{R}^{n\times n} and q(0,)q\in(0,\infty).

Theorem 2.5.

Suppose the same conditions in Theorem 2.4 hold for some K>0K>0. Fix q(0,)q\in(0,\infty) and a p.s.d. matrix 𝖠n×n\mathsf{A}\in\mathbb{R}^{n\times n} with 𝖠op𝖠1opK\lVert\mathsf{A}\rVert_{\operatorname{op}}\vee\lVert\mathsf{A}^{-1}\rVert_{\operatorname{op}}\leq K. Then there exist constants C>1,ϑ(0,1/50)C>1,\vartheta\in(0,1/50) depending on K,qK,q, and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C}, such that

supμ0𝒰ϑ(supηΞK|𝖠(μ^ημ0)qR¯(Σ,μ0);q𝖠(η)1|nϑ)Cn1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\bigg{(}\sup_{\eta\in\Xi_{K}}\bigg{\lvert}\frac{\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{q}}{\bar{R}_{(\Sigma,\mu_{0});q}^{\mathsf{A}}(\eta)}-1\bigg{\rvert}\geq n^{-\vartheta}\bigg{)}\leq Cn^{-1/7}.

Here R¯(Σ,μ0);q𝖠(η){𝔼𝖠(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0)q,n1/2diag(Γη;(Σ,μ0)𝖠)q/21/2Mq}\bar{R}_{(\Sigma,\mu_{0});q}^{\mathsf{A}}(\eta)\in\big{\{}\operatorname{\mathbb{E}}\lVert\mathsf{A}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\big{)}\rVert_{q},n^{-1/2}\lVert\mathrm{diag}\big{(}\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{\mathsf{A}}\big{)}\rVert_{q/2}^{1/2}M_{q}\big{\}}, where Mq𝔼1/q|𝒩(0,1)|q=21/2{Γ((q+1)/2)/π}1/qM_{q}\equiv\operatorname{\mathbb{E}}^{1/q}\lvert\mathcal{N}(0,1)\rvert^{q}=2^{1/2}\big{\{}\Gamma\big{(}(q+1)/2\big{)}/\sqrt{\pi}\big{\}}^{1/q},

Γη;(Σ,μ0)𝖠𝖠(Σ+τη,In)1(γ~η,2(μ0)Σ+τη,2μ02In)(Σ+τη,In)1𝖠,\displaystyle\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{\mathsf{A}}\equiv\mathsf{A}(\Sigma+\tau_{\eta,\ast}I_{n})^{-1}\Big{(}\widetilde{\gamma}_{\eta,\ast}^{2}(\lVert\mu_{0}\rVert)\Sigma+\tau_{\eta,\ast}^{2}\lVert\mu_{0}\rVert^{2}I_{n}\Big{)}(\Sigma+\tau_{\eta,\ast}I_{n})^{-1}\mathsf{A}, (2.7)

and γ~η,2(μ0)\widetilde{\gamma}_{\eta,\ast}^{2}(\lVert\mu_{0}\rVert) is defined in Proposition 2.1-(3).

The proof of the above theorem can be found in Section 10.7. To the best of our knowledge, general weighted q\ell_{q} risks for the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} have not be available in the literature except for the special case q=2q=2, for which 𝖠(μ^ημ0)2\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{2} admits a closed-form expression in terms of the spectral statistics of XX that facilitates direct applications of RMT techniques, cf. [TV+04, EK13, Dic16, DW18, EK18, ASS20, WX20, BMR21, RMR21, HMRT22, CM22].

Here, obtaining q\ell_{q} risks for q(0,2)q\in(0,2) via our Theorems 2.3 and 2.4 is relatively easy, as xxq/n1/q1/2x\mapsto\lVert x\rVert_{q}/n^{1/q-1/2} is 11-Lipschitz with respect to \lVert\cdot\rVert for q(0,2)q\in(0,2). The stronger norm case q(2,)q\in(2,\infty) is significantly harder. In fact, we need additionally the following delocalization result for μ^η\widehat{\mu}_{\eta}.

Proposition 2.6.

Suppose the same conditions as in Theorem 2.5 hold for some K>0K>0. Fix ϑ(0,1/2]\vartheta\in(0,1/2]. Then there exist some constant C=C(K,ϑ)>0C=C(K,\vartheta)>0 and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cen2ϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{2\vartheta}/C}, such that

supμ0𝒰ϑ(supηΞK𝖠(μ^ημ0)Cn1/2+ϑ)Cn100.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}\lVert\mathsf{A}(\widehat{\mu}_{\eta}-\mu_{0})\rVert_{\infty}\geq Cn^{-1/2+\vartheta}\Big{)}\leq Cn^{-100}.

The above proposition is a simplified version of Proposition 10.3, proved via the anisotropic local laws developed in [KY17]. In essence, delocalization allows us to apply Theorems 2.3 and 2.4 with a truncated version of the q\ell_{q} norm (q>2q>2) with a well-controlled Lipschitz constant with respect to 2\ell_{2}. Moreover, delocalization of μ^η\widehat{\mu}_{\eta} also serves as a key technical ingredient in proving the universality Theorem 2.4; the readers are referred to Section 6 for a detailed account on the technical connection between delocalization and universality.

Convention on probability estimates:

  1. (1)

    When Z=GZ=G, n1/7n^{-1/7} in Theorem 2.5 can be replaced by nDn^{-D} for any D>0D>0.

  2. (2)

    n100n^{-100} in Proposition 2.6 can be replaced by nDn^{-D} for any D>0D>0.

The cost will be a possibly enlarged constant C>0C>0 that depends further on DD. This convention applies to other statements in the following sections in which the probability estimates n1/7,n100n^{-1/7},n^{-100} appear.

3. 2\ell_{2} risk formulae and phase transitions

In this section, we will study in some detail the behavior of various 2\ell_{2} risks associated with μ^η\widehat{\mu}_{\eta}. Compared to the general q\ell_{q} risks as in Theorem 2.5, the major additional analytic advantage of working with 2\ell_{2} risks is its close connection to techniques from random matrix theory (RMT). As will be clear from below, this allows us to characterize certain global, qualitative behavior of these 2\ell_{2} risk curves with respect to the regularization level η\eta.

3.1. Definitions of various 2\ell_{2} risks

Recall the notation R(Σ,μ0)#(η)R^{\#}_{(\Sigma,\mu_{0})}(\eta) defined in Section 1.3. Let their ‘theoretical’ versions be defined as follows:

  • R¯(Σ,μ0)𝗉𝗋𝖾𝖽(η)τη,2(Σ+τη,In)1Σ1/2μ02+γη,2ntr(Σ2(Σ+τη,In)2)\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\tau_{\eta,\ast}^{2}\lVert(\Sigma+\tau_{\eta,\ast}I_{n})^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}+\frac{\gamma_{\eta,\ast}^{2}}{n}\operatorname{tr}\big{(}\Sigma^{2}(\Sigma+\tau_{\eta,\ast}I_{n})^{-2}\big{)}.

  • R¯(Σ,μ0)𝖾𝗌𝗍(η)τη,2(Σ+τη,In)1μ02+γη,2ntr(Σ(Σ+τη,In)2)\bar{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\tau_{\eta,\ast}^{2}\lVert(\Sigma+\tau_{\eta,\ast}I_{n})^{-1}\mu_{0}\rVert^{2}+\frac{\gamma_{\eta,\ast}^{2}}{n}\operatorname{tr}\big{(}\Sigma(\Sigma+\tau_{\eta,\ast}I_{n})^{-2}\big{)}.

  • R¯(Σ,μ0)𝗂𝗇(η)(ηγη,τη,)2+ϕσξ2(12ηϕτη,)\bar{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\big{(}\frac{\eta\gamma_{\eta,\ast}}{\tau_{\eta,\ast}}\big{)}^{2}+\phi\sigma_{\xi}^{2}\cdot\big{(}1-\frac{2\eta}{\phi\tau_{\eta,\ast}}\big{)}.

We also define the residual and its theoretical version as

  • R(Σ,μ0)𝗋𝖾𝗌(η)n1YXμ^η2R^{\operatorname{\mathsf{res}}}_{(\Sigma,\mu_{0})}(\eta)\equiv n^{-1}\lVert Y-X\widehat{\mu}_{\eta}\rVert^{2}, R¯(Σ,μ0)𝗋𝖾𝗌(η)(ηγη,τη,)2\bar{R}^{\operatorname{\mathsf{res}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\big{(}\frac{\eta\gamma_{\eta,\ast}}{\tau_{\eta,\ast}}\big{)}^{2}.

The following theorem follows easily from Theorems 2.3 and 2.4. The proof of this and all other results in this section can be found in Section 11.

Theorem 3.1.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix a small enough ϑ(0,1/50)\vartheta\in(0,1/50). Then there exist a constant C=C(K,ϑ)>1C=C(K,\vartheta)>1, and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C}, such that for any ε(0,1/2]\varepsilon\in(0,1/2], and #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇,𝗋𝖾𝗌}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}},\operatorname{\mathsf{res}}\},

supμ0𝒰ϑ(supηΞ#|R(Σ,μ0)#(η)R¯(Σ,μ0)#(η)|ε)C{nenε4/C,Z=G;εc0n1/6.5,otherwise.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi^{\#}}\lvert R^{\#}_{(\Sigma,\mu_{0})}(\eta)-\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)\rvert\geq\varepsilon\Big{)}\leq C\cdot\begin{cases}ne^{-n\varepsilon^{4}/C},&Z=G;\\ \varepsilon^{-c_{0}}n^{-1/6.5},&\hbox{otherwise}.\end{cases}

Here Ξ#=ΞK\Xi^{\#}=\Xi_{K} for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}}\} and Ξ#=[1/K,K]\Xi^{\#}=[1/K,K] for #{𝗂𝗇,𝗋𝖾𝗌}\#\in\{\operatorname{\mathsf{in}},\operatorname{\mathsf{res}}\}, and c0>0c_{0}>0 is universal. Moreover, when Z=GZ=G, the supremum in the above display extends to μ0Bn(1)\mu_{0}\in B_{n}(1), and the constant C>0C>0 does not depend on ϑ\vartheta.

Remark 1.

For #{𝗂𝗇,𝗋𝖾𝗌}\#\in\{\operatorname{\mathsf{in}},\operatorname{\mathsf{res}}\}, we may take Ξ#=ΞK\Xi^{\#}=\Xi_{K} at the cost of an worsened probability estimate C(nenεc0/C+εc0n1/6.5𝟏ZG)C(ne^{-n\varepsilon^{c_{0}}/C}+\varepsilon^{-c_{0}}n^{-1/6.5}\bm{1}_{Z\neq G}), cf. Lemma 11.4.

A substantial recent line of research has focused on understanding the behavior of R(Σ,μ0)𝗉𝗋𝖾𝖽(η)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta), with a special emphasis for the interpolating regime η0\eta\approx 0 when ϕ1>1\phi^{-1}>1, cf. [BLLT20, KLS20, WX20, BMR21, KZSS21, RMR21, CM22, HMRT22, TB22, ZKS+22]. We refer the readers to [TB22, Section 1.2 and Section 9] for a thorough account and a state-of-art review on various results on R(Σ,μ0)𝗉𝗋𝖾𝖽(0)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(0) and their comparisons.

Here, the closest non-asymptotic results on exact risk characterizations related to our Theorem 3.1, appear to be those presented in (i) [HMRT22, Theorems 2 and 5], which proved non-asymptotic additive approximations R(Σ,μ0)𝗉𝗋𝖾𝖽(η)=R¯(Σ,μ0)𝗉𝗋𝖾𝖽(η)+𝔬𝐏(1)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)=\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)+\mathfrak{o}_{\mathbf{P}}(1), and (ii) [CM22, Theorems 1 and 2], which provided substantially refined, multiplicative approximations R(Σ,μ0)𝗉𝗋𝖾𝖽(η)/R¯(Σ,μ0)𝗉𝗋𝖾𝖽(η)=1+𝔬𝐏(1)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)/\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)=1+\mathfrak{o}_{\mathbf{P}}(1) that hold beyond the proportional regime. Both works [HMRT22, CM22] leverage the closed form of the Ridge(less) estimator μ^η\widehat{\mu}_{\eta} to analyze the bias and variance terms in R(Σ,μ0)𝗉𝗋𝖾𝖽(η)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta), by means of calculus for the resolvent of the sample covariance. Their analysis works under ηnc0\eta\gg n^{-c_{0}} for some suitable c0>0c_{0}>0. For the case #=𝗉𝗋𝖾𝖽\#=\operatorname{\mathsf{pred}}, Theorem 3.1 above complements the results in [HMRT22, CM22] by providing uniform control in η\eta when ϕ1>1\phi^{-1}>1 (under a set of different conditions).

3.2. Approximate representation of R¯(Σ,μ0)#\bar{R}^{\#}_{(\Sigma,\mu_{0})} via RMT

Recall the Stieltjes transformation 𝔪\mathfrak{m} defined via (2.2), and the signal-to-noise ratio 𝖲𝖭𝖱μ0=μ02/σξ2\operatorname{\mathsf{SNR}}_{\mu_{0}}=\lVert\mu_{0}\rVert^{2}/\sigma_{\xi}^{2} (for σξ2=0\sigma_{\xi}^{2}=0, we interpret σξ2𝖲𝖭𝖱μ0=μ02\sigma_{\xi}^{2}\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}=\lVert\mu_{0}\rVert^{2}). The following theorem provides an efficient RMT representation of R¯(Σ,μ0)#(η)\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) that holds for ‘most’ μ0\mu_{0}’s.

Theorem 3.2.

Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K, σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exists some constant C=C(K)>0C=C(K)>0 such that for any ε(0,1/2]\varepsilon\in(0,1/2], we may find a measurable set 𝒰εBn(1)\mathcal{U}_{\varepsilon}\subset B_{n}(1) with vol(𝒰ε)/vol(Bn(1))1Cε1enε2/C\mathrm{vol}(\mathcal{U}_{\varepsilon})/\mathrm{vol}(B_{n}(1))\geq 1-C\varepsilon^{-1}e^{-n\varepsilon^{2}/C},

supμ0𝒰εsupηΞK|R¯(Σ,μ0)#(η)(Σ,μ0)#(η)|ε.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\varepsilon}}\sup_{\eta\in\Xi_{K}}\lvert\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)-\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)\rvert\leq\varepsilon. (3.1)

Here with 𝔪η𝔪(η/ϕ)\mathfrak{m}_{\eta}\equiv\mathfrak{m}(-\eta/\phi) and 𝔪η𝔪(η/ϕ)\mathfrak{m}_{\eta}^{\prime}\equiv\mathfrak{m}^{\prime}(-\eta/\phi),

  • (Σ,μ0)𝗉𝗋𝖾𝖽(η)σξ2{1𝔪η2(ϕ𝖲𝖭𝖱μ0𝔪η(η𝖲𝖭𝖱μ01)𝔪η)1}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\sigma_{\xi}^{2}\cdot\Big{\{}\frac{1}{\mathfrak{m}_{\eta}^{2}}\Big{(}\phi\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}\mathfrak{m}_{\eta}-\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}\mathfrak{m}_{\eta}^{\prime}\Big{)}-1\Big{\}},

  • (Σ,μ0)𝖾𝗌𝗍(η)σξ2{𝖲𝖭𝖱μ0(1ϕ)+𝔪η+ηϕ(η𝖲𝖭𝖱μ01)𝔪η}\mathscr{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\sigma_{\xi}^{2}\cdot\Big{\{}\operatorname{\mathsf{SNR}}_{\mu_{0}}(1-\phi)+\mathfrak{m}_{\eta}+\frac{\eta}{\phi}\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}\mathfrak{m}_{\eta}^{\prime}\Big{\}},

  • (Σ,μ0)𝗂𝗇(η)σξ2η2ϕ(ϕ𝖲𝖭𝖱μ0𝔪η(η𝖲𝖭𝖱μ01)𝔪η)+σξ2(ϕ2η𝔪η)\mathscr{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\sigma_{\xi}^{2}\cdot\frac{\eta^{2}}{\phi}\Big{(}\phi\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}\mathfrak{m}_{\eta}-\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}\mathfrak{m}_{\eta}^{\prime}\Big{)}+\sigma_{\xi}^{2}\cdot(\phi-2\eta\mathfrak{m}_{\eta}),

  • (Σ,μ0)𝗋𝖾𝗌(η)σξ2η2ϕ(ϕ𝖲𝖭𝖱μ0𝔪η(η𝖲𝖭𝖱μ01)𝔪η)\mathscr{R}^{\operatorname{\mathsf{res}}}_{(\Sigma,\mu_{0})}(\eta)\equiv\sigma_{\xi}^{2}\cdot\frac{\eta^{2}}{\phi}\Big{(}\phi\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}\mathfrak{m}_{\eta}-\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}\mathfrak{m}_{\eta}^{\prime}\Big{)}.

When Σ=In\Sigma=I_{n}, we may take 𝒰ε=Bn(1)\mathcal{U}_{\varepsilon}=B_{n}(1) and (3.1) holds with ε=0\varepsilon=0.

The RMT representation above yields a crucial insight into the extremal behavior of the risk maps ηR¯(Σ,μ0)#(η)\eta\mapsto\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta). In fact, the following derivative formulae hold.

Proposition 3.3.

For #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

η(Σ,μ0)#(η)\displaystyle\partial_{\eta}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) =σξ2𝔐#(η)(η𝖲𝖭𝖱μ01).\displaystyle=\sigma_{\xi}^{2}\cdot\mathfrak{M}^{\#}(\eta)\cdot\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}.

Here with ρ\rho denoting the asymptotic eigenvalue density of Σˇ\check{\Sigma}, and zηη/ϕz_{\eta}\equiv\eta/\phi,

𝔐#(η)\displaystyle\mathfrak{M}^{\#}(\eta) =2{(ϕ𝔪η3)1{ρ(dx)(x+zη)ρ(dx)(x+zη)3(ρ(dx)(x+zη)2)2},#=𝗉𝗋𝖾𝖽;x(x+zη)3ρ(dx),#=𝖾𝗌𝗍;x2(x+zη)3ρ(dx),#=𝗂𝗇.\displaystyle=2\cdot\begin{cases}(\phi\mathfrak{m}^{3}_{\eta})^{-1}\Big{\{}\int\frac{\rho(\mathrm{d}x)}{(x+z_{\eta})}\int\frac{\rho(\mathrm{d}x)}{(x+z_{\eta})^{3}}-\Big{(}\int\frac{\rho(\mathrm{d}x)}{(x+z_{\eta})^{2}}\Big{)}^{2}\Big{\}},&\#=\operatorname{\mathsf{pred}};\\ \int\frac{x}{(x+z_{\eta})^{3}}\,\rho(\mathrm{d}x),&\#=\operatorname{\mathsf{est}};\\ \int\frac{x^{2}}{(x+z_{\eta})^{3}}\,\rho(\mathrm{d}x),&\#=\operatorname{\mathsf{in}}.\end{cases}

As supp(ρ)[0,)\mathrm{supp}(\rho)\subset[0,\infty) (cf. [KY17, Lemma 2.2]), it is easy to see 𝔐#0\mathfrak{M}^{\#}\geq 0 using the above integral representation via ρ\rho. In Proposition 11.3 ahead, we will give a different representation of 𝔐#\mathfrak{M}^{\#} via the effective regularization τη,\tau_{\eta,\ast}. With the help of the stability estimates of τη,\tau_{\eta,\ast} in Proposition 2.1, this representation allows us to derive a much stronger estimate

1/C𝔐#(η)C,ηΞK,#{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇},\displaystyle 1/C\leq\mathfrak{M}^{\#}(\eta)\leq C,\quad\forall\eta\in\Xi_{K},\quad\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}, (3.2)

under the condition ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K.

A particular important consequence of (3.2) is that, the maps η(Σ,μ0)#(η)\eta\mapsto\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) are approximately (locally) quadratic functions centered at the same point η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1} for all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}. Therefore, in view of Theorems 3.1 and 3.2, approximately so do the risk maps ηR(Σ,μ0)#(η),R¯(Σ,μ0)#(η)\eta\mapsto{R}^{\#}_{(\Sigma,\mu_{0})}(\eta),\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) for ‘most’ μ0\mu_{0}’s. Moreover, the approximate local quadraticity of η(Σ,μ0)#(η)\eta\mapsto\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) due to (3.2) allows one to relate tightly the change of value in η\eta and that of the actual risks R(Σ,μ0)#(η)R^{\#}_{(\Sigma,\mu_{0})}(\eta). This will play an important technical role in validating the optimality of cross-validation schemes beyond prediction errors in Section 4 ahead.

Remark 2.

Proposition 3.3 can also be used to study certain qualitative features of the optimally tuned (theoretical) risks 𝖮𝖯𝖳(Σ,μ0)#minη0(Σ,μ0)#(η)/σξ2=(Σ,μ0)#(η)/σξ2\operatorname{\mathsf{OPT}}^{\#}_{(\Sigma,\mu_{0})}\equiv\min_{\eta\geq 0}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)/\sigma_{\xi}^{2}=\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta_{\ast})/\sigma_{\xi}^{2} for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}. Some further results in this direction are detailed in Appendix A. In particular, Proposition A.1 therein proves the monotonicity of 𝖮𝖯𝖳(Σ,μ0)#\operatorname{\mathsf{OPT}}^{\#}_{(\Sigma,\mu_{0})} with respect to the aspect ratio ϕ\phi for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}.

3.3. Phase transitions on the optimality of interpolation

The fact that the maps η(Σ,μ0)#(η)\eta\mapsto\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) admit the same global minimizer η=𝖲𝖭𝖱μ01=σξ2/μ02\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}=\sigma_{\xi}^{2}/\lVert\mu_{0}\rVert^{2} due to Proposition 3.3 may be viewed from a different perspective: η=0\eta_{\ast}=0 if and only if σξ2=0\sigma_{\xi}^{2}=0 for ‘most’ signal μ0\mu_{0}’s. Combined with Theorems 3.1 and 3.2, it naturally suggests that for ‘most’ μ0\mu_{0}’s, interpolation is optimal simultaneously for prediction, estimation and in-sample risks, if and only if the noise level is (nearly) zero. The following theorem makes this precise.

Theorem 3.4.

Suppose Assumptions A-B hold, and Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K for some K>0K>0. Fix a small enough ϑ(0,1/50)\vartheta\in(0,1/50). The following hold for all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}.

  1. (1)

    (Noisy case). Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K and 1/Kσξ2K1/K\leq\sigma_{\xi}^{2}\leq K. Fix δ(0,1/2]\delta\in(0,1/2] and LK/δ2L\geq K/\delta^{2}. There exist a constant C=C(K,L,δ,ϑ)>0C=C(K,L,\delta,\vartheta)>0 and a measurable set 𝒰δ,ϑBn(1)Bn(δ)\mathcal{U}_{\delta,\vartheta}\subset B_{n}(1)\setminus B_{n}(\delta) with vol(𝒰δ,ϑ)/vol(Bn(1)Bn(δ))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\delta,\vartheta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\geq 1-Ce^{-n^{\vartheta}/C}, such that

    supμ0𝒰δ,ϑ(infηΞL:|η𝖲𝖭𝖱μ01|δ|R(Σ,μ0)#(η)minηΞLR(Σ,μ0)#(η)|<1C)Cn1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\delta,\vartheta}}\operatorname{\mathbb{P}}\bigg{(}\inf_{\eta^{\prime}\in\Xi_{L}:\lvert\eta^{\prime}-\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}\rvert\geq\delta}\big{\lvert}R^{\#}_{(\Sigma,\mu_{0})}(\eta^{\prime})-\min_{\eta\in\Xi_{L}}R^{\#}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}<\frac{1}{C}\bigg{)}\leq Cn^{-1/7}.
  2. (2)

    (Noiseless case). Suppose 1+1/Kϕ1K1+1/K\leq\phi^{-1}\leq K and σξ2=0\sigma_{\xi}^{2}=0. There exist a constant C=C(K,ϑ)>0C=C(K,\vartheta)>0 and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C}, such that

    supμ0𝒰ϑ(R(Σ,μ0)#(0)minη[0,K]R(Σ,μ0)#(η)+nϑ)Cn1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\Big{(}R^{\#}_{(\Sigma,\mu_{0})}(0)\geq\min_{\eta\in[0,K]}R^{\#}_{(\Sigma,\mu_{0})}(\eta)+n^{-\vartheta}\Big{)}\leq Cn^{-1/7}.

We note that the sub-optimality of interpolation in the noisy setting σξ2>0\sigma_{\xi}^{2}>0, at least for ‘most’ μ0\mu_{0}’s as described in the above theorem, does not contradict the possible benign behavior of the prediction risk R(Σ,μ0)𝗉𝗋𝖾𝖽(0)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(0) extensively studied in the literature (cited after Theorem 3.1). In fact, from a conceptual standpoint, the observation that interpolation may fall short while a certain amount of regularization remains advantageous in ‘typical’ scenarios aligns closely with the conventional statistical wisdom, which emphasizes the vital role of regularization in striking a balance between the bias and variance [JWHT21].

On the other hand, the optimality of interpolation in the noiseless setting σξ2=0\sigma_{\xi}^{2}=0 stated in the above theorem is not as intuitively obvious. This phenomenon can be elucidated from our distributional characterizations in Theorems 2.3 and 2.4. As the effective regularization τη,\tau_{\eta,\ast} decreases as η0\eta\downarrow 0, and for σξ2=0\sigma_{\xi}^{2}=0, the effective noise γη,2μ02(τη,ηητη,)\gamma_{\eta,\ast}^{2}\approx\lVert\mu_{0}\rVert^{2}(\tau_{\eta,\ast}-\eta\partial_{\eta}\tau_{\eta,\ast}) (cf. Proposition 2.1-(3)) also decreases as η0\eta\downarrow 0, both the bias and variance of μ^η\widehat{\mu}_{\eta} are also expected to decrease as η0\eta\downarrow 0, at least for #=𝖾𝗌𝗍\#=\operatorname{\mathsf{est}} and for ‘most’ μ0\mu_{0}’s. The above theorem rigorously establishes this heuristic for all the prediction, estimation and in-sample risks.

Remark 3.

Some remarks on the connection of Theorem 3.4 to the literature:

  1. (1)

    Theorem 3.4-(1) is related to some results in [CDK22] which considers a Bayesian setting with an isotropic prior on μ0\mu_{0} in that 𝔼μ0=0\operatorname{\mathbb{E}}\mu_{0}=0 and Cov(μ0)=In/n\operatorname{Cov}(\mu_{0})=I_{n}/n. In particular, [CDK22, Theorem 3-(ii)] shows that, using our notation, in the proportional asymptotics m/nϕ(0,)m/n\to\phi\in(0,\infty), if σξ2>0\sigma_{\xi}^{2}>0 and the empirical spectral distribution of Σ\Sigma converges, then with X\operatorname{\mathbb{P}}^{X}-probability 11, limn|𝔼[R(Σ,μ0)𝗉𝗋𝖾𝖽(η)|X]minη>0𝔼[R(Σ,μ0)𝗉𝗋𝖾𝖽(η)|X]|=0\lim_{n\to\infty}\lvert\operatorname{\mathbb{E}}[R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta_{\ast})|X]-\min_{\eta^{\prime}>0}\operatorname{\mathbb{E}}[R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta^{\prime})|X]\rvert=0. Here for the case #=𝗉𝗋𝖾𝖽\#=\operatorname{\mathsf{pred}}, our Theorem 3.4-(1) above provides a non-asymptotic, and more importantly, a non-Bayesian version that holds for ‘most’ μ0\mu_{0}’s.

  2. (2)

    As mentioned before, the recent work [CM22] proves an important, multiplicative characterization R(Σ,μ0)𝗉𝗋𝖾𝖽(η)/R¯(Σ,μ0)𝗉𝗋𝖾𝖽(η)=1+𝔬𝐏(1)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)/\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)=1+\mathfrak{o}_{\mathbf{P}}(1) beyond the proportional regime. In particular, the multiplicative formulation encompasses several important cases of data covariance Σ\Sigma with decaying eigenvalues that lead to benign overfitting R¯(Σ,μ0)𝗉𝗋𝖾𝖽(0)1\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(0)\ll 1, cf. [CM22, Section 4.2]. We conjecture that Theorem 3.4 also remains valid for such irregular data covariance, in a similar multiplicative formulation. However, formally establishing this validity remains an interesting open question.

4. Cross-validation: optimality beyond prediction

This section is devoted to the validation of the broad optimality of two widely used cross-validation schemes beyond the prediction risk. Some consequences to statistical inference via debiased Ridge(less) estimators will also be discussed.

4.1. Estimation of effective noise and regularization

We shall first take a slight detour, by considering estimation of the effective regularization τη,\tau_{\eta,\ast} and the effective noise γη,\gamma_{\eta,\ast}. We propose the following estimators:

{τ^η{1mtr(1mXX+ηϕIm)1}1={tr(XX+ηnIm)1}1,γ^ητ^ηn(η1YXμ^η𝟏ϕ1<1+(XX/n)1Xμ^η𝟏ϕ11).\displaystyle\begin{cases}\widehat{\tau}_{\eta}\equiv\Big{\{}\frac{1}{m}\operatorname{tr}\big{(}\frac{1}{m}XX^{\top}+\frac{\eta}{\phi}I_{m}\big{)}^{-1}\Big{\}}^{-1}=\big{\{}\operatorname{tr}(XX^{\top}+\eta\cdot nI_{m})^{-1}\big{\}}^{-1},\\ \widehat{\gamma}_{\eta}\equiv\frac{\widehat{\tau}_{\eta}}{\sqrt{n}}\Big{(}\eta^{-1}\lVert Y-X\widehat{\mu}_{\eta}\rVert\bm{1}_{\phi^{-1}<1}+\lVert(XX^{\top}/n)^{-1}X\widehat{\mu}_{\eta}\rVert\bm{1}_{\phi^{-1}\geq 1}\Big{)}.\end{cases} (4.1)

These estimators will not only be useful in their own rights, they will also play an important rule in understanding the generalized cross-validation scheme in the next subsection.

The following theorem provides a formal justification for τ^η,γ^η\widehat{\tau}_{\eta},\widehat{\gamma}_{\eta} in (4.1). The proof of this and all other results in this section can be found in Section 12.

Theorem 4.1.

Suppose Assumption A holds, and 1/Kϕ1K1/K\leq\phi^{-1}\leq K, Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K hold for some K>0K>0.

  1. (1)

    For any small ε>0\varepsilon>0, there exists some C1=C1(K,ε)>0C_{1}=C_{1}(K,\varepsilon)>0 such that

    (supηΞK|τ^ητη,|n1/2+ε)C1n100.\displaystyle\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}\lvert\widehat{\tau}_{\eta}-\tau_{\eta,\ast}\rvert\geq n^{-1/2+\varepsilon}\Big{)}\leq C_{1}n^{-100}.
  2. (2)

    Suppose further Assumption B holds with either (i) σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K] or (ii) σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] with 1+1/Kϕ1K1+1/K\leq\phi^{-1}\leq K. Fix a small enough constant ϑ(0,1/50)\vartheta\in(0,1/50). Then there exist a constant C2=C2(K,ϑ)>1C_{2}=C_{2}(K,\vartheta)>1, and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C}, such that

    supμ0𝒰ϑ(supηΞK|γ^ηγη,|nϑ)C2n1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}\lvert\widehat{\gamma}_{\eta}-\gamma_{\eta,\ast}\rvert\geq n^{-\vartheta}\Big{)}\leq C_{2}n^{-1/7}.
Remark 4.

The original noise level σξ2\sigma_{\xi}^{2} can be consistently estimated when Σ\Sigma is known. In particular, we may use

σ^η2γ^η2(1ϕ+2ητ^η1)τ^η2Σ1/2μ^η2.\displaystyle\widehat{\sigma}_{\eta}^{2}\equiv\widehat{\gamma}_{\eta}^{2}\big{(}1-\phi+2\eta\widehat{\tau}_{\eta}^{-1}\big{)}-\widehat{\tau}_{\eta}^{2}\cdot\lVert\Sigma^{-1/2}\widehat{\mu}_{\eta}\rVert^{2}. (4.2)

Estimators for σξ2\sigma_{\xi}^{2} of a similar flavor for other convex regularized estimators under Gaussian designs can be found in [BEM13, Bel20, MM21]. An advantage of σ^η2\widehat{\sigma}_{\eta}^{2} in (4.2) is its validity in the interpolation regime when ϕ1>1\phi^{-1}>1. We may also replace τ^η\widehat{\tau}_{\eta} in the above display by τη,\tau_{\eta,\ast}, as it can be solved exactly with known Σ\Sigma using the second equation of (2.1). A formal proof of σ^η2σξ2\widehat{\sigma}_{\eta}^{2}\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\sigma_{\xi}^{2} can be carried out similar to that of Theorem 4.1-(2) above, so we omit the details.

4.2. Validation of cross-validation schemes

4.2.1. Generalized cross-validation

Consider choosing η\eta by minimizing the estimated effective noise γ^η\widehat{\gamma}_{\eta} given in (4.1): for any L>0L>0,

η^L𝖦𝖢𝖵argminηΞLγ^η.\displaystyle\widehat{\eta}^{\operatorname{\mathsf{GCV}}}_{L}\in\operatorname*{arg\,min\,}_{\eta\in\Xi_{L}}\widehat{\gamma}_{\eta}. (4.3)

The subscript on LL in η^L𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}}_{L} will usually be suppressed for notational simplicity.

The above proposal (4.3) is known in the literature as the generalized cross validation [CW79, GHW79] in the underparametrized regime ϕ1<1\phi^{-1}<1, with the same form of modification [HMRT22, Eqn. (48)] in the overparametrized regime ϕ1>1\phi^{-1}>1. The connection there is strongly tied to the so-called shortcut formula for leave-one-out cross validation that exists uniquely for Ridge regression, cf. [HMRT22, Eqn. (46)].

Here we take a different perspective on (4.3). From our developed theory, this tuning scheme is easily believed to “work” since

γ^η2γη,2=ϕ1(σξ2+R¯(Σ,μ0)𝗉𝗋𝖾𝖽(η))ϕ1(σξ2+R(Σ,μ0)𝗉𝗋𝖾𝖽(η)).\displaystyle\widehat{\gamma}_{\eta}^{2}\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\gamma_{\eta,\ast}^{2}=\phi^{-1}\big{(}\sigma_{\xi}^{2}+\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{)}\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\phi^{-1}\big{(}\sigma_{\xi}^{2}+R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{)}. (4.4)

We therefore expect that minimization of ηγ^η\eta\mapsto\widehat{\gamma}_{\eta} is approximately the same as that of ηR(Σ,μ0)𝗉𝗋𝖾𝖽(η)\eta\mapsto R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta). Moreover, in view of Theorem 3.4, the minimizer of the latter problem should be roughly the same as ηR(Σ,μ0)#(η)\eta\mapsto R^{\#}_{(\Sigma,\mu_{0})}(\eta) for #{𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\} for ‘most’ μ0\mu_{0}’s. Now it is natural to expect that the tuning method η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} in (4.3)—which aims at minimizing the prediction risk—should simultaneously give optimal performance for all prediction, estimation and in-sample risks for ‘most’ signal μ0\mu_{0}’s. We make precise the foregoing heuristics in the following theorem.

Theorem 4.2.

Suppose Assumption A holds, and the following hold some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K.

  • Assumption B holds with either (i) σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K] or (ii) σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] with ϕ11+1/K\phi^{-1}\geq 1+1/K.

Fix δ(0,1/2]\delta\in(0,1/2], LK/δ2L\geq K/\delta^{2} and a small enough ϑ(0,1/50)\vartheta\in(0,1/50). There exist a constant C=C(K,L,δ,ϑ)>0C=C(K,L,\delta,\vartheta)>0 and a measurable set 𝒰δ,ϑBn(1)Bn(δ)\mathcal{U}_{\delta,\vartheta}\subset B_{n}(1)\setminus B_{n}(\delta) with vol(𝒰δ,ϑ)/vol(Bn(1)Bn(δ))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\delta,\vartheta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\geq 1-Ce^{-n^{\vartheta}/C}, such that for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

supμ0𝒰δ,ϑ(R(Σ,μ0)#(η^L𝖦𝖢𝖵)minηΞLR(Σ,μ0)#(η)+nϑ)Cn1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\delta,\vartheta}}\operatorname{\mathbb{P}}\Big{(}R^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}_{L})\geq\min_{\eta\in\Xi_{L}}R^{\#}_{(\Sigma,\mu_{0})}(\eta)+n^{-\vartheta}\Big{)}\leq Cn^{-1/7}.

Earlier results for generalized cross validation in Ridge regression in low-dimensional settings include [Sto74, Sto77, CW79, Li85, Li86, Li87, DvdL05]. In the proportional high-dimensional regime, [HMRT22, Theorem 7] provides an asymptotic justification for the generalized cross validation under isotropic Σ=In\Sigma=I_{n} and an isotropic prior on μ0\mu_{0}. Subsequent improvement by [PWRT21, Theorem 4.1] allows for general Σ\Sigma, deterministic μ0\mu_{0}’s and a much larger range of regularization levels that include η=0\eta=0 and possibly even negative η\eta’s. Both works consider the optimality of η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} with respect to the prediction risk R(Σ,μ0)𝗉𝗋𝖾𝖽R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})} in an asymptotic framework; the proofs therein rely intrinsically on the asymptotics.

Here in Theorem 4.2 above, we provide a non-asymptotic justification for the optimality of η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} that surprisingly holds simultaneously for all the three indicated risks. To the best of our knowledge, the optimality of the generalized cross validation in (4.3) beyond prediction risk has not been observed in prior literature.

4.2.2. kk-fold cross-validation

Next we consider the widely used kk-fold cross-validation. Before detailing the procedure, we need some further notation. Let mm_{\ell} be the sample size of batch [k]\ell\in[k], so [k]m=m\sum_{\ell\in[k]}m_{\ell}=m. In the standard kk-fold cross validation, we choose equal sized batch with m=m/km_{\ell}=m/k (assumed to be integer without loss of generality). Let X()m×nX^{(\ell)}\in\mathbb{R}^{m_{\ell}\times n} (resp. Y()mY^{(\ell)}\in\mathbb{R}^{m_{\ell}}) be the submatrix of XX (resp. subvector of YY) that contains all rows corresponding to the training data in batch \ell. In a similar fashion, let X()(mm)×nX^{(-\ell)}\in\mathbb{R}^{(m-m_{\ell})\times n} (resp. Y()mmY^{(-\ell)}\in\mathbb{R}^{m-m_{\ell}}) be the submatrix of XX (resp. subvector of YY) that removes all rows corresponding to X()X^{(\ell)} (resp. Y()Y^{(\ell)}).

The kk-fold cross-validation works as follows. For [k]\ell\in[k], let μ^η()argminμn{12nY()X()μ2+η2μ2}\widehat{\mu}^{(\ell)}_{\eta}\equiv\operatorname*{arg\,min\,}_{\mu\in\mathbb{R}^{n}}\big{\{}\frac{1}{2n}\lVert Y^{(-\ell)}-X^{(-\ell)}\mu\rVert^{2}+\frac{\eta}{2}\lVert\mu\rVert^{2}\big{\}} be the Ridge estimator over (X(),Y())(X^{(-\ell)},Y^{(-\ell)}) with regularization η0\eta\geq 0. We then pick the tuning parameter that minimizes the averaged test errors of μ^η()\widehat{\mu}^{(\ell)}_{\eta} over (X(),Y())(X^{(\ell)},Y^{(\ell)}): for any L>0L>0,

η^L𝖢𝖵argminηΞL{1k[k]1mY()X()μ^η()2}argminηΞLR(Σ,μ0)𝖢𝖵,k(η).\displaystyle\widehat{\eta}^{\operatorname{\mathsf{CV}}}_{L}\in\operatorname*{arg\,min\,}_{\eta\in\Xi_{L}}\bigg{\{}\frac{1}{k}\sum_{\ell\in[k]}\frac{1}{m_{\ell}}\lVert Y^{(\ell)}-X^{(\ell)}\widehat{\mu}^{(\ell)}_{\eta}\rVert^{2}\bigg{\}}\equiv\operatorname*{arg\,min\,}_{\eta\in\Xi_{L}}R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta). (4.5)

We shall often omit the subscript LL in η^L𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}}_{L}. Intuitively, due to the independence between μ^η()\widehat{\mu}^{(\ell)}_{\eta} and (X(),Y())(X^{(\ell)},Y^{(\ell)}), R(Σ,μ0)𝖢𝖵,k(η)R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta) can be viewed as an estimator of the generalization error R(Σ,μ0)𝗉𝗋𝖾𝖽(η)+σξ2R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)+\sigma_{\xi}^{2}. So it is natural to expect that η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} approximately minimizes ηR(Σ,μ0)𝗉𝗋𝖾𝖽(η)\eta\mapsto R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta). Based on the same heuristics as for η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} in (4.3), we may therefore expect that η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} in (4.5) simultaneously provides optimal prediction, estimation and in-sample risks for ‘most’ signal μ0\mu_{0}’s. This is the content of the following theorem.

Theorem 4.3.

Suppose the same conditions as in Theorem 4.2 and max[k]m/n1/(2K)\max_{\ell\in[k]}m_{\ell}/n\leq 1/(2K) hold for some K>0K>0. Fix δ(0,1/2]\delta\in(0,1/2], LK/δ2L\geq K/\delta^{2} and a small enough ϑ(0,1/50)\vartheta\in(0,1/50). Further assume min[k]mlog2/δm\min_{\ell\in[k]}m_{\ell}\geq\log^{2/\delta}m. There exist a constant C=C(K,L,δ,ϑ)>0C=C(K,L,\delta,\vartheta)>0 and a measurable set 𝒰δ,ϑBn(1)Bn(δ)\mathcal{U}_{\delta,\vartheta}\subset B_{n}(1)\setminus B_{n}(\delta) with vol(𝒰δ,ϑ)/vol(Bn(1)Bn(δ))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\delta,\vartheta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\geq 1-Ce^{-n^{\vartheta}/C}, such that for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

supμ0𝒰δ,ϑ(R(Σ,μ0)#(η^L𝖢𝖵)minηΞLR(Σ,μ0)#(η)+C{1k[k]1m(1δ)/2+1k+nϑ})\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\delta,\vartheta}}\operatorname{\mathbb{P}}\bigg{(}R^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{CV}}}_{L})\geq\min_{\eta\in\Xi_{L}}R^{\#}_{(\Sigma,\mu_{0})}(\eta)+C\cdot\bigg{\{}\frac{1}{k}\sum_{\ell\in[k]}\frac{1}{m_{\ell}^{(1-\delta)/2}}+\frac{1}{k}+n^{-\vartheta}\bigg{\}}\bigg{)}
C(1+{m})n1/7.\displaystyle\qquad\leq C(1+\mathcal{L}_{\{m_{\ell}\}})\cdot n^{-1/7}.

Here {m}[k](m/m)1\mathcal{L}_{\{m_{\ell}\}}\equiv\sum_{\ell\in[k]}(m_{\ell}/m)^{-1}.

Non-asymptotic results of this type for kk-fold cross validation are previously obtained for R(Σ,μ0)𝗉𝗋𝖾𝖽(η^𝖢𝖵)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{CV}}}) in the Lasso setting [MM21, Proposition 4.3] under isotropic Σ=In\Sigma=I_{n}, where the range of the regularization must be strictly away from the interpolation regime. In contrast, our results above are valid down to η=0\eta=0 when ϕ1>1\phi^{-1}>1, and allow for general anisotropic Σ\Sigma.

Interestingly, the error bound in the above theorem reflects the folklore tension between the bias and variance in the selection of kk in the cross validation scheme (cf. [JWHT21, Chapter 5]):

  • For a small number of kk, R(Σ,μ0)𝖢𝖵,k(η)R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta) is biased for estimating R(Σ,μ0)𝗉𝗋𝖾𝖽(η)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta); this corresponds to the term 𝒪(1/k)\mathcal{O}(1/k) in the error bound, which is known to be of the optimal order in Ridge regression (cf. [LD19]).

  • For a large number of kk, R(Σ,μ0)𝖢𝖵,k(η)R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta) has large fluctuations; this corresponds to the term 𝒪(k1[k]m(1δ)/2)=𝒪((k/m)(1δ)/2)\mathcal{O}\big{(}k^{-1}\sum_{\ell\in[k]}m_{\ell}^{-(1-\delta)/2}\big{)}=\mathcal{O}\big{(}(k/m)^{(1-\delta)/2}) in the equal-sized case. By a central limit heuristic (cf. [AZ20, KL22]), we also expect this term to be of a near optimal order.

4.3. Implications to statistical inference via μ^η\widehat{\mu}_{\eta}

As Ridge(less) estimators μ^η\widehat{\mu}_{\eta} are in general biased, debiasing is necessary for statistical inference of μ0\mu_{0}, cf. [BZ23]. Here the debiasing scheme for μ^η\widehat{\mu}_{\eta} can be readily read off from the distributional characterizations in Theorems 2.3 and 2.4. Assuming known covariance Σ\Sigma, let the debiased Ridge(less) estimator be defined as

μ^η𝖽𝖱(Σ+τη,I)Σ1μ^η.\displaystyle\widehat{\mu}_{\eta}^{\operatorname{\mathsf{dR}}}\equiv(\Sigma+\tau_{\eta,\ast}I)\Sigma^{-1}\widehat{\mu}_{\eta}. (4.6)

Similar to Remark 4, τη,\tau_{\eta,\ast} and τ^η\widehat{\tau}_{\eta} is interchangeable in the above display due to known Σ\Sigma. Using Theorems 2.3 and 2.4, we expect that μ^η𝖽𝖱dμ0+γη,Σ1/2g/n\widehat{\mu}_{\eta}^{\operatorname{\mathsf{dR}}}\stackrel{{\scriptstyle d}}{{\approx}}\mu_{0}+\gamma_{\eta,\ast}\Sigma^{-1/2}g/\sqrt{n}. This motivates the following confidence intervals for {μ0,j}\{\mu_{0,j}\}:

CIj(η)[μ^η,j𝖽𝖱±γ^η(Σ1)jj1/2zα/2n],j[n].\displaystyle\mathrm{CI}_{j}(\eta)\equiv\Big{[}\widehat{\mu}_{\eta,j}^{\operatorname{\mathsf{dR}}}\pm\widehat{\gamma}_{\eta}\cdot(\Sigma^{-1})_{jj}^{1/2}\cdot\frac{z_{\alpha/2}}{\sqrt{n}}\Big{]},\quad j\in[n]. (4.7)

Here zαz_{\alpha} is the normal upper-α\alpha quantile defined via (𝒩(0,1)>zα)=α\operatorname{\mathbb{P}}(\mathcal{N}(0,1)>z_{\alpha})=\alpha. It is easy to see from the above definition that minimization of ηγ^η\eta\mapsto\widehat{\gamma}_{\eta} is equivalent to that of the CI length. As the former minimization procedure corresponds exactly to the proposal η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} in (4.3), we expect that {CIj(η^𝖦𝖢𝖵)}\{\mathrm{CI}_{j}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})\} provide the shortest (asymptotic) (1α)(1-\alpha)-CIs along the regularization path, and so do {CIj(η^𝖢𝖵)}\{\mathrm{CI}_{j}(\widehat{\eta}^{\operatorname{\mathsf{CV}}})\}.

Below we give a rigorous statement on the above informal discussion. Let 𝒞𝖽𝖱(η)n1j=1n𝟏(μ0,jCIj(η))\mathscr{C}^{\operatorname{\mathsf{dR}}}(\eta)\equiv n^{-1}\sum_{j=1}^{n}\bm{1}(\mu_{0,j}\in\mathrm{CI}_{j}(\eta)) denote the averaged coverage of {CIj(η)}\{\mathrm{CI}_{j}(\eta)\} for {μ0,j}\{\mu_{0,j}\}. We have the following.

Theorem 4.4.

Suppose the same conditions as in Theorem 4.2 (resp. Theorem 4.3) for η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} (resp. η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}}) hold for some K>0K>0. Fix α(0,1/4],δ(0,1/2]\alpha\in(0,1/4],\delta\in(0,1/2], LK/δ2L\geq K/\delta^{2} and a small enough ϑ(0,1/50)\vartheta\in(0,1/50). There exist a constant C=C(K,L,δ,ϑ)>0C=C(K,L,\delta,\vartheta)>0 and a measurable set 𝒰δ,ϑBn(1)Bn(δ)\mathcal{U}_{\delta,\vartheta}\subset B_{n}(1)\setminus B_{n}(\delta) with vol(𝒰δ,ϑ)/vol(Bn(1)Bn(δ))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\delta,\vartheta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\geq 1-Ce^{-n^{\vartheta}/C}, such that the CI length and the averaged coverage satisfy

supμ0𝒰δ,ϑ{(nzα/21maxj[n]||CIj(η^L#)|minηΞL|CIj(η)||Cn#)\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\delta,\vartheta}}\Big{\{}\operatorname{\mathbb{P}}\Big{(}\sqrt{n}z_{\alpha/2}^{-1}\cdot\max_{j\in[n]}\big{\lvert}|\mathrm{CI}_{j}(\widehat{\eta}_{L}^{\#})|-\min_{\eta\in\Xi_{L}}|\mathrm{CI}_{j}(\eta)|\big{\rvert}\geq C\mathcal{E}^{\#}_{n}\Big{)}
(|𝒞𝖽𝖱(η^L#)(1α)|C(n#)1/4)}C𝔭n#.\displaystyle\qquad\qquad\vee\operatorname{\mathbb{P}}\Big{(}\lvert\mathscr{C}^{\operatorname{\mathsf{dR}}}(\widehat{\eta}^{\#}_{L})-(1-\alpha)\rvert\geq C(\mathcal{E}^{\#}_{n})^{1/4}\Big{)}\Big{\}}\leq C\mathfrak{p}_{n}^{\#}.

Here for #{𝖦𝖢𝖵,𝖢𝖵}\#\in\{\operatorname{\mathsf{GCV}},\operatorname{\mathsf{CV}}\}, the quantities n#,𝔭n#\mathcal{E}^{\#}_{n},\mathfrak{p}_{n}^{\#} are defined via

n#\mathcal{E}^{\#}_{n} 𝔭n#\mathfrak{p}_{n}^{\#}
#=𝖦𝖢𝖵\#=\operatorname{\mathsf{GCV}} nϑn^{-\vartheta} n1/7n^{-1/7}
#=𝖢𝖵\#=\operatorname{\mathsf{CV}} k1[k]m(1δ)/2+k1+nϑk^{-1}\sum_{\ell\in[k]}{m_{\ell}^{-(1-\delta)/2}}+k^{-1}+n^{-\vartheta} (1+{m})n1/7(1+\mathcal{L}_{\{m_{\ell}\}})\cdot n^{-1/7}

An interesting and somewhat non-standard special case of the above theorem is the noiseless setting σξ2=0\sigma_{\xi}^{2}=0 in the overparametrized regime ϕ1>1\phi^{-1}>1. In this case, exact recovery of μ0\mu_{0} is impossible and our CI’s above provide a precise scheme for partial recovery of μ0\mu_{0}. Moreover, as the effective noise ϕγη,2(0)=R¯(Σ,μ0)𝗉𝗋𝖾𝖽(η)\phi\gamma_{\eta,\ast}^{2}(0)=\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta), Theorem 3.2 and Proposition 3.3 suggest that ηγη,2(0)\eta\mapsto\gamma^{2}_{\eta,\ast}(0) is approximately minimized at η=0\eta=0 for ‘most’ μ0\mu_{0}’s. This means that, in this noiseless case, the length of the adaptively tuned CIs is also approximately minimized at the interpolation regime for ‘most’ μ0\mu_{0}’s.

Remark 5.

The debiased Ridge estimator μ^η𝖽𝖱\widehat{\mu}_{\eta}^{\operatorname{\mathsf{dR}}} in (4.6) takes a slightly non-standard form, which allows for interpolation η=0\eta=0 when ϕ1>1\phi^{-1}>1. To see its equivalence to the standard debiased form, for any η>0\eta>0, using the KKT condition μ^η=X(YXμ^η)/(nη)\widehat{\mu}_{\eta}=X^{\top}(Y-X\widehat{\mu}_{\eta})/(n\eta) and the calculation in (2.4),

μ^η𝖽𝖱=μ^η+τη,nηΣ1μ^η\displaystyle\widehat{\mu}_{\eta}^{\operatorname{\mathsf{dR}}}=\widehat{\mu}_{\eta}+\frac{\tau_{\eta,\ast}}{n\eta}\cdot\Sigma^{-1}\widehat{\mu}_{\eta} μ^η+τ^ηnηΣ1X(YXμ^η)=μ^η+Σ1X(YXμ^η)mdf(μ^η).\displaystyle\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\widehat{\mu}_{\eta}+\frac{\widehat{\tau}_{\eta}}{n\eta}\cdot\Sigma^{-1}X^{\top}(Y-X\widehat{\mu}_{\eta})=\widehat{\mu}_{\eta}+\frac{\Sigma^{-1}X^{\top}(Y-X\widehat{\mu}_{\eta})}{m-\mathrm{df}(\widehat{\mu}_{\eta})}.

The form in the right hand side of the above display matches the standard form of the debiased Ridge estimator, cf. [BZ23, Eqn. (3.15)].

5. Some illustrative simulations

Refer to caption
Refer to caption
Refer to caption
Figure 1. Validation of the phase transitions in Theorem 3.4. The theoretical risks R¯(Σ,μ0)#(η)\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) are computed by solving (2.1); the empirical risks R(Σ,μ0)#(η)R^{\#}_{(\Sigma,\mu_{0})}(\eta) are computed via the Monte Carlo simulation over 200200 repetitions. Left panel: Noisy case with minimal empirical risks attained at η=𝖲𝖭𝖱μ01=1\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}=1 (marked with \ast). Middle panel: Noiseless case with all risks minimized at the interpolation regime η=𝖲𝖭𝖱μ01=0\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}=0. Right panel: Differences of the global minimizer of the risk curves and the oracle η\eta_{\ast} are concentrated around 0 over 500 different μ0\mu_{0}’s.

In this section, we provide some illustrative numerical simulations to validate some of the developed theoretical results. In particular, we focus on:

  1. (1)

    the phase transitions on the optimality of interpolation in Section 3;

  2. (2)

    the effectiveness of cross-validation methods in Section 4.

5.1. Common numerical settings

We set Σ=1.99In+0.01𝟣n𝟣n\Sigma=1.99\cdot I_{n}+0.01\cdot\mathsf{1}_{n}\mathsf{1}_{n}^{\top}, with 𝟣n\mathsf{1}_{n} representing an nn-dimensional all one vector. The random design matrix ZZ and the error ξ\xi are both generated by tt-distribution with 1010 degrees of freedom, scaled by 0.8\sqrt{0.8}. This scaling choice ensures that ZijZ_{ij} and ξi\xi_{i} have mean zero and variance one. The concrete choice of the signal dimension nn, the sample size mm, and μ0\mu_{0} will be specified later.

5.2. Validation of the phase transitions in Theorem 3.4

First, we use m=100m=100, n=200n=200, and a unit μ0\mu_{0} chosen randomly (but fixed afterwards) from the sphere Bn(1)\partial B_{n}(1), and we plot both the theoretical risk curve ηR¯(Σ,μ0)#\eta\mapsto\bar{R}^{\#}_{(\Sigma,\mu_{0})} and the empirical risk curve ηR(Σ,μ0)#\eta\mapsto R^{\#}_{(\Sigma,\mu_{0})} for all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}. The left panel of Figure 1 reports the outcome of this simulation with noise level σξ2=1\sigma_{\xi}^{2}=1 and 𝖲𝖭𝖱μ01=1\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}=1, while the middle panel of the same figure reports the noiseless case σξ2=0\sigma_{\xi}^{2}=0 with 𝖲𝖭𝖱μ01=0\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}=0. These plots show excellent agreements with the theory in Theorem 3.4 in that the global minimum of both the theoretical and empirical risk curves are attained roughly at η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}.

In order to demonstrate the validity of the aforementioned phenomenon for ‘most’ μ0\mu_{0}’s as claimed in Theorem 3.4, we uniformly generate 500500 different μ0\mu_{0} over Bn(1)\partial B_{n}(1). Next, we discretize η[0,1.5]\eta\in[0,1.5] into 160160 grid points. We then select the optimal value η#\eta^{\#} by minimizing the empirical prediction, estimation, and in-sample risks. The difference between the chosen empirical optimal η#\eta^{\#} and the theoretically optimal tuning is depicted in the right panel of Figure 1 through a boxplot of η#η\eta^{\#}-\eta_{\ast}. It is easily seen that the differences for all three risks are highly concentrated around 0.

Refer to caption
Refer to caption
Refer to caption
Figure 2. Validation of 𝖦𝖢𝖵,𝖢𝖵\operatorname{\mathsf{GCV}},\operatorname{\mathsf{CV}} in Theorems 4.2-4.4. The empirical risks are computed via the Monte Carlo average over 100100 repetitions. Left panel: Comparison between empirical risks and theoretical risks for =𝖦𝖢𝖵\ast=\operatorname{\mathsf{GCV}} and =𝖢𝖵\bullet=\operatorname{\mathsf{CV}} with k=5k=5. Middle panel: Averaged coverage 𝒞𝖽𝖱(η^#)\mathscr{C}^{\operatorname{\mathsf{dR}}}(\widehat{\eta}^{\#}) for #{𝖦𝖢𝖵,𝖢𝖵}\#\in\{\operatorname{\mathsf{GCV}},\operatorname{\mathsf{CV}}\} and the oracle 𝒞𝖽𝖱(η)\mathscr{C}^{\operatorname{\mathsf{dR}}}(\eta_{\ast}). Right panel: Length of the confidence intervals CI1(η^#)\mathrm{CI}_{1}(\widehat{\eta}^{\#}) for #{𝖦𝖢𝖵,𝖢𝖵}\#\in\{\operatorname{\mathsf{GCV}},\operatorname{\mathsf{CV}}\} and the oracle CI1(η)\mathrm{CI}_{1}(\eta_{\ast}).

5.3. Optimality of (generalized) cross-validation schemes

Next, we investigate the efficacy of two cross validation schemes in Section 4, namely η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} in (4.3) and η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} in (4.5). We keep the sample size fixed at m=500m=500, and allow the signal dimension nn to vary so that the aspect ratio ϕ=m/n\phi=m/n ranges from [0.5,1.5][0.5,1.5]. To facilitate the tuning process, we employ 3131 equidistant η\eta’s within the range of [0,1.5][0,1.5]. Moreover, the kk-fold cross validation scheme η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} is carried out with the default choice k=5k=5.

To empirically verify Theorem 4.2 and 4.3, we report in the left panel of Figure 2 the empirical risks R(Σ,μ0)#(η^𝖦𝖢𝖵),R(Σ,μ0)#(η^𝖢𝖵)R^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}),R^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{CV}}}) for all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}. All the empirical risk curves are found to concentrate around their theoretical optimal counterparts (Σ,μ0)#(η)\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta_{\ast}). We note again that as η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} and η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} are designed to tune the prediction risk, it is not surprising that R(Σ,μ0)𝗉𝗋𝖾𝖽(η^𝖦𝖢𝖵),R(Σ,μ0)𝗉𝗋𝖾𝖽(η^𝖢𝖵)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}),R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{CV}}}) concentrate around (Σ,μ0)𝗉𝗋𝖾𝖽(η)\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta_{\ast}). The major surprise appears to be that η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} and η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} also provide optimal tuning for estimation and in-sample risks, both theoretically validated in our Theorems 4.2 and 4.3 and empirically confirmed here.

To empirically verify Theorem 4.4, we report in the middle and right panels of Figure 2 the averaged coverage and length for the 95%95\%-debiased Ridge CI’s with cross-validation, namely {CIj(η^#)}\{\mathrm{CI}_{j}(\widehat{\eta}^{\#})\} for #{𝖦𝖢𝖵,𝖢𝖵}\#\in\{\operatorname{\mathsf{GCV}},\operatorname{\mathsf{CV}}\}, and with oracle tuning η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}. For the middle panel, we observe that adaptive tuning via η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} and η^𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{CV}}} both provide approximate nominal coverage for a moderate sample size mm and signal dimension nn. For the right panel, as the lengths of {CIj(η^#)}\{\mathrm{CI}_{j}(\widehat{\eta}^{\#})\} are solely determined by γ^η^#\widehat{\gamma}_{\widehat{\eta}^{\#}}, we report here only the length of CI1(η^#)\mathrm{CI}_{1}(\widehat{\eta}^{\#}). We observe that the CI length for both CI1(η^𝖦𝖢𝖵),CI1(η^𝖢𝖵)\mathrm{CI}_{1}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}),\mathrm{CI}_{1}(\widehat{\eta}^{\operatorname{\mathsf{CV}}}) are also in excellent agreement to the oracle length across different aspect ratios.

6. Proof outlines

6.1. Technical tools

The main technical tool we use for the proof of Theorem 2.3 is the following version of convex Gaussian min-max theorem, taken from [MM21, Corollary G.1].

Theorem 6.1 (Convex Gaussian Min-Max Theorem).

Suppose Dun1+n2,Dvm1+m2D_{u}\in\mathbb{R}^{n_{1}+n_{2}},D_{v}\in\mathbb{R}^{m_{1}+m_{2}} are compact sets, and Q:Du×DvQ:D_{u}\times D_{v}\to\mathbb{R} is continuous. Let G=(Gij)i[n1],j[m1]G=(G_{ij})_{i\in[n_{1}],j\in[m_{1}]} with GijG_{ij}’s i.i.d. 𝒩(0,1)\mathcal{N}(0,1), and g𝒩(0,In1)g\sim\mathcal{N}(0,I_{n_{1}}), h𝒩(0,Im1)h\sim\mathcal{N}(0,I_{m_{1}}) be independent Gaussian vectors. For un1+n2,vm1+m2u\in\mathbb{R}^{n_{1}+n_{2}},v\in\mathbb{R}^{m_{1}+m_{2}}, write u1u[n1]n1,v1v[m1]m1u_{1}\equiv u_{[n_{1}]}\in\mathbb{R}^{n_{1}},v_{1}\equiv v_{[m_{1}]}\in\mathbb{R}^{m_{1}}. Define

Φp(G)\displaystyle\Phi^{\textrm{p}}(G) =minuDumaxvDv(u1Gv1+Q(u,v)),\displaystyle=\min_{u\in D_{u}}\max_{v\in D_{v}}\Big{(}u_{1}^{\top}Gv_{1}+Q(u,v)\Big{)},
Φa(g,h)\displaystyle\Phi^{\textrm{a}}(g,h) =minuDumaxvDv(v1gu1+u1hv1+Q(u,v)).\displaystyle=\min_{u\in D_{u}}\max_{v\in D_{v}}\Big{(}\lVert v_{1}\rVert g^{\top}u_{1}+\lVert u_{1}\rVert h^{\top}v_{1}+Q(u,v)\Big{)}.

Then the following hold.

  1. (1)

    For all tt\in\mathbb{R}, (Φp(G)t)2(Φa(g,h)t)\operatorname{\mathbb{P}}\big{(}\Phi^{\textrm{p}}(G)\leq t\big{)}\leq 2\operatorname{\mathbb{P}}\big{(}\Phi^{\textrm{a}}(g,h)\leq t\big{)}.

  2. (2)

    If (u,v)u1Gv1+Q(u,v)(u,v)\mapsto u_{1}^{\top}Gv_{1}+Q(u,v) satisfies the conditions of Sion’s min-max theorem for the pair (Du,Dv)(D_{u},D_{v}) a.s. (for instance, Du,DvD_{u},D_{v} are convex, and QQ is convex-concave), then for any tt\in\mathbb{R}, (Φp(G)t)2(Φa(g,h)t)\operatorname{\mathbb{P}}\big{(}\Phi^{\textrm{p}}(G)\geq t\big{)}\leq 2\operatorname{\mathbb{P}}\big{(}\Phi^{\textrm{a}}(g,h)\geq t\big{)}.

Clearly, \geq (resp. \leq) in (1) (resp. (2)) can be replaced with >> (resp <<). In the proofs below, we shall assume without loss of generality that G,g,hG,g,h are independent Gaussian matrix/vectors defined on the same probability space.

As mentioned above, the CGMT above has been utilized for deriving precise risk/distributional asymptotics for a number of canonical statistical estimators across various important models; we only refer the readers to [TOH15, TAH18, SAH19, LGC+21, CMW22, DKT22, Han22, LS22, WWM22, ZZY22, MRSY23] for some selected references.

6.2. Reparametrization and further notation

Consider the reparametrization

w=Σ1/2(μμ0),w^η;ZΣ1/2(μ^η;Zμ0).\displaystyle w=\Sigma^{1/2}(\mu-\mu_{0}),\quad\widehat{w}_{\eta;Z}\equiv\Sigma^{1/2}(\widehat{\mu}_{\eta;Z}-\mu_{0}).

Then with

F(w)F(Σ,μ0)(w)=12μ0+Σ1/2w2,\displaystyle F(w)\equiv F_{(\Sigma,\mu_{0})}(w)=\frac{1}{2}\lVert\mu_{0}+\Sigma^{-1/2}w\rVert^{2}, (6.1)

we have the following reparametrized version of μ^η;Z\widehat{\mu}_{\eta;Z}:

w^η;Z={argminwn{F(w):Zw=ξ},η=0;argminwn{F(w)+1η12nZwξ2},η>0.\displaystyle\widehat{w}_{\eta;Z}=\begin{cases}\operatorname*{arg\,min\,}_{w\in\mathbb{R}^{n}}\big{\{}F(w):Zw=\xi\big{\}},&\eta=0;\\ \operatorname*{arg\,min\,}_{w\in\mathbb{R}^{n}}\big{\{}F(w)+\frac{1}{\eta}\cdot\frac{1}{2n}\lVert Zw-\xi\rVert^{2}\big{\}},&\eta>0.\end{cases}

Next we give some further notation for cost functions. Let for η0\eta\geq 0,

hη;Z(w,v)\displaystyle h_{\eta;Z}(w,v) 1nv,Zwξ+F(w)ηv22,\displaystyle\equiv\frac{1}{\sqrt{n}}\langle v,Zw-\xi\rangle+F(w)-\frac{\eta\lVert v\rVert^{2}}{2},
η(w,v)\displaystyle\ell_{\eta}(w,v) 1n(vg,w+wh,vv,ξ)+F(w)ηv22,\displaystyle\equiv\frac{1}{\sqrt{n}}\Big{(}-\lVert v\rVert\langle g,w\rangle+\lVert w\rVert\langle h,v\rangle-\langle v,\xi\rangle\Big{)}+F(w)-\frac{\eta\lVert v\rVert^{2}}{2}, (6.2)

and for Lv[0,]L_{v}\in[0,\infty],

Hη;Z(w;Lv)\displaystyle H_{\eta;Z}(w;L_{v}) maxvBn(Lv)hη;Z(w,v)maxvBn(Lv){v,Zwξn+F(w)ηv22},\displaystyle\equiv\max_{v\in B_{n}(L_{v})}h_{\eta;Z}(w,v)\equiv\max_{v\in B_{n}(L_{v})}\bigg{\{}\frac{\langle v,Zw-\xi\rangle}{\sqrt{n}}+F(w)-\frac{\eta\lVert v\rVert^{2}}{2}\bigg{\}}, (6.3)
Lη(w;Lv)\displaystyle L_{\eta}(w;L_{v}) maxvBn(Lv)η(w,v)=maxβ[0,Lv]{βn(whξg,w)+F(w)ηβ22}.\displaystyle\equiv\max_{v\in B_{n}(L_{v})}\ell_{\eta}(w,v)=\max_{\beta\in[0,L_{v}]}\bigg{\{}\frac{\beta}{\sqrt{n}}\Big{(}\big{\lVert}\lVert w\rVert h-\xi\big{\rVert}-\langle g,w\rangle\Big{)}+F(w)-\frac{\eta\beta^{2}}{2}\bigg{\}}.

We shall simply write Hη;Z()=Hη;Z(;)H_{\eta;Z}(\cdot)=H_{\eta;Z}(\cdot;\infty) and Lη()=Lη(;)L_{\eta}(\cdot)=L_{\eta}(\cdot;\infty). When Z=GZ=G, we sometimes write hη;G=hηh_{\eta;G}=h_{\eta} and Hη;G=HηH_{\eta;G}=H_{\eta} for simplicity of notation.

Let the empirical noise σm2\sigma_{m}^{2} and its modified version be

σm2ξ2h2,σ±2(Lw)(σm2±2Lw|h,ξ|h2)+.\displaystyle\sigma_{m}^{2}\equiv\frac{\lVert\xi\rVert^{2}}{\lVert h\rVert^{2}},\quad\sigma_{\pm}^{2}(L_{w})\equiv\bigg{(}\sigma_{m}^{2}\pm 2L_{w}\frac{\lvert\langle h,\xi\rangle\rvert}{\lVert h\rVert^{2}}\bigg{)}_{+}. (6.4)

Finally we define 𝖣η,±\mathsf{D}_{\eta,\pm} and its deterministic version 𝖣¯η\overline{\mathsf{D}}_{\eta} as follows:

𝖣η,±(β,γ)\displaystyle\mathsf{D}_{\eta,\pm}(\beta,\gamma) β2(γ(ϕeh2eg2)+σ±2γ)ηβ22+𝖾F(γng;γβ),\displaystyle\equiv\frac{\beta}{2}\bigg{(}\gamma\big{(}\phi e_{h}^{2}-e_{g}^{2}\big{)}+\frac{\sigma_{\pm}^{2}}{\gamma}\bigg{)}-\frac{\eta\beta^{2}}{2}+\mathsf{e}_{F}\bigg{(}\frac{\gamma}{\sqrt{n}}g;\frac{\gamma}{\beta}\bigg{)},
𝖣¯η(β,γ)\displaystyle\overline{\mathsf{D}}_{\eta}(\beta,\gamma) =β2(γ(ϕ1)+σξ2γ)ηβ22+𝔼𝖾F(γng;γβ).\displaystyle=\frac{\beta}{2}\bigg{(}\gamma\big{(}\phi-1\big{)}+\frac{\sigma_{\xi}^{2}}{\gamma}\bigg{)}-\frac{\eta\beta^{2}}{2}+\operatorname{\mathbb{E}}\mathsf{e}_{F}\bigg{(}\frac{\gamma}{\sqrt{n}}g;\frac{\gamma}{\beta}\bigg{)}. (6.5)

Here recall 𝖾F\mathsf{e}_{F} is the Moreau envelope of FF in (6.1). Note that 𝖣η,±\mathsf{D}_{\eta,\pm} depends on the choice of LwL_{w}, but for notational convenience we drop this dependence here.

6.3. Proof outline for Theorem 2.3 for η=0\eta=0

We shall outline below the main steps for the proof of Theorem 2.3 for η=0\eta=0 in the regime ϕ1>1\phi^{-1}>1 under a stronger condition Σ1op1\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\lesssim 1. The high level strategy of the proof shares conceptual similarities to [MM21, CMW22], but the details differ significantly.

(Step 1: Localization of the primal optimization). In this step, we show that for Lw,Lv>0L_{w},L_{v}>0 such that LwLv1L_{w}\wedge L_{v}\gtrsim 1, with high probability (w.h.p.),

minwBn(Lw)H0(w;Lv)=minwnH0(w).\displaystyle\min_{w\in B_{n}(L_{w})}H_{0}(w;L_{v})=\min_{w\in\mathbb{R}^{n}}H_{0}(w). (6.6)

A formal statement of the above localization can be found in Proposition 9.1. The key point here is that despite minwH0(w)\min_{w}H_{0}(w) optimizes a deterministic function with a random constraint, it can be efficiently rewritten (in a probabilistic sense) in a minimax form indexed by compact sets that facilitate the application of the convex Gaussian min-max Theorem 6.1.

(Step 2: Characterization of the Gordon cost optimum). In this step, we show that a suitably localized version of minwL0(w)\min_{w}L_{0}(w) concentrates around some deterministic quantity involving the function 𝖣¯0\overline{\mathsf{D}}_{0} in (6.2). In particular, we show in Theorem 9.2 that for Lw,Lv1L_{w},L_{v}\asymp 1 chosen large enough, w.h.p.,

minwBn(Lw)L0(w;Lv)maxβ>0minγ>0𝖣¯0(β,γ).\displaystyle\min_{w\in B_{n}(L_{w})}L_{0}(w;L_{v})\approx\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma). (6.7)

The proof of (6.7) is fairly involved, as the minimax problem minwL0(w)=minwmaxv0(w,v)\min_{w}L_{0}(w)=\min_{w}\max_{v}\ell_{0}(w,v) (and its suitably localized versions) cannot be computed exactly. We get around this technical issue by the following bracketing strategy:

  • (Step 2.1). We show in Proposition 9.3 that for the prescribed choice of Lw,LvL_{w},L_{v}, w.h.p., both

    maxβ>0minγ>0𝖣0,(β,γ)minwBn(Lw)L0(w;Lv)maxβ>0minγ>0𝖣0,+(β,γ),\displaystyle\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{0,-}(\beta,\gamma)\leq\min_{w\in B_{n}(L_{w})}L_{0}(w;L_{v})\leq\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{0,+}(\beta,\gamma),

    and the localization

    maxβ>0minγ>0𝖣0,±(β,γ)=max1/CβCmin1/CγC𝖣0,±(β,γ)\displaystyle\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{0,\pm}(\beta,\gamma)=\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\mathsf{D}_{0,\pm}(\beta,\gamma)

    hold for some large C>0C>0.

  • (Step 2.2). We show in Proposition 9.4 that for localized minimax problems, we may replace 𝖣0,±\mathsf{D}_{0,\pm} by 𝖣¯0,±\overline{\mathsf{D}}_{0,\pm}: w.h.p.,

    max1/CβCmin1/CγC𝖣0,±(β,γ)max1/CβCmin1/CγC𝖣¯0(β,γ).\displaystyle\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\mathsf{D}_{0,\pm}(\beta,\gamma)\approx\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\overline{\mathsf{D}}_{0}(\beta,\gamma).
  • (Step 2.3). We show in Proposition 9.5 that (de)localization holds for the (deterministic) max-min optimization problem with 𝖣¯0\overline{\mathsf{D}}_{0}:

    maxβ>0minγ>0𝖣¯0(β,γ)=max1/CβCmin1/CγC𝖣¯0(β,γ).\displaystyle\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma)=\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\overline{\mathsf{D}}_{0}(\beta,\gamma).

Combining the above Steps 2.1-2.3 yields (6.7). An important step to prove the (de)localization claims above is to derive apriori estimates for the solutions of the fixed point equation (2.1) and its sample version, to be defined in (8.12). These estimates will be detailed in Section 8.

(Step 3: Locating the global minimizer of the Gordon objective). In this step, we show that a suitably localized version of the Gordon objective wL0(w)w\mapsto L_{0}(w) attains its global minimum approximately at w0,Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γ0,;τ0,)μ0)w_{0,\ast}\equiv\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{0,\ast};\tau_{0,\ast})-\mu_{0}\big{)} in the following sense. For any ε>0\varepsilon>0 and any g:ng:\mathbb{R}^{n}\to\mathbb{R} that is 11-Lipschitz with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}, let D0;ε(𝗀){wn:|𝗀(w)𝔼𝗀(w0,)|ε}D_{0;\varepsilon}(\mathsf{g})\equiv\big{\{}w\in\mathbb{R}^{n}:\lvert\mathsf{g}(w)-\operatorname{\mathbb{E}}\mathsf{g}(w_{0,\ast})\rvert\geq\varepsilon\big{\}} be the ‘exceptional set’. We show in Theorem 9.6 that again for Lw,Lv1L_{w},L_{v}\asymp 1 chosen large enough, w.h.p.,

minwD0;ε(𝗀)Bn(Lw)L0(w;Lv)maxβ>0minγ>0𝖣¯0(β,γ)+Ωε(1).\displaystyle\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}L_{0}(w;L_{v})\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma)+\Omega_{\varepsilon}(1). (6.8)

The main challenge in proving (6.8) is partly attributed to the possible violation of strong convexity of the map wL0(w;Lv)w\mapsto L_{0}(w;L_{v}), due to the necessity of working with non-Gaussian ξ\xi’s. We will get around this technical issue in similar spirit to Step 2 by another bracketing strategy. In particular:

  • (Step 3.1). In Lemma 9.7, we will use surrogate, strongly convex functions L0,±(;Lv)L_{0,\pm}(\cdot;L_{v}), formally defined in (9.16), to provide a sufficiently tight bracket for L0(;Lv)L_{0}(\cdot;L_{v}) over large enough compact sets.

  • (Step 3.2). In Proposition 9.8, we show that the minimizers of wL0,±(;Lv)w\mapsto L_{0,\pm}(\cdot;L_{v}) can be computed exactly and are close enough to w0,w_{0,\ast}.

  • (Step 3.3). In Proposition 9.9, combined with the tight bracketing and certain apriori estimates, we then conclude that all minimizers of wL0(;Lv)w\mapsto L_{0}(\cdot;L_{v}) must be close to w0,w_{0,\ast}.

With all the above steps, finally we prove (6.8) by (i) using the proximity of L0L_{0} and its surrogate L0,±L_{0,\pm} and (ii) exploiting the strong convexity of L0,±L_{0,\pm}.

(Step 4: Putting pieces together and establishing uniform guarantees). In this final step, we shall use the convex Gaussian min-max theorem to translate the estimates (6.7) in Step 2 and (6.8) in Step 3 to their counterparts with primal cost function H0H_{0}. For the global cost optimum, with the help of the localization in (6.6), by choosing Lw,Lv1L_{w},L_{v}\asymp 1, we have w.h.p.,

minwnH0(w)=(6.6)minwBn(Lw)H0(w;Lv)minwBn(Lw)L0(w;Lv)(6.7)maxβ>0minγ>0𝖣¯0(β,γ).\displaystyle\min_{w\in\mathbb{R}^{n}}H_{0}(w)\stackrel{{\scriptstyle(\ref{ineq:proof_outline_1})}}{{=}}\min_{w\in B_{n}(L_{w})}H_{0}(w;L_{v})\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\min_{w\in B_{n}(L_{w})}L_{0}(w;L_{v})\stackrel{{\scriptstyle(\ref{ineq:proof_outline_2})}}{{\approx}}\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma).

For the cost over the exceptional set, we have w.h.p.,

minwD0;ε(𝗀)Bn(Lw)H0(w)\displaystyle\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}H_{0}(w) minwD0;ε(𝗀)Bn(Lw)H0(w;Lv)\displaystyle\geq\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}H_{0}(w;L_{v})
minwD0;ε(𝗀)Bn(Lw)L0(w;Lv)\displaystyle\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\geq}}\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}L_{0}(w;L_{v}) (6.8)maxβ>0minγ>0𝖣¯0(β,γ)+Ωε(1).\displaystyle\stackrel{{\scriptstyle(\ref{ineq:proof_outline_4})}}{{\geq}}\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma)+\Omega_{\varepsilon}(1).

Combining the above two displays, we then conclude that w.h.p., w^0D0;ε(𝗀)Bn(Lw)\widehat{w}_{0}\notin D_{0;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w}). Finally using apriori estimate on w^0\lVert\widehat{w}_{0}\rVert we may conclude that w.h.p., w^0D0;ε(𝗀)\widehat{w}_{0}\notin D_{0;\varepsilon}(\mathsf{g}), i.e., |𝗀(w^0)𝔼𝗀(w0,)|ε\lvert\mathsf{g}(\widehat{w}_{0})-\operatorname{\mathbb{E}}\mathsf{g}(w_{0,\ast})\rvert\leq\varepsilon.

The uniform guarantee in η\eta is then proved by (i) extending the above arguments to include any positive η>0\eta>0, and (ii) establishing (high probability) Lipschitz continuity (w.r.t. Σ1\lVert\cdot\rVert_{\Sigma^{-1}}) of the maps ηw^η\eta\mapsto\widehat{w}_{\eta} and ηwη,\eta\mapsto w_{\eta,\ast}.

Details of the above outline are implemented in Section 9.

6.4. Proof outline for Theorem 2.4 for η=0\eta=0

The main tool we will use to prove the universality Theorem 2.4 is the following set of comparison inequalities developed in [HS22]: Suppose ZZ matches the first two moments of GG, and possesses enough high moments. Then for any measurable sets 𝒮w[Ln/n,Ln/n]n,𝒮v[Ln/n,Ln/n]m\mathcal{S}_{w}\subset[-L_{n}/\sqrt{n},L_{n}/\sqrt{n}]^{n},\mathcal{S}_{v}\subset[-L_{n}/\sqrt{n},L_{n}/\sqrt{n}]^{m}, and any smooth test function 𝖳:\mathsf{T}:\mathbb{R}\to\mathbb{R} (standardized with derivatives of order 11 in \lVert\cdot\rVert_{\infty}),

|𝔼𝖳(minw𝒮wmaxv𝒮vhη;Z(w,v))𝔼𝖳(minw𝒮wmaxv𝒮vhη;G(w,v))|\displaystyle\Big{|}\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\min_{w\in\mathcal{S}_{w}}\max_{v\in\mathcal{S}_{v}}h_{\eta;Z}(w,v)\Big{)}-\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\min_{w\in\mathcal{S}_{w}}\max_{v\in\mathcal{S}_{v}}h_{\eta;G}(w,v)\Big{)}\Big{|} 𝗋n(Ln),\displaystyle\leq\mathsf{r}_{n}(L_{n}),
|𝔼𝖳(minw𝒮wHη;Z(w))𝔼𝖳(minw𝒮wHη;G(w))|\displaystyle\Big{|}\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\min_{w\in\mathcal{S}_{w}}H_{\eta;Z}(w)\Big{)}-\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\min_{w\in\mathcal{S}_{w}}H_{\eta;G}(w)\Big{)}\Big{|} 𝗋n(Ln).\displaystyle\leq\mathsf{r}_{n}(L_{n}). (6.9)

Here 𝗋n(Ln)0\mathsf{r}_{n}(L_{n})\to 0 for Ln=nϑL_{n}=n^{\vartheta} with sufficiently small ϑ>0\vartheta>0. The readers are referred to Theorems 10.1 and 10.2 for a precise statement of (6.4).

An important technical subtlety here is that while the first inequality in (6.4) holds down to η=0\eta=0, the second inequality does not. This is so because minwH0;Z(w)\min_{w}H_{0;Z}(w), which minimizes a deterministic function under a random constraint due to the unbounded constraint in the maximization of vv, is qualitatively different from minwHη;Z(w)\min_{w}H_{\eta;Z}(w) for any η>0\eta>0.

Now we shall sketch how the comparison inequalities (6.4) lead to universality.

(Step 1: Universality of the global cost optimum). In this step, we shall use the first inequality in (6.4) to establish the universality of the global Gordon cost:

minwnH0;Z(w)=minwnmaxvmh0;Z(w,v)minwnmaxvmh0;G(w,v).\displaystyle\min_{w\in\mathbb{R}^{n}}H_{0;Z}(w)=\min_{w\in\mathbb{R}^{n}}\max_{v\in\mathbb{R}^{m}}h_{0;Z}(w,v)\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\approx}}\min_{w\in\mathbb{R}^{n}}\max_{v\in\mathbb{R}^{m}}h_{0;G}(w,v). (6.10)

See Theorem 10.4 for a formal statement of (6.10).

The crux to establish (6.10) via the first inequality of (6.4) is to show that, the ranges of the minimum and the maximum of minwmaxvh0;Z(w,v)\min_{w}\max_{v}h_{0;Z}(w,v) can be localized into an LL_{\infty} ball of order close to 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). This amounts to showing that the stationary points (w^0;Z,v^0;Z)(\widehat{w}_{0;Z},\widehat{v}_{0;Z}), where w^η;Z=Σ1/2(μ^η;Zμ0)\widehat{w}_{\eta;Z}=\Sigma^{1/2}(\widehat{\mu}_{\eta;Z}-\mu_{0}) and v^η;Z=n1/2(XX/n+ηIm)1Y\widehat{v}_{\eta;Z}=-n^{-1/2}(XX^{\top}/n+\eta I_{m})^{-1}Y (cf. Eqn. (10.3)), are delocalized. We prove such delocalization properties in Proposition 10.3 for ‘most’ μ0Bn(1)\mu_{0}\in B_{n}(1).

(Step 2: Universality of the cost over exceptional sets). In this step, we shall use the second inequality in (6.4) to establish the universality of the Gordon cost over exceptional sets D0;ε(𝗀)D_{0;\varepsilon}(\mathsf{g}). In particular, we show in Theorem 10.5 that with Ln=CnϑL_{n}=Cn^{\vartheta} for sufficiently small ϑ>0\vartheta>0 and a large enough C0>0C_{0}>0, w.h.p.,

minwD0;ε(𝗀)B(2,)(C0,Lnn)H0;Z(w)maxβ>0minγ>0𝖣¯0(β,γ)+Ωε(1).\displaystyle\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{0;Z}(w)\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma)+\Omega_{\varepsilon}(1). (6.11)

Here B(2,)(C0,Ln/n)=Bn(C0)L(Ln/n)B_{(2,\infty)}(C_{0},L_{n}/\sqrt{n})=B_{n}(C_{0})\cap L_{\infty}(L_{n}/\sqrt{n}). As mentioned above, a technical difficulty to apply the second inequality of (6.4) rests in its singular behavior near the interpolation regime η=0\eta=0. Also, we note that for a general exceptional set D0;ε(𝗀)D_{0;\varepsilon}(\mathsf{g}), the maximum over vv in minwD0;ε(𝗀)H0;Z(w)=minwD0;ε(𝗀)maxvh0;Z(w,v)\min_{w\in D_{0;\varepsilon}(\mathsf{g})}H_{0;Z}(w)=\min_{w\in D_{0;\varepsilon}(\mathsf{g})}\max_{v}h_{0;Z}(w,v) need not be delocalized, so the first inequality of (6.4) cannot be applied. This singularity issue will be resolved in two steps:

  • (Step 2.1). First, we use the second inequality of (6.4) to show that, (6.11) is valid for a version with small enough η>0\eta>0:

    (minwDη;ε(𝗀)B(2,)(C0,Lnn)Hη;Z(w)maxβ>0minγ>0𝖣¯η(β,γ)+Ωε(1))1cη𝔬(1).\displaystyle\operatorname{\mathbb{P}}\bigg{(}\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\Omega_{\varepsilon}(1)\bigg{)}\geq 1-c_{\eta}\cdot\mathfrak{o}(1).

    See (10.4) for a precise statement. As expected, cηc_{\eta} blows up as η0\eta\downarrow 0.

  • (Step 2.2). Next, by using the ‘stability’ of the set Dη;ε(𝗀)D_{\eta;\varepsilon}(\mathsf{g}) (cf. Lemma 10.6) and maxβ>0minγ>0𝖣¯η(β,γ)\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) (cf. Eqn. (9.13)) with respect to η\eta, for a small enough η>0\eta>0, we have the following series of inequalities:

    minwD0;ε(𝗀)B(2,)(C0,Lnn)H0;Z(w)\displaystyle\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{0;Z}(w)
    minwD0;ε(𝗀)B(2,)(C0,Lnn)Hη;Z(w)(by definition of Hη;Z)\displaystyle\geq\min_{w\in D_{0;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\quad\hbox{(by definition of $H_{\eta;Z}$)}
    minwDη;εη(𝗀)B(2,)(C0,Lnn)Hη;Z(w)(εηε by Lemma 10.6)\displaystyle\geq\min_{w\in D_{\eta;\varepsilon_{\eta}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\quad(\hbox{$\varepsilon_{\eta}\approx\varepsilon$ by Lemma \ref{lem:D_0_eta}})
    maxβ>0minγ>0𝖣¯η(β,γ)+Ωε(1)(by Step 2.1 above)\displaystyle\stackrel{{\scriptstyle\operatorname{\mathbb{P}}}}{{\geq}}\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\Omega_{\varepsilon}(1)\quad\hbox{(by Step 2.1 above)}
    maxβ>0minγ>0𝖣¯0(β,γ)𝒪(η)+Ωε(1)(by Eqn. (9.13)).\displaystyle\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma)-\mathcal{O}(\eta)+\Omega_{\varepsilon}(1)\quad\hbox{(by Eqn. (\ref{ineq:cont_D_eta}))}.

    Now for a given ε>0\varepsilon>0, we may choose η>0\eta>0 small enough so that the term 𝒪(η)-\mathcal{O}(\eta) is absorbed into Ωε(1)\Omega_{\varepsilon}(1), and therefore concluding (6.11).

A complete proof of the above outline is detailed in Section 10.

7. Proof preliminaries

7.1. Some properties of 𝖾F\mathsf{e}_{F} and 𝗉𝗋𝗈𝗑F\operatorname{\mathsf{prox}}_{F}

We write gng/ng_{n}\equiv g/\sqrt{n} in this subsection. First we give an explicit expression for 𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ)\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau) and 𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ)\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau).

Lemma 7.1.

For any (γ,τ)(0,)2(\gamma,\tau)\in(0,\infty)^{2},

𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ)\displaystyle\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau) =τ2(Σ+τI)1Σ1/2μ02+γ2n1tr(Σ2(Σ+τI)2),\displaystyle=\tau^{2}\lVert(\Sigma+\tau I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}+\gamma^{2}\cdot n^{-1}\operatorname{tr}\big{(}\Sigma^{2}(\Sigma+\tau I)^{-2}\big{)},
𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ)\displaystyle\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau) =γ2n1tr(Σ(Σ+τI)1).\displaystyle=\gamma^{2}\cdot n^{-1}\operatorname{tr}\big{(}\Sigma(\Sigma+\tau I)^{-1}\big{)}.
Proof.

Using the closed-form of μ^(Σ,μ0)𝗌𝖾𝗊\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}, we may compute

Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)μ0)\displaystyle\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-\mu_{0}\big{)} =(Σ+τI)1Σ1/2(τμ0+γΣ1/2gn).\displaystyle=(\Sigma+\tau I)^{-1}\Sigma^{1/2}\big{(}-\tau\mu_{0}+\gamma\Sigma^{1/2}g_{n}\big{)}. (7.1)

The claims follow from direct calculations. ∎

Next we give explicit expression for 𝗉𝗋𝗈𝗑F(γgn;τ)\operatorname{\mathsf{prox}}_{F}(\gamma g_{n};\tau) and 𝖾F(γgn;τ)\mathsf{e}_{F}(\gamma g_{n};\tau).

Lemma 7.2.

It holds that

𝗉𝗋𝗈𝗑F(γgn;τ)\displaystyle\operatorname{\mathsf{prox}}_{F}(\gamma g_{n};\tau) =Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)μ0),\displaystyle=\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-\mu_{0}\big{)},
𝖾F(γgn;τ)\displaystyle\mathsf{e}_{F}(\gamma g_{n};\tau) =12τΣ1/2μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)y(Σ,μ0)𝗌𝖾𝗊(γ)2+12μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)2.\displaystyle=\frac{1}{2\tau}\lVert\Sigma^{1/2}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-y_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma)\rVert^{2}+\frac{1}{2}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)\rVert^{2}.

Furthermore,

𝔼𝖾F(γgn;τ)\displaystyle\operatorname{\mathbb{E}}\mathsf{e}_{F}(\gamma g_{n};\tau) =12τ(𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ)2𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ)+γ2)\displaystyle=\frac{1}{2\tau}\big{(}\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau)-2\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau)+\gamma^{2}\big{)}
+12((Σ+τI)1Σμ02+γ21ntr(Σ(Σ+τI)2)).\displaystyle\qquad+\frac{1}{2}\Big{(}\lVert(\Sigma+\tau I)^{-1}\Sigma\mu_{0}\rVert^{2}+\gamma^{2}\cdot\frac{1}{n}\operatorname{tr}(\Sigma(\Sigma+\tau I)^{-2})\Big{)}.
Proof.

The two identities in the first display follows from the definition of FF. For the second display, note that 𝔼𝖾F(γgn;τ)\operatorname{\mathbb{E}}\mathsf{e}_{F}(\gamma g_{n};\tau) is equal to

12τ(𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ)2𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ)+γ2)+12𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)2.\displaystyle\frac{1}{2\tau}\big{(}\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau)-2\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau)+\gamma^{2}\big{)}+\frac{1}{2}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)\rVert^{2}.

Using 𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)2=(Σ+τI)1Σμ02+γ2n1tr(Σ(Σ+τI)2)\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)\rVert^{2}=\lVert(\Sigma+\tau I)^{-1}\Sigma\mu_{0}\rVert^{2}+\gamma^{2}\cdot n^{-1}\operatorname{tr}\big{(}\Sigma(\Sigma+\tau I)^{-2}\big{)} to conclude. ∎

The derivative formula below for 𝖾F\mathsf{e}_{F} will be useful.

Lemma 7.3.

It holds that

x𝖾F(x;τ)=1τ(x𝗉𝗋𝗈𝗑F(x;τ)),τ𝖾F(x;τ)=12τ2x𝗉𝗋𝗈𝗑F(x;τ)2.\displaystyle\nabla_{x}\mathsf{e}_{F}(x;\tau)=\frac{1}{\tau}\big{(}x-\operatorname{\mathsf{prox}}_{F}(x;\tau)\big{)},\quad\partial_{\tau}\mathsf{e}_{F}(x;\tau)=-\frac{1}{2\tau^{2}}\lVert x-\operatorname{\mathsf{prox}}_{F}(x;\tau)\rVert^{2}.
Proof.

See e.g., [TAH18, Lemmas B.5 and D.1]. ∎

Finally we provide a concentration inequality for 𝖾F(γgn;τ)\mathsf{e}_{F}(\gamma g_{n};\tau).

Proposition 7.4.

There exists some universal constant C>0C>0 such that

(|𝖾F(γgn;τ)𝔼𝖾F(γgn;τ)|C{v𝔼1/2𝖾F(γgn;τ)tn+v2tn})Cet/C\displaystyle\operatorname{\mathbb{P}}\Big{(}\big{\lvert}\mathsf{e}_{F}(\gamma g_{n};\tau)-\operatorname{\mathbb{E}}\mathsf{e}_{F}(\gamma g_{n};\tau)\big{\rvert}\geq C\Big{\{}v\operatorname{\mathbb{E}}^{1/2}\mathsf{e}_{F}(\gamma g_{n};\tau)\sqrt{\frac{t}{n}}+v^{2}\cdot\frac{t}{n}\Big{\}}\Big{)}\leq Ce^{-t/C}

holds for any t0t\geq 0. Here v2v2(γ,τ)γ2(τ(Σ+τI)1op2+(Σ+τI)1Σ1/2op2)v^{2}\equiv v^{2}(\gamma,\tau)\equiv\gamma^{2}\big{(}\tau\lVert(\Sigma+\tau I)^{-1}\rVert_{\operatorname{op}}^{2}+\lVert(\Sigma+\tau I)^{-1}\Sigma^{1/2}\rVert_{\operatorname{op}}^{2}\big{)}.

Proof.

Using that gμ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)=γn(Σ+τI)1Σ1/2\nabla_{g}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)=\frac{\gamma}{\sqrt{n}}(\Sigma+\tau I)^{-1}\Sigma^{1/2} and gy𝗌𝖾𝗊(γ)=γnI\nabla_{g}y^{\operatorname{\mathsf{seq}}}(\gamma)=\frac{\gamma}{\sqrt{n}}I,

g𝖾F(γgn;τ)\displaystyle\nabla_{g}\mathsf{e}_{F}(\gamma g_{n};\tau) =1τγn((Σ+τI)1ΣI)(Σ1/2μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)y(Σ,μ0)𝗌𝖾𝗊(γ))\displaystyle=\frac{1}{\tau}\cdot\frac{\gamma}{\sqrt{n}}\big{(}(\Sigma+\tau I)^{-1}\Sigma-I\big{)}\big{(}\Sigma^{1/2}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-y_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma)\big{)}
+γn(Σ+τI)1Σ1/2gμ^(Σ,μ0)𝗌𝖾𝗊(γ;τ).\displaystyle\qquad+\frac{\gamma}{\sqrt{n}}(\Sigma+\tau I)^{-1}\Sigma^{1/2}\nabla_{g}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau).

This means

g𝖾F(γgn;τ)2\displaystyle\lVert\nabla_{g}\mathsf{e}_{F}(\gamma g_{n};\tau)\rVert^{2}
2γ2n1{(Σ+τI)1op2Σ1/2μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)y(Σ,μ0)𝗌𝖾𝗊(γ)2\displaystyle\leq 2\gamma^{2}\cdot n^{-1}\Big{\{}\lVert(\Sigma+\tau I)^{-1}\rVert_{\operatorname{op}}^{2}\lVert\Sigma^{1/2}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)-y_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma)\rVert^{2}
+(Σ+τI)1Σ1/2op2gμ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)2}\displaystyle\qquad\qquad\qquad\qquad+\lVert(\Sigma+\tau I)^{-1}\Sigma^{1/2}\rVert_{\operatorname{op}}^{2}\lVert\nabla_{g}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)\rVert^{2}\Big{\}}
4γ2n1(τ(Σ+τI)1op2+(Σ+τI)1Σ1/2op2)𝖾F(γgn;τ).\displaystyle\leq 4\gamma^{2}\cdot n^{-1}\Big{(}\tau\lVert(\Sigma+\tau I)^{-1}\rVert_{\operatorname{op}}^{2}+\lVert(\Sigma+\tau I)^{-1}\Sigma^{1/2}\rVert_{\operatorname{op}}^{2}\Big{)}\cdot\mathsf{e}_{F}(\gamma g_{n};\tau). (7.2)

From here we may conclude by setting H(g)𝖾F(γgn;τ)H(g)\equiv\mathsf{e}_{F}\big{(}\gamma g_{n};\tau\big{)} and Γ24γ2n1(τ(Σ+τI)1op2+(Σ+τI)1Σ1/2op2)\Gamma^{2}\equiv 4\gamma^{2}n^{-1}\big{(}\tau\lVert(\Sigma+\tau I)^{-1}\rVert_{\operatorname{op}}^{2}+\lVert(\Sigma+\tau I)^{-1}\Sigma^{1/2}\rVert_{\operatorname{op}}^{2}\big{)} in Proposition B.1. ∎

7.2. Some high probability events

Let

eh2=h2/m,eg2g2/n.\displaystyle e_{h}^{2}={\lVert h\rVert^{2}}/{m},\quad e_{g}^{2}\equiv{\lVert g\rVert^{2}}/{n}. (7.3)

For M,δ>0M,\delta>0, consider the event

0(M)\displaystyle\mathscr{E}_{0}(M) {(Gop/n)[(GG/n)1op𝟏η=0]M},\displaystyle\equiv\big{\{}({\lVert G\rVert_{\operatorname{op}}}/{\sqrt{n}})\vee\big{[}\lVert({GG^{\top}}/{n})^{-1}\rVert_{\operatorname{op}}\bm{1}_{\eta=0}\big{]}\leq M\big{\}},
1,0(δ)\displaystyle\mathscr{E}_{1,0}(\delta) {|eg21||eh21||n1/2Σ1/2g,μ0||n1h,ξ|δ},\displaystyle\equiv\big{\{}\lvert e_{g}^{2}-1\rvert\vee\lvert e_{h}^{2}-1\rvert\vee\lvert n^{-1/2}\langle\Sigma^{1/2}g,\mu_{0}\rangle\rvert\vee\lvert n^{-1}\langle h,\xi\rangle\rvert\leq\delta\big{\}},
1,ξ(δ)\displaystyle\mathscr{E}_{1,\xi}(\delta) {|(ξ2/m)σξ2|δ},\displaystyle\equiv\big{\{}\lvert(\lVert\xi\rVert^{2}/m)-\sigma_{\xi}^{2}\rvert\leq\delta\big{\}},
1(δ)\displaystyle\mathscr{E}_{1}(\delta) 1,0(δ)1,ξ(δ).\displaystyle\equiv\mathscr{E}_{1,0}(\delta)\cap\mathscr{E}_{1,\xi}(\delta).

Here in the definition of 0(M)\mathscr{E}_{0}(M), we interpret 0=0\infty\cdot 0=0. Typically we think of M1M\asymp 1 and δ1/n\delta\asymp 1/\sqrt{n}.

Lemma 7.5.

Fix δ(0,1/2)\delta\in(0,1/2) and Lw>0L_{w}>0. Then 1(δ)2(4(σξ2+1+ϕ1Lw)δ,Lw)\mathscr{E}_{1}(\delta)\subset\mathscr{E}_{2}\big{(}4(\sigma_{\xi}^{2}+1+\phi^{-1}L_{w})\delta,L_{w}\big{)}, where 2(δ,Lw){|σ±2(Lw)σξ2|δ}\mathscr{E}_{2}(\delta,L_{w})\equiv\big{\{}\lvert\sigma_{\pm}^{2}(L_{w})-\sigma_{\xi}^{2}\rvert\leq\delta\big{\}}.

Proof.

Using the definition of σ±2(Lw)\sigma_{\pm}^{2}(L_{w}) in (6.4), on 1(δ)\mathscr{E}_{1}(\delta), we have

|σ±2(Lw)σξ2|ξ2h2|eh21|+|ξ2mσξ2|+2Lw|h,ξ|h24(σξ2+1+ϕ1Lw)δ.\displaystyle\big{\lvert}\sigma_{\pm}^{2}(L_{w})-\sigma_{\xi}^{2}\big{\rvert}\leq\frac{\lVert\xi\rVert^{2}}{\lVert h\rVert^{2}}\lvert e_{h}^{2}-1\rvert+\bigg{\lvert}\frac{\lVert\xi\rVert^{2}}{m}-\sigma_{\xi}^{2}\bigg{\rvert}+\frac{2L_{w}\lvert\langle h,\xi\rangle\rvert}{\lVert h\rVert^{2}}\leq 4(\sigma_{\xi}^{2}+1+\phi^{-1}L_{w})\delta.

The claim follows. ∎

Lemma 7.6.

Suppose 1/Kϕ1𝟏η=0K1/K\leq\phi^{-1}-\bm{1}_{\eta=0}\leq K. Then there exists some C=C(K)>0C=C(K)>0 such that (0(C))1Cen/C\operatorname{\mathbb{P}}(\mathscr{E}_{0}(C))\geq 1-Ce^{-n/C}.

Proof.

The claim for Gop/n\lVert G\rVert_{\operatorname{op}}/\sqrt{n} follows from standard concentration estimates. The claim for (GG/n)1op\lVert(GG^{\top}/n)^{-1}\rVert_{\operatorname{op}} follows from, e.g., [RV09, Theorem 1.1]. ∎

Lemma 7.7.

Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K, and μ0ΣopK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K for some K>0K>0, and Assumption B hold with σξ2>0\sigma_{\xi}^{2}>0. There exists some constant C=C(K,σξ)>0C=C(K,\sigma_{\xi})>0 such that for all t0t\geq 0, with δ(t,n)C(t/n+t/n)\delta(t,n)\equiv C(\sqrt{t/n}+t/n), for ξ1,ξ(δ(t,n))\xi\in\mathscr{E}_{1,\xi}(\delta(t,n)), we have ξ(1(δ(t,n)))1et\operatorname{\mathbb{P}}^{\xi}(\mathscr{E}_{1}(\delta(t,n)))\geq 1-e^{-t}.

Proof.

The claim follows by standard concentration inequalities. ∎

8. Properties of the fixed point equations

8.1. The fixed point equation (2.1)

Proposition 8.1.

The following hold.

  1. (1)

    The fixed point equation (2.1) admits a unique solution (γη,,τη,)(0,)2(\gamma_{\eta,\ast},\tau_{\eta,\ast})\in(0,\infty)^{2}, for all (m,n)2(m,n)\in\mathbb{N}^{2} when η>0\eta>0 and m<nm<n when η=0\eta=0.

  2. (2)

    The following apriori bounds hold:

    1ϕ+(1ϕ)2+4Ση2Σ\displaystyle\frac{1-\phi+\sqrt{\big{(}1-\phi\big{)}^{2}+4\mathcal{H}_{\Sigma}\eta}}{2\mathcal{H}_{\Sigma}} τη,infk[0:min{m1,n}]{j>kλjmk+nmkη},\displaystyle\leq\tau_{\eta,\ast}\leq\inf_{k\in[0:\min\{m-1,n\}]}\bigg{\{}\frac{\sum_{j>k}\lambda_{j}}{m-k}+\frac{n}{m-k}\cdot\eta\bigg{\}},
    σξ2ϕ\displaystyle\frac{\sigma_{\xi}^{2}}{\phi} γη,2σξ2+Σopμ02ϕ(1+Σopτη,).\displaystyle\leq\gamma_{\eta,\ast}^{2}\leq\frac{\sigma_{\xi}^{2}+\lVert\Sigma\rVert_{\operatorname{op}}\lVert\mu_{0}\rVert^{2}}{\phi}\bigg{(}1+\frac{\lVert\Sigma\rVert_{\operatorname{op}}}{\tau_{\eta,\ast}}\bigg{)}.
  3. (3)

    If 1/Kϕ1K1/K\leq\phi^{-1}\leq K and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>1K>1, then there exists some C=C(K)>1C=C(K)>1 such that uniformly in ηΞK\eta\in\Xi_{K},

    1/Cτη,C,1/C(1)q+1ηqτη,C,q{1,2}.\displaystyle 1/C\leq\tau_{\eta,\ast}\leq C,\quad 1/C\leq(-1)^{q+1}\partial_{\eta}^{q}\tau_{\eta,\ast}\leq C,\quad q\in\{1,2\}.

    If furthermore 1/Kσξ2K1/K\leq\sigma_{\xi}^{2}\leq K and μ0K\lVert\mu_{0}\rVert\leq K, then uniformly in ηΞK\eta\in\Xi_{K},

    1/Cγη,C,|ηγη,|C.\displaystyle 1/C\leq\gamma_{\eta,\ast}\leq C,\quad\lvert\partial_{\eta}\gamma_{\eta,\ast}\rvert\leq C.
Proof.

We shall write (γη,,τη,)=(γ,τ)(\gamma_{\eta,\ast},\tau_{\eta,\ast})=(\gamma_{\ast},\tau_{\ast}) for notational simplicity. All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK.

(1). First we prove the existence and uniqueness of τ\tau_{\ast}. We rewrite the second equation of (2.1) as

ϕ=1ntr((Σ+τI)1Σ)+ητ=1nj=1nλjλj+τ+ητ𝖿(τ).\displaystyle\phi=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-1}\Sigma\big{)}+\frac{\eta}{\tau_{\ast}}=\frac{1}{n}\sum_{j=1}^{n}\frac{\lambda_{j}}{\lambda_{j}+\tau_{\ast}}+\frac{\eta}{\tau_{\ast}}\equiv\mathsf{f}(\tau_{\ast}). (8.1)

Clearly 𝖿(τ)\mathsf{f}(\tau) is smooth, non-increasing, 𝖿(0)=1>ϕ\mathsf{f}(0)=1>\phi for η=0\eta=0 and 𝖿(0)=\mathsf{f}(0)=\infty for η>0\eta>0, and 𝖿()=0\mathsf{f}(\infty)=0, so τ𝖿(τ)ϕ\tau\mapsto\mathsf{f}(\tau)-\phi must admit a unique zero τ(0,)\tau_{\ast}\in(0,\infty).

Next we prove the existence and uniqueness of γ\gamma_{\ast}. Using Lemma 7.1, the equation ϕγ2=σξ2+𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ)\phi\gamma_{\ast}^{2}=\sigma_{\xi}^{2}+\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{\ast};\tau_{\ast}) reads

ϕ=1γ2(σξ2+τ2(Σ+τI)1Σ1/2μ02)+1ntr((Σ+τI)2Σ2).\displaystyle\phi=\frac{1}{\gamma_{\ast}^{2}}\big{(}\sigma_{\xi}^{2}+\tau_{\ast}^{2}\lVert(\Sigma+\tau_{\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\big{)}+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma^{2}\big{)}. (8.2)

As n1tr((Σ+τI)2Σ2)<n1tr((Σ+τI)1Σ)ϕn^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma^{2}\big{)}<n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-1}\Sigma\big{)}\leq\phi by (8.1) and the fact τ>0\tau_{\ast}>0, the above equation admits a unique solution γ(0,)\gamma_{\ast}\in(0,\infty), analytically given by

γ2=σξ2+τ2(Σ+τI)1Σ1/2μ02ϕ1ntr((Σ+τI)2Σ2)=σξ2+τ2(Σ+τI)1Σ1/2μ02ητ+τntr((Σ+τI)2Σ).\displaystyle\gamma_{\ast}^{2}=\frac{\sigma_{\xi}^{2}+\tau_{\ast}^{2}\lVert(\Sigma+\tau_{\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}}{\phi-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma^{2}\big{)}}=\frac{\sigma_{\xi}^{2}+\tau_{\ast}^{2}\lVert(\Sigma+\tau_{\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}}{\frac{\eta}{\tau_{\ast}}+\frac{\tau_{\ast}}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma\big{)}}. (8.3)

(2). For the upper bound for τ\tau_{\ast}, using the equation (8.1), we have

m=nϕk+1τj>kλj+nητ,k[0:n],km1.\displaystyle m=n\phi\leq k+\frac{1}{\tau_{\ast}}\sum_{j>k}\lambda_{j}+\frac{n\eta}{\tau_{\ast}},\quad\forall k\in[0:n],\,k\leq m-1.

Solving for τ\tau_{\ast} yields the desired upper bound. For the lower bound for τ\tau_{\ast}, note that (8.1) leads to

ϕ=1τ1nj=1n1λj+τ+ητ1τΣ+ητ,\displaystyle\phi=1-\tau_{\ast}\cdot\frac{1}{n}\sum_{j=1}^{n}\frac{1}{\lambda_{j}+\tau_{\ast}}+\frac{\eta}{\tau_{\ast}}\geq 1-\tau_{\ast}\mathcal{H}_{\Sigma}+\frac{\eta}{\tau_{\ast}},

or equivalently Στ2+(ϕ1)τη0\mathcal{H}_{\Sigma}\tau_{\ast}^{2}+\big{(}\phi-1\big{)}\tau_{\ast}-\eta\geq 0. Solving this quadratic inequality yields the lower bound for τ\tau_{\ast}.

On the other hand, the lower bound γ2σξ2/ϕ\gamma_{\ast}^{2}\geq\sigma_{\xi}^{2}/\phi is trivial by (8.3). For the upper bound for γ\gamma_{\ast}, using that

ϕ1ntr((Σ+τI)2Σ2)\displaystyle\phi-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma^{2}\big{)} ϕ1ntr((Σ+τI)1Σ)maxj[n]λjλj+τ\displaystyle\geq\phi-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-1}\Sigma\big{)}\cdot\max_{j\in[n]}\frac{\lambda_{j}}{\lambda_{j}+\tau_{\ast}}
ϕτΣop+τ,\displaystyle\geq\phi\cdot\frac{\tau_{\ast}}{\lVert\Sigma\rVert_{\operatorname{op}}+\tau_{\ast}}, (8.4)

and the first identity in (8.3), we have

γ2ϕ1(σξ2+Σopμ02)(1+Σop/τ).\displaystyle\gamma_{\ast}^{2}\leq\phi^{-1}\big{(}\sigma_{\xi}^{2}+\lVert\Sigma\rVert_{\operatorname{op}}\lVert\mu_{0}\rVert^{2}\big{)}\big{(}1+{\lVert\Sigma\rVert_{\operatorname{op}}}/{\tau_{\ast}}\big{)}.

Collecting the bounds proves the claim.

(3). The claim on γ,τ\gamma_{\ast},\tau_{\ast} is a simple consequence of (2). We shall prove the other claim on their derivatives. Viewing τ=τ(η)\tau_{\ast}=\tau_{\ast}(\eta) and taking derivative with respect to η\eta on both sides of (8.1) yield that, with Tp,q(η)n1tr((Σ+τ(η)I)pΣq)T_{-p,q}(\eta)\equiv n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}(\eta)I)^{-p}\Sigma^{q}\big{)} for p,qp,q\in\mathbb{N},

0=T2,1(η)τ(η)+1τ(η)ητ2(η)τ(η).\displaystyle 0=-T_{-2,1}(\eta)\cdot\tau_{\ast}^{\prime}(\eta)+\frac{1}{\tau_{\ast}(\eta)}-\frac{\eta}{\tau_{\ast}^{2}(\eta)}\cdot\tau_{\ast}^{\prime}(\eta).

Solving for τ(η)\tau_{\ast}^{\prime}(\eta) yields that

τ(η)=τ(η)η+τ2(η)T2,1(η)τ(η)G0(η).\displaystyle\tau_{\ast}^{\prime}(\eta)=\frac{\tau_{\ast}(\eta)}{\eta+\tau_{\ast}^{2}(\eta)\cdot T_{-2,1}(\eta)}\equiv\frac{\tau_{\ast}(\eta)}{G_{0}(\eta)}. (8.5)

Further taking derivative with respect to η\eta on both sides of the above display (8.5), we have

τ′′(η)\displaystyle\tau_{\ast}^{\prime\prime}(\eta) =1G02(η)(τ(η)G0(η)τ(η)G0(η))\displaystyle=\frac{1}{G_{0}^{2}(\eta)}\big{(}\tau_{\ast}^{\prime}(\eta)G_{0}(\eta)-\tau_{\ast}(\eta)G_{0}^{\prime}(\eta)\big{)}
=1G02(η){τ(η)τ(η)(1+2τ(η)τ(η)T2,1(η)2τ2(η)τ(η)T3,1(η))}\displaystyle=\frac{1}{G_{0}^{2}(\eta)}\Big{\{}\tau_{\ast}(\eta)-\tau_{\ast}(\eta)\Big{(}1+2\tau_{\ast}(\eta)\tau_{\ast}^{\prime}(\eta)T_{-2,1}(\eta)-2\tau_{\ast}^{2}(\eta)\tau_{\ast}^{\prime}(\eta)T_{-3,1}(\eta)\Big{)}\Big{\}}
=2τ2(η)τ(η)G02(η)(τ(η)T3,1(η)T2,1(η))=2τ2(η)τ(η)G02(η)T3,2(η).\displaystyle=\frac{2\tau_{\ast}^{2}(\eta)\tau_{\ast}^{\prime}(\eta)}{G_{0}^{2}(\eta)}\Big{(}\tau_{\ast}(\eta)T_{-3,1}(\eta)-T_{-2,1}(\eta)\Big{)}=-\frac{2\tau_{\ast}^{2}(\eta)\tau_{\ast}^{\prime}(\eta)}{G_{0}^{2}(\eta)}T_{-3,2}(\eta). (8.6)

Using the apriori estimate for τ(η)\tau_{\ast}(\eta) proved in (2), it follows that for q{1,2}q\in\{1,2\},

1infηΞK(1)q+1τ(q)(η)supηΞK(1)q+1τ(q)(η)1.\displaystyle 1\lesssim\inf_{\eta\in\Xi_{K}}(-1)^{q+1}\tau_{\ast}^{(q)}(\eta)\leq\sup_{\eta\in\Xi_{K}}(-1)^{q+1}\tau_{\ast}^{(q)}(\eta)\lesssim 1. (8.7)

For γ(η)\gamma_{\ast}^{\prime}(\eta), let us define

G1(η)\displaystyle G_{1}(\eta) σξ2+τ2(η)(Σ+τ(η)I)1Σ1/2μ02,\displaystyle\equiv\sigma_{\xi}^{2}+\tau_{\ast}^{2}(\eta)\lVert(\Sigma+\tau_{\ast}(\eta)I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2},
G2(η)\displaystyle G_{2}(\eta) ϕn1tr((Σ+τ(η)I)2Σ2).\displaystyle\equiv\phi-n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}(\eta)I)^{-2}\Sigma^{2}\big{)}.

Then

γ(η)=G1(η)G2(η)G1(η)G2(η)2γ(η)G22(η).\displaystyle\gamma_{\ast}^{\prime}(\eta)=\frac{G_{1}^{\prime}(\eta)G_{2}(\eta)-G_{1}(\eta)G_{2}^{\prime}(\eta)}{2\gamma_{\ast}(\eta)G_{2}^{2}(\eta)}. (8.8)

We shall now prove bounds for G1,G1,G2,G2G_{1},G_{1}^{\prime},G_{2},G_{2}^{\prime}. First, using (8.1), we have

σξ2G1(η)σξ2+τ(η)2μ02,ϕτ(η)Σop+τ(η)G2(η)ϕ.\displaystyle\sigma_{\xi}^{2}\leq G_{1}(\eta)\leq\sigma_{\xi}^{2}+\frac{\tau_{\ast}(\eta)}{2}\lVert\mu_{0}\rVert^{2},\quad\phi\cdot\frac{\tau_{\ast}(\eta)}{\lVert\Sigma\rVert_{\operatorname{op}}+\tau_{\ast}(\eta)}\leq G_{2}(\eta)\leq\phi.

In particular, uniformly in ηΞK\eta\in\Xi_{K},

G1(η),G2(η)1.\displaystyle G_{1}(\eta),G_{2}(\eta)\asymp 1. (8.9)

The derivatives G1,G2G_{1}^{\prime},G_{2}^{\prime} are

G1(η)\displaystyle G_{1}^{\prime}(\eta) =2τ(η)τ(η)(Σ+τ(η)I)1Σ1/2μ02\displaystyle=2\tau_{\ast}(\eta)\tau_{\ast}^{\prime}(\eta)\lVert(\Sigma+\tau_{\ast}(\eta)I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}
2τ2(η)(Σ+τ(η)I)3/2Σ1/2μ02τ(η),\displaystyle\qquad-2\tau_{\ast}^{2}(\eta)\lVert(\Sigma+\tau_{\ast}(\eta)I)^{-3/2}\Sigma^{1/2}\mu_{0}\rVert^{2}\cdot\tau_{\ast}^{\prime}(\eta),
G2(η)\displaystyle G_{2}^{\prime}(\eta) =2n1tr((Σ+τ(η)I)3Σ2)τ(η).\displaystyle=2\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}(\eta)I)^{-3}\Sigma^{2}\big{)}\cdot\tau_{\ast}^{\prime}(\eta).

Using the apriori estimates on τ(η)\tau_{\ast}(\eta) and (8.7), it now follows that

supηΞK{|G1(η)||G2(η)|}1.\displaystyle\sup_{\eta\in\Xi_{K}}\big{\{}\lvert G_{1}^{\prime}(\eta)\rvert\vee\lvert G_{2}^{\prime}(\eta)\rvert\big{\}}\lesssim 1. (8.10)

Combining (8.8)-(8.10) and using apriori estimates on γ(η)\gamma_{\ast}(\eta), we arrive at

supηΞK|γ(η)|1.\displaystyle\sup_{\eta\in\Xi_{K}}\lvert\gamma_{\ast}^{\prime}(\eta)\rvert\lesssim 1. (8.11)

The claim follows by collecting (8.7) and (8.11). ∎

8.2. Sample version of (2.1)

Let the sample version of (2.1) be defined by

{ϕeh2γ2=σ±2(Lw)+𝖾𝗋𝗋(Σ,μ0)(γ;τ),(ϕeh2ητ)γ2=𝖽𝗈𝖿(Σ,μ0)(γ;τ).\displaystyle\begin{cases}\phi e_{h}^{2}\gamma^{2}=\sigma_{\pm}^{2}(L_{w})+\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau),\\ \big{(}\phi e_{h}^{2}-\frac{\eta}{\tau}\big{)}\cdot\gamma^{2}=\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau).\end{cases} (8.12)

Here recall that eh2e_{h}^{2} is defined in (7.3), and σ±2(Lw)\sigma_{\pm}^{2}(L_{w}) is defined in (6.4).

Proposition 8.2.

1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K and μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exist some C,C0>1C,C_{0}>1 depending on KK, such that with δ(0,1/C100)\delta\in(0,1/C^{100}), 1Mn/C1\leq M\leq\sqrt{n}/C and LwCL_{w}\leq C, on the event 1(δ)Δ,Ξ(M)\mathscr{E}_{1}(\delta)\cap\mathscr{E}_{\Delta,\Xi}(M), where

Δ,Ξ(M){max=1,2supτ0|Δ(τ)|max=1,2supτ0n1/2|Ξ(τ)𝔼Ξ(τ)|M}\displaystyle\mathscr{E}_{\Delta,\Xi}(M)\equiv\Big{\{}\max_{\ell=1,2}\sup_{\tau\geq 0}\lvert\Delta_{\ell}(\tau)\rvert\vee\max_{\ell=1,2}\sup_{\tau\geq 0}n^{-1/2}\lvert\Xi_{\ell}(\tau)-\operatorname{\mathbb{E}}\Xi_{\ell}(\tau)\rvert\leq M\Big{\}}

with Δ,Ξ\Delta_{\ell},\Xi_{\ell} defined in Lemmas 8.3 and 8.4 ahead, the following hold.

  1. (1)

    All solutions (γn,η,±,τn,η,±)(\gamma_{n,\eta,\pm},\tau_{n,\eta,\pm}) to the system of equations in (8.12) satisfy

    1/C0τn,η,±C0,1/C0γn,η,±C0\displaystyle 1/C_{0}\leq\tau_{n,\eta,\pm}\leq C_{0},\quad 1/C_{0}\leq\gamma_{n,\eta,\pm}\leq C_{0}

    uniformly in ηΞK\eta\in\Xi_{K}.

  2. (2)

    Moreover,

    supηΞK{|τn,η,±τη,||γn,η,±γη,|}C0(M/n+δ).\displaystyle\sup_{\eta\in\Xi_{K}}\big{\{}\lvert\tau_{n,\eta,\pm}-\tau_{\eta,\ast}\rvert\vee\lvert\gamma_{n,\eta,\pm}-\gamma_{\eta,\ast}\rvert\big{\}}\leq C_{0}\cdot\big{(}M/\sqrt{n}+\delta\big{)}.

We need two concentration lemmas before the proof of Proposition 8.2.

Lemma 8.3.

Let Δ(τ)τ(Σ+τI)Σ1/2μ0,g\Delta_{\ell}(\tau)\equiv-\ell\cdot\tau\big{\langle}(\Sigma+\tau I)^{-\ell}\Sigma^{\ell-1/2}\mu_{0},g\big{\rangle} for =1,2\ell=1,2. Suppose that μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. Then there exists some constant C=C(K)>1C=C(K)>1 such that for tClog(en)t\geq C\log(en),

(max=1,2supτ0|Δ(τ)|Ct)et.\displaystyle\operatorname{\mathbb{P}}\Big{(}\max_{\ell=1,2}\sup_{\tau\geq 0}\lvert\Delta_{\ell}(\tau)\rvert\geq C\sqrt{t}\Big{)}\leq e^{-t}.
Lemma 8.4.

Let Ξ(τ)(Σ+τI)/2Σ/2g2\Xi_{\ell}(\tau)\equiv\lVert(\Sigma+\tau I)^{-\ell/2}\Sigma^{\ell/2}g\rVert^{2} for =1,2\ell=1,2. Suppose that ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. Then there exists some constant C=C(K)>1C=C(K)>1 such that for tClog(en)t\geq C\log(en),

(max=1,2supτ0|Ξ(τ)𝔼Ξ(τ)|C(nt+t))et.\displaystyle\operatorname{\mathbb{P}}\Big{(}\max_{\ell=1,2}\sup_{\tau\geq 0}\big{\lvert}\Xi_{\ell}(\tau)-\operatorname{\mathbb{E}}\Xi_{\ell}(\tau)\big{\rvert}\geq C(\sqrt{nt}+t)\Big{)}\leq e^{-t}.

The proofs of these lemmas are deferred to the next subsection.

Proof of Proposition 8.2.

All the constants in \lesssim, \gtrsim, \asymp and 𝒪\mathcal{O} below may possibly depend on KK. We often suppress the dependence of σ±2(Lw)\sigma_{\pm}^{2}(L_{w}) on LwL_{w} for simplicity.

(1). We shall write (γn,η,±,τn,η,±)(\gamma_{n,\eta,\pm},\tau_{n,\eta,\pm}) as (γn,τn)(\gamma_{n},\tau_{n}) and (γη,,τη,)=(γ,τ)(\gamma_{\eta,\ast},\tau_{\eta,\ast})=(\gamma_{\ast},\tau_{\ast}) for notational simplicity. Using (7.1), any solution (γn,τn)(\gamma_{n},\tau_{n}) to the equations in (8.12) satisfies

{ϕeh2ητn+Δ1(τn)nγn=1ntr((Σ+τnI)1Σ)+1n(id𝔼)Ξ1(τn),ϕeh2+Δ2(τn)nγn=1γn2(σ±2+τn2(Σ+τnI)1Σ1/2μ02)+1ntr((Σ+τnI)2Σ2)+1n(id𝔼)Ξ2(τn).\displaystyle\begin{cases}\phi e_{h}^{2}-\frac{\eta}{\tau_{n}}+\frac{\Delta_{1}(\tau_{n})}{\sqrt{n}\gamma_{n}}=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-1}\Sigma\big{)}+\frac{1}{n}(\mathrm{id}-\operatorname{\mathbb{E}})\Xi_{1}(\tau_{n}),\\ \phi e_{h}^{2}+\frac{\Delta_{2}(\tau_{n})}{\sqrt{n}\gamma_{n}}=\frac{1}{\gamma_{n}^{2}}\big{(}\sigma_{\pm}^{2}+\tau_{n}^{2}\lVert(\Sigma+\tau_{n}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\big{)}\\ \qquad\qquad\qquad\qquad+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-2}\Sigma^{2}\big{)}+\frac{1}{n}(\mathrm{id}-\operatorname{\mathbb{E}})\Xi_{2}(\tau_{n}).\end{cases} (8.13)

On the event 1(δ)Δ,Ξ(M)\mathscr{E}_{1}(\delta)\cap\mathscr{E}_{\Delta,\Xi}(M) with δ(0,1/C100)\delta\in(0,1/C^{100}), 1Mn/C1\leq M\leq\sqrt{n}/C and LwCL_{w}\leq C, using Lemma 7.5, the second equation in (8.13) becomes

ϕ+𝒪(M(1γn1)/n+δ)\displaystyle\phi+\mathcal{O}\big{(}M\big{(}1\vee\gamma_{n}^{-1}\big{)}/\sqrt{n}+\delta\big{)}
=1γn2(σ±2+τn2(Σ+τnI)1Σ1/2μ02)+1ntr((Σ+τnI)2Σ2)1γn2.\displaystyle=\frac{1}{\gamma_{n}^{2}}\Big{(}\sigma_{\pm}^{2}+\tau_{n}^{2}\lVert(\Sigma+\tau_{n}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\Big{)}+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-2}\Sigma^{2}\big{)}\gtrsim\frac{1}{\gamma_{n}^{2}}.

Rearranging terms we obtain the inequality

1γn21+Mn+Mnγnγn11+M/n1.\displaystyle\frac{1}{\gamma_{n}^{2}}\lesssim 1+\frac{M}{\sqrt{n}}+\frac{M}{\sqrt{n}\gamma_{n}}\,\Rightarrow\,\gamma_{n}\gtrsim\frac{1}{1+M/\sqrt{n}}\gtrsim 1.

So with εnεn(M,δ)M/n+δ\varepsilon_{n}\equiv\varepsilon_{n}(M,\delta)\equiv M/\sqrt{n}+\delta, the equations in (8.13) reduce to

{ϕητn+𝒪(εn)=1ntr((Σ+τnI)1Σ),ϕ+𝒪(εn)=1γn2(σξ2+τn2(Σ+τnI)1Σ1/2μ02)+1ntr((Σ+τnI)2Σ2).\displaystyle\begin{cases}\phi-\frac{\eta}{\tau_{n}}+\mathcal{O}(\varepsilon_{n})=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-1}\Sigma\big{)},\\ \phi+\mathcal{O}(\varepsilon_{n})=\frac{1}{\gamma_{n}^{2}}\big{(}\sigma_{\xi}^{2}+\tau_{n}^{2}\lVert(\Sigma+\tau_{n}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\big{)}+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-2}\Sigma^{2}\big{)}.\end{cases} (8.14)

The above equations match (2.1) up to the small perturbation 𝒪(εn)=𝒪(εn(M,δ))\mathcal{O}(\varepsilon_{n})=\mathcal{O}(\varepsilon_{n}(M,\delta)) that can be assimilated into the leading term ϕ\phi with small enough c0>0c_{0}>0 such that Mc0nM\leq c_{0}\sqrt{n}. From here the existence (but not uniqueness) and apriori bounds for γn,τn\gamma_{n},\tau_{n} can be established similarly to the proof of Proposition 8.1.

(2). Now we shall prove the claimed error bounds. By using (8.1) and the first equation of (8.14), we have

1ntr((Σ+τnI)1Σ)+ητn=1ntr((Σ+τI)1Σ)+ητ+𝒪(εn).\displaystyle\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-1}\Sigma\big{)}+\frac{\eta}{\tau_{n}}=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-1}\Sigma\big{)}+\frac{\eta}{\tau_{\ast}}+\mathcal{O}(\varepsilon_{n}).

Let 𝖿(τ)1ntr((Σ+τI)1Σ)+ητ\mathsf{f}(\tau)\equiv\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I)^{-1}\Sigma\big{)}+\frac{\eta}{\tau}. Then it is easy to calculate 𝖿(τ)=1ntr((Σ+τI)2Σ)ητ20\mathsf{f}^{\prime}(\tau)=-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I)^{-2}\Sigma\big{)}-\frac{\eta}{\tau^{2}}\leq 0, and for any C0>1C_{0}>1,

inf1/C0τC0|𝖿(τ)|\displaystyle\inf_{1/C_{0}\leq\tau\leq C_{0}}\lvert\mathsf{f}^{\prime}(\tau)\rvert inf1/C0τC01ntr((Σ+τI)2Σ)(Σop+C0)2Σ1.\displaystyle\geq\inf_{1/C_{0}\leq\tau\leq C_{0}}\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I)^{-2}\Sigma\big{)}\geq(\lVert\Sigma\rVert_{\operatorname{op}}+C_{0})^{-2}\mathcal{H}_{\Sigma}^{-1}.

Now using the apriori estimates on τ,τn\tau_{\ast},\tau_{n}, we may conclude

supηΞK|τnτ|εn.\displaystyle\sup_{\eta\in\Xi_{K}}\lvert\tau_{n}-\tau_{\ast}\rvert\lesssim\varepsilon_{n}. (8.15)

On the other hand, using (8.2) and the second equation of (8.14), we have

1γn2(σξ2+τn2(Σ+τnI)1Σ1/2μ02)+1ntr((Σ+τnI)2Σ2)\displaystyle\frac{1}{\gamma_{n}^{2}}\Big{(}\sigma_{\xi}^{2}+\tau_{n}^{2}\lVert(\Sigma+\tau_{n}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\Big{)}+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-2}\Sigma^{2}\big{)}
=1γ2(σξ2+τ2(Σ+τI)1Σ1/2μ02)+1ntr((Σ+τI)2Σ2)+𝒪(εn).\displaystyle=\frac{1}{\gamma_{\ast}^{2}}\Big{(}\sigma_{\xi}^{2}+\tau_{\ast}^{2}\lVert(\Sigma+\tau_{\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\Big{)}+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma^{2}\big{)}+\mathcal{O}(\varepsilon_{n}). (8.16)

Using the error bound in (8.15) and apriori estimates for τn,τ\tau_{n},\tau_{\ast}, and the fact that Σ1\mathcal{H}_{\Sigma}\lesssim 1, by an easy derivative estimate we have

  • |1ntr((Σ+τnI)2Σ2)1ntr((Σ+τI)2Σ2)|εn\big{\lvert}\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{n}I)^{-2}\Sigma^{2}\big{)}-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}I)^{-2}\Sigma^{2}\big{)}\big{\rvert}\lesssim\varepsilon_{n}, and

  • |τn2(Σ+τnI)1Σ1/2μ02τ2(Σ+τI)1Σ1/2μ02|εn\big{\lvert}\tau_{n}^{2}\lVert(\Sigma+\tau_{n}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}-\tau_{\ast}^{2}\lVert(\Sigma+\tau_{\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\big{\rvert}\lesssim\varepsilon_{n}.

Now plugging these estimates into (8.2), with 𝒞0σξ2+τ2(Σ+τI)1Σ1/2μ02\mathscr{C}_{0}\equiv\sigma_{\xi}^{2}+\tau_{\ast}^{2}\lVert(\Sigma+\tau_{\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2} satisfying 𝒞01\mathscr{C}_{0}\asymp 1, we arrive at

𝒞0+𝒪(εn)γn2=𝒞0γ2+𝒪(εn).\displaystyle\frac{\mathscr{C}_{0}+\mathcal{O}(\varepsilon_{n})}{\gamma_{n}^{2}}=\frac{\mathscr{C}_{0}}{\gamma_{\ast}^{2}}+\mathcal{O}(\varepsilon_{n}).

Using apriori estimates on γn,γ\gamma_{n},\gamma_{\ast}, we may then invert the above estimate into

supηΞK|γnγ|εn.\displaystyle\sup_{\eta\in\Xi_{K}}\lvert\gamma_{n}-\gamma_{\ast}\rvert\lesssim\varepsilon_{n}. (8.17)

The claimed error bounds follow by combining (8.15) and (8.17). ∎

8.3. Proofs of Lemmas 8.3 and 8.4

Proof of Lemma 8.3.

We only handle the case =1\ell=1. The case =2\ell=2 is similar. Note that the assumption on μ0\mu_{0} invariant over orthogonal transforms, so for notational simplicity we assume without loss of generality that Σ\Sigma is diagonal. As supτKn|Δ1(τ)||j=1nλj1/2μ0,jgj|+Cegn1/2\sup_{\tau\geq Kn}\lvert\Delta_{1}(\tau)\rvert\leq\big{\lvert}\sum_{j=1}^{n}\lambda_{j}^{1/2}\mu_{0,j}g_{j}\big{\rvert}+Ce_{g}\cdot n^{-1/2}, a standard concentration for the first term shows for t1t\geq 1, with probability 1et1-e^{-t},

supτKn|Δ1(τ)|\displaystyle\sup_{\tau\geq Kn}\lvert\Delta_{1}(\tau)\rvert C0t.\displaystyle\leq C_{0}\sqrt{t}. (8.18)

On the other hand, for ε>0\varepsilon>0 to be chosen later, by taking an ε\varepsilon-net 𝒮ε\mathcal{S}_{\varepsilon} of [0,Kn][0,Kn], a union bound shows that with probability at least 1(Kn/ε+1)et1-(Kn/\varepsilon+1)e^{-t},

supτ[0,Kn]|Δ1(τ)|\displaystyle\sup_{\tau\in[0,Kn]}\lvert\Delta_{1}(\tau)\rvert maxτ𝒮ε|Δ1(τ)|+supτ,τ[0,Kn]:|ττ|ε|Δ1(τ)Δ1(τ)|\displaystyle\leq\max_{\tau\in\mathcal{S}_{\varepsilon}}\lvert\Delta_{1}(\tau)\rvert+\sup_{\tau,\tau^{\prime}\in[0,Kn]:\lvert\tau-\tau^{\prime}\rvert\leq\varepsilon}\lvert\Delta_{1}(\tau)-\Delta_{1}(\tau^{\prime})\rvert
C1(t+n(logn+t)ε).\displaystyle\leq C_{1}\cdot\Big{(}\sqrt{t}+\sqrt{n}(\sqrt{\log n}+\sqrt{t})\varepsilon\Big{)}.

Here in the last inequality we used the simple estimate supτ[0,Kn]|τΔ1(τ)|Cnμ0g\sup_{\tau\in[0,Kn]}\lvert\partial_{\tau}\Delta_{1}(\tau)\rvert\leq C\sqrt{n}\lVert\mu_{0}\rVert\lVert g\rVert_{\infty}. Finally by choosing εt/{n(logn+t)}\varepsilon\equiv\sqrt{t}/\big{\{}\sqrt{n}(\sqrt{\log n}+\sqrt{t})\big{\}}, we conclude that for tC2log(en)t\geq C_{2}\log(en), with probability 1et1-e^{-t},

supτ[0,Kn]|Δ1(τ)|\displaystyle\sup_{\tau\in[0,Kn]}\lvert\Delta_{1}(\tau)\rvert C2t.\displaystyle\leq C_{2}\sqrt{t}. (8.19)

The claim follows by combining (8.18) and (8.19). ∎

Proof of Lemma 8.4.

We focus on the case =1\ell=1 and will follow a similar idea used in the proof of Lemma 8.3 above. Similarly we assume Σ\Sigma is diagonal without loss of generality. All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK.

First note by a standard concentration, for any t1t\geq 1, with probability at least 1et1-e^{-t}, supτ>Kn|Ξ1(τ)|eg21+t/n\sup_{\tau>Kn}\lvert\Xi_{1}(\tau)\rvert\lesssim e_{g}^{2}\lesssim 1+t/n. Similarly we have supτ>Kn𝔼|Ξ1(τ)|1\sup_{\tau>Kn}\operatorname{\mathbb{E}}\lvert\Xi_{1}(\tau)\rvert\lesssim 1. This means for any t1t\geq 1, with probability at least 1et1-e^{-t},

supτ>Kn(|Ξ1(τ)|𝔼|Ξ1(τ)|)1+t/n.\displaystyle\sup_{\tau>Kn}\Big{(}\lvert\Xi_{1}(\tau)\rvert\vee\operatorname{\mathbb{E}}\lvert\Xi_{1}(\tau)\rvert\Big{)}\lesssim 1+t/n. (8.20)

Next we handle the suprema over [0,Kn][0,Kn] by discretization over an ε\varepsilon-net 𝒮ε\mathcal{S}_{\varepsilon}. To this end, we shall establish a pointwise concentration. Note that Ξ1(τ)2=4(Σ+τI)1Σg24Ξ1(τ)\lVert\nabla\Xi_{1}(\tau)\rVert^{2}=4\lVert(\Sigma+\tau I)^{-1}\Sigma g\rVert^{2}\leq 4\Xi_{1}(\tau). An application of Proposition B.1 then yields that, for each τ0\tau\geq 0 and t1t\geq 1, with probability at least 1et1-e^{-t},

|Ξ1(τ)𝔼Ξ1(τ)|C(𝔼1/2Ξ1(τ)t+t)(nt+t).\displaystyle\lvert\Xi_{1}(\tau)-\operatorname{\mathbb{E}}\Xi_{1}(\tau)\rvert\leq C\big{(}\operatorname{\mathbb{E}}^{1/2}\Xi_{1}(\tau)\cdot\sqrt{t}+t\big{)}\lesssim(\sqrt{nt}+t).

On the other hand, as supτ[0,Kn]|τΞ1(τ)|ng2\sup_{\tau\in[0,Kn]}\lvert\partial_{\tau}\Xi_{1}(\tau)\rvert\lesssim n\lVert g\rVert_{\infty}^{2} and supτ[0,Kn]|τ𝔼Ξ1(τ)|nlogn\sup_{\tau\in[0,Kn]}\lvert\partial_{\tau}\operatorname{\mathbb{E}}\Xi_{1}(\tau)\rvert\lesssim n\log n, we deduce that with probability at least 1(Kn/ε+1)et1-(Kn/\varepsilon+1)e^{-t},

supτ[0,Kn]|Ξ1(τ)𝔼Ξ1(τ)|\displaystyle\sup_{\tau\in[0,Kn]}\lvert\Xi_{1}(\tau)-\operatorname{\mathbb{E}}\Xi_{1}(\tau)\rvert maxτ𝒮ε|Ξ1(τ)𝔼Ξ1(τ)|+supτ,τ[0,Kn]:|ττ|ε|Ξ1(τ)Ξ1(τ)|\displaystyle\leq\max_{\tau\in\mathcal{S}_{\varepsilon}}\lvert\Xi_{1}(\tau)-\operatorname{\mathbb{E}}\Xi_{1}(\tau)\rvert+\sup_{\tau,\tau^{\prime}\in[0,Kn]:\lvert\tau-\tau^{\prime}\rvert\leq\varepsilon}\lvert\Xi_{1}(\tau)-\Xi_{1}(\tau^{\prime})\rvert
+supτ,τ[0,Kn]:|ττ|ε|𝔼Ξ1(τ)𝔼Ξ1(τ)|\displaystyle\qquad\qquad+\sup_{\tau,\tau^{\prime}\in[0,Kn]:\lvert\tau-\tau^{\prime}\rvert\leq\varepsilon}\lvert\operatorname{\mathbb{E}}\Xi_{1}(\tau)-\operatorname{\mathbb{E}}\Xi_{1}(\tau^{\prime})\rvert
nt+t+n(logn+t)ε.\displaystyle\lesssim\sqrt{nt}+t+n(\log n+t)\varepsilon.

From here the claim follows by the same arguments used in the proof of Lemma 8.3 above. ∎

9. Gaussian designs: Proof of Theorem 2.3

We assume without loss of generality that Σ=diag(λ1,,λn)\Sigma=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{n}), so 𝖵=I\mathsf{V}=I unless otherwise specified. Recall Σ=tr(Σ1)/n\mathcal{H}_{\Sigma}=\operatorname{tr}(\Sigma^{-1})/n.

9.1. Localization of the primal problem

Proposition 9.1.

Suppose 1/Kϕ1𝟏η=0,σξ2K1/K\leq\phi^{-1}-\bm{1}_{\eta=0},\sigma_{\xi}^{2}\leq K, and μ0ΣopK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K for some K>0K>0. Fix M>1,δ(0,1/2)M>1,\delta\in(0,1/2) and η0\eta\geq 0. On the event 0(M)1(δ)\mathscr{E}_{0}(M)\cap\mathscr{E}_{1}(\delta), there exists some C=C(K)>0C=C(K)>0 such that for any deterministic choice of (Lw,Lv)(L_{w},L_{v}) with

LwLvC{1+(Σ1opM𝟏ϕ11+1/K1η1)M2},\displaystyle L_{w}\wedge L_{v}\geq C\big{\{}1+\big{(}\lVert\Sigma^{-1}\rVert_{\operatorname{op}}M\bm{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge\eta^{-1}\big{)}\cdot M^{2}\big{\}},

we have minwBn(Lw)Hη(w;Lv)=minwnHη(w)\min_{w\in B_{n}(L_{w})}H_{\eta}(w;L_{v})=\min_{w\in\mathbb{R}^{n}}H_{\eta}(w).

Proof.

Using the first-order optimality condition for the minimax problem

minwnHη(w)=minwnmaxvm{1nv,Gwξ+F(w)ηv22},\displaystyle\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)=\min_{w\in\mathbb{R}^{n}}\max_{v\in\mathbb{R}^{m}}\bigg{\{}\frac{1}{\sqrt{n}}\langle v,Gw-\xi\rangle+F(w)-\frac{\eta\lVert v\rVert^{2}}{2}\bigg{\}}, (9.1)

any saddle point (w,v)(w_{\ast},v_{\ast}) of (9.1) must satisfy F(w)=1nGv\nabla F(w_{\ast})=-\frac{1}{\sqrt{n}}G^{\top}v_{\ast} and 1n(Gwξ)=ηv\frac{1}{\sqrt{n}}(Gw_{\ast}-\xi)=\eta v_{\ast}, or equivalently,

{w=Σ1/2μ0+1nΣG(ϕΣˇ+ηI)1(GΣ1/2μ0+ξ),v=1n(ϕΣˇ+ηI)1(GΣ1/2μ0+ξ).\displaystyle\begin{cases}w_{\ast}=-\Sigma^{1/2}\mu_{0}+\frac{1}{n}\Sigma G^{\top}\big{(}\phi\check{\Sigma}+\eta I\big{)}^{-1}(G\Sigma^{1/2}\mu_{0}+\xi),\\ v_{\ast}=-\frac{1}{\sqrt{n}}\big{(}\phi\check{\Sigma}+\eta I\big{)}^{-1}\big{(}G\Sigma^{1/2}\mu_{0}+\xi\big{)}.\end{cases}

Here recall Σˇ=m1GΣG\check{\Sigma}=m^{-1}G\Sigma G^{\top}. On the event 0(M)\mathscr{E}_{0}(M),

(ϕΣˇ+ηI)1opKΣ1opM𝟏ϕ11+1/K1η1.\displaystyle\lVert(\phi\check{\Sigma}+\eta I)^{-1}\rVert_{\operatorname{op}}\lesssim_{K}\lVert\Sigma^{-1}\rVert_{\operatorname{op}}M\bm{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge\eta^{-1}.

So on 0(M)1(δ)\mathscr{E}_{0}(M)\cap\mathscr{E}_{1}(\delta),

wvK1+(Σ1opM𝟏ϕ11+1/K1η1)M2.\displaystyle\lVert w_{\ast}\rVert\vee\lVert v_{\ast}\rVert\lesssim_{K}1+(\lVert\Sigma^{-1}\rVert_{\operatorname{op}}M\bm{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge\eta^{-1})M^{2}.

This means that on the event 0(M)1(δ)\mathscr{E}_{0}(M)\cap\mathscr{E}_{1}(\delta), for any Lw,LvL_{w},L_{v} chosen as in the statement of the lemma,

minwnHη(w)=minwBn(Lw)maxvBm(Lv){1nv,Gwξ+F(w)ηv22}.\displaystyle\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)=\min_{w\in B_{n}(L_{w})}\max_{v\in B_{m}(L_{v})}\bigg{\{}\frac{1}{\sqrt{n}}\langle v,Gw-\xi\rangle+F(w)-\frac{\eta\lVert v\rVert^{2}}{2}\bigg{\}}.

The proof is complete by recalling the definition of Hη(;Lv)H_{\eta}(\cdot;L_{v}). ∎

9.2. Characterization of the Gordon cost optimum

Theorem 9.2.

Suppose the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K.

  • Assumption B with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

There exist some C,C>1C,C^{\prime}>1 depending on KK such that for any deterministic choice of Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}], it holds for any Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime}, ηΞK\eta\in\Xi_{K} and ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}),

ξ(|minwBn(Lw)Lη(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)|t/n)Cet/C.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\big{\lvert}\min_{w\in B_{n}(L_{w})}L_{\eta}(w;L_{v})-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{\rvert}\geq\sqrt{t/n}\Big{)}\leq Ce^{-t/C}.

In the next subsection we will show that for large Lv>0L_{v}>0, the map wLη(w;Lv)w\mapsto L_{\eta}(w;L_{v}) attains its global minimum in an 2\ell_{2} ball of constant order radius (under Σ1\mathcal{H}_{\Sigma}\lesssim 1) with high probability. This means that although the initial localization radius for the primal optimization may be highly suboptimal (which involves Σ1op\lVert\Sigma^{-1}\rVert_{\operatorname{op}}), the Gordon objective can be further localized into an 2\ell_{2} ball with constant order radius.

To prove Theorem 9.2, we shall first relate minwBn(Lw)Lη(w;Lv)\min_{w\in B_{n}(L_{w})}L_{\eta}(w;L_{v}) to maxβ>0minγ>0𝖣η,±(β,γ)\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{\eta,\pm}(\beta,\gamma) and its localized versions.

Proposition 9.3.

Suppose 1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K, and μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exists constant C=C(K)>1C=C(K)>1 such that for any deterministic choice of Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}], on the event 1(δ)Δ,Ξ(M)\mathscr{E}_{1}(\delta)\cap\mathscr{E}_{\Delta,\Xi}(M) (defined in Proposition 8.2) with δ(0,1/C100)\delta\in(0,1/C^{100}) and Mn/CM\leq\sqrt{n}/C, we have for any ηΞK\eta\in\Xi_{K},

maxβ>0minγ>0𝖣η,(β,γ)minwBn(Lw)Lη(w;Lv)maxβ>0minγ>0𝖣η,+(β,γ),\displaystyle\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{\eta,-}(\beta,\gamma)\leq\min_{w\in B_{n}(L_{w})}L_{\eta}(w;L_{v})\leq\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{\eta,+}(\beta,\gamma),

and the following localization holds:

maxβ>0minγ>0𝖣η,±(β,γ)=max1/CβCmin1/CγC𝖣η,±(β,γ).\displaystyle\max_{\beta>0}\min_{\gamma>0}\mathsf{D}_{\eta,\pm}(\beta,\gamma)=\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\mathsf{D}_{\eta,\pm}(\beta,\gamma).
Proof.

We write gng/ng_{n}\equiv g/\sqrt{n} in the proof.

(Step 1). Fix any Lw,Lv>0L_{w},L_{v}>0. We may compute

minwBn(Lw)Lη(w;Lv)\displaystyle\min_{w\in B_{n}(L_{w})}L_{\eta}(w;L_{v}) (9.2)
=minwBn(Lw)maxβ[0,Lv]{βn(whξg,w)+F(w)ηβ22}\displaystyle=\min_{w\in B_{n}(L_{w})}\max_{\beta\in[0,L_{v}]}\bigg{\{}\frac{\beta}{\sqrt{n}}\Big{(}\big{\lVert}\lVert w\rVert h-\xi\big{\rVert}-\langle g,w\rangle\Big{)}+F(w)-\frac{\eta\beta^{2}}{2}\bigg{\}}
=maxβ[0,Lv]minγ>0{βγh22nηβ22+minwBn(Lw)(β2γwhξ2h2w,βgn+F(w))}.\displaystyle=\max_{\beta\in[0,L_{v}]}\min_{\gamma>0}\bigg{\{}\frac{\beta\gamma\lVert h\rVert^{2}}{2n}-\frac{\eta\beta^{2}}{2}+\min_{w\in B_{n}(L_{w})}\bigg{(}\frac{\beta}{2\gamma}\frac{\lVert\lVert w\rVert h-\xi\rVert^{2}}{\lVert h\rVert^{2}}-\langle w,\beta g_{n}\rangle+F(w)\bigg{)}\bigg{\}}.

Here in the last line we used Sion’s min-max theorem to flip the order of minimum and maximum in minwBn(Lw)maxβ[0,Lv]\min_{w\in B_{n}(L_{w})}\max_{\beta\in[0,L_{v}]}. The minimum over γ\gamma is achieved exactly at whξ/hh/n\frac{\lVert\lVert w\rVert h-\xi\rVert/\lVert h\rVert}{\lVert h\rVert/\sqrt{n}}, so when σ20\sigma_{-}^{2}\neq 0, using the simple inequality

w2+σ2whξ2/h2w2+σ+2,\displaystyle\lVert w\rVert^{2}+\sigma_{-}^{2}\leq\lVert\lVert w\rVert h-\xi\rVert^{2}/\lVert h\rVert^{2}\leq\lVert w\rVert^{2}+\sigma_{+}^{2}, (9.3)

on the event 1(δ)\mathscr{E}_{1}(\delta), we may further bound (9.2) as follows:

±minwBn(Lw)Lη(w;Lv)±maxβ[0,Lv]minγ>0\displaystyle\pm\min_{w\in B_{n}(L_{w})}L_{\eta}(w;L_{v})\leq\pm\max_{\beta\in[0,L_{v}]}\min_{\gamma>0}
{βγh22nηβ22+minwBn(Lw)(β2γ(w2+σ±2(Lw))w,βgn+F(w))}.\displaystyle\qquad\bigg{\{}\frac{\beta\gamma\lVert h\rVert^{2}}{2n}-\frac{\eta\beta^{2}}{2}+\min_{w\in B_{n}(L_{w})}\bigg{(}\frac{\beta}{2\gamma}\big{(}\lVert w\rVert^{2}+\sigma_{\pm}^{2}(L_{w})\big{)}-\langle w,\beta g_{n}\rangle+F(w)\bigg{)}\bigg{\}}. (9.4)

We note that σ±2\sigma_{\pm}^{2} depends on LwL_{w}, but this notational dependence will be dropped from now on for convenience.

(Step 2). Consider the minimax optimization problem in (9.2):

maxβ>0minγ>0,wn{βγh22nηβ22+(β2γ(w2+σ±2)w,βgn+F(w))}\displaystyle\max_{\beta>0}\min_{\gamma>0,w\in\mathbb{R}^{n}}\bigg{\{}\frac{\beta\gamma\lVert h\rVert^{2}}{2n}-\frac{\eta\beta^{2}}{2}+\bigg{(}\frac{\beta}{2\gamma}\big{(}\lVert w\rVert^{2}+\sigma_{\pm}^{2}\big{)}-\langle w,\beta g_{n}\rangle+F(w)\bigg{)}\bigg{\}}
=maxβ>0minγ>0{β2(γ(ϕeh2eg2)+σ±2γ)ηβ22+𝖾F(γgn;γ/β)}.\displaystyle=\max_{\beta>0}\min_{\gamma>0}\bigg{\{}\frac{\beta}{2}\bigg{(}\gamma\big{(}\phi e_{h}^{2}-e_{g}^{2}\big{)}+\frac{\sigma_{\pm}^{2}}{\gamma}\bigg{)}-\frac{\eta\beta^{2}}{2}+\mathsf{e}_{F}(\gamma g_{n};{\gamma}/{\beta})\bigg{\}}. (9.5)

Any saddle point (βn,η,±,γn,η,±,wn,η,±)=(βn,±,γn,±,wn,±)(\beta_{n,\eta,\pm},\gamma_{n,\eta,\pm},w_{n,\eta,\pm})=(\beta_{n,\pm},\gamma_{n,\pm},w_{n,\pm}) of the above program must satisfy the first-order optimality condition

{0=12(γn,±(ϕeh2eg2)+σ±2γn,±)ηβn,±+β𝖾F(γn,±gn;γn,±/βn,±),0=βn,±2((ϕeh2eg2)σ±2γn,±2)+γ𝖾F(γn,±gn;γn,±/βn,±),wn,±=𝗉𝗋𝗈𝗑F(γn,±gn;γn,±/βn,±).\displaystyle\begin{cases}0=\frac{1}{2}\big{(}\gamma_{n,\pm}(\phi e_{h}^{2}-e_{g}^{2})+\frac{\sigma_{\pm}^{2}}{\gamma_{n,\pm}}\big{)}-\eta\beta_{n,\pm}+\partial_{\beta}\mathsf{e}_{F}\big{(}\gamma_{n,\pm}g_{n};{\gamma_{n,\pm}}/{\beta_{n,\pm}}\big{)},\\ 0=\frac{\beta_{n,\pm}}{2}\big{(}(\phi e_{h}^{2}-e_{g}^{2})-\frac{\sigma_{\pm}^{2}}{\gamma_{n,\pm}^{2}}\big{)}+\partial_{\gamma}\mathsf{e}_{F}\big{(}\gamma_{n,\pm}g_{n};{\gamma_{n,\pm}}/{\beta_{n,\pm}}\big{)},\\ w_{n,\pm}=\operatorname{\mathsf{prox}}_{F}\big{(}\gamma_{n,\pm}g_{n};{\gamma_{n,\pm}}/{\beta_{n,\pm}}\big{)}.\end{cases} (9.6)

Using the derivative formula in Lemma 7.3 and the form of 𝗉𝗋𝗈𝗑F\operatorname{\mathsf{prox}}_{F} in Lemma 7.2, we may compute

{β𝖾F(γgn;γ/β)=12γ(𝖾𝗋𝗋(Σ,μ0)(γ;γ/β)2𝖽𝗈𝖿(Σ,μ0)(γ;γ/β)+γ2eg2),γ𝖾F(γgn;γ/β)=β2γ2(γ2eg2𝖾𝗋𝗋(Σ,μ0)(γ;γ/β)).\displaystyle\begin{cases}\partial_{\beta}\mathsf{e}_{F}(\gamma g_{n};{\gamma}/{\beta})=\frac{1}{2\gamma}\Big{(}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\gamma/\beta)-2\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\gamma/\beta)+\gamma^{2}e_{g}^{2}\Big{)},\\ \partial_{\gamma}\mathsf{e}_{F}(\gamma g_{n};{\gamma}/{\beta})=\frac{\beta}{2\gamma^{2}}\Big{(}\gamma^{2}e_{g}^{2}-\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\gamma/\beta)\Big{)}.\end{cases} (9.7)

Plugging (9.7) into (9.6), the first-order optimality condition for (βn,±,γn,±)(\beta_{n,\pm},\gamma_{n,\pm}) in the minimax program (9.2) is given by

{(ϕeh2eg2)γn,±2+σ±2=2ηγn,±βn,±𝖾𝗋𝗋(Σ,μ0)(γn,±;γn,±/βn,±)+2𝖽𝗈𝖿(Σ,μ0)(γn,±;γn,±/βn,±)eg2γn,±2,(ϕeh2eg2)γn,±2σ±2=eg2γn,±2+𝖾𝗋𝗋(Σ,μ0)(γn,±;γn,±/βn,±).\displaystyle\begin{cases}\big{(}\phi e_{h}^{2}-e_{g}^{2}\big{)}\gamma_{n,\pm}^{2}+\sigma_{\pm}^{2}=2\eta\cdot\gamma_{n,\pm}\beta_{n,\pm}-\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{n,\pm};\gamma_{n,\pm}/\beta_{n,\pm})\\ \qquad\qquad\qquad\qquad\qquad+2\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma_{n,\pm};\gamma_{n,\pm}/\beta_{n,\pm})-e_{g}^{2}\gamma_{n,\pm}^{2},\\ \big{(}\phi e_{h}^{2}-e_{g}^{2}\big{)}\gamma_{n,\pm}^{2}-\sigma_{\pm}^{2}=-e_{g}^{2}\gamma_{n,\pm}^{2}+\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{n,\pm};\gamma_{n,\pm}/\beta_{n,\pm}).\end{cases}

Equivalently,

{ϕeh2γn,±2=σ±2+𝖾𝗋𝗋(Σ,μ0)(γn,±;γn,±/βn,±),(ϕeh2ηγn,±/βn,±)γn,±2=𝖽𝗈𝖿(Σ,μ0)(γn,±;γn,±/βn,±).\displaystyle\begin{cases}\phi e_{h}^{2}\gamma_{n,\pm}^{2}=\sigma_{\pm}^{2}+\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{n,\pm};\gamma_{n,\pm}/\beta_{n,\pm}),\\ \big{(}\phi e_{h}^{2}-\frac{\eta}{\gamma_{n,\pm}/\beta_{n,\pm}}\big{)}\gamma_{n,\pm}^{2}=\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma_{n,\pm};\gamma_{n,\pm}/\beta_{n,\pm}).\end{cases} (9.8)

Using the apriori estimates in Proposition 8.2, on the event 1(δ)Δ,Ξ(M)\mathscr{E}_{1}(\delta)\cap\mathscr{E}_{\Delta,\Xi}(M) we have γn,±/βn,±K1{\gamma_{n,\pm}}/{\beta_{n,\pm}}\asymp_{K}1 and γn,±2K1\gamma_{n,\pm}^{2}\asymp_{K}1. This implies on the same event,

γn,±K1,βn,±K1.\displaystyle\gamma_{n,\pm}\asymp_{K}1,\quad\beta_{n,\pm}\asymp_{K}1. (9.9)

Using the last equation of (9.6), we have

wn,±\displaystyle\lVert w_{n,\pm}\rVert =Σ1/2(Σ+γn,±βn,±I)1(γn,±βn,±μ0+γn,±Σ1/2gn)K1.\displaystyle=\bigg{\lVert}\Sigma^{1/2}\Big{(}\Sigma+\frac{\gamma_{n,\pm}}{\beta_{n,\pm}}I\Big{)}^{-1}\Big{(}-\frac{\gamma_{n,\pm}}{\beta_{n,\pm}}\mu_{0}+\gamma_{n,\pm}\Sigma^{1/2}g_{n}\Big{)}\bigg{\rVert}\lesssim_{K}1. (9.10)

In view of (9.9)-(9.10), by choosing Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}] for large enough C>0C>0, the constraints in the optimization in (9.2) can be dropped for free. ∎

Next we replace the random function 𝖣η,±\mathsf{D}_{\eta,\pm} in the above proposition by its deterministic counterpart 𝖣¯η\overline{\mathsf{D}}_{\eta} in their localized versions.

Proposition 9.4.

Suppose 1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K, and μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exist some C,C>1C,C^{\prime}>1 depending on KK such that for Lw[C,C2]L_{w}\in[C,C^{2}], δ(0,1/C100)\delta\in(0,1/C^{100}), ξ1,ξ(δ)\xi\in\mathscr{E}_{1,\xi}(\delta) and tClog(en)t\geq C^{\prime}\log(en),

ξ[supηΞK|max1/CβCmin1/CγC𝖣η,±(β,γ)max1/CβCmin1/CγC𝖣¯η(β,γ)|\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{[}\sup_{\eta\in\Xi_{K}}\big{|}\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\mathsf{D}_{\eta,\pm}(\beta,\gamma)-\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{|}
C(t/n+t/n+δ)]Cet/C+ξ(1,0(δ)c).\displaystyle\qquad\geq C\big{(}\sqrt{t/n}+t/n+\delta\big{)}\Big{]}\leq Ce^{-t/C}+\operatorname{\mathbb{P}}^{\xi}\big{(}\mathscr{E}_{1,0}(\delta)^{c}\big{)}.
Proof.

In the proof, we write gng/ng_{n}\equiv g/\sqrt{n}. All the constants in ,,\lesssim,\gtrsim,\asymp and 𝒪\mathcal{O} below may depend on KK.

(Step 1). We first prove the following: On the event 1(δ)\mathscr{E}_{1}(\delta), for any C0>1C_{0}>1,

supγ,τ[1/C0,C0]2|#𝖾F(γgn;τ)||#𝔼𝖾F(γgn;τ)|\displaystyle\sup_{\gamma,\tau\in[1/C_{0},C_{0}]^{2}}\lvert\partial_{\#}\mathsf{e}_{F}\big{(}\gamma g_{n};\tau\big{)}\rvert\vee\lvert\partial_{\#}\operatorname{\mathbb{E}}\mathsf{e}_{F}\big{(}\gamma g_{n};\tau\big{)}\rvert 1.\displaystyle\lesssim 1. (9.11)

To this end, with μ^(Σ,μ0)𝗌𝖾𝗊,y(Σ,μ0)𝗌𝖾𝗊\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}},y_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}} written as μ^,y\widehat{\mu},y, and using γμ^=(Σ+τI)1Σ1/2gn\partial_{\gamma}\widehat{\mu}=(\Sigma+\tau I)^{-1}\Sigma^{1/2}g_{n}, γy=gn\partial_{\gamma}y=g_{n}, τμ^=(Σ+τI)2Σ1/2(Σ1/2μ0+γgn)\partial_{\tau}\widehat{\mu}=-\big{(}\Sigma+\tau I\big{)}^{-2}\Sigma^{1/2}\big{(}\Sigma^{1/2}\mu_{0}+\gamma g_{n}\big{)},

γ𝖾F(γgn;τ)\displaystyle\partial_{\gamma}\mathsf{e}_{F}(\gamma g_{n};\tau) =τ1Σ1/2μ^y,Σ1/2γμ^γy+μ^,γμ^,\displaystyle=\tau^{-1}\big{\langle}\Sigma^{1/2}\widehat{\mu}-y,\Sigma^{1/2}\partial_{\gamma}\widehat{\mu}-\partial_{\gamma}y\big{\rangle}+\big{\langle}\widehat{\mu},\partial_{\gamma}\widehat{\mu}\big{\rangle},
τ𝖾F(γgn;τ)\displaystyle\partial_{\tau}\mathsf{e}_{F}(\gamma g_{n};\tau) =12τ2Σ1/2μ^y2+1τΣ1/2μ^y,Σ1/2τμ^+μ^,τμ^,\displaystyle=-\frac{1}{2\tau^{2}}\lVert\Sigma^{1/2}\widehat{\mu}-y\rVert^{2}+\frac{1}{\tau}\big{\langle}\Sigma^{1/2}\widehat{\mu}-y,\Sigma^{1/2}\partial_{\tau}\widehat{\mu}\big{\rangle}+\langle\widehat{\mu},\partial_{\tau}\widehat{\mu}\rangle,

on the event 1(δ)\mathscr{E}_{1}(\delta), we may estimate |γ𝖾F(γgn;τ)||τ𝖾F(γgn;τ)|1\lvert\partial_{\gamma}\mathsf{e}_{F}(\gamma g_{n};\tau)\rvert\vee\lvert\partial_{\tau}\mathsf{e}_{F}(\gamma g_{n};\tau)\rvert\lesssim 1. A similar estimate applies to the expectation versions, proving (9.11).

(Step 2). Next we show that for any C0>1C_{0}>1, there exists C1>0C_{1}>0 such that for tC1log(en)t\geq C_{1}\log(en),

(sup(γ,τ)[C01,C0]2|(id𝔼)𝖾F(γgn;τ)|C1(t/n+t/n),1(δ))C1et/C1.\displaystyle\operatorname{\mathbb{P}}\Big{(}\sup_{(\gamma,\tau)\in[C_{0}^{-1},C_{0}]^{2}}\big{\lvert}(\mathrm{id}-\operatorname{\mathbb{E}})\mathsf{e}_{F}(\gamma g_{n};\tau)\big{\rvert}\geq C_{1}\big{(}\sqrt{t/n}+t/n\big{)},\mathscr{E}_{1}(\delta)\Big{)}\leq C_{1}e^{-t/C_{1}}. (9.12)

To prove the claim, we fix ε>0\varepsilon>0 to be chosen later, and take an ε\varepsilon-net 𝒮(ε)\mathcal{S}(\varepsilon) for [1/C0,C0][1/C_{0},C_{0}]. Then |𝒮(ε)|C0/ε+1\lvert\mathcal{S}(\varepsilon)\rvert\leq C_{0}/\varepsilon+1. So on the event 1(δ)\mathscr{E}_{1}(\delta), using the estimate in (9.11) and a union bound via the pointwise concentration inequality in Proposition 7.4, for t1t\geq 1, with probability at least 1Cε2et/C1-C\varepsilon^{-2}e^{-t/C},

sup(γ,τ)[C01,C0]2|(id𝔼)𝖾F(γgn;τ)|\displaystyle\sup_{(\gamma,\tau)\in[C_{0}^{-1},C_{0}]^{2}}\big{\lvert}(\mathrm{id}-\operatorname{\mathbb{E}})\mathsf{e}_{F}(\gamma g_{n};\tau)\big{\rvert} supγ,τ𝒮(ε)|(id𝔼)𝖾F(γgn;τ)|+Cεtn+tn+ε.\displaystyle\lesssim\sup_{\gamma,\tau\in\mathcal{S}(\varepsilon)}\big{\lvert}(\mathrm{id}-\operatorname{\mathbb{E}})\mathsf{e}_{F}(\gamma g_{n};\tau)\big{\rvert}+C\varepsilon\lesssim\sqrt{\frac{t}{n}}+\frac{t}{n}+\varepsilon.

Here in the last inequality we used Lemma 7.2 to estimate sup(γ,τ)v2(γ,τ)sup(γ,τ)v2(γ,τ)𝔼𝖾F(γgn;τ)1\sup_{(\gamma,\tau)}v^{2}(\gamma,\tau)\vee\sup_{(\gamma,\tau)}v^{2}(\gamma,\tau)\operatorname{\mathbb{E}}\mathsf{e}_{F}(\gamma g_{n};\tau)\lesssim 1, where v2(γ,τ)v^{2}(\gamma,\tau) is defined in Proposition 7.4. The claim (9.12) follows by choosing εt/n+t/n\varepsilon\equiv\sqrt{t/n}+t/n and some calculations.

(Step 3). By (9.12), for tClog(en)t\geq C\log(en), on the event 1(δ)\mathscr{E}_{1}(\delta), it holds with probability at least 1C2et/C21-C_{2}e^{-t/C_{2}} that

max1/CβCmin1/CγC𝖣η,±(β,γ)\displaystyle\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\mathsf{D}_{\eta,\pm}(\beta,\gamma)
=max1/CβCmin1/CγC{β2(γ(ϕeh2eg2)+σ±2γ)ηβ22+𝖾F(γgn;γ/β)}\displaystyle=\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\bigg{\{}\frac{\beta}{2}\bigg{(}\gamma\big{(}\phi e_{h}^{2}-e_{g}^{2}\big{)}+\frac{\sigma_{\pm}^{2}}{\gamma}\bigg{)}-\frac{\eta\beta^{2}}{2}+\mathsf{e}_{F}(\gamma g_{n};\gamma/\beta)\bigg{\}}
=max1/CβCmin1/CγC𝖣¯η(β,γ)+𝒪(t/n+t/n+δ).\displaystyle=\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\mathcal{O}\big{(}\sqrt{t/n}+t/n+\delta\big{)}.

The estimate in 𝒪\mathcal{O} is uniform in ηΞK\eta\in\Xi_{K}, so the claim follows. ∎

Finally we delocalize the range constraints for β,γ\beta,\gamma in the deterministic minimax problem with 𝖣¯η\overline{\mathsf{D}}_{\eta} in the above proposition.

Proposition 9.5.

Suppose 1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K, and μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exists some C=C(K)>1C=C(K)>1 such that for any ηΞK\eta\in\Xi_{K},

maxβ>0minγ>0𝖣¯η(β,γ)=max1/CβCmin1/CγC𝖣¯η(β,γ).\displaystyle\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)=\max_{1/C\leq\beta\leq C}\min_{1/C\leq\gamma\leq C}\overline{\mathsf{D}}_{\eta}(\beta,\gamma).

Consequently,

|maxβ>0minγ>0𝖣¯η(β,γ)maxβ>0minγ>0𝖣¯0(β,γ)|Cη.\displaystyle\big{\lvert}\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{0}(\beta,\gamma)\big{\rvert}\leq C\eta. (9.13)
Proof.

The proof is essentially a deterministic version of Step 2 in the proof of Proposition 9.3. We give some details below. We write gng/ng_{n}\equiv g/\sqrt{n}. First, using similar calculations as that of (9.7),

{β𝔼𝖾F(γgn;γ/β)=12γ(𝔼𝖾𝗋𝗋(Σ,μ0)(γ;γ/β)2𝔼𝖽𝗈𝖿(Σ,μ0)(γ;γ/β)+γ2),γ𝔼𝖾F(γgn;γ/β)=β2γ2(γ2𝔼𝖾𝗋𝗋(Σ,μ0)(γ;γ/β)).\displaystyle\begin{cases}\partial_{\beta}\operatorname{\mathbb{E}}\mathsf{e}_{F}(\gamma g_{n};{\gamma}/{\beta})=\frac{1}{2\gamma}\Big{(}\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\gamma/\beta)-2\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\gamma/\beta)+\gamma^{2}\Big{)},\\ \partial_{\gamma}\operatorname{\mathbb{E}}\mathsf{e}_{F}(\gamma g_{n};{\gamma}/{\beta})=\frac{\beta}{2\gamma^{2}}\Big{(}\gamma^{2}-\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\gamma/\beta)\Big{)}.\end{cases}

Then the first-order optimality condition for (β,γ)(\beta_{\ast},\gamma_{\ast}) to be the saddle point of maxβ>0minγ>0𝖣¯η(β,γ)\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma), i.e., a deterministic version of (9.8), is given by

{ϕγ2=σξ2+𝔼𝖾𝗋𝗋(Σ,μ0)(γ;γ/β),(ϕηγ/β)γ2=𝔼𝖽𝗈𝖿(Σ,μ0)(γ;γ/β).\displaystyle\begin{cases}\phi\gamma_{\ast}^{2}=\sigma_{\xi}^{2}+\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{\ast};\gamma_{\ast}/\beta_{\ast}),\\ \big{(}\phi-\frac{\eta}{\gamma_{\ast}/\beta_{\ast}}\big{)}\gamma_{\ast}^{2}=\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma_{\ast};\gamma_{\ast}/\beta_{\ast}).\end{cases}

Finally using the apriori estimates in Proposition 8.1, we obtain a deterministic analogue of (9.9) in that γK1,βK1\gamma_{\ast}\asymp_{K}1,\quad\beta_{\ast}\asymp_{K}1. The claimed localization follows. The continuity follows by the definition of 𝖣¯η\overline{\mathsf{D}}_{\eta} and the proven localization. ∎

Proof of Theorem 9.2.

By Propositions 9.3, 9.4 and 9.5, there exist C,C>0C,C^{\prime}>0 such that for any δ(0,1/C100)\delta\in(0,1/C^{100}), Mn/CM\leq\sqrt{n}/C, tClog(en)t\geq C^{\prime}\log(en), ξ1,ξ(δ)\xi\in\mathscr{E}_{1,\xi}(\delta) and ηΞK\eta\in\Xi_{K},

ξ[|minwBn(Lw)Lη(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)|C(t/n+t/n+δ)]\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{[}\big{|}\min_{w\in B_{n}(L_{w})}L_{\eta}(w;L_{v})-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{|}\geq C\big{(}\sqrt{t/n}+t/n+\delta\big{)}\Big{]}
Cet/C+ξ(1,0(δ)c)+(Δ,Ξ(M)c).\displaystyle\leq Ce^{-t/C}+\operatorname{\mathbb{P}}^{\xi}\big{(}\mathscr{E}_{1,0}(\delta)^{c}\big{)}+\operatorname{\mathbb{P}}(\mathscr{E}_{\Delta,\Xi}(M)^{c}).

The claim now follows from the concentration estimates in Lemmas 7.7, 8.3 and 8.4, by choosing Mn/CM\equiv\sqrt{n}/C and δC(t/n+t/n)\delta\equiv C(\sqrt{t/n}+t/n), which is valid in the regime tn/C0t\leq n/C_{0} for large C0C_{0}. ∎

9.3. Locating the global minimizer of the Gordon objective

With (γη,,τη,)(\gamma_{\eta,\ast},\tau_{\eta,\ast}) denoting the unique solution to the system of equations (2.1), let

wη,𝗉𝗋𝗈𝗑F(γη,g/n;τη,)=Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0).\displaystyle w_{\eta,\ast}\equiv\operatorname{\mathsf{prox}}_{F}\big{(}\gamma_{\eta,\ast}g/\sqrt{n};\tau_{\eta,\ast}\big{)}=\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\big{)}. (9.14)

For any ε>0\varepsilon>0, let the exceptional set be defined as

Dη;ε(𝗀){wn:|𝗀(w)𝔼𝗀(wη,)|ε}.\displaystyle D_{\eta;\varepsilon}(\mathsf{g})\equiv\big{\{}w\in\mathbb{R}^{n}:\lvert\mathsf{g}(w)-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta,\ast})\rvert\geq\varepsilon\big{\}}. (9.15)
Theorem 9.6.

Suppose the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix any 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R} that is 11-Lipschitz with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}. There exist constants C,C>10C,C^{\prime}>10 depending on KK such that for Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}], Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime}, ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}) and ηΞK\eta\in\Xi_{K},

ξ(minwDη;C(t/n)1/4(𝗀)Bn(Lw)Lη(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)+t/n)Cet/C.\displaystyle\operatorname{\mathbb{P}}^{\xi}\bigg{(}\min_{w\in D_{\eta;C(t/n)^{1/4}}(\mathsf{g})\cap B_{n}(L_{w})}L_{\eta}(w;L_{v})\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\sqrt{t/n}\bigg{)}\leq Ce^{-t/C}.

Roughly speaking, the above theorem will be proved by approximating LηL_{\eta} both from above and below by nicer strongly convex, surrogate functions whose minimizers can be directly located. Then we may relate the minimizer of LηL_{\eta} and those of the surrogate functions.

We first formally define these surrogate functions. For Lw>0,Lv>0L_{w}>0,L_{v}>0, let

Lη,±(w;Lv)\displaystyle L_{\eta,\pm}(w;L_{v}) maxβ[0,Lv]{βn(hw2+σ±2(Lw)g,w)ηβ22+F(w)}.\displaystyle\equiv\max_{\beta\in[0,L_{v}]}\bigg{\{}\frac{\beta}{\sqrt{n}}\Big{(}\lVert h\rVert\sqrt{\lVert w\rVert^{2}+\sigma_{\pm}^{2}(L_{w})}-\langle g,w\rangle\Big{)}-\frac{\eta\beta^{2}}{2}+F(w)\bigg{\}}. (9.16)

Again we omit notational dependence of Lη,±L_{\eta,\pm} on LwL_{w} for simplicity.

The following lemma provides uniform (bracketing) approximation of LηL_{\eta} via Lη,±L_{\eta,\pm} on compact sets.

Lemma 9.7.

Fix Lv>0L_{v}>0. The following hold when σ2(Lw)0\sigma_{-}^{2}(L_{w})\neq 0.

  1. (1)

    For any wBn(Lw)w\in B_{n}(L_{w}), Lη,(w;Lv)Lη(w;Lv)Lη,+(w;Lv)L_{\eta,-}(w;L_{v})\leq L_{\eta}(w;L_{v})\leq L_{\eta,+}(w;L_{v}).

  2. (2)

    For any Lw>0L_{w}>0,

    supwn|Lη,+(w;Lv)Lη,(w;Lv)|4ehσmLvLw|h,ξ|h2.\displaystyle\sup_{w\in\mathbb{R}^{n}}\lvert L_{\eta,+}(w;L_{v})-L_{\eta,-}(w;L_{v})\rvert\leq\frac{4e_{h}}{\sigma_{m}}\cdot L_{v}L_{w}\frac{\lvert\langle h,\xi\rangle\rvert}{\lVert h\rVert^{2}}.
Proof.

The first claim (1) follows by the definition of σ±2(Lw)\sigma_{\pm}^{2}(L_{w}) in (6.4) and the simple inequality (9.3). For (2), note that

|Lη,+(w;Lv)Lη,(w;Lv)|\displaystyle\big{\lvert}L_{\eta,+}(w;L_{v})-L_{\eta,-}(w;L_{v})\big{\rvert} Lveh|w2+σ+2(Lw)w2+σ2(Lw)|\displaystyle\leq L_{v}e_{h}\cdot\big{\lvert}\sqrt{\lVert w\rVert^{2}+\sigma_{+}^{2}(L_{w})}-\sqrt{\lVert w\rVert^{2}+\sigma_{-}^{2}(L_{w})}\big{\rvert}
Lveh|σ+2(Lw)σ2(Lw)|σ+(Lw)+σ(Lw)4ehσmLvLw|h,ξ|h2,\displaystyle\leq L_{v}e_{h}\cdot\frac{\lvert\sigma_{+}^{2}(L_{w})-\sigma_{-}^{2}(L_{w})\rvert}{\sigma_{+}(L_{w})+\sigma_{-}(L_{w})}\leq\frac{4e_{h}}{\sigma_{m}}\cdot L_{v}L_{w}\frac{\lvert\langle h,\xi\rangle\rvert}{\lVert h\rVert^{2}},

as desired. ∎

Next, we will study the properties of the global minimizers for Lη,±L_{\eta,\pm}.

Proposition 9.8.

Suppose 1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K, and μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exists some constant C=C(K)>1C=C(K)>1 such that for any deterministic choice of Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}], on the event 1(δ)Δ,Ξ(M)\mathscr{E}_{1}(\delta)\cap\mathscr{E}_{\Delta,\Xi}(M) (defined in Proposition 8.2) with δ(0,1/C100)\delta\in(0,1/C^{100}) and Mn/CM\leq\sqrt{n}/C, for any ηΞK\eta\in\Xi_{K}, the maps wLη,±(w;Lv)w\mapsto L_{\eta,\pm}(w;L_{v}) attain its global minimum at wn,η,±w_{n,\eta,\pm} with wn,η,±Σ1C\lVert w_{n,\eta,\pm}\rVert_{\Sigma^{-1}}\leq C. Moreover, wn,η,±wη,Σ1C(M/n+δ)1/2\lVert w_{n,\eta,\pm}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}\leq C(M/\sqrt{n}+\delta)^{1/2}.

Proof.

Note that the optimization problem

minwnLη,±(w;Lv)\displaystyle\min_{w\in\mathbb{R}^{n}}L_{\eta,\pm}(w;L_{v})
=minwnmaxβ[0,Lv]{βn(hw2+σ±2(Lw)g,w)ηβ22+F(w)}\displaystyle=\min_{w\in\mathbb{R}^{n}}\max_{\beta\in[0,L_{v}]}\bigg{\{}\frac{\beta}{\sqrt{n}}\Big{(}\lVert h\rVert\sqrt{\lVert w\rVert^{2}+\sigma_{\pm}^{2}(L_{w})}-\langle g,w\rangle\Big{)}-\frac{\eta\beta^{2}}{2}+F(w)\bigg{\}}
=()maxβ[0,Lv]minwn{βn(hw2+σ±2(Lw)g,w)ηβ22+F(w)}\displaystyle\stackrel{{\scriptstyle(\ast)}}{{=}}\max_{\beta\in[0,L_{v}]}\min_{w\in\mathbb{R}^{n}}\bigg{\{}\frac{\beta}{\sqrt{n}}\Big{(}\lVert h\rVert\sqrt{\lVert w\rVert^{2}+\sigma_{\pm}^{2}(L_{w})}-\langle g,w\rangle\Big{)}-\frac{\eta\beta^{2}}{2}+F(w)\bigg{\}}
=maxβ[0,Lv]minγ>0,wn{βγh22nηβ22+(β2γ(w2+σ±2(Lw))w,βng+F(w))}.\displaystyle=\max_{\beta\in[0,L_{v}]}\min_{\gamma>0,w\in\mathbb{R}^{n}}\bigg{\{}\frac{\beta\gamma\lVert h\rVert^{2}}{2n}-\frac{\eta\beta^{2}}{2}+\bigg{(}\frac{\beta}{2\gamma}\big{(}\lVert w\rVert^{2}+\sigma_{\pm}^{2}(L_{w})\big{)}-\bigg{\langle}w,\frac{\beta}{\sqrt{n}}g\bigg{\rangle}+F(w)\bigg{)}\bigg{\}}.

Here in ()(\ast) we used Sion’s min-max theorem to exchange minimum and maximum, as the maximum is taken over a compact set. The difference of the above minimax problem compared to (9.2) rests in its range constraint on β\beta. As proven in (9.9), all solutions βn,±\beta_{n,\pm} to the unconstrained minimax problem (9.2) must satisfy βn,η,±C\beta_{n,\eta,\pm}\leq C on the event 1(δ)Δ,Ξ(M)\mathscr{E}_{1}(\delta)\cap\mathscr{E}_{\Delta,\Xi}(M). So on this event, for the choice Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}] for some large C>0C>0, minwnLη,±(w;Lv)\min_{w\in\mathbb{R}^{n}}L_{\eta,\pm}(w;L_{v}) exactly corresponds to (9.2), whose minimizers wn,±w_{n,\pm} admit the apriori estimate (9.10) (with minor modifications that change \lVert\cdot\rVert to the stronger estimate in Σ1\lVert\cdot\rVert_{\Sigma^{-1}}).

Next, for the error bound, using the last equation in (9.6) and the definition of wη,w_{\eta,\ast} in (9.14), along with the estimates in Proposition 8.2, we have

wn,η,±wη,Σ12\displaystyle\lVert w_{n,\eta,\pm}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}^{2} =μ^(Σ,μ0)𝗌𝖾𝗊(γn,η,±;τn,η,±)μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)2\displaystyle=\big{\lVert}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{n,\eta,\pm};\tau_{n,\eta,\pm})-\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{\rVert}^{2}
K|γn,η,±γη,||τn,η,±τη,|C(M/n+δ),\displaystyle\lesssim_{K}\lvert\gamma_{n,\eta,\pm}-\gamma_{\eta,\ast}\rvert\vee\lvert\tau_{n,\eta,\pm}-\tau_{\eta,\ast}\rvert\leq C(M/\sqrt{n}+\delta),

as desired. ∎

Finally we shall relate back to the global minimizer of LηL_{\eta}. We note that the proposition below by itself is not formally used in the proof of Theorem 9.6, but will turn out to be useful in the proof of Theorem 2.3 ahead.

Proposition 9.9.

Suppose the conditions in Theorem 9.6 hold for some K>0K>0. There exist constants C,C>1C,C^{\prime}>1 depending on KK such that for Lw,Lv[C,C2]L_{w},L_{v}\in[C,C^{2}], Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime}, ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}) and ηΞK\eta\in\Xi_{K},

ξ(The map wLη(w;Lv) attains its global minimum at wn,η with wn,ηΣ1C,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\hbox{The map $w\mapsto L_{\eta}(w;L_{v})$ attains its global minimum at $w_{n,\eta}$ with $\lVert w_{n,\eta}\rVert_{\Sigma^{-1}}\leq C$,}
and wn,ηwη,Σ1C(t/n)1/4)1Cet/C.\displaystyle\qquad\hbox{and $\lVert w_{n,\eta}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}\leq C(t/n)^{1/4}$}\Big{)}\geq 1-Ce^{-t/C}.
Proof.

Let us fix ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}).

(Step 1). We first prove the apriori estimate for wn,ηΣ1\lVert w_{n,\eta}\rVert_{\Sigma^{-1}}. To this end, for large enough C0,C0>0C_{0},C_{0}^{\prime}>0 depending on KK, we choose LwC0,δ1/C0100L_{w}\equiv C_{0},\delta\equiv 1/C_{0}^{100} and MδnM\equiv\delta\sqrt{n} in Proposition 9.8, it follows that

ξ(E1{wn,η,±Σ1wn,η,±C0/2,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{1}\equiv\Big{\{}\lVert w_{n,\eta,\pm}\rVert_{\Sigma^{-1}}\vee\lVert w_{n,\eta,\pm}\rVert\leq C_{0}/2,
Lη,±(wn,η,±;Lv)=(9.2)})1C0en/C0.\displaystyle\qquad\qquad\qquad L_{\eta,\pm}(w_{n,\eta,\pm};L_{v})=\hbox{(\ref{ineq:L_local_6})}\Big{\}}\Big{)}\geq 1-C_{0}e^{-n/C_{0}}. (9.17)

On the other hand, choosing δt/n\delta\equiv\sqrt{t/n} with C0log(en)tn/C0C_{0}^{\prime}\log(en)\leq t\leq n/C_{0}^{\prime} leads to

ξ(E2(t){wn,η,±wη,Σ1C0(t/n)1/4})1C0et/C0.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{2}(t)\equiv\Big{\{}\lVert w_{n,\eta,\pm}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}\leq C_{0}(t/n)^{1/4}\Big{\}}\Big{)}\geq 1-C_{0}e^{-t/C_{0}}. (9.18)

On E1E_{1}, we may characterize the value of Lη,±(wn,η,±;Lv)L_{\eta,\pm}(w_{n,\eta,\pm};L_{v}) by applying Propositions 9.3-9.5: for C0log(en)tn/C0C_{0}^{\prime}\log(en)\leq t\leq n/C_{0}^{\prime},

ξ(E3(t){|Lη,±(wn,η,±;Lv)maxβ>0minγ>0𝖣¯η(β,γ)|C0t/n})\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{3}(t)\equiv\Big{\{}\big{\lvert}L_{\eta,\pm}(w_{n,\eta,\pm};L_{v})-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{\rvert}\leq C_{0}\sqrt{t/n}\Big{\}}\Big{)}
1C0et/C0.\displaystyle\geq 1-C_{0}e^{-t/C_{0}}. (9.19)

Note by the strong convexity of Lη,±(;Lv)L_{\eta,\pm}(\cdot;L_{v}) with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}, we have

infwn:wwn,η,±Σ16C0(t/n)1/4Lη,±(w;Lv)Lη,±(wn,η,±;Lv)3C0t/n.\displaystyle\inf_{w\in\mathbb{R}^{n}:\lVert w-w_{n,\eta,\pm}\rVert_{\Sigma^{-1}}\geq\sqrt{6C_{0}}(t/n)^{1/4}}L_{\eta,\pm}(w;L_{v})-L_{\eta,\pm}(w_{n,\eta,\pm};L_{v})\geq 3C_{0}\sqrt{t/n}.

This means on E3(t)E_{3}(t),

infwn:wwn,η,±Σ16C0(t/n)1/4Lη,±(w;Lv)\displaystyle\inf_{w\in\mathbb{R}^{n}:\lVert w-w_{n,\eta,\pm}\rVert_{\Sigma^{-1}}\geq\sqrt{6C_{0}}(t/n)^{1/4}}L_{\eta,\pm}(w;L_{v}) maxβ>0minγ>0𝖣¯η(β,γ)+2C0t/n,\displaystyle\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+2C_{0}\sqrt{t/n},
Lη,±(wn,η,±;Lv)\displaystyle L_{\eta,\pm}(w_{n,\eta,\pm};L_{v}) maxβ>0minγ>0𝖣¯η(β,γ)+C0t/n.\displaystyle\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+C_{0}\sqrt{t/n}.

This in particular means on E1E3(t)E_{1}\cap E_{3}(t),

wn,η,±\displaystyle w_{n,\eta,\pm} {wn:Lη,±(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)+C0t/n}\displaystyle\in\Big{\{}w\in\mathbb{R}^{n}:L_{\eta,\pm}(w;L_{v})\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+C_{0}\sqrt{t/n}\Big{\}}
{wn:wΣ16C0(t/n)1/4+C0/2}.\displaystyle\subset\big{\{}w\in\mathbb{R}^{n}:\lVert w\rVert_{\Sigma^{-1}}\leq\sqrt{6C_{0}}(t/n)^{1/4}+C_{0}/2\big{\}}.

Consequently, by enlarging C0>0C_{0}>0 if necessary, using Lemma 9.7-(1), on E1E3(t)E_{1}\cap E_{3}(t)

{wn:L(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)+C0t/n}\displaystyle\Big{\{}w\in\mathbb{R}^{n}:L(w;L_{v})\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+C_{0}\sqrt{t/n}\Big{\}}
{wn:wΣ13C0/5}Bn(3C0/4)Bn(C0)=Bn(Lw).\displaystyle\subset\big{\{}w\in\mathbb{R}^{n}:\lVert w\rVert_{\Sigma^{-1}}\leq 3C_{0}/5\big{\}}\subset B_{n}(3C_{0}/4)\subsetneq B_{n}(C_{0})=B_{n}(L_{w}).

This implies, on E1E3(t)E_{1}\cap E_{3}(t), we have wn,ηΣ1wn,η3C0/4\lVert w_{n,\eta}\rVert_{\Sigma^{-1}}\vee\lVert w_{n,\eta}\rVert\leq 3C_{0}/4, proving the apriori bound.

(Step 2). Next we establish the announced error bound. On the event 1,0(t/n)\mathscr{E}_{1,0}(\sqrt{t/n}), by Lemma 9.7-(2),

supwBn(C0)|Lη(w;Lv)Lη,±(w;Lv)|C1t/n.\displaystyle\sup_{w\in B_{n}(C_{0})}\big{\lvert}L_{\eta}(w;L_{v})-L_{\eta,\pm}(w;L_{v})\big{\rvert}\leq C_{1}\sqrt{t/n}. (9.20)

Consequently, on E1E3(t)1,0(t/n)E_{1}\cap E_{3}(t)\cap\mathscr{E}_{1,0}(\sqrt{t/n}),

|minwnLη(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)|C2t/n.\displaystyle\big{\lvert}\min_{w\in\mathbb{R}^{n}}L_{\eta}(w;L_{v})-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{\rvert}\leq C_{2}\sqrt{t/n}. (9.21)

On this event, combining (9.20)-(9.21) with (9.3), and using again the strong convexity of Lη,+(;Lv)L_{\eta,+}(\cdot;L_{v}) respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}, we have for C3=2(C0+C1+C2)C_{3}=2\sqrt{(C_{0}+C_{1}+C_{2})},

infwBn(C0):wwn,η,+Σ1C3(t/n)1/4Lη(w;Lv)minwnLη(w;Lv)\displaystyle\inf_{w\in B_{n}(C_{0}):\lVert w-w_{n,\eta,+}\rVert_{\Sigma^{-1}}\geq C_{3}(t/n)^{1/4}}L_{\eta}(w;L_{v})-\min_{w\in\mathbb{R}^{n}}L_{\eta}(w;L_{v})
infwBn(C0):wwn,η,+Σ1C3(t/n)1/4Lη,+(w;Lv)maxβ>0minγ>0𝖣¯η(β,γ)(C1+C2)t/n\displaystyle\geq\inf_{w\in B_{n}(C_{0}):\lVert w-w_{n,\eta,+}\rVert_{\Sigma^{-1}}\geq C_{3}(t/n)^{1/4}}L_{\eta,+}(w;L_{v})-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-(C_{1}+C_{2})\sqrt{t/n}
infwBn(C0):wwn,η,+Σ1C3(t/n)1/4Lη,+(w;Lv)Lη,+(wn,+;Lv)(C0+C1+C2)t/n\displaystyle\geq\inf_{w\in B_{n}(C_{0}):\lVert w-w_{n,\eta,+}\rVert_{\Sigma^{-1}}\geq C_{3}(t/n)^{1/4}}L_{\eta,+}(w;L_{v})-L_{\eta,+}(w_{n,+};L_{v})-(C_{0}+C_{1}+C_{2})\sqrt{t/n}
(C32/2)t/n(C0+C1+C2)t/n=(C0+C1+C2)t/n.\displaystyle\geq(C_{3}^{2}/2)\sqrt{t/n}-(C_{0}+C_{1}+C_{2})\sqrt{t/n}=(C_{0}+C_{1}+C_{2})\sqrt{t/n}.

This means that wn,ηwn,η,+Σ1C3(t/n)1/4\lVert w_{n,\eta}-w_{n,\eta,+}\rVert_{\Sigma^{-1}}\leq C_{3}(t/n)^{1/4} on E1E3(t)1,0(t/n)E_{1}\cap E_{3}(t)\cap\mathscr{E}_{1,0}(\sqrt{t/n}). The claim follows by intersecting the prescribed event with E2(t)E_{2}(t) in (9.18) that controls the ξ\operatorname{\mathbb{P}}^{\xi}-probability of wn,η,+wη,Σ1C0(t/n)1/4\lVert w_{n,\eta,+}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}\leq C_{0}(t/n)^{1/4}. ∎

Proof of Theorem 9.6.

Fix ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}), and ε>0\varepsilon>0 to be chosen later on. First, as 𝗀\mathsf{g} is Lipschitz with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}, by the Gaussian concentration inequality, there exists C0=C0(K)>0C_{0}=C_{0}(K)>0 such that for t1t\geq 1, on an event E0(t)E_{0}(t) with ξ\operatorname{\mathbb{P}}^{\xi}-probability at least 1et1-e^{-t},

|𝗀(wη,)𝔼𝗀(wη,)|C0t/n.\displaystyle\lvert\mathsf{g}(w_{\eta,\ast})-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta,\ast})\rvert\leq C_{0}\sqrt{t/n}.

Moreover, by Proposition 9.8 and Propositions 9.3-9.5, there exist some C1,C1>0C_{1},C_{1}^{\prime}>0 depending on KK such that for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime}, on an event E1(t)E_{1}(t) with ξ\operatorname{\mathbb{P}}^{\xi}-probability 1C1et/C11-C_{1}e^{-t/C_{1}}, we have

  1. (1)

    wn,η,Σ1wn,η,C1\lVert w_{n,\eta,-}\rVert_{\Sigma^{-1}}\vee\lVert w_{n,\eta,-}\rVert\leq C_{1}, wn,η,wη,Σ1C1(t/n)1/4\lVert w_{n,\eta,-}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}\leq C_{1}(t/n)^{1/4}, and

  2. (2)

    |Lη,(wn,η,;Lv)maxβ>0minγ>0𝖣¯η(β,γ)|C1t/n.\lvert L_{\eta,-}(w_{n,\eta,-};L_{v})-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\rvert\leq C_{1}\sqrt{t/n}.

Consequently, for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime}, on the event E0(t)E1(t)E_{0}(t)\cap E_{1}(t), uniformly in wDη;ε(𝗀)Bn(Lw)w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w}),

ε\displaystyle\varepsilon |𝗀(w)𝔼𝗀(wη,)|\displaystyle\leq\lvert\mathsf{g}(w)-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta,\ast})\rvert
|𝗀(w)𝗀(wη,)|+|𝗀(wη,)𝔼𝗀(wη,)|\displaystyle\leq\lvert\mathsf{g}(w)-\mathsf{g}(w_{\eta,\ast})\rvert+\lvert\mathsf{g}(w_{\eta,\ast})-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta,\ast})\rvert
wwη,n,Σ1+wη,n,wη,Σ1+C0t/n\displaystyle\leq\lVert w-w_{\eta,n,-}\rVert_{\Sigma^{-1}}+\lVert w_{\eta,n,-}-w_{\eta,\ast}\rVert_{\Sigma^{-1}}+C_{0}\sqrt{t/n}
wwη,n,Σ1+(C0+C1)(t/n)1/4.\displaystyle\leq\lVert w-w_{\eta,n,-}\rVert_{\Sigma^{-1}}+(C_{0}+C_{1})(t/n)^{1/4}.

This implies that, for the prescribed range of tt and on the event E0(t)E1(t)E_{0}(t)\cap E_{1}(t),

minwDη;ε(𝗀)Bn(Lw)wwη,n,Σ1(ε(C0+C1)(t/n)1/4)+.\displaystyle\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}\lVert w-w_{\eta,n,-}\rVert_{\Sigma^{-1}}\geq\big{(}\varepsilon-(C_{0}+C_{1})(t/n)^{1/4}\big{)}_{+}.

Using the strong convexity of Lη,(;Lv)L_{\eta,-}(\cdot;L_{v}) with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}, we have for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime}, on the event E0(t)E1(t)E_{0}(t)\cap E_{1}(t),

minwDη;ε(𝗀)Bn(Lw)Lη(w;Lv)minwDη;ε(𝗀)Bn(Lw)Lη,(w;Lv)\displaystyle\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}L_{\eta}(w;L_{v})\geq\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{n}(L_{w})}L_{\eta,-}(w;L_{v})
Lη,(wη,n,;Lv)+12(ε(C0+C1)(t/n)1/4)+2\displaystyle\geq L_{\eta,-}(w_{\eta,n,-};L_{v})+\frac{1}{2}\big{(}\varepsilon-(C_{0}+C_{1})(t/n)^{1/4}\big{)}_{+}^{2}
maxβ>0minγ>0𝖣¯η(β,γ)+12(ε(C0+C1)(t/n)1/4)+2C1t/n.\displaystyle\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\frac{1}{2}\big{(}\varepsilon-(C_{0}+C_{1})(t/n)^{1/4}\big{)}_{+}^{2}-C_{1}\sqrt{t/n}.

Now we may choose εε(t,n)(C0+C1+2C1)(t/n)1/4\varepsilon\equiv\varepsilon(t,n)\equiv(C_{0}+C_{1}+2\sqrt{C_{1}})(t/n)^{1/4} to conclude by adjusting constants. ∎

9.4. Proof of Theorem 2.3 for μ^η;G\widehat{\mu}_{\eta;G}

Fix ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}). All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK.

(Step 1). In this step, we will obtain an upper bound minwnHη(w)\min_{w\in\mathbb{R}^{n}}H_{\eta}(w). By Proposition 9.1 and the concentration estimate in Lemma 7.6, there exists some C0=C0(K)>0C_{0}=C_{0}(K)>0 such that on an event E0E_{0} with ξ(E0)1C0en/C0\operatorname{\mathbb{P}}^{\xi}(E_{0})\geq 1-C_{0}e^{-n/C_{0}},

minwnHη(w)=minwnHη(w;L0)=minwBn(L0)Hη(w)=minwBn(L0)Hη(w;L0).\displaystyle\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)=\min_{w\in\mathbb{R}^{n}}H_{\eta}(w;L_{0})=\min_{w\in B_{n}(L_{0})}H_{\eta}(w)=\min_{w\in B_{n}(L_{0})}H_{\eta}(w;L_{0}). (9.22)

where

L0C0{1+(Σ1op𝟏ϕ11+1/K1η1)}.\displaystyle L_{0}\equiv C_{0}\Big{\{}1+\Big{(}\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\bm{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge\eta^{-1}\Big{)}\Big{\}}. (9.23)

Now we shall apply the convex(-side) Gaussian min-max theorem to obtain an upper bound for the right hand side of (9.22). Recall the definition of hη=hη;Gh_{\eta}=h_{\eta;G} and η\ell_{\eta} in (6.2). Using Theorem 6.1-(2), for any zz\in\mathbb{R},

ξ(minwnHη(w)z)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)\geq z\Big{)} ξ(minwBn(L0)Hη(w;L0)z)+ξ(E0c)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})}H_{\eta}(w;L_{0})\geq z\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})
=ξ(minwBn(L0)maxvBm(L0)hη(w,v)z)+ξ(E0c)\displaystyle=\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})}\max_{v\in B_{m}(L_{0})}h_{\eta}(w,v)\geq z\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})
2ξ(minwBn(L0)maxvBm(L0)η(w,v)z)+ξ(E0c)\displaystyle\leq 2\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})}\max_{v\in B_{m}(L_{0})}\ell_{\eta}(w,v)\geq z\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})
=2ξ(minwBn(L0)Lη(w;L0)z)+ξ(E0c).\displaystyle=2\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})}L_{\eta}(w;L_{0})\geq z\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c}). (9.24)

By Proposition 9.9, there exist some C1,C1>0C_{1},C_{1}^{\prime}>0 depending on KK (which we assume without loss of generality L0>C1L_{0}>C_{1} and C1C_{1} exceeds the constants in Theorems 9.2 and 9.6), such that on an event E1E_{1} with ξ\operatorname{\mathbb{P}}^{\xi}-probability at least 1C1en/C11-C_{1}e^{-n/C_{1}}, the map wLη(w;L0)w\mapsto L_{\eta}(w;L_{0}) attains its global minimum in Bn(C1)B_{n}(C_{1}). We may now apply Theorem 9.2: with zz¯(t)=maxβ>0minγ>0𝖣¯η(β,γ)+t/nz\equiv\bar{z}(t)=\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\sqrt{t/n}, for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime},

ξ(minwBn(L0)Lη(w;L0)z¯(t))\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})}L_{\eta}(w;L_{0})\geq\bar{z}(t)\Big{)}
ξ(minwBn(C1)Lη(w;L0)z¯(t))+ξ(E1c)C1et/C1+ξ(E1c).\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(C_{1})}L_{\eta}(w;L_{0})\geq\bar{z}(t)\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{1}^{c})\leq C_{1}e^{-t/C_{1}}+\operatorname{\mathbb{P}}^{\xi}(E_{1}^{c}). (9.25)

Combining (9.4)-(9.4), by enlarging C1C_{1} if necessary, for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime}, and ηΞK\eta\in\Xi_{K},

ξ(minwnHη(w)maxβ>0minγ>0𝖣¯η(β,γ)+t/n)C1et/C1.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\sqrt{t/n}\Big{)}\leq C_{1}e^{-t/C_{1}}. (9.26)

An entirely similar argument leads to a lower bound (which will be used later on):

ξ(minwnHη(w)maxβ>0minγ>0𝖣¯η(β,γ)t/n)C1et/C1.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-\sqrt{t/n}\Big{)}\leq C_{1}e^{-t/C_{1}}. (9.27)

(Step 2). In this step, we will obtain a lower bound on minwDη;ε(𝗀)Hη(w)\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})}H_{\eta}(w) for the exceptional set Dε(𝗀)D_{\varepsilon}(\mathsf{g}) defined in (9.15), with a suitable choice of ε\varepsilon. Let us take C2,C2>0C_{2},C_{2}^{\prime}>0 to be the constants in Theorem 9.6, and let ε(t,n)C2(t/n)1/4\varepsilon(t,n)\equiv C_{2}(t/n)^{1/4} for C2log(en)tn/C2C_{2}^{\prime}\log(en)\leq t\leq n/C_{2}^{\prime}. To this end, using Theorem 6.1-(1) (that holds without convexity), for any zz\in\mathbb{R} and Lv>0L_{v}>0

ξ(minwBn(L0)Dη;ε(𝗀)Hη(w)z)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})\cap D_{\eta;\varepsilon}(\mathsf{g})}H_{\eta}(w)\leq z\Big{)} ξ(minwBn(L0)Dη;ε(𝗀)maxvBm(Lv)hη(w,v)z)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})\cap D_{\eta;\varepsilon}(\mathsf{g})}\max_{v\in B_{m}(L_{v})}h_{\eta}(w,v)\leq z\Big{)}
2ξ(minwBn(L0)Dη;ε(𝗀)maxvBm(Lv)η(w,v)z)\displaystyle\leq 2\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})\cap D_{\eta;\varepsilon}(\mathsf{g})}\max_{v\in B_{m}(L_{v})}\ell_{\eta}(w,v)\leq z\Big{)}
=2ξ(minwBn(L0)Dη;ε(𝗀)Lη(w;Lv)z).\displaystyle=2\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{n}(L_{0})\cap D_{\eta;\varepsilon}(\mathsf{g})}L_{\eta}(w;L_{v})\leq z\Big{)}.

By choosing Lv1L_{v}\asymp 1 of constant order but large enough, εε(t,n)\varepsilon\equiv\varepsilon(t,n) and zz¯(t)=maxβ>0minγ>0𝖣¯η(β,γ)+2t/nz\equiv\bar{z}(t)=\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+2\sqrt{t/n}, we have for C2log(en)tn/C2C_{2}^{\prime}\log(en)\leq t\leq n/C_{2}^{\prime},

ξ(minwBn(L0)Dη;ε(t,n)(𝗀)Hη(w)maxβ>0minγ>0𝖣¯η(β,γ)+2tn)2C2et/C2.\displaystyle\operatorname{\mathbb{P}}^{\xi}\bigg{(}\min_{w\in B_{n}(L_{0})\cap D_{\eta;\varepsilon(t,n)}(\mathsf{g})}H_{\eta}(w)\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+2\sqrt{\frac{t}{n}}\bigg{)}\leq 2C_{2}e^{-t/C_{2}}. (9.28)

(Step 3). Combining (9.28) and the localization in (9.22), there exist some C3,C3>0C_{3},C_{3}^{\prime}>0 depending on KK such that for C3log(en)tn/C3C_{3}^{\prime}\log(en)\leq t\leq n/C_{3}^{\prime}, on an event E3(t)E_{3}(t) with ξ(E3(t))1C3et/C3\operatorname{\mathbb{P}}^{\xi}(E_{3}(t))\geq 1-C_{3}e^{-t/C_{3}},

minwBn(L0)Dη;ε(t,n)(𝗀)Hη(w)\displaystyle\min_{w\in B_{n}(L_{0})\cap D_{\eta;\varepsilon(t,n)}(\mathsf{g})}H_{\eta}(w) maxβ>0minγ>0𝖣¯η(β,γ)+2t/n\displaystyle\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+2\sqrt{t/n}
>maxβ>0minγ>0𝖣¯η(β,γ)+t/nminwnHη(w)=minwBn(L0)Hη(w).\displaystyle>\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\sqrt{t/n}\geq\min_{w\in\mathbb{R}^{n}}H_{\eta}(w)=\min_{w\in B_{n}(L_{0})}H_{\eta}(w).

So on E3(t)E_{3}(t), w^ηDη;ε(t,n)(𝗀)Bn(L0)\widehat{w}_{\eta}\notin D_{\eta;\varepsilon(t,n)}(\mathsf{g})\cap B_{n}(L_{0}), i.e., for C3log(en)tn/C3C_{3}^{\prime}\log(en)\leq t\leq n/C_{3}^{\prime},

ξ(|𝗀(w^η)𝔼𝗀(wη,)|C3(t/n)1/4)C3et/C3.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\lvert\mathsf{g}(\widehat{w}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta,\ast})\rvert\geq C_{3}(t/n)^{1/4}\Big{)}\leq C_{3}e^{-t/C_{3}}.

Using a change of variable and suitably adjusting the constant C3C_{3}, for any 11-Lipschitz function 𝗀0:n\mathsf{g}_{0}:\mathbb{R}^{n}\to\mathbb{R}, ηΞK\eta\in\Xi_{K} and ε(0,1/2]\varepsilon\in(0,1/2],

ξ(|𝗀0(μ^η)𝔼𝗀0(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,))|ε)C3nenε4/C3.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\big{\lvert}\mathsf{g}_{0}(\widehat{\mu}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}\big{\rvert}\geq\varepsilon\Big{)}\leq C_{3}ne^{-n\varepsilon^{4}/C_{3}}.

(Step 4). In this step we shall establish uniform guarantees. We write μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)=μ^η;(Σ,μ0)𝗌𝖾𝗊,\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})=\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast} in this part of the proof. First, in the case ϕ11+1/K\phi^{-1}\geq 1+1/K, using μ^η=n1X(XX/n+ηI)1Y\widehat{\mu}_{\eta}=n^{-1}X^{\top}\big{(}XX^{\top}/n+\eta I\big{)}^{-1}Y, for η1,η2[0,K]\eta_{1},\eta_{2}\in[0,K],

μ^η1μ^η2\displaystyle\lVert\widehat{\mu}_{\eta_{1}}-\widehat{\mu}_{\eta_{2}}\rVert n1Gop(Gop+ξ)(XX/n+η1I)1(XX/n+η2I)1op\displaystyle\lesssim n^{-1}\lVert G\rVert_{\operatorname{op}}(\lVert G\rVert_{\operatorname{op}}+\lVert\xi\rVert)\cdot\lVert\big{(}XX^{\top}/n+\eta_{1}I\big{)}^{-1}-\big{(}XX^{\top}/n+\eta_{2}I\big{)}^{-1}\rVert_{\operatorname{op}}
Σ1op2(1+Gop+ξn)2(GG/n)1op2|η1η2|.\displaystyle\lesssim\lVert\Sigma^{-1}\rVert_{\operatorname{op}}^{2}\cdot\Big{(}1+\frac{\lVert G\rVert_{\operatorname{op}}+\lVert\xi\rVert}{\sqrt{n}}\Big{)}^{2}\cdot\lVert(GG^{\top}/n)^{-1}\rVert_{\operatorname{op}}^{2}\cdot\lvert\eta_{1}-\eta_{2}\rvert. (9.29)

Here the last inequality follows by the fact that any p.s.d. matrix AA, (A+η1I)1(A+η2I)1opλmin2(A)|η1η2|\lVert(A+\eta_{1}I)^{-1}-(A+\eta_{2}I)^{-1}\rVert_{\operatorname{op}}\leq\lambda_{\min}^{-2}(A)\lvert\eta_{1}-\eta_{2}\rvert. As Σ1opn\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\lesssim n under ΣK\mathcal{H}_{\Sigma}\leq K, there exists C4=C4(K)>0C_{4}=C_{4}(K)>0 such that on an event E4E_{4} with ξ(E4)1C4en/C4\operatorname{\mathbb{P}}^{\xi}(E_{4})\geq 1-C_{4}e^{-n/C_{4}},

μ^η1μ^η2C4n2|η1η2|.\displaystyle\lVert\widehat{\mu}_{\eta_{1}}-\widehat{\mu}_{\eta_{2}}\rVert\leq C_{4}n^{2}\lvert\eta_{1}-\eta_{2}\rvert. (9.30)

On the other hand, note that for η1,η2[0,K]\eta_{1},\eta_{2}\in[0,K], using Proposition 8.1-(3),

μ^η1;(Σ,μ0)𝗌𝖾𝗊,μ^η2;(Σ,μ0)𝗌𝖾𝗊,(1eg)Σ1op2|η1η2|.\displaystyle\lVert\widehat{\mu}_{\eta_{1};(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\widehat{\mu}_{\eta_{2};(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\rVert\lesssim(1\vee e_{g})\lVert\Sigma^{-1}\rVert_{\operatorname{op}}^{2}\lvert\eta_{1}-\eta_{2}\rvert. (9.31)

So we have

|𝔼𝗀0(μ^η1;(Σ,μ0)𝗌𝖾𝗊,)𝔼𝗀0(μ^η2;(Σ,μ0)𝗌𝖾𝗊,)|C4n2|η1η2|.\displaystyle\big{\lvert}\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{\eta_{1};(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{\eta_{2};(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}\big{\rvert}\leq C_{4}n^{2}\lvert\eta_{1}-\eta_{2}\rvert. (9.32)

Now by taking an ε/(2C4n2)\varepsilon/(2C_{4}n^{2})-net Λε\Lambda_{\varepsilon} of [0,K][0,K] and a union bound,

ξ(supη[0,K]|𝗀0(μ^η)𝔼𝗀0(μ^η;(Σ,μ0)𝗌𝖾𝗊,)|2ε)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in[0,K]}\big{\lvert}\mathsf{g}_{0}(\widehat{\mu}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}\big{\rvert}\geq 2\varepsilon\Big{)}
ξ(maxηΛε|𝗀0(μ^η)𝔼𝗀0(μ^η;(Σ,μ0)𝗌𝖾𝗊,)|ε)+(E4c)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{\eta\in\Lambda_{\varepsilon}}\big{\lvert}\mathsf{g}_{0}(\widehat{\mu}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}\big{\rvert}\geq\varepsilon\Big{)}+\operatorname{\mathbb{P}}(E_{4}^{c})
(1+2C4Kn2/ε)C3nenε4/C3+C4en/C4Cε1n3enε4/C.\displaystyle\leq(1+2C_{4}Kn^{2}/\varepsilon)\cdot C_{3}ne^{-n\varepsilon^{4}/C_{3}}+C_{4}e^{-n/C_{4}}\leq C\cdot\varepsilon^{-1}n^{3}e^{-n\varepsilon^{4}/C}. (9.33)

By adjusting constants, we may replace n3/εn^{3}/\varepsilon by nn. We then conclude by further taking expectation with respect to ξ\xi, and noting that (ξ1,ξ(t/n))1Cet/C\operatorname{\mathbb{P}}(\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}))\geq 1-Ce^{-t/C}.

Next, in the case ϕ1<1+1/K\phi^{-1}<1+1/K, we work with η[1/K,K]\eta\in[1/K,K] and use the standard form of μ^η\widehat{\mu}_{\eta} with μ^η=n1(XX/n+ηI)1XY\widehat{\mu}_{\eta}=n^{-1}\big{(}X^{\top}X/n+\eta I\big{)}^{-1}X^{\top}Y. As η1/K\eta\geq 1/K, the spectrum of the middle inverse matrix is bounded by 1/ηK1/\eta\leq K, so we may replicate the above calculations in (9.30) and (9.32) to reach a similar estimate as in (9.4).∎

9.5. Proof of Theorem 2.3 for r^η;G\widehat{r}_{\eta;G}

Recall the cost function hη=hη;G,ηh_{\eta}=h_{\eta;G},\ell_{\eta} defined in (6.2). It is easy to see that

v^ηargmaxvmminwnhη(w,v)=1nη(Gw^ηξ)=r^ηη.\displaystyle\widehat{v}_{\eta}\equiv\operatorname*{arg\,max\,}_{v\in\mathbb{R}^{m}}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)=\frac{1}{\sqrt{n}\eta}(G\widehat{w}_{\eta}-\xi)=-\frac{\widehat{r}_{\eta}}{\eta}. (9.34)

We shall define the ‘population’ version of v^η\widehat{v}_{\eta} as

vη,1ϕτη,(ϕγη,2σξ2hnξn)\displaystyle v_{\eta,\ast}\equiv\frac{1}{\phi\tau_{\eta,\ast}}\bigg{(}\sqrt{\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2}}\cdot\frac{h}{\sqrt{n}}-\frac{\xi}{\sqrt{n}}\bigg{)} (9.35)

in the Gordon problem.

Proposition 9.10.

Suppose the following hold for some K>0K>0.

  • 1/Kϕ1,ηK1/K\leq\phi^{-1},\eta\leq K, μ0ΣΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert\vee\mathcal{H}_{\Sigma}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

There exist constants C,C>0C,C^{\prime}>0 depending on KK such that for Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime} and ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}),

ξ(The map vη(wη,;v) is η-strongly concave with unique maximizer vη,n\displaystyle\operatorname{\mathbb{P}}^{\xi}\bigg{(}\hbox{The map $v\mapsto\ell_{\eta}(w_{\eta,\ast};v)$ is $\eta$-strongly concave with unique maximizer $v_{\eta,n}$}
satisfying vη,nC\|v_{\eta,n}\|\leq C and vη,nvη,Ct/n\|v_{\eta,n}-v_{\eta,\ast}\|\leq C\sqrt{t/n}.
Furthermore, |maxvη(wη,,v)maxβ>0minγ>0𝖣¯η(β,γ)|Ct/n)\displaystyle\qquad\hbox{Furthermore, $\big{\lvert}\max_{v}\ell_{\eta}(w_{\eta,\ast},v)-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{\rvert}\leq C\sqrt{t/n}$. }\bigg{)}
1Cet/C.\displaystyle\geq 1-Ce^{-t/C}.

We need the following before the proof of Proposition 9.10.

Lemma 9.11.

Suppose 1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K, and μ0ΣopΣK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. Recall wη,w_{\eta,\ast} defined in (9.14). Then there exist constants C,C>0C,C^{\prime}>0 depending on KK such that for Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime}, ηΞK\eta\in\Xi_{K} and ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}),

ξ(max{|(id𝔼)g/n,wη,|,|(id𝔼)wη,2|,|(id𝔼)F(wη,)|,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max\Big{\{}\lvert\big{(}\mathrm{id}-\operatorname{\mathbb{E}}\big{)}\langle g/\sqrt{n},w_{\eta,\ast}\rangle\rvert,\,\lvert\big{(}\mathrm{id}-\operatorname{\mathbb{E}}\big{)}\|w_{\eta,\ast}\|^{2}\rvert,\,\lvert\big{(}\mathrm{id}-\operatorname{\mathbb{E}}\big{)}F(w_{\eta,\ast})\rvert,
n1|(id𝔼)wη,hξ2|}t/n)Cet/C.\displaystyle\qquad\qquad n^{-1}\lvert\big{(}\mathrm{id}-\operatorname{\mathbb{E}}\big{)}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}\rvert\Big{\}}\geq\sqrt{t/n}\Big{)}\leq Ce^{-t/C}.
Proof.

All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK. Recall wη,=(Σ+τη,I)1Σ1/2(τη,μ0+γη,Σ1/2g/n)w_{\eta,\ast}=(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}(-\tau_{\eta,\ast}\mu_{0}+\gamma_{\eta,\ast}\Sigma^{1/2}g/\sqrt{n}). Under the assumed conditions, γη,,τη,1\gamma_{\eta,\ast},\tau_{\eta,\ast}\asymp 1. We shall consider the four terms separately below.

For the first term, we have

n1/2|g,wη,𝔼g,wη,|τη,n1/2|(Σ+τη,I)1Σ1/2μ0,g|\displaystyle n^{-1/2}\lvert\langle g,w_{\eta,\ast}\rangle-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\rvert\leq\tau_{\eta,\ast}\cdot n^{-1/2}\lvert\langle(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}\mu_{0},g\rangle\rvert
+γη,n1(id𝔼)(Σ+τη,I)1/2Σ1/2g2A1,1+A1,2.\displaystyle\qquad+\gamma_{\eta,\ast}\cdot n^{-1}(\mathrm{id}-\operatorname{\mathbb{E}})\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1/2}\Sigma^{1/2}g\rVert^{2}\equiv A_{1,1}+A_{1,2}.

The concentration of the term A1,1A_{1,1} can be handled using Gaussian tails and the fact that (Σ+τη,I)1Σ1/2μ021\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\lesssim 1. For the term A1,2A_{1,2}, with H1(g)(Σ+τη,I)1/2Σ1/2g2H_{1}(g)\equiv\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1/2}\Sigma^{1/2}g\rVert^{2}, it is easy to evaluate H1(g)2=4(Σ+τη,I)1Σg24H1(g)\lVert\nabla H_{1}(g)\rVert^{2}=4\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma g\rVert^{2}\leq 4H_{1}(g) and 𝔼H1(g)n\operatorname{\mathbb{E}}H_{1}(g)\leq n, so Proposition B.1 applies to conclude the concentration of A1,2A_{1,2}.

For the second term, we may decompose

|wη,2𝔼wη,2|τη,γη,n1/2|(Σ+τη,I)2Σ3/2μ0,g|\displaystyle\big{|}\|w_{\eta,\ast}\|^{2}-\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}\big{|}\lesssim{\tau_{\eta,\ast}\gamma_{\eta,\ast}}\cdot n^{-1/2}\lvert\langle(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma^{3/2}\mu_{0},g\rangle\rvert
+γη,2n1(id𝔼)(Σ+τη,I)1Σg2.\displaystyle\qquad+\gamma_{\eta,\ast}^{2}\cdot n^{-1}(\mathrm{id}-\operatorname{\mathbb{E}})\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma g\rVert^{2}.

From here we may handle the concentration of the above two terms in a completely similar fashion to A1,1A_{1,1} and A1,2A_{1,2} above.

For the third term, recall that μ^(Σ,μ0)𝗌𝖾𝗊(γ;τ)=(Σ+τI)1Σ1/2(Σ1/2μ0+γg/n)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma;\tau)=(\Sigma+\tau I)^{-1}\Sigma^{1/2}\big{(}\Sigma^{1/2}\mu_{0}+\gamma g/\sqrt{n}\big{)}, so

|F(wη,)𝔼F(wη,)|=12|μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)2𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)2|\displaystyle\big{\lvert}F(w_{\eta,\ast})-\operatorname{\mathbb{E}}F(w_{\eta,\ast})\big{\rvert}=\frac{1}{2}\big{\lvert}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\rVert^{2}-\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\rVert^{2}\big{\rvert}
γη,n1/2|(Σ+τη,I)2Σ3/2μ0,g|+γη,2n1(id𝔼)(Σ+τη,I)1Σ1/2g2.\displaystyle\lesssim\gamma_{\eta,\ast}\cdot n^{-1/2}\lvert\langle(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma^{3/2}\mu_{0},g\rangle\rvert+\gamma_{\eta,\ast}^{2}\cdot n^{-1}(\mathrm{id}-\operatorname{\mathbb{E}})\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}g\rVert^{2}.

The concentration properties of the two terms on the right hand side above can be handled similarly to the case for the second term.

For the last term, we have

n1|wη,hξ2𝔼wη,hξ2|\displaystyle n^{-1}\big{\lvert}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}-\operatorname{\mathbb{E}}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}\big{\rvert}
n1|wη,2h2𝔼wη,2h2|+n1wη,|h,ξ|A4,1+A4,2.\displaystyle\lesssim n^{-1}\big{\lvert}\|w_{\eta,\ast}\|^{2}\lVert h\rVert^{2}-\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}\lVert h\rVert^{2}\big{\rvert}+n^{-1}\|w_{\eta,\ast}\|\lvert\langle h,\xi\rangle\rvert\equiv A_{4,1}+A_{4,2}.

On the other hand, on the event 1(t/n)\mathscr{E}_{1}(\sqrt{t/n}),

A4,1\displaystyle A_{4,1} (h2/n)|wη,2𝔼wη,2|+n1𝔼wη,2|h2m|t/n,\displaystyle\lesssim({\lVert h\rVert^{2}}/{n})\big{|}\|w_{\eta,\ast}\|^{2}-\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}\big{|}+n^{-1}\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}\cdot\lvert\lVert h\rVert^{2}-m\rvert\lesssim\sqrt{t/n},

and A4,2(1eg)n1|h,ξ|t/nA_{4,2}\lesssim(1\vee e_{g})\cdot n^{-1}\lvert\langle h,\xi\rangle\rvert\lesssim\sqrt{t/n}. Combining the above estimates concludes the concentration claim for the last term. ∎

Proof of Proposition 9.10.

Fix ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}). All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK.

(Step 1). In this step, we establish both the uniqueness and the apriori estimates for vη,nv_{\eta,n}. Using Lemma 9.11, we may choose a sufficiently large C,C>0C,C^{\prime}>0 depending on KK such that Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime},

ξ(E0(t){max{|(id𝔼)g/n,wη,|,|(id𝔼)wη,2|,|(id𝔼)F(wη,)|,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{0}(t)\equiv\Big{\{}\max\Big{\{}\lvert(\mathrm{id}-\operatorname{\mathbb{E}})\langle g/\sqrt{n},w_{\eta,\ast}\rangle\rvert,\,\lvert(\mathrm{id}-\operatorname{\mathbb{E}})\|w_{\eta,\ast}\|^{2}\rvert,\,\lvert(\mathrm{id}-\operatorname{\mathbb{E}})F(w_{\eta,\ast})\rvert,
n1|(id𝔼)wη,hξ2|}t/n})1Cet/C.\displaystyle\qquad\qquad\qquad n^{-1}\lvert(\mathrm{id}-\operatorname{\mathbb{E}})\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}\rvert\Big{\}}\leq\sqrt{t/n}\Big{\}}\Big{)}\geq 1-Ce^{-t/C}.

Therefore, on the event E0(t)E_{0}(t),

g/n,wη,𝔼g/n,wη,t/n=γη,n1tr((Σ+τη,I)1Σ)t/n.\displaystyle\langle g/\sqrt{n},w_{\eta,\ast}\rangle\geq\operatorname{\mathbb{E}}\langle g/\sqrt{n},w_{\eta,\ast}\rangle-\sqrt{t/n}=\gamma_{\eta,\ast}\cdot n^{-1}\mathrm{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma\big{)}-\sqrt{t/n}.

Note that n1tr((Σ+τη,I)1Σ)Σ11n^{-1}\mathrm{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma\big{)}\gtrsim\mathcal{H}_{\Sigma}^{-1}\gtrsim 1, by choosing sufficiently large CC, we conclude g/n,wη,>0\langle g/\sqrt{n},w_{\eta,\ast}\rangle>0 on the event E0(t)E_{0}(t). This implies that vη(wη,,v)v\mapsto\ell_{\eta}(w_{\eta,\ast},v) is η\eta-strongly concave with respect to \lVert\cdot\rVert, so vη,nv_{\eta,n} exists uniquely on E0(t)E_{0}(t).

Next we derive apriori estimates. We claim that on E0(t)E_{0}(t), vη,n=argmaxvmη(wη,,v)v_{\eta,n}=\operatorname*{arg\,max\,}_{v\in\mathbb{R}^{m}}\ell_{\eta}(w_{\eta,\ast},v) takes the following form:

vη,n=1nη(1g,wη,wη,hξ)+(wη,hξ).\displaystyle v_{\eta,n}=\frac{1}{\sqrt{n}\eta}\bigg{(}1-\frac{\langle g,w_{\eta,\ast}\rangle}{\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert}\bigg{)}_{+}\cdot\big{(}\|w_{\eta,\ast}\|h-\xi\big{)}. (9.36)

To see this, using the definition

vη,n\displaystyle v_{\eta,n} =argmaxvm{1n(vg,wη,+wη,h,vv,ξ)ηv22}\displaystyle=\operatorname*{arg\,max\,}_{v\in\mathbb{R}^{m}}\bigg{\{}\frac{1}{\sqrt{n}}\Big{(}-\lVert v\rVert\langle g,w_{\eta,\ast}\rangle+\lVert w_{\eta,\ast}\rVert\langle h,v\rangle-\langle v,\xi\rangle\Big{)}-\frac{\eta\lVert v\rVert^{2}}{2}\bigg{\}}
=argmaxα0{αn(g,wη,+wη,hξ)ηα22}wη,hξwη,hξ\displaystyle=\operatorname*{arg\,max\,}_{\alpha\geq 0}\bigg{\{}\frac{\alpha}{\sqrt{n}}\bigg{(}-\langle g,w_{\eta,\ast}\rangle+\lVert\,\lVert w_{\eta,\ast}\rVert h-\xi\rVert\bigg{)}-\frac{\eta\alpha^{2}}{2}\bigg{\}}\cdot\frac{\|w_{\eta,\ast}\|h-\xi}{\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert}
=1nη(g,wη,+wη,hξ)+wη,hξwη,hξ.\displaystyle=\frac{1}{\sqrt{n}\eta}\bigg{(}-\langle g,w_{\eta,\ast}\rangle+\lVert\,\lVert w_{\eta,\ast}\rVert h-\xi\rVert\bigg{)}_{+}\cdot\frac{\|w_{\eta,\ast}\|h-\xi}{\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert}.

Some simple algebra leads to the expression in (9.36). The boundedness of vη,n\|v_{\eta,n}\| then follows from the boundedness of wη,\|w_{\eta,\ast}\|.

(Step 2). In this step, we establish the bound on vη,nvη,\|v_{\eta,n}-v_{\eta,\ast}\|. The key observation is that we may rewrite vη,v_{\eta,\ast} defined via (9.35) into the following form

vη,=1nη(1𝔼g,wη,𝔼1/2wη,hξ2)(𝔼1/2wη,2hξ).\displaystyle v_{\eta,\ast}=\frac{1}{\sqrt{n}\eta}\bigg{(}1-\frac{\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle}{\operatorname{\mathbb{E}}^{1/2}\lVert\,\|w_{\eta,\ast}\|\cdot h-\xi\rVert^{2}}\bigg{)}\cdot\big{(}\operatorname{\mathbb{E}}^{1/2}\|w_{\eta,\ast}\|^{2}\cdot h-\xi\big{)}. (9.37)

This can be seen by observing

{𝔼wη,2=𝔼𝖾𝗋𝗋(Σ,μ0)(γη,;τη,)=ϕγη,2σξ2,𝔼g,wη,=nγη,𝔼𝖽𝗈𝖿(Σ,μ0)(γη,;τη,)=nγη,(ϕητη,),𝔼1/2wη,hξ2=m(𝔼wη,2+σξ2)1/2=mϕγη,,\displaystyle\begin{cases}\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}=\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{\eta,\ast};\tau_{\eta,\ast})=\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2},\\ \operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle=\frac{\sqrt{n}}{\gamma_{\eta,\ast}}\cdot\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma_{\eta,\ast};\tau_{\eta,\ast})=\sqrt{n}\gamma_{\eta,\ast}\cdot\big{(}\phi-\frac{\eta}{\tau_{\eta,\ast}}\big{)},\\ \operatorname{\mathbb{E}}^{1/2}\lVert\,\|w_{\eta,\ast}\|\cdot h-\xi\rVert^{2}=\sqrt{m}\big{(}\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}+\sigma_{\xi}^{2}\big{)}^{1/2}=\sqrt{m\phi}\gamma_{\eta,\ast},\end{cases} (9.38)

and therefore 1𝔼g,wη,𝔼1/2wη,hξ2=ηϕτη,1-\frac{\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle}{\operatorname{\mathbb{E}}^{1/2}\lVert\,\|w_{\eta,\ast}\|\cdot h-\xi\rVert^{2}}=\frac{\eta}{\phi\tau_{\eta,\ast}}. Now with (9.36)-(9.37), we may use Lemma 9.11 to estimate

vη,nvη,\displaystyle\lVert v_{\eta,n}-v_{\eta,\ast}\rVert 1nη|wη,𝔼1/2wη,2|h\displaystyle\leq\frac{1}{\sqrt{n}\eta}\big{\lvert}\|w_{\eta,\ast}\|-\operatorname{\mathbb{E}}^{1/2}\|w_{\eta,\ast}\|^{2}\big{\rvert}\cdot\lVert h\rVert
+1nη|g,wη,wη,hξ𝔼g,wη,𝔼1/2wη,hξ2|𝔼1/2wη,2hξ\displaystyle\quad+\frac{1}{\sqrt{n}\eta}\bigg{\lvert}\frac{\langle g,w_{\eta,\ast}\rangle}{\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert}-\frac{\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle}{\operatorname{\mathbb{E}}^{1/2}\lVert\,\|w_{\eta,\ast}\|\cdot h-\xi\rVert^{2}}\bigg{\rvert}\lVert\operatorname{\mathbb{E}}^{1/2}\|w_{\eta,\ast}\|^{2}\cdot h-\xi\rVert
V1+V2.\displaystyle\equiv V_{1}+V_{2}. (9.39)

We first handle the term V1V_{1}. As 𝔼wη,2γη,2tr(Σ2(Σ+τη,)2)/n1\operatorname{\mathbb{E}}\lVert w_{\eta,\ast}\rVert^{2}\geq\gamma_{\eta,\ast}^{2}\mathrm{tr}\big{(}\Sigma^{2}(\Sigma+\tau_{\eta,\ast})^{-2}\big{)}/n\gtrsim 1, on the event E0(t)1,0(t/n)E_{0}(t)\cap\mathscr{E}_{1,0}(\sqrt{t/n}),

V1hn|wη,2𝔼wη,2|𝔼1/2wη,2t/n.\displaystyle V_{1}\lesssim\frac{\lVert h\rVert}{\sqrt{n}}\cdot\frac{\lvert\,\|w_{\eta,\ast}\|^{2}-\operatorname{\mathbb{E}}\|w_{\eta,\ast}\|^{2}\rvert}{\operatorname{\mathbb{E}}^{1/2}\|w_{\eta,\ast}\|^{2}}\lesssim\sqrt{t/n}. (9.40)

Next we handle V2V_{2}. On the event E0(t)1,0(t/n)E_{0}(t)\cap\mathscr{E}_{1,0}(\sqrt{t/n}),

V2\displaystyle V_{2} wη,hξ1|g,wη,𝔼g,wη,|\displaystyle\lesssim\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{-1}\cdot\big{\lvert}\langle g,w_{\eta,\ast}\rangle-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\big{\rvert}
+𝔼g,wη,|wη,hξ1𝔼1/2wη,hξ2|\displaystyle\qquad+\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\cdot\big{\lvert}\,\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{-1}-\operatorname{\mathbb{E}}^{-1/2}\lVert\,\|w_{\eta,\ast}\|\cdot h-\xi\rVert^{2}\big{\rvert}
n1/2|g,wη,𝔼g,wη,|+n1/2|wη,hξ𝔼1/2wη,hξ2|\displaystyle\lesssim n^{-1/2}\big{\lvert}\langle g,w_{\eta,\ast}\rangle-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\big{\rvert}+n^{-1/2}\big{\lvert}\,\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert-\operatorname{\mathbb{E}}^{1/2}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}\big{\rvert}
n1/2|g,wη,𝔼g,wη,|+n1|wη,hξ2𝔼wη,hξ2|\displaystyle\lesssim n^{-1/2}\big{\lvert}\langle g,w_{\eta,\ast}\rangle-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\big{\rvert}+n^{-1}\big{\lvert}\,\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}-\operatorname{\mathbb{E}}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}\big{\rvert}
t/n.\displaystyle\lesssim\sqrt{t/n}. (9.41)

The desired estimate for vη,nvη,\|v_{\eta,n}-v_{\eta,\ast}\| follows from (9.5)-(9.5).

(Step 3). In this step, we prove the claimed bound on |maxvη(wη,,v)𝖣¯η(βη,,γη,)||\max_{v}\ell_{\eta}(w_{\eta,\ast},v)-\overline{\mathsf{D}}_{\eta}(\beta_{\eta,\ast},\gamma_{\eta,\ast})|. First note that

maxvmη(wη,,v)\displaystyle\max_{v\in\mathbb{R}^{m}}\ell_{\eta}(w_{\eta,\ast},v)
maxvm{1n(vg,wη,+wη,h,vv,ξ)+F(wη,)ηv22}\displaystyle\equiv\max_{v\in\mathbb{R}^{m}}\bigg{\{}\frac{1}{\sqrt{n}}\Big{(}-\lVert v\rVert\langle g,w_{\eta,\ast}\rangle+\lVert w_{\eta,\ast}\rVert\langle h,v\rangle-\langle v,\xi\rangle\Big{)}+F(w_{\eta,\ast})-\frac{\eta\lVert v\rVert^{2}}{2}\bigg{\}}
=12nη(wη,hξg,wη,)+2+F(wη,).\displaystyle=\frac{1}{2n\eta}\big{(}\|\,\|w_{\eta,\ast}\|h-\xi\|-\langle g,w_{\eta,\ast}\rangle\big{)}_{+}^{2}+F(w_{\eta,\ast}). (9.42)

On the other hand, with #η;(Σ,μ0)#(Σ,μ0)(γη,;τη,)\#_{\eta;(\Sigma,\mu_{0})}^{\ast}\equiv\#_{(\Sigma,\mu_{0})}(\gamma_{\eta,\ast};\tau_{\eta,\ast}), #{𝖾𝗋𝗋,𝖽𝗈𝖿}\#\in\{\operatorname{\mathsf{err}},\operatorname{\mathsf{dof}}\},

𝔼𝖾F(γη,ng;γη,βη,)\displaystyle\operatorname{\mathbb{E}}\mathsf{e}_{F}\bigg{(}\frac{\gamma_{\eta,\ast}}{\sqrt{n}}g;\frac{\gamma_{\eta,\ast}}{\beta_{\eta,\ast}}\bigg{)} =βη,2γη,(𝔼𝖾𝗋𝗋η;(Σ,μ0)2𝔼𝖽𝗈𝖿η;(Σ,μ0)+γη,2)+𝔼F(wη,),\displaystyle=\frac{\beta_{\eta,\ast}}{2\gamma_{\eta,\ast}}\big{(}\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{\eta;(\Sigma,\mu_{0})}^{\ast}-2\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{\eta;(\Sigma,\mu_{0})}^{\ast}+\gamma_{\eta,\ast}^{2}\big{)}+\operatorname{\mathbb{E}}F(w_{\eta,\ast}),

so we may rewrite maxβ>0minγ>0𝖣¯η(β,γ)=𝖣¯η(βη,,γη,)\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)=\overline{\mathsf{D}}_{\eta}(\beta_{\eta,\ast},\gamma_{\eta,\ast}) as follows:

𝖣¯η(βη,,γη,)\displaystyle\overline{\mathsf{D}}_{\eta}(\beta_{\eta,\ast},\gamma_{\eta,\ast}) =βη,2(γη,(ϕ1)+σξ2γη,)ηβη,22+𝔼𝖾F(γη,ng;γη,βη,)\displaystyle=\frac{\beta_{\eta,\ast}}{2}\bigg{(}\gamma_{\eta,\ast}\big{(}\phi-1\big{)}+\frac{\sigma_{\xi}^{2}}{\gamma_{\eta,\ast}}\bigg{)}-\frac{\eta\beta_{\eta,\ast}^{2}}{2}+\operatorname{\mathbb{E}}\mathsf{e}_{F}\bigg{(}\frac{\gamma_{\eta,\ast}}{\sqrt{n}}g;\frac{\gamma_{\eta,\ast}}{\beta_{\eta,\ast}}\bigg{)}
=βη,2γη,(ϕγη,2+σξ2+𝔼𝖾𝗋𝗋η;(Σ,μ0)2𝔼𝖽𝗈𝖿η;(Σ,μ0))ηβη,22+𝔼F(wη,)\displaystyle=\frac{\beta_{\eta,\ast}}{2\gamma_{\eta,\ast}}\Big{(}\phi\gamma_{\eta,\ast}^{2}+\sigma_{\xi}^{2}+\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{\eta;(\Sigma,\mu_{0})}^{\ast}-2\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{\eta;(\Sigma,\mu_{0})}^{\ast}\Big{)}-\frac{\eta\beta_{\eta,\ast}^{2}}{2}+\operatorname{\mathbb{E}}F(w_{\eta,\ast})
=βη,γη,(ϕγη,2𝔼𝖽𝗈𝖿η;(Σ,μ0))ηβη,22+𝔼F(wη,).\displaystyle=\frac{\beta_{\eta,\ast}}{\gamma_{\eta,\ast}}\Big{(}\phi\gamma_{\eta,\ast}^{2}-\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{\eta;(\Sigma,\mu_{0})}^{\ast}\Big{)}-\frac{\eta\beta_{\eta,\ast}^{2}}{2}+\operatorname{\mathbb{E}}F(w_{\eta,\ast}).

Further using the second and third equations in (9.38), it now follows that

maxβ>0minγ>0𝖣¯η(β,γ)\displaystyle\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) =βη,n(𝔼1/2wη,hξ2𝔼g,wη,)ηβη,22+𝔼F(wη,)\displaystyle=\frac{\beta_{\eta,\ast}}{\sqrt{n}}\Big{(}\operatorname{\mathbb{E}}^{1/2}\|\,\|w_{\eta,\ast}\|h-\xi\|^{2}-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\Big{)}-\frac{\eta\beta_{\eta,\ast}^{2}}{2}+\operatorname{\mathbb{E}}F(w_{\eta,\ast})
=12nη(𝔼1/2wη,hξ2𝔼g,wη,)2+𝔼F(wη,).\displaystyle=\frac{1}{2n\eta}\Big{(}\operatorname{\mathbb{E}}^{1/2}\|\,\|w_{\eta,\ast}\|h-\xi\|^{2}-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\Big{)}^{2}+\operatorname{\mathbb{E}}F(w_{\eta,\ast}). (9.43)

Now combining (9.5) and (9.5), on the event E0(t)1,0(t/n)E_{0}(t)\cap\mathscr{E}_{1,0}(\sqrt{t/n}), we may estimate

|maxvmη(wη,,v)maxβ>0minγ>0𝖣¯η(β,γ)|\displaystyle\big{\lvert}\max_{v\in\mathbb{R}^{m}}\ell_{\eta}(w_{\eta,\ast},v)-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{\rvert}
n1/2|g,wη,𝔼g,wη,|+n1/2|wη,hξ𝔼1/2wη,hξ2|\displaystyle\lesssim n^{-1/2}\big{\lvert}\langle g,w_{\eta,\ast}\rangle-\operatorname{\mathbb{E}}\langle g,w_{\eta,\ast}\rangle\big{\rvert}+n^{-1/2}\big{\lvert}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert-\operatorname{\mathbb{E}}^{1/2}\lVert\,\|w_{\eta,\ast}\|h-\xi\rVert^{2}\big{\rvert}
+|F(wη,)𝔼F(wη,)|t/n,\displaystyle\qquad\qquad+\lvert F(w_{\eta,\ast})-\operatorname{\mathbb{E}}F(w_{\eta,\ast})\rvert\lesssim\sqrt{t/n},

completing the proof. ∎

Proof of Theorem 2.3 for r^η\widehat{r}_{\eta}.

Fix ξ1,ξ(t/n)\xi\in\mathscr{E}_{1,\xi}(\sqrt{t/n}). All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK. We sometimes write 𝒟¯ηmaxβ>0minγ>0𝖣¯η(β,γ)\overline{\mathscr{D}}_{\eta}\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma).

As r^η=ηv^η\widehat{r}_{\eta}=-\eta\widehat{v}_{\eta}, we only need to study v^η\widehat{v}_{\eta}. Fix ε>0\varepsilon>0, and any 𝗁:m\mathsf{h}:\mathbb{R}^{m}\to\mathbb{R}, let

Dη;ε(𝗁){vm:|𝗁(v)𝔼ξ𝗁(vη,)|ε}.\displaystyle D_{\eta;\varepsilon}(\mathsf{h})\equiv\big{\{}v\in\mathbb{R}^{m}:|\mathsf{h}(v)-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}({v}_{\eta,\ast})|\geq\varepsilon\big{\}}.

(Step 1). In this step we establish the Gordon cost cap: there exist constants C1,C1>0C_{1},C_{1}^{\prime}>0 depending on KK such that for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime},

ξ(E1(t)c{maxvDη;C1(t/n)1/4(𝗁)η(wη,,v)𝒟¯ηC11t/n})C1et/C1.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{1}(t)^{c}\equiv\Big{\{}\max_{v\in D_{\eta;C_{1}(t/n)^{1/4}}(\mathsf{h})}\ell_{\eta}(w_{\eta,\ast},v)\geq\overline{\mathscr{D}}_{\eta}-C_{1}^{-1}\sqrt{t/n}\Big{\}}\Big{)}\leq C_{1}e^{-t/C_{1}}. (9.44)

To this end, first note that by the Lipschitz property of 𝗁\mathsf{h}, the Gaussian concentration and Proposition 9.10, there exist some C0,C0>0C_{0},C_{0}^{\prime}>0 depending on KK such that for C0log(en)tn/C0C_{0}^{\prime}\log(en)\leq t\leq n/C_{0}^{\prime}, on an event E1,0(t)E_{1,0}(t) with probability at least 1C0et/C01-C_{0}e^{-t/C_{0}}, we have uniformly in v𝖣η;ε(𝗁)v\in\mathsf{D}_{\eta;\varepsilon}(\mathsf{h}),

ε\displaystyle\varepsilon\leq |𝗁(v)𝔼ξ𝗁(vη,)||𝗁(v)𝗁(vη,)|+|𝗁(vη,)𝔼ξ𝗁(vη,)|\displaystyle|\mathsf{h}(v)-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}({v}_{\eta,\ast})|\leq|\mathsf{h}(v)-\mathsf{h}({v}_{\eta,\ast})|+|\mathsf{h}(v_{\eta,\ast})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}({v}_{\eta,\ast})|
\displaystyle\leq vvη,n+vη,vη,n+Ct/nvvη,n+C0t/n,\displaystyle\|v-{v}_{\eta,n}\|+\|v_{\eta,\ast}-v_{\eta,n}\|+C\sqrt{t/n}\leq\|v-{v}_{\eta,n}\|+C_{0}\sqrt{t/n},

and all the properties in Proposition 9.10 hold. In other word, on E1,0(t)E_{1,0}(t) with the prescribed range of tt,

infwDη;ε(𝗁)vvη,n(εC0t/n)+.\displaystyle\inf_{w\in D_{\eta;\varepsilon}(\mathsf{h})}\lVert v-v_{\eta,n}\rVert\geq\big{(}\varepsilon-C_{0}\sqrt{t/n}\big{)}_{+}.

Using the η\eta-strong concavity of vη(wη,,v)v\mapsto\ell_{\eta}(w_{\eta,\ast},v) on E1,0(t)E_{1,0}(t), we have

maxvDη;ε(𝗁)η(wη,,v)\displaystyle\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})}\ell_{\eta}(w_{\eta,\ast},v) η(wη,,vη,n)η2infw𝖣η;ε(𝗁)vvη,n2\displaystyle\leq\ell_{\eta}(w_{\eta,\ast},v_{\eta,n})-\frac{\eta}{2}\inf_{w\in\mathsf{D}_{\eta;\varepsilon}(\mathsf{h})}\lVert v-v_{\eta,n}\rVert^{2}
maxβ>0minγ>0𝖣¯η(β,γ)η2(εC0t/n)+2+C1t/n.\displaystyle\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-\frac{\eta}{2}\big{(}\varepsilon-C_{0}\sqrt{t/n}\big{)}_{+}^{2}+C_{1}\sqrt{t/n}.

By choosing εεη;v(t,n)C0t/n+2C1/η(t/n)1/4\varepsilon\equiv\varepsilon_{\eta;v}(t,n)\equiv C_{0}\sqrt{t/n}+2\sqrt{C_{1}/\eta}\cdot(t/n)^{1/4}, we have on E1,0(t)E_{1,0}(t),

maxvDη;εη;v(t,n)(𝗁)η(wη,,v)maxβ>0minγ>0𝖣¯η(β,γ)C1t/n.\displaystyle\max_{v\in D_{\eta;\varepsilon_{\eta;v}(t,n)}(\mathsf{h})}\ell_{\eta}(w_{\eta,\ast},v)\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-C_{1}\sqrt{t/n}. (9.45)

Adjusting constants proves the claim in (9.44).

(Step 2). In this step, we provide an upper bound for the original cost over exceptional set. More concretely, we will prove that there exist constants C2,C2>0C_{2},C_{2}^{\prime}>0 depending on KK such that for any Lv>0L_{v}>0, and C2log(en)tn/C2C_{2}^{\prime}\log(en)\leq t\leq n/C_{2}^{\prime},

ξ(E2(t)c{maxvDη;C2(t/n)1/4(𝗁)Bm(Lv)minwnhη(w,v)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{2}(t)^{c}\equiv\Big{\{}\max_{v\in D_{\eta;C_{2}(t/n)^{1/4}}(\mathsf{h})\cap B_{m}(L_{v})}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)
𝒟¯ηC21t/n})C2et/C2.\displaystyle\qquad\qquad\geq\overline{\mathscr{D}}_{\eta}-C_{2}^{-1}\sqrt{t/n}\Big{\}}\Big{)}\leq C_{2}e^{-t/C_{2}}. (9.46)

To see this, first note by Proposition 9.9, there exists some C2=C2(K)>0C_{2}=C_{2}(K)>0 such that on an event E2,0E_{2,0} with ξ(E2,0)1C2en/C2\operatorname{\mathbb{P}}^{\xi}(E_{2,0})\geq 1-C_{2}e^{-n/C_{2}}, wη,C2\lVert w_{\eta,\ast}\rVert\leq C_{2}. So with z¯η;v(t,n)maxβ>0minγ>0𝖣¯η(β,γ)C11t/n\bar{z}_{\eta;v}(t,n)\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-C_{1}^{-1}\sqrt{t/n}, for any Lv>0L_{v}>0, an application of Theorem 6.1-(1) yields that for C1log(en)tn/C1C_{1}^{\prime}\log(en)\leq t\leq n/C_{1}^{\prime},

ξ(maxvDη;C1(t/n)1/4(𝗁)Bm(Lv)minwnhη(w,v)z¯η;v(t,n))\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{1}(t/n)^{1/4}}(\mathsf{h})\cap B_{m}(L_{v})}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)\geq\bar{z}_{\eta;v}(t,n)\Big{)}
ξ(maxvDη;C1(t/n)1/4(𝗁)Bm(Lv)minwBn(C2)hη(w,v)z¯η;v(t,n))\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{1}(t/n)^{1/4}}(\mathsf{h})\cap B_{m}(L_{v})}\min_{w\in B_{n}(C_{2})}h_{\eta}(w,v)\geq\bar{z}_{\eta;v}(t,n)\Big{)}
2ξ(maxvDη;C1(t/n)1/4(𝗁)Bm(Lv)minwBn(C2)η(w,v)z¯η;v(t,n))\displaystyle\leq 2\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{1}(t/n)^{1/4}}(\mathsf{h})\cap B_{m}(L_{v})}\min_{w\in B_{n}(C_{2})}\ell_{\eta}(w,v)\geq\bar{z}_{\eta;v}(t,n)\Big{)}
2ξ(maxvDη;C1(t/n)1/4(𝗁)Bm(Lv)η(wη,,v)z¯η;v(t,n))+2ξ(E2,0c)\displaystyle\leq 2\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{1}(t/n)^{1/4}}(\mathsf{h})\cap B_{m}(L_{v})}\ell_{\eta}(w_{\eta,\ast},v)\geq\bar{z}_{\eta;v}(t,n)\Big{)}+2\operatorname{\mathbb{P}}^{\xi}(E_{2,0}^{c})
2ξ(maxvDη;C1(t/n)1/4(𝗁)η(wη,,v)z¯η;v(t,n))+2ξ(E2,0c)Cet/C,\displaystyle\leq 2\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{1}(t/n)^{1/4}}(\mathsf{h})}\ell_{\eta}(w_{\eta,\ast},v)\geq\bar{z}_{\eta;v}(t,n)\Big{)}+2\operatorname{\mathbb{P}}^{\xi}(E_{2,0}^{c})\leq Ce^{-t/C},

proving the claim (9.44) by possibly adjusting constants.

(Step 3). In this step, we recall a lower bound for the original cost optimum, essentially established in the Step 1 in the proof of Theorem 2.3. In particular, using (9.22), (9.23) and (9.27), there exist C3,C3,C3>0C_{3},C_{3}^{\prime},C_{3}^{\prime\prime}>0 depending on KK, such that for C3log(en)tn/C3C_{3}^{\prime}\log(en)\leq t\leq n/C_{3}^{\prime},

ξ(E3,0(t)c{maxvBm(C3)minwnhη(w,v)𝒟¯ηC31t/n})C3et/C3,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{3,0}(t)^{c}\equiv\Big{\{}\max_{v\in B_{m}(C_{3}^{\prime\prime})}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)\leq\overline{\mathscr{D}}_{\eta}-C_{3}^{-1}\sqrt{t/n}\Big{\}}\Big{)}\leq C_{3}e^{-t/C_{3}}, (9.47)

and

ξ(E3,1c{maxvBm(C3)minwnhη(w,v)=maxvmminwnhη(w,v)})C3en/C3.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}E_{3,1}^{c}\equiv\Big{\{}\max_{v\in B_{m}(C_{3}^{\prime\prime})}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)=\max_{v\in\mathbb{R}^{m}}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)\Big{\}}\Big{)}\leq C_{3}e^{-n/C_{3}}. (9.48)

(Step 4). By choosing without loss of generality C3>C2C_{3}>C_{2}, on the event E2(t)E3,0(t)E3,1E_{2}(t)\cap E_{3,0}(t)\cap E_{3,1}, (9.5)-(9.48) yield that for any Clog(en)tn/CC^{\prime}\log(en)\leq t\leq n/C^{\prime},

maxv𝖣η;C2(t/n)1/4(𝗁)Bm(C3)minwnhη(w,v)\displaystyle\max_{v\in\mathsf{D}_{\eta;C_{2}(t/n)^{1/4}}(\mathsf{h})\cap B_{m}(C_{3}^{\prime\prime})}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v) maxβ>0minγ>0𝖣¯η(β,γ)C21t/n\displaystyle\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-C_{2}^{-1}\sqrt{t/n}
<maxβ>0minγ>0𝖣¯η(β,γ)C31t/n\displaystyle<\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-C_{3}^{-1}\sqrt{t/n}
maxvBm(C3)minwnhη(w,v)=maxvmminwnhη(w,v).\displaystyle\leq\max_{v\in B_{m}(C_{3}^{\prime\prime})}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v)=\max_{v\in\mathbb{R}^{m}}\min_{w\in\mathbb{R}^{n}}h_{\eta}(w,v).

This means on the event E2(t)E3,0(t)E3,1E_{2}(t)\cap E_{3,0}(t)\cap E_{3,1}, v^η𝖣η;C2(t/n)1/4(𝗁)\widehat{v}_{\eta}\notin\mathsf{D}_{\eta;C_{2}(t/n)^{1/4}}(\mathsf{h}), i.e., there exist some C4,C4>0C_{4},C_{4}^{\prime}>0 depending on KK such that for C4log(en)tn/C4C_{4}^{\prime}\log(en)\leq t\leq n/C_{4}^{\prime} and 1/KηK1/K\leq\eta\leq K,

ξ(|𝗁(v^η)𝔼ξ𝗁(vη,)|C4(t/n)1/4)C4et/C4.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\lvert\mathsf{h}(\widehat{v}_{\eta})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(v_{\eta,\ast})\rvert\geq C_{4}(t/n)^{1/4}\Big{)}\leq C_{4}e^{-t/C_{4}}. (9.49)

(Step 5). In this final step, we shall prove uniform version of the estimate (9.49). For η1,η2[1/K,K]\eta_{1},\eta_{2}\in[1/K,K], using the definition of v^η\widehat{v}_{\eta} in (9.34),

|𝗁(v^η1)𝗁(v^η2)|v^η1v^η2\displaystyle\lvert\mathsf{h}(\widehat{v}_{\eta_{1}})-\mathsf{h}(\widehat{v}_{\eta_{2}})\rvert\leq\lVert\widehat{v}_{\eta_{1}}-\widehat{v}_{\eta_{2}}\rVert
n1/2η11Gw^η1η21Gw^η2+(ξ/n)|η11η21|\displaystyle\leq n^{-1/2}\big{\lVert}\eta_{1}^{-1}G\widehat{w}_{\eta_{1}}-\eta_{2}^{-1}G\widehat{w}_{\eta_{2}}\big{\rVert}+({\lVert\xi\rVert}/{\sqrt{n}})\cdot\lvert\eta_{1}^{-1}-\eta_{2}^{-1}\rvert
Gw^η1+ξn|η11η21|+1nη2G(w^η1w^η2)\displaystyle\leq\frac{\lVert G\widehat{w}_{\eta_{1}}\rVert+\lVert\xi\rVert}{\sqrt{n}}\cdot\lvert\eta_{1}^{-1}-\eta_{2}^{-1}\rvert+\frac{1}{\sqrt{n}\eta_{2}}\big{\lVert}G(\widehat{w}_{\eta_{1}}-\widehat{w}_{\eta_{2}})\big{\rVert}
(1+μ^η1Gopn)|η1η2|+Gopnμ^η1μ^η2.\displaystyle\lesssim\Big{(}1+\lVert\widehat{\mu}_{\eta_{1}}\rVert\frac{\lVert G\rVert_{\operatorname{op}}}{\sqrt{n}}\Big{)}\cdot\lvert\eta_{1}-\eta_{2}\rvert+\frac{\lVert G\rVert_{\operatorname{op}}}{\sqrt{n}}\cdot\lVert\widehat{\mu}_{\eta_{1}}-\widehat{\mu}_{\eta_{2}}\rVert.

Using that μ^η=n1(XX/n+ηI)1XYXY/(nη)(1+Gop/n)2\lVert\widehat{\mu}_{\eta}\rVert=\lVert n^{-1}\big{(}X^{\top}X/n+\eta I\big{)}^{-1}X^{\top}Y\rVert\leq\lVert X^{\top}Y\rVert/(n\eta)\lesssim\big{(}1+{\lVert G\rVert_{\operatorname{op}}}/{\sqrt{n}}\big{)}^{2}, we have

|𝗁(v^η1)𝗁(v^η2)|(1+Gop/n)3(|η1η2|μ^η1μ^η2).\displaystyle\lvert\mathsf{h}(\widehat{v}_{\eta_{1}})-\mathsf{h}(\widehat{v}_{\eta_{2}})\rvert\lesssim\big{(}1+{\lVert G\rVert_{\operatorname{op}}}/{\sqrt{n}}\big{)}^{3}\cdot\big{(}\lvert\eta_{1}-\eta_{2}\rvert\vee\lVert\widehat{\mu}_{\eta_{1}}-\widehat{\mu}_{\eta_{2}}\rVert\big{)}. (9.50)

In view of (9.30), there exists some C5>0C_{5}>0 depending on KK, such that on an event E5,1E_{5,1} with ξ(E5,1)1C5en/C5\operatorname{\mathbb{P}}^{\xi}(E_{5,1})\geq 1-C_{5}e^{-n/C_{5}},

|𝗁(v^η1)𝗁(v^η2)|C5n2|η1η2|.\displaystyle\lvert\mathsf{h}(\widehat{v}_{\eta_{1}})-\mathsf{h}(\widehat{v}_{\eta_{2}})\rvert\leq C_{5}n^{2}\lvert\eta_{1}-\eta_{2}\rvert. (9.51)

On the other hand, using the definition of vη,v_{\eta,\ast} in (9.35), Proposition 8.1-(3) and the fact that ϕγη,2σξ2=𝔼𝖾𝗋𝗋(Σ,μ0)(γη,;τη,)tr((Σ+τη,I)2Σ2)1\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2}=\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\geq\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma^{2}\big{)}\gtrsim 1, we have

|𝔼ξ𝗁(vη1,)𝔼ξ𝗁(vη2,)|𝔼1/2,ξvη1,vη2,2\displaystyle\big{\lvert}\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(v_{\eta_{1},\ast})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(v_{\eta_{2},\ast})\big{\rvert}\leq\operatorname{\mathbb{E}}^{1/2,\xi}\lVert v_{\eta_{1},\ast}-v_{\eta_{2},\ast}\rVert^{2}
|τη1,1ϕγη1,2σξ2τη2,1ϕγη2,2σξ2|+|τη1,1τη2,1|\displaystyle\lesssim\big{\lvert}{\tau_{\eta_{1},\ast}^{-1}}\sqrt{\phi\gamma_{\eta_{1},\ast}^{2}-\sigma_{\xi}^{2}}-{\tau_{\eta_{2},\ast}^{-1}}\sqrt{\phi\gamma_{\eta_{2},\ast}^{2}-\sigma_{\xi}^{2}}\big{\rvert}+\big{\lvert}\tau_{\eta_{1},\ast}^{-1}-\tau_{\eta_{2},\ast}^{-1}\big{\rvert}
|γη1,2γη2,2|+|τη1,1τη2,1|C5|η1η2|.\displaystyle\lesssim\big{\lvert}\gamma_{\eta_{1},\ast}^{2}-\gamma_{\eta_{2},\ast}^{2}\big{\rvert}+\big{\lvert}\tau_{\eta_{1},\ast}^{-1}-\tau_{\eta_{2},\ast}^{-1}\big{\rvert}\leq C_{5}\lvert\eta_{1}-\eta_{2}\rvert. (9.52)

Now we may mimic the proof in (9.4) to conclude that, by possibly enlarging C5>0C_{5}>0, for any ε(0,1/2]\varepsilon\in(0,1/2] and ξ1,ξ(ε2/C5)\xi\in\mathscr{E}_{1,\xi}(\varepsilon^{2}/C_{5}),

ξ(supη[1/K,K]|𝗁(v^η)𝔼ξ𝗁(vη,)|ε)C5nenε4/C5,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in[1/K,K]}\lvert\mathsf{h}(\widehat{v}_{\eta})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(v_{\eta,\ast})\rvert\geq\varepsilon\Big{)}\leq C_{5}ne^{-n\varepsilon^{4}/C_{5}},

as desired. ∎

10. Universality: Proof of Theorem 2.4

10.1. Comparison inequalities

For 𝖿:n\mathsf{f}:\mathbb{R}^{n}\to\mathbb{R}, let

𝖿(w,A)12nAwξ2+𝖿(w).\displaystyle\mathcal{H}_{\mathsf{f}}(w,A)\equiv\frac{1}{2n}\lVert Aw-\xi\rVert^{2}+\mathsf{f}(w).

The following theorem is proved in [HS22, Theorem 2.3].

Theorem 10.1.

Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K for some K>1K>1. Let A0,B0m×nA_{0},B_{0}\in\mathbb{R}^{m\times n} be two random matrices with independent components, such that 𝔼A0;ij=𝔼B0;ij=0\operatorname{\mathbb{E}}A_{0;ij}=\operatorname{\mathbb{E}}B_{0;ij}=0 and 𝔼A0;ij2=𝔼B0;ij2\operatorname{\mathbb{E}}A_{0;ij}^{2}=\operatorname{\mathbb{E}}B_{0;ij}^{2} for all i[m],j[n]i\in[m],j\in[n]. Further assume that

Mmaxi[m],j[n](𝔼|A0;ij|6+𝔼|B0;ij|6)<.\displaystyle M\equiv\max_{i\in[m],j\in[n]}\big{(}\operatorname{\mathbb{E}}\lvert A_{0;ij}\rvert^{6}+\operatorname{\mathbb{E}}\lvert B_{0;ij}\rvert^{6}\big{)}<\infty.

Let AA0/nA\equiv A_{0}/\sqrt{n} and BB0/nB\equiv B_{0}/\sqrt{n}. Then there exists some C0=C0(K,M)>0C_{0}=C_{0}(K,M)>0 such that the following hold: For any 𝒮n[Ln,Ln]n\mathcal{S}_{n}\subset[-L_{n},L_{n}]^{n} with Ln1L_{n}\geq 1, and any 𝖳C3()\mathsf{T}\in C^{3}(\mathbb{R}), we have

|𝔼𝖳(minw𝒮n𝖿(w,A))𝔼𝖳(minw𝒮n𝖿(w,B))|C0K𝖳𝗋𝖿(Ln).\displaystyle\Big{|}\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\min_{w\in\mathcal{S}_{n}}\mathcal{H}_{\mathsf{f}}(w,A)\Big{)}-\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\min_{w\in\mathcal{S}_{n}}\mathcal{H}_{\mathsf{f}}(w,B)\Big{)}\Big{|}\leq C_{0}\cdot K_{\mathsf{T}}\cdot\mathsf{r}_{\mathsf{f}}(L_{n}).

Here K𝖳1+max[0:3]𝖳()K_{\mathsf{T}}\equiv 1+\max_{\ell\in[0:3]}\lVert\mathsf{T}^{(\ell)}\rVert_{\infty}, and 𝗋𝖿(Ln)\mathsf{r}_{\mathsf{f}}(L_{n}) is defined by

𝗋𝖿(Ln)\displaystyle\mathsf{r}_{\mathsf{f}}(L_{n}) infδ(0,n5/2){𝒩𝖿(Ln,δ)+(1+1mi=1m𝔼|ξi|3)1/3Ln2log+2/3(Ln/δ)n1/6},\displaystyle\equiv\inf_{\delta\in(0,n^{-5/2})}\bigg{\{}\mathscr{N}_{\mathsf{f}}(L_{n},\delta)+\bigg{(}1+\frac{1}{m}\sum_{i=1}^{m}\operatorname{\mathbb{E}}\lvert\xi_{i}\rvert^{3}\bigg{)}^{1/3}\cdot\frac{L_{n}^{2}\log_{+}^{2/3}(L_{n}/\delta)}{n^{1/6}}\bigg{\}},

where 𝒩𝖿(Ln,δ)sup|𝖿(w)𝖿(w)|\mathscr{N}_{\mathsf{f}}(L_{n},\delta)\equiv\sup\,\lvert\mathsf{f}(w)-\mathsf{f}(w^{\prime})\rvert with the supremum taken over all w,w[Ln,Ln]nw,w^{\prime}\in[-L_{n},L_{n}]^{n} such that wwδ\lVert w-w^{\prime}\rVert_{\infty}\leq\delta. Consequently, for any z,ε>0z\in\mathbb{R},\varepsilon>0,

(minw𝒮n𝖿(w,A)>z+3ε)(minw𝒮n𝖿(w,B)>z+ε)+C1(1ε3)𝗋𝖿(Ln).\displaystyle\operatorname{\mathbb{P}}\Big{(}\min_{w\in\mathcal{S}_{n}}\mathcal{H}_{\mathsf{f}}(w,A)>z+3\varepsilon\Big{)}\leq\operatorname{\mathbb{P}}\Big{(}\min_{w\in\mathcal{S}_{n}}\mathcal{H}_{\mathsf{f}}(w,B)>z+\varepsilon\Big{)}+C_{1}(1\vee\varepsilon^{-3})\mathsf{r}_{\mathsf{f}}(L_{n}).

Here C1>0C_{1}>0 is an absolute multiple of C0C_{0}.

Let for um,wn,Am×nu\in\mathbb{R}^{m},w\in\mathbb{R}^{n},A\in\mathbb{R}^{m\times n} and a measurable function Q:m×nQ:\mathbb{R}^{m}\times\mathbb{R}^{n}\to\mathbb{R}

X(u,w;A)uAw+Q(u,w).\displaystyle X(u,w;A)\equiv u^{\top}Aw+Q(u,w). (10.1)

The following theorem is proved in [HS22, Theorem 2.5].

Theorem 10.2.

Let A,Bm×nA,B\in\mathbb{R}^{m\times n} be two random matrices with independent entries and matching first two moments, i.e., 𝔼Aij=𝔼Bij\operatorname{\mathbb{E}}A_{ij}^{\ell}=\operatorname{\mathbb{E}}B_{ij}^{\ell} for all i[m],j[n],=1,2i\in[m],j\in[n],\ell=1,2. There exists a universal constant C0>0C_{0}>0 such that the following hold. For any measurable subsets 𝒮u[Lu,Lu]m\mathcal{S}_{u}\subset[-L_{u},L_{u}]^{m}, 𝒮w[Lw,Lw]n\mathcal{S}_{w}\subset[-L_{w},L_{w}]^{n} with Lu,Lw1L_{u},L_{w}\geq 1, and any 𝖳C3()\mathsf{T}\in C^{3}(\mathbb{R}), we have

|𝔼𝖳(maxu𝒮uminw𝒮wX(u,w;A))𝔼𝖳(maxu𝒮uminw𝒮wX(u,w;B))|\displaystyle\Big{|}\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\max_{u\in\mathcal{S}_{u}}\min_{w\in\mathcal{S}_{w}}X(u,w;A)\Big{)}-\operatorname{\mathbb{E}}\mathsf{T}\Big{(}\max_{u\in\mathcal{S}_{u}}\min_{w\in\mathcal{S}_{w}}X(u,w;B)\Big{)}\Big{|}
C0K𝖳infδ(0,1){M1Lδ+𝒩Q(L,δ)+log+2/3(L/δ)(m+n)2/3M31/3L2}.\displaystyle\leq C_{0}\cdot K_{\mathsf{T}}\cdot\inf_{\delta\in(0,1)}\Big{\{}M_{1}L\delta+\mathscr{N}_{Q}(L,\delta)+\log_{+}^{2/3}(L/\delta)\cdot(m+n)^{2/3}M_{3}^{1/3}L^{2}\Big{\}}.

Here K𝖳1+max[0:3]𝖳()K_{\mathsf{T}}\equiv 1+\max_{\ell\in[0:3]}\lVert\mathsf{T}^{(\ell)}\rVert_{\infty}, LLu+LwL\equiv L_{u}+L_{w}, Mi[m],j[n](𝔼|Aij|+𝔼|Bij|)M_{\ell}\equiv\sum_{i\in[m],j\in[n]}\big{(}\operatorname{\mathbb{E}}\lvert A_{ij}\rvert^{\ell}+\operatorname{\mathbb{E}}\lvert B_{ij}\rvert^{\ell}\big{)}, and 𝒩Q(L,δ)sup|Q(u,w)Q(u,w)|\mathscr{N}_{Q}(L,\delta)\equiv\sup\,\lvert Q(u,w)-Q(u^{\prime},w^{\prime})\rvert with the supremum taken over all u,u[L,L]m,w,w[L,L]nu,u^{\prime}\in[-L,L]^{m},w,w^{\prime}\in[-L,L]^{n} such that uuwwδ\lVert u-u^{\prime}\rVert_{\infty}\vee\lVert w-w^{\prime}\rVert_{\infty}\leq\delta. The conclusion continues to hold when max-min is flipped to min-max.

10.2. Delocalization

Recall that μ^η\widehat{\mu}_{\eta} defined in (1.3) can be rewritten as

μ^η=argminμnmaxvm{12μ2+1nv,XμYη2v2}.\displaystyle\widehat{\mu}_{\eta}=\operatorname*{arg\,min\,}_{\mu\in\mathbb{R}^{n}}\max_{v\in\mathbb{R}^{m}}\bigg{\{}\frac{1}{2}\lVert\mu\rVert^{2}+\frac{1}{\sqrt{n}}\langle v,X\mu-Y\rangle-\frac{\eta}{2}\lVert v\rVert^{2}\bigg{\}}.

For any η>0\eta>0, we have the following closed form for μ^η\widehat{\mu}_{\eta}:

μ^η=n1(XX/n+ηIn)1XY,v^η=(nη)1(YXμ^η).\displaystyle\widehat{\mu}_{\eta}=n^{-1}\big{(}{X^{\top}X}/{n}+\eta I_{n}\big{)}^{-1}X^{\top}Y,\quad\widehat{v}_{\eta}=-(\sqrt{n}\eta)^{-1}(Y-X\widehat{\mu}_{\eta}). (10.2)

The above formula does not include the interpolating case η=0\eta=0 when n>mn>m. To give an alternative expression, note that the first-order condition for the above minimax optimization is μ^η=Xv^η/n\widehat{\mu}_{\eta}=X^{\top}\widehat{v}_{\eta}/\sqrt{n}, YXμ^η=nηv^ηY-X\widehat{\mu}_{\eta}=-\sqrt{n}\eta\widehat{v}_{\eta}, or equivalently,

μ^η=n1X(XX/n+ηIm)1Y,v^η=n1/2(XX/n+ηIm)1Y.\displaystyle\widehat{\mu}_{\eta}=n^{-1}X^{\top}\big{(}{XX^{\top}}/{n}+\eta I_{m}\big{)}^{-1}Y,\quad\widehat{v}_{\eta}=-n^{-1/2}\big{(}{XX^{\top}}/{n}+\eta I_{m}\big{)}^{-1}Y. (10.3)

The following proposition proves delocalization for w^ηΣ1/2(μ^ημ0)\widehat{w}_{\eta}\equiv\Sigma^{1/2}(\widehat{\mu}_{\eta}-\mu_{0}) and v^η\widehat{v}_{\eta}.

Proposition 10.3.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix ϑ(0,1/2]\vartheta\in(0,1/2]. Then there exist some constant C=C(K,ϑ)>0C=C(K,\vartheta)>0, two measurable sets 𝒰ϑBn(1),ϑm\mathcal{U}_{\vartheta}\subset B_{n}(1),\mathcal{E}_{\vartheta}\subset\mathbb{R}^{m} with min{vol(𝒰ϑ)/vol(Bn(1)),(ξϑ)}1Cen2ϑ/C\min\{\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1)),\operatorname{\mathbb{P}}(\xi\in\mathcal{E}_{\vartheta})\}\geq 1-Ce^{-n^{2\vartheta}/C}, such that

supμ0𝒰ϑ,ξϑξ(supηΞK{w^ηv^η}Cn1/2+ϑ)Cn100.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta},\xi\in\mathcal{E}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in\Xi_{K}}\Big{\{}\lVert\widehat{w}_{\eta}\rVert_{\infty}\vee\lVert\widehat{v}_{\eta}\rVert_{\infty}\Big{\}}\geq Cn^{-1/2+\vartheta}\Big{)}\leq Cn^{-100}.

The sets 𝒰ϑ,ϑ\mathcal{U}_{\vartheta},\mathcal{E}_{\vartheta} can be taken as

𝒰ϑ\displaystyle\mathcal{U}_{\vartheta} {μ0Bn(1):supηΞKΣ1/2(𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0)C0n1/2+ϑ},\displaystyle\equiv\Big{\{}\mu_{0}\in B_{n}(1):\sup_{\eta\in\Xi_{K}}\big{\lVert}\Sigma^{1/2}\big{(}\operatorname{\mathbb{E}}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\big{)}\big{\rVert}_{\infty}\leq C_{0}n^{-1/2+\vartheta}\Big{\}},
ϑ\displaystyle\mathcal{E}_{\vartheta} {ξm:ξC0nϑ,|ξ2/mσξ2|C0n1/2+ϑ}\displaystyle\equiv\Big{\{}\xi\in\mathbb{R}^{m}:\lVert\xi\rVert_{\infty}\leq C_{0}n^{\vartheta},\big{\lvert}\lVert\xi\rVert^{2}/m-\sigma_{\xi}^{2}\big{\rvert}\leq C_{0}n^{-1/2+\vartheta}\Big{\}}

for some large enough C0=C0(K)>0C_{0}=C_{0}(K)>0.

Remark 6.

Delocalization in the same sense of the above proposition holds for 𝖯μ^η+𝗊\lVert\mathsf{P}\widehat{\mu}_{\eta}+\mathsf{q}\rVert_{\infty} with any deterministic matrix 𝖯n×n\mathsf{P}\in\mathbb{R}^{n\times n} and vector 𝗊n\mathsf{q}\in\mathbb{R}^{n} satisfying 𝖯op𝗊1\lVert\mathsf{P}\rVert_{\operatorname{op}}\vee\lVert\mathsf{q}\rVert\leq 1, with a (slightly) different construction of 𝒰ϑ\mathcal{U}_{\vartheta}.

Proof of Proposition 10.3.

All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK.

(1). Let us consider delocalization for w^η\widehat{w}_{\eta}. Using (10.3), for any s[n]s\in[n],

es,w^η\displaystyle\langle e_{s},\widehat{w}_{\eta}\rangle =n1Σ1/2es,X(ϕΣˇ+ηIm)1Xμ0Σ1/2es,μ0\displaystyle=n^{-1}\langle\Sigma^{1/2}e_{s},X^{\top}(\phi\check{\Sigma}+\eta I_{m})^{-1}X\mu_{0}\rangle-\langle\Sigma^{1/2}e_{s},\mu_{0}\rangle
+n1Σ1/2es,X(ϕΣˇ+ηIm)1ξA1;s+A2;s.\displaystyle\qquad+n^{-1}\langle\Sigma^{1/2}e_{s},X^{\top}(\phi\check{\Sigma}+\eta I_{m})^{-1}\xi\rangle\equiv A_{1;s}+A_{2;s}. (10.4)

We first handle A1;sA_{1;s}. Let ρ\rho be the asymptotic eigenvalue density of Σˇ=XX/m\check{\Sigma}=XX^{\top}/m and fix c>0c>0. By [KY17, Theorem 3.16-(i), Remark 3.17 and Lemma 4.4-(i)], for any small ϑ>0\vartheta>0 and large D>0D>0,

ξ(|m1Σ1/2es,X(ΣˇzIm)1Xμ0\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\big{\lvert}m^{-1}\langle\Sigma^{1/2}e_{s},X^{\top}(\check{\Sigma}-zI_{m})^{-1}X\mu_{0}\rangle
Σ1/2es,𝔪(z)Σ(In+𝔪(z)Σ)1μ0|n1/2+ϑ𝔪(z)/z)CnD\displaystyle\qquad-\langle\Sigma^{1/2}e_{s},\mathfrak{m}(z)\Sigma(I_{n}+\mathfrak{m}(z)\Sigma)^{-1}\mu_{0}\rangle\big{\rvert}\geq n^{-1/2+\vartheta}\sqrt{\Im\mathfrak{m}(z)/\Im z}\Big{)}\leq Cn^{-D}

holds for all z[1/c,1/c]×(0,1/c]z\in[-1/c,1/c]\times(0,1/c]. With κκ(z)dist(z,suppρ)n2/3+c\kappa\equiv\kappa(z)\equiv\mathrm{dist}(\Re z,\mathrm{supp}\;\rho)\geq n^{-2/3+c}, by further using the simple relation m(z)/z=ρ(dx)(zx)2+2zκ2\Im m(z)/\Im z=\int\frac{\rho(\mathrm{d}x)}{(\Re z-x)^{2}+\Im^{2}z}\leq\kappa^{-2}, the error bound n1/2+ϑ𝔪(z)/zn^{-1/2+\vartheta}\sqrt{\Im\mathfrak{m}(z)/\Im z} in the above display can be replaced by κ1n1/2+ϑ\kappa^{-1}n^{-1/2+\vartheta}.

When ϕ11+1/K\phi^{-1}\geq 1+1/K, according to [BS10, Theorem 6.3-(2)], suppρ(C01,C0)\mathrm{supp}\;\rho\in(C_{0}^{-1},C_{0}) for some constant C0>1C_{0}>1. Therefore, for zz(b)η/ϕ+1bz\equiv z(b)\equiv-\eta/\phi+\sqrt{-1}b with a small enough b>0b>0 to be chosen later, it is easy to see that κκ0(η/ϕ)C01𝟏ϕ11+1/K\kappa\geq\kappa_{0}\equiv(\eta/\phi)\vee C_{0}^{-1}\bm{1}_{\phi^{-1}\geq 1+1/K}. Therefore, on an event E1,0;s(b)E_{1,0;s}(b) with ξ(E1,0;s(b))1CnD\operatorname{\mathbb{P}}^{\xi}(E_{1,0;s}(b))\geq 1-Cn^{-D},

|m1Σ1/2es,X(Σˇz(0)Im)1Xμ0\displaystyle\big{\lvert}m^{-1}\langle\Sigma^{1/2}e_{s},X^{\top}(\check{\Sigma}-z(0)I_{m})^{-1}X\mu_{0}\rangle
Σ1/2es,𝔪(z(0))Σ(In+𝔪(z(0))Σ)1μ0|(I)+(II)+κ01n1/2+ϑ,\displaystyle\qquad-\langle\Sigma^{1/2}e_{s},\mathfrak{m}(z(0))\Sigma(I_{n}+\mathfrak{m}(z(0))\Sigma)^{-1}\mu_{0}\rangle\big{\rvert}\leq(I)+(II)+\kappa_{0}^{-1}n^{-1/2+\vartheta}, (10.5)

where

  • (I)=|m1Σ1/2es,X(Σˇz(b)Im)1Xμ0m1Σ1/2es,X(Σˇz(0)Im)1Xμ0|(I)=\lvert m^{-1}\langle\Sigma^{1/2}e_{s},X^{\top}(\check{\Sigma}-z(b)I_{m})^{-1}X\mu_{0}\rangle-m^{-1}\langle\Sigma^{1/2}e_{s},X^{\top}(\check{\Sigma}-z(0)I_{m})^{-1}X\mu_{0}\rangle\rvert,

  • (II)=|Σ1/2es,𝔪(z(b))Σ(In+𝔪(z(b))Σ)1μ0Σ1/2es,𝔪(z(0))Σ(In+𝔪(z(0))Σ)1μ0|(II)=\lvert\langle\Sigma^{1/2}e_{s},\mathfrak{m}(z(b))\Sigma(I_{n}+\mathfrak{m}(z(b))\Sigma)^{-1}\mu_{0}\rangle-\langle\Sigma^{1/2}e_{s},\mathfrak{m}(z(0))\Sigma(I_{n}+\mathfrak{m}(z(0))\Sigma)^{-1}\mu_{0}\rangle\rvert.

By a derivative calculation, it is easy to derive

(I)\displaystyle(I) (Zop/n)2((ZZ/n)1op𝟏ϕ11+1/Kη1)2b.\displaystyle\lesssim\big{(}\lVert Z\rVert_{\operatorname{op}}/\sqrt{n}\big{)}^{2}\cdot\big{(}\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}\bm{1}_{\phi^{-1}\geq 1+1/K}\wedge\eta^{-1}\big{)}^{2}\cdot b.

Now by using the concentration result in [RV09, Theorem 1.1], on an event E1,1;sE_{1,1;s} with ξ(E1,1;s)1en/C\operatorname{\mathbb{P}}^{\xi}(E_{1,1;s})\geq 1-e^{-n/C}, we have (I)Cb(I)\leq Cb.

For (II)(II), using the boundedness of 𝔪(z(b))\mathfrak{m}(z(b)) around 0 for ϕ11+1/K\phi^{-1}\geq 1+1/K, we may estimate

(II)\displaystyle(II) (𝟏ϕ11+1/Kη1)|𝔪(z(b))𝔪(z(0))|\displaystyle\lesssim\big{(}\bm{1}_{\phi^{-1}\geq 1+1/K}\wedge\eta^{-1}\big{)}\cdot\lvert\mathfrak{m}(z(b))-\mathfrak{m}(z(0))\rvert
(𝟏ϕ11+1/Kη1)C01𝟏ϕ11+1/Kb|xz(b)||xz(0)|ρ(dx)\displaystyle\leq\big{(}\bm{1}_{\phi^{-1}\geq 1+1/K}\wedge\eta^{-1}\big{)}\cdot\int_{C_{0}^{-1}\mathbf{1}_{\phi^{-1}\geq 1+1/K}}^{\infty}\frac{b}{|x-z(b)||x-z(0)|}\,\rho(\mathrm{d}x)
{C02𝟏ϕ11+1/K1η3}b.\displaystyle\leq\Big{\{}C_{0}^{2}\mathbf{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge\eta^{-3}\Big{\}}\cdot b.

Combining the above estimates, for bb chosen small enough, say, b=n100b=n^{-100}, on the event E1,0;s(n100)E1,1;sE_{1,0;s}(n^{-100})\cap E_{1,1;s},

|A1;sΣ1/2es,𝔪(η/ϕ)Σ(In+𝔪(η/ϕ)Σ)1μ0μ0|n1/2+ϑ.\displaystyle\lvert A_{1;s}-\langle\Sigma^{1/2}e_{s},\mathfrak{m}(-\eta/\phi)\Sigma(I_{n}+\mathfrak{m}(-\eta/\phi)\Sigma)^{-1}\mu_{0}-\mu_{0}\rangle\rvert\lesssim n^{-1/2+\vartheta}.

Using τη,1=𝔪(η/ϕ)\tau_{\eta,\ast}^{-1}=\mathfrak{m}(-\eta/\phi) and the definition of μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast}), recall wη,=Σ1/2(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0)w_{\eta,\ast}=\Sigma^{1/2}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\big{)} defined in (9.14), we then have

supμ0Bn(1)ξ(maxs[n]|A1;ses,𝔼wη,|Cn1/2+ϑ)CnD.\displaystyle\sup_{\mu_{0}\in B_{n}(1)}\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{s\in[n]}\lvert A_{1;s}-\big{\langle}e_{s},\operatorname{\mathbb{E}}w_{\eta,\ast}\big{\rangle}\rvert\geq Cn^{-1/2+\vartheta}\Big{)}\leq Cn^{-D}. (10.6)

The term A2;sA_{2;s} can be handled similarly, now reading off the (1,2)(1,2) element in [KY17, Eqn. (3.10)], which shows that for any ξm\xi\in\mathbb{R}^{m},

ξ(maxs[n]|A2;s|C(ξ/m)n1/2+ϑ)CnD.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{s\in[n]}\lvert A_{2;s}\rvert\geq C(\lVert\xi\rVert/\sqrt{m})\cdot n^{-1/2+\vartheta}\Big{)}\leq Cn^{-D}. (10.7)

Combining (10.2), (10.6) and (10.7), we have

supμ0Bn(1),ξϑξ(w^η𝔼wη,+Cn1/2+ϑ)CnD.\displaystyle\sup_{\mu_{0}\in B_{n}(1),\xi\in\mathcal{E}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\lVert\widehat{w}_{\eta}\rVert_{\infty}\geq\lVert\operatorname{\mathbb{E}}w_{\eta,\ast}\rVert_{\infty}+Cn^{-1/2+\vartheta}\Big{)}\leq Cn^{-D}. (10.8)

Now we will construct 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with the desired volume estimate, and supμ0𝒰ϑsupηΞK𝔼wη,Cn1/2+ϑ\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\sup_{\eta\in\Xi_{K}}\lVert\operatorname{\mathbb{E}}w_{\eta,\ast}\rVert_{\infty}\leq Cn^{-1/2+\vartheta}. To this end, we place a uniform prior on μ0U0g0/g0\mu_{0}\sim U_{0}g_{0}/\lVert g_{0}\rVert, where U0Unif[0,1]U_{0}\sim\mathrm{Unif}[0,1] and g0𝒩(0,In)g_{0}\sim\mathcal{N}(0,I_{n}) are independent of all other random variables. Then supηΞK𝔼wη,supηΞKτη,(Σ+τη,In)1Σ1/2g0/g0\sup_{\eta\in\Xi_{K}}\lVert\operatorname{\mathbb{E}}w_{\eta,\ast}\rVert_{\infty}\leq\sup_{\eta\in\Xi_{K}}\tau_{\eta,\ast}\lVert(\Sigma+\tau_{\eta,\ast}I_{n})^{-1}\Sigma^{1/2}g_{0}\rVert_{\infty}/\lVert g_{0}\rVert. Using Proposition 8.1-(3) and a standard Gaussian tail bound, μ0(𝒰ϑ{supηΞK𝔼wη,C1n1/2+ϑ})Cen2ϑ/C\operatorname{\mathbb{P}}_{\mu_{0}}\big{(}\mathcal{U}_{\vartheta}\equiv\big{\{}\sup_{\eta\in\Xi_{K}}\lVert\operatorname{\mathbb{E}}w_{\eta,\ast}\rVert_{\infty}\geq C_{1}n^{-1/2+\vartheta}\big{\}}\big{)}\leq Ce^{-n^{2\vartheta}/C}. Moreover, (ξϑ)en2ϑ/C\operatorname{\mathbb{P}}(\xi\notin\mathcal{E}_{\vartheta})\leq e^{-n^{2\vartheta}/C}. The pointwise-in-η\eta delocalization claim on w^η\widehat{w}_{\eta} follows. As ηw^η\eta\mapsto\lVert\widehat{w}_{\eta}\rVert_{\infty} is CC-Lipschitz with exponentially high probability, the uniform version follows by a standard discretization and union bound argument.

(2). Let us consider delocalization for v^η\widehat{v}_{\eta}. Using again (10.3), for any t[m]t\in[m],

et,v^η\displaystyle-\langle e_{t},\widehat{v}_{\eta}\rangle =n1/2et,(ϕΣˇ+ηIm)1Xμ0+n1/2et,(ϕΣˇ+ηIm)1ξ\displaystyle=n^{-1/2}\langle e_{t},(\phi\check{\Sigma}+\eta I_{m})^{-1}X\mu_{0}\rangle+n^{-1/2}\langle e_{t},(\phi\check{\Sigma}+\eta I_{m})^{-1}\xi\rangle
B1;t+B2;t.\displaystyle\equiv B_{1;t}+B_{2;t}.

The term B1;tB_{1;t} can be handled, by reading off the (2,1)(2,1) element in [KY17, Eqn. (3.10)], which shows that

supμ0Bn(1)ξ(maxt[m]|B1;t|Cn1/2+ϑ)CnD.\displaystyle\sup_{\mu_{0}\in B_{n}(1)}\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{t\in[m]}\lvert B_{1;t}\rvert\geq Cn^{-1/2+\vartheta}\Big{)}\leq Cn^{-D}. (10.9)

The term B2;tB_{2;t} relies on the local law described by the (2,2)(2,2) element in [KY17, Eqn. (3.10)]: for any ξm\xi\in\mathbb{R}^{m},

ξ(maxt[m]|B2;tϕ1𝔪(η/ϕ)ξt|C(ξ/m)n1/2+ϑ)CnD.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{t\in[m]}\lvert B_{2;t}-\phi^{-1}\mathfrak{m}(-\eta/\phi)\xi_{t}\rvert\geq C(\lVert\xi\rVert/\sqrt{m})\cdot n^{-1/2+\vartheta}\Big{)}\leq Cn^{-D}. (10.10)

Consequently, combining (10.9)-(10.10), we have

supμ0Bn(1),ξϑξ(v^ηCn1/2+ϑ)CnD.\displaystyle\sup_{\mu_{0}\in B_{n}(1),\xi\in\mathcal{E}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\lVert\widehat{v}_{\eta}\rVert_{\infty}\geq Cn^{-1/2+\vartheta}\Big{)}\leq Cn^{-D}.

The claim follows. ∎

10.3. Universality of the global cost optimum

Theorem 10.4.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, ΣopΣ1opK\lVert\Sigma\rVert_{\operatorname{op}}\vee\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix ϑ(0,1/18)\vartheta\in(0,1/18). There exists some C=C(K,ϑ)>0C=C(K,\vartheta)>0 such that for ρ01/C\rho_{0}\leq 1/C, ηΞK\eta\in\Xi_{K} and ξϑ\xi\in\mathcal{E}_{\vartheta},

supμ0𝒰ϑξ(|minwnHη;Z(w)maxβ>0minγ>0𝖣¯η(β,γ)|ρ0)Cρ03n1/6+3ϑ.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\big{\lvert}\min_{w\in\mathbb{R}^{n}}H_{\eta;Z}(w)-\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)\big{\rvert}\geq\rho_{0}\Big{)}\leq C\rho_{0}^{-3}\cdot n^{-1/6+3\vartheta}.

Here 𝒰ϑ\mathcal{U}_{\vartheta} is specified as in Proposition 10.3.

Proof.

Fix ϑ>0\vartheta>0, μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ξϑ\xi\in\mathcal{E}_{\vartheta} as specified in Proposition 10.3. Let LnC0nϑL_{n}\equiv C_{0}n^{\vartheta}. By the same proposition, with ξ\operatorname{\mathbb{P}}^{\xi}-probability at least 1C0n1001-C_{0}n^{-100},

minwnHη;Z(w)=minwLn/nmaxvLn/n{1nv,Zw1nv,ξη2v2+F(w)}\displaystyle\min_{w\in\mathbb{R}^{n}}H_{\eta;Z}(w)=\min_{\lVert w\rVert_{\infty}\leq L_{n}/\sqrt{n}}\max_{\lVert v\rVert_{\infty}\leq L_{n}/\sqrt{n}}\bigg{\{}\frac{1}{\sqrt{n}}\langle v,Zw\rangle-\frac{1}{\sqrt{n}}\langle v,\xi\rangle-\frac{\eta}{2}\lVert v\rVert^{2}+F(w)\bigg{\}}
=minw~Lnmaxv~Ln{1n3/2v~,Zw~1nv~,ξη2nv~2+F(w~/n)},\displaystyle=\min_{\lVert\widetilde{w}\rVert_{\infty}\leq L_{n}}\max_{\lVert\widetilde{v}\rVert_{\infty}\leq L_{n}}\bigg{\{}\frac{1}{n^{3/2}}\langle\widetilde{v},Z\widetilde{w}\rangle-\frac{1}{n}\langle\widetilde{v},\xi\rangle-\frac{\eta}{2n}\lVert\widetilde{v}\rVert^{2}+F(\widetilde{w}/\sqrt{n})\bigg{\}}, (10.11)

and

minwnHη;G(w)\displaystyle\min_{w\in\mathbb{R}^{n}}H_{\eta;G}(w)
=minw~Lnmaxv~Ln{1n3/2v~,Gw~1nv~,ξη2nv~2+F(w~/n)}.\displaystyle=\min_{\lVert\widetilde{w}\rVert_{\infty}\leq L_{n}}\max_{\lVert\widetilde{v}\rVert_{\infty}\leq L_{n}}\bigg{\{}\frac{1}{n^{3/2}}\langle\widetilde{v},G\widetilde{w}\rangle-\frac{1}{n}\langle\widetilde{v},\xi\rangle-\frac{\eta}{2n}\lVert\widetilde{v}\rVert^{2}+F(\widetilde{w}/\sqrt{n})\bigg{\}}. (10.12)

By writing Q(v~,w~)1nv~,ξη2nv~2+F(w~/n)Q(\widetilde{v},\widetilde{w})\equiv-\frac{1}{n}\langle\widetilde{v},\xi\rangle-\frac{\eta}{2n}\lVert\widetilde{v}\rVert^{2}+F(\widetilde{w}/\sqrt{n}), we have

𝒩Q(L,δ)\displaystyle\mathscr{N}_{Q}(L,\delta) supv~v~L,v~v~δ,w~w~L,w~w~δ|Q(v~,w~)Q(v~,w~)|K(1L)δ(1+ξ1n).\displaystyle\equiv\sup_{\begin{subarray}{c}\lVert\widetilde{v}\rVert_{\infty}\vee\lVert\widetilde{v}^{\prime}\rVert_{\infty}\leq L,\lVert\widetilde{v}-\widetilde{v}^{\prime}\rVert_{\infty}\leq\delta,\\ \lVert\widetilde{w}\rVert_{\infty}\vee\lVert\widetilde{w}^{\prime}\rVert_{\infty}\leq L,\lVert\widetilde{w}-\widetilde{w}^{\prime}\rVert_{\infty}\leq\delta\end{subarray}}\big{\lvert}Q(\widetilde{v},\widetilde{w})-Q(\widetilde{v}^{\prime},\widetilde{w}^{\prime})\big{\rvert}\lesssim_{K}(1\vee L)\delta\cdot\bigg{(}1+\frac{\lVert\xi\rVert_{1}}{n}\bigg{)}.

Now with XQ(v~,w~;Z)n3/2v~,Zw~+Q(v~,w~)X_{Q}(\widetilde{v},\widetilde{w};Z)\equiv n^{-3/2}\langle\widetilde{v},Z\widetilde{w}\rangle+Q(\widetilde{v},\widetilde{w}), for ξϑ\xi\in\mathcal{E}_{\vartheta}, by applying Theorem 10.2, we have for any 𝖳C3()\mathsf{T}\in C^{3}(\mathbb{R}),

|𝔼ξ𝖳(minw~Lnmaxv~LnXQ(v~,w~;Z))𝔼ξ𝖳(minw~Lnmaxv~LnXQ(v~,w~;G))|\displaystyle\Big{|}\operatorname{\mathbb{E}}^{\xi}\mathsf{T}\Big{(}\min_{\lVert\widetilde{w}\rVert_{\infty}\leq L_{n}}\max_{\lVert\widetilde{v}\rVert_{\infty}\leq L_{n}}X_{Q}(\widetilde{v},\widetilde{w};Z)\Big{)}-\operatorname{\mathbb{E}}^{\xi}\mathsf{T}\Big{(}\min_{\lVert\widetilde{w}\rVert_{\infty}\leq L_{n}}\max_{\lVert\widetilde{v}\rVert_{\infty}\leq L_{n}}X_{Q}(\widetilde{v},\widetilde{w};G)\Big{)}\Big{|}
KK𝖳infδ(0,1){nLnδ+Lnδ+log+2/3(Ln/δ)n1/6Ln2}\displaystyle\lesssim_{K}K_{\mathsf{T}}\cdot\inf_{\delta\in(0,1)}\Big{\{}\sqrt{n}L_{n}\delta+L_{n}\delta+\log_{+}^{2/3}(L_{n}/\delta)\cdot n^{-1/6}L_{n}^{2}\Big{\}}
C1K𝖳n1/6+3ϑ.\displaystyle\leq C_{1}\cdot K_{\mathsf{T}}\cdot n^{-1/6+3\vartheta}. (10.13)

Replicating the last paragraph of proof of [HS22, Theorem 2.3] (right above Section 4.3 therein), for any z>0,ρ0>0z>0,\rho_{0}>0,

ξ(minw~Lnmaxv~LnXQ(v~,w~;Z)>z+3ρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{\lVert\widetilde{w}\rVert_{\infty}\leq L_{n}}\max_{\lVert\widetilde{v}\rVert_{\infty}\leq L_{n}}X_{Q}(\widetilde{v},\widetilde{w};Z)>z+3\rho_{0}\Big{)}
ξ(minw~Lnmaxv~LnXQ(v~,w~;G)>z+ρ0)+Cρ03n1/6+3ϑ.\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{\lVert\widetilde{w}\rVert_{\infty}\leq L_{n}}\max_{\lVert\widetilde{v}\rVert_{\infty}\leq L_{n}}X_{Q}(\widetilde{v},\widetilde{w};G)>z+\rho_{0}\Big{)}+C\rho_{0}^{-3}n^{-1/6+3\vartheta}.

Combined with (10.3)-(10.3), we have

ξ(minwnHη;Z(w)>z+3ρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta;Z}(w)>z+3\rho_{0}\Big{)} ξ(minwnHη;G(w)>z+ρ0)+C2ρ03n1/6+3ϑ.\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta;G}(w)>z+\rho_{0}\Big{)}+C_{2}\rho_{0}^{-3}n^{-1/6+3\vartheta}.

In view of (9.26) (in Step 1 of the final proof of Theorem 2.3), for ρ0(C3n1/2+ϑ,1/C3)\rho_{0}\in(C_{3}n^{-1/2+\vartheta},1/C_{3}), we take zzηmaxβ>0minγ>0𝖣¯η(β,γ)z\equiv z_{\eta}\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) and tρ02n/C3t\equiv\rho_{0}^{2}n/C_{3} therein, so that for ξϑ1,ξ(ρ0/C31/2)\xi\in\mathcal{E}_{\vartheta}\subset\mathscr{E}_{1,\xi}(\rho_{0}/C_{3}^{1/2}),

ξ(minwnHη;G(w)>zη+ρ0)C3eρ02n/C3.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta;G}(w)>z_{\eta}+\rho_{0}\Big{)}\leq C_{3}e^{-\rho_{0}^{2}n/C_{3}}.

Combining the estimates, for ξϑ\xi\in\mathcal{E}_{\vartheta}, ρ0(C3n1/2+ϑ,1/C3)\rho_{0}\in(C_{3}n^{-1/2+\vartheta},1/C_{3}),

ξ(minwnHη;Z(w)>zη+3ρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{n}}H_{\eta;Z}(w)>z_{\eta}+3\rho_{0}\Big{)} C4{eρ02n/C4+ρ03n1/6+3ϑ}.\displaystyle\leq C_{4}\big{\{}e^{-\rho_{0}^{2}n/C_{4}}+\rho_{0}^{-3}n^{-1/6+3\vartheta}\big{\}}.

The first term above can be assimilated into the second one, and ρ0C3n1/2+ϑ\rho_{0}\geq C_{3}n^{-1/2+\vartheta} can be dropped. The lower bound follow similarly by utilizing (9.27). ∎

10.4. Universality of the cost over exceptional sets

Theorem 10.5.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1/Kϕ1K1/K\leq\phi^{-1}\leq K, ΣopΣ1opK\lVert\Sigma\rVert_{\operatorname{op}}\vee\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K.

  • Assumption B with variance σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix ϑ(0,1/18)\vartheta\in(0,1/18). Then there exists some C=C(K,ϑ)>0C=C(K,\vartheta)>0 such that for 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R} being 11-Lipschitz with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}, ρ01/C\rho_{0}\leq 1/C, ηΞK\eta\in\Xi_{K} and ξϑ\xi\in\mathcal{E}_{\vartheta},

supμ0𝒰ϑξ(minwDη;Cρ01/2(𝗀)B(2,)(C,Lnn)Hη;Z(w)maxβ>0minγ>0𝖣¯η(β,γ)+ρ0)Cρ06n1/6+3ϑ.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;C\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C,\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\rho_{0}\Big{)}\leq C\rho_{0}^{-6}\cdot n^{-1/6+3\vartheta}.

Here B(2,)(C,Ln/n)Bn(C)L(Ln/n)B_{(2,\infty)}(C,L_{n}/\sqrt{n})\equiv B_{n}(C)\cap L_{\infty}(L_{n}/\sqrt{n}) with LnCnϑL_{n}\equiv Cn^{\vartheta}, and 𝒰ϑ\mathcal{U}_{\vartheta} is specified as in Proposition 10.3.

Proof.

Fix ε,ϑ>0\varepsilon,\vartheta>0, μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ξϑ\xi\in\mathcal{E}_{\vartheta} as specified in Proposition 10.3. We define a renormalized version of Dε;η(𝗀)D_{\varepsilon;\eta}(\mathsf{g}) as

D~ε;η(𝗀){w~n:|𝗀(w~/n)𝔼𝗀(w~η,/n)|ε},\displaystyle\widetilde{D}_{\varepsilon;\eta}(\mathsf{g})\equiv\big{\{}\widetilde{w}\in\mathbb{R}^{n}:\lvert\mathsf{g}(\widetilde{w}/\sqrt{n})-\operatorname{\mathbb{E}}\mathsf{g}(\widetilde{w}_{\eta,\ast}/\sqrt{n})\rvert\geq\varepsilon\big{\}},

where w~η,=nwη,\widetilde{w}_{\eta,\ast}=\sqrt{n}{w}_{\eta,\ast}.

(Step 1). Let LnC0nϑL_{n}\equiv C_{0}n^{\vartheta}. For any zz\in\mathbb{R} and ρ0>0\rho_{0}>0, with ZnZ/nZ_{n}\equiv Z/\sqrt{n},

ξ(minwDη;ε(𝗀)B(2,)(C0,Lnn)Hη;Z(w)z+ρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\leq z+\rho_{0}\Big{)} (10.14)
=ξ(minwDη;ε(𝗀)B(2,)(C0,Lnn){F(w)+12nηZwξ2}z+ρ0)\displaystyle=\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}\Big{\{}F(w)+\frac{1}{2n\eta}\lVert Zw-\xi\rVert^{2}\Big{\}}\leq z+\rho_{0}\Big{)}
=ξ(minw~D~η;ε(𝗀)B(2,)(nC0,Ln){ηF(w~/n)+12nZnw~ξ2}η(z+ρ0)).\displaystyle=\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{\widetilde{w}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(\sqrt{n}C_{0},L_{n})}\Big{\{}\eta F(\widetilde{w}/\sqrt{n})+\frac{1}{2n}\lVert Z_{n}\widetilde{w}-\xi\rVert^{2}\Big{\}}\leq\eta(z+\rho_{0})\Big{)}.

Now we may apply Theorem 10.1. To do so, let us write 𝖿(w~)ηF(w~/n)\mathsf{f}(\widetilde{w})\equiv\eta F(\widetilde{w}/\sqrt{n}) to match the notation. Then a simple calculation leads to

𝒩𝖿(L,δ)\displaystyle\mathscr{N}_{\mathsf{f}}(L,\delta) supw~w~L,w~w~δ|𝖿(w~)𝖿(w~)|K(1L)δ,\displaystyle\equiv\sup_{\lVert\widetilde{w}\rVert_{\infty}\vee\lVert\widetilde{w}^{\prime}\rVert_{\infty}\leq L,\lVert\widetilde{w}-\widetilde{w}^{\prime}\rVert_{\infty}\leq\delta}\lvert\mathsf{f}(\widetilde{w})-\mathsf{f}(\widetilde{w}^{\prime})\rvert\lesssim_{K}(1\vee L)\delta,

Consequently, an application of Theorem 10.1 leads to

RHS of (10.14)C1(1(ηρ0)3)Ln2n1/6log2/3(Lnn)\displaystyle\hbox{RHS of (\ref{ineq:gordon_cost_except_universality_1})}-C_{1}\big{(}1\vee(\eta\rho_{0})^{-3}\big{)}L_{n}^{2}n^{-1/6}\log^{2/3}(L_{n}n)
ξ(minw~D~η;ε(𝗀)B(2,)(nC0,Ln){ηF(w~/n)+12nGnw~ξ2}η(z+3ρ0))\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{\widetilde{w}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(\sqrt{n}C_{0},L_{n})}\Big{\{}\eta F(\widetilde{w}/\sqrt{n})+\frac{1}{2n}\lVert G_{n}\widetilde{w}-\xi\rVert^{2}\Big{\}}\leq\eta(z+3\rho_{0})\Big{)}
ξ(minwDη;ε(𝗀)B(2,)(C0,Lnn)Hη;G(w)z+3ρ0)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;G}(w)\leq z+3\rho_{0}\Big{)}
ξ(minwDη;ε(𝗀)Bn(C0)Hη;G(w)z+3ρ0).\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;\varepsilon}(\mathsf{g})\cap B_{n}(C_{0})}H_{\eta;G}(w)\leq z+3\rho_{0}\Big{)}.

Here in the last inequality we simply drop the LL_{\infty} constraint. Now for C2n1/2+ϑρ01/C2C_{2}n^{-1/2+\vartheta}\leq\rho_{0}\leq 1/C_{2}, by choosing zzηmaxβ>0minγ>0𝖣¯η(β,γ)z\equiv z_{\eta}\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) and t2ρ02n/C3t\equiv 2\rho_{0}^{2}n/C_{3} in Theorem 9.6, where C3C_{3} is the constant therein, we have

ξ(minwDη;C4ρ01/2(𝗀)B(2,)(C,Lnn)Hη;Z(w)maxβ>0minγ>0𝖣¯η(β,γ)+ρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;C_{4}\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C,\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)+\rho_{0}\Big{)}
C{eρ02n/C3+(ηρ0)3n1/6+3ϑ}C4(ηρ0)3n1/6+3ϑ.\displaystyle\leq C\Big{\{}e^{-\rho_{0}^{2}n/C_{3}}+(\eta\rho_{0})^{-3}\cdot n^{-1/6+3\vartheta}\Big{\}}\leq C_{4}\cdot(\eta\rho_{0})^{-3}\cdot n^{-1/6+3\vartheta}. (10.15)

The constraints ρ0C2n1/2+ϑ\rho_{0}\geq C_{2}n^{-1/2+\vartheta} can be removed by enlarging C4C_{4} if necessary.

(Step 2). In this step we shall trade the dependence of the above bound with respect to η>0\eta>0 with a possible worsened dependence on ρ0\rho_{0}, primarily in the regime ϕ11+1/K\phi^{-1}\geq 1+1/K. Fix η0ΞK\eta_{0}\in\Xi_{K}. Let η>0\eta>0 be chosen later and η1η0+η\eta_{1}\equiv\eta_{0}+\eta. Without loss of generality we assume η0,η1ΞK\eta_{0},\eta_{1}\in\Xi_{K}, so by (9.13) in Proposition 9.5, |zη1zη0|C5η\lvert z_{\eta_{1}}-z_{\eta_{0}}\rvert\leq C_{5}\eta. By enlarging C5C_{5} if necessary we assume that C5C_{5} exceeds the constant in Lemma 10.6. Using Lemma 10.6, for ε=2C4ρ01/2\varepsilon=2C_{4}\rho_{0}^{1/2}, with the choice η=C4ρ0/C5C4ρ01/2/C5\eta=C_{4}\rho_{0}/C_{5}\leq C_{4}\rho_{0}^{1/2}/C_{5} (we assume without loss of generality ρ01\rho_{0}\leq 1),

ξ(minwDη0;ε(𝗀)B(2,)(C0,Lnn)Hη0;Z(w)zη0+ρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta_{0};\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta_{0};Z}(w)\leq z_{\eta_{0}}+\rho_{0}\Big{)}
ξ(minwDη0;ε(𝗀)B(2,)(C0,Lnn)Hη1;Z(w)zη0+ρ0)(since Hη1;ZHη0;Z)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta_{0};\varepsilon}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta_{1};Z}(w)\leq z_{\eta_{0}}+\rho_{0}\Big{)}\quad\hbox{(since $H_{\eta_{1};Z}\leq H_{\eta_{0};Z}$)}
ξ(minwDη1;(εC5η)+(𝗀)B(2,)(C0,Lnn)Hη1;Z(w)zη0+ρ0)(by Lemma 10.6)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta_{1};(\varepsilon-C_{5}\eta)_{+}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta_{1};Z}(w)\leq z_{\eta_{0}}+\rho_{0}\Big{)}\quad\hbox{(by Lemma \ref{lem:D_0_eta})}
ξ(minwDη1;(εC5η)+(𝗀)B(2,)(C0,Lnn)Hη1;Z(w)zη1+C5η+ρ0)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta_{1};(\varepsilon-C_{5}\eta)_{+}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta_{1};Z}(w)\leq z_{\eta_{1}}+C_{5}\eta+\rho_{0}\Big{)}
ξ(minwDη;C4ρ01/2(𝗀)B(2,)(C0,Lnn)Hη1;Z(w)maxβ>0minγ>0𝖣¯η1(β,γ)+Cρ0)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;C_{4}\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta_{1};Z}(w)\leq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta_{1}}(\beta,\gamma)+C\rho_{0}\Big{)}
C(η0+ρ0)3ρ03n1/6+3ϑCρ06n1/6+3ϑ.\displaystyle\leq C\cdot(\eta_{0}+\rho_{0})^{-3}\rho_{0}^{-3}\cdot n^{-1/6+3\vartheta}\leq C\cdot\rho_{0}^{-6}n^{-1/6+3\vartheta}.

The proof is complete by adjusting constants. ∎

Lemma 10.6.

Suppose μ0ΣopΣ1opK\lVert\mu_{0}\rVert\vee\lVert\Sigma\rVert_{\operatorname{op}}\vee\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K. Let 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R} be 11-Lipschitz with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}. Then there exists some constant C=C(K)>0C=C(K)>0 such that for any ε>0,η0,η1ΞK\varepsilon>0,\eta_{0},\eta_{1}\in\Xi_{K} with η1η0\eta_{1}\geq\eta_{0}, we have Dη0;ε(𝗀)Dη1;(εC(η1η0))+(𝗀)D_{\eta_{0};\varepsilon}(\mathsf{g})\subset D_{\eta_{1};(\varepsilon-C(\eta_{1}-\eta_{0}))_{+}}(\mathsf{g}).

Proof.

Using the definition of wη,w_{\eta,\ast} in (9.14), we have

|𝔼𝗀(wη1,)𝔼𝗀(wη0,)|𝔼wη1,wη0,Σ1\displaystyle\lvert\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta_{1},\ast})-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta_{0},\ast})\rvert\leq\operatorname{\mathbb{E}}\lVert w_{\eta_{1},\ast}-w_{\eta_{0},\ast}\rVert_{\Sigma^{-1}}
=𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη1,;τη1,)μ^(Σ,μ0)𝗌𝖾𝗊(γη0,;τη0,)C(η1η0).\displaystyle=\operatorname{\mathbb{E}}\big{\lVert}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta_{1},\ast};\tau_{\eta_{1},\ast})-\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta_{0},\ast};\tau_{\eta_{0},\ast})\big{\rVert}\leq C\cdot(\eta_{1}-\eta_{0}).

Here the last inequality follows from the calculations in (9.31). So for any wDη0;ε(𝗀)w\in D_{\eta_{0};\varepsilon}(\mathsf{g}), we have

ε|𝗀(w)𝔼𝗀(wη0,)||𝗀(w)𝔼𝗀(wη1,)|+C(η1η0).\displaystyle\varepsilon\leq\lvert\mathsf{g}(w)-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta_{0},\ast})\rvert\leq\lvert\mathsf{g}(w)-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta_{1},\ast})\rvert+C(\eta_{1}-\eta_{0}).

This means wDη1;(εC(η1η0))+(𝗀)w\in D_{\eta_{1};(\varepsilon-C(\eta_{1}-\eta_{0}))_{+}}(\mathsf{g}), as desired. ∎

10.5. Proof of the universality Theorem 2.4 for μ^η;Z\widehat{\mu}_{\eta;Z}

Fix ϑ>0\vartheta>0, μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ξϑ\xi\in\mathcal{E}_{\vartheta}. Let LnC0nϑL_{n}\equiv C_{0}n^{\vartheta}, and E0{w^n;ZB(2,)(C0,Ln/n)=Bn(C0)L(Ln/n)}E_{0}\equiv\{\widehat{w}_{n;Z}\in B_{(2,\infty)}(C_{0},L_{n}/\sqrt{n})=B_{n}(C_{0})\cap L_{\infty}(L_{n}/\sqrt{n})\}. We assume that C0C_{0} exceeds the constants in Proposition 10.3 and Theorem 10.5. By Proposition 10.3 and a simple 2\ell_{2} estimate, ξ(E0c)C0n100\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})\leq C_{0}n^{-100}. We further let zηmaxβ>0minγ>0𝖣¯η(β,γ)z_{\eta}\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) for η0\eta\geq 0.

Let 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R} be 11-Lipschitz with respect to Σ1\lVert\cdot\rVert_{\Sigma^{-1}}. Then for ρ01/C0\rho_{0}\leq 1/C_{0} and ηΞK\eta\in\Xi_{K}, we have

ξ(w^η;ZDη;C0ρ01/2(𝗀))\displaystyle\operatorname{\mathbb{P}}^{\xi}\big{(}\widehat{w}_{\eta;Z}\in D_{\eta;C_{0}\rho_{0}^{1/2}}(\mathsf{g})\big{)}
ξ(w^η;ZDη;C0ρ01/2(𝗀)B(2,)(C0,Ln/n))+ξ(E0c)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\big{(}\widehat{w}_{\eta;Z}\in D_{\eta;C_{0}\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},L_{n}/\sqrt{n})\big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})
ξ(minwB(2,)(C0,Lnn)Hη;Z(w)zη+ρ0)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\geq z_{\eta}+\rho_{0}\Big{)}
+ξ(minwDη;C0ρ01/2(𝗀)B(2,)(C0,Lnn)Hη;Z(w)zη+2ρ0)+C0n100.\displaystyle\qquad+\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in D_{\eta;C_{0}\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)\leq z_{\eta}+2\rho_{0}\Big{)}+C_{0}n^{-100}.

Here in the last inequality we used the simple fact that

{minwB(2,)(C0,Lnn)Hη;Z(w)<zη+ρ0}{minwDη;C0ρ01/2(𝗀)B(2,)(C0,Lnn)Hη;Z(w)>zη+2ρ0}\displaystyle\Big{\{}\min_{w\in B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)<z_{\eta}+\rho_{0}\Big{\}}\cap\Big{\{}\min_{w\in D_{\eta;C_{0}\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},\frac{L_{n}}{\sqrt{n}})}H_{\eta;Z}(w)>z_{\eta}+2\rho_{0}\Big{\}}
{w^η;ZDη;C0ρ01/2(𝗀)B(2,)(C0,Ln/n)}.\displaystyle\subset\big{\{}\widehat{w}_{\eta;Z}\notin D_{\eta;C_{0}\rho_{0}^{1/2}}(\mathsf{g})\cap B_{(2,\infty)}(C_{0},{L_{n}}/{\sqrt{n}})\big{\}}.

Invoking Theorems 10.4 and 10.5, by enlarging C0C_{0} if necessary, we have for ρ01/C0\rho_{0}\leq 1/C_{0} and ηΞK\eta\in\Xi_{K},

ξ(|𝗀(w^η;Z)𝔼𝗀(wη,)|ρ01/2)C0ρ06n1/6+3ϑ,\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\lvert\mathsf{g}(\widehat{w}_{\eta;Z})-\operatorname{\mathbb{E}}\mathsf{g}(w_{\eta,\ast})\rvert\geq\rho_{0}^{1/2}\Big{)}\leq C_{0}\cdot\rho_{0}^{-6}n^{-1/6+3\vartheta},

or equivalently, for 𝗀0:n\mathsf{g}_{0}:\mathbb{R}^{n}\to\mathbb{R} being 11-Lipschitz with respect to \lVert\cdot\rVert,

ξ(|𝗀0(μ^η;Z)𝔼𝗀0(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,))|ρ0)C0ρ012n1/6+3ϑ.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\big{\lvert}\mathsf{g}_{0}(\widehat{\mu}_{\eta;Z})-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}\big{\rvert}\geq\rho_{0}\Big{)}\leq C_{0}\cdot\rho_{0}^{-12}n^{-1/6+3\vartheta}.

Now we may follow Step 4 in the proof of Theorem 2.3 to strengthen the above statement to a uniform one in η\eta; we only sketch the differences below. Using (9.4) with GG therein replaced by ZZ, and the assumption Σ1opK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K, we arrive at a modified form of (9.30): on an event E1E_{1} with ξ(E1)1C1en/C1\operatorname{\mathbb{P}}^{\xi}(E_{1})\geq 1-C_{1}e^{-n/C_{1}}, for any η1,η2ΞK\eta_{1},\eta_{2}\in\Xi_{K},

μ^η1;Zμ^η2;ZC1|η1η2|.\displaystyle\lVert\widehat{\mu}_{\eta_{1};Z}-\widehat{\mu}_{\eta_{2};Z}\rVert\leq C_{1}\lvert\eta_{1}-\eta_{2}\rvert. (10.16)

Using (9.31) with Σ1opK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K, we arrive at a modified form of (9.32): for any η1,η2ΞK\eta_{1},\eta_{2}\in\Xi_{K},

|𝔼𝗀0(μ^(Σ,μ0)𝗌𝖾𝗊(γη1,;τη1,))𝔼𝗀0(μ^(Σ,μ0)𝗌𝖾𝗊(γη2,;τη2,))|C1|η1η2|.\displaystyle\big{\lvert}\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta_{1},\ast};\tau_{\eta_{1},\ast})\big{)}-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta_{2},\ast};\tau_{\eta_{2},\ast})\big{)}\big{\rvert}\leq C_{1}\lvert\eta_{1}-\eta_{2}\rvert.

Now using a standard discretization and a union bound, we have

ξ(supηΞK|𝗀0(μ^η;Z)𝔼𝗀0(μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,))|ρ0)C2ρ013n1/6+3ϑ.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in\Xi_{K}}\big{\lvert}\mathsf{g}_{0}(\widehat{\mu}_{\eta;Z})-\operatorname{\mathbb{E}}\mathsf{g}_{0}\big{(}\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})\big{)}\big{\rvert}\geq\rho_{0}\Big{)}\leq C_{2}\cdot\rho_{0}^{-13}n^{-1/6+3\vartheta}.

The proof is complete by taking expectation with respect to ξ\xi and note that (ξϑ)1Cen2ϑ/C\operatorname{\mathbb{P}}(\xi\in\mathcal{E}_{\vartheta})\geq 1-Ce^{-n^{2\vartheta}/C} as in Proposition 10.3. ∎

10.6. Proof of the universality Theorem 2.4 for r^η;Z\widehat{r}_{\eta;Z}

Proposition 10.7.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1/Kϕ1,ηK1/K\leq\phi^{-1},\eta\leq K, ΣopΣ1opK\lVert\Sigma\rVert_{\operatorname{op}}\vee\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K.

  • Assumption B holds with σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K].

Fix ϑ(0,1/18)\vartheta\in(0,1/18). Then there exists some C=C(K,ϑ)>0C=C(K,\vartheta)>0 such that for any 11-Lipschitz function 𝗁:m\mathsf{h}:\mathbb{R}^{m}\to\mathbb{R}, ρ01/C\rho_{0}\leq 1/C, ηΞK\eta\in\Xi_{K} and ξϑ\xi\in\mathcal{E}_{\vartheta},

supμ0𝒰ϑξ(maxvDη;Cρ01/2(𝗁)L(Lnn)minwnhη;Z(w,v)maxβ>0minγ>0𝖣¯η(β,γ)ρ0)Cρ03n1/6+3ϑ.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(\frac{L_{n}}{\sqrt{n}})}\min_{w\in\mathbb{R}^{n}}h_{\eta;Z}(w,v)\geq\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma)-\rho_{0}\Big{)}\leq C\rho_{0}^{-3}n^{-1/6+3\vartheta}.

Here LnCnϑL_{n}\equiv Cn^{\vartheta}, and 𝒰ϑ\mathcal{U}_{\vartheta} is specified as in Proposition 10.3.

Proof.

Fix ε,ϑ>0\varepsilon,\vartheta>0, μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ξϑ\xi\in\mathcal{E}_{\vartheta} as specified in Proposition 10.3. We define a renormalized version of Dε;η(𝗁)D_{\varepsilon;\eta}(\mathsf{h}) as

D~ε;η(𝗁){r~m:|𝗁(r~/n)𝔼ξ𝗁(r~η,/n)|ε},\displaystyle\widetilde{D}_{\varepsilon;\eta}(\mathsf{h})\equiv\big{\{}\widetilde{r}\in\mathbb{R}^{m}:\lvert\mathsf{h}(\widetilde{r}/\sqrt{n})-\operatorname{\mathbb{E}}^{\xi}\mathsf{h}(\widetilde{r}_{\eta,\ast}/\sqrt{n})\rvert\geq\varepsilon\big{\}},

where r~η,=nrη,\widetilde{r}_{\eta,\ast}=\sqrt{n}{r}_{\eta,\ast}. Let Ln=C0nϑL_{n}=C_{0}n^{\vartheta} and Q(v~,w~)Q(\widetilde{v},\widetilde{w}) be defined as in the proof of Theorem 10.4. Then we have,

maxvDη;ε(𝗁)L(Ln/n)minwL(Ln/n)hη;Z(w,v)\displaystyle\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}h_{\eta;Z}(w,v)
=maxv~D~η;ε(𝗁)L(Ln)minw~L(Ln)hη;Z(w~/n,v~/n)\displaystyle=\max_{\widetilde{v}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n})}\min_{\widetilde{w}\in L_{\infty}(L_{n})}h_{\eta;Z}(\widetilde{w}/\sqrt{n},\widetilde{v}/\sqrt{n})
=maxv~D~η;ε(𝗁)L(Ln)minw~L(Ln){1n3/2v~,Zw~1nv~,ξ+F(w~/n)ηv~22n}\displaystyle=\max_{\widetilde{v}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n})}\min_{\widetilde{w}\in L_{\infty}(L_{n})}\bigg{\{}\frac{1}{n^{3/2}}\langle\widetilde{v},Z\widetilde{w}\rangle-\frac{1}{n}\langle\widetilde{v},\xi\rangle+F(\widetilde{w}/\sqrt{n})-\frac{\eta\lVert\widetilde{v}\rVert^{2}}{2n}\bigg{\}}
=maxv~D~η;ε(𝗁)L(Ln)minw~L(Ln){1n3/2v~,Zw~+Q(v~,w~)}.\displaystyle=\max_{\widetilde{v}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n})}\min_{\widetilde{w}\in L_{\infty}(L_{n})}\bigg{\{}\frac{1}{n^{3/2}}\langle\widetilde{v},Z\widetilde{w}\rangle+Q(\widetilde{v},\widetilde{w})\bigg{\}}.

Using the comparison inequality in Theorem 10.2 and a similar calculation as in (10.3), with sn(ρ0)ρ03n1/6+3ϑs_{n}(\rho_{0})\equiv\rho_{0}^{-3}n^{-1/6+3\vartheta},

ξ(maxvDη;ε(𝗁)L(Ln/n)minwL(Ln/n)hη;Z(w,v)zρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}h_{\eta;Z}(w,v)\geq z-\rho_{0}\Big{)}
=ξ(maxv~D~η;ε(𝗁)L(Ln)minw~L(Ln){1n3/2v~,Zw~+Q(v~,w~)}zρ0)\displaystyle=\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{\widetilde{v}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n})}\min_{\widetilde{w}\in L_{\infty}(L_{n})}\bigg{\{}\frac{1}{n^{3/2}}\langle\widetilde{v},Z\widetilde{w}\rangle+Q(\widetilde{v},\widetilde{w})\bigg{\}}\geq z-\rho_{0}\Big{)}
ξ(maxv~D~η;ε(𝗁)L(Ln)minw~L(Ln){1n3/2v~,Gw~+Q(v~,w~)}z3ρ0)+Csn(ρ0)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{\widetilde{v}\in\widetilde{D}_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n})}\min_{\widetilde{w}\in L_{\infty}(L_{n})}\bigg{\{}\frac{1}{n^{3/2}}\langle\widetilde{v},G\widetilde{w}\rangle+Q(\widetilde{v},\widetilde{w})\bigg{\}}\geq z-3\rho_{0}\Big{)}+Cs_{n}(\rho_{0})
=ξ(maxvDη;ε(𝗁)L(Ln/n)minwL(Ln/n)hη;G(w,v)z3ρ0)+C1sn(ρ0).\displaystyle=\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}h_{\eta;G}(w,v)\geq z-3\rho_{0}\Big{)}+C_{1}s_{n}(\rho_{0}).

Using the convex Gaussian min-max theorem (cf. Theorem 6.1),

ξ(maxvDη;ε(𝗁)L(Ln/n)minwL(Ln/n)hη;Z(w,v)zρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}h_{\eta;Z}(w,v)\geq z-\rho_{0}\Big{)}
2(maxvDη;ε(𝗁)L(Ln/n)minwL(Ln/n)η(w,v)z3ρ0)+C1sn(ρ0).\displaystyle\leq 2\operatorname{\mathbb{P}}\Big{(}\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}\ell_{\eta}(w,v)\geq z-3\rho_{0}\Big{)}+C_{1}s_{n}(\rho_{0}). (10.17)

On the other hand, using the definition of wη,w_{\eta,\ast} in (9.14), and the fact that for any μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta}, 𝔼wη,Ln/n\lVert\operatorname{\mathbb{E}}w_{\eta,\ast}\rVert_{\infty}\leq L_{n}/\sqrt{n}, we have (wη,Ln/n)Cen2ϑ/C\operatorname{\mathbb{P}}\big{(}\lVert w_{\eta,\ast}\rVert_{\infty}\geq L_{n}/\sqrt{n}\big{)}\leq Ce^{-n^{2\vartheta}/C}. Combined with (10.6), we have

ξ(maxvDη;ε(𝗁)L(Ln/n)minwL(Ln/n)hη;Z(w,v)zρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}h_{\eta;Z}(w,v)\geq z-\rho_{0}\Big{)}
2(maxvDη;ε(𝗁)L(Ln/n)η(wη,,v)z3ρ0)+C2sn(ρ0).\displaystyle\leq 2\operatorname{\mathbb{P}}\Big{(}\max_{v\in D_{\eta;\varepsilon}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\ell_{\eta}(w_{\eta,\ast},v)\geq z-3\rho_{0}\Big{)}+C_{2}s_{n}(\rho_{0}).

In view of (9.45), now by choosing zzηmaxβ>0minγ>0𝖣¯η(β,γ)z\equiv z_{\eta}\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) and εC3ρ1/20\varepsilon\equiv C_{3}\rho^{1/2}_{0}, for ρ0C4n1/2+ϑ\rho_{0}\geq C_{4}n^{-1/2+\vartheta}, ξϑ1,ξ(ρ0/C)\xi\in\mathcal{E}_{\vartheta}\subset\mathscr{E}_{1,\xi}(\rho_{0}/C), it follows that

ξ(maxvDη;C3ρ01/2(𝗁)L(Ln/n)minwnhη;Z(w,v)zηρ0)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{3}\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in\mathbb{R}^{n}}h_{\eta;Z}(w,v)\geq z_{\eta}-\rho_{0}\Big{)}
ξ(maxvDη;C3ρ01/2(𝗁)L(Ln/n)minwL(Ln/n)hη;Z(w,v)zηρ0)C4sn(ρ0).\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{3}\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in L_{\infty}(L_{n}/\sqrt{n})}h_{\eta;Z}(w,v)\geq z_{\eta}-\rho_{0}\Big{)}\leq C_{4}s_{n}(\rho_{0}).

The claim follows by adjusting constants. ∎

Proof of Theorem 2.4 for r^η;Z\widehat{r}_{\eta;Z}.

Fix ϑ>0\vartheta>0, μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ξϑ\xi\in\mathcal{E}_{\vartheta} as specified in Proposition 10.3. We continue writing zηmaxβ>0minγ>0𝖣¯η(β,γ)z_{\eta}\equiv\max_{\beta>0}\min_{\gamma>0}\overline{\mathsf{D}}_{\eta}(\beta,\gamma) in the proof. Using the delocalization results in Proposition 10.3, on an event E0E_{0} with ξ(E0)1C0n100\operatorname{\mathbb{P}}^{\xi}(E_{0})\geq 1-C_{0}n^{-100}, we have w^η;Zr^η;ZLn/n\lVert\widehat{w}_{\eta;Z}\rVert_{\infty}\vee\lVert\widehat{r}_{\eta;Z}\rVert_{\infty}\leq L_{n}/\sqrt{n} with Ln=C0nϑL_{n}=C_{0}n^{\vartheta}. Using Theorem 10.4, for ρ01/C\rho_{0}\leq 1/C, and ηΞK\eta\in\Xi_{K}, by possibly adjusting C0>0C_{0}>0,

ξ(maxvL(Ln/n)minwmhη;Z(w,v)zηρ0/2)\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in\mathbb{R}^{m}}h_{\eta;Z}(w,v)\leq z_{\eta}-\rho_{0}/2\Big{)}
ξ(minwmHη;Z(w)zηρ0/2)+ξ(E0c)C0ρ03n1/6+3ϑ.\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\min_{w\in\mathbb{R}^{m}}H_{\eta;Z}(w)\leq z_{\eta}-\rho_{0}/2\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})\leq C_{0}\rho_{0}^{-3}\cdot n^{-1/6+3\vartheta}. (10.18)

Let us take C1>0C_{1}>0 to be the constant in Proposition 10.7. By noting that

{maxvL(Ln/n)minwmhη;Z(w,v)>zηρ0/2}\displaystyle\Big{\{}\max_{v\in L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in\mathbb{R}^{m}}h_{\eta;Z}(w,v)>z_{\eta}-\rho_{0}/2\Big{\}}
{maxvDη;C1ρ01/2(𝗁)L(Ln/n)minwnhη;Z(w,v)<zηρ0}\displaystyle\quad\cap\Big{\{}\max_{v\in D_{\eta;C_{1}\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in\mathbb{R}^{n}}h_{\eta;Z}(w,v)<z_{\eta}-\rho_{0}\Big{\}}
{v^η;ZDη;C1ρ01/2(𝗁)L(Ln/n)},\displaystyle\subset\big{\{}\widehat{v}_{\eta;Z}\notin D_{\eta;C_{1}\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})\big{\}},

it follows from (10.6) and Proposition 10.7 that

ξ(v^η;ZDη;C1ρ01/2(𝗁))\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\widehat{v}_{\eta;Z}\in D_{\eta;C_{1}\rho_{0}^{1/2}}(\mathsf{h})\Big{)}
ξ(v^η;ZDη;C1ρ01/2(𝗁)L(Ln/n))+ξ(v^η;ZL(Ln/n))\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\widehat{v}_{\eta;Z}\in D_{\eta;C_{1}\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})\Big{)}+\operatorname{\mathbb{P}}^{\xi}\Big{(}\widehat{v}_{\eta;Z}\notin L_{\infty}(L_{n}/\sqrt{n})\Big{)}
ξ(maxvL(Ln/n)minwmhη;Z(w,v)zηρ0/2)\displaystyle\leq\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in\mathbb{R}^{m}}h_{\eta;Z}(w,v)\leq z_{\eta}-\rho_{0}/2\Big{)}
+ξ(maxvDη;C1ρ01/2(𝗁)L(Ln/n)minwnhη;Z(w,v)zηρ0)+ξ(E0c)\displaystyle\qquad+\operatorname{\mathbb{P}}^{\xi}\Big{(}\max_{v\in D_{\eta;C_{1}\rho_{0}^{1/2}}(\mathsf{h})\cap L_{\infty}(L_{n}/\sqrt{n})}\min_{w\in\mathbb{R}^{n}}h_{\eta;Z}(w,v)\geq z_{\eta}-\rho_{0}\Big{)}+\operatorname{\mathbb{P}}^{\xi}(E_{0}^{c})
Cρ03n1/6+3ϑ.\displaystyle\leq C\rho_{0}^{-3}\cdot n^{-1/6+3\vartheta}.

Finally we only need to extend the above display to a uniform control over η[1/K,K]\eta\in[1/K,K] by continuity arguments similar to Step 5 of the proof of Theorem 2.3 for r^η;G\widehat{r}_{\eta;G}. By (9.50) (where GG therein is replaced by ZZ) and (10.16), on an event E1E_{1} with ξ(E1)1Cen/C\operatorname{\mathbb{P}}^{\xi}(E_{1})\geq 1-Ce^{-n/C}, for any η1,η2[1/K,K]\eta_{1},\eta_{2}\in[1/K,K],

r^η1;Zr^η2;ZC|η1η2|.\displaystyle\lVert\widehat{r}_{\eta_{1};Z}-\widehat{r}_{\eta_{2};Z}\rVert\leq C\lvert\eta_{1}-\eta_{2}\rvert.

On the other hand, (9.5) remains valid, so we may proceed with an ε\varepsilon-net argument over [1/K,K][1/K,K] to conclude. ∎

10.7. Proof of Theorem 2.5

To keep notation simple, we work with 𝖠=In\mathsf{A}=I_{n} and write Γη;(Σ,μ0)In=Γη;(Σ,μ0)\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{I_{n}}=\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}. The general case follows from minor modifications.

Lemma 10.8.

Suppose the conditions in Theorem 2.5 hold for some K>0K>0. Fix q(0,)q\in(0,\infty). There exists some constant c=c(K,q)>0c=c(K,q)>0 such that n121q𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0qcn^{\frac{1}{2}-\frac{1}{q}}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\rVert_{q}\geq c uniformly in ηΞK\eta\in\Xi_{K}.

Proof.

We may write 𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0q=𝔼(j=1n|aj+bjgj|q)1/q\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\rVert_{q}=\operatorname{\mathbb{E}}\big{(}\sum_{j=1}^{n}\lvert a_{j}+b_{j}g_{j}\rvert^{q}\big{)}^{1/q} for some aj,bja_{j},b_{j}\in\mathbb{R} with bj1b_{j}\asymp 1, and gj𝒩(0,1/n)g_{j}\sim\mathcal{N}(0,1/n) not necessarily independent of each other. So for some cjc_{j}\in\mathbb{R},

𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0q𝔼(j=1n|cj+gj|q)1/q.\displaystyle\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\rVert_{q}\gtrsim\operatorname{\mathbb{E}}\Big{(}\sum_{j=1}^{n}\lvert c_{j}+g_{j}\rvert^{q}\Big{)}^{1/q}.

If j=1n|cj|qC0j=1n𝔼|gj|q\sum_{j=1}^{n}\lvert c_{j}\rvert^{q}\geq C_{0}\sum_{j=1}^{n}\operatorname{\mathbb{E}}\lvert g_{j}\rvert^{q} for a large enough C0>0C_{0}>0, the lower bound follows trivially. Otherwise, with Zj=1n|cj+gj|qZ\equiv\sum_{j=1}^{n}\lvert c_{j}+g_{j}\rvert^{q}, we have 𝔼Zj=1ninfc𝔼|c+gj|qn1q/2\operatorname{\mathbb{E}}Z\geq\sum_{j=1}^{n}\inf_{c\in\mathbb{R}}\operatorname{\mathbb{E}}\lvert c+g_{j}\rvert^{q}\gtrsim n^{1-q/2} and 𝔼Z2𝔼(j=1n(|gj|q+𝔼|gj|q))2(n1q/2)2\operatorname{\mathbb{E}}Z^{2}\lesssim\operatorname{\mathbb{E}}\big{(}\sum_{j=1}^{n}(\lvert g_{j}\rvert^{q}+\operatorname{\mathbb{E}}\lvert g_{j}\rvert^{q})\big{)}^{2}\lesssim(n^{1-q/2})^{2}, so by Paley-Zygmund inequality, (Z𝔼Z/2)(𝔼Z)2/(4𝔼Z2)c0\operatorname{\mathbb{P}}(Z\geq\operatorname{\mathbb{E}}Z/2)\geq(\operatorname{\mathbb{E}}Z)^{2}/(4\operatorname{\mathbb{E}}Z^{2})\geq c_{0} for some c0>0c_{0}>0. In other words, on an event E0E_{0} with (E0)c0\operatorname{\mathbb{P}}(E_{0})\geq c_{0}, Zc0n1q/2Z\geq c_{0}n^{1-q/2}. Using the above display, this means that 𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0q𝔼Z1/q𝔼Z1/q𝟏E0n1/q1/2\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\rVert_{q}\gtrsim\operatorname{\mathbb{E}}Z^{1/q}\geq\operatorname{\mathbb{E}}Z^{1/q}\bm{1}_{E_{0}}\gtrsim n^{1/q-1/2}. ∎

Lemma 10.9.

Suppose the conditions in Theorem 2.5 hold for some K>0K>0. Fix q(0,)q\in(0,\infty). Then there exist constants C=C(K,q)>1C=C(K,q)>1, ϑ=ϑ(q)(0,1/50)\vartheta=\vartheta(q)\in(0,1/50), and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C}, such that

supμ0𝒰ϑn121q|𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)μ0qn12diag(Γη;(Σ,μ0))q/212Mq|Cnϑ.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}n^{\frac{1}{2}-\frac{1}{q}}\big{\lvert}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})-\mu_{0}\rVert_{q}-n^{-\frac{1}{2}}\lVert\mathrm{diag}\big{(}\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}\big{)}\rVert_{q/2}^{\frac{1}{2}}M_{q}\big{\rvert}\leq Cn^{-\vartheta}.

Here Mq=𝔼1/q|𝒩(0,1)|qM_{q}=\operatorname{\mathbb{E}}^{1/q}\lvert\mathcal{N}(0,1)\rvert^{q}.

Proof.

We write τη,=τη\tau_{\eta,\ast}=\tau_{\eta}, γη,=γη\gamma_{\eta,\ast}=\gamma_{\eta} for notational simplicity in the proof. All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on K,qK,q. Recall the general fact xqn12+1q2x2q2x12q2\lVert x\rVert_{q}\leq n^{-\frac{1}{2}+\frac{1}{q\wedge 2}}\lVert x\rVert^{\frac{2}{q\vee 2}}\lVert x\rVert_{\infty}^{1-\frac{2}{q\vee 2}} for xnx\in\mathbb{R}^{n} and q(0,)q\in(0,\infty).

By Proposition 11.2 below, for any ϑ(0,1/2)\vartheta\in(0,1/2), there exists some 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cen12ϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{1-2\vartheta}/C}, such that supμ0𝒰ϑsupηΞK|γη2γ~η2(μ0)|nϑ\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\sup_{\eta\in\Xi_{K}}\lvert\gamma_{\eta}^{2}-\widetilde{\gamma}_{\eta}^{2}(\lVert\mu_{0}\rVert)\rvert\leq n^{-\vartheta}. Consequently, uniformly in μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ηΞK\eta\in\Xi_{K},

|𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη;τη)μ0q𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γ~η(μ0);τη)μ0q|\displaystyle\big{\lvert}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta};\tau_{\eta})-\mu_{0}\rVert_{q}-\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\widetilde{\gamma}_{\eta}(\lVert\mu_{0}\rVert);\tau_{\eta})-\mu_{0}\rVert_{q}\big{\rvert}
|γηγ~η(μ0)|n1/2𝔼(Σ+τηI)1Σ1/2gq\displaystyle\lesssim\lvert\gamma_{\eta}-\widetilde{\gamma}_{\eta}(\lVert\mu_{0}\rVert)\rvert\cdot n^{-1/2}\operatorname{\mathbb{E}}\lVert(\Sigma+\tau_{\eta}I)^{-1}\Sigma^{1/2}g\rVert_{q}
n12ϑn12+1q2𝔼{(Σ+τηI)1Σ1/2g2q2(Σ+τηI)1Σ1/2g12q2}\displaystyle\leq n^{-\frac{1}{2}-\vartheta}\cdot n^{-\frac{1}{2}+\frac{1}{q\wedge 2}}\cdot\operatorname{\mathbb{E}}\Big{\{}\lVert(\Sigma+\tau_{\eta}I)^{-1}\Sigma^{1/2}g\rVert^{\frac{2}{q\vee 2}}\cdot\lVert(\Sigma+\tau_{\eta}I)^{-1}\Sigma^{1/2}g\rVert_{\infty}^{1-\frac{2}{q\vee 2}}\Big{\}}
n12+1qϑ(logn)121q2.\displaystyle\lesssim n^{-\frac{1}{2}+\frac{1}{q}-\vartheta}\big{(}\log n\big{)}^{\frac{1}{2}-\frac{1}{q\vee 2}}. (10.19)

For gng^{\prime}\in\mathbb{R}^{n}, let 𝖿(g)n1/2(Σ+τηI)1gq\mathsf{f}(g^{\prime})\equiv n^{-1/2}\lVert(\Sigma+\tau_{\eta}I)^{-1}g^{\prime}\rVert_{q}, and

𝖥μ0(g)\displaystyle\mathsf{F}_{\lVert\mu_{0}\rVert}(g^{\prime}) n1/2(Σ+τηI)1(τημ0g+γ~η(μ0)Σ1/2g)q,\displaystyle\equiv n^{-1/2}\big{\lVert}(\Sigma+\tau_{\eta}I)^{-1}\big{(}-\tau_{\eta}\lVert\mu_{0}\rVert g^{\prime}+\widetilde{\gamma}_{\eta}(\lVert\mu_{0}\rVert)\Sigma^{1/2}g\big{)}\big{\rVert}_{q},
𝖥μ0,0(g)\displaystyle\mathsf{F}_{\lVert\mu_{0}\rVert,0}(g^{\prime}) (Σ+τηI)1(τημ0gg+γ~η(μ0)Σ1/2gn)q.\displaystyle\equiv\bigg{\lVert}(\Sigma+\tau_{\eta}I)^{-1}\bigg{(}-\tau_{\eta}\lVert\mu_{0}\rVert\frac{g^{\prime}}{\lVert g^{\prime}\rVert}+\widetilde{\gamma}_{\eta}(\lVert\mu_{0}\rVert)\Sigma^{1/2}\frac{g}{\sqrt{n}}\bigg{)}\bigg{\rVert}_{q}.

Then for g1,g2ng_{1}^{\prime},g_{2}^{\prime}\in\mathbb{R}^{n},

|𝖥μ0(g1)𝖥μ0(g2)||𝖿(g1)𝖿(g2)|n1+1q2g1g2.\displaystyle\big{\lvert}\mathsf{F}_{\lVert\mu_{0}\rVert}(g_{1}^{\prime})-\mathsf{F}_{\lVert\mu_{0}\rVert}(g_{2}^{\prime})\big{\rvert}\vee\big{\lvert}\mathsf{f}(g_{1}^{\prime})-\mathsf{f}(g_{2}^{\prime})\big{\rvert}\lesssim n^{-1+\frac{1}{q\wedge 2}}\lVert g_{1}^{\prime}-g_{2}^{\prime}\rVert.

By Gaussian concentration inequality, for any ϑ(0,1/2)\vartheta\in(0,1/2), we may find some 𝒢ϑ,μ0n\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}\subset\mathbb{R}^{n} with (g0𝒢ϑ,μ0)1Cen2ϑ/C\operatorname{\mathbb{P}}(g_{0}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert})\geq 1-Ce^{-n^{2\vartheta}/C}, g0𝒩(0,In)g_{0}\sim\mathcal{N}(0,I_{n}), such that uniformly in g𝒢ϑ,μ0g^{\prime}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert},

max{|gn|,n11q2|𝖥μ0(g)𝔼g0𝖥μ0(g0)|,\displaystyle\max\Big{\{}\big{\lvert}\lVert g^{\prime}\rVert-\sqrt{n}\big{\rvert},n^{1-\frac{1}{q\wedge 2}}\big{\lvert}\mathsf{F}_{\lVert\mu_{0}\rVert}(g^{\prime})-\operatorname{\mathbb{E}}_{g_{0}}\mathsf{F}_{\lVert\mu_{0}\rVert}(g_{0})\big{\rvert},
n11q2|𝖿(g)𝔼𝖿(g0)|}nϑ.\displaystyle\qquad\qquad n^{1-\frac{1}{q\wedge 2}}\big{\lvert}\mathsf{f}(g^{\prime})-\operatorname{\mathbb{E}}\mathsf{f}(g_{0})\big{\rvert}\Big{\}}\leq n^{\vartheta}. (10.20)

As 𝔼𝖿(g0)=n1/2𝔼(Σ+τηI)gqn12+1q(logn)121q2\operatorname{\mathbb{E}}\mathsf{f}(g_{0})=n^{-1/2}\operatorname{\mathbb{E}}\lVert(\Sigma+\tau_{\eta}I)g\rVert_{q}\lesssim n^{-\frac{1}{2}+\frac{1}{q}}(\log n)^{\frac{1}{2}-\frac{1}{q\vee 2}}, for ϑ\vartheta small enough, uniformly in g𝒢ϑ,μ0g^{\prime}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert},

|𝖥μ0(g)𝖥μ0,0(g)||𝖿(g)||1n/g|\displaystyle\big{\lvert}\mathsf{F}_{\lVert\mu_{0}\rVert}(g^{\prime})-\mathsf{F}_{\lVert\mu_{0}\rVert,0}(g^{\prime})\big{\rvert}\lesssim\lvert\mathsf{f}(g^{\prime})\rvert\cdot\lvert 1-\sqrt{n}/\lVert g^{\prime}\rVert\rvert
(𝔼𝖿(g0)+n1+1q2+ϑ)n12+ϑn1+1q+ϑ(logn)121q2.\displaystyle\lesssim\big{(}\operatorname{\mathbb{E}}\mathsf{f}(g_{0})+n^{-1+\frac{1}{q\wedge 2}+\vartheta}\big{)}\cdot n^{-\frac{1}{2}+\vartheta}\lesssim n^{-1+\frac{1}{q}+\vartheta}(\log n)^{\frac{1}{2}-\frac{1}{q\vee 2}}. (10.21)

Combining (10.7)-(10.7), for ϑ\vartheta small enough,

supg𝒢ϑ,μ0n11q2|𝖥μ0,0(g)𝔼g0𝖥μ0(g0)|nϑ.\displaystyle\sup_{g^{\prime}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}}n^{1-\frac{1}{q\wedge 2}}\big{\lvert}\mathsf{F}_{\lVert\mu_{0}\rVert,0}(g^{\prime})-\operatorname{\mathbb{E}}_{g_{0}}\mathsf{F}_{\lVert\mu_{0}\rVert}(g_{0})\big{\rvert}\lesssim n^{\vartheta}. (10.22)

Now let 𝒢ϑ,μ0{g/g:g𝒢ϑ,μ0}Bn(1)\partial\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}\equiv\{g^{\prime}/\lVert g^{\prime}\rVert:g^{\prime}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}\}\subset\partial B_{n}(1). Using that {g0𝒢ϑ,μ0}{g0/g0𝒢ϑ,μ0}\{g_{0}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}\}\subset\big{\{}g_{0}/\lVert g_{0}\rVert\in\partial\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}\}, we have (g0/g0𝒢ϑ,μ0)(g0𝒢ϑ,μ0)1Cen2ϑ/C\operatorname{\mathbb{P}}\big{(}g_{0}/\lVert g_{0}\rVert\in\partial\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert}\big{)}\geq\operatorname{\mathbb{P}}(g_{0}\in\mathcal{G}_{\vartheta,\lVert\mu_{0}\rVert})\geq 1-Ce^{-n^{2\vartheta}/C}. So with

𝒱ϑ{μ0=U0g:U0[0,1],g𝒢ϑ,U0}Bn(1),\displaystyle\mathcal{V}_{\vartheta}\equiv\big{\{}\mu_{0}=U_{0}g^{\prime}:U_{0}\in[0,1],g^{\prime}\in\partial\mathcal{G}_{\vartheta,U_{0}}\big{\}}\subset B_{n}(1),

we have (Unif(Bn(1))𝒱ϑ)=𝔼U0g0(g0/g0𝒢ϑ,U0)1Cen2ϑ/C\operatorname{\mathbb{P}}\big{(}\mathrm{Unif}(B_{n}(1))\in\mathcal{V}_{\vartheta}\big{)}=\operatorname{\mathbb{E}}_{U_{0}}\operatorname{\mathbb{P}}_{g_{0}}\big{(}g_{0}/\lVert g_{0}\rVert\in\partial\mathcal{G}_{\vartheta,U_{0}}\big{)}\geq 1-Ce^{-n^{2\vartheta}/C}. In other words, for this constructed set 𝒱ϑ\mathcal{V}_{\vartheta}, we have the desired volume estimate vol(𝒱ϑ)/vol(Bn(1))1Cen2ϑ/C\mathrm{vol}(\mathcal{V}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{2\vartheta}/C}, and by (10.22),

n11q2supμ0𝒱ϑ|𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γ~η(μ0);τη)μ0q𝔼g0𝖥μ0(g0)|nϑ.\displaystyle n^{1-\frac{1}{q\wedge 2}}\sup_{\mu_{0}\in\mathcal{V}_{\vartheta}}\big{\lvert}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\widetilde{\gamma}_{\eta}(\lVert\mu_{0}\rVert);\tau_{\eta})-\mu_{0}\rVert_{q}-\operatorname{\mathbb{E}}_{g_{0}}\mathsf{F}_{\lVert\mu_{0}\rVert}(g_{0})\big{\rvert}\lesssim n^{\vartheta}. (10.23)

On the other hand, using the definition of Γη;(Σ,μ0)\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)} in (2.7), we may compute

𝔼g0𝖥μ0(g0)=𝔼Γη;(Σ,μ0)1/2g/nq.\displaystyle\operatorname{\mathbb{E}}_{g_{0}}\mathsf{F}_{\lVert\mu_{0}\rVert}(g_{0})=\operatorname{\mathbb{E}}\big{\lVert}\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{1/2}g/\sqrt{n}\big{\rVert}_{q}. (10.24)

Combining (10.7), (10.23) and (10.24), for ϑ\vartheta chosen small enough,

supμ0𝒰ϑ𝒱ϑn1/21/q|𝔼μ^(Σ,μ0)𝗌𝖾𝗊(γη;τη)μ0q𝔼Γη;(Σ,μ0)1/2g/nq|\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}\cap\mathcal{V}_{\vartheta}}n^{1/2-1/q}\big{\lvert}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta};\tau_{\eta})-\mu_{0}\rVert_{q}-\operatorname{\mathbb{E}}\big{\lVert}\Gamma_{\eta;(\Sigma,\lVert\mu_{0}\rVert)}^{1/2}g/\sqrt{n}\big{\rVert}_{q}\big{\rvert}
n121q(n12+1qϑ(logn)121q2+n1+1q2+ϑ)\displaystyle\lesssim n^{\frac{1}{2}-\frac{1}{q}}\cdot\Big{(}n^{-\frac{1}{2}+\frac{1}{q}-\vartheta}\big{(}\log n\big{)}^{\frac{1}{2}-\frac{1}{q\vee 2}}+n^{-1+\frac{1}{q\wedge 2}+\vartheta}\Big{)}
=nϑ(logn)121q2+n121q+1q2+ϑnϑ/2.\displaystyle=n^{-\vartheta}\big{(}\log n\big{)}^{\frac{1}{2}-\frac{1}{q\vee 2}}+n^{-\frac{1}{2}-\frac{1}{q}+\frac{1}{q\wedge 2}+\vartheta}\lesssim n^{-\vartheta/2}.

The claim follows from Lemma B.2. ∎

Proof of Theorem 2.5.

We write μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)=μ^η;(Σ,μ0)𝗌𝖾𝗊,\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})=\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast} in the proof.

First we consider 0<q20<q\leq 2. This is the easy case, as 𝗀q(x)xμ0q/n1/q1/2\mathsf{g}_{q}(x)\equiv\lVert x-\mu_{0}\rVert_{q}/n^{1/q-1/2} is 11-Lipschitz with respect to \lVert\cdot\rVert. So applying Theorems 2.3 and 2.4 verifies the existence of some small ϑ>0\vartheta>0 such that for some 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C},

supμ0𝒰ϑ(supηΞKn121q|μ^ημ0q𝔼μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q|nϑ)Cn1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}n^{\frac{1}{2}-\frac{1}{q}}\big{\lvert}\lVert\widehat{\mu}_{\eta}-\mu_{0}\rVert_{q}-\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\big{\rvert}\geq n^{-\vartheta}\Big{)}\leq Cn^{-1/7}.

The ratio formulation follows from Lemmas 10.8 and 10.9 by further intersecting 𝒰ϑ\mathcal{U}_{\vartheta} and the set therein.

Next we consider q(2,)q\in(2,\infty). Let Lnnϑ1L_{n}\equiv n^{\vartheta_{1}} for some ϑ1\vartheta_{1} to be chosen later. Using Proposition 10.3 and its proofs below (10.8), for ϑ1>0\vartheta_{1}>0 chosen small enough, we may find some 𝒰ϑ1Bn(1)\mathcal{U}_{\vartheta_{1}}\subset B_{n}(1) with the desired volume estimate, such that supμ0𝒰ϑ1supηΞK𝔼μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0Ln/n\sup_{\mu_{0}\in\mathcal{U}_{\vartheta_{1}}}\sup_{\eta\in\Xi_{K}}\lVert\operatorname{\mathbb{E}}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{\infty}\leq L_{n}/\sqrt{n}, and

supμ0𝒰ϑ1(supηΞK{μ^ημ0μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0}Lnn)Cn2D,\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta_{1}}}\operatorname{\mathbb{P}}\bigg{(}\sup_{\eta\in\Xi_{K}}\Big{\{}\lVert\widehat{\mu}_{\eta}-\mu_{0}\rVert_{\infty}\vee\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{\infty}\Big{\}}\geq\frac{L_{n}}{\sqrt{n}}\bigg{)}\leq Cn^{-2D}, (10.25)

where we choose D>0D>0 sufficiently large. Recall for xnx\in\mathbb{R}^{n} and q>2q>2, xqx2/qx12/q\lVert x\rVert_{q}\leq\lVert x\rVert^{2/q}\lVert x\rVert_{\infty}^{1-2/q}. This motivates the choice

𝗀q(x)[(Lnn)2q1(xμ0)(Lnn)2q1{(Lnn)2q1}q]q2,\displaystyle\mathsf{g}_{q}(x)\equiv\bigg{[}\bigg{(}\frac{L_{n}}{\sqrt{n}}\bigg{)}^{\frac{2}{q}-1}\bigg{\lVert}(x-\mu_{0})\wedge\bigg{(}\frac{L_{n}}{\sqrt{n}}\bigg{)}^{\frac{2}{q}-1}\vee\bigg{\{}-\bigg{(}\frac{L_{n}}{\sqrt{n}}\bigg{)}^{\frac{2}{q}-1}\bigg{\}}\bigg{\rVert}_{q}\bigg{]}^{\frac{q}{2}},

which verifies that 𝗀q\mathsf{g}_{q} is 11-Lipschitz with respect to \lVert\cdot\rVert. Using (10.25),

infμ0𝒰ϑ1(𝗀q(μ^η)=n(1q2)ϑ1{n121qμ^ημ0q}q2,ηΞK)1CnD,\displaystyle\inf_{\mu_{0}\in\mathcal{U}_{\vartheta_{1}}}\operatorname{\mathbb{P}}\Big{(}\mathsf{g}_{q}(\widehat{\mu}_{\eta})=n^{(1-\frac{q}{2})\vartheta_{1}}\cdot\big{\{}n^{\frac{1}{2}-\frac{1}{q}}\lVert\widehat{\mu}_{\eta}-\mu_{0}\rVert_{q}\big{\}}^{\frac{q}{2}},\forall\eta\in\Xi_{K}\Big{)}\geq 1-Cn^{-D}, (10.26)

and with Eμ0{supηΞKμ^η;(Σ,μ0)𝗌𝖾𝗊,μ0Ln/n}E_{\mu_{0}}\equiv\big{\{}\sup_{\eta\in\Xi_{K}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{\infty}\leq L_{n}/\sqrt{n}\big{\}},

supμ0𝒰ϑ1supηΞK|𝔼𝗀q(μ^η;(Σ,μ0)𝗌𝖾𝗊,)n(1q2)ϑ1𝔼{n121qμ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q}q2|\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta_{1}}}\sup_{\eta\in\Xi_{K}}\Big{|}\operatorname{\mathbb{E}}\mathsf{g}_{q}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}-n^{(1-\frac{q}{2})\vartheta_{1}}\operatorname{\mathbb{E}}\big{\{}n^{\frac{1}{2}-\frac{1}{q}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\Big{\}}^{\frac{q}{2}}\Big{|}
=n(1q2)ϑ1supμ0𝒰ϑ1supηΞK𝔼{n121qμ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q}q2𝟏Eμ0c\displaystyle=n^{(1-\frac{q}{2})\vartheta_{1}}\sup_{\mu_{0}\in\mathcal{U}_{\vartheta_{1}}}\sup_{\eta\in\Xi_{K}}\operatorname{\mathbb{E}}\big{\{}n^{\frac{1}{2}-\frac{1}{q}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\Big{\}}^{\frac{q}{2}}\bm{1}_{E_{\mu_{0}}^{c}}
+supμ0𝒰ϑ1supηΞK𝔼𝗀q(μ^η;(Σ,μ0)𝗌𝖾𝗊,)𝟏Eμ0cnD.\displaystyle\qquad+\sup_{\mu_{0}\in\mathcal{U}_{\vartheta_{1}}}\sup_{\eta\in\Xi_{K}}\operatorname{\mathbb{E}}\mathsf{g}_{q}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}\bm{1}_{E_{\mu_{0}}^{c}}\lesssim n^{-D}. (10.27)

As the map gμ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q=(Σ+τη,I)1(τη,μ0+γη,Σ1/2g/n)qg\mapsto\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}=\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\big{(}-\tau_{\eta,\ast}\mu_{0}+\gamma_{\eta,\ast}\Sigma^{1/2}g/\sqrt{n}\big{)}\rVert_{q} is Cn1/2Cn^{-1/2}-Lipschitz with respect to \lVert\cdot\rVert, Gaussian concentration yields

(n1/2|μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q𝔼μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q|nϑ1)Cn2D.\displaystyle\operatorname{\mathbb{P}}\Big{(}n^{1/2}\big{\lvert}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}-\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\big{\rvert}\geq n^{\vartheta_{1}}\Big{)}\leq Cn^{-2D}.

Using the Lipschitz property of the maps, we may strengthen the above inequality to a uniform control over ηΞK\eta\in\Xi_{K}. This means uniformly in μ0𝒰ϑ1,ηΞK\mu_{0}\in\mathcal{U}_{\vartheta_{1}},\eta\in\Xi_{K},

|𝔼{n121qμ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q}q2{n121q𝔼μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q}q2|n1q+ϑ1.\displaystyle\Big{|}\operatorname{\mathbb{E}}\Big{\{}n^{\frac{1}{2}-\frac{1}{q}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\Big{\}}^{\frac{q}{2}}-\Big{\{}n^{\frac{1}{2}-\frac{1}{q}}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\Big{\}}^{\frac{q}{2}}\Big{|}\lesssim n^{-\frac{1}{q}+\vartheta_{1}}. (10.28)

Combining (10.7)-(10.28), we have uniformly in μ0𝒰ϑ1,ηΞK\mu_{0}\in\mathcal{U}_{\vartheta_{1}},\eta\in\Xi_{K},

|𝔼𝗀q(μ^η;(Σ,μ0)𝗌𝖾𝗊,)n(1q2)ϑ1{n121q𝔼μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q}q/2|Cn(2q2)ϑ11q.\displaystyle\Big{|}\operatorname{\mathbb{E}}\mathsf{g}_{q}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}\big{)}-n^{(1-\frac{q}{2})\vartheta_{1}}\big{\{}n^{\frac{1}{2}-\frac{1}{q}}\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\Big{\}}^{q/2}\Big{|}\leq Cn^{(2-\frac{q}{2})\vartheta_{1}-\frac{1}{q}}. (10.29)

Combining (10.26) and (10.29) proves the existence of some small ϑ2\vartheta_{2} and some 𝒰ϑ2Bn(1)\mathcal{U}_{\vartheta_{2}}\subset B_{n}(1) with the desired volume estimate, such that

supμ0𝒰ϑ2(n121qsupηΞK|μ^ημ0q𝔼μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0q|nϑ2)Cn1/7.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta_{2}}}\operatorname{\mathbb{P}}\Big{(}n^{\frac{1}{2}-\frac{1}{q}}\sup_{\eta\in\Xi_{K}}\big{\lvert}\lVert\widehat{\mu}_{\eta}-\mu_{0}\rVert_{q}-\operatorname{\mathbb{E}}\lVert\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\rVert_{q}\big{\rvert}\geq n^{-\vartheta_{2}}\Big{)}\leq Cn^{-1/7}.

The ratio formulation follows again from Lemmas 10.8 and 10.9. ∎

11. Proofs for Section 3

11.1. Proof of Theorem 3.1

For ϑ\vartheta chosen small enough, we fix μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta}, where 𝒰ϑ\mathcal{U}_{\vartheta} is specified in Theorem 2.4. We omit the subscripts in R#(Σ,μ0)(η)=R#(η)R^{\#}_{(\Sigma,\mu_{0})}(\eta)=R^{\#}(\eta), R¯#(Σ,μ0)(η)=R¯#(η)\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)=\bar{R}^{\#}(\eta), and write μ^(Σ,μ0)𝗌𝖾𝗊(γη,;τη,)=μ^η;(Σ,μ0)𝗌𝖾𝗊,\widehat{\mu}_{(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}}}(\gamma_{\eta,\ast};\tau_{\eta,\ast})=\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast} in the proof. All the constants in ,,\lesssim,\gtrsim,\asymp and 𝒪\mathcal{O} below may possibly depend on KK.

(1). Consider the case #=𝗉𝗋𝖾𝖽\#=\operatorname{\mathsf{pred}}. We omit the superscript 𝗉𝗋𝖾𝖽\operatorname{\mathsf{pred}} as well. Using Theorem 2.4-(1) with 𝗀(x)=Σ1/2(xμ0)\mathsf{g}(x)=\lVert\Sigma^{1/2}(x-\mu_{0})\rVert, on an event E0E_{0} with (E0c)C0εc0n1/6.5\operatorname{\mathbb{P}}(E_{0}^{c})\leq C_{0}\varepsilon^{-c_{0}}n^{-1/6.5},

supηΞK|R(η)𝔼Σ1/2(μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0)|ε.\displaystyle\sup_{\eta\in\Xi_{K}}\big{\lvert}\sqrt{R(\eta)}-\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\big{)}\rVert\big{\rvert}\leq\varepsilon.

By Gaussian-Poincaré inequality, 0R¯(η)(𝔼Σ1/2(μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0))2=Var(Σ1/2(μ^η;(Σ,μ0)𝗌𝖾𝗊,μ0))n10\leq\bar{R}(\eta)-\big{(}\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}\big{(}\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0}\big{)}\rVert\big{)}^{2}=\operatorname{Var}\big{(}\lVert\Sigma^{1/2}(\widehat{\mu}_{\eta;(\Sigma,\mu_{0})}^{\operatorname{\mathsf{seq}},\ast}-\mu_{0})\rVert\big{)}\lesssim n^{-1}. As R¯(η)1\bar{R}(\eta)\asymp 1 uniformly in ηΞK\eta\in\Xi_{K}, on E0E_{0},

supηΞK|R1/2(η)R¯1/2(η)|ε+C0n1.\displaystyle\sup_{\eta\in\Xi_{K}}\big{\lvert}R^{1/2}(\eta)-\bar{R}^{1/2}(\eta)\big{\rvert}\leq\varepsilon+C_{0}^{\prime}n^{-1}. (11.1)

On the other hand, using both the standard form μ^η=n1(XX/n+ηIn)1XY\widehat{\mu}_{\eta}=n^{-1}\big{(}X^{\top}X/n+\eta I_{n}\big{)}^{-1}X^{\top}Y and the alternative form μ^η=n1X(XX/n+ηIm)1Y\widehat{\mu}_{\eta}=n^{-1}X^{\top}\big{(}XX^{\top}/n+\eta I_{m}\big{)}^{-1}Y, we have

supηΞKμ^η((ZZ/n)1op𝟏ϕ11+1/K11)(1+Zop+ξn)2.\displaystyle\sup_{\eta\in\Xi_{K}}\lVert\widehat{\mu}_{\eta}\rVert\lesssim\Big{(}\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}\bm{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge 1\Big{)}\cdot\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi\rVert}{\sqrt{n}}\Big{)}^{2}. (11.2)

Consequently, on an event E1E_{1} with (E1c)C1en/C1\operatorname{\mathbb{P}}(E_{1}^{c})\leq C_{1}e^{-n/C_{1}},

supηΞKμ^ηC1.\displaystyle\sup_{\eta\in\Xi_{K}}\lVert\widehat{\mu}_{\eta}\rVert\leq C_{1}. (11.3)

Finally, using (11.1) and (11.3), on E0E1E_{0}\cap E_{1},

supηΞK|R(η)R¯(η)|supηΞK|R1/2(η)R¯1/2(η)|(1+supηΞKμ^η)ε+n1.\displaystyle\sup_{\eta\in\Xi_{K}}\lvert R(\eta)-\bar{R}(\eta)\rvert\lesssim\sup_{\eta\in\Xi_{K}}\lvert R^{1/2}(\eta)-\bar{R}^{1/2}(\eta)\rvert\Big{(}1+\sup_{\eta\in\Xi_{K}}\lVert\widehat{\mu}_{\eta}\rVert\Big{)}\lesssim\varepsilon+n^{-1}.

The claim follows. The case #=𝖾𝗌𝗍\#=\operatorname{\mathsf{est}} follows from minor modifications so will be omitted.

(2). Consider the case #=𝗋𝖾𝗌\#=\operatorname{\mathsf{res}}. We omit the superscript 𝗋𝖾𝗌\operatorname{\mathsf{res}} as well. Further fix ξϑ\xi\in\mathcal{E}_{\vartheta} as specified in Theorem 2.4 (the concrete form of ϑ\mathcal{E}_{\vartheta} is given in Proposition 10.3). Using the same Theorem 2.4-(2) with 𝗁(x)=x\mathsf{h}(x)=\lVert x\rVert,

ξ(supη[1/K,K]|r^η𝔼ξrη,|ε)Cεc0n1/6.5.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in[1/K,K]}\big{\lvert}\lVert\widehat{r}_{\eta}\rVert-\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}\rVert\big{\rvert}\geq\varepsilon\Big{)}\leq C\varepsilon^{-c_{0}}\cdot n^{-1/6.5}.

By Gaussian-Poincaré inequality, 0𝔼ξrη,2(𝔼ξrη,)2=Varξ(rη,)1/n0\leq\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}\rVert^{2}-\big{(}\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}\rVert\big{)}^{2}=\operatorname{Var}^{\xi}\big{(}\lVert r_{\eta,\ast}\rVert\big{)}\lesssim 1/n. Combined with the fact that 𝔼ξrη,2=(ηγη,/τη,)2+𝒪(|ξ2/mσξ2|)\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}\rVert^{2}=(\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast})^{2}+\mathcal{O}(\lvert\lVert\xi\rVert^{2}/m-\sigma_{\xi}^{2}\rvert), for η[1/K,K]\eta\in[1/K,K], using the stability estimate in Proposition 8.1-(3),

|𝔼ξrη,ηγη,/τη,||(𝔼ξrη,)2(ηγη,/τη,)2|n1/2+ϑ.\displaystyle\big{\lvert}\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}\rVert-\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast}\big{\rvert}\lesssim\big{\lvert}\big{(}\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}\rVert\big{)}^{2}-(\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast})^{2}\big{\rvert}\lesssim n^{-1/2+\vartheta}.

So for ε(Cn1/2+ϑ,1/C]\varepsilon\in(Cn^{-1/2+\vartheta},1/C],

ξ(supη[1/K,K]|r^ηηγη,/τη,|ε)Cεc0n1/6.5.\displaystyle\operatorname{\mathbb{P}}^{\xi}\Big{(}\sup_{\eta\in[1/K,K]}\big{\lvert}\lVert\widehat{r}_{\eta}\rVert-\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast}\big{\rvert}\geq\varepsilon\Big{)}\leq C\varepsilon^{-c_{0}}\cdot n^{-1/6.5}.

Now taking expectation over ξ\xi, for the same range of ε\varepsilon,

(supη[1/K,K]|r^ηηγη,/τη,|ε)Cεc0n1/6.5.\displaystyle\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in[1/K,K]}\big{\lvert}\lVert\widehat{r}_{\eta}\rVert-\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast}\big{\rvert}\geq\varepsilon\Big{)}\leq C\varepsilon^{-c_{0}}\cdot n^{-1/6.5}. (11.4)

On the other hand, using (11.2),

supη[1/K,K]r^η\displaystyle\sup_{\eta\in[1/K,K]}\lVert\widehat{r}_{\eta}\rVert ((ZZ/n)1op𝟏ϕ11+1/K11)(1+Zop+ξn)3.\displaystyle\lesssim\Big{(}\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}\bm{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge 1\Big{)}\cdot\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi\rVert}{\sqrt{n}}\Big{)}^{3}.

Consequently, on an event E3E_{3} with (E3c)C3en/C3\operatorname{\mathbb{P}}(E_{3}^{c})\leq C_{3}e^{-n/C_{3}}, supη[1/K,K]r^ηC3\sup_{\eta\in[1/K,K]}\lVert\widehat{r}_{\eta}\rVert\leq C_{3}, and therefore

supη[1/K,K]|r^η2(ηγη,/τη,)2|\displaystyle\sup_{\eta\in[1/K,K]}\big{\lvert}\lVert\widehat{r}_{\eta}\rVert^{2}-\big{(}\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast}\big{)}^{2}\big{\rvert} C3supη[1/K,K]|r^ηηγη,/τη,|.\displaystyle\leq C_{3}\cdot\sup_{\eta\in[1/K,K]}\big{\lvert}\lVert\widehat{r}_{\eta}\rVert-\eta\gamma_{\eta,\ast}/\tau_{\eta,\ast}\big{\rvert}.

The claim follows. The case #=𝗂𝗇\#=\operatorname{\mathsf{in}} proceeds similarly, but with the function now taken as 𝗁(x)=xξ/n\mathsf{h}(x)=\lVert x-\xi/\sqrt{n}\rVert, and the claim follows by computing that

𝔼ξrη,ξ/n2=ϕ{(ηϕτη,)2(ϕγη,2σξ2)+ξ2m(ηϕτη,1)2}\displaystyle\operatorname{\mathbb{E}}^{\xi}\lVert r_{\eta,\ast}-\xi/\sqrt{n}\rVert^{2}=\phi\cdot\bigg{\{}\bigg{(}\frac{\eta}{\phi\tau_{\eta,\ast}}\bigg{)}^{2}\big{(}\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2}\big{)}+\frac{\lVert\xi\rVert^{2}}{m}\cdot\bigg{(}\frac{\eta}{\phi\tau_{\eta,\ast}}-1\bigg{)}^{2}\bigg{\}}
=(ηγη,τη,)2+ϕσξ2[(ηϕτη,1)2(ηϕτη,)2]+𝒪(|ξ2/mσξ2|).\displaystyle=\bigg{(}\frac{\eta\gamma_{\eta,\ast}}{\tau_{\eta,\ast}}\bigg{)}^{2}+\phi\sigma_{\xi}^{2}\cdot\bigg{[}\bigg{(}\frac{\eta}{\phi\tau_{\eta,\ast}}-1\bigg{)}^{2}-\bigg{(}\frac{\eta}{\phi\tau_{\eta,\ast}}\bigg{)}^{2}\bigg{]}+\mathcal{O}\big{(}\lvert\lVert\xi\rVert^{2}/m-\sigma_{\xi}^{2}\rvert\big{)}.

The proof is complete. ∎

11.2. Proof of Theorem 3.2

Lemma 11.1.

Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K, and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. Then with g𝒩(0,In)g\sim\mathcal{N}(0,I_{n}), there exists some C=C(K)>0C=C(K)>0 such that for ε(0,1)\varepsilon\in(0,1), and q{0,1/2}q\in\{0,1/2\},

(supηΞK|(Σ+τη,I)1Σqg/g2n1tr((Σ+τη,I)2Σ2q)|>ε)Cε1enε2/C.\displaystyle\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in\Xi_{K}}\big{\lvert}\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{q}g/\lVert g\rVert\rVert^{2}-n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma^{2q}\big{)}\big{\rvert}>\varepsilon\Big{)}\leq C\varepsilon^{-1}e^{-n\varepsilon^{2}/C}.
Proof.

We only prove the case q=1/2q=1/2. All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK. We write Aη(Σ+τη,I)2ΣA_{\eta}\equiv(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma for notational simplicity. Note that

|(Σ+τη,I)1Σ1/2g/g2n1tr((Σ+τη,I)2Σ)|\displaystyle\big{\lvert}\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}g/\lVert g\rVert\rVert^{2}-n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}\big{\rvert}
=n1|eg2Aη1/2g2𝔼Aη1/2g2|\displaystyle=n^{-1}\big{\lvert}e_{g}^{-2}\lVert A_{\eta}^{1/2}g\rVert^{2}-\operatorname{\mathbb{E}}\lVert A_{\eta}^{1/2}g\rVert^{2}\big{\rvert}
eg2n1|Aη1/2g2𝔼Aη1/2g2|+|eg21|.\displaystyle\lesssim e_{g}^{-2}\cdot n^{-1}\big{\lvert}\lVert A_{\eta}^{1/2}g\rVert^{2}-\operatorname{\mathbb{E}}\lVert A_{\eta}^{1/2}g\rVert^{2}\big{\rvert}+\lvert e_{g}^{-2}-1\rvert.

Here in the last inequality we used 𝔼Aη1/2g2n\operatorname{\mathbb{E}}\lVert A_{\eta}^{1/2}g\rVert^{2}\lesssim n. As

  • AηF2=tr((Σ+τη,I)4Σ2)n(1τη,)4n\lVert A_{\eta}\rVert_{F}^{2}=\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-4}\Sigma^{2}\big{)}\lesssim n(1\wedge\tau_{\eta,\ast})^{-4}\asymp n, and

  • AηF2tr(Σ2)(1τη,)4n\lVert A_{\eta}\rVert_{F}^{2}\gtrsim\operatorname{tr}(\Sigma^{2})\cdot(1\vee\tau_{\eta,\ast})^{-4}\gtrsim n,

we have uniformly in η[0,K]\eta\in[0,K], AηFn\lVert A_{\eta}\rVert_{F}\asymp\sqrt{n}. It is easy to see that Aηop1\lVert A_{\eta}\rVert_{\operatorname{op}}\asymp 1. So by Hanson-Wright inequality, there exists some constant C1=C1(K)C_{1}=C_{1}(K) such that for ε(0,1)\varepsilon\in(0,1),

(|(Σ+τη,I)1Σ1/2g/g2n1tr((Σ+τη,I)2Σ)|>ε)\displaystyle\operatorname{\mathbb{P}}\Big{(}\big{\lvert}\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}g/\lVert g\rVert\rVert^{2}-n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}\big{\rvert}>\varepsilon\Big{)}
(|n1(Aη1/2g2𝔼Aη1/2g2)|>ε/4)+(|eg21|>ε/2)+(eg21/2)\displaystyle\leq\operatorname{\mathbb{P}}\Big{(}\big{\lvert}n^{-1}\big{(}\lVert A_{\eta}^{1/2}g\rVert^{2}-\operatorname{\mathbb{E}}\lVert A_{\eta}^{1/2}g\rVert^{2}\big{)}\big{\rvert}>\varepsilon/4\Big{)}+\operatorname{\mathbb{P}}\big{(}\lvert e_{g}^{-2}-1\rvert>\varepsilon/2\big{)}+\operatorname{\mathbb{P}}(e_{g}^{2}\leq 1/2)
C1enε2/C1.\displaystyle\leq C_{1}e^{-n\varepsilon^{2}/C_{1}}.

On the other hand, for any η1,η2ΞK\eta_{1},\eta_{2}\in\Xi_{K}, using Proposition 8.1-(3),

|(Σ+τη1,I)1Σ1/2g/g2(Σ+τη2,I)1Σ1/2g/g2|\displaystyle\big{\lvert}\lVert(\Sigma+\tau_{\eta_{1},\ast}I)^{-1}\Sigma^{1/2}g/\lVert g\rVert\rVert^{2}-\lVert(\Sigma+\tau_{\eta_{2},\ast}I)^{-1}\Sigma^{1/2}g/\lVert g\rVert\rVert^{2}\big{\rvert} |η1η2|,\displaystyle\lesssim\lvert\eta_{1}-\eta_{2}\rvert,
n1|tr((Σ+τη1,I)2Σ)tr((Σ+τη2,I)2Σ)|\displaystyle n^{-1}\big{\lvert}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta_{1},\ast}I)^{-2}\Sigma\big{)}-\operatorname{tr}\big{(}(\Sigma+\tau_{\eta_{2},\ast}I)^{-2}\Sigma\big{)}\big{\rvert} |η1η2|,\displaystyle\lesssim\lvert\eta_{1}-\eta_{2}\rvert,

so we may conclude by a standard discretization and union bound argument. ∎

Proposition 11.2.

The following hold with 𝔪η𝔪(η/ϕ)\mathfrak{m}_{\eta}\equiv\mathfrak{m}(-\eta/\phi), 𝔪η𝔪(η/ϕ)\mathfrak{m}_{\eta}^{\prime}\equiv\mathfrak{m}^{\prime}(-\eta/\phi).

  1. (1)

    τη,=1/𝔪η\tau_{\eta,\ast}=1/\mathfrak{m}_{\eta} and ητη,=𝔪η/(ϕ𝔪η2)\partial_{\eta}\tau_{\eta,\ast}=\mathfrak{m}_{\eta}^{\prime}/(\phi\mathfrak{m}_{\eta}^{2}).

  2. (2)

    It holds that

    1ntr((Σ+τη,I)2Σ)\displaystyle\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)} =ϕ𝔪η2𝔪η(𝔪η(η/ϕ)𝔪η),\displaystyle=\frac{\phi\mathfrak{m}_{\eta}^{2}}{\mathfrak{m}_{\eta}^{\prime}}\big{(}\mathfrak{m}_{\eta}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}\big{)},
    1ntr((Σ+τη,I)2)\displaystyle\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\big{)} =ϕ𝔪η2𝔪η((ϕ11)𝔪η+2(η/ϕ)𝔪η𝔪η𝔪η2).\displaystyle=\frac{\phi\mathfrak{m}_{\eta}^{2}}{\mathfrak{m}_{\eta}^{\prime}}\big{(}(\phi^{-1}-1)\mathfrak{m}_{\eta}^{\prime}+2({\eta}/{\phi})\cdot\mathfrak{m}_{\eta}\mathfrak{m}_{\eta}^{\prime}-\mathfrak{m}_{\eta}^{2}\big{)}.
  3. (3)

    Suppose 1/Kϕ1K1/K\leq\phi^{-1}\leq K, and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. There exists some constant C=C(K)>0C=C(K)>0 such that the following hold. For any ε(0,1/2]\varepsilon\in(0,1/2], for some 𝒰εBn(1)\mathcal{U}_{\varepsilon}\subset B_{n}(1) with vol(𝒰ε)/vol(Bn(1))1Cε1enε2/C\mathrm{vol}(\mathcal{U}_{\varepsilon})/\mathrm{vol}(B_{n}(1))\geq 1-C\varepsilon^{-1}e^{-n\varepsilon^{2}/C},

    supμ0𝒰εsupηΞK|γη,2σξ2𝔪η+μ02(ϕ𝔪ηη𝔪η)ϕ𝔪η2|ε.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\varepsilon}}\sup_{\eta\in\Xi_{K}}\bigg{\lvert}\gamma_{\eta,\ast}^{2}-\frac{\sigma_{\xi}^{2}\mathfrak{m}_{\eta}^{\prime}+\lVert\mu_{0}\rVert^{2}\big{(}\phi\mathfrak{m}_{\eta}-\eta\mathfrak{m}_{\eta}^{\prime}\big{)}}{\phi\mathfrak{m}_{\eta}^{2}}\bigg{\rvert}\leq\varepsilon.

    When Σ=In\Sigma=I_{n}, we may take 𝒰ε=Bn(1)\mathcal{U}_{\varepsilon}=B_{n}(1) and the above inequality holds with ε=0\varepsilon=0.

Proof.

(1) follows from definition so we focus on (2)-(3).

(2). Differentiating both sides of (2.3) with respect to η\eta yields that

n1tr((Σ+τη,I)2Σ)ητη,=(𝔪η(η/ϕ)𝔪η).\displaystyle-n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}\cdot\partial_{\eta}\tau_{\eta,\ast}=-\big{(}\mathfrak{m}_{\eta}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}\big{)}.

Now using ητη,=𝔪η/(ϕ𝔪η2)\partial_{\eta}\tau_{\eta,\ast}=\mathfrak{m}_{\eta}^{\prime}/(\phi\mathfrak{m}_{\eta}^{2}) to obtain the formula for n1tr((Σ+τη,I)2Σ)n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}.

Next, using that ϕητη,=n1tr((Σ+τη,I)1Σ)=1τη,n1tr((Σ+τη,I)1)\phi-\frac{\eta}{\tau_{\eta,\ast}}=n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma\big{)}=1-\tau_{\eta,\ast}\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-1}\big{)}, we may solve

n1tr((Σ+τη,I)1)=𝔪η(1ϕ+η𝔪η).\displaystyle n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-1}\big{)}=\mathfrak{m}_{\eta}\big{(}1-\phi+\eta\cdot\mathfrak{m}_{\eta}\big{)}.

Differentiating with respect to η\eta on both sides of the above display, we obtain

n1tr((Σ+τη,I)2)ητη,\displaystyle-n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\big{)}\cdot\partial_{\eta}\tau_{\eta,\ast} =ϕ1𝔪η(1ϕ+η𝔪η)+𝔪η(𝔪η(η/ϕ)𝔪η)\displaystyle=-\phi^{-1}{\mathfrak{m}_{\eta}^{\prime}}\big{(}1-\phi+\eta\cdot\mathfrak{m}_{\eta}\big{)}+\mathfrak{m}_{\eta}\cdot\big{(}\mathfrak{m}_{\eta}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}\big{)}
=(ϕ11)𝔪η2(η/ϕ)𝔪η𝔪η+𝔪η2,\displaystyle=-(\phi^{-1}-1)\mathfrak{m}_{\eta}^{\prime}-2({\eta}/{\phi})\cdot\mathfrak{m}_{\eta}\mathfrak{m}_{\eta}^{\prime}+\mathfrak{m}_{\eta}^{2},

proving the second identity.

(3). Let μ0U0g0/g0\mu_{0}\equiv U_{0}g_{0}/\lVert g_{0}\rVert, where U0Unif[0,1]U_{0}\sim\mathrm{Unif}[0,1] and g0𝒩(0,In)g_{0}\sim\mathcal{N}(0,I_{n}) are independent variables. Then μ0\mu_{0} is uniformly distributed on Bn(1)B_{n}(1). For some ε>0\varepsilon>0 to be chosen later, let

𝒢ε\displaystyle\mathcal{G}_{\varepsilon} {gn:supηΞK|(Σ+τη,I)1Σ1/2gg21ntr((Σ+τη,I)2Σ)|ε}.\displaystyle\equiv\Big{\{}g\in\mathbb{R}^{n}:\sup_{\eta\in\Xi_{K}}\Big{|}\big{\lVert}(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}\frac{g}{\lVert g\rVert}\big{\rVert}^{2}-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}\Big{|}\leq\varepsilon\Big{\}}. (11.5)

Let 𝒰ε{Ug/g:U[0,1],g𝒢ε}Bn(1)\mathcal{U}_{\varepsilon}\equiv\{Ug/\lVert g\rVert:U\in[0,1],g\in\mathcal{G}_{\varepsilon}\}\subset B_{n}(1). Using Lemma 11.1, there exists some constant C0=C0(K)>0C_{0}=C_{0}(K)>0 such that vol(𝒰ε)/vol(Bn(1))=μ0(μ0𝒰ε)1C0ε1enε2/C0{\mathrm{vol}(\mathcal{U}_{\varepsilon})}/{\mathrm{vol}(B_{n}(1))}=\operatorname{\mathbb{P}}_{\mu_{0}}(\mu_{0}\in\mathcal{U}_{\varepsilon})\geq 1-C_{0}\varepsilon^{-1}e^{-n\varepsilon^{2}/C_{0}}, and moreover,

supμ0𝒰εsupηΞK|(Σ+τη,I)1Σ1/2μ02μ02n1tr((Σ+τη,I)2Σ)|ε.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\varepsilon}}\sup_{\eta\in\Xi_{K}}\big{\lvert}\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}-\lVert\mu_{0}\rVert^{2}\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}\big{\rvert}\leq\varepsilon.

Note that when Σ=In\Sigma=I_{n}, the above estimate holds for all μ0Bn(1)\mu_{0}\in B_{n}(1) with ε=0\varepsilon=0.

Combining the above display with the formula (8.3) for γη,2\gamma_{\eta,\ast}^{2}, and the fact that the denominator therein is of order 11 (depending on KK), we have

supμ0𝒰εsupηΞK|γη,2σξ2+μ02τη,21ntr((Σ+τη,I)2Σ)ητη,+τη,1ntr((Σ+τη,I)2Σ)|C1ε.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\varepsilon}}\sup_{\eta\in\Xi_{K}}\bigg{\lvert}\gamma_{\eta,\ast}^{2}-\frac{\sigma_{\xi}^{2}+\lVert\mu_{0}\rVert^{2}\tau_{\eta,\ast}^{2}\cdot\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}}{\frac{\eta}{\tau_{\eta,\ast}}+\tau_{\eta,\ast}\cdot\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}}\bigg{\rvert}\leq C_{1}\varepsilon.

Now using (2), the second term in the above display equals to

σξ2+μ02ϕ𝔪η(𝔪ηηϕ𝔪η)η𝔪η+ϕ𝔪η𝔪η(𝔪ηηϕ𝔪η)=σξ2𝔪η+μ02(ϕ𝔪ηη𝔪η)ϕ𝔪η2.\displaystyle\frac{\sigma_{\xi}^{2}+\lVert\mu_{0}\rVert^{2}\cdot\frac{\phi}{\mathfrak{m}_{\eta}^{\prime}}\big{(}\mathfrak{m}_{\eta}-\frac{\eta}{\phi}\mathfrak{m}_{\eta}^{\prime}\big{)}}{\eta\mathfrak{m}_{\eta}+\frac{\phi\mathfrak{m}_{\eta}}{\mathfrak{m}_{\eta}^{\prime}}\big{(}\mathfrak{m}_{\eta}-\frac{\eta}{\phi}\mathfrak{m}_{\eta}^{\prime}\big{)}}=\frac{\sigma_{\xi}^{2}\mathfrak{m}_{\eta}^{\prime}+\lVert\mu_{0}\rVert^{2}\big{(}\phi\mathfrak{m}_{\eta}-\eta\mathfrak{m}_{\eta}^{\prime}\big{)}}{\phi\mathfrak{m}_{\eta}^{2}}.

The claim follows by adjusting constants. ∎

Proof of Theorem 3.2.

As R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η)=ϕγη,2σξ2\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)=\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2}, directly invoking Proposition 11.2-(3) yields the claim for R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η)\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta).

Next we handle R¯𝖾𝗌𝗍(Σ,μ0)(η)\bar{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta). Note that

R¯𝖾𝗌𝗍(Σ,μ0)(η)=τη,2(Σ+τη,I)1μ02+γη,2n1tr((Σ+τη,I)2Σ).\displaystyle\bar{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta)=\tau_{\eta,\ast}^{2}\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\mu_{0}\rVert^{2}+\gamma_{\eta,\ast}^{2}\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}. (11.6)

Using a similar construction as in the proof of Proposition 11.2 via the help of Lemma 11.1, this time with q=0q=0 therein, we may find some 𝒰εBn(1)\mathcal{U}_{\varepsilon}\subset B_{n}(1) with the desired volume estimate, such that both Proposition 11.2-(3) and

supμ0𝒰εsupηΞK|(Σ+τη,I)1μ02μ02n1tr((Σ+τη,I)2)|ε\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\varepsilon}}\sup_{\eta\in\Xi_{K}}\big{\lvert}\lVert(\Sigma+\tau_{\eta,\ast}I)^{-1}\mu_{0}\rVert^{2}-\lVert\mu_{0}\rVert^{2}\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\big{)}\big{\rvert}\leq\varepsilon (11.7)

hold. Combining (11.6)-(11.7), we may set

𝖾𝗌𝗍(Σ,μ0)(η)\displaystyle\mathscr{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta) τη,2μ02n1tr((Σ+τη,I)2)\displaystyle\equiv\tau_{\eta,\ast}^{2}\lVert\mu_{0}\rVert^{2}\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\big{)}
+(ϕ𝔪η2)1(σξ2𝔪η+μ02(ϕ𝔪ηη𝔪η))n1tr((Σ+τη,I)2Σ)\displaystyle\qquad+(\phi\mathfrak{m}_{\eta}^{2})^{-1}\Big{(}\sigma_{\xi}^{2}\mathfrak{m}_{\eta}^{\prime}+\lVert\mu_{0}\rVert^{2}\big{(}\phi\mathfrak{m}_{\eta}-\eta\mathfrak{m}_{\eta}^{\prime}\big{)}\Big{)}\cdot n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}I)^{-2}\Sigma\big{)}
R2,1+R2,2.\displaystyle\equiv R_{2,1}+R_{2,2}.

By Proposition 11.2-(2), we may compute R2,1,R2,2R_{2,1},R_{2,2} separately:

R2,1\displaystyle R_{2,1} =μ02ϕ𝔪η((ϕ11)𝔪η+2(η/ϕ)𝔪η𝔪η𝔪η2)\displaystyle=\lVert\mu_{0}\rVert^{2}\cdot\frac{\phi}{\mathfrak{m}_{\eta}^{\prime}}\cdot\Big{(}(\phi^{-1}-1)\mathfrak{m}_{\eta}^{\prime}+2({\eta}/{\phi})\cdot\mathfrak{m}_{\eta}\mathfrak{m}_{\eta}^{\prime}-\mathfrak{m}_{\eta}^{2}\Big{)}
=μ02(1ϕ)+{2μ02η𝔪ημ02ϕ𝔪η2𝔪η},\displaystyle=\lVert\mu_{0}\rVert^{2}(1-\phi)+\Big{\{}2\lVert\mu_{0}\rVert^{2}\eta\mathfrak{m}_{\eta}-\lVert\mu_{0}\rVert^{2}\phi\cdot\frac{\mathfrak{m}_{\eta}^{2}}{\mathfrak{m}_{\eta}^{\prime}}\Big{\}},
R2,2\displaystyle R_{2,2} =1𝔪η(σξ2𝔪η+μ02(ϕ𝔪ηη𝔪η))(𝔪η(η/ϕ)𝔪η)\displaystyle=\frac{1}{\mathfrak{m}_{\eta}^{\prime}}\Big{(}\sigma_{\xi}^{2}\mathfrak{m}_{\eta}^{\prime}+\lVert\mu_{0}\rVert^{2}\big{(}\phi\mathfrak{m}_{\eta}-\eta\mathfrak{m}_{\eta}^{\prime}\big{)}\Big{)}\cdot\big{(}\mathfrak{m}_{\eta}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}\big{)}
=σξ2(𝔪η(η/ϕ)𝔪η)+ϕ1μ02η2𝔪η{2μ02η𝔪ημ02ϕ𝔪η2𝔪η}.\displaystyle=\sigma_{\xi}^{2}\big{(}\mathfrak{m}_{\eta}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}\big{)}+\phi^{-1}\lVert\mu_{0}\rVert^{2}\eta^{2}\mathfrak{m}_{\eta}^{\prime}-\Big{\{}2\lVert\mu_{0}\rVert^{2}\eta\mathfrak{m}_{\eta}-\lVert\mu_{0}\rVert^{2}\phi\cdot\frac{\mathfrak{m}_{\eta}^{2}}{\mathfrak{m}_{\eta}^{\prime}}\Big{\}}.

Consequently,

𝖾𝗌𝗍(Σ,μ0)(η)\displaystyle\mathscr{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta) =μ02(1ϕ)+σξ2(𝔪η(η/ϕ)𝔪η)+ϕ1μ02η2𝔪η\displaystyle=\lVert\mu_{0}\rVert^{2}(1-\phi)+\sigma_{\xi}^{2}\big{(}\mathfrak{m}_{\eta}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}\big{)}+\phi^{-1}\lVert\mu_{0}\rVert^{2}\eta^{2}\mathfrak{m}_{\eta}^{\prime}
=σξ2{𝖲𝖭𝖱μ0(1ϕ)+𝔪η+(η/ϕ)(η𝖲𝖭𝖱μ01)𝔪η}.\displaystyle=\sigma_{\xi}^{2}\cdot\Big{\{}\operatorname{\mathsf{SNR}}_{\mu_{0}}(1-\phi)+\mathfrak{m}_{\eta}+({\eta}/{\phi})\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}\mathfrak{m}_{\eta}^{\prime}\Big{\}}.

The claims for 𝗂𝗇(Σ,μ0)(η)\mathscr{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta) and 𝗋𝖾𝗌(Σ,μ0)(η)\mathscr{R}^{\operatorname{\mathsf{res}}}_{(\Sigma,\mu_{0})}(\eta) follow from Proposition 11.2-(3). ∎

11.3. Proof of Proposition 3.3

We will prove the following version of Proposition 3.3, where 𝔐#\mathfrak{M}^{\#} is represented via τη,\tau_{\eta,\ast} instead of 𝔪\mathfrak{m}. In the proof below, we will also verify the representation of 𝔐#\mathfrak{M}^{\#} via 𝔪\mathfrak{m} as stated in Proposition 3.3.

Proposition 11.3.

Recall 𝖲𝖭𝖱μ0=μ02/σξ2\operatorname{\mathsf{SNR}}_{\mu_{0}}=\lVert\mu_{0}\rVert^{2}/\sigma_{\xi}^{2}. Then for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

η#(Σ,μ0)(η)\displaystyle\partial_{\eta}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) =σξ2𝔐#(η)(η𝖲𝖭𝖱μ01).\displaystyle=\sigma_{\xi}^{2}\cdot\mathfrak{M}^{\#}(\eta)\cdot\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}.

Here with Tp,q(η)n1tr((Σ+τ(η)I)pΣq)T_{-p,q}(\eta)\equiv n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\ast}(\eta)I)^{-p}\Sigma^{q}\big{)} for p,qp,q\in\mathbb{N},

𝔐#(η){ϕ(τ(η)),#=𝗉𝗋𝖾𝖽;2(τ(η))2(T3,1(η)+τ(η)T2,1(η)T3,2(η)),#=𝖾𝗌𝗍;2(τ(η))2τ2(η)(η2τ(η)T3,2(η)+τ3(η)T2,12(η)),#=𝗂𝗇.\displaystyle\mathfrak{M}^{\#}(\eta)\equiv\begin{cases}\phi\big{(}-\tau_{\ast}^{\prime\prime}(\eta)\big{)},&\#=\operatorname{\mathsf{pred}};\\ 2(\tau_{\ast}^{\prime}(\eta))^{2}\big{(}T_{-3,1}(\eta)+\tau_{\ast}^{\prime}(\eta)T_{-2,1}(\eta)T_{-3,2}(\eta)\big{)},&\#=\operatorname{\mathsf{est}};\\ \frac{2(\tau_{\ast}^{\prime}(\eta))^{2}}{\tau_{\ast}^{2}(\eta)}\Big{(}\eta^{2}\tau_{\ast}^{\prime}(\eta)T_{-3,2}(\eta)+\tau_{\ast}^{3}(\eta)T_{-2,1}^{2}(\eta)\Big{)},&\#=\operatorname{\mathsf{in}}.\end{cases}

Suppose further 1/Kϕ1K1/K\leq\phi^{-1}\leq K and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0. Then there exists some C=C(K)>0C=C(K)>0 such that uniformly in ηΞK\eta\in\Xi_{K} and for all #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

  1. (1)

    1/C𝔐#(η)C1/C\leq\mathfrak{M}^{\#}(\eta)\leq C, and

  2. (2)

    if additionally η𝖲𝖭𝖱μ01ΞK\eta_{\ast}\equiv\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}\in\Xi_{K},

    1/C|#(Σ,μ0)(η)#(Σ,μ0)(η)|μ02(ηη)2C.\displaystyle 1/C\leq\frac{\lvert\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)-\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta_{\ast})\rvert}{\lVert\mu_{0}\rVert^{2}(\eta-\eta_{\ast})^{2}}\leq C.
Proof.

In the proof we write τη,=τη\tau_{\eta,\ast}=\tau_{\eta}. Recall the notation 𝔪η=𝔪(η/ϕ)\mathfrak{m}_{\eta}=\mathfrak{m}(-\eta/\phi), 𝔪η=𝔪(η/ϕ)\mathfrak{m}_{\eta}^{\prime}=\mathfrak{m}^{\prime}(-\eta/\phi), and we naturally write 𝔪η𝔪(η/ϕ)\mathfrak{m}_{\eta}^{\prime\prime}\equiv\mathfrak{m}^{\prime\prime}(-\eta/\phi). By differentiating with respect to η\eta for both sides of 𝔪η=1/τη\mathfrak{m}_{\eta}=1/\tau_{\eta}, with some calculations we have

𝔪η=τη1,𝔪η=ϕτη/τη2,𝔪η=ϕ2(τητη2(τη)2)/τη3.\displaystyle\mathfrak{m}_{\eta}=\tau_{\eta}^{-1},\quad\mathfrak{m}_{\eta}^{\prime}={\phi\tau_{\eta}^{\prime}}/{\tau_{\eta}^{2}},\quad\mathfrak{m}_{\eta}^{\prime\prime}=-{\phi^{2}}\big{(}\tau_{\eta}^{\prime\prime}\tau_{\eta}-2(\tau_{\eta}^{\prime})^{2}\big{)}/\tau_{\eta}^{3}. (11.8)

Using ρ\rho, we may also write 𝔪η(q)=q!ρ(dx)(x+η/ϕ)q+1\mathfrak{m}_{\eta}^{(q)}=q!\int\frac{\rho(\mathrm{d}x)}{(x+\eta/\phi)^{q+1}} for qq\in\mathbb{N}. Here by convention 0!=10!=1.

(1). Using the formula for 𝗉𝗋𝖾𝖽(Σ,μ0)\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})},

η𝗉𝗋𝖾𝖽(Σ,μ0)(η)\displaystyle\partial_{\eta}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta) =σξ2η{𝔪η2(ϕ𝖲𝖭𝖱μ0𝔪η(η𝖲𝖭𝖱μ01)𝔪η)}\displaystyle=\sigma_{\xi}^{2}\cdot\partial_{\eta}\Big{\{}\mathfrak{m}_{\eta}^{-2}\big{(}\phi\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}\mathfrak{m}_{\eta}-\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}\mathfrak{m}_{\eta}^{\prime}\big{)}\Big{\}}
=ϕ1σξ2𝔪η3(𝔪η𝔪η2(𝔪η)2)(η𝖲𝖭𝖱μ01).\displaystyle=\phi^{-1}\sigma_{\xi}^{2}\cdot\mathfrak{m}_{\eta}^{-3}\big{(}\mathfrak{m}_{\eta}\mathfrak{m}_{\eta}^{\prime\prime}-2(\mathfrak{m}_{\eta}^{\prime})^{2}\big{)}\cdot\big{(}\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1\big{)}.

Some calculations show that

𝔪η𝔪η2(𝔪η)2\displaystyle\mathfrak{m}_{\eta}\mathfrak{m}_{\eta}^{\prime\prime}-2(\mathfrak{m}_{\eta}^{\prime})^{2} =τη3ϕ2(τη)=2{ρ(dx)(x+η/ϕ)ρ(dx)(x+η/ϕ)3(ρ(dx)(x+η/ϕ)2)2},\displaystyle=\tau_{\eta}^{-3}\phi^{2}(-\tau_{\eta}^{\prime\prime})=2\bigg{\{}\int\frac{\rho(\mathrm{d}x)}{(x+\eta/\phi)}\int\frac{\rho(\mathrm{d}x)}{(x+\eta/\phi)^{3}}-\bigg{(}\int\frac{\rho(\mathrm{d}x)}{(x+\eta/\phi)^{2}}\bigg{)}^{2}\bigg{\}},

so the identity follows.

(2). Using the formula for 𝖾𝗌𝗍(Σ,μ0)\mathscr{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})},

η𝖾𝗌𝗍(Σ,μ0)(η)\displaystyle\partial_{\eta}\mathscr{R}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}(\eta) =σξ2η(𝖲𝖭𝖱μ0(1ϕ)+𝔪η+(η/ϕ)(η𝖲𝖭𝖱μ01)𝔪η)\displaystyle=\sigma_{\xi}^{2}\cdot\partial_{\eta}\big{(}\operatorname{\mathsf{SNR}}_{\mu_{0}}(1-\phi)+\mathfrak{m}_{\eta}+({\eta}/{\phi})(\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1)\mathfrak{m}_{\eta}^{\prime}\big{)}
=ϕ1σξ2(2𝔪η(η/ϕ)𝔪η)(η𝖲𝖭𝖱μ01).\displaystyle=\phi^{-1}\sigma_{\xi}^{2}\cdot\big{(}2\mathfrak{m}_{\eta}^{\prime}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime\prime}\big{)}\cdot(\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1). (11.9)

To compute the second term in the above display, recall the identity for τη,τη\tau_{\eta}^{\prime},\tau_{\eta}^{\prime\prime} in (8.5)-(8.1). Also recall G0(η)=η+τη2T2,1(η)=τη/τηG_{0}(\eta)=\eta+\tau_{\eta}^{2}T_{-2,1}(\eta)=\tau_{\eta}/\tau_{\eta}^{\prime} defined in (8.5). Then

2𝔪ηηϕ𝔪η\displaystyle 2\mathfrak{m}_{\eta}^{\prime}-\frac{\eta}{\phi}\mathfrak{m}_{\eta}^{\prime\prime} =ϕτη{2τητη(1ητητη)+ητητη}\displaystyle=\frac{\phi}{\tau_{\eta}}\bigg{\{}\frac{2\tau_{\eta}^{\prime}}{\tau_{\eta}}\bigg{(}1-\frac{\eta\tau_{\eta}^{\prime}}{\tau_{\eta}}\bigg{)}+\frac{\eta\tau_{\eta}^{\prime\prime}}{\tau_{\eta}}\bigg{\}}
=ϕτη{2τητητη2T2,1(η)G0(η)2ητητηG02(η)T3,2(η)}\displaystyle=\frac{\phi}{\tau_{\eta}}\bigg{\{}\frac{2\tau_{\eta}^{\prime}}{\tau_{\eta}}\frac{\tau_{\eta}^{2}T_{-2,1}(\eta)}{G_{0}(\eta)}-\frac{2\eta\tau_{\eta}\tau_{\eta}^{\prime}}{G_{0}^{2}(\eta)}T_{-3,2}(\eta)\bigg{\}}
=2ϕτηG0(η)(T2,1(η)ηG0(η)T3,2(η))\displaystyle=\frac{2\phi\tau_{\eta}^{\prime}}{G_{0}(\eta)}\Big{(}T_{-2,1}(\eta)-\frac{\eta}{G_{0}(\eta)}T_{-3,2}(\eta)\Big{)}
=()2ϕτηG0(η){τηT3,1(η)+(1ηG0(η))T3,2(η)}\displaystyle\stackrel{{\scriptstyle(\ast)}}{{=}}\frac{2\phi\tau_{\eta}^{\prime}}{G_{0}(\eta)}\bigg{\{}\tau_{\eta}T_{-3,1}(\eta)+\bigg{(}1-\frac{\eta}{G_{0}(\eta)}\bigg{)}T_{-3,2}(\eta)\bigg{\}}
=2ϕ(τη)2(T3,1(η)+τηT2,1(η)T3,2(η)).\displaystyle=2\phi(\tau_{\eta}^{\prime})^{2}\big{(}T_{-3,1}(\eta)+\tau_{\eta}^{\prime}T_{-2,1}(\eta)T_{-3,2}(\eta)\big{)}.

Here in ()(\ast) we used T2,1(η)T3,2(η)=τηT3,1(η)T_{-2,1}(\eta)-T_{-3,2}(\eta)=\tau_{\eta}T_{-3,1}(\eta). The claimed identity follows by combining the above display and (11.3). Using ρ\rho, we may write

2𝔪η(η/ϕ)𝔪η\displaystyle 2\mathfrak{m}_{\eta}^{\prime}-({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime\prime} =2x(x+η/ϕ)3ρ(dx).\displaystyle=2\int\frac{x}{(x+\eta/\phi)^{3}}\,\rho(\mathrm{d}x).

(3). Using the formula for 𝗂𝗇(Σ,μ0)\mathscr{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})},

η𝗂𝗇(Σ,μ0)(η)\displaystyle\partial_{\eta}\mathscr{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta) =σξ2η{ϕ1η2(ϕ𝖲𝖭𝖱μ0𝔪η(η𝖲𝖭𝖱μ01)𝔪η)+(ϕ2η𝔪η)}\displaystyle=\sigma_{\xi}^{2}\cdot\partial_{\eta}\Big{\{}\phi^{-1}\eta^{2}\big{(}\phi\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}\mathfrak{m}_{\eta}-(\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1)\mathfrak{m}_{\eta}^{\prime}\big{)}+(\phi-2\eta\mathfrak{m}_{\eta})\Big{\}}
=σξ2(2𝔪η4(η/ϕ)𝔪η+ϕ2η2𝔪η)(η𝖲𝖭𝖱μ01).\displaystyle=\sigma_{\xi}^{2}\cdot\big{(}2\mathfrak{m}_{\eta}-4({\eta}/{\phi})\mathfrak{m}_{\eta}^{\prime}+\phi^{-2}\eta^{2}\mathfrak{m}_{\eta}^{\prime\prime}\big{)}\cdot(\eta\cdot\operatorname{\mathsf{SNR}}_{\mu_{0}}-1). (11.10)

The second term in the above display requires some non-trivial calculations:

2𝔪η4ηϕ𝔪η+η2ϕ2𝔪η\displaystyle 2\mathfrak{m}_{\eta}-\frac{4\eta}{\phi}\mathfrak{m}_{\eta}^{\prime}+\frac{\eta^{2}}{\phi^{2}}\mathfrak{m}_{\eta}^{\prime\prime} =1τη{24ητητηη2τητη+2η2(τητη)2}\displaystyle=\frac{1}{\tau_{\eta}}\bigg{\{}2-4\eta\frac{\tau_{\eta}^{\prime}}{\tau_{\eta}}-\eta^{2}\frac{\tau_{\eta}^{\prime\prime}}{\tau_{\eta}}+2\eta^{2}\bigg{(}\frac{\tau_{\eta}^{\prime}}{\tau_{\eta}}\bigg{)}^{2}\bigg{\}}
=1τη{24ηG0(η)+η22τητηT3,2(η)G02(η)+2η2G02(η)}\displaystyle=\frac{1}{\tau_{\eta}}\bigg{\{}2-\frac{4\eta}{G_{0}(\eta)}+\eta^{2}\frac{2\tau_{\eta}\tau_{\eta}^{\prime}T_{-3,2}(\eta)}{G_{0}^{2}(\eta)}+\frac{2\eta^{2}}{G_{0}^{2}(\eta)}\bigg{\}}
=2τηG02(η){G02(η)2ηG0(η)+η2τητηT3,2(η)+η2}.\displaystyle=\frac{2}{\tau_{\eta}G_{0}^{2}(\eta)}\big{\{}G_{0}^{2}(\eta)-2\eta G_{0}(\eta)+\eta^{2}\tau_{\eta}\tau_{\eta}^{\prime}T_{-3,2}(\eta)+\eta^{2}\big{\}}.

Expanding the G0(η)G_{0}(\eta) terms in the bracket using G0(η)=η+τη2T2,1(η)G_{0}(\eta)=\eta+\tau_{\eta}^{2}T_{-2,1}(\eta), with some calculations we arrive at

2𝔪η4ηϕ𝔪η+η2ϕ2𝔪η\displaystyle 2\mathfrak{m}_{\eta}-\frac{4\eta}{\phi}\mathfrak{m}_{\eta}^{\prime}+\frac{\eta^{2}}{\phi^{2}}\mathfrak{m}_{\eta}^{\prime\prime} =2G02(η)(η2τηT3,2(η)+τη3T2,12(η)).\displaystyle=\frac{2}{G_{0}^{2}(\eta)}\Big{(}\eta^{2}\tau_{\eta}^{\prime}T_{-3,2}(\eta)+\tau_{\eta}^{3}T_{-2,1}^{2}(\eta)\Big{)}.

The claimed identity follows by combining the above display and (11.3). Using ρ\rho, we may write

2𝔪η4ηϕ𝔪η+η2ϕ2𝔪η=2x2(x+η/ϕ)3ρ(dx).\displaystyle 2\mathfrak{m}_{\eta}-\frac{4\eta}{\phi}\mathfrak{m}_{\eta}^{\prime}+\frac{\eta^{2}}{\phi^{2}}\mathfrak{m}_{\eta}^{\prime\prime}=2\int\frac{x^{2}}{(x+\eta/\phi)^{3}}\,\rho(\mathrm{d}x).

Finally, the claimed first two-sided bound on 𝔐#\mathfrak{M}^{\#} follows from Proposition 8.1, and the second bound follows by using the fundamental theorem of calculus. ∎

11.4. Proof of Theorem 3.4

The following lemma gives a technical extension of Theorem 3.1 for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\} under σξ20\sigma_{\xi}^{2}\approx 0 when ϕ1>1\phi^{-1}>1. For #=𝗂𝗇\#=\operatorname{\mathsf{in}}, the extension also allows uniform control over η0\eta\approx 0 under both the above small variance scenario with ϕ1>1\phi^{-1}>1, and under the original conditions.

Lemma 11.4.

Suppose Assumption A holds and the following hold for some K>0K>0.

  • 1+1/Kϕ1K1+1/K\leq\phi^{-1}\leq K, Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K.

  • Assumption B with σξ2[0,K]\sigma_{\xi}^{2}\in[0,K].

Fix a small enough ϑ(0,1/50)\vartheta\in(0,1/50). Then there exist a constant C=C(K,ϑ)>1C=C(K,\vartheta)>1, and a measurable set 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) with vol(𝒰ϑ)/vol(Bn(1))1Cenϑ/C\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1))\geq 1-Ce^{-n^{\vartheta}/C}, such that for any ε(0,1/2]\varepsilon\in(0,1/2], and #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇,𝗋𝖾𝗌}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}},\operatorname{\mathsf{res}}\},

supμ0𝒰ϑ(supηΞK|R#(Σ,μ0)(η,σξ)R¯#(Σ,μ0)(η,σξ)|ε)C{nenεc0/C,Z=G;εc0n1/6.5,otherwise.\displaystyle\sup_{\mu_{0}\in\mathcal{U}_{\vartheta}}\operatorname{\mathbb{P}}\bigg{(}\sup_{\eta\in\Xi_{K}}\lvert R^{\#}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})-\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})\rvert\geq\varepsilon\bigg{)}\leq C\cdot\begin{cases}ne^{-n\varepsilon^{c_{0}}/C},&Z=G;\\ \varepsilon^{-c_{0}}n^{-1/6.5},&\hbox{otherwise}.\end{cases}
Proof.

All the constants in ,,\lesssim,\gtrsim,\asymp below may possibly depend on KK.

(Part 1). We shall first extend the claim of Theorem 3.1 for #=𝗉𝗋𝖾𝖽\#=\operatorname{\mathsf{pred}} to σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] in the case ϕ11+1/K\phi^{-1}\geq 1+1/K. Note that uniformly in η[0,K]\eta\in[0,K], for σξ,σξ[0,K]\sigma_{\xi},\sigma_{\xi}^{\prime}\in[0,K],

μ^η(σξ)μ^η(σξ)\displaystyle\lVert\widehat{\mu}_{\eta}(\sigma_{\xi})-\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\rVert |σξσξ|n1Zopξ0(ZZ/n)1op.\displaystyle\lesssim\lvert\sigma_{\xi}-\sigma_{\xi}^{\prime}\rvert\cdot n^{-1}\lVert Z\rVert_{\operatorname{op}}\lVert\xi_{0}\rVert\cdot\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}. (11.11)

Using the estimate (11.2), uniformly in η[0,K]\eta\in[0,K], for all σξ,σξ[0,K]\sigma_{\xi},\sigma_{\xi}^{\prime}\in[0,K],

|R𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)R𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)|\displaystyle\lvert R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})-R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}^{\prime})\rvert
μ^η(σξ)μ^η(σξ)(μ^η(σξ)+μ^η(σξ)+μ0)\displaystyle\lesssim\lVert\widehat{\mu}_{\eta}(\sigma_{\xi})-\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\rVert\cdot\big{(}\lVert\widehat{\mu}_{\eta}(\sigma_{\xi})\rVert+\lVert\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\rVert+\lVert\mu_{0}\rVert\big{)}
|σξσξ|(ZZ/n)1op2(1+Zop+ξ0n)4.\displaystyle\lesssim\lvert\sigma_{\xi}-\sigma_{\xi}^{\prime}\rvert\cdot\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}^{2}\cdot\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi_{0}\rVert}{\sqrt{n}}\Big{)}^{4}.

So on an event E1E_{1} with (E1)1C1en/C1\operatorname{\mathbb{P}}(E_{1})\geq 1-C_{1}e^{-n/C_{1}}, for σξ,σξ[0,K]\sigma_{\xi},\sigma_{\xi}^{\prime}\in[0,K],

supη[0,K]|R𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)R𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)|C1|σξσξ|.\displaystyle\sup_{\eta\in[0,K]}\lvert R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})-R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}^{\prime})\rvert\leq C_{1}\cdot\lvert\sigma_{\xi}-\sigma_{\xi}^{\prime}\rvert.

On the other hand, using Lemma 11.5-(2),

supη[0,K]|R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)|C1|σξσξ|.\displaystyle\sup_{\eta\in[0,K]}\lvert\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})-\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}^{\prime})\rvert\leq C_{1}\cdot\lvert\sigma_{\xi}-\sigma_{\xi}^{\prime}\rvert.

Using the above two displays, for any ε>0\varepsilon>0, by choosing σξε/(2C1)\sigma_{\xi}^{\prime}\equiv\varepsilon/(2C_{1}), we have for any σξσξ\sigma_{\xi}\leq\sigma_{\xi}^{\prime},

(supη[0,K]|R𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)|2ε)\displaystyle\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in[0,K]}\lvert R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})-\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})\rvert\geq 2\varepsilon\Big{)}
(supη[0,K]|R𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η,σξ)|ε)+C1en/C1.\displaystyle\leq\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in[0,K]}\lvert R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}^{\prime})-\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}^{\prime})\rvert\geq\varepsilon\Big{)}+C_{1}e^{-n/C_{1}}. (11.12)

The first term on the right hand side of the above display can be handled by the proven claim in Theorem 3.1, upon noting that (i) the constant CC therein depends on KK polynomially, and here we choose KK to be larger than 2C1/ε2C_{1}/\varepsilon; (ii) (n/ε)CenεC1nenεC(n/\varepsilon)^{C^{\prime}}e^{-n\varepsilon^{C^{\prime}}}\wedge 1\leq ne^{-n\varepsilon^{C^{\prime\prime}}} holds for CC^{\prime\prime} chosen much larger than CC^{\prime}.

The extension of the claim of Theorem 3.1 for #=𝖾𝗌𝗍\#=\operatorname{\mathsf{est}} to σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] follows a similar proof with minor modifications, so we omit the details.

(Part 2). Next we consider the case #=𝗂𝗇\#=\operatorname{\mathsf{in}}. We need to extend the corresponding claim of Theorem 3.1 to both σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] and η[0,K]\eta\in[0,K].

We first verify the (high probability) Lipschitz continuity of the maps σξR𝗂𝗇(Σ,μ0)(η,σξ),R¯𝗂𝗇(Σ,μ0)(η,σξ)\sigma_{\xi}\mapsto R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}),\bar{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}). Note that uniformly in η[0,K]\eta\in[0,K], by virtue of (11.11), for any σξ,σξ[0,K]\sigma_{\xi},\sigma_{\xi}^{\prime}\in[0,K],

|R𝗂𝗇(Σ,μ0)(η,σξ)R𝗂𝗇(Σ,μ0)(η,σξ)|\displaystyle\lvert R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi})-R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}^{\prime})\rvert
(1+Zopn)2μ^η(σξ)μ^η(σξ)(μ^η(σξ)+μ^η(σξ)+μ0)\displaystyle\lesssim\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}}{\sqrt{n}}\Big{)}^{2}\cdot\lVert\widehat{\mu}_{\eta}(\sigma_{\xi})-\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\rVert\cdot\big{(}\lVert\widehat{\mu}_{\eta}(\sigma_{\xi})\rVert+\lVert\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\rVert+\lVert\mu_{0}\rVert\big{)}
|σξσξ|(ZZ/n)1op2(1+Zop+ξ0n)6.\displaystyle\lesssim\lvert\sigma_{\xi}-\sigma_{\xi}^{\prime}\rvert\cdot\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}^{2}\cdot\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi_{0}\rVert}{\sqrt{n}}\Big{)}^{6}.

This verifies the high probability Lipschitz property of σξR𝗂𝗇(Σ,μ0)(η,σξ)\sigma_{\xi}\mapsto R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}). The Lipschitz property of σξR¯𝗂𝗇(Σ,μ0)(η,σξ)\sigma_{\xi}\mapsto\bar{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}) is easily verified. From here we may use a similar argument to (11.4) to conclude the extension of the claim of Theorem 3.1 for #=𝗂𝗇\#=\operatorname{\mathsf{in}} to σξ2[0,K]\sigma_{\xi}^{2}\in[0,K].

Finally we verify the (high probability) Lipschitz continuity of the maps ηR𝗂𝗇(Σ,μ0)(η,σξ),R¯𝗂𝗇(Σ,μ0)(η,σξ)\eta\mapsto R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}),\bar{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}). Using the estimates (9.4) (with GG replaced by ZZ) and (11.2), uniformly in σξ[0,K]\sigma_{\xi}\in[0,K] and η1,η2[0,K]\eta_{1},\eta_{2}\in[0,K],

|R𝗂𝗇(Σ,μ0)(η1,σξ)R𝗂𝗇(Σ,μ0)(η2,σξ)|\displaystyle\lvert R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta_{1},\sigma_{\xi})-R^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta_{2},\sigma_{\xi})\rvert
(1+Zopn)2μ^η1(σξ)μ^η2(σξ)(μ^η1(σξ)+μ^η2(σξ)+μ0)\displaystyle\lesssim\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}}{\sqrt{n}}\Big{)}^{2}\cdot\lVert\widehat{\mu}_{\eta_{1}}(\sigma_{\xi})-\widehat{\mu}_{\eta_{2}}(\sigma_{\xi})\rVert\cdot\big{(}\lVert\widehat{\mu}_{\eta_{1}}(\sigma_{\xi})\rVert+\lVert\widehat{\mu}_{\eta_{2}}(\sigma_{\xi})\rVert+\lVert\mu_{0}\rVert\big{)}
(1+Zop+ξ0n)6(ZZ/n)1op3|η1η2|.\displaystyle\lesssim\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi_{0}\rVert}{\sqrt{n}}\Big{)}^{6}\cdot\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}^{3}\cdot\lvert\eta_{1}-\eta_{2}\rvert.

The Lipschitz property of σξR¯𝗂𝗇(Σ,μ0)(η,σξ)\sigma_{\xi}\mapsto\bar{R}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}(\eta,\sigma_{\xi}) is again easily verified. Again from here we may argue similarly to (11.4) to extend the claim of Theorem 3.1 for #=𝗂𝗇\#=\operatorname{\mathsf{in}} to η[0,K]\eta\in[0,K]. The case for #=𝗋𝖾𝗌\#=\operatorname{\mathsf{res}} is similar so we omit repetitive details. ∎

Lemma 11.5.

Suppose ϕ1>1\phi^{-1}>1. The following hold.

  1. (1)

    The system of equations

    {ϕγ2=𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ),ϕητ=γ2𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ)=1ntr((Σ+τI)1Σ)\displaystyle\begin{cases}\phi\gamma^{2}=\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau),\\ \phi-\frac{\eta}{\tau}=\gamma^{-2}\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau)=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I)^{-1}\Sigma\big{)}\end{cases}

    admit a unique solution (γη,(0),τη,(0))[0,)×(0,)(\gamma_{\eta,\ast}(0),\tau_{\eta,\ast}(0))\in[0,\infty)\times(0,\infty).

  2. (2)

    It holds that τη,(0)=τη,(σξ)\tau_{\eta,\ast}(0)=\tau_{\eta,\ast}(\sigma_{\xi}). If furthermore 1+1/Kϕ1K1+1/K\leq\phi^{-1}\leq K and ΣopΣK\lVert\Sigma\rVert_{\operatorname{op}}\vee\mathcal{H}_{\Sigma}\leq K for some K>0K>0, then there exists some C=C(K)>0C=C(K)>0 such that |γη,2(σξ)γη,2(0)|Cσξ2\lvert\gamma_{\eta,\ast}^{2}(\sigma_{\xi})-\gamma_{\eta,\ast}^{2}(0)\rvert\leq C\sigma_{\xi}^{2}.

Proof.

The claim (1) follows verbatim from the proof of Proposition 8.1-(1) by setting σξ2=0\sigma_{\xi}^{2}=0 therein. The claim (2) follows by using the formula (8.3). ∎

Proof of Theorem 3.4.

Let 𝒰ϑBn(1)\mathcal{U}_{\vartheta}\subset B_{n}(1) be as specified in Theorem 3.1 or 2.4. In view of its explicit form given in Proposition 10.3, with 𝒰δ,ϑ𝒰ϑ(Bn(1)Bn(δ))\mathcal{U}_{\delta,\vartheta}\equiv\mathcal{U}_{\vartheta}\cap\big{(}B_{n}(1)\setminus B_{n}(\delta)\big{)}, the volume estimates min{vol(𝒰ϑ)/vol(Bn(1)),vol(𝒰δ,ϑ)/vol(Bn(1)Bn(δ))}1Cenϑ/C\min\big{\{}\mathrm{vol}(\mathcal{U}_{\vartheta})/\mathrm{vol}(B_{n}(1)),\mathrm{vol}(\mathcal{U}_{\delta,\vartheta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\big{\}}\geq 1-Ce^{-n^{\vartheta}/C} hold.

On the other hand, using the construction around (11.5), we may find some 𝒱εBn(1),𝒱ε,δBn(1)Bn(δ)\mathcal{V}_{\varepsilon}\subset B_{n}(1),\mathcal{V}_{\varepsilon,\delta}\subset B_{n}(1)\setminus B_{n}(\delta) (for the latter, we take U0Unif(δ,1)U_{0}\sim\mathrm{Unif}(\delta,1) therein) with min{vol(𝒱ε)/vol(Bn(1)),vol(𝒱ε,δ)/vol(Bn(1)Bn(δ))}1Cε1enε2/C\min\big{\{}\mathrm{vol}(\mathcal{V}_{\varepsilon})/\mathrm{vol}(B_{n}(1)),\mathrm{vol}(\mathcal{V}_{\varepsilon,\delta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\big{\}}\geq 1-C\varepsilon^{-1}e^{-n\varepsilon^{2}/C}, such that for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

supμ0{𝒱ε,𝒱ε,δ}supηΞL|R¯#(Σ,μ0)(η)#(Σ,μ0)(η)|ε.\displaystyle\sup_{\mu_{0}\in\{\mathcal{V}_{\varepsilon},\mathcal{V}_{\varepsilon,\delta}\}}\sup_{\eta\in\Xi_{L}}\lvert\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)-\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)\rvert\leq\varepsilon. (11.13)

Now let

𝒲ε,ϑ𝒰ϑ𝒱ε,𝒲ε,δ,ϑ𝒰δ,ϑ𝒱ε,δ.\displaystyle\mathcal{W}_{\varepsilon,\vartheta}\equiv\mathcal{U}_{\vartheta}\cap\mathcal{V}_{\varepsilon},\quad\mathcal{W}_{\varepsilon,\delta,\vartheta}\equiv\mathcal{U}_{\delta,\vartheta}\cap\mathcal{V}_{\varepsilon,\delta}. (11.14)

Then we have the volume estimates min{vol(𝒲ε,ϑ)/vol(Bn(1)),vol(𝒲ε,δ,ϑ)/vol(Bn(1)Bn(δ))}1Cε1enε2/CCenϑ/C\min\big{\{}\mathrm{vol}(\mathcal{W}_{\varepsilon,\vartheta})/\mathrm{vol}(B_{n}(1)),\mathrm{vol}(\mathcal{W}_{\varepsilon,\delta,\vartheta})/\mathrm{vol}(B_{n}(1)\setminus B_{n}(\delta))\big{\}}\geq 1-C\varepsilon^{-1}e^{-n\varepsilon^{2}/C}-Ce^{-n^{\vartheta}/C}.

Moreover, by Proposition 11.3, provided ησξ2/μ02=𝖲𝖭𝖱μ01ΞL\eta_{\ast}\equiv\sigma_{\xi}^{2}/\lVert\mu_{0}\rVert^{2}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}\in\Xi_{L},

μ02/C0|#(Σ,μ0)(η)#(Σ,μ0)(η)|(ηη)2C0μ02\displaystyle\lVert\mu_{0}\rVert^{2}/C_{0}\leq\frac{\lvert\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)-\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta_{\ast})\rvert}{(\eta-\eta_{\ast})^{2}}\leq C_{0}\lVert\mu_{0}\rVert^{2} (11.15)

holds uniformly in ηΞL\eta\in\Xi_{L} for some C0>0C_{0}>0.

(Noisy case σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K]). Fix μ0𝒲ε,δ,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta}. Under the assumed conditions, ηΞL\eta_{\ast}\in\Xi_{L}. So using the estimates (11.13) and (11.15), for any ηη\eta^{\prime}\geq\eta_{\ast},

R¯#(Σ,μ0)(η)infηΞLR¯#(Σ,μ0)(η)\displaystyle\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta^{\prime})-\inf_{\eta\in\Xi_{L}}\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) #(Σ,μ0)(η)infηΞL#(Σ,μ0)(η)2εδ2(ηη)2C02ε.\displaystyle\geq\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta^{\prime})-\inf_{\eta\in\Xi_{L}}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)-2\varepsilon\geq\frac{\delta^{2}(\eta^{\prime}-\eta_{\ast})^{2}}{C_{0}}-2\varepsilon.

Combined with a similar inequality for ηη\eta^{\prime}\leq\eta_{\ast}, we conclude that for any μ0𝒲ε,δ,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta} and ηΞL\eta^{\prime}\in\Xi_{L},

|R¯#(Σ,μ0)(η)infηΞLR¯#(Σ,μ0)(η)|δ2(ηη)2C02ε.\displaystyle\big{\lvert}\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta^{\prime})-\inf_{\eta\in\Xi_{L}}\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\geq\frac{\delta^{2}(\eta^{\prime}-\eta_{\ast})^{2}}{C_{0}}-2\varepsilon.

Now for |ηη|Δ\lvert\eta^{\prime}-\eta_{\ast}\rvert\geq\Delta, choosing εε0δ2Δ2/(4C0)\varepsilon\equiv\varepsilon_{0}\equiv\delta^{2}\Delta^{2}/(4C_{0}), we have

infμ0𝒲ε,δ,ϑinfηΞL:|ηη|Δ|R¯#(Σ,μ0)(η)infηΞLR¯#(Σ,μ0)(η)|δ2Δ22C0.\displaystyle\inf_{\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta}}\inf_{\eta^{\prime}\in\Xi_{L}:\lvert\eta^{\prime}-\eta_{\ast}\rvert\geq\Delta}\big{\lvert}\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta^{\prime})-\inf_{\eta\in\Xi_{L}}\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\geq\frac{\delta^{2}\Delta^{2}}{2C_{0}}.

From here the claim follows from Theorem 3.1.

(Noiseless case σξ2=0\sigma_{\xi}^{2}=0). In this case, (11.15) implies that the map η#(Σ,μ0)(η)\eta\mapsto\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) attains global minimum at η=0\eta=0. So together with (11.13), it implies that uniformly in μ0𝒲ε,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\vartheta},

|minη[0,K]R¯#(Σ,μ0)(η)R¯#(Σ,μ0)(0)|ε.\displaystyle\big{\lvert}\min_{\eta\in[0,K]}\bar{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)-\bar{R}^{\#}_{(\Sigma,\mu_{0})}(0)\big{\rvert}\leq\varepsilon.

From here the claim follows from Lemma 11.4 that holds for σξ=0\sigma_{\xi}=0. ∎

12. Proofs for Section 4

12.1. Proof of Theorem 4.1

All the constants in ,,\lesssim,\gtrsim,\asymp may depend on KK.

Proof of Theorem 4.1 for τ^η\widehat{\tau}_{\eta}.

Let κ0\kappa_{0} be defined in the same way as in the proof of Proposition 10.3. Using a similar local law and continuity argument as in the proof of that proposition, on an event E0E_{0} with (E0)1CnD\operatorname{\mathbb{P}}(E_{0})\geq 1-Cn^{-D},

supηΞK|m1tr(Σˇ+(η/ϕ)I)1𝔪(η/ϕ)|κ01n1/2+ε.\displaystyle\sup_{\eta\in\Xi_{K}}\big{\lvert}m^{-1}\operatorname{tr}\big{(}\check{\Sigma}+(\eta/\phi)I\big{)}^{-1}-\mathfrak{m}\big{(}-\eta/\phi\big{)}\big{\rvert}\lesssim\kappa_{0}^{-1}n^{-1/2+\varepsilon}.

So on E0(C1)E_{0}\cap\mathscr{E}(C_{1}), where (C1){Zop/nC1}\mathscr{E}(C_{1})\equiv\{\lVert Z\rVert_{\operatorname{op}}/\sqrt{n}\leq C_{1}\} with ((C1))1Cen/C\operatorname{\mathbb{P}}(\mathscr{E}(C_{1}))\geq 1-Ce^{-n/C}, uniformly in ηΞK\eta\in\Xi_{K},

|τ^ητη,|\displaystyle\lvert\widehat{\tau}_{\eta}-\tau_{\eta,\ast}\rvert |1mtr(Σˇ+ηϕI)1𝔪(ηϕ)|1mtr(Σˇ+ηϕI)1𝔪(ηϕ){C12𝟏ϕ11+1/K1η1}κ01n1/2+ε.\displaystyle\leq\frac{\big{\lvert}\frac{1}{m}\operatorname{tr}\big{(}\check{\Sigma}+\frac{\eta}{\phi}I\big{)}^{-1}-\mathfrak{m}\big{(}-\frac{\eta}{\phi}\big{)}\big{\rvert}}{\frac{1}{m}\operatorname{tr}\big{(}\check{\Sigma}+\frac{\eta}{\phi}I\big{)}^{-1}\cdot\mathfrak{m}\big{(}-\frac{\eta}{\phi}\big{)}}\lesssim\Big{\{}C_{1}^{2}\mathbf{1}_{\phi^{-1}\geq 1+1/K}^{-1}\wedge\eta^{-1}\Big{\}}\cdot\kappa_{0}^{-1}n^{-1/2+\varepsilon}.

Here in the last inequality, we use the following estimate for 𝔪(z)\mathfrak{m}(z): As 𝔪\mathfrak{m} is the Stieltjes transform of ρ\rho (cf. [KY17, Lemma 2.2]), 𝔪(z)0\mathfrak{m}(z)\geq 0 for z0z\leq 0, and

1𝔪(z)=(z)+1mtr((I+Σ𝔪(z))1Σ)1+|z|.\displaystyle\frac{1}{\mathfrak{m}(z)}=(-z)+\frac{1}{m}\operatorname{tr}\Big{(}\big{(}I+\Sigma\mathfrak{m}(z)\big{)}^{-1}\Sigma\Big{)}\lesssim 1+\lvert z\rvert.

The claim follows. ∎

Proof of Theorem 4.1 for γ^η\widehat{\gamma}_{\eta}.

Using Theorem 3.1, the stability of τη,\tau_{\eta,\ast} in Proposition 8.1, and the proven fact in (1) on τ^η\widehat{\tau}_{\eta}, it holds for ε(0,1/2]\varepsilon\in(0,1/2] that

(supη[1,K,K]|η1τ^ηr^η(σξ)γη,(σξ)|ε)C1εc0n6.5.\displaystyle\operatorname{\mathbb{P}}\Big{(}\sup_{\eta\in[1,K,K]}\big{\lvert}\eta^{-1}{\widehat{\tau}_{\eta}}\lVert\widehat{r}_{\eta}(\sigma_{\xi})\rVert-\gamma_{\eta,\ast}(\sigma_{\xi})\big{\rvert}\geq\varepsilon\Big{)}\leq C_{1}\varepsilon^{-c_{0}}n^{-6.5}. (12.1)

Next we consider extension to η[0,K]\eta\in[0,K] in the regime ϕ11+1/K\phi^{-1}\geq 1+1/K. By KKT condition, we have n1X(YXμ^η)=ημ^ηn^{-1}X^{\top}(Y-X\widehat{\mu}_{\eta})=\eta\widehat{\mu}_{\eta}, so a.s. r^η/η=(YXμη)/(nη)=n(XX)1Xμ^η\widehat{r}_{\eta}/\eta=(Y-X\mu_{\eta})/(\sqrt{n}\eta)=\sqrt{n}(XX^{\top})^{-1}X\widehat{\mu}_{\eta} for any η>0\eta>0. So we only need to verify the high probability Lipschitz continuity for ηnτ^η(XX)1Xμ^η\eta\mapsto\sqrt{n}\widehat{\tau}_{\eta}(XX^{\top})^{-1}X\widehat{\mu}_{\eta}: for any η1,η2[0,K]\eta_{1},\eta_{2}\in[0,K], using the estimate (9.4) (with GG replaced by ZZ) we obtain, for some universal c0>1c_{0}>1,

|nτ^η1(XX)1Xμ^η1nτ^η2(XX)1Xμ^η2|\displaystyle\big{\lvert}\sqrt{n}\widehat{\tau}_{\eta_{1}}\big{\lVert}(XX^{\top})^{-1}X\widehat{\mu}_{\eta_{1}}\big{\rVert}-\sqrt{n}\widehat{\tau}_{\eta_{2}}\big{\lVert}(XX^{\top})^{-1}X\widehat{\mu}_{\eta_{2}}\big{\rVert}\big{\rvert}
(ZZ/n)1op(Zop/n)(|τ^η1τ^η2|μ^η1+|τ^η2|μ^η1μ^η2)\displaystyle\lesssim\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}\cdot({\lVert Z\rVert_{\operatorname{op}}}/{\sqrt{n}})\cdot\Big{(}\lvert\widehat{\tau}_{\eta_{1}}-\widehat{\tau}_{\eta_{2}}\rvert\cdot\lVert\widehat{\mu}_{\eta_{1}}\rVert+\lvert\widehat{\tau}_{\eta_{2}}\rvert\cdot\lVert\widehat{\mu}_{\eta_{1}}-\widehat{\mu}_{\eta_{2}}\rVert\Big{)}
(1+Zop+ξ0n+(ZZ/n)1op)c0|η2η1|.\displaystyle\lesssim\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi_{0}\rVert}{\sqrt{n}}+\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}\Big{)}^{c_{0}}\cdot\lvert\eta_{2}-\eta_{1}\rvert.

Finally we consider extension to σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] in the same regime ϕ11+1/K\phi^{-1}\geq 1+1/K by verifying a similar high probability uniform-in-η\eta Lipschitz continuity property for σξn(XX)1Xμ^η(σξ)\sigma_{\xi}\mapsto\sqrt{n}(XX^{\top})^{-1}X\widehat{\mu}_{\eta}(\sigma_{\xi}): for any σξ,σξ[0,K]\sigma_{\xi},\sigma_{\xi}^{\prime}\in[0,K], using the estimate (11.11),

supη[0,K]|nτ^η(XX)1Xμ^η(σξ)nτ^η(XX)1Xμ^η(σξ)|\displaystyle\sup_{\eta\in[0,K]}\big{\lvert}\sqrt{n}\widehat{\tau}_{\eta}\big{\lVert}(XX^{\top})^{-1}X\widehat{\mu}_{\eta}(\sigma_{\xi})\big{\rVert}-\sqrt{n}\widehat{\tau}_{\eta}\big{\lVert}(XX^{\top})^{-1}X\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\big{\rVert}\big{\rvert}
(ZZ/n)1op2(Zop/n)supη[0,K]μ^η(σξ)μ^η(σξ)\displaystyle\lesssim\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}^{2}\cdot({\lVert Z\rVert_{\operatorname{op}}}/{\sqrt{n}})\cdot\sup_{\eta\in[0,K]}\lVert\widehat{\mu}_{\eta}(\sigma_{\xi})-\widehat{\mu}_{\eta}(\sigma_{\xi}^{\prime})\rVert
(1+Zop+ξ0n+(ZZ/n)1op)c0|σξσξ|.\displaystyle\lesssim\Big{(}1+\frac{\lVert Z\rVert_{\operatorname{op}}+\lVert\xi_{0}\rVert}{\sqrt{n}}+\lVert(ZZ^{\top}/n)^{-1}\rVert_{\operatorname{op}}\Big{)}^{c_{0}}\cdot\lvert\sigma_{\xi}-\sigma_{\xi}^{\prime}\rvert.

The claimed bound follows. ∎

12.2. Proof of Theorem 4.2

Recall we have γη,2=ϕ1(σξ2+R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η))\gamma_{\eta,\ast}^{2}=\phi^{-1}\big{(}\sigma_{\xi}^{2}+\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{)}. For both the case σξ2[1/K,K]\sigma_{\xi}^{2}\in[1/K,K] and σξ2[0,K]\sigma_{\xi}^{2}\in[0,K] with ϕ11+1/K\phi^{-1}\geq 1+1/K, we take 𝒲ε,δ,ϑBn(1)Bn(δ)\mathcal{W}_{\varepsilon,\delta,\vartheta}\subset B_{n}(1)\setminus B_{n}(\delta) as constructed in (11.14) in the proof of Theorem 3.4, with εεnnϑ\varepsilon\equiv\varepsilon_{n}\equiv n^{-\vartheta}. Fix μ0𝒲ε,δ,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta}, then η=𝖲𝖭𝖱μ01ΞL\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}\in\Xi_{L}. Using Theorems 3.2 and 4.1, on an event E0E_{0} with (E0c)Cn1/7\operatorname{\mathbb{P}}(E_{0}^{c})\leq Cn^{-1/7},

supηΞL|γ^η2ϕ1(σξ2+𝗉𝗋𝖾𝖽(Σ,μ0)(η))|ε.\displaystyle\sup_{\eta\in\Xi_{L}}\big{\lvert}\widehat{\gamma}_{\eta}^{2}-\phi^{-1}\big{(}\sigma_{\xi}^{2}+\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{)}\big{\rvert}\leq\varepsilon. (12.2)

This in particular implies that on E0E_{0}, both the following inequalities hold:

ϕγ^η^𝖦𝖢𝖵2σξ2ϕε\displaystyle\phi\widehat{\gamma}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}^{2}-\sigma_{\xi}^{2}-\phi\varepsilon 𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)ϕγ^η^𝖦𝖢𝖵2σξ2+ϕε,\displaystyle\leq\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})\leq\phi\widehat{\gamma}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}^{2}-\sigma_{\xi}^{2}+\phi\varepsilon,
ϕminηΞLγ^η2σξ2ϕε\displaystyle\phi\min_{\eta\in\Xi_{L}}\widehat{\gamma}_{\eta}^{2}-\sigma_{\xi}^{2}-\phi\varepsilon minηΞL𝗉𝗋𝖾𝖽(Σ,μ0)(η)ϕminηΞLγ^η2σξ2+ϕε.\displaystyle\leq\min_{\eta\in\Xi_{L}}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\leq\phi\min_{\eta\in\Xi_{L}}\widehat{\gamma}_{\eta}^{2}-\sigma_{\xi}^{2}+\phi\varepsilon. (12.3)

Using the definition of η^𝖦𝖢𝖵\widehat{\eta}^{\operatorname{\mathsf{GCV}}} which gives γ^η^𝖦𝖢𝖵2=minηΞLγ^η2\widehat{\gamma}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}^{2}=\min_{\eta\in\Xi_{L}}\widehat{\gamma}_{\eta}^{2}, the above two displays can be used to relate 𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}) and minηΞL𝗉𝗋𝖾𝖽(Σ,μ0)(η)\min_{\eta\in\Xi_{L}}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta): on the event E0E_{0},

|𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)minηΞL𝗉𝗋𝖾𝖽(Σ,μ0)(η)|2ϕε.\displaystyle\big{\lvert}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})-\min_{\eta\in\Xi_{L}}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\leq 2\phi\varepsilon. (12.4)

As ηΞL\eta_{\ast}\in\Xi_{L}, minηΞL#(Σ,μ0)(η)=#(Σ,μ0)(η)\min_{\eta\in\Xi_{L}}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)=\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta_{\ast}) for #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}. Consequently, by the second inequality in Proposition 11.3, we have on the event E0E_{0},

|η^𝖦𝖢𝖵η|Cμ0|𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)𝗉𝗋𝖾𝖽(Σ,μ0)(η)|1/2C1ε1/2.\displaystyle\lvert\widehat{\eta}^{\operatorname{\mathsf{GCV}}}-\eta_{\ast}\rvert\leq\frac{C}{\lVert\mu_{0}\rVert}\big{\lvert}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})-\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta_{\ast})\big{\rvert}^{1/2}\leq C_{1}\varepsilon^{1/2}. (12.5)

This means on E0E_{0}, for both #{𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\},

|#(Σ,μ0)(η^𝖦𝖢𝖵)minηΞL#(Σ,μ0)(η)|C2ε.\displaystyle\big{\lvert}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})-\min_{\eta\in\Xi_{L}}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\leq C_{2}\varepsilon.

We may conclude from here by virtues of Theorems 3.1 and 3.2, together with Lemma 11.4.∎

12.3. Proof of Theorem 4.3

Lemma 12.1.

Consider the following version of (2.1) with sample size mmm-m_{\ell}:

{mmnγ2=σξ2+𝔼𝖾𝗋𝗋(Σ,μ0)(γ;τ),(mmnητ)γ2=𝔼𝖽𝗈𝖿(Σ,μ0)(γ;τ).\displaystyle\begin{cases}\frac{m-m_{\ell}}{n}\cdot\gamma^{2}=\sigma_{\xi}^{2}+\operatorname{\mathbb{E}}\operatorname{\mathsf{err}}_{(\Sigma,\mu_{0})}(\gamma;\tau),\\ \big{(}\frac{m-m_{\ell}}{n}-\frac{\eta}{\tau}\big{)}\cdot\gamma^{2}=\operatorname{\mathbb{E}}\operatorname{\mathsf{dof}}_{(\Sigma,\mu_{0})}(\gamma;\tau).\end{cases} (12.6)
  1. (1)

    The fixed point equation (12.6) admits a unique solution (γη,(),τη,())(0,)2(\gamma_{\eta,\ast}^{(\ell)},\tau_{\eta,\ast}^{(\ell)})\in(0,\infty)^{2}, for all (m,n)2(m,n)\in\mathbb{N}^{2} when η>0\eta>0 and m<nm<n when η=0\eta=0.

  2. (2)

    Further suppose 1/Kϕ1,σξ2K1/K\leq\phi^{-1},\sigma_{\xi}^{2}\leq K, m/n1/(2K)m_{\ell}/n\leq 1/(2K) and Σ1opΣopK\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\vee\lVert\Sigma\rVert_{\operatorname{op}}\leq K for some K>10K>10. Then there exists some C=C(K)>1C=C(K)>1 such that uniformly in ηΞK\eta\in\Xi_{K}, 1/Cγη,(),τη,()C1/C\leq\gamma_{\eta,\ast}^{(\ell)},\tau_{\eta,\ast}^{(\ell)}\leq C. Moreover,

    |γη,()γη,||τη,()τη,|Cmn.\displaystyle\lvert\gamma_{\eta,\ast}^{(\ell)}-\gamma_{\eta,\ast}\rvert\vee\lvert\tau_{\eta,\ast}^{(\ell)}-\tau_{\eta,\ast}\rvert\leq\frac{Cm_{\ell}}{n}.
Proof.

All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on KK. We only need to prove (2). The method of proof is similar to that of Proposition 8.1-(3). Instead of considering (12.6), we shall consider the system of equations

{ϕα=1γ2(σξ2+τ2(Σ+τI)1Σ1/2μ02)+1ntr((Σ+τI)2Σ2),ϕα=1ntr((Σ+τI)1Σ)+ητ,\displaystyle\begin{cases}\phi-\alpha=\frac{1}{\gamma^{2}}\big{(}\sigma_{\xi}^{2}+\tau^{2}\lVert(\Sigma+\tau I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}\big{)}+\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I)^{-2}\Sigma^{2}\big{)},\\ \phi-\alpha=\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau I)^{-1}\Sigma\big{)}+\frac{\eta}{\tau},\end{cases} (12.7)

indexed by α0\alpha\geq 0. For α[0,1/(2K)]\alpha\in[0,1/(2K)], the solution (γη,(α),τη,(α))(\gamma_{\eta,\ast}(\alpha),\tau_{\eta,\ast}(\alpha)) exists uniquely for η>0\eta>0 and also for η=0\eta=0 if additionally m<nm<n. Moreover, using the apriori estimate in Proposition 8.1-(2), we have uniformly in ηΞK\eta\in\Xi_{K} and α[0,1/(2K)]\alpha\in[0,1/(2K)], γη,(α),τη,(α)1\gamma_{\eta,\ast}(\alpha),\tau_{\eta,\ast}(\alpha)\asymp 1. Now differentiating on both sides of the second equation in (12.7) with respect to α\alpha, we obtain

1=(n1tr((Σ+τη,(α)I)2Σ)+ητη,2(α))τη,(α).\displaystyle 1=\Big{(}n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}(\alpha)I)^{-2}\Sigma\big{)}+{\eta}{\tau_{\eta,\ast}^{-2}(\alpha)}\Big{)}\cdot\tau_{\eta,\ast}^{\prime}(\alpha).

This means uniformly in ηΞK\eta\in\Xi_{K} and α[0,1/(2K)]\alpha\in[0,1/(2K)], τη,(α)1\tau_{\eta,\ast}^{\prime}(\alpha)\asymp 1. Next, using the first equation in (12.7), we obtain

γη,2(α)=σξ2+τη,2(α)(Σ+τη,(α)I)1Σ1/2μ02ϕα1ntr((Σ+τη,(α)I)2Σ2)G1,η(α)G2,η(α).\displaystyle\gamma_{\eta,\ast}^{2}(\alpha)=\frac{\sigma_{\xi}^{2}+\tau_{\eta,\ast}^{2}(\alpha)\lVert(\Sigma+\tau_{\eta,\ast}(\alpha)I)^{-1}\Sigma^{1/2}\mu_{0}\rVert^{2}}{\phi-\alpha-\frac{1}{n}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta,\ast}(\alpha)I)^{-2}\Sigma^{2}\big{)}}\equiv\frac{G_{1,\eta}(\alpha)}{G_{2,\eta}(\alpha)}.

Using similar calculations as in (8.9)-(8.10), we have uniformly in ηΞK\eta\in\Xi_{K} and α[0,1/(2K)]\alpha\in[0,1/(2K)], G1,η(α),G2,η(α)1G_{1,\eta}(\alpha),G_{2,\eta}(\alpha)\asymp 1, and |G1,η(α)||G2,η(α)|1\lvert G_{1,\eta}^{\prime}(\alpha)\rvert\vee\lvert G_{2,\eta}^{\prime}(\alpha)\rvert\lesssim 1. This concludes the claim. ∎

Proof of Theorem 4.3.

All the constants in ,,\lesssim,\gtrsim,\asymp below may depend on K,LK,L.

As Y()X()μ^()η2=Z()Σ1/2(μ0μ^()η)+ξ()2\lVert Y^{(\ell)}-X^{(\ell)}\widehat{\mu}^{(\ell)}_{\eta}\rVert^{2}=\lVert Z^{(\ell)}\Sigma^{1/2}(\mu_{0}-\widehat{\mu}^{(\ell)}_{\eta})+\xi^{(\ell)}\rVert^{2} and μ^()η\widehat{\mu}^{(\ell)}_{\eta} is independent of (Z(),ξ())(Z^{(\ell)},\xi^{(\ell)}), by using Lemma B.3 first conditionally on (Z(),ξ())(Z^{(-\ell)},\xi^{(-\ell)}) and then further taking expectation over (Z(),ξ())(Z^{(-\ell)},\xi^{(-\ell)}), we have for 0<ϱ10<\varrho\leq 1,

(E0,c(η){|m1Y()X()μ^()η2(Σ1/2(μ^()ημ0)2+σξ2)|\displaystyle\operatorname{\mathbb{P}}\Big{(}E_{0,\ell}^{c}(\eta)\equiv\Big{\{}\big{\lvert}m_{\ell}^{-1}\lVert Y^{(\ell)}-X^{(\ell)}\widehat{\mu}^{(\ell)}_{\eta}\rVert^{2}-\big{(}\lVert\Sigma^{1/2}(\widehat{\mu}^{(\ell)}_{\eta}-\mu_{0})\rVert^{2}+\sigma_{\xi}^{2}\big{)}\big{\rvert}
C0(σξ2Σ1/2(μ^()ημ0)2)m(1ϱ)/2})C0emϱ/C0.\displaystyle\qquad\qquad\geq C_{0}\big{(}\sigma_{\xi}^{2}\vee\lVert\Sigma^{1/2}(\widehat{\mu}^{(\ell)}_{\eta}-\mu_{0})\rVert^{2}\big{)}m_{\ell}^{-(1-\varrho)/2}\Big{\}}\Big{)}\leq C_{0}e^{-m_{\ell}^{\varrho}/C_{0}}.

Here C0>0C_{0}>0 is a universal constant. Using similar arguments as in (11.3) (by noting that the normalization in μ^η()\widehat{\mu}_{\eta}^{(\ell)} is still nn), there exists some constant C1>0C_{1}>0 such that for any [k]\ell\in[k], on an event E1,E_{1,\ell} with (E1,c)C1em/C1\operatorname{\mathbb{P}}(E_{1,\ell}^{c})\leq C_{1}e^{-m_{\ell}/C_{1}}, supηΞLμ^()ηC1\sup_{\eta\in\Xi_{L}}\lVert\widehat{\mu}^{(\ell)}_{\eta}\rVert\leq C_{1}. This means that for any ηΞL\eta\in\Xi_{L}, on the event [k](E0,(η)E1,)\cap_{\ell\in[k]}(E_{0,\ell}(\eta)\cap E_{1,\ell}),

max[k]m(1ϱ)/2|m1Y()X()μ^()η2(Σ1/2(μ^()ημ0)2+σξ2)|C1.\displaystyle\max_{\ell\in[k]}m_{\ell}^{(1-\varrho)/2}\cdot\big{\lvert}m_{\ell}^{-1}\lVert Y^{(\ell)}-X^{(\ell)}\widehat{\mu}^{(\ell)}_{\eta}\rVert^{2}-\big{(}\lVert\Sigma^{1/2}(\widehat{\mu}^{(\ell)}_{\eta}-\mu_{0})\rVert^{2}+\sigma_{\xi}^{2}\big{)}\big{\rvert}\leq C_{1}^{\prime}. (12.8)

On the other hand, using Theorem 3.1, we may find some 𝒰ϑ;Bn(1)\mathcal{U}_{\vartheta;\ell}\subset B_{n}(1) with vol(𝒰ϑ;)/vol(Bn(1))1C2enϑ/C2\mathrm{vol}(\mathcal{U}_{\vartheta;\ell})/\mathrm{vol}(B_{n}(1))\geq 1-C_{2}e^{-n^{\vartheta}/C_{2}}, such that for ε(0,1/2]\varepsilon\in(0,1/2], on an event E2,(ε)E_{2,\ell}(\varepsilon) with (E2,c(ε))C2(nenε4/C2+εc0n1/6.5𝟏ZG)\operatorname{\mathbb{P}}(E_{2,\ell}^{c}(\varepsilon))\leq C_{2}(ne^{-n\varepsilon^{4}/C_{2}}+\varepsilon^{-c_{0}}n^{-1/6.5}\bm{1}_{Z\neq G}), for μ0𝒰ϑ;\mu_{0}\in\mathcal{U}_{\vartheta;\ell},

supηΞL|Σ1/2(μ^()ημ0)2{mmn(γη,())2σξ2}|ε.\displaystyle\sup_{\eta\in\Xi_{L}}\bigg{\lvert}\lVert\Sigma^{1/2}(\widehat{\mu}^{(\ell)}_{\eta}-\mu_{0})\rVert^{2}-\bigg{\{}\frac{m-m_{\ell}}{n}(\gamma_{\eta,\ast}^{(\ell)})^{2}-\sigma_{\xi}^{2}\bigg{\}}\bigg{\rvert}\leq\varepsilon.

Here γη,()\gamma_{\eta,\ast}^{(\ell)} is taken from Lemma 12.1, and we extend the definition to =0\ell=0 with μ^η(0)μ^η\widehat{\mu}_{\eta}^{(0)}\equiv\widehat{\mu}_{\eta} and γη,(0)γη,\gamma_{\eta,\ast}^{(0)}\equiv\gamma_{\eta,\ast}. Using the statement (2) of the same Lemma 12.1, on the event E2,(ε)E_{2,\ell}(\varepsilon), we then have

supηΞL|Σ1/2(μ^()ημ0)2{ϕγη,2σξ2}|ε+C2mn.\displaystyle\sup_{\eta\in\Xi_{L}}\big{\lvert}\lVert\Sigma^{1/2}(\widehat{\mu}^{(\ell)}_{\eta}-\mu_{0})\rVert^{2}-\big{\{}\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2}\big{\}}\big{\rvert}\leq\varepsilon+\frac{C_{2}m_{\ell}}{n}.

Replacing ϕγη,2σξ2\phi\gamma_{\eta,\ast}^{2}-\sigma_{\xi}^{2} by R𝗉𝗋𝖾𝖽(Σ,μ0)(η)=Σ1/2(μ^ημ0)2R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)=\lVert\Sigma^{1/2}(\widehat{\mu}_{\eta}-\mu_{0})\rVert^{2} yields that, on [0:k]E2,(ε)\cap_{\ell\in[0:k]}E_{2,\ell}(\varepsilon),

supηΞL|Σ1/2(μ^()ημ0)2R𝗉𝗋𝖾𝖽(Σ,μ0)(η)|2ε+C2mn.\displaystyle\sup_{\eta\in\Xi_{L}}\big{\lvert}\lVert\Sigma^{1/2}(\widehat{\mu}^{(\ell)}_{\eta}-\mu_{0})\rVert^{2}-R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\leq 2\varepsilon+\frac{C_{2}m_{\ell}}{n}. (12.9)

Combining (12.8)-(12.9), for μ0𝒰ϑ[0:k]𝒰ϑ;\mu_{0}\in\mathcal{U}_{\vartheta}\equiv\cap_{\ell\in[0:k]}\mathcal{U}_{\vartheta;\ell}, ε(0,1/2]\varepsilon\in(0,1/2] and ηΞL\eta\in\Xi_{L},

(|R𝖢𝖵,k(Σ,μ0)(η)(R𝗉𝗋𝖾𝖽(Σ,μ0)(η)+σξ2)|C2{1k[k]1m(1ϱ)/2+1k+ε})\displaystyle\operatorname{\mathbb{P}}\bigg{(}\big{\lvert}R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta)-\big{(}R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)+\sigma_{\xi}^{2}\big{)}\big{\rvert}\geq C_{2}^{\prime}\cdot\bigg{\{}\frac{1}{k}\sum_{\ell\in[k]}\frac{1}{m_{\ell}^{(1-\varrho)/2}}+\frac{1}{k}+\varepsilon\bigg{\}}\bigg{)}
C2{[k]emϱ/C0+knenε4/C2,Z=G;[k]emϱ/C0+εc0kn1/6.5,otherwise.\displaystyle\leq C_{2}^{\prime}\cdot\begin{cases}\sum_{\ell\in[k]}e^{-m_{\ell}^{\varrho}/C_{0}}+kne^{-n\varepsilon^{4}/C_{2}},&Z=G;\\ \sum_{\ell\in[k]}e^{-m_{\ell}^{\varrho}/C_{0}}+\varepsilon^{-c_{0}}\cdot kn^{-1/6.5},&\hbox{otherwise}.\end{cases} (12.10)

Now we strengthen the estimate (12.3) into a uniform version. It is easy to verify that on an event E3,E_{3,\ell} with (E3,c)C3em/C3\operatorname{\mathbb{P}}(E_{3,\ell}^{c})\leq C_{3}e^{-m_{\ell}/C_{3}}, Z()opC3(m+n)\lVert Z^{(\ell)}\rVert_{\operatorname{op}}\leq C_{3}(\sqrt{m_{\ell}}+\sqrt{n}), ξ()C3m\lVert\xi^{(\ell)}\rVert\leq C_{3}\sqrt{m_{\ell}}, and for η1,η2ΞL\eta_{1},\eta_{2}\in\Xi_{L}, μ^η1()μ^η2()C3|η1η2|\lVert\widehat{\mu}_{\eta_{1}}^{(\ell)}-\widehat{\mu}_{\eta_{2}}^{(\ell)}\rVert\leq C_{3}\lvert\eta_{1}-\eta_{2}\rvert. So on [k](E1,E3,)\cap_{\ell\in[k]}(E_{1,\ell}\cap E_{3,\ell}), for η1,η2ΞL\eta_{1},\eta_{2}\in\Xi_{L},

|R𝖢𝖵,k(Σ,μ0)(η1)R𝖢𝖵,k(Σ,μ0)(η2)|\displaystyle\big{\lvert}R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta_{1})-R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta_{2})\big{\rvert} 1k[k]1m|Z()opμ^η1()μ^η2()(Z()op+ξ())|\displaystyle\lesssim\frac{1}{k}\sum_{\ell\in[k]}\frac{1}{m_{\ell}}\big{\lvert}\lVert Z^{(\ell)}\rVert_{\operatorname{op}}\lVert\widehat{\mu}_{\eta_{1}}^{(\ell)}-\widehat{\mu}_{\eta_{2}}^{(\ell)}\rVert\cdot\big{(}\lVert Z^{(\ell)}\rVert_{\operatorname{op}}+\lVert\xi^{(\ell)}\rVert\big{)}\big{\rvert}
1k[k]m+nm|η1η2|C31k[k]nm|η1η2|,\displaystyle\lesssim\frac{1}{k}\sum_{\ell\in[k]}\frac{m_{\ell}+n}{m_{\ell}}\cdot\lvert\eta_{1}-\eta_{2}\rvert\leq C_{3}^{\prime}\cdot\frac{1}{k}\sum_{\ell\in[k]}\frac{n}{m_{\ell}}\cdot\lvert\eta_{1}-\eta_{2}\rvert,

and

|R𝗉𝗋𝖾𝖽(Σ,μ0)(η1)R𝗉𝗋𝖾𝖽(Σ,μ0)(η2)|C3|η1η2|.\displaystyle\big{\lvert}R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta_{1})-R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta_{2})\big{\rvert}\leq C_{3}^{\prime}\lvert\eta_{1}-\eta_{2}\rvert.

From here, using (i) (12.3) along with a discretization and union bound that strengthens (12.3) to a uniform control, and (ii) Theorem 3.1 which replaces R𝗉𝗋𝖾𝖽(Σ,μ0)(η)R^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta) by R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η)\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta), we obtain for μ0𝒰ϑ\mu_{0}\in\mathcal{U}_{\vartheta} and ε(0,1/2]\varepsilon\in(0,1/2],

(supηΞL|R𝖢𝖵,k(Σ,μ0)(η)(R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η)+σξ2)|C3{1k[k]1m(1ϱ)/2+1k+ε}ε{m})\displaystyle\operatorname{\mathbb{P}}\bigg{(}\sup_{\eta\in\Xi_{L}}\big{\lvert}R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta)-\big{(}\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)+\sigma_{\xi}^{2}\big{)}\big{\rvert}\geq C_{3}^{\prime\prime}\cdot\bigg{\{}\frac{1}{k}\sum_{\ell\in[k]}\frac{1}{m_{\ell}^{(1-\varrho)/2}}+\frac{1}{k}+\varepsilon\bigg{\}}\equiv\varepsilon_{\{m_{\ell}\}}\bigg{)}
𝔭0C3εk[k]nm{[k]emϱ/C0+knenε4/C2,Z=G;[k]emϱ/C0+εc0kn1/6.5,otherwise.\displaystyle\leq\mathfrak{p}_{0}\equiv\frac{C_{3}^{\prime\prime}}{\varepsilon k}\sum_{\ell\in[k]}\frac{n}{m_{\ell}}\cdot\begin{cases}\sum_{\ell\in[k]}e^{-m_{\ell}^{\varrho}/C_{0}}+kne^{-n\varepsilon^{4}/C_{2}},&Z=G;\\ \sum_{\ell\in[k]}e^{-m_{\ell}^{\varrho}/C_{0}}+\varepsilon^{-c_{0}}\cdot kn^{-1/6.5},&\hbox{otherwise}.\end{cases}

Now with the same 𝒲ε,δ,ϑ\mathcal{W}_{\varepsilon,\delta,\vartheta} as in (11.14) using εεnnϑ\varepsilon\equiv\varepsilon_{n}\equiv n^{-\vartheta}, for any μ0𝒰ϑ𝒲ε,δ,ϑ\mu_{0}\in\mathcal{U}_{\vartheta}\cap\mathcal{W}_{\varepsilon,\delta,\vartheta}, we may further replace R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η)\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta) by 𝗉𝗋𝖾𝖽(Σ,μ0)(η)\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta) in the above display (with a possibly slightly larger ε{m}\varepsilon_{\{m_{\ell}\}}, but for notational simplicity we abuse this notation). In summary, for any μ0𝒰ϑ𝒲ε,δ,ϑ\mu_{0}\in\mathcal{U}_{\vartheta}\cap\mathcal{W}_{\varepsilon,\delta,\vartheta}, on an event E4E_{4} with (E4c)𝔭0\operatorname{\mathbb{P}}(E_{4}^{c})\leq\mathfrak{p}_{0},

supηΞL|R𝖢𝖵,k(Σ,μ0)(η)(𝗉𝗋𝖾𝖽(Σ,μ0)(η)+σξ2)|ε{m}.\displaystyle\sup_{\eta\in\Xi_{L}}\big{\lvert}R^{\operatorname{\mathsf{CV}},k}_{(\Sigma,\mu_{0})}(\eta)-\big{(}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)+\sigma_{\xi}^{2}\big{)}\big{\rvert}\leq\varepsilon_{\{m_{\ell}\}}.

From here, using similar arguments as in (12.2)-(12.4), on the event E4E_{4},

|𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖢𝖵)minηΞL𝗉𝗋𝖾𝖽(Σ,μ0)(η)|2ε{m}.\displaystyle\big{\lvert}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{CV}}})-\min_{\eta\in\Xi_{L}}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}\leq 2\varepsilon_{\{m_{\ell}\}}.

Similar to (12.5), on the event E4E_{4}, we have

|η^𝖢𝖵η|C4ε{m}1/2.\displaystyle\lvert\widehat{\eta}^{\operatorname{\mathsf{CV}}}-\eta_{\ast}\rvert\leq C_{4}\cdot\varepsilon_{\{m_{\ell}\}}^{1/2}. (12.11)

From here we may argue along the same lines as those following (12.5) in the proof of Theorem 4.2 to conclude with probability estimated at 𝔭0\mathfrak{p}_{0}, by further noting that 𝒰ϑ𝒲ε,δ,ϑ\mathcal{U}_{\vartheta}\cap\mathcal{W}_{\varepsilon,\delta,\vartheta} satisfies the desired volume estimate. Under the further condition min[k]mlog2/δm\min_{\ell\in[k]}m_{\ell}\geq\log^{2/\delta}m, by taking ϱ=δ\varrho=\delta, 𝔭0\mathfrak{p}_{0} simplifies as indicated in the statement of the theorem for nn large. ∎

12.4. Proof of Theorem 4.4

We only prove the case for #=𝖦𝖢𝖵\#=\operatorname{\mathsf{GCV}}; the other case is similar. All constants in ,,\lesssim,\gtrsim,\asymp and 𝒪\mathcal{O} may possibly depend on K,LK,L. Let 𝒲ε,δ,ϑBn(1)Bn(δ)\mathcal{W}_{\varepsilon,\delta,\vartheta}\subset B_{n}(1)\setminus B_{n}(\delta) be as constructed in (11.14) with εεnnϑ\varepsilon\equiv\varepsilon_{n}\equiv n^{-\vartheta}.

(1). We first prove the statement for the length of the CI. Note that |CIj(η)|=2γ^η(Σ1)jj1/2zα/2/n|\mathrm{CI}_{j}(\eta)|=2\widehat{\gamma}_{\eta}(\Sigma^{-1})_{jj}^{1/2}z_{\alpha/2}/\sqrt{n}. By Theorem 4.1-(2), on an event E0E_{0} with the probability indicated therein,

maxj[n]supηΞL||CIj(η)|2γη,(Σ1)jj1/2zα/2n|2Σ1op1/2zα/2nsupηΞL|γ^ηγη,|zα/2nε.\displaystyle\max_{j\in[n]}\sup_{\eta\in\Xi_{L}}\bigg{\lvert}|\mathrm{CI}_{j}(\eta)|-2\gamma_{\eta,\ast}(\Sigma^{-1})_{jj}^{1/2}\frac{z_{\alpha/2}}{\sqrt{n}}\bigg{\rvert}\leq\frac{2\lVert\Sigma^{-1}\rVert_{\operatorname{op}}^{1/2}z_{\alpha/2}}{\sqrt{n}}\sup_{\eta\in\Xi_{L}}\lvert\widehat{\gamma}_{\eta}-\gamma_{\eta,\ast}\rvert\lesssim\frac{z_{\alpha/2}}{\sqrt{n}}\cdot\varepsilon.

Consequently, on the event E0E_{0}, for any μ0𝒲ε,δ,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta},

nzα/21maxj[n]||CIj(η^𝖦𝖢𝖵)|minηΞL|CIj(η)||\displaystyle\sqrt{n}z_{\alpha/2}^{-1}\cdot\max_{j\in[n]}\big{\lvert}|\mathrm{CI}_{j}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})|-\min_{\eta\in\Xi_{L}}|\mathrm{CI}_{j}(\eta)|\big{\rvert}
|γη^𝖦𝖢𝖵,minηΞLγη,|+ε\displaystyle\lesssim\lvert\gamma_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\ast}-\min_{\eta\in\Xi_{L}}\gamma_{\eta,\ast}\rvert+\varepsilon
|γη^𝖦𝖢𝖵,2minηΞLγη,2|+ε(using Proposition 8.1-(3))\displaystyle\lesssim\lvert\gamma_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\ast}^{2}-\min_{\eta\in\Xi_{L}}\gamma_{\eta,\ast}^{2}\rvert+\varepsilon\quad(\hbox{using Proposition \ref{prop:fpe_est}-(3)})
|R¯𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)minηΞLR¯𝗉𝗋𝖾𝖽(Σ,μ0)(η)|+ε(using definition of γη,2)\displaystyle\asymp\big{\lvert}\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})-\min_{\eta\in\Xi_{L}}\bar{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}+\varepsilon\quad(\hbox{using definition of $\gamma_{\eta,\ast}^{2}$})
|𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)minηΞL𝗉𝗋𝖾𝖽(Σ,μ0)(η)|+ε(using Theorem 3.2).\displaystyle\lesssim\big{\lvert}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})-\min_{\eta\in\Xi_{L}}\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta)\big{\rvert}+\varepsilon\quad(\hbox{using Theorem \ref{thm:error_rmt}}).

As in the proof of Theorem 4.2, for σξ2K\sigma_{\xi}^{2}\leq K, η=𝖲𝖭𝖱μ01ΞL\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}\in\Xi_{L}, so by using Proposition 11.3-(2), on the event E0E_{0}, for any μ0𝒲ε,δ,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta},

nzα/21maxj[n]||CIj(η^𝖦𝖢𝖵)|minηΞL|CIj(η)||\displaystyle\sqrt{n}z_{\alpha/2}^{-1}\cdot\max_{j\in[n]}\big{\lvert}|\mathrm{CI}_{j}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})|-\min_{\eta\in\Xi_{L}}|\mathrm{CI}_{j}(\eta)|\big{\rvert}
|𝗉𝗋𝖾𝖽(Σ,μ0)(η^𝖦𝖢𝖵)𝗉𝗋𝖾𝖽(Σ,μ0)(η)|+ε|η^𝖦𝖢𝖵η|2+ε.\displaystyle\lesssim\lvert\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})-\mathscr{R}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}(\eta_{\ast})\rvert+\varepsilon\lesssim\lvert\widehat{\eta}^{\operatorname{\mathsf{GCV}}}-\eta_{\ast}\rvert^{2}+\varepsilon.

The above reasoning also proves that on the same event E0E_{0}, for any μ0𝒲ε,δ,ϑ\mu_{0}\in\mathcal{W}_{\varepsilon,\delta,\vartheta},

|γη^𝖦𝖢𝖵,γη,||η^𝖦𝖢𝖵η|2+ε.\displaystyle\lvert\gamma_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\ast}-\gamma_{\eta_{\ast},\ast}\rvert\lesssim\lvert\widehat{\eta}^{\operatorname{\mathsf{GCV}}}-\eta_{\ast}\rvert^{2}+\varepsilon.

From here, in view of (12.5), by adjusting constants, on an event E1E_{1} with (E1c)C1n1/7\operatorname{\mathbb{P}}(E_{1}^{c})\leq C_{1}n^{-1/7}, it holds that

nzα/21maxj[n]||CIj(η^𝖦𝖢𝖵)|minηΞL|CIj(η)||\displaystyle\sqrt{n}z_{\alpha/2}^{-1}\cdot\max_{j\in[n]}\big{\lvert}|\mathrm{CI}_{j}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})|-\min_{\eta\in\Xi_{L}}|\mathrm{CI}_{j}(\eta)|\big{\rvert}
|γη^𝖦𝖢𝖵,γη,||η^𝖦𝖢𝖵η|2ε.\displaystyle\qquad\qquad\vee\lvert\gamma_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\ast}-\gamma_{\eta_{\ast},\ast}\rvert\vee\lvert\widehat{\eta}^{\operatorname{\mathsf{GCV}}}-\eta_{\ast}\rvert^{2}\leq\varepsilon. (12.12)

This proves the claim for the length of the CI.

(2). Next we prove the statement for the coverage. We note that a similar Lipschitz continuity argument as in the proof of Lemma 11.4 shows that for any 11-Lipschitz 𝗀:n\mathsf{g}:\mathbb{R}^{n}\to\mathbb{R}, on an event E2(𝗀)E_{2}(\mathsf{g}) with (E2(𝗀)c)Cn1/7\operatorname{\mathbb{P}}(E_{2}(\mathsf{g})^{c})\leq Cn^{-1/7},

supηΞL|𝗀(μ^𝖽𝖱η)𝔼𝗀(μ0+γη,Σ1/2g/n)|ε.\displaystyle\sup_{\eta\in\Xi_{L}}\big{\lvert}\mathsf{g}(\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\eta})-\operatorname{\mathbb{E}}\mathsf{g}\big{(}\mu_{0}+\gamma_{\eta,\ast}\Sigma^{-1/2}g/\sqrt{n}\big{)}\big{\rvert}\leq\varepsilon. (12.13)

On the other hand, using the Lipschitz continuity of ητη,\eta\mapsto\tau_{\eta,\ast} in Proposition 8.1-(3),

|𝗀(μ^𝖽𝖱η^𝖦𝖢𝖵)𝗀(μ^𝖽𝖱η)|\displaystyle\big{\lvert}\mathsf{g}(\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}})-\mathsf{g}(\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\eta_{\ast}})\big{\rvert} μ^𝖽𝖱η^𝖦𝖢𝖵μ^𝖽𝖱η\displaystyle\leq\lVert\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}-\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\eta_{\ast}}\rVert
|τη^𝖦𝖢𝖵,τη,|supηΞLμ^η+μ^η^𝖦𝖢𝖵μ^η\displaystyle\lesssim\lvert\tau_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},\ast}-\tau_{\eta_{\ast},\ast}\rvert\sup_{\eta\in\Xi_{L}}\lVert\widehat{\mu}_{\eta}\rVert+\lVert\widehat{\mu}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}-\widehat{\mu}_{\eta_{\ast}}\rVert
η^𝖦𝖢𝖵ηsupηΞLμ^η+μ^η^𝖦𝖢𝖵μ^η.\displaystyle\lesssim\lVert\widehat{\eta}^{\operatorname{\mathsf{GCV}}}-\eta_{\ast}\rVert\sup_{\eta\in\Xi_{L}}\lVert\widehat{\mu}_{\eta}\rVert+\lVert\widehat{\mu}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}-\widehat{\mu}_{\eta_{\ast}}\rVert.

So by enlarging C1C_{1} if necessary, we may assume without loss of generality that on E1E2(𝗀)E_{1}\cap E_{2}(\mathsf{g}), we have

|𝗀(μ^𝖽𝖱η^𝖦𝖢𝖵)𝗀(μ^𝖽𝖱η)|C1ε1/2.\displaystyle\big{\lvert}\mathsf{g}(\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}})-\mathsf{g}(\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\eta_{\ast}})\big{\rvert}\leq C_{1}\varepsilon^{1/2}. (12.14)

Now we shall make a good choice of 𝗀\mathsf{g} in (12.4). Let Δ(0,1)\Delta\in(0,1) and 𝗀0,Δ:[0,1]\mathsf{g}_{0,\Delta}:\mathbb{R}\to[0,1] be a function such that 𝗀0,Δ=1\mathsf{g}_{0,\Delta}=1 on [1,1][-1,1], 𝗀0,Δ=0\mathsf{g}_{0,\Delta}=0 on (1Δ,1+Δ)\mathbb{R}\setminus(-1-\Delta,1+\Delta), and linearly interpolated in (1Δ,1)(1,1+Δ)(-1-\Delta,-1)\cup(1,1+\Delta). Let

𝗀(u)Δnj=1n𝗀0,Δ(ujμ0,j(γη,+ε)(Σ1)jj1/2zα/2/n).\displaystyle\mathsf{g}(u)\equiv\frac{\Delta}{n}\sum_{j=1}^{n}\mathsf{g}_{0,\Delta}\bigg{(}\frac{u_{j}-\mu_{0,j}}{(\gamma_{\eta_{\ast},\ast}+\varepsilon)(\Sigma^{-1})_{jj}^{1/2}z_{\alpha/2}/\sqrt{n}}\bigg{)}. (12.15)

It is easy to verify the Lipschitz property of 𝗀\mathsf{g}: for any u1,u2nu_{1},u_{2}\in\mathbb{R}^{n}, |𝗀(u1)𝗀(u2)|n1/2Δ𝗀0,ΔLipj=1n|u1,ju2,j|u1u2\lvert\mathsf{g}(u_{1})-\mathsf{g}(u_{2})\rvert\lesssim n^{-1/2}\Delta\lVert\mathsf{g}_{0,\Delta}\rVert_{\mathrm{Lip}}\sum_{j=1}^{n}\lvert u_{1,j}-u_{2,j}\rvert\lesssim\lVert u_{1}-u_{2}\rVert. Consequently, we may apply (12.13) with 𝗀\mathsf{g} defined in (12.15) to obtain that on the event E1E2(𝗀)E_{1}\cap E_{2}(\mathsf{g}),

𝒞𝖽𝖱(η^𝖦𝖢𝖵)\displaystyle\mathscr{C}^{\operatorname{\mathsf{dR}}}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}}) =1nj=1n𝟏(μ^𝖽𝖱η^𝖦𝖢𝖵,j[μ0,j±γ^η^𝖦𝖢𝖵(Σ1)1/2jjzα/2n])\displaystyle=\frac{1}{n}\sum_{j=1}^{n}\bm{1}\Big{(}\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},j}\in\Big{[}\mu_{0,j}\pm\widehat{\gamma}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}(\Sigma^{-1})^{1/2}_{jj}\frac{z_{\alpha/2}}{\sqrt{n}}\Big{]}\Big{)}
1nj=1n𝟏(μ^𝖽𝖱η^𝖦𝖢𝖵,j[μ0,j±(γη,+ε)(Σ1)1/2jjzα/2n])\displaystyle\leq\frac{1}{n}\sum_{j=1}^{n}\bm{1}\Big{(}\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}},j}\in\Big{[}\mu_{0,j}\pm(\gamma_{\eta_{\ast},\ast}+\varepsilon)(\Sigma^{-1})^{1/2}_{jj}\frac{z_{\alpha/2}}{\sqrt{n}}\Big{]}\Big{)}
Δ1𝗀(μ^𝖽𝖱η^𝖦𝖢𝖵)(using 𝟏[1,1]𝗀0,Δ)\displaystyle\leq\Delta^{-1}\cdot\mathsf{g}\big{(}\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\widehat{\eta}^{\operatorname{\mathsf{GCV}}}}\big{)}\quad\hbox{(using $\bm{1}_{[-1,1]}\leq\mathsf{g}_{0,\Delta}$)}
Δ1𝗀(μ^𝖽𝖱η)+𝒪(ε1/2/Δ)(by (12.14))\displaystyle\leq\Delta^{-1}\cdot\mathsf{g}\big{(}\widehat{\mu}^{\operatorname{\mathsf{dR}}}_{\eta_{\ast}}\big{)}+\mathcal{O}(\varepsilon^{1/2}/\Delta)\quad\hbox{(by (\ref{ineq:CI_cv_3}))}
Δ1𝔼𝗀(μ0+γη,Σ1/2g/n)+𝒪(ε1/2/Δ).\displaystyle\leq\Delta^{-1}\cdot\operatorname{\mathbb{E}}\mathsf{g}\big{(}\mu_{0}+\gamma_{\eta_{\ast},\ast}\Sigma^{-1/2}g/\sqrt{n}\big{)}+\mathcal{O}(\varepsilon^{1/2}/\Delta). (12.16)

Now using 𝗀0,Δ𝟏[1Δ,1+Δ]\mathsf{g}_{0,\Delta}\leq\bm{1}_{[-1-\Delta,1+\Delta]} and the anti-concentration of the standard normal random variable, we may compute

Δ1𝔼𝗀(μ0+γη,Σ1/2g/n)=𝔼𝗀0,Δ(γη,γη,+εgzα/2)\displaystyle\Delta^{-1}\cdot\operatorname{\mathbb{E}}\mathsf{g}\big{(}\mu_{0}+\gamma_{\eta_{\ast},\ast}\Sigma^{-1/2}g/\sqrt{n}\big{)}=\operatorname{\mathbb{E}}\mathsf{g}_{0,\Delta}\bigg{(}\frac{\gamma_{\eta_{\ast},\ast}}{\gamma_{\eta_{\ast},\ast}+\varepsilon}\cdot\frac{g}{z_{\alpha/2}}\bigg{)}
(𝒩(0,1)[±zα/2(1+ε/γη,)(1+Δ)])1α+𝒪(ε+Δ).\displaystyle\leq\operatorname{\mathbb{P}}\Big{(}\mathcal{N}(0,1)\in\Big{[}\pm z_{\alpha/2}\cdot\big{(}1+{\varepsilon}/{\gamma_{\eta_{\ast},\ast}}\big{)}\cdot(1+\Delta)\Big{]}\Big{)}\leq 1-\alpha+\mathcal{O}(\varepsilon+\Delta). (12.17)

Combining the above two displays (12.4)-(12.4), on the event E1E2(𝗀)E_{1}\cap E_{2}(\mathsf{g}),

𝒞𝖽𝖱(η^𝖦𝖢𝖵)1α+𝒪(ε+Δ+ε1/2/Δ).\displaystyle\mathscr{C}^{\operatorname{\mathsf{dR}}}(\widehat{\eta}^{\operatorname{\mathsf{GCV}}})\leq 1-\alpha+\mathcal{O}(\varepsilon+\Delta+\varepsilon^{1/2}/\Delta).

Finally choosing Δ=ε1/4\Delta=\varepsilon^{1/4} to conclude the upper control. The lower control can be proved similarly so we omit the details. ∎

Appendix A Further results on 𝖮𝖯𝖳#(Σ,μ0)\operatorname{\mathsf{OPT}}^{\#}_{(\Sigma,\mu_{0})}

For #{𝗉𝗋𝖾𝖽,𝖾𝗌𝗍,𝗂𝗇}\#\in\{\operatorname{\mathsf{pred}},\operatorname{\mathsf{est}},\operatorname{\mathsf{in}}\}, recall #(Σ,μ0)(η)\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta) defined in Theorem 3.2, and the optimally tuned risks 𝖮𝖯𝖳#(Σ,μ0)\operatorname{\mathsf{OPT}}^{\#}_{(\Sigma,\mu_{0})} defined as 𝖮𝖯𝖳#(Σ,μ0)=minη0#(Σ,μ0)(η)/σξ2=#(Σ,μ0)(η)/σξ2\operatorname{\mathsf{OPT}}^{\#}_{(\Sigma,\mu_{0})}=\min_{\eta\geq 0}\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta)/\sigma_{\xi}^{2}=\mathscr{R}^{\#}_{(\Sigma,\mu_{0})}(\eta_{\ast})/\sigma_{\xi}^{2} with η=𝖲𝖭𝖱μ01\eta_{\ast}=\operatorname{\mathsf{SNR}}_{\mu_{0}}^{-1}.

Proposition A.1.

We have

𝖮𝖯𝖳𝗉𝗋𝖾𝖽(Σ,μ0)=ϕτη,η1,𝖮𝖯𝖳𝖾𝗌𝗍(Σ,μ0)=1ϕη+1τη,,𝖮𝖯𝖳𝗂𝗇(Σ,μ0)=ητη,+ϕ.\displaystyle\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}=\frac{\phi\tau_{\eta_{\ast},\ast}}{\eta_{\ast}}-1,\,\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}=\frac{1-\phi}{\eta_{\ast}}+\frac{1}{\tau_{\eta_{\ast},\ast}},\,\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}=-\frac{\eta_{\ast}}{\tau_{\eta_{\ast},\ast}}+\phi.

Consequently,

  1. (1)

    𝖮𝖯𝖳𝖾𝗌𝗍(Σ,μ0)=𝖲𝖭𝖱μ0(1ϕ+ϕ𝖮𝖯𝖳𝗉𝗋𝖾𝖽(Σ,μ0)+1)\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}=\operatorname{\mathsf{SNR}}_{\mu_{0}}\big{(}1-\phi+\frac{\phi}{\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}+1}\big{)}, and 𝖮𝖯𝖳𝗂𝗇(Σ,μ0)=ϕ𝖮𝖯𝖳𝗉𝗋𝖾𝖽(Σ,μ0)𝖮𝖯𝖳𝗉𝗋𝖾𝖽(Σ,μ0)+1\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}=\phi\cdot\frac{\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}}{\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}+1};

  2. (2)

    ϕ𝖮𝖯𝖳𝗉𝗋𝖾𝖽(Σ,μ0)0\partial_{\phi}\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{pred}}}_{(\Sigma,\mu_{0})}\leq 0, ϕ𝖮𝖯𝖳𝖾𝗌𝗍(Σ,μ0)0\partial_{\phi}\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})}\leq 0, and ϕ𝖮𝖯𝖳𝗂𝗇(Σ,μ0)0\partial_{\phi}\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})}\geq 0.

Proof.

We only need to verify (2). This follows from the formula for 𝖮𝖯𝖳#(Σ,μ0)\operatorname{\mathsf{OPT}}^{\#}_{(\Sigma,\mu_{0})} and the following simple consequences of the second equation of (2.1):

  • ϕτη,0\partial_{\phi}\tau_{\eta_{\ast},\ast}\leq 0,

  • ϕ(ϕτη,)=n1tr((Σ+τη,I)2Σ2)ϕτη,0\partial_{\phi}(\phi\tau_{\eta_{\ast},\ast})=n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta_{\ast},\ast}I)^{-2}\Sigma^{2}\big{)}\cdot\partial_{\phi}\tau_{\eta_{\ast},\ast}\leq 0,

  • η1+ϕτη,1=η1n1tr((Σ+τη,I)2Σ)ϕτη,0-\eta_{\ast}^{-1}+\partial_{\phi}\tau_{\eta_{\ast},\ast}^{-1}=\eta_{\ast}^{-1}n^{-1}\operatorname{tr}\big{(}(\Sigma+\tau_{\eta_{\ast},\ast}I)^{-2}\Sigma\big{)}\cdot\partial_{\phi}\tau_{\eta_{\ast},\ast}\leq 0.

The proof is complete. ∎

The first claim in (1) above provides a non-asymptotic version of [DW18, Corollary 2.2], while the first claim in (2) gives a non-asymptotic analogue of the monotonicity result obtained in [PD23, Theorem 6]. Interestingly, (2) also asserts the monotonicity of 𝖮𝖯𝖳𝖾𝗌𝗍(Σ,μ0),𝖮𝖯𝖳𝗂𝗇(Σ,μ0)\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{est}}}_{(\Sigma,\mu_{0})},\operatorname{\mathsf{OPT}}^{\operatorname{\mathsf{in}}}_{(\Sigma,\mu_{0})} with respect to ϕ1\phi^{-1}.

Appendix B Auxiliary results

Proposition B.1.

Let H:n0H:\mathbb{R}^{n}\to\mathbb{R}_{\geq 0} be a non-negative, differentiable function. Suppose there exists some deterministic Γ>0\Gamma>0 such that H(g)2Γ2H(g)\lVert\nabla H(g)\rVert^{2}\leq\Gamma^{2}H(g) almost surely for g𝒩(0,In)g\sim\mathcal{N}(0,I_{n}). Then there exists some universal constant C>0C>0 such that for all t0t\geq 0,

(|H(g)𝔼H(g)|/CΓ𝔼1/2H(g)t+Γ2t)Cet/C.\displaystyle\operatorname{\mathbb{P}}\Big{(}\lvert H(g)-\operatorname{\mathbb{E}}H(g)\rvert/C\geq\Gamma\operatorname{\mathbb{E}}^{1/2}H(g)\cdot\sqrt{t}+\Gamma^{2}\cdot t\Big{)}\leq Ce^{-t/C}.
Proof.

The method of proof via the Gaussian log-Sobolev inequality and the Herbst’s argument is well known. We give some details for the convenience of the reader. Let ZH(g)𝔼H(g)Z\equiv H(g)-\operatorname{\mathbb{E}}H(g) be the centered version of HH, and G(g)λZ=λ(H(g)𝔼H(g))G(g)\equiv\lambda Z=\lambda(H(g)-\operatorname{\mathbb{E}}H(g)). Then G(g)2=λ2H(g)2λ2Γ2H(g)=λ2Γ2(Z+𝔼H(g))\lVert\nabla G(g)\rVert^{2}=\lambda^{2}\lVert\nabla H(g)\rVert^{2}\leq\lambda^{2}\Gamma^{2}\cdot H(g)=\lambda^{2}\Gamma^{2}\cdot\big{(}Z+\operatorname{\mathbb{E}}H(g)\big{)}. By the Gaussian log-Sobolev inequality (see e.g., [BLM13, Theorem 5.4], or [GN16, Theorem 2.5.6]),

Ent(eλZ)=𝔼[λZeλZ]𝔼eλZlog𝔼eλZ12𝔼[λ2Γ2(Z+𝔼H(g))eλZ].\displaystyle\mathrm{Ent}(e^{\lambda Z})=\operatorname{\mathbb{E}}[\lambda Ze^{\lambda Z}]-\operatorname{\mathbb{E}}e^{\lambda Z}\log\operatorname{\mathbb{E}}e^{\lambda Z}\leq\frac{1}{2}\operatorname{\mathbb{E}}\big{[}\lambda^{2}\Gamma^{2}\big{(}Z+\operatorname{\mathbb{E}}H(g)\big{)}e^{\lambda Z}\big{]}.

With mZ(λ)𝔼eλZm_{Z}(\lambda)\equiv\operatorname{\mathbb{E}}e^{\lambda Z} denoting the moment generation function of ZZ, the above inequality is equivalent to

λmZ(λ)mZ(λ)logmZ(λ)Γ2λ22(mZ(λ)+𝔼H(g)mZ(λ)).\displaystyle\lambda m_{Z}^{\prime}(\lambda)-m_{Z}(\lambda)\log m_{Z}(\lambda)\leq\frac{\Gamma^{2}\lambda^{2}}{2}\Big{(}m_{Z}^{\prime}(\lambda)+\operatorname{\mathbb{E}}H(g)\cdot m_{Z}(\lambda)\Big{)}.

Now dividing λ2mλ(Z)\lambda^{2}m_{\lambda}(Z) on both sides of the above display, we have (logmZ(λ)/λ)Γ22(logmZ(λ)+λ𝔼H(g))\big{(}{\log m_{Z}(\lambda)}/{\lambda}\big{)}^{\prime}\leq\frac{\Gamma^{2}}{2}\big{(}\log m_{Z}(\lambda)+\lambda\operatorname{\mathbb{E}}H(g)\big{)}^{\prime}. Integrating both sides with the condition limλ0(logmZ(λ)/λ)=0\lim_{\lambda\downarrow 0}(\log m_{Z}(\lambda)/\lambda)=0 and logmZ(λ)=0\log m_{Z}(\lambda)=0, we arrive at logmZ(λ)Γ22(λlogmZ(λ)+λ2𝔼H(g))\log m_{Z}(\lambda)\leq\frac{\Gamma^{2}}{2}\big{(}\lambda\log m_{Z}(\lambda)+\lambda^{2}\operatorname{\mathbb{E}}H(g)\big{)}. Solving for logmZ(λ)\log m_{Z}(\lambda) and using the standard method to convert to tail bound yield the claimed inequality. ∎

Lemma B.2.

Let Σn×n\Sigma\in\mathbb{R}^{n\times n} be an invertible covariance matrix with ΣopΣ1opK\lVert\Sigma\rVert_{\operatorname{op}}\vee\lVert\Sigma^{-1}\rVert_{\operatorname{op}}\leq K for some K>0K>0. Then for any q(0,)q\in(0,\infty), there exists some C=C(K,q)>0C=C(K,q)>0 such that

|𝔼𝒩(0,Σ)qdiag(Σ)q/21/2Mq1|Cn1q2logn.\displaystyle\bigg{\lvert}\frac{\operatorname{\mathbb{E}}\lVert\mathcal{N}(0,\Sigma)\rVert_{q}}{\lVert\mathrm{diag}(\Sigma)\rVert_{q/2}^{1/2}M_{q}}-1\bigg{\rvert}\leq Cn^{-\frac{1}{q\vee 2}}\sqrt{\log n}.

where Mq𝔼1/q|𝒩(0,1)|q=21/2{Γ((q+1)/2)/π}1/qM_{q}\equiv\operatorname{\mathbb{E}}^{1/q}\lvert\mathcal{N}(0,1)\rvert^{q}=2^{1/2}\big{\{}\Gamma\big{(}(q+1)/2\big{)}/\sqrt{\pi}\big{\}}^{1/q}.

Proof.

Let g𝒩(0,In)g\sim\mathcal{N}(0,I_{n}). We first prove that for some C0>1C_{0}>1,

n1q2/C0𝔼Σ1/2gqC0n1q.\displaystyle n^{\frac{1}{q\vee 2}}/C_{0}\leq\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}\leq C_{0}n^{\frac{1}{q}}. (B.1)

The upper bound in the above display is trivial. For the lower bound, using xn121q2xq\lVert x\rVert\leq n^{\frac{1}{2}-\frac{1}{q\vee 2}}\lVert x\rVert_{q}, we find 𝔼Σ1/2gqn12+1q2𝔼Σ1/2gn1q2\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}\geq n^{-\frac{1}{2}+\frac{1}{q\vee 2}}\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert\gtrsim n^{\frac{1}{q\vee 2}}. This proves (B.1).

As xqn12+1q2x\lVert x\rVert_{q}\leq n^{-\frac{1}{2}+\frac{1}{q\wedge 2}}\lVert x\rVert, the map gΣ1/2gqg\mapsto\lVert\Sigma^{1/2}g\rVert_{q} is Σop1/2n12+1q2\lVert\Sigma\rVert_{\operatorname{op}}^{1/2}n^{-\frac{1}{2}+\frac{1}{q\wedge 2}}-Lipschitz with respect to \lVert\cdot\rVert. So by Gaussian concentration, for any t0t\geq 0,

(E(t)c{n121q2|Σ1/2gq𝔼Σ1/2gq|Ct})Cet/C.\displaystyle\operatorname{\mathbb{P}}\Big{(}E(t)^{c}\equiv\Big{\{}n^{\frac{1}{2}-\frac{1}{q\wedge 2}}\big{\lvert}\lVert\Sigma^{1/2}g\rVert_{q}-\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}\big{\rvert}\geq C\sqrt{t}\Big{\}}\Big{)}\leq Ce^{-t/C}.

Consequently, using the above concentration and (B.1),

𝔼Σ1/2gqq\displaystyle\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}^{q} 𝔼Σ1/2gqq𝟏E(t)+𝔼1/2Σ1/2gq2q1/2(E(t)c)\displaystyle\leq\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}^{q}\bm{1}_{E(t)}+\operatorname{\mathbb{E}}^{1/2}\lVert\Sigma^{1/2}g\rVert_{q}^{2q}\cdot\operatorname{\mathbb{P}}^{1/2}(E(t)^{c})
(𝔼Σ1/2gq+Ct)q+Cn1/q1/2(E(t)c)\displaystyle\leq\big{(}\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}+C\sqrt{t}\big{)}^{q}+C\cdot n^{1/q}\operatorname{\mathbb{P}}^{1/2}(E(t)^{c})
(𝔼Σ1/2gq)q{(1+Cn1q2t)q+Cn1q1q21/2(E(t)c)}.\displaystyle\leq\big{(}\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}\big{)}^{q}\cdot\big{\{}\big{(}1+Cn^{-\frac{1}{q\vee 2}}\sqrt{t}\big{)}^{q}+C\cdot n^{\frac{1}{q}-\frac{1}{q\vee 2}}\operatorname{\mathbb{P}}^{1/2}(E(t)^{c})\big{\}}.

By choosing t=C1lognt=C_{1}\log n for some sufficiently large C1>0C_{1}>0, we have

𝔼𝒩(0,Σ)qdiag(Σ)q/21/2Mq=𝔼Σ1/2gq𝔼1/qΣ1/2gqq(1Cn1q2logn)+.\displaystyle\frac{\operatorname{\mathbb{E}}\lVert\mathcal{N}(0,\Sigma)\rVert_{q}}{\lVert\mathrm{diag}(\Sigma)\rVert_{q/2}^{1/2}M_{q}}=\frac{\operatorname{\mathbb{E}}\lVert\Sigma^{1/2}g\rVert_{q}}{\operatorname{\mathbb{E}}^{1/q}\lVert\Sigma^{1/2}g\rVert_{q}^{q}}\geq\big{(}1-Cn^{-\frac{1}{q\vee 2}}\sqrt{\log n}\big{)}_{+}.

The upper bound follows similarly. ∎

Lemma B.3.

Let Zm×nZ\in\mathbb{R}^{m\times n} be a random matrix with independent, mean-zero, unit variance, uniformly sub-gaussian components. Suppose the coordinates of ξ\xi are i.i.d. mean zero and uniformly subgaussian with variance σξ2>0\sigma_{\xi}^{2}>0, and are independent of ZZ. Then there exists some universal constant C>0C>0 such that for any bnb\in\mathbb{R}^{n} and 0<ϱ10<\varrho\leq 1, with probability at least 1Cemϱ/C1-Ce^{-m^{\varrho}/C},

|m1Zb+ξ2(b2+σξ2)|C(σξ2b2)m(1ϱ)/2.\displaystyle\big{\lvert}m^{-1}\lVert Zb+\xi\rVert^{2}-\big{(}\lVert b\rVert^{2}+\sigma_{\xi}^{2}\big{)}\big{\rvert}\leq C\cdot(\sigma_{\xi}^{2}\vee\lVert b\rVert^{2})\cdot m^{-(1-\varrho)/2}.
Proof.

Let Z1,,ZmnZ_{1},\ldots,Z_{m}\in\mathbb{R}^{n} be the rows of ZZ. Then

1mZb+ξ2=b21mi=1mZi,bb2+2σξbmi=1nξiσξZi,bb+σξ2ξ/σξ2m.\displaystyle\frac{1}{m}\lVert Zb+\xi\rVert^{2}=\lVert b\rVert^{2}\frac{1}{m}\sum_{i=1}^{m}\bigg{\langle}Z_{i},\frac{b}{\lVert b\rVert}\bigg{\rangle}^{2}+\frac{2\sigma_{\xi}\lVert b\rVert}{m}\sum_{i=1}^{n}\frac{\xi_{i}}{\sigma_{\xi}}\bigg{\langle}Z_{i},\frac{b}{\lVert b\rVert}\bigg{\rangle}+\sigma_{\xi}^{2}\frac{\lVert\xi/\sigma_{\xi}\rVert^{2}}{m}.

Using standard concentration estimates, with probability at least 1Cemϱ/C1-Ce^{-m^{\varrho}/C},

  • |b21mi=1mZi,bb2b2|Cb2m(1ϱ)/2\big{\lvert}\lVert b\rVert^{2}\frac{1}{m}\sum_{i=1}^{m}\big{\langle}Z_{i},\frac{b}{\lVert b\rVert}\big{\rangle}^{2}-\lVert b\rVert^{2}\big{\rvert}\leq{C\lVert b\rVert^{2}}\cdot{m^{-(1-\varrho)/2}},

  • |2σξbmi=1nξiσξZi,bb|Cσξbm(1ϱ)/2\big{\lvert}\frac{2\sigma_{\xi}\lVert b\rVert}{m}\sum_{i=1}^{n}\frac{\xi_{i}}{\sigma_{\xi}}\big{\langle}Z_{i},\frac{b}{\lVert b\rVert}\big{\rangle}\big{\rvert}\leq{C\sigma_{\xi}\lVert b\rVert}\cdot{m^{-(1-\varrho)/2}},

  • |σξ2ξ/σξ2mσξ2|Cσξ2m(1ϱ)/2\big{\lvert}\sigma_{\xi}^{2}\frac{\lVert\xi/\sigma_{\xi}\rVert^{2}}{m}-\sigma_{\xi}^{2}\big{\rvert}\leq{C\sigma_{\xi}^{2}}\cdot{m^{-(1-\varrho)/2}}.

Collecting the bounds to conclude. ∎

References

  • [ASS20] Madhu S Advani, Andrew M Saxe, and Haim Sompolinsky, High-dimensional dynamics of generalization error in neural networks, Neural Networks 132 (2020), 428–446.
  • [AZ20] Morgane Austern and Wenda Zhou, Asymptotics of cross-validation, arXiv preprint arXiv:2001.11111 (2020).
  • [AZLS19] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, International Conference on Machine Learning, PMLR, 2019, pp. 242–252.
  • [Bel20] Pierre C. Bellec, Out-of-sample error estimate for robust m-estimators with convex penalty, arXiv preprint arXiv:2008.11840 (2020).
  • [BEM13] Mohsen Bayati, Murat A Erdogdu, and Andrea Montanari, Estimating lasso risk and noise level, Advances in Neural Information Processing Systems 26 (2013).
  • [BHMM19] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA 116 (2019), no. 32, 15849–15854.
  • [BHX20] Mikhail Belkin, Daniel Hsu, and Ji Xu, Two models of double descent for weak features, SIAM J. Math. Data Sci. 2 (2020), no. 4, 1167–1180.
  • [BLLT20] Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler, Benign overfitting in linear regression, Proc. Natl. Acad. Sci. USA 117 (2020), no. 48, 30063–30070.
  • [BLM13] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford University Press, Oxford, 2013.
  • [BMR21] Peter L. Bartlett, Andrea Montanari, and Alexander Rakhlin, Deep learning: a statistical viewpoint, Acta Numer. 30 (2021), 87–201.
  • [BS10] Zhidong Bai and Jack W. Silverstein, Spectral analysis of large dimensional random matrices, second ed., Springer Series in Statistics, Springer, New York, 2010.
  • [BZ23] Pierre C. Bellec and Cun-Hui Zhang, Debiasing convex regularized estimators and interval estimation in linear models, Ann. Statist. 51 (2023), no. 2, 391–436.
  • [CDK22] Chen Cheng, John Duchi, and Rohith Kuditipudi, Memorize to generalize: on the necessity of interpolation in high dimensional linear regression, Conference on Learning Theory, PMLR, 2022, pp. 5528–5560.
  • [CM22] Chen Cheng and Andrea Montanari, Dimension free ridge regression, arXiv preprint arXiv:2210.08571 (2022).
  • [CMW22] Michael Celentano, Andrea Montanari, and Yuting Wei, The lasso with general gaussian designs with applications to hypothesis testing, arXiv preprint arXiv:2007.13716v2 (2022).
  • [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach, On lazy training in differentiable programming, Advances in Neural Information Processing Systems 32 (2019).
  • [CW79] Peter Craven and Grace Wahba, Smoothing noisy data with spline functions. Estimating the correct degree of smoothing by the method of generalized cross-validation, Numer. Math. 31 (1978/79), no. 4, 377–403.
  • [Dic16] Lee H. Dicker, Ridge regression and asymptotic minimax estimation over spheres of growing dimension, Bernoulli 22 (2016), no. 1, 1–37.
  • [DKT22] Zeyu Deng, Abla Kammoun, and Christos Thrampoulidis, A model of double descent for high-dimensional binary linear classification, Inf. Inference 11 (2022), no. 2, 435–495.
  • [DSH23] Alexis Derumigny and Johannes Schmidt-Hieber, On lower bounds for the bias-variance trade-off, Ann. Statist., to appear. Available at arXiv:2006.00278 (2023+).
  • [DvdL05] Sandrine Dudoit and Mark J. van der Laan, Asymptotics of cross-validated risk estimation in estimator selection and performance assessment, Stat. Methodol. 2 (2005), no. 2, 131–154.
  • [DW18] Edgar Dobriban and Stefan Wager, High-dimensional asymptotics of prediction: ridge regression and classification, Ann. Statist. 46 (2018), no. 1, 247–279.
  • [DZPS18] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv preprint arXiv:1810.02054 (2018).
  • [Efr04] Bradley Efron, The estimation of prediction error: covariance penalties and cross-validation, J. Amer. Statist. Assoc. 99 (2004), no. 467, 619–642, With comments and a rejoinder by the author.
  • [EK13] Noureddine El Karoui, Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results, arXiv preprint arXiv:1311.2445 (2013).
  • [EK18] by same author, On the impact of predictor geometry on the performance on high-dimensional ridge-regularized generalized robust regression estimators, Probab. Theory Related Fields 170 (2018), no. 1-2, 95–175.
  • [GHW79] Gene H. Golub, Michael Heath, and Grace Wahba, Generalized cross-validation as a method for choosing a good ridge parameter, Technometrics 21 (1979), no. 2, 215–223.
  • [GKKW02] László Györfi, Michael Kohler, Adam Krzyżak, and Harro Walk, A Distribution-Free Theory of Nonparametric Regression, Springer Series in Statistics, Springer-Verlag, New York, 2002.
  • [GN16] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge Series in Statistical and Probabilistic Mathematics, [40], Cambridge University Press, New York, 2016.
  • [Gor85] Yehoram Gordon, Some inequalities for Gaussian processes and applications, Israel J. Math. 50 (1985), no. 4, 265–289.
  • [Gor88] by same author, On Milman’s inequality and random subspaces which escape through a mesh in 𝐑n{\bf R}^{n}, Geometric aspects of functional analysis (1986/87), Lecture Notes in Math., vol. 1317, Springer, Berlin, 1988, pp. 84–106.
  • [Han22] Qiyang Han, Noisy linear inverse problems under convex constraints: Exact risk asymptotics in high dimensions, arXiv preprint arXiv:2201.08435 (2022).
  • [HK70] Arthur E. Hoerl and Robert W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970), no. 1, 55–67.
  • [HKZ14] Daniel Hsu, Sham M. Kakade, and Tong Zhang, Random design analysis of ridge regression, Found. Comput. Math. 14 (2014), no. 3, 569–600.
  • [HMRT22] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J. Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, Ann. Statist. 50 (2022), no. 2, 949–986.
  • [HS22] Qiyang Han and Yandi Shen, Universality of regularized regression estimators in high dimensions, arXiv preprint arXiv:2206.07936 (2022).
  • [JGH18] Arthur Jacot, Franck Gabriel, and Clément Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in Neural Information Processing Systems 31 (2018).
  • [JWHT21] Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An introduction to statistical learning—with applications in R, Springer Texts in Statistics, Springer, New York, [2021] ©2021, Second edition [of 3100153].
  • [KL22] Nicholas Kissel and Jing Lei, On high-dimensional gaussian comparisons for cross-validation, arXiv preprint arXiv:2211.04958 (2022).
  • [KLS20] Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez, The optimal ridge penalty for real-world high-dimensional data can be zero or negative due to the implicit ridge regularization, J. Mach. Learn. Res. 21 (2020), Paper No. 169, 16.
  • [KY17] Antti Knowles and Jun Yin, Anisotropic local laws for random matrices, Probab. Theory Related Fields 169 (2017), no. 1-2, 257–352.
  • [KZSS21] Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro, Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting, Advances in Neural Information Processing Systems 34 (2021), 20657–20668.
  • [LD19] Sifan Liu and Edgar Dobriban, Ridge regression: Structure, cross-validation, and sketching, arXiv preprint arXiv:1910.02373 (2019).
  • [LGC+21] Bruno Loureiro, Cedric Gerbelot, Hugo Cui, Sebastian Goldt, Florent Krzakala, Marc Mezard, and Lenka Zdeborová, Learning curves of generic features maps for realistic datasets with a teacher-student model, Advances in Neural Information Processing Systems 34 (2021), 18137–18151.
  • [Li85] Ker-Chau Li, From Stein’s unbiased risk estimates to the method of generalized cross validation, Ann. Statist. 13 (1985), no. 4, 1352–1377.
  • [Li86] by same author, Asymptotic optimality of CLC_{L} and generalized cross-validation in ridge regression with application to spline smoothing, Ann. Statist. 14 (1986), no. 3, 1101–1112.
  • [Li87] by same author, Asymptotic optimality for CpC_{p}, CLC_{L}, cross-validation and generalized cross-validation: discrete index set, Ann. Statist. 15 (1987), no. 3, 958–975.
  • [LR20] Tengyuan Liang and Alexander Rakhlin, Just interpolate: kernel “ridgeless” regression can generalize, Ann. Statist. 48 (2020), no. 3, 1329–1347.
  • [LS22] Tengyuan Liang and Pragya Sur, A precise high-dimensional asymptotic theory for boosting and minimum-\ell1-norm interpolated classifiers, Ann. Statist. 50 (2022), no. 3, 1669–1695.
  • [MM21] Léo Miolane and Andrea Montanari, The distribution of the Lasso: uniform control over sparse balls and adaptive parameter tuning, Ann. Statist. 49 (2021), no. 4, 2313–2335.
  • [MM22] Song Mei and Andrea Montanari, The generalization error of random features regression: precise asymptotics and the double descent curve, Comm. Pure Appl. Math. 75 (2022), no. 4, 667–766.
  • [MMM22] Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Generalization error of random feature and kernel methods: hypercontractivity and kernel matrix concentration, Appl. Comput. Harmon. Anal. 59 (2022), 3–84.
  • [MRSY23] Andrea Montanari, Feng Ruan, Youngtak Sohn, and Jun Yan, The generalization error of max-margin linear classifiers: Benign overfitting and high-dimensional asymptotics in the overparametrized regime, arXiv preprint arXiv:1911.01544v3 (2023).
  • [MZ22] Andrea Montanari and Yiqiao Zhong, The interpolation phase transition in neural networks: memorization and generalization under lazy training, Ann. Statist. 50 (2022), no. 5, 2816–2847.
  • [PD23] Pratik Patil and Jin-Hong Du, Generalized equivalences between subsampling and ridge regularization, arXiv preprint arXiv:2305.18496 (2023).
  • [PWRT21] Pratik Patil, Yuting Wei, Alessandro Rinaldo, and Ryan Tibshirani, Uniform consistency of cross-validation estimators for high-dimensional ridge regression, International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 3178–3186.
  • [RMR21] Dominic Richards, Jaouad Mourtada, and Lorenzo Rosasco, Asymptotics of ridge (less) regression under general source condition, International Conference on Artificial Intelligence and Statistics, PMLR, 2021, pp. 3889–3897.
  • [RV09] Mark Rudelson and Roman Vershynin, Smallest singular value of a random rectangular matrix, Comm. Pure Appl. Math. 62 (2009), no. 12, 1707–1739.
  • [SAH19] Fariborz Salehi, Ehsan Abbasi, and Babak Hassibi, The impact of regularization on high-dimensional logistic regression, Advances in Neural Information Processing Systems 32 (2019).
  • [Ste81] Charles M. Stein, Estimation of the mean of a multivariate normal distribution, Ann. Statist. 9 (1981), no. 6, 1135–1151.
  • [Sto74] M. Stone, Cross-validatory choice and assessment of statistical predictions, J. Roy. Statist. Soc. Ser. B 36 (1974), 111–147.
  • [Sto77] by same author, Asymptotics for and against cross-validation, Biometrika 64 (1977), no. 1, 29–35.
  • [TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi, Precise error analysis of regularized MM-estimators in high dimensions, IEEE Trans. Inform. Theory 64 (2018), no. 8, 5592–5628.
  • [TB22] Alexander Tsigler and Peter L. Bartlett, Benign overfitting in ridge regression, arXiv preprint arXiv:2009.14286v2 (2022).
  • [TOH15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, Regularized linear regression: A precise analysis of the estimation error, Conference on Learning Theory, PMLR, 2015, pp. 1683–1709.
  • [TV+04] Antonia M Tulino, Sergio Verdú, et al., Random matrix theory and wireless communications, Foundations and Trends® in Communications and Information Theory 1 (2004), no. 1, 1–182.
  • [vdVW96] Aad van der Vaart and Jon A. Wellner, Weak Convergence and Empirical Processes, Springer Series in Statistics, Springer-Verlag, New York, 1996.
  • [WWM22] Shuaiwen Wang, Haolei Weng, and Arian Maleki, Does SLOPE outperform bridge regression?, Inf. Inference 11 (2022), no. 1, 1–54.
  • [WX20] Denny Wu and Ji Xu, On the optimal weighted 2\ell_{2} regularization in overparameterized linear regression, Advances in Neural Information Processing Systems 33 (2020), 10112–10123.
  • [ZBH+21] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM 64 (2021), no. 3, 107–115.
  • [ZKS+22] Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J Sutherland, and Nathan Srebro, A non-asymptotic moreau envelope theory for high-dimensional generalized linear models, arXiv preprint arXiv:2210.12082 (2022).
  • [ZZY22] Xianyang Zhang, Huijuan Zhou, and Hanxuan Ye, A modern theory for high-dimensional Cox regression models, arXiv preprint arXiv:2204.01161 (2022).