This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

HEDE: Heritability estimation in high dimensions
by Ensembling Debiased Estimators

Yanke Song, Xihong Lin, and Pragya Sur  
Harvard University
Yanke Song is a graduate student ([email protected]) in the Department of Statistics at Harvard University. Xihong Lin ([email protected]) is Professor of Biostatistics at Harvard T.H. Chan School of Public Health and Professor of Statistics at Harvard University. Pragya Sur is Assistant Professor ([email protected]) in the Department of Statistics at Harvard University. Lin’s work was supported by the National Institutes of Health grants R35-CA197449, R01-HL163560, U01-HG012064, U19-CA203654, and P30-ES000002. Song and Sur’s work was supported by NSF DMS-2113426 and a Dean’s Competitive Fund for Promising Scholarship. Corresponding authors are Song and Sur.

Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sample size and the dimension grow proportionally. Our method ensembles post-processed versions of the debiased lasso and debiased ridge estimators, and incorporates a data-driven strategy for hyperparameter selection that significantly boosts estimation performance. We establish rigorous consistency guarantees that hold despite adaptive tuning. Extensive simulations demonstrate our method’s superiority over existing state-of-the-art methods across various signal structures and genetic architectures, ranging from sparse to relatively dense and from evenly to unevenly distributed signals. Furthermore, we discuss the advantages of fixed effects heritability estimation compared to random effects estimation. Our theoretical guarantees hold for realistic genotype distributions observed in genetic studies, where genotypes typically take on discrete values and are often well-modeled by sub-Gaussian distributed random variables. We establish our theoretical results by deriving uniform bounds, built upon the convex Gaussian min-max theorem, and leveraging universality results. Finally, we showcase the efficacy of our approach in estimating height and BMI heritability using the UK Biobank.


Abstract

Keywords: Signal-to-noise ratio estimation, Debiased Lasso estimator, Debiased ridge estimator, Convex Gaussian min-max theorem, Ensemble estimation, Proportional asymptotics, Universality.

1 Introduction

Accurate heritability estimation presents a significant challenge in statistical genetics. Distinguishing the contributions of genetic versus environmental factors is crucial for understanding how our genetic makeup influences the development of complex diseases. Importantly, genetic research over the past decade has highlighted a significant discrepancy between heritability estimates from twin studies and those from Genome-Wide Association Studies (GWAS) (Manolio et al.,, 2009; Yang et al.,, 2015) This discrepancy, known as the “missing heritability problem”, underscores the importance of developing reliable heritability estimation methods to understand the extent of missing heritability. In this paper, we propose an ensemble method for heritability estimation, using high-dimensional Single Nucleotide Polymorphisms (SNPs) commonly collected in GWAS, that exhibits superior performance across a wide variety of settings and significantly outperforms existing methods in practical scenarios of interest. We focus on the narrow-sense heritability yang2011gcta, which refers to the proportion of phenotypic variance explained by additive genotypic effects. For conciseness, we henceforth refer to this simply as the heritability.

GWAS data typically contains more SNPs than individuals. This presents a challenge when estimating heritability. Addressing this challenge necessitates modeling assumptions. Previous work in this area can be broadly categorized into two classes based on these assumptions: random effects models and fixed effects models.

Random effects models treat the underlying regression coefficients as random variables and have been widely used for heritability estimation. Arguably, the most popular approach in this vein is GCTA, introduced by yang2011gcta. GCTA employs the classical restricted maximum likelihood method (REML) (also called genomic REML/GREML in their work) within a linear mixed effects model. It assumes that the regression coefficients are i.i.d. draws from a Gaussian distribution with zero mean and a constant variance. Since the genetic effect sizes/regression coefficients might depend on the design matrix (the genotypes), subsequent work provides various improvements that relax this assumption, including GREML-MS (lee2013estimation), GREML-LDMS (yang2015genetic), BOLT-LMM (loh2015efficient), among others. Specifically, GREML-MS allows the signal variance to depend on minor allele frequencies (MAFs), while GREML-LDMS allows it to depende on linkage disequilibrium (LD) levels/quantiles. Prior statistical literature has studied the theoretical properties of several random effects methods (jiang2016high; bonnet2015heritability; dicker2016maximum; hu2022misspecification).

Despite this rich literature, a major limitation of random effects model based heritability estimation method is that its consistency often relies on the correct parametric assumptions of regression coefficients. Specifically, these methods assume that regression coefficients are either i.i.d, or depend on the design matrix in specific ways. When these assumptions are violated, random effects methods can produce significantly biased heritablity estimates. The issue is magnified when SNPs are in linkage disequilibrium (LD) or when there is additional heterogeneity across the LD blocks not captured by the random effects assumptions. We elaborate this issue further in Figure 1, where we demonstrate that GREML, GREML-MS and GREML-LDMS all incur significant biases when their respective random effects assumptions are violated.

These challenges associated with random effects methods have spurred a distinct strand of methods by positing regression coefficients to be fixed while treating the design matrix to be random. Referred to as fixed effects methods, these approaches often exhibit greater robustness to diverse realizations of the underlying signal without the need to make parametric assumptions on regression coefficients (Figure 1). Consequently, our focus lies on estimating the heritability within this fixed effects model framework.

Several fixed effects methods have emerged for heritability estimation. However, a substantial body of the proposed work requires the underlying signal to be sparse (fan2012variance; sun2012scaled; tony2020semisupervised; guo2019optimal), and they are subject to loss of efficiency when these sparsity assumptions are violated (dicker2014variance; janson2017eigenprism; bayati2013estimating), see Sections 5.2, 3.3, 2, respectively. The methods proposed by several authors (chen2022statistical; dicker2014variance; janson2017eigenprism) operate under less stringent assumptions regarding the signal, but they depend on ridge-regression-analogous concepts. Therefore they are expected to have good efficiency when the signal is dense, and are subject to loss of efficiency when signals are sparse. We illustrate these issues further in Section 5.2. In practice, the underlying genetic architecture for a given trait is unknown, i.e., the signal could be sparse or dense. Hence, it is desirable to develop a robust method across a broad spectrum of signal regimes from sparse to dense. In addition, fixed effects methods often rely on assuming the covariate distribution to be Gaussian (bayati2013estimating; dicker2014variance; janson2017eigenprism; verzelen2018adaptive). This assumption fails to capture genetic data where genotypes are discrete, taking values in 0,1,2{0,1,2}.

To address these issues, we propose a robust ensemble heritability estimation method HEDE (Heritability Estimation via Debiased Ensemble), which ensembles the debiased Lasso and ridge estimators, with degrees-of-freedom corrections (bellec2019biasing; javanmard2014hypothesis). Debiased estimators, originally proposed to correct the bias for Lasso/ridge estimators (zhang2014confidence; van2014asymptotically; javanmard2014confidence), have been employed for tasks such as confidence interval construction and hypothesis testing, and have nice properties such as optimal testing properties javanmard2014confidence. Unlike these methods, HEDE utilizes debiasing for heritability estimation. To develop HEDE, we first demonstrate that a reliable estimator for heritability can be formulated using any linear combination of these debiased estimators, in which we devise a consistent bias correction strategy. Subsequently, we refine the linear combination based on this bias correction strategy. This methodology enables the consistent estimation of heritability using any linear combination of the debiased Lasso and the debiased Ridge. HEDE then uses a data-driven approach to selecting an optimal ensemble from the array of possible linear combinations. Finally, we introduce an adaptive method for determining the underlying regularization parameters. We develop both adaptive strategies—ensemble selection and tuning parameter selection for the Lasso and the Ridge—by minimizing a certain mean square error. This results in enhanced statistical performance for heritability estimation.

Underpinning our method is a rigorous statistical theory that ensures consistency under minimal assumptions. Specifically, we operate under sub-Gaussian assumptions, which are consistent with a broad class of designs observed in GWAS that motivates this paper. We show that the consistency of heritability estimation using HEDE guarantees remain intact despite employing adaptive strategies for selecting the tuning parameters and the debiased Lasso and ridge ensemble, all while accommodating sub-Gaussian designs.

We perform extensive simulation studies to evaluate the finite sample performance of HEDE compared with the existing methods in a range of settings mimicking GWAS data. We demonstrate that HEDE overcomes the limitations in random effects methods, remaining unbiased across a wide variety of signal structures, specifically signals with different amplitudes across various combinations of MAF and LD levels, as well as LD blocks. our simulation also illustrates that HEDE consistently achieves the lowest mean square error when compared to popular fixed effects methods across a wide spectrum of sparse and dense signals, as well as low and high heritability values. We compare HEDE’s performance with both random and fixed effects methods on the problems of estimating height and BMI heritability using the UK Biobank data.

The rest of this paper is organized as follows. Section 2 introduces the problem setup. Section 3 presents our HEDE while Section 4 discusses its theoretical properties. Section 5 presents extensive simulation studies while Section 6 complements these numerical investigations with a real-data application on the UK Biobank data (sudlow2015uk). Finally, Section 7 concludes with discussions.

2 Problem Setup

In this section, we formally introduce our problem setup. We consider a high-dimensional regime where the covariate dimension grows with the sample size. To elaborate, we consider a sequence of problem instances {𝒚(n),𝑿(n),𝜷(n),ϵ(n)}n1\{\bm{y}(n),\bm{X}(n),\bm{\beta}(n),\bm{\epsilon}(n)\}_{n\geq 1} that satisfies a linear model

𝒚(n)=𝑿(n)𝜷(n)+ϵ(n),\bm{y}(n)=\bm{X}(n)\bm{\beta}(n)+\bm{\epsilon}(n), (1)

where 𝒚(n)=(y1,,yn)n\bm{y}(n)=(y_{1},\ldots,y_{n})^{\top}\in\mathbb{R}^{n} represents observations of a real-valued phenotype of interest, such as height, BMI, blood pressure, for nn unrelated individuals from a population. The unknown regression coefficient vector, 𝜷(n)=(β1,,βp(n))p(n)\bm{\beta}(n)=(\beta_{1},\ldots,\beta_{p(n)})^{\top}\in\mathbb{R}^{p(n)}, captures the effects of p(n)p(n) SNPs and is assumed to be fixed parameters. The environmental noise, ϵ(n)=(ϵ1,,ϵn)n\bm{\epsilon}(n)=(\epsilon_{1},\ldots,\epsilon_{n})\in\mathbb{R}^{n}, is independent of 𝑿(n)\bm{X}(n), with each entry also independent. Lastly, 𝑿(n)n×p(n)\bm{X}(n)\in\mathbb{R}^{n\times p(n)} signifies the matrix of observed genotypes with i.i.d. rows 𝒙i\bm{x}_{i\bullet}, which are normalized as follows

Xij=Gij2G¯j2G¯j(1G¯j),X_{ij}=\frac{G_{ij}-2\bar{G}_{j}}{\sqrt{2{\bar{G}_{j}(1-\bar{G}_{j})}}}, (2)

where Gij=0,1,2G_{ij}=0,1,2 is the genotype containing the minor allele count (the number of copies of the less frequent allele) at SNP jj for individual ii and G¯j=i=1nGij/n\bar{G}_{j}=\sum_{i=1}^{n}G_{ij}/n is the minor allele frequency of SNP jj in the sample.

We assume the covariates have sub-Gaussian distributions. Since 𝑿\bm{X} is random and 𝜷\bm{\beta} is deterministic, we have placed ourselves in the fixed effects framework. Moving forward, we suppress the dependence on nn when the meaning is clear from the context. Our parameter of interest is the heritability, which is the proportion of variance in yiy_{i} explained by features in 𝒙i\bm{x}_{i\bullet} defined as

h2=limn,p,p/nδVar(𝒙i𝜷)Var(yi).h^{2}=\lim_{n,p\rightarrow\infty,\,\,p/n\rightarrow\delta}\frac{\text{Var}(\bm{x}_{i\bullet}^{\top}\bm{\beta})}{\text{Var}(y_{i})}. (3)

We emphasize the following remark regarding the aforementioned definition. We assume that nn and pp diverge with p/nδ>0p/n\rightarrow\delta>0. This setting–also known as the proportional asymptotics regime–has gained significant recent attention in high-dimensional statistics. A major advantage of this regime is that theoretical results derived under such asymptotics demonstrate attractive finite sample performance (sur2019likelihood; sur2019modern; candes2020phase; zhao2022asymptotic; liang2022precise; jiang2022new). Furthermore, prior research in heritability estimation (jiang2016high; dicker2016maximum; janson2017eigenprism) utilized this framewrok to characterize theoretical properties of GREML and its variants. In practical situations, we observe a single value for nn and pp, so we may substitute the corresponding ratio p/np/n into our theory in place of δ\delta to execute data analyses. Formal assumptions on the covariate and noise distributions, and the signal are deferred to Section 4.

3 Methodology of HEDE: Heritability estimation Ensembling Debiased Estimators

We begin by describing the overarching theme underlying our method. For simplicity, we discuss the case where 𝑿\bm{X} has independent columns, deferring the extension to correlated columns to Section 3.4. The denominator of h2h^{2} in (3) can be consistently estimated using the sample variance of 𝒚\bm{y}. The numerator of h2h^{2} in (3) simplifies to 𝜷22\|\bm{\beta}\|_{2}^{2} in the setting of independent columns. Consequently, we seek to develop an accurate estimator for this norm.

To this end, we turn to the debiased Lasso estimator and the debiased ridge estimator. Denote the Lasso and the ridge by 𝜷^L\hat{\bm{\beta}}_{\textrm{L}} and 𝜷^R\hat{\bm{\beta}}_{\textrm{R}} respectively. These penalized estimators are known to shrink estimated coefficients to 0, therefore inducing bias. The existing literature provides debiased versions of these estimators, denoted as 𝜷^Ld\hat{\bm{\beta}}_{\textrm{L}}^{d} and 𝜷^Rd\hat{\bm{\beta}}_{\textrm{R}}^{d}. These debiased estimators roughly satisfy the following property: 𝜷^Ld𝜷+σLZ1\hat{\bm{\beta}}_{\textrm{L}}^{d}\approx\bm{\beta}+\sigma_{\textrm{L}}Z_{1}, 𝜷^Rd𝜷+σRZ2\hat{\bm{\beta}}_{\textrm{R}}^{d}\approx\bm{\beta}+\sigma_{\textrm{R}}Z_{2} meaning that each entry of 𝜷^Ld,𝜷^Rd\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d} is approximately centered around the corresponding true signal coefficient with Gaussian fluctuations and variances given by σL,σR\sigma_{L},\sigma_{R} respectively. By taking norms on both sides, we obtain that 𝜷^kd22pσk2𝜷22\|\hat{\bm{\beta}}_{k}^{d}\|_{2}^{2}-p\cdot\sigma_{k}^{2}\approx\|\bm{\beta}\|_{2}^{2}, where k=LorRk=\textrm{L}\,\,\text{or}\,\,\textrm{R} depending on whether we consider the Lasso or the ridge. Thus, precise estimates for the variances σk2\sigma_{k}^{2}’s produce accurate estimates for h2h^{2} starting from each debiased estimator.

This intuition extends to any linear combination of the debiased Lasso and ridge in the form of 𝜷~Cd:=αL𝜷^Ld+(1αL)𝜷^Rd\tilde{\bm{\beta}}_{C}^{d}:=\alpha_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\alpha_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}. A similar relationship holds for the norm: 𝜷~Cd22pτ~C2𝜷22\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}\approx\|\bm{\beta}\|_{2}^{2}. We establish that for any αL[0,1]\alpha_{L}\in[0,1], we can derive a consistent estimator for the ensemble variance τ~C2\tilde{\tau}_{C}^{2}. Among all possible ensemble choices with different ensembling parameter αL\alpha_{\textrm{L}} and different tuning parameters λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} for calculating the Lasso and the ridge, we adaptively choose the combination that minimizes the mean square error of the ensemble estimator. This method, dubbed HEDE, harnesses the strengths of L1L_{1} and L2L_{2} regression and demonstrates strong practical performance, since we design our adaptive tuning process to particularly enhance statistical accuracy. We provide further elaboration on HEDE in the subsequent subsections: Section 3.1, discusses the specific forms of the debiased estimators utilized in our framework; Section 3.2 describes our ensembling approach, and Section 3.3 discusses the final HEDE method with our adaptive tuning strategy.

3.1 Debiasing Regularized Estimators

Define the Lasso and ridge estimators as follows

𝜷^L:=argmin𝒃12n𝒚𝑿𝒃22+λLn𝒃1,𝜷^R:=argmin𝒃12n𝒚𝑿𝒃22+λR2𝒃22,\displaystyle\hat{\bm{\beta}}_{\textrm{L}}:=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{L}}}{\sqrt{n}}\|\bm{b}\|_{1},\quad\hat{\bm{\beta}}_{\textrm{R}}:=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{R}}}{2}\|\bm{b}\|_{2}^{2}, (4)

where λL\lambda_{\textrm{L}} and λR\lambda_{\textrm{R}} denote the respective tuning parameters. We remark that the different scaling in the regularization terms ensure comparable scales for the solutions in our high-dimensional regime (this is easy to see by noting that if each entry of 𝒃\bm{b} satisfies bj1/nb_{j}\sim 1/\sqrt{n}, then 𝒃1n\|\bm{b}\|_{1}\sim\sqrt{n} whereas 𝒃221\|\bm{b}\|_{2}^{2}\sim 1). These estimators incur a regularization bias. To tackle this issue, several authors proposed debiased versions of these estimators that remain asymptotically unbiased with Gaussian fluctuations (zhang2014confidence; van2014asymptotically; javanmard2014confidence; bellec2019biasing; bellec2023debiasing).

Specifically, we work with the following debiased estimators:

𝜷^Ld=𝜷^L+𝑿(𝒚𝑿𝜷^L)n𝜷^L0,𝜷^Rd=𝜷^R+𝑿(𝒚𝑿𝜷^R)nTr((1n𝑿𝑿+λR𝑰)11n𝑿𝑿).\displaystyle\hat{\bm{\beta}}_{\textrm{L}}^{d}=\hat{\bm{\beta}}_{\textrm{L}}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}})}{n-\|\hat{\bm{\beta}}_{\textrm{L}}\|_{0}},\,\quad\,\hat{\bm{\beta}}_{\textrm{R}}^{d}=\hat{\bm{\beta}}_{\textrm{R}}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}})}{n-\text{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})}. (5)

Detailed in bellec2019biasing, these versions are known as debiasing with degrees-of-freedom (DOF) correction, since the second term in the denominator is precisely the degrees-of-freedom of the regularized estimator. Prior debiasing theory (bellec2019second; celentano2021cad; bellec2022observable) roughly established that under suitable conditions, for each coordinate jj, the debiased estimators jointly satisfy (asymptotically)

(β^L,jdβjβ^R,jβj)𝒩((00),(τL2τLRτLRτR2)),\begin{pmatrix}\hat{\beta}_{\textrm{L},j}^{d}-\beta_{j}\\ \hat{\beta}_{\textrm{R},j}-\beta_{j}\end{pmatrix}\approx\mathcal{N}\left(\begin{pmatrix}0\\ 0\end{pmatrix},\begin{pmatrix}\tau_{\textrm{L}}^{2}&\tau_{\textrm{L}\textrm{R}}\\ \tau_{\textrm{L}\textrm{R}}&\tau_{\textrm{R}}^{2}\\ \end{pmatrix}\right), (6)

for appropriate variance and covariance parameters τL2,τR2,τLR\tau_{\textrm{L}}^{2},\tau_{\textrm{R}}^{2},\tau_{\textrm{L}\textrm{R}}. We remark that this is a non-rigorous statement (hence the “\approx” sign, and a precise statement is presented in Theorem 4.3. Additionally, these parameters can be consistently estimated using the following (celentano2021cad):

τ^L2=𝒚𝑿𝜷^L22(n𝜷^L0)2,\displaystyle\hat{\tau}_{\textrm{L}}^{2}=\frac{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\|_{2}^{2}}{(n-\|\hat{\bm{\beta}}_{\textrm{L}}\|_{0})^{2}}, τ^R2=𝒚𝑿𝜷^R22(nTr((1n𝑿𝑿+λR𝑰)11n𝑿𝑿))2,\displaystyle\qquad\hat{\tau}_{\textrm{R}}^{2}=\frac{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}}\|_{2}^{2}}{(n-\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}))^{2}}, (7)
τ^LR\displaystyle\hat{\tau}_{\textrm{L}\textrm{R}} =<𝒚𝑿𝜷^L,𝒚𝑿𝜷^R>(n𝜷^L0)(nTr((1n𝑿𝑿+λR𝑰)11n𝑿𝑿).\displaystyle=\frac{<\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}},\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}}>}{(n-\|\hat{\bm{\beta}}_{\textrm{L}}\|_{0})(n-\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})}.

This serves as the basis of our HEDE heritability estimator.

3.2 Ensembling and a Primitive Heritability Estimator

We seek to combine the strengths of L1L_{1} and L2L_{2} regression estimators. To achieve this, we consider linear combinations of the debiased Lasso and Ridge, taking the form

𝜷~Cd=αL𝜷^Ld+(1αL)𝜷^Rd,\tilde{\bm{\beta}}_{C}^{d}=\alpha_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\alpha_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}, (8)

where αL[0,1]\alpha_{\textrm{L}}\in[0,1] is held fixed for the time being. We remark that the well-known ElasticNet estimator (zou2005regularization) also combines the strengths of the Lasso and the ridge. Though our theoretical framework allows potential extensions to the ElasticNet, we do not pursue this direction due to computational concerns, discussed further in the end of Section 3.4.

Combining (6) and (7), we obtain that each coordinate of 𝜷~Cd\tilde{\bm{\beta}}_{C}^{d} satisfies, roughly,

β~C,jdβjτ~C2𝒩(0,1)withτ~C2:=αL2τ^L2+(1αL)2τ^R2+2αL(1αL)τ^LR.\frac{\tilde{\beta}_{C,j}^{d}-\beta_{j}}{\sqrt{\tilde{\tau}_{C}^{2}}}\approx\mathcal{N}(0,1)\quad\text{with}\quad\tilde{\tau}_{C}^{2}:=\alpha_{\textrm{L}}^{2}\hat{\tau}_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}\hat{\tau}_{\textrm{R}}^{2}+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\hat{\tau}_{\textrm{L}\textrm{R}}. (9)

Furthermore, the Gaussianity property (9) holds on an “average” sense as well, where the result can be effectively summed across all coordinates. We describe the precise sense next. Specifically, for any Lipschitz function ψ\psi, the difference j=1pψ(β~C,jd)j=1pψ(βj)\sum_{j=1}^{p}\psi(\tilde{\beta}_{C,j}^{d})-\sum_{j=1}^{p}\psi(\beta_{j}) behaves roughly as pE{ψ(τ~C2Z)}pE\{\psi(\sqrt{\tilde{\tau}_{C}^{2}}Z)\}, where Z𝒩(0,1)Z\sim\mathcal{N}(0,1). On setting ψ(x)=x2\psi(x)=x^{2} with xx being bounded, we obtain

𝜷~Cd22𝜷22pτ~C20,\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}\approx 0, (10)

where from (8), 𝜷~Cd22=αL2𝜷^Ld22+(1αL)2𝜷^Rd22+2αL(1αL)𝜷^Ld,𝜷^Rd.\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}=\alpha_{\textrm{L}}^{2}\|\hat{\bm{\beta}}_{\textrm{L}}^{d}\|_{2}^{2}+(1-\alpha_{\textrm{L}})^{2}\|\hat{\bm{\beta}}_{\textrm{R}}^{d}\|_{2}^{2}+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\langle\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}\rangle. Synthesizing these arguments, we observe that

h^αL,λL,λR2=min[1,max{0,𝜷~Cd22pτ~C2Var^(𝒚)}]\hat{h}^{2}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}=\min\left[1,\max\left\{0,\frac{\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}}{\widehat{\text{Var}}(\bm{y})}\right\}\right] (11)

yields a consistent estimator of h2h^{2}. The clipping is crucial for ensuring that h2h^{2} falls within the range [0,1], which is necessary considering h2h^{2} represents a proportion. Given a fixed α[0,1]\alpha\in[0,1] and λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} within a suitable range [λmin,λmax][\lambda_{\text{min}},\lambda_{\text{max}}], we can establish that h^αL,λL,λR2\hat{h}^{2}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}} is consistent for h2h^{2}.

For our final algorithm, we set a range [tmin,tmax][t_{\min},t_{\max}] on the the degrees-of-freedom (the denominators in (5)), and filter out all (λL,λR)(\lambda_{\textrm{L}},\lambda_{\textrm{R}}) values whose resulting degrees-of-freedom falls outside this range. Details on how we set the range is described in Supplementary Materials A.2. This results in a class of consistent estimators for the heritability, each corresponding to a different choice of αL,λL,λR\alpha_{L},\lambda_{L},\lambda_{R}. Among these, we propose to select the estimator that minimizes the mean square error in estimating the signal 𝜷\bm{\beta}. This leads to our adaptive tuning strategy, which we describe in the next subsection.

3.3 The method HEDE

In order to develop our method, HEDE, we take advantage of the flexibility provided by αL,λL,λR\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}} as adjustable hyperparameters. Initially, we fix λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} and adaptively select αL\alpha_{\textrm{L}}. To this end, we minimize the estimated variability in the ensemble estimator, given by τ~C\tilde{\tau}_{C} (9), for all possible choices of αL\alpha_{L}. This results in a choice that consistently estimates each coordinate of 𝜷\bm{\beta} with the least possible variability among all ensemble estimators of the form (8). As this holds true for every coordinate and β~C,j\tilde{\beta}_{C,j} is asymptotically unbiased for βj\beta_{j}, minimizing this variance concurrently minimizes the mean squared error. Formally, minimizing (9) as a function of αL\alpha_{\textrm{L}} leads to the following data-dependent choice:

α^L=max{0,min{1,τ^R2τ^LRτ^R22τ^LR+τ^L2}}.\hat{\alpha}_{\textrm{L}}=\max\left\{0,\min\left\{1,\frac{\hat{\tau}_{\textrm{R}}^{2}-\hat{\tau}_{\textrm{L}\textrm{R}}}{\hat{\tau}_{\textrm{R}}^{2}-2\hat{\tau}_{\textrm{L}\textrm{R}}+\hat{\tau}_{\textrm{L}}^{2}}\right\}\right\}. (12)

Once again, the maximum and minimum operations ensure that α^L[0,1]\hat{\alpha}_{\textrm{L}}\in[0,1]. Subsequently, by plugging in α^L\hat{\alpha}_{\textrm{L}} into the ensembling formula (8), we obtain the estimator

𝜷^Cd=α^L𝜷^Ld+(1α^L)𝜷^Rd.\hat{\bm{\beta}}_{C}^{d}=\hat{\alpha}_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\hat{\alpha}_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}. (13)

For any fixed tuning parameter pair (λL,λR)(\lambda_{\textrm{L}},\lambda_{\textrm{R}}), 𝜷^Cd\hat{\bm{\beta}}_{C}^{d} achieves the lowest mean squared error (MSE) for estimating each coordinate, among all possible ensembled estimators of the form (8). Further, the mean square error of this estimator is given by

τ^C2=α^L2τ^L2+2α^L(1α^L)τ^LR+(1α^L)2τ^R2.\hat{\tau}_{C}^{2}=\hat{\alpha}_{\textrm{L}}^{2}\hat{\tau}_{\textrm{L}}^{2}+2\hat{\alpha}_{\textrm{L}}(1-\hat{\alpha}_{\textrm{L}})\hat{\tau}_{\textrm{L}\textrm{R}}+(1-\hat{\alpha}_{\textrm{L}})^{2}\hat{\tau}_{\textrm{R}}^{2}. (14)

Note that τ^C2\hat{\tau}_{C}^{2} is τ~C2\tilde{\tau}_{C}^{2} evaluated at αL=α^L\alpha_{\textrm{L}}=\hat{\alpha}_{\textrm{L}}. Observe that both 𝜷^Cd\hat{\bm{\beta}}_{C}^{d} and τ^C2\hat{\tau}^{2}_{C} are functions of the initially selected tuning parameter pair (λL,λR)(\lambda_{\textrm{L}},\lambda_{\textrm{R}}). Therefore, to construct a competitive heritability estimator, we minimize τ^C2\hat{\tau}_{C}^{2} over a range of tuning parameter values, compute the corresponding 𝜷^Cd\hat{\bm{\beta}}_{C}^{d}, and construct a heritability estimator according to (11) using this 𝜷^Cd\hat{\bm{\beta}}_{C}^{d} and the minimum τ^C2\hat{\tau}_{C}^{2} (in place of 𝜷~Cd\tilde{\bm{\beta}}_{C}^{d} and τ~C2\tilde{\tau}_{C}^{2}, respectively). This yields our final estimator, HEDE.

3.4 Correlated Designs

In the previous subsections, we discussed our method in the setting of independent covariates. In practical situations, covariates are typically correlated, i.e., SNPs are in LD in GWAS. Thus it is more realistic to assume that each row 𝒙i\bm{x}_{i\bullet} of our design matrix takes the form 𝒛i𝚺1/2{\bm{z}}_{i\bullet}\bm{\Sigma}^{1/2}, where 𝒛i{\bm{z}}_{i\bullet} has independent entries and 𝚺\bm{\Sigma} denotes the covariance of 𝒙i\bm{x}_{i\bullet}. Hence, if we can develop an accurate estimate 𝚺^\hat{\bm{\Sigma}} for the covariance 𝚺\bm{\Sigma}, 𝒙i𝚺^1/2\bm{x}_{i\bullet}\hat{\bm{\Sigma}}^{-1/2} should have roughly independent entries. Alternately, the whitened design matrix 𝑿𝚺^1/2\bm{X}\hat{\bm{\Sigma}}^{-1/2} should have roughly identity covariance and our methodology in the prior section would apply directly.

Estimation of high-dimensional covariance matrices poses challenges, and has been a rich area of research. Here, we consider a special class of covariance matrices. Our motivation stems from GWAS data, where chromosomes constitute approximately tens of thousands of independent LD blocks (berisa2016approximately), where the number of SNPs in each LD block is much smaller in size compared to the total number of SNPs. To reflect this, we assume the population covariance matrix is block-diagonal. If the size of each block is negligible compared to the total sample size, the block-wise sample covariance matrix estimates the corresponding block of the population covariance matrix well. We establish in Supplementary Materials, Proposition A.1 that stitching together these block-wise sample covariances provides an accurate estimate of the population covariance matrix in the operator norm, i.e., 𝚺^𝚺op0\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\rightarrow 0. The block diagonal approximation suffices for our purposes (see Section 6).

Other scenarios exist where accurate estimates for the population covariance matrix is available, but we will not discuss these further. Our final procedure, therefore, first uses the observed data to calculate 𝚺^\hat{\bm{\Sigma}}, then uses 𝚺^\hat{\bm{\Sigma}} to whiten the covariates, and finally applies HEDE to the whitened data.

Table 1: Heritability Estimation via Debiased Ensembles (HEDE)
Input: 𝒚,𝑿,t\bm{y},\bm{X},t
Output: h^2\hat{h}^{2}
0. If necessary, estimate 𝚺^\hat{\bm{\Sigma}} as block-diagonal and whiten the data.
1. Fix a range [tmin,tmax](0,1)[t_{\min},t_{\max}]\subset(0,1). For all λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}}:
  Calculate 𝜷^L,𝜷^R\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{\beta}}_{\textrm{R}} in (4).
  If 𝜷^L(λL)0n[tmin,tmax]\frac{\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})\|_{0}}{n}\notin[t_{\min},t_{\max}] or 1nTr((1n𝑿𝑿+λR𝑰)11n𝑿𝑿)[tmin,tmax]\frac{1}{n}\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})\notin[t_{\min},t_{\max}], drop (λL,λR)(\lambda_{\textrm{L}},\lambda_{\textrm{R}}).
  Calculate τ^L2,τ^R2,τ^LR\hat{\tau}_{\textrm{L}}^{2},\hat{\tau}_{\textrm{R}}^{2},\hat{\tau}_{\textrm{L}\textrm{R}} in (7).
  Calculate α^L\hat{\alpha}_{\textrm{L}} in (12), and τ^C2\hat{\tau}_{C}^{2} in (14).
2. Select (λ^L,λ^R):=argminλL,λRτ^C2(λL,λR)(\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}):=\operatorname*{arg\,min}_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}}\hat{\tau}_{C}^{2}(\lambda_{\textrm{L}},\lambda_{\textrm{R}}). Denote τ^C,min2:=τ^C2(λ^L,λ^R)\hat{\tau}_{C,\min}^{2}:=\hat{\tau}_{C}^{2}(\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}).
3. Calculate 𝜷^Ld,𝜷^Rd\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d} from (5), and 𝜷^Cd\hat{\bm{\beta}}_{C}^{d} from (13) corresponding to λ^L,λ^R\hat{\lambda}_{L},\hat{\lambda}_{R}.
4. Calculate the heritability estimator using the formula
h^α^L,λ^L,λ^R2=min{1,max{0,𝜷^Cd22pτ^C,min2Var^(yi)}},\hat{h}^{2}_{\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}}=\min\left\{1,\max\left\{0,\frac{\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C,\min}^{2}}{\widehat{\text{Var}}(y_{i})}\right\}\right\}, (15)
where Var^(yi)\widehat{\text{Var}}(y_{i}) denotes the sample variance of the outcome.

We algorithmically summarize HEDE in Table 1. As we can see for step 11, assuming our hyperparameter grids have mL,mRm_{\textrm{L}},m_{\textrm{R}} different values of λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}}, respectively, we need mL+mRm_{\textrm{L}}+m_{\textrm{R}} glmnet calls. As for the well known ElasticNet estimator, the same grids would require mLmRm_{\textrm{L}}m_{\textrm{R}} glmnet calls, making it much more expensive than the current ensembling method. We also remark that we did not make full effort into the computational/algorithmic optimization. With more advanced techniques such as snpnet from (li2022fast), HEDE could potentially be scaled to very large dimensions.

4 Theoretical Properties

We turn to discuss the key theoretical properties of HEDE. We first focus on the case of independent covariates. Recall that throughout, we assume an asymptotic framework where p,np,n\rightarrow\infty with limnp/nδ(0,)\lim_{n\rightarrow\infty}p/n\rightarrow\delta\in(0,\infty). We work with the following assumptions on the covariate, signal, and noise.

Assumption 1.
  1. (i)

    Each entry XijX_{ij} of 𝑿\bm{X} is independent (but not necessarily identically distributed), has mean 0 and variance 11, and are uniformly sub-Gaussian.

  2. (ii)

    The norm of the signal 𝜷\bm{\beta} satisfies: 0𝜷22𝑃σβ2σmax2<0\leq\|\bm{\beta}\|_{2}^{2}\overset{P}{\rightarrow}\sigma_{\beta}^{2}\leq\sigma_{\max}^{2}<\infty.

  3. (iii)

    The noise ϵ\bm{\epsilon} has independent entries (not necessarily identically distributed) with mean 0 and variance σ2\sigma^{2} that is bounded (σ2σmax2\sigma^{2}\leq\sigma_{\max}^{2}), and are uniformly sub-Gaussian. Further, ϵ\bm{\epsilon} and 𝑿\bm{X} are independent.

It is worth noting that the uniform sub-Gaussianity in Assumption 1(i) implies that the G¯j\bar{G}_{j}’s from (2) are bounded. This is satisfied in GWAS, as G¯j\bar{G}_{j} is minor allele frequency (MAF) (psychiatric2009genomewide). The remaining Assumptions 1(ii) and (iii) are all weak and natural to impose. With Assumption 1 in mind, our primary goal is to establish that HEDE consistently estimates the heritability. The following theorem establishes this result under mild additional conditions on the minimum singular values of submatrices of 𝑿\bm{X} (Assumption 2, Supplementary Materials A.1). The proof is provided in Supplementary Materials F.3.

Theorem 4.1.

Recall h^αL,λL,λR2\hat{h}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}^{2} from (11) and hα^L,λ^L,λ^R2h_{\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}}^{2} from (15). Under Assumptions 1 and 2, the following conclusions hold:

  • (i)

    supλL,λR[λmin,λmax],αL[0,1]|h^αL,λL,λR2h2|𝑃0\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}],\alpha_{\textrm{L}}\in[0,1]}\left|\hat{h}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}^{2}-h^{2}\right|\overset{P}{\rightarrow}0

  • (ii)

    |h^α^L,λ^L,λ^R2h2|𝑃0\left|\hat{h}_{\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}}^{2}-h^{2}\right|\overset{P}{\rightarrow}0.

Theorem 4.1 states that HEDE is consistent when λL,λR\lambda_{L},\lambda_{R} and αL\alpha_{\textrm{L}} are arbitrarily chosen in [λmin,λmax][\lambda_{\min},\lambda_{\max}] and [0,1][0,1], respectively. It follows that HEDE is consistent using our proposed choices α^L,λ^L,λ^R\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}. Here [λmin,λmax][\lambda_{\min},\lambda_{\max}] denotes the range of λL,λR\lambda_{L},\lambda_{R} that we use. See details in Supplementary Materials A.2.

Proving the consistency for given fixed choices of these hyperparameters is relatively straightforward, However, establishing the uniform gurantee is non-trivial. It requires analyzing the debiased Lasso and ridge estimators jointly, and furthermore, pinning down this joint behavior uniformly over all hyperparameter choices. We achieve both these goals under relatively mild assumptions on the covariate and noise distributions. Once we establish the uniform control, Part (ii) follows trivially since the uniform result provides consistency for the particular choices α^L,λ^L,λ^R\hat{\alpha}_{L},\hat{\lambda}_{L},\hat{\lambda}_{R} produced by our algorithm. Below, we present the key theorem necessary for proving Theorem 4.1.

Proposition 4.2.

Recall 𝛃~Cd\tilde{\bm{\beta}}_{C}^{d} from (8) and τ~C2\tilde{\tau}_{C}^{2} from (9). Under Assumptions 1 and 2,

supλL,λR[λmin,λmax]supαL[0,1]|𝜷~Cd22𝜷22pτ~C2|𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{L}\in[0,1]}\left|\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}\right|\overset{P}{\rightarrow}0,

We now discuss how Proposition 4.2 leads to our previous Theorem 4.1. Proposition 4.2 equivalently proposes a consistent estimator for 𝜷22\|\bm{\beta}\|_{2}^{2}, which is directly related to h2h^{2} by a factor of the population variance of yiy_{i}. Since the population variance of yiy_{i} can be consistently estimated by the sample variance of the observed outcomes, we can safely replace the former by the latter in the denominator. Theorem 4.1(i) then follows on using the final fact that clipping does not affect consistency.

Proposition is 4.2 proved in Supplementary materials F.2. Here we briefly outline our main idea. Recall from (8) that 𝜷~Cd\tilde{\bm{\beta}}_{C}^{d} is a linear combination of 𝜷^Ld\hat{\bm{\beta}}_{\textrm{L}}^{d} and 𝜷^Rd\hat{\bm{\beta}}_{\textrm{R}}^{d}. To develop uniform guarantees for 𝜷~Cd\tilde{\bm{\beta}}_{C}^{d}, it therefore suffices to establish the corresponding uniform guarantees for the debiased Lasso and ridge jointly, which we establish in the following theorem.

Theorem 4.3.

Recall the estimators (5). If Assumptions 1 and 2 hold, then for any 11-Lipschitz function ϕβ:(p)3\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}, we have

supλL,λR[λmin,λmax]|ϕβ(𝜷^Ld,𝜷^Rd,𝜷)𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]|𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0,

where (𝛃^Lf,d,𝛃^Rf,d)=(𝛃+𝐠Lf,𝛃+𝐠Rf)(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})=(\bm{\beta}+\bm{g}_{\textrm{L}}^{f},\bm{\beta}+\bm{g}_{\textrm{R}}^{f}) with (𝐠Lf,𝐠Rf)𝒩(0,𝐒𝐈p)(\bm{g}_{\textrm{L}}^{f},\bm{g}_{\textrm{R}}^{f})\sim\mathcal{N}(0,\bm{S}\otimes\bm{I}_{p}), where 𝐒=(τL2τLRτLRτR2)\bm{S}=\begin{pmatrix}\tau_{\textrm{L}}^{2}&\tau_{LR}\\ \tau_{LR}&\tau_{\textrm{R}}^{2}\end{pmatrix} is formally defined via the system of equations (24).

Theorem 4.3 is proved in Supplementary materials D. It states that the joint “Gaussianity” described in (6) occurs at the population level in terms of the value of Lipschitz functions of the estimators, in addition to a per-coordinate basis. While results of this nature have appeared in the recent literature on exact high-dimensional asymptotics, majority of these works focus on a single regularized estimator (or its debiased version) with a fixed tuning parameter value. Theorem 4.3 provides the first joint characterization of multiple debiased estimators that is uniform over a range of tuning parameter values.

Among related works, celentano2021cad established joint characterization of two estimators and their debiased versions for given fixed tuning parameter choices. Meanwhile, miolane2021distribution characterized certain properties of the Lasso and the debiased Lasso uniformly across tuning parameter values. However, developing a heritability estimator that performs well in practice across sparse to dense signals requires the stronger result of the form in Theorem 4.3. Dealing with the debiased lasso and ridge jointly, while requiring uniform characterization over tuning parameters as well as allowing sub-Gaussian covariates, poses unique challenges. We overcome these challenges in our work, building upon miolane2021distribution, celentano2020lasso, celentano2021cad and universality techniques proposed in han2022universality, which in turn rely on the Convex Gaussian Minmax Theorem (GCMT) (thrampoulidis2015regularized). We detail the connections and differences in the Supplementary Materials.

Note that Theorem 4.3 involves the variance-covariance parameters τL2,τLR,τR2\tau_{L}^{2},\tau_{LR},\tau_{R}^{2} (formal definitions deferred to later (23) (24) in the interest of space. We establish in Supplementary Materials, Theorem F.1 that τ^L2,τ^R2,τ^LR\hat{\tau}_{\textrm{L}}^{2},\hat{\tau}_{\textrm{R}}^{2},\hat{\tau}_{\textrm{L}\textrm{R}}, defined in (7), consistently estimate these parameters. Furthermore, we show that this consistency holds uniformly over the required range of tuning parameter values. This result, coupled with Theorem 4.3, yields our key Proposition 4.2.

Finally we turn to the case where the observed features are correlated. Recall from Section 3.4 that here we consider the design matrix to take the form 𝑿=𝒁𝚺1/2\bm{X}=\bm{Z}\bm{\Sigma}^{1/2}, where 𝒁\bm{Z} now has independent entries and satisfies the conditions in Assumption 1(i). Thus, our previous theorems apply to 𝒁\bm{Z} (which is unobserved), but not to the observed 𝑿\bm{X}. Applying whitening with estimated 𝚺^\hat{\bm{\Sigma}} results in final features of the form 𝑿𝚺^1/2=𝒁𝚺1/2𝚺^1/2\bm{X}\hat{\bm{\Sigma}}^{-1/2}=\bm{Z}\bm{\Sigma}^{1/2}\hat{\bm{\Sigma}}^{-1/2}. When 𝚺^1/2\hat{\bm{\Sigma}}^{1/2} approximates 𝚺\bm{\Sigma} accurately, 𝚺1/2𝚺^1/2\bm{\Sigma}^{1/2}\hat{\bm{\Sigma}}^{-1/2} behaves as the identity matrix 𝑰\bm{I}. In this case, 𝑿𝚺^1/2\bm{X}\hat{\bm{\Sigma}}^{-1/2} is close to 𝒁\bm{Z} which satisfies our previous Assumption 1. HEDE therefore continues to enjoy uniform consistency guarantees, as formalized below.

Theorem 4.4.

Consider the setting of Theorem 4.1 except that the observed features now take the form 𝐗=𝐙𝚺1/2\bm{X}=\bm{Z}\bm{\Sigma}^{1/2} where 𝐙\bm{Z} satisfies Assumptions 1(i) and 2(i). Let 𝚺\bm{\Sigma} be a sequence of deterministic symmetric matrices whose eigenvalues are bounded in [1/M,M][1/M,M] for some M>1M>1. Let 𝚺^\hat{\bm{\Sigma}} be a sequence of random matrices such that 𝚺^𝚺op𝑃0\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\overset{P}{\rightarrow}0. If 𝐀=𝚺1/2𝚺^1/2\bm{A}=\bm{\Sigma}^{1/2}\hat{\bm{\Sigma}}^{-1/2} satisfies Assumption 3 and h^αL,λL,λR2\hat{h}^{2}_{\alpha_{L},\lambda_{L},\lambda_{R}} denotes the estimator 11 calculated using the whitened data (𝐲,𝐗𝚺^1/2)(\bm{y},\bm{X}\hat{\bm{\Sigma}}^{-1/2}), then

supλL,λR[λmin,λmax],αL[0,1]|h^αL,λL,λR2h2|𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}],\alpha_{\textrm{L}}\in[0,1]}\left|\hat{h}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}^{2}-h^{2}\right|\overset{P}{\rightarrow}0, (16)

Thus, the corresponding HEDE estimator h^α^L,λ^L,λ^R2\hat{h}^{2}_{\hat{\alpha}_{L},\hat{\lambda}_{L},\hat{\lambda}_{R}} is also consistent for h2h^{2}.

Although we present this theorem for a general sequence of covariance matrices 𝚺\bm{\Sigma} that allow accurate estimators in operator norm, we use it only for block diagonal covariances relevant for our application, where the block sizes are relatively small compared to the sample size. In Supplementary Materials, Proposition A.1, we establish that for such covariance matrices, the block-wise sample covariance matrix provides such an acccurate estimator. Theorem 4.4 is proved in Supplementary Materials G. The key idea lies in establishing that the estimation error incurred while replacing 𝚺\bm{\Sigma} by 𝚺^\hat{\bm{\Sigma}} does not affect the uniform consistency properties of HEDE.

5 Simulation Results

We evaluate the performance of HEDE through simulation studies. In Section 5.1, we compare HEDE to some representative random effects methods for estimating the heritability under different signal structures. In Section 5.2, we compare HEDE against several fixed effect based methods for estimating the heritability.

5.1 Comparison with random effects methods

Given the widespread adoption of the random effects assumption in genetics, it is important to benchmark HEDE against various random-effects-based methods. A diverse array of such methods exist, each based on a different random effects assumption. We show that a random-effects-based method fails to provide unbiased heritability estimates when corresponding assumptions are violated (i.e. signals are not generated as assumed). In contrast, HEDE constantly offers unbiased estimates (albeit with a slightly higher variance).

For conciseness, we focus on three representative random effects methods: GREML-SC (yang2011gcta): effectively GCTA, which assumes no stratification; GREML-MS (lee2013estimation), which is a modified version of GREML that accounts for minor allele frequency (MAF) stratification; GREML-LDMS (yang2015genetic), which is a modified version of GREML that accounts for both linkage disequilibrium (LD) level and MAF stratification. Their corresponding random effects assumptions state that within respective stratifications, the signal entries are i.i.d. draws from a zero-inflated normal distribution (equivalently, the non-zero entries are randomly/evenly located and normally distributed).

For our simulations, we use genotype data from unrelated white British individuals in the UK Biobank dataset (sudlow2015uk). We select common variants with a minor allele frequency (MAF) greater than 1%1\% from the UK Biobank Axiom array using PLINK. Additionally, we exclude SNPs with genotype missingness over 1%1\% and Hardy-Weinberg disequilibrium p-values greater than 10610^{-6}. We impute the remaining small number of missing values as 0. This results in a design matrix 𝑿~\tilde{\bm{X}} with n=332,430n=332,430 individuals and p=533,169p=533,169 variants across 22 chromosomes, which is then normalized so each column has a mean of 0 and variance of 1. To expedite the process, our simulation analysis focuses on chromosome 2222. We use 100100 randomly sampled disjoint subsets of 30003000 individuals each to simulate 100100 independent random draws of 𝑿\bm{X}. We then generate a random 𝜷\bm{\beta} based on different assumptions and generate 𝒚\bm{y} following the linear model in (1). This results in 100100 heritability estimates, which are estimated using GREML methods and HEDE. We then compare the average of these estimates with the true population heritability value.

While our formal mathematical guarantees are established under the fixed effects assumption, it is desirable to stress-test the performance of our method in the random effects simulation settings, especially since we wish to benchmark our method against popular methods from statistical genetics. We further discuss the connection between the random effect assumption and the fixed effect assumption in terms of the true population parameter (Supplementary Materials B.1), with the discussion aimed at explaining why the empirical comparison we perform here is reasonable, given the methods were developed under different assumptions.

For the GREML methods, we use 44 LD bins generated from the quantiles of individual LD values (evans2018comparison), and 33 MAF bins: [0.01,0.05],[0.05,0.1][0.01,0.05],[0.05,0.1], and [0.1,0.5][0.1,0.5]. For HEDE, we perform the whitening step by estimating the population covariance 𝚺\bm{\Sigma} as a block diagonal with blocks specified in (berisa2016approximately). We estimate the block-wise population covariances using sample covariances. Given that n=332,430n=332,430 and p<1000p<1000 for each LD block, this approach results in minimal high-dimensional error.

Refer to caption
Figure 1: Box plots comparing three random effects methods, namely GREML with different stratifications, with our fixed effect method HEDE. Meaning of method abbreviations: ”SC”: single-component (yang2011gcta), ”MS”: MAF stratified, ”LDMS”: LD and MAF stratified (yang2015genetic) Different rows use different true 𝜷2\|\bm{\beta}\|_{2} values. Each column indicates a different distribution from which the true signal is drawn: (a) zero-inflated normal, where the nonzero entries are uniformly distributed (no stratification). (b) zero-inflated normal, where the nonzero entries have different densities in each LDMS stratification. (c) zero-inflated mixture of two normals, where the nonzero entries have different densities in each (consecutive) LD block. (d) zero-inflated mixture of two normals, where the nonzero entries have different densities in each (consecutive) LD block and in each LDMS stratification. Red dotted horizontal line represents true heritability values (which varies depending on the signal distribution). Boxes represents distributions of different methods’ estimates from 100100 independent draws of 𝑿\bm{X} (from UKB data chromosome 2222, 3000 individuals each) and 𝜷\bm{\beta} (from the aforementioned distribtions). GREML methods show biases under model misspecification while HEDE remains unbiased. See details in Section 5.1.

To assess the robustness of different methods against the misspecification of typical random effects assumptions on the regression coefficients, we created 44 types of random signals. These signals vary in the locations and distributions of non-zero entries (causal variants) as shown in Figure 1. In Figure 1a, the non-zero locations of 𝜷\bm{\beta} are uniformly distributed, with non-zero components normally distributed. Here the assumptions for all three GREML methods are satisfied. In Figure 1b, the non-zero locations of 𝜷\bm{\beta} vary in concentration across LDMS strata, with non-zero components normally distributed. Here only the assumption for GREML-LDMS is satisfied. In Figure 1c, the non-zero locations of 𝜷\bm{\beta} are more concentrated in the last LD block, with non-zero components distributed as a mixture of two normals (centered symmetrically around zero). Here none of the assumptions of the three GREML methods is satisfied. In Figure 1d, the non-zero locations of 𝜷\bm{\beta} vary in concentrations across both LD blocks and LDMS strata, with non-zero components distributed as a mixture of two normals. Here none of the assumptions of the three GREML methods are satisfied, with varying degrees of violations. The full details of the signal generation process are described in Supplementary Materials B.2.

Figure 1 clearly demonstrates that the GREML methods are susceptible to signal distribution misspecifications, exhibiting increased bias as the extent of assumption violation rises. In contrast, HEDE is robust and provides unbiased estimates in all cases, despite exhibiting higher variance. This pattern remains robust across various levels of true population heritability (0.1,0.250.1,0.25 and 0.50.5 shown).

We believe these findings are likely to apply to other random-effects-based methods. A random effects method can only guarantee unbiased estimates when its underlying random effects assumptions are satisfied—a condition that is often difficult to verify with real data. In contrast, HEDE is not susceptible to such biases, demonstrating that the fixed effect based HEDE method applies under much diverse signal structures.

5.2 Comparison of HEDE with other fixed effects methods

In this section, we compare the performance of HEDE with other fixed effects heritability estimation methods. Recall we showed in Section 3.4 and Theorem 4.4 that HEDE remains consistent under accurate covariance estimation strategies. Also, fixed effects heritability estimation methods all assume 𝚺\bm{\Sigma} is known or can be accurately estimated. In light of these, we assume without loss of generality that 𝚺=𝑰\bm{\Sigma}=\bm{I}, in order to directly compare the relative performance of fixed effect methods. To mimic discrete covariates in genetic scenarios, we generate the design matrix 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} via (2), where GijBinomial(2,πj),i,jG_{ij}\sim\text{Binomial}(2,\pi_{j}),\forall i,j and πjUnif[0.01,0.5],j\pi_{j}\sim\text{Unif}[0.01,0.5],\forall j, representing common variants with a minor allele frequency of at least 1%1\%. We generate the true signal 𝜷p\bm{\beta}\in\mathbb{R}^{p} as i.i.d. zero-inflated normals with non-zero co-ordinate weight κ\kappa and heritability h2h^{2}: βji.i.d.κ𝒩(0,h2/(pκ(1h2)))+(1κ)δ0\beta_{j}\overset{i.i.d.}{\sim}\kappa\cdot\mathcal{N}(0,h^{2}/(p\kappa(1-h^{2})))+(1-\kappa)\cdot\delta_{0}. We generate the noise ϵn\bm{\epsilon}\in\mathbb{R}^{n} with i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries. We vary n,p,κ,h2n,p,\kappa,h^{2} across our simulation settings.

For each simulation setting, we compare HEDE with a Lasso-based method, AMP (bayati2013estimating)111We label this method as AMP, since it was created based on approximate message passing (AMP) algorithms., and three Ridge-based methods: Dicker’s moment-based method (dicker2014variance), Eigenprism (janson2017eigenprism), and Estimating Equation (chen2022statistical). For AMP, (bayati2013estimating) does not provide a strategy for tuning the regularization parameter. Therefore, we choose the tuning parameter based on cross validation (specifically the Approximate Leave one Out (ALO) rad2020scalable approach) performed over 1010 smaller datasets with the same values of δ,h2,κ\delta,h^{2},\kappa. All aforementioned methods operate under similar fixed effects assumption as HEDE, except that Eigenprism only handles npn\leq p in its current form. We first compared these methods in terms of bias and found that they all remain unbiased across a wide variety of settings. In Supplementary Matrials B.3, we demonstrate their unbiasedness for a specific choice of n,p,h2,κn,p,h^{2},\kappa. In light of this, here we present comparisons between these estimators in terms of their mean square errors (MSEs).

Refer to caption
Figure 2: Ratio of the MSEs (called relative MSE) of different fixed effects methods—HEDE (ours), AMP (bayati2013estimating), EigenPrism (janson2017eigenprism), MM (dicker2014variance), EstEqn (chen2022statistical)—with respect to HEDE. Thus the baseline is 1, represented by the orange line. The design matrices satisfy (2), where GijBinomial(2,πj),i,jG_{ij}\sim\text{Binomial}(2,\pi_{j}),\forall i,j and πjUnif[0.01,0.5],j\pi_{j}\sim\text{Unif}[0.01,0.5],\forall j, with dimensions n,pn,p. Recall that signals are drawn from i.i.d. zero-inflated normals with non-zero co-ordinate weight κ\kappa and true heritability h2h^{2}: βji.i.d.κ𝒩(0,h2/(pκ(1h2)))+(1κ)δ0\beta_{j}\overset{i.i.d.}{\sim}\kappa\cdot\mathcal{N}(0,h^{2}/(p\kappa(1-h^{2})))+(1-\kappa)\cdot\delta_{0}. We generate 1010 draws of signals with 100100 draws of design matrices and display average MSEs. Parameter values: n=1000,p=10000n=1000,p=10000, κ\kappa varies across panels as indicated in the subtitles, and h2h^{2} varies across the xx-axis as indicated by the label. See full details in Section 5.2.

In Figure 2, we display the relative MSEs of heritability estimates in relation to κ\kappa and h2h^{2}, while fixing sample size n=1000n=1000 and number of variants p=10000p=10000. We vary sparsity from 0.0010.001 to 0.30.3 on a roughly evenly spaced log scale, and vary heritability from 0.010.01 to 0.70.7. This range mimics the potential true heritability values for traits such as autoimmune disorders (lower heritability) and height (higher heritability) (hou2019accurate). Although the methods under comparison operate under fixed effects assumptions, we still simulate 1010 signals to mitigate the influence of any particular signal choice. For each signal, we generate 100100 random samples and calculate the MSE for estimating heritability. We then average the error over the 10 signals.

Refer to caption
Figure 3: Settings same as Figure 2. Parameter values: h2=0.5,κ=0.1h^{2}=0.5,\kappa=0.1, nn varies across panels as indicated in the subtitles, and pp varies across the xx-axis as indicated by the label.

We observe that HEDE outperforms ridge-type methods particularly when the true signal is sparse and relatively strong, while performing comparably in other scenarios. Since HEDE utilizes the strengths of both Lasso and ridge, the performance gain over pure ridge-based methods, such as EigenPrism, MM and EstEqn, is natural. The pure Lasso-based method (AMP, green) performs sub-optimal compared to all the others, and its higher MSE values are dominating the y-axis range. For a closer view, we present a zoomed in version of the plot in Figure 5 where we exclude AMP (we defer the figure to Supplementary Materials B.3 in light of space). Figures 2 and 5 clearly demonstrate the efficacy of our ensembling strategy. It leads to a superior heritability estimator across signals ranging from less sparse to highly sparse.

To better investigate our relative performance, in Figure 3, we consider a specific setting—κ=0.1,h2=0.5\kappa=0.1,h^{2}=0.5—where HEDE has a moderate advantage in Figure 2, We vary nn and pp so the ratio δ=p/n\delta=p/n ranges from 0.250.25 to 2020. We clearly observe that HEDE’s advantage across a broad spectrum of configurations. Note that Eigenprism’s results are missing for n>pn>p as it currently only supports npn\leq p. Also observe that AMP’s relative performance degrades sharply with larger δ\delta values, as shown in the first panel of Figure 3. This explains its sub-optimal performance in Figure 2 when δ=10\delta=10. Figure 3 demonstrates that HEDE leads to superior performance over a wide range of sample sizes, dimensions and dimension-to-sample ratios. It consistently outperforms the other methods for the setting examined.

To determine if HEDE’s superior performance extends to broader contexts, we conducted additional investigations. We examined settings where the designs contain larger sub-Gaussian norms, and in addition, settings where the non-zero entries of the signal were drawm from non-normal distributions and/or are distributed with stratifications (see Section 5.1). Results for the former investigation are presented in Supplementary Materials. B.4, whereas we omit those for the latter in light of the performances looking similar. Across the board, we observed that HEDE maintained maintains its competitive advantage over other methods. We also investigated a Lasso-based method CHIVE (tony2020semisupervised) in our experiments. However, CHIVE exhibited considerable bias in many settings, particularly when its underlying sparsity assumptions were violated. For this reason, we have omitted CHIVE from our result displays.

6 Real data application

We next describe our real data experiments. We apply HEDE on the unrelated white British individuals from the UK biobank GWAS data which contain common variants (sudlow2015uk) to estimate the heritability for two commonly studied phenotypes: height and BMI. For the design matrix 𝑿\bm{X}, we follow the preprocessing steps outlined in Section 5.1, but include all 2222 chromosomes, unlike Section 5.1 which considers only chromosome 2222. For each phenotype, we exclude individuals with missing values. We center the outcomes and denote the resultant vector by 𝒚~\tilde{\bm{y}}. We account for non-genetic factors by including covariates for age, age2, sex as well as 4040 ancestral principal components. We derive the final response vector 𝒚\bm{y} by regressing out the influence of these non-genetic covariates from 𝒚~\tilde{\bm{y}} using Ordinary Least Squares (OLS). The large sample size of approximately 330,000330,000 individuals compared to only 4343 non-genetic covariate, ensures that OLS can be confidently applied.

We apply HEDE to preprocessed 𝑿\bm{X} and 𝒚\bm{y}. For comparison, we also apply GREML-SC and GREML-LDMS (see Section 5.1 for details) from the random effects methods family, as well as MM and AMP from the fixed effects methods family. We did not include Eigenprism and EstEqn since their runtimes are both O(n3)O(n^{3}) (assuming p/nδp/n\rightarrow\delta), so they could not be completed within a reasonable time frame. We calculate standard errors using the standard deviation of point estimates from 1515 disjoint random subsets, each containing 20,00020,000 individuals.222In our high-dimensional setting, it is known that traditional resampling techniques for estimating standard errors such as the bootstrap, subsampling, etc. fail (el2018can; clarte2024analysis; bellec2024asymptotics). Corrections have been proposed to address issues with using these for estimating standard errors while inferring individual regression vector coefficients and closely related problems. However, such strategies for heritability estimation are yet to be developed. To circumvent this issue, we use disjoint subsets of observed data to mimic independent training data draws from the population. Additionally, due to memory constraints, many methods (including HEDE) could not process all 2222 chromosomes simultaneously. Therefore, for all methods, we estimate the total heritability by summing the heritabilities calculated for each chromosome separately. This approach is expected to introduce minimal error, given the widely accepted assumption of independence across chromosomes.

Trait HEDE MM AMP GREML-SC GREML-LDMS
Height 0.681 0.813 0.436 0.741 0.605
(SE) (7.47E-02) (9.30E-02) (3.01E-01) (2.87E-02) (3.19E-02)
BMI 0.279 0.369 0.510 0.316 0.265
(SE) (5.20E-02) (6.66E-02) (3.52E-01) (2.85E-02) (3.22E-02)
Table 2: Estimates of heritability from HEDE (ours), MM (dicker2014variance), AMP (bayati2013estimating), GREML-SC (yang2011gcta) and GREML-LDMS (yang2015genetic) for certain traits in the UKB dataset. 533,169533,169 SNPs are present in 2222 chromosomes. For fixed effect methods, we assume independence across LD blocks from berisa2016approximately for covariance estimation. Standard errors are computed from 1515 disjoint random subsets of 20,00020,000 individuals each. See full details in Section 6.

We summarize the results in Table 2. First, we observe that MM and AMP estimates differ more from HEDE compared to other methods. This is consistent with Figure 2, where we observe that AMP often has larger MSE than other fixed effects methods. Thus, we expect AMP estimates to differ significantly from HEDE, and MM. For MM, the situation is more nuanced. Prior work suggests the true height heritability should be on the higher end of the range shown in Figure 2. In this region, MM has higher MSE than HEDE across all sparsity levels. Therefore, MM estimates of height heritability may be less accurate than HEDE. For BMI, prior work provides heritability estimates within the range 0.2850.285 to 0.4360.436 using various methods. Referring to Figure 2, we see that MM’s relative MSE compared to HEDE depends on the sparsity level in this BMI heritability range. Without knowing the exact sparsity, it is difficult to draw conclusions. However, we still observe the general trend of MM producing larger estimates than other methods for BMI, similar to height.

Comparing HEDE to the two random effect methods, we observe that HEDE point estimates fall between GREML-SC and GREML-LDMS ones. For both height and BMI, HEDE estimates lie between GREML-SC and GREML-LDMS,, with the latter always undershooting. This is consistent with the trend observed in the last column of Figure 1. This difference may indicate a heterogeneous underlying genetic signal, especially in a way that differentiates the SC and LDMS approaches within the GREML framework.

7 Discussion

In this paper, we introduce HEDE, a new method for heritability estimation that harnesses the strengths of 1\ell_{1} and 2\ell_{2} regression in high-dimensional settings through a sophisticated ensemble approach. We develop data-driven techniques for selecting an optimal ensemble and fine-tuning key hyperparameters, aiming to enhance statistical performance in heritability estimation. As a result, we present a competitive estimator that surpasses existing fixed effects heritability estimators across diverse signal structures and heritability values. Notably, our approach circumvents bias issues inherent in random effects methods. In summary, our contribution offers a dependable heritability estimator tailored for high-dimensional genetic problems, effectively accommodating underlying signal variations. Importantly, our method maintains consistency guarantees even with adaptive tuning and under minimal assumptions on the covariate distribution. We validate the efficacy of our approach through comprehensive simulations and real-data analysis.

HEDE’s computational complexity is primarily driven by the Lasso/ridge solutions, making it more efficient than many existing heritability estimators (e.g. Eigenprism (janson2017eigenprism), where an SVD is required, and EstEqn (chen2022statistical), where chained matrix inversions are involved). HEDE is potentially scalable to much larger dimensions via snpnet (rad2020scalable). Furthermore, with proper Lasso/ridge solving techniques, HEDE only needs knowledge of summary statistics 𝑿𝑿\bm{X}^{\top}\bm{X} and 𝑿𝒚\bm{X}^{\top}\bm{y}. This makes it an attractive option in situations where individual data sharing is difficult due to privacy concerns. However, it is important to note that these discussions assume the availiability of an accurate estimate of the LD blocks and strength. This is a common limitation that other fixed effects methods also face. Any potential errors in covariance estimation could naturally lead to additional errors in heritability estimation. Therefore, improving and scaling covariance estimation in high dimensions could independently benefit heritability estimation. We manage to circumvent this issue by leveraging the hypothesis of independence across chromosomes, which significantly aids us computationally.

Our theoretical framework relied on two pivotal assumptions: the independence of observed samples and a sub-Gaussian tailed distribution for covariates. Natural settings occur where these assumptions are violated. Within the genetics context, familial relationships and repeated measures in longitudinal studies among observed samples can introduce dependence. For the broader problem of signal-to-noise ratio estimation in various scientific fields, both assumptions might be violated. For instance, applications such as wireless communications (eldar2003asymptotic) and magnetic resonance imaging (benjamini2013shuffle) may introduce challenges such as temporal dependence or heavier-tailed covariate distributions. Exploring analogues of HEDE under such conditions presents an intriguing avenue for future study. Recent advancements in high-dimensional regression, which extend debiasing methodology and proportional asymptotics theory for regularized estimators to these complex scenarios (li2023spectrum; lahiry2023universality), offer valuable insights for further exploration. We defer these investigations to future research.

We focus on continuous traits in this paper. Discrete disease traits occur commonly in genetics. Such situations are typically modeled using logistic, probit, or binomial regression models, depending on the number of possible categorical values of the response. Adapting our current methodology to these scenarios is a crucial avenue for future research. Debiasing methodologies for generalized linear models exist in the literature. A compelling question would be to understand how various debiasing techniques can be ensembled to optimize their benefits for heritability estimation, similar to our approach in this work for the linear model.

References

  • Bayati et al., (2013) Bayati, M., Erdogdu, M. A., and Montanari, A. (2013). Estimating lasso risk and noise level. Advances in Neural Information Processing Systems, 26.
  • Bellec, (2022) Bellec, P. C. (2022). Observable adjustments in single-index models for regularized m-estimators. arXiv preprint arXiv:2204.06990.
  • Bellec and Koriyama, (2024) Bellec, P. C. and Koriyama, T. (2024). Asymptotics of resampling without replacement in robust and logistic regression. arXiv preprint arXiv:2404.02070.
  • Bellec and Zhang, (2019) Bellec, P. C. and Zhang, C.-H. (2019). Second order poincaré inequalities and de-biasing arbitrary convex regularizers when p/n→ γ\gamma. arXiv preprint arXiv:1912.11943, 2:15–34.
  • Bellec and Zhang, (2022) Bellec, P. C. and Zhang, C.-H. (2022). De-biasing the lasso with degrees-of-freedom adjustment. Bernoulli, 28(2):713–743.
  • Bellec and Zhang, (2023) Bellec, P. C. and Zhang, C.-H. (2023). Debiasing convex regularized estimators and interval estimation in linear models. The Annals of Statistics, 51(2):391–436.
  • Benjamini and Yu, (2013) Benjamini, Y. and Yu, B. (2013). The shuffle estimator for explainable variance in fmri experiments. The Annals of Applied Statistics, pages 2007–2033.
  • Berisa and Pickrell, (2016) Berisa, T. and Pickrell, J. K. (2016). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283.
  • Bonnet et al., (2015) Bonnet, A., Gassiat, E., and Lévy-Leduc, C. (2015). Heritability estimation in high dimensional sparse linear mixed models.
  • Cai and Guo, (2020) Cai, T. and Guo, Z. (2020). Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):391–419.
  • Candès and Sur, (2020) Candès, E. J. and Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.
  • Celentano and Montanari, (2021) Celentano, M. and Montanari, A. (2021). Cad: Debiasing the lasso with inaccurate covariate model. arXiv preprint arXiv:2107.14172.
  • Celentano et al., (2020) Celentano, M., Montanari, A., and Wei, Y. (2020). The lasso with general gaussian designs with applications to hypothesis testing. arXiv preprint arXiv:2007.13716.
  • Chen, (2022) Chen, H. Y. (2022). Statistical inference on explained variation in high-dimensional linear model with dense effects. arXiv preprint arXiv:2201.08723.
  • Clarté et al., (2024) Clarté, L., Vandenbroucque, A., Dalle, G., Loureiro, B., Krzakala, F., and Zdeborová, L. (2024). Analysis of bootstrap and subsampling in high-dimensional regularized regression. arXiv preprint arXiv:2402.13622.
  • Dicker, (2014) Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284.
  • Dicker and Erdogdu, (2016) Dicker, L. H. and Erdogdu, M. A. (2016). Maximum likelihood for variance estimation in high-dimensional linear models. In Artificial Intelligence and Statistics, pages 159–167. PMLR.
  • El Karoui and Purdom, (2018) El Karoui, N. and Purdom, E. (2018). Can we trust the bootstrap in high-dimensions? the case of linear models. Journal of Machine Learning Research, 19(5):1–66.
  • Eldar and Chan, (2003) Eldar, Y. C. and Chan, A. M. (2003). On the asymptotic performance of the decorrelator. IEEE Transactions on Information Theory, 49(9):2309–2313.
  • Evans et al., (2018) Evans, L. M., Tahmasbi, R., Vrieze, S. I., Abecasis, G. R., Das, S., Gazal, S., Bjelland, D. W., De Candia, T. R., Consortium, H. R., Goddard, M. E., et al. (2018). Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nature genetics, 50(5):737–745.
  • Fan et al., (2012) Fan, J., Guo, S., and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):37–65.
  • Guo et al., (2019) Guo, Z., Wang, W., Cai, T. T., and Li, H. (2019). Optimal estimation of genetic relatedness in high-dimensional linear models. Journal of the American Statistical Association, 114(525):358–369.
  • Han and Shen, (2022) Han, Q. and Shen, Y. (2022). Universality of regularized regression estimators in high dimensions. arXiv preprint arXiv:2206.07936.
  • Hou et al., (2019) Hou, K., Burch, K. S., Majumdar, A., Shi, H., Mancuso, N., Wu, Y., Sankararaman, S., and Pasaniuc, B. (2019). Accurate estimation of snp-heritability from biobank-scale data irrespective of genetic architecture. Nature genetics, 51(8):1244–1251.
  • Hu and Li, (2022) Hu, X. and Li, X. (2022). Misspecification analysis of high-dimensional random effects models for estimation of signal-to-noise ratios. arXiv preprint arXiv:2202.06400.
  • Janson et al., (2017) Janson, L., Barber, R. F., and Candes, E. (2017). Eigenprism: inference for high dimensional signal-to-noise ratios. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1037–1065.
  • (27) Javanmard, A. and Montanari, A. (2014a). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909.
  • (28) Javanmard, A. and Montanari, A. (2014b). Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522–6554.
  • Jiang et al., (2016) Jiang, J., Li, C., Paul, D., Yang, C., and Zhao, H. (2016). On high-dimensional misspecified mixed model analysis in genome-wide association study. The Annals of Statistics, 44(5):2127–2160.
  • Jiang et al., (2022) Jiang, K., Mukherjee, R., Sen, S., and Sur, P. (2022). A new central limit theorem for the augmented ipw estimator: Variance inflation, cross-fit covariance and beyond. arXiv preprint arXiv:2205.10198.
  • Knowles and Yin, (2017) Knowles, A. and Yin, J. (2017). Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352.
  • Lahiry and Sur, (2023) Lahiry, S. and Sur, P. (2023). Universality in block dependent linear models with applications to nonparametric regression. arXiv preprint arXiv:2401.00344.
  • Lee et al., (2013) Lee, S. H., Yang, J., Chen, G.-B., Ripke, S., Stahl, E. A., Hultman, C. M., Sklar, P., Visscher, P. M., Sullivan, P. F., Goddard, M. E., et al. (2013). Estimation of snp heritability from dense genotype data. The American Journal of Human Genetics, 93(6):1151–1155.
  • Li et al., (2022) Li, R., Chang, C., Justesen, J. M., Tanigawa, Y., Qian, J., Hastie, T., Rivas, M. A., and Tibshirani, R. (2022). Fast lasso method for large-scale and ultrahigh-dimensional cox model with applications to uk biobank. Biostatistics, 23(2):522–540.
  • Li and Sur, (2023) Li, Y. and Sur, P. (2023). Spectrum-aware adjustment: A new debiasing framework with applications to principal components regression. arXiv preprint arXiv:2309.07810.
  • Liang and Sur, (2022) Liang, T. and Sur, P. (2022). A precise high-dimensional asymptotic theory for boosting and minimum-l1-norm interpolated classifiers. The Annals of Statistics, 50(3):1669–1695.
  • Loh et al., (2015) Loh, P.-R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., Chasman, D. I., Ridker, P. M., Neale, B. M., Berger, B., et al. (2015). Efficient bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47(3):284–290.
  • Manolio et al., (2009) Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265):747–753.
  • Miolane and Montanari, (2021) Miolane, L. and Montanari, A. (2021). The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning. The Annals of Statistics, 49(4):2313–2335.
  • Psychiatric GWAS Consortium Coordinating Committee, (2009) Psychiatric GWAS Consortium Coordinating Committee, . (2009). Genomewide association studies: history, rationale, and prospects for psychiatric disorders. American Journal of Psychiatry, 166(5):540–556.
  • Rad and Maleki, (2020) Rad, K. R. and Maleki, A. (2020). A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(4):965–996.
  • Rockafellar, (2015) Rockafellar, R. T. (2015). Convex analysis.
  • Sudlow et al., (2015) Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779.
  • Sun and Zhang, (2012) Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99(4):879–898.
  • Sur and Candès, (2019) Sur, P. and Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.
  • Sur et al., (2019) Sur, P., Chen, Y., and Candès, E. J. (2019). The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probability theory and related fields, 175:487–558.
  • Thrampoulidis et al., (2015) Thrampoulidis, C., Oymak, S., and Hassibi, B. (2015). Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory, pages 1683–1709. PMLR.
  • Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models.
  • Vershynin, (2010) Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.
  • Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
  • Verzelen and Gassiat, (2018) Verzelen, N. and Gassiat, E. (2018). Adaptive estimation of high-dimensional signal-to-noise ratios. Bernoulli, 24(4B):3683–3710.
  • Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press.
  • Wellner et al., (2013) Wellner, J. et al. (2013). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media.
  • Yang et al., (2015) Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A., Lee, S. H., Robinson, M. R., Perry, J. R., Nolte, I. M., van Vliet-Ostaptchouk, J. V., et al. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature genetics, 47(10):1114–1120.
  • Yang et al., (2011) Yang, J., Lee, S. H., Goddard, M. E., and Visscher, P. M. (2011). Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82.
  • Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 217–242.
  • Zhao et al., (2022) Zhao, Q., Sur, P., and Candes, E. J. (2022). The asymptotic distribution of the mle in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3):1835–1861.
  • Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320.

Appendix A Additional Methodology Details

A.1 Additional Technical Assumptions

We present additional technical assumptions needed for our mathematical guarantees in Section 4. The following assumption is needed for independent covariates:

Assumption 2.
  1. 1.

    The design matrix 𝑿\bm{X} satisfies for any γ<1\gamma<1, there exists a constant κmin>0\kappa_{\min}>0 such that (1nκ(𝑿,min{p,γn})κmin)0\mathbb{P}(\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},\min\{p,\gamma n\})\leq\kappa_{\min})\rightarrow 0 as nn\rightarrow\infty, where κ(𝑿,s)\kappa_{-}(\bm{X},s) denotes the minimum singular value of 𝑿S\bm{X}_{S} over all subsets SS of the columns with |S|s|S|\leq s.

  2. 2.

    The tuning parameters are bounded, that is, there exists λmin,λmax\lambda_{\min},\lambda_{\max} such that 0<λminλL,λRλmax<0<\lambda_{\min}\leq\lambda_{\textrm{L}},\lambda_{\textrm{R}}\leq\lambda_{\max}<\infty.

Assumption 2(1) states that the minimum singular value of 𝑿\bm{X}, across all subsets of a certain size, is lower bounded by some positive constant with high probability. This assumption is required for our universality proof. In fact, we conjecture that this assumption is true given Assumption 1(1) and Assumption 1(2) (c.f. Section B.5.4 in celentano2020lasso for a proof in Gaussian case). We leave the proof of this assumption to future work.

Assumption 2(2) restricts the range of λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} to a predetermined range. We require a specific lower bound λmin=λmin(σmax2,δ)\lambda_{\min}=\lambda_{\min}(\sigma_{\max}^{2},\delta). The specific functional form is complicated, but plays a crucial role in our proof in Section I (which builds upon results from han2022universality). On the other hand, λmax\lambda_{\max} is not restricted and can take any proper value. See Section A.2 for a heuristic method that we use to determine (λmin,λmax)(\lambda_{\min},\lambda_{\max}) in practice.

The following assumption is needed for estimated covariance under correlated case, effectively perturbing HEDE by a matrix 𝑨𝑰\bm{A}\approx\bm{I}:

Assumption 3.

Let 𝐀p×p\bm{A}\in\mathbb{R}^{p\times p} be a sequence of random matrices. Denote 𝐀(𝐛):=12n𝐲𝐗𝐀𝐛22+λLn𝐛1\mathcal{L}_{\bm{A}}(\bm{b}):=\frac{1}{2n}\|\bm{y}-\bm{X}\bm{A}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{L}}}{\sqrt{n}}\|\bm{b}\|_{1} to be the perturbed Lasso cost function when we replace 𝐗\bm{X} by 𝐗𝐀\bm{X}\bm{A}, and denote 𝛃^(𝐀):=argmin𝐛𝐀(𝐛)\hat{\bm{\beta}}(\bm{A}):=\operatorname*{arg\,min}_{\bm{b}}\mathcal{L}_{\bm{A}}(\bm{b}) to be the perturbed Lasso solution. We say 𝐀\bm{A} satisfies Assumption 3 if

  1. 1.

    supλL[λmin,λmax]𝜷^(𝑨)M\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}(\bm{A})\leq M for some constant MM with probability converging to 11 as nn\rightarrow\infty

  2. 2.

    λL,λL[λmin,λmax],λL,𝑨(𝜷^λL(𝑨))λL,𝑨(𝜷^λL(𝑨))+K|λLλL|\forall\lambda_{\textrm{L}},\lambda_{\textrm{L}}^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda_{\textrm{L}}^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{\textrm{L}}}(\bm{A}))\leq\mathcal{L}_{\lambda_{\textrm{L}}^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{\textrm{L}}^{\prime}}(\bm{A}))+K|\lambda_{\textrm{L}}-\lambda_{\textrm{L}}^{\prime}| with probability converging to 11 as nn\rightarrow\infty.

Assumption 3 is required as technical steps in our proof under sub-Gaussian design. They are automatically satisfied when the design 𝑿\bm{X} is Gaussian (see proofs in Section G).

A.2 Hyperparameter Selection

Our algorithm necessitates a tuning parameter range [λmin,λmax][\lambda_{\min},\lambda_{\max}]. Assumption 2(2) defines λmin(σmax2,δ)\lambda_{\min}(\sigma_{\max}^{2},\delta) as a function of the unknown σmax2\sigma_{\max}^{2}, as well as an unconstrained λmax\lambda_{\max}, for completely technical reasons, both of which give little practical guidance. Here we propose to determine the range by the empirical degrees-of-freedom from Lasso and Ridge, defined in 19.

For λmax\lambda_{\max}, there is no point increasing it infinitely since 𝜷^L,𝜷^R\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{\beta}}_{\textrm{R}} will approach 0, yielding very similar 𝜷^Ld,𝜷^Rd\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d} values. Thus, we choose tmin=0.01t_{\min}=0.01, therefore filtering out excessively large λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} values.

For λmin\lambda_{\min}, we will similarly pick an upper bound for 𝜷^L(λL)0n\frac{\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})\|_{0}}{n} and 1nTr((1n𝑿𝑿+λR𝑰)11n𝑿𝑿)\frac{1}{n}\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}). Due to numerical precision/stability issue, glmnet (and possibly other numerical solvers) yields incorrect solutions for tiny values of λ\lambda. Also, the lower bound λmin(σmax2,δ)\lambda_{\min}(\sigma_{\max}^{2},\delta)-although unknown-prohibits tiny values of λ\lambda. Therefore, simply choosing an upper bound such as tmax=0.99t_{\max}=0.99 does not suffice. Heuristically, we find any value from 0.30.3 to 0.70.7 acceptable, and we take tmax=0.5t_{\max}=0.5 for simplicity.

Once the range for λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} are determined, we discretize it on the log scale with grid width 0.10.1. This yields the final discrete collection of λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}}, from which we minimize τ^c2\hat{\tau}_{c}^{2}.

A.3 Covariance Estimation

In Section 3.4, we discussed estimating the blocked population covariance matrix by block-wise sample covariance matrices. The following proposition supports its consistency under mind conditions:

Proposition A.1.

Let 𝚺p×p\bm{\Sigma}\in\mathbb{R}^{p\times p} be a real symmetric matrix whose spectral distribution has bounded support μ𝚺[κmin,κmax]\mu_{\bm{\Sigma}}\subset[\kappa_{\min},\kappa_{\max}] with 0<κminκmax<0<\kappa_{\min}\leq\kappa_{\max}<\infty. Further suppose that 𝚺\bm{\Sigma} is block-diagonal with blocks 𝚺1,,𝚺k\bm{\Sigma}_{1},...,\bm{\Sigma}_{k} such that the size of each block 𝚺i\bm{\Sigma}_{i} is bounded by some constant mm. Let 𝐗n×p\bm{X}\in\mathbb{R}^{n\times p} with each row 𝐱i\bm{x}_{i\bullet} satisfying 𝐱i=𝐳i𝚺1/2\bm{x}_{i\bullet}=\bm{z}_{i\bullet}\bm{\Sigma}^{1/2} where 𝐳i\bm{z}_{i\bullet} has independent sub-Gaussian entries with bounded sub-Gaussian norm. If we estimate 𝚺^\hat{\bm{\Sigma}} as also block-diagonal with blocks 𝚺^1,,𝚺^k\hat{\bm{\Sigma}}_{1},...,\hat{\bm{\Sigma}}_{k} where 𝚺^i=1n𝐗i𝐗i\hat{\bm{\Sigma}}_{i}=\frac{1}{n}\bm{X}_{i}^{\top}\bm{X}_{i} are corresponding sample covariances, then as p/nδ(0,)p/n\rightarrow\delta\in(0,\infty), we have 𝚺^𝚺op𝑃0\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\overset{P}{\rightarrow}0.

Proof.

The proof is straightforward by applying (wainwright2019high, Theorem 6.5) on each block component, in conjunction with a union bound. Thus we omit it here. ∎

Appendix B Additional Simulation Details

B.1 Random effects comparison: Heritability definition

The common setting in Section 5.1 assumes both 𝑿\bm{X} and 𝜷\bm{\beta} are zero-mean random and that 𝑿\bm{X} has population covariance 𝚺\bm{\Sigma}. Under random effects setting, a random 𝑿\bm{X} is allowed, and the denominator in the heritability definition (3) reads

Var(𝒙i𝜷)=𝔼[Var[𝒙i𝜷|𝜷]]+Var[𝔼[𝒙i𝜷|𝜷]]=𝔼[𝜷𝚺𝜷],\text{Var}(\bm{x}_{i\bullet}^{\top}\bm{\beta})=\mathbb{E}[\text{Var}[\bm{x}_{i\bullet}^{\top}\bm{\beta}|\bm{\beta}]]+\text{Var}[\mathbb{E}[\bm{x}_{i\bullet}^{\top}\bm{\beta}|\bm{\beta}]]=\mathbb{E}[\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}],

which is a well-defined population quantity. Yet, 𝔼[𝜷𝚺𝜷]\mathbb{E}[\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}] does not have a closed form formula if 𝜷\bm{\beta} is generated with stratifications as in Section 5.1, so we approximated it with 100100 iterations.

Under fixed effect setting, conditioning on a given 𝜷\bm{\beta}, the same definition reads

Var(𝒙i𝜷)=𝜷𝚺𝜷,\text{Var}(\bm{x}_{i\bullet}^{\top}\bm{\beta})=\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta},

which is an empirical quantity that fluctuates around the population quantity 𝔼[𝜷𝚺𝜷]\mathbb{E}[\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}]. Now with 100100 random 𝜷\bm{\beta}, estimating the corresponding 100100 individual 𝜷𝚺𝜷\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta} approximately constitutes as estimating its expectation. This, therefore, facilitates fair comparisons between random and fixed effects methods.

B.2 Random effects comparison: Signal generation details

To generate 𝜷\bm{\beta} with varying non-zero entry locations and distributions, we employed the following generation process. 1) We fixed two levels of concentration: cl=0.05c_{l}=0.05 and ch=0.5c_{h}=0.5. 2) We determine ksk_{s}, the number of stratifications needed. For uniformly distributed non-zero entries ks=1k_{s}=1. For LDMS stratification, ks=12k_{s}=12: the cross product of 44 LD stratifications and 33 MAF stratifications mentioned in Section 5.1. For LDMS plus block stratification, ks=24k_{s}=24: the cross product of the previous 1212 LDMS stratifications and 22 blocks: the last LD block vs the rest, specified in berisa2016approximately. 3) For each of the ksk_{s} stratifications determined, we alternatively assign clc_{l} and chc_{h} as the concentration, and generate non-zero entries uniformly randomly with the assigned concentration. In the special case ks=1k_{s}=1, clc_{l} is assigned. 4) After selecting non-zero entry locations, count the number of non-zero entries KK, and then calculate the entrywmeshyise variance σ+2=h2/K\sigma_{+}^{2}=h^{2}/K where h2h^{2} is the desired heritability value. 5) Lastly, pick the non-zero entry distribution. For random normal entries, the distribution is 𝒩(0,σ+2)\mathcal{N}(0,\sigma_{+}^{2}). For mixture-of-normal entries, the distribution is 𝒩(±σ+10,9σ+210)\mathcal{N}(\pm\frac{\sigma_{+}}{\sqrt{10}},\frac{9\sigma_{+}^{2}}{10}) (an equal mixture of two symmetric normals).

B.3 Fixed effects comparisons: Unbiasedness and a closer look

In this section, we present two plots. The first investigates the bias properties of different fixed effects methods. For a representative example, we chose a setting with n=1000,p=10000,h2=0.1n=1000,p=10000,h^{2}=0.1, and κ=0.003\kappa=0.003, though this selection is not particularly unique. Using a specific 𝜷\bm{\beta}, we generated 100 random instances of 𝑿\bm{X}. The box plots in Figure 4 depict the heritability estimates from all the methods under consideration. We observe that every method is (approximately) unbiased, which was confirmed by in additional settings. We also note that truncating negative estimates to 0 would yield lower MSEs but more bias, so we opted to keep the negative estimates.

Refer to caption
Figure 4: Same setting as Figure 2, showing box plots of estimates for a certain realized signal with 100100 draws of design matrices. All methods are unbiased.

As a second line of investigation, we recall Figure 2 from the main manuscript, which showed that the AMP MSEs are often much higher than the remaining. In this light, we zoom into Figure 2 to obtain a clearer picture for the performance of our method relative to others. Figure 5 shows this zoomed in version. Note that our relative superiority is mostly maintained across diverse settings.

Refer to caption
Figure 5: Zoomed in of Figure 2.

B.4 Fixed effects comparison: Designs with larger sub-Gaussian norms

To generate 𝑿\bm{X} with larger sub-Gaussian norms, we followed the exact same setup as in Section 5.2, except that G¯jUnif[0.005,0.01]\bar{G}_{j}\sim\text{Unif}[0.005,0.01] instead of [0.01,0.5][0.01,0.5] in (2). This mimicks lower-frequency variants, leading to larger sub-Gaussian norms in the normalized 𝑿\bm{X}. Comparing Figures 6, 7 with Figures 2, 3, we observe similar trends across the board, with HEDE having either similar or dominating performance.

Refer to caption
Figure 6: Same setting as Figure 2, except that πjUnif[0.005,0.01]\pi_{j}\sim\text{Unif}[0.005,0.01].
Refer to caption
Figure 7: Same setting as Figure 3, except that πjUnif[0.005,0.01]\pi_{j}\sim\text{Unif}[0.005,0.01].

Appendix C Proof Notations and Conventions

This section introduces notations that we will use through the rest of the proof. We consider both the Lasso and ridge estimators in the context of linear models. Several important quantities in our calculations appear in two versions—one computed based on the ridge and the other based on the Lasso. We use subscripts R,L\textrm{R},\textrm{L} respectively to denote the versions of these quantities corresponding to the ridge and the Lasso. Recall that the setting we study may be expressed via the linear model 𝒚=𝑿𝜷+σ𝒛\bm{y}=\bm{X}\bm{\beta}+\sigma\bm{z} where 𝒚n\bm{y}\in\mathbb{R}^{n} denote the responses, 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} the design matrix, 𝜷p\bm{\beta}\in\mathbb{R}^{p} the unknown regression coefficients, and 𝒛\bm{z} the noise. Then for k=L,Rk=\textrm{L},\textrm{R}, the corresponding Lasso and Ridge estimators 𝜷^kp\hat{\bm{\beta}}_{k}\in\mathbb{R}^{p} are defined as

𝜷^k=argmin𝒃{12n𝒚𝑿𝒃22+Ωk(𝒃)},\hat{\bm{\beta}}_{k}=\operatorname*{arg\,min}_{\bm{b}}\left\{\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\Omega_{k}(\bm{b})\right\}, (17)

where ΩL(𝒃)=λLn𝒃1,ΩR(𝒃)=λR2𝒃22\Omega_{\textrm{L}}(\bm{b})=\frac{\lambda_{\textrm{L}}}{\sqrt{n}}\|\bm{b}\|_{1},\Omega_{\textrm{R}}(\bm{b})=\frac{\lambda_{\textrm{R}}}{2}\|\bm{b}\|_{2}^{2}.

Our methodology relies on debiased versions of these estimators, defined as

𝜷^kd=𝜷^k+𝑿(𝒚𝑿𝜷^k)ndf^k,fork=L,R,\hat{\bm{\beta}}_{k}^{d}=\hat{\bm{\beta}}_{k}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\hat{\textnormal{df}}_{k}},\,\,\text{for}\,\,k=\textrm{L},\textrm{R}, (18)

where df^k\hat{\textnormal{df}}_{k}\in\mathbb{R} are terms that may be interpreted as degrees-of-freedom, and are defined in each of the aforementioned cases as follows,

df^L\displaystyle\hat{\text{df}}_{\textrm{L}} =𝜷^L0\displaystyle=\|\hat{\bm{\beta}}_{\textrm{L}}\|_{0} (19)
df^R\displaystyle\hat{\text{df}}_{\textrm{R}} =Tr((1n𝑿𝑿+λR𝑰)11n𝑿𝑿).\displaystyle=\text{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}).

Our proofs use some intermediate quantities that we introduce next. First, we will require intermediate quantities that replace df^k\hat{\textrm{df}}_{k} by a set of parameters dfk\textrm{df}_{k} that rely on underlying problem parameters. For k=L,Rk=\textrm{L},\textrm{R}, we define these to be

𝜷~kd=𝜷^k+𝑿(𝒚𝑿𝜷^k)ndfk.\tilde{\bm{\beta}}_{k}^{d}=\hat{\bm{\beta}}_{k}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\textnormal{df}_{k}}. (20)

The exact definition of dfk\textnormal{df}_{k} is involved, so we defer its presentation to 24. Second, we need another set of intermediate quantities 𝜷^kf\hat{\bm{\beta}}_{k}^{f} that form solutions to optimization problems of the form (17), but where the observed data {𝒚,𝑿}\{\bm{y},\bm{X}\} is replaced by random variables 𝜷^kf,d\hat{\bm{\beta}}_{k}^{f,d} that can be expressed as gaussian perturbations of the true signal, with appropriate adjustments to the penalty function. For k=L,Rk=\textrm{L},\textrm{R}, these are defined as follows:

𝜷^kf:=ηk(𝜷^kf,d;ζk):=argmin𝒃{12𝜷^kf,d𝒃22+1ζkΩk(𝒃)},\hat{\bm{\beta}}_{k}^{f}:=\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\zeta_{k}):=\operatorname*{arg\,min}_{\bm{b}}\left\{\frac{1}{2}\|\hat{\bm{\beta}}_{k}^{f,d}-\bm{b}\|_{2}^{2}+\frac{1}{\zeta_{k}}\Omega_{k}(\bm{b})\right\}, (21)
where𝜷^kf,d=𝜷+𝒈kf,(𝒈Lf,𝒈Rf)𝒩(𝟎,𝑺𝑰p),\text{where}\quad\hat{\bm{\beta}}_{k}^{f,d}=\bm{\beta}+\bm{g}_{k}^{f},\quad(\bm{g}_{\textrm{L}}^{f},\bm{g}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},\bm{S}\otimes\bm{I}_{p}), (22)

with 𝑺\bm{S} of the form

𝑺:=(τL2ρτLτRρτLτRτR2),\bm{S}:=\begin{pmatrix}\tau_{\textrm{L}}^{2}&\rho\tau_{\textrm{L}}\tau_{\textrm{R}}\\ \rho\tau_{\textrm{L}}\tau_{\textrm{R}}&\tau_{\textrm{R}}^{2}\end{pmatrix}, (23)

for suitable choices of τL,τR,ρ,ζk\tau_{\textrm{L}},\tau_{\textrm{R}},\rho,\zeta_{k} that we define later in (24). Our exposition so far refrains from providing additional insights regarding the necessity of these intermediate quantities–however, the role of these quantities will unravel in due course through the proof. As an aside, note that the randomness in 𝜷^kf,d,𝜷^kf\hat{\bm{\beta}}_{k}^{f,d},\hat{\bm{\beta}}_{k}^{f} comes from 𝒈kf\bm{g}_{\textrm{k}}^{f}, which is independent of the observed data. We use the superscript ‘f’ (standing for fixed) to denote that these do not depend on our observed data.

Our aforementioned notations are complete once we define the parameters dfk\textrm{df}_{k} from (20), ζk\zeta_{k} from (21), and 𝑺\bm{S} from (23). We define these as solutions to the following system of equations in the variables {𝑺ˇ,ζˇk,dfˇk,k=L,R}\{\check{\bm{S}},\check{\zeta}_{k},\check{\textnormal{df}}_{k},k=\textrm{L},\textrm{R}\}.

𝑺ˇ\displaystyle\check{\bm{S}} =1n(σ2𝑰+𝔼[(ηL(𝜷^Lf,d;ζˇL)𝜷,ηR(𝜷^Rf,d;ζˇR)𝜷)(ηL(𝜷^Lf,d;ζˇL)𝜷,ηR(𝜷^Rf,d;ζˇR)𝜷)])\displaystyle=\frac{1}{n}(\sigma^{2}\bm{I}+\mathbb{E}[(\eta_{\textrm{L}}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d};\check{\zeta}_{\textrm{L}})-\bm{\beta},\eta_{\textrm{R}}(\hat{\bm{\beta}}_{\textrm{R}}^{f,d};\check{\zeta}_{\textrm{R}})-\bm{\beta})^{\top}(\eta_{\textrm{L}}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d};\check{\zeta}_{\textrm{L}})-\bm{\beta},\eta_{\textrm{R}}(\hat{\bm{\beta}}_{\textrm{R}}^{f,d};\check{\zeta}_{\textrm{R}})-\bm{\beta})]) (24)
dfˇk\displaystyle\check{\textnormal{df}}_{k} =𝔼[divηk(𝜷^kf,d;ζˇk)],k=L,R\displaystyle=\mathbb{E}[\text{div}\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k})],k=\textrm{L},\textrm{R}
ζˇk\displaystyle\check{\zeta}_{k} =1dfˇkn,k=L,R,\displaystyle=1-\frac{\check{\textnormal{df}}_{k}}{n},k=\textrm{L},\textrm{R},

where divηk(𝜷^kf,d;ζˇk):=j=1p𝜷^k,jf,dηk(𝜷^kf,d;ζˇk)\text{div}\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k}):=\sum_{j=1}^{p}\frac{\partial}{\partial\hat{\bm{\beta}}_{k,j}^{f,d}}\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k}) is defined as the divergence of ηk(𝜷^kf,d;ζˇk)\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k}) with respect to its first argument and recall that we defined ηk(,)\eta_{k}(\cdot,\cdot) in (21).

Extracting the first and the last entries of the first equation in (24), we observe that τL,ζL,dfL\tau_{\textrm{L}},\zeta_{\textrm{L}},\textnormal{df}_{\textrm{L}} depends on λL\lambda_{\textrm{L}} and not on λR\lambda_{\textrm{R}}, and similarly for the corresponding Ridge parameters. We also note that τLR\tau_{\textrm{L}\textrm{R}} depends on both λL\lambda_{\textrm{L}} and λR\lambda_{\textrm{R}}. We will use this observation multiple times in our proofs later.

The system of equations 24 first arose in the context of the problem studied in celentano2021cad and Lemma C.1 from the aforementioned paper establishes the uniqueness of the solutions. Furthermore, it follows that there exist positive constants (τmin,τmax,ζmin,ρmax)(\tau_{\min},\tau_{\max},\zeta_{\min},\rho_{\max}) such that

τmin2+σ2<nτk2<τmax2+σ2,ζmin<ζk1,|ρ|<ρmax<1,\tau_{\min}^{2}+\sigma^{2}<n\tau_{k}^{2}<\tau_{\max}^{2}+\sigma^{2},\zeta_{\min}<\zeta_{k}\leq 1,|\rho|<\rho_{\max}<1, (25)

for k=L,Rk=L,R where (τL,τR,ζL,ζR,ρ)(\tau_{L},\tau_{R},\zeta_{L},\zeta_{R},\rho) denotes the unique solution to equation (24).

Whenever we mention constants in our proof below, we mean values that only depend on the model parameters {κmin,κmax,δ,σmax,λmin,λmax}\{\kappa_{\min},\kappa_{\max},\delta,\sigma_{\max},\lambda_{\min},\lambda_{\max}\} laid out in Assumptions 1 and 2, and do not depend on any other variable (especially n,pn,p, and realization of any random quantities).

Finally in Sections DH.4, we prove our results under the stylized setting where the design matrix entries are i.i.d. 𝒩(0,1)\mathcal{N}(0,1) and the noise vector entries are i.i.d. 𝒩(0,1)\mathcal{N}(0,1). In Section I we establish universality results that show that the same conclusions hold under the setting of Assumptions 1 and 2, where the covariate and error distributions are more general. To avoid confusion, we define below a stylized version of Assumptions 1 and 2, where everything remains the same except the design, error distributions are taken to be i.i.d. Gaussian. So, Sections D-H.4 work under Assumption 4, while Section I works under Assumptions 1 and 2.

Assumption 4.
  1. 1.

    Same as Assumption 1(1).

  2. 2.

    Each entry of 𝑿\bm{X} satisfies Xiji.i.d.𝒩(0,1)X_{ij}\overset{i.i.d.}{\sim}\mathcal{N}(0,1).

  3. 3.

    Same as Assumption 1(3).

  4. 4.

    The noise ϵ\bm{\epsilon} satisfies ϵii.i.d.𝒩(0,σ2)\epsilon_{i}\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma^{2}), with σ2σmax2\sigma^{2}\leq\sigma_{\max}^{2}.

  5. 5.

    Same as Assumption 2(2).

Note that we don’t need Assumption 2(1) since it is true for Gaussians (see B.5.4 in celentano2020lasso).

Appendix D Proof of Theorem 4.3

Theorem 4.3 forms the backbone of our results so we begin by presenting its proof here. Towards this goal, we first prove the following slightly different version.

Theorem D.1.

Suppose that Assumption 4 holds. Then for any 11-Lipschitz function ϕβ:(p)3\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}, we have

supλL,λR[λmin,λmax]|ϕβ(𝜷~Ld,𝜷~Rd,𝜷)𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]|𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0,

where 𝛃~kd,𝛃^kf,d\tilde{\bm{\beta}}_{k}^{d},\hat{\bm{\beta}}_{k}^{f,d} are as defined in (20), (22) for k=L,Rk=\textrm{L},\textrm{R} respectively.

Theorem D.1 differs from Theorem 4.3 in that 𝜷^kd\hat{\bm{\beta}}_{k}^{d} is now replaced by 𝜷~kd\tilde{\bm{\beta}}_{k}^{d}. The difference between these lies in the fact that df^k\hat{\textrm{df}}_{k} is replaced by dfk\textrm{df}_{k} from the former to the latter (c.f. Eqns 18 and 20).

Given Theorem D.1, the proof of Theorem 4.3 follows a two step procedure: we first establish that the empirical quantities df^k\hat{\textrm{df}}_{k} and the parameters dfk\textrm{df}_{k} are uniformly close asymptotically. This allows us to establish that 𝜷~kd\tilde{\bm{\beta}}_{k}^{d} and 𝜷^kd\hat{\bm{\beta}}_{k}^{d} are asymptotically close. Theorem 4.3 thereby follows from here, when applied in conjunction with Theorem D.1 and the fact that ϕβ\phi_{\beta} is Lipschitz. We formalize these arguments below.

Lemma D.2.

Recall the definitions of df^k\hat{\textnormal{df}}_{k} and dfk\textnormal{df}_{k} from (19), (24). Under Assumption 4, for k=L,Rk=\textrm{L},\textrm{R},

supλk[λmin,λmax]|df^kpdfkp|𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}}{p}-\frac{\textnormal{df}_{k}}{p}\right|\overset{P}{\rightarrow}0.
Proof.

(miolane2021distribution, Theorem F.1.) established the aformentioned for k=Lk=\textrm{L}.

For k=Rk=R, (celentano2021cad, Lemma H.1) (which further cites (knowles2017anisotropic, Theorem 3.7)) showed that that df^R/p\hat{\textnormal{df}}_{\textrm{R}}/p converges to dfR/p\textnormal{df}_{\textrm{R}}/p for any fixed λR\lambda_{\textrm{R}}. So for our purposes it suffices to extend this proof to all λR[λmin,λmax]\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}] simultaneously. We achieve this below.

Recall from Eqn. 24 that df^R/p=1λRTr(1p(1n𝑿𝑿+λR𝑰)1)\hat{\textnormal{df}}_{\textrm{R}}/p=1-\lambda_{\textrm{R}}\text{Tr}(\frac{1}{p}(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}), where the trace on the right hand side is the negative resolvent of 1n𝑿𝑿\frac{1}{n}\bm{X}^{\top}\bm{X} evaluated at λR-\lambda_{\textrm{R}} and normalized by 1/p1/p. (knowles2017anisotropic, Theorem 3.7) (with some notation-transforming algebra) states that this negative normalized resolvent converges in probability to (1dfR/p)/λR(1-\textnormal{df}_{\textrm{R}}/p)/\lambda_{\textrm{R}} with fluctuation of level O(N1/2κ1/4)O(N^{-1/2}\kappa^{-1/4}), where κ=λR+1\kappa=\lambda_{\textrm{R}}+1 is the distance from λR-\lambda_{\textrm{R}} to the spectrum of 𝑰\bm{I}. Therefore, df^R/p\hat{\textnormal{df}}_{\textrm{R}}/p concentrates around dfR/p\textnormal{df}_{\textrm{R}}/p with fluctuation of level O(N1/2λR(1+λR)1/4)=O(N1/2(1+λmax)3/4)O(N^{-1/2}\lambda_{\textrm{R}}(1+\lambda_{\textrm{R}})^{-1/4})=O(N^{-1/2}(1+\lambda_{\max})^{3/4}), for all λR[λmin,λmax]\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]. ∎

As a direct consequence, we have

Corollary D.1.

Under Assumption 4, we have for k=L,Rk=\textrm{L},\textrm{R},

supλk[λmin,λmax]|1dfkn|\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|1-\frac{\textnormal{df}_{k}}{n}\right| =Θp(1),\displaystyle=\Theta_{p}(1),
supλk[λmin,λmax]|1df^kn|\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|1-\frac{\hat{\textnormal{df}}_{k}}{n}\right| =Θp(1).\displaystyle=\Theta_{p}(1).
Proof.

The first line follows directly from (24) and the fact that ζk\zeta_{k} is bounded by 25. For the second line, we have

supλk[λmin,λmax]|1df^kn|supλk[λmin,λmax]|1dfkn|+supλk[λmin,λmax]|df^kndfkn|.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|1-\frac{\hat{\textnormal{df}}_{k}}{n}\right|\leq\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|1-\frac{{\textnormal{df}}_{k}}{n}\right|+\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}}{n}-\frac{\textnormal{df}_{k}}{n}\right|.

where the first term is Θp(1)\Theta_{p}(1) and the second term is oP(1)o_{P}(1) by Lemma D.2. ∎

From here, one can derive the following Lemma using the definitions of 𝜷^kd\hat{\bm{\beta}}_{k}^{d}, 𝜷~kd\tilde{\bm{\beta}}_{k}^{d}. One should compare the following to (celentano2021cad, Lemma H.1.ii) that proved a pointwise version of this result without any supremum over the tuning parameters.

Lemma D.3.

Under Assumption 4, for k=L,Rk=\textrm{L},\textrm{R},

supλk[λmin,λmax]𝜷^kd𝜷~kd2𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}^{d}-\tilde{\bm{\beta}}_{k}^{d}\|_{2}\overset{P}{\rightarrow}0.
Proof.

By definition,

supλk[λmin,λmax]𝜷^kd𝜷~kd2\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}^{d}-\tilde{\bm{\beta}}_{k}^{d}\|_{2}
=\displaystyle= supλk[λmin,λmax]𝑿(𝒚𝑿𝜷^k)ndf^k𝑿(𝒚𝑿𝜷^k)ndfk2\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\hat{\textnormal{df}}_{k}}-\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\textnormal{df}_{k}}\right\|_{2}
\displaystyle\leq 𝑿opnsupλk[λmin,λmax]𝒚𝑿𝜷^k2nsupλk[λmin,λmax]|11df^k/n11dfk/n|.\displaystyle\frac{\|\bm{X}\|_{\textnormal{op}}}{\sqrt{n}}\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\frac{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}\|_{2}}{\sqrt{n}}\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{1}{{1-\hat{\textnormal{df}}_{k}/n}}-\frac{1}{{1-\textnormal{df}_{k}/n}}\right|.

By Corollary H.1, 𝑿op/n\|\bm{X}\|_{\textnormal{op}}/\sqrt{n} is bounded with probability 1o(1)1-o(1). By Lemma D.7, supλk[λmin,λmax]𝒚𝑿𝜷^k/n\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}\|/\sqrt{n} is bounded with probability 1o(1)1-o(1). Lastly,

supλk[λmin,λmax]|11df^k/n11dfk/n|=supλk[λmin,λmax]|dfk/ndf^k/n(1df^k/n)(1dfk/n)|𝑃0,\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{1}{{1-\hat{\textnormal{df}}_{k}/n}}-\frac{1}{{1-\textnormal{df}_{k}/n}}\right|=\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\textnormal{df}_{k}/n-\hat{\textnormal{df}}_{k}/n}{(1-\hat{\textnormal{df}}_{k}/n)(1-\textnormal{df}_{k}/n)}\right|\overset{P}{\rightarrow}0,

where the numerator is oP(1)o_{P}(1) by Lemma D.2 and the denominator is ΘP(1)\Theta_{P}(1) by Corollary D.1. ∎

We next turn to prove Theorem 4.3.

Proof of Theorem 4.3.

By triangle inequality,

supλL,λR[λmin,λmax]|ϕβ(𝜷^Ld,𝜷^Rd,𝜷)𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]|\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|
\displaystyle\leq supλL,λR[λmin,λmax]|ϕβ(𝜷~Ld,𝜷~Rd,𝜷)𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]|+supλL,λR[λmin,λmax]|ϕβ(𝜷^Ld,𝜷^Rd,𝜷)ϕβ(𝜷~Ld,𝜷~Rd,𝜷)|\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|+\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})|
\displaystyle\leq oP(1)+supλL,λR[λmin,λmax]𝜷^Ld𝜷~Ld2+supλL,λR[λmin,λmax]𝜷^Rd𝜷~Rd2\displaystyle o_{P}(1)+\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{\textrm{L}}^{d}-\tilde{\bm{\beta}}_{\textrm{L}}^{d}\|_{2}+\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{\textrm{R}}^{d}-\tilde{\bm{\beta}}_{\textrm{R}}^{d}\|_{2}
𝑃\displaystyle\overset{P}{\rightarrow} 0,\displaystyle\ 0,

where we used Theorem D.1 and Lemma D.3, as well as the fact that ϕβ\phi_{\beta} is 11-Lipschitz. ∎

Thus, our proof of Theorem 4.3 is complete if we prove Theorem D.1. We present this in the next sub-section.

D.1 Proof of Theorem D.1

The overarching structure of our proof of Theorem D.1 is inspired by (celentano2021cad, Section D). However, unlike in our setting below, the results in the aforementioned paper do not require uniform convergence over a range of values of the tuning parameter. This leads to novel technical challenges in our setting that we handle as we proceed.

To prove Theorem D.1, we introduce an intermediate quantity 𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)|𝒈Lf=𝒈^L]\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}] that conditions on 𝒈Lf\bm{g}_{\textrm{L}}^{f} (recall the definition from (22)) taking on a specific value 𝒈^L\hat{\bm{g}}_{\textrm{L}} defined as follows:

𝒈^L=nτLζLnτL2σ2𝒚𝑿𝜷^L2𝜷^L𝜷2(𝜷^L𝜷)+τL𝒚𝑿𝜷^L2𝑿(𝒚𝑿𝜷^L),\begin{gathered}\hat{\bm{g}}_{\textrm{L}}=\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\|_{2}\|\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\|_{2}}(\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta})+\frac{\tau_{\textrm{L}}}{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\|_{2}}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}),\end{gathered} (26)

where τL,ζL\tau_{\textrm{L}},\zeta_{\textrm{L}} are defined as in (24). Recall from (25) that nτL2σ2n\tau_{\textrm{L}}^{2}\geq\sigma^{2} so that the square root above is well-defined.

Remark.

The definition of 𝐠^L\hat{\bm{g}}_{\textrm{L}} is non-trivial. However, the takeaway is that the realization 𝐠Lf=𝐠^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}} should be understood as the coupling of 𝐗\bm{X} and 𝐠Lf\bm{g}_{\textrm{L}}^{f} such that 𝛃^Lf,𝛃^Lf,d\hat{\bm{\beta}}_{\textrm{L}}^{f},\hat{\bm{\beta}}_{\textrm{L}}^{f,d} equals 𝛃^L,𝛃~Ld\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}, respectively. We refer readers to (celentano2021cad, Section F.1 and Section L) for the underlying intuition as well as the proof of this equivalence.

Since 𝜷^Lf,d\hat{\bm{\beta}}_{\textrm{L}}^{f,d} becomes 𝜷~Ld\tilde{\bm{\beta}}_{\textrm{L}}^{d} conditioning on 𝒈Lf=𝒈^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}, we have 𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)|𝒈Lf=𝒈^L]=𝔼[ϕβ(𝜷~Ld,𝜷^Rf,d,𝜷)|𝒈Lf=𝒈^L]\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]=\mathbb{E}[\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}], where the randomness inside the expectation comes from 𝜷^Rf,d\hat{\bm{\beta}}_{\textrm{R}}^{f,d}, and 𝜷~Ld\tilde{\bm{\beta}}_{\textrm{L}}^{d} is fixed. In fact, analogous to 𝒈Lf=𝜷^Lf,d𝜷\bm{g}_{\textrm{L}}^{f}=\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}, 𝒈^L\hat{\bm{g}}_{\textrm{L}} approximates 𝜷~Ld𝜷\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}, as formalized in the result below.

Lemma D.4.

Under Assumption 4,

supλL[λmin,λmax]𝒈^L(𝜷~Ld𝜷)2𝑃0.\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\|_{2}\overset{P}{\rightarrow}0.

With these definitions in hand, the proof is complete on combining the following two lemmas.

Lemma D.5.

Recall the definitions of 𝛃~kd\tilde{\bm{\beta}}_{k}^{d} from Eqn. 20 and 𝛃^kf,d\hat{\bm{\beta}}_{k}^{f,d} from Eqn. 22. Under Assumption 4, for any 11-Lipschitz function ϕβ:(p)3\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R},

supλL,λR[λmin,λmax]|𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)|𝒈Lf=𝒈^L]𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]|𝑃0.\begin{gathered}\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0.\end{gathered} (27)
Lemma D.6.

Under Assumption 4, for any 11-Lipschitz function ϕβ:(p)3\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R},

supλL,λR[λmin,λmax]|𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)|𝒈Lf=𝒈^L]ϕβ(𝜷~Ld,𝜷~Rd,𝜷)|𝑃0.\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]-\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})|\overset{P}{\rightarrow}0.

We present the proofs of these supporting lemmas in the following subsection.

D.2 Proof of Lemmas D.4D.6

We start with the proof of Lemma D.4 since it plays a crucial role in the proofs of the others. To this end, perhaps the most important step is the following lemma that establishes the boundedness and limiting behavior of certain special entities.

Lemma D.7.

For k=L,Rk=\textrm{L},\textrm{R}, define 𝐰^k,𝐮^k,𝐯^k\hat{\bm{w}}_{k},\hat{\bm{u}}_{k},\hat{\bm{v}}_{k} as

𝒘^k=𝜷^k𝜷,𝒖^k=𝒚𝑿𝜷^k,𝒗^k=𝑿(𝒚𝑿𝜷^k).\hat{\bm{w}}_{k}=\hat{\bm{\beta}}_{k}-\bm{\beta},\hat{\bm{u}}_{k}=\bm{y}-\bm{X}\hat{\bm{\beta}}_{k},\hat{\bm{v}}_{k}=\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}). (28)

Under Assumption 4,

  • Convergence:

    supλk[λmin,λmax]|𝒘^k2nτk2σ2|𝑃0.\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{w}}_{k}\|_{2}-\sqrt{n\tau_{k}^{2}-\sigma^{2}}\right|\overset{P}{\rightarrow}0.
    supλk[λmin,λmax]|𝒖^k2/nnτkζk|𝑃0.\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n}-\sqrt{n}\tau_{k}\zeta_{k}\right|\overset{P}{\rightarrow}0.
  • Boundedness: there exists positive constants c,Cc,C such that with probability 1o(1)1-o(1),

    csupλk[λmin,λmax]𝒘^k2,supλk[λmin,λmax]𝒖^k2/n,supλk[λmin,λmax]𝒗^k2/nC.c\leq\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{w}}_{k}\|_{2},\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n},\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{v}}_{k}\|_{2}/n\leq C.
Proof.

Lemma D.7 is proved in Section E.1

We will see below that Lemma D.7 forms crux of the proof of Lemma D.4.

D.2.1 Proof of Lemma D.4

Proof of Lemma D.4.

(celentano2021cad, Section J.1) proved a pointwise version of this lemma, without any supremum over the tuning parameters. Thus, with Lemma D.7 at our disposal, the proof of Lemma D.4 follows by a suitable combination with the strategy in celentano2021cad. Recall from Eqns. 26 and 28, we have

𝒈^L=nτLζLnτL2σ2𝒖^L2𝒘^L2𝒘^L+τL𝒖^L2𝒗^L.\hat{\bm{g}}_{\textrm{L}}=\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}\hat{\bm{w}}_{\textrm{L}}+\frac{\tau_{\textrm{L}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}}\hat{\bm{v}}_{\textrm{L}}.

Next recall the definition of 𝜷~Ld\tilde{\bm{\beta}}_{\textrm{L}}^{d} from (20) and that ndfL=nζLn-\textnormal{df}_{\textrm{L}}=n\zeta_{\textrm{L}} from our system of equations (24). On applying triangle inequality and Lemma H.5, we obtain that

supλL[λmin,λmax]𝒈^L(𝜷~Ld𝜷L)2\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}_{\textrm{L}})\|_{2} supλL[λmin,λmax]|nζLτLnτL2σ2𝒖^L2𝒘^L21|supλL[λmin,λmax]𝒘^L2\displaystyle\leq\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n\zeta_{\textrm{L}}\tau_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}-1\right|\cdot\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{w}}_{\textrm{L}}\|_{2} (29)
+supλL[λmin,λmax]|nτL𝒖^L21ζL|supλL[λmin,λmax]𝒗^L2n.\displaystyle+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n\tau_{\textrm{L}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}}-\frac{1}{\zeta_{\textrm{L}}}\right|\cdot\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\frac{\|\hat{\bm{v}}_{\textrm{L}}\|_{2}}{n}.

For the first summand, note that by Lemma D.7, supλL[λmin,λmax]𝒘^L2=OP(1)\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{w}}_{\textrm{L}}\|_{2}=O_{P}(1). Thus it suffices to show that the first term in the first summand in (29) is oP(1)o_{P}(1). To see this, observe that we can simplify this as follows

supλL[λmin,λmax]|nτLζLnτL2σ2𝒖^L2𝒘^L21|\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}-1\right|
=\displaystyle= supλL[λmin,λmax]|(nτLζL)nτL2σ2(𝒖^L2/n)𝒘^L21|\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}})\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{(\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n})\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}-1\right|
=\displaystyle= supλL[λmin,λmax]|(nτLζL)(nτL2σ2𝒘^L2)+𝒘^L2(nτLζL𝒖^L2/n)(𝒖^L2/n)𝒘^L2|\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}})(\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}-\|\hat{\bm{w}}_{\textrm{L}}\|_{2})+\|\hat{\bm{w}}_{\textrm{L}}\|_{2}(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n})}{(\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n})\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}\right|
\displaystyle\leq supλL[λmin,λmax]|nτLζL(𝒖^L2/n)𝒘^L2|supλL[λmin,λmax]|nτL2σ2𝒘^L2|\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}}{(\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n})\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}\right|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}-\|\hat{\bm{w}}_{\textrm{L}}\|_{2}\right|
+supλL[λmin,λmax]|1𝒖^L2/n|supλL[λmin,λmax]|(nτLζL𝒖^L2/n)|𝑃0by Lemma D.7 .\displaystyle+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{1}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n}}\right|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n})\right|\overset{P}{\rightarrow}0\,\,\text{by Lemma \ref{lemma:uvw_convergence} }.

We turn to the second summand in (29). It suffices to show that the first term is oP(1)o_{P}(1) since the second is OP(1)O_{P}(1) by Lemma D.7. But this follows from the same result on observing that

supλL[λmin,λmax]|nτL𝒖^L21ζL|=supλL[λmin,λmax]|nτLζL𝒖^L2/n𝒖^L2/nζL|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n\tau_{\textrm{L}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}}-\frac{1}{\zeta_{\textrm{L}}}\right|=\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n}\cdot\zeta_{\textrm{L}}}\right|

and that ζL\zeta_{L} is bounded below by a positive constant by (25). This completes the proof. ∎

D.2.2 Proof of Lemma D.5

For the purpose of clarity, we sometimes make the dependence of estimators on λL\lambda_{\textrm{L}} and/or λR\lambda_{\textrm{R}} explicit (by writing 𝜷^L(λL)\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}}), for example). We first introduce a supporting Lemma that characterizes a convergence result individually for the Lasso and Ridge.

Lemma D.8.

Under Assumption 4, for k=L,R,k=\textrm{L},\textrm{R}, and any 11-Lipschitz function ϕβ:(p)3\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R},

supλk[λmin,λmax]|ϕβ(𝜷^k,𝜷~kd,𝜷)𝔼[ϕβ(𝜷^kf,𝜷^kf,d,𝜷)]|𝑃0.\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{\beta}(\hat{\bm{\beta}}_{k},\tilde{\bm{\beta}}_{k}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{k}^{f},\hat{\bm{\beta}}_{k}^{f,d},\bm{\beta})]\right|\overset{P}{\rightarrow}0.
Proof.

First we consider k=Rk=\textrm{R}. From Lemma E.1, we know

supλR[λmin,λmax]|ϕβ(𝜷^R)𝔼[ϕβ(𝜷^Rf)]|𝑃0\sup_{\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{R}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{R}}^{f})]|\overset{P}{\rightarrow}0

for any 11-Lipschitz function ϕβ\phi_{\beta}. Since 𝜷~Rd\tilde{\bm{\beta}}_{\textrm{R}}^{d} is a Lipschitz function of 𝜷^R\hat{\bm{\beta}}_{\textrm{R}} by Lemma D.14, the proof follows by the definition of these from (22).

Now for k=Lk=\textrm{L}, however, 𝜷~Ld\tilde{\bm{\beta}}_{\textrm{L}}^{d} is not necessarily a Lipschitz function of 𝜷^L\hat{\bm{\beta}}_{\textrm{L}}. Thus, we turn to the α\alpha-smoothed Lasso (α>0\alpha>0) instead celentano2020lasso, defined as

𝜷^α=argmin𝒃12n𝒚𝑿𝒃22+λLninf𝜽{n2α𝒃𝜽22+𝜽1}.\hat{\bm{\beta}}_{\alpha}=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda_{L}}{\sqrt{n}}\inf_{\bm{\theta}}\left\{\frac{\sqrt{n}}{2\alpha}\|\bm{b}-\bm{\theta}\|_{2}^{2}+\|\bm{\theta}\|_{1}\right\}.

Based on this, 𝜷~αd,𝜷^αf,𝜷^αf,d\tilde{\bm{\beta}}_{\alpha}^{d},\hat{\bm{\beta}}_{\alpha}^{f},\hat{\bm{\beta}}_{\alpha}^{f,d} can be similarly defined. We omit the full details for simplicity since they are similar to (18)-(24), but we note that they satisfy the following relation by KKT conditions:

𝜷~αd\displaystyle\tilde{\bm{\beta}}_{\alpha}^{d} =𝜷^α+λLMα(𝜷^α)nξα\displaystyle=\hat{\bm{\beta}}_{\alpha}+\frac{\lambda_{\textrm{L}}\nabla M_{\alpha}(\hat{\bm{\beta}}_{\alpha})}{\sqrt{n}\xi_{\alpha}}
𝜷^αf,d\displaystyle\hat{\bm{\beta}}_{\alpha}^{f,d} =𝜷^αf+λLMα(𝜷^αf)nξα,\displaystyle=\hat{\bm{\beta}}_{\alpha}^{f}+\frac{\lambda_{\textrm{L}}\nabla M_{\alpha}(\hat{\bm{\beta}}_{\alpha}^{f})}{\sqrt{n}\xi_{\alpha}},

where Mα(𝒃):=inf𝜽{n2α𝒃𝜽22+𝜽1}M_{\alpha}(\bm{b}):=\inf_{\bm{\theta}}\left\{\frac{\sqrt{n}}{2\alpha}\|\bm{b}-\bm{\theta}\|_{2}^{2}+\|\bm{\theta}\|_{1}\right\} and 𝝃α\bm{\xi}_{\alpha} is defined similarly as (24). (celentano2020lasso, Theorem B.1) establishes a pointwise version of Lemma E.1 for the α\alpha-smoothed Lasso, i.e. without supremum over the tuning parameter range, but uniform control results can be easily obtained following our techniques for Lemma E.1. Furthermore, (celentano2020lasso, Section B.5.2) shows that 𝜷~αd\tilde{\bm{\beta}}_{\alpha}^{d} is an C/αC/\alpha-Lipschitz function of 𝜷^α\hat{\bm{\beta}}_{\alpha} for some positive constant CC (by checking the Lipschitzness of Mα\nabla M_{\alpha}). Combining these, we obtain that for a fixed α>0\alpha>0

supλL[λmin,λmax]|ϕβ(𝜷^α,𝜷~αd,𝜷)𝔼[ϕβ(𝜷^αf,𝜷^αf,d,𝜷)]|𝑃0.\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{\beta}(\hat{\bm{\beta}}_{\alpha},\tilde{\bm{\beta}}_{\alpha}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\alpha}^{f},\hat{\bm{\beta}}_{\alpha}^{f,d},\bm{\beta})]\right|\overset{P}{\rightarrow}0. (30)

In addition, (celentano2020lasso, Lemma B.8 and Section B.5.1) establishes the closeness of (debiased) Lasso and (debiased) α\alpha-smoothed Lasso as follows: there exists a constant αmax>0\alpha_{\max}>0 such that

(𝜷^α𝜷^L2C1α,ααmax)=1o(1)(𝜷^αd𝜷^Ld2C1α,ααmax)=1o(1),\begin{gathered}\mathbb{P}\left(\|\hat{\bm{\beta}}_{\alpha}-\hat{\bm{\beta}}_{\textrm{L}}\|_{2}\leq C_{1}\sqrt{\alpha},\forall\alpha\leq\alpha_{\max}\right)=1-o(1)\\ \mathbb{P}\left(\|\hat{\bm{\beta}}_{\alpha}^{d}-\hat{\bm{\beta}}_{\textrm{L}}^{d}\|_{2}\leq C_{1}\sqrt{\alpha},\forall\alpha\leq\alpha_{\max}\right)=1-o(1),\end{gathered} (31)

which can be easily extended to the uniform version since C1,αmaxC_{1},\alpha_{\max} and constants hiding in o(1)o(1) do not depend on λ\lambda, and estimators for all λ\lambda share the same source of randomness.

Finally, (celentano2020lasso, Lemma A.5) shows there exists constants C2C_{2} and αmax\alpha_{\max} such that ααmax\forall\alpha\leq\alpha_{\max}:

|τLτα|C2α,|ξLξα|C2α,|\tau_{\textrm{L}}-\tau_{\alpha}|\leq C_{2}\sqrt{\alpha},\ \ |\xi_{\textrm{L}}-\xi_{\alpha}|\leq C_{2}\sqrt{\alpha}, (32)

which can be extended to uniform version similarly.

Once uniform control results are extended, the rest of the proof stays the same as in (celentano2020lasso, Section B.5), on combining (30)-(32). ∎

As a consequence of Lemmas D.4 and D.8, we obtain the following corollary. An analogous result was established in (celentano2021cad, Corollary J.2) but unlike their case, our guarantee here is uniform over the tuning parameter space. Thus the result below relies crucially on our earlier results that provide such uniform guarantees.

Corollary D.2.

Recall the definition of 𝐠^L\hat{\bm{g}}_{\textrm{L}} from (26) and that 𝐠Lf=𝛃^Lf,d𝛃\bm{g}_{L}^{f}=\hat{\bm{\beta}}_{L}^{f,d}-\bm{\beta} from (21). Under Assumption 4, for 11-Lipschitz functions ϕβ:(p)2\phi_{\beta}:(\mathbb{R}^{p})^{2}\rightarrow\mathbb{R},

supλL[λmin,λmax]|ϕβ(𝜷^L,𝒈^L)𝔼[ϕβ(𝜷^Lf,𝒈Lf)]|𝑃0.\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]|\overset{P}{\rightarrow}0.
Proof.

By triangle inequality,

supλL[λmin,λmax]|ϕβ(𝜷^L,𝒈^L)𝔼[ϕβ(𝜷^Lf,𝒈Lf)]|\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]|
\displaystyle\leq supλL[λmin,λmax]|ϕβ(𝜷^L,𝒈^L)ϕβ(𝜷^L,𝜷~Ld𝜷)|+supλL[λmin,λmax]|ϕβ(𝜷^L,𝜷~Ld𝜷)𝔼[ϕβ(𝜷^Lf,𝒈Lf)]|\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})|+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]|
\displaystyle\leq supλL[λmin,λmax]𝒈^L(𝜷~Ld𝜷)2+supλL[λmin,λmax]|ϕβ(𝜷^L,𝜷~Ld𝜷)𝔼[ϕβ(𝜷^Lf,𝜷^Lf,d𝜷)]|𝑃0,\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\|_{2}+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\hat{\bm{\beta}}_{L}^{f,d}-\bm{\beta})]|\overset{P}{\rightarrow}0,

since the first summand vanishes due to Lemma D.4 and the second summand vanishes due to Lemma D.8. ∎

To prove Lemma D.5, that is, (27), we need to establish a convergence result that is uniform over λL\lambda_{L} and λR\lambda_{R}. We divide this goal into a two-step strategy, where we first establish that (27) holds with the supremum only over λL\lambda_{L} (Lemma D.9) Then by appropriate Lipschitzness arguments (Lemma D.10), we extend the result to hold with supremum simultaneously over λL\lambda_{L} and λR\lambda_{R}.

Lemma D.9.

Denote the quantity 𝔼[ϕβ(𝛃^Lf,d,𝛃^Rf,d,𝛃)|𝐠Lf=𝐠^L]\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}] as ϕβ|L(𝐠^L)\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}}). Then under Assumption 4,

supλL[λmin,λmax]|ϕβ|L(𝒈^L)𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]|𝑃0.\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0. (33)
Proof.

From the definition, we know 𝔼[ϕβ|L(𝒈Lf)]=𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d,𝜷)]\mathbb{E}[\phi_{\beta|L}(\bm{g}_{\textrm{L}}^{f})]=\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})], since, per the remark below (26), 𝜷~Ld\tilde{\bm{\beta}}_{\textrm{L}}^{d} equals 𝜷^Lf,d\hat{\bm{\beta}}_{\textrm{L}}^{f,d} conditional on 𝒈Lf=𝒈^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}.

Therefore, we just need to show

supλL[λmin,λmax]|ϕβ|L(𝒈^L)𝔼[ϕβ|L(𝒈Lf)]|𝑃0,\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta|L}(\bm{g}_{\textrm{L}}^{f})]|\overset{P}{\rightarrow}0,

which follows from Corollary D.2, if we can show that ϕβ|L\phi_{\beta|L} is a Lipschitz function. The Lipschitzness follows by an argument similar to (celentano2021cad, Section J.2), so we omit the details here. ∎

Note that Corollary D.2, which in turn relies on Lemmas D.4 and D.8, forms a crucial ingredient for the preceding proof.

Lemma D.10.

Under Assumption 4, with probability 1o(1)1-o(1), Ψ(λR)\Psi(\lambda_{\textrm{R}}) is an M-Lipschitz function of λR\lambda_{\textrm{R}} for some positive constant MM, where

Ψ(λR):=supλL[λmin,λmax]|ϕβ|L(𝒈^L)𝔼[ϕβ(𝜷^Lf,d(λL),𝜷^Rf,d(λR),𝜷)]|.\Psi(\lambda_{\textrm{R}}):=\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{\textrm{R}}),\bm{\beta})]|.
Proof of Lemma D.10.

We define (with a slight overload of notations) an auxiliary function:

ψ(λL,λR):=|ϕβ|L(𝒈^L)𝔼[ϕβ(𝜷^Lf,d(λL),𝜷^Rf,d(λR),𝜷)]|.\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}}):=|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{\textrm{R}}),\bm{\beta})]|.

First, τL\tau_{\textrm{L}} and therefore 𝜷^Lf,d\hat{\bm{\beta}}_{\textrm{L}}^{f,d} does not depend on λR\lambda_{\textrm{R}}. Also, 𝜷^Rf,d=𝜷+𝒈Rf=𝜷+τR𝝃~\hat{\bm{\beta}}_{\textrm{R}}^{f,d}=\bm{\beta}+\bm{g}_{\textrm{R}}^{f}=\bm{\beta}+\tau_{R}\tilde{\bm{\xi}} where 𝝃~𝒩(𝟎,𝑰p)\tilde{\bm{\xi}}\sim\mathcal{N}(\bm{0},\bm{I}_{p}), τR\tau_{R} is C1/nC_{1}/\sqrt{n}-Lipschitz in λR\lambda_{\textrm{R}} from Lemma H.2. Thus, we have

|𝔼[ϕβ(𝜷^Lf,d(λL),𝜷^Rf,d(λ1),𝜷)]𝔼[ϕβ(𝜷^Lf,d(λL),𝜷^Rf,d(λ2),𝜷)]|\displaystyle\left|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})]\right|
\displaystyle\leq 𝔼[𝜷^Rf,d(λ1)𝜷^Rf,d(λ2)2]\displaystyle\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\|_{2}]
\displaystyle\leq 𝔼[(τR(λ1)τR(λ2))𝝃~22]\displaystyle\sqrt{\mathbb{E}[\|(\tau_{\textrm{R}}(\lambda_{1})-\tau_{\textrm{R}}(\lambda_{2}))\tilde{\bm{\xi}}\|_{2}^{2}]}
\displaystyle\leq C1|λ1λ2|,\displaystyle C_{1}|\lambda_{1}-\lambda_{2}|,

which yields 𝔼[ϕβ(𝜷^Lf,d(λL),𝜷^Rf,d(λR),𝜷)]|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{\textrm{R}}),\bm{\beta})]| is CC-Lipschitz in λR\lambda_{\textrm{R}}. Next note that conditioning on 𝒈Lf=𝒈^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}} necessitates 𝒈Rf\bm{g}_{\textrm{R}}^{f} to take the following value so that their joint covariance structure matches (23),

𝒈Rf=τRρ/τL𝒈^L+τR1ρ2𝝃,where𝝃𝒩(0,𝑰p).\bm{g}_{\textrm{R}}^{f}=\tau_{\textrm{R}}\rho/\tau_{\textrm{L}}\cdot\hat{\bm{g}}_{\textrm{L}}+\tau_{\textrm{R}}\sqrt{1-\rho^{2}}\cdot\bm{\xi},\,\,\text{where}\,\,\bm{\xi}\sim\mathcal{N}(0,\bm{I}_{p}). (34)

This means that conditional on 𝒈Lf=𝒈^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}, 𝜷^Rf,d=𝜷+τRρ/τL𝒈^L+τR1ρ2𝝃\hat{\bm{\beta}}_{\textrm{R}}^{f,d}=\bm{\beta}+\tau_{R}\rho/\tau_{L}\hat{\bm{g}}_{\textrm{L}}+\tau_{R}\sqrt{1-\rho^{2}}\bm{\xi}, whereas 𝜷^Ld\hat{\bm{\beta}}_{\textrm{L}}^{d} never depends on λR\lambda_{\textrm{R}}. Also, from 25 and Lemma H.2, τR\tau_{R} is bounded in [τmin2/n,(τmax2+σmax2)/n][\sqrt{\tau_{\min}^{2}/n},\sqrt{(\tau_{\max}^{2}+\sigma_{\max}^{2})/n}] and is CC-Lipschitz in λR\lambda_{\textrm{R}} while both ρ\rho and 1ρ2\sqrt{1-\rho^{2}} are bounded in [0,1][0,1] and CC^{\prime}-Lipschitz in λR\lambda_{\textrm{R}}, and 1/τLn/τmin1/\tau_{L}\leq\sqrt{n}/\tau_{\min}. Combining Lemma D.4 and D.3 yields that 𝒈^L22𝜷^Ld𝜷2\|\hat{\bm{g}}_{\textrm{L}}\|_{2}\leq 2\|\hat{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}\|_{2} with probability 1o(1)1-o(1). In conjunction with Lemma D.8, this yields that 𝒈^L24𝔼[𝜷^Lf,d𝜷2]4𝔼[𝜷^Lf,d𝜷22]=4nτL24τmax2+σmax2\|\hat{\bm{g}}_{\textrm{L}}\|_{2}\leq 4\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}]\leq 4\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}^{2}]}=4n\tau_{{\textrm{L}}}^{2}\leq 4\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}} with probability 1o(1)1-o(1). Thus,

|𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d(λ1),𝜷)|𝒈Lf=𝒈^L]𝔼[ϕβ(𝜷^Lf,d,𝜷^Rf,d(λ2),𝜷)|𝒈Lf=𝒈^L]|\displaystyle\left|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right|
\displaystyle\leq 𝔼[|ϕβ(𝜷^Lf,d,𝜷^Rf,d(λ1),𝜷)ϕβ(𝜷^Lf,d,𝜷^Rf,d(λ2),𝜷)||𝒈Lf=𝒈^L]\displaystyle\mathbb{E}\left[\left|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})-\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})\right||\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}\right]
\displaystyle\leq 𝔼[𝜷^Rf,d(λ1)𝜷^Rf,d(λ2)2|𝒈Lf=𝒈^L]\displaystyle\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\|_{2}|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]
\displaystyle\leq 𝔼[𝜷^Rf,d(λ1)𝜷^Rf,d(λ2)22|𝒈Lf=𝒈^L]\displaystyle\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\|_{2}^{2}|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]}
\displaystyle\leq (4(C+τmax2+σmax2C)τmax2+σmax2/τmin+(C+τmax2+σmax2C))|λ1λ2|.\displaystyle\left(4(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}/\tau_{\min}+(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\right)|\lambda_{1}-\lambda_{2}|.

Thus, we conclude that ϕβ|L(𝒈^L)\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}}) is (4(C+τmax2+σmax2C)τmax2+σmax2/τmin+(C+τmax2+σmax2C))(4(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}/\tau_{\min}+(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C))-Lipschitz in λR\lambda_{\textrm{R}} with probability 1o(1)1-o(1).

Therefore, with probability 1o(1)1-o(1), ψ(λL,λR)\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}}) is an MM-Lipschitz function of λR\lambda_{\textrm{R}} for M=4(C+τmax2+σmax2C)τmax2+σmax2/τmin+(C+τmax2+σmax2C)+CM=4(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}/\tau_{\min}+(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)+C. Notice that MM does not depend on λL\lambda_{\textrm{L}}. Therefore, by Lemma H.9, Ψ(λR):=supλLψ(λL,λR)\Psi(\lambda_{\textrm{R}}):=\sup_{\lambda_{\textrm{L}}}\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}}) is also an MM-Lipschitz function of λR\lambda_{\textrm{R}}, hence completing the proof. ∎

Now we are ready to prove Lemma D.5.

Proof of Lemma D.5.

Consider the high probability event in Lemma D.10. For any ϵ>0\epsilon>0, define ϵ=ϵ/2M\epsilon^{\prime}=\epsilon/2M. Let k=(λmaxλmin)/ϵk=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil. Define, for i=0,,ki=0,...,k: λi=λmin+iϵ\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}. Then by Lemma D.10, we know supλR[λi1,λi]Ψ(λR)Ψ(λi)+Mϵ\sup_{\lambda_{\textrm{R}}\in[\lambda_{i-1},\lambda_{i}]}\Psi(\lambda_{\textrm{R}})\leq\Psi(\lambda_{i})+M\epsilon^{\prime}.

By union bound, we have

(supλR[λmin,λmax]Ψ(λR)ϵ)\displaystyle\mathbb{P}\left(\sup_{\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\Psi(\lambda_{\textrm{R}})\geq\epsilon\right)
\displaystyle\leq i=1k(supλR[λi1,λi]Ψ(λR)ϵ)\displaystyle\sum_{i=1}^{k}\mathbb{P}\left(\sup_{\lambda_{\textrm{R}}\in[\lambda_{i-1},\lambda_{i}]}\Psi(\lambda_{\textrm{R}})\geq\epsilon\right)
\displaystyle\leq i=1k(Ψ(λi)ϵMϵ)\displaystyle\sum_{i=1}^{k}\mathbb{P}\left(\Psi(\lambda_{i})\geq\epsilon-M\epsilon^{\prime}\right)
=\displaystyle= i=1k(supλL[λmin,λmax]|ϕβ|L(𝒈^L(λL))𝔼[ϕβ(𝜷^Lf,d(λL),𝜷^Rf,d(λi),𝜷)]|ϵ/2)\displaystyle\sum_{i=1}^{k}\mathbb{P}\left(\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}}(\lambda_{\textrm{L}}))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{i}),\bm{\beta})]|\geq\epsilon/2\right)
=\displaystyle= o(1),\displaystyle o(1),

on using Lemma D.9

D.2.3 Proof of Lemma D.6

Recall the definition of 𝒘^k\hat{\bm{w}}_{k} from (28). We define a loss function that yields 𝒘^R\hat{\bm{w}}_{\textrm{R}} as the minimizer:

𝒞λR(𝒘):=12n𝑿𝒘σ𝒛22+λR2𝒘+𝜷22.\mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w}):=\frac{1}{2n}\|\bm{X}\bm{w}-\sigma\bm{z}\|_{2}^{2}+\frac{\lambda_{\textrm{R}}}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}.

We next introduce some supporting lemmas.

Lemma D.11.

Recall the definition of 𝐠^L\hat{\bm{g}}_{\textrm{L}} from 26. Under Assumption 4, for any ϵ>0\epsilon>0, there exists a constant CC such that for a 11-Lipschitz function ϕw:p\phi_{w}:\mathbb{R}^{p}\rightarrow\mathbb{R},

supλL,λR[λmin,λmax](𝒘p,|ϕw(𝒘)𝔼[ϕw(𝜷^Rf𝜷)|𝒈Lf=𝒈^L]|ϵand𝒞λR(𝒘)min𝒞λR+Cϵ2)=o(ϵ2).\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\left|\phi_{w}(\bm{w})-\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right|\geq\epsilon\ \text{and}\ \mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{\textrm{R}}}+C\epsilon^{2}\right)=o(\epsilon^{2}).
Proof.

From part of (celentano2021cad, Lemma F.4), we know that

(𝒘p,|ϕw(𝒘)𝔼[ϕw(𝜷^Rf𝜷)|𝒈Lf=𝒈^L]|ϵand𝒞λR(𝒘)min𝒞λR+Cϵ2)=o(ϵ2).\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\left|\phi_{w}(\bm{w})-\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right|\geq\epsilon\ \text{and}\ \mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{\textrm{R}}}+C\epsilon^{2}\right)=o(\epsilon^{2}).

The proof is then completed by the fact that as in (celentano2021cad, Lemma F.4), o(ϵ2)o(\epsilon^{2}) only hides constants that do not depend on λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}}. ∎

Lemma D.12.

Under Assumption 4, there exists a positive constant KK such that

(λ,λ[λmin,λmax],𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))+K|λλ|)=1o(1).\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))\leq\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1).
Proof.

We have

𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))\displaystyle\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))
=\displaystyle= 𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))+𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))+𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))\displaystyle\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))+\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))+\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))-\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))
\displaystyle\leq 𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))+𝒞λ(𝒘^R(λ))𝒞λ(𝒘^R(λ))\displaystyle\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))+\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))-\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))
=\displaystyle= λλ2(𝜷^R(λ)22𝜷^R(λ)22)\displaystyle\frac{\lambda^{\prime}-\lambda}{2}(\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\|_{2}^{2}-\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda^{\prime})\|_{2}^{2})
\displaystyle\leq |λλ|2(𝜷^R(λ)22+𝜷^R(λ)22).\displaystyle\frac{|\lambda^{\prime}-\lambda|}{2}(\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\|_{2}^{2}+\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda^{\prime})\|_{2}^{2}).

Further, with probability at least 1en/21-e^{-n/2} we have 𝒛22n\|\bm{z}\|_{2}\leq 2\sqrt{n}, and therefore

λ2𝜷^R(λ)22𝒞λ(𝒘^R(λ))=argmin𝒃𝒞λ(𝒃)𝒞λ(𝟎)=σ𝒛22+λ2𝜷222σ2+λ2𝜷22.\frac{\lambda}{2}\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\|_{2}^{2}\leq\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))=\operatorname*{arg\,min}_{\bm{b}}\mathcal{C}_{\lambda}(\bm{b})\leq\mathcal{C}_{\lambda}(\bm{0})=\|\sigma\bm{z}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{\beta}\|_{2}^{2}\leq 2\sigma^{2}+\frac{\lambda}{2}\|\bm{\beta}\|_{2}^{2}.

Hence 𝜷^R(λ)22\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\|_{2}^{2} is bounded with high probability and so is 𝜷^R(λ)22\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda^{\prime})\|_{2}^{2}, which completes the proof. ∎

Lemma D.13.

Under Assumption 4, for any ϵ>0\epsilon>0, consider any λ1,λ2[λmin,λmax]\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}] such that |λ1λ2|ϵ|\lambda_{1}-\lambda_{2}|\geq\epsilon. Then with probability 1o(1)1-o(1), we have

|ψ(λ1,λR)ψ(λ2,λR)|M|λ1λ2|\displaystyle|\psi(\lambda_{1},\lambda_{\textrm{R}})-\psi(\lambda_{2},\lambda_{\textrm{R}})|\leq M|\lambda_{1}-\lambda_{2}|
|ψ(λL,λ1)ψ(λL,λ2)|M|λ1λ2|\displaystyle|\psi(\lambda_{\textrm{L}},\lambda_{1})-\psi(\lambda_{\textrm{L}},\lambda_{2})|\leq M|\lambda_{1}-\lambda_{2}|

for some constant MM (that does not depend on ϵ\epsilon), where ψ(λL,λR)=𝔼[ϕw(𝛃^Rf𝛃)|𝐠Lf=𝐠^L]\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}})=\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}].

We defer the proof to Section E.2 We remark that this is a slightly weaker condition than the Lipschitzness of ψ\psi in λk\lambda_{k}, but it suffices for our purpose. For convenience, we call this ”weak-Lipschitz” condition.

Lemma D.14.

Under Assumption 4, 𝛃~Rd\tilde{\bm{\beta}}_{\textrm{R}}^{d} is an MM-Lipschitz function of 𝛃^R\hat{\bm{\beta}}_{\textrm{R}} for some constant MM.

Proof.

Recall from (20) and (24) that 𝜷~Rd=𝜷^R+𝑿(𝒚𝑿𝜷^R)nζR\tilde{\bm{\beta}}_{\textrm{R}}^{d}=\hat{\bm{\beta}}_{\textrm{R}}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}})}{n\zeta_{\textrm{R}}}. Further, the KKT condition for ridge regression implies that 1n𝑿(𝒚𝑿𝜷^R)=λR𝜷^R\frac{1}{n}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}})=\lambda_{R}\hat{\bm{\beta}}_{\textrm{R}}. Since from 25 we know that ζR\zeta_{\textrm{R}} is bounded below by a positive constant ζmin\zeta_{\min}, it follows that 𝜷~Rd\tilde{\bm{\beta}}_{\textrm{R}}^{d} is an M=1+λmax/ζminM=1+\lambda_{\max}/\zeta_{\min}-Lipschitz function of 𝜷^R\hat{\bm{\beta}}_{\textrm{R}}. ∎

Proof of Lemma D.6.

Let C>0C>0 as given by Lemma D.11, let K>0K>0 as given by Lemma D.12, and let M>0M>0 as given by Lemma D.13. Consider any ϵ>0\epsilon>0, define ϵ=min(Cϵ2K,ϵM)\epsilon^{\prime}=\min\left(\frac{C\epsilon^{2}}{K},\frac{\epsilon}{M}\right). Let k=(λmaxλmin)/ϵk=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil. Further define λi=λmin+iϵ\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime} for i=0,,ki=0,...,k. By Lemma D.11, the event

{iL,iR=1,,k,𝒘p,𝒞λiR(𝒘)min𝒞λiR+Cϵ2|ϕw(𝒘)ψ(λiL,λiR)|ϵ}\left\{\forall i_{\textrm{L}},i_{\textrm{R}}=1,...,k,\forall\bm{w}\in\mathbb{R}^{p},\mathcal{C}_{\lambda_{i_{\textrm{R}}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{i_{\textrm{R}}}}+C\epsilon^{2}\Rightarrow\left|\phi_{w}(\bm{w})-\psi(\lambda_{i_{\textrm{L}}},\lambda_{i_{\textrm{R}}})\right|\leq\epsilon\right\} (35)

has probability 1k2o(ϵ2)=1o(1)1-k^{2}o(\epsilon^{2})=1-o(1). Now define λiR:=argmax{|λRλi|,|λRλi+1|}\lambda_{i_{\textrm{R}}}:=\operatorname*{arg\,max}\{|\lambda_{\textrm{R}}-\lambda_{i}|,|\lambda_{\textrm{R}}-\lambda_{i+1}|\} with λiλRλi+1\lambda_{i}\leq\lambda_{\textrm{R}}\leq\lambda_{i+1} (so we know |λiRλR|ϵ/2|\lambda_{i_{\textrm{R}}}-\lambda_{\textrm{R}}|\geq\epsilon^{\prime}/2) and similarly for iLi_{\textrm{L}}. On the intersection of event 35 and the event in Lemma D.12, which has probability 1o(1)1-o(1) we have that

𝒞λiR(𝒘^R(λR))min𝒞λiR+Kϵmin𝒞λiR+Cϵ2.\mathcal{C}_{\lambda_{i_{\textrm{R}}}}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))\leq\min\mathcal{C}_{\lambda_{i_{\textrm{R}}}}+K\epsilon^{\prime}\leq\min\mathcal{C}_{\lambda_{i_{\textrm{R}}}}+C\epsilon^{2}.

This implies (since we are on event 35) that |ϕw(𝒘^R(λR))ψ(λiL,λiR)|ϵ|\phi_{w}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))-\psi(\lambda_{i_{L}},\lambda_{i_{R}})|\leq\epsilon, where 1iRk1\leq i_{\textrm{R}}\leq k. Thus we have

|ϕw(𝒘^R(λR))ψ(λL,λR)|\displaystyle|\phi_{w}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))-\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}})|
\displaystyle\leq |ϕw(𝒘^R(λR))ψ(λiL,λiR)|+|ψ(λiL,λiR)ψ(λL,λiR)|+|ψ(λL,λiR)ψ(λiL,λR)|\displaystyle|\phi_{w}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))-\psi(\lambda_{i_{L}},\lambda_{i_{R}})|+|\psi(\lambda_{i_{L}},\lambda_{i_{R}})-\psi(\lambda_{L},\lambda_{i_{R}})|+|\psi(\lambda_{L},\lambda_{i_{R}})-\psi(\lambda_{i_{L}},\lambda_{R})|
\displaystyle\leq ϵ+2Mϵ\displaystyle\epsilon+2M\epsilon^{\prime}
\displaystyle\leq 3ϵ\displaystyle 3\epsilon

with probability 1o(1)1-o(1), where the second-to-last inequality follows from Lemma D.13 and the fact that |λiLλL|,|λiRλR|ϵ|\lambda_{i_{\textrm{L}}}-\lambda_{\textrm{L}}|,|\lambda_{i_{\textrm{R}}}-\lambda_{\textrm{R}}|\leq\epsilon^{\prime}. Finally we note that ϕβ(𝜷~Ld,𝜷~Rd,𝜷)\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta}) is a Lipschitz function of 𝜷~Rd\tilde{\bm{\beta}}_{\textrm{R}}^{d}, in fact, ϕβ(𝜷~Ld,𝜷~Rd,𝜷)=ϕβ(𝜷~Ld,𝜷^R(1+λR/ζR),𝜷)\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})=\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}(1+\lambda_{R}/\zeta_{R}),\bm{\beta}), (by definition). This in turn equals ϕβ(𝜷~Ld,(𝒘^R+𝜷)(1+λR/ζR),𝜷)\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},(\hat{\bm{w}}_{R}+\bm{\beta})(1+\lambda_{R}/\zeta_{R}),\bm{\beta}). If we define this to be ϕw(𝒘^R)\phi_{w}(\hat{\bm{w}}_{R}), then ψ(λL,λR)=𝔼[ϕβ(𝜷~Ld,𝜷^Rf(1+λR/ζR),𝜷|𝒈Lf=𝒈^L)]=𝔼[ϕβ(𝜷~Ld,𝜷^Rf,d,𝜷)|𝒈Lf=𝒈^L]\psi(\lambda_{L},\lambda_{R})=\mathbb{E}[\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{f}(1+\lambda_{R}/\zeta_{R}),\bm{\beta}|\bm{g}^{f}_{L}=\hat{\bm{g}}_{L})]=\mathbb{E}[\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}^{f}_{L}=\hat{\bm{g}}_{L}], once again by definition. Then (LABEL:eq:intermcalc) yields the desired result. ∎

Appendix E Proof of supporting lemmas for Section D.2

E.1 Proof of Lemma D.7

We introduce two supporting Lemmas:

Lemma E.1.

Under Assumption 4, for k=L,Rk=L,R and any 11-Lipschitz function ϕβ:p\phi_{\beta}:\mathbb{R}^{p}\rightarrow\mathbb{R},

supλk[λmin,λmax]|ϕβ(𝜷^k)𝔼[ϕβ(𝜷^kf)]|𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{k})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{k}^{f})]|\overset{P}{\rightarrow}0.\\
Lemma E.2.

Recall 𝐮^k=𝐲𝐗𝛃^k\hat{\bm{u}}_{k}=\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}. Further define 𝐮^kf=nζk𝐡kf\hat{\bm{u}}_{k}^{f}=\sqrt{n}\zeta_{k}\bm{h}_{k}^{f}, where (𝐡kf,𝐡Rf)𝒩(𝟎,𝐒𝐈n)(\bm{h}_{k}^{f},\bm{h}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},\bm{S}\otimes\bm{I}_{n}) for 𝐒\bm{S} defined in Eqn. 24. Under Assumption 4, for any 11-Lipschitz functions ϕu:n\phi_{u}:\mathbb{R}^{n}\rightarrow\mathbb{R},

supλk[λmin,λmax]|ϕu(𝒖^kn)𝔼[ϕu(𝒖^kfn)]|𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{u}\left(\frac{\hat{\bm{u}}_{k}}{\sqrt{n}}\right)-\mathbb{E}\left[\phi_{u}\left(\frac{\hat{\bm{u}}_{k}^{f}}{\sqrt{n}}\right)\right]\right|\overset{P}{\rightarrow}0.

Lemmas E.1 and E.2 are proved in Section E.3.

Proof of Lemma D.7.

First consider 𝒘^k=𝜷^k𝜷\hat{\bm{w}}_{k}=\hat{\bm{\beta}}_{k}-\bm{\beta}. From Lemma E.1, we know that for any 11-Lipschitz function ϕ\phi,

supλk[λmin,λmax]|ϕ(𝒘^k)𝔼[ϕ(𝜷^kf𝜷)]|𝑃0,\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\phi(\hat{\bm{w}}_{k})-\mathbb{E}[\phi(\hat{\bm{\beta}}_{k}^{f}-\bm{\beta})]|\overset{P}{\rightarrow}0,

where recall from Eqn. 21 that 𝜷^kf𝜷=ηk(𝜷+𝒈kf,ζk)𝜷\hat{\bm{\beta}}_{k}^{f}-\bm{\beta}=\eta_{k}(\bm{\beta}+\bm{g}_{k}^{f},\zeta_{k})-\bm{\beta} is a 1/p1/\sqrt{p}-Lipschitz function of p𝒈kf\sqrt{p}\bm{g}_{k}^{f} due to the Lipschitzness of the proximal mapping operator, where p𝒈kf𝒩(𝟎,pτk2𝑰p)\sqrt{p}\bm{g}_{k}^{f}\sim\mathcal{N}(\bm{0},p\tau_{k}^{2}\bm{I}_{p}) with pτk2δτmax2+δσmax2p\tau_{k}^{2}\leq\delta\tau_{\max}^{2}+\delta\sigma_{\max}^{2} and δ=p/n\delta=p/n. Also, 𝔼[𝒈kf22]=pτk2\mathbb{E}[\|\bm{g}_{k}^{f}\|_{2}^{2}]=p\tau_{k}^{2}. Moreover, from (21) and (24), 𝔼[(𝜷^kf𝜷)22]=nτk2σ2\mathbb{E}[\|(\hat{\bm{\beta}}_{k}^{f}-\bm{\beta})\|_{2}^{2}]=n\tau_{k}^{2}-\sigma^{2} τmax2\leq\tau_{\max}^{2} is bounded by 25. Therefore, by Lemma H.3, we know

supλL[λmin,λmax]|𝒘^k22(nτk2σ2)|𝑃0.\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{w}}_{k}\|_{2}^{2}-(n\tau_{k}^{2}-\sigma^{2})\right|\overset{P}{\rightarrow}0.

We also know that nτk2σ2τmin2n\tau_{k}^{2}-\sigma^{2}\geq\tau_{\min}^{2} by 25. Thus, by Lemma H.7,

supλL[λmin,λmax]|𝒘^k2nτk2σ2|𝑃0.\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{w}}_{k}\|_{2}-\sqrt{n\tau_{k}^{2}-\sigma^{2}}\right|\overset{P}{\rightarrow}0.

As a direct corollary, with probability 1o(1)1-o(1), 𝒘^k22nτk2σ22τmax\|\hat{\bm{w}}_{k}\|_{2}\leq 2\sqrt{n\tau_{k}^{2}-\sigma^{2}}\leq 2\tau_{\max} and 𝒘^k2nτk2σ2/2τmin/2\|\hat{\bm{w}}_{k}\|_{2}\geq\sqrt{n\tau_{k}^{2}-\sigma^{2}}/2\geq\tau_{\min}/2 for all λk\lambda_{k}, so 𝒘^k2\|\hat{\bm{w}}_{k}\|_{2} is bounded both above and below for all λk\lambda_{k} with probability 1o(1)1-o(1). Similarly, convergence of 𝒖^k2/n\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n} follows by starting from Lemma E.2 on combining with (25) and Lemmas (H.3), (H.7).

Finally we consider 𝒗^k\hat{\bm{v}}_{k}. We know 𝒗^k=𝑿𝒖^k\hat{\bm{v}}_{k}=\bm{X}^{\top}\hat{\bm{u}}_{k} and that 𝑿op/n\|\bm{X}\|_{\textnormal{op}}/\sqrt{n} is bounded with probability 1o(1)1-o(1) (Corollary H.1). Thus,

supλk[λmin,λmax]𝒗^k2/n𝑿op/nsupλk[λmin,λmax]𝒖^k2/n,\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{v}}_{k}\|_{2}/n\leq\|\bm{X}\|_{\textnormal{op}}/\sqrt{n}\cdot\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n},

which is bounded above with probability 1o(1)1-o(1). This completes the proof. ∎

E.2 Proof of Lemma D.13

We introduce another Lemma:

Lemma E.3.

Under Assumption 4, for any ϵ>0\epsilon>0, consider any λ1,λ2[λmin,λmax]\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}] such that |λ1λ2|ϵ|\lambda_{1}-\lambda_{2}|\geq\epsilon, then with probability 1o(1)1-o(1), we have

𝜷^L(λ)2M,λ[λmin,λmax],𝜷^L(λ1)𝜷^L(λ2)2M|λ1λ2|.\begin{gathered}\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda)\|_{2}\leq M,\,\,\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\\ \|\hat{\bm{\beta}}_{L}(\lambda_{1})-\hat{\bm{\beta}}_{L}(\lambda_{2})\|_{2}\leq M|\lambda_{1}-\lambda_{2}|.\end{gathered}

for some positive constant MM that does not depend on ϵ\epsilon.

Proof.

The first line follows directly from Lemma D.7 and the fact that 𝜷^L=𝒘^L+𝜷\hat{\bm{\beta}}_{\textrm{L}}=\hat{\bm{w}}_{\textrm{L}}+\bm{\beta}. For the second line, consider any ϵ>0\epsilon>0. It is a direct consequence of Lemma E.1 that

(supλL[λmin,λmax]|𝜷^L(λL)2𝔼𝜷^Lf(λL)2|ϵ)=o(1).\mathbb{P}\left(\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})\|_{2}-\mathbb{E}\|\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{\textrm{L}})\|_{2}\right|\geq\epsilon\right)=o(1). (36)

For any λ1,λ2\lambda_{1},\lambda_{2}, by triangle inequality, we have

𝜷^L(λ1)𝜷^L(λ2)2𝜷^L(λ1)𝜷^Lf(λ1)2+𝜷^L(λ2)𝜷^Lf(λ2)2+𝜷^Lf(λ1)𝜷^Lf(λ2)2.\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{2})\|_{2}\leq\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{1})\|_{2}+\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{2})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{2})\|_{2}+\|\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{2})\|_{2}.

Now, 𝜷^L(λ1)𝜷^Lf(λ1)2ϵ\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{1})\|_{2}\leq\epsilon and 𝜷^L(λ2)𝜷^Lf(λ2)2ϵ\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{2})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{2})\|_{2}\leq\epsilon with probability 1o(1)1-o(1) by (36). Further, recalling the definition of 𝜷^Lf\hat{\bm{\beta}}_{\textrm{L}}^{f} in (21) and notice that the minimization problem is separable, we know the ii-th entry of 𝜷^Lf\hat{\bm{\beta}}_{\textrm{L}}^{f} satisfies

𝜷^Lf,(i)=η(βi+τLZi,λLnζL),\hat{\bm{\beta}}_{\textrm{L}}^{f,(i)}=\eta(\beta_{i}+\tau_{L}Z_{i},\frac{\lambda_{\textrm{L}}}{\sqrt{n}\zeta_{\textrm{L}}}),

where Zii.i.d.𝒩(0,1)Z_{i}\overset{i.i.d.}{\sim}\mathcal{N}(0,1) and

η(x,b)={x+b,x<b0,bxbxb,x>b,\eta(x,b)=\begin{cases}x+b,\ x<-b\\ 0,\ -b\leq x\leq b\\ x-b,\ x>b\end{cases},

the soft-thresholding operator, is 11-Lipshitz in both xx and bb. Thus,

𝜷^Lf(λ1)𝜷^Lf(λ2)22\displaystyle\|\hat{\bm{\beta}}_{L}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{L}^{f}(\lambda_{2})\|_{2}^{2}
=\displaystyle= i=1p(η(βi+τL(λ1)Zi,λ1nζL(λ1))η(βi+τL(λ2)Zi,λ2nζL(λ2)))2\displaystyle\sum_{i=1}^{p}\left(\eta(\beta_{i}+\tau_{L}(\lambda_{1})Z_{i},\frac{\lambda_{1}}{\sqrt{n}\zeta_{L}(\lambda_{1})})-\eta(\beta_{i}+\tau_{L}(\lambda_{2})Z_{i},\frac{\lambda_{2}}{\sqrt{n}\zeta_{L}(\lambda_{2})})\right)^{2}
\displaystyle\leq i=1p2((τL(λ1)ZiτL(λ2)Zi)2+(λ1nζL(λ1)λ2nζL(λ2))2)\displaystyle\sum_{i=1}^{p}2\left((\tau_{L}(\lambda_{1})Z_{i}-\tau_{L}(\lambda_{2})Z_{i})^{2}+(\frac{\lambda_{1}}{\sqrt{n}\zeta_{L}(\lambda_{1})}-\frac{\lambda_{2}}{\sqrt{n}\zeta_{L}(\lambda_{2})})^{2}\right)
=\displaystyle= 2((τL(λ1)τL(λ2))2i=1pZi2+δ(λ1ζL(λ1)λ2ζL(λ2))2)\displaystyle 2\left((\tau_{L}(\lambda_{1})-\tau_{L}(\lambda_{2}))^{2}\sum_{i=1}^{p}Z_{i}^{2}+\delta(\frac{\lambda_{1}}{\zeta_{L}(\lambda_{1})}-\frac{\lambda_{2}}{\zeta_{L}(\lambda_{2})})^{2}\right)
\displaystyle\leq M2|λ1λ2|2\displaystyle M^{2}|\lambda_{1}-\lambda_{2}|^{2}

for some constant MM with probability 1o(1)1-o(1), where we used 25, Lemma H.8, and the facts that nτL(λ),ζL(λ)\sqrt{n}\tau_{L}(\lambda),\zeta_{L}(\lambda) are bounded Lipschitz functions of λL\lambda_{L} (from Lemma H.2) and i=1pZi2/p\sum_{i=1}^{p}Z_{i}^{2}/p is bounded with probability 1o(1)1-o(1). Combining the above, we know with probability 1o(1)1-o(1),

𝜷^L(λ1)𝜷^L(λ2)22ϵ+M|λ1λ2|(M+2)|λ1λ2|,\|\hat{\bm{\beta}}_{L}(\lambda_{1})-\hat{\bm{\beta}}_{L}(\lambda_{2})\|_{2}\leq 2\epsilon+M|\lambda_{1}-\lambda_{2}|\leq(M+2)|\lambda_{1}-\lambda_{2}|,

which concludes the proof. ∎

As a corollary, we have the following lemma:

Lemma E.4.

Under Assumption 4, for any ϵ>0\epsilon>0, consider any λ1,λ2[λmin,λmax]\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}] such that |λ1λ2|ϵ|\lambda_{1}-\lambda_{2}|\geq\epsilon, then with probability 1o(1)1-o(1), we have

𝒈^L(λ1)𝒈^L(λ2)2M|λ1λ2|,\|\hat{\bm{g}}_{L}(\lambda_{1})-\hat{\bm{g}}_{L}(\lambda_{2})\|_{2}\leq M|\lambda_{1}-\lambda_{2}|,

for some constant MM (that does not depend on ϵ\epsilon).

Proof.

Recall Eqn. 26, and note the following:

  • ζL,nτL\zeta_{\textrm{L}},\sqrt{n}\tau_{L} are bounded Lipschitz functions of λL\lambda_{\textrm{L}}, and as a simple corollary, nτL2σ2\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}} is also a bounded Lipschitz function of λL\lambda_{\textrm{L}}.

  • |𝒘^L(λ1)2𝒘^L(λ2)2|𝒘^L(λ1)𝒘^L(λ2)2=𝜷^L(λ1)𝜷^L(λ2)2M|λ1λ2|\left|\|\hat{\bm{w}}_{L}(\lambda_{1})\|_{2}-\|\hat{\bm{w}}_{L}(\lambda_{2})\|_{2}\right|\leq\|\hat{\bm{w}}_{L}(\lambda_{1})-\hat{\bm{w}}_{L}(\lambda_{2})\|_{2}=\|\hat{\bm{\beta}}_{L}(\lambda_{1})-\hat{\bm{\beta}}_{L}(\lambda_{2})\|_{2}\leq M|\lambda_{1}-\lambda_{2}| with probability 1o(1)1-o(1) by Lemma E.3. Also 𝒘^2\|\hat{\bm{w}}\|_{2} is bounded with probability o(1)o(1) by Lemma D.7

  • 𝒖^L=σ𝒛𝑿𝒘^L\hat{\bm{u}}_{\textrm{L}}=\sigma\bm{z}-\bm{X}\hat{\bm{w}}_{\textrm{L}}, so 1n𝒖^L(λ1)𝒖^L(λ2)21nσmax(𝑿)𝒘^L(λ1)𝒘^L(λ2)2(2+δ)M|λ1λ2|\frac{1}{\sqrt{n}}\|\hat{\bm{u}}_{L}(\lambda_{1})-\hat{\bm{u}}_{L}(\lambda_{2})\|_{2}\leq\frac{1}{\sqrt{n}}\sigma_{\max}(\bm{X})\|\hat{\bm{w}}_{L}(\lambda_{1})-\hat{\bm{w}}_{L}(\lambda_{2})\|_{2}\leq(2+\sqrt{\delta})M|\lambda_{1}-\lambda_{2}| with probability 1o(1)1-o(1), where we used Corollary H.1. Also 1n𝒖^L21n(σ𝒛2+𝑿𝒘^L2)\frac{1}{\sqrt{n}}\|\hat{\bm{u}}_{\textrm{L}}\|_{2}\leq\frac{1}{\sqrt{n}}(\sigma\|\bm{z}\|_{2}+\|\bm{X}\hat{\bm{w}}_{\textrm{L}}\|_{2}) is bounded with probability 1o(1)1-o(1) for the same reason.

  • By the same argument, 1n𝒗^L(λ1)𝒗^L(λ2)2(2+δ)2M|λ1λ2|\frac{1}{n}\|\hat{\bm{v}}_{L}(\lambda_{1})-\hat{\bm{v}}_{L}(\lambda_{2})\|_{2}\leq(2+\sqrt{\delta})^{2}M|\lambda_{1}-\lambda_{2}| with probability 1o(1)1-o(1) and 1n𝒗^2\frac{1}{n}\|\hat{\bm{v}}\|_{2} is bounded with probability 1o(1)1-o(1).

Then the proof is complete on iteratively applying Lemma H.8 on the above displays. ∎

We are now ready to prove Lemma D.13.

Proof of Lemma D.13.

By Jensen’s inequality and the fact that ϕw\phi_{w} is 1-Lipschitz, we only need to show that conditional on 𝒈Lf=𝒈^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}, 𝒘^Rf=𝜷^Rf𝜷\hat{\bm{w}}_{\textrm{R}}^{f}=\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta} satisfies the “weak-Lipschitz” condition in λL\lambda_{\textrm{L}} with probability 1o(1)1-o(1), where by Eqn. 40,

𝜷^Rf𝜷\displaystyle\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta} =1ζR+λR(ζR𝒈RfλR𝜷)\displaystyle=\frac{1}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}(\zeta_{\textrm{R}}\bm{g}_{\textrm{R}}^{f}-\lambda_{R}\bm{\beta})
=1ζR+λR(ζRτRρ/τL𝒈^L+ζRτR1ρ2𝝃λR𝜷).\displaystyle=\frac{1}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}(\zeta_{\textrm{R}}\tau_{R}\rho/\tau_{\textrm{L}}\cdot\hat{\bm{g}}_{\textrm{L}}+\zeta_{\textrm{R}}\tau_{R}\sqrt{1-\rho^{2}}\cdot\bm{\xi}-\lambda_{\textrm{R}}\bm{\beta}).

Now we note the following observations.

  • ζR,λR,𝜷\zeta_{\textrm{R}},\lambda_{\textrm{R}},\bm{\beta} does not depend on λL\lambda_{\textrm{L}} and by Lemma H.2, ζR,λR\zeta_{\textrm{R}},\lambda_{\textrm{R}} are both bounded Lipschitz functions of λR\lambda_{\textrm{R}}. Further, 𝜷2\|\bm{\beta}\|_{2} is bounded. Thus, by Lemma H.8, λRζR+λR𝜷\frac{\lambda_{\textrm{R}}}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}\bm{\beta} is Lipschitz in both λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} with some constant M1M_{1}.

  • In addition, by Lemma H.2, nτR\sqrt{n}\tau_{R} is bounded Lipschitz functions of λR\lambda_{\textrm{R}}, and 1ρ2\sqrt{1-\rho^{2}} is bounded and Lipschitz in both λL\lambda_{\textrm{L}} and λR\lambda_{\textrm{R}}. Further, 𝝃/n𝒩(𝟎,𝑰p/n)\bm{\xi}/\sqrt{n}\sim\mathcal{N}(\bm{0},\bm{I}_{p}/n) is bounded with probability 1o(1)1-o(1). Therefore, by Lemma H.8, ζRτR1ρ2ζR+λR𝝃\frac{\zeta_{\textrm{R}}\tau_{R}\sqrt{1-\rho^{2}}}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}\bm{\xi} is Lipschitz in both λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} with some constant M2M_{2} with probability o(1)o(1).

  • By Lemma H.2, ρ\rho is bounded and Lipschitz in both λL\lambda_{\textrm{L}} and λR\lambda_{\textrm{R}}, and nτk\sqrt{n}\tau_{k} is bounded and Lipschitz in λk\lambda_{k}.

  • By Lemma D.4 and Lemma D.8, 𝒈^L22𝜷^Ld𝜷24𝔼[𝜷^Lf,d𝜷2]4𝔼[𝜷^Lf,d𝜷22]=4nτL24(τmax2+σmax2)\|\hat{\bm{g}}_{\textrm{L}}\|_{2}\leq 2\|\hat{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}\|_{2}\leq 4\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}]\leq 4\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}^{2}]}=4n\tau_{\textrm{L}}^{2}\leq 4(\tau_{\max}^{2}+\sigma_{\max}^{2}) with probability 1o(1)1-o(1), where the equality follows from (21) and (24).

  • Further, 𝒈^L\hat{\bm{g}}_{\textrm{L}} does not depend on λR\lambda_{\textrm{R}}, and by Lemma E.4, 𝒈^L\hat{\bm{g}}_{\textrm{L}} satisfy the ”weak-Lipschitz” condition w.r.t. λL\lambda_{\textrm{L}} with probability 1o(1)1-o(1).

  • Hence, by Lemma H.8, ζRτRρτL(ζR+λR)𝒈^L\frac{\zeta_{\textrm{R}}\tau_{R}\rho}{\tau_{L}(\zeta_{\textrm{R}}+\lambda_{\textrm{R}})}\hat{\bm{g}}_{\textrm{L}} satisfies the aforementioned condition w.r.t. both λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}} with some constant M3M_{3} with probability 1o(1)1-o(1).

Combining the above steps completes the proof. ∎

E.3 Proof of Lemma E.1

For the case of the Lasso, (miolane2021distribution, Theorem 3.1) proved an analogous result for W2W_{2} convergence. Although this does not directly yield our current lemma, the ideas therein sometimes prove to be useful. Below we present the proof of Lemma E.1 for the case of the ridge. Our approach towards handling 1\ell_{1} convergence can be adapted to extend (miolane2021distribution, Theorem 3.1) for the lasso to an 1\ell_{1} convergence result as well. For simplicity, we drop the subscript RR for this section. For convenience, we also make the λ\lambda dependence explicit for certain expressions in this section (via writing 𝜷^(λ)\hat{\bm{\beta}}(\lambda), for instance).

E.3.1 Converting the optimization problem

First, although the original optimization problem is

𝜷^(λ)=argmin𝒃{12n𝒚𝑿𝒃22+λ2𝒃22}:=argmin𝒃λ(𝒃),\hat{\bm{\beta}}(\lambda)=\operatorname*{arg\,min}_{\bm{b}}\Big{\{}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{b}\|_{2}^{2}\Big{\}}:=\operatorname*{arg\,min}_{\bm{b}}\mathcal{L}_{\lambda}(\bm{b}),

it is more convenient to work with 𝒘^(λ)=𝜷^(λ)𝜷\hat{\bm{w}}(\lambda)=\hat{\bm{\beta}}(\lambda)-\bm{\beta}, which satisfies

𝒘^(λ)\displaystyle\hat{\bm{w}}(\lambda) =argmin𝒘12n𝑿𝒘σ𝒛22+λ2𝒘+𝜷22:=argmin𝒘𝒞λ(𝒘)\displaystyle=\operatorname*{arg\,min}_{\bm{w}}\frac{1}{2n}\|\bm{X}\bm{w}-\sigma\bm{z}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}:=\operatorname*{arg\,min}_{\bm{w}}\mathcal{C}_{\lambda}(\bm{w}) (37)
=argmin𝒘max𝒖1n𝒖𝑿𝒘σn𝒖𝒛12n𝒖22+λ2𝒘+𝜷22:=argmin𝒘max𝒖cλ(𝒘,𝒖),\displaystyle=\operatorname*{arg\,min}_{\bm{w}}\max_{\bm{u}}\frac{1}{n}\bm{u}^{\top}\bm{X}\bm{w}-\frac{\sigma}{n}\bm{u}^{\top}\bm{z}-\frac{1}{2n}\|\bm{u}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}:=\operatorname*{arg\,min}_{\bm{w}}\max_{\bm{u}}c_{\lambda}(\bm{w},\bm{u}),

where we used the fact that 𝒙22=max𝒖2𝒖𝒙𝒖22\|\bm{x}\|_{2}^{2}=\max_{\bm{u}}2\bm{u}^{\top}\bm{x}-\|\bm{u}\|_{2}^{2}. We call this the Primary Optimization (PO) problem, which involves a random matrix 𝑿\bm{X}. We then define the following Auxiliary Optimization (AO) that involves only independent random vectors 𝝃g𝒩(0,𝑰p),𝝃h𝒩(0,𝑰n)\bm{\xi}_{g}\sim\mathcal{N}(0,\bm{I}_{p}),\bm{\xi}_{h}\sim\mathcal{N}(0,\bm{I}_{n}). We call its solution 𝒘(λ)\bm{w}^{*}(\lambda).

𝒘(λ)\displaystyle\bm{w}^{*}(\lambda) =argmin𝒘max𝒖1n𝒖2𝝃g𝒘+𝒘22+σ2𝝃h𝒖n12n𝒖22+λ2𝒘+𝜷22:=argmin𝒘max𝒖lλ(𝒘,𝒖)\displaystyle=\operatorname*{arg\,min}_{\bm{w}}\max_{\bm{u}}\frac{1}{n}\|\bm{u}\|_{2}\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}\frac{\bm{\xi}_{h}^{\top}\bm{u}}{n}-\frac{1}{2n}\|\bm{u}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}:=\operatorname*{arg\,min}_{\bm{w}}\max_{\bm{u}}l_{\lambda}(\bm{w},\bm{u}) (38)
=argmin𝒘maxα0αn(𝝃g𝒘+𝒘22+σ2𝝃h2)α22+λ2𝒘+𝜷22:=argmin𝒘maxα0λ(𝒘,α)\displaystyle=\operatorname*{arg\,min}_{\bm{w}}\max_{\alpha\geq 0}\frac{\alpha}{\sqrt{n}}(\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2})-\frac{\alpha^{2}}{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}:=\operatorname*{arg\,min}_{\bm{w}}\max_{\alpha\geq 0}\ell_{\lambda}(\bm{w},\alpha)
=argmin𝒘12n(𝝃g𝒘+𝒘22+σ2𝝃h2)+2+λ2𝒘+𝜷22:=argmin𝒘Lλ(𝒘).\displaystyle=\operatorname*{arg\,min}_{\bm{w}}\frac{1}{2n}(\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2})_{+}^{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}:=\operatorname*{arg\,min}_{\bm{w}}L_{\lambda}(\bm{w}).

Note to obtain the third equality we set α:=𝒖2/n\alpha:=\|\bm{u}\|_{2}/\sqrt{n}. Section E.3.3 will establish a formal connection between PO and AO.

Next we perform some non-rigorous calculations using the AO to gain insights and show that it should asymtotically behave as a much simpler scalar optimization problem. To this end, we use the fact that 𝒘22+σ2=argminτ0𝒘22+σ22τ+τ2,\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}=\operatorname*{arg\,min}_{\tau\geq 0}\frac{\|\bm{w}\|_{2}^{2}+\sigma^{2}}{2\tau}+\frac{\tau}{2}, and obtain

min𝒘maxα0λ(𝒘,α)\displaystyle\min_{\bm{w}}\max_{\alpha\geq 0}\ell_{\lambda}(\bm{w},\alpha)
=\displaystyle= min𝒘maxα0minτ0αn(𝝃g𝒘+(𝒘22+σ22τ+τ2)𝝃h2)α22+λ2𝒘+𝜷22.\displaystyle\min_{\bm{w}}\max_{\alpha\geq 0}\min_{\tau\geq 0}\frac{\alpha}{\sqrt{n}}\left(\bm{\xi}_{g}^{\top}\bm{w}+\left(\frac{\|\bm{w}\|_{2}^{2}+\sigma^{2}}{2\tau}+\frac{\tau}{2}\right)\|\bm{\xi}_{h}\|_{2}\right)-\frac{\alpha^{2}}{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}.

We know in the asymptotic limit 𝝃h2/n1\|\bm{\xi}_{h}\|_{2}/\sqrt{n}\rightarrow 1. Substituting in and optimizing 𝒘\bm{w} first, we have

argmin𝒘αn𝝃g𝒘+α(𝒘22+σ2)2τ+ατ2α22+λ2𝒘+𝜷22=1α/τ+λ(αn𝝃g+λ𝜷).\operatorname*{arg\,min}_{\bm{w}}\frac{\alpha}{\sqrt{n}}\bm{\xi}_{g}^{\top}\bm{w}+\frac{\alpha(\|\bm{w}\|_{2}^{2}+\sigma^{2})}{2\tau}+\frac{\alpha\tau}{2}-\frac{\alpha^{2}}{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}=-\frac{1}{\alpha/\tau+\lambda}\left(\frac{\alpha}{\sqrt{n}}\bm{\xi}_{g}+\lambda\bm{\beta}\right).

Plugging in and using the fact that 𝝃g22/p1\|\bm{\xi}_{g}\|_{2}^{2}/p\rightarrow 1, 𝜷22σβ2\|\bm{\beta}\|_{2}^{2}\rightarrow\sigma_{\beta}^{2}, and 𝝃g𝜷/n0\bm{\xi}_{g}^{\top}\bm{\beta}/\sqrt{n}\rightarrow 0, we arrive at a Scalar Optimization (SO) problem in the asymptotic limit:

(α,τ)\displaystyle(\alpha_{*},\tau_{*}) =argmaxα0minτ0ψ(α,τ)\displaystyle=\operatorname*{arg\,max}_{\alpha\geq 0}\min_{\tau\geq 0}\psi(\alpha,\tau) (39)
:=argmaxα0minτ0ασ22τ+ατ2α22α2τδ2(α+τλ)+αλσβ22(α+τλ).\displaystyle:=\operatorname*{arg\,max}_{\alpha\geq 0}\min_{\tau\geq 0}\frac{\alpha\sigma^{2}}{2\tau}+\frac{\alpha\tau}{2}-\frac{\alpha^{2}}{2}-\frac{\alpha^{2}\tau\delta}{2(\alpha+\tau\lambda)}+\frac{\alpha\lambda\sigma_{\beta}^{2}}{2(\alpha+\tau\lambda)}.

By some algebra, it turns out that the solution to 39 and the Ridge part of (24) can be related as τ=nτR\tau_{\star}=\sqrt{n}\tau_{R} in and α=τζR\alpha_{\star}=\tau_{\star}\zeta_{\textrm{R}}. So by the discussion before 25, we know α,τ\alpha_{\star},\tau_{\star} is unique and bounded. We in turn define

𝒘(λ)=1α/τ+λ(αn𝝃g+λ𝜷).\bm{w}(\lambda)=-\frac{1}{\alpha_{*}/\tau_{*}+\lambda}\left(\frac{\alpha_{*}}{\sqrt{n}}\bm{\xi}_{g}+\lambda\bm{\beta}\right). (40)

By some algebra, this satisfies

𝒘(λ)=𝜷^f(λ)𝜷.\bm{w}(\lambda)=\hat{\bm{\beta}}^{f}(\lambda)-\bm{\beta}. (41)

We will rigorously prove the above conversion of AO to SO in Section E.3.2

E.3.2 Connecting AO with SO

First we show that AO has a minimizer.

Proposition E.5.

LλL_{\lambda} admits almost surely a unique minimizer 𝐰(λ)\bm{w}^{*}(\lambda) on p\mathbb{R}^{p}.

Proof.

First, LλL_{\lambda} is a convex function that goes to \infty at \infty, so it has a minimizer.

Case 1: there exists a minimizer 𝒘\bm{w} such that 𝝃g𝒘+𝒘22+σ2𝝃h2>0\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2}>0.

In that case, there exists a neighborhood O𝒘O_{\bm{w}} of 𝒘\bm{w} such that for all 𝒘O𝒘\bm{w}^{\prime}\in O_{\bm{w}}, a(𝒘):=𝝃g𝒘+𝒘22+σ2𝝃h2>0a(\bm{w}^{\prime}):=\bm{\xi}_{g}^{\top}\bm{w}^{\prime}+\sqrt{\|\bm{w}^{\prime}\|_{2}^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2}>0. Thus for all 𝒘O𝒘,Lλ(𝒘)=12a(𝒘)2+λ2𝒘+𝜷22\bm{w}^{\prime}\in O_{\bm{w}},L_{\lambda}(\bm{w}^{\prime})=\frac{1}{2}a(\bm{w}^{\prime})^{2}+\frac{\lambda}{2}\|\bm{w}^{\prime}+\bm{\beta}\|_{2}^{2} is strictly convex, because a(𝒘)a(\bm{w}^{\prime}) is strictly convex (due to its first argument being linear and its second argument being positively quadratic) and remains positive on O𝒘O_{\bm{w}} and x>0x2x>0\mapsto x^{2} is strictly increasing. Hence 𝒘\bm{w} is the only minimizer of LλL_{\lambda}.

Case 2: for all minimizer 𝒘\bm{w} we have 𝝃g𝒘+𝒘2+σ2𝝃h20\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2}\leq 0.

In this case, from optimality condition, any minimizer 𝒘\bm{w} must satisfies λ(𝒘+𝜷)=0\lambda(\bm{w}+\bm{\beta})=0, which implies 𝒘=𝜷\bm{w}=-\bm{\beta} and LλL_{\lambda} has a unique minimizer. ∎

Then we show that the minimizer 𝒘(λ)\bm{w}^{*}(\lambda) is close to 𝒘(λ)\bm{w}(\lambda):

Lemma E.6.

There exists a constant γ\gamma such that for all ϵ(0,1]\epsilon\in(0,1],

supλ[λmin,λmax](𝒘p,𝒘𝒘(λ)22>ϵandLλ(𝒘)min𝒗pLλ(𝒗)+γϵ)=o(ϵ),\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}>\epsilon\ \ \text{and}\ \ L_{\lambda}(\bm{w})\leq\min_{\bm{v}\in\mathbb{R}^{p}}L_{\lambda}(\bm{v})+\gamma\epsilon\right)=o(\epsilon),

where 𝐰(λ)\bm{w}(\lambda) is defined as in (40).

Proof.

The proof follows analogous to (miolane2021distribution, Theorem B.1), which handles the case of the Lasso. For conciseness, we refer readers to their proof. Note that we can safely add the sup in front since o(ϵ)o(\epsilon) hides constants that do not depend on λ\lambda. ∎

E.3.3 Connecting PO with AO

We introduce a useful corollary that follows from the Convex Gaussian Min-Max Theorem (Theorem H.1) and connects PO with AO.

Corollary E.1.
  • Let DpD\subset\mathbb{R}^{p} be a closed set. We have for all tt\in\mathbb{R},

    (min𝒘D𝒞λ(𝒘)t)2(min𝒘DLλ(𝒘)t).\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq t).
  • Let DpD\subset\mathbb{R}^{p} be a convex closed set. We have for all tt\in\mathbb{R},

    (min𝒘D𝒞λ(𝒘)t)2(min𝒘DLλ(𝒘)t).\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\geq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\geq t).
Proof.

We will only prove the first point, since the second follows similarly. Suppose that 𝑿,𝒛,𝒈,𝒉\bm{X},\bm{z},\bm{g},\bm{h} live on the same probability space and are independent. Let ϵ(0,1]\epsilon\in(0,1]. Let σmax(𝑿)\sigma_{\max}(\bm{X}) denote the largest singular value of 𝑿\bm{X}. By tightness we can find a constant C>0C>0 such that the event

{σmax(𝑿)Cn,𝒛2Cn,𝝃g2Cn,𝝃h2Cn}\{\sigma_{\max}(\bm{X})\leq C\sqrt{n},\|\bm{z}\|_{2}\leq C\sqrt{n},\|\bm{\xi}_{g}\|_{2}\leq C\sqrt{n},\|\bm{\xi}_{h}\|_{2}\leq C\sqrt{n}\} (42)

has probability at least 1ϵ1-\epsilon. Let DpD\subset\mathbb{R}^{p} be a non-empty closed set. Fix some 𝒘0D\bm{w}_{0}\in D, on event 42 both 𝒞λ(𝒘0)\mathcal{C}_{\lambda}(\bm{w}_{0}) and Lλ(𝒘0)L_{\lambda}(\bm{w}_{0}) are bounded by some constant RR. Now for any 𝒘\bm{w} such that 𝒞λ(𝒘)R\mathcal{C}_{\lambda}(\bm{w})\leq R, we have 𝒘+𝜷22𝒞λ(𝒘)R\|\bm{w}+\bm{\beta}\|_{2}^{2}\leq\mathcal{C}_{\lambda}(\bm{w})\leq R.

This means that there exists R1R_{1} such that 𝒘2R1\|\bm{w}\|^{2}\leq R_{1} on event (42). Since this is true for all such 𝒘\bm{w}, then the minimum of 𝒞λ\mathcal{C}_{\lambda} over DD is achieved on DB(0,R1)D\cap B(0,R_{1}). Similarly, the minimum of LλL_{\lambda} over DD is achieved on DB(0,R2)D\cap B(0,R_{2}). WLOG, we can assume R1=R2R_{1}=R_{2}. On event 42 we have

min𝒘D𝒞λ(𝒘)=min𝒘DB(0,R1)𝒞λ(𝒘)=min𝒘DB(0,R1)max𝒖B(0,R3)cλ(𝒘,𝒖)\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})=\min_{\bm{w}\in D\cap B(0,R_{1})}\mathcal{C}_{\lambda}(\bm{w})=\min_{\bm{w}\in D\cap B(0,R_{1})}\max_{\bm{u}\in B(0,R_{3})}c_{\lambda}(\bm{w},\bm{u})

for some non-random R3>0R_{3}>0. Thus, for all tRt\in R, we have

(min𝒘D𝒞λ(𝒘)t)(min𝒘DB(0,R1)max𝒖B(0,R3)cλ(𝒘,𝒖)t)+ϵ,\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq\mathbb{P}(\min_{\bm{w}\in D\cap B(0,R_{1})}\max_{\bm{u}\in B(0,R_{3})}c_{\lambda}(\bm{w},\bm{u})\leq t)+\epsilon,

and similarly

(min𝒘DB(0,R1)max𝒖B(0,R3)lλ(𝒘,𝒖)t)(min𝒘DLλ(𝒘)t)+ϵ.\mathbb{P}(\min_{\bm{w}\in D\cap B(0,R_{1})}\max_{\bm{u}\in B(0,R_{3})}l_{\lambda}(\bm{w},\bm{u})\leq t)\leq\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq t)+\epsilon.

Now we know that DB(0,R1)D\cap B(0,R_{1}) and B(0,R3)B(0,R_{3}) are compact, we can apply Theorem H.1 to get

(min𝒘D𝒞λ(𝒘)t)2(min𝒘DLλ(𝒘)t)+2ϵ.\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq t)+2\epsilon.

The corollary then follows from the fact that one can take ϵ\epsilon arbitrarily small. ∎

Now for a fixed λ\lambda, consider the set Dλϵ={𝒘p|𝒘𝒘(λ)22ϵ}D_{\lambda}^{\epsilon}=\{\bm{w}\in\mathbb{R}^{p}|\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}\geq\epsilon\}. Obviously this is a closed set. By first applying parts 1 and 2 from Corollary E.1 to the convex closed domain p\mathbb{R}^{p}, then applying part 1 to the closed domain DλϵD_{\lambda}^{\epsilon}, we can show the following proposition:

Lemma E.7.

For all ϵ(0,1]\epsilon\in(0,1],

(min𝒘Dλϵ𝒞λ(𝒘)min𝒘p𝒞λ(𝒘)+ϵ)2(min𝒘DλϵLλ(𝒘)min𝒘pLλ(𝒘)+3ϵ)+o(ϵ).\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}\mathcal{C}_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}\mathcal{C}_{\lambda}(\bm{w})+\epsilon\right)\leq 2\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}L_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}L_{\lambda}(\bm{w})+3\epsilon\right)+o(\epsilon).
Proof.

The proof of Lemma E.7 is similar to Section C.1.1 in miolane2021distribution, where the above was established for the Lasso. ∎

Combining Lemmas E.7 and E.6, we arrive at the following lemma, where we return back to the cost λ(𝒃)\mathcal{L}_{\lambda}(\bm{b}):

Lemma E.8.

There exists a constant γ>0\gamma>0 such that for all ϵ(0,1]\epsilon\in(0,1],

supλ[λmin,λmax](𝒃p,𝒃𝜷^f(λ)22ϵandλ(𝒃)minλ+γϵ)=o(ϵ).\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{b}\in\mathbb{R}^{p},\ \|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda)\|_{2}^{2}\geq\epsilon\ \ \text{and}\ \ \mathcal{L}_{\lambda}(\bm{b})\leq\min\mathcal{L}_{\lambda}+\gamma\epsilon\right)=o(\epsilon).
Proof.

Let γ>0\gamma>0 be from Lemma E.6. We have

supλ[λmin,λmax](𝒃p,𝒃𝜷^f(λ)22ϵandλ(𝒃)minλ+γϵ/3)\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{b}\in\mathbb{R}^{p},\ \|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda)\|_{2}^{2}\geq\epsilon\ \ \text{and}\ \ \mathcal{L}_{\lambda}(\bm{b})\leq\min\mathcal{L}_{\lambda}+\gamma\epsilon/3\right)
=\displaystyle= supλ[λmin,λmax](min𝒘Dλϵ𝒞λ(𝒘)min𝒞λ+γϵ/3)\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}\mathcal{C}_{\lambda}(\bm{w})\leq\min\mathcal{C}_{\lambda}+\gamma\epsilon/3\right)
\displaystyle\leq supλ[λmin,λmax]2(min𝒘DλϵLλ(𝒘)min𝒘pLλ(𝒘)+γϵ)+o(ϵ)\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}2\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}L_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}L_{\lambda}(\bm{w})+\gamma\epsilon\right)+o(\epsilon)
=\displaystyle= o(ϵ),\displaystyle o(\epsilon),

where the first equality comes from the definition of 𝒞λ\mathcal{C}_{\lambda} in (37), the inequality comes from Lemma E.7 and the last equality comes from E.6. Thus proved. ∎

E.3.4 Uniform Control over λ\lambda

We now establish the result with uniform control over λ\lambda (i.e. bring the sup\sup in Lemma E.8) inside).

We first prove the following Lemma:

Lemma E.9.

There exists constant KK such that

(λ,λ[λmin,λmax],λ(𝜷^(λ))λ(𝜷^(λ))+K|λλ|)=1o(1).\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda))\leq\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda^{\prime}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1).
Proof.

The proof follows directly from Lemma D.12. ∎

Then we prove the following proposition:

Lemma E.10.

For λ1,λ2[λmin,λmax]\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}], we have 𝔼[𝛃^f(λ1)𝛃^f(λ2)2]M|λ1λ2|\mathbb{E}[\|\hat{\bm{\beta}}^{f}(\lambda_{1})-\hat{\bm{\beta}}^{f}(\lambda_{2})\|_{2}]\leq M|\lambda_{1}-\lambda_{2}| for some constant MM

Proof.

First, by noting that τ=nτR\tau_{*}=\sqrt{n}\tau_{\textrm{R}}, α=λτ/ζR\alpha_{*}=\lambda\tau_{*}/\zeta_{\textrm{R}} as well as applying Lemma H.2 and Lemma H.8, we know α(λ)\alpha_{*}(\lambda) and τ(λ)\tau_{*}(\lambda) are bounded in some [αmin,αmax],[τmin,τmax][\alpha_{\min},\alpha_{\max}],[\tau_{\min},\tau_{\max}] respectively, and they are both continuous functions of λ\lambda. Let λ1,λ2[λmin,λmax]\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}]. Next, recall from (41) that 𝒘(λ)=𝜷^f(λ)𝜷\bm{w}(\lambda)=\hat{\bm{\beta}}^{f}(\lambda)-\bm{\beta}. Thus,

𝔼[𝜷^f(λ1)𝜷^f(λ2)2]2𝔼[𝜷^f(λ1)𝜷^f(λ2)22]=𝔼[𝒘(λ1)𝒘(λ2)22]\displaystyle\mathbb{E}[\|\hat{\bm{\beta}}^{f}(\lambda_{1})-\hat{\bm{\beta}}^{f}(\lambda_{2})\|_{2}]^{2}\leq\mathbb{E}[\|\hat{\bm{\beta}}^{f}(\lambda_{1})-\hat{\bm{\beta}}^{f}(\lambda_{2})\|_{2}^{2}]=\mathbb{E}[\|\bm{w}(\lambda_{1})-\bm{w}(\lambda_{2})\|_{2}^{2}]
=\displaystyle= i=1p𝔼[(1α(λ1)/τ(λ1)+λ1(α(λ1)nξg,i+λ1βi)1α(λ2)/τ(λ2)+λ2(α(λ2)nξg,i+λ2βi))2]\displaystyle\sum_{i=1}^{p}\mathbb{E}\left[\left(\frac{1}{\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1}}\left(\frac{\alpha_{*}(\lambda_{1})}{\sqrt{n}}\xi_{g,i}+\lambda_{1}\beta_{i}\right)-\frac{1}{\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2}}\left(\frac{\alpha_{*}(\lambda_{2})}{\sqrt{n}}\xi_{g,i}+\lambda_{2}\beta_{i}\right)\right)^{2}\right]
\displaystyle\leq 2𝔼[δ(α(λ1)α(λ1)/τ(λ1)+λ1ξgα(λ2)α(λ2)/τ(λ2)+λ2ξg)2]\displaystyle 2\mathbb{E}\left[\delta\left(\frac{\alpha_{*}(\lambda_{1})}{\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1}}\xi_{g}-\frac{\alpha_{*}(\lambda_{2})}{\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2}}\xi_{g}\right)^{2}\right]
+2𝜷22[(λ1α(λ1)/τ(λ1)+λ1λ2α(λ2)/τ(λ2)+λ2)2],\displaystyle\indent+2\|\bm{\beta}\|_{2}^{2}\left[\left(\frac{\lambda_{1}}{\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1}}-\frac{\lambda_{2}}{\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2}}\right)^{2}\right],

where ξg,i\xi_{g,i} is the iith entry of 𝝃g𝒩(0,𝑰p)\bm{\xi}_{g}\sim\mathcal{N}(0,\bm{I}_{p}) and ξg𝒩(0,1)\xi_{g}\sim\mathcal{N}(0,1). For the first summand,

2𝔼[δ(α(λ1)α(λ1)/τ(λ1)+λ1ξgα(λ2)α(λ2)/τ(λ2)+λ2ξg)2]\displaystyle 2\mathbb{E}\left[\delta\left(\frac{\alpha_{*}(\lambda_{1})}{\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1}}\xi_{g}-\frac{\alpha_{*}(\lambda_{2})}{\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2}}\xi_{g}\right)^{2}\right]
=\displaystyle= 2δ(α(λ1)α(λ1)/τ(λ1)+λ1α(λ2)α(λ2)/τ(λ2)+λ2)2\displaystyle 2\delta\left(\frac{\alpha_{*}(\lambda_{1})}{\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1}}-\frac{\alpha_{*}(\lambda_{2})}{\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2}}\right)^{2}
=\displaystyle= 2δ(α(λ1)α(λ2)τ(λ1)τ(λ2)(τ(λ1)τ(λ2))+(λ2α(λ1)λ1α(λ2))(α(λ1)/τ(λ1)+λ1)(α(λ2)/τ(λ2)+λ2))2.\displaystyle 2\delta\left(\frac{\frac{\alpha_{*}(\lambda_{1})\alpha_{*}(\lambda_{2})}{\tau_{*}(\lambda_{1})\tau_{*}(\lambda_{2})}(\tau_{*}(\lambda_{1})-\tau_{*}(\lambda_{2}))+(\lambda_{2}\alpha_{*}(\lambda_{1})-\lambda_{1}\alpha_{*}(\lambda_{2}))}{(\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1})(\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2})}\right)^{2}.

We know both α(λ1)α(λ2)τ(λ1)τ(λ2)\frac{\alpha_{*}(\lambda_{1})\alpha_{*}(\lambda_{2})}{\tau_{*}(\lambda_{1})\tau_{*}(\lambda_{2})} and (α(λ1)/τ(λ1)+λ1)(α(λ2)/τ(λ2)+λ2)(\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1})(\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2}) are bounded, τ\tau_{*} is Lipschitz, and

λ2α(λ1)λ1α(λ2)=\displaystyle\lambda_{2}\alpha_{*}(\lambda_{1})-\lambda_{1}\alpha_{*}(\lambda_{2})= α(λ1)(λ2λ1)+λ1(α(λ1)α(λ2))\displaystyle\alpha_{*}(\lambda_{1})(\lambda_{2}-\lambda_{1})+\lambda_{1}(\alpha_{*}(\lambda_{1})-\alpha_{*}(\lambda_{2}))
\displaystyle\leq αmax|λ2λ1|+λmax|α(λ1)α(λ2)|.\displaystyle\alpha_{\max}|\lambda_{2}-\lambda_{1}|+\lambda_{\max}|\alpha_{*}(\lambda_{1})-\alpha_{*}(\lambda_{2})|.

Thus, we know the first part is bounded by some M1(λ1λ2)2M_{1}(\lambda_{1}-\lambda_{2})^{2}. Similarly, the second part is bounded by some M2(λ1λ2)2M_{2}(\lambda_{1}-\lambda_{2})^{2}. Combining them completes the proof. ∎

We are now ready to prove Lemma E.1.

proof of Lemma E.1.

Let γ>0\gamma>0 as given by Lemma E.8, let K>0K>0 as given by Lemma E.9, and let M>0M>0 as given by Lemma E.10. Fix ϵ(0,1]\epsilon\in(0,1] and define ϵ=min(γϵK,ϵM)\epsilon^{\prime}=\min\left(\frac{\gamma\epsilon}{K},\frac{\sqrt{\epsilon}}{M}\right). Let k=(λmaxλmin)/ϵk=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil. Further define λi=λmin+iϵ\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime} for i=0,,ki=0,...,k By Lemma E.8, the event

{i{1,,k},𝒃p,λi(𝒃)minλi+γϵ𝒃𝜷^f(λi)22ϵ}\left\{\forall i\in\{1,...,k\},\forall\bm{b}\in\mathbb{R}^{p},\mathcal{L}_{\lambda_{i}}(\bm{b})\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon\Rightarrow\|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon\right\} (43)

has probability at least 1ko(ϵ)=1o(1)1-ko(\epsilon)=1-o(1). Therefore, on the intersection of event 43 and the event in Lemma E.9, which has probability 1o(1)1-o(1), we have for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}],

λi(𝜷^(λ))minλi+K|λλi|minλi+γϵ,\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}(\lambda))\leq\min\mathcal{L}_{\lambda_{i}}+K|\lambda-\lambda_{i}|\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon,

where 1ik1\leq i\leq k is such that λ[λi1,λi]\lambda\in[\lambda_{i-1},\lambda_{i}]. This implies (since we are on the event 43) that 𝜷^(λ)𝜷^f(λi)22ϵ\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon.

Consider any Lipschitz function ϕβ\phi_{\beta}. Since 𝜷^f(λ)\hat{\bm{\beta}}^{f}(\lambda) is a α/(n(α/τ+λ))C/n\alpha_{*}/(\sqrt{n}(\alpha_{*}/\tau_{*}+\lambda))\leq C/\sqrt{n}-Lipschitz function of 𝝃g𝒩(𝟎,𝑰)\bm{\xi}_{g}\sim\mathcal{N}(\bm{0},\bm{I}) for some constant CC (recall (40)), then by concentration of Lipschitz function of Gaussian random variables, we know the event

{|ϕβ(𝜷^f(λ))𝔼[ϕβ(𝜷^f(λ))]|ϵ}\left\{|\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))]|\leq\sqrt{\epsilon}\right\} (44)

has probability 1o(ϵ)1-o(\epsilon). Therefore, on the intersection of event (43) and event (44), we have

|ϕβ(𝜷^(λ))𝔼[ϕβ(𝜷^f(λ))]|\displaystyle|\phi_{\beta}(\hat{\bm{\beta}}(\lambda))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))]|
\displaystyle\leq |ϕβ(𝜷^(λ))ϕβ(𝜷^f(λi))|+|ϕβ(𝜷^f(λi))𝔼[ϕβ(𝜷^f(λi))]|+|𝔼[ϕβ(𝜷^f(λi))]𝔼[ϕβ(𝜷^f(λ))]|\displaystyle|\phi_{\beta}(\hat{\bm{\beta}}(\lambda))-\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))|+|\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))]|+|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))]|
\displaystyle\leq 𝜷^(λ)𝜷^f(λi)2+ϵ+𝔼[𝜷^f(λ)𝜷^f(λi)2]\displaystyle\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}+\sqrt{\epsilon}+\mathbb{E}[\|\hat{\bm{\beta}}^{f}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}]
\displaystyle\leq 2ϵ+M|λλi|\displaystyle 2\sqrt{\epsilon}+M|\lambda-\lambda_{i}|
\displaystyle\leq 3ϵ.\displaystyle 3\sqrt{\epsilon}.

The proof is then completed by noting that this holds for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}]. ∎

E.3.5 Proof of Lemma E.2

This is almost verbatim as the proof of Lemma E.1, where we instead study the optimization for 𝒖\bm{u} in (37). We omit the proof here for conciseness.

Appendix F Proof of main results

In this section we utilize Theorem D.1 and some preceding supporting Lemmas to prove the remaining results in the main text, that is, Proposition 4.2 and Theorem 4.1.

F.1 Variance Parameter Estimation

We present the theorem that makes rigorous the consistency of (7).

Theorem F.1.

Consider τ^L2,τ^R,τ^LR\hat{\tau}_{\textrm{L}}^{2},\hat{\tau}_{\textrm{R}},\hat{\tau}_{\textrm{L}\textrm{R}} in (7). If Assumptions 1 and 2 hold, then we have

supλL[λmin,λmax]n|τ^L2τL2|𝑃0,\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}n\cdot|\hat{\tau}_{\textrm{L}}^{2}-\tau_{\textrm{L}}^{2}|\overset{P}{\rightarrow}0,
supλR[λmin,λmax]n|τ^R2τR2|𝑃0,\displaystyle\sup_{\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}n\cdot|\hat{\tau}_{\textrm{R}}^{2}-\tau_{\textrm{R}}^{2}|\overset{P}{\rightarrow}0,
supλL,λR[λmin,λmax]n|τ^LRτLR|𝑃0.\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}n\cdot|\hat{\tau}_{\textrm{L}\textrm{R}}-\tau_{\textrm{L}\textrm{R}}|\overset{P}{\rightarrow}0.

The proof of Theorem F.1 crucially relies on the following Lemma, which is the counterpart to Theorem 4.3 in the context of residuals:

Lemma F.2.

Define 𝐮^kd=𝐲𝐗𝛃^k1df^kn,𝐮^kf,d=n𝐡kf\hat{\bm{u}}_{k}^{d}=\frac{\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}}{1-\frac{\hat{\textnormal{df}}_{k}}{n}},\hat{\bm{u}}_{k}^{f,d}=\sqrt{n}\bm{h}_{k}^{f}, where (𝐡Lf,𝐡Rf)𝒩(𝟎,𝐒𝐈n)(\bm{h}_{\textrm{L}}^{f},\bm{h}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},\bm{S}\otimes\bm{I}_{n}) for 𝐒\bm{S} defined as in 24. Under Assumption 4, for any 11-Lipschitz function ϕu:(n)2\phi_{u}:(\mathbb{R}^{n})^{2}\rightarrow\mathbb{R},

supλL,λR[λmin,λmax]|ϕu(𝒖^Ldn,𝒖^Rdn)𝔼[ϕβ(𝒖^Lf,dn,𝒖^Rf,dn)]|𝑃0.\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{u}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{d}}{\sqrt{n}}\right)-\mathbb{E}\left[\phi_{\beta}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{f,d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{f,d}}{\sqrt{n}}\right)\right]\right|\overset{P}{\rightarrow}0.

The proof of Lemma F.2 follows similar to the proof of Theorem 4.3, thus we omit this for conciseness.

Proof of Theorem F.1.

We know 𝒉kf\bm{h}_{k}^{f} is a 1/n1/\sqrt{n}-Lipschitz function of n𝒉kf\sqrt{n}\bm{h}_{k}^{f}, where (n𝒉Lf,n𝒉Rf)𝒩(𝟎,n𝑺𝑰n)(\sqrt{n}\bm{h}_{\textrm{L}}^{f},\sqrt{n}\bm{h}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},n\bm{S}\otimes\bm{I}_{n}) and eigenvalues of n𝑺n\bm{S} are all bounded (since entries of n𝑺n\bm{S} are bounded from 25). Also, 𝔼[𝒉kf22]=nτk2τmax2+σmax2\mathbb{E}[\|\bm{h}_{k}^{f}\|_{2}^{2}]=n\tau_{k}^{2}\leq\tau_{\max}^{2}+\sigma_{\max}^{2}. Thus, by Lemma H.3, we know

supλL,λR[λmin,λmax]𝑻(𝒖^Ldn,𝒖^Rdn)𝔼[𝑻(𝒖^Lf,dn,𝒖^Rf,dn)]F𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\bm{T}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{d}}{\sqrt{n}}\right)-\mathbb{E}\left[\bm{T}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{f,d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{f,d}}{\sqrt{n}}\right)\right]\right\|_{F}\overset{P}{\rightarrow}0,

where

𝑻(𝒂,𝒃):=(𝒂𝒂𝒂𝒃𝒂𝒃𝒃𝒃).\bm{T}(\bm{a},\bm{b}):=\begin{pmatrix}\bm{a}^{\top}\bm{a}&\bm{a}^{\top}\bm{b}\\ \bm{a}^{\top}\bm{b}&\bm{b}^{\top}\bm{b}\end{pmatrix}. (45)

Extracting the entries of the above equation completes the proof. ∎

F.2 Ensembling: Proof of Proposition 4.2

Proof of Proposition 4.2.

By arguments similar to the proof of Theorem F.1, on using Theorem 4.3 in conjunction with 25 and Lemma H.3, we have that

supλL,λR[λmin,λmax]𝑻(𝜷^Ld,𝜷^Rd)𝔼[𝑻(𝜷^Lf,d,𝜷^Rf,d)]F𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d})-\mathbb{E}[\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})]\|_{F}\overset{P}{\rightarrow}0,

where 𝑻\bm{T} is given by 45. Therefore, we have

supλL,λR[λmin,λmax]supαL[0,1]|𝜷~Cd22𝜷22pτC2|\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left|\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\cdot\tau_{C}^{2}\right|
=\displaystyle= supλL,λR[λmin,λmax]supαL[0,1]|αL𝜷^Ld+(1αL)𝜷^Rd22𝜷22αL2pτL2(1αL)2pτR22αL(1αL)pτLR|\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left|\|\alpha_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\alpha_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-\alpha_{\textrm{L}}^{2}\cdot p\tau_{\textrm{L}}^{2}-(1-\alpha_{\textrm{L}})^{2}\cdot p\tau_{\textrm{R}}^{2}-2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\cdot p\tau_{\textrm{L}\textrm{R}}\right|
\displaystyle\leq supλL,λR[λmin,λmax]supαL[0,1][αL2|𝜷^Ld22𝜷22pτL2|+(1αL)2|𝜷^Rd22𝜷22pτR2|\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\bigg{[}\alpha_{\textrm{L}}^{2}\left|\|\hat{\bm{\beta}}_{\textrm{L}}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\tau_{\textrm{L}}^{2}\right|+(1-\alpha_{\textrm{L}})^{2}\left|\|\hat{\bm{\beta}}_{\textrm{R}}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\tau_{\textrm{R}}^{2}\right|
+2αL(1αL)|𝜷^Ld,𝜷^Rd𝜷22pτLR|]\displaystyle+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\left|\langle\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}\rangle-\|\bm{\beta}\|_{2}^{2}-p\tau_{\textrm{L}\textrm{R}}\right|\bigg{]}
\displaystyle\leq supλL,λR[λmin,λmax]supαL[0,1](αL2+(1αL)2++2αL(1αL))𝑻(𝜷^Ld,𝜷^Rd)𝔼[𝑻(𝜷^Lf,d,𝜷^Rf,d)]F\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left(\alpha_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}++2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\right)\left\|\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d})-\mathbb{E}[\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})]\right\|_{F}
\displaystyle\leq supλL,λR[λmin,λmax]𝑻(𝜷^Ld,𝜷^Rd)𝔼[𝑻(𝜷^Lf,d,𝜷^Rf,d)]F𝑃0,\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d})-\mathbb{E}[\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})]\right\|_{F}\overset{P}{\rightarrow}0,

where we define

τC2=αL2τL2+(1αL)2τR2+2αL(1αL)τLR.\tau_{C}^{2}=\alpha_{\textrm{L}}^{2}\cdot\tau_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}\cdot\tau_{\textrm{R}}^{2}+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\cdot\tau_{\textrm{L}\textrm{R}}.

Furthermore, from Theorem F.1,

supλL,λR[λmin,λmax]supαL[0,1]|pτ~C2pτC2|\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}|p\cdot\tilde{\tau}_{C}^{2}-p\cdot\tau_{C}^{2}|
\displaystyle\leq supλL,λR[λmin,λmax]supαL[0,1](αL2+(1αL)2++2αL(1αL))|p|τ^L2τL2|+p|τ^R2τR2|+p|τ^LRτLR||\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left(\alpha_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}++2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\right)\left|p\cdot|\hat{\tau}_{\textrm{L}}^{2}-\tau_{\textrm{L}}^{2}|+p\cdot|\hat{\tau}_{\textrm{R}}^{2}-\tau_{\textrm{R}}^{2}|+p\cdot|\hat{\tau}_{LR}-\tau_{LR}|\right|
\displaystyle\leq supλL,λR[λmin,λmax]|p|τ^L2τL2|+p|τ^R2τR2|+p|τ^LRτLR||𝑃0.\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|p\cdot|\hat{\tau}_{\textrm{L}}^{2}-\tau_{\textrm{L}}^{2}|+p\cdot|\hat{\tau}_{\textrm{R}}^{2}-\tau_{\textrm{R}}^{2}|+p\cdot|\hat{\tau}_{LR}-\tau_{LR}|\right|\overset{P}{\rightarrow}0.

Combining the above displays completes the proof. ∎

F.3 Heritability Estimation: Proof of Theorem 4.1

Proof of Theorem 4.1.

From Eqn. 12, we know that α^L[0,1]\hat{\alpha}_{\textrm{L}}\in[0,1], so Proposition 4.2 implies that

supλL,λR[λmin,λmax]|𝜷^Cd22pτ^C2σβ2|𝑃0,\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C}^{2}-\sigma_{\beta}^{2}\right|\overset{P}{\rightarrow}0, (46)

which further implies that

supλL,λR[λmin,λmax]|𝜷^Cd22pτ^C2σβ2+σ2σβ2σβ2+σ2|=supλL,λR[λmin,λmax]|𝜷^Cd22pτ^C2σβ2+σ2h|𝑃0.\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C}^{2}}{\sigma_{\beta}^{2}+\sigma^{2}}-\frac{\sigma_{\beta}^{2}}{\sigma_{\beta}^{2}+\sigma^{2}}\right|=\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C}^{2}}{\sigma_{\beta}^{2}+\sigma^{2}}-h\right|\overset{P}{\rightarrow}0. (47)

Recalling the definitions of σβ2,σ2\sigma_{\beta}^{2},\sigma^{2} from Assumption 4, and the fact that XijX_{ij}’s are independent with mean 0 and variance 11, we know that

Var^(𝒚)𝑃σβ2+σ2,\widehat{\text{Var}}(\bm{y})\overset{P}{\rightarrow}\sigma_{\beta}^{2}+\sigma^{2},

where Var^(𝒚)\widehat{\text{Var}}(\bm{y}) denotes the sample variance of 𝒚\bm{y}. Moreover, σβ2+σ2\sigma_{\beta}^{2}+\sigma^{2} is bounded by 2σmax22\sigma_{\max}^{2}, and is positive (otherwise the problem becomes trivial). Thus we can safely replace the denominator in (47) with Var^(𝒚)\widehat{\text{Var}}(\bm{y}). The proof is then completed by the fact that clipping does not effect consistency. ∎

Appendix G Robustness under Accurate Covariance Estimation: Proof of Theorem 4.4

In this section we prove Theorem 4.4. To maintain notational consistency with the rest of the proofs, we refrain from using 𝒁\bm{Z} and use 𝑿\bm{X} instead. A simple derivation reveals that assuming 𝑿\bm{X} with covariance 𝑨𝑨\bm{A}^{\top}\bm{A} is equivalent to assuming 𝑿\bm{X} has independent columns but use 𝑿𝑨\bm{X}\bm{A} to denote the design matrix. For all quantities in this section. we include 𝑨\bm{A} in a parenthesis or a subscript to denote the corresponding perturbed quantity if we replace 𝑿\bm{X} by 𝑿𝑨\bm{X}\bm{A} for the design matrix. We first introduce a sufficient condition of Theorem 4.4.

Lemma G.1.

Let 𝐀\bm{A} be a sequence of random matrices such that 𝐀𝐈op𝑃0\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0. Under Assumption 4, we have

supλk[λmin,λmax]𝜷^k(𝑨)𝜷^k2\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}(\bm{A})-\hat{\bm{\beta}}_{k}\|_{2} 𝑃0\displaystyle\overset{P}{\rightarrow}0 (48)
supλk[λmin,λmax]|df^k(𝑨)pdf^kp|\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}(\bm{A})}{p}-\frac{\hat{\textnormal{df}}_{k}}{p}\right| 𝑃0.\displaystyle\overset{P}{\rightarrow}0.

Lemma G.1 is proved in Sections G.1 and G.2 for Ridge and Lasso, respectively. Compared to the assumptions in Theorem 4.4, besides reducing Assumption 1 to 4 under Gaussian design and replacing assumptions about 𝚺\bm{\Sigma}, 𝚺^\hat{\bm{\Sigma}} with those about 𝑨\bm{A}, we also dropped Assumption 3 as it can be proved under Gaussian design. Taking Lemma G.1 momentarily, we prove Theorem 4.4 below.

Proof of Theorem 4.4.

First we show that 𝑨𝑰op𝑃0\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0 is a corollary of the assumptions in Theorem 4.4. Notice that Proposition A.1 guarantees 𝚺^𝚺op0\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\rightarrow 0. We have 𝑨1𝑰op=𝚺^1/2𝚺1/2𝑰op𝚺^1/2𝚺1/2op𝚺1/2op𝑃0\|\bm{A}^{-1}-\bm{I}\|_{\textnormal{op}}=\|\hat{\bm{\Sigma}}^{1/2}\bm{\Sigma}^{-1/2}-\bm{I}\|_{\textnormal{op}}\leq\|\hat{\bm{\Sigma}}^{1/2}-\bm{\Sigma}^{1/2}\|_{\textnormal{op}}\|\bm{\Sigma}^{-1/2}\|_{\textnormal{op}}\overset{P}{\rightarrow}0 since 𝚺^1/2𝚺1/2op𝑃0\|\hat{\bm{\Sigma}}^{1/2}-\bm{\Sigma}^{1/2}\|_{\textnormal{op}}\overset{P}{\rightarrow}0 and eigenvalues of 𝚺\bm{\Sigma} are bounded. Let 𝑩=𝑰𝑨1\bm{B}=\bm{I}-\bm{A}^{-1}, then the Neumann series k=0𝑩k\sum_{k=0}^{\infty}\bm{B}^{k} converges to 𝑰\bm{I} in operator norm since 𝑩op𝑃0\|\bm{B}\|_{\textnormal{op}}\overset{P}{\rightarrow}0. Hence 𝑨=(𝑰𝑩)1=k=0𝑩k\bm{A}=(\bm{I}-\bm{B})^{-1}=\sum_{k=0}^{\infty}\bm{B}^{k} converges to 𝑰\bm{I} in operator norm. In other words, 𝑨𝑰op𝑃0\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0.

We are now left with arguing that under the assumptions of Lemma G.1, (48) yields (16). Recall the definition of debiased estimators in (18). We have the following:

  • supλk[λmin,λmax]𝜷^k(𝑨)𝜷^k(𝑨)2𝑃0\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}(\bm{A})-\hat{\bm{\beta}}_{k}(\bm{A})\|_{2}\overset{P}{\rightarrow}0.

  • 1n𝑿𝑿𝑨op1n𝑿op𝑨𝑰op𝑃0\frac{1}{\sqrt{n}}\|\bm{X}-\bm{X}\bm{A}\|_{\textnormal{op}}\leq\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0 from Corollary H.1.

  • 1n𝒚2\frac{1}{\sqrt{n}}\|\bm{y}\|_{2} is bounded with probability 1o(1)1-o(1).

  • supλk[λmin,λmax]|nndf^knndf^k(𝑨)|𝑃0\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n}{n-\hat{\textnormal{df}}_{k}}-\frac{n}{n-\hat{\textnormal{df}}_{k}(\bm{A})}\right|\overset{P}{\rightarrow}0 since supλk[λmin,λmax]|df^k(𝑨)ndf^kn|𝑃0\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}(\bm{A})}{n}-\frac{\hat{\textnormal{df}}_{k}}{n}\right|\overset{P}{\rightarrow}0, together with the fact that 1df^kn1-\frac{\hat{\textnormal{df}}_{k}}{n} is bounded below with probability 1o(1)1-o(1) from Corollary D.1.

Then

supλk[λmin,λmax]𝜷^kd𝜷^kd(𝑨)2𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}^{d}-\hat{\bm{\beta}}_{k}^{d}(\bm{A})\|_{2}\overset{P}{\rightarrow}0.

utilizing Lemma H.5 and Lemma H.6.

A similar proof yields

supλk[λmin,λmax]|τ^k2τ^k2(𝑨)|𝑃0\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\hat{\tau}_{k}^{2}-\hat{\tau}_{k}^{2}(\bm{A})|\overset{P}{\rightarrow}0
supλL,λR[λmin,λmax]|τ^LR2τ^LR2(𝑨)|𝑃0.\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\hat{\tau}_{\textrm{L}\textrm{R}}^{2}-\hat{\tau}_{\textrm{L}\textrm{R}}^{2}(\bm{A})|\overset{P}{\rightarrow}0.

We are then left with articulating the expression of h^2\hat{h}^{2} and taking supremum over αL[0,1]\alpha_{\textrm{L}}\in[0,1], which are straightforward using Lemma H.5, Lemma H.6 and proof techniques in Section F.2 and F.3. We omit the details here. ∎

G.1 Proof of Lemma G.1, Ridge case

In this section, we drop the subscript R for simplicity. First we show that

supλk[λmin,λmax]𝜷^(𝑨)𝜷^2𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}(\bm{A})-\hat{\bm{\beta}}\|_{2}\overset{P}{\rightarrow}0.

Recall the Ridge estimator and its perturbed version:

𝜷^\displaystyle\hat{\bm{\beta}} =(1n𝑿𝑿+λ𝑰)11n𝑿𝒚\displaystyle=(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{y}
𝜷^(𝑨)\displaystyle\hat{\bm{\beta}}(\bm{A}) =(1n𝑨𝑿𝑿𝑨+λ𝑰)11n𝑨𝑿𝒚.\displaystyle=(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{A}\bm{X}^{\top}\bm{y}.

We know both (1n𝑿𝑿+λ𝑰)1op\left\|(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}} and (1n𝑨𝑿𝑿𝑨+λ𝑰)1op\left\|(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}} are bounded by 1/λmin1/\lambda_{\min}, uniformly for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}]. Further, their difference satisfies

(1n𝑿𝑿+λ𝑰)1(1n𝑨𝑿𝑿𝑨+λ𝑰)1=(1n𝑿𝑿+λ𝑰)1(𝑨𝑿𝑿𝑨𝑿𝑿)(1n𝑨𝑿𝑿𝑨+λ𝑰)1.(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}=(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}(\bm{A}\bm{X}^{\top}\bm{X}\bm{A}-\bm{X}^{\top}\bm{X})(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}.

Then we know supλ[λmin,λmax](1n𝑿𝑿+λ𝑰)1(1n𝑨𝑿𝑿𝑨+λ𝑰)1op𝑃0\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left\|(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}}\overset{P}{\rightarrow}0 by observing that 1n𝑿op\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}} is bounded with probability 1o(1)1-o(1), 𝑰𝑨op𝑃0\|\bm{I}-\bm{A}\|_{\textnormal{op}}\overset{P}{\rightarrow}0, and applying Lemma H.6. Another application of Lemma H.6 with additionally the fact that 1n𝒚2\frac{1}{\sqrt{n}}\|\bm{y}\|_{2} is bounded concludes that supλk[λmin,λmax]𝜷^(𝑨)𝜷^2𝑃0\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}(\bm{A})-\hat{\bm{\beta}}\|_{2}\overset{P}{\rightarrow}0.

Next, for df^/p\hat{\textnormal{df}}/p and its perturbed counterpart, we have

supλ[λmin,λmax]|df^(𝑨)pdf^p|\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}(\bm{A})}{p}-\frac{\hat{\textnormal{df}}}{p}\right|
=\displaystyle= supλ[λmin,λmax]1p|(nTr((1n𝑿𝑿+λ𝑰)11n𝑿𝑿))(nTr((1n𝑨𝑿𝑿𝑨+λ𝑰)11n𝑨𝑿𝑿𝑨))|\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{p}\left|\left(n-\text{Tr}\left((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}\right)\right)-\left(n-\text{Tr}\left((\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}\right)\right)\right|
=\displaystyle= supλ[λmin,λmax]1pTr((1n𝑿𝑿+λ𝑰)11n𝑿𝑿(1n𝑨𝑿𝑿𝑨+λ𝑰)11n𝑨𝑿𝑿𝑨)\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{p}\text{Tr}\left((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}\right)
=\displaystyle= supλ[λmin,λmax]λpTr((1n𝑿𝑿+λ𝑰)1(1n𝑨𝑿𝑿𝑨+λ𝑰)1)\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{\lambda}{p}\text{Tr}\left((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right)
\displaystyle\leq supλ[λmin,λmax]λmaxpp(1n𝑿𝑿+λ𝑰)1(1n𝑨𝑿𝑿𝑨+λ𝑰)1op𝑃0.\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{\lambda_{\max}}{p}\cdot p\left\|(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}}\overset{P}{\rightarrow}0.

where the third equality follows from (1n𝑿𝑿+λ𝑰)11n𝑿𝑿=𝑰λ(1n𝑿𝑿+λ𝑰)1(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}=\bm{I}-\lambda(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1} and its perturbed counterpart. Hence the proof is complete.

G.2 Proof of Lemma G.1, Lasso case

We drop the subscript L in this section for simplicity. The proof of the Lasso case is more involved. Denote

(𝒃)\displaystyle\mathcal{L}(\bm{b}) :=12n𝒚𝑿𝒃22+λn𝒃1\displaystyle:=\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\|\bm{b}\|_{1}
𝑨(𝒃)\displaystyle\mathcal{L}_{\bm{A}}(\bm{b}) :=12n𝒚𝑿𝑨𝒃22+λn𝒃1\displaystyle:=\frac{1}{2n}\|\bm{y}-\bm{X}\bm{A}\bm{b}\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\|\bm{b}\|_{1}

the Lasso objective function and its perturbed counterpart. We first argue that (𝜷^)\mathcal{L}(\hat{\bm{\beta}}) is close to (𝜷^(𝑨))\mathcal{L}(\hat{\bm{\beta}}(\bm{A})) where 𝜷^(𝑨):=argmin𝑨\hat{\bm{\beta}}(\bm{A}):=\operatorname*{arg\,min}\mathcal{L}_{\bm{A}} is the perturbed Lasso solution (Lemma G.2), then we utilize techniques around the local stability of Lasso objective (c.f. miolane2021distribution, celentano2020lasso) to prove the desired results in Sections G.2.1 and G.2.2.

Lemma G.2 (Closeness of Lasso objectives).

Under the assumptions in Lemma G.1, we have

supλ[λmin,λmax]|(𝜷^(𝑨))(𝜷^)|𝑃0.\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}(\hat{\bm{\beta}})|\overset{P}{\rightarrow}0.
Proof.

By optimality, (𝜷^(𝑨))(𝜷^)\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))\geq\mathcal{L}(\hat{\bm{\beta}}) and 𝑨(𝜷^(𝑨))𝑨(𝜷^)\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))\leq\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}). Further, we have

supλ[λmin,λmax]((𝜷^(𝑨))(𝜷^))\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}(\hat{\bm{\beta}})\right)
=\displaystyle= supλ[λmin,λmax]((𝜷^(𝑨))𝑨(𝜷^(𝑨))+𝑨(𝜷^(𝑨))𝑨(𝜷^)+𝑨(𝜷^)(𝜷^))\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))+\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})+\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})-\mathcal{L}(\hat{\bm{\beta}})\right)
\displaystyle\leq supλ[λmin,λmax]((𝜷^(𝑨))𝑨(𝜷^(𝑨))+𝑨(𝜷^)(𝜷^))\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))+\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})-\mathcal{L}(\hat{\bm{\beta}})\right)
\displaystyle\leq supλ[λmin,λmax](12n|𝒚𝑿𝜷^(𝑨)22𝒚𝑿𝑨𝜷^(𝑨)22|+12n|𝒚𝑿𝜷^22𝒚𝑿𝑨𝜷^22|).\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\frac{1}{2n}\left|\|\bm{y}-\bm{X}\hat{\bm{\beta}}(\bm{A})\|_{2}^{2}-\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}(\bm{A})\|_{2}^{2}\right|+\frac{1}{2n}\left|\|\bm{y}-\bm{X}\hat{\bm{\beta}}\|_{2}^{2}-\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}\|_{2}^{2}\right|\right).

We know from Lemma D.7 that 𝜷^\hat{\bm{\beta}} is uniformly bounded with probability 1o(1)1-o(1). Further, replacing Lemma E.1 with (celentano2020lasso, Theorem 6), we can use similar proof as Lemma D.7 to show that 𝜷^(𝑨)\hat{\bm{\beta}}(\bm{A}) is also uniformly bounded with probability 1o(1)1-o(1). Therefore, utilizing Lemma H.6,

supλ[λmin,λmax]12n|𝒚𝑿𝜷^22𝒚𝑿𝑨𝜷^22|\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left|\|\bm{y}-\bm{X}\hat{\bm{\beta}}\|_{2}^{2}-\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}\|_{2}^{2}\right|
\displaystyle\leq supλ[λmin,λmax]121n2𝒚+𝑿𝜷^+𝑿𝑨𝜷^21n𝑿𝜷^𝑿𝑨𝜷^2𝑃0,\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2}\cdot\frac{1}{\sqrt{n}}\|2\bm{y}+\bm{X}\hat{\bm{\beta}}+\bm{X}\bm{A}\hat{\bm{\beta}}\|_{2}\cdot\frac{1}{\sqrt{n}}\|\bm{X}\hat{\bm{\beta}}-\bm{X}\bm{A}\hat{\bm{\beta}}\|_{2}\overset{P}{\rightarrow}0,

and similarly supλ[λmin,λmax]12n|𝒚𝑿𝜷^(𝑨)22𝒚𝑿𝑨𝜷^(𝑨)22|𝑃0\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left|\|\bm{y}-\bm{X}\hat{\bm{\beta}}(\bm{A})\|_{2}^{2}-\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}(\bm{A})\|_{2}^{2}\right|\overset{P}{\rightarrow}0. Hence,

supλ[λmin,λmax]|(𝜷^(𝑨))(𝜷^)|𝑃0\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}(\hat{\bm{\beta}})|\overset{P}{\rightarrow}0

as desired. ∎

G.2.1 Closeness of Lasso solutions

We introduce a sufficient Lemma for the first line of (48):

Lemma G.3 (Uniform local strong convexity of the Lasso objective).

Under the assumptions in Lemma G.1, there exists a constant CC such that for any λ\lambda-dependent vector 𝛃ˇ\check{\bm{\beta}},

supλ[λmin,λmax](𝜷ˇ)(𝜷^)supλ[λmin,λmax]min{C𝜷ˇ𝜷^22,C𝜷ˇ𝜷^2}.\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathcal{L}(\check{\bm{\beta}})-\mathcal{L}(\hat{\bm{\beta}})\geq\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\min\{C\|\check{\bm{\beta}}-\hat{\bm{\beta}}\|_{2}^{2},C\|\check{\bm{\beta}}-\hat{\bm{\beta}}\|_{2}\}.

The first line of (48) directly follows by plugging in 𝜷ˇ=𝜷^(𝑨)\check{\bm{\beta}}=\hat{\bm{\beta}}(\bm{A}) in Lemma G.3.

Proof of Lemma G.3.

Notice that this Lemma is the uniform extension of (celentano2020lasso, Lemma B.9) (their statement only involves 𝜷ˇ𝜷^22\|\check{\bm{\beta}}-\hat{\bm{\beta}}\|_{2}^{2} with an additional constraint, but our statement naturally follows from the proof procedure). Thus we only elaborate the necessary extensions of their proof here:

Consider the critical event in the proof:

𝒜:={1nκ(𝑿,n(1ζ/4))κmin}{1n𝑿opC}{1n#{j:|t^j|1Δ/2}1ζ/2}\mathcal{A}:=\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}\cap\left\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\right\}\cap\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\}

where ζ:=1df/n\zeta:=1-\textnormal{df}/n as in (24) and 𝒕:=1nλ𝑿(𝒚𝑿𝜷^)\bm{t}:=\frac{1}{\sqrt{n}\lambda}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}) is the Lasso subgradient. We want to show that there exist constants C,κmin,ΔC,\kappa_{\min},\Delta such that event 𝒜\mathcal{A} happens with probability 1o(1)1-o(1), uniformly for λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}].

For {1n𝑿opC}\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\}, it does not depend on λ\lambda, so Corollary H.1 guarantees the high probability.

For {1nκ(𝑿,n(1ζ/4))κmin}\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}, we know 1ζ/41ζmin/41-\zeta/4\leq 1-\zeta_{\min}/4 from (25), and we also know κ(𝑿,k)κ(𝑿,k)\kappa_{-}(\bm{X},k)\geq\kappa_{-}(\bm{X},k^{\prime}) for any kkk\leq k^{\prime} by the definition of κ\kappa_{-}. Therefore the high-probability (which is proved in (celentano2020lasso, Section B.5.4)) naturally extends uniformly.

Finally, {1n#{j:|t^j|1Δ/2}1ζ/2}\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\} follows from (miolane2021distribution, Theorem E.5) by an ϵ\epsilon-net argument (c.f. (miolane2021distribution, Section E.3.4) and Section E.3.4), which we omit here for conciseness.

Now that we have established 𝒜\mathcal{A} happens uniformly with probability 1o(1)1-o(1), the rest of the proof extends almost verbatim the proof in (celentano2020lasso, Section B.5.4) by taking the supremum in all inequalities. ∎

G.2.2 Closeness of Lasso sparsities

Taking inspiration from miolane2021distribution, we prove the convergence from above and below using different strategies. We state and prove two supporting Lemmas G.4 and G.5. Taking those lemmas momentarily and noticing the fact that a nonzero entry in the Lasso solution guarantees a ±1\pm 1 gradient, we immediately have

supλ[λmin,λmax]|df^(𝑨)pdfp|𝑃0.\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}(\bm{A})}{p}-\frac{\textnormal{df}}{p}\right|\overset{P}{\rightarrow}0.

The second line of (48) then follows by further applying Lemma D.2.

Lemma G.4.

Under the assumptions in Lemma G.1, for any ϵ>0\epsilon>0,

(λ[λmin,λmax],1p𝜷^(𝑨)0dfpϵ)=1o(1).\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\frac{1}{p}\|\hat{\bm{\beta}}(\bm{A})\|_{0}\geq\frac{\textnormal{df}}{p}-\epsilon\right)=1-o(1).
Proof.

As part of the proof in Lemma G.2, we know

(λ[λmin,λmax],(𝒃)𝑨(𝒃)+ϵ)=1o(1)\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}(\bm{b})\leq\mathcal{L}_{\bm{A}}(\bm{b})+\epsilon\right)=1-o(1) (49)

for any bounded 𝒃\bm{b}.

From (miolane2021distribution, Lemma F.4), we know

supλ[λmin,λmax](𝒃,(𝒃)(𝜷^)+γϵ31p𝒃0dfp+ϵ)=1o(1)\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\forall\bm{b},\mathcal{L}(\bm{b})\leq\mathcal{L}(\hat{\bm{\beta}})+\gamma\epsilon^{3}\Rightarrow\frac{1}{p}\|\bm{b}\|_{0}\geq\frac{\textnormal{df}}{p}+\epsilon\right)=1-o(1) (50)

for some constant γ\gamma.

From (celentano2020lasso, Lemma B.12), we know

(λ,λ[λmin,λmax],λ,𝑨(𝜷^λ(𝑨))λ,𝑨(𝜷^λ(𝑨))+K|λλ|)=1o(1)\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))\leq\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda^{\prime}}(\bm{A}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1) (51)

for some constant KK, where we write subscript to make the dependence on λ\lambda explicit when necessary.

Now let KK and γ\gamma be the constants mentioned above, and let CC be the constant in Lemma H.2(b) Consider any ϵ>0\epsilon>0. Define ϵ=min{γϵ33K,ϵC+1}\epsilon^{\prime}=\min\left\{\frac{\gamma\epsilon^{3}}{3K},\frac{\epsilon}{C+1}\right\}. Let k=λmaxλminϵk=\lceil\frac{\lambda_{\max}-\lambda_{\min}}{\epsilon^{\prime}}\rceil, and let λi=λmin+iϵ\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}. Applying a union bound on (50) yields

(i,b,λi(𝒃)λi(𝜷^λi)+γϵ31p𝒃0dfλip+ϵ)=1o(1).\mathbb{P}\left(\forall i,\forall b,\mathcal{L}_{\lambda_{i}}(\bm{b})\leq\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}_{\lambda_{i}})+\gamma\epsilon^{3}\Rightarrow\frac{1}{p}\|\bm{b}\|_{0}\geq\frac{\textnormal{df}_{\lambda_{i}}}{p}+\epsilon\right)=1-o(1). (52)

Consider the intersection of events in (49), (51) and (52), which has probability 1o(1)1-o(1). For any λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}], let ii be such that λ[λi,λi+1]\lambda\in[\lambda_{i},\lambda_{i+1}]. We have

λi(𝜷^λ(𝑨))\displaystyle\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}_{\lambda}(\bm{A})) λi,𝑨(𝜷^λ(𝑨))+13γϵ3\displaystyle\leq\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))+\frac{1}{3}\gamma\epsilon^{3}
λi,𝑨(𝜷^λi(𝑨))+13γϵ3+K|λλ|\displaystyle\leq\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{i}}(\bm{A}))+\frac{1}{3}\gamma\epsilon^{3}+K|\lambda-\lambda^{\prime}|
=λi,𝑨(𝜷^λi(𝑨))+23γϵ3\displaystyle=\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{i}}(\bm{A}))+\frac{2}{3}\gamma\epsilon^{3}
λi,𝑨(𝜷^λi)+23γϵ3\displaystyle\leq\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{i}})+\frac{2}{3}\gamma\epsilon^{3}
λi(𝜷^λi)+γϵ3\displaystyle\leq\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}_{\lambda_{i}})+\gamma\epsilon^{3}

where the first and fourth inequalities comes from (49), the second inequality comes from (51), and the third inequality follows by optimality. Now since we are on event (52), we have

1p𝜷^λ(𝑨)0\displaystyle\frac{1}{p}\|\hat{\bm{\beta}}_{\lambda}(\bm{A})\|_{0} dfλipϵ\displaystyle\geq\frac{\textnormal{df}_{\lambda_{i}}}{p}-\epsilon
=dfλpϵ+(dfλipdfλp)\displaystyle=\frac{\textnormal{df}_{\lambda}}{p}-\epsilon+(\frac{\textnormal{df}_{\lambda_{i}}}{p}-\frac{\textnormal{df}_{\lambda}}{p})
dfλpϵC|λλ|\displaystyle\geq\frac{\textnormal{df}_{\lambda}}{p}-\epsilon-C|\lambda-\lambda^{\prime}|
dfλp2ϵ.\displaystyle\geq\frac{\textnormal{df}_{\lambda}}{p}-2\epsilon.

Hence the proof is complete upon observing that both ϵ\epsilon and λ\lambda are arbitrary. ∎

Lemma G.5.

Denote

𝒕^\displaystyle\hat{\bm{t}} :=1nλ𝑿(𝒚𝑿𝜷^)\displaystyle:=\frac{1}{\sqrt{n}\lambda}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}})
𝒕^(𝑨)\displaystyle\hat{\bm{t}}(\bm{A}) :=1nλ𝑨𝑿(𝒚𝑿𝑨𝜷^(𝑨))\displaystyle:=\frac{1}{\sqrt{n}\lambda}\bm{A}^{\top}\bm{X}^{\top}(\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}(\bm{A}))

the Lasso subgradient and its perturbed counterpart. Under the assumptions in Lemma G.1, for any ϵ>0\epsilon>0,

(λ[λmin,λmax],1n#{j:|t^j(𝑨)|=1}df+ϵ)=1o(1)\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\frac{1}{n}\#\{j:|\hat{t}_{j}(\bm{A})|=1\}\leq\textnormal{df}+\epsilon\right)=1-o(1)
Proof.

We define an auxiliary loss function 𝒱\mathcal{V} and its perturbed counterpart 𝒱𝑨\mathcal{V}_{\bm{A}}:

𝒱(𝒕)\displaystyle\mathcal{V}(\bm{t}) =min𝒃2M{12n𝑿𝒃𝒚22+λn𝒕𝒃}:=min𝒃2Mw(𝒃,𝒕)\displaystyle=\min_{\|\bm{b}\|_{2}\leq M}\left\{\frac{1}{2n}\|\bm{X}\bm{b}-\bm{y}\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\bm{t}^{\top}\bm{b}\right\}:=\min_{\|\bm{b}\|_{2}\leq M}w(\bm{b},\bm{t})
𝒱𝑨(𝒕)\displaystyle\mathcal{V}_{\bm{A}}(\bm{t}) =min𝒃2M{12n𝑿𝑨𝒃𝒚22+λn𝒕𝒃}:=min𝒃2Mw𝑨(𝒃,𝒕)\displaystyle=\min_{\|\bm{b}\|_{2}\leq M}\left\{\frac{1}{2n}\|\bm{X}\bm{A}\bm{b}-\bm{y}\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\bm{t}^{\top}\bm{b}\right\}:=\min_{\|\bm{b}\|_{2}\leq M}w_{\bm{A}}(\bm{b},\bm{t})

where MM is some large enough constant. For any ϵ>0\epsilon>0, consider the following three high probability event statements:

(λ[λmin,λmax],𝒕1,𝒱(𝒕)𝒱𝑨(𝒕)ϵ)=1o(1);\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\forall\|\bm{t}\|_{\infty}\leq 1,\mathcal{V}(\bm{t})\geq\mathcal{V}_{\bm{A}}(\bm{t})-\epsilon\right)=1-o(1); (53)
supλ[λmin,λmax](𝒕1,𝒱(𝒕)𝒱(𝒕^)3γϵ31p#{j:|tj|1ϵ}dfp+Kϵ)=1o(1)\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\forall\|\bm{t}\|_{\infty}\leq 1,\mathcal{V}(\bm{t})\geq\mathcal{V}(\hat{\bm{t}})-3\gamma\epsilon^{3}\Rightarrow\frac{1}{p}\#\{j:|t_{j}|\geq 1-\epsilon\}\geq\frac{\textnormal{df}}{p}+K\epsilon\right)=1-o(1) (54)

for some constants K1K_{1} and γ\gamma;

(λ,λ[λmin,λmax],𝒱λ,𝑨(𝒕^λ(𝑨))𝒱λ,𝑨(𝒕^λ(𝑨))K2|λλ|)=1o(1)\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))\geq\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda^{\prime}}(\bm{A}))-K_{2}|\lambda-\lambda^{\prime}|\right)=1-o(1) (55)

for some constant K2K_{2}.

Observe the exact same forms of statements (53), (54), (55) and statements (49), (50), (51). Hence, upon proving statements (53), (54), (55), we can use a similar ϵ\epsilon-net argument as in the proof of Lemma G.4 to complete the proof of Lemma G.5.

First we prove (53). Fixing any 𝒕1\|\bm{t}\|_{\infty}\leq 1 and denoting 𝒃:=argmin𝒃2Mw(𝒃,𝒕),𝒃𝑨:=argmin𝒃2Mw𝑨(𝒃,𝒕)\bm{b}^{*}:=\operatorname*{arg\,min}_{\|\bm{b}\|_{2}\leq M}w(\bm{b},\bm{t}),\bm{b}_{\bm{A}}^{*}:=\operatorname*{arg\,min}_{\|\bm{b}\|_{2}\leq M}w_{\bm{A}}(\bm{b},\bm{t}), we have

supλ[λmin,λmax](𝒱(𝒕)𝒱𝑨(𝒕))\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(\mathcal{V}(\bm{t})-\mathcal{V}_{\bm{A}}(\bm{t}))
:=\displaystyle:= supλ[λmin,λmax](w(𝒃,𝒕)w𝑨(𝒃𝑨,𝒕))\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(w(\bm{b}^{*},\bm{t})-w_{\bm{A}}(\bm{b}_{\bm{A}}^{*},\bm{t}))
\displaystyle\leq supλ[λmin,λmax](w(𝒃𝑨,𝒕)w𝑨(𝒃𝑨,𝒕))\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(w(\bm{b}_{\bm{A}}^{*},\bm{t})-w_{\bm{A}}(\bm{b}_{\bm{A}}^{*},\bm{t}))
:=\displaystyle:= supλ[λmin,λmax]12n|𝑿𝒃𝑨𝒚22𝑿𝑨𝒃𝑨𝒚22|𝑃0\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left|\|\bm{X}\bm{b}_{\bm{A}}^{*}-\bm{y}\|_{2}^{2}-\|\bm{X}\bm{A}\bm{b}_{\bm{A}}^{*}-\bm{y}\|_{2}^{2}\right|\overset{P}{\rightarrow}0

where the inequality follows from optimaility of 𝒃\bm{b}^{*}, and the convergence follows from Lemma H.6 and the fact that 𝒃𝑨2M\|\bm{b}_{\bm{A}}^{*}\|_{2}\leq M.

Writing out a symmetric argument yields the other direction (details omitted):

supλ[λmin,λmax](𝒱𝑨(𝒕)𝒱(𝒕))supλ[λmin,λmax]12n|𝑿𝒃𝒚22𝑿𝑨𝒃𝒚22|𝑃0.\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(\mathcal{V}_{\bm{A}}(\bm{t})-\mathcal{V}(\bm{t}))\leq\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left|\|\bm{X}\bm{b}^{*}-\bm{y}\|_{2}^{2}-\|\bm{X}\bm{A}\bm{b}^{*}-\bm{y}\|_{2}^{2}\right|\overset{P}{\rightarrow}0.

Therefore, we know supλ[λmin,λmax]|𝒱𝑨(𝒕)𝒱(𝒕)|𝑃0\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|\mathcal{V}_{\bm{A}}(\bm{t})-\mathcal{V}(\bm{t})|\overset{P}{\rightarrow}0, which leads to (53).

Argument (54) is a direct corollary of (miolane2021distribution, Lemma E.9).

We are now left with (55). From KKT condition, on the event {λ[λmin,λmax],𝜷^(𝑨)2M}\{\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\|\hat{\bm{\beta}}(\bm{A})\|_{2}\leq M\}, which happens with probability 1o(1)1-o(1) (see proof of Lemma G.2), we have

𝑨(𝜷^(𝑨))=min𝒃2M𝑨(𝒃):=min𝒃2Mmax𝒕1w𝑨(𝒃,𝒕)=max𝒕1min𝒃2Mw𝑨(𝒃,𝒕):=max𝒕1𝒱𝑨(𝒕):=𝒱𝑨(𝒕^(𝑨))\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))=\min_{\|\bm{b}\|_{2}\leq M}\mathcal{L}_{\bm{A}}(\bm{b}):=\min_{\|\bm{b}\|_{2}\leq M}\max_{\|\bm{t}\|_{\infty}\leq 1}w_{\bm{A}}(\bm{b},\bm{t})=\max_{\|\bm{t}\|_{\infty}\leq 1}\min_{\|\bm{b}\|_{2}\leq M}w_{\bm{A}}(\bm{b},\bm{t}):=\max_{\|\bm{t}\|_{\infty}\leq 1}\mathcal{V}_{\bm{A}}(\bm{t}):=\mathcal{V}_{\bm{A}}(\hat{\bm{t}}(\bm{A}))

where the permutation of min-max is authorized by (rockafellar2015convex, Corollary 37.3.2).

As a result, with probability 1o(1)1-o(1), λ,λ[λmin,λmax]\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],

𝒱λ,𝑨(𝒕^λ(𝑨))𝒱λ,𝑨(𝒕^λ(𝑨))\displaystyle\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda^{\prime}}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))
=\displaystyle= λ,𝑨(𝜷^λ(𝑨))𝒱λ,𝑨(𝒕^λ(𝑨))\displaystyle\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda^{\prime}}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))
\displaystyle\leq λ,𝑨(𝜷^λ(𝑨))𝒱λ,𝑨(𝒕^λ(𝑨))\displaystyle\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))
=\displaystyle= λ,𝑨(𝜷^λ(𝑨))λ,𝑨(𝜷^λ(𝑨))+𝒱λ,𝑨(𝒕^λ(𝑨))𝒱λ,𝑨(𝒕^λ(𝑨))\displaystyle\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))-\mathcal{L}_{\lambda,\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))+\mathcal{V}_{\lambda,\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))

where the inequality follows from optimality of 𝜷^λ(𝑨)\hat{\bm{\beta}}_{\lambda^{\prime}}(\bm{A}). Statement (51) already guarantees a high probability bound for λ,𝑨(𝜷^λ(𝑨))λ,𝑨(𝜷^λ(𝑨))\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))-\mathcal{L}_{\lambda,\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A})). Finally, (re)denoting 𝒃:=argmin𝒃2Mwλ,𝑨(𝒃,𝒕^λ(𝑨))\bm{b}^{*}:=\operatorname*{arg\,min}_{\|\bm{b}\|_{2}\leq M}w_{\lambda^{\prime},\bm{A}}(\bm{b},\hat{\bm{t}}_{\lambda}(\bm{A})), we have

𝒱λ,𝑨(𝒕^λ(𝑨))𝒱λ,𝑨(𝒕^λ(𝑨))\displaystyle\mathcal{V}_{\lambda,\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))
:=\displaystyle:= min𝒃2Mwλ,𝑨(𝒃,𝒕^λ(𝑨))wλ,𝑨(𝒃,𝒕^λ(𝑨))\displaystyle\min_{\|\bm{b}\|_{2}\leq M}w_{\lambda,\bm{A}}(\bm{b},\hat{\bm{t}}_{\lambda}(\bm{A}))-w_{\lambda^{\prime},\bm{A}}(\bm{b}^{*},\hat{\bm{t}}_{\lambda}(\bm{A}))
\displaystyle\leq wλ,𝑨(𝒃,𝒕^λ(𝑨))wλ,𝑨(𝒃,𝒕^λ(𝑨))\displaystyle w_{\lambda,\bm{A}}(\bm{b}^{*},\hat{\bm{t}}_{\lambda}(\bm{A}))-w_{\lambda^{\prime},\bm{A}}(\bm{b}^{*},\hat{\bm{t}}_{\lambda}(\bm{A}))
=\displaystyle= |λλ|n𝒃𝒕^λ(𝑨)\displaystyle\frac{|\lambda-\lambda^{\prime}|}{\sqrt{n}}\bm{b}^{*\top}\hat{\bm{t}}_{\lambda}(\bm{A})
\displaystyle\leq |λλ|𝒃21n𝒕^λ(𝑨)2\displaystyle|\lambda-\lambda^{\prime}|\cdot\|\bm{b}^{*}\|_{2}\cdot\frac{1}{\sqrt{n}}\|\hat{\bm{t}}_{\lambda}(\bm{A})\|_{2}
\displaystyle\leq M|λλ|1n1λ𝑨𝑿(𝒚𝑿𝑨𝜷^λ(𝑨))2\displaystyle M|\lambda-\lambda^{\prime}|\cdot\frac{1}{n}\left\|\frac{1}{\lambda}\bm{A}^{\top}\bm{X}^{\top}(\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}_{\lambda}(\bm{A}))\right\|_{2}
\displaystyle\leq K3|λλ|\displaystyle K_{3}|\lambda-\lambda^{\prime}|

with probability 1o(1)1-o(1) for some constant K3K_{3}, where the first inequality follows from optimality of 𝒃\bm{b}^{*} and the last inequality follows from the fact that 1λ,1n𝑿op,1n𝒚2,𝑨op\frac{1}{\lambda},\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}},\frac{1}{\sqrt{n}}\|\bm{y}\|_{2},\|\bm{A}\|_{\textnormal{op}} are all bounded with high probability. The proof is thus complete. ∎

Appendix H Auxiliary Lemmas

This section introduces several auxiliary lemmas that we use repeatedly through our proofs.

H.1 Convex Gaussian Minmax Theorem

Theorem H.1.

(thrampoulidis2015regularized, Theorem 3) Let SwpS_{w}\subset\mathbb{R}^{p} and SunS_{u}\subset\mathbb{R}^{n} be two compact sets and let f:Sw×Suf:S_{w}\times S_{u}\rightarrow\mathbb{R} be a continuous function. Let 𝐗=(𝐗i,j)i.i.d.𝒩(0,1),𝛏g𝒩(0,𝐈p),𝛏h𝒩(0,𝐈n)\bm{X}=(\bm{X}_{i,j})\overset{i.i.d.}{\sim}\mathcal{N}(0,1),\bm{\xi}_{g}\sim\mathcal{N}(0,\bm{I}_{p}),\bm{\xi}_{h}\sim\mathcal{N}(0,\bm{I}_{n}) be independent standard Gaussian vectors. Define

𝒞(𝑿)\displaystyle\mathcal{C}^{*}(\bm{X}) =min𝒘Swmax𝒖Su𝒖𝑿𝒘+f(𝒘,𝒖),\displaystyle=\min_{\bm{w}\in S_{w}}\max_{\bm{u}\in S_{u}}\bm{u}^{\top}\bm{X}\bm{w}+f(\bm{w},\bm{u}),
L(𝝃g,𝝃h)\displaystyle L^{*}(\bm{\xi}_{g},\bm{\xi}_{h}) =min𝒘Swmax𝒖Su𝒖2𝝃g𝒘+𝒘2𝝃h𝒖+f(𝒘,𝒖).\displaystyle=\min_{\bm{w}\in S_{w}}\max_{\bm{u}\in S_{u}}\|\bm{u}\|_{2}\bm{\xi}_{g}^{\top}\bm{w}+\|\bm{w}\|_{2}\bm{\xi}_{h}^{\top}\bm{u}+f(\bm{w},\bm{u}).

Then we have:

  • For all tt\in\mathbb{R},

    (𝒞(𝑿)t)2(L(𝝃g,𝝃h)t).\mathbb{P}(\mathcal{C}^{*}(\bm{X})\leq t)\leq 2\mathbb{P}(L^{*}(\bm{\xi}_{g},\bm{\xi}_{h})\leq t).
  • If SwS_{w} and SuS_{u} are convex and ff is convex-concave, then for all tt\in\mathbb{R},

    (𝒞(𝑿)t)2(L(𝝃g,𝝃h)t).\mathbb{P}(\mathcal{C}^{*}(\bm{X})\geq t)\leq 2\mathbb{P}(L^{*}(\bm{\xi}_{g},\bm{\xi}_{h})\geq t).

H.2 Properties of Equation System Solution

Lemma H.2 (Lipschitzness of fixed point parameters).

Denote (τL,τR,ζL,ζR,ρ)(\tau_{L},\tau_{R},\zeta_{L},\zeta_{R},\rho) to be the unique solution to the fixed point equation systen (24). There exists a constant CC such that

  1. (a)

    the mapping λkτk\lambda_{k}\mapsto\tau_{k} is C/nC/\sqrt{n}-Lipschitz.

  2. (b)

    the mapping λkζk\lambda_{k}\mapsto\zeta_{k} is CC-Lipschitz.

  3. (c)

    the mappings (λL,λR)ρ,(λL,λR)ρ(\lambda_{\textrm{L}},\lambda_{\textrm{R}})\mapsto\rho,(\lambda_{\textrm{L}},\lambda_{\textrm{R}})\mapsto\rho^{\perp} are both CC-Lipschitz in both arguments, where ρ=1ρ2.\rho^{\perp}=\sqrt{1-\rho^{2}}.

Proof.

Note that (a) is a direct consequence of miolane2021distribution Proposition A.3. (b) also follows from miolane2021distribution on noting that ζk=β/τ\zeta_{k}=\beta_{*}/\tau_{*} in their notation, and on applying Lemma H.8 We remark that both results from miolane2021distribution Proposition A.3 are for the case of the Lasso, but the Ridge case follows similarly.

We prove (c) here. We show that both ρ\rho and ρ\rho^{\perp} are Lipschitz in λR\lambda_{\textrm{R}}. The proof for λL\lambda_{\textrm{L}} follows similarly and is therefore omitted. From (23) and (24), we know that

ρ=1nτLτR(σ2+𝔼[𝜷^Lf𝜷,𝜷^Rf𝜷]).\rho=\frac{1}{n\tau_{L}\tau_{R}}\left(\sigma^{2}+\mathbb{E}[\langle\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta},\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}\rangle]\right).

First, we know σ2\sigma^{2} and nτL\sqrt{n}\tau_{L} does not depend on λR\lambda_{\textrm{R}} while nτR\sqrt{n}\tau_{R} is bounded and Lipschitz by 25 and the first statement of this Lemma. Next, by Cauchy-Schwartz, we have

𝔼[𝜷^Lf𝜷,𝜷^Rf𝜷]𝔼[𝜷^Lf𝜷2]𝔼[𝜷^Rf𝜷2],\mathbb{E}[\langle\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta},\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}\rangle]\leq\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}\|_{2}]\cdot\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}\|_{2}],

where 𝔼[𝜷^Lf𝜷2]𝔼[𝜷^Lf𝜷22]=nτL2σ2τmax2\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}\|_{2}]\leq\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}\|_{2}^{2}]}=n\tau_{L}^{2}-\sigma^{2}\leq\tau_{\max}^{2} is bounded and does not depend on λR\lambda_{\textrm{R}} (the equality follows from (23) and (24)). Now,

|𝔼[𝜷^Rf(λ1)𝜷2]𝔼[𝜷^Rf(λ2)𝜷2]|\displaystyle\left|\mathbb{E}[\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\bm{\beta}\|^{2}]-\mathbb{E}[\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})-\bm{\beta}\|^{2}]\right|\leq 𝔼[|𝜷^Rf(λ1)𝜷2𝜷^Rf(λ2)𝜷2|]\displaystyle\mathbb{E}\left[\left|\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\bm{\beta}\|_{2}-\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})-\bm{\beta}\|_{2}\right|\right]
\displaystyle\leq 𝔼[𝜷^Rf(λ1)𝜷^Rf(λ2)2]\displaystyle\mathbb{E}[\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})\|_{2}]
\displaystyle\leq 𝔼[𝜷^Rf(λ1)𝜷^Rf(λ2)22]\displaystyle\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})\|_{2}^{2}]}
\displaystyle\leq M(λ1λ2)\displaystyle M(\lambda_{1}-\lambda_{2})

for some constant MM, where the last line follows verbatim the proof of Lemma E.10. Hence 𝔼[𝜷^Lf𝜷L,𝜷^Rf𝜷R]\mathbb{E}[\langle\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}_{\textrm{L}},\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}_{\textrm{R}}\rangle] is Lipschitz in λR\lambda_{\textrm{R}}. Combining the two terms with Lemma H.8, we conclude that ρ\rho is Lipschitz in λR\lambda_{\textrm{R}}. For ρ\rho^{\perp}, note the fact that ρ=1ρ2\rho^{\perp}=\sqrt{1-\rho^{2}}, so

|ρ(λ1)ρ(λ2)|\displaystyle|\rho^{\perp}(\lambda_{1})-\rho^{\perp}(\lambda_{2})| =|ρ(λ1)+ρ(λ2)||ρ(λ1)+ρ(λ2)||ρ(λ1)ρ(λ2)|\displaystyle=\frac{|\rho(\lambda_{1})+\rho(\lambda_{2})|}{|\rho^{\perp}(\lambda_{1})+\rho^{\perp}(\lambda_{2})|}\cdot|\rho(\lambda_{1})-\rho(\lambda_{2})|
221ρmax2|ρ(λ1)ρ(λ2)|\displaystyle\leq\frac{2}{2\sqrt{1-\rho_{\max}^{2}}}|\rho(\lambda_{1})-\rho(\lambda_{2})|
11ρmax2M|λ1λ2|.\displaystyle\leq\frac{1}{\sqrt{1-\rho_{\max}^{2}}}\cdot M|\lambda_{1}-\lambda_{2}|.

Thus ρ\rho^{\perp} is also Lipschitz in λR\lambda_{\textrm{R}}, concluding the proof. ∎

H.3 Concentration of empirical second moments

Most of our results are established for Lipschitz functions of random vectors. However we will frequently need to show the concentration of second order statistics of the form 𝒂(1),𝒂(2)\langle\bm{a}^{(1)},\bm{a}^{(2)}\rangle. The following lemma provides the connection:

Lemma H.3.

Consider two random vectors 𝐚L,𝐚Rp\bm{a}_{\textrm{L}},\bm{a}_{\textrm{R}}\in\mathbb{R}^{p}, and a positive semi-definite matrix 𝐒2×2\bm{S}\in\mathbb{R}^{2\times 2} (all possibly dependent on λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}}) that satisfy the following concentration guarantee:

  • There exist functions ϕL,ϕR\bm{\phi}_{\textrm{L}},\bm{\phi}_{\textrm{R}} (dependent on λL,λR\lambda_{\textrm{L}},\lambda_{\textrm{R}}) that are M/pM/\sqrt{p}-Lipschitz such that for Gaussian vectors 𝝃L,𝝃R𝒩(0,𝑺𝑰p)\bm{\xi}_{\textrm{L}},\bm{\xi}_{\textrm{R}}\sim\mathcal{N}(0,\bm{S}\otimes\bm{I}_{p}) and any 11-Lipschitz functions ϕ:(p)2\phi:(\mathbb{R}^{p})^{2}\rightarrow\mathbb{R},

    (supλL,λR[λmin,λmax]|ϕ(𝒂L,𝒂R)𝔼[ϕ(ϕL(𝝃L),ϕR(𝝃R))]|>Kϵ)=o(ϵ)\displaystyle\mathbb{P}\left(\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi(\bm{a}_{\textrm{L}},\bm{a}_{\textrm{R}})-\mathbb{E}[\phi(\bm{\phi}_{\textrm{L}}(\bm{\xi}_{\textrm{L}}),\bm{\phi}_{\textrm{R}}(\bm{\xi}_{\textrm{R}}))]|>K\epsilon\right)=o(\epsilon)

    and

    𝔼[supλk[λmin,λmax]ϕk(𝝃k)22]C,k=L,R.\mathbb{E}[\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\bm{\phi}_{k}(\bm{\xi}_{k})\|_{2}^{2}]\leq C,\ k=L,R.
  • The parameters satisfy MCKM\leq CK and the singular values of 𝑺\bm{S} are bounded by CC for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}].

Then we have

(supλL,λR[λmin,λmax]|𝒂L,𝒂R𝔼[ϕL(𝝃L),ϕR(𝝃R)]|>Kϵ)=o(ϵ).\mathbb{P}\left(\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\langle\bm{a}_{\textrm{L}},\bm{a}_{\textrm{R}}\rangle-\mathbb{E}[\langle\bm{\phi}_{\textrm{L}}(\bm{\xi}_{\textrm{L}}),\bm{\phi}_{\textrm{R}}(\bm{\xi}_{\textrm{R}})\rangle]|>K\epsilon\right)=o(\epsilon).

The proof for this Lemma follows almost verbatim from the proof of Lemma G.1 in celentano2021cad, with Proposition H.5 applied in appropriate places to make sure equalities/inequalities hold uniformly over the tuning parameter range.

H.4 Largest singular value of random matrices

Lemma H.4.

(vershynin2010introduction, Corollary 5.35) Let σmax(𝐗)\sigma_{\max}(\bm{X}) be the largest singular value of the matrix 𝐗n×p\bm{X}\in\mathbb{R}^{n\times p} with i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries, then

(σmax(𝑿)>n+p+t)et2/2.\mathbb{P}(\sigma_{\max}(\bm{X})>\sqrt{n}+\sqrt{p}+t)\leq e^{-t^{2}/2}.
Corollary H.1.

In the setting of Lemma H.4,

(1nσmax(𝑿)>2+δ)en/2=o(1).\mathbb{P}\left(\frac{1}{\sqrt{n}}\sigma_{\max}(\bm{X})>2+\sqrt{\delta}\right)\leq e^{-n/2}=o(1).

H.5 Inequalities for supremum

Our results often involve establishing that certain statements hold uniformly over the range of tuning parameter values. To this end, we frequently use the following results related to the properties of the supremum.

Lemma H.5.
  • For a random variable XX and functions fλ(X)=k=1Kfλ(k)(X)f_{\lambda}(X)=\sum_{k=1}^{K}f_{\lambda}^{(k)}(X) parametrized by λ\lambda, with probability 11,

    supλfλ(X)k=1Ksupλfλ(k)(X).\sup_{\lambda}f_{\lambda}(X)\leq\sum_{k=1}^{K}\sup_{\lambda}f_{\lambda}^{(k)}(X).
  • For a random variable XX and non-negative functions fλ(X)=k=1Kfλ(k)(X)f_{\lambda}(X)=\prod_{k=1}^{K}f_{\lambda}^{(k)}(X) parametrized by λ\lambda, with probability 11,

    supλfλ(X)k=1Ksupλfλ(k)(X).\sup_{\lambda}f_{\lambda}(X)\leq\prod_{k=1}^{K}\sup_{\lambda}f_{\lambda}^{(k)}(X).
  • For a random variable XX and a function fλ(X)f_{\lambda}(X) parametrized by λ\lambda,

    supλ𝔼[fλ(X)]𝔼[supλfλ(X)].\sup_{\lambda}\mathbb{E}[f_{\lambda}(X)]\leq\mathbb{E}[\sup_{\lambda}f_{\lambda}(X)].

The proof is straightforward and is therefore omitted.

Lemma H.6.

For sequences of bounded positive random functions {fn(1)(λ)},{fn(2)(λ)},{gn(1)(λ)},{gn(2)(λ)}\{f_{n}^{(1)}(\lambda)\},\{f_{n}^{(2)}(\lambda)\},\{g_{n}^{(1)}(\lambda)\},\{g_{n}^{(2)}(\lambda)\} such that supλ|fn(1)(λ)fn(2)(λ)|𝑃0\sup_{\lambda}|f_{n}^{(1)}(\lambda)-f_{n}^{(2)}(\lambda)|\overset{P}{\rightarrow}0 and supλ|gn(1)(λ)gn(2)(λ)|𝑃0\sup_{\lambda}|g_{n}^{(1)}(\lambda)-g_{n}^{(2)}(\lambda)|\overset{P}{\rightarrow}0, we have

supλ|fn(1)(λ)gn(1)(λ)fn(2)(λ)gn(2)(λ)|𝑃0\sup_{\lambda}|f_{n}^{(1)}(\lambda)g_{n}^{(1)}(\lambda)-f_{n}^{(2)}(\lambda)g_{n}^{(2)}(\lambda)|\overset{P}{\rightarrow}0
Proof.

The proof follows directly from Lemma H.5 together with the fact that

|fn(1)gn(1)fn(1)gn(1)|\displaystyle|f_{n}^{(1)}g_{n}^{(1)}-f_{n}^{(1)}g_{n}^{(1)}|
=\displaystyle= |fn(1)gn(1)fn(1)gn(2)+fn(1)gn(2)fn(2)gn(2)|\displaystyle|f_{n}^{(1)}g_{n}^{(1)}-f_{n}^{(1)}g_{n}^{(2)}+f_{n}^{(1)}g_{n}^{(2)}-f_{n}^{(2)}g_{n}^{(2)}|
\displaystyle\leq |fn(1)||gn(1)gn(2)|+|gn(2)||fn(1)fn(2)|\displaystyle|f_{n}^{(1)}|\cdot|g_{n}^{(1)}-g_{n}^{(2)}|+|g_{n}^{(2)}|\cdot|f_{n}^{(1)}-f_{n}^{(2)}|

Lemma H.7.

For sequences of positive random functions {fn(λ)}\{f_{n}(\lambda)\},{gn(λ)}\{g_{n}(\lambda)\} such that supλ|fn(λ)|\sup_{\lambda}|f_{n}(\lambda)| and supλ|gn(λ)|\sup_{\lambda}|g_{n}(\lambda)| are bounded below with probability 1o(1)1-o(1) and supλ|fn(λ)gn(λ)|𝑃0\sup_{\lambda}|f_{n}(\lambda)-g_{n}(\lambda)|\overset{P}{\rightarrow}0, we have

supλ|fn(λ)gn(λ)|𝑃0.\sup_{\lambda}|\sqrt{f_{n}(\lambda)}-\sqrt{g_{n}(\lambda)}|\overset{P}{\rightarrow}0.
Proof.
supλ|fn(λ)gn(λ)|\displaystyle\sup_{\lambda}|\sqrt{f_{n}(\lambda)}-\sqrt{g_{n}(\lambda)}|
=\displaystyle= supλ|fn(λ)gn(λ)fn(λ)+gn(λ)|\displaystyle\sup_{\lambda}\left|\frac{f_{n}(\lambda)-g_{n}(\lambda)}{\sqrt{f_{n}(\lambda)}+\sqrt{g_{n}(\lambda)}}\right|
\displaystyle\leq supλ|fn(λ)gn(λ)|supλ|1fn(λ)+gn(λ)|\displaystyle\sup_{\lambda}|f_{n}(\lambda)-g_{n}(\lambda)|\cdot\sup_{\lambda}\left|\frac{1}{\sqrt{f_{n}(\lambda)}+\sqrt{g_{n}(\lambda)}}\right|
𝑃\displaystyle\overset{P}{\rightarrow} 0.\displaystyle\ 0.

H.6 Properties on Lipschitz functions

Lemma H.8.

Suppose f(x)f(x) has bounded norm in [fmin,fmax][f_{\min},f_{\max}] and is Lipschitz with constant CfC_{f}, g(x)g(x) has bounded norm in [gmin,gmax][g_{\min},g_{\max}] and is Lipschitz with constant CgC_{g}, then

  • h(x)=f(x)g(x)h(x)=f(x)g(x) has bounded norm in [fmingmin,fmaxgmax][f_{\min}g_{\min},f_{\max}g_{\max}], and is Lipschitz with constant Cfgmax+CgfmaxC_{f}g_{\max}+C_{g}f_{\max}.

  • 1f(x)\frac{1}{f(x)} is positive and bounded in [1fmax,1fmin][\frac{1}{f_{\max}},\frac{1}{f_{\min}}], and is Lipschitz with constant Cf/fmin2C_{f}/f_{\min}^{2} in xx

Proof.

The boundedness part is trivial. For Lipschitzness part, we have

h(x1)h(x2)\displaystyle\|h(x_{1})-h(x_{2})\| =f(x1)g(x1)f(x2)g(x2)\displaystyle=\|f(x_{1})g(x_{1})-f(x_{2})g(x_{2})\|
=(f(x1)g(x1)f(x2)g(x1))+(f(x2)g(x1)f(x2)g(x2))\displaystyle=\|(f(x_{1})g(x_{1})-f(x_{2})g(x_{1}))+(f(x_{2})g(x_{1})-f(x_{2})g(x_{2}))\|
g(x1)f(x1)f(x2)+f(x2)g(x1)g(x2)\displaystyle\leq\|g(x_{1})\|\|f(x_{1})-f(x_{2})\|+\|f(x_{2})\|\|g(x_{1})-g(x_{2})\|
gmaxCfx1x2+fmaxCgx1x2,\displaystyle\leq g_{\max}C_{f}\|x_{1}-x_{2}\|+f_{\max}C_{g}\|x_{1}-x_{2}\|,

and further,

1f(x1)1f(x2)\displaystyle\|\frac{1}{f(x_{1})}-\frac{1}{f(x_{2})}\| =f(x2)f(x1)f(x1)f(x2)\displaystyle=\|\frac{f(x_{2})-f(x_{1})}{f(x_{1})f(x_{2})}\|
f(x2)f(x1)fmin2\displaystyle\leq\frac{\|f(x_{2})-f(x_{1})\|}{f_{\min}^{2}}
Cfx1x2fmin2.\displaystyle\leq\frac{C_{f}\|x_{1}-x_{2}\|}{f_{\min}^{2}}.

Lemma H.9.

Suppose f(x,y)f(x,y) is CC-Lipschitz in yy for all xx, then supxf(x)\sup_{x}f(x) is also MM-Lipschitz in yy.

Proof.

Consider any y1,y2y_{1},y_{2}. Let x1=argmaxxf(x,y1),x2=argmaxxf(x,y2)x_{1}=\operatorname*{arg\,max}_{x}f(x,y_{1}),x_{2}=\operatorname*{arg\,max}_{x}f(x,y_{2}), then

f(x1,y1)f(x2,y2)=f(x1,y1)f(x1,y2)+f(x1,y2)f(x2,y2)|f(x1,y1)f(x1,y2)|M|y1y2|,\displaystyle f(x_{1},y_{1})-f(x_{2},y_{2})=f(x_{1},y_{1})-f(x_{1},y_{2})+f(x_{1},y_{2})-f(x_{2},y_{2})\leq|f(x_{1},y_{1})-f(x_{1},y_{2})|\leq M|y_{1}-y_{2}|,
f(x2,y2)f(x1,y1)=f(x2,y2)f(x2,y1)+f(x2,y1)f(x1,y1)|f(x2,y2)f(x2,y1)|M|y1y2|,\displaystyle f(x_{2},y_{2})-f(x_{1},y_{1})=f(x_{2},y_{2})-f(x_{2},y_{1})+f(x_{2},y_{1})-f(x_{1},y_{1})\leq|f(x_{2},y_{2})-f(x_{2},y_{1})|\leq M|y_{1}-y_{2}|,

where we have used the fact that f(x1,y2)f(x2,y2)0f(x_{1},y_{2})-f(x_{2},y_{2})\leq 0 and f(x2,y1)f(x1,y1)0f(x_{2},y_{1})-f(x_{1},y_{1})\leq 0 by optimality of x1x_{1} and x2x_{2}. Thus,

|supxf(x,y1)supxf(x,y2)|=|f(x1,y1)f(x2,y2)|M|y1y2|,|\sup_{x}f(x,y_{1})-\sup_{x}f(x,y_{2})|=|f(x_{1},y_{1})-f(x_{2},y_{2})|\leq M|y_{1}-y_{2}|,

which shows supxf(x,y)\sup_{x}f(x,y) is also MM-Lipschitz in yy. ∎

Appendix I Universality Proof

Here we extend our results from prior sections to the case of the general covariate and noise distributions mentioned under Assumptions 1 and 2.

To differentiate from the notations above, we use 𝑮\bm{G} to denote a design matrix with i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries, and ϵG\bm{\epsilon}^{G} to denote the noise vector with i.i.d. 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) entries. Unless otherwise noted, for all other quantities, we add a superscript GG to denote corresponding quantities that depend on 𝑮\bm{G} and/or ϵG\bm{\epsilon}^{G}. Also, our original notations from the preceding sections now refer to quantities with the general sub-Gaussian tailed design matrix and noise distribution specified by Assumptions 1 and 2.

Several results from prior sections assumed Gaussianity of the design or errors. A complete list of these results would be as follows: Lemmas D.2-D.13, Corollary D.2, Lemmas E.1-E.4 (along with several supporting lemmas that enter the proof of Lemma E.1), F.2, Lemmas G.1-G.5, H.4 and Corollary H.1. However, for many of these results, the proof uses Gaussianity only through the use of certain other lemmas/results in the aforementioned list. Thus, to prove validity of such results beyond Gaussianity, it suffices to prove validity of the other lemmas used in their proofs. We thus identify a smaller set of results so that showing these hold under the non-Gaussian setting of Assumptions 1 and 2 suffices for showing that all of the aforementioned results satisfy the same universality property. This smaller set is as follows: Lemmas D.2, D.8, D.11, D.12, Lemmas E.1 (along with some supporting results in this proof as outlined in Section I.2 below), E.2, Lemmas G.2-G.5, Lemma H.4 and Corollary H.1. In this section, we show that these results continue to hold under Assumptions 1 and 2 (and Assumption 3 for Lemmas G.2-G.5).

I.1 Replacing Lemma H.4 and Corollary H.1

This follows from prior results in the literature that we quote below. Recall that, supnmaxijXijψ2<\sup_{n}\max_{ij}\|X_{ij}\|_{\psi_{2}}<\infty, where ψ2\|\cdot\|_{\psi_{2}} is the Orcliz-2 norm or sub-Gaussian norm (see (wellner2013weak, Section 2.1) for the precise definition).

Lemma I.1.

(vershynin2018high, Theorem 4.4.5) Let σmax(𝐗)\sigma_{\max}(\bm{X}) be the largest singular value of the matrix 𝐗n×p\bm{X}\in\mathbb{R}^{n\times p} with independent, mean 0, variance 11, uniformly sub-Gaussian entries, then

(σmax(𝑿)>CK(n+p+t))2et2,\mathbb{P}(\sigma_{\max}(\bm{X})>CK(\sqrt{n}+\sqrt{p}+t))\leq 2e^{-t^{2}},

where CC is an absolute constant and K=maxijXijψ2K=\max_{ij}\|X_{ij}\|_{\psi_{2}}

Corollary I.1.

Under Assumption 1,

(1nσmax(𝑿)>CK(2+δ))2en=o(1),\mathbb{P}\left(\frac{1}{\sqrt{n}}\sigma_{\max}(\bm{X})>CK(2+\sqrt{\delta})\right)\leq 2e^{-n}=o(1),

where C,KC,K are as in Lemma I.1.

I.2 Replacing Lemmas E.1 and E.2

In this section, we will establish that the following more general version of Lemma E.1 holds. For Lemma E.2 a similar extension can be proved by using similar proof techniques as in this section, thus we omit writing the details for the latter.

Lemma I.2 (Replacing Lemma E.1).

Under Assumptions 1 and 2, for k=L,Rk=L,R and any 11-Lipschitz function ϕβ:p\phi_{\beta}:\mathbb{R}^{p}\rightarrow\mathbb{R},

supλk[λmin,λmax]|ϕβ(𝜷^k)𝔼[ϕβ(𝜷^kf)]|𝑃0.\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{k})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{k}^{f})]|\overset{P}{\rightarrow}0.\\

The proof of the Lasso and Ridge cases are similar. We only present the case of the Ridge so it is easy for readers to compare with Section E.3. We introduce some useful supporting lemmas first. By recent results in han2022universality, the below lemma follows.

Lemma I.3.

Suppose we are under Assumptions 1 and 2, and further assume 𝐆n×p\bm{G}\in\mathbb{R}^{n\times p} has i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries. For any set Dp[Lp/n,Lp/n]pD_{p}\subset[-L_{p}/\sqrt{n},L_{p}/\sqrt{n}]^{p} with Lp=K(logp+2n𝛃)L_{p}=K(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty}) where KK is a constant that only depends on λmax\lambda_{\max}, any t,ϵ>0t\in\mathbb{R},\epsilon>0, we have

(min𝒘Dp𝒞λ(𝒘)>t+3ϵ)(min𝒘Dp𝒞λ(𝒘;𝑮,ϵ)>t+ϵ)+o(1),\displaystyle\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})>t+3\epsilon\right)\leq\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})>t+\epsilon\right)+o(1),
(min𝒘Dp𝒞λ(𝒘)<t3ϵ)(min𝒘Dp𝒞λ(𝒘;𝑮,ϵ)<tϵ)+o(1),\displaystyle\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})<t-3\epsilon\right)\leq\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})<t-\epsilon\right)+o(1),

where 𝒞λ(𝐰)\mathcal{C}_{\lambda}(\bm{w}) is defined as in (37) and 𝒞λ(𝐰;𝐆,ϵ)\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon}) represents 𝒞λ(𝐰)\mathcal{C}_{\lambda}(\bm{w}) where we replace 𝐗\bm{X} by 𝐆\bm{G} but keep the (possibly non-Gaussian) ϵ\bm{\epsilon}.

Proof.

The first argument is a direct consequence of (han2022universality, Theorem 2.3), by specifically plugging in the formula for 𝒞λ(𝒘)\mathcal{C}_{\lambda}(\bm{w}) from (37). The second argument can be similarly proved by modifying the last lines of (han2022universality, Section 4.2). We skip the proof here for conciseness. Notice that the n\sqrt{n} in DpD_{p} comes from the difference between their scaling and ours. ∎

Further, (han2022universality, Proposition 3.3(2)), with the appropriate n\sqrt{n}-rescaling in our case yields the following.

Lemma I.4.

Under Assumptions 1 and 2, with probability 1o(1)1-o(1),

n𝜷^RK(logp+2n𝜷),\sqrt{n}\|\hat{\bm{\beta}}_{\textrm{R}}\|_{\infty}\leq K\left(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty}\right),

where KK is a constant that only depends on λmax\lambda_{\max}.

We note that Lemma I.3 also holds for the Lasso case (with a different 𝒞λ(𝒘)\mathcal{C}_{\lambda}(\bm{w})), and the Lasso counterpart of Lemma I.4 is given by (han2022universality, Proposition 3.7(2)).

Now we prove Lemma I.2. We follow the same structure as Section E.3.

I.2.1 Converting the optimization problem

This subsection requires no change.

I.2.2 Connecting AO with SO

Lemma I.5 (Replacing Proposition E.6).

Recall the definition of 𝐰(λ)\bm{w}(\lambda) from (40). There exists a constant γ>0\gamma>0 such that for all ϵ(0,1]\epsilon\in(0,1],

supλ[λmin,λmax](𝒘p,𝒘𝒘(λ)22>ϵ and 𝒘Lp/nandLλ(𝒘)min𝒗pLλ(𝒗)+γϵ)=o(ϵ).\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}>\epsilon\text{ and }\|\bm{w}\|_{\infty}\leq L_{p}/\sqrt{n}\ \ \text{and}\ \ L_{\lambda}(\bm{w})\leq\min_{\bm{v}\in\mathbb{R}^{p}}L_{\lambda}(\bm{v})+\gamma\epsilon\right)=o(\epsilon).
Proof.

The proof is a direct corollary of Lemma E.6, since adding another condition in the event does not increase the probability. ∎

I.2.3 Connecting PO with AO

Lemma I.6 (Replacing Corollary E.1).

Consider the same setting as in Lemma I.3.

  • Further assuming DpD_{p} is a closed set, we have for all t,ϵ>0t\in\mathbb{R},\epsilon>0,

    (min𝒘Dp𝒞λ(𝒘)t)2(min𝒘DpLλ(𝒘)t+2ϵ)+o(1),\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D_{p}}L_{\lambda}(\bm{w})\leq t+2\epsilon)+o(1),

    where LλL_{\lambda} is defined as in (38).

  • Further assuming DpD_{p} is a closed convex set, we have for all t,ϵ>0t\in\mathbb{R},\epsilon>0,

    (min𝒘Dp𝒞λ(𝒘)t)2(min𝒘DpLλ(𝒘)t2ϵ)+o(1).\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})\geq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D_{p}}L_{\lambda}(\bm{w})\geq t-2\epsilon)+o(1).
Proof.

We only prove the first argument as the second follows similarly. As a direct corollary of the second argument in I.3 (by substituting tt3ϵt\leftarrow t-3\epsilon), we have

(min𝒘Dp𝒞λ(𝒘)t)(min𝒘Dp𝒞λ(𝒘;𝑮,ϵ)t+2ϵ)+o(1).\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})\leq t+2\epsilon)+o(1).

Now consider the high probability event (42). The fact that 𝒛=ϵ/σ\bm{z}=\bm{\epsilon}/\sigma is now sub-Gaussian instead of Gaussian does not change the fact that we can find another KK that ensures this event occurs with high probability. Therefore, the proof of E.1 goes through, and we have the following inequality: for any tt\in\mathbb{R},

(min𝒘Dp𝒞λ(𝒘;𝑮,ϵ)t)2(min𝒘DpLλ(𝒘)t).\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D_{p}}L_{\lambda}(\bm{w})\leq t).

Combining the two displays above finishes the proof. ∎

Now for a fixed λ\lambda, we instead consider the set D=DpDλϵD=D_{p}\bigcap D_{\lambda}^{\epsilon}, where Dp[Lp/n,Lp/n]pD_{p}\subset[-L_{p}/\sqrt{n},L_{p}/\sqrt{n}]^{p} is defined in Lemma I.3 and Dλϵ={𝒘p|𝒘𝒘(λ)22>ϵ}D_{\lambda}^{\epsilon}=\{\bm{w}\in\mathbb{R}^{p}|\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}>\epsilon\} is defined in Section E.3.3.

Lemma I.7 (Replacing Lemma E.7).

For all ϵ(0,1]\epsilon\in(0,1],

(min𝒘D𝒞λ(𝒘)min𝒘p𝒞λ(𝒘)+ϵ)2(min𝒘DLλ(𝒘)min𝒘pLλ(𝒘)+5ϵ)+o(1).\mathbb{P}\left(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}\mathcal{C}_{\lambda}(\bm{w})+\epsilon\right)\leq 2\mathbb{P}\left(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}L_{\lambda}(\bm{w})+5\epsilon\right)+o(1).
Proof.

Recall from our remark in the proof of Lemma E.7 that (miolane2021distribution, Section C.1.1) established this result for the case of the Lasso and Gaussian designs/error distributions. In their proof, their Corollaries 5.1 and B.1 played a crucial role. Now note that (miolane2021distribution, Corollary B.1) requires concentration of ϵ\bm{\epsilon}, which also follows if ϵ\bm{\epsilon} is sub-Gaussian instead of Gaussian. Thus we are left to generalize their Corollary 5.1 (the analogue of this is Corollary E.1 in our paper) to the case of the ridge under our non-Gaussian design/error setting of Assumptions 1 and 2. We achieve this via Lemma I.6. With these modifications in place, a proof analogous (miolane2021distribution, Section C.1.1) still works here. Note that the extra 2ϵ2\epsilon on the RHS comes from the extra 2ϵ2\epsilon in Lemma I.6. ∎

Lemma I.8 (Replacing Lemma E.8).

There exists a constant γ>0\gamma>0 such that for all ϵ(0,1]\epsilon\in(0,1],

supλ[λmin,λmax](𝒃p,𝜷𝜷^f(λ)22>ϵ and 𝒃𝜷Lp/nandλ(𝒃)minλ+γϵ)=o(ϵ).\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{b}\in\mathbb{R}^{p},\ \|\bm{\beta}-\hat{\bm{\beta}}^{f}(\lambda)\|_{2}^{2}>\epsilon\text{ and }\|\bm{b}-\bm{\beta}\|_{\infty}\leq L_{p}/\sqrt{n}\ \ \text{and}\ \ \mathcal{L}_{\lambda}(\bm{b})\leq\min\mathcal{L}_{\lambda}+\gamma\epsilon\right)=o(\epsilon).
Proof.

This is a direct corollary of Lemma I.5 and Lemma I.7, with 𝒘(λ)=𝜷^f(λ)𝜷\bm{w}(\lambda)=\hat{\bm{\beta}}^{f}(\lambda)-\bm{\beta} from (40). ∎

I.2.4 Uniform Control over λ\lambda

Lemma I.9 (Re-stating Lemma E.9).

There exists constant KK such that

(λ,λ[λmin,λmax],λ(𝜷^(λ))λ(𝜷^(λ))+K|λλ|)=1o(1).\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda))\leq\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda^{\prime}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1).
Proof.

The only difference in the proofs of Lemma I.9 and E.9 lies in the fact that 𝒛\bm{z} is now independent sub-Gaussian instead of i.i.d. Gaussian. However, since the entries of 𝒛\bm{z} still have mean 0 and variance 11, we still have 𝒛22n\|\bm{z}\|_{2}\leq 2\sqrt{n} with high probability. The rest of the proof follows directly. ∎

Note that Lemma D.12 is equivalent to Lemma E.9 so universality follows directly from the above. Lemma E.10 remains as is since it does not involve 𝑿\bm{X} or ϵ\bm{\epsilon}. We are now ready to prove Lemma I.2.

proof of Lemma I.2.

Let γ>0\gamma>0 as given by Lemma I.8, let K>0K>0 as given by Lemma I.9, and let M>0M>0 as given by Lemma E.10.

Fix ϵ(0,1]\epsilon\in(0,1] and define ϵ=min(γϵK,ϵM)\epsilon^{\prime}=\min\left(\frac{\gamma\epsilon}{K},\frac{\sqrt{\epsilon}}{M}\right). Let k=(λmaxλmin)/ϵk=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil. Further define λi=λmin+iϵ\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime} for i=0,,ki=0,...,k. By Lemma I.8, the event

{i{1,,k},𝒃p,λi(𝒃)minλi+γϵ𝒃𝜷^f(λi)22ϵ or 𝒃𝜷>Lp/n}\left\{\forall i\in\{1,...,k\},\forall\bm{b}\in\mathbb{R}^{p},\mathcal{L}_{\lambda_{i}}(\bm{b})\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon\Rightarrow\|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon\text{ or }\|\bm{b}-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}\right\} (56)

has probability at least 1ko(ϵ)=1o(1)1-ko(\epsilon)=1-o(1). Therefore, on the intersection of event 56 and the event in Lemma I.9, which has probability 1o(1)1-o(1), we have for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}],

λi(𝜷^(λ))minλi+K|λλi|minλi+γϵ,\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}(\lambda))\leq\min\mathcal{L}_{\lambda_{i}}+K|\lambda-\lambda_{i}|\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon,

where 1ik1\leq i\leq k is such that λ[λi1,λi]\lambda\in[\lambda_{i-1},\lambda_{i}]. This implies (since we are on the event 56) that either 𝜷^(λ)𝜷^f(λi)22ϵ\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon or 𝜷^(λ)𝜷>Lp/n\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}.

However, by Lemma I.4, we know that

𝜷^(λ)𝜷\displaystyle\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}
\displaystyle\leq 𝜷^(λ)+𝜷\displaystyle\|\hat{\bm{\beta}}(\lambda)\|_{\infty}+\|\bm{\beta}\|_{\infty}
\displaystyle\leq K((logp)/n+2𝜷)+𝜷\displaystyle K\left(\sqrt{(\log p)/n}+2\|\bm{\beta}\|_{\infty}\right)+\|\bm{\beta}\|_{\infty}
\displaystyle\leq (K+1/2)((logp)/n+2𝜷)\displaystyle(K+1/2)\left(\sqrt{(\log p)/n}+2\|\bm{\beta}\|_{\infty}\right)
=\displaystyle= Lp/n\displaystyle L_{p}/\sqrt{n}

with probability 1o(1)1-o(1). Since the probability is over the randomness in 𝑿\bm{X} and ϵ\bm{\epsilon}, and KK only depends in λmax\lambda_{\max}, then the argument holds for all λ[λmin,λmax]\lambda\in[\lambda_{\min},\lambda_{\max}].

Therefore, only with probability o(1)o(1), there exists λ\lambda such that 𝜷^(λ)𝜷>Lp/n\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}. Hence, still with probability 1o(1)1-o(1), 𝜷^(λ)𝜷^f(λi)22ϵ\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon. The rest of the proof remains the same as in the end of Section E.3.4. ∎

I.3 Replacing Lemma D.2

The ridge case is rather straightforward: (knowles2017anisotropic, Theorem 3.7) only requires mean 0, variance 11, independence, as well as uniformly bounded ppth moments for the entries of 𝑿\bm{X}, which are already satisfied by Assumptions 1 and 2 (specifically, uniform sub-Gaussianity implies uniformly bounded ppth moments). As for the Lasso case, we can modify the proof of (miolane2021distribution, Theorem F.1) in the same way as we modified Section E.3 in Section I.2. We omit the details here for conciseness.

I.4 Replacing Lemma D.8

The supporting Lemma E.1 has already been replaced by Lemma I.2. For the ridge case, the Lipschitz argument needs no change. For the Lasso case, the version of Lemma I.2 for α\alpha-smoothed Lasso can be extended similarly, and the following Lipschitz argument requires no change. Finally, (31) relies critically on (celentano2020lasso, Lemma B.9). The corresponding proof in (celentano2020lasso, Lemma B.5.4) has been elaborated in our proof of Lemma G.3, for which we articulate the replacement in I.6.

I.5 Replacing Lemma D.11

Lemma I.10 (Replacing Lemma D.11).

Recall the definition of 𝐠^L\hat{\bm{g}}_{\textrm{L}} from 26, and recall that Lp=K(logp+2n𝛃)L_{p}=K(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty}) where KK is a constant that depends only on λmax\lambda_{\max}. Under Assumptions 1 and 2, for any ϵ>0\epsilon>0, there exists a constant CC such that for a 11-Lipschitz function ϕw:p\phi_{w}:\mathbb{R}^{p}\rightarrow\mathbb{R},

supλL,λR[λmin,λmax]\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]} (𝒘p,|ϕw(𝒘)𝔼[ϕw(𝜷^Rf𝜷)|𝒈Lf=𝒈^L]|ϵ\displaystyle\mathbb{P}\biggl{(}\exists\bm{w}\in\mathbb{R}^{p},\left|\phi_{w}(\bm{w})-\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right|\geq\epsilon
and 𝒘Lp/nand𝒞λR(𝒘)min𝒞λR+Cϵ2)=o(ϵ2).\displaystyle\text{ and }\|\bm{w}\|_{\infty}\leq L_{p}/\sqrt{n}\ \text{and}\ \mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{\textrm{R}}}+C\epsilon^{2}\biggr{)}=o(\epsilon^{2}).

We introduce another supporting Lemma.

Lemma I.11.

Consider Assumptions 1 and 2, and further assume 𝐆n×p\bm{G}\in\mathbb{R}^{n\times p} has i.i.d. 𝒩(0,1)\mathcal{N}(0,1) entries. Denote C(𝐰,𝐮)=1n𝐰𝐗𝐮+f(𝐰,𝐮)C(\bm{w},\bm{u})=\frac{1}{n}\bm{w}^{\top}\bm{X}\bm{u}+f(\bm{w},\bm{u}) where ff is convex-concave. For any set Dp[Lp/n,Lp/n]pD_{p}\subset[-L_{p}/\sqrt{n},L_{p}/\sqrt{n}]^{p} with Lp=K(logp+2n𝛃)L_{p}=K(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty}) where KK is a constant that only depends on λmax\lambda_{\max}, any t,ϵ>0t\in\mathbb{R},\epsilon>0, we have

(max𝒖nDpmin𝒘DpC(𝒘,𝒖)>t+3ϵ)(max𝒖nDpmin𝒘DpC(𝒘,𝒖;𝑮)>t+ϵ)+o(1),\displaystyle\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u})>t+3\epsilon\right)\leq\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u};\bm{G})>t+\epsilon\right)+o(1),
(max𝒖nDpmin𝒘DpC(𝒘,𝒖)<t3ϵ)(max𝒖nDpmin𝒘DpC(𝒘,𝒖;𝑮)<tϵ)+o(1),\displaystyle\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u})<t-3\epsilon\right)\leq\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u};\bm{G})<t-\epsilon\right)+o(1),

where C(𝐰,𝐮;𝐆)C(\bm{w},\bm{u};\bm{G}) represents C(𝐰,𝐮)C(\bm{w},\bm{u}) with 𝐗\bm{X} replaced by 𝐆\bm{G}.

Proof.

This is a consequence of (han2022universality, Corollary 2.6), with necessary modifications similar to what we performed in the proof of Lemma I.3. Again notice the adjusted scaling in our setting. ∎

Now we prove Lemma I.10.

proof of Lemma I.10.

Most of the proof will follow (celentano2021cad, Lemma F.4). In their proof, they first showed their Lemma F.2: Conditional Gordon’s Inequality, which states that min𝒗Ewmax𝒖Euc(𝒘,𝒖)\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}c(\bm{w},\bm{u}) (as defined in (37), leaving the dependence on λ\lambda implicit) concentrates around min𝒗Ewmax𝒖EulR|L(𝒘,𝒖)\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u}), the ”conditional Auxiliary Optimization Problem” defined as l(𝒘,𝒖)l(\bm{w},\bm{u}) in (38) conditioned on 𝒈Lf=𝒈^L\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{L}, which necessitates 𝝃g\bm{\xi}_{g} and 𝝃h\bm{\xi}_{h} taking the following values (see derivation in (celentano2021cad, Section L)):

𝝃^g\displaystyle\hat{\bm{\xi}}_{g} =𝒈^L/τL\displaystyle=\hat{\bm{g}}_{\textrm{L}}/\tau_{\textrm{L}}
𝝃^h\displaystyle\hat{\bm{\xi}}_{h} =𝑿(𝜷^L𝜷)𝜷^L𝜷2+𝝃^g,𝜷^L𝜷𝒚𝑿𝜷^L2𝜷^L𝜷2(𝒚𝑿𝜷^L).\displaystyle=-\frac{\bm{X}(\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta})}{\|\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\|_{2}}+\frac{\langle\hat{\bm{\xi}}_{g},\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\rangle}{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\|_{2}\|\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\|_{2}}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}).

Then, they showed the concentration of min𝒗Ewmax𝒖EulR|L(𝒘,𝒖)\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u}) around its asymptotic limit. Here Eu,EwE_{u},E_{w} represent any convex, closed sets.

We make modifications to their proof as follows:

  1. 1.

    Similar to how we proved Lemma I.6 by invoking Lemma I.3, we can also modify their Lemma F.2 by invoking Lemma I.11, and thus establish that it holds under our general Assumptions 1 and 2.

  2. 2.

    For proving that min𝒗Ewmax𝒖EulR|L(𝒘,𝒖)\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u}) concentrates around its asymptotic limit, no modification is necessary. This is because lR|L(𝒘,𝒖)l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u}) does not involve 𝑿\bm{X} anymore, and the fact that replacing i.i.d. Gaussianity of ϵ\bm{\epsilon} by independent uniform sub-Gaussianity preserves its concentration properties (with possibly different constants).

Now with Lemma I.10 replacing Lemma D.11, the rest of the proof of Lemma D.6 in Section D.2.3 naturally extends to the universality version under Assumptions 1 and 2 (similar to how we modified Section E.3.4 to Section I.2.4), since the probability that there exists λ\lambda such that 𝜷^(λ)𝜷>Lp/n\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n} is o(1)o(1).

I.6 Replacing Lemmas G.2-G.5

For Lemma G.2, we critically used the facts that 𝜷^2\|\hat{\bm{\beta}}\|_{2}, 𝜷^(𝑨)2\|\hat{\bm{\beta}}(\bm{A})\|_{2} are bounded with probability 1o(1)1-o(1). The former is guaranteed by applying Lemma H.3 on Lemma I.2. The latter is stated in Assumption 3(1).

For Lemma G.3, consider the critical event

𝒜:={1nκ(𝑿,n(1ζ/4))κmin}{1n𝑿opC}{1n#{j:|t^j|1Δ/2}1ζ/2}.\mathcal{A}:=\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}\cap\left\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\right\}\cap\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\}.

{1nκ(𝑿,n(1ζ/4))κmin}\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\} happens with high probability according to Assumption 1(2). {1n𝑿opC}\left\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\right\} happens with high probability by Corollary I.1. Finally, (miolane2021distribution, Theorem E.5) can be extended to the universality version in the same way as we extended Section E.3 in Section I.2. As a direct corollary, {1n#{j:|t^j|1Δ/2}1ζ/2}\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\} happens with high probability.

For Lemma G.4, we need to extend the critical events (49), (50), (51). (49) follows naturally from the proof of the extended G.2. (50) is a Corollary of (miolane2021distribution, Theorem F.4), which again can be extended to the universality version in the same way as we extended Section E.3 in Section I.2. Lastly (51) is stated in Assumption 3(2).

For Lemma G.5, consider the critical events (53), (54), (55). Proof of (53) remains unchanged (given Corollary I.1). (54) is a Corollary of (miolane2021distribution, Lemma E.9), which can be similarly extended. Proof of (55) remains unchanged as well.