HEDE: Heritability estimation in high dimensions
by Ensembling Debiased Estimators

Yanke Song, Xihong Lin, and Pragya Sur
Harvard University Yanke Song is a graduate student ([email protected]) in the Department of Statistics at Harvard University. Xihong Lin ([email protected]) is Professor of Biostatistics at Harvard T.H. Chan School of Public Health and Professor of Statistics at Harvard University. Pragya Sur is Assistant Professor ([email protected]) in the Department of Statistics at Harvard University. Lin’s work was supported by the National Institutes of Health grants R35-CA197449, R01-HL163560, U01-HG012064, U19-CA203654, and P30-ES000002. Song and Sur’s work was supported by NSF DMS-2113426 and a Dean’s Competitive Fund for Promising Scholarship. Corresponding authors are Song and Sur.

Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sample size and the dimension grow proportionally. Our method ensembles post-processed versions of the debiased lasso and debiased ridge estimators, and incorporates a data-driven strategy for hyperparameter selection that significantly boosts estimation performance. We establish rigorous consistency guarantees that hold despite adaptive tuning. Extensive simulations demonstrate our method’s superiority over existing state-of-the-art methods across various signal structures and genetic architectures, ranging from sparse to relatively dense and from evenly to unevenly distributed signals. Furthermore, we discuss the advantages of fixed effects heritability estimation compared to random effects estimation. Our theoretical guarantees hold for realistic genotype distributions observed in genetic studies, where genotypes typically take on discrete values and are often well-modeled by sub-Gaussian distributed random variables. We establish our theoretical results by deriving uniform bounds, built upon the convex Gaussian min-max theorem, and leveraging universality results. Finally, we showcase the efficacy of our approach in estimating height and BMI heritability using the UK Biobank.

Abstract

Keywords: Signal-to-noise ratio estimation, Debiased Lasso estimator, Debiased ridge estimator, Convex Gaussian min-max theorem, Ensemble estimation, Proportional asymptotics, Universality.

1 Introduction

Accurate heritability estimation presents a significant challenge in statistical genetics. Distinguishing the contributions of genetic versus environmental factors is crucial for understanding how our genetic makeup influences the development of complex diseases. Importantly, genetic research over the past decade has highlighted a significant discrepancy between heritability estimates from twin studies and those from Genome-Wide Association Studies (GWAS) (Manolio et al.,, 2009; Yang et al.,, 2015) This discrepancy, known as the “missing heritability problem”, underscores the importance of developing reliable heritability estimation methods to understand the extent of missing heritability. In this paper, we propose an ensemble method for heritability estimation, using high-dimensional Single Nucleotide Polymorphisms (SNPs) commonly collected in GWAS, that exhibits superior performance across a wide variety of settings and significantly outperforms existing methods in practical scenarios of interest. We focus on the narrow-sense heritability yang2011gcta, which refers to the proportion of phenotypic variance explained by additive genotypic effects. For conciseness, we henceforth refer to this simply as the heritability.

GWAS data typically contains more SNPs than individuals. This presents a challenge when estimating heritability. Addressing this challenge necessitates modeling assumptions. Previous work in this area can be broadly categorized into two classes based on these assumptions: random effects models and fixed effects models.

Random effects models treat the underlying regression coefficients as random variables and have been widely used for heritability estimation. Arguably, the most popular approach in this vein is GCTA, introduced by yang2011gcta. GCTA employs the classical restricted maximum likelihood method (REML) (also called genomic REML/GREML in their work) within a linear mixed effects model. It assumes that the regression coefficients are i.i.d. draws from a Gaussian distribution with zero mean and a constant variance. Since the genetic effect sizes/regression coefficients might depend on the design matrix (the genotypes), subsequent work provides various improvements that relax this assumption, including GREML-MS (lee2013estimation), GREML-LDMS (yang2015genetic), BOLT-LMM (loh2015efficient), among others. Specifically, GREML-MS allows the signal variance to depend on minor allele frequencies (MAFs), while GREML-LDMS allows it to depende on linkage disequilibrium (LD) levels/quantiles. Prior statistical literature has studied the theoretical properties of several random effects methods (jiang2016high; bonnet2015heritability; dicker2016maximum; hu2022misspecification).

Despite this rich literature, a major limitation of random effects model based heritability estimation method is that its consistency often relies on the correct parametric assumptions of regression coefficients. Specifically, these methods assume that regression coefficients are either i.i.d, or depend on the design matrix in specific ways. When these assumptions are violated, random effects methods can produce significantly biased heritablity estimates. The issue is magnified when SNPs are in linkage disequilibrium (LD) or when there is additional heterogeneity across the LD blocks not captured by the random effects assumptions. We elaborate this issue further in Figure 1, where we demonstrate that GREML, GREML-MS and GREML-LDMS all incur significant biases when their respective random effects assumptions are violated.

These challenges associated with random effects methods have spurred a distinct strand of methods by positing regression coefficients to be fixed while treating the design matrix to be random. Referred to as fixed effects methods, these approaches often exhibit greater robustness to diverse realizations of the underlying signal without the need to make parametric assumptions on regression coefficients (Figure 1). Consequently, our focus lies on estimating the heritability within this fixed effects model framework.

Several fixed effects methods have emerged for heritability estimation. However, a substantial body of the proposed work requires the underlying signal to be sparse (fan2012variance; sun2012scaled; tony2020semisupervised; guo2019optimal), and they are subject to loss of efficiency when these sparsity assumptions are violated (dicker2014variance; janson2017eigenprism; bayati2013estimating), see Sections 5.2, 3.3, 2, respectively. The methods proposed by several authors (chen2022statistical; dicker2014variance; janson2017eigenprism) operate under less stringent assumptions regarding the signal, but they depend on ridge-regression-analogous concepts. Therefore they are expected to have good efficiency when the signal is dense, and are subject to loss of efficiency when signals are sparse. We illustrate these issues further in Section 5.2. In practice, the underlying genetic architecture for a given trait is unknown, i.e., the signal could be sparse or dense. Hence, it is desirable to develop a robust method across a broad spectrum of signal regimes from sparse to dense. In addition, fixed effects methods often rely on assuming the covariate distribution to be Gaussian (bayati2013estimating; dicker2014variance; janson2017eigenprism; verzelen2018adaptive). This assumption fails to capture genetic data where genotypes are discrete, taking values in ${0,1,2}$ .

To address these issues, we propose a robust ensemble heritability estimation method HEDE (Heritability Estimation via Debiased Ensemble), which ensembles the debiased Lasso and ridge estimators, with degrees-of-freedom corrections (bellec2019biasing; javanmard2014hypothesis). Debiased estimators, originally proposed to correct the bias for Lasso/ridge estimators (zhang2014confidence; van2014asymptotically; javanmard2014confidence), have been employed for tasks such as confidence interval construction and hypothesis testing, and have nice properties such as optimal testing properties javanmard2014confidence. Unlike these methods, HEDE utilizes debiasing for heritability estimation. To develop HEDE, we first demonstrate that a reliable estimator for heritability can be formulated using any linear combination of these debiased estimators, in which we devise a consistent bias correction strategy. Subsequently, we refine the linear combination based on this bias correction strategy. This methodology enables the consistent estimation of heritability using any linear combination of the debiased Lasso and the debiased Ridge. HEDE then uses a data-driven approach to selecting an optimal ensemble from the array of possible linear combinations. Finally, we introduce an adaptive method for determining the underlying regularization parameters. We develop both adaptive strategies—ensemble selection and tuning parameter selection for the Lasso and the Ridge—by minimizing a certain mean square error. This results in enhanced statistical performance for heritability estimation.

Underpinning our method is a rigorous statistical theory that ensures consistency under minimal assumptions. Specifically, we operate under sub-Gaussian assumptions, which are consistent with a broad class of designs observed in GWAS that motivates this paper. We show that the consistency of heritability estimation using HEDE guarantees remain intact despite employing adaptive strategies for selecting the tuning parameters and the debiased Lasso and ridge ensemble, all while accommodating sub-Gaussian designs.

We perform extensive simulation studies to evaluate the finite sample performance of HEDE compared with the existing methods in a range of settings mimicking GWAS data. We demonstrate that HEDE overcomes the limitations in random effects methods, remaining unbiased across a wide variety of signal structures, specifically signals with different amplitudes across various combinations of MAF and LD levels, as well as LD blocks. our simulation also illustrates that HEDE consistently achieves the lowest mean square error when compared to popular fixed effects methods across a wide spectrum of sparse and dense signals, as well as low and high heritability values. We compare HEDE’s performance with both random and fixed effects methods on the problems of estimating height and BMI heritability using the UK Biobank data.

The rest of this paper is organized as follows. Section 2 introduces the problem setup. Section 3 presents our HEDE while Section 4 discusses its theoretical properties. Section 5 presents extensive simulation studies while Section 6 complements these numerical investigations with a real-data application on the UK Biobank data (sudlow2015uk). Finally, Section 7 concludes with discussions.

2 Problem Setup

In this section, we formally introduce our problem setup. We consider a high-dimensional regime where the covariate dimension grows with the sample size. To elaborate, we consider a sequence of problem instances $\{\bm{y}(n),\bm{X}(n),\bm{\beta}(n),\bm{\epsilon}(n)\}_{n\geq 1}$ that satisfies a linear model

\bm{y}(n)=\bm{X}(n)\bm{\beta}(n)+\bm{\epsilon}(n),

(1)

where $\bm{y}(n)=(y_{1},\ldots,y_{n})^{\top}\in\mathbb{R}^{n}$ represents observations of a real-valued phenotype of interest, such as height, BMI, blood pressure, for $n$ unrelated individuals from a population. The unknown regression coefficient vector, $\bm{\beta}(n)=(\beta_{1},\ldots,\beta_{p(n)})^{\top}\in\mathbb{R}^{p(n)}$ , captures the effects of $p(n)$ SNPs and is assumed to be fixed parameters. The environmental noise, $\bm{\epsilon}(n)=(\epsilon_{1},\ldots,\epsilon_{n})\in\mathbb{R}^{n}$ , is independent of $\bm{X}(n)$ , with each entry also independent. Lastly, $\bm{X}(n)\in\mathbb{R}^{n\times p(n)}$ signifies the matrix of observed genotypes with i.i.d. rows $\bm{x}_{i\bullet}$ , which are normalized as follows

X_{ij}=\frac{G_{ij}-2\bar{G}_{j}}{\sqrt{2{\bar{G}_{j}(1-\bar{G}_{j})}}},

(2)

where $G_{ij}=0,1,2$ is the genotype containing the minor allele count (the number of copies of the less frequent allele) at SNP $j$ for individual $i$ and $\bar{G}_{j}=\sum_{i=1}^{n}G_{ij}/n$ is the minor allele frequency of SNP $j$ in the sample.

We assume the covariates have sub-Gaussian distributions. Since $\bm{X}$ is random and $\bm{\beta}$ is deterministic, we have placed ourselves in the fixed effects framework. Moving forward, we suppress the dependence on $n$ when the meaning is clear from the context. Our parameter of interest is the heritability, which is the proportion of variance in $y_{i}$ explained by features in $\bm{x}_{i\bullet}$ defined as

h^{2}=\lim_{n,p\rightarrow\infty,\,\,p/n\rightarrow\delta}\frac{\text{Var}(\bm{x}_{i\bullet}^{\top}\bm{\beta})}{\text{Var}(y_{i})}.

(3)

We emphasize the following remark regarding the aforementioned definition. We assume that $n$ and $p$ diverge with $p/n\rightarrow\delta>0$ . This setting–also known as the proportional asymptotics regime–has gained significant recent attention in high-dimensional statistics. A major advantage of this regime is that theoretical results derived under such asymptotics demonstrate attractive finite sample performance (sur2019likelihood; sur2019modern; candes2020phase; zhao2022asymptotic; liang2022precise; jiang2022new). Furthermore, prior research in heritability estimation (jiang2016high; dicker2016maximum; janson2017eigenprism) utilized this framewrok to characterize theoretical properties of GREML and its variants. In practical situations, we observe a single value for $n$ and $p$ , so we may substitute the corresponding ratio $p/n$ into our theory in place of $\delta$ to execute data analyses. Formal assumptions on the covariate and noise distributions, and the signal are deferred to Section 4.

3 Methodology of HEDE: Heritability estimation Ensembling Debiased Estimators

We begin by describing the overarching theme underlying our method. For simplicity, we discuss the case where $\bm{X}$ has independent columns, deferring the extension to correlated columns to Section 3.4. The denominator of $h^{2}$ in (3) can be consistently estimated using the sample variance of $\bm{y}$ . The numerator of $h^{2}$ in (3) simplifies to $\|\bm{\beta}\|_{2}^{2}$ in the setting of independent columns. Consequently, we seek to develop an accurate estimator for this norm.

To this end, we turn to the debiased Lasso estimator and the debiased ridge estimator. Denote the Lasso and the ridge by $\hat{\bm{\beta}}_{\textrm{L}}$ and $\hat{\bm{\beta}}_{\textrm{R}}$ respectively. These penalized estimators are known to shrink estimated coefficients to $0$ , therefore inducing bias. The existing literature provides debiased versions of these estimators, denoted as $\hat{\bm{\beta}}_{\textrm{L}}^{d}$ and $\hat{\bm{\beta}}_{\textrm{R}}^{d}$ . These debiased estimators roughly satisfy the following property: $\hat{\bm{\beta}}_{\textrm{L}}^{d}\approx\bm{\beta}+\sigma_{\textrm{L}}Z_{1}$ , $\hat{\bm{\beta}}_{\textrm{R}}^{d}\approx\bm{\beta}+\sigma_{\textrm{R}}Z_{2}$ meaning that each entry of $\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}$ is approximately centered around the corresponding true signal coefficient with Gaussian fluctuations and variances given by $\sigma_{L},\sigma_{R}$ respectively. By taking norms on both sides, we obtain that $\|\hat{\bm{\beta}}_{k}^{d}\|_{2}^{2}-p\cdot\sigma_{k}^{2}\approx\|\bm{\beta}\|_{2}^{2}$ , where $k=\textrm{L}\,\,\text{or}\,\,\textrm{R}$ depending on whether we consider the Lasso or the ridge. Thus, precise estimates for the variances $\sigma_{k}^{2}$ ’s produce accurate estimates for $h^{2}$ starting from each debiased estimator.

This intuition extends to any linear combination of the debiased Lasso and ridge in the form of $\tilde{\bm{\beta}}_{C}^{d}:=\alpha_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\alpha_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}$ . A similar relationship holds for the norm: $\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}\approx\|\bm{\beta}\|_{2}^{2}$ . We establish that for any $\alpha_{L}\in[0,1]$ , we can derive a consistent estimator for the ensemble variance $\tilde{\tau}_{C}^{2}$ . Among all possible ensemble choices with different ensembling parameter $\alpha_{\textrm{L}}$ and different tuning parameters $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ for calculating the Lasso and the ridge, we adaptively choose the combination that minimizes the mean square error of the ensemble estimator. This method, dubbed HEDE, harnesses the strengths of $L_{1}$ and $L_{2}$ regression and demonstrates strong practical performance, since we design our adaptive tuning process to particularly enhance statistical accuracy. We provide further elaboration on HEDE in the subsequent subsections: Section 3.1, discusses the specific forms of the debiased estimators utilized in our framework; Section 3.2 describes our ensembling approach, and Section 3.3 discusses the final HEDE method with our adaptive tuning strategy.

3.1 Debiasing Regularized Estimators

Define the Lasso and ridge estimators as follows

\displaystyle\hat{\bm{\beta}}_{\textrm{L}}:=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{L}}}{\sqrt{n}}\|\bm{b}\|_{1},\quad\hat{\bm{\beta}}_{\textrm{R}}:=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{R}}}{2}\|\bm{b}\|_{2}^{2},

(4)

where $\lambda_{\textrm{L}}$ and $\lambda_{\textrm{R}}$ denote the respective tuning parameters. We remark that the different scaling in the regularization terms ensure comparable scales for the solutions in our high-dimensional regime (this is easy to see by noting that if each entry of $\bm{b}$ satisfies $b_{j}\sim 1/\sqrt{n}$ , then $\|\bm{b}\|_{1}\sim\sqrt{n}$ whereas $\|\bm{b}\|_{2}^{2}\sim 1$ ). These estimators incur a regularization bias. To tackle this issue, several authors proposed debiased versions of these estimators that remain asymptotically unbiased with Gaussian fluctuations (zhang2014confidence; van2014asymptotically; javanmard2014confidence; bellec2019biasing; bellec2023debiasing).

Specifically, we work with the following debiased estimators:

\displaystyle\hat{\bm{\beta}}_{\textrm{L}}^{d}=\hat{\bm{\beta}}_{\textrm{L}}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}})}{n-\|\hat{\bm{\beta}}_{\textrm{L}}\|_{0}},\,\quad\,\hat{\bm{\beta}}_{\textrm{R}}^{d}=\hat{\bm{\beta}}_{\textrm{R}}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}})}{n-\text{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})}.

(5)

Detailed in bellec2019biasing, these versions are known as debiasing with degrees-of-freedom (DOF) correction, since the second term in the denominator is precisely the degrees-of-freedom of the regularized estimator. Prior debiasing theory (bellec2019second; celentano2021cad; bellec2022observable) roughly established that under suitable conditions, for each coordinate $j$ , the debiased estimators jointly satisfy (asymptotically)

\begin{pmatrix}\hat{\beta}_{\textrm{L},j}^{d}-\beta_{j}\\ \hat{\beta}_{\textrm{R},j}-\beta_{j}\end{pmatrix}\approx\mathcal{N}\left(\begin{pmatrix}0\\ 0\end{pmatrix},\begin{pmatrix}\tau_{\textrm{L}}^{2}&\tau_{\textrm{L}\textrm{R}}\\ \tau_{\textrm{L}\textrm{R}}&\tau_{\textrm{R}}^{2}\\ \end{pmatrix}\right),

(6)

for appropriate variance and covariance parameters $\tau_{\textrm{L}}^{2},\tau_{\textrm{R}}^{2},\tau_{\textrm{L}\textrm{R}}$ . We remark that this is a non-rigorous statement (hence the “ $\approx$ ” sign, and a precise statement is presented in Theorem 4.3. Additionally, these parameters can be consistently estimated using the following (celentano2021cad):

	$\displaystyle\hat{\tau}_{\textrm{L}}^{2}=\frac{\\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\\|_{2}^{2}}{(n-\\|\hat{\bm{\beta}}_{\textrm{L}}\\|_{0})^{2}},$	$\displaystyle\qquad\hat{\tau}_{\textrm{R}}^{2}=\frac{\\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}}\\|_{2}^{2}}{(n-\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}))^{2}},$		(7)
	$\displaystyle\hat{\tau}_{\textrm{L}\textrm{R}}$	$\displaystyle=\frac{<\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}},\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}}>}{(n-\\|\hat{\bm{\beta}}_{\textrm{L}}\\|_{0})(n-\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})}.$		(7)

This serves as the basis of our HEDE heritability estimator.

3.2 Ensembling and a Primitive Heritability Estimator

We seek to combine the strengths of $L_{1}$ and $L_{2}$ regression estimators. To achieve this, we consider linear combinations of the debiased Lasso and Ridge, taking the form

\tilde{\bm{\beta}}_{C}^{d}=\alpha_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\alpha_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d},

(8)

where $\alpha_{\textrm{L}}\in[0,1]$ is held fixed for the time being. We remark that the well-known ElasticNet estimator (zou2005regularization) also combines the strengths of the Lasso and the ridge. Though our theoretical framework allows potential extensions to the ElasticNet, we do not pursue this direction due to computational concerns, discussed further in the end of Section 3.4.

Combining (6) and (7), we obtain that each coordinate of $\tilde{\bm{\beta}}_{C}^{d}$ satisfies, roughly,

\frac{\tilde{\beta}_{C,j}^{d}-\beta_{j}}{\sqrt{\tilde{\tau}_{C}^{2}}}\approx\mathcal{N}(0,1)\quad\text{with}\quad\tilde{\tau}_{C}^{2}:=\alpha_{\textrm{L}}^{2}\hat{\tau}_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}\hat{\tau}_{\textrm{R}}^{2}+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\hat{\tau}_{\textrm{L}\textrm{R}}.

(9)

Furthermore, the Gaussianity property (9) holds on an “average” sense as well, where the result can be effectively summed across all coordinates. We describe the precise sense next. Specifically, for any Lipschitz function $\psi$ , the difference $\sum_{j=1}^{p}\psi(\tilde{\beta}_{C,j}^{d})-\sum_{j=1}^{p}\psi(\beta_{j})$ behaves roughly as $pE\{\psi(\sqrt{\tilde{\tau}_{C}^{2}}Z)\}$ , where $Z\sim\mathcal{N}(0,1)$ . On setting $\psi(x)=x^{2}$ with $x$ being bounded, we obtain

\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}\approx 0,

(10)

where from (8), $\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}=\alpha_{\textrm{L}}^{2}\|\hat{\bm{\beta}}_{\textrm{L}}^{d}\|_{2}^{2}+(1-\alpha_{\textrm{L}})^{2}\|\hat{\bm{\beta}}_{\textrm{R}}^{d}\|_{2}^{2}+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\langle\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}\rangle.$ Synthesizing these arguments, we observe that

\hat{h}^{2}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}=\min\left[1,\max\left\{0,\frac{\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}}{\widehat{\text{Var}}(\bm{y})}\right\}\right]

(11)

yields a consistent estimator of $h^{2}$ . The clipping is crucial for ensuring that $h^{2}$ falls within the range [0,1], which is necessary considering $h^{2}$ represents a proportion. Given a fixed $\alpha\in[0,1]$ and $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ within a suitable range $[\lambda_{\text{min}},\lambda_{\text{max}}]$ , we can establish that $\hat{h}^{2}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}$ is consistent for $h^{2}$ .

For our final algorithm, we set a range $[t_{\min},t_{\max}]$ on the the degrees-of-freedom (the denominators in (5)), and filter out all $(\lambda_{\textrm{L}},\lambda_{\textrm{R}})$ values whose resulting degrees-of-freedom falls outside this range. Details on how we set the range is described in Supplementary Materials A.2. This results in a class of consistent estimators for the heritability, each corresponding to a different choice of $\alpha_{L},\lambda_{L},\lambda_{R}$ . Among these, we propose to select the estimator that minimizes the mean square error in estimating the signal $\bm{\beta}$ . This leads to our adaptive tuning strategy, which we describe in the next subsection.

3.3 The method HEDE

In order to develop our method, HEDE, we take advantage of the flexibility provided by $\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ as adjustable hyperparameters. Initially, we fix $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ and adaptively select $\alpha_{\textrm{L}}$ . To this end, we minimize the estimated variability in the ensemble estimator, given by $\tilde{\tau}_{C}$ (9), for all possible choices of $\alpha_{L}$ . This results in a choice that consistently estimates each coordinate of $\bm{\beta}$ with the least possible variability among all ensemble estimators of the form (8). As this holds true for every coordinate and $\tilde{\beta}_{C,j}$ is asymptotically unbiased for $\beta_{j}$ , minimizing this variance concurrently minimizes the mean squared error. Formally, minimizing (9) as a function of $\alpha_{\textrm{L}}$ leads to the following data-dependent choice:

\hat{\alpha}_{\textrm{L}}=\max\left\{0,\min\left\{1,\frac{\hat{\tau}_{\textrm{R}}^{2}-\hat{\tau}_{\textrm{L}\textrm{R}}}{\hat{\tau}_{\textrm{R}}^{2}-2\hat{\tau}_{\textrm{L}\textrm{R}}+\hat{\tau}_{\textrm{L}}^{2}}\right\}\right\}.

(12)

Once again, the maximum and minimum operations ensure that $\hat{\alpha}_{\textrm{L}}\in[0,1]$ . Subsequently, by plugging in $\hat{\alpha}_{\textrm{L}}$ into the ensembling formula (8), we obtain the estimator

\hat{\bm{\beta}}_{C}^{d}=\hat{\alpha}_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\hat{\alpha}_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}.

(13)

For any fixed tuning parameter pair $(\lambda_{\textrm{L}},\lambda_{\textrm{R}})$ , $\hat{\bm{\beta}}_{C}^{d}$ achieves the lowest mean squared error (MSE) for estimating each coordinate, among all possible ensembled estimators of the form (8). Further, the mean square error of this estimator is given by

\hat{\tau}_{C}^{2}=\hat{\alpha}_{\textrm{L}}^{2}\hat{\tau}_{\textrm{L}}^{2}+2\hat{\alpha}_{\textrm{L}}(1-\hat{\alpha}_{\textrm{L}})\hat{\tau}_{\textrm{L}\textrm{R}}+(1-\hat{\alpha}_{\textrm{L}})^{2}\hat{\tau}_{\textrm{R}}^{2}.

(14)

Note that $\hat{\tau}_{C}^{2}$ is $\tilde{\tau}_{C}^{2}$ evaluated at $\alpha_{\textrm{L}}=\hat{\alpha}_{\textrm{L}}$ . Observe that both $\hat{\bm{\beta}}_{C}^{d}$ and $\hat{\tau}^{2}_{C}$ are functions of the initially selected tuning parameter pair $(\lambda_{\textrm{L}},\lambda_{\textrm{R}})$ . Therefore, to construct a competitive heritability estimator, we minimize $\hat{\tau}_{C}^{2}$ over a range of tuning parameter values, compute the corresponding $\hat{\bm{\beta}}_{C}^{d}$ , and construct a heritability estimator according to (11) using this $\hat{\bm{\beta}}_{C}^{d}$ and the minimum $\hat{\tau}_{C}^{2}$ (in place of $\tilde{\bm{\beta}}_{C}^{d}$ and $\tilde{\tau}_{C}^{2}$ , respectively). This yields our final estimator, HEDE.

3.4 Correlated Designs

In the previous subsections, we discussed our method in the setting of independent covariates. In practical situations, covariates are typically correlated, i.e., SNPs are in LD in GWAS. Thus it is more realistic to assume that each row $\bm{x}_{i\bullet}$ of our design matrix takes the form ${\bm{z}}_{i\bullet}\bm{\Sigma}^{1/2}$ , where ${\bm{z}}_{i\bullet}$ has independent entries and $\bm{\Sigma}$ denotes the covariance of $\bm{x}_{i\bullet}$ . Hence, if we can develop an accurate estimate $\hat{\bm{\Sigma}}$ for the covariance $\bm{\Sigma}$ , $\bm{x}_{i\bullet}\hat{\bm{\Sigma}}^{-1/2}$ should have roughly independent entries. Alternately, the whitened design matrix $\bm{X}\hat{\bm{\Sigma}}^{-1/2}$ should have roughly identity covariance and our methodology in the prior section would apply directly.

Estimation of high-dimensional covariance matrices poses challenges, and has been a rich area of research. Here, we consider a special class of covariance matrices. Our motivation stems from GWAS data, where chromosomes constitute approximately tens of thousands of independent LD blocks (berisa2016approximately), where the number of SNPs in each LD block is much smaller in size compared to the total number of SNPs. To reflect this, we assume the population covariance matrix is block-diagonal. If the size of each block is negligible compared to the total sample size, the block-wise sample covariance matrix estimates the corresponding block of the population covariance matrix well. We establish in Supplementary Materials, Proposition A.1 that stitching together these block-wise sample covariances provides an accurate estimate of the population covariance matrix in the operator norm, i.e., $\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\rightarrow 0$ . The block diagonal approximation suffices for our purposes (see Section 6).

Other scenarios exist where accurate estimates for the population covariance matrix is available, but we will not discuss these further. Our final procedure, therefore, first uses the observed data to calculate $\hat{\bm{\Sigma}}$ , then uses $\hat{\bm{\Sigma}}$ to whiten the covariates, and finally applies HEDE to the whitened data.

Table 1: Heritability Estimation via Debiased Ensembles (HEDE)

Input:

\bm{y},\bm{X},t

Output:

\hat{h}^{2}

0. If necessary, estimate

\hat{\bm{\Sigma}}

as block-diagonal and whiten the data.

1. Fix a range

[t_{\min},t_{\max}]\subset(0,1)

. For all

\lambda_{\textrm{L}},\lambda_{\textrm{R}}

• Calculate

\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{\beta}}_{\textrm{R}}

in (4).

• If

\frac{\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})\|_{0}}{n}\notin[t_{\min},t_{\max}]

\frac{1}{n}\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})\notin[t_{\min},t_{\max}]

, drop

(\lambda_{\textrm{L}},\lambda_{\textrm{R}})

• Calculate

\hat{\tau}_{\textrm{L}}^{2},\hat{\tau}_{\textrm{R}}^{2},\hat{\tau}_{\textrm{L}\textrm{R}}

in (7).

• Calculate

\hat{\alpha}_{\textrm{L}}

in (12), and

\hat{\tau}_{C}^{2}

in (14).

2. Select

(\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}):=\operatorname*{arg\,min}_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}}\hat{\tau}_{C}^{2}(\lambda_{\textrm{L}},\lambda_{\textrm{R}})

. Denote

\hat{\tau}_{C,\min}^{2}:=\hat{\tau}_{C}^{2}(\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}})

3. Calculate

\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}

from (5), and

\hat{\bm{\beta}}_{C}^{d}

from (13) corresponding to

\hat{\lambda}_{L},\hat{\lambda}_{R}

4. Calculate the heritability estimator using the formula

\hat{h}^{2}_{\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}}=\min\left\{1,\max\left\{0,\frac{\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C,\min}^{2}}{\widehat{\text{Var}}(y_{i})}\right\}\right\},

(15)

where

\widehat{\text{Var}}(y_{i})

denotes the sample variance of the outcome.

We algorithmically summarize HEDE in Table 1. As we can see for step $1$ , assuming our hyperparameter grids have $m_{\textrm{L}},m_{\textrm{R}}$ different values of $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ , respectively, we need $m_{\textrm{L}}+m_{\textrm{R}}$ glmnet calls. As for the well known ElasticNet estimator, the same grids would require $m_{\textrm{L}}m_{\textrm{R}}$ glmnet calls, making it much more expensive than the current ensembling method. We also remark that we did not make full effort into the computational/algorithmic optimization. With more advanced techniques such as snpnet from (li2022fast), HEDE could potentially be scaled to very large dimensions.

4 Theoretical Properties

We turn to discuss the key theoretical properties of HEDE. We first focus on the case of independent covariates. Recall that throughout, we assume an asymptotic framework where $p,n\rightarrow\infty$ with $\lim_{n\rightarrow\infty}p/n\rightarrow\delta\in(0,\infty)$ . We work with the following assumptions on the covariate, signal, and noise.

Assumption 1.

(i)

Each entry $X_{ij}$ of $\bm{X}$ is independent (but not necessarily identically distributed), has mean $0$ and variance $1$ , and are uniformly sub-Gaussian.
(ii)

The norm of the signal $\bm{\beta}$ satisfies: $0\leq\|\bm{\beta}\|_{2}^{2}\overset{P}{\rightarrow}\sigma_{\beta}^{2}\leq\sigma_{\max}^{2}<\infty$ .
(iii)

The noise $\bm{\epsilon}$ has independent entries (not necessarily identically distributed) with mean $0$ and variance $\sigma^{2}$ that is bounded ( $\sigma^{2}\leq\sigma_{\max}^{2}$ ), and are uniformly sub-Gaussian. Further, $\bm{\epsilon}$ and $\bm{X}$ are independent.

It is worth noting that the uniform sub-Gaussianity in Assumption 1(i) implies that the $\bar{G}_{j}$ ’s from (2) are bounded. This is satisfied in GWAS, as $\bar{G}_{j}$ is minor allele frequency (MAF) (psychiatric2009genomewide). The remaining Assumptions 1(ii) and (iii) are all weak and natural to impose. With Assumption 1 in mind, our primary goal is to establish that HEDE consistently estimates the heritability. The following theorem establishes this result under mild additional conditions on the minimum singular values of submatrices of $\bm{X}$ (Assumption 2, Supplementary Materials A.1). The proof is provided in Supplementary Materials F.3.

Theorem 4.1.

Recall $\hat{h}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}^{2}$ from (11) and $h_{\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}}^{2}$ from (15). Under Assumptions 1 and 2, the following conclusions hold:

(i)

$\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}],\alpha_{\textrm{L}}\in[0,1]}\left|\hat{h}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}^{2}-h^{2}\right|\overset{P}{\rightarrow}0$
(ii)

$\left|\hat{h}_{\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}}^{2}-h^{2}\right|\overset{P}{\rightarrow}0$ .

Theorem 4.1 states that HEDE is consistent when $\lambda_{L},\lambda_{R}$ and $\alpha_{\textrm{L}}$ are arbitrarily chosen in $[\lambda_{\min},\lambda_{\max}]$ and $[0,1]$ , respectively. It follows that HEDE is consistent using our proposed choices $\hat{\alpha}_{\textrm{L}},\hat{\lambda}_{\textrm{L}},\hat{\lambda}_{\textrm{R}}$ . Here $[\lambda_{\min},\lambda_{\max}]$ denotes the range of $\lambda_{L},\lambda_{R}$ that we use. See details in Supplementary Materials A.2.

Proving the consistency for given fixed choices of these hyperparameters is relatively straightforward, However, establishing the uniform gurantee is non-trivial. It requires analyzing the debiased Lasso and ridge estimators jointly, and furthermore, pinning down this joint behavior uniformly over all hyperparameter choices. We achieve both these goals under relatively mild assumptions on the covariate and noise distributions. Once we establish the uniform control, Part (ii) follows trivially since the uniform result provides consistency for the particular choices $\hat{\alpha}_{L},\hat{\lambda}_{L},\hat{\lambda}_{R}$ produced by our algorithm. Below, we present the key theorem necessary for proving Theorem 4.1.

Proposition 4.2.

Recall $\tilde{\bm{\beta}}_{C}^{d}$ from (8) and $\tilde{\tau}_{C}^{2}$ from (9). Under Assumptions 1 and 2,

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{L}\in[0,1]}\left|\|\tilde{\bm{\beta}}_{C}^{d}\|_{2}^{2}-\|\bm{\beta}\|_{2}^{2}-p\cdot\tilde{\tau}_{C}^{2}\right|\overset{P}{\rightarrow}0,

We now discuss how Proposition 4.2 leads to our previous Theorem 4.1. Proposition 4.2 equivalently proposes a consistent estimator for $\|\bm{\beta}\|_{2}^{2}$ , which is directly related to $h^{2}$ by a factor of the population variance of $y_{i}$ . Since the population variance of $y_{i}$ can be consistently estimated by the sample variance of the observed outcomes, we can safely replace the former by the latter in the denominator. Theorem 4.1(i) then follows on using the final fact that clipping does not affect consistency.

Proposition is 4.2 proved in Supplementary materials F.2. Here we briefly outline our main idea. Recall from (8) that $\tilde{\bm{\beta}}_{C}^{d}$ is a linear combination of $\hat{\bm{\beta}}_{\textrm{L}}^{d}$ and $\hat{\bm{\beta}}_{\textrm{R}}^{d}$ . To develop uniform guarantees for $\tilde{\bm{\beta}}_{C}^{d}$ , it therefore suffices to establish the corresponding uniform guarantees for the debiased Lasso and ridge jointly, which we establish in the following theorem.

Theorem 4.3.

Recall the estimators (5). If Assumptions 1 and 2 hold, then for any $1$ -Lipschitz function $\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}$ , we have

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0,

where $(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})=(\bm{\beta}+\bm{g}_{\textrm{L}}^{f},\bm{\beta}+\bm{g}_{\textrm{R}}^{f})$ with $(\bm{g}_{\textrm{L}}^{f},\bm{g}_{\textrm{R}}^{f})\sim\mathcal{N}(0,\bm{S}\otimes\bm{I}_{p})$ , where $\bm{S}=\begin{pmatrix}\tau_{\textrm{L}}^{2}&\tau_{LR}\\ \tau_{LR}&\tau_{\textrm{R}}^{2}\end{pmatrix}$ is formally defined via the system of equations (24).

Theorem 4.3 is proved in Supplementary materials D. It states that the joint “Gaussianity” described in (6) occurs at the population level in terms of the value of Lipschitz functions of the estimators, in addition to a per-coordinate basis. While results of this nature have appeared in the recent literature on exact high-dimensional asymptotics, majority of these works focus on a single regularized estimator (or its debiased version) with a fixed tuning parameter value. Theorem 4.3 provides the first joint characterization of multiple debiased estimators that is uniform over a range of tuning parameter values.

Among related works, celentano2021cad established joint characterization of two estimators and their debiased versions for given fixed tuning parameter choices. Meanwhile, miolane2021distribution characterized certain properties of the Lasso and the debiased Lasso uniformly across tuning parameter values. However, developing a heritability estimator that performs well in practice across sparse to dense signals requires the stronger result of the form in Theorem 4.3. Dealing with the debiased lasso and ridge jointly, while requiring uniform characterization over tuning parameters as well as allowing sub-Gaussian covariates, poses unique challenges. We overcome these challenges in our work, building upon miolane2021distribution, celentano2020lasso, celentano2021cad and universality techniques proposed in han2022universality, which in turn rely on the Convex Gaussian Minmax Theorem (GCMT) (thrampoulidis2015regularized). We detail the connections and differences in the Supplementary Materials.

Note that Theorem 4.3 involves the variance-covariance parameters $\tau_{L}^{2},\tau_{LR},\tau_{R}^{2}$ (formal definitions deferred to later (23) (24) in the interest of space. We establish in Supplementary Materials, Theorem F.1 that $\hat{\tau}_{\textrm{L}}^{2},\hat{\tau}_{\textrm{R}}^{2},\hat{\tau}_{\textrm{L}\textrm{R}}$ , defined in (7), consistently estimate these parameters. Furthermore, we show that this consistency holds uniformly over the required range of tuning parameter values. This result, coupled with Theorem 4.3, yields our key Proposition 4.2.

Finally we turn to the case where the observed features are correlated. Recall from Section 3.4 that here we consider the design matrix to take the form $\bm{X}=\bm{Z}\bm{\Sigma}^{1/2}$ , where $\bm{Z}$ now has independent entries and satisfies the conditions in Assumption 1(i). Thus, our previous theorems apply to $\bm{Z}$ (which is unobserved), but not to the observed $\bm{X}$ . Applying whitening with estimated $\hat{\bm{\Sigma}}$ results in final features of the form $\bm{X}\hat{\bm{\Sigma}}^{-1/2}=\bm{Z}\bm{\Sigma}^{1/2}\hat{\bm{\Sigma}}^{-1/2}$ . When $\hat{\bm{\Sigma}}^{1/2}$ approximates $\bm{\Sigma}$ accurately, $\bm{\Sigma}^{1/2}\hat{\bm{\Sigma}}^{-1/2}$ behaves as the identity matrix $\bm{I}$ . In this case, $\bm{X}\hat{\bm{\Sigma}}^{-1/2}$ is close to $\bm{Z}$ which satisfies our previous Assumption 1. HEDE therefore continues to enjoy uniform consistency guarantees, as formalized below.

Theorem 4.4.

Consider the setting of Theorem 4.1 except that the observed features now take the form $\bm{X}=\bm{Z}\bm{\Sigma}^{1/2}$ where $\bm{Z}$ satisfies Assumptions 1(i) and 2(i). Let $\bm{\Sigma}$ be a sequence of deterministic symmetric matrices whose eigenvalues are bounded in $[1/M,M]$ for some $M>1$ . Let $\hat{\bm{\Sigma}}$ be a sequence of random matrices such that $\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ . If $\bm{A}=\bm{\Sigma}^{1/2}\hat{\bm{\Sigma}}^{-1/2}$ satisfies Assumption 3 and $\hat{h}^{2}_{\alpha_{L},\lambda_{L},\lambda_{R}}$ denotes the estimator 11 calculated using the whitened data $(\bm{y},\bm{X}\hat{\bm{\Sigma}}^{-1/2})$ , then

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}],\alpha_{\textrm{L}}\in[0,1]}\left|\hat{h}_{\alpha_{\textrm{L}},\lambda_{\textrm{L}},\lambda_{\textrm{R}}}^{2}-h^{2}\right|\overset{P}{\rightarrow}0,

(16)

Thus, the corresponding HEDE estimator $\hat{h}^{2}_{\hat{\alpha}_{L},\hat{\lambda}_{L},\hat{\lambda}_{R}}$ is also consistent for $h^{2}$ .

Although we present this theorem for a general sequence of covariance matrices $\bm{\Sigma}$ that allow accurate estimators in operator norm, we use it only for block diagonal covariances relevant for our application, where the block sizes are relatively small compared to the sample size. In Supplementary Materials, Proposition A.1, we establish that for such covariance matrices, the block-wise sample covariance matrix provides such an acccurate estimator. Theorem 4.4 is proved in Supplementary Materials G. The key idea lies in establishing that the estimation error incurred while replacing $\bm{\Sigma}$ by $\hat{\bm{\Sigma}}$ does not affect the uniform consistency properties of HEDE.

5 Simulation Results

We evaluate the performance of HEDE through simulation studies. In Section 5.1, we compare HEDE to some representative random effects methods for estimating the heritability under different signal structures. In Section 5.2, we compare HEDE against several fixed effect based methods for estimating the heritability.

5.1 Comparison with random effects methods

Given the widespread adoption of the random effects assumption in genetics, it is important to benchmark HEDE against various random-effects-based methods. A diverse array of such methods exist, each based on a different random effects assumption. We show that a random-effects-based method fails to provide unbiased heritability estimates when corresponding assumptions are violated (i.e. signals are not generated as assumed). In contrast, HEDE constantly offers unbiased estimates (albeit with a slightly higher variance).

For conciseness, we focus on three representative random effects methods: GREML-SC (yang2011gcta): effectively GCTA, which assumes no stratification; GREML-MS (lee2013estimation), which is a modified version of GREML that accounts for minor allele frequency (MAF) stratification; GREML-LDMS (yang2015genetic), which is a modified version of GREML that accounts for both linkage disequilibrium (LD) level and MAF stratification. Their corresponding random effects assumptions state that within respective stratifications, the signal entries are i.i.d. draws from a zero-inflated normal distribution (equivalently, the non-zero entries are randomly/evenly located and normally distributed).

For our simulations, we use genotype data from unrelated white British individuals in the UK Biobank dataset (sudlow2015uk). We select common variants with a minor allele frequency (MAF) greater than $1\%$ from the UK Biobank Axiom array using PLINK. Additionally, we exclude SNPs with genotype missingness over $1\%$ and Hardy-Weinberg disequilibrium p-values greater than $10^{-6}$ . We impute the remaining small number of missing values as 0. This results in a design matrix $\tilde{\bm{X}}$ with $n=332,430$ individuals and $p=533,169$ variants across 22 chromosomes, which is then normalized so each column has a mean of 0 and variance of 1. To expedite the process, our simulation analysis focuses on chromosome $22$ . We use $100$ randomly sampled disjoint subsets of $3000$ individuals each to simulate $100$ independent random draws of $\bm{X}$ . We then generate a random $\bm{\beta}$ based on different assumptions and generate $\bm{y}$ following the linear model in (1). This results in $100$ heritability estimates, which are estimated using GREML methods and HEDE. We then compare the average of these estimates with the true population heritability value.

While our formal mathematical guarantees are established under the fixed effects assumption, it is desirable to stress-test the performance of our method in the random effects simulation settings, especially since we wish to benchmark our method against popular methods from statistical genetics. We further discuss the connection between the random effect assumption and the fixed effect assumption in terms of the true population parameter (Supplementary Materials B.1), with the discussion aimed at explaining why the empirical comparison we perform here is reasonable, given the methods were developed under different assumptions.

For the GREML methods, we use $4$ LD bins generated from the quantiles of individual LD values (evans2018comparison), and $3$ MAF bins: $[0.01,0.05],[0.05,0.1]$ , and $[0.1,0.5]$ . For HEDE, we perform the whitening step by estimating the population covariance $\bm{\Sigma}$ as a block diagonal with blocks specified in (berisa2016approximately). We estimate the block-wise population covariances using sample covariances. Given that $n=332,430$ and $p<1000$ for each LD block, this approach results in minimal high-dimensional error.

Refer to caption — Figure 1: Box plots comparing three random effects methods, namely GREML with different stratifications, with our fixed effect method HEDE. Meaning of method abbreviations: ”SC”: single-component (yang2011gcta), ”MS”: MAF stratified, ”LDMS”: LD and MAF stratified (yang2015genetic) Different rows use different true $\|\bm{\beta}\|_{2}$ values. Each column indicates a different distribution from which the true signal is drawn: (a) zero-inflated normal, where the nonzero entries are uniformly distributed (no stratification). (b) zero-inflated normal, where the nonzero entries have different densities in each LDMS stratification. (c) zero-inflated mixture of two normals, where the nonzero entries have different densities in each (consecutive) LD block. (d) zero-inflated mixture of two normals, where the nonzero entries have different densities in each (consecutive) LD block and in each LDMS stratification. Red dotted horizontal line represents true heritability values (which varies depending on the signal distribution). Boxes represents distributions of different methods’ estimates from $100$ independent draws of $\bm{X}$ (from UKB data chromosome $22$ , 3000 individuals each) and $\bm{\beta}$ (from the aforementioned distribtions). GREML methods show biases under model misspecification while HEDE remains unbiased. See details in Section 5.1.

To assess the robustness of different methods against the misspecification of typical random effects assumptions on the regression coefficients, we created $4$ types of random signals. These signals vary in the locations and distributions of non-zero entries (causal variants) as shown in Figure 1. In Figure 1a, the non-zero locations of $\bm{\beta}$ are uniformly distributed, with non-zero components normally distributed. Here the assumptions for all three GREML methods are satisfied. In Figure 1b, the non-zero locations of $\bm{\beta}$ vary in concentration across LDMS strata, with non-zero components normally distributed. Here only the assumption for GREML-LDMS is satisfied. In Figure 1c, the non-zero locations of $\bm{\beta}$ are more concentrated in the last LD block, with non-zero components distributed as a mixture of two normals (centered symmetrically around zero). Here none of the assumptions of the three GREML methods is satisfied. In Figure 1d, the non-zero locations of $\bm{\beta}$ vary in concentrations across both LD blocks and LDMS strata, with non-zero components distributed as a mixture of two normals. Here none of the assumptions of the three GREML methods are satisfied, with varying degrees of violations. The full details of the signal generation process are described in Supplementary Materials B.2.

Figure 1 clearly demonstrates that the GREML methods are susceptible to signal distribution misspecifications, exhibiting increased bias as the extent of assumption violation rises. In contrast, HEDE is robust and provides unbiased estimates in all cases, despite exhibiting higher variance. This pattern remains robust across various levels of true population heritability ( $0.1,0.25$ and $0.5$ shown).

We believe these findings are likely to apply to other random-effects-based methods. A random effects method can only guarantee unbiased estimates when its underlying random effects assumptions are satisfied—a condition that is often difficult to verify with real data. In contrast, HEDE is not susceptible to such biases, demonstrating that the fixed effect based HEDE method applies under much diverse signal structures.

5.2 Comparison of HEDE with other fixed effects methods

In this section, we compare the performance of HEDE with other fixed effects heritability estimation methods. Recall we showed in Section 3.4 and Theorem 4.4 that HEDE remains consistent under accurate covariance estimation strategies. Also, fixed effects heritability estimation methods all assume $\bm{\Sigma}$ is known or can be accurately estimated. In light of these, we assume without loss of generality that $\bm{\Sigma}=\bm{I}$ , in order to directly compare the relative performance of fixed effect methods. To mimic discrete covariates in genetic scenarios, we generate the design matrix $\bm{X}\in\mathbb{R}^{n\times p}$ via (2), where $G_{ij}\sim\text{Binomial}(2,\pi_{j}),\forall i,j$ and $\pi_{j}\sim\text{Unif}[0.01,0.5],\forall j$ , representing common variants with a minor allele frequency of at least $1\%$ . We generate the true signal $\bm{\beta}\in\mathbb{R}^{p}$ as i.i.d. zero-inflated normals with non-zero co-ordinate weight $\kappa$ and heritability $h^{2}$ : $\beta_{j}\overset{i.i.d.}{\sim}\kappa\cdot\mathcal{N}(0,h^{2}/(p\kappa(1-h^{2})))+(1-\kappa)\cdot\delta_{0}$ . We generate the noise $\bm{\epsilon}\in\mathbb{R}^{n}$ with i.i.d. $\mathcal{N}(0,1)$ entries. We vary $n,p,\kappa,h^{2}$ across our simulation settings.

For each simulation setting, we compare HEDE with a Lasso-based method, AMP (bayati2013estimating)¹¹1We label this method as AMP, since it was created based on approximate message passing (AMP) algorithms., and three Ridge-based methods: Dicker’s moment-based method (dicker2014variance), Eigenprism (janson2017eigenprism), and Estimating Equation (chen2022statistical). For AMP, (bayati2013estimating) does not provide a strategy for tuning the regularization parameter. Therefore, we choose the tuning parameter based on cross validation (specifically the Approximate Leave one Out (ALO) rad2020scalable approach) performed over $10$ smaller datasets with the same values of $\delta,h^{2},\kappa$ . All aforementioned methods operate under similar fixed effects assumption as HEDE, except that Eigenprism only handles $n\leq p$ in its current form. We first compared these methods in terms of bias and found that they all remain unbiased across a wide variety of settings. In Supplementary Matrials B.3, we demonstrate their unbiasedness for a specific choice of $n,p,h^{2},\kappa$ . In light of this, here we present comparisons between these estimators in terms of their mean square errors (MSEs).

In Figure 2, we display the relative MSEs of heritability estimates in relation to $\kappa$ and $h^{2}$ , while fixing sample size $n=1000$ and number of variants $p=10000$ . We vary sparsity from $0.001$ to $0.3$ on a roughly evenly spaced log scale, and vary heritability from $0.01$ to $0.7$ . This range mimics the potential true heritability values for traits such as autoimmune disorders (lower heritability) and height (higher heritability) (hou2019accurate). Although the methods under comparison operate under fixed effects assumptions, we still simulate $10$ signals to mitigate the influence of any particular signal choice. For each signal, we generate $100$ random samples and calculate the MSE for estimating heritability. We then average the error over the 10 signals.

We observe that HEDE outperforms ridge-type methods particularly when the true signal is sparse and relatively strong, while performing comparably in other scenarios. Since HEDE utilizes the strengths of both Lasso and ridge, the performance gain over pure ridge-based methods, such as EigenPrism, MM and EstEqn, is natural. The pure Lasso-based method (AMP, green) performs sub-optimal compared to all the others, and its higher MSE values are dominating the y-axis range. For a closer view, we present a zoomed in version of the plot in Figure 5 where we exclude AMP (we defer the figure to Supplementary Materials B.3 in light of space). Figures 2 and 5 clearly demonstrate the efficacy of our ensembling strategy. It leads to a superior heritability estimator across signals ranging from less sparse to highly sparse.

To better investigate our relative performance, in Figure 3, we consider a specific setting— $\kappa=0.1,h^{2}=0.5$ —where HEDE has a moderate advantage in Figure 2, We vary $n$ and $p$ so the ratio $\delta=p/n$ ranges from $0.25$ to $20$ . We clearly observe that HEDE’s advantage across a broad spectrum of configurations. Note that Eigenprism’s results are missing for $n>p$ as it currently only supports $n\leq p$ . Also observe that AMP’s relative performance degrades sharply with larger $\delta$ values, as shown in the first panel of Figure 3. This explains its sub-optimal performance in Figure 2 when $\delta=10$ . Figure 3 demonstrates that HEDE leads to superior performance over a wide range of sample sizes, dimensions and dimension-to-sample ratios. It consistently outperforms the other methods for the setting examined.

To determine if HEDE’s superior performance extends to broader contexts, we conducted additional investigations. We examined settings where the designs contain larger sub-Gaussian norms, and in addition, settings where the non-zero entries of the signal were drawm from non-normal distributions and/or are distributed with stratifications (see Section 5.1). Results for the former investigation are presented in Supplementary Materials. B.4, whereas we omit those for the latter in light of the performances looking similar. Across the board, we observed that HEDE maintained maintains its competitive advantage over other methods. We also investigated a Lasso-based method CHIVE (tony2020semisupervised) in our experiments. However, CHIVE exhibited considerable bias in many settings, particularly when its underlying sparsity assumptions were violated. For this reason, we have omitted CHIVE from our result displays.

6 Real data application

We next describe our real data experiments. We apply HEDE on the unrelated white British individuals from the UK biobank GWAS data which contain common variants (sudlow2015uk) to estimate the heritability for two commonly studied phenotypes: height and BMI. For the design matrix $\bm{X}$ , we follow the preprocessing steps outlined in Section 5.1, but include all $22$ chromosomes, unlike Section 5.1 which considers only chromosome $22$ . For each phenotype, we exclude individuals with missing values. We center the outcomes and denote the resultant vector by $\tilde{\bm{y}}$ . We account for non-genetic factors by including covariates for age, age², sex as well as $40$ ancestral principal components. We derive the final response vector $\bm{y}$ by regressing out the influence of these non-genetic covariates from $\tilde{\bm{y}}$ using Ordinary Least Squares (OLS). The large sample size of approximately $330,000$ individuals compared to only $43$ non-genetic covariate, ensures that OLS can be confidently applied.

We apply HEDE to preprocessed $\bm{X}$ and $\bm{y}$ . For comparison, we also apply GREML-SC and GREML-LDMS (see Section 5.1 for details) from the random effects methods family, as well as MM and AMP from the fixed effects methods family. We did not include Eigenprism and EstEqn since their runtimes are both $O(n^{3})$ (assuming $p/n\rightarrow\delta$ ), so they could not be completed within a reasonable time frame. We calculate standard errors using the standard deviation of point estimates from $15$ disjoint random subsets, each containing $20,000$ individuals.²²2In our high-dimensional setting, it is known that traditional resampling techniques for estimating standard errors such as the bootstrap, subsampling, etc. fail (el2018can; clarte2024analysis; bellec2024asymptotics). Corrections have been proposed to address issues with using these for estimating standard errors while inferring individual regression vector coefficients and closely related problems. However, such strategies for heritability estimation are yet to be developed. To circumvent this issue, we use disjoint subsets of observed data to mimic independent training data draws from the population. Additionally, due to memory constraints, many methods (including HEDE) could not process all $22$ chromosomes simultaneously. Therefore, for all methods, we estimate the total heritability by summing the heritabilities calculated for each chromosome separately. This approach is expected to introduce minimal error, given the widely accepted assumption of independence across chromosomes.

Trait	HEDE	MM	AMP	GREML-SC	GREML-LDMS
Height	0.681	0.813	0.436	0.741	0.605
(SE)	(7.47E-02)	(9.30E-02)	(3.01E-01)	(2.87E-02)	(3.19E-02)
BMI	0.279	0.369	0.510	0.316	0.265
(SE)	(5.20E-02)	(6.66E-02)	(3.52E-01)	(2.85E-02)	(3.22E-02)

Table 2: Estimates of heritability from HEDE (ours), MM (dicker2014variance), AMP (bayati2013estimating), GREML-SC (yang2011gcta) and GREML-LDMS (yang2015genetic) for certain traits in the UKB dataset.

533,169

SNPs are present in

22

chromosomes. For fixed effect methods, we assume independence across LD blocks from berisa2016approximately for covariance estimation. Standard errors are computed from

15

disjoint random subsets of

20,000

individuals each. See full details in Section 6.

We summarize the results in Table 2. First, we observe that MM and AMP estimates differ more from HEDE compared to other methods. This is consistent with Figure 2, where we observe that AMP often has larger MSE than other fixed effects methods. Thus, we expect AMP estimates to differ significantly from HEDE, and MM. For MM, the situation is more nuanced. Prior work suggests the true height heritability should be on the higher end of the range shown in Figure 2. In this region, MM has higher MSE than HEDE across all sparsity levels. Therefore, MM estimates of height heritability may be less accurate than HEDE. For BMI, prior work provides heritability estimates within the range $0.285$ to $0.436$ using various methods. Referring to Figure 2, we see that MM’s relative MSE compared to HEDE depends on the sparsity level in this BMI heritability range. Without knowing the exact sparsity, it is difficult to draw conclusions. However, we still observe the general trend of MM producing larger estimates than other methods for BMI, similar to height.

Comparing HEDE to the two random effect methods, we observe that HEDE point estimates fall between GREML-SC and GREML-LDMS ones. For both height and BMI, HEDE estimates lie between GREML-SC and GREML-LDMS,, with the latter always undershooting. This is consistent with the trend observed in the last column of Figure 1. This difference may indicate a heterogeneous underlying genetic signal, especially in a way that differentiates the SC and LDMS approaches within the GREML framework.

7 Discussion

In this paper, we introduce HEDE, a new method for heritability estimation that harnesses the strengths of $\ell_{1}$ and $\ell_{2}$ regression in high-dimensional settings through a sophisticated ensemble approach. We develop data-driven techniques for selecting an optimal ensemble and fine-tuning key hyperparameters, aiming to enhance statistical performance in heritability estimation. As a result, we present a competitive estimator that surpasses existing fixed effects heritability estimators across diverse signal structures and heritability values. Notably, our approach circumvents bias issues inherent in random effects methods. In summary, our contribution offers a dependable heritability estimator tailored for high-dimensional genetic problems, effectively accommodating underlying signal variations. Importantly, our method maintains consistency guarantees even with adaptive tuning and under minimal assumptions on the covariate distribution. We validate the efficacy of our approach through comprehensive simulations and real-data analysis.

HEDE’s computational complexity is primarily driven by the Lasso/ridge solutions, making it more efficient than many existing heritability estimators (e.g. Eigenprism (janson2017eigenprism), where an SVD is required, and EstEqn (chen2022statistical), where chained matrix inversions are involved). HEDE is potentially scalable to much larger dimensions via snpnet (rad2020scalable). Furthermore, with proper Lasso/ridge solving techniques, HEDE only needs knowledge of summary statistics $\bm{X}^{\top}\bm{X}$ and $\bm{X}^{\top}\bm{y}$ . This makes it an attractive option in situations where individual data sharing is difficult due to privacy concerns. However, it is important to note that these discussions assume the availiability of an accurate estimate of the LD blocks and strength. This is a common limitation that other fixed effects methods also face. Any potential errors in covariance estimation could naturally lead to additional errors in heritability estimation. Therefore, improving and scaling covariance estimation in high dimensions could independently benefit heritability estimation. We manage to circumvent this issue by leveraging the hypothesis of independence across chromosomes, which significantly aids us computationally.

Our theoretical framework relied on two pivotal assumptions: the independence of observed samples and a sub-Gaussian tailed distribution for covariates. Natural settings occur where these assumptions are violated. Within the genetics context, familial relationships and repeated measures in longitudinal studies among observed samples can introduce dependence. For the broader problem of signal-to-noise ratio estimation in various scientific fields, both assumptions might be violated. For instance, applications such as wireless communications (eldar2003asymptotic) and magnetic resonance imaging (benjamini2013shuffle) may introduce challenges such as temporal dependence or heavier-tailed covariate distributions. Exploring analogues of HEDE under such conditions presents an intriguing avenue for future study. Recent advancements in high-dimensional regression, which extend debiasing methodology and proportional asymptotics theory for regularized estimators to these complex scenarios (li2023spectrum; lahiry2023universality), offer valuable insights for further exploration. We defer these investigations to future research.

We focus on continuous traits in this paper. Discrete disease traits occur commonly in genetics. Such situations are typically modeled using logistic, probit, or binomial regression models, depending on the number of possible categorical values of the response. Adapting our current methodology to these scenarios is a crucial avenue for future research. Debiasing methodologies for generalized linear models exist in the literature. A compelling question would be to understand how various debiasing techniques can be ensembled to optimize their benefits for heritability estimation, similar to our approach in this work for the linear model.

References

Bayati et al., (2013) Bayati, M., Erdogdu, M. A., and Montanari, A. (2013). Estimating lasso risk and noise level. Advances in Neural Information Processing Systems, 26.
Bellec, (2022) Bellec, P. C. (2022). Observable adjustments in single-index models for regularized m-estimators. arXiv preprint arXiv:2204.06990.
Bellec and Koriyama, (2024) Bellec, P. C. and Koriyama, T. (2024). Asymptotics of resampling without replacement in robust and logistic regression. arXiv preprint arXiv:2404.02070.
Bellec and Zhang, (2019) Bellec, P. C. and Zhang, C.-H. (2019). Second order poincaré inequalities and de-biasing arbitrary convex regularizers when p/n→ $\gamma$ . arXiv preprint arXiv:1912.11943, 2:15–34.
Bellec and Zhang, (2022) Bellec, P. C. and Zhang, C.-H. (2022). De-biasing the lasso with degrees-of-freedom adjustment. Bernoulli, 28(2):713–743.
Bellec and Zhang, (2023) Bellec, P. C. and Zhang, C.-H. (2023). Debiasing convex regularized estimators and interval estimation in linear models. The Annals of Statistics, 51(2):391–436.
Benjamini and Yu, (2013) Benjamini, Y. and Yu, B. (2013). The shuffle estimator for explainable variance in fmri experiments. The Annals of Applied Statistics, pages 2007–2033.
Berisa and Pickrell, (2016) Berisa, T. and Pickrell, J. K. (2016). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283.
Bonnet et al., (2015) Bonnet, A., Gassiat, E., and Lévy-Leduc, C. (2015). Heritability estimation in high dimensional sparse linear mixed models.
Cai and Guo, (2020) Cai, T. and Guo, Z. (2020). Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):391–419.
Candès and Sur, (2020) Candès, E. J. and Sur, P. (2020). The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression.
Celentano and Montanari, (2021) Celentano, M. and Montanari, A. (2021). Cad: Debiasing the lasso with inaccurate covariate model. arXiv preprint arXiv:2107.14172.
Celentano et al., (2020) Celentano, M., Montanari, A., and Wei, Y. (2020). The lasso with general gaussian designs with applications to hypothesis testing. arXiv preprint arXiv:2007.13716.
Chen, (2022) Chen, H. Y. (2022). Statistical inference on explained variation in high-dimensional linear model with dense effects. arXiv preprint arXiv:2201.08723.
Clarté et al., (2024) Clarté, L., Vandenbroucque, A., Dalle, G., Loureiro, B., Krzakala, F., and Zdeborová, L. (2024). Analysis of bootstrap and subsampling in high-dimensional regularized regression. arXiv preprint arXiv:2402.13622.
Dicker, (2014) Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284.
Dicker and Erdogdu, (2016) Dicker, L. H. and Erdogdu, M. A. (2016). Maximum likelihood for variance estimation in high-dimensional linear models. In Artificial Intelligence and Statistics, pages 159–167. PMLR.
El Karoui and Purdom, (2018) El Karoui, N. and Purdom, E. (2018). Can we trust the bootstrap in high-dimensions? the case of linear models. Journal of Machine Learning Research, 19(5):1–66.
Eldar and Chan, (2003) Eldar, Y. C. and Chan, A. M. (2003). On the asymptotic performance of the decorrelator. IEEE Transactions on Information Theory, 49(9):2309–2313.
Evans et al., (2018) Evans, L. M., Tahmasbi, R., Vrieze, S. I., Abecasis, G. R., Das, S., Gazal, S., Bjelland, D. W., De Candia, T. R., Consortium, H. R., Goddard, M. E., et al. (2018). Comparison of methods that use whole genome data to estimate the heritability and genetic architecture of complex traits. Nature genetics, 50(5):737–745.
Fan et al., (2012) Fan, J., Guo, S., and Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(1):37–65.
Guo et al., (2019) Guo, Z., Wang, W., Cai, T. T., and Li, H. (2019). Optimal estimation of genetic relatedness in high-dimensional linear models. Journal of the American Statistical Association, 114(525):358–369.
Han and Shen, (2022) Han, Q. and Shen, Y. (2022). Universality of regularized regression estimators in high dimensions. arXiv preprint arXiv:2206.07936.
Hou et al., (2019) Hou, K., Burch, K. S., Majumdar, A., Shi, H., Mancuso, N., Wu, Y., Sankararaman, S., and Pasaniuc, B. (2019). Accurate estimation of snp-heritability from biobank-scale data irrespective of genetic architecture. Nature genetics, 51(8):1244–1251.
Hu and Li, (2022) Hu, X. and Li, X. (2022). Misspecification analysis of high-dimensional random effects models for estimation of signal-to-noise ratios. arXiv preprint arXiv:2202.06400.
Janson et al., (2017) Janson, L., Barber, R. F., and Candes, E. (2017). Eigenprism: inference for high dimensional signal-to-noise ratios. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4):1037–1065.
(27) Javanmard, A. and Montanari, A. (2014a). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909.
(28) Javanmard, A. and Montanari, A. (2014b). Hypothesis testing in high-dimensional regression under the gaussian random design model: Asymptotic theory. IEEE Transactions on Information Theory, 60(10):6522–6554.
Jiang et al., (2016) Jiang, J., Li, C., Paul, D., Yang, C., and Zhao, H. (2016). On high-dimensional misspecified mixed model analysis in genome-wide association study. The Annals of Statistics, 44(5):2127–2160.
Jiang et al., (2022) Jiang, K., Mukherjee, R., Sen, S., and Sur, P. (2022). A new central limit theorem for the augmented ipw estimator: Variance inflation, cross-fit covariance and beyond. arXiv preprint arXiv:2205.10198.
Knowles and Yin, (2017) Knowles, A. and Yin, J. (2017). Anisotropic local laws for random matrices. Probability Theory and Related Fields, 169(1):257–352.
Lahiry and Sur, (2023) Lahiry, S. and Sur, P. (2023). Universality in block dependent linear models with applications to nonparametric regression. arXiv preprint arXiv:2401.00344.
Lee et al., (2013) Lee, S. H., Yang, J., Chen, G.-B., Ripke, S., Stahl, E. A., Hultman, C. M., Sklar, P., Visscher, P. M., Sullivan, P. F., Goddard, M. E., et al. (2013). Estimation of snp heritability from dense genotype data. The American Journal of Human Genetics, 93(6):1151–1155.
Li et al., (2022) Li, R., Chang, C., Justesen, J. M., Tanigawa, Y., Qian, J., Hastie, T., Rivas, M. A., and Tibshirani, R. (2022). Fast lasso method for large-scale and ultrahigh-dimensional cox model with applications to uk biobank. Biostatistics, 23(2):522–540.
Li and Sur, (2023) Li, Y. and Sur, P. (2023). Spectrum-aware adjustment: A new debiasing framework with applications to principal components regression. arXiv preprint arXiv:2309.07810.
Liang and Sur, (2022) Liang, T. and Sur, P. (2022). A precise high-dimensional asymptotic theory for boosting and minimum-l1-norm interpolated classifiers. The Annals of Statistics, 50(3):1669–1695.
Loh et al., (2015) Loh, P.-R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., Chasman, D. I., Ridker, P. M., Neale, B. M., Berger, B., et al. (2015). Efficient bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47(3):284–290.
Manolio et al., (2009) Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., et al. (2009). Finding the missing heritability of complex diseases. Nature, 461(7265):747–753.
Miolane and Montanari, (2021) Miolane, L. and Montanari, A. (2021). The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning. The Annals of Statistics, 49(4):2313–2335.
Psychiatric GWAS Consortium Coordinating Committee, (2009) Psychiatric GWAS Consortium Coordinating Committee, . (2009). Genomewide association studies: history, rationale, and prospects for psychiatric disorders. American Journal of Psychiatry, 166(5):540–556.
Rad and Maleki, (2020) Rad, K. R. and Maleki, A. (2020). A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(4):965–996.
Rockafellar, (2015) Rockafellar, R. T. (2015). Convex analysis.
Sudlow et al., (2015) Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779.
Sun and Zhang, (2012) Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika, 99(4):879–898.
Sur and Candès, (2019) Sur, P. and Candès, E. J. (2019). A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences, 116(29):14516–14525.
Sur et al., (2019) Sur, P., Chen, Y., and Candès, E. J. (2019). The likelihood ratio test in high-dimensional logistic regression is asymptotically a rescaled chi-square. Probability theory and related fields, 175:487–558.
Thrampoulidis et al., (2015) Thrampoulidis, C., Oymak, S., and Hassibi, B. (2015). Regularized linear regression: A precise analysis of the estimation error. In Conference on Learning Theory, pages 1683–1709. PMLR.
Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models.
Vershynin, (2010) Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027.
Vershynin, (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press.
Verzelen and Gassiat, (2018) Verzelen, N. and Gassiat, E. (2018). Adaptive estimation of high-dimensional signal-to-noise ratios. Bernoulli, 24(4B):3683–3710.
Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press.
Wellner et al., (2013) Wellner, J. et al. (2013). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media.
Yang et al., (2015) Yang, J., Bakshi, A., Zhu, Z., Hemani, G., Vinkhuyzen, A. A., Lee, S. H., Robinson, M. R., Perry, J. R., Nolte, I. M., van Vliet-Ostaptchouk, J. V., et al. (2015). Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature genetics, 47(10):1114–1120.
Yang et al., (2011) Yang, J., Lee, S. H., Goddard, M. E., and Visscher, P. M. (2011). Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82.
Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 217–242.
Zhao et al., (2022) Zhao, Q., Sur, P., and Candes, E. J. (2022). The asymptotic distribution of the mle in high-dimensional logistic models: Arbitrary covariance. Bernoulli, 28(3):1835–1861.
Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2):301–320.

Appendix A Additional Methodology Details

A.1 Additional Technical Assumptions

We present additional technical assumptions needed for our mathematical guarantees in Section 4. The following assumption is needed for independent covariates:

Assumption 2.

1.

The design matrix $\bm{X}$ satisfies for any $\gamma<1$ , there exists a constant $\kappa_{\min}>0$ such that $\mathbb{P}(\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},\min\{p,\gamma n\})\leq\kappa_{\min})\rightarrow 0$ as $n\rightarrow\infty$ , where $\kappa_{-}(\bm{X},s)$ denotes the minimum singular value of $\bm{X}_{S}$ over all subsets $S$ of the columns with $|S|\leq s$ .
2.

The tuning parameters are bounded, that is, there exists $\lambda_{\min},\lambda_{\max}$ such that $0<\lambda_{\min}\leq\lambda_{\textrm{L}},\lambda_{\textrm{R}}\leq\lambda_{\max}<\infty$ .

Assumption 2(1) states that the minimum singular value of $\bm{X}$ , across all subsets of a certain size, is lower bounded by some positive constant with high probability. This assumption is required for our universality proof. In fact, we conjecture that this assumption is true given Assumption 1(1) and Assumption 1(2) (c.f. Section B.5.4 in celentano2020lasso for a proof in Gaussian case). We leave the proof of this assumption to future work.

Assumption 2(2) restricts the range of $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ to a predetermined range. We require a specific lower bound $\lambda_{\min}=\lambda_{\min}(\sigma_{\max}^{2},\delta)$ . The specific functional form is complicated, but plays a crucial role in our proof in Section I (which builds upon results from han2022universality). On the other hand, $\lambda_{\max}$ is not restricted and can take any proper value. See Section A.2 for a heuristic method that we use to determine $(\lambda_{\min},\lambda_{\max})$ in practice.

The following assumption is needed for estimated covariance under correlated case, effectively perturbing HEDE by a matrix $\bm{A}\approx\bm{I}$ :

Assumption 3.

Let $\bm{A}\in\mathbb{R}^{p\times p}$ be a sequence of random matrices. Denote $\mathcal{L}_{\bm{A}}(\bm{b}):=\frac{1}{2n}\|\bm{y}-\bm{X}\bm{A}\bm{b}\|_{2}^{2}+\frac{\lambda_{\textrm{L}}}{\sqrt{n}}\|\bm{b}\|_{1}$ to be the perturbed Lasso cost function when we replace $\bm{X}$ by $\bm{X}\bm{A}$ , and denote $\hat{\bm{\beta}}(\bm{A}):=\operatorname*{arg\,min}_{\bm{b}}\mathcal{L}_{\bm{A}}(\bm{b})$ to be the perturbed Lasso solution. We say $\bm{A}$ satisfies Assumption 3 if

1.

$\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}(\bm{A})\leq M$ for some constant $M$ with probability converging to $1$ as $n\rightarrow\infty$
2.

$\forall\lambda_{\textrm{L}},\lambda_{\textrm{L}}^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda_{\textrm{L}}^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{\textrm{L}}}(\bm{A}))\leq\mathcal{L}_{\lambda_{\textrm{L}}^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{\textrm{L}}^{\prime}}(\bm{A}))+K|\lambda_{\textrm{L}}-\lambda_{\textrm{L}}^{\prime}|$ with probability converging to $1$ as $n\rightarrow\infty$ .

Assumption 3 is required as technical steps in our proof under sub-Gaussian design. They are automatically satisfied when the design $\bm{X}$ is Gaussian (see proofs in Section G).

A.2 Hyperparameter Selection

Our algorithm necessitates a tuning parameter range $[\lambda_{\min},\lambda_{\max}]$ . Assumption 2(2) defines $\lambda_{\min}(\sigma_{\max}^{2},\delta)$ as a function of the unknown $\sigma_{\max}^{2}$ , as well as an unconstrained $\lambda_{\max}$ , for completely technical reasons, both of which give little practical guidance. Here we propose to determine the range by the empirical degrees-of-freedom from Lasso and Ridge, defined in 19.

For $\lambda_{\max}$ , there is no point increasing it infinitely since $\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{\beta}}_{\textrm{R}}$ will approach $0$ , yielding very similar $\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}$ values. Thus, we choose $t_{\min}=0.01$ , therefore filtering out excessively large $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ values.

For $\lambda_{\min}$ , we will similarly pick an upper bound for $\frac{\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})\|_{0}}{n}$ and $\frac{1}{n}\operatorname{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X})$ . Due to numerical precision/stability issue, glmnet (and possibly other numerical solvers) yields incorrect solutions for tiny values of $\lambda$ . Also, the lower bound $\lambda_{\min}(\sigma_{\max}^{2},\delta)$ -although unknown-prohibits tiny values of $\lambda$ . Therefore, simply choosing an upper bound such as $t_{\max}=0.99$ does not suffice. Heuristically, we find any value from $0.3$ to $0.7$ acceptable, and we take $t_{\max}=0.5$ for simplicity.

Once the range for $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ are determined, we discretize it on the log scale with grid width $0.1$ . This yields the final discrete collection of $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ , from which we minimize $\hat{\tau}_{c}^{2}$ .

A.3 Covariance Estimation

In Section 3.4, we discussed estimating the blocked population covariance matrix by block-wise sample covariance matrices. The following proposition supports its consistency under mind conditions:

Proposition A.1.

Let $\bm{\Sigma}\in\mathbb{R}^{p\times p}$ be a real symmetric matrix whose spectral distribution has bounded support $\mu_{\bm{\Sigma}}\subset[\kappa_{\min},\kappa_{\max}]$ with $0<\kappa_{\min}\leq\kappa_{\max}<\infty$ . Further suppose that $\bm{\Sigma}$ is block-diagonal with blocks $\bm{\Sigma}_{1},...,\bm{\Sigma}_{k}$ such that the size of each block $\bm{\Sigma}_{i}$ is bounded by some constant $m$ . Let $\bm{X}\in\mathbb{R}^{n\times p}$ with each row $\bm{x}_{i\bullet}$ satisfying $\bm{x}_{i\bullet}=\bm{z}_{i\bullet}\bm{\Sigma}^{1/2}$ where $\bm{z}_{i\bullet}$ has independent sub-Gaussian entries with bounded sub-Gaussian norm. If we estimate $\hat{\bm{\Sigma}}$ as also block-diagonal with blocks $\hat{\bm{\Sigma}}_{1},...,\hat{\bm{\Sigma}}_{k}$ where $\hat{\bm{\Sigma}}_{i}=\frac{1}{n}\bm{X}_{i}^{\top}\bm{X}_{i}$ are corresponding sample covariances, then as $p/n\rightarrow\delta\in(0,\infty)$ , we have $\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ .

Proof.

The proof is straightforward by applying (wainwright2019high, Theorem 6.5) on each block component, in conjunction with a union bound. Thus we omit it here. ∎

Appendix B Additional Simulation Details

B.1 Random effects comparison: Heritability definition

The common setting in Section 5.1 assumes both $\bm{X}$ and $\bm{\beta}$ are zero-mean random and that $\bm{X}$ has population covariance $\bm{\Sigma}$ . Under random effects setting, a random $\bm{X}$ is allowed, and the denominator in the heritability definition (3) reads

\text{Var}(\bm{x}_{i\bullet}^{\top}\bm{\beta})=\mathbb{E}[\text{Var}[\bm{x}_{i\bullet}^{\top}\bm{\beta}|\bm{\beta}]]+\text{Var}[\mathbb{E}[\bm{x}_{i\bullet}^{\top}\bm{\beta}|\bm{\beta}]]=\mathbb{E}[\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}],

which is a well-defined population quantity. Yet, $\mathbb{E}[\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}]$ does not have a closed form formula if $\bm{\beta}$ is generated with stratifications as in Section 5.1, so we approximated it with $100$ iterations.

Under fixed effect setting, conditioning on a given $\bm{\beta}$ , the same definition reads

\text{Var}(\bm{x}_{i\bullet}^{\top}\bm{\beta})=\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta},

which is an empirical quantity that fluctuates around the population quantity $\mathbb{E}[\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}]$ . Now with $100$ random $\bm{\beta}$ , estimating the corresponding $100$ individual $\bm{\beta}^{\top}\bm{\Sigma}\bm{\beta}$ approximately constitutes as estimating its expectation. This, therefore, facilitates fair comparisons between random and fixed effects methods.

B.2 Random effects comparison: Signal generation details

To generate $\bm{\beta}$ with varying non-zero entry locations and distributions, we employed the following generation process. 1) We fixed two levels of concentration: $c_{l}=0.05$ and $c_{h}=0.5$ . 2) We determine $k_{s}$ , the number of stratifications needed. For uniformly distributed non-zero entries $k_{s}=1$ . For LDMS stratification, $k_{s}=12$ : the cross product of $4$ LD stratifications and $3$ MAF stratifications mentioned in Section 5.1. For LDMS plus block stratification, $k_{s}=24$ : the cross product of the previous $12$ LDMS stratifications and $2$ blocks: the last LD block vs the rest, specified in berisa2016approximately. 3) For each of the $k_{s}$ stratifications determined, we alternatively assign $c_{l}$ and $c_{h}$ as the concentration, and generate non-zero entries uniformly randomly with the assigned concentration. In the special case $k_{s}=1$ , $c_{l}$ is assigned. 4) After selecting non-zero entry locations, count the number of non-zero entries $K$ , and then calculate the entrywmeshyise variance $\sigma_{+}^{2}=h^{2}/K$ where $h^{2}$ is the desired heritability value. 5) Lastly, pick the non-zero entry distribution. For random normal entries, the distribution is $\mathcal{N}(0,\sigma_{+}^{2})$ . For mixture-of-normal entries, the distribution is $\mathcal{N}(\pm\frac{\sigma_{+}}{\sqrt{10}},\frac{9\sigma_{+}^{2}}{10})$ (an equal mixture of two symmetric normals).

B.3 Fixed effects comparisons: Unbiasedness and a closer look

In this section, we present two plots. The first investigates the bias properties of different fixed effects methods. For a representative example, we chose a setting with $n=1000,p=10000,h^{2}=0.1$ , and $\kappa=0.003$ , though this selection is not particularly unique. Using a specific $\bm{\beta}$ , we generated 100 random instances of $\bm{X}$ . The box plots in Figure 4 depict the heritability estimates from all the methods under consideration. We observe that every method is (approximately) unbiased, which was confirmed by in additional settings. We also note that truncating negative estimates to $0$ would yield lower MSEs but more bias, so we opted to keep the negative estimates.

As a second line of investigation, we recall Figure 2 from the main manuscript, which showed that the AMP MSEs are often much higher than the remaining. In this light, we zoom into Figure 2 to obtain a clearer picture for the performance of our method relative to others. Figure 5 shows this zoomed in version. Note that our relative superiority is mostly maintained across diverse settings.

B.4 Fixed effects comparison: Designs with larger sub-Gaussian norms

To generate $\bm{X}$ with larger sub-Gaussian norms, we followed the exact same setup as in Section 5.2, except that $\bar{G}_{j}\sim\text{Unif}[0.005,0.01]$ instead of $[0.01,0.5]$ in (2). This mimicks lower-frequency variants, leading to larger sub-Gaussian norms in the normalized $\bm{X}$ . Comparing Figures 6, 7 with Figures 2, 3, we observe similar trends across the board, with HEDE having either similar or dominating performance.

Appendix C Proof Notations and Conventions

This section introduces notations that we will use through the rest of the proof. We consider both the Lasso and ridge estimators in the context of linear models. Several important quantities in our calculations appear in two versions—one computed based on the ridge and the other based on the Lasso. We use subscripts $\textrm{R},\textrm{L}$ respectively to denote the versions of these quantities corresponding to the ridge and the Lasso. Recall that the setting we study may be expressed via the linear model $\bm{y}=\bm{X}\bm{\beta}+\sigma\bm{z}$ where $\bm{y}\in\mathbb{R}^{n}$ denote the responses, $\bm{X}\in\mathbb{R}^{n\times p}$ the design matrix, $\bm{\beta}\in\mathbb{R}^{p}$ the unknown regression coefficients, and $\bm{z}$ the noise. Then for $k=\textrm{L},\textrm{R}$ , the corresponding Lasso and Ridge estimators $\hat{\bm{\beta}}_{k}\in\mathbb{R}^{p}$ are defined as

\hat{\bm{\beta}}_{k}=\operatorname*{arg\,min}_{\bm{b}}\left\{\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\Omega_{k}(\bm{b})\right\},

(17)

where $\Omega_{\textrm{L}}(\bm{b})=\frac{\lambda_{\textrm{L}}}{\sqrt{n}}\|\bm{b}\|_{1},\Omega_{\textrm{R}}(\bm{b})=\frac{\lambda_{\textrm{R}}}{2}\|\bm{b}\|_{2}^{2}$ .

Our methodology relies on debiased versions of these estimators, defined as

\hat{\bm{\beta}}_{k}^{d}=\hat{\bm{\beta}}_{k}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\hat{\textnormal{df}}_{k}},\,\,\text{for}\,\,k=\textrm{L},\textrm{R},

(18)

where $\hat{\textnormal{df}}_{k}\in\mathbb{R}$ are terms that may be interpreted as degrees-of-freedom, and are defined in each of the aforementioned cases as follows,

	$\displaystyle\hat{\text{df}}_{\textrm{L}}$	$\displaystyle=\\|\hat{\bm{\beta}}_{\textrm{L}}\\|_{0}$		(19)
	$\displaystyle\hat{\text{df}}_{\textrm{R}}$	$\displaystyle=\text{Tr}((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}).$		(19)

Our proofs use some intermediate quantities that we introduce next. First, we will require intermediate quantities that replace $\hat{\textrm{df}}_{k}$ by a set of parameters $\textrm{df}_{k}$ that rely on underlying problem parameters. For $k=\textrm{L},\textrm{R}$ , we define these to be

\tilde{\bm{\beta}}_{k}^{d}=\hat{\bm{\beta}}_{k}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\textnormal{df}_{k}}.

(20)

The exact definition of $\textnormal{df}_{k}$ is involved, so we defer its presentation to 24. Second, we need another set of intermediate quantities $\hat{\bm{\beta}}_{k}^{f}$ that form solutions to optimization problems of the form (17), but where the observed data $\{\bm{y},\bm{X}\}$ is replaced by random variables $\hat{\bm{\beta}}_{k}^{f,d}$ that can be expressed as gaussian perturbations of the true signal, with appropriate adjustments to the penalty function. For $k=\textrm{L},\textrm{R}$ , these are defined as follows:

\hat{\bm{\beta}}_{k}^{f}:=\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\zeta_{k}):=\operatorname*{arg\,min}_{\bm{b}}\left\{\frac{1}{2}\|\hat{\bm{\beta}}_{k}^{f,d}-\bm{b}\|_{2}^{2}+\frac{1}{\zeta_{k}}\Omega_{k}(\bm{b})\right\},

(21)

\text{where}\quad\hat{\bm{\beta}}_{k}^{f,d}=\bm{\beta}+\bm{g}_{k}^{f},\quad(\bm{g}_{\textrm{L}}^{f},\bm{g}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},\bm{S}\otimes\bm{I}_{p}),

(22)

with $\bm{S}$ of the form

\bm{S}:=\begin{pmatrix}\tau_{\textrm{L}}^{2}&\rho\tau_{\textrm{L}}\tau_{\textrm{R}}\\ \rho\tau_{\textrm{L}}\tau_{\textrm{R}}&\tau_{\textrm{R}}^{2}\end{pmatrix},

(23)

for suitable choices of $\tau_{\textrm{L}},\tau_{\textrm{R}},\rho,\zeta_{k}$ that we define later in (24). Our exposition so far refrains from providing additional insights regarding the necessity of these intermediate quantities–however, the role of these quantities will unravel in due course through the proof. As an aside, note that the randomness in $\hat{\bm{\beta}}_{k}^{f,d},\hat{\bm{\beta}}_{k}^{f}$ comes from $\bm{g}_{\textrm{k}}^{f}$ , which is independent of the observed data. We use the superscript ‘f’ (standing for fixed) to denote that these do not depend on our observed data.

Our aforementioned notations are complete once we define the parameters $\textrm{df}_{k}$ from (20), $\zeta_{k}$ from (21), and $\bm{S}$ from (23). We define these as solutions to the following system of equations in the variables $\{\check{\bm{S}},\check{\zeta}_{k},\check{\textnormal{df}}_{k},k=\textrm{L},\textrm{R}\}$ .

$\displaystyle\check{\bm{S}}$	$\displaystyle=\frac{1}{n}(\sigma^{2}\bm{I}+\mathbb{E}[(\eta_{\textrm{L}}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d};\check{\zeta}_{\textrm{L}})-\bm{\beta},\eta_{\textrm{R}}(\hat{\bm{\beta}}_{\textrm{R}}^{f,d};\check{\zeta}_{\textrm{R}})-\bm{\beta})^{\top}(\eta_{\textrm{L}}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d};\check{\zeta}_{\textrm{L}})-\bm{\beta},\eta_{\textrm{R}}(\hat{\bm{\beta}}_{\textrm{R}}^{f,d};\check{\zeta}_{\textrm{R}})-\bm{\beta})])$	(24)
$\displaystyle\check{\textnormal{df}}_{k}$	$\displaystyle=\mathbb{E}[\text{div}\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k})],k=\textrm{L},\textrm{R}$
$\displaystyle\check{\zeta}_{k}$	$\displaystyle=1-\frac{\check{\textnormal{df}}_{k}}{n},k=\textrm{L},\textrm{R},$

where $\text{div}\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k}):=\sum_{j=1}^{p}\frac{\partial}{\partial\hat{\bm{\beta}}_{k,j}^{f,d}}\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k})$ is defined as the divergence of $\eta_{k}(\hat{\bm{\beta}}_{k}^{f,d};\check{\zeta}_{k})$ with respect to its first argument and recall that we defined $\eta_{k}(\cdot,\cdot)$ in (21).

Extracting the first and the last entries of the first equation in (24), we observe that $\tau_{\textrm{L}},\zeta_{\textrm{L}},\textnormal{df}_{\textrm{L}}$ depends on $\lambda_{\textrm{L}}$ and not on $\lambda_{\textrm{R}}$ , and similarly for the corresponding Ridge parameters. We also note that $\tau_{\textrm{L}\textrm{R}}$ depends on both $\lambda_{\textrm{L}}$ and $\lambda_{\textrm{R}}$ . We will use this observation multiple times in our proofs later.

The system of equations 24 first arose in the context of the problem studied in celentano2021cad and Lemma C.1 from the aforementioned paper establishes the uniqueness of the solutions. Furthermore, it follows that there exist positive constants $(\tau_{\min},\tau_{\max},\zeta_{\min},\rho_{\max})$ such that

\tau_{\min}^{2}+\sigma^{2}<n\tau_{k}^{2}<\tau_{\max}^{2}+\sigma^{2},\zeta_{\min}<\zeta_{k}\leq 1,|\rho|<\rho_{\max}<1,

(25)

for $k=L,R$ where $(\tau_{L},\tau_{R},\zeta_{L},\zeta_{R},\rho)$ denotes the unique solution to equation (24).

Whenever we mention constants in our proof below, we mean values that only depend on the model parameters $\{\kappa_{\min},\kappa_{\max},\delta,\sigma_{\max},\lambda_{\min},\lambda_{\max}\}$ laid out in Assumptions 1 and 2, and do not depend on any other variable (especially $n,p$ , and realization of any random quantities).

Finally in Sections D–H.4, we prove our results under the stylized setting where the design matrix entries are i.i.d. $\mathcal{N}(0,1)$ and the noise vector entries are i.i.d. $\mathcal{N}(0,1)$ . In Section I we establish universality results that show that the same conclusions hold under the setting of Assumptions 1 and 2, where the covariate and error distributions are more general. To avoid confusion, we define below a stylized version of Assumptions 1 and 2, where everything remains the same except the design, error distributions are taken to be i.i.d. Gaussian. So, Sections D-H.4 work under Assumption 4, while Section I works under Assumptions 1 and 2.

Assumption 4.

1.

Same as Assumption 1(1).
2.

Each entry of $\bm{X}$ satisfies $X_{ij}\overset{i.i.d.}{\sim}\mathcal{N}(0,1)$ .
3.

Same as Assumption 1(3).
4.

The noise $\bm{\epsilon}$ satisfies $\epsilon_{i}\overset{i.i.d.}{\sim}\mathcal{N}(0,\sigma^{2})$ , with $\sigma^{2}\leq\sigma_{\max}^{2}$ .
5.

Same as Assumption 2(2).

Note that we don’t need Assumption 2(1) since it is true for Gaussians (see B.5.4 in celentano2020lasso).

Appendix D Proof of Theorem 4.3

Theorem 4.3 forms the backbone of our results so we begin by presenting its proof here. Towards this goal, we first prove the following slightly different version.

Theorem D.1.

Suppose that Assumption 4 holds. Then for any $1$ -Lipschitz function $\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}$ , we have

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0,

where $\tilde{\bm{\beta}}_{k}^{d},\hat{\bm{\beta}}_{k}^{f,d}$ are as defined in (20), (22) for $k=\textrm{L},\textrm{R}$ respectively.

Theorem D.1 differs from Theorem 4.3 in that $\hat{\bm{\beta}}_{k}^{d}$ is now replaced by $\tilde{\bm{\beta}}_{k}^{d}$ . The difference between these lies in the fact that $\hat{\textrm{df}}_{k}$ is replaced by $\textrm{df}_{k}$ from the former to the latter (c.f. Eqns 18 and 20).

Given Theorem D.1, the proof of Theorem 4.3 follows a two step procedure: we first establish that the empirical quantities $\hat{\textrm{df}}_{k}$ and the parameters $\textrm{df}_{k}$ are uniformly close asymptotically. This allows us to establish that $\tilde{\bm{\beta}}_{k}^{d}$ and $\hat{\bm{\beta}}_{k}^{d}$ are asymptotically close. Theorem 4.3 thereby follows from here, when applied in conjunction with Theorem D.1 and the fact that $\phi_{\beta}$ is Lipschitz. We formalize these arguments below.

Lemma D.2.

Recall the definitions of $\hat{\textnormal{df}}_{k}$ and $\textnormal{df}_{k}$ from (19), (24). Under Assumption 4, for $k=\textrm{L},\textrm{R}$ ,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}}{p}-\frac{\textnormal{df}_{k}}{p}\right|\overset{P}{\rightarrow}0.

Proof.

(miolane2021distribution, Theorem F.1.) established the aformentioned for $k=\textrm{L}$ .

For $k=R$ , (celentano2021cad, Lemma H.1) (which further cites (knowles2017anisotropic, Theorem 3.7)) showed that that $\hat{\textnormal{df}}_{\textrm{R}}/p$ converges to $\textnormal{df}_{\textrm{R}}/p$ for any fixed $\lambda_{\textrm{R}}$ . So for our purposes it suffices to extend this proof to all $\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]$ simultaneously. We achieve this below.

Recall from Eqn. 24 that $\hat{\textnormal{df}}_{\textrm{R}}/p=1-\lambda_{\textrm{R}}\text{Tr}(\frac{1}{p}(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda_{\textrm{R}}\bm{I})^{-1})$ , where the trace on the right hand side is the negative resolvent of $\frac{1}{n}\bm{X}^{\top}\bm{X}$ evaluated at $-\lambda_{\textrm{R}}$ and normalized by $1/p$ . (knowles2017anisotropic, Theorem 3.7) (with some notation-transforming algebra) states that this negative normalized resolvent converges in probability to $(1-\textnormal{df}_{\textrm{R}}/p)/\lambda_{\textrm{R}}$ with fluctuation of level $O(N^{-1/2}\kappa^{-1/4})$ , where $\kappa=\lambda_{\textrm{R}}+1$ is the distance from $-\lambda_{\textrm{R}}$ to the spectrum of $\bm{I}$ . Therefore, $\hat{\textnormal{df}}_{\textrm{R}}/p$ concentrates around $\textnormal{df}_{\textrm{R}}/p$ with fluctuation of level $O(N^{-1/2}\lambda_{\textrm{R}}(1+\lambda_{\textrm{R}})^{-1/4})=O(N^{-1/2}(1+\lambda_{\max})^{3/4})$ , for all $\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]$ . ∎

As a direct consequence, we have

Corollary D.1.

Under Assumption 4, we have for $k=\textrm{L},\textrm{R}$ ,

	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|1-\frac{\textnormal{df}_{k}}{n}\right\|$	$\displaystyle=\Theta_{p}(1),$
	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|1-\frac{\hat{\textnormal{df}}_{k}}{n}\right\|$	$\displaystyle=\Theta_{p}(1).$

Proof.

The first line follows directly from (24) and the fact that $\zeta_{k}$ is bounded by 25. For the second line, we have

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|1-\frac{\hat{\textnormal{df}}_{k}}{n}\right|\leq\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|1-\frac{{\textnormal{df}}_{k}}{n}\right|+\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}}{n}-\frac{\textnormal{df}_{k}}{n}\right|.

where the first term is $\Theta_{p}(1)$ and the second term is $o_{P}(1)$ by Lemma D.2. ∎

From here, one can derive the following Lemma using the definitions of $\hat{\bm{\beta}}_{k}^{d}$ , $\tilde{\bm{\beta}}_{k}^{d}$ . One should compare the following to (celentano2021cad, Lemma H.1.ii) that proved a pointwise version of this result without any supremum over the tuning parameters.

Lemma D.3.

Under Assumption 4, for $k=\textrm{L},\textrm{R}$ ,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}^{d}-\tilde{\bm{\beta}}_{k}^{d}\|_{2}\overset{P}{\rightarrow}0.

Proof.

By definition,

		$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{\beta}}_{k}^{d}-\tilde{\bm{\beta}}_{k}^{d}\\|_{2}$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\\|\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\hat{\textnormal{df}}_{k}}-\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\textnormal{df}_{k}}\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{\\|\bm{X}\\|_{\textnormal{op}}}{\sqrt{n}}\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\frac{\\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}\\|_{2}}{\sqrt{n}}\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{1}{{1-\hat{\textnormal{df}}_{k}/n}}-\frac{1}{{1-\textnormal{df}_{k}/n}}\right\|.$

By Corollary H.1, $\|\bm{X}\|_{\textnormal{op}}/\sqrt{n}$ is bounded with probability $1-o(1)$ . By Lemma D.7, $\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}\|/\sqrt{n}$ is bounded with probability $1-o(1)$ . Lastly,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{1}{{1-\hat{\textnormal{df}}_{k}/n}}-\frac{1}{{1-\textnormal{df}_{k}/n}}\right|=\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\textnormal{df}_{k}/n-\hat{\textnormal{df}}_{k}/n}{(1-\hat{\textnormal{df}}_{k}/n)(1-\textnormal{df}_{k}/n)}\right|\overset{P}{\rightarrow}0,

where the numerator is $o_{P}(1)$ by Lemma D.2 and the denominator is $\Theta_{P}(1)$ by Corollary D.1. ∎

We next turn to prove Theorem 4.3.

Proof of Theorem 4.3.

By triangle inequality,

		$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]\|+\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})-\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})\|$
	$\displaystyle\leq$	$\displaystyle o_{P}(1)+\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{\beta}}_{\textrm{L}}^{d}-\tilde{\bm{\beta}}_{\textrm{L}}^{d}\\|_{2}+\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{\beta}}_{\textrm{R}}^{d}-\tilde{\bm{\beta}}_{\textrm{R}}^{d}\\|_{2}$
	$\displaystyle\overset{P}{\rightarrow}$	$\displaystyle\ 0,$

where we used Theorem D.1 and Lemma D.3, as well as the fact that $\phi_{\beta}$ is $1$ -Lipschitz. ∎

Thus, our proof of Theorem 4.3 is complete if we prove Theorem D.1. We present this in the next sub-section.

D.1 Proof of Theorem D.1

The overarching structure of our proof of Theorem D.1 is inspired by (celentano2021cad, Section D). However, unlike in our setting below, the results in the aforementioned paper do not require uniform convergence over a range of values of the tuning parameter. This leads to novel technical challenges in our setting that we handle as we proceed.

To prove Theorem D.1, we introduce an intermediate quantity $\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]$ that conditions on $\bm{g}_{\textrm{L}}^{f}$ (recall the definition from (22)) taking on a specific value $\hat{\bm{g}}_{\textrm{L}}$ defined as follows:

\begin{gathered}\hat{\bm{g}}_{\textrm{L}}=\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\|_{2}\|\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\|_{2}}(\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta})+\frac{\tau_{\textrm{L}}}{\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\|_{2}}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}),\end{gathered}

(26)

where $\tau_{\textrm{L}},\zeta_{\textrm{L}}$ are defined as in (24). Recall from (25) that $n\tau_{\textrm{L}}^{2}\geq\sigma^{2}$ so that the square root above is well-defined.

Remark.

The definition of $\hat{\bm{g}}_{\textrm{L}}$ is non-trivial. However, the takeaway is that the realization $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}$ should be understood as the coupling of $\bm{X}$ and $\bm{g}_{\textrm{L}}^{f}$ such that $\hat{\bm{\beta}}_{\textrm{L}}^{f},\hat{\bm{\beta}}_{\textrm{L}}^{f,d}$ equals $\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}$ , respectively. We refer readers to (celentano2021cad, Section F.1 and Section L) for the underlying intuition as well as the proof of this equivalence.

Since $\hat{\bm{\beta}}_{\textrm{L}}^{f,d}$ becomes $\tilde{\bm{\beta}}_{\textrm{L}}^{d}$ conditioning on $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}$ , we have $\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]=\mathbb{E}[\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]$ , where the randomness inside the expectation comes from $\hat{\bm{\beta}}_{\textrm{R}}^{f,d}$ , and $\tilde{\bm{\beta}}_{\textrm{L}}^{d}$ is fixed. In fact, analogous to $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}$ , $\hat{\bm{g}}_{\textrm{L}}$ approximates $\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}$ , as formalized in the result below.

Lemma D.4.

Under Assumption 4,

\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\|_{2}\overset{P}{\rightarrow}0.

With these definitions in hand, the proof is complete on combining the following two lemmas.

Lemma D.5.

Recall the definitions of $\tilde{\bm{\beta}}_{k}^{d}$ from Eqn. 20 and $\hat{\bm{\beta}}_{k}^{f,d}$ from Eqn. 22. Under Assumption 4, for any $1$ -Lipschitz function $\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}$ ,

\begin{gathered}\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0.\end{gathered}

(27)

Lemma D.6.

Under Assumption 4, for any $1$ -Lipschitz function $\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]-\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})|\overset{P}{\rightarrow}0.

We present the proofs of these supporting lemmas in the following subsection.

D.2 Proof of Lemmas D.4–D.6

We start with the proof of Lemma D.4 since it plays a crucial role in the proofs of the others. To this end, perhaps the most important step is the following lemma that establishes the boundedness and limiting behavior of certain special entities.

Lemma D.7.

For $k=\textrm{L},\textrm{R}$ , define $\hat{\bm{w}}_{k},\hat{\bm{u}}_{k},\hat{\bm{v}}_{k}$ as

\hat{\bm{w}}_{k}=\hat{\bm{\beta}}_{k}-\bm{\beta},\hat{\bm{u}}_{k}=\bm{y}-\bm{X}\hat{\bm{\beta}}_{k},\hat{\bm{v}}_{k}=\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}).

(28)

Under Assumption 4,

•

Convergence:

	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|\\|\hat{\bm{w}}_{k}\\|_{2}-\sqrt{n\tau_{k}^{2}-\sigma^{2}}\right\|\overset{P}{\rightarrow}0.$
	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|\\|\hat{\bm{u}}_{k}\\|_{2}/\sqrt{n}-\sqrt{n}\tau_{k}\zeta_{k}\right\|\overset{P}{\rightarrow}0.$

•

Boundedness: there exists positive constants $c,C$ such that with probability $1-o(1)$ ,

c\leq\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{w}}_{k}\|_{2},\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n},\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{v}}_{k}\|_{2}/n\leq C.

Proof.

Lemma D.7 is proved in Section E.1 ∎

We will see below that Lemma D.7 forms crux of the proof of Lemma D.4.

D.2.1 Proof of Lemma D.4

Proof of Lemma D.4.

(celentano2021cad, Section J.1) proved a pointwise version of this lemma, without any supremum over the tuning parameters. Thus, with Lemma D.7 at our disposal, the proof of Lemma D.4 follows by a suitable combination with the strategy in celentano2021cad. Recall from Eqns. 26 and 28, we have

\hat{\bm{g}}_{\textrm{L}}=\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}\|\hat{\bm{w}}_{\textrm{L}}\|_{2}}\hat{\bm{w}}_{\textrm{L}}+\frac{\tau_{\textrm{L}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}}\hat{\bm{v}}_{\textrm{L}}.

Next recall the definition of $\tilde{\bm{\beta}}_{\textrm{L}}^{d}$ from (20) and that $n-\textnormal{df}_{\textrm{L}}=n\zeta_{\textrm{L}}$ from our system of equations (24). On applying triangle inequality and Lemma H.5, we obtain that

	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}_{\textrm{L}})\\|_{2}$	$\displaystyle\leq\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{n\zeta_{\textrm{L}}\tau_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}-1\right\|\cdot\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}$		(29)
		$\displaystyle+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{n\tau_{\textrm{L}}}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}}-\frac{1}{\zeta_{\textrm{L}}}\right\|\cdot\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\frac{\\|\hat{\bm{v}}_{\textrm{L}}\\|_{2}}{n}.$

For the first summand, note that by Lemma D.7, $\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{w}}_{\textrm{L}}\|_{2}=O_{P}(1)$ . Thus it suffices to show that the first term in the first summand in (29) is $o_{P}(1)$ . To see this, observe that we can simplify this as follows

		$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}-1\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}})\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{(\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}-1\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}})(\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}-\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2})+\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})}{(\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}}{(\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}\right\|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}-\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}\right\|$
		$\displaystyle+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{1}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n}}\right\|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\right\|\overset{P}{\rightarrow}0\,\,\text{by Lemma \ref{lemma:uvw_convergence} }.$

We turn to the second summand in (29). It suffices to show that the first term is $o_{P}(1)$ since the second is $O_{P}(1)$ by Lemma D.7. But this follows from the same result on observing that

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n\tau_{\textrm{L}}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}}-\frac{1}{\zeta_{\textrm{L}}}\right|=\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n}}{\|\hat{\bm{u}}_{\textrm{L}}\|_{2}/\sqrt{n}\cdot\zeta_{\textrm{L}}}\right|

and that $\zeta_{L}$ is bounded below by a positive constant by (25). This completes the proof. ∎

D.2.2 Proof of Lemma D.5

For the purpose of clarity, we sometimes make the dependence of estimators on $\lambda_{\textrm{L}}$ and/or $\lambda_{\textrm{R}}$ explicit (by writing $\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})$ , for example). We first introduce a supporting Lemma that characterizes a convergence result individually for the Lasso and Ridge.

Lemma D.8.

Under Assumption 4, for $k=\textrm{L},\textrm{R},$ and any $1$ -Lipschitz function $\phi_{\beta}:(\mathbb{R}^{p})^{3}\rightarrow\mathbb{R}$ ,

\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{\beta}(\hat{\bm{\beta}}_{k},\tilde{\bm{\beta}}_{k}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{k}^{f},\hat{\bm{\beta}}_{k}^{f,d},\bm{\beta})]\right|\overset{P}{\rightarrow}0.

Proof.

First we consider $k=\textrm{R}$ . From Lemma E.1, we know

\sup_{\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{R}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{R}}^{f})]|\overset{P}{\rightarrow}0

for any $1$ -Lipschitz function $\phi_{\beta}$ . Since $\tilde{\bm{\beta}}_{\textrm{R}}^{d}$ is a Lipschitz function of $\hat{\bm{\beta}}_{\textrm{R}}$ by Lemma D.14, the proof follows by the definition of these from (22).

Now for $k=\textrm{L}$ , however, $\tilde{\bm{\beta}}_{\textrm{L}}^{d}$ is not necessarily a Lipschitz function of $\hat{\bm{\beta}}_{\textrm{L}}$ . Thus, we turn to the $\alpha$ -smoothed Lasso ( $\alpha>0$ ) instead celentano2020lasso, defined as

\hat{\bm{\beta}}_{\alpha}=\operatorname*{arg\,min}_{\bm{b}}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda_{L}}{\sqrt{n}}\inf_{\bm{\theta}}\left\{\frac{\sqrt{n}}{2\alpha}\|\bm{b}-\bm{\theta}\|_{2}^{2}+\|\bm{\theta}\|_{1}\right\}.

Based on this, $\tilde{\bm{\beta}}_{\alpha}^{d},\hat{\bm{\beta}}_{\alpha}^{f},\hat{\bm{\beta}}_{\alpha}^{f,d}$ can be similarly defined. We omit the full details for simplicity since they are similar to (18)-(24), but we note that they satisfy the following relation by KKT conditions:

	$\displaystyle\tilde{\bm{\beta}}_{\alpha}^{d}$	$\displaystyle=\hat{\bm{\beta}}_{\alpha}+\frac{\lambda_{\textrm{L}}\nabla M_{\alpha}(\hat{\bm{\beta}}_{\alpha})}{\sqrt{n}\xi_{\alpha}}$
	$\displaystyle\hat{\bm{\beta}}_{\alpha}^{f,d}$	$\displaystyle=\hat{\bm{\beta}}_{\alpha}^{f}+\frac{\lambda_{\textrm{L}}\nabla M_{\alpha}(\hat{\bm{\beta}}_{\alpha}^{f})}{\sqrt{n}\xi_{\alpha}},$

where $M_{\alpha}(\bm{b}):=\inf_{\bm{\theta}}\left\{\frac{\sqrt{n}}{2\alpha}\|\bm{b}-\bm{\theta}\|_{2}^{2}+\|\bm{\theta}\|_{1}\right\}$ and $\bm{\xi}_{\alpha}$ is defined similarly as (24). (celentano2020lasso, Theorem B.1) establishes a pointwise version of Lemma E.1 for the $\alpha$ -smoothed Lasso, i.e. without supremum over the tuning parameter range, but uniform control results can be easily obtained following our techniques for Lemma E.1. Furthermore, (celentano2020lasso, Section B.5.2) shows that $\tilde{\bm{\beta}}_{\alpha}^{d}$ is an $C/\alpha$ -Lipschitz function of $\hat{\bm{\beta}}_{\alpha}$ for some positive constant $C$ (by checking the Lipschitzness of $\nabla M_{\alpha}$ ). Combining these, we obtain that for a fixed $\alpha>0$

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{\beta}(\hat{\bm{\beta}}_{\alpha},\tilde{\bm{\beta}}_{\alpha}^{d},\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\alpha}^{f},\hat{\bm{\beta}}_{\alpha}^{f,d},\bm{\beta})]\right|\overset{P}{\rightarrow}0.

(30)

In addition, (celentano2020lasso, Lemma B.8 and Section B.5.1) establishes the closeness of (debiased) Lasso and (debiased) $\alpha$ -smoothed Lasso as follows: there exists a constant $\alpha_{\max}>0$ such that

\begin{gathered}\mathbb{P}\left(\|\hat{\bm{\beta}}_{\alpha}-\hat{\bm{\beta}}_{\textrm{L}}\|_{2}\leq C_{1}\sqrt{\alpha},\forall\alpha\leq\alpha_{\max}\right)=1-o(1)\\ \mathbb{P}\left(\|\hat{\bm{\beta}}_{\alpha}^{d}-\hat{\bm{\beta}}_{\textrm{L}}^{d}\|_{2}\leq C_{1}\sqrt{\alpha},\forall\alpha\leq\alpha_{\max}\right)=1-o(1),\end{gathered}

(31)

which can be easily extended to the uniform version since $C_{1},\alpha_{\max}$ and constants hiding in $o(1)$ do not depend on $\lambda$ , and estimators for all $\lambda$ share the same source of randomness.

Finally, (celentano2020lasso, Lemma A.5) shows there exists constants $C_{2}$ and $\alpha_{\max}$ such that $\forall\alpha\leq\alpha_{\max}$ :

|\tau_{\textrm{L}}-\tau_{\alpha}|\leq C_{2}\sqrt{\alpha},\ \ |\xi_{\textrm{L}}-\xi_{\alpha}|\leq C_{2}\sqrt{\alpha},

(32)

which can be extended to uniform version similarly.

Once uniform control results are extended, the rest of the proof stays the same as in (celentano2020lasso, Section B.5), on combining (30)-(32). ∎

As a consequence of Lemmas D.4 and D.8, we obtain the following corollary. An analogous result was established in (celentano2021cad, Corollary J.2) but unlike their case, our guarantee here is uniform over the tuning parameter space. Thus the result below relies crucially on our earlier results that provide such uniform guarantees.

Corollary D.2.

Recall the definition of $\hat{\bm{g}}_{\textrm{L}}$ from (26) and that $\bm{g}_{L}^{f}=\hat{\bm{\beta}}_{L}^{f,d}-\bm{\beta}$ from (21). Under Assumption 4, for $1$ -Lipschitz functions $\phi_{\beta}:(\mathbb{R}^{p})^{2}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]|\overset{P}{\rightarrow}0.

Proof.

By triangle inequality,

		$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\|+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\\|_{2}+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\hat{\bm{\beta}}_{L}^{f,d}-\bm{\beta})]\|\overset{P}{\rightarrow}0,$

since the first summand vanishes due to Lemma D.4 and the second summand vanishes due to Lemma D.8. ∎

To prove Lemma D.5, that is, (27), we need to establish a convergence result that is uniform over $\lambda_{L}$ and $\lambda_{R}$ . We divide this goal into a two-step strategy, where we first establish that (27) holds with the supremum only over $\lambda_{L}$ (Lemma D.9) Then by appropriate Lipschitzness arguments (Lemma D.10), we extend the result to hold with supremum simultaneously over $\lambda_{L}$ and $\lambda_{R}$ .

Lemma D.9.

Denote the quantity $\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]$ as $\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})$ . Then under Assumption 4,

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]|\overset{P}{\rightarrow}0.

(33)

Proof.

From the definition, we know $\mathbb{E}[\phi_{\beta|L}(\bm{g}_{\textrm{L}}^{f})]=\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})]$ , since, per the remark below (26), $\tilde{\bm{\beta}}_{\textrm{L}}^{d}$ equals $\hat{\bm{\beta}}_{\textrm{L}}^{f,d}$ conditional on $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}$ .

Therefore, we just need to show

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta|L}(\bm{g}_{\textrm{L}}^{f})]|\overset{P}{\rightarrow}0,

which follows from Corollary D.2, if we can show that $\phi_{\beta|L}$ is a Lipschitz function. The Lipschitzness follows by an argument similar to (celentano2021cad, Section J.2), so we omit the details here. ∎

Note that Corollary D.2, which in turn relies on Lemmas D.4 and D.8, forms a crucial ingredient for the preceding proof.

Lemma D.10.

Under Assumption 4, with probability $1-o(1)$ , $\Psi(\lambda_{\textrm{R}})$ is an M-Lipschitz function of $\lambda_{\textrm{R}}$ for some positive constant $M$ , where

\Psi(\lambda_{\textrm{R}}):=\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{\textrm{R}}),\bm{\beta})]|.

Proof of Lemma D.10.

We define (with a slight overload of notations) an auxiliary function:

\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}}):=|\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{\textrm{R}}),\bm{\beta})]|.

First, $\tau_{\textrm{L}}$ and therefore $\hat{\bm{\beta}}_{\textrm{L}}^{f,d}$ does not depend on $\lambda_{\textrm{R}}$ . Also, $\hat{\bm{\beta}}_{\textrm{R}}^{f,d}=\bm{\beta}+\bm{g}_{\textrm{R}}^{f}=\bm{\beta}+\tau_{R}\tilde{\bm{\xi}}$ where $\tilde{\bm{\xi}}\sim\mathcal{N}(\bm{0},\bm{I}_{p})$ , $\tau_{R}$ is $C_{1}/\sqrt{n}$ -Lipschitz in $\lambda_{\textrm{R}}$ from Lemma H.2. Thus, we have

		$\displaystyle\left\|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})]\right\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\\|_{2}]$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}[\\|(\tau_{\textrm{R}}(\lambda_{1})-\tau_{\textrm{R}}(\lambda_{2}))\tilde{\bm{\xi}}\\|_{2}^{2}]}$
	$\displaystyle\leq$	$\displaystyle C_{1}\|\lambda_{1}-\lambda_{2}\|,$

which yields $\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{\textrm{R}}),\bm{\beta})]|$ is $C$ -Lipschitz in $\lambda_{\textrm{R}}$ . Next note that conditioning on $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}$ necessitates $\bm{g}_{\textrm{R}}^{f}$ to take the following value so that their joint covariance structure matches (23),

\bm{g}_{\textrm{R}}^{f}=\tau_{\textrm{R}}\rho/\tau_{\textrm{L}}\cdot\hat{\bm{g}}_{\textrm{L}}+\tau_{\textrm{R}}\sqrt{1-\rho^{2}}\cdot\bm{\xi},\,\,\text{where}\,\,\bm{\xi}\sim\mathcal{N}(0,\bm{I}_{p}).

(34)

This means that conditional on $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}$ , $\hat{\bm{\beta}}_{\textrm{R}}^{f,d}=\bm{\beta}+\tau_{R}\rho/\tau_{L}\hat{\bm{g}}_{\textrm{L}}+\tau_{R}\sqrt{1-\rho^{2}}\bm{\xi}$ , whereas $\hat{\bm{\beta}}_{\textrm{L}}^{d}$ never depends on $\lambda_{\textrm{R}}$ . Also, from 25 and Lemma H.2, $\tau_{R}$ is bounded in $[\sqrt{\tau_{\min}^{2}/n},\sqrt{(\tau_{\max}^{2}+\sigma_{\max}^{2})/n}]$ and is $C$ -Lipschitz in $\lambda_{\textrm{R}}$ while both $\rho$ and $\sqrt{1-\rho^{2}}$ are bounded in $[0,1]$ and $C^{\prime}$ -Lipschitz in $\lambda_{\textrm{R}}$ , and $1/\tau_{L}\leq\sqrt{n}/\tau_{\min}$ . Combining Lemma D.4 and D.3 yields that $\|\hat{\bm{g}}_{\textrm{L}}\|_{2}\leq 2\|\hat{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}\|_{2}$ with probability $1-o(1)$ . In conjunction with Lemma D.8, this yields that $\|\hat{\bm{g}}_{\textrm{L}}\|_{2}\leq 4\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}]\leq 4\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}^{2}]}=4n\tau_{{\textrm{L}}}^{2}\leq 4\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}$ with probability $1-o(1)$ . Thus,

		$\displaystyle\left\|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})\|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})\|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\left\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})-\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})\right\|\|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\\|_{2}\|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}[\\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\\|_{2}^{2}\|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]}$
	$\displaystyle\leq$	$\displaystyle\left(4(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}/\tau_{\min}+(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\right)\|\lambda_{1}-\lambda_{2}\|.$

Thus, we conclude that $\phi_{\beta|L}(\hat{\bm{g}}_{\textrm{L}})$ is $(4(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}/\tau_{\min}+(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C))$ -Lipschitz in $\lambda_{\textrm{R}}$ with probability $1-o(1)$ .

Therefore, with probability $1-o(1)$ , $\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}})$ is an $M$ -Lipschitz function of $\lambda_{\textrm{R}}$ for $M=4(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}/\tau_{\min}+(C^{\prime}+\sqrt{\tau_{\max}^{2}+\sigma_{\max}^{2}}C)+C$ . Notice that $M$ does not depend on $\lambda_{\textrm{L}}$ . Therefore, by Lemma H.9, $\Psi(\lambda_{\textrm{R}}):=\sup_{\lambda_{\textrm{L}}}\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}})$ is also an $M$ -Lipschitz function of $\lambda_{\textrm{R}}$ , hence completing the proof. ∎

Now we are ready to prove Lemma D.5.

Proof of Lemma D.5.

Consider the high probability event in Lemma D.10. For any $\epsilon>0$ , define $\epsilon^{\prime}=\epsilon/2M$ . Let $k=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil$ . Define, for $i=0,...,k$ : $\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}$ . Then by Lemma D.10, we know $\sup_{\lambda_{\textrm{R}}\in[\lambda_{i-1},\lambda_{i}]}\Psi(\lambda_{\textrm{R}})\leq\Psi(\lambda_{i})+M\epsilon^{\prime}$ .

By union bound, we have

		$\displaystyle\mathbb{P}\left(\sup_{\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\Psi(\lambda_{\textrm{R}})\geq\epsilon\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{k}\mathbb{P}\left(\sup_{\lambda_{\textrm{R}}\in[\lambda_{i-1},\lambda_{i}]}\Psi(\lambda_{\textrm{R}})\geq\epsilon\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{k}\mathbb{P}\left(\Psi(\lambda_{i})\geq\epsilon-M\epsilon^{\prime}\right)$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{k}\mathbb{P}\left(\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta\|L}(\hat{\bm{g}}_{\textrm{L}}(\lambda_{\textrm{L}}))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{i}),\bm{\beta})]\|\geq\epsilon/2\right)$
	$\displaystyle=$	$\displaystyle o(1),$

on using Lemma D.9 ∎

D.2.3 Proof of Lemma D.6

Recall the definition of $\hat{\bm{w}}_{k}$ from (28). We define a loss function that yields $\hat{\bm{w}}_{\textrm{R}}$ as the minimizer:

\mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w}):=\frac{1}{2n}\|\bm{X}\bm{w}-\sigma\bm{z}\|_{2}^{2}+\frac{\lambda_{\textrm{R}}}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}.

We next introduce some supporting lemmas.

Lemma D.11.

Recall the definition of $\hat{\bm{g}}_{\textrm{L}}$ from 26. Under Assumption 4, for any $\epsilon>0$ , there exists a constant $C$ such that for a $1$ -Lipschitz function $\phi_{w}:\mathbb{R}^{p}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\left|\phi_{w}(\bm{w})-\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right|\geq\epsilon\ \text{and}\ \mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{\textrm{R}}}+C\epsilon^{2}\right)=o(\epsilon^{2}).

Proof.

From part of (celentano2021cad, Lemma F.4), we know that

\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\left|\phi_{w}(\bm{w})-\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right|\geq\epsilon\ \text{and}\ \mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{\textrm{R}}}+C\epsilon^{2}\right)=o(\epsilon^{2}).

The proof is then completed by the fact that as in (celentano2021cad, Lemma F.4), $o(\epsilon^{2})$ only hides constants that do not depend on $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ . ∎

Lemma D.12.

Under Assumption 4, there exists a positive constant $K$ such that

\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))\leq\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1).

Proof.

We have

		$\displaystyle\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))$
	$\displaystyle=$	$\displaystyle\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))+\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))+\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))-\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))$
	$\displaystyle\leq$	$\displaystyle\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda))-\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))+\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))-\mathcal{C}_{\lambda^{\prime}}(\hat{\bm{w}}_{\textrm{R}}(\lambda^{\prime}))$
	$\displaystyle=$	$\displaystyle\frac{\lambda^{\prime}-\lambda}{2}(\\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\\|_{2}^{2}-\\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda^{\prime})\\|_{2}^{2})$
	$\displaystyle\leq$	$\displaystyle\frac{\|\lambda^{\prime}-\lambda\|}{2}(\\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\\|_{2}^{2}+\\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda^{\prime})\\|_{2}^{2}).$

Further, with probability at least $1-e^{-n/2}$ we have $\|\bm{z}\|_{2}\leq 2\sqrt{n}$ , and therefore

\frac{\lambda}{2}\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\|_{2}^{2}\leq\mathcal{C}_{\lambda}(\hat{\bm{w}}_{\textrm{R}}(\lambda))=\operatorname*{arg\,min}_{\bm{b}}\mathcal{C}_{\lambda}(\bm{b})\leq\mathcal{C}_{\lambda}(\bm{0})=\|\sigma\bm{z}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{\beta}\|_{2}^{2}\leq 2\sigma^{2}+\frac{\lambda}{2}\|\bm{\beta}\|_{2}^{2}.

Hence $\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda)\|_{2}^{2}$ is bounded with high probability and so is $\|\hat{\bm{\beta}}_{\textrm{R}}(\lambda^{\prime})\|_{2}^{2}$ , which completes the proof. ∎

Lemma D.13.

Under Assumption 4, for any $\epsilon>0$ , consider any $\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}]$ such that $|\lambda_{1}-\lambda_{2}|\geq\epsilon$ . Then with probability $1-o(1)$ , we have

	$\displaystyle\|\psi(\lambda_{1},\lambda_{\textrm{R}})-\psi(\lambda_{2},\lambda_{\textrm{R}})\|\leq M\|\lambda_{1}-\lambda_{2}\|$
	$\displaystyle\|\psi(\lambda_{\textrm{L}},\lambda_{1})-\psi(\lambda_{\textrm{L}},\lambda_{2})\|\leq M\|\lambda_{1}-\lambda_{2}\|$

for some constant $M$ (that does not depend on $\epsilon$ ), where $\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}})=\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]$ .

We defer the proof to Section E.2 We remark that this is a slightly weaker condition than the Lipschitzness of $\psi$ in $\lambda_{k}$ , but it suffices for our purpose. For convenience, we call this ”weak-Lipschitz” condition.

Lemma D.14.

Under Assumption 4, $\tilde{\bm{\beta}}_{\textrm{R}}^{d}$ is an $M$ -Lipschitz function of $\hat{\bm{\beta}}_{\textrm{R}}$ for some constant $M$ .

Proof.

Recall from (20) and (24) that $\tilde{\bm{\beta}}_{\textrm{R}}^{d}=\hat{\bm{\beta}}_{\textrm{R}}+\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}})}{n\zeta_{\textrm{R}}}$ . Further, the KKT condition for ridge regression implies that $\frac{1}{n}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{R}})=\lambda_{R}\hat{\bm{\beta}}_{\textrm{R}}$ . Since from 25 we know that $\zeta_{\textrm{R}}$ is bounded below by a positive constant $\zeta_{\min}$ , it follows that $\tilde{\bm{\beta}}_{\textrm{R}}^{d}$ is an $M=1+\lambda_{\max}/\zeta_{\min}$ -Lipschitz function of $\hat{\bm{\beta}}_{\textrm{R}}$ . ∎

Proof of Lemma D.6.

Let $C>0$ as given by Lemma D.11, let $K>0$ as given by Lemma D.12, and let $M>0$ as given by Lemma D.13. Consider any $\epsilon>0$ , define $\epsilon^{\prime}=\min\left(\frac{C\epsilon^{2}}{K},\frac{\epsilon}{M}\right)$ . Let $k=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil$ . Further define $\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}$ for $i=0,...,k$ . By Lemma D.11, the event

\left\{\forall i_{\textrm{L}},i_{\textrm{R}}=1,...,k,\forall\bm{w}\in\mathbb{R}^{p},\mathcal{C}_{\lambda_{i_{\textrm{R}}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{i_{\textrm{R}}}}+C\epsilon^{2}\Rightarrow\left|\phi_{w}(\bm{w})-\psi(\lambda_{i_{\textrm{L}}},\lambda_{i_{\textrm{R}}})\right|\leq\epsilon\right\}

(35)

has probability $1-k^{2}o(\epsilon^{2})=1-o(1)$ . Now define $\lambda_{i_{\textrm{R}}}:=\operatorname*{arg\,max}\{|\lambda_{\textrm{R}}-\lambda_{i}|,|\lambda_{\textrm{R}}-\lambda_{i+1}|\}$ with $\lambda_{i}\leq\lambda_{\textrm{R}}\leq\lambda_{i+1}$ (so we know $|\lambda_{i_{\textrm{R}}}-\lambda_{\textrm{R}}|\geq\epsilon^{\prime}/2$ ) and similarly for $i_{\textrm{L}}$ . On the intersection of event 35 and the event in Lemma D.12, which has probability $1-o(1)$ we have that

\mathcal{C}_{\lambda_{i_{\textrm{R}}}}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))\leq\min\mathcal{C}_{\lambda_{i_{\textrm{R}}}}+K\epsilon^{\prime}\leq\min\mathcal{C}_{\lambda_{i_{\textrm{R}}}}+C\epsilon^{2}.

This implies (since we are on event 35) that $|\phi_{w}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))-\psi(\lambda_{i_{L}},\lambda_{i_{R}})|\leq\epsilon$ , where $1\leq i_{\textrm{R}}\leq k$ . Thus we have

		$\displaystyle\|\phi_{w}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))-\psi(\lambda_{\textrm{L}},\lambda_{\textrm{R}})\|$
	$\displaystyle\leq$	$\displaystyle\|\phi_{w}(\hat{\bm{w}}_{\textrm{R}}(\lambda_{\textrm{R}}))-\psi(\lambda_{i_{L}},\lambda_{i_{R}})\|+\|\psi(\lambda_{i_{L}},\lambda_{i_{R}})-\psi(\lambda_{L},\lambda_{i_{R}})\|+\|\psi(\lambda_{L},\lambda_{i_{R}})-\psi(\lambda_{i_{L}},\lambda_{R})\|$
	$\displaystyle\leq$	$\displaystyle\epsilon+2M\epsilon^{\prime}$
	$\displaystyle\leq$	$\displaystyle 3\epsilon$

with probability $1-o(1)$ , where the second-to-last inequality follows from Lemma D.13 and the fact that $|\lambda_{i_{\textrm{L}}}-\lambda_{\textrm{L}}|,|\lambda_{i_{\textrm{R}}}-\lambda_{\textrm{R}}|\leq\epsilon^{\prime}$ . Finally we note that $\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})$ is a Lipschitz function of $\tilde{\bm{\beta}}_{\textrm{R}}^{d}$ , in fact, $\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\tilde{\bm{\beta}}_{\textrm{R}}^{d},\bm{\beta})=\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}(1+\lambda_{R}/\zeta_{R}),\bm{\beta})$ , (by definition). This in turn equals $\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},(\hat{\bm{w}}_{R}+\bm{\beta})(1+\lambda_{R}/\zeta_{R}),\bm{\beta})$ . If we define this to be $\phi_{w}(\hat{\bm{w}}_{R})$ , then $\psi(\lambda_{L},\lambda_{R})=\mathbb{E}[\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{f}(1+\lambda_{R}/\zeta_{R}),\bm{\beta}|\bm{g}^{f}_{L}=\hat{\bm{g}}_{L})]=\mathbb{E}[\phi_{\beta}(\tilde{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d},\bm{\beta})|\bm{g}^{f}_{L}=\hat{\bm{g}}_{L}]$ , once again by definition. Then (LABEL:eq:intermcalc) yields the desired result. ∎

Appendix E Proof of supporting lemmas for Section D.2

E.1 Proof of Lemma D.7

We introduce two supporting Lemmas:

Lemma E.1.

Under Assumption 4, for $k=L,R$ and any $1$ -Lipschitz function $\phi_{\beta}:\mathbb{R}^{p}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{k})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{k}^{f})]|\overset{P}{\rightarrow}0.\\

Lemma E.2.

Recall $\hat{\bm{u}}_{k}=\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}$ . Further define $\hat{\bm{u}}_{k}^{f}=\sqrt{n}\zeta_{k}\bm{h}_{k}^{f}$ , where $(\bm{h}_{k}^{f},\bm{h}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},\bm{S}\otimes\bm{I}_{n})$ for $\bm{S}$ defined in Eqn. 24. Under Assumption 4, for any $1$ -Lipschitz functions $\phi_{u}:\mathbb{R}^{n}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{u}\left(\frac{\hat{\bm{u}}_{k}}{\sqrt{n}}\right)-\mathbb{E}\left[\phi_{u}\left(\frac{\hat{\bm{u}}_{k}^{f}}{\sqrt{n}}\right)\right]\right|\overset{P}{\rightarrow}0.

Lemmas E.1 and E.2 are proved in Section E.3.

Proof of Lemma D.7.

First consider $\hat{\bm{w}}_{k}=\hat{\bm{\beta}}_{k}-\bm{\beta}$ . From Lemma E.1, we know that for any $1$ -Lipschitz function $\phi$ ,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\phi(\hat{\bm{w}}_{k})-\mathbb{E}[\phi(\hat{\bm{\beta}}_{k}^{f}-\bm{\beta})]|\overset{P}{\rightarrow}0,

where recall from Eqn. 21 that $\hat{\bm{\beta}}_{k}^{f}-\bm{\beta}=\eta_{k}(\bm{\beta}+\bm{g}_{k}^{f},\zeta_{k})-\bm{\beta}$ is a $1/\sqrt{p}$ -Lipschitz function of $\sqrt{p}\bm{g}_{k}^{f}$ due to the Lipschitzness of the proximal mapping operator, where $\sqrt{p}\bm{g}_{k}^{f}\sim\mathcal{N}(\bm{0},p\tau_{k}^{2}\bm{I}_{p})$ with $p\tau_{k}^{2}\leq\delta\tau_{\max}^{2}+\delta\sigma_{\max}^{2}$ and $\delta=p/n$ . Also, $\mathbb{E}[\|\bm{g}_{k}^{f}\|_{2}^{2}]=p\tau_{k}^{2}$ . Moreover, from (21) and (24), $\mathbb{E}[\|(\hat{\bm{\beta}}_{k}^{f}-\bm{\beta})\|_{2}^{2}]=n\tau_{k}^{2}-\sigma^{2}$ $\leq\tau_{\max}^{2}$ is bounded by 25. Therefore, by Lemma H.3, we know

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{w}}_{k}\|_{2}^{2}-(n\tau_{k}^{2}-\sigma^{2})\right|\overset{P}{\rightarrow}0.

We also know that $n\tau_{k}^{2}-\sigma^{2}\geq\tau_{\min}^{2}$ by 25. Thus, by Lemma H.7,

\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{w}}_{k}\|_{2}-\sqrt{n\tau_{k}^{2}-\sigma^{2}}\right|\overset{P}{\rightarrow}0.

As a direct corollary, with probability $1-o(1)$ , $\|\hat{\bm{w}}_{k}\|_{2}\leq 2\sqrt{n\tau_{k}^{2}-\sigma^{2}}\leq 2\tau_{\max}$ and $\|\hat{\bm{w}}_{k}\|_{2}\geq\sqrt{n\tau_{k}^{2}-\sigma^{2}}/2\geq\tau_{\min}/2$ for all $\lambda_{k}$ , so $\|\hat{\bm{w}}_{k}\|_{2}$ is bounded both above and below for all $\lambda_{k}$ with probability $1-o(1)$ . Similarly, convergence of $\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n}$ follows by starting from Lemma E.2 on combining with (25) and Lemmas (H.3), (H.7).

Finally we consider $\hat{\bm{v}}_{k}$ . We know $\hat{\bm{v}}_{k}=\bm{X}^{\top}\hat{\bm{u}}_{k}$ and that $\|\bm{X}\|_{\textnormal{op}}/\sqrt{n}$ is bounded with probability $1-o(1)$ (Corollary H.1). Thus,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{v}}_{k}\|_{2}/n\leq\|\bm{X}\|_{\textnormal{op}}/\sqrt{n}\cdot\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{u}}_{k}\|_{2}/\sqrt{n},

which is bounded above with probability $1-o(1)$ . This completes the proof. ∎

E.2 Proof of Lemma D.13

We introduce another Lemma:

Lemma E.3.

\begin{gathered}\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda)\|_{2}\leq M,\,\,\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\\ \|\hat{\bm{\beta}}_{L}(\lambda_{1})-\hat{\bm{\beta}}_{L}(\lambda_{2})\|_{2}\leq M|\lambda_{1}-\lambda_{2}|.\end{gathered}

for some positive constant $M$ that does not depend on $\epsilon$ .

Proof.

The first line follows directly from Lemma D.7 and the fact that $\hat{\bm{\beta}}_{\textrm{L}}=\hat{\bm{w}}_{\textrm{L}}+\bm{\beta}$ . For the second line, consider any $\epsilon>0$ . It is a direct consequence of Lemma E.1 that

\mathbb{P}\left(\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{\textrm{L}})\|_{2}-\mathbb{E}\|\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{\textrm{L}})\|_{2}\right|\geq\epsilon\right)=o(1).

(36)

For any $\lambda_{1},\lambda_{2}$ , by triangle inequality, we have

\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{2})\|_{2}\leq\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{1})\|_{2}+\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{2})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{2})\|_{2}+\|\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{2})\|_{2}.

Now, $\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{1})\|_{2}\leq\epsilon$ and $\|\hat{\bm{\beta}}_{\textrm{L}}(\lambda_{2})-\hat{\bm{\beta}}_{\textrm{L}}^{f}(\lambda_{2})\|_{2}\leq\epsilon$ with probability $1-o(1)$ by (36). Further, recalling the definition of $\hat{\bm{\beta}}_{\textrm{L}}^{f}$ in (21) and notice that the minimization problem is separable, we know the $i$ -th entry of $\hat{\bm{\beta}}_{\textrm{L}}^{f}$ satisfies

\hat{\bm{\beta}}_{\textrm{L}}^{f,(i)}=\eta(\beta_{i}+\tau_{L}Z_{i},\frac{\lambda_{\textrm{L}}}{\sqrt{n}\zeta_{\textrm{L}}}),

where $Z_{i}\overset{i.i.d.}{\sim}\mathcal{N}(0,1)$ and

\eta(x,b)=\begin{cases}x+b,\ x<-b\\ 0,\ -b\leq x\leq b\\ x-b,\ x>b\end{cases},

the soft-thresholding operator, is $1$ -Lipshitz in both $x$ and $b$ . Thus,

		$\displaystyle\\|\hat{\bm{\beta}}_{L}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{L}^{f}(\lambda_{2})\\|_{2}^{2}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{p}\left(\eta(\beta_{i}+\tau_{L}(\lambda_{1})Z_{i},\frac{\lambda_{1}}{\sqrt{n}\zeta_{L}(\lambda_{1})})-\eta(\beta_{i}+\tau_{L}(\lambda_{2})Z_{i},\frac{\lambda_{2}}{\sqrt{n}\zeta_{L}(\lambda_{2})})\right)^{2}$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{p}2\left((\tau_{L}(\lambda_{1})Z_{i}-\tau_{L}(\lambda_{2})Z_{i})^{2}+(\frac{\lambda_{1}}{\sqrt{n}\zeta_{L}(\lambda_{1})}-\frac{\lambda_{2}}{\sqrt{n}\zeta_{L}(\lambda_{2})})^{2}\right)$
	$\displaystyle=$	$\displaystyle 2\left((\tau_{L}(\lambda_{1})-\tau_{L}(\lambda_{2}))^{2}\sum_{i=1}^{p}Z_{i}^{2}+\delta(\frac{\lambda_{1}}{\zeta_{L}(\lambda_{1})}-\frac{\lambda_{2}}{\zeta_{L}(\lambda_{2})})^{2}\right)$
	$\displaystyle\leq$	$\displaystyle M^{2}\|\lambda_{1}-\lambda_{2}\|^{2}$

for some constant $M$ with probability $1-o(1)$ , where we used 25, Lemma H.8, and the facts that $\sqrt{n}\tau_{L}(\lambda),\zeta_{L}(\lambda)$ are bounded Lipschitz functions of $\lambda_{L}$ (from Lemma H.2) and $\sum_{i=1}^{p}Z_{i}^{2}/p$ is bounded with probability $1-o(1)$ . Combining the above, we know with probability $1-o(1)$ ,

\|\hat{\bm{\beta}}_{L}(\lambda_{1})-\hat{\bm{\beta}}_{L}(\lambda_{2})\|_{2}\leq 2\epsilon+M|\lambda_{1}-\lambda_{2}|\leq(M+2)|\lambda_{1}-\lambda_{2}|,

which concludes the proof. ∎

As a corollary, we have the following lemma:

Lemma E.4.

\|\hat{\bm{g}}_{L}(\lambda_{1})-\hat{\bm{g}}_{L}(\lambda_{2})\|_{2}\leq M|\lambda_{1}-\lambda_{2}|,

for some constant $M$ (that does not depend on $\epsilon$ ).

Proof.

Recall Eqn. 26, and note the following:

•

$\zeta_{\textrm{L}},\sqrt{n}\tau_{L}$ are bounded Lipschitz functions of $\lambda_{\textrm{L}}$ , and as a simple corollary, $\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}$ is also a bounded Lipschitz function of $\lambda_{\textrm{L}}$ .
•

$\left|\|\hat{\bm{w}}_{L}(\lambda_{1})\|_{2}-\|\hat{\bm{w}}_{L}(\lambda_{2})\|_{2}\right|\leq\|\hat{\bm{w}}_{L}(\lambda_{1})-\hat{\bm{w}}_{L}(\lambda_{2})\|_{2}=\|\hat{\bm{\beta}}_{L}(\lambda_{1})-\hat{\bm{\beta}}_{L}(\lambda_{2})\|_{2}\leq M|\lambda_{1}-\lambda_{2}|$ with probability $1-o(1)$ by Lemma E.3. Also $\|\hat{\bm{w}}\|_{2}$ is bounded with probability $o(1)$ by Lemma D.7
•

$\hat{\bm{u}}_{\textrm{L}}=\sigma\bm{z}-\bm{X}\hat{\bm{w}}_{\textrm{L}}$ , so $\frac{1}{\sqrt{n}}\|\hat{\bm{u}}_{L}(\lambda_{1})-\hat{\bm{u}}_{L}(\lambda_{2})\|_{2}\leq\frac{1}{\sqrt{n}}\sigma_{\max}(\bm{X})\|\hat{\bm{w}}_{L}(\lambda_{1})-\hat{\bm{w}}_{L}(\lambda_{2})\|_{2}\leq(2+\sqrt{\delta})M|\lambda_{1}-\lambda_{2}|$ with probability $1-o(1)$ , where we used Corollary H.1. Also $\frac{1}{\sqrt{n}}\|\hat{\bm{u}}_{\textrm{L}}\|_{2}\leq\frac{1}{\sqrt{n}}(\sigma\|\bm{z}\|_{2}+\|\bm{X}\hat{\bm{w}}_{\textrm{L}}\|_{2})$ is bounded with probability $1-o(1)$ for the same reason.
•

By the same argument, $\frac{1}{n}\|\hat{\bm{v}}_{L}(\lambda_{1})-\hat{\bm{v}}_{L}(\lambda_{2})\|_{2}\leq(2+\sqrt{\delta})^{2}M|\lambda_{1}-\lambda_{2}|$ with probability $1-o(1)$ and $\frac{1}{n}\|\hat{\bm{v}}\|_{2}$ is bounded with probability $1-o(1)$ .

Then the proof is complete on iteratively applying Lemma H.8 on the above displays. ∎

We are now ready to prove Lemma D.13.

Proof of Lemma D.13.

By Jensen’s inequality and the fact that $\phi_{w}$ is 1-Lipschitz, we only need to show that conditional on $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}$ , $\hat{\bm{w}}_{\textrm{R}}^{f}=\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}$ satisfies the “weak-Lipschitz” condition in $\lambda_{\textrm{L}}$ with probability $1-o(1)$ , where by Eqn. 40,

	$\displaystyle\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}$	$\displaystyle=\frac{1}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}(\zeta_{\textrm{R}}\bm{g}_{\textrm{R}}^{f}-\lambda_{R}\bm{\beta})$
		$\displaystyle=\frac{1}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}(\zeta_{\textrm{R}}\tau_{R}\rho/\tau_{\textrm{L}}\cdot\hat{\bm{g}}_{\textrm{L}}+\zeta_{\textrm{R}}\tau_{R}\sqrt{1-\rho^{2}}\cdot\bm{\xi}-\lambda_{\textrm{R}}\bm{\beta}).$

Now we note the following observations.

•

$\zeta_{\textrm{R}},\lambda_{\textrm{R}},\bm{\beta}$ does not depend on $\lambda_{\textrm{L}}$ and by Lemma H.2, $\zeta_{\textrm{R}},\lambda_{\textrm{R}}$ are both bounded Lipschitz functions of $\lambda_{\textrm{R}}$ . Further, $\|\bm{\beta}\|_{2}$ is bounded. Thus, by Lemma H.8, $\frac{\lambda_{\textrm{R}}}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}\bm{\beta}$ is Lipschitz in both $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ with some constant $M_{1}$ .
•

In addition, by Lemma H.2, $\sqrt{n}\tau_{R}$ is bounded Lipschitz functions of $\lambda_{\textrm{R}}$ , and $\sqrt{1-\rho^{2}}$ is bounded and Lipschitz in both $\lambda_{\textrm{L}}$ and $\lambda_{\textrm{R}}$ . Further, $\bm{\xi}/\sqrt{n}\sim\mathcal{N}(\bm{0},\bm{I}_{p}/n)$ is bounded with probability $1-o(1)$ . Therefore, by Lemma H.8, $\frac{\zeta_{\textrm{R}}\tau_{R}\sqrt{1-\rho^{2}}}{\zeta_{\textrm{R}}+\lambda_{\textrm{R}}}\bm{\xi}$ is Lipschitz in both $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ with some constant $M_{2}$ with probability $o(1)$ .
•

By Lemma H.2, $\rho$ is bounded and Lipschitz in both $\lambda_{\textrm{L}}$ and $\lambda_{\textrm{R}}$ , and $\sqrt{n}\tau_{k}$ is bounded and Lipschitz in $\lambda_{k}$ .
•

By Lemma D.4 and Lemma D.8, $\|\hat{\bm{g}}_{\textrm{L}}\|_{2}\leq 2\|\hat{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}\|_{2}\leq 4\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}]\leq 4\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f,d}-\bm{\beta}\|_{2}^{2}]}=4n\tau_{\textrm{L}}^{2}\leq 4(\tau_{\max}^{2}+\sigma_{\max}^{2})$ with probability $1-o(1)$ , where the equality follows from (21) and (24).
•

Further, $\hat{\bm{g}}_{\textrm{L}}$ does not depend on $\lambda_{\textrm{R}}$ , and by Lemma E.4, $\hat{\bm{g}}_{\textrm{L}}$ satisfy the ”weak-Lipschitz” condition w.r.t. $\lambda_{\textrm{L}}$ with probability $1-o(1)$ .
•

Hence, by Lemma H.8, $\frac{\zeta_{\textrm{R}}\tau_{R}\rho}{\tau_{L}(\zeta_{\textrm{R}}+\lambda_{\textrm{R}})}\hat{\bm{g}}_{\textrm{L}}$ satisfies the aforementioned condition w.r.t. both $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ with some constant $M_{3}$ with probability $1-o(1)$ .

Combining the above steps completes the proof. ∎

E.3 Proof of Lemma E.1

For the case of the Lasso, (miolane2021distribution, Theorem 3.1) proved an analogous result for $W_{2}$ convergence. Although this does not directly yield our current lemma, the ideas therein sometimes prove to be useful. Below we present the proof of Lemma E.1 for the case of the ridge. Our approach towards handling $\ell_{1}$ convergence can be adapted to extend (miolane2021distribution, Theorem 3.1) for the lasso to an $\ell_{1}$ convergence result as well. For simplicity, we drop the subscript $R$ for this section. For convenience, we also make the $\lambda$ dependence explicit for certain expressions in this section (via writing $\hat{\bm{\beta}}(\lambda)$ , for instance).

E.3.1 Converting the optimization problem

First, although the original optimization problem is

\hat{\bm{\beta}}(\lambda)=\operatorname*{arg\,min}_{\bm{b}}\Big{\{}\frac{1}{2n}\|\bm{y}-\bm{X}\bm{b}\|_{2}^{2}+\frac{\lambda}{2}\|\bm{b}\|_{2}^{2}\Big{\}}:=\operatorname*{arg\,min}_{\bm{b}}\mathcal{L}_{\lambda}(\bm{b}),

it is more convenient to work with $\hat{\bm{w}}(\lambda)=\hat{\bm{\beta}}(\lambda)-\bm{\beta}$ , which satisfies

	$\displaystyle\hat{\bm{w}}(\lambda)$	$\displaystyle=\operatorname{arg\,min}_{\bm{w}}\frac{1}{2n}\\|\bm{X}\bm{w}-\sigma\bm{z}\\|_{2}^{2}+\frac{\lambda}{2}\\|\bm{w}+\bm{\beta}\\|_{2}^{2}:=\operatorname{arg\,min}_{\bm{w}}\mathcal{C}_{\lambda}(\bm{w})$		(37)
		$\displaystyle=\operatorname{arg\,min}_{\bm{w}}\max_{\bm{u}}\frac{1}{n}\bm{u}^{\top}\bm{X}\bm{w}-\frac{\sigma}{n}\bm{u}^{\top}\bm{z}-\frac{1}{2n}\\|\bm{u}\\|_{2}^{2}+\frac{\lambda}{2}\\|\bm{w}+\bm{\beta}\\|_{2}^{2}:=\operatorname{arg\,min}_{\bm{w}}\max_{\bm{u}}c_{\lambda}(\bm{w},\bm{u}),$		(37)

where we used the fact that $\|\bm{x}\|_{2}^{2}=\max_{\bm{u}}2\bm{u}^{\top}\bm{x}-\|\bm{u}\|_{2}^{2}$ . We call this the Primary Optimization (PO) problem, which involves a random matrix $\bm{X}$ . We then define the following Auxiliary Optimization (AO) that involves only independent random vectors $\bm{\xi}_{g}\sim\mathcal{N}(0,\bm{I}_{p}),\bm{\xi}_{h}\sim\mathcal{N}(0,\bm{I}_{n})$ . We call its solution $\bm{w}^{*}(\lambda)$ .

$\displaystyle\bm{w}^{*}(\lambda)$	$\displaystyle=\operatorname{arg\,min}_{\bm{w}}\max_{\bm{u}}\frac{1}{n}\\|\bm{u}\\|_{2}\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\\|\bm{w}\\|_{2}^{2}+\sigma^{2}}\frac{\bm{\xi}_{h}^{\top}\bm{u}}{n}-\frac{1}{2n}\\|\bm{u}\\|_{2}^{2}+\frac{\lambda}{2}\\|\bm{w}+\bm{\beta}\\|_{2}^{2}:=\operatorname{arg\,min}_{\bm{w}}\max_{\bm{u}}l_{\lambda}(\bm{w},\bm{u})$	(38)
	$\displaystyle=\operatorname{arg\,min}_{\bm{w}}\max_{\alpha\geq 0}\frac{\alpha}{\sqrt{n}}(\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\\|\bm{w}\\|_{2}^{2}+\sigma^{2}}\\|\bm{\xi}_{h}\\|_{2})-\frac{\alpha^{2}}{2}+\frac{\lambda}{2}\\|\bm{w}+\bm{\beta}\\|_{2}^{2}:=\operatorname{arg\,min}_{\bm{w}}\max_{\alpha\geq 0}\ell_{\lambda}(\bm{w},\alpha)$
	$\displaystyle=\operatorname{arg\,min}_{\bm{w}}\frac{1}{2n}(\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\\|\bm{w}\\|_{2}^{2}+\sigma^{2}}\\|\bm{\xi}_{h}\\|_{2})_{+}^{2}+\frac{\lambda}{2}\\|\bm{w}+\bm{\beta}\\|_{2}^{2}:=\operatorname{arg\,min}_{\bm{w}}L_{\lambda}(\bm{w}).$

Note to obtain the third equality we set $\alpha:=\|\bm{u}\|_{2}/\sqrt{n}$ . Section E.3.3 will establish a formal connection between PO and AO.

Next we perform some non-rigorous calculations using the AO to gain insights and show that it should asymtotically behave as a much simpler scalar optimization problem. To this end, we use the fact that $\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}=\operatorname*{arg\,min}_{\tau\geq 0}\frac{\|\bm{w}\|_{2}^{2}+\sigma^{2}}{2\tau}+\frac{\tau}{2},$ and obtain

		$\displaystyle\min_{\bm{w}}\max_{\alpha\geq 0}\ell_{\lambda}(\bm{w},\alpha)$
	$\displaystyle=$	$\displaystyle\min_{\bm{w}}\max_{\alpha\geq 0}\min_{\tau\geq 0}\frac{\alpha}{\sqrt{n}}\left(\bm{\xi}_{g}^{\top}\bm{w}+\left(\frac{\\|\bm{w}\\|_{2}^{2}+\sigma^{2}}{2\tau}+\frac{\tau}{2}\right)\\|\bm{\xi}_{h}\\|_{2}\right)-\frac{\alpha^{2}}{2}+\frac{\lambda}{2}\\|\bm{w}+\bm{\beta}\\|_{2}^{2}.$

We know in the asymptotic limit $\|\bm{\xi}_{h}\|_{2}/\sqrt{n}\rightarrow 1$ . Substituting in and optimizing $\bm{w}$ first, we have

\operatorname*{arg\,min}_{\bm{w}}\frac{\alpha}{\sqrt{n}}\bm{\xi}_{g}^{\top}\bm{w}+\frac{\alpha(\|\bm{w}\|_{2}^{2}+\sigma^{2})}{2\tau}+\frac{\alpha\tau}{2}-\frac{\alpha^{2}}{2}+\frac{\lambda}{2}\|\bm{w}+\bm{\beta}\|_{2}^{2}=-\frac{1}{\alpha/\tau+\lambda}\left(\frac{\alpha}{\sqrt{n}}\bm{\xi}_{g}+\lambda\bm{\beta}\right).

Plugging in and using the fact that $\|\bm{\xi}_{g}\|_{2}^{2}/p\rightarrow 1$ , $\|\bm{\beta}\|_{2}^{2}\rightarrow\sigma_{\beta}^{2}$ , and $\bm{\xi}_{g}^{\top}\bm{\beta}/\sqrt{n}\rightarrow 0$ , we arrive at a Scalar Optimization (SO) problem in the asymptotic limit:

	$\displaystyle(\alpha_{},\tau_{})$	$\displaystyle=\operatorname*{arg\,max}_{\alpha\geq 0}\min_{\tau\geq 0}\psi(\alpha,\tau)$		(39)
		$\displaystyle:=\operatorname*{arg\,max}_{\alpha\geq 0}\min_{\tau\geq 0}\frac{\alpha\sigma^{2}}{2\tau}+\frac{\alpha\tau}{2}-\frac{\alpha^{2}}{2}-\frac{\alpha^{2}\tau\delta}{2(\alpha+\tau\lambda)}+\frac{\alpha\lambda\sigma_{\beta}^{2}}{2(\alpha+\tau\lambda)}.$		(39)

By some algebra, it turns out that the solution to 39 and the Ridge part of (24) can be related as $\tau_{\star}=\sqrt{n}\tau_{R}$ in and $\alpha_{\star}=\tau_{\star}\zeta_{\textrm{R}}$ . So by the discussion before 25, we know $\alpha_{\star},\tau_{\star}$ is unique and bounded. We in turn define

\bm{w}(\lambda)=-\frac{1}{\alpha_{*}/\tau_{*}+\lambda}\left(\frac{\alpha_{*}}{\sqrt{n}}\bm{\xi}_{g}+\lambda\bm{\beta}\right).

(40)

By some algebra, this satisfies

\bm{w}(\lambda)=\hat{\bm{\beta}}^{f}(\lambda)-\bm{\beta}.

(41)

We will rigorously prove the above conversion of AO to SO in Section E.3.2

E.3.2 Connecting AO with SO

First we show that AO has a minimizer.

Proposition E.5.

$L_{\lambda}$ admits almost surely a unique minimizer $\bm{w}^{*}(\lambda)$ on $\mathbb{R}^{p}$ .

Proof.

First, $L_{\lambda}$ is a convex function that goes to $\infty$ at $\infty$ , so it has a minimizer.

Case 1: there exists a minimizer $\bm{w}$ such that $\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|_{2}^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2}>0$ .

In that case, there exists a neighborhood $O_{\bm{w}}$ of $\bm{w}$ such that for all $\bm{w}^{\prime}\in O_{\bm{w}}$ , $a(\bm{w}^{\prime}):=\bm{\xi}_{g}^{\top}\bm{w}^{\prime}+\sqrt{\|\bm{w}^{\prime}\|_{2}^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2}>0$ . Thus for all $\bm{w}^{\prime}\in O_{\bm{w}},L_{\lambda}(\bm{w}^{\prime})=\frac{1}{2}a(\bm{w}^{\prime})^{2}+\frac{\lambda}{2}\|\bm{w}^{\prime}+\bm{\beta}\|_{2}^{2}$ is strictly convex, because $a(\bm{w}^{\prime})$ is strictly convex (due to its first argument being linear and its second argument being positively quadratic) and remains positive on $O_{\bm{w}}$ and $x>0\mapsto x^{2}$ is strictly increasing. Hence $\bm{w}$ is the only minimizer of $L_{\lambda}$ .

Case 2: for all minimizer $\bm{w}$ we have $\bm{\xi}_{g}^{\top}\bm{w}+\sqrt{\|\bm{w}\|^{2}+\sigma^{2}}\|\bm{\xi}_{h}\|_{2}\leq 0$ .

In this case, from optimality condition, any minimizer $\bm{w}$ must satisfies $\lambda(\bm{w}+\bm{\beta})=0$ , which implies $\bm{w}=-\bm{\beta}$ and $L_{\lambda}$ has a unique minimizer. ∎

Then we show that the minimizer $\bm{w}^{*}(\lambda)$ is close to $\bm{w}(\lambda)$ :

Lemma E.6.

There exists a constant $\gamma$ such that for all $\epsilon\in(0,1]$ ,

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}>\epsilon\ \ \text{and}\ \ L_{\lambda}(\bm{w})\leq\min_{\bm{v}\in\mathbb{R}^{p}}L_{\lambda}(\bm{v})+\gamma\epsilon\right)=o(\epsilon),

where $\bm{w}(\lambda)$ is defined as in (40).

Proof.

The proof follows analogous to (miolane2021distribution, Theorem B.1), which handles the case of the Lasso. For conciseness, we refer readers to their proof. Note that we can safely add the sup in front since $o(\epsilon)$ hides constants that do not depend on $\lambda$ . ∎

E.3.3 Connecting PO with AO

We introduce a useful corollary that follows from the Convex Gaussian Min-Max Theorem (Theorem H.1) and connects PO with AO.

Corollary E.1.

•

Let $D\subset\mathbb{R}^{p}$ be a closed set. We have for all $t\in\mathbb{R}$ ,

\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq t).

•

Let $D\subset\mathbb{R}^{p}$ be a convex closed set. We have for all $t\in\mathbb{R}$ ,

\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\geq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\geq t).

Proof.

We will only prove the first point, since the second follows similarly. Suppose that $\bm{X},\bm{z},\bm{g},\bm{h}$ live on the same probability space and are independent. Let $\epsilon\in(0,1]$ . Let $\sigma_{\max}(\bm{X})$ denote the largest singular value of $\bm{X}$ . By tightness we can find a constant $C>0$ such that the event

\{\sigma_{\max}(\bm{X})\leq C\sqrt{n},\|\bm{z}\|_{2}\leq C\sqrt{n},\|\bm{\xi}_{g}\|_{2}\leq C\sqrt{n},\|\bm{\xi}_{h}\|_{2}\leq C\sqrt{n}\}

(42)

has probability at least $1-\epsilon$ . Let $D\subset\mathbb{R}^{p}$ be a non-empty closed set. Fix some $\bm{w}_{0}\in D$ , on event 42 both $\mathcal{C}_{\lambda}(\bm{w}_{0})$ and $L_{\lambda}(\bm{w}_{0})$ are bounded by some constant $R$ . Now for any $\bm{w}$ such that $\mathcal{C}_{\lambda}(\bm{w})\leq R$ , we have $\|\bm{w}+\bm{\beta}\|_{2}^{2}\leq\mathcal{C}_{\lambda}(\bm{w})\leq R$ .

This means that there exists $R_{1}$ such that $\|\bm{w}\|^{2}\leq R_{1}$ on event (42). Since this is true for all such $\bm{w}$ , then the minimum of $\mathcal{C}_{\lambda}$ over $D$ is achieved on $D\cap B(0,R_{1})$ . Similarly, the minimum of $L_{\lambda}$ over $D$ is achieved on $D\cap B(0,R_{2})$ . WLOG, we can assume $R_{1}=R_{2}$ . On event 42 we have

\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})=\min_{\bm{w}\in D\cap B(0,R_{1})}\mathcal{C}_{\lambda}(\bm{w})=\min_{\bm{w}\in D\cap B(0,R_{1})}\max_{\bm{u}\in B(0,R_{3})}c_{\lambda}(\bm{w},\bm{u})

for some non-random $R_{3}>0$ . Thus, for all $t\in R$ , we have

\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq\mathbb{P}(\min_{\bm{w}\in D\cap B(0,R_{1})}\max_{\bm{u}\in B(0,R_{3})}c_{\lambda}(\bm{w},\bm{u})\leq t)+\epsilon,

and similarly

\mathbb{P}(\min_{\bm{w}\in D\cap B(0,R_{1})}\max_{\bm{u}\in B(0,R_{3})}l_{\lambda}(\bm{w},\bm{u})\leq t)\leq\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq t)+\epsilon.

Now we know that $D\cap B(0,R_{1})$ and $B(0,R_{3})$ are compact, we can apply Theorem H.1 to get

\mathbb{P}(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq t)+2\epsilon.

The corollary then follows from the fact that one can take $\epsilon$ arbitrarily small. ∎

Now for a fixed $\lambda$ , consider the set $D_{\lambda}^{\epsilon}=\{\bm{w}\in\mathbb{R}^{p}|\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}\geq\epsilon\}$ . Obviously this is a closed set. By first applying parts 1 and 2 from Corollary E.1 to the convex closed domain $\mathbb{R}^{p}$ , then applying part 1 to the closed domain $D_{\lambda}^{\epsilon}$ , we can show the following proposition:

Lemma E.7.

For all $\epsilon\in(0,1]$ ,

\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}\mathcal{C}_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}\mathcal{C}_{\lambda}(\bm{w})+\epsilon\right)\leq 2\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}L_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}L_{\lambda}(\bm{w})+3\epsilon\right)+o(\epsilon).

Proof.

The proof of Lemma E.7 is similar to Section C.1.1 in miolane2021distribution, where the above was established for the Lasso. ∎

Combining Lemmas E.7 and E.6, we arrive at the following lemma, where we return back to the cost $\mathcal{L}_{\lambda}(\bm{b})$ :

Lemma E.8.

There exists a constant $\gamma>0$ such that for all $\epsilon\in(0,1]$ ,

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{b}\in\mathbb{R}^{p},\ \|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda)\|_{2}^{2}\geq\epsilon\ \ \text{and}\ \ \mathcal{L}_{\lambda}(\bm{b})\leq\min\mathcal{L}_{\lambda}+\gamma\epsilon\right)=o(\epsilon).

Proof.

Let $\gamma>0$ be from Lemma E.6. We have

		$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{b}\in\mathbb{R}^{p},\ \\|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda)\\|_{2}^{2}\geq\epsilon\ \ \text{and}\ \ \mathcal{L}_{\lambda}(\bm{b})\leq\min\mathcal{L}_{\lambda}+\gamma\epsilon/3\right)$
	$\displaystyle=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}\mathcal{C}_{\lambda}(\bm{w})\leq\min\mathcal{C}_{\lambda}+\gamma\epsilon/3\right)$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}2\mathbb{P}\left(\min_{\bm{w}\in D_{\lambda}^{\epsilon}}L_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}L_{\lambda}(\bm{w})+\gamma\epsilon\right)+o(\epsilon)$
	$\displaystyle=$	$\displaystyle o(\epsilon),$

where the first equality comes from the definition of $\mathcal{C}_{\lambda}$ in (37), the inequality comes from Lemma E.7 and the last equality comes from E.6. Thus proved. ∎

E.3.4 Uniform Control over $\lambda$

We now establish the result with uniform control over $\lambda$ (i.e. bring the $\sup$ in Lemma E.8) inside).

We first prove the following Lemma:

Lemma E.9.

There exists constant $K$ such that

\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda))\leq\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda^{\prime}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1).

Proof.

The proof follows directly from Lemma D.12. ∎

Then we prove the following proposition:

Lemma E.10.

For $\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}]$ , we have $\mathbb{E}[\|\hat{\bm{\beta}}^{f}(\lambda_{1})-\hat{\bm{\beta}}^{f}(\lambda_{2})\|_{2}]\leq M|\lambda_{1}-\lambda_{2}|$ for some constant $M$

Proof.

First, by noting that $\tau_{*}=\sqrt{n}\tau_{\textrm{R}}$ , $\alpha_{*}=\lambda\tau_{*}/\zeta_{\textrm{R}}$ as well as applying Lemma H.2 and Lemma H.8, we know $\alpha_{*}(\lambda)$ and $\tau_{*}(\lambda)$ are bounded in some $[\alpha_{\min},\alpha_{\max}],[\tau_{\min},\tau_{\max}]$ respectively, and they are both continuous functions of $\lambda$ . Let $\lambda_{1},\lambda_{2}\in[\lambda_{\min},\lambda_{\max}]$ . Next, recall from (41) that $\bm{w}(\lambda)=\hat{\bm{\beta}}^{f}(\lambda)-\bm{\beta}$ . Thus,

		$\displaystyle\mathbb{E}[\\|\hat{\bm{\beta}}^{f}(\lambda_{1})-\hat{\bm{\beta}}^{f}(\lambda_{2})\\|_{2}]^{2}\leq\mathbb{E}[\\|\hat{\bm{\beta}}^{f}(\lambda_{1})-\hat{\bm{\beta}}^{f}(\lambda_{2})\\|_{2}^{2}]=\mathbb{E}[\\|\bm{w}(\lambda_{1})-\bm{w}(\lambda_{2})\\|_{2}^{2}]$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{p}\mathbb{E}\left[\left(\frac{1}{\alpha_{}(\lambda_{1})/\tau_{}(\lambda_{1})+\lambda_{1}}\left(\frac{\alpha_{}(\lambda_{1})}{\sqrt{n}}\xi_{g,i}+\lambda_{1}\beta_{i}\right)-\frac{1}{\alpha_{}(\lambda_{2})/\tau_{}(\lambda_{2})+\lambda_{2}}\left(\frac{\alpha_{}(\lambda_{2})}{\sqrt{n}}\xi_{g,i}+\lambda_{2}\beta_{i}\right)\right)^{2}\right]$
	$\displaystyle\leq$	$\displaystyle 2\mathbb{E}\left[\delta\left(\frac{\alpha_{}(\lambda_{1})}{\alpha_{}(\lambda_{1})/\tau_{}(\lambda_{1})+\lambda_{1}}\xi_{g}-\frac{\alpha_{}(\lambda_{2})}{\alpha_{}(\lambda_{2})/\tau_{}(\lambda_{2})+\lambda_{2}}\xi_{g}\right)^{2}\right]$
		$\displaystyle\indent+2\\|\bm{\beta}\\|_{2}^{2}\left[\left(\frac{\lambda_{1}}{\alpha_{}(\lambda_{1})/\tau_{}(\lambda_{1})+\lambda_{1}}-\frac{\lambda_{2}}{\alpha_{}(\lambda_{2})/\tau_{}(\lambda_{2})+\lambda_{2}}\right)^{2}\right],$

where $\xi_{g,i}$ is the $i$ th entry of $\bm{\xi}_{g}\sim\mathcal{N}(0,\bm{I}_{p})$ and $\xi_{g}\sim\mathcal{N}(0,1)$ . For the first summand,

		$\displaystyle 2\mathbb{E}\left[\delta\left(\frac{\alpha_{}(\lambda_{1})}{\alpha_{}(\lambda_{1})/\tau_{}(\lambda_{1})+\lambda_{1}}\xi_{g}-\frac{\alpha_{}(\lambda_{2})}{\alpha_{}(\lambda_{2})/\tau_{}(\lambda_{2})+\lambda_{2}}\xi_{g}\right)^{2}\right]$
	$\displaystyle=$	$\displaystyle 2\delta\left(\frac{\alpha_{}(\lambda_{1})}{\alpha_{}(\lambda_{1})/\tau_{}(\lambda_{1})+\lambda_{1}}-\frac{\alpha_{}(\lambda_{2})}{\alpha_{}(\lambda_{2})/\tau_{}(\lambda_{2})+\lambda_{2}}\right)^{2}$
	$\displaystyle=$	$\displaystyle 2\delta\left(\frac{\frac{\alpha_{}(\lambda_{1})\alpha_{}(\lambda_{2})}{\tau_{}(\lambda_{1})\tau_{}(\lambda_{2})}(\tau_{}(\lambda_{1})-\tau_{}(\lambda_{2}))+(\lambda_{2}\alpha_{}(\lambda_{1})-\lambda_{1}\alpha_{}(\lambda_{2}))}{(\alpha_{}(\lambda_{1})/\tau_{}(\lambda_{1})+\lambda_{1})(\alpha_{}(\lambda_{2})/\tau_{}(\lambda_{2})+\lambda_{2})}\right)^{2}.$

We know both $\frac{\alpha_{*}(\lambda_{1})\alpha_{*}(\lambda_{2})}{\tau_{*}(\lambda_{1})\tau_{*}(\lambda_{2})}$ and $(\alpha_{*}(\lambda_{1})/\tau_{*}(\lambda_{1})+\lambda_{1})(\alpha_{*}(\lambda_{2})/\tau_{*}(\lambda_{2})+\lambda_{2})$ are bounded, $\tau_{*}$ is Lipschitz, and

	$\displaystyle\lambda_{2}\alpha_{}(\lambda_{1})-\lambda_{1}\alpha_{}(\lambda_{2})=$	$\displaystyle\alpha_{}(\lambda_{1})(\lambda_{2}-\lambda_{1})+\lambda_{1}(\alpha_{}(\lambda_{1})-\alpha_{*}(\lambda_{2}))$
	$\displaystyle\leq$	$\displaystyle\alpha_{\max}\|\lambda_{2}-\lambda_{1}\|+\lambda_{\max}\|\alpha_{}(\lambda_{1})-\alpha_{}(\lambda_{2})\|.$

Thus, we know the first part is bounded by some $M_{1}(\lambda_{1}-\lambda_{2})^{2}$ . Similarly, the second part is bounded by some $M_{2}(\lambda_{1}-\lambda_{2})^{2}$ . Combining them completes the proof. ∎

We are now ready to prove Lemma E.1.

proof of Lemma E.1.

Let $\gamma>0$ as given by Lemma E.8, let $K>0$ as given by Lemma E.9, and let $M>0$ as given by Lemma E.10. Fix $\epsilon\in(0,1]$ and define $\epsilon^{\prime}=\min\left(\frac{\gamma\epsilon}{K},\frac{\sqrt{\epsilon}}{M}\right)$ . Let $k=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil$ . Further define $\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}$ for $i=0,...,k$ By Lemma E.8, the event

\left\{\forall i\in\{1,...,k\},\forall\bm{b}\in\mathbb{R}^{p},\mathcal{L}_{\lambda_{i}}(\bm{b})\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon\Rightarrow\|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon\right\}

(43)

has probability at least $1-ko(\epsilon)=1-o(1)$ . Therefore, on the intersection of event 43 and the event in Lemma E.9, which has probability $1-o(1)$ , we have for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ ,

\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}(\lambda))\leq\min\mathcal{L}_{\lambda_{i}}+K|\lambda-\lambda_{i}|\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon,

where $1\leq i\leq k$ is such that $\lambda\in[\lambda_{i-1},\lambda_{i}]$ . This implies (since we are on the event 43) that $\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon$ .

Consider any Lipschitz function $\phi_{\beta}$ . Since $\hat{\bm{\beta}}^{f}(\lambda)$ is a $\alpha_{*}/(\sqrt{n}(\alpha_{*}/\tau_{*}+\lambda))\leq C/\sqrt{n}$ -Lipschitz function of $\bm{\xi}_{g}\sim\mathcal{N}(\bm{0},\bm{I})$ for some constant $C$ (recall (40)), then by concentration of Lipschitz function of Gaussian random variables, we know the event

\left\{|\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))]|\leq\sqrt{\epsilon}\right\}

(44)

has probability $1-o(\epsilon)$ . Therefore, on the intersection of event (43) and event (44), we have

		$\displaystyle\|\phi_{\beta}(\hat{\bm{\beta}}(\lambda))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))]\|$
	$\displaystyle\leq$	$\displaystyle\|\phi_{\beta}(\hat{\bm{\beta}}(\lambda))-\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))\|+\|\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))]\|+\|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda_{i}))]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}^{f}(\lambda))]\|$
	$\displaystyle\leq$	$\displaystyle\\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\\|_{2}+\sqrt{\epsilon}+\mathbb{E}[\\|\hat{\bm{\beta}}^{f}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\\|_{2}]$
	$\displaystyle\leq$	$\displaystyle 2\sqrt{\epsilon}+M\|\lambda-\lambda_{i}\|$
	$\displaystyle\leq$	$\displaystyle 3\sqrt{\epsilon}.$

The proof is then completed by noting that this holds for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ . ∎

E.3.5 Proof of Lemma E.2

This is almost verbatim as the proof of Lemma E.1, where we instead study the optimization for $\bm{u}$ in (37). We omit the proof here for conciseness.

Appendix F Proof of main results

In this section we utilize Theorem D.1 and some preceding supporting Lemmas to prove the remaining results in the main text, that is, Proposition 4.2 and Theorem 4.1.

F.1 Variance Parameter Estimation

We present the theorem that makes rigorous the consistency of (7).

Theorem F.1.

Consider $\hat{\tau}_{\textrm{L}}^{2},\hat{\tau}_{\textrm{R}},\hat{\tau}_{\textrm{L}\textrm{R}}$ in (7). If Assumptions 1 and 2 hold, then we have

		$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}n\cdot\|\hat{\tau}_{\textrm{L}}^{2}-\tau_{\textrm{L}}^{2}\|\overset{P}{\rightarrow}0,$
		$\displaystyle\sup_{\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}n\cdot\|\hat{\tau}_{\textrm{R}}^{2}-\tau_{\textrm{R}}^{2}\|\overset{P}{\rightarrow}0,$
		$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}n\cdot\|\hat{\tau}_{\textrm{L}\textrm{R}}-\tau_{\textrm{L}\textrm{R}}\|\overset{P}{\rightarrow}0.$

The proof of Theorem F.1 crucially relies on the following Lemma, which is the counterpart to Theorem 4.3 in the context of residuals:

Lemma F.2.

Define $\hat{\bm{u}}_{k}^{d}=\frac{\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}}{1-\frac{\hat{\textnormal{df}}_{k}}{n}},\hat{\bm{u}}_{k}^{f,d}=\sqrt{n}\bm{h}_{k}^{f}$ , where $(\bm{h}_{\textrm{L}}^{f},\bm{h}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},\bm{S}\otimes\bm{I}_{n})$ for $\bm{S}$ defined as in 24. Under Assumption 4, for any $1$ -Lipschitz function $\phi_{u}:(\mathbb{R}^{n})^{2}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\phi_{u}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{d}}{\sqrt{n}}\right)-\mathbb{E}\left[\phi_{\beta}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{f,d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{f,d}}{\sqrt{n}}\right)\right]\right|\overset{P}{\rightarrow}0.

The proof of Lemma F.2 follows similar to the proof of Theorem 4.3, thus we omit this for conciseness.

Proof of Theorem F.1.

We know $\bm{h}_{k}^{f}$ is a $1/\sqrt{n}$ -Lipschitz function of $\sqrt{n}\bm{h}_{k}^{f}$ , where $(\sqrt{n}\bm{h}_{\textrm{L}}^{f},\sqrt{n}\bm{h}_{\textrm{R}}^{f})\sim\mathcal{N}(\bm{0},n\bm{S}\otimes\bm{I}_{n})$ and eigenvalues of $n\bm{S}$ are all bounded (since entries of $n\bm{S}$ are bounded from 25). Also, $\mathbb{E}[\|\bm{h}_{k}^{f}\|_{2}^{2}]=n\tau_{k}^{2}\leq\tau_{\max}^{2}+\sigma_{\max}^{2}$ . Thus, by Lemma H.3, we know

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\bm{T}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{d}}{\sqrt{n}}\right)-\mathbb{E}\left[\bm{T}\left(\frac{\hat{\bm{u}}_{\textrm{L}}^{f,d}}{\sqrt{n}},\frac{\hat{\bm{u}}_{\textrm{R}}^{f,d}}{\sqrt{n}}\right)\right]\right\|_{F}\overset{P}{\rightarrow}0,

where

\bm{T}(\bm{a},\bm{b}):=\begin{pmatrix}\bm{a}^{\top}\bm{a}&\bm{a}^{\top}\bm{b}\\ \bm{a}^{\top}\bm{b}&\bm{b}^{\top}\bm{b}\end{pmatrix}.

(45)

Extracting the entries of the above equation completes the proof. ∎

F.2 Ensembling: Proof of Proposition 4.2

Proof of Proposition 4.2.

By arguments similar to the proof of Theorem F.1, on using Theorem 4.3 in conjunction with 25 and Lemma H.3, we have that

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d})-\mathbb{E}[\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})]\|_{F}\overset{P}{\rightarrow}0,

where $\bm{T}$ is given by 45. Therefore, we have

		$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left\|\\|\tilde{\bm{\beta}}_{C}^{d}\\|_{2}^{2}-\\|\bm{\beta}\\|_{2}^{2}-p\cdot\tau_{C}^{2}\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left\|\\|\alpha_{\textrm{L}}\hat{\bm{\beta}}_{\textrm{L}}^{d}+(1-\alpha_{\textrm{L}})\hat{\bm{\beta}}_{\textrm{R}}^{d}\\|_{2}^{2}-\\|\bm{\beta}\\|_{2}^{2}-\alpha_{\textrm{L}}^{2}\cdot p\tau_{\textrm{L}}^{2}-(1-\alpha_{\textrm{L}})^{2}\cdot p\tau_{\textrm{R}}^{2}-2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\cdot p\tau_{\textrm{L}\textrm{R}}\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\bigg{[}\alpha_{\textrm{L}}^{2}\left\|\\|\hat{\bm{\beta}}_{\textrm{L}}^{d}\\|_{2}^{2}-\\|\bm{\beta}\\|_{2}^{2}-p\tau_{\textrm{L}}^{2}\right\|+(1-\alpha_{\textrm{L}})^{2}\left\|\\|\hat{\bm{\beta}}_{\textrm{R}}^{d}\\|_{2}^{2}-\\|\bm{\beta}\\|_{2}^{2}-p\tau_{\textrm{R}}^{2}\right\|$
		$\displaystyle+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\left\|\langle\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d}\rangle-\\|\bm{\beta}\\|_{2}^{2}-p\tau_{\textrm{L}\textrm{R}}\right\|\bigg{]}$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left(\alpha_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}++2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\right)\left\\|\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d})-\mathbb{E}[\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})]\right\\|_{F}$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left\\|\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{d},\hat{\bm{\beta}}_{\textrm{R}}^{d})-\mathbb{E}[\bm{T}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d},\hat{\bm{\beta}}_{\textrm{R}}^{f,d})]\right\\|_{F}\overset{P}{\rightarrow}0,$

where we define

\tau_{C}^{2}=\alpha_{\textrm{L}}^{2}\cdot\tau_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}\cdot\tau_{\textrm{R}}^{2}+2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\cdot\tau_{\textrm{L}\textrm{R}}.

Furthermore, from Theorem F.1,

		$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\|p\cdot\tilde{\tau}_{C}^{2}-p\cdot\tau_{C}^{2}\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\sup_{\alpha_{\textrm{L}}\in[0,1]}\left(\alpha_{\textrm{L}}^{2}+(1-\alpha_{\textrm{L}})^{2}++2\alpha_{\textrm{L}}(1-\alpha_{\textrm{L}})\right)\left\|p\cdot\|\hat{\tau}_{\textrm{L}}^{2}-\tau_{\textrm{L}}^{2}\|+p\cdot\|\hat{\tau}_{\textrm{R}}^{2}-\tau_{\textrm{R}}^{2}\|+p\cdot\|\hat{\tau}_{LR}-\tau_{LR}\|\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left\|p\cdot\|\hat{\tau}_{\textrm{L}}^{2}-\tau_{\textrm{L}}^{2}\|+p\cdot\|\hat{\tau}_{\textrm{R}}^{2}-\tau_{\textrm{R}}^{2}\|+p\cdot\|\hat{\tau}_{LR}-\tau_{LR}\|\right\|\overset{P}{\rightarrow}0.$

Combining the above displays completes the proof. ∎

F.3 Heritability Estimation: Proof of Theorem 4.1

Proof of Theorem 4.1.

From Eqn. 12, we know that $\hat{\alpha}_{\textrm{L}}\in[0,1]$ , so Proposition 4.2 implies that

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C}^{2}-\sigma_{\beta}^{2}\right|\overset{P}{\rightarrow}0,

(46)

which further implies that

\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C}^{2}}{\sigma_{\beta}^{2}+\sigma^{2}}-\frac{\sigma_{\beta}^{2}}{\sigma_{\beta}^{2}+\sigma^{2}}\right|=\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\|\hat{\bm{\beta}}_{C}^{d}\|_{2}^{2}-p\cdot\hat{\tau}_{C}^{2}}{\sigma_{\beta}^{2}+\sigma^{2}}-h\right|\overset{P}{\rightarrow}0.

(47)

Recalling the definitions of $\sigma_{\beta}^{2},\sigma^{2}$ from Assumption 4, and the fact that $X_{ij}$ ’s are independent with mean $0$ and variance $1$ , we know that

\widehat{\text{Var}}(\bm{y})\overset{P}{\rightarrow}\sigma_{\beta}^{2}+\sigma^{2},

where $\widehat{\text{Var}}(\bm{y})$ denotes the sample variance of $\bm{y}$ . Moreover, $\sigma_{\beta}^{2}+\sigma^{2}$ is bounded by $2\sigma_{\max}^{2}$ , and is positive (otherwise the problem becomes trivial). Thus we can safely replace the denominator in (47) with $\widehat{\text{Var}}(\bm{y})$ . The proof is then completed by the fact that clipping does not effect consistency. ∎

Appendix G Robustness under Accurate Covariance Estimation: Proof of Theorem 4.4

In this section we prove Theorem 4.4. To maintain notational consistency with the rest of the proofs, we refrain from using $\bm{Z}$ and use $\bm{X}$ instead. A simple derivation reveals that assuming $\bm{X}$ with covariance $\bm{A}^{\top}\bm{A}$ is equivalent to assuming $\bm{X}$ has independent columns but use $\bm{X}\bm{A}$ to denote the design matrix. For all quantities in this section. we include $\bm{A}$ in a parenthesis or a subscript to denote the corresponding perturbed quantity if we replace $\bm{X}$ by $\bm{X}\bm{A}$ for the design matrix. We first introduce a sufficient condition of Theorem 4.4.

Lemma G.1.

Let $\bm{A}$ be a sequence of random matrices such that $\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ . Under Assumption 4, we have

	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{\beta}}_{k}(\bm{A})-\hat{\bm{\beta}}_{k}\\|_{2}$	$\displaystyle\overset{P}{\rightarrow}0$		(48)
	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{\hat{\textnormal{df}}_{k}(\bm{A})}{p}-\frac{\hat{\textnormal{df}}_{k}}{p}\right\|$	$\displaystyle\overset{P}{\rightarrow}0.$		(48)

Lemma G.1 is proved in Sections G.1 and G.2 for Ridge and Lasso, respectively. Compared to the assumptions in Theorem 4.4, besides reducing Assumption 1 to 4 under Gaussian design and replacing assumptions about $\bm{\Sigma}$ , $\hat{\bm{\Sigma}}$ with those about $\bm{A}$ , we also dropped Assumption 3 as it can be proved under Gaussian design. Taking Lemma G.1 momentarily, we prove Theorem 4.4 below.

Proof of Theorem 4.4.

First we show that $\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ is a corollary of the assumptions in Theorem 4.4. Notice that Proposition A.1 guarantees $\|\hat{\bm{\Sigma}}-\bm{\Sigma}\|_{\textnormal{op}}\rightarrow 0$ . We have $\|\bm{A}^{-1}-\bm{I}\|_{\textnormal{op}}=\|\hat{\bm{\Sigma}}^{1/2}\bm{\Sigma}^{-1/2}-\bm{I}\|_{\textnormal{op}}\leq\|\hat{\bm{\Sigma}}^{1/2}-\bm{\Sigma}^{1/2}\|_{\textnormal{op}}\|\bm{\Sigma}^{-1/2}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ since $\|\hat{\bm{\Sigma}}^{1/2}-\bm{\Sigma}^{1/2}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ and eigenvalues of $\bm{\Sigma}$ are bounded. Let $\bm{B}=\bm{I}-\bm{A}^{-1}$ , then the Neumann series $\sum_{k=0}^{\infty}\bm{B}^{k}$ converges to $\bm{I}$ in operator norm since $\|\bm{B}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ . Hence $\bm{A}=(\bm{I}-\bm{B})^{-1}=\sum_{k=0}^{\infty}\bm{B}^{k}$ converges to $\bm{I}$ in operator norm. In other words, $\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ .

We are now left with arguing that under the assumptions of Lemma G.1, (48) yields (16). Recall the definition of debiased estimators in (18). We have the following:

•

$\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}(\bm{A})-\hat{\bm{\beta}}_{k}(\bm{A})\|_{2}\overset{P}{\rightarrow}0$ .
•

$\frac{1}{\sqrt{n}}\|\bm{X}-\bm{X}\bm{A}\|_{\textnormal{op}}\leq\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\|\bm{A}-\bm{I}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ from Corollary H.1.
•

$\frac{1}{\sqrt{n}}\|\bm{y}\|_{2}$ is bounded with probability $1-o(1)$ .
•

$\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{n}{n-\hat{\textnormal{df}}_{k}}-\frac{n}{n-\hat{\textnormal{df}}_{k}(\bm{A})}\right|\overset{P}{\rightarrow}0$ since $\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}_{k}(\bm{A})}{n}-\frac{\hat{\textnormal{df}}_{k}}{n}\right|\overset{P}{\rightarrow}0$ , together with the fact that $1-\frac{\hat{\textnormal{df}}_{k}}{n}$ is bounded below with probability $1-o(1)$ from Corollary D.1.

Then

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}_{k}^{d}-\hat{\bm{\beta}}_{k}^{d}(\bm{A})\|_{2}\overset{P}{\rightarrow}0.

utilizing Lemma H.5 and Lemma H.6.

A similar proof yields

	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\tau}_{k}^{2}-\hat{\tau}_{k}^{2}(\bm{A})\|\overset{P}{\rightarrow}0$
	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\tau}_{\textrm{L}\textrm{R}}^{2}-\hat{\tau}_{\textrm{L}\textrm{R}}^{2}(\bm{A})\|\overset{P}{\rightarrow}0.$

We are then left with articulating the expression of $\hat{h}^{2}$ and taking supremum over $\alpha_{\textrm{L}}\in[0,1]$ , which are straightforward using Lemma H.5, Lemma H.6 and proof techniques in Section F.2 and F.3. We omit the details here. ∎

G.1 Proof of Lemma G.1, Ridge case

In this section, we drop the subscript R for simplicity. First we show that

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}(\bm{A})-\hat{\bm{\beta}}\|_{2}\overset{P}{\rightarrow}0.

Recall the Ridge estimator and its perturbed version:

	$\displaystyle\hat{\bm{\beta}}$	$\displaystyle=(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{y}$
	$\displaystyle\hat{\bm{\beta}}(\bm{A})$	$\displaystyle=(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{A}\bm{X}^{\top}\bm{y}.$

We know both $\left\|(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}}$ and $\left\|(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}}$ are bounded by $1/\lambda_{\min}$ , uniformly for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ . Further, their difference satisfies

(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}=(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}(\bm{A}\bm{X}^{\top}\bm{X}\bm{A}-\bm{X}^{\top}\bm{X})(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}.

Then we know $\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left\|(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ by observing that $\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}$ is bounded with probability $1-o(1)$ , $\|\bm{I}-\bm{A}\|_{\textnormal{op}}\overset{P}{\rightarrow}0$ , and applying Lemma H.6. Another application of Lemma H.6 with additionally the fact that $\frac{1}{\sqrt{n}}\|\bm{y}\|_{2}$ is bounded concludes that $\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\hat{\bm{\beta}}(\bm{A})-\hat{\bm{\beta}}\|_{2}\overset{P}{\rightarrow}0$ .

Next, for $\hat{\textnormal{df}}/p$ and its perturbed counterpart, we have

		$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{\hat{\textnormal{df}}(\bm{A})}{p}-\frac{\hat{\textnormal{df}}}{p}\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{p}\left\|\left(n-\text{Tr}\left((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}\right)\right)-\left(n-\text{Tr}\left((\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}\right)\right)\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{p}\text{Tr}\left((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}\right)$
	$\displaystyle=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{\lambda}{p}\text{Tr}\left((\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right)$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{\lambda_{\max}}{p}\cdot p\left\\|(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}-(\frac{1}{n}\bm{A}^{\top}\bm{X}^{\top}\bm{X}\bm{A}+\lambda\bm{I})^{-1}\right\\|_{\textnormal{op}}\overset{P}{\rightarrow}0.$

where the third equality follows from $(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}\frac{1}{n}\bm{X}^{\top}\bm{X}=\bm{I}-\lambda(\frac{1}{n}\bm{X}^{\top}\bm{X}+\lambda\bm{I})^{-1}$ and its perturbed counterpart. Hence the proof is complete.

G.2 Proof of Lemma G.1, Lasso case

We drop the subscript L in this section for simplicity. The proof of the Lasso case is more involved. Denote

	$\displaystyle\mathcal{L}(\bm{b})$	$\displaystyle:=\frac{1}{2n}\\|\bm{y}-\bm{X}\bm{b}\\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\\|\bm{b}\\|_{1}$
	$\displaystyle\mathcal{L}_{\bm{A}}(\bm{b})$	$\displaystyle:=\frac{1}{2n}\\|\bm{y}-\bm{X}\bm{A}\bm{b}\\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\\|\bm{b}\\|_{1}$

the Lasso objective function and its perturbed counterpart. We first argue that $\mathcal{L}(\hat{\bm{\beta}})$ is close to $\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))$ where $\hat{\bm{\beta}}(\bm{A}):=\operatorname*{arg\,min}\mathcal{L}_{\bm{A}}$ is the perturbed Lasso solution (Lemma G.2), then we utilize techniques around the local stability of Lasso objective (c.f. miolane2021distribution, celentano2020lasso) to prove the desired results in Sections G.2.1 and G.2.2.

Lemma G.2 (Closeness of Lasso objectives).

Under the assumptions in Lemma G.1, we have

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}(\hat{\bm{\beta}})|\overset{P}{\rightarrow}0.

Proof.

By optimality, $\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))\geq\mathcal{L}(\hat{\bm{\beta}})$ and $\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))\leq\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})$ . Further, we have

		$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}(\hat{\bm{\beta}})\right)$
	$\displaystyle=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))+\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})+\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})-\mathcal{L}(\hat{\bm{\beta}})\right)$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))+\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}})-\mathcal{L}(\hat{\bm{\beta}})\right)$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left(\frac{1}{2n}\left\|\\|\bm{y}-\bm{X}\hat{\bm{\beta}}(\bm{A})\\|_{2}^{2}-\\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}(\bm{A})\\|_{2}^{2}\right\|+\frac{1}{2n}\left\|\\|\bm{y}-\bm{X}\hat{\bm{\beta}}\\|_{2}^{2}-\\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}\\|_{2}^{2}\right\|\right).$

We know from Lemma D.7 that $\hat{\bm{\beta}}$ is uniformly bounded with probability $1-o(1)$ . Further, replacing Lemma E.1 with (celentano2020lasso, Theorem 6), we can use similar proof as Lemma D.7 to show that $\hat{\bm{\beta}}(\bm{A})$ is also uniformly bounded with probability $1-o(1)$ . Therefore, utilizing Lemma H.6,

		$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left\|\\|\bm{y}-\bm{X}\hat{\bm{\beta}}\\|_{2}^{2}-\\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}\\|_{2}^{2}\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2}\cdot\frac{1}{\sqrt{n}}\\|2\bm{y}+\bm{X}\hat{\bm{\beta}}+\bm{X}\bm{A}\hat{\bm{\beta}}\\|_{2}\cdot\frac{1}{\sqrt{n}}\\|\bm{X}\hat{\bm{\beta}}-\bm{X}\bm{A}\hat{\bm{\beta}}\\|_{2}\overset{P}{\rightarrow}0,$

and similarly $\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left|\|\bm{y}-\bm{X}\hat{\bm{\beta}}(\bm{A})\|_{2}^{2}-\|\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}(\bm{A})\|_{2}^{2}\right|\overset{P}{\rightarrow}0$ . Hence,

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|\mathcal{L}(\hat{\bm{\beta}}(\bm{A}))-\mathcal{L}(\hat{\bm{\beta}})|\overset{P}{\rightarrow}0

as desired. ∎

G.2.1 Closeness of Lasso solutions

We introduce a sufficient Lemma for the first line of (48):

Lemma G.3 (Uniform local strong convexity of the Lasso objective).

Under the assumptions in Lemma G.1, there exists a constant $C$ such that for any $\lambda$ -dependent vector $\check{\bm{\beta}}$ ,

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathcal{L}(\check{\bm{\beta}})-\mathcal{L}(\hat{\bm{\beta}})\geq\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\min\{C\|\check{\bm{\beta}}-\hat{\bm{\beta}}\|_{2}^{2},C\|\check{\bm{\beta}}-\hat{\bm{\beta}}\|_{2}\}.

The first line of (48) directly follows by plugging in $\check{\bm{\beta}}=\hat{\bm{\beta}}(\bm{A})$ in Lemma G.3.

Proof of Lemma G.3.

Notice that this Lemma is the uniform extension of (celentano2020lasso, Lemma B.9) (their statement only involves $\|\check{\bm{\beta}}-\hat{\bm{\beta}}\|_{2}^{2}$ with an additional constraint, but our statement naturally follows from the proof procedure). Thus we only elaborate the necessary extensions of their proof here:

Consider the critical event in the proof:

\mathcal{A}:=\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}\cap\left\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\right\}\cap\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\}

where $\zeta:=1-\textnormal{df}/n$ as in (24) and $\bm{t}:=\frac{1}{\sqrt{n}\lambda}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}})$ is the Lasso subgradient. We want to show that there exist constants $C,\kappa_{\min},\Delta$ such that event $\mathcal{A}$ happens with probability $1-o(1)$ , uniformly for $\lambda\in[\lambda_{\min},\lambda_{\max}]$ .

For $\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\}$ , it does not depend on $\lambda$ , so Corollary H.1 guarantees the high probability.

For $\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}$ , we know $1-\zeta/4\leq 1-\zeta_{\min}/4$ from (25), and we also know $\kappa_{-}(\bm{X},k)\geq\kappa_{-}(\bm{X},k^{\prime})$ for any $k\leq k^{\prime}$ by the definition of $\kappa_{-}$ . Therefore the high-probability (which is proved in (celentano2020lasso, Section B.5.4)) naturally extends uniformly.

Finally, $\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\}$ follows from (miolane2021distribution, Theorem E.5) by an $\epsilon$ -net argument (c.f. (miolane2021distribution, Section E.3.4) and Section E.3.4), which we omit here for conciseness.

Now that we have established $\mathcal{A}$ happens uniformly with probability $1-o(1)$ , the rest of the proof extends almost verbatim the proof in (celentano2020lasso, Section B.5.4) by taking the supremum in all inequalities. ∎

G.2.2 Closeness of Lasso sparsities

Taking inspiration from miolane2021distribution, we prove the convergence from above and below using different strategies. We state and prove two supporting Lemmas G.4 and G.5. Taking those lemmas momentarily and noticing the fact that a nonzero entry in the Lasso solution guarantees a $\pm 1$ gradient, we immediately have

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\left|\frac{\hat{\textnormal{df}}(\bm{A})}{p}-\frac{\textnormal{df}}{p}\right|\overset{P}{\rightarrow}0.

The second line of (48) then follows by further applying Lemma D.2.

Lemma G.4.

Under the assumptions in Lemma G.1, for any $\epsilon>0$ ,

\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\frac{1}{p}\|\hat{\bm{\beta}}(\bm{A})\|_{0}\geq\frac{\textnormal{df}}{p}-\epsilon\right)=1-o(1).

Proof.

As part of the proof in Lemma G.2, we know

\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}(\bm{b})\leq\mathcal{L}_{\bm{A}}(\bm{b})+\epsilon\right)=1-o(1)

(49)

for any bounded $\bm{b}$ .

From (miolane2021distribution, Lemma F.4), we know

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\forall\bm{b},\mathcal{L}(\bm{b})\leq\mathcal{L}(\hat{\bm{\beta}})+\gamma\epsilon^{3}\Rightarrow\frac{1}{p}\|\bm{b}\|_{0}\geq\frac{\textnormal{df}}{p}+\epsilon\right)=1-o(1)

(50)

for some constant $\gamma$ .

From (celentano2020lasso, Lemma B.12), we know

\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))\leq\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda^{\prime}}(\bm{A}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1)

(51)

for some constant $K$ , where we write subscript to make the dependence on $\lambda$ explicit when necessary.

Now let $K$ and $\gamma$ be the constants mentioned above, and let $C$ be the constant in Lemma H.2(b) Consider any $\epsilon>0$ . Define $\epsilon^{\prime}=\min\left\{\frac{\gamma\epsilon^{3}}{3K},\frac{\epsilon}{C+1}\right\}$ . Let $k=\lceil\frac{\lambda_{\max}-\lambda_{\min}}{\epsilon^{\prime}}\rceil$ , and let $\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}$ . Applying a union bound on (50) yields

\mathbb{P}\left(\forall i,\forall b,\mathcal{L}_{\lambda_{i}}(\bm{b})\leq\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}_{\lambda_{i}})+\gamma\epsilon^{3}\Rightarrow\frac{1}{p}\|\bm{b}\|_{0}\geq\frac{\textnormal{df}_{\lambda_{i}}}{p}+\epsilon\right)=1-o(1).

(52)

Consider the intersection of events in (49), (51) and (52), which has probability $1-o(1)$ . For any $\lambda\in[\lambda_{\min},\lambda_{\max}]$ , let $i$ be such that $\lambda\in[\lambda_{i},\lambda_{i+1}]$ . We have

	$\displaystyle\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))$	$\displaystyle\leq\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))+\frac{1}{3}\gamma\epsilon^{3}$
		$\displaystyle\leq\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{i}}(\bm{A}))+\frac{1}{3}\gamma\epsilon^{3}+K\|\lambda-\lambda^{\prime}\|$
		$\displaystyle=\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{i}}(\bm{A}))+\frac{2}{3}\gamma\epsilon^{3}$
		$\displaystyle\leq\mathcal{L}_{\lambda_{i},\bm{A}}(\hat{\bm{\beta}}_{\lambda_{i}})+\frac{2}{3}\gamma\epsilon^{3}$
		$\displaystyle\leq\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}_{\lambda_{i}})+\gamma\epsilon^{3}$

where the first and fourth inequalities comes from (49), the second inequality comes from (51), and the third inequality follows by optimality. Now since we are on event (52), we have

	$\displaystyle\frac{1}{p}\\|\hat{\bm{\beta}}_{\lambda}(\bm{A})\\|_{0}$	$\displaystyle\geq\frac{\textnormal{df}_{\lambda_{i}}}{p}-\epsilon$
		$\displaystyle=\frac{\textnormal{df}_{\lambda}}{p}-\epsilon+(\frac{\textnormal{df}_{\lambda_{i}}}{p}-\frac{\textnormal{df}_{\lambda}}{p})$
		$\displaystyle\geq\frac{\textnormal{df}_{\lambda}}{p}-\epsilon-C\|\lambda-\lambda^{\prime}\|$
		$\displaystyle\geq\frac{\textnormal{df}_{\lambda}}{p}-2\epsilon.$

Hence the proof is complete upon observing that both $\epsilon$ and $\lambda$ are arbitrary. ∎

Lemma G.5.

Denote

	$\displaystyle\hat{\bm{t}}$	$\displaystyle:=\frac{1}{\sqrt{n}\lambda}\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}})$
	$\displaystyle\hat{\bm{t}}(\bm{A})$	$\displaystyle:=\frac{1}{\sqrt{n}\lambda}\bm{A}^{\top}\bm{X}^{\top}(\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}(\bm{A}))$

the Lasso subgradient and its perturbed counterpart. Under the assumptions in Lemma G.1, for any $\epsilon>0$ ,

\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\frac{1}{n}\#\{j:|\hat{t}_{j}(\bm{A})|=1\}\leq\textnormal{df}+\epsilon\right)=1-o(1)

Proof.

We define an auxiliary loss function $\mathcal{V}$ and its perturbed counterpart $\mathcal{V}_{\bm{A}}$ :

	$\displaystyle\mathcal{V}(\bm{t})$	$\displaystyle=\min_{\\|\bm{b}\\|_{2}\leq M}\left\{\frac{1}{2n}\\|\bm{X}\bm{b}-\bm{y}\\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\bm{t}^{\top}\bm{b}\right\}:=\min_{\\|\bm{b}\\|_{2}\leq M}w(\bm{b},\bm{t})$
	$\displaystyle\mathcal{V}_{\bm{A}}(\bm{t})$	$\displaystyle=\min_{\\|\bm{b}\\|_{2}\leq M}\left\{\frac{1}{2n}\\|\bm{X}\bm{A}\bm{b}-\bm{y}\\|_{2}^{2}+\frac{\lambda}{\sqrt{n}}\bm{t}^{\top}\bm{b}\right\}:=\min_{\\|\bm{b}\\|_{2}\leq M}w_{\bm{A}}(\bm{b},\bm{t})$

where $M$ is some large enough constant. For any $\epsilon>0$ , consider the following three high probability event statements:

\mathbb{P}\left(\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\forall\|\bm{t}\|_{\infty}\leq 1,\mathcal{V}(\bm{t})\geq\mathcal{V}_{\bm{A}}(\bm{t})-\epsilon\right)=1-o(1);

(53)

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\forall\|\bm{t}\|_{\infty}\leq 1,\mathcal{V}(\bm{t})\geq\mathcal{V}(\hat{\bm{t}})-3\gamma\epsilon^{3}\Rightarrow\frac{1}{p}\#\{j:|t_{j}|\geq 1-\epsilon\}\geq\frac{\textnormal{df}}{p}+K\epsilon\right)=1-o(1)

(54)

for some constants $K_{1}$ and $\gamma$ ;

\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))\geq\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda^{\prime}}(\bm{A}))-K_{2}|\lambda-\lambda^{\prime}|\right)=1-o(1)

(55)

for some constant $K_{2}$ .

Observe the exact same forms of statements (53), (54), (55) and statements (49), (50), (51). Hence, upon proving statements (53), (54), (55), we can use a similar $\epsilon$ -net argument as in the proof of Lemma G.4 to complete the proof of Lemma G.5.

First we prove (53). Fixing any $\|\bm{t}\|_{\infty}\leq 1$ and denoting $\bm{b}^{*}:=\operatorname*{arg\,min}_{\|\bm{b}\|_{2}\leq M}w(\bm{b},\bm{t}),\bm{b}_{\bm{A}}^{*}:=\operatorname*{arg\,min}_{\|\bm{b}\|_{2}\leq M}w_{\bm{A}}(\bm{b},\bm{t})$ , we have

		$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(\mathcal{V}(\bm{t})-\mathcal{V}_{\bm{A}}(\bm{t}))$
	$\displaystyle:=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(w(\bm{b}^{},\bm{t})-w_{\bm{A}}(\bm{b}_{\bm{A}}^{},\bm{t}))$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(w(\bm{b}_{\bm{A}}^{},\bm{t})-w_{\bm{A}}(\bm{b}_{\bm{A}}^{},\bm{t}))$
	$\displaystyle:=$	$\displaystyle\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left\|\\|\bm{X}\bm{b}_{\bm{A}}^{}-\bm{y}\\|_{2}^{2}-\\|\bm{X}\bm{A}\bm{b}_{\bm{A}}^{}-\bm{y}\\|_{2}^{2}\right\|\overset{P}{\rightarrow}0$

where the inequality follows from optimaility of $\bm{b}^{*}$ , and the convergence follows from Lemma H.6 and the fact that $\|\bm{b}_{\bm{A}}^{*}\|_{2}\leq M$ .

Writing out a symmetric argument yields the other direction (details omitted):

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}(\mathcal{V}_{\bm{A}}(\bm{t})-\mathcal{V}(\bm{t}))\leq\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\frac{1}{2n}\left|\|\bm{X}\bm{b}^{*}-\bm{y}\|_{2}^{2}-\|\bm{X}\bm{A}\bm{b}^{*}-\bm{y}\|_{2}^{2}\right|\overset{P}{\rightarrow}0.

Therefore, we know $\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}|\mathcal{V}_{\bm{A}}(\bm{t})-\mathcal{V}(\bm{t})|\overset{P}{\rightarrow}0$ , which leads to (53).

Argument (54) is a direct corollary of (miolane2021distribution, Lemma E.9).

We are now left with (55). From KKT condition, on the event $\{\forall\lambda\in[\lambda_{\min},\lambda_{\max}],\|\hat{\bm{\beta}}(\bm{A})\|_{2}\leq M\}$ , which happens with probability $1-o(1)$ (see proof of Lemma G.2), we have

\mathcal{L}_{\bm{A}}(\hat{\bm{\beta}}(\bm{A}))=\min_{\|\bm{b}\|_{2}\leq M}\mathcal{L}_{\bm{A}}(\bm{b}):=\min_{\|\bm{b}\|_{2}\leq M}\max_{\|\bm{t}\|_{\infty}\leq 1}w_{\bm{A}}(\bm{b},\bm{t})=\max_{\|\bm{t}\|_{\infty}\leq 1}\min_{\|\bm{b}\|_{2}\leq M}w_{\bm{A}}(\bm{b},\bm{t}):=\max_{\|\bm{t}\|_{\infty}\leq 1}\mathcal{V}_{\bm{A}}(\bm{t}):=\mathcal{V}_{\bm{A}}(\hat{\bm{t}}(\bm{A}))

where the permutation of min-max is authorized by (rockafellar2015convex, Corollary 37.3.2).

As a result, with probability $1-o(1)$ , $\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}]$ ,

		$\displaystyle\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda^{\prime}}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))$
	$\displaystyle=$	$\displaystyle\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda^{\prime}}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))$
	$\displaystyle\leq$	$\displaystyle\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))$
	$\displaystyle=$	$\displaystyle\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))-\mathcal{L}_{\lambda,\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))+\mathcal{V}_{\lambda,\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))$

where the inequality follows from optimality of $\hat{\bm{\beta}}_{\lambda^{\prime}}(\bm{A})$ . Statement (51) already guarantees a high probability bound for $\mathcal{L}_{\lambda^{\prime},\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))-\mathcal{L}_{\lambda,\bm{A}}(\hat{\bm{\beta}}_{\lambda}(\bm{A}))$ . Finally, (re)denoting $\bm{b}^{*}:=\operatorname*{arg\,min}_{\|\bm{b}\|_{2}\leq M}w_{\lambda^{\prime},\bm{A}}(\bm{b},\hat{\bm{t}}_{\lambda}(\bm{A}))$ , we have

		$\displaystyle\mathcal{V}_{\lambda,\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))-\mathcal{V}_{\lambda^{\prime},\bm{A}}(\hat{\bm{t}}_{\lambda}(\bm{A}))$
	$\displaystyle:=$	$\displaystyle\min_{\\|\bm{b}\\|_{2}\leq M}w_{\lambda,\bm{A}}(\bm{b},\hat{\bm{t}}_{\lambda}(\bm{A}))-w_{\lambda^{\prime},\bm{A}}(\bm{b}^{*},\hat{\bm{t}}_{\lambda}(\bm{A}))$
	$\displaystyle\leq$	$\displaystyle w_{\lambda,\bm{A}}(\bm{b}^{},\hat{\bm{t}}_{\lambda}(\bm{A}))-w_{\lambda^{\prime},\bm{A}}(\bm{b}^{},\hat{\bm{t}}_{\lambda}(\bm{A}))$
	$\displaystyle=$	$\displaystyle\frac{\|\lambda-\lambda^{\prime}\|}{\sqrt{n}}\bm{b}^{*\top}\hat{\bm{t}}_{\lambda}(\bm{A})$
	$\displaystyle\leq$	$\displaystyle\|\lambda-\lambda^{\prime}\|\cdot\\|\bm{b}^{*}\\|_{2}\cdot\frac{1}{\sqrt{n}}\\|\hat{\bm{t}}_{\lambda}(\bm{A})\\|_{2}$
	$\displaystyle\leq$	$\displaystyle M\|\lambda-\lambda^{\prime}\|\cdot\frac{1}{n}\left\\|\frac{1}{\lambda}\bm{A}^{\top}\bm{X}^{\top}(\bm{y}-\bm{X}\bm{A}\hat{\bm{\beta}}_{\lambda}(\bm{A}))\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle K_{3}\|\lambda-\lambda^{\prime}\|$

with probability $1-o(1)$ for some constant $K_{3}$ , where the first inequality follows from optimality of $\bm{b}^{*}$ and the last inequality follows from the fact that $\frac{1}{\lambda},\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}},\frac{1}{\sqrt{n}}\|\bm{y}\|_{2},\|\bm{A}\|_{\textnormal{op}}$ are all bounded with high probability. The proof is thus complete. ∎

Appendix H Auxiliary Lemmas

This section introduces several auxiliary lemmas that we use repeatedly through our proofs.

H.1 Convex Gaussian Minmax Theorem

Theorem H.1.

(thrampoulidis2015regularized, Theorem 3) Let $S_{w}\subset\mathbb{R}^{p}$ and $S_{u}\subset\mathbb{R}^{n}$ be two compact sets and let $f:S_{w}\times S_{u}\rightarrow\mathbb{R}$ be a continuous function. Let $\bm{X}=(\bm{X}_{i,j})\overset{i.i.d.}{\sim}\mathcal{N}(0,1),\bm{\xi}_{g}\sim\mathcal{N}(0,\bm{I}_{p}),\bm{\xi}_{h}\sim\mathcal{N}(0,\bm{I}_{n})$ be independent standard Gaussian vectors. Define

	$\displaystyle\mathcal{C}^{*}(\bm{X})$	$\displaystyle=\min_{\bm{w}\in S_{w}}\max_{\bm{u}\in S_{u}}\bm{u}^{\top}\bm{X}\bm{w}+f(\bm{w},\bm{u}),$
	$\displaystyle L^{*}(\bm{\xi}_{g},\bm{\xi}_{h})$	$\displaystyle=\min_{\bm{w}\in S_{w}}\max_{\bm{u}\in S_{u}}\\|\bm{u}\\|_{2}\bm{\xi}_{g}^{\top}\bm{w}+\\|\bm{w}\\|_{2}\bm{\xi}_{h}^{\top}\bm{u}+f(\bm{w},\bm{u}).$

Then we have:

•

For all $t\in\mathbb{R}$ ,

$\mathbb{P}(\mathcal{C}^{*}(\bm{X})\leq t)\leq 2\mathbb{P}(L^{*}(\bm{\xi}_{g},\bm{\xi}_{h})\leq t).$
•

If $S_{w}$ and $S_{u}$ are convex and $f$ is convex-concave, then for all $t\in\mathbb{R}$ ,

$\mathbb{P}(\mathcal{C}^{*}(\bm{X})\geq t)\leq 2\mathbb{P}(L^{*}(\bm{\xi}_{g},\bm{\xi}_{h})\geq t).$

H.2 Properties of Equation System Solution

Lemma H.2 (Lipschitzness of fixed point parameters).

Denote $(\tau_{L},\tau_{R},\zeta_{L},\zeta_{R},\rho)$ to be the unique solution to the fixed point equation systen (24). There exists a constant $C$ such that

(a)

the mapping $\lambda_{k}\mapsto\tau_{k}$ is $C/\sqrt{n}$ -Lipschitz.
(b)

the mapping $\lambda_{k}\mapsto\zeta_{k}$ is $C$ -Lipschitz.
(c)

the mappings $(\lambda_{\textrm{L}},\lambda_{\textrm{R}})\mapsto\rho,(\lambda_{\textrm{L}},\lambda_{\textrm{R}})\mapsto\rho^{\perp}$ are both $C$ -Lipschitz in both arguments, where $\rho^{\perp}=\sqrt{1-\rho^{2}}.$

Proof.

Note that (a) is a direct consequence of miolane2021distribution Proposition A.3. (b) also follows from miolane2021distribution on noting that $\zeta_{k}=\beta_{*}/\tau_{*}$ in their notation, and on applying Lemma H.8 We remark that both results from miolane2021distribution Proposition A.3 are for the case of the Lasso, but the Ridge case follows similarly.

We prove (c) here. We show that both $\rho$ and $\rho^{\perp}$ are Lipschitz in $\lambda_{\textrm{R}}$ . The proof for $\lambda_{\textrm{L}}$ follows similarly and is therefore omitted. From (23) and (24), we know that

\rho=\frac{1}{n\tau_{L}\tau_{R}}\left(\sigma^{2}+\mathbb{E}[\langle\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta},\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}\rangle]\right).

First, we know $\sigma^{2}$ and $\sqrt{n}\tau_{L}$ does not depend on $\lambda_{\textrm{R}}$ while $\sqrt{n}\tau_{R}$ is bounded and Lipschitz by 25 and the first statement of this Lemma. Next, by Cauchy-Schwartz, we have

\mathbb{E}[\langle\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta},\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}\rangle]\leq\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}\|_{2}]\cdot\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}\|_{2}],

where $\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}\|_{2}]\leq\sqrt{\mathbb{E}[\|\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}\|_{2}^{2}]}=n\tau_{L}^{2}-\sigma^{2}\leq\tau_{\max}^{2}$ is bounded and does not depend on $\lambda_{\textrm{R}}$ (the equality follows from (23) and (24)). Now,

	$\displaystyle\left\|\mathbb{E}[\\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\bm{\beta}\\|^{2}]-\mathbb{E}[\\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})-\bm{\beta}\\|^{2}]\right\|\leq$	$\displaystyle\mathbb{E}\left[\left\|\\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\bm{\beta}\\|_{2}-\\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})-\bm{\beta}\\|_{2}\right\|\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})\\|_{2}]$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}[\\|\hat{\bm{\beta}}_{R}^{f}(\lambda_{1})-\hat{\bm{\beta}}_{R}^{f}(\lambda_{2})\\|_{2}^{2}]}$
	$\displaystyle\leq$	$\displaystyle M(\lambda_{1}-\lambda_{2})$

for some constant $M$ , where the last line follows verbatim the proof of Lemma E.10. Hence $\mathbb{E}[\langle\hat{\bm{\beta}}_{\textrm{L}}^{f}-\bm{\beta}_{\textrm{L}},\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta}_{\textrm{R}}\rangle]$ is Lipschitz in $\lambda_{\textrm{R}}$ . Combining the two terms with Lemma H.8, we conclude that $\rho$ is Lipschitz in $\lambda_{\textrm{R}}$ . For $\rho^{\perp}$ , note the fact that $\rho^{\perp}=\sqrt{1-\rho^{2}}$ , so

	$\displaystyle\|\rho^{\perp}(\lambda_{1})-\rho^{\perp}(\lambda_{2})\|$	$\displaystyle=\frac{\|\rho(\lambda_{1})+\rho(\lambda_{2})\|}{\|\rho^{\perp}(\lambda_{1})+\rho^{\perp}(\lambda_{2})\|}\cdot\|\rho(\lambda_{1})-\rho(\lambda_{2})\|$
		$\displaystyle\leq\frac{2}{2\sqrt{1-\rho_{\max}^{2}}}\|\rho(\lambda_{1})-\rho(\lambda_{2})\|$
		$\displaystyle\leq\frac{1}{\sqrt{1-\rho_{\max}^{2}}}\cdot M\|\lambda_{1}-\lambda_{2}\|.$

Thus $\rho^{\perp}$ is also Lipschitz in $\lambda_{\textrm{R}}$ , concluding the proof. ∎

H.3 Concentration of empirical second moments

Most of our results are established for Lipschitz functions of random vectors. However we will frequently need to show the concentration of second order statistics of the form $\langle\bm{a}^{(1)},\bm{a}^{(2)}\rangle$ . The following lemma provides the connection:

Lemma H.3.

Consider two random vectors $\bm{a}_{\textrm{L}},\bm{a}_{\textrm{R}}\in\mathbb{R}^{p}$ , and a positive semi-definite matrix $\bm{S}\in\mathbb{R}^{2\times 2}$ (all possibly dependent on $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ ) that satisfy the following concentration guarantee:

•

There exist functions $\bm{\phi}_{\textrm{L}},\bm{\phi}_{\textrm{R}}$ (dependent on $\lambda_{\textrm{L}},\lambda_{\textrm{R}}$ ) that are $M/\sqrt{p}$ -Lipschitz such that for Gaussian vectors $\bm{\xi}_{\textrm{L}},\bm{\xi}_{\textrm{R}}\sim\mathcal{N}(0,\bm{S}\otimes\bm{I}_{p})$ and any $1$ -Lipschitz functions $\phi:(\mathbb{R}^{p})^{2}\rightarrow\mathbb{R}$ ,

\displaystyle\mathbb{P}\left(\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\phi(\bm{a}_{\textrm{L}},\bm{a}_{\textrm{R}})-\mathbb{E}[\phi(\bm{\phi}_{\textrm{L}}(\bm{\xi}_{\textrm{L}}),\bm{\phi}_{\textrm{R}}(\bm{\xi}_{\textrm{R}}))]|>K\epsilon\right)=o(\epsilon)

and

\mathbb{E}[\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\|\bm{\phi}_{k}(\bm{\xi}_{k})\|_{2}^{2}]\leq C,\ k=L,R.

•

The parameters satisfy $M\leq CK$ and the singular values of $\bm{S}$ are bounded by $C$ for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ .

Then we have

\mathbb{P}\left(\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}|\langle\bm{a}_{\textrm{L}},\bm{a}_{\textrm{R}}\rangle-\mathbb{E}[\langle\bm{\phi}_{\textrm{L}}(\bm{\xi}_{\textrm{L}}),\bm{\phi}_{\textrm{R}}(\bm{\xi}_{\textrm{R}})\rangle]|>K\epsilon\right)=o(\epsilon).

The proof for this Lemma follows almost verbatim from the proof of Lemma G.1 in celentano2021cad, with Proposition H.5 applied in appropriate places to make sure equalities/inequalities hold uniformly over the tuning parameter range.

H.4 Largest singular value of random matrices

Lemma H.4.

(vershynin2010introduction, Corollary 5.35) Let $\sigma_{\max}(\bm{X})$ be the largest singular value of the matrix $\bm{X}\in\mathbb{R}^{n\times p}$ with i.i.d. $\mathcal{N}(0,1)$ entries, then

\mathbb{P}(\sigma_{\max}(\bm{X})>\sqrt{n}+\sqrt{p}+t)\leq e^{-t^{2}/2}.

Corollary H.1.

In the setting of Lemma H.4,

\mathbb{P}\left(\frac{1}{\sqrt{n}}\sigma_{\max}(\bm{X})>2+\sqrt{\delta}\right)\leq e^{-n/2}=o(1).

H.5 Inequalities for supremum

Our results often involve establishing that certain statements hold uniformly over the range of tuning parameter values. To this end, we frequently use the following results related to the properties of the supremum.

Lemma H.5.

•

For a random variable $X$ and functions $f_{\lambda}(X)=\sum_{k=1}^{K}f_{\lambda}^{(k)}(X)$ parametrized by $\lambda$ , with probability $1$ ,

$\sup_{\lambda}f_{\lambda}(X)\leq\sum_{k=1}^{K}\sup_{\lambda}f_{\lambda}^{(k)}(X).$
•

For a random variable $X$ and non-negative functions $f_{\lambda}(X)=\prod_{k=1}^{K}f_{\lambda}^{(k)}(X)$ parametrized by $\lambda$ , with probability $1$ ,

$\sup_{\lambda}f_{\lambda}(X)\leq\prod_{k=1}^{K}\sup_{\lambda}f_{\lambda}^{(k)}(X).$
•

For a random variable $X$ and a function $f_{\lambda}(X)$ parametrized by $\lambda$ ,

$\sup_{\lambda}\mathbb{E}[f_{\lambda}(X)]\leq\mathbb{E}[\sup_{\lambda}f_{\lambda}(X)].$

The proof is straightforward and is therefore omitted.

Lemma H.6.

For sequences of bounded positive random functions $\{f_{n}^{(1)}(\lambda)\},\{f_{n}^{(2)}(\lambda)\},\{g_{n}^{(1)}(\lambda)\},\{g_{n}^{(2)}(\lambda)\}$ such that $\sup_{\lambda}|f_{n}^{(1)}(\lambda)-f_{n}^{(2)}(\lambda)|\overset{P}{\rightarrow}0$ and $\sup_{\lambda}|g_{n}^{(1)}(\lambda)-g_{n}^{(2)}(\lambda)|\overset{P}{\rightarrow}0$ , we have

\sup_{\lambda}|f_{n}^{(1)}(\lambda)g_{n}^{(1)}(\lambda)-f_{n}^{(2)}(\lambda)g_{n}^{(2)}(\lambda)|\overset{P}{\rightarrow}0

Proof.

The proof follows directly from Lemma H.5 together with the fact that

		$\displaystyle\|f_{n}^{(1)}g_{n}^{(1)}-f_{n}^{(1)}g_{n}^{(1)}\|$
	$\displaystyle=$	$\displaystyle\|f_{n}^{(1)}g_{n}^{(1)}-f_{n}^{(1)}g_{n}^{(2)}+f_{n}^{(1)}g_{n}^{(2)}-f_{n}^{(2)}g_{n}^{(2)}\|$
	$\displaystyle\leq$	$\displaystyle\|f_{n}^{(1)}\|\cdot\|g_{n}^{(1)}-g_{n}^{(2)}\|+\|g_{n}^{(2)}\|\cdot\|f_{n}^{(1)}-f_{n}^{(2)}\|$

∎

Lemma H.7.

For sequences of positive random functions $\{f_{n}(\lambda)\}$ , $\{g_{n}(\lambda)\}$ such that $\sup_{\lambda}|f_{n}(\lambda)|$ and $\sup_{\lambda}|g_{n}(\lambda)|$ are bounded below with probability $1-o(1)$ and $\sup_{\lambda}|f_{n}(\lambda)-g_{n}(\lambda)|\overset{P}{\rightarrow}0$ , we have

\sup_{\lambda}|\sqrt{f_{n}(\lambda)}-\sqrt{g_{n}(\lambda)}|\overset{P}{\rightarrow}0.

Proof.

		$\displaystyle\sup_{\lambda}\|\sqrt{f_{n}(\lambda)}-\sqrt{g_{n}(\lambda)}\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda}\left\|\frac{f_{n}(\lambda)-g_{n}(\lambda)}{\sqrt{f_{n}(\lambda)}+\sqrt{g_{n}(\lambda)}}\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda}\|f_{n}(\lambda)-g_{n}(\lambda)\|\cdot\sup_{\lambda}\left\|\frac{1}{\sqrt{f_{n}(\lambda)}+\sqrt{g_{n}(\lambda)}}\right\|$
	$\displaystyle\overset{P}{\rightarrow}$	$\displaystyle\ 0.$

∎

H.6 Properties on Lipschitz functions

Lemma H.8.

Suppose $f(x)$ has bounded norm in $[f_{\min},f_{\max}]$ and is Lipschitz with constant $C_{f}$ , $g(x)$ has bounded norm in $[g_{\min},g_{\max}]$ and is Lipschitz with constant $C_{g}$ , then

•

$h(x)=f(x)g(x)$ has bounded norm in $[f_{\min}g_{\min},f_{\max}g_{\max}]$ , and is Lipschitz with constant $C_{f}g_{\max}+C_{g}f_{\max}$ .
•

$\frac{1}{f(x)}$ is positive and bounded in $[\frac{1}{f_{\max}},\frac{1}{f_{\min}}]$ , and is Lipschitz with constant $C_{f}/f_{\min}^{2}$ in $x$

Proof.

The boundedness part is trivial. For Lipschitzness part, we have

	$\displaystyle\\|h(x_{1})-h(x_{2})\\|$	$\displaystyle=\\|f(x_{1})g(x_{1})-f(x_{2})g(x_{2})\\|$
		$\displaystyle=\\|(f(x_{1})g(x_{1})-f(x_{2})g(x_{1}))+(f(x_{2})g(x_{1})-f(x_{2})g(x_{2}))\\|$
		$\displaystyle\leq\\|g(x_{1})\\|\\|f(x_{1})-f(x_{2})\\|+\\|f(x_{2})\\|\\|g(x_{1})-g(x_{2})\\|$
		$\displaystyle\leq g_{\max}C_{f}\\|x_{1}-x_{2}\\|+f_{\max}C_{g}\\|x_{1}-x_{2}\\|,$

and further,

	$\displaystyle\\|\frac{1}{f(x_{1})}-\frac{1}{f(x_{2})}\\|$	$\displaystyle=\\|\frac{f(x_{2})-f(x_{1})}{f(x_{1})f(x_{2})}\\|$
		$\displaystyle\leq\frac{\\|f(x_{2})-f(x_{1})\\|}{f_{\min}^{2}}$
		$\displaystyle\leq\frac{C_{f}\\|x_{1}-x_{2}\\|}{f_{\min}^{2}}.$

∎

Lemma H.9.

Suppose $f(x,y)$ is $C$ -Lipschitz in $y$ for all $x$ , then $\sup_{x}f(x)$ is also $M$ -Lipschitz in $y$ .

Proof.

Consider any $y_{1},y_{2}$ . Let $x_{1}=\operatorname*{arg\,max}_{x}f(x,y_{1}),x_{2}=\operatorname*{arg\,max}_{x}f(x,y_{2})$ , then

	$\displaystyle f(x_{1},y_{1})-f(x_{2},y_{2})=f(x_{1},y_{1})-f(x_{1},y_{2})+f(x_{1},y_{2})-f(x_{2},y_{2})\leq\|f(x_{1},y_{1})-f(x_{1},y_{2})\|\leq M\|y_{1}-y_{2}\|,$
	$\displaystyle f(x_{2},y_{2})-f(x_{1},y_{1})=f(x_{2},y_{2})-f(x_{2},y_{1})+f(x_{2},y_{1})-f(x_{1},y_{1})\leq\|f(x_{2},y_{2})-f(x_{2},y_{1})\|\leq M\|y_{1}-y_{2}\|,$

where we have used the fact that $f(x_{1},y_{2})-f(x_{2},y_{2})\leq 0$ and $f(x_{2},y_{1})-f(x_{1},y_{1})\leq 0$ by optimality of $x_{1}$ and $x_{2}$ . Thus,

|\sup_{x}f(x,y_{1})-\sup_{x}f(x,y_{2})|=|f(x_{1},y_{1})-f(x_{2},y_{2})|\leq M|y_{1}-y_{2}|,

which shows $\sup_{x}f(x,y)$ is also $M$ -Lipschitz in $y$ . ∎

Appendix I Universality Proof

Here we extend our results from prior sections to the case of the general covariate and noise distributions mentioned under Assumptions 1 and 2.

To differentiate from the notations above, we use $\bm{G}$ to denote a design matrix with i.i.d. $\mathcal{N}(0,1)$ entries, and $\bm{\epsilon}^{G}$ to denote the noise vector with i.i.d. $\mathcal{N}(0,\sigma^{2})$ entries. Unless otherwise noted, for all other quantities, we add a superscript $G$ to denote corresponding quantities that depend on $\bm{G}$ and/or $\bm{\epsilon}^{G}$ . Also, our original notations from the preceding sections now refer to quantities with the general sub-Gaussian tailed design matrix and noise distribution specified by Assumptions 1 and 2.

Several results from prior sections assumed Gaussianity of the design or errors. A complete list of these results would be as follows: Lemmas D.2-D.13, Corollary D.2, Lemmas E.1-E.4 (along with several supporting lemmas that enter the proof of Lemma E.1), F.2, Lemmas G.1-G.5, H.4 and Corollary H.1. However, for many of these results, the proof uses Gaussianity only through the use of certain other lemmas/results in the aforementioned list. Thus, to prove validity of such results beyond Gaussianity, it suffices to prove validity of the other lemmas used in their proofs. We thus identify a smaller set of results so that showing these hold under the non-Gaussian setting of Assumptions 1 and 2 suffices for showing that all of the aforementioned results satisfy the same universality property. This smaller set is as follows: Lemmas D.2, D.8, D.11, D.12, Lemmas E.1 (along with some supporting results in this proof as outlined in Section I.2 below), E.2, Lemmas G.2-G.5, Lemma H.4 and Corollary H.1. In this section, we show that these results continue to hold under Assumptions 1 and 2 (and Assumption 3 for Lemmas G.2-G.5).

I.1 Replacing Lemma H.4 and Corollary H.1

This follows from prior results in the literature that we quote below. Recall that, $\sup_{n}\max_{ij}\|X_{ij}\|_{\psi_{2}}<\infty$ , where $\|\cdot\|_{\psi_{2}}$ is the Orcliz-2 norm or sub-Gaussian norm (see (wellner2013weak, Section 2.1) for the precise definition).

Lemma I.1.

(vershynin2018high, Theorem 4.4.5) Let $\sigma_{\max}(\bm{X})$ be the largest singular value of the matrix $\bm{X}\in\mathbb{R}^{n\times p}$ with independent, mean $0$ , variance $1$ , uniformly sub-Gaussian entries, then

\mathbb{P}(\sigma_{\max}(\bm{X})>CK(\sqrt{n}+\sqrt{p}+t))\leq 2e^{-t^{2}},

where $C$ is an absolute constant and $K=\max_{ij}\|X_{ij}\|_{\psi_{2}}$

Corollary I.1.

Under Assumption 1,

\mathbb{P}\left(\frac{1}{\sqrt{n}}\sigma_{\max}(\bm{X})>CK(2+\sqrt{\delta})\right)\leq 2e^{-n}=o(1),

where $C,K$ are as in Lemma I.1.

I.2 Replacing Lemmas E.1 and E.2

In this section, we will establish that the following more general version of Lemma E.1 holds. For Lemma E.2 a similar extension can be proved by using similar proof techniques as in this section, thus we omit writing the details for the latter.

Lemma I.2 (Replacing Lemma E.1).

Under Assumptions 1 and 2, for $k=L,R$ and any $1$ -Lipschitz function $\phi_{\beta}:\mathbb{R}^{p}\rightarrow\mathbb{R}$ ,

\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}|\phi_{\beta}(\hat{\bm{\beta}}_{k})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{k}^{f})]|\overset{P}{\rightarrow}0.\\

The proof of the Lasso and Ridge cases are similar. We only present the case of the Ridge so it is easy for readers to compare with Section E.3. We introduce some useful supporting lemmas first. By recent results in han2022universality, the below lemma follows.

Lemma I.3.

Suppose we are under Assumptions 1 and 2, and further assume $\bm{G}\in\mathbb{R}^{n\times p}$ has i.i.d. $\mathcal{N}(0,1)$ entries. For any set $D_{p}\subset[-L_{p}/\sqrt{n},L_{p}/\sqrt{n}]^{p}$ with $L_{p}=K(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty})$ where $K$ is a constant that only depends on $\lambda_{\max}$ , any $t\in\mathbb{R},\epsilon>0$ , we have

	$\displaystyle\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})>t+3\epsilon\right)\leq\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})>t+\epsilon\right)+o(1),$
	$\displaystyle\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})<t-3\epsilon\right)\leq\mathbb{P}\left(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})<t-\epsilon\right)+o(1),$

where $\mathcal{C}_{\lambda}(\bm{w})$ is defined as in (37) and $\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})$ represents $\mathcal{C}_{\lambda}(\bm{w})$ where we replace $\bm{X}$ by $\bm{G}$ but keep the (possibly non-Gaussian) $\bm{\epsilon}$ .

Proof.

The first argument is a direct consequence of (han2022universality, Theorem 2.3), by specifically plugging in the formula for $\mathcal{C}_{\lambda}(\bm{w})$ from (37). The second argument can be similarly proved by modifying the last lines of (han2022universality, Section 4.2). We skip the proof here for conciseness. Notice that the $\sqrt{n}$ in $D_{p}$ comes from the difference between their scaling and ours. ∎

Further, (han2022universality, Proposition 3.3(2)), with the appropriate $\sqrt{n}$ -rescaling in our case yields the following.

Lemma I.4.

Under Assumptions 1 and 2, with probability $1-o(1)$ ,

\sqrt{n}\|\hat{\bm{\beta}}_{\textrm{R}}\|_{\infty}\leq K\left(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty}\right),

where $K$ is a constant that only depends on $\lambda_{\max}$ .

We note that Lemma I.3 also holds for the Lasso case (with a different $\mathcal{C}_{\lambda}(\bm{w})$ ), and the Lasso counterpart of Lemma I.4 is given by (han2022universality, Proposition 3.7(2)).

Now we prove Lemma I.2. We follow the same structure as Section E.3.

I.2.1 Converting the optimization problem

This subsection requires no change.

I.2.2 Connecting AO with SO

Lemma I.5 (Replacing Proposition E.6).

Recall the definition of $\bm{w}(\lambda)$ from (40). There exists a constant $\gamma>0$ such that for all $\epsilon\in(0,1]$ ,

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{w}\in\mathbb{R}^{p},\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}>\epsilon\text{ and }\|\bm{w}\|_{\infty}\leq L_{p}/\sqrt{n}\ \ \text{and}\ \ L_{\lambda}(\bm{w})\leq\min_{\bm{v}\in\mathbb{R}^{p}}L_{\lambda}(\bm{v})+\gamma\epsilon\right)=o(\epsilon).

Proof.

The proof is a direct corollary of Lemma E.6, since adding another condition in the event does not increase the probability. ∎

I.2.3 Connecting PO with AO

Lemma I.6 (Replacing Corollary E.1).

Consider the same setting as in Lemma I.3.

•

Further assuming $D_{p}$ is a closed set, we have for all $t\in\mathbb{R},\epsilon>0$ ,

\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D_{p}}L_{\lambda}(\bm{w})\leq t+2\epsilon)+o(1),

where $L_{\lambda}$ is defined as in (38).

•

Further assuming $D_{p}$ is a closed convex set, we have for all $t\in\mathbb{R},\epsilon>0$ ,

\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})\geq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D_{p}}L_{\lambda}(\bm{w})\geq t-2\epsilon)+o(1).

Proof.

We only prove the first argument as the second follows similarly. As a direct corollary of the second argument in I.3 (by substituting $t\leftarrow t-3\epsilon$ ), we have

\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w})\leq t)\leq\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})\leq t+2\epsilon)+o(1).

Now consider the high probability event (42). The fact that $\bm{z}=\bm{\epsilon}/\sigma$ is now sub-Gaussian instead of Gaussian does not change the fact that we can find another $K$ that ensures this event occurs with high probability. Therefore, the proof of E.1 goes through, and we have the following inequality: for any $t\in\mathbb{R}$ ,

\mathbb{P}(\min_{\bm{w}\in D_{p}}\mathcal{C}_{\lambda}(\bm{w};\bm{G},\bm{\epsilon})\leq t)\leq 2\mathbb{P}(\min_{\bm{w}\in D_{p}}L_{\lambda}(\bm{w})\leq t).

Combining the two displays above finishes the proof. ∎

Now for a fixed $\lambda$ , we instead consider the set $D=D_{p}\bigcap D_{\lambda}^{\epsilon}$ , where $D_{p}\subset[-L_{p}/\sqrt{n},L_{p}/\sqrt{n}]^{p}$ is defined in Lemma I.3 and $D_{\lambda}^{\epsilon}=\{\bm{w}\in\mathbb{R}^{p}|\|\bm{w}-\bm{w}(\lambda)\|_{2}^{2}>\epsilon\}$ is defined in Section E.3.3.

Lemma I.7 (Replacing Lemma E.7).

For all $\epsilon\in(0,1]$ ,

\mathbb{P}\left(\min_{\bm{w}\in D}\mathcal{C}_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}\mathcal{C}_{\lambda}(\bm{w})+\epsilon\right)\leq 2\mathbb{P}\left(\min_{\bm{w}\in D}L_{\lambda}(\bm{w})\leq\min_{\bm{w}\in\mathbb{R}^{p}}L_{\lambda}(\bm{w})+5\epsilon\right)+o(1).

Proof.

Recall from our remark in the proof of Lemma E.7 that (miolane2021distribution, Section C.1.1) established this result for the case of the Lasso and Gaussian designs/error distributions. In their proof, their Corollaries 5.1 and B.1 played a crucial role. Now note that (miolane2021distribution, Corollary B.1) requires concentration of $\bm{\epsilon}$ , which also follows if $\bm{\epsilon}$ is sub-Gaussian instead of Gaussian. Thus we are left to generalize their Corollary 5.1 (the analogue of this is Corollary E.1 in our paper) to the case of the ridge under our non-Gaussian design/error setting of Assumptions 1 and 2. We achieve this via Lemma I.6. With these modifications in place, a proof analogous (miolane2021distribution, Section C.1.1) still works here. Note that the extra $2\epsilon$ on the RHS comes from the extra $2\epsilon$ in Lemma I.6. ∎

Lemma I.8 (Replacing Lemma E.8).

There exists a constant $\gamma>0$ such that for all $\epsilon\in(0,1]$ ,

\sup_{\lambda\in[\lambda_{\min},\lambda_{\max}]}\mathbb{P}\left(\exists\bm{b}\in\mathbb{R}^{p},\ \|\bm{\beta}-\hat{\bm{\beta}}^{f}(\lambda)\|_{2}^{2}>\epsilon\text{ and }\|\bm{b}-\bm{\beta}\|_{\infty}\leq L_{p}/\sqrt{n}\ \ \text{and}\ \ \mathcal{L}_{\lambda}(\bm{b})\leq\min\mathcal{L}_{\lambda}+\gamma\epsilon\right)=o(\epsilon).

Proof.

This is a direct corollary of Lemma I.5 and Lemma I.7, with $\bm{w}(\lambda)=\hat{\bm{\beta}}^{f}(\lambda)-\bm{\beta}$ from (40). ∎

I.2.4 Uniform Control over $\lambda$

Lemma I.9 (Re-stating Lemma E.9).

There exists constant $K$ such that

\mathbb{P}\left(\forall\lambda,\lambda^{\prime}\in[\lambda_{\min},\lambda_{\max}],\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda))\leq\mathcal{L}_{\lambda^{\prime}}(\hat{\bm{\beta}}(\lambda^{\prime}))+K|\lambda-\lambda^{\prime}|\right)=1-o(1).

Proof.

The only difference in the proofs of Lemma I.9 and E.9 lies in the fact that $\bm{z}$ is now independent sub-Gaussian instead of i.i.d. Gaussian. However, since the entries of $\bm{z}$ still have mean $0$ and variance $1$ , we still have $\|\bm{z}\|_{2}\leq 2\sqrt{n}$ with high probability. The rest of the proof follows directly. ∎

Note that Lemma D.12 is equivalent to Lemma E.9 so universality follows directly from the above. Lemma E.10 remains as is since it does not involve $\bm{X}$ or $\bm{\epsilon}$ . We are now ready to prove Lemma I.2.

proof of Lemma I.2.

Let $\gamma>0$ as given by Lemma I.8, let $K>0$ as given by Lemma I.9, and let $M>0$ as given by Lemma E.10.

Fix $\epsilon\in(0,1]$ and define $\epsilon^{\prime}=\min\left(\frac{\gamma\epsilon}{K},\frac{\sqrt{\epsilon}}{M}\right)$ . Let $k=\lceil(\lambda_{\max}-\lambda_{\min})/\epsilon^{\prime}\rceil$ . Further define $\lambda_{i}=\lambda_{\min}+i\epsilon^{\prime}$ for $i=0,...,k$ . By Lemma I.8, the event

\left\{\forall i\in\{1,...,k\},\forall\bm{b}\in\mathbb{R}^{p},\mathcal{L}_{\lambda_{i}}(\bm{b})\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon\Rightarrow\|\bm{b}-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon\text{ or }\|\bm{b}-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}\right\}

(56)

has probability at least $1-ko(\epsilon)=1-o(1)$ . Therefore, on the intersection of event 56 and the event in Lemma I.9, which has probability $1-o(1)$ , we have for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ ,

\mathcal{L}_{\lambda_{i}}(\hat{\bm{\beta}}(\lambda))\leq\min\mathcal{L}_{\lambda_{i}}+K|\lambda-\lambda_{i}|\leq\min\mathcal{L}_{\lambda_{i}}+\gamma\epsilon,

where $1\leq i\leq k$ is such that $\lambda\in[\lambda_{i-1},\lambda_{i}]$ . This implies (since we are on the event 56) that either $\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon$ or $\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}$ .

However, by Lemma I.4, we know that

		$\displaystyle\\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle\\|\hat{\bm{\beta}}(\lambda)\\|_{\infty}+\\|\bm{\beta}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle K\left(\sqrt{(\log p)/n}+2\\|\bm{\beta}\\|_{\infty}\right)+\\|\bm{\beta}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle(K+1/2)\left(\sqrt{(\log p)/n}+2\\|\bm{\beta}\\|_{\infty}\right)$
	$\displaystyle=$	$\displaystyle L_{p}/\sqrt{n}$

with probability $1-o(1)$ . Since the probability is over the randomness in $\bm{X}$ and $\bm{\epsilon}$ , and $K$ only depends in $\lambda_{\max}$ , then the argument holds for all $\lambda\in[\lambda_{\min},\lambda_{\max}]$ .

Therefore, only with probability $o(1)$ , there exists $\lambda$ such that $\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}$ . Hence, still with probability $1-o(1)$ , $\|\hat{\bm{\beta}}(\lambda)-\hat{\bm{\beta}}^{f}(\lambda_{i})\|_{2}^{2}\leq\epsilon$ . The rest of the proof remains the same as in the end of Section E.3.4. ∎

I.3 Replacing Lemma D.2

The ridge case is rather straightforward: (knowles2017anisotropic, Theorem 3.7) only requires mean $0$ , variance $1$ , independence, as well as uniformly bounded $p$ th moments for the entries of $\bm{X}$ , which are already satisfied by Assumptions 1 and 2 (specifically, uniform sub-Gaussianity implies uniformly bounded $p$ th moments). As for the Lasso case, we can modify the proof of (miolane2021distribution, Theorem F.1) in the same way as we modified Section E.3 in Section I.2. We omit the details here for conciseness.

I.4 Replacing Lemma D.8

The supporting Lemma E.1 has already been replaced by Lemma I.2. For the ridge case, the Lipschitz argument needs no change. For the Lasso case, the version of Lemma I.2 for $\alpha$ -smoothed Lasso can be extended similarly, and the following Lipschitz argument requires no change. Finally, (31) relies critically on (celentano2020lasso, Lemma B.9). The corresponding proof in (celentano2020lasso, Lemma B.5.4) has been elaborated in our proof of Lemma G.3, for which we articulate the replacement in I.6.

I.5 Replacing Lemma D.11

Lemma I.10 (Replacing Lemma D.11).

Recall the definition of $\hat{\bm{g}}_{\textrm{L}}$ from 26, and recall that $L_{p}=K(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty})$ where $K$ is a constant that depends only on $\lambda_{\max}$ . Under Assumptions 1 and 2, for any $\epsilon>0$ , there exists a constant $C$ such that for a $1$ -Lipschitz function $\phi_{w}:\mathbb{R}^{p}\rightarrow\mathbb{R}$ ,

	$\displaystyle\sup_{\lambda_{\textrm{L}},\lambda_{\textrm{R}}\in[\lambda_{\min},\lambda_{\max}]}$	$\displaystyle\mathbb{P}\biggl{(}\exists\bm{w}\in\mathbb{R}^{p},\left\|\phi_{w}(\bm{w})-\mathbb{E}[\phi_{w}(\hat{\bm{\beta}}_{\textrm{R}}^{f}-\bm{\beta})\|\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{\textrm{L}}]\right\|\geq\epsilon$
		$\displaystyle\text{ and }\\|\bm{w}\\|_{\infty}\leq L_{p}/\sqrt{n}\ \text{and}\ \mathcal{C}_{\lambda_{\textrm{R}}}(\bm{w})\leq\min\mathcal{C}_{\lambda_{\textrm{R}}}+C\epsilon^{2}\biggr{)}=o(\epsilon^{2}).$

We introduce another supporting Lemma.

Lemma I.11.

Consider Assumptions 1 and 2, and further assume $\bm{G}\in\mathbb{R}^{n\times p}$ has i.i.d. $\mathcal{N}(0,1)$ entries. Denote $C(\bm{w},\bm{u})=\frac{1}{n}\bm{w}^{\top}\bm{X}\bm{u}+f(\bm{w},\bm{u})$ where $f$ is convex-concave. For any set $D_{p}\subset[-L_{p}/\sqrt{n},L_{p}/\sqrt{n}]^{p}$ with $L_{p}=K(\sqrt{\log p}+2\sqrt{n}\|\bm{\beta}\|_{\infty})$ where $K$ is a constant that only depends on $\lambda_{\max}$ , any $t\in\mathbb{R},\epsilon>0$ , we have

	$\displaystyle\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u})>t+3\epsilon\right)\leq\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u};\bm{G})>t+\epsilon\right)+o(1),$
	$\displaystyle\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u})<t-3\epsilon\right)\leq\mathbb{P}\left(\max_{\bm{u}\in\sqrt{n}D_{p}}\min_{\bm{w}\in D_{p}}C(\bm{w},\bm{u};\bm{G})<t-\epsilon\right)+o(1),$

where $C(\bm{w},\bm{u};\bm{G})$ represents $C(\bm{w},\bm{u})$ with $\bm{X}$ replaced by $\bm{G}$ .

Proof.

This is a consequence of (han2022universality, Corollary 2.6), with necessary modifications similar to what we performed in the proof of Lemma I.3. Again notice the adjusted scaling in our setting. ∎

Now we prove Lemma I.10.

proof of Lemma I.10.

Most of the proof will follow (celentano2021cad, Lemma F.4). In their proof, they first showed their Lemma F.2: Conditional Gordon’s Inequality, which states that $\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}c(\bm{w},\bm{u})$ (as defined in (37), leaving the dependence on $\lambda$ implicit) concentrates around $\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u})$ , the ”conditional Auxiliary Optimization Problem” defined as $l(\bm{w},\bm{u})$ in (38) conditioned on $\bm{g}_{\textrm{L}}^{f}=\hat{\bm{g}}_{L}$ , which necessitates $\bm{\xi}_{g}$ and $\bm{\xi}_{h}$ taking the following values (see derivation in (celentano2021cad, Section L)):

	$\displaystyle\hat{\bm{\xi}}_{g}$	$\displaystyle=\hat{\bm{g}}_{\textrm{L}}/\tau_{\textrm{L}}$
	$\displaystyle\hat{\bm{\xi}}_{h}$	$\displaystyle=-\frac{\bm{X}(\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta})}{\\|\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\\|_{2}}+\frac{\langle\hat{\bm{\xi}}_{g},\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\rangle}{\\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}\\|_{2}\\|\hat{\bm{\beta}}_{\textrm{L}}-\bm{\beta}\\|_{2}}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{\textrm{L}}).$

Then, they showed the concentration of $\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u})$ around its asymptotic limit. Here $E_{u},E_{w}$ represent any convex, closed sets.

We make modifications to their proof as follows:

1.

Similar to how we proved Lemma I.6 by invoking Lemma I.3, we can also modify their Lemma F.2 by invoking Lemma I.11, and thus establish that it holds under our general Assumptions 1 and 2.
2.

For proving that $\min_{\bm{v}\in E_{w}}\max_{\bm{u}\in E_{u}}l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u})$ concentrates around its asymptotic limit, no modification is necessary. This is because $l_{\textrm{R}|\textrm{L}}(\bm{w},\bm{u})$ does not involve $\bm{X}$ anymore, and the fact that replacing i.i.d. Gaussianity of $\bm{\epsilon}$ by independent uniform sub-Gaussianity preserves its concentration properties (with possibly different constants).

∎

Now with Lemma I.10 replacing Lemma D.11, the rest of the proof of Lemma D.6 in Section D.2.3 naturally extends to the universality version under Assumptions 1 and 2 (similar to how we modified Section E.3.4 to Section I.2.4), since the probability that there exists $\lambda$ such that $\|\hat{\bm{\beta}}(\lambda)-\bm{\beta}\|_{\infty}>L_{p}/\sqrt{n}$ is $o(1)$ .

I.6 Replacing Lemmas G.2-G.5

For Lemma G.2, we critically used the facts that $\|\hat{\bm{\beta}}\|_{2}$ , $\|\hat{\bm{\beta}}(\bm{A})\|_{2}$ are bounded with probability $1-o(1)$ . The former is guaranteed by applying Lemma H.3 on Lemma I.2. The latter is stated in Assumption 3(1).

For Lemma G.3, consider the critical event

\mathcal{A}:=\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}\cap\left\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\right\}\cap\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\}.

$\left\{\frac{1}{\sqrt{n}}\kappa_{-}(\bm{X},n(1-\zeta/4))\geq\kappa_{\min}\right\}$ happens with high probability according to Assumption 1(2). $\left\{\frac{1}{\sqrt{n}}\|\bm{X}\|_{\textnormal{op}}\leq C\right\}$ happens with high probability by Corollary I.1. Finally, (miolane2021distribution, Theorem E.5) can be extended to the universality version in the same way as we extended Section E.3 in Section I.2. As a direct corollary, $\left\{\frac{1}{n}\#\{j:|\hat{t}_{j}|\geq 1-\Delta/2\}\leq 1-\zeta/2\right\}$ happens with high probability.

For Lemma G.4, we need to extend the critical events (49), (50), (51). (49) follows naturally from the proof of the extended G.2. (50) is a Corollary of (miolane2021distribution, Theorem F.4), which again can be extended to the universality version in the same way as we extended Section E.3 in Section I.2. Lastly (51) is stated in Assumption 3(2).

For Lemma G.5, consider the critical events (53), (54), (55). Proof of (53) remains unchanged (given Corollary I.1). (54) is a Corollary of (miolane2021distribution, Lemma E.9), which can be similarly extended. Proof of (55) remains unchanged as well.

		$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{\beta}}_{k}^{d}-\tilde{\bm{\beta}}_{k}^{d}\\|_{2}$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\\|\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\hat{\textnormal{df}}_{k}}-\frac{\bm{X}^{\top}(\bm{y}-\bm{X}\hat{\bm{\beta}}_{k})}{n-\textnormal{df}_{k}}\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\frac{\\|\bm{X}\\|_{\textnormal{op}}}{\sqrt{n}}\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\frac{\\|\bm{y}-\bm{X}\hat{\bm{\beta}}_{k}\\|_{2}}{\sqrt{n}}\sup_{\lambda_{k}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{1}{{1-\hat{\textnormal{df}}_{k}/n}}-\frac{1}{{1-\textnormal{df}_{k}/n}}\right\|.$

	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta}_{\textrm{L}})\\|_{2}$	$\displaystyle\leq\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{n\zeta_{\textrm{L}}\tau_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}-1\right\|\cdot\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}$		(29)
		$\displaystyle+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{n\tau_{\textrm{L}}}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}}-\frac{1}{\zeta_{\textrm{L}}}\right\|\cdot\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\frac{\\|\hat{\bm{v}}_{\textrm{L}}\\|_{2}}{n}.$

		$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{n\tau_{\textrm{L}}\zeta_{\textrm{L}}\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}-1\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}})\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}}{(\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}-1\right\|$
	$\displaystyle=$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}})(\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}-\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2})+\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})}{(\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}\right\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}}{(\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}}\right\|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\sqrt{n\tau_{\textrm{L}}^{2}-\sigma^{2}}-\\|\hat{\bm{w}}_{\textrm{L}}\\|_{2}\right\|$
		$\displaystyle+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|\frac{1}{\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n}}\right\|\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\left\|(\sqrt{n}\tau_{\textrm{L}}\zeta_{\textrm{L}}-\\|\hat{\bm{u}}_{\textrm{L}}\\|_{2}/\sqrt{n})\right\|\overset{P}{\rightarrow}0\,\,\text{by Lemma \ref{lemma:uvw_convergence} }.$

		$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\hat{\bm{g}}_{\textrm{L}})-\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\|+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\bm{g}_{\textrm{L}}^{f})]\|$
	$\displaystyle\leq$	$\displaystyle\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\\|\hat{\bm{g}}_{\textrm{L}}-(\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})\\|_{2}+\sup_{\lambda_{\textrm{L}}\in[\lambda_{\min},\lambda_{\max}]}\|\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}},\tilde{\bm{\beta}}_{\textrm{L}}^{d}-\bm{\beta})-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f},\hat{\bm{\beta}}_{L}^{f,d}-\bm{\beta})]\|\overset{P}{\rightarrow}0,$

		$\displaystyle\left\|\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1}),\bm{\beta})]-\mathbb{E}[\phi_{\beta}(\hat{\bm{\beta}}_{\textrm{L}}^{f,d}(\lambda_{\textrm{L}}),\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2}),\bm{\beta})]\right\|$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}[\\|\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{1})-\hat{\bm{\beta}}_{\textrm{R}}^{f,d}(\lambda_{2})\\|_{2}]$
	$\displaystyle\leq$	$\displaystyle\sqrt{\mathbb{E}[\\|(\tau_{\textrm{R}}(\lambda_{1})-\tau_{\textrm{R}}(\lambda_{2}))\tilde{\bm{\xi}}\\|_{2}^{2}]}$
	$\displaystyle\leq$	$\displaystyle C_{1}\|\lambda_{1}-\lambda_{2}\|,$

HEDE: Heritability estimation in high dimensions by Ensembling Debiased Estimators

1 Introduction

2 Problem Setup

3 Methodology of HEDE: Heritability estimation Ensembling Debiased Estimators

3.1 Debiasing Regularized Estimators

3.2 Ensembling and a Primitive Heritability Estimator

3.3 The method HEDE

3.4 Correlated Designs

4 Theoretical Properties

Assumption 1.

Theorem 4.1.

Proposition 4.2.

Theorem 4.3.

Theorem 4.4.

5 Simulation Results

5.1 Comparison with random effects methods

5.2 Comparison of HEDE with other fixed effects methods

6 Real data application

7 Discussion

References

Appendix A Additional Methodology Details

A.1 Additional Technical Assumptions

Assumption 2.

Assumption 3.

A.2 Hyperparameter Selection

A.3 Covariance Estimation

Proposition A.1.

Proof.

Appendix B Additional Simulation Details

B.1 Random effects comparison: Heritability definition

B.2 Random effects comparison: Signal generation details

B.3 Fixed effects comparisons: Unbiasedness and a closer look

B.4 Fixed effects comparison: Designs with larger sub-Gaussian norms

Appendix C Proof Notations and Conventions

Assumption 4.

Appendix D Proof of Theorem 4.3

Theorem D.1.

Lemma D.2.

Proof.

Corollary D.1.

Proof.

Lemma D.3.

Proof.

Proof of Theorem 4.3.

D.1 Proof of Theorem D.1

Remark.

Lemma D.4.

Lemma D.5.

Lemma D.6.

D.2 Proof of Lemmas D.4–D.6

Lemma D.7.

Proof.

D.2.1 Proof of Lemma D.4

Proof of Lemma D.4.

D.2.2 Proof of Lemma D.5

Lemma D.8.

Proof.

Corollary D.2.

Proof.

Lemma D.9.

Proof.

Lemma D.10.

Proof of Lemma D.10.

Proof of Lemma D.5.

D.2.3 Proof of Lemma D.6

Lemma D.11.

Proof.

Lemma D.12.

Proof.

Lemma D.13.

Lemma D.14.

Proof.

Proof of Lemma D.6.

Appendix E Proof of supporting lemmas for Section D.2

E.1 Proof of Lemma D.7

Lemma E.1.

Lemma E.2.

Proof of Lemma D.7.

E.2 Proof of Lemma D.13

Lemma E.3.

HEDE: Heritability estimation in high dimensions
by Ensembling Debiased Estimators

E.3.4 Uniform Control over $\lambda$