This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sign Consistency of the Generalized Elastic Net Estimator

Wencan Zhu, Eric Adjakossa, Céline Lévy-Leduc, Nils Ternès
Abstract

In this paper, we propose a novel variable selection approach in the framework of high-dimensional linear models where the columns of the design matrix are highly correlated. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the columns of the design matrix and in applying a generalized Elastic Net criterion since it can be seen as an extension of the generalized Lasso. The properties of our approach called gEN (generalized Elastic Net) are investigated both from a theoretical and a numerical point of view. More precisely, we provide a new condition called GIC (Generalized Irrepresentable Condition) which generalizes the EIC (Elastic Net Irrepresentable Condition) of [1] under which we prove that our estimator can recover the positions of the null and non null entries of the coefficients when the sample size tends to infinity. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that our approach improves the variable selection performance in many cases.

Key words: Lasso; Model selection consistency; Irrepresentable Condition; Generalized Lasso; Elastic Net.

1 Introduction

Variable selection has become an important and actively used task for understanding or predicting an outcome of interest in many fields such as medicine [2, 3, 4, 5], social media [6, 7, 8], or finance [9, 10, 11]. Through decades, numerous variable selection methods have been developed such as subset selection [12] or regularization techniques [13].

Subset selection methods achieve sparsity by selecting the best subset of relevant variables using the Akaike information criterion [14] or the Bayesian information criterion [15] but are shown to be NP-hard and could be unstable in practice [16, 17].

The regularized variable selection techniques have become popular for their capability to overcome the above difficulties [18, 19, 20, 21]. Among them, the Lasso approach [18] is one of the most popular and can be defined as follows. Let 𝐲\displaystyle\mathbf{y} satisfy the following linear model

𝐲=𝐗𝜷+ϵ,\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{\star}+\boldsymbol{\epsilon}, (1)

where 𝐲=(y1,,yn)n\displaystyle\mathbf{y}=(y_{1},\ldots,y_{n})^{{}^{\prime}}\in\mathbb{R}^{n} is the response variable, denoting the transposition, 𝐗=(𝐗1,,𝐗p)\displaystyle\mathbf{X}=(\mathbf{X}_{1},\ldots,\mathbf{X}_{p}) is the design matrix with n\displaystyle n rows of observations on p\displaystyle p covariates, 𝜷=(β1,,βp)p\displaystyle\boldsymbol{\beta}^{\star}=(\beta_{1}^{\star},\ldots,\beta_{p}^{\star})^{{}^{\prime}}\in\mathbb{R}^{p} is a sparse vector, namely contains a lot of null components, and ϵ\displaystyle\boldsymbol{\epsilon} is a Gaussian vector with zero-mean and a covariance matrix equal to σ2𝕀n\displaystyle\sigma^{2}\mathbb{I}_{n}, 𝕀n\displaystyle\mathbb{I}_{n} denoting the identity matrix in n\displaystyle\mathbb{R}^{n}. The Lasso approach estimates 𝜷\displaystyle\boldsymbol{\beta}^{\star} with a sparsity enforcing constraint by minimizing the following penalized least-squares criterion:

LλLasso(𝜷)=𝐲𝐗𝜷22+λ𝜷1,L^{Lasso}_{\lambda}(\boldsymbol{\beta})=\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\beta}\right\rVert_{1}, (2)

where a1=k=1p|ak|\displaystyle\left\lVert a\right\rVert_{1}=\sum_{k=1}^{p}|a_{k}| denotes the 1\displaystyle\ell_{1} norm of the vector (a1,,ap)\displaystyle(a_{1},\dots,a_{p})^{\prime}, b22=k=1nbk2\displaystyle\left\lVert b\right\rVert_{2}^{2}=\sum_{k=1}^{n}b_{k}^{2} denotes the 2\displaystyle\ell_{2} norm of the vector (b1,,bn)\displaystyle(b_{1},\dots,b_{n})^{\prime}, and λ\displaystyle\lambda is a positive constant corresponding to the regularization parameter. The Lasso popularity largely comes from the fact that the resulting estimator

𝜷^Lasso(λ)=Argmin𝜷pLλLasso(𝜷)\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda)=\operatorname*{Argmin}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}L^{Lasso}_{\lambda}(\boldsymbol{\beta})

is sparse (has only a few nonzero entries), and sparse models are often preferred for their interpretability [22]. Moreover, 𝜷^Lasso(λ)\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda) can be proved to be sign consistent under some assumptions, namely there exists λ\displaystyle\lambda such that

limn(sign(𝜷^Lasso(λ))=sign(𝜷))=1,\displaystyle\lim_{n\to\infty}\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda)\right)=sign(\boldsymbol{\beta}^{\star})\right)=1,

where sign(x)=1\displaystyle\textrm{sign}(x)=1 if x>0\displaystyle x>0, -1 if x<0\displaystyle x<0 and 0 if x=0\displaystyle x=0. Before giving the conditions under which [22] prove the sign consistency of 𝜷^Lasso\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}, we first introduce some notations. Without loss of generality, we shall assume as in [22] that the first q\displaystyle q components of 𝜷\displaystyle\boldsymbol{\beta}^{\star} are non null (i.e. the components that are associated to the active variables, and denoted as 𝜷1\displaystyle\boldsymbol{\beta}_{1}^{\star}) and the last pq\displaystyle p-q components of 𝜷\displaystyle\boldsymbol{\beta}^{\star} are null (i.e. the components that are associated to the non active variables, and denoted as 𝜷2\displaystyle\boldsymbol{\beta}_{2}^{\star}). Moreover, we shall denote by 𝐗1\displaystyle\mathbf{X}_{1} (resp. 𝐗2\displaystyle\mathbf{X}_{2}) the first q\displaystyle q (resp. the last pq\displaystyle p-q) columns of 𝐗\displaystyle\mathbf{X}. Hence, Cn=n1𝐗𝐗\displaystyle C_{n}=n^{-1}\mathbf{X}^{\prime}\mathbf{X}, which is the empirical covariance matrix of the covariates, can be rewritten as follows:

Cn=[C11nC12nC21nC22n],\displaystyle C_{n}=\begin{bmatrix}C_{11}^{n}&C_{12}^{n}\\ C_{21}^{n}&C_{22}^{n}\end{bmatrix},

with C11n=n1𝐗1𝐗1\displaystyle C_{11}^{n}=n^{-1}\mathbf{X}_{1}^{{}^{\prime}}\mathbf{X}_{1}, C12n=n1𝐗1𝐗2\displaystyle C_{12}^{n}=n^{-1}\mathbf{X}_{1}^{{}^{\prime}}\mathbf{X}_{2}, C21n=n1𝐗2𝐗1\displaystyle C_{21}^{n}=n^{-1}\mathbf{X}_{2}^{{}^{\prime}}\mathbf{X}_{1}, C22n=n1𝐗2𝐗2\displaystyle C_{22}^{n}=n^{-1}\mathbf{X}_{2}^{{}^{\prime}}\mathbf{X}_{2}. It is proved by Zhao and Yu in [22] that 𝜷^Lasso(λ)\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda) is sign consistent when the following Irrepresentable Condition (IC) is satisfied:

|(C21n(C11n)1sign(𝜷1))j|1α, for all j,\left|\left(C_{21}^{n}(C_{11}^{n})^{-1}\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\leq 1-\alpha,\textrm{ for all }j, (3)

where α\displaystyle\alpha is a positive constant. In the case where pn\displaystyle p\gg n, Wainwright develops in [23] the necessary and sufficient conditions, for both deterministic and random designs, on p\displaystyle p, q\displaystyle q, and n\displaystyle n for which it is possible to recover the positions of the null and non null components of 𝜷\displaystyle\boldsymbol{\beta}^{\star}, namely its support, using the Lasso.

When there are high correlations between covariates, especially the active ones, the C11n\displaystyle C_{11}^{n} matrix may not be invertible, and the Lasso estimator fails to be sign consistent. To circumvent this issue, Zou and Hastie [20] introduced the Elastic Net estimator defined by:

𝜷^EN(λ,η)=Argmin𝜷pLλ,ηEN(𝜷),\widehat{\boldsymbol{\beta}}^{EN}(\lambda,\eta)=\operatorname*{Argmin}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}L^{EN}_{\lambda,\eta}(\boldsymbol{\beta}), (4)

where

Lλ,ηEN(𝜷)=𝐲𝐗𝜷22+λ𝜷1+η𝜷2 with λ,η>0.\displaystyle L^{EN}_{\lambda,\eta}(\boldsymbol{\beta})=\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\beta}\right\rVert_{1}+\eta\left\lVert\boldsymbol{\beta}\right\rVert_{2}\textrm{ with }\lambda,\eta>0.

Yuan and Lin prove in [24] that when the following Elastic Net Condition (EIC) is satisfied the Elastic Net estimator defined by (4) is sign consistent when p\displaystyle p and q\displaystyle q are fixed: there exist positive λ\displaystyle\lambda and η\displaystyle\eta such that

|(C21n(C11n+ηn𝕀q)1(sign(𝜷1)+2ηλ𝜷1))j|1α, for all j.\left|\left(C_{21}^{n}\left(C_{11}^{n}+\dfrac{\eta}{n}\mathbb{I}_{q}\right)^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\dfrac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)\right)_{j}\right|\leq 1-\alpha,\textrm{ for all }j. (5)

Moreover, when p\displaystyle p, q\displaystyle q, and n\displaystyle n go to infinity with pn\displaystyle p\gg n, Jia and Yu prove in [1] that the sign consistency of the Elastic Net estimator holds if additionally to Condition (5) n\displaystyle n goes to infinity at a rate faster than qlog(pq)\displaystyle q\log(p-q).

In the case where the active and non active covariates are highly correlated, IC (3) and EIC (5) may be violated. To overcome this issue several approaches were proposed: the Standard PArtial Covariance (SPAC) method [25] and preconditioning approaches among others. Xue and Qu [25] developed the so-called SPAC-Lasso which enjoys strong sign consistency in both finite-dimensional (p<n\displaystyle p<n) and high-dimensional (pn\displaystyle p\gg n) settings. However, the authors mentioned that the SPAC-Lasso method only selects the active variables that are not highly correlated to the non active ones, which may be a weakness of this approach. The preconditioning approaches consist in transforming the given data 𝐗\displaystyle\mathbf{X} and 𝐲\displaystyle\mathbf{y} before applying the Lasso criterion. For example, [26] and [27] proposed to left-multiply 𝐗\displaystyle\mathbf{X}, 𝐲\displaystyle\mathbf{y} and thus ϵ\displaystyle\boldsymbol{\epsilon} in Model (1) by specific matrices to remove the correlations between the columns of 𝐗\displaystyle\mathbf{X}. A major drawback of the latter approach, called HOLP (High dimensional Ordinary Least squares Projection), is that the preconditioning step may increase the variance of the error term and thus may alter the variable selection performance.

Recently, [5] proposed another strategy under the following assumption:

  1. (A1)

    𝐗\displaystyle\mathbf{X} is assumed to be a random design matrix such that its rows (𝒙i)1in\displaystyle(\boldsymbol{x}_{i})_{1\leq i\leq n} are i.i.d. zero-mean Gaussian random vectors having a covariance matrix equal to 𝚺\displaystyle\boldsymbol{\Sigma}.

More precisely, they propose to rewrite Model (1) in order to remove the correlation existing between the columns of 𝐗\displaystyle\mathbf{X}. Let 𝚺1/2:=𝑼𝑫1/2𝑼T\displaystyle\boldsymbol{\Sigma}^{-1/2}:=\boldsymbol{U}\boldsymbol{D}^{-1/2}\boldsymbol{U}^{T} where 𝑼\displaystyle\boldsymbol{U} and 𝑫\displaystyle\boldsymbol{D} are the matrices involved in the spectral decomposition of the symmetric matrix 𝚺\displaystyle\boldsymbol{\Sigma} given by: 𝚺=𝑼𝑫𝑼T\displaystyle\boldsymbol{\Sigma}=\boldsymbol{U}\boldsymbol{D}\boldsymbol{U}^{T}, then, denoting 𝐗~=𝐗𝚺1/2\displaystyle\widetilde{\mathbf{X}}=\mathbf{X}\boldsymbol{\Sigma}^{-1/2}, (1) can be rewritten as follows:

𝐲=𝐗~𝜷~+ϵ,\mathbf{y}=\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}^{\star}+\boldsymbol{\epsilon}, (6)

where 𝜷~=𝚺1/2𝜷:=𝑼𝑫1/2𝑼T𝜷\displaystyle\widetilde{\boldsymbol{\beta}}^{\star}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{\beta}^{\star}:=\boldsymbol{U}\boldsymbol{D}^{1/2}\boldsymbol{U}^{T}\boldsymbol{\beta}^{\star}. With such a transformation, the covariance matrix of the n\displaystyle n rows of 𝐗~\displaystyle\widetilde{\mathbf{X}} is equal to identity and the columns of 𝐗~\displaystyle\widetilde{\mathbf{X}} are thus uncorrelated. The advantage of such a transformation with respect to the preconditioning approach proposed by [27] is that the error term ϵ\displaystyle\boldsymbol{\epsilon} is not modified thus avoiding an increase of the noise which can overwhelm the benefits of a well conditioned design matrix. Their approach then consists in minimizing the following criterion with respect to 𝜷~\displaystyle\widetilde{\boldsymbol{\beta}}:

𝐲𝐗~𝜷~22+λΣ1/2𝜷~1,\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\Sigma^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1}, (7)

where 𝐗~=𝐗𝚺1/2\displaystyle\widetilde{\mathbf{X}}=\mathbf{X}\boldsymbol{\Sigma}^{-1/2} in order to ensure a sparse estimation of 𝜷\displaystyle\boldsymbol{\beta}^{\star} thanks to the penalization by the 1\displaystyle\ell_{1} norm. This criterion actually boils down to the Generalized Lasso proposed by [28]:

Lλgenlasso(𝜷~)=𝐲𝐗~𝜷~22+λD𝜷~1, with λ>0L_{\lambda}^{genlasso}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert D\widetilde{\boldsymbol{\beta}}\right\rVert_{1},\textrm{ with }\lambda>0 (8)

and D=Σ1/2\displaystyle D=\Sigma^{-1/2}.

Since, as explained in [28], some problems may occur when the rank of the design matrix is not full, we will consider in this paper the following criterion:

Lλ,ηgEN(𝜷~)=𝐲𝐗~𝜷~22+λΣ1/2𝜷~1+η𝜷~22, with λ,η>0.L_{\lambda,\eta}^{gEN}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\Sigma^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1}+\eta\left\lVert\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2},\textrm{ with }\lambda,\eta>0. (9)

Since it consists in adding an L2\displaystyle L_{2} penalty part to the Generalized Lasso as in the Elastic Net, we will call it generalized Elastic Net (gEN). We prove in Section 2 that under Assumption (A1) and the Generalized Irrepresentable Condition (GIC) (12) given below among others, 𝜷^\displaystyle\widehat{\boldsymbol{\beta}} is a sign-consistent estimator of 𝜷\displaystyle\boldsymbol{\beta}^{\star} where 𝜷^\displaystyle\widehat{\boldsymbol{\beta}} is defined by

𝜷^=𝚺1/2𝜷~^,\widehat{\boldsymbol{\beta}}=\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}}, (10)

with

𝜷~^=Argmin𝜷~Lλ,ηgEN(𝜷~),\widehat{\widetilde{\boldsymbol{\beta}}}=\operatorname*{Argmin}_{\widetilde{\boldsymbol{\beta}}}L_{\lambda,\eta}^{gEN}\left(\widetilde{\boldsymbol{\beta}}\right), (11)

Lλ,ηgEN(𝜷~)\displaystyle L_{\lambda,\eta}^{gEN}\left(\widetilde{\boldsymbol{\beta}}\right) being defined in Equation (9). The Generalized Irrepresentable Condition (GIC) can be stated as follows: There exist λ,η,α,δ4>0\displaystyle\lambda,\eta,\alpha,\delta_{4}>0 such that for all j\displaystyle j,

(|((C21n+ηnΣ21)(C11n+ηnΣ11)1(sign(𝜷1)+2ηλ𝜷1)2ηλΣ21𝜷1)j|1α)=1o(enδ4).\mathbb{P}\left(\left|\left((C_{21}^{n}+\frac{\eta}{n}\Sigma_{21})(C_{11}^{n}+\frac{\eta}{n}\Sigma_{11})^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\Sigma_{21}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\leq 1-\alpha\right)=1-o\left(e^{-n^{\delta_{4}}}\right). (12)

Note that GIC coincides with EIC when 𝐗\displaystyle\mathbf{X} is not random and Σ=𝕀p\displaystyle\Sigma=\mathbb{I}_{p}. Moreover, GIC does not require C11n\displaystyle C_{11}^{n} to be invertible. Since EIC and IC are both particular cases of GIC, if the IC or EIC holds, then there exist λ\displaystyle\lambda or η\displaystyle\eta such that the GIC holds.

The rest of the paper is organized as follows. Section 2 is devoted to the theoretical results of the paper. More precisely, we prove that under some mild conditions 𝜷^\displaystyle\widehat{\boldsymbol{\beta}} defined in (10) is a sign-consistent estimator of 𝜷\displaystyle\boldsymbol{\beta}^{\star}. To support our theoretical results, some numerical experiments are presented in Section 3. The proofs of our theoretical results can be found in Section 5.

2 Theoretical results

The goal of this section is to establish the sign consistency of the Generalized Elastic Net estimator defined in (10). To prove this result, we shall use the following lemma.

Lemma 2.1.

Let 𝐲\displaystyle\mathbf{y} satisfying Model (1) under Assumption (A1) and 𝛃^\displaystyle\widehat{\boldsymbol{\beta}} be defined in (10). Then,

(sign(𝜷^)=sign(𝜷))(AnBn),\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}\right)=sign(\boldsymbol{\beta}^{\star})\right)\geq\mathbb{P}\left(A_{n}\cap B_{n}\right), (13)

where

An:={|(𝒞11n,𝚺)1Wn(1)|<n(|𝜷1|λ2n|(𝒞11n,𝚺)1sign(𝜷1)|ηn|(𝒞11n,𝚺)1𝚺11𝜷1|)},\displaystyle A_{n}:=\left\{\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)\right|<\sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)\right\},
Bn:={|𝒞21n,𝚺(𝒞11n,𝚺)1Wn(1)Wn(2)|λ2nλ2n|𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)+2ηλ𝚺11𝜷1)2ηλ𝚺21𝜷1|},B_{n}:=\left\{\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}}\right.\\ \left.-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\right\},

and

𝒞11n,𝚺=C11n+ηn𝚺11,𝒞21n,𝚺=C21n+ηn𝚺21,Wn=1n𝐗ϵ=[Wn(1)Wn(2)],\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}=C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11},\;\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}=C_{21}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{21},\;W_{n}=\dfrac{1}{\sqrt{n}}\mathbf{X}^{\prime}\boldsymbol{\epsilon}=\begin{bmatrix}W_{n}(1)\\ W_{n}(2)\end{bmatrix}, (14)

with

Wn(1)=1n𝐗1ϵ and Wn(2)=1n𝐗2ϵ.\displaystyle W_{n}(1)=\dfrac{1}{\sqrt{n}}\mathbf{X}^{\prime}_{1}\boldsymbol{\epsilon}\textrm{ and }W_{n}(2)=\dfrac{1}{\sqrt{n}}\mathbf{X}^{\prime}_{2}\boldsymbol{\epsilon}.

The proof of Lemma 2.1 is given in Section 5.

The following theorem gives the conditions under which the sign consistency of the generalized Elastic Net estimator 𝜷^\displaystyle\widehat{\boldsymbol{\beta}} defined in (10) holds.

Theorem 2.2.

Assume that 𝐲\displaystyle\mathbf{y} satisfies Model (1) under Assumption (A1) with p=pn\displaystyle p=p_{n} is such that pnexp(nδ)\displaystyle p_{n}\exp\left(n^{-\delta}\right) tends to 0 as n\displaystyle n tends to infinity for all positive δ\displaystyle\delta. Assume also that there exist some positive constants M1\displaystyle M_{1}, M2\displaystyle M_{2}, M3\displaystyle M_{3} and α\displaystyle\alpha satisfying

M1<𝜷min29σ2 and 2+2M3σα<𝜷min3M2q,M_{1}<\frac{\boldsymbol{\beta}_{\min}^{2}}{9\sigma^{2}}\quad\textrm{ and }\quad\dfrac{\sqrt{2+\sqrt{2}}\sqrt{M_{3}}\sigma}{\alpha}<\dfrac{\boldsymbol{\beta}_{\min}}{3M_{2}\sqrt{q}}, (15)

and that there exist λ>0\displaystyle\lambda>0 and η>0\displaystyle\eta>0 such that (12) and

λn<2𝜷min3M2q,\dfrac{\lambda}{n}<\dfrac{2\boldsymbol{\beta}_{\min}}{3M_{2}\sqrt{q}}, (16)
λn22+2M3σα,\dfrac{\lambda}{n}\geq\dfrac{2\sqrt{2+\sqrt{2}}\sqrt{M_{3}}\sigma}{\alpha}, (17)
ηn<13M2λmax(𝚺11)×𝜷min𝜷12,\dfrac{\eta}{n}<\dfrac{1}{3M_{2}\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}\times\dfrac{\boldsymbol{\beta}_{\min}}{\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}}, (18)

hold as n\displaystyle n tends to infinity, where 𝛃min=min1jq|(𝛃1)j|\displaystyle\boldsymbol{\beta}_{\min}=\min_{1\leq j\leq q}\left|\left(\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|. Suppose also that there exist some positive constants δ1\displaystyle\delta_{1}, δ2\displaystyle\delta_{2}, δ3\displaystyle\delta_{3} such that, as n\displaystyle n\rightarrow\infty,

(λmax(HAHA)M1)=1o(enδ1),\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\leq M_{1}\right)=1-o\left(e^{-n^{\delta_{1}}}\right), (19)
(λmax((𝒞11n,𝚺)1)M2)=1o(enδ2),\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\leq M_{2}\right)=1-o\left(e^{-n^{\delta_{2}}}\right), (20)
(λmax(HBHB)M3)=1o(enδ3),\mathbb{P}\left(\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right)\leq M_{3}\right)=1-o\left(e^{-n^{\delta_{3}}}\right), (21)

where λmax(A)\displaystyle\lambda_{\max}(A) denotes the largest eigenvalue of A\displaystyle A,

HA=1n(𝒞11n,𝚺)1𝐗1 and HB=1n(𝒞21n,𝚺(𝒞11n,𝚺)1𝐗1𝐗2),\displaystyle H_{A}=\frac{1}{\sqrt{n}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathbf{X}^{\prime}_{1}\textrm{ and }H_{B}=\dfrac{1}{\sqrt{n}}\left(\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathbf{X}^{\prime}_{1}-\mathbf{X}^{\prime}_{2}\right),

𝒞11n,𝚺\displaystyle\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}} and 𝒞21n,𝚺\displaystyle\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}} being defined in (14) and 𝐗1\displaystyle\mathbf{X}_{1} (resp. 𝐗2\displaystyle\mathbf{X}_{2}) denoting the first q\displaystyle q (resp. the last pq\displaystyle p-q) columns of 𝐗\displaystyle\mathbf{X}. Then,

(sign(𝜷^)=sign(𝜷))1, as n,\displaystyle\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}\right)=sign(\boldsymbol{\beta}^{\star})\right)\rightarrow 1,\textrm{ as }n\rightarrow\infty,

where 𝛃^\displaystyle\widehat{\boldsymbol{\beta}} is defined in (10).

Note that Conditions (16) and (17) are consistent thanks to (15).

The proof of Theorem 2.2 is given in Section 5 and a discussion on the assumptions of Theorem 2.2 is provided in Section 3.

3 Numerical experiments

The goal of this section is to discuss the assumptions and illustrate the results of Theorem 2.2. For this, we generated datasets from Model (1) where the matrix 𝚺\displaystyle\boldsymbol{\Sigma} appearing in (A1) is defined by

𝚺=[𝚺11𝚺12𝚺12𝚺22].\boldsymbol{\Sigma}=\begin{bmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{12}^{\prime}&\boldsymbol{\Sigma}_{22}\end{bmatrix}.\vspace{-1mm} (22)

In (22), 𝚺11\displaystyle\boldsymbol{\Sigma}_{11} is the correlation matrix of the active variables having its off-diagonal entries equal to α1\displaystyle\alpha_{1}, 𝚺22\displaystyle\boldsymbol{\Sigma}_{22} is the correlation matrix of the non active variables having its off-diagonal entries equal to α3\displaystyle\alpha_{3} and 𝚺12\displaystyle\boldsymbol{\Sigma}_{12} is the correlation matrix between the active and the non active variables with entries equal to α2\displaystyle\alpha_{2}. In the numerical experiments, (α1,α2,α3)=(0.3,0.5,0.7)\displaystyle(\alpha_{1},\alpha_{2},\alpha_{3})=(0.3,0.5,0.7). Moreover, 𝜷\displaystyle\boldsymbol{\beta}^{\star} appearing in Model (1) has q\displaystyle q non zero components which are equal to b\displaystyle b and σ=1\displaystyle\sigma=1. The number of predictors p\displaystyle p is equal to 200, 400, or 600 and the sample size n\displaystyle n takes the same values for each value of p\displaystyle p.

3.1 Discussion on the assumptions of Theorem 2.2

We first show that GIC defined in (12) can be satisfied even when EIC and IC, defined in (5) and (3) respectively, are not fulfilled. For this, we computed for different values of λ\displaystyle\lambda and η\displaystyle\eta the following values:

IC=maxj(|(C21n(C11n)1(sign(𝜷1))j|)\displaystyle\displaystyle\textrm{IC}=\max_{j}\left(\left|\left(C_{21}^{n}(C_{11}^{n})^{-1}(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\right)
EIC=minλ,ηmaxj(|(C21n(C11n+ηn𝕀q)1(sign(𝜷1)+2ηλ𝜷1))j|)\displaystyle\displaystyle\textrm{EIC}=\min_{\lambda,\eta}\max_{j}\left(\left|\left(C_{21}^{n}(C_{11}^{n}+\frac{\eta}{n}\mathbb{I}_{q})^{-1}(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\right)
GIC=minλ,ηmaxj(|((C21n+ηnΣ21)(C11n+ηnΣ11)1(sign(𝜷1)+2ηλ𝜷1)2ηλΣ21𝜷1)j|)\displaystyle\displaystyle\textrm{GIC}=\min_{\lambda,\eta}\max_{j}\left(\left|\left((C_{21}^{n}+\frac{\eta}{n}\Sigma_{21})(C_{11}^{n}+\frac{\eta}{n}\Sigma_{11})^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\Sigma_{21}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\right) (23)

and Figure 1 displays the boxplots of these criteria obtained from 100 replications. We can see from these figures that in all the considered cases GIC is satisfied (i.e. all values are smaller than 1) whereas EIC and IC are not. The values of p\displaystyle p and n\displaystyle n do not seem to have a big impact on EIC and IC. However, contrary to p\displaystyle p, n\displaystyle n seems to have an influence on GIC which increases with n\displaystyle n when b=1\displaystyle b=1 and decreases when n\displaystyle n increases when b=10\displaystyle b=10.

Refer to caption
Figure 1: Boxplot of values defined in (3.1) and obtained from 100 replications.

Figures 2 and 3 show the behavior of λmax(HAHA)\displaystyle\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right), λmax((𝒞11n,𝚺)1)\displaystyle\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right) and λmax(HBHB)\displaystyle\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right) appearing in (19), (20) and (21) with respect to η\displaystyle\eta for different values of n\displaystyle n, p\displaystyle p and for q=5\displaystyle q=5 or 10. These plots thus provide lower bounds for M1\displaystyle M_{1}, M2\displaystyle M_{2} and M3\displaystyle M_{3} appearing in the previous equations. Observe that (18) can be rewritten as:

ηM2<n3λmax(𝚺11)×𝜷min𝜷12.\eta M_{2}<\dfrac{n}{3\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}\times\dfrac{\boldsymbol{\beta}_{\min}}{\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}}. (24)

Based on the plots at the bottom right of Figures 2 and 3, we can see that there exist η\displaystyle\eta’s satisfying Condition 24 and thus (18) and that the interval in which the adapted η\displaystyle\eta’s lie is larger when q=5\displaystyle q=5 than when q=10\displaystyle q=10.

Refer to caption
Figure 2: Top left: Average of λmax(HAHA)\displaystyle\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right) in (19) as a function of η\displaystyle\eta. Top right: Average of λmax((𝒞11n,𝚺)1)\displaystyle\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right) in (20) as a function of η\displaystyle\eta. Bottom left: Average of λmax(HBHB)\displaystyle\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right) in (21) as a function of η\displaystyle\eta. Bottom right: Average of the left (resp. right) part of (24) in plain (resp. dashed) line. The averages are obtained from 10 replications. Here q=5\displaystyle q=5.
Refer to caption
Figure 3: Top left: Average of λmax(HAHA)\displaystyle\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right) in (19) as a function of η\displaystyle\eta. Top right: Average of λmax((𝒞11n,𝚺)1)\displaystyle\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right) in (20) as a function of η\displaystyle\eta. Bottom left: Average of λmax(HBHB)\displaystyle\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right) in (21) as a function of η\displaystyle\eta. Bottom right: Average of the left (resp. right) part of (24) in plain (resp. dashed) line. The averages are obtained from 10 replications. Here q=10\displaystyle q=10.

Based on the average of M1\displaystyle M_{1} previously obtained, the left part of (15) is always satisfied as soon as b>18\displaystyle b>\sqrt{18}. Based on the average of M2\displaystyle M_{2} and M3\displaystyle M_{3} previously obtained, the average of left-hand side and of the right-hand side of the right part of Equation (15) are displayed in Figures 4 and 5. We can see from these figures that it is only satisfied for large values of b\displaystyle b. Moreover, it is more often satisfied when q=5\displaystyle q=5 than for q=10\displaystyle q=10.

Refer to caption
Figure 4: Average of the left-hand (resp. right-hand) side of the second part of (15) in red (resp. blue) for q=5\displaystyle q=5.
Refer to caption
Figure 5: Average of the left-hand (resp. right-hand) side of the second part of (15) in red (resp. blue) for q=10\displaystyle q=10.

We will show in the next section that even if the cases where all the conditions of the theorem are not fulfilled our method is robust enough to outperform the Elastic Net defined in (4) even in these cases.

3.2 Comparison with other methods

To assess the performance of our approach (gEN) in terms of sign-consistency with respect to other methods and to illustrate the results of Theorem 2.2, we computed the True Positive Rate (TPR), namely the proportion of active variables selected, and the False Positive Rate (FPR), namely the proportion of non active variables selected, of the Elastic Net and gEN estimators defined in (4) and (10), respectively.

Figures 6 and 8 display the empirical mean of the largest difference between the True Positive Rate and False Positive Rate over the replications. It is obtained by selecting for each replication the value of λ\displaystyle\lambda and η\displaystyle\eta achieving the largest difference between the TPR and FPR and by averaging these differences. They also display the corresponding TPR and FPR for gEN and Elastic Net for different values of n\displaystyle n and p\displaystyle p. We can see from these figures that the gEN and the Elastic Net estimators have a TPR equal to 1 but that the FPR of gEN is smaller than Elastic Net. We can see from these figures that the difference between the performance of gEN and Elastic Net is larger for high signal-to-noise ratios (b=10\displaystyle b=10). It has to be noticed that when TPR=1 for our approach it also means that the signs of the non null βi\displaystyle\beta_{i}^{\star} are also properly retrieved.

Refer to caption
Figure 6: Average of max(TPR-FPR) and the corresponding TPR and FPR for gEN (in red) and Elastic Net (in blue) with p=200\displaystyle p=200.
Refer to caption
Figure 7: Average of max(TPR-FPR) and the corresponding TPR and FPR for gEN (in red) and Elastic Net (in blue) with p=400\displaystyle p=400.
Refer to caption
Figure 8: Average of max(TPR-FPR) and the corresponding TPR and FPR for gEN (in red) and Elastic Net (in blue) with p=600\displaystyle p=600.

4 Discussion

In this paper, we proposed a novel variable selection approach called gEN (generalized Elastic Net) in the framework of linear models where the columns of the design matrix are highly correlated and thus when the standard Lasso criterion usually fails. We proved that under mild conditions, among which the GIC, which is valid when other standard conditions like EIC or IC are not fulfilled, our method provides a sign-consistent estimator of 𝜷\displaystyle\boldsymbol{\beta}^{\star}. For a more thorough discussion regarding the application of our approach in practical situations, we refer the reader to [5].

5 Proofs

5.1 Proof of Lemma 2.1

Note that (9) given by:

LgEN(𝜷~)=𝐲𝐗~𝜷~22+λΣ1/2𝜷~1+η𝜷~22\displaystyle L^{gEN}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\Sigma^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1}+\eta\left\lVert\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}

can be rewritten as

LgEN(𝜷~)=𝐲𝐗~𝜷~22+λ𝚺1/2𝜷~1,L^{gEN}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}^{*}-\widetilde{\mathbf{X}}{*}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\Sigma}^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1},

where

𝐲=(𝐲0),𝐗~=(𝐗~η𝕀p).\mathbf{y}^{*}=\begin{pmatrix}\mathbf{y}\\ 0\end{pmatrix},\quad\widetilde{\mathbf{X}}^{*}=\begin{pmatrix}\widetilde{\mathbf{X}}\\ \sqrt{\eta}\mathbb{I}_{p}\end{pmatrix}.

Then, 𝜷~^\displaystyle\widehat{\widetilde{\boldsymbol{\beta}}} satisfies

𝐗~(𝐲𝐗~𝜷~^)=λ2(𝚺1/2)z,\widetilde{\mathbf{X}}^{*^{\prime}}\left(\mathbf{y}^{*}-\widetilde{\mathbf{X}}{*^{\prime}}\widehat{\widetilde{\boldsymbol{\beta}}}\right)=\frac{\lambda}{2}(\boldsymbol{\Sigma}^{-1/2})^{\prime}z, (25)

where A\displaystyle A^{\prime} denotes the transpose of the matrix A\displaystyle A, and

{zj=sign((𝚺1/2𝜷~^)j),if(𝚺1/2𝜷~^)j0zj[1,1],if(𝚺1/2𝜷~^)j=0.\begin{cases}z_{j}=sign\left((\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}})_{j}\right),&\text{if}\ (\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}})_{j}\neq 0\\ z_{j}\in[-1,1],&\text{if}\ (\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}})_{j}=0\end{cases}.

Equation (25) can be rewritten as:

𝐗𝐲(𝐗𝐗+η𝚺)𝜷^=λ2z\displaystyle\mathbf{X}^{\prime}\mathbf{y}-\left(\mathbf{X}^{\prime}\mathbf{X}+\eta\boldsymbol{\Sigma}\right)\widehat{\boldsymbol{\beta}}=\frac{\lambda}{2}z

which leads to

𝐗𝐗(𝜷𝜷^)+𝐗ϵη𝚺𝜷^=λ2z,\displaystyle\mathbf{X}^{\prime}\mathbf{X}(\boldsymbol{\beta}^{\star}-\widehat{\boldsymbol{\beta}})+\mathbf{X}^{\prime}\boldsymbol{\epsilon}-\eta\boldsymbol{\Sigma}\widehat{\boldsymbol{\beta}}=\frac{\lambda}{2}z,

by using that 𝐲=𝐗𝜷+ϵ\displaystyle\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{\star}+\boldsymbol{\epsilon}. By using the following notations: 𝐮^=𝜷^𝜷\displaystyle\widehat{\mathbf{u}}=\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}^{\star},

Cn=1n𝐗𝐗 and Wn=1n𝐗ϵ,\displaystyle C_{n}=\frac{1}{n}\mathbf{X}^{\prime}\mathbf{X}\textrm{ and }W_{n}=\frac{1}{\sqrt{n}}\mathbf{X}^{\prime}\boldsymbol{\epsilon},

Equation (25) becomes

(Cn+ηn𝚺)n𝐮^+ηn𝚺𝜷Wn=λ2nz.\left(C_{n}+\frac{\eta}{n}\boldsymbol{\Sigma}\right)\sqrt{n}\widehat{\mathbf{u}}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}\boldsymbol{\beta}^{\star}-W_{n}=-\frac{\lambda}{2\sqrt{n}}z. (26)

With the following notations:

Cn=(C11nC12nC21nC22n),𝚺=(𝚺11𝚺12𝚺21𝚺22),𝐮^=(𝐮^1𝐮^2),Wn=(Wn(1)Wn(2)),𝜷=(𝜷10),\displaystyle C_{n}=\begin{pmatrix}C_{11}^{n}&C_{12}^{n}\\ C_{21}^{n}&C_{22}^{n}\end{pmatrix},\;\boldsymbol{\Sigma}=\begin{pmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21}&\boldsymbol{\Sigma}_{22}\end{pmatrix},\;\widehat{\mathbf{u}}=\begin{pmatrix}\widehat{\mathbf{u}}_{1}\\ \widehat{\mathbf{u}}_{2}\end{pmatrix},\;W_{n}=\begin{pmatrix}W_{n}(1)\\ W_{n}(2)\end{pmatrix},\;\boldsymbol{\beta}^{\star}=\begin{pmatrix}\boldsymbol{\beta}_{1}^{\star}\\ \textbf{0}\end{pmatrix},

the first components of Equation (26) are:

(C11n+ηn𝚺11)n𝐮^1+(C12n+ηn𝚺12)n𝐮^2+ηn𝚺11𝜷1Wn(1)=λ2nsign(𝜷1).\left(C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11}\right)\sqrt{n}\widehat{\mathbf{u}}_{1}+\left(C_{12}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{12}\right)\sqrt{n}\widehat{\mathbf{u}}_{2}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}-W_{n}(1)=-\frac{\lambda}{2\sqrt{n}}sign(\boldsymbol{\beta}_{1}^{\star}). (27)

If 𝐮^=(𝐮^10)\displaystyle\widehat{\mathbf{u}}=\begin{pmatrix}\widehat{\mathbf{u}}_{1}\\ 0\end{pmatrix}, it can be seen as a solution of the generalized Elastic Net criterion where, by Equation (27), 𝐮^1\displaystyle\widehat{\mathbf{u}}_{1} is defined by:

n𝐮^1=(𝒞11n,𝚺)1Wn(1)ηn(𝒞11n,𝚺)1𝚺11𝜷1λ2n(𝒞11n,𝚺)1sign(𝜷1),\sqrt{n}\widehat{\mathbf{u}}_{1}=\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-\frac{\eta}{\sqrt{n}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}{\boldsymbol{\beta}_{1}}^{\star}-\frac{\lambda}{2\sqrt{n}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign({\boldsymbol{\beta}_{1}}^{\star}), (28)

where we used (14).

Note that the event An\displaystyle A_{n} can be rewritten as follows:

n(|𝜷1|+λ2n|(𝒞11n,𝚺)1sign(𝜷1)|+ηn|(𝒞11n,𝚺)1𝚺11𝜷1|)<(𝒞11n,𝚺)1Wn(1)<n(|𝜷1|λ2n|(𝒞11n,𝚺)1sign(𝜷1)|ηn|(𝒞11n,𝚺)1𝚺11𝜷1|)\sqrt{n}\left(-\left|\boldsymbol{\beta}_{1}^{\star}\right|+\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|+\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)\\ <\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)<\\ \sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right) (29)

which implies

n(|𝜷1|+λ2n(𝒞11n,𝚺)1sign(𝜷1)+ηn(𝒞11n,𝚺)1𝚺11𝜷1)<(𝒞11n,𝚺)1Wn(1)<n(|𝜷1|+λ2n(𝒞11n,𝚺)1sign(𝜷1)+ηn(𝒞11n,𝚺)1𝚺11𝜷1),\sqrt{n}\left(-\left|\boldsymbol{\beta}_{1}^{\star}\right|+\frac{\lambda}{2n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})+\frac{\eta}{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)\\ <\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)<\\ \sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|+\frac{\lambda}{2n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})+\frac{\eta}{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right), (30)

using that |x|x|x|,x\displaystyle-|x|\leq x\leq|x|,\,\forall x\in\mathbb{R}. Then, by using (28), we get that n|𝐮^1|<n|𝜷1|\displaystyle\sqrt{n}|\widehat{\mathbf{u}}_{1}|<\sqrt{n}|\boldsymbol{\beta}_{1}^{\star}| and thus |𝐮^1|<|𝜷1|\displaystyle|\widehat{\mathbf{u}}_{1}|<|\boldsymbol{\beta}_{1}^{\star}|. Notice that |𝐮^1|<|𝜷1|\displaystyle|\widehat{\mathbf{u}}_{1}|<|\boldsymbol{\beta}_{1}^{\star}| implies that 𝜷^10\displaystyle\widehat{\boldsymbol{\beta}}_{1}\neq 0 and that sign(𝜷^1)=sign(𝜷1)\displaystyle sign(\widehat{\boldsymbol{\beta}}_{1})=sign(\boldsymbol{\beta}_{1}^{\star}). Moreover, since 𝐮^2=0\displaystyle\widehat{\mathbf{u}}_{2}=0, we get that sign(𝜷^)=sign(𝜷)\displaystyle sign(\widehat{\boldsymbol{\beta}})=sign(\boldsymbol{\beta}^{\star}).

The last components of (26) satisfy:

(C21n+ηn𝚺21)n𝐮^1+(C22n+ηn𝚺22)n𝐮^2+ηn𝚺21𝜷1Wn(2)=λ2nz2,\left(C_{21}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{21}\right)\sqrt{n}\widehat{\mathbf{u}}_{1}+\left(C_{22}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{22}\right)\sqrt{n}\widehat{\mathbf{u}}_{2}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}-W_{n}(2)=-\frac{\lambda}{2\sqrt{n}}z_{2},

where by (25), |z2|1\displaystyle|z_{2}|\leq 1. Hence,

|(C21n+ηn𝚺21)n𝐮^1+ηn𝚺21𝜷1Wn(2)|λ2n,\left|\left(C_{21}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{21}\right)\sqrt{n}\widehat{\mathbf{u}}_{1}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}},

which can be rewritten as follows by using (28):

|𝒞21(𝒞11n,𝚺)1(Wn(1)ηn𝚺11𝜷1λ2nsign(𝜷1))+ηn𝚺21𝜷1Wn(2)|λ2n.\left|\mathscr{C}_{21}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(W_{n}(1)-\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}-\frac{\lambda}{2\sqrt{n}}sign(\boldsymbol{\beta}_{1}^{\star})\right)+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}}. (31)

When the event Bn\displaystyle B_{n} is satisfied:

λ2n+λ2n|𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)2ηλ𝚺11𝜷1)2ηλ𝚺21𝜷1|𝒞21n,𝚺(𝒞11n,𝚺)1Wn(1)Wn(2)λ2nλ2n|𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)+2ηλ𝚺11𝜷1)2ηλ𝚺21𝜷1|.-\frac{\lambda}{2\sqrt{n}}+\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\\ \leq\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)\\ \leq\frac{\lambda}{2\sqrt{n}}-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|. (32)

By using that |x|x|x|\displaystyle-|x|\leq x\leq|x| for all x\displaystyle x in \displaystyle\mathbb{R}, we get that it implies that

|𝒞21n,𝚺(𝒞11n,𝚺)1Wn(1)Wn(2)λ2n𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)+2ηλ𝚺11𝜷1)+ηn𝚺21𝜷1|λ2n,\displaystyle\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)-\frac{\lambda}{2\sqrt{n}}\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\leq\frac{\lambda}{2\sqrt{n}},

which corresponds to (31). Thus, if An\displaystyle A_{n} and Bn\displaystyle B_{n} are satisfied, we get that sign(𝜷^)=sign(𝜷)\displaystyle sign\left(\widehat{\boldsymbol{\beta}}\right)=sign\left(\boldsymbol{\beta}\right), which concludes the proof.

5.2 Proof of Theorem 2.2

By Lemma 2.1,

(sign(𝜷^)=sign(𝜷))(AnBn)1(Anc)(Bnc),\displaystyle\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}\right)=sign(\boldsymbol{\beta}^{\star})\right)\geq\mathbb{P}\left(A_{n}\cap B_{n}\right)\geq 1-\mathbb{P}\left(A_{n}^{c}\right)-\mathbb{P}\left(B_{n}^{c}\right),

where Anc\displaystyle A_{n}^{c} and Bnc\displaystyle B_{n}^{c} denote the complementary of An\displaystyle A_{n} and Bn\displaystyle B_{n}, respectively. Thus, to prove the theorem it is enough to prove that (Anc)0\displaystyle\mathbb{P}\left(A_{n}^{c}\right)\to 0 and (Bnc)0\displaystyle\mathbb{P}\left(B_{n}^{c}\right)\to 0 as n\displaystyle n\to\infty.

Recall that

An:={|(𝒞11n,𝚺)1Wn(1)|<n(|𝜷1|λ2n|(𝒞11n,𝚺)1sign(𝜷1)|ηn|(𝒞11n,𝚺)1𝚺11𝜷1|)}.\displaystyle A_{n}:=\left\{\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)\right|<\sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)\right\}.

Let 𝜻\displaystyle\boldsymbol{\zeta} and 𝝉\displaystyle\boldsymbol{\tau} be defined by

𝜻=(𝒞11n,𝚺)1Wn(1) and 𝝉=n(|𝜷1|λ2n|(𝒞11n,𝚺)1sign(𝜷1)|ηn|(𝒞11n,𝚺)1𝚺11𝜷1|).\displaystyle\boldsymbol{\zeta}=\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)\textrm{ and }\boldsymbol{\tau}=\sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right).

Then,

(An)=(j,|ζj|<τj).\displaystyle\mathbb{P}(A_{n})=\mathbb{P}\left(\forall j,|\zeta_{j}|<\tau_{j}\right).

Thus,

(Anc)=(j,|ζj|τj)j=1q(|ζj|τj).\displaystyle\mathbb{P}(A_{n}^{c})=\mathbb{P}\left(\exists j,|\zeta_{j}|\geq\tau_{j}\right)\leq\sum_{j=1}^{q}\mathbb{P}\left(|\zeta_{j}|\geq\tau_{j}\right).

Note that

(|ζj|τj)\displaystyle\displaystyle\mathbb{P}(|\zeta_{j}|\geq\tau_{j}) =(|ζj|n(|(𝜷1)j|λ2n|((𝒞11n,𝚺)1sign(𝜷1))j|ηn|((𝒞11n,𝚺)1𝚺11𝜷1)j|))\displaystyle\displaystyle=\mathbb{P}\left(|\zeta_{j}|\geq\sqrt{n}\left(\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|-\frac{\lambda}{2n}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|-\frac{\eta}{n}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\right)\right)
=(|ζj|+λ2n|((𝒞11n,𝚺)1sign(𝜷1))j|+ηn|((𝒞11n,𝚺)1𝚺11𝜷1)j|n|(𝜷1)j|)\displaystyle\displaystyle=\mathbb{P}\left(|\zeta_{j}|+\frac{\lambda}{2\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|+\frac{\eta}{\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\geq\sqrt{n}\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|\right)
(|ζj|n|(𝜷1)j|3)+(λ2n|((𝒞11n,𝚺)1sign(𝜷1))j|n|(𝜷1)j|3)\displaystyle\displaystyle\leq\mathbb{P}\left(|\zeta_{j}|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right)+\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right)
+(ηn|((𝒞11n,𝚺)1𝚺11𝜷1)j|n|(𝜷1)j|3).\displaystyle\displaystyle+\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right). (33)

Observe that

𝜻=(𝒞11n,𝚺)1Wn(1)=1n(C11n+ηn𝚺11)1𝐗1ϵ=HAϵ,\displaystyle\boldsymbol{\zeta}=\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)=\frac{1}{\sqrt{n}}\left(C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11}\right)^{-1}\mathbf{X}^{\prime}_{1}\boldsymbol{\epsilon}=H_{A}\boldsymbol{\epsilon},

where

HA=1n(C11n+ηn𝚺11)1𝐗1,\displaystyle H_{A}=\frac{1}{\sqrt{n}}\left(C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11}\right)^{-1}\mathbf{X}^{\prime}_{1},

𝐗1\displaystyle\mathbf{X}_{1} denoting the columns of the design matrix 𝐗\displaystyle\mathbf{X} associated to the q\displaystyle q active covariates. Thus, for all j\displaystyle j in {1,,q}\displaystyle\{1,\dots,q\},

ζj=k=1n(HA)jkϵk.\displaystyle\zeta_{j}=\sum_{k=1}^{n}(H_{A})_{jk}\boldsymbol{\epsilon}_{k}.

By using the Cauchy-Schwarz inequality,

|ζj|=|k=1n(HA)jkϵk|\displaystyle\displaystyle|\zeta_{j}|=\left|\sum_{k=1}^{n}(H_{A})_{jk}\boldsymbol{\epsilon}_{k}\right| \displaystyle\displaystyle\leq (k=1n(HA)jk2)1/2(k=1nϵk2)1/2\displaystyle\displaystyle\left(\sum_{k=1}^{n}(H_{A})_{jk}^{2}\right)^{1/2}\left(\sum_{k=1}^{n}\boldsymbol{\epsilon}_{k}^{2}\right)^{1/2}
=\displaystyle\displaystyle= (HAHA)jj×ϵ2\displaystyle\displaystyle\sqrt{\left(H_{A}H^{\prime}_{A}\right)_{jj}}\times\|\boldsymbol{\epsilon}\|_{2}
\displaystyle\displaystyle\leq λmax(HAHA)×ϵ2.\displaystyle\displaystyle\sqrt{\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)}\times\|\boldsymbol{\epsilon}\|_{2}.

Hence, the first term in the r.h.s. of (5.2) satisfies the following inequalities:

(|ζj|n|(𝜷1)j|3)\displaystyle\displaystyle\mathbb{P}\left(|\zeta_{j}|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right) \displaystyle\displaystyle\leq (λmax(HAHA)×ϵ2n|(𝜷1)j|3)\displaystyle\displaystyle\mathbb{P}\left(\sqrt{\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)}\times\|\boldsymbol{\epsilon}\|_{2}\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right) (34)
\displaystyle\displaystyle\leq (λmax(HAHA)×ϵ22n(𝜷1)j29).\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\|\boldsymbol{\epsilon}\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right).

Since by (19), there exist M1>0\displaystyle M_{1}>0 and δ1>0\displaystyle\delta_{1}>0 such that

(λmax(HAHA)M1)=1o(enδ1), as n,\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\leq M_{1}\right)=1-o\left(e^{-n^{\delta_{1}}}\right),\textrm{ as }n\rightarrow\infty,

we have:

(λmax(HAHA)×ϵ22n(𝜷1)j29)\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\|\boldsymbol{\epsilon}\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right) \displaystyle\displaystyle\leq (ϵ22n(𝜷1)j29M1)+(λmax(HAHA)>M1)\displaystyle\displaystyle\mathbb{P}\left(\|\boldsymbol{\epsilon}\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9M_{1}}\right)+\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)>M_{1}\right)
\displaystyle\displaystyle\leq (ϵ22σ2n𝜷min29M1σ2)+o(enδ1).\displaystyle\displaystyle\mathbb{P}\left(\dfrac{\|\boldsymbol{\epsilon}\|_{2}^{2}}{\sigma^{2}}\geq\dfrac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}\right)+o\left(e^{-n^{\delta_{1}}}\right).

Using that ϵ22σ2χ2(n)\displaystyle\dfrac{\|\boldsymbol{\epsilon}\|_{2}^{2}}{\sigma^{2}}\sim\chi^{2}(n), we get, by Lemma 1 of [29], that

(λmax(HAHA)×ϵ22n(𝜷1)j29)exp(t2+12n(2tn))+o(enδ1),\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\|\boldsymbol{\epsilon}\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right)\leq\exp\left(-\dfrac{t}{2}+\dfrac{1}{2}\sqrt{n\left(2t-n\right)}\right)+o\left(e^{-n^{\delta_{1}}}\right), (35)

since t=n𝜷min29M1σ2>n2\displaystyle t=\frac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}>\dfrac{n}{2} using that 2𝜷min29σ2>M1\displaystyle\frac{2\boldsymbol{\beta}_{\min}^{2}}{9\sigma^{2}}>M_{1} by (15).
By putting together Equations (34) and (35) we get

(|ζj|>n|(𝜷1)j|3)exp(t2+12n(2tn))+o(enδ1),\mathbb{P}\left(|\zeta_{j}|>\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right)\leq\exp\left(-\dfrac{t}{2}+\dfrac{1}{2}\sqrt{n\left(2t-n\right)}\right)+o\left(e^{-n^{\delta_{1}}}\right), (36)

with t=n𝜷min29M1σ2>n2\displaystyle t=\frac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}>\dfrac{n}{2}.
Let us now derive an upper bound for the second term in the r.h.s. of (5.2):

(λ2n|((𝒞11n,𝚺)1sign(𝜷1))j|n|(𝜷1)j|3).\displaystyle\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right).

By using the Cauchy-Schwarz inequality, we get that:

|((𝒞11n,𝚺)1sign(𝜷1))j|=|k=1q((𝒞11n,𝚺)1)jk(sign(𝜷1))k|\displaystyle\displaystyle\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|=\left|\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)_{jk}\left(sign(\boldsymbol{\beta}_{1}^{\star})\right)_{k}\right|
\displaystyle\displaystyle\leq k=1q((𝒞11n,𝚺)1)jk2×sign(𝜷1)2(𝒞11n,𝚺)jj2×q\displaystyle\displaystyle\sqrt{\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)_{jk}^{2}}\times\|sign(\boldsymbol{\beta}_{1}^{\star})\|_{2}\leq\sqrt{\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)_{jj}^{-2}}\times\sqrt{q}
\displaystyle\displaystyle\leq λmax((𝒞11n,𝚺)1)×q.\displaystyle\displaystyle\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\times\sqrt{q}.

Then,

(λ2n|((𝒞11n,𝚺)1sign(𝜷1))j|n|(𝜷1)j|3)\displaystyle\displaystyle\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right) (37)
\displaystyle\displaystyle\leq (λ2qnλmax((𝒞11n,𝚺)1)n|(𝜷1)j|3)\displaystyle\displaystyle\mathbb{P}\left(\dfrac{\lambda}{2}\sqrt{\dfrac{q}{n}}\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right)
\displaystyle\displaystyle\leq (λmax((𝒞11n,𝚺)1)2n3λq|(𝜷1)j|)\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\dfrac{2n}{3\lambda\sqrt{q}}\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|\right)
\displaystyle\displaystyle\leq (λmax((𝒞11n,𝚺)1)2n3λq𝜷min)=o(enδ2), as n,\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\dfrac{2n}{3\lambda\sqrt{q}}\boldsymbol{\beta}_{\min}\right)=o\left(e^{-n^{\delta_{2}}}\right),\textrm{ as }n\to\infty,

since 2n3λq𝜷min>M2\displaystyle\dfrac{2n}{3\lambda\sqrt{q}}\boldsymbol{\beta}_{\min}>M_{2} by (16). Let us now derive an upper bound for the third term in the r.h.s. of (5.2):

(ηn|((𝒞11n,𝚺)1𝚺11𝜷1)j|>n|(𝜷1)j|3).\displaystyle\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|>\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right).

We have

|((𝒞11n,𝚺)1𝚺11𝜷1)j|=|k=1q((𝒞11n,𝚺)1𝚺11)jk(𝜷1)k|k=1q((𝒞11n,𝚺)1𝚺11)jk2×𝜷12\displaystyle\displaystyle\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|=\left|\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\right)_{jk}\left(\boldsymbol{\beta}_{1}^{\star}\right)_{k}\right|\leq\sqrt{\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\right)_{jk}^{2}}\times\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}
\displaystyle\displaystyle\leq λmax((𝒞11n,𝚺)1𝚺112(𝒞11n,𝚺)1)×𝜷12λmax((𝒞11n,𝚺)1)λmax(𝚺11)×𝜷12.\displaystyle\displaystyle\sqrt{\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}^{2}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)}\times\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}\leq\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)\times\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}.

Thus,

(ηn|((𝒞11n,𝚺)1𝚺11𝜷1)j|n|(𝜷1)j|3)\displaystyle\displaystyle\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right) (38)
\displaystyle\displaystyle\leq (ηnλmax((𝒞11n,𝚺)1)λmax(𝚺11)𝜷12n|(𝜷1)j|3)\displaystyle\displaystyle\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right)
\displaystyle\displaystyle\leq (λmax((𝒞11n,𝚺)1)n𝜷min3η𝜷12λmax(𝚺11))=o(enδ2), as n,\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\dfrac{n\boldsymbol{\beta}_{\min}}{3\eta\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}\right)=o\left(e^{-n^{\delta_{2}}}\right),\textrm{ as }n\to\infty,

since n𝜷min3η𝜷12λmax(𝚺11)>M2\displaystyle\dfrac{n\boldsymbol{\beta}_{\min}}{3\eta\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}>M_{2} by (18).
By putting together Equations (36), (37) and (38), we get:

(Anc)\displaystyle\displaystyle\mathbb{P}\left(A_{n}^{c}\right) \displaystyle\displaystyle\leq qexp[n2(κ2κ1)]+q×o(enδ1)+2q×o(enδ2),\displaystyle\displaystyle q\exp\left[-\dfrac{n}{2}\left(\kappa-\sqrt{2\kappa-1}\right)\right]+q\times o\left(e^{-n^{\delta_{1}}}\right)+2q\times o\left(e^{-n^{\delta_{2}}}\right), (39)

with κ=𝜷min29M1σ2\displaystyle\kappa=\frac{\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}. Note that κ2κ1>0\displaystyle\kappa-\sqrt{2\kappa-1}>0 since κ=𝜷min29M1σ2>1\displaystyle\kappa=\frac{\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}>1 by (15). Equation (39) then implies that

(Anc)0 as n.\displaystyle\mathbb{P}\left(A_{n}^{c}\right)\rightarrow 0\textrm{ as }n\rightarrow\infty.

Let us now prove that (Bnc)0 as n.\displaystyle\mathbb{P}\left(B_{n}^{c}\right)\rightarrow 0\textrm{ as }n\rightarrow\infty.
Recall that

Bn:={|𝒞21n,𝚺(𝒞11n,𝚺)1Wn(1)Wn(2)|λ2nλ2n|𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)+2ηλ𝚺11𝜷1)2ηλ𝚺21𝜷1|}.B_{n}:=\left\{\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}}\right.\\ \left.-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\right\}.

Let

𝝍=𝒞21n,𝚺(𝒞11n,𝚺)1Wn(1)Wn(2)=1n(𝒞21n,𝚺(𝒞11n,𝚺)1𝐗1𝐗2)ϵ=:HBϵ\displaystyle\boldsymbol{\psi}=\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)=\frac{1}{\sqrt{n}}\left(\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathbf{X}^{\prime}_{1}-\mathbf{X}^{\prime}_{2}\right)\boldsymbol{\epsilon}=:H_{B}\boldsymbol{\epsilon}

and

𝝁=λ2nλ2n|𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)+2ηλ𝚺11𝜷1)2ηλ𝚺21𝜷1|.\displaystyle\boldsymbol{\mu}=\frac{\lambda}{2\sqrt{n}}-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|.

Then,

(Bnc)=(j,|ψj|>μj)j=q+1p(|ψj|>μj).\displaystyle\mathbb{P}(B_{n}^{c})=\mathbb{P}(\exists j,|\psi_{j}|>\mu_{j})\leq\sum_{j=q+1}^{p}\mathbb{P}(|\psi_{j}|>\mu_{j}).

By using the Cauchy-Schwarz inequality, we get that:

|ψj|=|k=1n(HB)jkϵk|(k=1n(HB)jk2)1/2×ϵ2=(HBHB)jj×ϵ2λmax(HBHB)×ϵ2,|\psi_{j}|=\left|\sum_{k=1}^{n}(H_{B})_{jk}\boldsymbol{\epsilon}_{k}\right|\leq\left(\sum_{k=1}^{n}(H_{B})_{jk}^{2}\right)^{1/2}\times\|\boldsymbol{\epsilon}\|_{2}=\sqrt{\left(H_{B}H^{\prime}_{B}\right)_{jj}}\times\|\boldsymbol{\epsilon}\|_{2}\leq\sqrt{\lambda_{\max}(H_{B}H^{\prime}_{B})}\times\|\boldsymbol{\epsilon}\|_{2}, (40)

where

HBHB=𝒞21n,𝚺(𝒞11n,𝚺)1C11n(𝒞11n,𝚺)1𝒞12n,𝚺𝒞21n,𝚺(𝒞11n,𝚺)1C12nC21n(𝒞11n,𝚺)1𝒞12n,𝚺+C22n.\displaystyle H_{B}H^{\prime}_{B}=\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}C_{11}^{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathscr{C}_{12}^{n,\boldsymbol{\Sigma}}-\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}C_{12}^{n}-C_{21}^{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathscr{C}_{12}^{n,\boldsymbol{\Sigma}}+C_{22}^{n}.

By (21), there exist M3>0\displaystyle M_{3}>0 and δ3>0\displaystyle\delta_{3}>0 such that

(λmax(HBHB)M3)=1o(enδ3), as n.\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right)\leq M_{3}\right)=1-o\left(e^{-n^{\delta_{3}}}\right),\textrm{ as }n\to\infty.

By the GIC condition (12), there exist α>0\displaystyle\alpha>0 and δ4>0\displaystyle\delta_{4}>0 such that for all j\displaystyle j,

(|𝒞21n,𝚺(𝒞11n,𝚺)1(sign(𝜷1)+2ηλ𝚺11𝜷1)2ηλ𝚺21𝜷1|1α)=1o(enδ4).\displaystyle\mathbb{P}\left(\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\leq 1-\alpha\right)=1-o\left(e^{-n^{\delta_{4}}}\right).

Thus, we get that:

(Bnc)\displaystyle\displaystyle\mathbb{P}(B_{n}^{c}) \displaystyle\displaystyle\leq j=q+1p(|ψj|>μj)\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(|\psi_{j}|>\mu_{j}\right) (41)
\displaystyle\displaystyle\leq j=q+1p(|ψj|>λα2n)+(pq)o(enδ4)\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(|\psi_{j}|>\dfrac{\lambda\alpha}{2\sqrt{n}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)
\displaystyle\displaystyle\leq j=q+1p(λmax(HBHB)×ϵ2>λα2n)+(pq)o(enδ4),using Equation (40)\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\sqrt{\lambda_{\max}(H_{B}H^{\prime}_{B})}\times\|\boldsymbol{\epsilon}\|_{2}>\dfrac{\lambda\alpha}{2\sqrt{n}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right),\,\text{using Equation~{}(\ref{eq:Bc1})}
\displaystyle\displaystyle\leq j=q+1p(λmax(HBHB)×ϵ22>λ2α24n)+(pq)o(enδ4)\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\lambda_{\max}(H_{B}H^{\prime}_{B})\times\|\boldsymbol{\epsilon}\|_{2}^{2}>\dfrac{\lambda^{2}\alpha^{2}}{4n}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)
\displaystyle\displaystyle\leq j=q+1p(ϵ22σ2>λ2α24nM3σ2)+(pq)o(enδ3)+(pq)o(enδ4)\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\dfrac{\|\boldsymbol{\epsilon}\|_{2}^{2}}{\sigma^{2}}>\dfrac{\lambda^{2}\alpha^{2}}{4nM_{3}\sigma^{2}}\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)
\displaystyle\displaystyle\leq (pq)exp(s2+12n(2sn))+(pq)o(enδ3)+(pq)o(enδ4)\displaystyle\displaystyle(p-q)\exp\left(-\dfrac{s}{2}+\dfrac{1}{2}\sqrt{n(2s-n)}\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)
\displaystyle\displaystyle\leq (pq)exp(n2(sn2sn1))+(pq)o(enδ3)+(pq)o(enδ4)\displaystyle\displaystyle(p-q)\exp\left(-\dfrac{n}{2}\left(\dfrac{s}{n}-\sqrt{2\dfrac{s}{n}-1}\right)\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)
\displaystyle\displaystyle\leq (pq)exp(n2)+(pq)o(enδ3)+(pq)o(enδ4)\displaystyle\displaystyle(p-q)\exp\left(-\dfrac{n}{2}\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)

with sn=λ2α24n2M3σ2\displaystyle\frac{s}{n}=\dfrac{\lambda^{2}\alpha^{2}}{4n^{2}M_{3}\sigma^{2}} since λ2α24n2M3σ22+2\displaystyle\dfrac{\lambda^{2}\alpha^{2}}{4n^{2}M_{3}\sigma^{2}}\geq 2+\sqrt{2} by (17). Finally, Equation (41) implies that

(Bnc)0, as n,\displaystyle\mathbb{P}(B_{n}^{c})\rightarrow 0,\textrm{ as }n\rightarrow\infty,

which concludes the proof.

References

  • [1] Jinzhu Jia and B. Yu. On model selection consistency of the elastic net when pn\displaystyle p\gg n. Statistica Sinica, 20, 04 2010.
  • [2] Wenbin Lu, Hao Zhang, and Donglin Zeng. Variable selection for optimal treatment decision. Statistical methods in medical research, 22, 11 2011.
  • [3] Lacey Gunter, Ji Zhu, and Susan Murphy. Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate. Journal of biopharmaceutical statistics, 21(6):1063–1078, 2011.
  • [4] Xuemin Gu, Guosheng Yin, and J Jack Lee. Bayesian two-step lasso strategy for biomarker selection in personalized medicine development for time-to-event endpoints. Contemporary clinical trials, 36(2):642–650, 2013.
  • [5] Wencan Zhu, Céline Lévy-Leduc, and Nils Ternès. A variable selection approach for highly correlated predictors in high-dimensional genomic data. Bioinformatics, 2021. doi: 10.1093/bioinformatics/btab114. Epub ahead of print. PMID: 33617644.
  • [6] Zeynep Tufekci. Big questions for social media big data: Representativeness, validity and other methodological pitfalls. Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014, 03 2014.
  • [7] Huijie Lin, Jia Jia, Liqiang Nie, Guangyao Shen, and Tat-Seng Chua. What does social media say about your stress?. In IJCAI, pages 3775–3781, 2016.
  • [8] Theodore S Tomeny, Christopher J Vargo, and Sherine El-Toukhy. Geographic and demographic correlates of autism-related anti-vaccine beliefs on twitter, 2009-15. Social science & medicine, 191:168–175, 2017.
  • [9] Georgios Sermpinis, Serafeim Tsoukas, and Ping Zhang. Modelling market implied ratings using lasso variable selection techniques. Journal of Empirical Finance, 48:19–35, 2018.
  • [10] Alessandra Amendola, Francesco Giordano, Maria Parrella, and Marialuisa Restaino. Variable selection in high-dimensional regression: A nonparametric procedure for business failure prediction. Applied Stochastic Models in Business and Industry, 33, 02 2017.
  • [11] Bartosz Uniejewski, Grzegorz Marcjasz, and Rafał Weron. Understanding intraday electricity markets: Variable selection and very short-term price forecasting using lasso. International Journal of Forecasting, 35(4):1533–1547, 2019.
  • [12] Norman R Draper and Harry Smith. Applied regression analysis, volume 326. John Wiley & Sons, 1998.
  • [13] Peter J Bickel, Bo Li, Alexandre B Tsybakov, Sara A van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. Regularization in statistics. Test, 15(2):271–344, 2006.
  • [14] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998.
  • [15] Gideon Schwarz et al. Estimating the dimension of a model. Annals of statistics, 6(2):461–464, 1978.
  • [16] William J Welch. Algorithmic complexity: three np-hard problems in computational statistics. Journal of Statistical Computation and Simulation, 15(1):17–25, 1982.
  • [17] Leo Breiman et al. Heuristics of instability and stabilization in model selection. Annals of Statistics, 24(6):2350–2383, 1996.
  • [18] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288, 01 1996.
  • [19] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, 05 2015.
  • [20] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67:768–768, 02 2005.
  • [21] Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric Sobel, and Kenneth Lange. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 01 2009.
  • [22] Peng Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 12 2006.
  • [23] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
  • [24] Ming Yuan and Yi Lin. On the nonnegative garrote estimator. Journal of the Royal Statistical Society Series B, 69:143–161, 04 2007.
  • [25] Fei Xue and Annie Qu. Variable selection for highly correlated predictors. arXiv preprint arXiv:1709.04840, 2017.
  • [26] Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electron. J. Statist., 9(1):1150–1172, 2015.
  • [27] Xiangyu Wang and Chenlei Leng. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 78(3):589–611, 2016.
  • [28] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Ann. Statist., 39(3):1335–1371, 06 2011.
  • [29] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.