Sign Consistency of the Generalized Elastic Net Estimator

Wencan Zhu, Eric Adjakossa, Céline Lévy-Leduc, Nils Ternès

Abstract

In this paper, we propose a novel variable selection approach in the framework of high-dimensional linear models where the columns of the design matrix are highly correlated. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the columns of the design matrix and in applying a generalized Elastic Net criterion since it can be seen as an extension of the generalized Lasso. The properties of our approach called gEN (generalized Elastic Net) are investigated both from a theoretical and a numerical point of view. More precisely, we provide a new condition called GIC (Generalized Irrepresentable Condition) which generalizes the EIC (Elastic Net Irrepresentable Condition) of [1] under which we prove that our estimator can recover the positions of the null and non null entries of the coefficients when the sample size tends to infinity. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that our approach improves the variable selection performance in many cases.

Key words: Lasso; Model selection consistency; Irrepresentable Condition; Generalized Lasso; Elastic Net.

1 Introduction

Variable selection has become an important and actively used task for understanding or predicting an outcome of interest in many fields such as medicine [2, 3, 4, 5], social media [6, 7, 8], or finance [9, 10, 11]. Through decades, numerous variable selection methods have been developed such as subset selection [12] or regularization techniques [13].

Subset selection methods achieve sparsity by selecting the best subset of relevant variables using the Akaike information criterion [14] or the Bayesian information criterion [15] but are shown to be NP-hard and could be unstable in practice [16, 17].

The regularized variable selection techniques have become popular for their capability to overcome the above difficulties [18, 19, 20, 21]. Among them, the Lasso approach [18] is one of the most popular and can be defined as follows. Let $\displaystyle\mathbf{y}$ satisfy the following linear model

\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{\star}+\boldsymbol{\epsilon},

(1)

where $\displaystyle\mathbf{y}=(y_{1},\ldots,y_{n})^{{}^{\prime}}\in\mathbb{R}^{n}$ is the response variable, ^′ denoting the transposition, $\displaystyle\mathbf{X}=(\mathbf{X}_{1},\ldots,\mathbf{X}_{p})$ is the design matrix with $\displaystyle n$ rows of observations on $\displaystyle p$ covariates, $\displaystyle\boldsymbol{\beta}^{\star}=(\beta_{1}^{\star},\ldots,\beta_{p}^{\star})^{{}^{\prime}}\in\mathbb{R}^{p}$ is a sparse vector, namely contains a lot of null components, and $\displaystyle\boldsymbol{\epsilon}$ is a Gaussian vector with zero-mean and a covariance matrix equal to $\displaystyle\sigma^{2}\mathbb{I}_{n}$ , $\displaystyle\mathbb{I}_{n}$ denoting the identity matrix in $\displaystyle\mathbb{R}^{n}$ . The Lasso approach estimates $\displaystyle\boldsymbol{\beta}^{\star}$ with a sparsity enforcing constraint by minimizing the following penalized least-squares criterion:

L^{Lasso}_{\lambda}(\boldsymbol{\beta})=\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\beta}\right\rVert_{1},

(2)

where $\displaystyle\left\lVert a\right\rVert_{1}=\sum_{k=1}^{p}|a_{k}|$ denotes the $\displaystyle\ell_{1}$ norm of the vector $\displaystyle(a_{1},\dots,a_{p})^{\prime}$ , $\displaystyle\left\lVert b\right\rVert_{2}^{2}=\sum_{k=1}^{n}b_{k}^{2}$ denotes the $\displaystyle\ell_{2}$ norm of the vector $\displaystyle(b_{1},\dots,b_{n})^{\prime}$ , and $\displaystyle\lambda$ is a positive constant corresponding to the regularization parameter. The Lasso popularity largely comes from the fact that the resulting estimator

\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda)=\operatorname*{Argmin}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}L^{Lasso}_{\lambda}(\boldsymbol{\beta})

is sparse (has only a few nonzero entries), and sparse models are often preferred for their interpretability [22]. Moreover, $\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda)$ can be proved to be sign consistent under some assumptions, namely there exists $\displaystyle\lambda$ such that

\displaystyle\lim_{n\to\infty}\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda)\right)=sign(\boldsymbol{\beta}^{\star})\right)=1,

where $\displaystyle\textrm{sign}(x)=1$ if $\displaystyle x>0$ , -1 if $\displaystyle x<0$ and 0 if $\displaystyle x=0$ . Before giving the conditions under which [22] prove the sign consistency of $\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}$ , we first introduce some notations. Without loss of generality, we shall assume as in [22] that the first $\displaystyle q$ components of $\displaystyle\boldsymbol{\beta}^{\star}$ are non null (i.e. the components that are associated to the active variables, and denoted as $\displaystyle\boldsymbol{\beta}_{1}^{\star}$ ) and the last $\displaystyle p-q$ components of $\displaystyle\boldsymbol{\beta}^{\star}$ are null (i.e. the components that are associated to the non active variables, and denoted as $\displaystyle\boldsymbol{\beta}_{2}^{\star}$ ). Moreover, we shall denote by $\displaystyle\mathbf{X}_{1}$ (resp. $\displaystyle\mathbf{X}_{2}$ ) the first $\displaystyle q$ (resp. the last $\displaystyle p-q$ ) columns of $\displaystyle\mathbf{X}$ . Hence, $\displaystyle C_{n}=n^{-1}\mathbf{X}^{\prime}\mathbf{X}$ , which is the empirical covariance matrix of the covariates, can be rewritten as follows:

\displaystyle C_{n}=\begin{bmatrix}C_{11}^{n}&C_{12}^{n}\\ C_{21}^{n}&C_{22}^{n}\end{bmatrix},

with $\displaystyle C_{11}^{n}=n^{-1}\mathbf{X}_{1}^{{}^{\prime}}\mathbf{X}_{1}$ , $\displaystyle C_{12}^{n}=n^{-1}\mathbf{X}_{1}^{{}^{\prime}}\mathbf{X}_{2}$ , $\displaystyle C_{21}^{n}=n^{-1}\mathbf{X}_{2}^{{}^{\prime}}\mathbf{X}_{1}$ , $\displaystyle C_{22}^{n}=n^{-1}\mathbf{X}_{2}^{{}^{\prime}}\mathbf{X}_{2}$ . It is proved by Zhao and Yu in [22] that $\displaystyle\widehat{\boldsymbol{\beta}}^{Lasso}(\lambda)$ is sign consistent when the following Irrepresentable Condition (IC) is satisfied:

\left|\left(C_{21}^{n}(C_{11}^{n})^{-1}\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\leq 1-\alpha,\textrm{ for all }j,

(3)

where $\displaystyle\alpha$ is a positive constant. In the case where $\displaystyle p\gg n$ , Wainwright develops in [23] the necessary and sufficient conditions, for both deterministic and random designs, on $\displaystyle p$ , $\displaystyle q$ , and $\displaystyle n$ for which it is possible to recover the positions of the null and non null components of $\displaystyle\boldsymbol{\beta}^{\star}$ , namely its support, using the Lasso.

When there are high correlations between covariates, especially the active ones, the $\displaystyle C_{11}^{n}$ matrix may not be invertible, and the Lasso estimator fails to be sign consistent. To circumvent this issue, Zou and Hastie [20] introduced the Elastic Net estimator defined by:

\widehat{\boldsymbol{\beta}}^{EN}(\lambda,\eta)=\operatorname*{Argmin}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}L^{EN}_{\lambda,\eta}(\boldsymbol{\beta}),

(4)

where

\displaystyle L^{EN}_{\lambda,\eta}(\boldsymbol{\beta})=\left\lVert\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\beta}\right\rVert_{1}+\eta\left\lVert\boldsymbol{\beta}\right\rVert_{2}\textrm{ with }\lambda,\eta>0.

Yuan and Lin prove in [24] that when the following Elastic Net Condition (EIC) is satisfied the Elastic Net estimator defined by (4) is sign consistent when $\displaystyle p$ and $\displaystyle q$ are fixed: there exist positive $\displaystyle\lambda$ and $\displaystyle\eta$ such that

\left|\left(C_{21}^{n}\left(C_{11}^{n}+\dfrac{\eta}{n}\mathbb{I}_{q}\right)^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\dfrac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)\right)_{j}\right|\leq 1-\alpha,\textrm{ for all }j.

(5)

Moreover, when $\displaystyle p$ , $\displaystyle q$ , and $\displaystyle n$ go to infinity with $\displaystyle p\gg n$ , Jia and Yu prove in [1] that the sign consistency of the Elastic Net estimator holds if additionally to Condition (5) $\displaystyle n$ goes to infinity at a rate faster than $\displaystyle q\log(p-q)$ .

In the case where the active and non active covariates are highly correlated, IC (3) and EIC (5) may be violated. To overcome this issue several approaches were proposed: the Standard PArtial Covariance (SPAC) method [25] and preconditioning approaches among others. Xue and Qu [25] developed the so-called SPAC-Lasso which enjoys strong sign consistency in both finite-dimensional ( $\displaystyle p<n$ ) and high-dimensional ( $\displaystyle p\gg n$ ) settings. However, the authors mentioned that the SPAC-Lasso method only selects the active variables that are not highly correlated to the non active ones, which may be a weakness of this approach. The preconditioning approaches consist in transforming the given data $\displaystyle\mathbf{X}$ and $\displaystyle\mathbf{y}$ before applying the Lasso criterion. For example, [26] and [27] proposed to left-multiply $\displaystyle\mathbf{X}$ , $\displaystyle\mathbf{y}$ and thus $\displaystyle\boldsymbol{\epsilon}$ in Model (1) by specific matrices to remove the correlations between the columns of $\displaystyle\mathbf{X}$ . A major drawback of the latter approach, called HOLP (High dimensional Ordinary Least squares Projection), is that the preconditioning step may increase the variance of the error term and thus may alter the variable selection performance.

Recently, [5] proposed another strategy under the following assumption:

(A1)

$\displaystyle\mathbf{X}$ is assumed to be a random design matrix such that its rows $\displaystyle(\boldsymbol{x}_{i})_{1\leq i\leq n}$ are i.i.d. zero-mean Gaussian random vectors having a covariance matrix equal to $\displaystyle\boldsymbol{\Sigma}$ .

More precisely, they propose to rewrite Model (1) in order to remove the correlation existing between the columns of $\displaystyle\mathbf{X}$ . Let $\displaystyle\boldsymbol{\Sigma}^{-1/2}:=\boldsymbol{U}\boldsymbol{D}^{-1/2}\boldsymbol{U}^{T}$ where $\displaystyle\boldsymbol{U}$ and $\displaystyle\boldsymbol{D}$ are the matrices involved in the spectral decomposition of the symmetric matrix $\displaystyle\boldsymbol{\Sigma}$ given by: $\displaystyle\boldsymbol{\Sigma}=\boldsymbol{U}\boldsymbol{D}\boldsymbol{U}^{T}$ , then, denoting $\displaystyle\widetilde{\mathbf{X}}=\mathbf{X}\boldsymbol{\Sigma}^{-1/2}$ , (1) can be rewritten as follows:

\mathbf{y}=\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}^{\star}+\boldsymbol{\epsilon},

(6)

where $\displaystyle\widetilde{\boldsymbol{\beta}}^{\star}=\boldsymbol{\Sigma}^{1/2}\boldsymbol{\beta}^{\star}:=\boldsymbol{U}\boldsymbol{D}^{1/2}\boldsymbol{U}^{T}\boldsymbol{\beta}^{\star}$ . With such a transformation, the covariance matrix of the $\displaystyle n$ rows of $\displaystyle\widetilde{\mathbf{X}}$ is equal to identity and the columns of $\displaystyle\widetilde{\mathbf{X}}$ are thus uncorrelated. The advantage of such a transformation with respect to the preconditioning approach proposed by [27] is that the error term $\displaystyle\boldsymbol{\epsilon}$ is not modified thus avoiding an increase of the noise which can overwhelm the benefits of a well conditioned design matrix. Their approach then consists in minimizing the following criterion with respect to $\displaystyle\widetilde{\boldsymbol{\beta}}$ :

\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\Sigma^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1},

(7)

where $\displaystyle\widetilde{\mathbf{X}}=\mathbf{X}\boldsymbol{\Sigma}^{-1/2}$ in order to ensure a sparse estimation of $\displaystyle\boldsymbol{\beta}^{\star}$ thanks to the penalization by the $\displaystyle\ell_{1}$ norm. This criterion actually boils down to the Generalized Lasso proposed by [28]:

L_{\lambda}^{genlasso}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert D\widetilde{\boldsymbol{\beta}}\right\rVert_{1},\textrm{ with }\lambda>0

(8)

and $\displaystyle D=\Sigma^{-1/2}$ .

Since, as explained in [28], some problems may occur when the rank of the design matrix is not full, we will consider in this paper the following criterion:

L_{\lambda,\eta}^{gEN}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\Sigma^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1}+\eta\left\lVert\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2},\textrm{ with }\lambda,\eta>0.

(9)

Since it consists in adding an $\displaystyle L_{2}$ penalty part to the Generalized Lasso as in the Elastic Net, we will call it generalized Elastic Net (gEN). We prove in Section 2 that under Assumption (A1) and the Generalized Irrepresentable Condition (GIC) (12) given below among others, $\displaystyle\widehat{\boldsymbol{\beta}}$ is a sign-consistent estimator of $\displaystyle\boldsymbol{\beta}^{\star}$ where $\displaystyle\widehat{\boldsymbol{\beta}}$ is defined by

\widehat{\boldsymbol{\beta}}=\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}},

(10)

with

\widehat{\widetilde{\boldsymbol{\beta}}}=\operatorname*{Argmin}_{\widetilde{\boldsymbol{\beta}}}L_{\lambda,\eta}^{gEN}\left(\widetilde{\boldsymbol{\beta}}\right),

(11)

$\displaystyle L_{\lambda,\eta}^{gEN}\left(\widetilde{\boldsymbol{\beta}}\right)$ being defined in Equation (9). The Generalized Irrepresentable Condition (GIC) can be stated as follows: There exist $\displaystyle\lambda,\eta,\alpha,\delta_{4}>0$ such that for all $\displaystyle j$ ,

\mathbb{P}\left(\left|\left((C_{21}^{n}+\frac{\eta}{n}\Sigma_{21})(C_{11}^{n}+\frac{\eta}{n}\Sigma_{11})^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\Sigma_{21}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|\leq 1-\alpha\right)=1-o\left(e^{-n^{\delta_{4}}}\right).

(12)

Note that GIC coincides with EIC when $\displaystyle\mathbf{X}$ is not random and $\displaystyle\Sigma=\mathbb{I}_{p}$ . Moreover, GIC does not require $\displaystyle C_{11}^{n}$ to be invertible. Since EIC and IC are both particular cases of GIC, if the IC or EIC holds, then there exist $\displaystyle\lambda$ or $\displaystyle\eta$ such that the GIC holds.

The rest of the paper is organized as follows. Section 2 is devoted to the theoretical results of the paper. More precisely, we prove that under some mild conditions $\displaystyle\widehat{\boldsymbol{\beta}}$ defined in (10) is a sign-consistent estimator of $\displaystyle\boldsymbol{\beta}^{\star}$ . To support our theoretical results, some numerical experiments are presented in Section 3. The proofs of our theoretical results can be found in Section 5.

2 Theoretical results

The goal of this section is to establish the sign consistency of the Generalized Elastic Net estimator defined in (10). To prove this result, we shall use the following lemma.

Lemma 2.1.

Let $\displaystyle\mathbf{y}$ satisfying Model (1) under Assumption (A1) and $\displaystyle\widehat{\boldsymbol{\beta}}$ be defined in (10). Then,

\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}\right)=sign(\boldsymbol{\beta}^{\star})\right)\geq\mathbb{P}\left(A_{n}\cap B_{n}\right),

(13)

where

\displaystyle A_{n}:=\left\{\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)\right|<\sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)\right\},

B_{n}:=\left\{\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}}\right.\\ \left.-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\right\},

and

\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}=C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11},\;\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}=C_{21}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{21},\;W_{n}=\dfrac{1}{\sqrt{n}}\mathbf{X}^{\prime}\boldsymbol{\epsilon}=\begin{bmatrix}W_{n}(1)\\ W_{n}(2)\end{bmatrix},

(14)

with

\displaystyle W_{n}(1)=\dfrac{1}{\sqrt{n}}\mathbf{X}^{\prime}_{1}\boldsymbol{\epsilon}\textrm{ and }W_{n}(2)=\dfrac{1}{\sqrt{n}}\mathbf{X}^{\prime}_{2}\boldsymbol{\epsilon}.

The proof of Lemma 2.1 is given in Section 5.

The following theorem gives the conditions under which the sign consistency of the generalized Elastic Net estimator $\displaystyle\widehat{\boldsymbol{\beta}}$ defined in (10) holds.

Theorem 2.2.

Assume that $\displaystyle\mathbf{y}$ satisfies Model (1) under Assumption (A1) with $\displaystyle p=p_{n}$ is such that $\displaystyle p_{n}\exp\left(n^{-\delta}\right)$ tends to 0 as $\displaystyle n$ tends to infinity for all positive $\displaystyle\delta$ . Assume also that there exist some positive constants $\displaystyle M_{1}$ , $\displaystyle M_{2}$ , $\displaystyle M_{3}$ and $\displaystyle\alpha$ satisfying

M_{1}<\frac{\boldsymbol{\beta}_{\min}^{2}}{9\sigma^{2}}\quad\textrm{ and }\quad\dfrac{\sqrt{2+\sqrt{2}}\sqrt{M_{3}}\sigma}{\alpha}<\dfrac{\boldsymbol{\beta}_{\min}}{3M_{2}\sqrt{q}},

(15)

and that there exist $\displaystyle\lambda>0$ and $\displaystyle\eta>0$ such that (12) and

\dfrac{\lambda}{n}<\dfrac{2\boldsymbol{\beta}_{\min}}{3M_{2}\sqrt{q}},

(16)

\dfrac{\lambda}{n}\geq\dfrac{2\sqrt{2+\sqrt{2}}\sqrt{M_{3}}\sigma}{\alpha},

(17)

\dfrac{\eta}{n}<\dfrac{1}{3M_{2}\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}\times\dfrac{\boldsymbol{\beta}_{\min}}{\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}},

(18)

hold as $\displaystyle n$ tends to infinity, where $\displaystyle\boldsymbol{\beta}_{\min}=\min_{1\leq j\leq q}\left|\left(\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|$ . Suppose also that there exist some positive constants $\displaystyle\delta_{1}$ , $\displaystyle\delta_{2}$ , $\displaystyle\delta_{3}$ such that, as $\displaystyle n\rightarrow\infty$ ,

\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\leq M_{1}\right)=1-o\left(e^{-n^{\delta_{1}}}\right),

(19)

\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\leq M_{2}\right)=1-o\left(e^{-n^{\delta_{2}}}\right),

(20)

\mathbb{P}\left(\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right)\leq M_{3}\right)=1-o\left(e^{-n^{\delta_{3}}}\right),

(21)

where $\displaystyle\lambda_{\max}(A)$ denotes the largest eigenvalue of $\displaystyle A$ ,

\displaystyle H_{A}=\frac{1}{\sqrt{n}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathbf{X}^{\prime}_{1}\textrm{ and }H_{B}=\dfrac{1}{\sqrt{n}}\left(\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathbf{X}^{\prime}_{1}-\mathbf{X}^{\prime}_{2}\right),

$\displaystyle\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}$ and $\displaystyle\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}$ being defined in (14) and $\displaystyle\mathbf{X}_{1}$ (resp. $\displaystyle\mathbf{X}_{2}$ ) denoting the first $\displaystyle q$ (resp. the last $\displaystyle p-q$ ) columns of $\displaystyle\mathbf{X}$ . Then,

\displaystyle\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}\right)=sign(\boldsymbol{\beta}^{\star})\right)\rightarrow 1,\textrm{ as }n\rightarrow\infty,

where $\displaystyle\widehat{\boldsymbol{\beta}}$ is defined in (10).

Note that Conditions (16) and (17) are consistent thanks to (15).

The proof of Theorem 2.2 is given in Section 5 and a discussion on the assumptions of Theorem 2.2 is provided in Section 3.

3 Numerical experiments

The goal of this section is to discuss the assumptions and illustrate the results of Theorem 2.2. For this, we generated datasets from Model (1) where the matrix $\displaystyle\boldsymbol{\Sigma}$ appearing in (A1) is defined by

\boldsymbol{\Sigma}=\begin{bmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{12}^{\prime}&\boldsymbol{\Sigma}_{22}\end{bmatrix}.\vspace{-1mm}

(22)

In (22), $\displaystyle\boldsymbol{\Sigma}_{11}$ is the correlation matrix of the active variables having its off-diagonal entries equal to $\displaystyle\alpha_{1}$ , $\displaystyle\boldsymbol{\Sigma}_{22}$ is the correlation matrix of the non active variables having its off-diagonal entries equal to $\displaystyle\alpha_{3}$ and $\displaystyle\boldsymbol{\Sigma}_{12}$ is the correlation matrix between the active and the non active variables with entries equal to $\displaystyle\alpha_{2}$ . In the numerical experiments, $\displaystyle(\alpha_{1},\alpha_{2},\alpha_{3})=(0.3,0.5,0.7)$ . Moreover, $\displaystyle\boldsymbol{\beta}^{\star}$ appearing in Model (1) has $\displaystyle q$ non zero components which are equal to $\displaystyle b$ and $\displaystyle\sigma=1$ . The number of predictors $\displaystyle p$ is equal to 200, 400, or 600 and the sample size $\displaystyle n$ takes the same values for each value of $\displaystyle p$ .

3.1 Discussion on the assumptions of Theorem 2.2

We first show that GIC defined in (12) can be satisfied even when EIC and IC, defined in (5) and (3) respectively, are not fulfilled. For this, we computed for different values of $\displaystyle\lambda$ and $\displaystyle\eta$ the following values:

		$\displaystyle\displaystyle\textrm{IC}=\max_{j}\left(\left\|\left(C_{21}^{n}(C_{11}^{n})^{-1}(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\right)$
		$\displaystyle\displaystyle\textrm{EIC}=\min_{\lambda,\eta}\max_{j}\left(\left\|\left(C_{21}^{n}(C_{11}^{n}+\frac{\eta}{n}\mathbb{I}_{q})^{-1}(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\right)$
		$\displaystyle\displaystyle\textrm{GIC}=\min_{\lambda,\eta}\max_{j}\left(\left\|\left((C_{21}^{n}+\frac{\eta}{n}\Sigma_{21})(C_{11}^{n}+\frac{\eta}{n}\Sigma_{11})^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\Sigma_{21}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\right)$		(23)

and Figure 1 displays the boxplots of these criteria obtained from 100 replications. We can see from these figures that in all the considered cases GIC is satisfied (i.e. all values are smaller than 1) whereas EIC and IC are not. The values of $\displaystyle p$ and $\displaystyle n$ do not seem to have a big impact on EIC and IC. However, contrary to $\displaystyle p$ , $\displaystyle n$ seems to have an influence on GIC which increases with $\displaystyle n$ when $\displaystyle b=1$ and decreases when $\displaystyle n$ increases when $\displaystyle b=10$ .

Refer to caption — Figure 1: Boxplot of values defined in (3.1) and obtained from 100 replications.

Figures 2 and 3 show the behavior of $\displaystyle\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)$ , $\displaystyle\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)$ and $\displaystyle\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right)$ appearing in (19), (20) and (21) with respect to $\displaystyle\eta$ for different values of $\displaystyle n$ , $\displaystyle p$ and for $\displaystyle q=5$ or 10. These plots thus provide lower bounds for $\displaystyle M_{1}$ , $\displaystyle M_{2}$ and $\displaystyle M_{3}$ appearing in the previous equations. Observe that (18) can be rewritten as:

\eta M_{2}<\dfrac{n}{3\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}\times\dfrac{\boldsymbol{\beta}_{\min}}{\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}}.

(24)

Based on the plots at the bottom right of Figures 2 and 3, we can see that there exist $\displaystyle\eta$ ’s satisfying Condition 24 and thus (18) and that the interval in which the adapted $\displaystyle\eta$ ’s lie is larger when $\displaystyle q=5$ than when $\displaystyle q=10$ .

Based on the average of $\displaystyle M_{1}$ previously obtained, the left part of (15) is always satisfied as soon as $\displaystyle b>\sqrt{18}$ . Based on the average of $\displaystyle M_{2}$ and $\displaystyle M_{3}$ previously obtained, the average of left-hand side and of the right-hand side of the right part of Equation (15) are displayed in Figures 4 and 5. We can see from these figures that it is only satisfied for large values of $\displaystyle b$ . Moreover, it is more often satisfied when $\displaystyle q=5$ than for $\displaystyle q=10$ .

We will show in the next section that even if the cases where all the conditions of the theorem are not fulfilled our method is robust enough to outperform the Elastic Net defined in (4) even in these cases.

3.2 Comparison with other methods

To assess the performance of our approach (gEN) in terms of sign-consistency with respect to other methods and to illustrate the results of Theorem 2.2, we computed the True Positive Rate (TPR), namely the proportion of active variables selected, and the False Positive Rate (FPR), namely the proportion of non active variables selected, of the Elastic Net and gEN estimators defined in (4) and (10), respectively.

Figures 6 and 8 display the empirical mean of the largest difference between the True Positive Rate and False Positive Rate over the replications. It is obtained by selecting for each replication the value of $\displaystyle\lambda$ and $\displaystyle\eta$ achieving the largest difference between the TPR and FPR and by averaging these differences. They also display the corresponding TPR and FPR for gEN and Elastic Net for different values of $\displaystyle n$ and $\displaystyle p$ . We can see from these figures that the gEN and the Elastic Net estimators have a TPR equal to 1 but that the FPR of gEN is smaller than Elastic Net. We can see from these figures that the difference between the performance of gEN and Elastic Net is larger for high signal-to-noise ratios ( $\displaystyle b=10$ ). It has to be noticed that when TPR=1 for our approach it also means that the signs of the non null $\displaystyle\beta_{i}^{\star}$ are also properly retrieved.

4 Discussion

In this paper, we proposed a novel variable selection approach called gEN (generalized Elastic Net) in the framework of linear models where the columns of the design matrix are highly correlated and thus when the standard Lasso criterion usually fails. We proved that under mild conditions, among which the GIC, which is valid when other standard conditions like EIC or IC are not fulfilled, our method provides a sign-consistent estimator of $\displaystyle\boldsymbol{\beta}^{\star}$ . For a more thorough discussion regarding the application of our approach in practical situations, we refer the reader to [5].

5 Proofs

5.1 Proof of Lemma 2.1

Note that (9) given by:

\displaystyle L^{gEN}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}-\widetilde{\mathbf{X}}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\Sigma^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1}+\eta\left\lVert\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}

can be rewritten as

L^{gEN}(\widetilde{\boldsymbol{\beta}})=\left\lVert\mathbf{y}^{*}-\widetilde{\mathbf{X}}{*}\widetilde{\boldsymbol{\beta}}\right\rVert_{2}^{2}+\lambda\left\lVert\boldsymbol{\Sigma}^{-1/2}\widetilde{\boldsymbol{\beta}}\right\rVert_{1},

where

\mathbf{y}^{*}=\begin{pmatrix}\mathbf{y}\\ 0\end{pmatrix},\quad\widetilde{\mathbf{X}}^{*}=\begin{pmatrix}\widetilde{\mathbf{X}}\\ \sqrt{\eta}\mathbb{I}_{p}\end{pmatrix}.

Then, $\displaystyle\widehat{\widetilde{\boldsymbol{\beta}}}$ satisfies

\widetilde{\mathbf{X}}^{*^{\prime}}\left(\mathbf{y}^{*}-\widetilde{\mathbf{X}}{*^{\prime}}\widehat{\widetilde{\boldsymbol{\beta}}}\right)=\frac{\lambda}{2}(\boldsymbol{\Sigma}^{-1/2})^{\prime}z,

(25)

where $\displaystyle A^{\prime}$ denotes the transpose of the matrix $\displaystyle A$ , and

\begin{cases}z_{j}=sign\left((\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}})_{j}\right),&\text{if}\ (\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}})_{j}\neq 0\\ z_{j}\in[-1,1],&\text{if}\ (\boldsymbol{\Sigma}^{-1/2}\widehat{\widetilde{\boldsymbol{\beta}}})_{j}=0\end{cases}.

Equation (25) can be rewritten as:

\displaystyle\mathbf{X}^{\prime}\mathbf{y}-\left(\mathbf{X}^{\prime}\mathbf{X}+\eta\boldsymbol{\Sigma}\right)\widehat{\boldsymbol{\beta}}=\frac{\lambda}{2}z

which leads to

\displaystyle\mathbf{X}^{\prime}\mathbf{X}(\boldsymbol{\beta}^{\star}-\widehat{\boldsymbol{\beta}})+\mathbf{X}^{\prime}\boldsymbol{\epsilon}-\eta\boldsymbol{\Sigma}\widehat{\boldsymbol{\beta}}=\frac{\lambda}{2}z,

by using that $\displaystyle\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{\star}+\boldsymbol{\epsilon}$ . By using the following notations: $\displaystyle\widehat{\mathbf{u}}=\widehat{\boldsymbol{\beta}}-\boldsymbol{\beta}^{\star}$ ,

\displaystyle C_{n}=\frac{1}{n}\mathbf{X}^{\prime}\mathbf{X}\textrm{ and }W_{n}=\frac{1}{\sqrt{n}}\mathbf{X}^{\prime}\boldsymbol{\epsilon},

Equation (25) becomes

\left(C_{n}+\frac{\eta}{n}\boldsymbol{\Sigma}\right)\sqrt{n}\widehat{\mathbf{u}}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}\boldsymbol{\beta}^{\star}-W_{n}=-\frac{\lambda}{2\sqrt{n}}z.

(26)

With the following notations:

\displaystyle C_{n}=\begin{pmatrix}C_{11}^{n}&C_{12}^{n}\\ C_{21}^{n}&C_{22}^{n}\end{pmatrix},\;\boldsymbol{\Sigma}=\begin{pmatrix}\boldsymbol{\Sigma}_{11}&\boldsymbol{\Sigma}_{12}\\ \boldsymbol{\Sigma}_{21}&\boldsymbol{\Sigma}_{22}\end{pmatrix},\;\widehat{\mathbf{u}}=\begin{pmatrix}\widehat{\mathbf{u}}_{1}\\ \widehat{\mathbf{u}}_{2}\end{pmatrix},\;W_{n}=\begin{pmatrix}W_{n}(1)\\ W_{n}(2)\end{pmatrix},\;\boldsymbol{\beta}^{\star}=\begin{pmatrix}\boldsymbol{\beta}_{1}^{\star}\\ \textbf{0}\end{pmatrix},

the first components of Equation (26) are:

\left(C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11}\right)\sqrt{n}\widehat{\mathbf{u}}_{1}+\left(C_{12}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{12}\right)\sqrt{n}\widehat{\mathbf{u}}_{2}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}-W_{n}(1)=-\frac{\lambda}{2\sqrt{n}}sign(\boldsymbol{\beta}_{1}^{\star}).

(27)

If $\displaystyle\widehat{\mathbf{u}}=\begin{pmatrix}\widehat{\mathbf{u}}_{1}\\ 0\end{pmatrix}$ , it can be seen as a solution of the generalized Elastic Net criterion where, by Equation (27), $\displaystyle\widehat{\mathbf{u}}_{1}$ is defined by:

\sqrt{n}\widehat{\mathbf{u}}_{1}=\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-\frac{\eta}{\sqrt{n}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}{\boldsymbol{\beta}_{1}}^{\star}-\frac{\lambda}{2\sqrt{n}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign({\boldsymbol{\beta}_{1}}^{\star}),

(28)

where we used (14).

Note that the event $\displaystyle A_{n}$ can be rewritten as follows:

\sqrt{n}\left(-\left|\boldsymbol{\beta}_{1}^{\star}\right|+\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|+\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)\\ <\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)<\\ \sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)

(29)

which implies

\sqrt{n}\left(-\left|\boldsymbol{\beta}_{1}^{\star}\right|+\frac{\lambda}{2n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})+\frac{\eta}{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)\\ <\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)<\\ \sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|+\frac{\lambda}{2n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})+\frac{\eta}{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right),

(30)

using that $\displaystyle-|x|\leq x\leq|x|,\,\forall x\in\mathbb{R}$ . Then, by using (28), we get that $\displaystyle\sqrt{n}|\widehat{\mathbf{u}}_{1}|<\sqrt{n}|\boldsymbol{\beta}_{1}^{\star}|$ and thus $\displaystyle|\widehat{\mathbf{u}}_{1}|<|\boldsymbol{\beta}_{1}^{\star}|$ . Notice that $\displaystyle|\widehat{\mathbf{u}}_{1}|<|\boldsymbol{\beta}_{1}^{\star}|$ implies that $\displaystyle\widehat{\boldsymbol{\beta}}_{1}\neq 0$ and that $\displaystyle sign(\widehat{\boldsymbol{\beta}}_{1})=sign(\boldsymbol{\beta}_{1}^{\star})$ . Moreover, since $\displaystyle\widehat{\mathbf{u}}_{2}=0$ , we get that $\displaystyle sign(\widehat{\boldsymbol{\beta}})=sign(\boldsymbol{\beta}^{\star})$ .

The last components of (26) satisfy:

\left(C_{21}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{21}\right)\sqrt{n}\widehat{\mathbf{u}}_{1}+\left(C_{22}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{22}\right)\sqrt{n}\widehat{\mathbf{u}}_{2}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}-W_{n}(2)=-\frac{\lambda}{2\sqrt{n}}z_{2},

where by (25), $\displaystyle|z_{2}|\leq 1$ . Hence,

\left|\left(C_{21}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{21}\right)\sqrt{n}\widehat{\mathbf{u}}_{1}+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}},

which can be rewritten as follows by using (28):

\left|\mathscr{C}_{21}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(W_{n}(1)-\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}-\frac{\lambda}{2\sqrt{n}}sign(\boldsymbol{\beta}_{1}^{\star})\right)+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}}.

(31)

When the event $\displaystyle B_{n}$ is satisfied:

-\frac{\lambda}{2\sqrt{n}}+\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\\ \leq\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)\\ \leq\frac{\lambda}{2\sqrt{n}}-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|.

(32)

By using that $\displaystyle-|x|\leq x\leq|x|$ for all $\displaystyle x$ in $\displaystyle\mathbb{R}$ , we get that it implies that

\displaystyle\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)-\frac{\lambda}{2\sqrt{n}}\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)+\frac{\eta}{\sqrt{n}}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\leq\frac{\lambda}{2\sqrt{n}},

which corresponds to (31). Thus, if $\displaystyle A_{n}$ and $\displaystyle B_{n}$ are satisfied, we get that $\displaystyle sign\left(\widehat{\boldsymbol{\beta}}\right)=sign\left(\boldsymbol{\beta}\right)$ , which concludes the proof.

5.2 Proof of Theorem 2.2

By Lemma 2.1,

\displaystyle\mathbb{P}\left(sign\left(\widehat{\boldsymbol{\beta}}\right)=sign(\boldsymbol{\beta}^{\star})\right)\geq\mathbb{P}\left(A_{n}\cap B_{n}\right)\geq 1-\mathbb{P}\left(A_{n}^{c}\right)-\mathbb{P}\left(B_{n}^{c}\right),

where $\displaystyle A_{n}^{c}$ and $\displaystyle B_{n}^{c}$ denote the complementary of $\displaystyle A_{n}$ and $\displaystyle B_{n}$ , respectively. Thus, to prove the theorem it is enough to prove that $\displaystyle\mathbb{P}\left(A_{n}^{c}\right)\to 0$ and $\displaystyle\mathbb{P}\left(B_{n}^{c}\right)\to 0$ as $\displaystyle n\to\infty$ .

Recall that

\displaystyle A_{n}:=\left\{\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)\right|<\sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right)\right\}.

Let $\displaystyle\boldsymbol{\zeta}$ and $\displaystyle\boldsymbol{\tau}$ be defined by

\displaystyle\boldsymbol{\zeta}=\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)\textrm{ and }\boldsymbol{\tau}=\sqrt{n}\left(\left|\boldsymbol{\beta}_{1}^{\star}\right|-\frac{\lambda}{2n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right|-\frac{\eta}{n}\left|\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right|\right).

Then,

\displaystyle\mathbb{P}(A_{n})=\mathbb{P}\left(\forall j,|\zeta_{j}|<\tau_{j}\right).

Thus,

\displaystyle\mathbb{P}(A_{n}^{c})=\mathbb{P}\left(\exists j,|\zeta_{j}|\geq\tau_{j}\right)\leq\sum_{j=1}^{q}\mathbb{P}\left(|\zeta_{j}|\geq\tau_{j}\right).

Note that

$\displaystyle\displaystyle\mathbb{P}(\|\zeta_{j}\|\geq\tau_{j})$	$\displaystyle\displaystyle=\mathbb{P}\left(\|\zeta_{j}\|\geq\sqrt{n}\left(\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|-\frac{\lambda}{2n}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|-\frac{\eta}{n}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\right)\right)$
	$\displaystyle\displaystyle=\mathbb{P}\left(\|\zeta_{j}\|+\frac{\lambda}{2\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|+\frac{\eta}{\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\geq\sqrt{n}\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|\right)$
	$\displaystyle\displaystyle\leq\mathbb{P}\left(\|\zeta_{j}\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)+\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$
	$\displaystyle\displaystyle+\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right).$	(33)

Observe that

\displaystyle\boldsymbol{\zeta}=\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)=\frac{1}{\sqrt{n}}\left(C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11}\right)^{-1}\mathbf{X}^{\prime}_{1}\boldsymbol{\epsilon}=H_{A}\boldsymbol{\epsilon},

where

\displaystyle H_{A}=\frac{1}{\sqrt{n}}\left(C_{11}^{n}+\frac{\eta}{n}\boldsymbol{\Sigma}_{11}\right)^{-1}\mathbf{X}^{\prime}_{1},

$\displaystyle\mathbf{X}_{1}$ denoting the columns of the design matrix $\displaystyle\mathbf{X}$ associated to the $\displaystyle q$ active covariates. Thus, for all $\displaystyle j$ in $\displaystyle\{1,\dots,q\}$ ,

\displaystyle\zeta_{j}=\sum_{k=1}^{n}(H_{A})_{jk}\boldsymbol{\epsilon}_{k}.

By using the Cauchy-Schwarz inequality,

$\displaystyle\displaystyle\|\zeta_{j}\|=\left\|\sum_{k=1}^{n}(H_{A})_{jk}\boldsymbol{\epsilon}_{k}\right\|$	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\left(\sum_{k=1}^{n}(H_{A})_{jk}^{2}\right)^{1/2}\left(\sum_{k=1}^{n}\boldsymbol{\epsilon}_{k}^{2}\right)^{1/2}$
	$\displaystyle\displaystyle=$	$\displaystyle\displaystyle\sqrt{\left(H_{A}H^{\prime}_{A}\right)_{jj}}\times\\|\boldsymbol{\epsilon}\\|_{2}$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sqrt{\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)}\times\\|\boldsymbol{\epsilon}\\|_{2}.$

Hence, the first term in the r.h.s. of (5.2) satisfies the following inequalities:

	$\displaystyle\displaystyle\mathbb{P}\left(\|\zeta_{j}\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\sqrt{\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)}\times\\|\boldsymbol{\epsilon}\\|_{2}\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$		(34)
		$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\\|\boldsymbol{\epsilon}\\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right).$		(34)

Since by (19), there exist $\displaystyle M_{1}>0$ and $\displaystyle\delta_{1}>0$ such that

\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\leq M_{1}\right)=1-o\left(e^{-n^{\delta_{1}}}\right),\textrm{ as }n\rightarrow\infty,

we have:

	$\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\\|\boldsymbol{\epsilon}\\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right)$	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\\|\boldsymbol{\epsilon}\\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9M_{1}}\right)+\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)>M_{1}\right)$
		$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\dfrac{\\|\boldsymbol{\epsilon}\\|_{2}^{2}}{\sigma^{2}}\geq\dfrac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}\right)+o\left(e^{-n^{\delta_{1}}}\right).$

Using that $\displaystyle\dfrac{\|\boldsymbol{\epsilon}\|_{2}^{2}}{\sigma^{2}}\sim\chi^{2}(n)$ , we get, by Lemma 1 of [29], that

\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\|\boldsymbol{\epsilon}\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right)\leq\exp\left(-\dfrac{t}{2}+\dfrac{1}{2}\sqrt{n\left(2t-n\right)}\right)+o\left(e^{-n^{\delta_{1}}}\right),

(35)

since $\displaystyle t=\frac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}>\dfrac{n}{2}$ using that $\displaystyle\frac{2\boldsymbol{\beta}_{\min}^{2}}{9\sigma^{2}}>M_{1}$ by (15).
By putting together Equations (34) and (35) we get

\mathbb{P}\left(|\zeta_{j}|>\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right)\leq\exp\left(-\dfrac{t}{2}+\dfrac{1}{2}\sqrt{n\left(2t-n\right)}\right)+o\left(e^{-n^{\delta_{1}}}\right),

(36)

with $\displaystyle t=\frac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}>\dfrac{n}{2}$ .
Let us now derive an upper bound for the second term in the r.h.s. of (5.2):

\displaystyle\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right|\geq\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right).

By using the Cauchy-Schwarz inequality, we get that:

			$\displaystyle\displaystyle\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|=\left\|\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)_{jk}\left(sign(\boldsymbol{\beta}_{1}^{\star})\right)_{k}\right\|$
		$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sqrt{\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)_{jk}^{2}}\times\\|sign(\boldsymbol{\beta}_{1}^{\star})\\|_{2}\leq\sqrt{\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)_{jj}^{-2}}\times\sqrt{q}$
		$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\times\sqrt{q}.$

Then,

	$\displaystyle\displaystyle\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$	(37)
$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\dfrac{\lambda}{2}\sqrt{\dfrac{q}{n}}\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$
$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\dfrac{2n}{3\lambda\sqrt{q}}\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|\right)$
$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\dfrac{2n}{3\lambda\sqrt{q}}\boldsymbol{\beta}_{\min}\right)=o\left(e^{-n^{\delta_{2}}}\right),\textrm{ as }n\to\infty,$

since $\displaystyle\dfrac{2n}{3\lambda\sqrt{q}}\boldsymbol{\beta}_{\min}>M_{2}$ by (16). Let us now derive an upper bound for the third term in the r.h.s. of (5.2):

\displaystyle\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right|>\sqrt{n}\frac{\left|(\boldsymbol{\beta}_{1}^{\star})_{j}\right|}{3}\right).

We have

			$\displaystyle\displaystyle\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|=\left\|\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\right)_{jk}\left(\boldsymbol{\beta}_{1}^{\star}\right)_{k}\right\|\leq\sqrt{\sum_{k=1}^{q}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\right)_{jk}^{2}}\times\left\\|\boldsymbol{\beta}_{1}^{\star}\right\\|_{2}$
		$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sqrt{\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}^{2}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)}\times\left\\|\boldsymbol{\beta}_{1}^{\star}\right\\|_{2}\leq\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)\times\left\\|\boldsymbol{\beta}_{1}^{\star}\right\\|_{2}.$

Thus,

	$\displaystyle\displaystyle\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$	(38)
$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)\left\\|\boldsymbol{\beta}_{1}^{\star}\right\\|_{2}\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$
$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\right)\geq\dfrac{n\boldsymbol{\beta}_{\min}}{3\eta\left\\|\boldsymbol{\beta}_{1}^{\star}\right\\|_{2}\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}\right)=o\left(e^{-n^{\delta_{2}}}\right),\textrm{ as }n\to\infty,$

since $\displaystyle\dfrac{n\boldsymbol{\beta}_{\min}}{3\eta\left\|\boldsymbol{\beta}_{1}^{\star}\right\|_{2}\lambda_{\max}\left(\boldsymbol{\Sigma}_{11}\right)}>M_{2}$ by (18).
By putting together Equations (36), (37) and (38), we get:

\displaystyle\displaystyle\mathbb{P}\left(A_{n}^{c}\right)

\displaystyle\displaystyle\leq

\displaystyle\displaystyle q\exp\left[-\dfrac{n}{2}\left(\kappa-\sqrt{2\kappa-1}\right)\right]+q\times o\left(e^{-n^{\delta_{1}}}\right)+2q\times o\left(e^{-n^{\delta_{2}}}\right),

(39)

with $\displaystyle\kappa=\frac{\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}$ . Note that $\displaystyle\kappa-\sqrt{2\kappa-1}>0$ since $\displaystyle\kappa=\frac{\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}>1$ by (15). Equation (39) then implies that

\displaystyle\mathbb{P}\left(A_{n}^{c}\right)\rightarrow 0\textrm{ as }n\rightarrow\infty.

Let us now prove that $\displaystyle\mathbb{P}\left(B_{n}^{c}\right)\rightarrow 0\textrm{ as }n\rightarrow\infty.$
Recall that

B_{n}:=\left\{\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)\right|\leq\frac{\lambda}{2\sqrt{n}}\right.\\ \left.-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\right\}.

Let

\displaystyle\boldsymbol{\psi}=\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}W_{n}(1)-W_{n}(2)=\frac{1}{\sqrt{n}}\left(\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathbf{X}^{\prime}_{1}-\mathbf{X}^{\prime}_{2}\right)\boldsymbol{\epsilon}=:H_{B}\boldsymbol{\epsilon}

and

\displaystyle\boldsymbol{\mu}=\frac{\lambda}{2\sqrt{n}}-\frac{\lambda}{2\sqrt{n}}\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|.

Then,

\displaystyle\mathbb{P}(B_{n}^{c})=\mathbb{P}(\exists j,|\psi_{j}|>\mu_{j})\leq\sum_{j=q+1}^{p}\mathbb{P}(|\psi_{j}|>\mu_{j}).

By using the Cauchy-Schwarz inequality, we get that:

|\psi_{j}|=\left|\sum_{k=1}^{n}(H_{B})_{jk}\boldsymbol{\epsilon}_{k}\right|\leq\left(\sum_{k=1}^{n}(H_{B})_{jk}^{2}\right)^{1/2}\times\|\boldsymbol{\epsilon}\|_{2}=\sqrt{\left(H_{B}H^{\prime}_{B}\right)_{jj}}\times\|\boldsymbol{\epsilon}\|_{2}\leq\sqrt{\lambda_{\max}(H_{B}H^{\prime}_{B})}\times\|\boldsymbol{\epsilon}\|_{2},

(40)

where

\displaystyle H_{B}H^{\prime}_{B}=\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}C_{11}^{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathscr{C}_{12}^{n,\boldsymbol{\Sigma}}-\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}C_{12}^{n}-C_{21}^{n}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\mathscr{C}_{12}^{n,\boldsymbol{\Sigma}}+C_{22}^{n}.

By (21), there exist $\displaystyle M_{3}>0$ and $\displaystyle\delta_{3}>0$ such that

\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{B}H^{\prime}_{B}\right)\leq M_{3}\right)=1-o\left(e^{-n^{\delta_{3}}}\right),\textrm{ as }n\to\infty.

By the GIC condition (12), there exist $\displaystyle\alpha>0$ and $\displaystyle\delta_{4}>0$ such that for all $\displaystyle j$ ,

\displaystyle\mathbb{P}\left(\left|\mathscr{C}_{21}^{n,\boldsymbol{\Sigma}}\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\left(sign(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\boldsymbol{\Sigma}_{21}\boldsymbol{\beta}_{1}^{\star}\right|\leq 1-\alpha\right)=1-o\left(e^{-n^{\delta_{4}}}\right).

Thus, we get that:

$\displaystyle\displaystyle\mathbb{P}(B_{n}^{c})$	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\|\psi_{j}\|>\mu_{j}\right)$	(41)
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\|\psi_{j}\|>\dfrac{\lambda\alpha}{2\sqrt{n}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\sqrt{\lambda_{\max}(H_{B}H^{\prime}_{B})}\times\\|\boldsymbol{\epsilon}\\|_{2}>\dfrac{\lambda\alpha}{2\sqrt{n}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right),\,\text{using Equation~{}(\ref{eq:Bc1})}$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\lambda_{\max}(H_{B}H^{\prime}_{B})\times\\|\boldsymbol{\epsilon}\\|_{2}^{2}>\dfrac{\lambda^{2}\alpha^{2}}{4n}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\sum_{j=q+1}^{p}\mathbb{P}\left(\dfrac{\\|\boldsymbol{\epsilon}\\|_{2}^{2}}{\sigma^{2}}>\dfrac{\lambda^{2}\alpha^{2}}{4nM_{3}\sigma^{2}}\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle(p-q)\exp\left(-\dfrac{s}{2}+\dfrac{1}{2}\sqrt{n(2s-n)}\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle(p-q)\exp\left(-\dfrac{n}{2}\left(\dfrac{s}{n}-\sqrt{2\dfrac{s}{n}-1}\right)\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)$
	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle(p-q)\exp\left(-\dfrac{n}{2}\right)+(p-q)o\left(e^{-n^{\delta_{3}}}\right)+(p-q)o\left(e^{-n^{\delta_{4}}}\right)$

with $\displaystyle\frac{s}{n}=\dfrac{\lambda^{2}\alpha^{2}}{4n^{2}M_{3}\sigma^{2}}$ since $\displaystyle\dfrac{\lambda^{2}\alpha^{2}}{4n^{2}M_{3}\sigma^{2}}\geq 2+\sqrt{2}$ by (17). Finally, Equation (41) implies that

\displaystyle\mathbb{P}(B_{n}^{c})\rightarrow 0,\textrm{ as }n\rightarrow\infty,

which concludes the proof.

References

[1] Jinzhu Jia and B. Yu. On model selection consistency of the elastic net when $\displaystyle p\gg n$ . Statistica Sinica, 20, 04 2010.
[2] Wenbin Lu, Hao Zhang, and Donglin Zeng. Variable selection for optimal treatment decision. Statistical methods in medical research, 22, 11 2011.
[3] Lacey Gunter, Ji Zhu, and Susan Murphy. Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate. Journal of biopharmaceutical statistics, 21(6):1063–1078, 2011.
[4] Xuemin Gu, Guosheng Yin, and J Jack Lee. Bayesian two-step lasso strategy for biomarker selection in personalized medicine development for time-to-event endpoints. Contemporary clinical trials, 36(2):642–650, 2013.
[5] Wencan Zhu, Céline Lévy-Leduc, and Nils Ternès. A variable selection approach for highly correlated predictors in high-dimensional genomic data. Bioinformatics, 2021. doi: 10.1093/bioinformatics/btab114. Epub ahead of print. PMID: 33617644.
[6] Zeynep Tufekci. Big questions for social media big data: Representativeness, validity and other methodological pitfalls. Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014, 03 2014.
[7] Huijie Lin, Jia Jia, Liqiang Nie, Guangyao Shen, and Tat-Seng Chua. What does social media say about your stress?. In IJCAI, pages 3775–3781, 2016.
[8] Theodore S Tomeny, Christopher J Vargo, and Sherine El-Toukhy. Geographic and demographic correlates of autism-related anti-vaccine beliefs on twitter, 2009-15. Social science & medicine, 191:168–175, 2017.
[9] Georgios Sermpinis, Serafeim Tsoukas, and Ping Zhang. Modelling market implied ratings using lasso variable selection techniques. Journal of Empirical Finance, 48:19–35, 2018.
[10] Alessandra Amendola, Francesco Giordano, Maria Parrella, and Marialuisa Restaino. Variable selection in high-dimensional regression: A nonparametric procedure for business failure prediction. Applied Stochastic Models in Business and Industry, 33, 02 2017.
[11] Bartosz Uniejewski, Grzegorz Marcjasz, and Rafał Weron. Understanding intraday electricity markets: Variable selection and very short-term price forecasting using lasso. International Journal of Forecasting, 35(4):1533–1547, 2019.
[12] Norman R Draper and Harry Smith. Applied regression analysis, volume 326. John Wiley & Sons, 1998.
[13] Peter J Bickel, Bo Li, Alexandre B Tsybakov, Sara A van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. Regularization in statistics. Test, 15(2):271–344, 2006.
[14] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998.
[15] Gideon Schwarz et al. Estimating the dimension of a model. Annals of statistics, 6(2):461–464, 1978.
[16] William J Welch. Algorithmic complexity: three np-hard problems in computational statistics. Journal of Statistical Computation and Simulation, 15(1):17–25, 1982.
[17] Leo Breiman et al. Heuristics of instability and stabilization in model selection. Annals of Statistics, 24(6):2350–2383, 1996.
[18] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288, 01 1996.
[19] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, 05 2015.
[20] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67:768–768, 02 2005.
[21] Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric Sobel, and Kenneth Lange. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 01 2009.
[22] Peng Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 12 2006.
[23] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
[24] Ming Yuan and Yi Lin. On the nonnegative garrote estimator. Journal of the Royal Statistical Society Series B, 69:143–161, 04 2007.
[25] Fei Xue and Annie Qu. Variable selection for highly correlated predictors. arXiv preprint arXiv:1709.04840, 2017.
[26] Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electron. J. Statist., 9(1):1150–1172, 2015.
[27] Xiangyu Wang and Chenlei Leng. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 78(3):589–611, 2016.
[28] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Ann. Statist., 39(3):1335–1371, 06 2011.
[29] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.

		$\displaystyle\displaystyle\textrm{IC}=\max_{j}\left(\left\|\left(C_{21}^{n}(C_{11}^{n})^{-1}(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\right)$
		$\displaystyle\displaystyle\textrm{EIC}=\min_{\lambda,\eta}\max_{j}\left(\left\|\left(C_{21}^{n}(C_{11}^{n}+\frac{\eta}{n}\mathbb{I}_{q})^{-1}(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\right)$
		$\displaystyle\displaystyle\textrm{GIC}=\min_{\lambda,\eta}\max_{j}\left(\left\|\left((C_{21}^{n}+\frac{\eta}{n}\Sigma_{21})(C_{11}^{n}+\frac{\eta}{n}\Sigma_{11})^{-1}\left(\textrm{sign}(\boldsymbol{\beta}_{1}^{\star})+\frac{2\eta}{\lambda}\boldsymbol{\beta}_{1}^{\star}\right)-\frac{2\eta}{\lambda}\Sigma_{21}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\right)$		(23)

$\displaystyle\displaystyle\mathbb{P}(\|\zeta_{j}\|\geq\tau_{j})$	$\displaystyle\displaystyle=\mathbb{P}\left(\|\zeta_{j}\|\geq\sqrt{n}\left(\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|-\frac{\lambda}{2n}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|-\frac{\eta}{n}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\right)\right)$
	$\displaystyle\displaystyle=\mathbb{P}\left(\|\zeta_{j}\|+\frac{\lambda}{2\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|+\frac{\eta}{\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\geq\sqrt{n}\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|\right)$
	$\displaystyle\displaystyle\leq\mathbb{P}\left(\|\zeta_{j}\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)+\mathbb{P}\left(\frac{\lambda}{2\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}sign(\boldsymbol{\beta}_{1}^{\star})\right)_{j}\right\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right)$
	$\displaystyle\displaystyle+\mathbb{P}\left(\frac{\eta}{\sqrt{n}}\left\|\left(\left(\mathscr{C}_{11}^{n,\boldsymbol{\Sigma}}\right)^{-1}\boldsymbol{\Sigma}_{11}\boldsymbol{\beta}_{1}^{\star}\right)_{j}\right\|\geq\sqrt{n}\frac{\left\|(\boldsymbol{\beta}_{1}^{\star})_{j}\right\|}{3}\right).$	(33)

	$\displaystyle\displaystyle\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)\times\\|\boldsymbol{\epsilon}\\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9}\right)$	$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\\|\boldsymbol{\epsilon}\\|_{2}^{2}\geq n\frac{(\boldsymbol{\beta}_{1}^{\star})_{j}^{2}}{9M_{1}}\right)+\mathbb{P}\left(\lambda_{\max}\left(H_{A}H^{\prime}_{A}\right)>M_{1}\right)$
		$\displaystyle\displaystyle\leq$	$\displaystyle\displaystyle\mathbb{P}\left(\dfrac{\\|\boldsymbol{\epsilon}\\|_{2}^{2}}{\sigma^{2}}\geq\dfrac{n\boldsymbol{\beta}_{\min}^{2}}{9M_{1}\sigma^{2}}\right)+o\left(e^{-n^{\delta_{1}}}\right).$