Sign Consistency of the Generalized Elastic Net Estimator
Abstract
In this paper, we propose a novel variable selection approach in the framework of high-dimensional linear models where the columns of the design matrix are highly correlated. It consists in rewriting the initial high-dimensional linear model to remove the correlation between the columns of the design matrix and in applying a generalized Elastic Net criterion since it can be seen as an extension of the generalized Lasso. The properties of our approach called gEN (generalized Elastic Net) are investigated both from a theoretical and a numerical point of view. More precisely, we provide a new condition called GIC (Generalized Irrepresentable Condition) which generalizes the EIC (Elastic Net Irrepresentable Condition) of [1] under which we prove that our estimator can recover the positions of the null and non null entries of the coefficients when the sample size tends to infinity. We also assess the performance of our methodology using synthetic data and compare it with alternative approaches. Our numerical experiments show that our approach improves the variable selection performance in many cases.
Key words: Lasso; Model selection consistency; Irrepresentable Condition; Generalized Lasso; Elastic Net.
1 Introduction
Variable selection has become an important and actively used task for understanding or predicting an outcome of interest in many fields such as medicine [2, 3, 4, 5], social media [6, 7, 8], or finance [9, 10, 11]. Through decades, numerous variable selection methods have been developed such as subset selection [12] or regularization techniques [13].
Subset selection methods achieve sparsity by selecting the best subset of relevant variables using the Akaike information criterion [14] or the Bayesian information criterion [15] but are shown to be NP-hard and could be unstable in practice [16, 17].
The regularized variable selection techniques have become popular for their capability to overcome the above difficulties [18, 19, 20, 21]. Among them, the Lasso approach [18] is one of the most popular and can be defined as follows. Let satisfy the following linear model
(1) |
where is the response variable, ′ denoting the transposition, is the design matrix with rows of observations on covariates, is a sparse vector, namely contains a lot of null components, and is a Gaussian vector with zero-mean and a covariance matrix equal to , denoting the identity matrix in . The Lasso approach estimates with a sparsity enforcing constraint by minimizing the following penalized least-squares criterion:
(2) |
where denotes the norm of the vector , denotes the norm of the vector , and is a positive constant corresponding to the regularization parameter. The Lasso popularity largely comes from the fact that the resulting estimator
is sparse (has only a few nonzero entries), and sparse models are often preferred for their interpretability [22]. Moreover, can be proved to be sign consistent under some assumptions, namely there exists such that
where if , -1 if and 0 if . Before giving the conditions under which [22] prove the sign consistency of , we first introduce some notations. Without loss of generality, we shall assume as in [22] that the first components of are non null (i.e. the components that are associated to the active variables, and denoted as ) and the last components of are null (i.e. the components that are associated to the non active variables, and denoted as ). Moreover, we shall denote by (resp. ) the first (resp. the last ) columns of . Hence, , which is the empirical covariance matrix of the covariates, can be rewritten as follows:
with , , , . It is proved by Zhao and Yu in [22] that is sign consistent when the following Irrepresentable Condition (IC) is satisfied:
(3) |
where is a positive constant. In the case where , Wainwright develops in [23] the necessary and sufficient conditions, for both deterministic and random designs, on , , and for which it is possible to recover the positions of the null and non null components of , namely its support, using the Lasso.
When there are high correlations between covariates, especially the active ones, the matrix may not be invertible, and the Lasso estimator fails to be sign consistent. To circumvent this issue, Zou and Hastie [20] introduced the Elastic Net estimator defined by:
(4) |
where
Yuan and Lin prove in [24] that when the following Elastic Net Condition (EIC) is satisfied the Elastic Net estimator defined by (4) is sign consistent when and are fixed: there exist positive and such that
(5) |
Moreover, when , , and go to infinity with , Jia and Yu prove in [1] that the sign consistency of the Elastic Net estimator holds if additionally to Condition (5) goes to infinity at a rate faster than .
In the case where the active and non active covariates are highly correlated, IC (3) and EIC (5) may be violated. To overcome this issue several approaches were proposed: the Standard PArtial Covariance (SPAC) method [25] and preconditioning approaches among others. Xue and Qu [25] developed the so-called SPAC-Lasso which enjoys strong sign consistency in both finite-dimensional () and high-dimensional () settings. However, the authors mentioned that the SPAC-Lasso method only selects the active variables that are not highly correlated to the non active ones, which may be a weakness of this approach. The preconditioning approaches consist in transforming the given data and before applying the Lasso criterion. For example, [26] and [27] proposed to left-multiply , and thus in Model (1) by specific matrices to remove the correlations between the columns of . A major drawback of the latter approach, called HOLP (High dimensional Ordinary Least squares Projection), is that the preconditioning step may increase the variance of the error term and thus may alter the variable selection performance.
Recently, [5] proposed another strategy under the following assumption:
-
(A1)
is assumed to be a random design matrix such that its rows are i.i.d. zero-mean Gaussian random vectors having a covariance matrix equal to .
More precisely, they propose to rewrite Model (1) in order to remove the correlation existing between the columns of . Let where and are the matrices involved in the spectral decomposition of the symmetric matrix given by: , then, denoting , (1) can be rewritten as follows:
(6) |
where . With such a transformation, the covariance matrix of the rows of is equal to identity and the columns of are thus uncorrelated. The advantage of such a transformation with respect to the preconditioning approach proposed by [27] is that the error term is not modified thus avoiding an increase of the noise which can overwhelm the benefits of a well conditioned design matrix. Their approach then consists in minimizing the following criterion with respect to :
(7) |
where in order to ensure a sparse estimation of thanks to the penalization by the norm. This criterion actually boils down to the Generalized Lasso proposed by [28]:
(8) |
and .
Since, as explained in [28], some problems may occur when the rank of the design matrix is not full, we will consider in this paper the following criterion:
(9) |
Since it consists in adding an penalty part to the Generalized Lasso as in the Elastic Net, we will call it generalized Elastic Net (gEN). We prove in Section 2 that under Assumption (A1) and the Generalized Irrepresentable Condition (GIC) (12) given below among others, is a sign-consistent estimator of where is defined by
(10) |
with
(11) |
being defined in Equation (9). The Generalized Irrepresentable Condition (GIC) can be stated as follows: There exist such that for all ,
(12) |
Note that GIC coincides with EIC when is not random and . Moreover, GIC does not require to be invertible. Since EIC and IC are both particular cases of GIC, if the IC or EIC holds, then there exist or such that the GIC holds.
The rest of the paper is organized as follows. Section 2 is devoted to the theoretical results of the paper. More precisely, we prove that under some mild conditions defined in (10) is a sign-consistent estimator of . To support our theoretical results, some numerical experiments are presented in Section 3. The proofs of our theoretical results can be found in Section 5.
2 Theoretical results
The goal of this section is to establish the sign consistency of the Generalized Elastic Net estimator defined in (10). To prove this result, we shall use the following lemma.
Lemma 2.1.
The following theorem gives the conditions under which the sign consistency of the generalized Elastic Net estimator defined in (10) holds.
Theorem 2.2.
Assume that satisfies Model (1) under Assumption (A1) with is such that tends to 0 as tends to infinity for all positive . Assume also that there exist some positive constants , , and satisfying
(15) |
and that there exist and such that (12) and
(16) |
(17) |
(18) |
hold as tends to infinity, where . Suppose also that there exist some positive constants , , such that, as ,
(19) |
(20) |
(21) |
where denotes the largest eigenvalue of ,
and being defined in (14) and (resp. ) denoting the first (resp. the last ) columns of . Then,
where is defined in (10).
3 Numerical experiments
The goal of this section is to discuss the assumptions and illustrate the results of Theorem 2.2. For this, we generated datasets from Model (1) where the matrix appearing in (A1) is defined by
(22) |
In (22), is the correlation matrix of the active variables having its off-diagonal entries equal to , is the correlation matrix of the non active variables having its off-diagonal entries equal to and is the correlation matrix between the active and the non active variables with entries equal to . In the numerical experiments, . Moreover, appearing in Model (1) has non zero components which are equal to and . The number of predictors is equal to 200, 400, or 600 and the sample size takes the same values for each value of .
3.1 Discussion on the assumptions of Theorem 2.2
We first show that GIC defined in (12) can be satisfied even when EIC and IC, defined in (5) and (3) respectively, are not fulfilled. For this, we computed for different values of and the following values:
(23) |
and Figure 1 displays the boxplots of these criteria obtained from 100 replications. We can see from these figures that in all the considered cases GIC is satisfied (i.e. all values are smaller than 1) whereas EIC and IC are not. The values of and do not seem to have a big impact on EIC and IC. However, contrary to , seems to have an influence on GIC which increases with when and decreases when increases when .

Figures 2 and 3 show the behavior of , and appearing in (19), (20) and (21) with respect to for different values of , and for or 10. These plots thus provide lower bounds for , and appearing in the previous equations. Observe that (18) can be rewritten as:
(24) |
Based on the plots at the bottom right of Figures 2 and 3, we can see that there exist ’s satisfying Condition 24 and thus (18) and that the interval in which the adapted ’s lie is larger when than when .


Based on the average of previously obtained, the left part of (15) is always satisfied as soon as . Based on the average of and previously obtained, the average of left-hand side and of the right-hand side of the right part of Equation (15) are displayed in Figures 4 and 5. We can see from these figures that it is only satisfied for large values of . Moreover, it is more often satisfied when than for .


We will show in the next section that even if the cases where all the conditions of the theorem are not fulfilled our method is robust enough to outperform the Elastic Net defined in (4) even in these cases.
3.2 Comparison with other methods
To assess the performance of our approach (gEN) in terms of sign-consistency with respect to other methods and to illustrate the results of Theorem 2.2, we computed the True Positive Rate (TPR), namely the proportion of active variables selected, and the False Positive Rate (FPR), namely the proportion of non active variables selected, of the Elastic Net and gEN estimators defined in (4) and (10), respectively.
Figures 6 and 8 display the empirical mean of the largest difference between the True Positive Rate and False Positive Rate over the replications. It is obtained by selecting for each replication the value of and achieving the largest difference between the TPR and FPR and by averaging these differences. They also display the corresponding TPR and FPR for gEN and Elastic Net for different values of and . We can see from these figures that the gEN and the Elastic Net estimators have a TPR equal to 1 but that the FPR of gEN is smaller than Elastic Net. We can see from these figures that the difference between the performance of gEN and Elastic Net is larger for high signal-to-noise ratios (). It has to be noticed that when TPR=1 for our approach it also means that the signs of the non null are also properly retrieved.



4 Discussion
In this paper, we proposed a novel variable selection approach called gEN (generalized Elastic Net) in the framework of linear models where the columns of the design matrix are highly correlated and thus when the standard Lasso criterion usually fails. We proved that under mild conditions, among which the GIC, which is valid when other standard conditions like EIC or IC are not fulfilled, our method provides a sign-consistent estimator of . For a more thorough discussion regarding the application of our approach in practical situations, we refer the reader to [5].
5 Proofs
5.1 Proof of Lemma 2.1
Note that (9) given by:
can be rewritten as
where
Then, satisfies
(25) |
where denotes the transpose of the matrix , and
Equation (25) can be rewritten as:
which leads to
by using that . By using the following notations: ,
Equation (25) becomes
(26) |
With the following notations:
the first components of Equation (26) are:
(27) |
If , it can be seen as a solution of the generalized Elastic Net criterion where, by Equation (27), is defined by:
(28) |
where we used (14).
Note that the event can be rewritten as follows:
(29) |
which implies
(30) |
using that . Then, by using (28), we get that and thus . Notice that implies that and that . Moreover, since , we get that .
5.2 Proof of Theorem 2.2
By Lemma 2.1,
where and denote the complementary of and , respectively. Thus, to prove the theorem it is enough to prove that and as .
Recall that
Let and be defined by
Then,
Thus,
Note that
(33) |
Observe that
where
denoting the columns of the design matrix associated to the active covariates. Thus, for all in ,
By using the Cauchy-Schwarz inequality,
Hence, the first term in the r.h.s. of (5.2) satisfies the following inequalities:
(34) | |||||
Since by (19), there exist and such that
we have:
Using that , we get, by Lemma 1 of [29], that
(35) |
since using that by (15).
By putting together Equations (34) and (35) we get
(36) |
with .
Let us now derive an upper bound for the second term in the r.h.s. of (5.2):
By using the Cauchy-Schwarz inequality, we get that:
Then,
(37) | |||||
since by (16). Let us now derive an upper bound for the third term in the r.h.s. of (5.2):
We have
Thus,
(38) | |||||
since by (18).
By putting together Equations (36), (37) and (38), we get:
(39) |
with . Note that since by (15). Equation (39) then implies that
Let us now prove that
Recall that
Let
and
Then,
By using the Cauchy-Schwarz inequality, we get that:
(40) |
where
By (21), there exist and such that
By the GIC condition (12), there exist and such that for all ,
Thus, we get that:
(41) | |||||
with since by (17). Finally, Equation (41) implies that
which concludes the proof.
References
- [1] Jinzhu Jia and B. Yu. On model selection consistency of the elastic net when . Statistica Sinica, 20, 04 2010.
- [2] Wenbin Lu, Hao Zhang, and Donglin Zeng. Variable selection for optimal treatment decision. Statistical methods in medical research, 22, 11 2011.
- [3] Lacey Gunter, Ji Zhu, and Susan Murphy. Variable selection for qualitative interactions in personalized medicine while controlling the family-wise error rate. Journal of biopharmaceutical statistics, 21(6):1063–1078, 2011.
- [4] Xuemin Gu, Guosheng Yin, and J Jack Lee. Bayesian two-step lasso strategy for biomarker selection in personalized medicine development for time-to-event endpoints. Contemporary clinical trials, 36(2):642–650, 2013.
- [5] Wencan Zhu, Céline Lévy-Leduc, and Nils Ternès. A variable selection approach for highly correlated predictors in high-dimensional genomic data. Bioinformatics, 2021. doi: 10.1093/bioinformatics/btab114. Epub ahead of print. PMID: 33617644.
- [6] Zeynep Tufekci. Big questions for social media big data: Representativeness, validity and other methodological pitfalls. Proceedings of the 8th International Conference on Weblogs and Social Media, ICWSM 2014, 03 2014.
- [7] Huijie Lin, Jia Jia, Liqiang Nie, Guangyao Shen, and Tat-Seng Chua. What does social media say about your stress?. In IJCAI, pages 3775–3781, 2016.
- [8] Theodore S Tomeny, Christopher J Vargo, and Sherine El-Toukhy. Geographic and demographic correlates of autism-related anti-vaccine beliefs on twitter, 2009-15. Social science & medicine, 191:168–175, 2017.
- [9] Georgios Sermpinis, Serafeim Tsoukas, and Ping Zhang. Modelling market implied ratings using lasso variable selection techniques. Journal of Empirical Finance, 48:19–35, 2018.
- [10] Alessandra Amendola, Francesco Giordano, Maria Parrella, and Marialuisa Restaino. Variable selection in high-dimensional regression: A nonparametric procedure for business failure prediction. Applied Stochastic Models in Business and Industry, 33, 02 2017.
- [11] Bartosz Uniejewski, Grzegorz Marcjasz, and Rafał Weron. Understanding intraday electricity markets: Variable selection and very short-term price forecasting using lasso. International Journal of Forecasting, 35(4):1533–1547, 2019.
- [12] Norman R Draper and Harry Smith. Applied regression analysis, volume 326. John Wiley & Sons, 1998.
- [13] Peter J Bickel, Bo Li, Alexandre B Tsybakov, Sara A van de Geer, Bin Yu, Teófilo Valdés, Carlos Rivero, Jianqing Fan, and Aad van der Vaart. Regularization in statistics. Test, 15(2):271–344, 2006.
- [14] Hirotogu Akaike. Information theory and an extension of the maximum likelihood principle. In Selected papers of hirotugu akaike, pages 199–213. Springer, 1998.
- [15] Gideon Schwarz et al. Estimating the dimension of a model. Annals of statistics, 6(2):461–464, 1978.
- [16] William J Welch. Algorithmic complexity: three np-hard problems in computational statistics. Journal of Statistical Computation and Simulation, 15(1):17–25, 1982.
- [17] Leo Breiman et al. Heuristics of instability and stabilization in model selection. Annals of Statistics, 24(6):2350–2383, 1996.
- [18] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288, 01 1996.
- [19] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton, 05 2015.
- [20] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B, 67:768–768, 02 2005.
- [21] Tong Tong Wu, Yi Fang Chen, Trevor Hastie, Eric Sobel, and Kenneth Lange. Genome-wide association analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714–721, 01 2009.
- [22] Peng Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 12 2006.
- [23] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
- [24] Ming Yuan and Yi Lin. On the nonnegative garrote estimator. Journal of the Royal Statistical Society Series B, 69:143–161, 04 2007.
- [25] Fei Xue and Annie Qu. Variable selection for highly correlated predictors. arXiv preprint arXiv:1709.04840, 2017.
- [26] Jinzhu Jia and Karl Rohe. Preconditioning the lasso for sign consistency. Electron. J. Statist., 9(1):1150–1172, 2015.
- [27] Xiangyu Wang and Chenlei Leng. High dimensional ordinary least squares projection for screening variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.), 78(3):589–611, 2016.
- [28] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Ann. Statist., 39(3):1335–1371, 06 2011.
- [29] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.