Feature Selection in Generalized Linear models via the Lasso: To Scale or Not to Scale?
Abstract
The Lasso regression is a popular regularization method for feature selection in statistics. Prior to computing the Lasso estimator in both linear and generalized linear models, it is common to conduct a preliminary rescaling of the feature matrix to ensure that all the features are standardized. Without this standardization, it is argued, the Lasso estimate will unfortunately depend on the units used to measure the features. We propose a new type of iterative rescaling of the features in the context of generalized linear models. Whilst existing Lasso algorithms perform a single scaling as a preprocessing step, the proposed rescaling is applied iteratively throughout the Lasso computation until convergence. We provide numerical examples, with both real and simulated data, illustrating that the proposed iterative rescaling can significantly improve the statistical performance of the Lasso estimator without incurring any significant additional computational cost.
Keywords:
Lasso standardization feature scaling Logistic regression Poisson regression1 Introduction and Background
We begin with providing the background notation and facts for the linear regression model and its extensions to generalized linear models. With this requisite notation and background material, we will then be able to explain the main contribution of our paper without unnecessary verbosity.
We denote the regression matrix (or feature matrix) containing the features as
and the corresponding regression response vector as (with being the realization of the random vector ). We make the standard assumptions (see, for example, [6]) that the regression function, or the mean of conditional on , satisfies for some noise vector with conditional expectation and conditional variance , where is an unknown parameter. Recall that in a linear model, we assume that
is a linear function of some model coefficients and , the last one being called the intercept and corresponding to the constant feature . Thus, the model for is
Define the projection (idempotent) matrix and let
be the empirical variance of the components of the -th feature vector. We define the Lasso estimate [7] of as the solution to the penalized least squares:
(1) |
where is a suitably chosen regularization parameter and the intercept is not penalized. It will shortly become clear why the definition (1) of the lasso estimator appears prima facie to be different from the one used in textbooks [5] and statistical packages [2]. We remind the reader that the main advantage of the Lasso regularization is that many of the components of are estimated as zeros, making it very easy to tell which features are important and which are not. This phenomenon is behind the well-known ability of the Lasso estimator to perform simultaneous shrinkage and model selection [5, 7].
Feature Standardization.
As mentioned in the abstract, it is common practice to standardize the features so that the variance of each is unity [5, 7]. In other words, the columns of are all rescaled in such a way that for all . This standardization ensures that the Lasso estimate is not affected by the units in which the features are measured, and in general improves the performance of the estimator [3]. The standardization can be accomplished by working with the matrix , rather than , where is the rescaling matrix
The solution to (1) (without preliminary standardization of the features) is equivalent to the solution
(2) |
so that is now in agreement with the definition of the Lasso estimator given in textbooks and statistical packages [2].
Extensions to GLMs.
Suppose that the joint density of the response vector given is , where the dependence on is through the linear map . Here, the cross-entropy training loss [6] (negative average log-likelihood) is and the extension of the Lasso estimator (2) to the setting of generalized linear models is then given by :
(3) |
Observe that, just like in the linear Lasso estimator (2), we scale the features so that their variance is unity [5]. This scaling need only be applied once on , possibly as a preprocessing step prior to the main optimization, and then reversed at the end of the optimization (in order to obtain the regression coefficients in the original units of measurement). The Lasso solution (3) can be also be rewritten in an equivalent form to (1), namely,
(4) |
We are now ready to describe our proposed methodology of rescaling.
New Rescaling Method for GLMs.
Let
be our shorthand notation for the cross-entropy loss. We define
to be the -th diagonal element of the Hessian matrix of second derivatives:
This is the Hessian matrix of the cross-entropy loss, evaluated at the true parameter , and after the nuisance parameter is eliminated from the optimization. Then, instead of the usual rescaled Lasso estimator (4), in this article we propose the following alternative iteratively rescaled Lasso (IRL):
(5) |
We now make three observations.
First, since each depends on the unknown , the approximate computation of (5) will be iterative — this will be explained carefully in the next section — and is the main reason for naming the method IRL.
Second, the linear Lasso estimator (1) is a special case of (5) when is a multivariate Gaussian with mean and variance , because then .
Third, one may ask what is the motivation for the proposed IRL estimator. The answer is that the IRL estimator coincides with the traditional linear regression estimator (1), provided that the cross-entropy loss is replaced by its quadratic approximation in the neighborhood of the true coefficients . In other words, our proposed IRL estimator uses exactly the same scaling as the linear Lasso estimator (1) when the generalized linear model is linearized at the true solution. Note that there is no such agreement in the scaling between the currently accepted linear estimator (1) and its GLM counterpart (4), that is, the current widely-used scaling is not consistent across linear and nonlinear models. Our proposal is thus motivated by the desire for consistency in the scaling applied to linear and nonlinear models.
The rest of the paper is organized as follows. In Section 2 we explain how to approximately compute the Lasso estimator in (5), given that we do not actually have apriori knowledge of the Hessian matrix at the true parameter . Then, in Section 3 we provide a number of numerical examples with both simulated and real data illustrating the scope of improvement in the statistical performance of the Lasso estimator. In the final section, we give concluding remarks.
2 Computation via Iterative Reweighted Least Squares
Since the true is not known apriori, in this section we explain how to approximately compute (5) using an iterative reweighted least squares (IRLS) method. We begin by reviewing the well-known IRLS for computing the estimator (4) and then explain how it is easily modified to approximately compute our proposed estimator (5). To this end, we introduce the notation and , so that Then, computing (4) is equivalent to minimizing
with respect to . This problem is nonlinear, but it can be solved by successive and repeated linearizations of . Suppose that at iteration , we have a current best guess for the minimizer of (4). Given this , we consider the quadratic multivariate Taylor approximation to the cross-entropy loss at the point :
Then, we update to by computing the linear Lasso estimator:
(6) |
This computation is then iterated until convergence [2].
To keep the mathematical detail simple, we henceforth use the Logit model to illustrate the computations that are typically required for all GLMs. In the Logit model the binary response are assumed to be independent Bernoulli random variables with conditional mean
yielding the cross-entropy loss:
(7) |
Then, the gradient and Hessian are given by the formulas:
If we define the quantities:
then straightforward algebra shows that the estimator in (6) can also be written in terms of the following (weighed) regularized least-squares:
We can eliminate the intercept term from the optimization by applying the centering (projection) matrix
to both and . Once the is computed, then we recover . This gives the following formulas for updating to :
We iterate for until a convergence criterion is met. This iterative reweighed penalized least squares is summarized in the following pseudo-algorithm [2], which assumes that we compute the Lasso estimate on a grid of values (with being typical values):
where is large enough so that ; see [2].
Line 1 in Algorithm 1 calls the subroutine to compute the Lasso estimate:
(8) |
to within an error tolerance of via the method of coordinate descent. For completeness, we include the pseudocode of this subroutine in the Appendix and refer the reader to [2] for more details of this well-known algorithm.
We now describe our proposed method for approximately computing the Lasso estimate (5). The basic idea is to modify the standard linearization (6) by replacing the penalty term with , where is determined from the columns of the Hessian of the cross-entropy loss evaluated at the current estimate . This gives an iterative reweighed least squares in which the scaling in the lasso penalty term is modified from one iteration to the next:
We implement this iterative rescaling with only one minor modification of Algorithm 1. In particular, we replace line 1 in Algorithm 1 with the variances of the features in weighted by :
In other words, rather than computing the square root of the diagonal elements of the matrix (as currently done in line 1 of Algorithm 1), we instead compute the square root of the diagonal elements of the matrix . This is the only significant difference between the implementations of the current widely-used constant scaling and our proposed iteratively rescaled Lasso (IRL) algorithm.
Remark 1 (An Alternative Computation via a Pilot Estimate.)
Recall that the adaptive Lasso estimator [9] is a variation of the classical or vanilla Lasso estimator (1), where the scaling is not constant, but depends on a root- consistent pilot estimator of (for example, the least-squares estimator when ):
This type of scaling successfully achieves model selection consistency under weak assumptions; see [9].
As in the adaptive Lasso case, given a good pilot estimate of , we can compute as the diagonal elements of the Hessian matrix of the cross-entropy loss at and then deliver the Lasso estimator:
In this implementation, the scaling parameter are fixed at the beginning of the lasso optimization and do not change during the course of the iterative reweighed least squares. This approach is an alternative to the iteratively rescaled Lasso described above, which we do not pursue further in this work.
3 Numerical Results
In this section, we provide empirical comparisons between the IRL method and the state-of-the-art Glmnet implementation [2] for Logistic and Poisson regression. We provide numerical results for experiments conducted on both simulated and real data.
3.1 Simulations
In our simulations, we generate synthetic data from the model:
(9) |
where the parameter is a scaling parameter that determines the strength of the signal. When implementing Logistic regression, we sample a random vector by drawing components from Bernoulli(), where for all . Similarly, for Poisson regression we sample by drawing from Poisson, where . Each row of the predictor matrix is generated from a multivariate normal distribution with zero mean and covariance with diagonal elements and off-diagonal elements , for parameters and ; see, e.g., [4, 1]. To induce sparsity in we introduce the parameter and set for all . We let the intercept and set the first 10 components of as
and the remaining components to 0.
We run IRL and compare its statistical performance against the state-of-the-art implementation for generalized linear Lasso regression, Glmnet [2]. To tune the parameter we generate an independent validation set from the generating process (9) with identical parameter values for , , and . We then minimize the expected cross-entropy error on the validation set over a grid with 100 values.
Since the vector is known, when implementing Logistic regression, we can use the conditional expected cross-entropy error,
(10) |
For Poisson regression, we use the following conditional expected cross-entropy error,
(11) |
The expected loss conditioned on allows one to estimate the true expected generalization risk.
In each simulation, after generating a training and validation set (, ) we run IRL and Glmnet to evaluate the that minimizes either (10) or (11) on the validation set. We denote this minimizer as and the corresponding model coefficient estimate as . We measure the following:
-
1.
Bias: This is defined as .
-
2.
True Positives (TP): The number of correctly identified non-zero features in .
-
3.
False Positives (FP): The number of falsely identified non-zero features in .
- 4.
For each value of the pair (, ) we replicate the simulation 100 times and report the mean value of each performance measure in Tables 1 to 4. Standard errors of the mean are provided in parentheses.
Average Variance = 0.047 | |||||
---|---|---|---|---|---|
Method | Bias | TP | FP | Test Loss | |
IRL | (0.1,0.1) | 73.72 | 8.58 (0.13) | 19.37 (0.43) | 0.12 |
Glmnet | 125.31 | 7.28 (0.11) | 48.87 (0.74) | 0.17 | |
IRL | (0.1,1.0) | 89.73 | 9.97 (0.02) | 24.71 (0.53) | 0.04 |
Glmnet | 126.59 | 9.86 (0.04) | 47.45 (0.71) | 0.08 | |
IRL | (0.9,0.1) | 61.09 | 6.65 (0.13) | 8.12 (0.29) | 0.29 |
Glmnet | 111.46 | 4.84 (0.08) | 36.36 (0.68) | 0.32 | |
IRL | (0.9,1.0) | 67.47 | 7.49 (0.13) | 14.04 (0.38) | 0.18 |
Glmnet | 121.52 | 6.33 (0.12) | 50.73 (0.73) | 0.22 | |
Average Variance = 0.156 | |||||
---|---|---|---|---|---|
Method | Bias | TP | FP | Test Loss | |
IRL | (0.1,0.1) | 40.89 | 6.09 (0.12) | 8.30 (0.26) | 0.48 |
Glmnet | 42.52 | 6.16 (0.12) | 9.65 (0.26) | 0.48 | |
IRL | (0.1,1.0) | 36.46 | 5.40 (0.10) | 14.96 (0.42) | 0.56 |
Glmnet | 37.31 | 5.45 (0.11) | 15.8 (0.43) | 0.56 | |
IRL | (0.9,0.1) | 123.43 | 4.39 (0.11) | 3.09 (0.18) | 0.41 |
Glmnet | 122.15 | 4.42 (0.11) | 3.85 (0.20) | 0.41 | |
IRL | (0.9,1.0) | 57.78 | 6.16 (0.13) | 5.62 (0.22) | 0.45 |
Glmnet | 59.30 | 6.29 (0.12) | 7.13 (0.24) | 0.45 | |
When conducting experiments with generalized linear models it is useful to estimate the noise inherent in the model. In Logistic regression, the variance of is given by
To quantify the signal strength we report the value
averaged over all the simulated responses in each table. The average variance has a maximum value of 0.25 which indicates high noise and a weak signal whereas an average variance closer to the minimum value of 0 indicates low noise and a strong signal.
In Table 1 we observe simulations with a strong signal. It is in this scenario that the IRL method significantly outperforms Glmnet. With IRL, we are able to obtain a sparser estimate that contains significantly fewer false positives and achieves improved test loss. IRL outperforms Glmnet the most when the correlation is highest among the predictors (, ). In Table 2, the simulated is non-sparse, and is decreased thereby creating a weaker signal in the generated responses. In this high-noise setting, IRL achieves only marginally fewer false positives. We provide further numerical tables in the Appendix which contain results from simulations with more values of .
In Figure 1 we compare the bias and test loss of computed with IRL and Glmnet over a grid of values. Since the parameter does not coincide between IRL and Glmnet, we plot the bias and test loss as a function of the dependent variable . Figure 1 shows that in the simulation setting where is sparse, IRL attains lower test loss over a range of values and the lowest bias.


Average Variance = 0.289 | |||||
---|---|---|---|---|---|
Method | Bias | TP | FP | Test Loss | |
IRL | (0.1,1.0) | 35.98 | 9.14 (0.09) | 20.76 (0.51) | 0.16 |
Glmnet | 73.62 | 9.62 (0.06) | 47.56 (0.63) | 0.19 | |
IRL | (0.1,0.1) | 36.13 | 5.69 (0.11) | 11.03 (0.33) | 0.28 |
Glmnet | 71.19 | 6.30 (0.12) | 43.78 (0.58) | 0.31 | |
IRL | (0.9,1.0) | 35.65 | 5.80 (0.11) | 8.66 (0.27) | 0.36 |
Glmnet | 69.55 | 6.09 (0.12) | 41.54 (0.54) | 0.39 | |
IRL | (0.9,0.1) | 41.86 | 6.09 (0.13) | 4.80 (0.23) | 0.49 |
Glmnet | 61.78 | 5.36 (0.10) | 23.26 (0.42) | 0.51 | |
Average Variance = 8.808 | |||||
---|---|---|---|---|---|
Method | Bias | TP | FP | Test Loss | |
IRL | (0.1,1.0) | 11.6 | 8.22 (0.13) | 19.02 (0.66) | -2.24 |
Glmnet | 11.62 | 8.21 (0.13) | 19.37 (0.66) | -2.24 | |
IRL | (0.1,0.1) | 11.64 | 6.44 (0.09) | 8.14 (0.47) | -14.77 |
Glmnet | 11.73 | 6.43 (0.10) | 8.95 (0.49) | -14.77 | |
IRL | (0.9,1.0) | 11.98 | 6.58 (0.10) | 6.72 (0.39) | -36.17 |
Glmnet | 12.18 | 6.54 (0.10) | 8.30 (0.45) | -36.17 | |
IRL | (0.9,0.1) | 19.91 | 6.32 (0.13) | 2.81 (0.18) | -102.07 |
Glmnet | 20.13 | 6.45 (0.13) | 4.49 (0.30) | -102.07 | |
In Poisson regression the variance of is equal to its mean, that is, . To quantify the noise we report the value averaged over all the simulated responses in each table. In Table 3 we observe simulations with low noise. Similar to Logistic regression, it is in this scenario that the IRL method significantly outperforms Glmnet. The IRL method is able to pick up significantly fewer false positives, thus exhibiting improved variable selection. The improvement is most significant when the correlation parameters are (, ). In Table 4, the simulated is non-sparse, and is decreased to 0.01, thereby increasing the average noise to 8.08. In this scenario, IRL outperforms Glmnet only in the high correlation settings. Figure 2 shows that in the sparse setting, IRL outperforms Glmnet over a range of values.


3.2 Real Data
We study the performance of Glmnet and IRL on the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset [8]. The data set consists of 569 observations and 30 predictors. The ordering of the 569 observations is randomized and we use 399 and 170 observations for training and testing respectively. On this dataset, we fit Glmnet and IRL over a range of 100 values and plot the test loss (10) in Figure 3. The results indicate that IRL achieves the lowest test loss and this occurs when , whereas for Glmnet the minimum test loss occurs when . Moreover, at their corresponding minimizers, IRL selects 13 non-zero features whereas Glmnet selects 14. In other words, the IRL can achieve a smaller test loss using a sparser model.

In other numerical experiments with real-world data (not shown here), we similarly observed that the statistical performance of the IRL estimator, whilst never inferior to that of the classical/vanilla Lasso estimator, can achieve a smaller test loss using a sparser model in problems with feature structure like the ones considered in the examples above.
4 Concluding Remarks
In this paper, we proposed a novel scaling of the features for Lasso penalized nonlinear regression. The current widely-used standardization of the feature matrix is performed only once, typically as a preprocessing step, and is independent of the true regression coefficient . In contrast, our rescaling proposal uses a standardization which depends on the current best estimate of and is thus updated from one iteration to the next during the course of the nonlinear optimization.
Numerical experiments suggest that the proposed iteratively rescaled Lasso estimator, while never harming the model-selection performance of the estimator in any of the examples considered, has the potential to significantly improve the bias and model selection performance of the estimator when the features matrix is sparse and the signal is strong. In addition, the proposed estimator is just as easy to compute as the vanilla Lasso estimator and only requires a minor modification of one line of the original algorithm.
Future areas of research include providing a theoretical framework of precise conditions under which a significant improvement in the statistical accuracy of the IRL estimator is guaranteed.
Appendix 0.A Coordinate Descent for Linear Lasso Computations
The following code provides a simple coordinate descent algorithm [2] for computing (8). Note the absence here of the intercept term , which is assumed to have been eliminated from the optimization. In the pseudocode below we use the Lasso shrinkage operator, defined as:
where .
Appendix 0.B Additional Numerical Results
In Tables 5 and 6 we provide results for numerical simulations with additional values of . In Table 5, is simulated to be sparse and which results in a weaker signal than that of in Table 1. In Table 6, is simulated to be non-sparse and which results in a stronger signal than that of in Table 2. In comparison to the weak signal setting in Table 2, there is a greater difference in the mean number of false positives in favor of the IRL method in Table 6.
Average Variance = 0.191 | |||||
---|---|---|---|---|---|
Method | Bias | TP | FP | Test Loss | |
IRL | (0.9,0.1) | 146.29 | 3.55 (0.13) | 3.61 (0.18) | 0.55 |
Glmnet | 143.38 | 3.66 (0.12) | 4.56 (0.20) | 0.55 | |
IRL | (0.1,1.0) | 55.2 | 4.49 (0.11) | 12.27 (0.39) | 0.57 |
Glmnet | 58.07 | 4.64 (0.12) | 13.85 (0.39) | 0.57 | |
IRL | (0.1,0.1) | 63.74 | 5.79 (0.13) | 7.37 (0.29) | 0.56 |
Glmnet | 66.26 | 5.87 (0.12) | 9.29 (0.30) | 0.56 | |
IRL | (0.9,1.0) | 80.68 | 5.62 (0.12) | 5.94 (0.25) | 0.56 |
Glmnet | 81.78 | 5.72 (0.12) | 7.90 (0.28) | 0.56 | |
Average Variance = 0.021 | |||||
---|---|---|---|---|---|
Method | Bias | TP | FP | Test Loss | |
IRL | (0.1,0.1) | 49.29 | 6.97 (0.11) | 23.59 (0.46) | 0.08 |
Glmnet | 53.75 | 6.85 (0.10) | 30.79 (0.48) | 0.09 | |
IRL | (0.1,1.0) | 48.54 | 9.34 (0.07) | 43.09 (0.54) | 0.12 |
Glmnet | 50.32 | 9.40 (0.07) | 47.15 (0.54) | 0.12 | |
IRL | (0.9,0.1) | 58.08 | 6.34 (0.12) | 4.65 (0.24) | 0.06 |
Glmnet | 66.95 | 6.42 (0.13) | 10.96 (0.28) | 0.06 | |
IRL | (0.9,1.0) | 45.34 | 6.77 (0.11) | 15.47 (0.37) | 0.07 |
Glmnet | 53.09 | 6.53 (0.11) | 23.93 (0.42) | 0.07 | |
References
- [1] Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens (2016)
- [2] Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33(1), 1 (2010)
- [3] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer (2009)
- [4] Hastie, T., Tibshirani, R., Tibshirani, R.: Best subset, forward stepwise or lasso? analysis and recommendations based on extensive comparisons (2020)
- [5] Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity: the lasso and generalizations. CRC press (2015)
- [6] Kroese, D.P., Botev, Z., Taimre, T., Vaisman, R.: Data science and machine learning: mathematical and statistical methods. CRC Press (2019)
- [7] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288 (1996)
- [8] Wolberg, W., Mangasarian, O., Street, N., Street, W.: Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository (1995), DOI: https://doi.org/10.24432/C5DW2B
- [9] Zou, H.: The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429 (2006)