This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: University of New South Wales, Kensington, Sydney, NSW 2052, Australia, 11email: anant.mathur@unsw.edu.au

Feature Selection in Generalized Linear models via the Lasso: To Scale or Not to Scale?

Anant Mathur 11    Sarat Moka 11    Zdravko Botev 11
Abstract

The Lasso regression is a popular regularization method for feature selection in statistics. Prior to computing the Lasso estimator in both linear and generalized linear models, it is common to conduct a preliminary rescaling of the feature matrix to ensure that all the features are standardized. Without this standardization, it is argued, the Lasso estimate will unfortunately depend on the units used to measure the features. We propose a new type of iterative rescaling of the features in the context of generalized linear models. Whilst existing Lasso algorithms perform a single scaling as a preprocessing step, the proposed rescaling is applied iteratively throughout the Lasso computation until convergence. We provide numerical examples, with both real and simulated data, illustrating that the proposed iterative rescaling can significantly improve the statistical performance of the Lasso estimator without incurring any significant additional computational cost.

Keywords:
Lasso standardization feature scaling Logistic regression Poisson regression

1 Introduction and Background

We begin with providing the background notation and facts for the linear regression model and its extensions to generalized linear models. With this requisite notation and background material, we will then be able to explain the main contribution of our paper without unnecessary verbosity.

We denote the n×pn\times p regression matrix (or feature matrix) containing the pp features 𝒗1,,𝒗p\boldsymbol{v}_{1},\ldots,\boldsymbol{v}_{p} as

𝐗=[𝒗1,,𝒗p]=[𝒙1𝒙n],{\color[rgb]{0,0,1}\mathbf{X}}=[\boldsymbol{v}_{1},\ldots,\boldsymbol{v}_{p}]=\begin{bmatrix}\boldsymbol{x}_{1}^{\top}\\ \vdots\\ \boldsymbol{x}_{n}^{\top}\end{bmatrix},

and the corresponding regression response vector as 𝒀n\boldsymbol{Y}\in\mathbb{R}^{n} (with 𝒚\boldsymbol{y} being the realization of the random vector 𝒀\boldsymbol{Y}). We make the standard assumptions (see, for example, [6]) that the regression function, or the mean of 𝒀\boldsymbol{Y} conditional on 𝐗{\color[rgb]{0,0,1}\mathbf{X}}, satisfies 𝒀=𝔼𝐗[𝒀]+ϵ\boldsymbol{Y}=\mathbb{E}_{{\color[rgb]{0,0,1}\mathbf{X}}}[\boldsymbol{Y}]+\boldsymbol{\epsilon} for some noise vector ϵ\boldsymbol{\epsilon} with conditional expectation 𝔼𝐗[ϵ]=𝟎\mathbb{E}_{\color[rgb]{0,0,1}\mathbf{X}}[\boldsymbol{\epsilon}]=\boldsymbol{0} and conditional variance 𝕍ar𝐗(ϵ)=σ2𝐈n\mathbb{V}\mathrm{ar}_{\color[rgb]{0,0,1}\mathbf{X}}(\boldsymbol{\epsilon})=\sigma^{2}\mathbf{I}_{n}, where σ\sigma is an unknown parameter. Recall that in a linear model, we assume that

𝔼𝐗[𝒀]=𝟏β0+𝐗𝜷\mathbb{E}_{{\color[rgb]{0,0,1}\mathbf{X}}}[\boldsymbol{Y}]=\boldsymbol{1}\beta_{0}+{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{\beta}

is a linear function of some model coefficients 𝜷p\boldsymbol{\beta}\in\mathbb{R}^{p} and β0\beta_{0}\in\mathbb{R}, the last one being called the intercept and corresponding to the constant feature 𝟏n\boldsymbol{1}\in\mathbb{R}^{n}. Thus, the model for 𝒀\boldsymbol{Y} is

𝒀=𝟏β0+𝐗𝜷+ϵ.\boldsymbol{Y}=\boldsymbol{1}\beta_{0}+{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{\beta}+\boldsymbol{\epsilon}.

Define the projection (idempotent) matrix 𝐂:=𝐈n𝟏𝟏/n\mathbf{C}:=\mathbf{I}_{n}-\boldsymbol{1}\boldsymbol{1}^{\top}/n and let

ηi2:=𝐂𝒗i2n=𝒗i𝐂𝒗in,i=1,,p\eta_{i}^{2}:=\frac{\|\mathbf{C}\boldsymbol{v}_{i}\|^{2}}{n}=\frac{\boldsymbol{v}_{i}^{\top}\mathbf{C}\boldsymbol{v}_{i}}{n},\quad i=1,\ldots,p

be the empirical variance of the components of the ii-th feature vector. We define the Lasso estimate [7] of 𝜷\boldsymbol{\beta} as the solution to the penalized least squares:

(β^0,λ,𝜷^λ)=argminb0,𝒃𝒚𝟏b0𝐗𝒃22n+λi=1pηi×|bi|,(\hat{\beta}_{0,\lambda},\hat{\boldsymbol{\beta}}_{\lambda})=\operatorname*{argmin}_{b_{0},\boldsymbol{b}}\frac{\|\boldsymbol{y}-\boldsymbol{1}b_{0}-{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{b}\|^{2}}{2n}+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}|, (1)

where λ>0\lambda>0 is a suitably chosen regularization parameter and the intercept b0b_{0} is not penalized. It will shortly become clear why the definition (1) of the lasso estimator appears prima facie to be different from the one used in textbooks [5] and statistical packages [2]. We remind the reader that the main advantage of the Lasso regularization is that many of the components of 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} are estimated as zeros, making it very easy to tell which features are important and which are not. This phenomenon is behind the well-known ability of the Lasso estimator to perform simultaneous shrinkage and model selection [5, 7].

Feature Standardization.

As mentioned in the abstract, it is common practice to standardize the features 𝒗1,,𝒗p\boldsymbol{v}_{1},\ldots,\boldsymbol{v}_{p} so that the variance of each 𝒗i\boldsymbol{v}_{i} is unity [5, 7]. In other words, the columns 𝒗1,,𝒗p\boldsymbol{v}_{1},\ldots,\boldsymbol{v}_{p} of 𝐗{\color[rgb]{0,0,1}\mathbf{X}} are all rescaled in such a way that 𝐂𝒗i2=n\|\mathbf{C}\boldsymbol{v}_{i}\|^{2}=n for all ii. This standardization ensures that the Lasso estimate 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} is not affected by the units in which the features are measured, and in general improves the performance of the estimator [3]. The standardization can be accomplished by working with the matrix 𝐗𝚼{\color[rgb]{0,0,1}\mathbf{X}}\mathbf{\Upsilon}, rather than 𝐗{\color[rgb]{0,0,1}\mathbf{X}}, where 𝚼\mathbf{\Upsilon} is the rescaling matrix

𝚼:=diag(η11,,ηp1).\mathbf{\Upsilon}:=\mathrm{diag}(\eta_{1}^{-1},\ldots,\eta_{p}^{-1}).

The solution to (1) (without preliminary standardization of the features) is equivalent to the solution

(β^0,λ,𝚼1𝜷^λ)=argminb0,𝒃𝒚𝟏b0[𝐗𝚼]𝒃22n+λi=1p|bi|,(\hat{\beta}_{0,\lambda},\mathbf{\Upsilon}^{-1}\hat{\boldsymbol{\beta}}_{\lambda})=\operatorname*{argmin}_{b_{0},\boldsymbol{b}}\frac{\|\boldsymbol{y}-\boldsymbol{1}b_{0}-[{\color[rgb]{0,0,1}\mathbf{X}}\mathbf{\Upsilon}]\boldsymbol{b}\|^{2}}{2n}+\lambda\sum_{i=1}^{p}|b_{i}|, (2)

so that 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} is now in agreement with the definition of the Lasso estimator given in textbooks and statistical packages [2].

Extensions to GLMs.

Suppose that the joint density of the response vector 𝒀\boldsymbol{Y} given β0,𝜷,𝐗\beta_{0},\boldsymbol{\beta},{\color[rgb]{0,0,1}\mathbf{X}} is g(𝒚|β0,𝜷,𝐗)g(\boldsymbol{y}\,|\,\beta_{0},\boldsymbol{\beta},{\color[rgb]{0,0,1}\mathbf{X}}), where the dependence on β0,𝜷,𝐗\beta_{0},\boldsymbol{\beta},{\color[rgb]{0,0,1}\mathbf{X}} is through the linear map (β0,𝜷)𝟏β0+𝐗𝜷(\beta_{0},\boldsymbol{\beta})\mapsto\boldsymbol{1}\beta_{0}+{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{\beta}. Here, the cross-entropy training loss [6] (negative average log-likelihood) is 1nlng(𝒚|β0,𝜷,𝐗)-\frac{1}{n}\ln g(\boldsymbol{y}\,|\,\beta_{0},\boldsymbol{\beta},{\color[rgb]{0,0,1}\mathbf{X}}) and the extension of the Lasso estimator (2) to the setting of generalized linear models is then given by :

(β^0,λ,𝚼1𝜷^λ)=argminb0,𝒃lng(𝒚|b0,𝒃,𝐗𝚼)n+λi=1p|bi|.(\hat{\beta}_{0,\lambda},\mathbf{\Upsilon}^{-1}\hat{\boldsymbol{\beta}}_{\lambda})=\operatorname*{argmin}_{b_{0},\boldsymbol{b}}\frac{-\ln g(\boldsymbol{y}\,|\,b_{0},\boldsymbol{b},{\color[rgb]{0,0,1}\mathbf{X}}\mathbf{\Upsilon})}{n}+\lambda\sum_{i=1}^{p}|b_{i}|. (3)

Observe that, just like in the linear Lasso estimator (2), we scale the features so that their variance is unity [5]. This scaling need only be applied once on 𝐗{\color[rgb]{0,0,1}\mathbf{X}}, possibly as a preprocessing step prior to the main optimization, and then reversed at the end of the optimization (in order to obtain the regression coefficients in the original units of measurement). The Lasso solution (3) can be also be rewritten in an equivalent form to (1), namely,

(β^0,λ,𝜷^λ)=argminb0,𝒃lng(𝒚|b0,𝒃,𝐗)n+λi=1pηi×|bi|.(\hat{\beta}_{0,\lambda},\hat{\boldsymbol{\beta}}_{\lambda})=\operatorname*{argmin}_{b_{0},\boldsymbol{b}}\frac{-\ln g(\boldsymbol{y}\,|\,b_{0},\boldsymbol{b},{\color[rgb]{0,0,1}\mathbf{X}})}{n}+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}|. (4)

We are now ready to describe our proposed methodology of rescaling.

New Rescaling Method for GLMs.

Let

r(b0,𝒃):=lng(𝒚|b0,𝒃,𝐗)nr(b_{0},\boldsymbol{b}):=\frac{-\ln g(\boldsymbol{y}\,|\,b_{0},\boldsymbol{b},{\color[rgb]{0,0,1}\mathbf{X}})}{n}

be our shorthand notation for the cross-entropy loss. We define

ηi2(𝜷),i=1,,p,\eta_{i}^{2}(\boldsymbol{\beta}),\qquad i=1,\ldots,p,

to be the ii-th diagonal element of the p×pp\times p Hessian matrix of second derivatives:

2𝜷𝜷minb0r(b0,𝜷).\frac{\partial^{2}}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\top}}\min_{b_{0}}r(b_{0},\boldsymbol{\beta}).

This is the Hessian matrix of the cross-entropy loss, evaluated at the true parameter 𝜷\boldsymbol{\beta}, and after the nuisance parameter β0\beta_{0} is eliminated from the optimization. Then, instead of the usual rescaled Lasso estimator (4), in this article we propose the following alternative iteratively rescaled Lasso (IRL):

argmin𝒃,b0lng(𝒚|b0,𝒃,𝐗)n+λi=1pηi(𝜷)×|bi|.\operatorname*{argmin}_{\boldsymbol{b},b_{0}}\frac{-\ln g(\boldsymbol{y}\,|\,b_{0},\boldsymbol{b},{\color[rgb]{0,0,1}\mathbf{X}})}{n}+\lambda\sum_{i=1}^{p}\eta_{i}(\boldsymbol{\beta})\times|b_{i}|. (5)

We now make three observations.

First, since each ηi(𝜷)\eta_{i}(\boldsymbol{\beta}) depends on the unknown 𝜷\boldsymbol{\beta}, the approximate computation of (5) will be iterative — this will be explained carefully in the next section — and is the main reason for naming the method IRL.

Second, the linear Lasso estimator (1) is a special case of (5) when 𝒀\boldsymbol{Y} is a multivariate Gaussian with mean 𝔼𝐗[𝒀]=𝟏β0+𝐗𝜷\mathbb{E}_{\color[rgb]{0,0,1}\mathbf{X}}[\boldsymbol{Y}]=\boldsymbol{1}\beta_{0}+{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{\beta} and variance 𝕍ar𝐗(𝒀)=𝐈\mathbb{V}\mathrm{ar}_{\color[rgb]{0,0,1}\mathbf{X}}(\boldsymbol{Y})=\mathbf{I}, because then ηi2(𝜷)=𝐂𝒗i2/n\eta_{i}^{2}(\boldsymbol{\beta})=\|\mathbf{C}\boldsymbol{v}_{i}\|^{2}/n.

Third, one may ask what is the motivation for the proposed IRL estimator. The answer is that the IRL estimator coincides with the traditional linear regression estimator (1), provided that the cross-entropy loss r(b0,𝒃)r(b_{0},\boldsymbol{b}) is replaced by its quadratic approximation in the neighborhood of the true coefficients β0,𝜷\beta_{0},\boldsymbol{\beta}. In other words, our proposed IRL estimator uses exactly the same scaling as the linear Lasso estimator (1) when the generalized linear model is linearized at the true solution. Note that there is no such agreement in the scaling between the currently accepted linear estimator (1) and its GLM counterpart (4), that is, the current widely-used scaling is not consistent across linear and nonlinear models. Our proposal is thus motivated by the desire for consistency in the scaling applied to linear and nonlinear models.

The rest of the paper is organized as follows. In Section 2 we explain how to approximately compute the Lasso estimator in (5), given that we do not actually have apriori knowledge of the Hessian matrix 𝐇\mathbf{H} at the true parameter 𝜷\boldsymbol{\beta}. Then, in Section 3 we provide a number of numerical examples with both simulated and real data illustrating the scope of improvement in the statistical performance of the Lasso estimator. In the final section, we give concluding remarks.

2 Computation via Iterative Reweighted Least Squares

Since the true 𝜷\boldsymbol{\beta} is not known apriori, in this section we explain how to approximately compute (5) using an iterative reweighted least squares (IRLS) method. We begin by reviewing the well-known IRLS for computing the estimator (4) and then explain how it is easily modified to approximately compute our proposed estimator (5). To this end, we introduce the notation 𝒃˘:=[b0,𝒃]\breve{\boldsymbol{b}}:=[b_{0},\boldsymbol{b}^{\top}]^{\top} and 𝐗˘:=[𝟏,𝐗]\breve{{\color[rgb]{0,0,1}\mathbf{X}}}:=[\boldsymbol{1},{\color[rgb]{0,0,1}\mathbf{X}}], so that 𝐗˘𝒃˘=𝟏b0+𝐗𝒃.\breve{{\color[rgb]{0,0,1}\mathbf{X}}}\breve{\boldsymbol{b}}=\boldsymbol{1}b_{0}+{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{b}. Then, computing (4) is equivalent to minimizing

r(𝒃˘)+λi=1pηi×|bi|r(\breve{\boldsymbol{b}})+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}|

with respect to 𝒃˘\breve{\boldsymbol{b}}. This problem is nonlinear, but it can be solved by successive and repeated linearizations of r(𝒃˘)r(\breve{\boldsymbol{b}}). Suppose that at iteration tt, we have a current best guess 𝒃˘t\breve{\boldsymbol{b}}_{t} for the minimizer of (4). Given this 𝒃˘t\breve{\boldsymbol{b}}_{t}, we consider the quadratic multivariate Taylor approximation to the cross-entropy loss at the point 𝒃˘t\breve{\boldsymbol{b}}_{t}:

r(𝒃˘t)+(𝒃˘𝒃˘t)r(𝒃˘t)+12(𝒃˘𝒃˘t)𝐇(𝒃˘t)(𝒃˘𝒃˘t).r(\breve{\boldsymbol{b}}_{t})+(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})^{\top}\nabla r(\breve{\boldsymbol{b}}_{t})+\frac{1}{2}(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})^{\top}\mathbf{H}(\breve{\boldsymbol{b}}_{t})(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t}).

Then, we update 𝒃˘t\breve{\boldsymbol{b}}_{t} to 𝒃˘t+1\breve{\boldsymbol{b}}_{t+1} by computing the linear Lasso estimator:

𝒃˘t+1:=argmin𝒃˘(𝒃˘𝒃˘t)r(𝒃˘t)+12(𝒃˘𝒃˘t)𝐇(𝒃˘t)(𝒃˘𝒃˘t)+λi=1pηi×|bi|.\breve{\boldsymbol{b}}_{t+1}:=\operatorname*{argmin}_{\breve{\boldsymbol{b}}}\;\;(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})^{\top}\nabla r(\breve{\boldsymbol{b}}_{t})+\frac{1}{2}(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})^{\top}\mathbf{H}(\breve{\boldsymbol{b}}_{t})(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}|. (6)

This computation is then iterated until convergence [2].

To keep the mathematical detail simple, we henceforth use the Logit model to illustrate the computations that are typically required for all GLMs. In the Logit model the binary response Y1,,YnY_{1},\ldots,Y_{n} are assumed to be independent Bernoulli random variables with conditional mean

𝔼𝐗[Yi]=11+exp(β0𝒙i𝜷),i=1,,n,\mathbb{E}_{\color[rgb]{0,0,1}\mathbf{X}}[Y_{i}]=\frac{1}{1+\exp(-\beta_{0}-\boldsymbol{x}_{i}^{\top}\boldsymbol{\beta})},\quad i=1,\ldots,n,

yielding the cross-entropy loss:

r(𝒃˘)=1ni=1n(1yi)𝒙˘i𝒃˘+ln(1+exp(𝒙˘i𝒃˘)).r(\boldsymbol{\breve{b}})=\frac{1}{n}\sum_{i=1}^{n}(1-y_{i})\boldsymbol{\breve{x}}_{i}^{\top}\boldsymbol{\breve{b}}+\ln(1+\exp(-\boldsymbol{\breve{x}}_{i}^{\top}\boldsymbol{\breve{b}})). (7)

Then, the gradient r(𝒃˘t)\nabla r(\breve{\boldsymbol{b}}_{t}) and Hessian 𝐇(𝒃˘t)\mathbf{H}(\breve{\boldsymbol{b}}_{t}) are given by the formulas:

μt,i:=(1+exp(𝒙˘i𝒃˘t))1,𝝁t:=[μt,1,,μt,n],r(𝒃˘t)=1n𝐗˘(𝝁t𝒚),𝒘t:=𝝁t(𝟏𝝁t),(wt,i=μt,i(1μt,i),i),𝐃t:=diag(𝒘t),𝐇(𝒃˘t)=1n𝐗˘𝐃t2𝐗˘=1ni=1nwt,i2𝒙˘i𝒙˘i.\begin{split}\mu_{t,i}&:=(1+\exp(-\breve{\boldsymbol{x}}_{i}^{\top}\breve{\boldsymbol{b}}_{t}))^{-1},\\ \boldsymbol{\mu}_{t}&:=[\mu_{t,1},\ldots,\mu_{t,n}]^{\top},\\ \nabla r(\breve{\boldsymbol{b}}_{t})&=\frac{1}{n}\breve{{\color[rgb]{0,0,1}\mathbf{X}}}^{\top}(\boldsymbol{\mu}_{t}-\boldsymbol{y}),\\ \boldsymbol{w}_{t}&:=\sqrt{\boldsymbol{\mu}_{t}\odot(\boldsymbol{1}-\boldsymbol{\mu}_{t})},\qquad(w_{t,i}=\sqrt{\mu_{t,i}(1-\mu_{t,i})},\;\forall i),\\ \mathbf{D}_{t}&:=\mathrm{diag}(\boldsymbol{w}_{t}),\\ \mathbf{H}(\breve{\boldsymbol{b}}_{t})&=\frac{1}{n}\breve{{\color[rgb]{0,0,1}\mathbf{X}}}^{\top}\mathbf{D}_{t}^{2}\breve{{\color[rgb]{0,0,1}\mathbf{X}}}=\frac{1}{n}\sum_{i=1}^{n}w_{t,i}^{2}\breve{\boldsymbol{x}}_{i}\breve{\boldsymbol{x}}_{i}^{\top}.\end{split}

If we define the quantities:

𝐗t:=𝐃t𝐗,𝒚t:=𝒘tbt,0+𝐗t𝒃t+𝐃t1(𝒚𝝁t),\begin{split}{\color[rgb]{0,0,1}\mathbf{X}}_{t}&:=\mathbf{D}_{t}{\color[rgb]{0,0,1}\mathbf{X}},\\ \boldsymbol{y}_{t}&:=\boldsymbol{w}_{t}b_{t,0}+{\color[rgb]{0,0,1}\mathbf{X}}_{t}\boldsymbol{b}_{t}+\mathbf{D}^{-1}_{t}(\boldsymbol{y}-\boldsymbol{\mu}_{t}),\end{split}

then straightforward algebra shows that the estimator in (6) can also be written in terms of the following (weighed) regularized least-squares:

(bt+1,0,𝒃t+1):=argminb0,𝒃𝒚t𝒘tb0𝐗t𝒃2+λi=1pηi×|bi|.(b_{t+1,0},\boldsymbol{b}_{t+1}):=\operatorname*{argmin}_{b_{0},\boldsymbol{b}}\|\boldsymbol{y}_{t}-\boldsymbol{w}_{t}b_{0}-{\color[rgb]{0,0,1}\mathbf{X}}_{t}\boldsymbol{b}\|^{2}+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}|.

We can eliminate the intercept term b0b_{0} from the optimization by applying the centering (projection) matrix

𝐂t:=𝐈n𝒘t𝒘t/𝒘t2\mathbf{C}_{t}:=\mathbf{I}_{n}-\boldsymbol{w}_{t}\boldsymbol{w}_{t}^{\top}/\|\boldsymbol{w}_{t}\|^{2}

to both 𝐗t{\color[rgb]{0,0,1}\mathbf{X}}_{t} and 𝒚t\boldsymbol{y}_{t}. Once the 𝒃t+1\boldsymbol{b}_{t+1} is computed, then we recover b0,t+1=𝒘t(𝒚t𝐗t𝒃t+1)/𝒘t2b_{0,t+1}=\boldsymbol{w}_{t}^{\top}(\boldsymbol{y}_{t}-{\color[rgb]{0,0,1}\mathbf{X}}_{t}\boldsymbol{b}_{t+1})/\|\boldsymbol{w}_{t}\|^{2}. This gives the following formulas for updating (b0,t,𝒃t)(b_{0,t},\boldsymbol{b}_{t}) to (b0,t+1,𝒃t+1)(b_{0,t+1},\boldsymbol{b}_{t+1}):

𝒃t+1:=argmin𝒃𝐂t𝒚t𝐂t𝐗t𝒃2+λi=1pηi×|bi|b0,t+1:=𝒘t(𝒚t𝐗t𝒃t+1)𝒘t2.\begin{split}\boldsymbol{b}_{t+1}&:=\operatorname*{argmin}_{\boldsymbol{b}}\|\mathbf{C}_{t}\boldsymbol{y}_{t}-\mathbf{C}_{t}{\color[rgb]{0,0,1}\mathbf{X}}_{t}\boldsymbol{b}\|^{2}+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}|\\ b_{0,t+1}&:=\frac{\boldsymbol{w}_{t}^{\top}(\boldsymbol{y}_{t}-{\color[rgb]{0,0,1}\mathbf{X}}_{t}\boldsymbol{b}_{t+1})}{\|\boldsymbol{w}_{t}\|^{2}}.\end{split}

We iterate for t=1,2,t=1,2,\ldots until a convergence criterion is met. This iterative reweighed penalized least squares is summarized in the following pseudo-algorithm [2], which assumes that we compute the Lasso estimate (β^0,λ,𝜷^λ)(\hat{\beta}_{0,\lambda},\hat{\boldsymbol{\beta}}_{\lambda}) on a grid of mm values (with m=200m=200 being typical values):

λ1<λ2<<λm,\lambda_{1}<\lambda_{2}<\cdots<\lambda_{m},

where λm\lambda_{m} is large enough so that 𝜷^λm=𝟎\hat{\boldsymbol{\beta}}_{\lambda_{m}}=\boldsymbol{0}; see [2].

 input: 𝒚,𝐗\boldsymbol{y},{\color[rgb]{0,0,1}\mathbf{X}}, error tolerance ϵ>0\epsilon>0 and grid λ1<λ2<<λm.\lambda_{1}<\lambda_{2}<\cdots<\lambda_{m}.
 output : Regularized solution (b0,j,𝒃j)(b_{0,j},\boldsymbol{b}_{j}) for each λjj=1,,m\lambda_{j}\,\;j=1,\ldots,m
1 𝒃𝟎,b0ln(1/y¯1)\boldsymbol{b}\leftarrow\boldsymbol{0},\;b_{0}\leftarrow-\ln(1/\bar{y}-1) // initializing values
2 for j=m,m1,m2,,2,1j=m,m-1,m-2,\ldots,2,1  do // outer loop over {λj}\{\lambda_{j}\}
3       t0t\leftarrow 0
4       repeat // middle loop is over the quadratic approximation
5             𝒃old𝒃\boldsymbol{b}_{\mathrm{old}}\leftarrow\boldsymbol{b}
6             𝝁(𝟏+exp(𝟏b0𝐗𝒃))1\boldsymbol{\mu}\leftarrow(\boldsymbol{1}+\exp(-\boldsymbol{1}b_{0}-{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{b}))^{-1} // mean response
7             𝒘𝝁(𝟏𝝁)\boldsymbol{w}\leftarrow\sqrt{\boldsymbol{\mu}\odot(\boldsymbol{1}-\boldsymbol{\mu})} // weights
8             𝐗tdiag(𝒘)𝐗{\color[rgb]{0,0,1}\mathbf{X}}_{t}\leftarrow\mathrm{diag}(\boldsymbol{w}){\color[rgb]{0,0,1}\mathbf{X}}
9             𝒚tb0𝒘+𝐗t𝒃+(𝒚𝝁)÷𝒘\boldsymbol{y}_{t}\leftarrow b_{0}\boldsymbol{w}+{\color[rgb]{0,0,1}\mathbf{X}}_{t}\boldsymbol{b}+(\boldsymbol{y}-\boldsymbol{\mu})\div\boldsymbol{w} // adjusted & weighted response
10             cy𝒚t𝒘/𝒘2c_{y}\leftarrow\boldsymbol{y}_{t}^{\top}\boldsymbol{w}/\|\boldsymbol{w}\|^{2}, 𝒄x𝐗t𝒘/𝒘2\boldsymbol{c}_{x}\leftarrow{\color[rgb]{0,0,1}\mathbf{X}}_{t}^{\top}\boldsymbol{w}/\|\boldsymbol{w}\|^{2} // adjustments for intercept
11             ηi𝐂𝒗i2/n\eta_{i}\leftarrow\sqrt{\|\mathbf{C}\boldsymbol{v}_{i}\|^{2}/n} for i=1,,pi=1,\ldots,p // square root of feature variance
12             𝒃lassoCD(𝒃,𝒚t𝒘cy,𝐗t𝒘𝒄x,ϵ,λj,𝜼)\boldsymbol{b}\leftarrow\mathrm{lassoCD}(\boldsymbol{b},\boldsymbol{y}_{t}-\boldsymbol{w}c_{y},{\color[rgb]{0,0,1}\mathbf{X}}_{t}-\boldsymbol{w}\boldsymbol{c}_{x}^{\top},\epsilon,\lambda_{j},\boldsymbol{\eta}) // inner loop
13             b0cy𝒄x𝒃b_{0}\leftarrow c_{y}-\boldsymbol{c}_{x}^{\top}\boldsymbol{b} // intercept update given 𝒃\boldsymbol{b}
14             tt+1t\leftarrow t+1
15            
16      until 𝐛old𝐛<ϵ\|\boldsymbol{b}_{\mathrm{old}}-\boldsymbol{b}\|<\epsilon
17      (b0j,𝒃j)(b0,𝒃)(b_{0j},\boldsymbol{b}_{j})\leftarrow(b_{0},\boldsymbol{b}) // store values for each grid point
return b0j,𝐛j,j=1,,mb_{0j},\boldsymbol{b}_{j},j=1,\ldots,m
Algorithm 1 IRLS for Lasso-Logistic model [2].

Line 1 in Algorithm 1 calls the subroutine lassoCD(𝒃,𝒚,𝐗,ϵ,λ,𝜼)\mathrm{lassoCD}(\boldsymbol{b},\boldsymbol{y},{\color[rgb]{0,0,1}\mathbf{X}},\epsilon,\lambda,\boldsymbol{\eta}) to compute the Lasso estimate:

argmin𝒃𝒚𝐗𝒃22n+λi=1pηi×|bi|\operatorname*{argmin}_{\boldsymbol{b}}\frac{\|\boldsymbol{y}-{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{b}\|^{2}}{2n}+\lambda\sum_{i=1}^{p}\eta_{i}\times|b_{i}| (8)

to within an error tolerance of ϵ\epsilon via the method of coordinate descent. For completeness, we include the pseudocode of this subroutine in the Appendix and refer the reader to [2] for more details of this well-known algorithm.

We now describe our proposed method for approximately computing the Lasso estimate (5). The basic idea is to modify the standard linearization (6) by replacing the penalty term i=1pηi×|bi|\sum_{i=1}^{p}\eta_{i}\times|b_{i}| with i=1pηt,i×|bi|\sum_{i=1}^{p}\eta_{t,i}\times|b_{i}|, where 𝜼t\boldsymbol{\eta}_{t} is determined from the columns of the Hessian of the cross-entropy loss evaluated at the current estimate 𝒃˘t\breve{\boldsymbol{b}}_{t}. This gives an iterative reweighed least squares in which the scaling in the lasso penalty term is modified from one iteration to the next:

𝒃˘t+1:=argmin𝒃˘(𝒃˘𝒃˘t)r(𝒃˘t)+(𝒃˘𝒃˘t)𝐇(𝒃˘t)(𝒃˘𝒃˘t)2+λi=1pηt,i×|bi|.\breve{\boldsymbol{b}}_{t+1}:=\operatorname*{argmin}_{\breve{\boldsymbol{b}}}\;(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})^{\top}\nabla r(\breve{\boldsymbol{b}}_{t})+\frac{(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})^{\top}\mathbf{H}(\breve{\boldsymbol{b}}_{t})(\breve{\boldsymbol{b}}-\breve{\boldsymbol{b}}_{t})}{2}+\lambda\sum_{i=1}^{p}\eta_{t,i}\times|b_{i}|.

We implement this iterative rescaling with only one minor modification of Algorithm 1. In particular, we replace line 1 in Algorithm 1 with the variances of the features in 𝐗t{\color[rgb]{0,0,1}\mathbf{X}}_{t} weighted by μt,i(1μt,i),i\sqrt{\mu_{t,i}(1-\mu_{t,i})},\;\forall i:

𝜼tdiag((𝐗t𝒘𝒄x)(𝐗t𝒘𝒄x))/n.\boldsymbol{\eta}_{t}\leftarrow\sqrt{\mathrm{diag}(({\color[rgb]{0,0,1}\mathbf{X}}_{t}-\boldsymbol{w}\boldsymbol{c}_{x}^{\top})^{\top}({\color[rgb]{0,0,1}\mathbf{X}}_{t}-\boldsymbol{w}\boldsymbol{c}_{x}^{\top}))/n}.

In other words, rather than computing the square root of the diagonal elements of the matrix 𝐗𝐂𝐗/n{\color[rgb]{0,0,1}\mathbf{X}}^{\top}\mathbf{C}{\color[rgb]{0,0,1}\mathbf{X}}/n (as currently done in line 1 of Algorithm 1), we instead compute the square root of the diagonal elements of the matrix 𝐗t𝐂t𝐗t/n{\color[rgb]{0,0,1}\mathbf{X}}_{t}^{\top}\mathbf{C}_{t}{\color[rgb]{0,0,1}\mathbf{X}}_{t}/n. This is the only significant difference between the implementations of the current widely-used constant scaling and our proposed iteratively rescaled Lasso (IRL) algorithm.

Remark 1 (An Alternative Computation via a Pilot Estimate.)

Recall that the adaptive Lasso estimator [9] is a variation of the classical or vanilla Lasso estimator (1), where the scaling is not constant, but depends on a root-nn consistent pilot estimator 𝜷^\hat{\boldsymbol{\beta}} of 𝜷\boldsymbol{\beta} (for example, the least-squares estimator when n>pn>p):

ηk(𝜷^)=𝐂𝒗i2/n|β^k|.\eta_{k}(\hat{\boldsymbol{\beta}})=\frac{\|\mathbf{C}\boldsymbol{v}_{i}\|^{2}/n}{|\hat{\beta}_{k}|}.

This type of scaling successfully achieves model selection consistency under weak assumptions; see [9].

As in the adaptive Lasso case, given a good pilot estimate 𝜷^\hat{\boldsymbol{\beta}} of 𝜷\boldsymbol{\beta}, we can compute ηk2(𝜷^),k=1,,p\eta_{k}^{2}(\hat{\boldsymbol{\beta}}),\;k=1,\ldots,p as the diagonal elements of the Hessian matrix of the cross-entropy loss at 𝜷^\hat{\boldsymbol{\beta}} and then deliver the Lasso estimator:

argmin𝒃,b0r(b0,𝒃)+λi=1pηi(𝜷^)×|bi|.\operatorname*{argmin}_{\boldsymbol{b},b_{0}}r(b_{0},\boldsymbol{b})+\lambda\sum_{i=1}^{p}\eta_{i}(\hat{\boldsymbol{\beta}})\times|b_{i}|.

In this implementation, the scaling parameter ηi(𝜷^),i=1,,p\eta_{i}(\hat{\boldsymbol{\beta}}),\;i=1,\ldots,p are fixed at the beginning of the lasso optimization and do not change during the course of the iterative reweighed least squares. This approach is an alternative to the iteratively rescaled Lasso described above, which we do not pursue further in this work.

3 Numerical Results

In this section, we provide empirical comparisons between the IRL method and the state-of-the-art Glmnet implementation [2] for Logistic and Poisson regression. We provide numerical results for experiments conducted on both simulated and real data.

3.1 Simulations

In our simulations, we generate synthetic data from the model:

𝒉=τ(𝟏β0+𝐗𝜷),\boldsymbol{h}=\tau\left(\boldsymbol{1}\beta_{0}+\mathbf{{\color[rgb]{0,0,1}\mathbf{X}}}\boldsymbol{\beta}\right), (9)

where the parameter τ>0\tau>0 is a scaling parameter that determines the strength of the signal. When implementing Logistic regression, we sample a random vector 𝒀\boldsymbol{Y} by drawing components yiy_{i} from Bernoulli(μi\mu_{i}), where μi=1/(1+exp(hi))\mu_{i}=1/(1+\exp(-h_{i})) for all ii. Similarly, for Poisson regression we sample 𝒀\boldsymbol{Y} by drawing yiy_{i} from Poisson(μi)(\mu_{i}), where μi=exp(hi)\mu_{i}=\exp(h_{i}). Each row of the predictor matrix 𝐗\mathbf{{\color[rgb]{0,0,1}\mathbf{X}}} is generated from a multivariate normal distribution with zero mean and covariance 𝚺\mathbf{\Sigma} with diagonal elements 𝚺j,j=1\mathbf{\Sigma}_{j,j}=1 and off-diagonal elements 𝚺i,j=ργij\mathbf{\Sigma}_{i,j}=\rho^{\gamma\mid i-j\mid}, for parameters ρ(1,1)\rho\in(-1,1) and γ>0\gamma>0; see, e.g., [4, 1]. To induce sparsity in 𝐗\mathbf{{\color[rgb]{0,0,1}\mathbf{X}}} we introduce the parameter ξ>0\xi>0 and set 𝐗i,j=0\mathbf{{\color[rgb]{0,0,1}\mathbf{X}}}_{i,j}=0 for all 𝐗i,j<ξ\mathbf{{\color[rgb]{0,0,1}\mathbf{X}}}_{i,j}<\xi. We let the intercept β0=0\beta_{0}=0 and set the first 10 components of β\beta as

𝜷[1:10]=[25,4,4,50,4,4,75,4,4,100],\boldsymbol{\beta}_{[1:10]}=\begin{bmatrix}25,&4,&-4,&50,&4,&-4,&75,&4,&-4,&100\end{bmatrix}^{\top},

and the remaining components to 0.

We run IRL and compare its statistical performance against the state-of-the-art implementation for generalized linear Lasso regression, Glmnet [2]. To tune the parameter λ\lambda we generate an independent validation set from the generating process (9) with identical parameter values for τ\tau, ξ\xi, ρ\rho and γ\gamma. We then minimize the expected cross-entropy error on the validation set over a grid with 100 values.

Since the vector 𝝁=𝔼𝐗[𝒀]\boldsymbol{\mu}=\mathbb{E}_{{\color[rgb]{0,0,1}\mathbf{X}}}[\boldsymbol{Y}] is known, when implementing Logistic regression, we can use the conditional expected cross-entropy error,

𝔼𝐗[r(𝒃˘)]=1ni=1n(1μi)𝒙˘i𝒃˘+ln(1+exp(𝒙˘i𝒃˘)).\mathbb{E}_{{\color[rgb]{0,0,1}\mathbf{X}}}\left[r(\boldsymbol{\breve{b}})\right]=\frac{1}{n}\sum_{i=1}^{n}(1-\mu_{i})\boldsymbol{\breve{x}}_{i}^{\top}\boldsymbol{\breve{b}}+\ln(1+\exp(-\boldsymbol{\breve{x}}_{i}^{\top}\boldsymbol{\breve{b}})). (10)

For Poisson regression, we use the following conditional expected cross-entropy error,

𝔼𝐗[r(𝒃˘)]=1ni=1nexp(𝒙˘i𝒃˘)μi𝒙˘i𝒃˘.\mathbb{E}_{{\color[rgb]{0,0,1}\mathbf{X}}}\left[r(\boldsymbol{\breve{b}})\right]=\frac{1}{n}\sum_{i=1}^{n}\exp(\boldsymbol{\breve{x}}_{i}^{\top}\boldsymbol{\breve{b}})-\mu_{i}\boldsymbol{\breve{x}}_{i}^{\top}\boldsymbol{\breve{b}}. (11)

The expected loss conditioned on 𝐗{\color[rgb]{0,0,1}\mathbf{X}} allows one to estimate the true expected generalization risk.

In each simulation, after generating a training and validation set (n=1000n=1000, p=100p=100) we run IRL and Glmnet to evaluate the λ\lambda that minimizes either (10) or (11) on the validation set. We denote this minimizer as λ\lambda^{*} and the corresponding model coefficient estimate as 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda^{*}}. We measure the following:

  1. 1.

    Bias: This is defined as 𝜷𝜷^λ2\|\boldsymbol{\beta}-\hat{\boldsymbol{\beta}}_{\lambda^{*}}\|_{2}.

  2. 2.

    True Positives (TP): The number of correctly identified non-zero features in 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda^{*}}.

  3. 3.

    False Positives (FP): The number of falsely identified non-zero features in 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda^{*}}.

  4. 4.

    Test Loss: The error (10) (Logistic) or (11) (Poisson) evaluated on an independently generated test set from model (9) of size nn.

For each value of the pair (γ\gamma, ρ\rho) we replicate the simulation 100 times and report the mean value of each performance measure in Tables  1 to  4. Standard errors of the mean are provided in parentheses.

Table 1: Logistic - 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is sparse (ξ=0.1\xi=0.1, τ=1\tau=1).
Average Variance = 0.047
Method (ρ,γ)(\rho,\gamma) Bias TP FP Test Loss
IRL (0.1,0.1) 73.72 8.58 (0.13) 19.37 (0.43) 0.12
Glmnet 125.31 7.28 (0.11) 48.87 (0.74) 0.17
IRL (0.1,1.0) 89.73 9.97 (0.02) 24.71 (0.53) 0.04
Glmnet 126.59 9.86 (0.04) 47.45 (0.71) 0.08
IRL (0.9,0.1) 61.09 6.65 (0.13) 8.12 (0.29) 0.29
Glmnet 111.46 4.84 (0.08) 36.36 (0.68) 0.32
IRL (0.9,1.0) 67.47 7.49 (0.13) 14.04 (0.38) 0.18
Glmnet 121.52 6.33 (0.12) 50.73 (0.73) 0.22
Table 2: Logistic - 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is not sparse (ξ=\xi=\infty, τ=0.01\tau=0.01).
Average Variance = 0.156
Method (ρ,γ)(\rho,\gamma) Bias TP FP Test Loss
IRL (0.1,0.1) 40.89 6.09 (0.12) 8.30 (0.26) 0.48
Glmnet 42.52 6.16 (0.12) 9.65 (0.26) 0.48
IRL (0.1,1.0) 36.46 5.40 (0.10) 14.96 (0.42) 0.56
Glmnet 37.31 5.45 (0.11) 15.8 (0.43) 0.56
IRL (0.9,0.1) 123.43 4.39 (0.11) 3.09 (0.18) 0.41
Glmnet 122.15 4.42 (0.11) 3.85 (0.20) 0.41
IRL (0.9,1.0) 57.78 6.16 (0.13) 5.62 (0.22) 0.45
Glmnet 59.30 6.29 (0.12) 7.13 (0.24) 0.45

When conducting experiments with generalized linear models it is useful to estimate the noise inherent in the model. In Logistic regression, the variance of yiy_{i} is given by

𝕍ar[yi]=μi(1μi).\mathbb{V}\mathrm{ar}[y_{i}]=\mu_{i}(1-\mu_{i}).

To quantify the signal strength we report the value

i=1nμi(1μi)/n\sum_{i=1}^{n}\mu_{i}(1-\mu_{i})/n

averaged over all the simulated responses in each table. The average variance has a maximum value of 0.25 which indicates high noise and a weak signal whereas an average variance closer to the minimum value of 0 indicates low noise and a strong signal.

In Table  1 we observe simulations with a strong signal. It is in this scenario that the IRL method significantly outperforms Glmnet. With IRL, we are able to obtain a sparser estimate 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda^{*}} that contains significantly fewer false positives and achieves improved test loss. IRL outperforms Glmnet the most when the correlation is highest among the predictors (ρ=0.9\rho=0.9, γ=0.1\gamma=0.1). In Table  2, the simulated 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is non-sparse, and τ\tau is decreased thereby creating a weaker signal in the generated responses. In this high-noise setting, IRL achieves only marginally fewer false positives. We provide further numerical tables in the Appendix which contain results from simulations with more values of τ\tau.

In Figure  1 we compare the bias and test loss of 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} computed with IRL and Glmnet over a grid of λ\lambda values. Since the parameter λ\lambda does not coincide between IRL and Glmnet, we plot the bias and test loss as a function of the dependent variable 𝜷^λ1\|\hat{\boldsymbol{\beta}}_{\lambda}\|_{1}. Figure  1 shows that in the simulation setting where 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is sparse, IRL attains lower test loss over a range of λ\lambda values and the lowest bias.

Refer to caption
Refer to caption
Figure 1: Bias and test loss of 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} in a Logistic regression simulation. The estimate 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} is computed over a grid of 100 λ\lambda values where the simulation parameters are τ=0.01\tau=0.01, ξ=0.1\xi=0.1, ρ=0.1\rho=0.1 and γ=1\gamma=1.
Table 3: Poisson - 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is sparse (ξ=0.1\xi=0.1, τ=0.1\tau=0.1).
Average Variance = 0.289
Method (ρ,γ)(\rho,\gamma) Bias TP FP Test Loss
IRL (0.1,1.0) 35.98 9.14 (0.09) 20.76 (0.51) 0.16
Glmnet 73.62 9.62 (0.06) 47.56 (0.63) 0.19
IRL (0.1,0.1) 36.13 5.69 (0.11) 11.03 (0.33) 0.28
Glmnet 71.19 6.30 (0.12) 43.78 (0.58) 0.31
IRL (0.9,1.0) 35.65 5.80 (0.11) 8.66 (0.27) 0.36
Glmnet 69.55 6.09 (0.12) 41.54 (0.54) 0.39
IRL (0.9,0.1) 41.86 6.09 (0.13) 4.80 (0.23) 0.49
Glmnet 61.78 5.36 (0.10) 23.26 (0.42) 0.51
Table 4: Poisson - 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is not sparse (ξ=\xi=\infty, τ=0.01\tau=0.01).
Average Variance = 8.808
Method (ρ,γ)(\rho,\gamma) Bias TP FP Test Loss
IRL (0.1,1.0) 11.6 8.22 (0.13) 19.02 (0.66) -2.24
Glmnet 11.62 8.21 (0.13) 19.37 (0.66) -2.24
IRL (0.1,0.1) 11.64 6.44 (0.09) 8.14 (0.47) -14.77
Glmnet 11.73 6.43 (0.10) 8.95 (0.49) -14.77
IRL (0.9,1.0) 11.98 6.58 (0.10) 6.72 (0.39) -36.17
Glmnet 12.18 6.54 (0.10) 8.30 (0.45) -36.17
IRL (0.9,0.1) 19.91 6.32 (0.13) 2.81 (0.18) -102.07
Glmnet 20.13 6.45 (0.13) 4.49 (0.30) -102.07

In Poisson regression the variance of yiy_{i} is equal to its mean, that is, 𝕍ar[yi]=μi\mathbb{V}\mathrm{ar}[y_{i}]=\mu_{i}. To quantify the noise we report the value i=1nμi/n\sum_{i=1}^{n}\mu_{i}/n averaged over all the simulated responses in each table. In Table  3 we observe simulations with low noise. Similar to Logistic regression, it is in this scenario that the IRL method significantly outperforms Glmnet. The IRL method is able to pick up significantly fewer false positives, thus exhibiting improved variable selection. The improvement is most significant when the correlation parameters are (ρ=0.9\rho=0.9, γ=1.0\gamma=1.0). In Table  4, the simulated 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is non-sparse, and τ\tau is decreased to 0.01, thereby increasing the average noise to 8.08. In this scenario, IRL outperforms Glmnet only in the high correlation settings. Figure  2 shows that in the sparse setting, IRL outperforms Glmnet over a range of λ\lambda values.

Refer to caption
Refer to caption
Figure 2: Bias and test loss of 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} in a Poisson regression simulation. The estimate 𝜷^λ\hat{\boldsymbol{\beta}}_{\lambda} is computed over a grid of 100 λ\lambda values where the simulation parameters are τ=0.1\tau=0.1, ξ=0.1\xi=0.1, ρ=0.1\rho=0.1 and γ=1\gamma=1.

3.2 Real Data

We study the performance of Glmnet and IRL on the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset [8]. The data set consists of 569 observations and 30 predictors. The ordering of the 569 observations is randomized and we use 399 and 170 observations for training and testing respectively. On this dataset, we fit Glmnet and IRL over a range of 100 λ\lambda values and plot the test loss (10) in Figure 3. The results indicate that IRL achieves the lowest test loss and this occurs when 𝜷^λ1=223.58\|\hat{\boldsymbol{\beta}}_{\lambda}\|_{1}=223.58, whereas for Glmnet the minimum test loss occurs when 𝜷^λ1=346.44\|\hat{\boldsymbol{\beta}}_{\lambda}\|_{1}=346.44. Moreover, at their corresponding minimizers, IRL selects 13 non-zero features whereas Glmnet selects 14. In other words, the IRL can achieve a smaller test loss using a sparser model.

Refer to caption
Figure 3: Test loss on the Breast Cancer Wisconsin (Diagnostic) dataset (n=569n=569, p=170p=170).

In other numerical experiments with real-world data (not shown here), we similarly observed that the statistical performance of the IRL estimator, whilst never inferior to that of the classical/vanilla Lasso estimator, can achieve a smaller test loss using a sparser model in problems with feature structure like the ones considered in the examples above.

4 Concluding Remarks

In this paper, we proposed a novel scaling of the features for Lasso penalized nonlinear regression. The current widely-used standardization of the feature matrix is performed only once, typically as a preprocessing step, and is independent of the true regression coefficient 𝜷\boldsymbol{\beta}. In contrast, our rescaling proposal uses a standardization which depends on the current best estimate of 𝜷\boldsymbol{\beta} and is thus updated from one iteration to the next during the course of the nonlinear optimization.

Numerical experiments suggest that the proposed iteratively rescaled Lasso estimator, while never harming the model-selection performance of the estimator in any of the examples considered, has the potential to significantly improve the bias and model selection performance of the estimator when the features matrix is sparse and the signal is strong. In addition, the proposed estimator is just as easy to compute as the vanilla Lasso estimator and only requires a minor modification of one line of the original algorithm.

Future areas of research include providing a theoretical framework of precise conditions under which a significant improvement in the statistical accuracy of the IRL estimator is guaranteed.

Appendix 0.A Coordinate Descent for Linear Lasso Computations

The following code provides a simple coordinate descent algorithm [2] for computing (8). Note the absence here of the intercept term β0\beta_{0}, which is assumed to have been eliminated from the optimization. In the pseudocode below we use the Lasso shrinkage operator, defined as:

SλLasso(x):=x[1λ/|x|]+,S_{\lambda}^{\mathrm{Lasso}}(x):=x[1-\lambda/|x|]_{+},

where x+=max{0,x}x_{+}=\max\{0,x\}.

 input: initial 𝒃\boldsymbol{b}, scaling parameter 𝜼\boldsymbol{\eta}, and 𝐗,ϵ,λ{\color[rgb]{0,0,1}\mathbf{X}},\epsilon,\lambda
 output : (global) minimizer 𝒃\boldsymbol{b}
1 𝒓𝒚𝐗𝒃\boldsymbol{r}\leftarrow\boldsymbol{y}-{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{b} // initial residual
2 repeat // iterate over CD cycles
3       𝒃old𝒃\boldsymbol{b}_{\mathrm{old}}\leftarrow\boldsymbol{b}
4       for k=1,,pk=1,\ldots,p do // coordinate-descent cycle
5             bnewSλ/ηkLasso(bk+𝒗k𝒓/𝒗k2)b_{\mathrm{new}}\leftarrow S_{\lambda/\eta_{k}}^{\mathrm{Lasso}}(b_{k}+\boldsymbol{v}_{k}^{\top}\boldsymbol{r}/\|\boldsymbol{v}_{k}\|^{2}) // this step costs 𝒪(n)\mathcal{O}(n)
6             if bnewbkb_{\mathrm{new}}\not=b_{k} then // update only if necessary
7                   𝒓𝒓+(bkbnew)𝒗k\boldsymbol{r}\leftarrow\boldsymbol{r}+(b_{k}-b_{\mathrm{new}})\boldsymbol{v}_{k} // this update costs 𝒪(n)\mathcal{O}(n)
8                   bkbnewb_{k}\leftarrow b_{\mathrm{new}}
9            
10      
11until 𝐛𝐛old<ϵ\|\boldsymbol{b}-\boldsymbol{b}_{\mathrm{old}}\|<\epsilon
return 𝐛\boldsymbol{b}
Algorithm 1 Coordinate Descent for 12n𝒚𝐗𝒃2+λiηi×|bi|\frac{1}{2n}\|\boldsymbol{y}-{\color[rgb]{0,0,1}\mathbf{X}}\boldsymbol{b}\|^{2}+\lambda\sum_{i}\eta_{i}\times|b_{i}|

Appendix 0.B Additional Numerical Results

In Tables  5 and 6 we provide results for numerical simulations with additional values of τ\tau. In Table  5, 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is simulated to be sparse and τ=0.01\tau=0.01 which results in a weaker signal than that of in Table  1. In Table  6, 𝐗{\color[rgb]{0,0,1}\mathbf{X}} is simulated to be non-sparse and τ=0.1\tau=0.1 which results in a stronger signal than that of in Table  2. In comparison to the weak signal setting in Table  2, there is a greater difference in the mean number of false positives in favor of the IRL method in Table  6.

Table 5: Logistic - 𝐗\mathbf{X} is sparse (ξ=0.1\xi=0.1, τ=0.01\tau=0.01).
Average Variance = 0.191
Method (ρ,γ)(\rho,\gamma) Bias TP FP Test Loss
IRL (0.9,0.1) 146.29 3.55 (0.13) 3.61 (0.18) 0.55
Glmnet 143.38 3.66 (0.12) 4.56 (0.20) 0.55
IRL (0.1,1.0) 55.2 4.49 (0.11) 12.27 (0.39) 0.57
Glmnet 58.07 4.64 (0.12) 13.85 (0.39) 0.57
IRL (0.1,0.1) 63.74 5.79 (0.13) 7.37 (0.29) 0.56
Glmnet 66.26 5.87 (0.12) 9.29 (0.30) 0.56
IRL (0.9,1.0) 80.68 5.62 (0.12) 5.94 (0.25) 0.56
Glmnet 81.78 5.72 (0.12) 7.90 (0.28) 0.56
Table 6: Logistic - 𝐗\mathbf{X} is not sparse (ξ=\xi=\infty, τ=0.1\tau=0.1 ).
Average Variance = 0.021
Method (ρ,γ)(\rho,\gamma) Bias TP FP Test Loss
IRL (0.1,0.1) 49.29 6.97 (0.11) 23.59 (0.46) 0.08
Glmnet 53.75 6.85 (0.10) 30.79 (0.48) 0.09
IRL (0.1,1.0) 48.54 9.34 (0.07) 43.09 (0.54) 0.12
Glmnet 50.32 9.40 (0.07) 47.15 (0.54) 0.12
IRL (0.9,0.1) 58.08 6.34 (0.12) 4.65 (0.24) 0.06
Glmnet 66.95 6.42 (0.13) 10.96 (0.28) 0.06
IRL (0.9,1.0) 45.34 6.77 (0.11) 15.47 (0.37) 0.07
Glmnet 53.09 6.53 (0.11) 23.93 (0.42) 0.07

References

  • [1] Bertsimas, D., King, A., Mazumder, R.: Best subset selection via a modern optimization lens (2016)
  • [2] Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. Journal of statistical software 33(1),  1 (2010)
  • [3] Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer (2009)
  • [4] Hastie, T., Tibshirani, R., Tibshirani, R.: Best subset, forward stepwise or lasso? analysis and recommendations based on extensive comparisons (2020)
  • [5] Hastie, T., Tibshirani, R., Wainwright, M.: Statistical learning with sparsity: the lasso and generalizations. CRC press (2015)
  • [6] Kroese, D.P., Botev, Z., Taimre, T., Vaisman, R.: Data science and machine learning: mathematical and statistical methods. CRC Press (2019)
  • [7] Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58(1), 267–288 (1996)
  • [8] Wolberg, W., Mangasarian, O., Street, N., Street, W.: Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository (1995), DOI: https://doi.org/10.24432/C5DW2B
  • [9] Zou, H.: The adaptive lasso and its oracle properties. Journal of the American statistical association 101(476), 1418–1429 (2006)