Elastic Net Regularization Paths for All Generalized Linear Models

J. Kenneth Tay
Stanford University &Balasubramanian Narasimhan
Stanford University &Trevor Hastie
Stanford University [email protected] [email protected] [email protected]

\Plainauthor

J. Kenneth Tay, Balasubramanian Narasimhan, Trevor Hastie \PlaintitleElastic Net Paths for All GLMs \ShorttitleElastic Net Paths for All GLMs \Abstract The lasso and elastic net are popular regularized regression models for supervised learning. Friedman2010 introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon2011 extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models. \Keywordslasso, elastic net, $\ell_{1}$ penalty, regularization path, coordinate descent, generalized linear models, survival, Cox model \Plainkeywordslasso, elastic net, l1 penalty, regularization path, coordinate descent, generalized linear models, survival, Cox model, relaxed lasso \Address J. Kenneth Tay
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, California 94305
United States of America
E-mail:
URL: https://kjytay.github.io/

Balasubramanian Narasimhan
Department of Biomedical Data Sciences, and
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, CA 94305
E-mail:
URL: https://web.stanford.edu/~naras/

Trevor Hastie
Department of Biomedical Data Sciences, and
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, CA 94305
E-mail:
URL: https://web.stanford.edu/~hastie/

1 Introduction

Consider the standard supervised learning framework. We have data of the form $(x_{1},y_{1}),\dots,$ $(x_{n},y_{n})$ , where $y_{i}\in\mathbb{R}$ is the target and $x_{i}=(x_{i,1},\dots,x_{i,p})^{T}\in\mathbb{R}^{p}$ is a vector of potential predictors. The ordinary least squares (OLS) model assumes that the response can be modeled as a linear combination of the covariates, i.e. $y_{i}=\beta_{0}+x_{i}^{T}\beta$ for some coefficient vector $\beta\in\mathbb{R}^{p}$ and intercept $\beta_{0}\in\mathbb{R}$ . The parameters are estimated by minimizing the residual sum of squares (RSS):

(\hat{\beta}_{0},\hat{\beta})=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\frac{1}{2n}\sum_{i=1}^{n}(y_{i}-\beta_{0}-x_{i}^{T}\beta)^{2}.

(1)

There has been a lot of research on regularization methods in the last two decades. We focus on the elastic net (Zou2005) which minimizes the sum of the RSS and a regularization term which is a mixture of $\ell_{1}$ and $\ell_{2}$ penalties:

(\hat{\beta}_{0},\hat{\beta})=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\left[\frac{1}{2n}\sum_{i=1}^{n}(y_{i}-\beta_{0}-x_{i}^{T}\beta)^{2}+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1}\right)\right].

(2)

In the above, $\lambda\geq 0$ is a tuning parameter and $\alpha\in[0,1]$ is a higher level hyperparameter¹¹1If the square were removed from the $\ell_{2}$ -norm penalty, it would be more natural to have $1-\alpha$ instead of $(1-\alpha)/2$ as its mixing parameter. The factor of $1/2$ compensates for the fact that a squared $\ell_{2}$ -norm penalty is used, in the sense that the gradient of the penalty with respect to $\beta$ can be seen as a convex combination of the $\ell_{1}$ and $\ell_{2}$ penalty terms. We note also that there is a one-to-one correspondence between these two parameterizations for the penalty.. We always fit a path of models in $\lambda$ , but set a value of $\alpha$ depending on the type of prediction model we want. For example, if we want ridge regression (Hoerl1970) we set $\alpha=0$ and if we want the lasso (Tibshirani1996) we set $\alpha=1$ . If we want a sparse model but are worried about correlations between features, we might set $\alpha$ close to but not equal to 1. The final value of $\lambda$ is usually chosen via cross-validation: we select the coefficients corresponding to the $\lambda$ value giving smallest cross-validated error as the final model.

The elastic net can be extended easily to generalized linear models (GLMs) (Nelder1972) and Cox proportional hazards models (Cox1972). Instead of solving the minimization problem (2), the RSS term in the objective function is replaced with a negative log-likelihood term or a negative log partial likelihood term respectively.

The \pkgglmnet \proglangR package (Friedman2010) contains efficient functions for computing the elastic net solution for an entire path of values $\lambda_{1}>\dots>\lambda_{m}$ . The minimization problems are solved via cyclic coordinate descent (vanderKooij2007), with the core routines programmed in \proglangFORTRAN for computational efficiency. Earlier versions of the package contained specialized \proglangFORTRAN subroutines for a handful of popular GLMs and the Cox model for right-censored survival data. The package includes functions for performing $K$ -fold cross-validation (CV), plotting coefficient paths and CV errors, and predicting on future data. The package can also accept the predictor matrix in sparse matrix format: this is especially useful in certain applications where the predictor matrix is both large and sparse. In particular, this means that we can fit unpenalized GLMs with sparse predictor matrices, something the \codeglm function in the \pkgstats package cannot do.

From version 4.1 and later, \pkgglmnet is able to compute the elastic net regularization path for all GLMs, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. (Hastie2020).

Friedman2010 gives details on how the \pkgglmnet package computes the elastic net solution for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon2011 explains how the package fits regularized Cox models for right-censored data. This paper builds on these two earlier works. In Section 2, we explain how the elastic net penalty can be applied to all GLMs and how we implement it in software. In Section LABEL:sec:cox, we detail extensions to Cox models with (start, stop] data and strata. In Section LABEL:sec:relax, we describe an implementation of the relaxed lasso implemented in the package, and in Section LABEL:sec:assess we describe the package’s functionality for assessing fitted models. We conclude with a summary and discussion.

2 Regularized generalized linear models

2.1 Overview of generalized linear models

Generalized linear models (GLMs) (Nelder1972) are a simple but powerful extension of OLS. A GLM consists of 3 parts:

•

A linear predictor: $\eta_{i}=x_{i}^{T}\beta$ ,
•

A link function: $\eta_{i}=g(\mu_{i})$ , and
•

A variance function as a function of the mean: $V=V(\mu_{i})$ .

The user gets to specify the link function $g$ and the variance function $V$ . For one-dimensional exponential families, the family determines the variance function, which, along with the link, are sufficient to specify a GLM. More generally, modeling can proceed once the link and variance functions are specified via a quasi-likelihood approach (see McCullagh1983 for details); this is the approach taken by the quasi-binomial and quasi-Poisson models. The OLS model is a special case, with link $g(x)=x$ and constant variance function $V(\mu)=\sigma^{2}$ for some constant $\sigma^{2}$ . More examples of GLMs are listed in Table 1.

GLM family / Regression type	Response type	Representation in R
Gaussian	$\mathbb{R}$	\codegaussian()
Logistic	$\{0,1\}$	\codebinomial()
Probit	$\{0,1\}$	\codebinomial(link = "probit")
Quasi-Binomial	$\{0,1\}$	\codequasibinomial()
Poisson	$\mathbb{N}_{0}=\{0,1,\dots\}$	\codepoisson()
Quasi-Poisson	$\mathbb{N}_{0}$	\codequasipoisson()
Negative binomial	$\mathbb{N}_{0}$	\codeMASS::negative.binomial(theta = 3)
Gamma	$\mathbb{R}_{+}=[0,\infty)$	\codeGamma()
Inverse Gaussian	$\mathbb{R}_{+}$	\codeinverse.gaussian()
Tweedie	Depends on variance power parameter	\codestatmod::tweedie()

Table 1: Examples of generalized linear models (GLMs) and their representations in \proglangR.

The GLM parameter $\beta$ is determined by maximum likelihood estimation. Unlike OLS, there is no closed form solution for $\hat{\beta}$ . Rather, it is typically computed via an iteratively reweighted least squares (IRLS) algorithm known as Fisher scoring. In each iteration of the algorithm we make a quadratic approximation to the negative log-likelihood (NLL), reducing the minimization problem to a weighted least squares (WLS) problem. For GLMs with canonical link functions, the negative log-likelihood is convex in $\beta$ , Fisher scoring is equivalent to the Newton-Raphson method and is guaranteed to converge to a global minimum. For GLMs with non-canonical links, the negative log-likelihood is not guaranteed to be convex²²2It is not true that the negative log-likelihood is always non-convex for non-canonical links. For example, it can be shown via direct computation that the negative log-likelihood for probit regression is convex in $\beta$ .. Also, Fisher scoring is no longer equivalent to the Newton-Raphson method and is only guaranteed to converge to a local minimum.

It is easy to fit GLMs in \proglangR using the \codeglm function from the \pkgstats package; the user can specify the GLM to be fit using \codefamily objects. These objects capture details of the GLM such as the link function and the variance function. For example, the code below shows the \codefamily object associated with probit regression model: {CodeChunk} {CodeInput} R> class(binomial(link = "probit")) {CodeOutput} [1] "family" {CodeInput} R> str(binomial(link = "probit")) {CodeOutput} List of 12 $family:chr"binomial"$ link : chr "probit" $linkfun:function(mu)$ linkinv : function (eta) $variance:function(mu)$ dev.resids: function (y, mu, wt) $aic:function(y,n,mu,wt,dev)$ mu.eta : function (eta) $initialize:language...#codetosetupobjectsneededforthefamily$ validmu : function (mu) $valideta:function(eta)$ simulate : function (object, nsim) - attr(*, "class")= chr "family" The \codelinkfun, \codelinkinv, \codevariance and \codemu.eta functions are used in fitting the GLM, and the \codedev.resids function is used in computing the deviance of the resulting model. By passing a class \code"family" object to the \codefamily argument of a \codeglm call, \codeglm has all the information it needs to fit the model. Here is an example of how one can fit a probit regression model in R: {CodeChunk} {CodeInput} R> library(glmnet) R> data(BinomialExample) R> glm(y x, family = binomial(link = "probit"))

2.2 Extending the elastic net to all GLM families

To extend the elastic net to GLMs, we replace the RSS term in (2) with a negative log-likelihood term:

(\hat{\beta}_{0},\hat{\beta})=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\left[-\frac{1}{n}\sum_{i=1}^{n}\ell\left(y_{i},\beta_{0}+x_{i}^{T}\beta\right)+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1}\right)\right],

(3)

where $\ell\left(y_{i},\beta_{0}+x_{j}^{T}\beta\right)$ is the log-likelihood term associated with observation $i$ . We can apply the same strategy as for GLMs to minimize this objective function. The key difference is that instead of solving a WLS problem in each iteration, we solve a penalized WLS problem.

The algorithm for solving (3) for a path of $\lambda$ values is described in Algorithm 1. Note that in Step 2(a), we initialize the solution for $\lambda=\lambda_{k}$ at the solution obtained for $\lambda=\lambda_{k-1}$ . This is known as a warm start: since we expect the solution at these two $\lambda$ values to be similar, the algorithm will likely require fewer iterations than if we initialized the solution at zero.

Algorithm 1 Fitting GLMs with elastic net penalty

1.

Select a value of $\alpha\in[0,1]$ and a sequence of $\lambda$ values $\lambda_{1}>\ldots>\lambda_{m}$ .

For $k=1,\dots,m$ :

(a)

Initialize $(\hat{\beta}_{0}^{(0)}(\lambda_{k}),\hat{\beta}^{(0)}(\lambda_{k}))=(\hat{\beta}_{0}(\lambda_{k-1}),\hat{\beta}(\lambda_{k-1}))$ . For $k=1$ , initialize $(\hat{\beta}_{0}^{(0)}(\lambda_{k}),\hat{\beta}^{(0)}(\lambda_{k}))=(0,{\bf 0})$ . (Here, $(\hat{\beta}_{0}(\lambda_{k}),\hat{\beta}(\lambda_{k}))$ denotes the elastic net solution at $\lambda=\lambda_{k}$ .)

(b)

For $t=0,1,\ldots$ until convergence:

i.

For $i=1,\dots,n$ , compute $\eta_{i}^{(t)}=\hat{\beta}_{0}^{(t)}(\lambda_{k})+\hat{\beta}^{(t)}(\lambda_{k})^{T}x_{i}$ and $\mu_{i}^{(t)}=g^{-1}\left(\eta_{i}^{(t)}\right)$ .

ii.

For $i=1,\dots,n$ , compute working responses and weights

z_{i}^{(t)}=\eta_{i}^{(t)}+\left(y_{i}-\mu_{i}^{(t)}\right)\big{/}\dfrac{d\mu_{i}^{(t)}}{d\eta_{i}^{(t)}},\quad w_{i}^{(t)}=\left(\dfrac{d\mu_{i}^{(t)}}{d\eta_{i}^{(t)}}\right)^{2}\big{/}V\left(\mu_{i}^{(t)}\right).

(4)

iii.

Solve the penalized WLS problem

	$\displaystyle(\hat{\beta}_{0}^{(t+1)}(\lambda_{k}),\hat{\beta}^{(t+1)}(\lambda_{k}))$
	$\displaystyle=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\left[\frac{1}{2n}\sum_{i=1}^{n}w_{i}^{(t)}\left(z_{i}^{(t)}-\beta_{0}-x_{i}^{T}\beta\right)^{2}+\lambda_{k}\left(\frac{1-\alpha}{2}\\|\beta\\|_{2}^{2}+\alpha\\|\beta\\|_{1}\right)\right].$		(5)

2.3 Implementation details

There are two main approaches we can take in implementing Algorithm 1. In the original implementation of \pkgglmnet, the entire algorithm was implemented in \proglangFORTRAN for specific GLM families. In version 4.0 and later, we added a second implementation which implemented just the computational bottleneck, the penalized WLS problem in Step 2(b)iii, in \proglangFORTRAN, with the rest of the algorithm implemented in \proglangR. Here are the relative merits and disadvantages of the second approach compared to the first:

✓

Because the formulas for the working weights and responses in (4) are specific to each GLM, the first approach requires a new \proglangFORTRAN subroutine for each GLM family. This is tedious to manage, and also means that users cannot fit regularized models for their bespoke GLM families. The second approach allows the user to pass a class \code"family" object to \codeglmnet: the working weights and responses can then be computed in \proglangR before the \proglangFORTRAN subroutine solves the resulting penalized WLS problem.
✓

As written, Algorithm 1 is a proximal Newton algorithm with a constant step size of 1, and hence it may not converge in certain cases. To ensure convergence, we can implement step-size halving after Step 2(b)iii: as long as the objective function (3) is not decreasing, set $\hat{\beta}^{(t+1)}(\lambda_{k})\leftarrow\hat{\beta}^{(t)}(\lambda_{k})+\frac{1}{2}\left[\hat{\beta}^{(t+1)}(\lambda_{k})-\hat{\beta}^{(t)}(\lambda_{k})\right]$ (with a similar formula for the intercept). Since the objective function involves a log-likelihood term, the formula for the objective function differs across GLMs, and the first approach has to maintain different subroutines for step-size halving. For the second approach, we can write a single function that takes in the class \code"family" object (along with other necessary parameters) and returns the objective function value.
$\times$

It is computationally less efficient than the first approach because (i) \proglangR is generally slower than \proglangFORTRAN, and (ii) there is overhead associated with constant switching between \proglangR and \proglangFORTRAN. Some timing comparisons for Gaussian and logistic regression with the default parameters are presented in Figure 1. The second approach is 10 to 15 times as slow than the first approach.
$\times$

Since each GLM family has its own set of \proglangFORTRAN subroutines in the first approach, it allows for special computational tricks to be employed in each situation. For example, with \codefamily = "gaussian", the predictors can be centered once upfront to have zero mean and Algorithm 1 can be run ignoring the intercept term.

Refer to caption — Figure 1: The top plot compares model fitting times for \codefamily = "gaussian" and \codefamily = gaussian() for a range of problem sizes, while the plot below compares that for \codefamily = "binomial" and \codefamily = binomial(). Each point is the mean of 5 simulation runs. Note that both the $x$ and $y$ axes are on the log scale.

We stress that both approaches have been implemented in \pkgglmnet. Users should use the first implementation for the most popular GLM families including OLS (Gaussian regression), logistic regression and Poisson regression (see \codeglmnet’s documentation for the full list of such families), and use the second implementation for all other GLM families. For example, the code below shows two equivalent ways to fit a regularized Poisson regression model: {CodeChunk} {CodeInput} R> data(PoissonExample) R> glmnet(x, y, family = "poisson") R> glmnet(x, y, family = poisson()) The first call specifies the GLM family as a character string to the \codefamily argument, invoking the first implementation. The second call passes a class \code"family" object to the \codefamily argument instead of a character string, invoking the second implementation. One would never run the second call in practice though, as it returns the same result as the first call but takes longer to fit. The example below fits a regularized quasi-Poisson model that allows for overdispersion, a family that is only available via the second approach: {CodeChunk} {CodeInput} R> glmnet(x, y, family = quasipoisson())

2.4 Details on the penalized WLS subroutine

Since the penalized WLS problem in Step 2(b)iii of Algorithm 1 is the computational bottleneck, we elected to implement it in \proglangFORTRAN. Concretely, the subroutine solves the problem

	$\displaystyle\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{minimize}}\qquad$	$\displaystyle\frac{1}{2n}\sum_{i=1}^{n}w_{i}\left(z_{i}-\beta_{0}-x_{i}^{T}\beta\right)^{2}+\lambda_{k}\sum_{j=1}^{p}\gamma_{j}\left(\frac{1-\alpha}{2}\beta_{j}^{2}+\alpha\|\beta_{j}\|\right)$		(6)
	subject to	$\displaystyle L_{j}\leq\beta_{j}\leq U_{j},\quad j=1,\dots,p.$		(7)

This is the same problem as (5) except for two things. First, the penalty placed on each coefficient $\beta_{j}$ has its own multiplicative factor $\gamma_{j}$ . ((7) reduces to (5) if $\gamma_{j}=1$ for all $j$ , which is the default value for the \codeglmnet function.) This allows the user to place different penalty weights on the coefficients. An instance where this is especially useful is when the user always wants to include feature $j$ in the model: in that case the user could set $\gamma_{j}=0$ so that $\beta_{j}$ is unpenalized. Second, the coefficient $\beta_{j}$ is constrained to lie in the interval $[L_{j},U_{j}]$ . (\codeglmnet’s default is $L_{j}=-\infty$ and $U_{j}=\infty$ for all $j$ , i.e. no constraints on the coefficients.) One example where these constraints are useful is when we want a certain $\beta_{j}$ to always be non-negative or always non-positive.

The \proglangFORTRAN subroutine solves (7) by cyclic coordinate descent: see Friedman2010 for details. Here we describe one major computational trick that was not covered in that paper: the application of strong rules (Tibshirani2012).

In each iteration of cyclic coordinate descent, the solver has to loop through all $p$ features to update the corresponding model coefficients. This can be time-consuming if $p$ is large, and is potentially wasteful if the solution is sparse: most of the $\beta_{j}$ would remain at zero. If we know a priori which predictors will be “active” at the solution (i.e. have $\beta_{j}\neq 0$ ), we could perform cyclic coordinate descent on just those coefficients and leave the others untouched. The set of “active" predictors is known as the active set. Strong rules are a simple yet powerful heuristic for guessing what the active set is, and can be combined with the Karush-Kuhn-Tucker (KKT) conditions to ensure that we get the exact solution. (The set of predictors determined by the strong rules is known as the strong set.) We describe the use of strong rules in solving (7) fully in Algorithm 2.

Algorithm 2 Solving penalized WLS (7) with strong rules

Assume that we are trying to solve for $\hat{\beta}(\lambda_{k})$ for some $k=1,\dots,m$ , and that we have already computed $\hat{\beta}(\lambda_{k-1})$ . (If $k=1$ , set $\hat{\beta}(\lambda_{k-1})={\bf 0}$ .)

1.

Initialize the strong set $\mathcal{S}_{\lambda_{k}}=\{j:\hat{\beta}(\lambda_{k-1})_{j}\neq 0\}$ .
2.

Check the strong rules: for $j=1,\dots,p$ , include $j$ in $\mathcal{S}_{\lambda_{k}}$ if

$\left|x_{j}^{T}\left\{y-X\hat{\beta}(\lambda_{k-1})\right\}\right|>\alpha\left[\lambda_{k}-(\lambda_{k-1}-\lambda_{k})\right]\gamma_{j}.$
3.

Perform cyclic coordinate descent only for features in $\mathcal{S}_{\lambda_{k}}$ .
4.

Check that the KKT conditions hold for each $j=1,\dots,p$ . If the conditions hold for all $j$ , we have the exact solution. If the conditions do not hold for some features, include them in the strong set $\mathcal{S}_{\lambda_{k}}$ and go back to Step 3.

Finally, we note that in some applications, the design matrix $X$ is sparse. In these settings, computational savings can be reaped by representing $X$ in a sparse matrix format and performing matrix manipulations with this form. To leverage this property of the data, we have a separate \proglangFORTRAN subroutine that solves (7) when $X$ is in sparse matrix format.

2.5 Other useful functionality

In this section, we mention other use functionality that the \pkgglmnet package provides for fitting elastic net models.

For fixed $\alpha$ , \codeglmnet solves (3) for a path of $\lambda$ values. While the user has the option of specifying this path of values using the \codelambda option, it is recommended that the user let \codeglmnet compute the sequence on its own. \codeglmnet uses the arguments passed to it to determine the value of $\lambda_{max}$ , defined to be the smallest value of $\lambda$ such that the estimated coefficients would be all equal to zero³³3We note that when $\alpha=0$ , $\lambda_{max}$ is infinite, i.e. all coefficients will always be non-zero for finite $\lambda$ . To avoid such extreme values of $\lambda_{max}$ , if $\alpha<0.001$ we return the $\lambda_{max}$ value for $\alpha=0.001$ .. The program then computes $\lambda_{min}$ such that the ratio $\lambda_{min}/\lambda_{max}$ is equal to \codelambda.min.ratio (default $10^{-2}$ if the number of variables exceeds the number of observations, $10^{-4}$ otherwise). Model (3) is then fit for \codenlambda $\lambda$ values (default 100) starting at $\lambda_{max}$ and ending at $\lambda_{min}$ which are equally spaced on the log scale.

In practice, it common to choose the value of $\lambda$ via cross-validation (CV). The \codecv.glmnet function is a convenience function that runs CV for the $\lambda$ tuning parameter. The returned object has class \code"cv.glmnet", which comes equipped with \codeplot, \codecoef and \codepredict methods. The \codeplot method produces a plot of CV error against $\lambda$ (see Figure 2 for an example.) As mentioned earlier, we prefer to think of $\alpha$ as a higher level hyperparameter whose value depends on the type of prediction model we want. Nevertheless, the code below shows how the user can perform CV for $\alpha$ manually using a for loop. Care must be taken to ensure that the same CV folds are used across runs for the CV errors to be comparable. {CodeChunk} {CodeInput} R> alphas <- c(1, 0.8, 0.5, 0.2, 0) R> fits <- list() R> fits[[1]] <- cv.glmnet(x, y, keep = TRUE) R> foldid <- fits[[1]] $f o l d i d R > f o r (i i n 2 : l e n g t h (a l p h a s)) + f i t s [[i]] < - c v . g l m n e t (x, y, a l p h a = a l p h a s [i], f o l d i d = f o l d i d) + Figure 2Figure 22Figure 22Exampleoutputforplottinga\codecv.glmnetobject:aplotofCVerroragainst$