This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Elastic Net Regularization Paths for All Generalized Linear Models

J. Kenneth Tay
Stanford University &Balasubramanian Narasimhan
Stanford University &Trevor Hastie
Stanford University
[email protected] [email protected] [email protected]
\Plainauthor

J. Kenneth Tay, Balasubramanian Narasimhan, Trevor Hastie \PlaintitleElastic Net Paths for All GLMs \ShorttitleElastic Net Paths for All GLMs \Abstract The lasso and elastic net are popular regularized regression models for supervised learning. Friedman2010 introduced a computationally efficient algorithm for computing the elastic net regularization path for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon2011 extended this work to Cox models for right-censored data. We further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. We also discuss convenient utility functions for measuring the performance of these fitted models. \Keywordslasso, elastic net, 1\ell_{1} penalty, regularization path, coordinate descent, generalized linear models, survival, Cox model \Plainkeywordslasso, elastic net, l1 penalty, regularization path, coordinate descent, generalized linear models, survival, Cox model, relaxed lasso \Address J. Kenneth Tay
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, California 94305
United States of America
E-mail:
URL: https://kjytay.github.io/

Balasubramanian Narasimhan
Department of Biomedical Data Sciences, and
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, CA 94305
E-mail:
URL: https://web.stanford.edu/~naras/

Trevor Hastie
Department of Biomedical Data Sciences, and
Department of Statistics
Stanford University
390 Jane Stanford Way
Stanford, CA 94305
E-mail:
URL: https://web.stanford.edu/~hastie/

1 Introduction

Consider the standard supervised learning framework. We have data of the form (x1,y1),,(x_{1},y_{1}),\dots, (xn,yn)(x_{n},y_{n}), where yiy_{i}\in\mathbb{R} is the target and xi=(xi,1,,xi,p)Tpx_{i}=(x_{i,1},\dots,x_{i,p})^{T}\in\mathbb{R}^{p} is a vector of potential predictors. The ordinary least squares (OLS) model assumes that the response can be modeled as a linear combination of the covariates, i.e. yi=β0+xiTβy_{i}=\beta_{0}+x_{i}^{T}\beta for some coefficient vector βp\beta\in\mathbb{R}^{p} and intercept β0\beta_{0}\in\mathbb{R}. The parameters are estimated by minimizing the residual sum of squares (RSS):

(β^0,β^)=argmin(β0,β)p+112ni=1n(yiβ0xiTβ)2.(\hat{\beta}_{0},\hat{\beta})=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\frac{1}{2n}\sum_{i=1}^{n}(y_{i}-\beta_{0}-x_{i}^{T}\beta)^{2}. (1)

There has been a lot of research on regularization methods in the last two decades. We focus on the elastic net (Zou2005) which minimizes the sum of the RSS and a regularization term which is a mixture of 1\ell_{1} and 2\ell_{2} penalties:

(β^0,β^)=argmin(β0,β)p+1[12ni=1n(yiβ0xiTβ)2+λ(1α2β22+αβ1)].(\hat{\beta}_{0},\hat{\beta})=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\left[\frac{1}{2n}\sum_{i=1}^{n}(y_{i}-\beta_{0}-x_{i}^{T}\beta)^{2}+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1}\right)\right]. (2)

In the above, λ0\lambda\geq 0 is a tuning parameter and α[0,1]\alpha\in[0,1] is a higher level hyperparameter111If the square were removed from the 2\ell_{2}-norm penalty, it would be more natural to have 1α1-\alpha instead of (1α)/2(1-\alpha)/2 as its mixing parameter. The factor of 1/21/2 compensates for the fact that a squared 2\ell_{2}-norm penalty is used, in the sense that the gradient of the penalty with respect to β\beta can be seen as a convex combination of the 1\ell_{1} and 2\ell_{2} penalty terms. We note also that there is a one-to-one correspondence between these two parameterizations for the penalty.. We always fit a path of models in λ\lambda, but set a value of α\alpha depending on the type of prediction model we want. For example, if we want ridge regression (Hoerl1970) we set α=0\alpha=0 and if we want the lasso (Tibshirani1996) we set α=1\alpha=1. If we want a sparse model but are worried about correlations between features, we might set α\alpha close to but not equal to 1. The final value of λ\lambda is usually chosen via cross-validation: we select the coefficients corresponding to the λ\lambda value giving smallest cross-validated error as the final model.

The elastic net can be extended easily to generalized linear models (GLMs) (Nelder1972) and Cox proportional hazards models (Cox1972). Instead of solving the minimization problem (2), the RSS term in the objective function is replaced with a negative log-likelihood term or a negative log partial likelihood term respectively.

The \pkgglmnet \proglangR package (Friedman2010) contains efficient functions for computing the elastic net solution for an entire path of values λ1>>λm\lambda_{1}>\dots>\lambda_{m}. The minimization problems are solved via cyclic coordinate descent (vanderKooij2007), with the core routines programmed in \proglangFORTRAN for computational efficiency. Earlier versions of the package contained specialized \proglangFORTRAN subroutines for a handful of popular GLMs and the Cox model for right-censored survival data. The package includes functions for performing KK-fold cross-validation (CV), plotting coefficient paths and CV errors, and predicting on future data. The package can also accept the predictor matrix in sparse matrix format: this is especially useful in certain applications where the predictor matrix is both large and sparse. In particular, this means that we can fit unpenalized GLMs with sparse predictor matrices, something the \codeglm function in the \pkgstats package cannot do.

From version 4.1 and later, \pkgglmnet is able to compute the elastic net regularization path for all GLMs, Cox models with (start, stop] data and strata, and a simplified version of the relaxed lasso. (Hastie2020).

Friedman2010 gives details on how the \pkgglmnet package computes the elastic net solution for ordinary least squares regression, logistic regression and multinomial logistic regression, while Simon2011 explains how the package fits regularized Cox models for right-censored data. This paper builds on these two earlier works. In Section 2, we explain how the elastic net penalty can be applied to all GLMs and how we implement it in software. In Section LABEL:sec:cox, we detail extensions to Cox models with (start, stop] data and strata. In Section LABEL:sec:relax, we describe an implementation of the relaxed lasso implemented in the package, and in Section LABEL:sec:assess we describe the package’s functionality for assessing fitted models. We conclude with a summary and discussion.

2 Regularized generalized linear models

2.1 Overview of generalized linear models

Generalized linear models (GLMs) (Nelder1972) are a simple but powerful extension of OLS. A GLM consists of 3 parts:

  • A linear predictor: ηi=xiTβ\eta_{i}=x_{i}^{T}\beta,

  • A link function: ηi=g(μi)\eta_{i}=g(\mu_{i}), and

  • A variance function as a function of the mean: V=V(μi)V=V(\mu_{i}).

The user gets to specify the link function gg and the variance function VV. For one-dimensional exponential families, the family determines the variance function, which, along with the link, are sufficient to specify a GLM. More generally, modeling can proceed once the link and variance functions are specified via a quasi-likelihood approach (see McCullagh1983 for details); this is the approach taken by the quasi-binomial and quasi-Poisson models. The OLS model is a special case, with link g(x)=xg(x)=x and constant variance function V(μ)=σ2V(\mu)=\sigma^{2} for some constant σ2\sigma^{2}. More examples of GLMs are listed in Table 1.

GLM family / Regression type Response type Representation in R
Gaussian \mathbb{R} \codegaussian()
Logistic {0,1}\{0,1\} \codebinomial()
Probit {0,1}\{0,1\} \codebinomial(link = "probit")
Quasi-Binomial {0,1}\{0,1\} \codequasibinomial()
Poisson 0={0,1,}\mathbb{N}_{0}=\{0,1,\dots\} \codepoisson()
Quasi-Poisson 0\mathbb{N}_{0} \codequasipoisson()
Negative binomial 0\mathbb{N}_{0} \codeMASS::negative.binomial(theta = 3)
Gamma +=[0,)\mathbb{R}_{+}=[0,\infty) \codeGamma()
Inverse Gaussian +\mathbb{R}_{+} \codeinverse.gaussian()
Tweedie Depends on variance power parameter \codestatmod::tweedie()
Table 1: Examples of generalized linear models (GLMs) and their representations in \proglangR.

The GLM parameter β\beta is determined by maximum likelihood estimation. Unlike OLS, there is no closed form solution for β^\hat{\beta}. Rather, it is typically computed via an iteratively reweighted least squares (IRLS) algorithm known as Fisher scoring. In each iteration of the algorithm we make a quadratic approximation to the negative log-likelihood (NLL), reducing the minimization problem to a weighted least squares (WLS) problem. For GLMs with canonical link functions, the negative log-likelihood is convex in β\beta, Fisher scoring is equivalent to the Newton-Raphson method and is guaranteed to converge to a global minimum. For GLMs with non-canonical links, the negative log-likelihood is not guaranteed to be convex222It is not true that the negative log-likelihood is always non-convex for non-canonical links. For example, it can be shown via direct computation that the negative log-likelihood for probit regression is convex in β\beta.. Also, Fisher scoring is no longer equivalent to the Newton-Raphson method and is only guaranteed to converge to a local minimum.

It is easy to fit GLMs in \proglangR using the \codeglm function from the \pkgstats package; the user can specify the GLM to be fit using \codefamily objects. These objects capture details of the GLM such as the link function and the variance function. For example, the code below shows the \codefamily object associated with probit regression model: {CodeChunk} {CodeInput} R> class(binomial(link = "probit")) {CodeOutput} [1] "family" {CodeInput} R> str(binomial(link = "probit")) {CodeOutput} List of 12 family:chr"binomial"family:chr"binomial" link : chr "probit" linkfun:function(mu)linkfun:function(mu) linkinv : function (eta) variance:function(mu)variance:function(mu) dev.resids: function (y, mu, wt) aic:function(y,n,mu,wt,dev)aic:function(y,n,mu,wt,dev) mu.eta : function (eta) initialize:language#codetosetupobjectsneededforthefamilyinitialize:language...#codetosetupobjectsneededforthefamily validmu : function (mu) valideta:function(eta)valideta:function(eta) simulate : function (object, nsim) - attr(*, "class")= chr "family" The \codelinkfun, \codelinkinv, \codevariance and \codemu.eta functions are used in fitting the GLM, and the \codedev.resids function is used in computing the deviance of the resulting model. By passing a class \code"family" object to the \codefamily argument of a \codeglm call, \codeglm has all the information it needs to fit the model. Here is an example of how one can fit a probit regression model in R: {CodeChunk} {CodeInput} R> library(glmnet) R> data(BinomialExample) R> glm(y   x, family = binomial(link = "probit"))

2.2 Extending the elastic net to all GLM families

To extend the elastic net to GLMs, we replace the RSS term in (2) with a negative log-likelihood term:

(β^0,β^)=argmin(β0,β)p+1[1ni=1n(yi,β0+xiTβ)+λ(1α2β22+αβ1)],(\hat{\beta}_{0},\hat{\beta})=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\left[-\frac{1}{n}\sum_{i=1}^{n}\ell\left(y_{i},\beta_{0}+x_{i}^{T}\beta\right)+\lambda\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1}\right)\right], (3)

where (yi,β0+xjTβ)\ell\left(y_{i},\beta_{0}+x_{j}^{T}\beta\right) is the log-likelihood term associated with observation ii. We can apply the same strategy as for GLMs to minimize this objective function. The key difference is that instead of solving a WLS problem in each iteration, we solve a penalized WLS problem.

The algorithm for solving (3) for a path of λ\lambda values is described in Algorithm 1. Note that in Step 2(a), we initialize the solution for λ=λk\lambda=\lambda_{k} at the solution obtained for λ=λk1\lambda=\lambda_{k-1}. This is known as a warm start: since we expect the solution at these two λ\lambda values to be similar, the algorithm will likely require fewer iterations than if we initialized the solution at zero.

Algorithm 1 Fitting GLMs with elastic net penalty
  1. 1.

    Select a value of α[0,1]\alpha\in[0,1] and a sequence of λ\lambda values λ1>>λm\lambda_{1}>\ldots>\lambda_{m}.

  2. 2.

    For k=1,,mk=1,\dots,m:

    1. (a)

      Initialize (β^0(0)(λk),β^(0)(λk))=(β^0(λk1),β^(λk1))(\hat{\beta}_{0}^{(0)}(\lambda_{k}),\hat{\beta}^{(0)}(\lambda_{k}))=(\hat{\beta}_{0}(\lambda_{k-1}),\hat{\beta}(\lambda_{k-1})). For k=1k=1, initialize (β^0(0)(λk),β^(0)(λk))=(0,𝟎)(\hat{\beta}_{0}^{(0)}(\lambda_{k}),\hat{\beta}^{(0)}(\lambda_{k}))=(0,{\bf 0}). (Here, (β^0(λk),β^(λk))(\hat{\beta}_{0}(\lambda_{k}),\hat{\beta}(\lambda_{k})) denotes the elastic net solution at λ=λk\lambda=\lambda_{k}.)

    2. (b)

      For t=0,1,t=0,1,\ldots until convergence:

      1. i.

        For i=1,,ni=1,\dots,n, compute ηi(t)=β^0(t)(λk)+β^(t)(λk)Txi\eta_{i}^{(t)}=\hat{\beta}_{0}^{(t)}(\lambda_{k})+\hat{\beta}^{(t)}(\lambda_{k})^{T}x_{i} and μi(t)=g1(ηi(t))\mu_{i}^{(t)}=g^{-1}\left(\eta_{i}^{(t)}\right).

      2. ii.

        For i=1,,ni=1,\dots,n, compute working responses and weights

        zi(t)=ηi(t)+(yiμi(t))/dμi(t)dηi(t),wi(t)=(dμi(t)dηi(t))2/V(μi(t)).z_{i}^{(t)}=\eta_{i}^{(t)}+\left(y_{i}-\mu_{i}^{(t)}\right)\big{/}\dfrac{d\mu_{i}^{(t)}}{d\eta_{i}^{(t)}},\quad w_{i}^{(t)}=\left(\dfrac{d\mu_{i}^{(t)}}{d\eta_{i}^{(t)}}\right)^{2}\big{/}V\left(\mu_{i}^{(t)}\right). (4)
      3. iii.

        Solve the penalized WLS problem

        (β^0(t+1)(λk),β^(t+1)(λk))\displaystyle(\hat{\beta}_{0}^{(t+1)}(\lambda_{k}),\hat{\beta}^{(t+1)}(\lambda_{k}))
        =argmin(β0,β)p+1[12ni=1nwi(t)(zi(t)β0xiTβ)2+λk(1α2β22+αβ1)].\displaystyle=\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{argmin}}\left[\frac{1}{2n}\sum_{i=1}^{n}w_{i}^{(t)}\left(z_{i}^{(t)}-\beta_{0}-x_{i}^{T}\beta\right)^{2}+\lambda_{k}\left(\frac{1-\alpha}{2}\|\beta\|_{2}^{2}+\alpha\|\beta\|_{1}\right)\right]. (5)

2.3 Implementation details

There are two main approaches we can take in implementing Algorithm 1. In the original implementation of \pkgglmnet, the entire algorithm was implemented in \proglangFORTRAN for specific GLM families. In version 4.0 and later, we added a second implementation which implemented just the computational bottleneck, the penalized WLS problem in Step 2(b)iii, in \proglangFORTRAN, with the rest of the algorithm implemented in \proglangR. Here are the relative merits and disadvantages of the second approach compared to the first:

  • Because the formulas for the working weights and responses in (4) are specific to each GLM, the first approach requires a new \proglangFORTRAN subroutine for each GLM family. This is tedious to manage, and also means that users cannot fit regularized models for their bespoke GLM families. The second approach allows the user to pass a class \code"family" object to \codeglmnet: the working weights and responses can then be computed in \proglangR before the \proglangFORTRAN subroutine solves the resulting penalized WLS problem.

  • As written, Algorithm 1 is a proximal Newton algorithm with a constant step size of 1, and hence it may not converge in certain cases. To ensure convergence, we can implement step-size halving after Step 2(b)iii: as long as the objective function (3) is not decreasing, set β^(t+1)(λk)β^(t)(λk)+12[β^(t+1)(λk)β^(t)(λk)]\hat{\beta}^{(t+1)}(\lambda_{k})\leftarrow\hat{\beta}^{(t)}(\lambda_{k})+\frac{1}{2}\left[\hat{\beta}^{(t+1)}(\lambda_{k})-\hat{\beta}^{(t)}(\lambda_{k})\right] (with a similar formula for the intercept). Since the objective function involves a log-likelihood term, the formula for the objective function differs across GLMs, and the first approach has to maintain different subroutines for step-size halving. For the second approach, we can write a single function that takes in the class \code"family" object (along with other necessary parameters) and returns the objective function value.

  • ×\times

    It is computationally less efficient than the first approach because (i) \proglangR is generally slower than \proglangFORTRAN, and (ii) there is overhead associated with constant switching between \proglangR and \proglangFORTRAN. Some timing comparisons for Gaussian and logistic regression with the default parameters are presented in Figure 1. The second approach is 10 to 15 times as slow than the first approach.

  • ×\times

    Since each GLM family has its own set of \proglangFORTRAN subroutines in the first approach, it allows for special computational tricks to be employed in each situation. For example, with \codefamily = "gaussian", the predictors can be centered once upfront to have zero mean and Algorithm 1 can be run ignoring the intercept term.

Refer to caption
Refer to caption
Figure 1: The top plot compares model fitting times for \codefamily = "gaussian" and \codefamily = gaussian() for a range of problem sizes, while the plot below compares that for \codefamily = "binomial" and \codefamily = binomial(). Each point is the mean of 5 simulation runs. Note that both the xx and yy axes are on the log scale.

We stress that both approaches have been implemented in \pkgglmnet. Users should use the first implementation for the most popular GLM families including OLS (Gaussian regression), logistic regression and Poisson regression (see \codeglmnet’s documentation for the full list of such families), and use the second implementation for all other GLM families. For example, the code below shows two equivalent ways to fit a regularized Poisson regression model: {CodeChunk} {CodeInput} R> data(PoissonExample) R> glmnet(x, y, family = "poisson") R> glmnet(x, y, family = poisson()) The first call specifies the GLM family as a character string to the \codefamily argument, invoking the first implementation. The second call passes a class \code"family" object to the \codefamily argument instead of a character string, invoking the second implementation. One would never run the second call in practice though, as it returns the same result as the first call but takes longer to fit. The example below fits a regularized quasi-Poisson model that allows for overdispersion, a family that is only available via the second approach: {CodeChunk} {CodeInput} R> glmnet(x, y, family = quasipoisson())

2.4 Details on the penalized WLS subroutine

Since the penalized WLS problem in Step 2(b)iii of Algorithm 1 is the computational bottleneck, we elected to implement it in \proglangFORTRAN. Concretely, the subroutine solves the problem

minimize(β0,β)p+1\displaystyle\underset{(\beta_{0},\beta)\in\mathbb{R}^{p+1}}{\text{minimize}}\qquad 12ni=1nwi(ziβ0xiTβ)2+λkj=1pγj(1α2βj2+α|βj|)\displaystyle\frac{1}{2n}\sum_{i=1}^{n}w_{i}\left(z_{i}-\beta_{0}-x_{i}^{T}\beta\right)^{2}+\lambda_{k}\sum_{j=1}^{p}\gamma_{j}\left(\frac{1-\alpha}{2}\beta_{j}^{2}+\alpha|\beta_{j}|\right) (6)
subject to LjβjUj,j=1,,p.\displaystyle L_{j}\leq\beta_{j}\leq U_{j},\quad j=1,\dots,p. (7)

This is the same problem as (5) except for two things. First, the penalty placed on each coefficient βj\beta_{j} has its own multiplicative factor γj\gamma_{j}. ((7) reduces to (5) if γj=1\gamma_{j}=1 for all jj, which is the default value for the \codeglmnet function.) This allows the user to place different penalty weights on the coefficients. An instance where this is especially useful is when the user always wants to include feature jj in the model: in that case the user could set γj=0\gamma_{j}=0 so that βj\beta_{j} is unpenalized. Second, the coefficient βj\beta_{j} is constrained to lie in the interval [Lj,Uj][L_{j},U_{j}]. (\codeglmnet’s default is Lj=L_{j}=-\infty and Uj=U_{j}=\infty for all jj, i.e. no constraints on the coefficients.) One example where these constraints are useful is when we want a certain βj\beta_{j} to always be non-negative or always non-positive.

The \proglangFORTRAN subroutine solves (7) by cyclic coordinate descent: see Friedman2010 for details. Here we describe one major computational trick that was not covered in that paper: the application of strong rules (Tibshirani2012).

In each iteration of cyclic coordinate descent, the solver has to loop through all pp features to update the corresponding model coefficients. This can be time-consuming if pp is large, and is potentially wasteful if the solution is sparse: most of the βj\beta_{j} would remain at zero. If we know a priori which predictors will be “active” at the solution (i.e. have βj0\beta_{j}\neq 0), we could perform cyclic coordinate descent on just those coefficients and leave the others untouched. The set of “active" predictors is known as the active set. Strong rules are a simple yet powerful heuristic for guessing what the active set is, and can be combined with the Karush-Kuhn-Tucker (KKT) conditions to ensure that we get the exact solution. (The set of predictors determined by the strong rules is known as the strong set.) We describe the use of strong rules in solving (7) fully in Algorithm 2.

Algorithm 2 Solving penalized WLS (7) with strong rules

Assume that we are trying to solve for β^(λk)\hat{\beta}(\lambda_{k}) for some k=1,,mk=1,\dots,m, and that we have already computed β^(λk1)\hat{\beta}(\lambda_{k-1}). (If k=1k=1, set β^(λk1)=𝟎\hat{\beta}(\lambda_{k-1})={\bf 0}.)

  1. 1.

    Initialize the strong set 𝒮λk={j:β^(λk1)j0}\mathcal{S}_{\lambda_{k}}=\{j:\hat{\beta}(\lambda_{k-1})_{j}\neq 0\}.

  2. 2.

    Check the strong rules: for j=1,,pj=1,\dots,p, include jj in 𝒮λk\mathcal{S}_{\lambda_{k}} if

    |xjT{yXβ^(λk1)}|>α[λk(λk1λk)]γj.\left|x_{j}^{T}\left\{y-X\hat{\beta}(\lambda_{k-1})\right\}\right|>\alpha\left[\lambda_{k}-(\lambda_{k-1}-\lambda_{k})\right]\gamma_{j}.
  3. 3.

    Perform cyclic coordinate descent only for features in 𝒮λk\mathcal{S}_{\lambda_{k}}.

  4. 4.

    Check that the KKT conditions hold for each j=1,,pj=1,\dots,p. If the conditions hold for all jj, we have the exact solution. If the conditions do not hold for some features, include them in the strong set 𝒮λk\mathcal{S}_{\lambda_{k}} and go back to Step 3.

Finally, we note that in some applications, the design matrix XX is sparse. In these settings, computational savings can be reaped by representing XX in a sparse matrix format and performing matrix manipulations with this form. To leverage this property of the data, we have a separate \proglangFORTRAN subroutine that solves (7) when XX is in sparse matrix format.

2.5 Other useful functionality

In this section, we mention other use functionality that the \pkgglmnet package provides for fitting elastic net models.

For fixed α\alpha, \codeglmnet solves (3) for a path of λ\lambda values. While the user has the option of specifying this path of values using the \codelambda option, it is recommended that the user let \codeglmnet compute the sequence on its own. \codeglmnet uses the arguments passed to it to determine the value of λmax\lambda_{max}, defined to be the smallest value of λ\lambda such that the estimated coefficients would be all equal to zero333We note that when α=0\alpha=0, λmax\lambda_{max} is infinite, i.e. all coefficients will always be non-zero for finite λ\lambda. To avoid such extreme values of λmax\lambda_{max}, if α<0.001\alpha<0.001 we return the λmax\lambda_{max} value for α=0.001\alpha=0.001.. The program then computes λmin\lambda_{min} such that the ratio λmin/λmax\lambda_{min}/\lambda_{max} is equal to \codelambda.min.ratio (default 10210^{-2} if the number of variables exceeds the number of observations, 10410^{-4} otherwise). Model (3) is then fit for \codenlambda λ\lambda values (default 100) starting at λmax\lambda_{max} and ending at λmin\lambda_{min} which are equally spaced on the log scale.

In practice, it common to choose the value of λ\lambda via cross-validation (CV). The \codecv.glmnet function is a convenience function that runs CV for the λ\lambda tuning parameter. The returned object has class \code"cv.glmnet", which comes equipped with \codeplot, \codecoef and \codepredict methods. The \codeplot method produces a plot of CV error against λ\lambda (see Figure 2 for an example.) As mentioned earlier, we prefer to think of α\alpha as a higher level hyperparameter whose value depends on the type of prediction model we want. Nevertheless, the code below shows how the user can perform CV for α\alpha manually using a for loop. Care must be taken to ensure that the same CV folds are used across runs for the CV errors to be comparable. {CodeChunk} {CodeInput} R> alphas <- c(1, 0.8, 0.5, 0.2, 0) R> fits <- list() R> fits[[1]] <- cv.glmnet(x, y, keep = TRUE) R> foldid <- fits[[1]]foldidR>for(iin2:length(alphas))+fits[[i]]<cv.glmnet(x,y,alpha=alphas[i],foldid=foldid)+Figure 2Figure 22Figure 22Exampleoutputforplottinga\codecv.glmnetobject:aplotofCVerroragainst