Parameter-Expanded ECME Algorithms for
Logistic and Penalized Logistic Regression
Abstract
Parameter estimation in logistic regression is a well-studied problem with the Newton-Raphson method being one of the most prominent optimization techniques used in practice. A number of monotone optimization methods including minorization-maximization (MM) algorithms, expectation-maximization (EM) algorithms and related variational Bayes approaches offer a family of useful alternatives guaranteed to increase the logistic regression likelihood at every iteration. In this article, we propose a modified version of a logistic regression EM algorithm which can substantially improve computationally efficiency while preserving the monotonicity of EM and the simplicity of the EM parameter updates. By introducing an additional latent parameter and selecting this parameter to maximize the penalized observed-data log-likelihood at every iteration, our iterative algorithm can be interpreted as a parameter-expanded expectation-condition maximization either (ECME) algorithm, and we demonstrate how to use the parameter-expanded ECME with an arbitrary choice of weights and penalty function. In addition, we describe a generalized version of our parameter-expanded ECME algorithm that can be tailored to the challenges encountered in specific high-dimensional problems, and we study several interesting connections between this generalized algorithm and other well-known methods. Performance comparisons between our method, the EM algorithm, and several other optimization methods are presented using a series of simulation studies based upon both real and synthetic datasets.
Keywords: convergence acceleration; EM algorithm; MM algorithm; Pólya-Gamma augmentation; parameter expansion; weighted maximum likelihood
1 Introduction
Logistic regression is one of the most well-known statistical methods for modeling the relationship between a binomial outcome and a vector of covariates. Popular iterative method for optimizing the logistic regression objective function include the Newton-Raphson/Fisher scoring method and gradient descent methods. While typically converging very rapidly, Newton-Raphson is not guaranteed to converge without additional algorithm safeguards imposed such as step-halving (Marschner (2011), Lange (2012)), and the behavior of Newton-Raphson can be very unstable for many starting values. Because of this, more stable, monotone algorithms which are guaranteed to improve the value of the objective function at each iteration can provide a useful alternative. While gradient descent and proximal gradient descent methods have seen a recent resurgence due to their scalability in problems with a large number of parameters, the convergence of gradient descent can be extremely sluggish, particularly if one does not use an adaptive steplength finding procedure.
In this article, we propose and explore a new monotone iterative method for parameter estimation in logistic regression. Our method builds upon an EM algorithm derived from the Pólya-Gamma latent variable representation of the distribution of a binomial random variable with a logistic link function (Polson et al., 2013; Scott & Sun, 2013). Specifically, we use a parameter-expanded version of the Pólya-Gamma representation to build a parameter-expanded expectation-conditional maximization either (PX-ECME) algorithm (Lewandowski et al., 2010) for maximizing the weighted log-likelihood of interest. Our approach combines the parameter-expanded EM framework (Liu et al., 1998) and conditional maximization versions of the EM algorithm described in, for example, Meng & Rubin (1993) and Liu & Rubin (1994). More specifically, we perform two conditional maximization steps where the first step performs an “M-step” with a fixed value of a latent scale parameter and the second step finds the value of the latent scale parameter maximizing the observed-data log-likelihood. Combining these steps leads to a direct parameter-updating scheme that is essentially a scalar multiple of the EM parameter update in each iteration.
While Newton-Raphson and gradient descent procedures are highly effective and will continue to be widely used computational techniques for logistic regression, EM-type algorithms for logistic regression and accelerated versions thereof have a number of distinct advantages, and hence, these algorithms merit further study and development. First, a major benefit of EM-type algorithms is that each iteration increases the likelihood which results in a very stable, directly implementable algorithm that does not require one to implement additional safeguards or step length finding procedures. Another main advantage of EM-type procedures is that they often work well in more complex models where one has additional latent variables within a larger, overall probability model. For example, a common use of the EM algorithm is in settings where one has missing covariate values (e.g., Ibrahim (1990); Ibrahim et al. (1999)). In such cases, one can often obtain a straightforward EM update of the regression parameters of interest whereas the Newton-Raphson method or other procedures can have very different forms with potentially unknown performance.
While EM algorithms have several notable advantages, slow convergence of EM is a common issue that often hampers their more widespread adoption. Fortunately, there are numerous techniques which can accelerate the convergence of EM while maintaining both the monotonicity property of EM and the simplicity of the original EM parameter updates. One useful acceleration technique is based on the parameter-expansion framework where one embeds the probability model of interest within a larger probability model that is identifiable from the complete data (Liu et al. (1998)). The resulting EM algorithm for the larger probability model is referred to as a parameter-expanded EM (PX-EM) algorithm, and in certain contexts, PX-EM algorithms can be developed so that both the parameter updates of interest have a direct, closed form and the overall convergence speed is substantially improved. Another variation of the EM algorithm which can aid more efficient computation is the expectation-conditional maximization (ECM) algorithm (Meng & Rubin (1993)). While not necessarily improving the rate of convergence, the ECM algorithm is a technique for simplifying the form of the parameter updates by breaking the required maximization in the M-step into a series of conditional maximization steps. A useful modification of ECM which can accelerate convergence is the expectation/conditional maximization either (ECME) algorithm (Liu & Rubin (1994)) which replaces one of the conditional maximization steps with a “full” maximization step where one performs maximization with respect to the observed-data likelihood. For certain statistical models, parameter-expanded versions of ECME can be derived and these algorithms can be termed PX-ECME algorithms (Lewandowski et al. (2010)). PX-ECME algorithms have been used successfully in the context of parameter estimation for multivariate t-distributions (Liu (1997)) and in fitting mixed effects models (Van Dyk (2000)). While not usually converging as quickly as a pure PX-EM algorithm, PX-ECME algorithms will, as argued in Lewandowski et al. (2010), typically converge faster than the EM algorithm, and hence, can be effective in cases when finding the full PX-EM parameter update is cumbersome or infeasible.
The purpose of the current work is to explore the relative performance of a PX-ECME algorithm versus the associated EM algorithm for penalized weighted logistic regression with user-specified weights, and moreover, to show its connections with and relative performance compared to several other commonly used optimization methods. We show that PX-ECME is indeed consistently faster than the associated EM algorithm and that it can be somewhat competitive with the speed of Newton-Raphson in certain scenarios. In addition, we provide examples where Newton-Raphson often diverges whereas the PX-ECME algorithm provides quick, yet stable convergence to the maximizer of the weighted log-likelihood.
The organization of this article is as follows. Section 2 reviews the Pólya-Gamma representation of a binomially distributed random variable with logistic link function and describes the associated EM algorithm for weighted logistic regression. Next, Section 3 reviews the construction of parameter-expanded EM algorithms and related PX-ECME algorithms. Then, Section 3 outlines a parameter-expanded version of the Pólya-Gamma model and uses this parameter-expanded model to derive a PX-ECME algorithm for weighted logistic regression. Section 3 also includes results on the rates of convergence of the EM and PX-ECME algorithms. Section 4 outlines a more general PX-ECME algorithm and describes its connections to several other well-known procedures including proximal gradient descent (Parikh & Boyd, 2014) and an MM algorithm for logistic regression (Böhning & Lindsay, 1988). Section 5 briefly describes a PX-ECME-type coordinate descent algorithm for logistic regression. Through a series of simulation studies, Section 6 investigates the performance of our PX-ECME algorithms and compares its performance with several other procedures, and we then conclude with a brief discussion in Section 7.
2 Review of Pólya-Gamma Data Augmentation for Logistic Regression
Suppose we have responses where each is an integer satisfying , and for each , we also have an associated covariate vector . A logistic regression model assumes that the are independent and that the distribution of each is given by
(1) |
where is the vector of regression coefficients. It is common in practice to estimate by maximizing the observed-data log-likelihood or a weighted observed log-likelihood function . Under the assumed distribution (1) and a nonnegative vector of weights , the weighted observed log-likelihood function is given by
(2) |
As outlined in Scott & Sun (2013), an EM algorithm for maximizing can be constructed by exploiting the Pólya-Gamma latent variable representation of the distribution of described in Polson et al. (2013). A random variable is said to follow a Pólya-Gamma distribution with parameters and (denoted by ) if has the same distribution as an infinite convolution of independent Gamma-distributed random variables. Specifically,
where and denotes equality in distribution. The relevance of the Pólya-Gamma family of distributions to logistic regression follows from the following key identity
where denotes the density of a random variable and from the fact that the density of a random variable is given by
Consequently, if one assumes that the observed responses and latent random variables arise from the following model
(3) |
then each has the correct marginal distribution, and the (weighted) complete-data log-likelihood corresponding to the complete data is
(4) | |||||
where . In other words, the complete-data log-likelihood is a quadratic form in the vector of regression coefficients . As noted in Scott & Sun (2013), this is very useful for constructing an EM algorithm as the M-step reduces to solving a weighted least-squares problem. The form of the complete-data log-likelihood is also useful in the context of Bayesian logistic regression (e.g., Polson et al. (2013), Choi et al. (2013)) as the conditional distribution of given the latent variables is multivariate Gaussian.
To utilize the latent variable representation (3) to construct an EM algorithm for estimating , one needs to find the expectation of given and current values of the regression coefficients. Curiously, the distribution of does not depend on in (3), and hence, the expectations in the “E-step” do not depend on the observed data. That is, for , and as shown in Polson et al. (2013), this expectation is given by
with the understanding that, whenever , is set to . It is useful to note the following connection between the and the mean function
(5) |
In (5), is the vector whose component is given by , and is the vector .
Parameter updates for the EM algorithm based on the latent variable model (3) are found by maximizing the “Q-function” which is defined as the expectation of given and the current vector of parameter estimates . From (4), the Q-function is found by replacing with which allows the Q-function to be written as
where , is as defined in (5), , and is a constant not depending on . The EM update of is found by maximizing with respect to , and hence we can represent as the solution of the following weighted least-squares problem
(6) | |||||
It is interesting to note that the EM algorithm parameter update (6) is identical to the parameter update defined in Jaakkola & Jordan (2000) which was based on variational Bayes arguments.
Based on (6), this EM algorithm for logistic regression can be viewed as an “iteratively reweighted least squares” (IRLS) algorithm with weights and “response vector” . In this sense, this EM algorithm closely resembles the common Newton-Raphson/Fisher scoring algorithm (Green (1984)) used for logistic regression where one performs iteratively reweighted least squares using the weights , where . Note that, in the case of logistic regression, Newton-Raphson and Fisher scoring are equivalent methods, and we will refer to this procedure as Newton-Raphson in the remainder of the article.
To better see the resemblance between the EM and Newton-Raphson updates, it interesting to note that we can also express the EM update (6) of as
The above expression for the EM update is a consequence of the equality stated in (5). The main difference between the EM weight function and the Newton-Raphson weight function is that, for a fixed value of , has much heavier tails than . This is illustrated in Figure 1 which plots and when it is assumed that .

A main advantage of the EM algorithm over the Newton-Raphson algorithm is that the EM algorithm is monotone; namely, iterates from the EM algorithm are guaranteed to increase the weighted log-likelihood at every iteration and the iterates are guaranteed to converge to a fixed-point of the algorithm. It is interesting to note that the first iterations of the EM and Newton-Raphson are identical whenever all components of the initial iterate are set to zero. This is a consequence of the fact that . While the Newton-Raphson algorithm is not monotone and does not possess any global convergence guarantees, the fact that the first step of Newton-Raphson increases the value of the log-likelihood when setting is one reason for the relatively robust performance of Newton-Raphson for logistic regression. The monotonicity of the Newton-Raphson first step when setting was also noted in Böhning & Lindsay (1988).
3 A Parameter-Expanded ECME Algorithm for Logistic Regression
3.1 Review of Parameter-Expanded EM and ECME Algorithms
In a variety of estimation problems, the convergence speed of the EM algorithm can be improved by viewing the complete-data model as embedded within a larger model that has additional parameters and then performing parameter updates with respect to the larger model. An EM algorithm constructed from a larger, parameter-expanded model is typically referred to as a parameter-expanded EM algorithm (PX-EM) algorithm (Liu et al. (1998)).
To describe the steps involved in the PX-EM algorithm, we consider a general setup where one is interested in estimating a parameter vector and one has defined a complete-data vector with density function . The complete-data vector and observed-data vector are connected through where is the data-reducing mapping to be used for developing the EM algorithm. As described in Liu et al. (1998), to generate a PX-EM algorithm one must first consider an expanded probability model for the complete-data vector , and we let denote the probability density function for this expanded probability model where are the parameters of the expanded model and denotes the parameter space for . As outlined in Liu et al. (1998) and Lewandowski et al. (2010), the expanded probability model should satisfy the following conditions in order to generate a well-defined PX-EM algorithm
-
1.
The observed-data model is preserved in the sense that there is a many-to-one reduction function such that
where .
-
2.
There is a “null value” of such that the complete-data model is preserved at the null value in the sense that
Similar to the EM algorithm, the PX-EM algorithm consists of two steps: a “PX–E” step followed by a “PX–M” step. The PX–E step is similar to the usual “E–step” of an EM algorithm in that a “Q-function” is formed by taking the expectation of the (parameter-expanded) complete-data log likelihood conditional on the observed data and parameter values . The “PX–M” step then proceeds by finding the expanded parameter values which maximize this Q-function and computing the parameter update by applying the reduction function to . The two steps of the PX-EM algorithm can be summarized as:
-
1.
PX–E step: Compute the parameter-expanded Q-function
(7) -
2.
PX–M step: Find
and update using
A chief advantage of the PX-EM algorithm is that it is guaranteed to converge at least as fast as the corresponding EM algorithm. When the parameter-expanded model is constructed so that all parameter updates have a direct, closed-form, the PX-EM algorithm can have substantially faster convergence speed than the EM algorithm while maintaining the stability and computational simplicity of the original EM algorithm.
The Expectation-Conditional Maximization (ECM) algorithm (Meng & Rubin (1993)) and the Expectation-Conditional Maximization Either (ECME) algorithms (Liu & Rubin (1994)) are variations of the EM algorithm where the M-step is replaced by a sequence of conditional maximization steps. In the ECM algorithm, one breaks up the maximization of the Q-function into several conditional maximization steps where, in each conditional maximization step, one only maximizes the Q-function with respect to a subset of the parameters while keeping the remaining parameter values fixed. In the ECME algorithm, one also performs a series of conditional maximization steps. However, in the ECME algorithm, some of the conditional maximization are performed with respect to the observed-data log-likelihood function while the remaining conditional maximization steps are performed with respect to the usual Q-function.
As in the original ECME algorithm, in a parameter-expanded version of ECME (PX-ECME), one performs a sequence of conditional maximization steps for the expanded parameters with some of the conditional maximization steps being performed with respect to the parameter-expanded Q-function while the other conditional maximization steps are performed with respect to the observed-data log-likelihood function. Specifically, when using the observed-data log-likelihood function for conditional maximization one will maximize with respect to the subset of parameters being maximized in that conditional maximization step. After all of the expanded parameters have been updated through the sequence of conditional maximization steps, the update for the parameter of interest is found by applying the reduction function to the updated expanded parameters.
One example of a PX-ECME algorithm is the procedure described in Liu (1997) for estimating the parameters of a multivariate-t distribution. Here, the EM algorithm exploits the representation of the multivariate-t distribution as a scale mixture of multivariate normal distributions, and the parameter-expanded model utilized by Liu (1997) includes an expanded scale parameter that plays a role in both the covariance matrix of the normal distribution and the scale of the latent Gamma distribution. This expanded parameter is updated in each iteration through a conditional maximization step maximizing the observed-data likelihood.
One point worth noting is that a PX-ECME algorithm can often be constructed even if the expanded parameter vector is not identifiable from the complete data. A PX-ECME algorithm can work if the subsets of parameters used in each conditional maximization step are conditionally identifiable from the observed data. Specifically, the parameters in each of the subsets should be identifiable from the observed data assuming that all the other parameters are fixed. For example, if we construct our PX-ECME algorithm so that is updated first followed by an update of , then should be identifiable for a fixed value of , and should be identifiable from the observed data for a fixed value of .
3.2 A Parameter-Expanded ECME Algorithm for Logistic and Penalized Logistic Regression
We consider the following parameter-expanded version of model (3) for the complete-data vector
(8) |
where the parameters in this model are and . Under this expanded model, the observed-data model is preserved when using the reduction function , and the original complete-data model (3) is preserved whenever .
We now turn to describing how to use the parameter-expanded model (8) to construct a PX-ECME algorithm for maximizing a penalized weighted log-likelihood function with penalty function . A PX-ECME algorithm for maximizing a penalized log-likelihood is very similar to that of maximizing a log-likelihood function the only difference being that a penalty function will be subtracted from either the parameter-expanded Q-function or the observed-data log-likelihood function. Throughout the remainder of this paper, we will assume the penalty function has the form , where is a vector of tuning parameters. The observed-data weighted penalized log-likelihood function that we seek to maximize is then
and the complete-data penalized weighted log-likelihood corresponding to model (8) is
where is a constant that does not depend on . The penalized, parameter-expanded Q-function is obtained by just subtracting from the usual parameter-expanded Q-function (7). For model (8), this is given by
(9) |
where is the weight matrix defined in (5).
One feature of the parameter-expanded model (8) is that the parameter vector is not identifiable from the complete-data vector . This may be seen by noting that, for a positive scalar c, the parameters and both lead to the same complete-data penalized log-likelihood. Nevertheless, are conditionally identifiable meaning that, for a fixed , is identifiable, and for a fixed , is identifiable. Hence, we can first update by fixing and maximizing with respect to
(10) | |||||
We then update by maximizing the observed-data penalized log-likelihood
The PX-ECME parameter update of is given by .
It is worth noting that one can express in terms of the EM parameter update as where the EM update is obtained from by maximizing with respect to . Hence, we can express the PX-ECME update as
where . In other words, the PX-ECME update is found by simply first computing the EM update of the regression coefficients and then multiplying by the scaling factor , where is the scalar maximizing the weighted observed-data penalized log-likelihood function among regression coefficients of the form . The complete PX-ECME procedure is summarized in Algorithm 1.
Finding only involves solving a one-dimensional root-finding problem, namely, , of which there are a wide range of available numerical methods, and in such cases, robust root-finding methods such as Brent’s method (Brent (2013)) may be directly used to find .
An attractive feature of the PX-ECME algorithm (Algorithm 1) is that it is, like the EM algorithm, a monotone algorithm. The monotonicity property helps to ensure that the procedure stable and robust in practice. The monotonicity of PX-ECME directly follows from the fact that the EM algorithm is monotone with respect to the penalized log-likelihood (i.e., ) and that .
3.3 Monotone acceleration of EM via order-1 Anderson acceleration
The PX-ECME algorithm accelerates convergence by multiplying the EM parameter update by a single scalar, and there are likely other straightforward modifications of the EM algorithm that can produce substantial speed-ups in convergence without sacrificing the monotonicity property.
An attractive class of methods with straightforward parameter updates are those that simply take a linear combination of the previous two EM parameter updates with the weights in the linear combination determined adaptively. One can ensure that this scheme is monotone by comparing the value of the objective function for a parameter update versus that of the current parameter value. One method for finding the weights in this linear combination is the order-1 Anderson acceleration (Walker & Ni (2011)). Applying the order-1 Anderson acceleration to the EM algorithm for penalized logistic regression leads to the following parameter updating scheme
where is the scalar and where and . Because the underlying EM algorithm is monotone, only accepting the linear combination if ensures that the above iterative scheme is monotone.
3.4 Rate of Convergence of EM and PX-ECME for Logistic Regression
The PX-ECME updating scheme outlined in Algorithm 1 implicitly defines a mapping which performs the PX-ECME update of , i.e., . Likewise, the EM algorithm also defines a mapping which, from (6), has the form . Because the PX-ECME parameter update is a scalar multiple of the EM update, we can express the mapping in terms of as
(11) |
where is the function defined as .
The rate of convergence of a fixed-point iteration of the form is typically defined as the maximum eigenvalue of the Jacobian matrix of when the Jacobian is evaluated at the convergence point of the algorithm (see, e.g., McLachlan & Krishnan (2007)). As shown in Durante & Rigon (2018), the Jacobian matrix associated with the mapping is given by
(12) |
where is the diagonal matrix whose diagonal element is and . As we show in Theorem 1 below, the maximum eigenvalue of the Jacobian matrix of at can be no larger than the maximum eigenvalue of , and hence the rate of convergence of PX-ECME is no slower than the EM algorithm.
Theorem 1
When assuming the penalty function equals zero for all (i.e., ) and , the Jacobian of the PX-ECME mapping at is given by
where is the Jacobian matrix of the EM mapping at , , , and . In addition, we have , where and are the maximal eigenvalues of and respectively.
4 A Generalized PX-ECME Algorithm for Logistic Regression
4.1 Description of the Aproach
For many penalty functions, the maximization in (10) does not yield an easily-computable solution and may require the use of a time-consuming iterative procedure to compute . To better handle such cases, we consider a generalized PX-ECME algorithm where maximizes a more manageable, surrogate parameter-expanded Q-function. The more manageable surrogate Q-functions are assumed to have the form
(13) |
where is a positive semi-definite matrix. As in the PX-ECME algorithm for penalized logistic regression, to update one first finds and by performing two conditional maximization steps with respect to and then sets . We will refer to an algorithm which updates in this manner as a generalized parameter-expanded ECME (GPX-ECME) algorithm for penalized weighted logistic regression.
The steps involved in a GPX-ECME are outlined in Algorithm 2. Notice in this Algorithm that, as in the case of a PX-ECME algorithm, the process of first computing and and then setting may be equivalently expressed as , where maximizes and maximizes with respect to . As stated in the following proposition, as long as is a positive semi-definite matrix for each , each iterate generated from a GPX-ECME algorithm will increase the value of the penalized log-likelihood.
Proposition 1
If is a positive semi-definite matrix for each , then iterates produced by a GPX-ECME algorithm (Algorithm 2) increase the penalized log-likelihood function at each iteration. That is, for any ,
The main purpose for considering surrogate Q-functions of the form (13) in a GPX-ECME algorithm is to allow for more easily-computable parameter updates. A choice of which typically ensures an easy-to-compute maximizer of is , where is a diagonal matrix with nonnegative diagonal entries . When using a matrix of this form, the surrogate Q-function (13) with arguments reduces to
(14) |
where . Because is diagonal, maximizing (14) is often straightforward as one can maximize each component of separately by maximizing a quadratic function plus the component of the penalty function .
If choosing and setting all the diagonal elements of equal so that for some , will be positive semi-definite provided that , where is the maximum eigenvalue of . When using , it follows from (7) that the term computed in Step 4 of Algorithm 2 can be be expressed as
(15) |
Each component of can be updated independently of the other components via
(16) |
where denotes the component of .
The updates (16) have a closed form for a number of common penalty functions. For example, in the case of the elastic net penalty function where , the update for the component of is given by
While other popular choices of the penalty function such as the smoothly clipped absolute deviation (SCAD) (Fan & Li (2001)) penalty may not necessarily yield direct, closed-form solutions, the parameter updates may be directly found by performing a univariate minimization separately for each component of the parameter vector.
4.2 Connections to Proximal Gradient Descent
Proximal gradient methods (Parikh & Boyd (2014)) are commonly used to optimize a function which can be expressed as the sum of a continuously differentiable function and a non-smooth function. In the context of penalized logistic regression where the aim is to minimize the composite function , the proximal gradient descent update with steplength of a current parameter estimate is given by
(17) |
where the proximal operator of the function is defined as
To elaborate on the connection between the proximal gradient update and the GPX-ECME procedure, we first note that the gradient of the observed-data log-likelihood function (2) is , where is an vector whose component is given by . Using the connection (5) between the mean function and the weight matrix , we can write the gradient of as
Thus, the proximal gradient update (17) for penalized logistic can be expressed as
(18) | |||||
It is interesting to note that the proximal gradient descent update (18) is exactly the same as the update in (15), which is part of a GPX-ECME algorithm with . In other words, the GPX-ECME algorithm with this choice of may be interpreted as a proximal gradient descent algorithm where one first takes a proximal gradient descent step with steplength and then multiplies this update by the optimal scalar which is optimal in the sense that it leads to the largest increase in the observed-data penalized log-likelihood.
4.3 Connection to an MM algorithm for Logistic Regression
If one chooses to be for a positive scalar in a GPX-ECME algorithm, the surrogate Q-function becomes
(19) |
where and is a constant independent of . In this context, the matrix is guaranteed to be positive semi-definite if greater than or equal to the largest diagonal element of . For example, setting where
guarantees that is positive semi-definite. When there is no penalty (i.e., ), maximizing with respect to leads to the following formula for
(20) |
The update is then found by finding and setting .
The algorithm defined by the update (20) is closely related to the MM algorithm for logistic regression described in Böhning & Lindsay (1988) and also detailed in Hunter & Lange (2004). To see why, if we use the fact that (5) implies
then (20) can be rewritten as
(21) |
If we were to set for each in (21), then in the case of , then (21) would correspond exactly to the MM algorithm for logistic regression update described by Böhning & Lindsay (1988). Note that, in the case when all equal , setting guarantees that is positive semi-definite because for any value of . One advantage of using rather than is that one does not need to compute the maximum of the weights in each iteration; however, this could result in taking slightly smaller “steps” when compared with using .
In the case of an penalty where , one would, in (21), simply replace with the matrix . This approach for the -penalized logistic regression problem, which we label the “PX-MM” algorithm is summarized in Algorithm 3.
5 PX-ECME and Coordinate Descent
Coordinate descent is often used in the context of -penalized regression (Friedman et al. (2007)) and -penalized generalized linear models (Friedman et al. (2010)). Coordinate descent proceeds by updating each regression coefficient one at a time while holding the remaining coefficients fixed. This is frequently an efficient approach as the parameter updates usually have a simple, closed form, and many of the parameters do not change after being mapped to zero.
One could also consider a coordinate descent version of the PX-ECME algorithm where one instead updates the elements of the vector one at a time rather than in a single step. To be more specific, consider the parameter-expanded Q-function defined in (9) with the “elastic net” penalty and where the first components of have already been updated
(22) | |||||
where denotes the element of , is a constant not depending on . Maximizing (22) with respect to leads to an update of which has the form
(23) |
Expressions for the terms , , and are given in Appendix C. After completing one “cycle” of coordinate descent where all components of are updated using (23), one can update the value of by maximizing with respect to . After completing this cycle, the regression coefficients would be updated via .
Rather than waiting an entire cycle to update and the weight matrix , an alternative is to perform this updating after a “block” of of the components (for ) of have been updated rather than all components. This can often reduce the total number of coordinate cycles required to converge; however, this comes at the expense of a greater computational cost to perform each coordinate descent cycle. The more general PX-ECME algorithm which updates the value of and the regression weights after every coordinate updates is outlined in Algorithm 4 of Appendix C. Note that when , this algorithm reduces to the case where and the weight matrix are only updated after a full cycle of coordinate descent is completed.
Our experience with running the PX-ECME version of coordinate descent suggests that it typically offers very modest improvements in the total number of iterations compared to coordinate descent with no parameter expansion (i.e., for all ), and the extra computational effort involved in updating often nullifies any reduction in the total number of coordinate descent iterations. Nevertheless, experimenting with different values of the “block size” parameter can have a positive impact in some cases, and hence, in some settings, it may be worth exploring different values of to determine if there are any gains to be had over a “classic” coordinate descent algorithm which does not have any parameter expansion whatsoever.
6 Simulations
6.1 A Weighted Logistic Regression Example
In the standard logistic regression setting where all the weights are equal, the usual starting values used for the Newton-Raphson algorithm (i.e., ) are generally quite robust, and using these starting values with Newton-Raphson produces iterates which converge to the maximum likelihood estimates of the regression coefficients. However, in many weighted logistic regression problems where there is large variability in the weights, the Newton-Raphson algorithm can result in failure even when the starting values of are used. To illustrate this, we consider the following small numerical example with responses , scalar covariates , and regression weights :
(24) |
The performance of Newton-Raphson, PX-ECME, and EM using the data and regression weights in (24) and with no penalty term is summarized in Table 1. We ran each method for 63 iterations because this was the number of iterations required for PX-ECME to converge when using the stopping criterion . As shown in Table 1, the Newton-Raphson procedure diverges even when using the starting values . While the first iteration of Newton-Raphson improves the value of the weighted log-likelihood, the subsequent iterates begin to quickly diverge. In contrast to Newton-Raphson, the PX-ECME algorithm provides more stable monotone convergence to the weighted maximum likelihood estimates . While the EM algorithm also enjoys monotone convergence, the convergence is considerably slower than PX-EMCE, and it takes iterations for EM to converge when using the same convergence tolerance as PX-ECME.
Newton-Raphson | PX-ECME | EM | |||||||
---|---|---|---|---|---|---|---|---|---|
k | |||||||||
0 | 0.0 | 0.0 | -0.6931 | 0.0 | 0.0 | -0.6931 | 0.0 | 0.0 | -0.6931 |
1 | 4.26 | 1.97 | -0.2972 | 1.55 | 0.01 | -0.3611 | 1.55 | 0.01 | -0.3611 |
2 | -79.22 | -10.17 | -80.4768 | 2.09 | 0.02 | -0.3441 | 1.85 | 0.01 | -0.3471 |
3 | 2.10 | 0.03 | -0.3432 | 1.97 | 0.02 | -0.3441 | |||
4 | 2.10 | 0.03 | -0.3424 | 2.03 | 0.03 | -0.3429 | |||
5 | 2.10 | 0.05 | -0.3414 | 2.05 | 0.04 | -0.3420 | |||
6 | 2.11 | 0.06 | -0.3404 | 2.07 | 0.05 | -0.3410 | |||
7 | 2.12 | 0.07 | -0.3392 | 2.07 | 0.06 | -0.3400 | |||
8 | 2.13 | 0.09 | -0.3377 | 2.08 | 0.08 | -0.3388 | |||
9 | 2.14 | 0.11 | -0.3360 | 2.08 | 0.09 | -0.3373 | |||
10 | 2.15 | 0.14 | -0.3339 | 2.08 | 0.11 | -0.3357 | |||
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
63 | 4.39 | 5.30 | -0.1376 | 4.01 | 4.83 | -0.1386 |
6.2 The Kyphosis Data
In this simulation study, we utilized the Kyphosis data described in Chambers & Hastie (1992) (pg. 200) and also considered in Liu et al. (1998). This dataset contains 81 observations from children who had undergone a corrective spinal surgery. The binary outcome in this dataset indicates whether or not kyphosis was present after the operation, and the following additional 3 covariates were also recorded: the age of the child in months, the number of vertebrae, and the number of the topmost vertebra involved in the surgery. For this simulation study, we did not use the observed binary outcomes but instead only used the observed covariates and simulated binary responses. Similar to the simulation described in Liu et al. (1998), we generated pseudo-binary outcomes assuming , where and denote the number of vertebrae and the number of topmost vertebra for the child respectively. We generated datasets in this manner so that each dataset contained the 3 covariate vectors in the original Kyphosis data and the vector of simulated pseudo-outcomes. When estimating the parameters of this model, we did not include a penalty term, and we included an intercept term so that, in total, four regression coefficients needed to be estimated.
Method | Number of iterations | Timing | |||||
---|---|---|---|---|---|---|---|
median | mean | std. dev. | median | mean | std. dev. | mean | |
EM | 424 | 756.96 | 1099.89 | 0.0215 | 0.0410 | 0.0671 | -10.524639 |
PX-ECME | 49 | 51.61 | 19.43 | 0.0050 | 0.0061 | 0.0058 | -10.524639 |
MM | 2242 | 7079.57 | 18123.52 | 0.0765 | 0.2577 | 0.6738 | -10.524639 |
PX-MM | 139 | 177.88 | 140.69 | 0.0130 | 0.0181 | 0.0170 | -10.524639 |
NR | 10 | 10.14 | 1.14 | 0.0010 | 0.0015 | 0.0014 | -10.524639 |
AA1 | 59 | 71.53 | 49.86 | 0.0040 | 0.0050 | 0.0036 | -10.524639 |
Using these 500 simulated datasets, we compared the following methods for computing the four regression coefficient estimates: the EM algorithm, Newton-Raphson, the PX-ECME algorithm (Algorithm 1), the MM algorithm of Section 4.3, the PX-MM algorithm (Algorithm 3), and the order-1 Anderson acceleration of EM (AA1). For each method, all regression coefficients were initialized at , and the stopping criteria of was used for each method.
The performance of these procedures is summarized in Table 2. As shown in Table 2, the Newton-Raphson procedure performed the best requiring only a median of Newton-Raphson iterations to converge and a median of seconds to achieve convergence. The PX-ECME was the second-fastest algorithm in terms of number of iterations only requiring a median of 49 iterations for convergence. Compared to the EM algorithm, this was a roughly ten-fold reduction in the median number of iterations. PX-ECME and AA1 were quite close in performance with AA1 requiring a few more iterations than PX-EMCE while having better time performance than PX-ECME due to a mildly cheaper per iteration cost than PX-ECME. A notable result from the Kyphosis data simulations was that PX-MM was substantially faster than the MM algorithm. Compared to the MM algorithm, PX-MM had a nearly 17-fold reduction in the median number of iterations and a nearly six-fold reduction in the median time to converge.
6.3 Simulated Outcomes with Autocorrelated Covariates
For this simulation study, we generated the elements of an design matrix using an autoregressive process of order 1 with autocorrelation parameter . Specifically, for each , the were generated as , and
where and . This simulation design implies the correlation between and is for any . The regression coefficients were generated independently as , where and follows a t distribution with degrees of freedom. Given the generated covariate vectors and regression coefficient vector , the responses were then generated independently as .
We considered two choices for the number of observations: , three settings for the number of covariates: , and three values for the autocorrelation parameter . For each of the simulation settings (i.e., each setting of , , and ), we ran each procedure for computing the maximum likelihood estimate of with no penalty term on simulated datasets. For all methods considered, we used a maximum of iterations.
Table 3 shows summary performance measures of methods for estimating the logistic regression coefficients in these simulation scenarios. With the exception of Newton-Raphson, each procedure displayed in Table 3 is a monotone algorithm as the steplength in gradient descent and a GPX-ECME algorithm (Algorithm 2) were chosen to guarantee an increase in the log-likelihood at each iteration. The GPX-ECME algorithm used here chooses (where is the maximum eigenvalue of ) so that the GPX-ECME update is simply the gradient descent update multiplied by the scalar which maximizies the observed-data log-likelihood. The summary measures shown in Table 3 correspond to performance measures aggregated across all 18 simulation settings.
For the results shown in the top half of Table 3, the initial value was set to the zero vector. While Newton-Raphson has the best overall performance when setting , PX-ECME shows clear advantages over the EM algorithm and the MM and PX-MM algorithms. Indeed, PX-ECME shows a more than ten-fold reduction from EM in the median number of iterations required to converge, and PX-ECME shows a more than three-fold reduction in the median number of iterations when compared with PX-MM. When compared with EM, PX-ECME also reduced the number of cases with very long times to converge which can be observed by noting the even greater reduction in mean number of iterations for PX-ECME versus EM when compared to the reduction in median iterations for PX-ECME versus EM. While each iteration of PX-ECME requires moderately more computational time than EM, the timing comparisons show that the median time to convergence of PX-ECME is nearly 10 times less than that of the EM algorithm.
The lower half of Table 3 shows the performance of each method when the components of the initial vector are set by sampling from a standard normal distribution. With these random starting values, all of the methods except Newton-Raphson demonstrated robust performance with all of these methods having very similar performance to the simulation runs where . Newton-Raphson, however, showed very erratic performance when the initial values were not set to zero demonstrating this method’s considerable sensitivity to the choice of starting values, and in the majority of simulation runs, Newton-Raphson did not converge. As with the simulation runs with , PX-ECME with random starting values provides a roughly 10-fold improvement in convergence speed over the EM algorithm.
Method | Number of iterations | Timing | |||||
---|---|---|---|---|---|---|---|
median | mean | std. dev. | median | mean | std. dev. | mean | |
initial values of zero: | |||||||
EM | 625 | 4509.12 | 13224.93 | 0.2395 | 5.5697 | 23.0756 | -210.230466 |
PX-ECME | 48 | 130.98 | 315.16 | 0.0340 | 0.1980 | 0.9346 | -210.230148 |
MM | 3762 | 24038.59 | 36171.98 | 0.3440 | 3.3383 | 6.4779 | -210.277489 |
PX-MM | 161 | 1248.01 | 5258.13 | 0.0860 | 0.8922 | 6.1386 | -210.230148 |
NR | 10 | 10.51 | 2.92 | 0.0060 | 0.0129 | 0.0180 | -210.230148 |
GD | 33917 | 48948.32 | 45435.61 | 8.6060 | 44.8929 | 77.4781 | -213.906704 |
GPX-ECME | 13178 | 38945.89 | 43830.79 | 7.3200 | 54.3209 | 93.0919 | -211.519659 |
GDBT | 9766 | 39079.08 | 44644.20 | 4.0385 | 44.5298 | 81.2833 | -211.461741 |
AA1 | 42 | 89.09 | 370.13 | 0.0200 | 0.1109 | 0.8442 | -210.230148 |
random initial values: | |||||||
EM | 596 | 4431.32 | 12871.93 | 0.2480 | 5.6423 | 24.3456 | -203.681932 |
PX-ECME | 50 | 122.06 | 193.26 | 0.0365 | 0.1649 | 0.3405 | -203.681930 |
MM | 3680 | 24055.37 | 35992.27 | 0.4185 | 3.1970 | 6.0510 | -203.720918 |
PX-MM | 158 | 998.80 | 2754.27 | 0.0900 | 0.6194 | 2.1296 | -203.681930 |
NR∗ | 100000 | 87223.14 | 33400.38 | 39.4695 | 120.7834 | 146.0996 | -4.02 |
GD | 31308 | 48637.68 | 45354.11 | 8.8395 | 45.1825 | 78.0428 | -207.656757 |
GPX-ECME | 13901 | 39070.57 | 43749.47 | 7.6315 | 53.8382 | 91.0420 | -205.777709 |
GDBT | 9094 | 39178.89 | 44816.03 | 3.9850 | 44.9879 | 81.2179 | -204.791037 |
AA1 | 46 | 80.60 | 172.71 | 0.0210 | 0.0917 | 0.4262 | -203.681930 |
∗Newton-Raphson resulted in numerical errors in out of simulation runs; these runs were discarded when tabulating the summary measures of performance.
6.4 The Madelon Data
In this simulation study, we consider a weighted logistic regression example with both an penalty (i.e, ) and an penalty function (i.e., ). We used the Madelon dataset (Guyon et al. (2004)) which is available for download from the UCI machine learning repository. This dataset has observations and covariates. While the same Madelon dataset was used for all simulation replications, a different set of weights were drawn for each of 10 simulation replications. The weights were sampled independently from an exponential distribution with rate parameter 1. We generated 10 sets of weights for both the and penalized cases.
For each of the ten sets of weights, we obtained parameter estimates across a decreasing sequence of nine tuning parameters. The same values of were used for both the and -penalized simulation runs. These values were , , , , , , , , . For a given set of weights, the parameter estimate for the previous value of was used as the starting value for the subsequent value of . That is, was used as the starting value for each method when was equal to . The following methods were evaluated for the -penalized case: coordinate descent with EM-based weights, coordinate descent with Newton-Raphson-based weights, proximal gradient descent (PGD) with a fixed steplength, parameter-expanded proximal gradient descent (GPX-ECME), and gradient descent with backtracking. The following methods were evaluated for the -penalized case: EM, PX-ECME, MM, PX-MM, Newton-Raphson, fixed steplength gradient descent, GPX-ECME, gradient descent with backtracking, and order-1 Anderson acceleration. For the -penalized simulations, we set a maximum of iterations for the non-coordinate descent methods and a maximum of iterations for the coordinate descent methods (one iteration for this case is a full cycle that updates all the regression coefficients), and for the -penalized simulations, we set a maximum of iterations.
The madelon-data simulation results are shown in Table 4. Note that the summary measures for the number of iterations and timings are aggregated across all replications and all values of and are not separated by value of . In the -penalized simulations, the parameter-expanded version (GPX-ECME) of proximal gradient descent (PGD) performs well in terms of both timing and final value of the objective function. Specifically, the timing required by GPX-ECME was not much greater than PGD while the average value of the objective function achieved by GPX-ECME at the final iteration was much better than that achieved by PGD. While proximal gradient descent with backtracking achieved better log-likelihood values after iterations, this was at a cost of much slower speeds than either PGD or GPX-ECME.
In the -penalized simulations, Newton-Raphson had the smallest median number of iterations needed for convergence. However, of the 90 simulation runs, there were 14 runs where Newton-Raphson diverged which led to a very large mean time spent and mean number of iterations needed. In contrast to Newton-Raphson, both EM and MM had more robust performance without sacrificing much computational speed relative to Newton-Raphson. Overall, both PX-ECME and PX-MM delivered speed advantages over the EM and MM algorithms respectively, but the additional gains in computational speed were, overall, quite modest. This seems to be an example where both EM and MM are quite fast relative to Newton-Raphson (when Newton-Raphson converges), and hence, it is difficult for PX-ECME or PX-MM to make substantial improvements. The strong performance of order-1 Anderson acceleration (AA1) in the -penalized simulations is also interesting to note. In addition to demonstrating considerable robustness by converging to the maximizer of the objective function in every simulation run, AA1 was somewhat faster than PX-ECME and was even competitive with Newton-Raphson in median timing.
Method | Number of iterations | Timing | |||||
median | mean | std. dev. | median | mean | mean | not converged | |
penalty: | |||||||
Coord Desc(EM) | 1000 | 1000 | 0.00 | 2212.20 | 2212.06 | -1610.69 | 90 |
Coord Desc(NR) | 1000 | 1000 | 0.00 | 2737.15 | 2737.40 | -1580.09 | 90 |
PGD | 100000 | 73635.87 | 35608.04 | 852.77 | 630.84 | -1573.44 | 49 |
GPX-ECME | 100000 | 75781.66 | 33794.31 | 1268.08 | 964.46 | -1515.11 | 51 |
PGDBT | 100000 | 72641.41 | 40250.80 | 28457.42 | 20723.50 | -1483.10 | 60 |
penalty: | |||||||
EM | 31 | 30.76 | 4.25 | 9.61 | 9.52 | -1200.41 | 0 |
PX-ECME | 24 | 23.50 | 3.27 | 7.58 | 7.39 | -1200.41 | 0 |
MM | 56 | 55.04 | 9.92 | 0.89 | 0.89 | -1200.41 | 0 |
PX-MM | 43 | 41.97 | 8.30 | 0.99 | 1.00 | -1200.41 | 0 |
NR | 9 | 1563.11 | 3641.38 | 5.03 | 874.79 | -8.05 | 14 |
GD | 10000 | 10000 | 0 | 2696.13 | 2697.28 | -1664.13 | 90 |
GPX-ECME | 10000 | 10000 | 0 | 2735.57 | 2737.46 | -1530.06 | 90 |
GDBT | 10000 | 10000 | 0 | 2773.56 | 2782.01 | -1597.34 | 90 |
AA1 | 18 | 18.06 | 2.30 | 5.66 | 5.63 | -1200.41 | 0 |
7 Discussion
In this article, we have proposed and investigated the performance of an accelerated version of a Pólya-Gamma-based EM algorithm for weighted penalized logistic regression. Our method provides a useful alternative algorithm that is stable, efficient, directly implementable, and provides monotone convergence regardless of the weights or starting values chosen by the user. We established that our PX-ECME algorithm has a convergence rate guaranteed to be as fast as the original Pólya-Gamma EM algorithm, and empirically, across a range of different simulation examples, our PX-ECME algorithm accelerates the convergence speed roughly by a factor of ten when compared to the EM algorithm. We also proposed a generalized version of the Pólya-Gamma EM algorithm and explored a PX-ECME version of this generalized EM algorithm. This generalization of EM can be useful in certain high dimensional contexts where performing the full M-step in every iteration of the EM algorithm can become computationally expensive. Our generalized EM algorithm includes both proximal gradient descent and the logistic regression MM algorithm as special cases, and in our simulations, we showed that the parameter-expanded version of the MM algorithm can have substantially faster convergence than the corresponding MM algorithm.
While Newton-Raphson typically converges very quickly in cases where Newton-Raphson is well-behaved, a chief motivation behind the development of the PX-ECME algorithm was to explore an alternative optimization procedure which has relatively fast convergence yet still has stable, robust performance regardless of the choice of starting values and choice of weights. Improving the robustness of Newton-Raphson becomes a more salient issue in cases where one is optimizing a weighted log-likelihood with highly variable weights. Indeed, the first example in Section 6 shows a simple case where Newton-Raphson regularly diverges even when the initial values are set to the typically robust starting value of zero for all parameters, and in this case, both EM and PX-ECME converge to the correct value. In addition to greater robustness and stability, another important issue is the fact that the EM algorithm framework can often be more easily adapted to generate stable, straightforward parameter updates in more complex variations of a logistic regression model where there are additional latent variables in the probability model of interest. A common example of this is when one posits an additional probability model for the covariates in order to perform parameter estimation in the context of missing data. In Appendix A, we describe a PX-ECME algorithm for one example of a missing-data model, and we demonstrate how the PX-ECME parameter updates for this missing-data model are very similar to the PX-ECME updates for fully observed covariates.
While PX-ECME is a straightforward modification of EM that substantially boosts convergence speed, it is certainly possible that applying other monotone acceleration schemes to the PX-ECME iteration could improve convergence speed even further. Specifically, using “off-the-shelf” acceleration schemes which directly accelerate the convergence of an iterative sequence without modifying the “base” optimization algorithm in any way could be applied directly to iterates of the PX-ECME algorithm. One such approach would be to apply the monotone order-1 Anderson acceleration to the PX-ECME updates rather than the EM updates as was done in Section 3.3. This would likely provide further speed gains in addition to those of PX-ECME over EM. It would also be of interest to explore the performance of applying other, higher-order off-the-shelf acceleration schemes to iterates of the PX-ECME algorithms such as the SQUAREM procedure (Varadhan & Roland (2008)) or the quasi-Newton acceleration scheme of Zhou et al. (2011).
The PX-ECME parameter update corresponds to multiplying the EM parameter update by a single scalar, where the scalar is chosen to maximize the observed log-likelihood. This can be thought of as finding the parameter update by searching across parameter updates which are a scalar multiple of the EM update. This type of “multiplicative search” was chosen because it fits naturally into the parameter-expansion framework, and because a similar type of multiplicative update has been found to work well in the context of accelerating EM algorithms for probit regression. Though not explored here, it is likely that alternative approaches based on using the EM algorithm search direction could deliver relatively similar performance to our PX-ECME algorithm. For example, searching across parameter updates that are linear combinations of the previous parameter value and the EM update could be an effective strategy that has relatively similar performance to our PX-ECME algorithm.
SUPPLEMENTARY MATERIAL
- R-package:
-
An R-package
pxlogistic
implementing the methods described in the article is available at https://github.com/nchenderson/pxlogistic
Appendix A A PX-ECME algorithm with Missing Covariates
As an example of a direct EM algorithm that can incorporate modeling of missing covariates, we consider a setting similar to that described in Ibrahim (1990). Specifically, we consider the case where all of the covariates are binary. To develop an EM algorithm in this context, we let denote the “complete version” of the vector , where we allow for the possibility that can contain missing values. We can express , where is the collection of observed values from and is the collection of missing values from .
The assumed joint distribution for the covariate vector is given by
(25) |
where indexes the possible observed values of , where . In (25), if equals the possible observed value of and equals otherwise. Note that this missing data model is only useful in practice when is relatively small.
Using similar notation to that used in our main manuscript, we consider the following Pólya-Gamma representation of the logistic regression model
(26) |
where refers to the same probability model stated in (25). We would also assume that the data are missing at random (MAR) so that if denotes the vector of missingness indicators, then
where is the collection of observed values from , and is the set of configurations of that is consistent with the values in .
Letting denote the “complete data”, the complete-data log-likelihood associated with (26) is
where is a term that does not depend on . The “Q-function” associated with is then
(27) | |||||
where , , and are defined as
Thus, we can re-write the Q-function (27) as
(28) |
where is the matrix whose row is and is the matrix defined as
Maximizing (28) subject to the constraint that and yields the following parameter updates:
To complete the description of this EM algorithm, we just need to describe how to compute and . To this end, note that
where is one of the possible values of and is defined as
(29) |
In (29), is the set of configurations of that is consistent with the observed covariate vector . Regarding , one can note that
where is as defined in (29).
One option for a PX-ECME version of this EM algorithm is to define and where is chosen to maximize the following “observed” log-likelihood
Appendix B Proof of Proposition 1 and Theorem 1
Proof of Proposition 1. First, note that
where the first inequality follows from the fact that is positive semi-definite and the second inequality follows from the fact that maximizes . It then follows from Jensen’s inequality and the above inequality that
Then, since , we have that
Proof of Theorem 1. First, recall that the PX-ECME mapping can be expressed in terms of the EM mapping as
where and . Hence, if denotes the component of , then . This implies that the Jacobian matrix of can be expressed as
(30) |
where denotes the gradient of and is the Jacobian matrix associated with the EM mapping .
Now, note that we can express the gradient of in terms of the gradient of evaluated at as
(31) |
To derive an explicit formula for , we first note that is defined implicitly by the equation where the score function is defined as
where is the length-n vector whose element is . If we differentiate with respect to , we obtain
where is the diagonal matrix whose diagonal element is and . Now, if we differentiate with respect to (keeping fixed), we obtain
So, we can write the fixed-h gradient as
By the implicit function theorem for continuously differentiable functions, we can express the gradient as
At the point of convergence , and and hence
Finally, returning to (30), we have that
where and are defined as
We now turn to the question of comparing the spectral radii of and . To address this, we first consider the matrices and defined as
Because the diagonal elements of are greater than the corresponding diagonal elements of (i.e., ), is a symmetric positive definite matrix. This implies that is also a symmetric positive definite matrix. Hence, is symmetric positive definite as it is the product of two symmetric positive definite matrices. Hence, because , we have , which then implies that the , where and are the largest eigenvalues of and respectively.
Appendix C PX-ECME and Coordinate Descent
The terms , , and mentioned in Section 5 are given by
(32) |
References
- (1)
- Böhning & Lindsay (1988) Böhning, D. & Lindsay, B. G. (1988), ‘Monotonicity of quadratic-approximation algorithms’, Annals of the Institute of Statistical Mathematics 40(4), 641–663.
- Brent (2013) Brent, R. P. (2013), Algorithms for minimization without derivatives, Courier Corporation.
- Chambers & Hastie (1992) Chambers, J. M. & Hastie, T. J. (1992), Statistical models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.
- Choi et al. (2013) Choi, H. M., Hobert, J. P. et al. (2013), ‘The Pólya–Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic’, Electronic Journal of Statistics 7, 2054–2064.
- Durante & Rigon (2018) Durante, D. & Rigon, T. (2018), ‘A note on quadratic approximations of logistic log-likelihoods’, ArXiv 1711.
- Fan & Li (2001) Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American statistical Association 96(456), 1348–1360.
- Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. et al. (2007), ‘Pathwise coordinate optimization’, The annals of applied statistics 1(2), 302–332.
- Friedman et al. (2010) Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of statistical software 33(1), 1.
- Green (1984) Green, P. J. (1984), ‘Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives’, Journal of the Royal Statistical Society: Series B 46(2), 149–170.
- Guyon et al. (2004) Guyon, I., Gunn, S., Ben-Hur, A. & Dror, G. (2004), ‘Result analysis of the NIPS 2003 feature selection challenge’, Advances in neural information processing systems 17.
- Hunter & Lange (2004) Hunter, D. R. & Lange, K. (2004), ‘A tutorial on MM algorithms’, The American Statistician 58(1), 30–37.
- Ibrahim (1990) Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the American Statistical Association 85(411), 765–769.
- Ibrahim et al. (1999) Ibrahim, J. G., Chen, M.-H. & Lipsitz, S. R. (1999), ‘Monte Carlo EM for missing covariates in parametric regression models’, Biometrics 55(2), 591–596.
- Jaakkola & Jordan (2000) Jaakkola, T. S. & Jordan, M. I. (2000), ‘Bayesian parameter estimation via variational methods’, Statistics and Computing 10, 25–37.
- Lange (2012) Lange, K. (2012), Numerical Analysis for Statisticians, 2nd edn, Springer Publishing Company, Incorporated.
- Lewandowski et al. (2010) Lewandowski, A., Liu, C. & Wiel, S. V. (2010), ‘Parameter expansion and efficient inference’, Statistical Science pp. 533–544.
- Liu (1997) Liu, C. (1997), ‘ML estimation of the multivariate t distribution and the EM algorithm’, Journal of Multivariate Analysis 63(2), 296–312.
- Liu & Rubin (1994) Liu, C. & Rubin, D. B. (1994), ‘The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence’, Biometrika 81(4), 633–648.
- Liu et al. (1998) Liu, C., Rubin, D. B. & Wu, Y. N. (1998), ‘Parameter expansion to accelerate EM: the PX-EM algorithm’, Biometrika 85(4), 755–770.
- Marschner (2011) Marschner, I. C. (2011), ‘glm2: Fitting generalized linear models with convergence problems’, The R Journal 3(2), 12–15.
- McLachlan & Krishnan (2007) McLachlan, G. J. & Krishnan, T. (2007), The EM algorithm and extensions, Vol. 382, John Wiley & Sons.
- Meng & Rubin (1993) Meng, X. L. & Rubin, D. B. (1993), ‘Maximum likelihood estimation via the ECM algorithm: a general framework’, Biometrika 80(2), 267–278.
- Parikh & Boyd (2014) Parikh, N. & Boyd, S. (2014), ‘Proximal algorithms’, Foundations and Trends in optimization 1(3), 127–239.
- Polson et al. (2013) Polson, N. G., Scott, J. G. & Windle, J. (2013), ‘Bayesian inference for logistic models using Pólya–Gamma latent variables’, Journal of the American Statistical Association 108(504), 1339–1349.
- Scott & Sun (2013) Scott, J. G. & Sun, L. (2013), ‘Expectation-maximization for logistic regression’, arXiv preprint arXiv:1306.0040 .
- Van Dyk (2000) Van Dyk, D. A. (2000), ‘Fitting mixed-effects models using efficient EM-type algorithms’, Journal of Computational and Graphical Statistics 9(1), 78–98.
- Varadhan & Roland (2008) Varadhan, R. & Roland, C. (2008), ‘Simple and globally convergent methods for accelerating the convergence of any EM algorithm’, Scandinavian Journal of Statistics 35(2), 335–353.
- Walker & Ni (2011) Walker, H. F. & Ni, P. (2011), ‘Anderson acceleration for fixed-point iterations’, SIAM Journal on Numerical Analysis 49(4), 1715–1735.
- Zhou et al. (2011) Zhou, H., Alexander, D. & Lange, K. (2011), ‘A quasi-Newton accleration for high-dimensional optimization algorithms’, Statistics and Computing 21(2), 261–273.