This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Parameter-Expanded ECME Algorithms for
Logistic and Penalized Logistic Regression

Nicholas C. Henderson Department of Biostatistics, University of Michigan, Ann Arbor Zhongzhe Ouyang Department of Biostatistics, University of Michigan, Ann Arbor
Abstract

Parameter estimation in logistic regression is a well-studied problem with the Newton-Raphson method being one of the most prominent optimization techniques used in practice. A number of monotone optimization methods including minorization-maximization (MM) algorithms, expectation-maximization (EM) algorithms and related variational Bayes approaches offer a family of useful alternatives guaranteed to increase the logistic regression likelihood at every iteration. In this article, we propose a modified version of a logistic regression EM algorithm which can substantially improve computationally efficiency while preserving the monotonicity of EM and the simplicity of the EM parameter updates. By introducing an additional latent parameter and selecting this parameter to maximize the penalized observed-data log-likelihood at every iteration, our iterative algorithm can be interpreted as a parameter-expanded expectation-condition maximization either (ECME) algorithm, and we demonstrate how to use the parameter-expanded ECME with an arbitrary choice of weights and penalty function. In addition, we describe a generalized version of our parameter-expanded ECME algorithm that can be tailored to the challenges encountered in specific high-dimensional problems, and we study several interesting connections between this generalized algorithm and other well-known methods. Performance comparisons between our method, the EM algorithm, and several other optimization methods are presented using a series of simulation studies based upon both real and synthetic datasets.


Keywords: convergence acceleration; EM algorithm; MM algorithm; Pólya-Gamma augmentation; parameter expansion; weighted maximum likelihood

1 Introduction

Logistic regression is one of the most well-known statistical methods for modeling the relationship between a binomial outcome and a vector of covariates. Popular iterative method for optimizing the logistic regression objective function include the Newton-Raphson/Fisher scoring method and gradient descent methods. While typically converging very rapidly, Newton-Raphson is not guaranteed to converge without additional algorithm safeguards imposed such as step-halving (Marschner (2011), Lange (2012)), and the behavior of Newton-Raphson can be very unstable for many starting values. Because of this, more stable, monotone algorithms which are guaranteed to improve the value of the objective function at each iteration can provide a useful alternative. While gradient descent and proximal gradient descent methods have seen a recent resurgence due to their scalability in problems with a large number of parameters, the convergence of gradient descent can be extremely sluggish, particularly if one does not use an adaptive steplength finding procedure.

In this article, we propose and explore a new monotone iterative method for parameter estimation in logistic regression. Our method builds upon an EM algorithm derived from the Pólya-Gamma latent variable representation of the distribution of a binomial random variable with a logistic link function (Polson et al., 2013; Scott & Sun, 2013). Specifically, we use a parameter-expanded version of the Pólya-Gamma representation to build a parameter-expanded expectation-conditional maximization either (PX-ECME) algorithm (Lewandowski et al., 2010) for maximizing the weighted log-likelihood of interest. Our approach combines the parameter-expanded EM framework (Liu et al., 1998) and conditional maximization versions of the EM algorithm described in, for example, Meng & Rubin (1993) and Liu & Rubin (1994). More specifically, we perform two conditional maximization steps where the first step performs an “M-step” with a fixed value of a latent scale parameter and the second step finds the value of the latent scale parameter maximizing the observed-data log-likelihood. Combining these steps leads to a direct parameter-updating scheme that is essentially a scalar multiple of the EM parameter update in each iteration.

While Newton-Raphson and gradient descent procedures are highly effective and will continue to be widely used computational techniques for logistic regression, EM-type algorithms for logistic regression and accelerated versions thereof have a number of distinct advantages, and hence, these algorithms merit further study and development. First, a major benefit of EM-type algorithms is that each iteration increases the likelihood which results in a very stable, directly implementable algorithm that does not require one to implement additional safeguards or step length finding procedures. Another main advantage of EM-type procedures is that they often work well in more complex models where one has additional latent variables within a larger, overall probability model. For example, a common use of the EM algorithm is in settings where one has missing covariate values (e.g., Ibrahim (1990); Ibrahim et al. (1999)). In such cases, one can often obtain a straightforward EM update of the regression parameters of interest whereas the Newton-Raphson method or other procedures can have very different forms with potentially unknown performance.

While EM algorithms have several notable advantages, slow convergence of EM is a common issue that often hampers their more widespread adoption. Fortunately, there are numerous techniques which can accelerate the convergence of EM while maintaining both the monotonicity property of EM and the simplicity of the original EM parameter updates. One useful acceleration technique is based on the parameter-expansion framework where one embeds the probability model of interest within a larger probability model that is identifiable from the complete data (Liu et al. (1998)). The resulting EM algorithm for the larger probability model is referred to as a parameter-expanded EM (PX-EM) algorithm, and in certain contexts, PX-EM algorithms can be developed so that both the parameter updates of interest have a direct, closed form and the overall convergence speed is substantially improved. Another variation of the EM algorithm which can aid more efficient computation is the expectation-conditional maximization (ECM) algorithm (Meng & Rubin (1993)). While not necessarily improving the rate of convergence, the ECM algorithm is a technique for simplifying the form of the parameter updates by breaking the required maximization in the M-step into a series of conditional maximization steps. A useful modification of ECM which can accelerate convergence is the expectation/conditional maximization either (ECME) algorithm (Liu & Rubin (1994)) which replaces one of the conditional maximization steps with a “full” maximization step where one performs maximization with respect to the observed-data likelihood. For certain statistical models, parameter-expanded versions of ECME can be derived and these algorithms can be termed PX-ECME algorithms (Lewandowski et al. (2010)). PX-ECME algorithms have been used successfully in the context of parameter estimation for multivariate t-distributions (Liu (1997)) and in fitting mixed effects models (Van Dyk (2000)). While not usually converging as quickly as a pure PX-EM algorithm, PX-ECME algorithms will, as argued in Lewandowski et al. (2010), typically converge faster than the EM algorithm, and hence, can be effective in cases when finding the full PX-EM parameter update is cumbersome or infeasible.

The purpose of the current work is to explore the relative performance of a PX-ECME algorithm versus the associated EM algorithm for penalized weighted logistic regression with user-specified weights, and moreover, to show its connections with and relative performance compared to several other commonly used optimization methods. We show that PX-ECME is indeed consistently faster than the associated EM algorithm and that it can be somewhat competitive with the speed of Newton-Raphson in certain scenarios. In addition, we provide examples where Newton-Raphson often diverges whereas the PX-ECME algorithm provides quick, yet stable convergence to the maximizer of the weighted log-likelihood.

The organization of this article is as follows. Section 2 reviews the Pólya-Gamma representation of a binomially distributed random variable with logistic link function and describes the associated EM algorithm for weighted logistic regression. Next, Section 3 reviews the construction of parameter-expanded EM algorithms and related PX-ECME algorithms. Then, Section 3 outlines a parameter-expanded version of the Pólya-Gamma model and uses this parameter-expanded model to derive a PX-ECME algorithm for weighted logistic regression. Section 3 also includes results on the rates of convergence of the EM and PX-ECME algorithms. Section 4 outlines a more general PX-ECME algorithm and describes its connections to several other well-known procedures including proximal gradient descent (Parikh & Boyd, 2014) and an MM algorithm for logistic regression (Böhning & Lindsay, 1988). Section 5 briefly describes a PX-ECME-type coordinate descent algorithm for logistic regression. Through a series of simulation studies, Section 6 investigates the performance of our PX-ECME algorithms and compares its performance with several other procedures, and we then conclude with a brief discussion in Section 7.

2 Review of Pólya-Gamma Data Augmentation for Logistic Regression

Suppose we have nn responses 𝐲=(y1,,yn)\mathbf{y}=(y_{1},\ldots,y_{n}) where each yiy_{i} is an integer satisfying 0yimi0\leq y_{i}\leq m_{i}, and for each yiy_{i}, we also have an associated covariate vector 𝐱ip\mathbf{x}_{i}\in\mathbb{R}^{p}. A logistic regression model assumes that the yiy_{i} are independent and that the distribution of each yiy_{i} is given by

yiBinomial(mi,{1+exp(𝐱iT𝜷)}1),y_{i}\sim\textrm{Binomial}\Big{(}m_{i},\{1+\exp(-\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}^{-1}\Big{)}, (1)

where 𝜷=(β1,,βp)T\mbox{\boldmath$\beta$}=(\beta_{1},\ldots,\beta_{p})^{T} is the vector of regression coefficients. It is common in practice to estimate 𝜷\beta by maximizing the observed-data log-likelihood o(𝜷|𝐲)\ell_{o}(\mbox{\boldmath$\beta$}|\mathbf{y}) or a weighted observed log-likelihood function o,𝐬(𝜷|𝐲)\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y}). Under the assumed distribution (1) and a nonnegative vector of weights 𝐬=(s1,,sn)\mathbf{s}=(s_{1},\ldots,s_{n}), the weighted observed log-likelihood function is given by

o,𝐬(𝜷|𝐲)=i=1nsilog(miyi)+i=1nsiyi𝐱iT𝜷i=1nsimilog(1+exp{𝐱iT𝜷}).\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y})=\sum_{i=1}^{n}s_{i}\log\binom{m_{i}}{y_{i}}+\sum_{i=1}^{n}s_{i}y_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}-\sum_{i=1}^{n}s_{i}m_{i}\log\Big{(}1+\exp\{\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}\Big{)}. (2)

As outlined in Scott & Sun (2013), an EM algorithm for maximizing o,𝐬(𝜷|𝐲)\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y}) can be constructed by exploiting the Pólya-Gamma latent variable representation of the distribution of yiy_{i} described in Polson et al. (2013). A random variable WW is said to follow a Pólya-Gamma distribution with parameters b>0b>0 and cc (denoted by WPG(b,c)W\sim PG(b,c)) if WW has the same distribution as an infinite convolution of independent Gamma-distributed random variables. Specifically,

W=D12π2k=1Gk(k1/2)2+c2/(4π2),W\stackrel{{\scriptstyle D}}{{=}}\frac{1}{2\pi^{2}}\sum_{k=1}^{\infty}\frac{G_{k}}{(k-1/2)^{2}+c^{2}/(4\pi^{2})},

where GkGamma(b,1)G_{k}\sim\textrm{Gamma}(b,1) and =D\stackrel{{\scriptstyle D}}{{=}} denotes equality in distribution. The relevance of the Pólya-Gamma family of distributions to logistic regression follows from the following key identity

2miexp(mi𝐱iT𝜷/2){1+exp(𝐱iT𝜷)}mi=0exp{w(𝐱iT𝜷)2/2}pPG(w;mi,0)𝑑w,\frac{2^{m_{i}}\exp(m_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}/2)}{\{1+\exp(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}^{m_{i}}}=\int_{0}^{\infty}\exp\big{\{}-w(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})^{2}/2\big{\}}p_{PG}(w;m_{i},0)dw,

where pPG(w;b,c)p_{PG}(w;b,c) denotes the density of a PG(b,c)PG(b,c) random variable and from the fact that the density of a PG(mi,𝐱iT𝜷)\textrm{PG}(m_{i},\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}) random variable is given by

pPG(w;mi,𝐱iT𝜷)=2miexp(mi𝐱iT𝜷/2){1+exp(𝐱iT𝜷)}miexp{w(𝐱iT𝜷)2/2}pPG(w;mi,0).p_{PG}(w;m_{i},\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})=2^{-m_{i}}\exp(-m_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}/2)\{1+\exp(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}^{m_{i}}\exp\{-w(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})^{2}/2\}p_{PG}(w;m_{i},0).

Consequently, if one assumes that the observed responses y1,,yny_{1},\ldots,y_{n} and latent random variables W1,,WnW_{1},\ldots,W_{n} arise from the following model

yi|Wi\displaystyle y_{i}|W_{i} \displaystyle\sim Binomial(mi,{1+exp(𝐱iT𝜷)}1)\displaystyle\textrm{Binomial}\Big{(}m_{i},\{1+\exp(-\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}^{-1}\Big{)}
Wi\displaystyle W_{i} \displaystyle\sim PG(mi,𝐱iT𝜷),\displaystyle PG(m_{i},\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}), (3)

then each yiy_{i} has the correct marginal distribution, and the (weighted) complete-data log-likelihood c,𝐬(𝜷|𝐯)\ell_{c,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{v}) corresponding to the complete data 𝐯={(y1,W1),,(yn,Wn)}\mathbf{v}=\{(y_{1},W_{1}),\ldots,(y_{n},W_{n})\} is

c,𝐬(𝜷|𝐯)=\displaystyle\ell_{c,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{v})= (4)
i=1nsilog(miyi)+i=1nsilog(exp(yi𝐱iT𝜷){1+exp(𝐱iT𝜷)}mi)+i=1nsilog(exp(mi𝐱iT𝜷/2))\displaystyle\sum_{i=1}^{n}s_{i}\log\binom{m_{i}}{y_{i}}+\sum_{i=1}^{n}s_{i}\log\Big{(}\frac{\exp(y_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})}{\{1+\exp(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}^{m_{i}}}\Big{)}+\sum_{i=1}^{n}s_{i}\log\Big{(}\exp(-m_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}/2)\Big{)}
+\displaystyle+ i=1nsimilog(1+exp(𝐱iT𝜷))12i=1nsiWi(𝐱iT𝜷)2+i=1nsilog{pPG(Wi;mi,0)/2mi}\displaystyle\sum_{i=1}^{n}s_{i}m_{i}\log\Big{(}1+\exp(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\Big{)}-\frac{1}{2}\sum_{i=1}^{n}s_{i}W_{i}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})^{2}+\sum_{i=1}^{n}s_{i}\log\{p_{PG}(W_{i};m_{i},0)/2^{m_{i}}\}
=\displaystyle= i=1nsiui𝐱iT𝜷12i=1nsiWi(𝐱iT𝜷)2+i=1nsi(log(miyi)+log{pPG(Wi;mi,0)/2mi}),\displaystyle\sum_{i=1}^{n}s_{i}u_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}-\frac{1}{2}\sum_{i=1}^{n}s_{i}W_{i}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})^{2}+\sum_{i=1}^{n}s_{i}\Big{(}\log\binom{m_{i}}{y_{i}}+\log\big{\{}p_{PG}(W_{i};m_{i},0)/2^{m_{i}}\big{\}}\Big{)},

where ui=yimi/2u_{i}=y_{i}-m_{i}/2. In other words, the complete-data log-likelihood is a quadratic form in the vector of regression coefficients 𝜷\beta. As noted in Scott & Sun (2013), this is very useful for constructing an EM algorithm as the M-step reduces to solving a weighted least-squares problem. The form of the complete-data log-likelihood is also useful in the context of Bayesian logistic regression (e.g., Polson et al. (2013), Choi et al. (2013)) as the conditional distribution of 𝜷\beta given the latent variables W1,,WnW_{1},\ldots,W_{n} is multivariate Gaussian.

To utilize the latent variable representation (3) to construct an EM algorithm for estimating 𝜷\beta, one needs to find the expectation of WiW_{i} given yiy_{i} and current values of the regression coefficients. Curiously, the distribution of yiy_{i} does not depend on WiW_{i} in (3), and hence, the expectations in the “E-step” do not depend on the observed data. That is, E{Wi|yi,𝜷(t)}=E{Wi|𝜷(t)}E\{W_{i}|y_{i},\mbox{\boldmath$\beta$}^{(t)}\}=E\{W_{i}|\mbox{\boldmath$\beta$}^{(t)}\} for i=1,,ni=1,\ldots,n, and as shown in Polson et al. (2013), this expectation is given by

ω(𝐱iT𝜷(t),mi)=E{Wi|𝜷(t)}=mi2𝐱iT𝜷(t)tanh(𝐱iT𝜷(t)/2)=mi𝐱iT𝜷(t)(11+exp(𝐱iT𝜷)12),\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{i})=E\{W_{i}|\mbox{\boldmath$\beta$}^{(t)}\}=\frac{m_{i}}{2\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}}\tanh\Big{(}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}/2\Big{)}=\frac{m_{i}}{\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}}\Big{(}\frac{1}{1+\exp(-\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})}-\frac{1}{2}\Big{)},

with the understanding that, whenever 𝐱iT𝜷(t)=0\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}=0, ω(𝐱iT𝜷(t),mi)\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{i}) is set to limy0mi2ytanh(y2)=mi/4\lim_{y\longrightarrow 0}\tfrac{m_{i}}{2y}\tanh(\tfrac{y}{2})=m_{i}/4. It is useful to note the following connection between the ω(𝐱iT𝜷(t),mi)\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{i}) and the mean function E(𝐲)=μ(𝐗𝜷)E(\mathbf{y})=\mu(\mathbf{X}\mbox{\boldmath$\beta$})

𝐖(𝜷)𝐗𝜷=μ(𝐗𝜷)12𝐦, where 𝐖(𝜷)=diag{ω(𝐱1T𝜷,m1),,ω(𝐱nT𝜷,mn)}.\mathbf{W}(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}=\mu(\mathbf{X}\mbox{\boldmath$\beta$})-\tfrac{1}{2}\mathbf{m},\quad\textrm{ where }\quad\mathbf{W}(\mbox{\boldmath$\beta$})=\textrm{diag}\big{\{}\omega(\mathbf{x}_{1}^{T}\mbox{\boldmath$\beta$},m_{1}),\ldots,\omega(\mathbf{x}_{n}^{T}\mbox{\boldmath$\beta$},m_{n})\big{\}}. (5)

In (5), μ(𝐗𝜷)=E(𝐲)\mu(\mathbf{X}\mbox{\boldmath$\beta$})=E(\mathbf{y}) is the n×1n\times 1 vector whose ithi^{th} component [μ(𝐗𝜷)]i[\mu(\mathbf{X}\mbox{\boldmath$\beta$})]_{i} is given by [μ(𝐗𝜷)]i=mi/{1+exp(𝐱iT𝜷)}[\mu(\mathbf{X}\mbox{\boldmath$\beta$})]_{i}=m_{i}/\{1+\exp(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}, and 𝐦\mathbf{m} is the n×1n\times 1 vector 𝐦=(m1,,mn)T\mathbf{m}=(m_{1},\ldots,m_{n})^{T}.

Parameter updates for the EM algorithm based on the latent variable model (3) are found by maximizing the “Q-function” Q(𝜷|𝜷(t))Q(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\beta$}^{(t)}) which is defined as the expectation of c,𝐬(𝜷|𝐱)\ell_{c,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{x}) given 𝐲\mathbf{y} and the current vector of parameter estimates 𝜷(t)\mbox{\boldmath$\beta$}^{(t)}. From (4), the Q-function is found by replacing WiW_{i} with ω(𝐱iT𝜷(t),mi)\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{i}) which allows the Q-function to be written as

Q(𝜷|𝜷(t))=E{c,𝐬(𝜷|𝐱)|𝐲,𝜷(t)}=12𝜷T𝐗T𝐒𝐖(𝜷(t))𝐗𝜷+𝜷T𝐗T𝐒𝐮+C,\displaystyle Q(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\beta$}^{(t)})=E\{\ell_{c,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{x})|\mathbf{y},\mbox{\boldmath$\beta$}^{(t)}\}=-\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\beta$}+\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{u}+C,

where 𝐮=(u1,,un)T\mathbf{u}=(u_{1},\ldots,u_{n})^{T}, 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}) is as defined in (5), 𝐒=diag{s1,,sn}\mathbf{S}=\textrm{diag}\{s_{1},\ldots,s_{n}\}, and CC is a constant not depending on 𝜷\beta. The EM update 𝜷(t+1)\mbox{\boldmath$\beta$}^{(t+1)} of 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} is found by maximizing Q(𝜷|𝜷(t))Q(\mbox{\boldmath$\beta$}|\mbox{\boldmath$\beta$}^{(t)}) with respect to 𝜷\beta, and hence we can represent 𝜷(t+1)\mbox{\boldmath$\beta$}^{(t+1)} as the solution of the following weighted least-squares problem

𝜷(t+1)\displaystyle\mbox{\boldmath$\beta$}^{(t+1)} =\displaystyle= argmin𝜷p12(𝐖1(𝜷(t))𝐮𝐗𝜷)T𝐒𝐖(𝜷(t))(𝐖1(𝜷(t))𝐮𝐗𝜷)\displaystyle\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\frac{1}{2}\Big{(}\mathbf{W}^{-1}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{u}-\mathbf{X}\mbox{\boldmath$\beta$}\Big{)}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\Big{(}\mathbf{W}^{-1}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{u}-\mathbf{X}\mbox{\boldmath$\beta$}\Big{)} (6)
=\displaystyle= (𝐗T𝐒𝐖(𝜷(t))𝐗)1𝐗T𝐒𝐮.\displaystyle(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{u}.

It is interesting to note that the EM algorithm parameter update (6) is identical to the parameter update defined in Jaakkola & Jordan (2000) which was based on variational Bayes arguments.

Based on (6), this EM algorithm for logistic regression can be viewed as an “iteratively reweighted least squares” (IRLS) algorithm with weights s1ω(𝐱1T𝜷(t),m1),,snω(𝐱nT𝜷(t),mn)s_{1}\omega(\mathbf{x}_{1}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{1}),\ldots,s_{n}\omega(\mathbf{x}_{n}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{n}) and “response vector” 𝐖1(𝜷(t))𝐮\mathbf{W}^{-1}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{u}. In this sense, this EM algorithm closely resembles the common Newton-Raphson/Fisher scoring algorithm (Green (1984)) used for logistic regression where one performs iteratively reweighted least squares using the weights siωNR(𝐱iT𝜷(t),mi)s_{i}\omega^{NR}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{i}), where ωNR(𝐱iT𝜷(t),mi)=miexp(𝐱iT𝜷(t))/[{1+exp(𝐱iT𝜷(t))}2]\omega^{NR}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{i})=m_{i}\exp(-\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)})/[\{1+\exp(-\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)})\}^{2}]. Note that, in the case of logistic regression, Newton-Raphson and Fisher scoring are equivalent methods, and we will refer to this procedure as Newton-Raphson in the remainder of the article.

To better see the resemblance between the EM and Newton-Raphson updates, it interesting to note that we can also express the EM update (6) of 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} as

𝜷(t+1)=𝜷(t)+(𝐗T𝐒𝐖(𝜷(t))𝐗)1𝐗T𝐒{𝐲μ(𝐗𝜷(t))}.\mbox{\boldmath$\beta$}^{(t+1)}=\mbox{\boldmath$\beta$}^{(t)}+(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\{\mathbf{y}-\mu(\mathbf{X}\mbox{\boldmath$\beta$}^{(t)})\}.

The above expression for the EM update is a consequence of the equality 𝐖(𝜷)𝐗𝜷=μ(𝐗𝜷)12𝐦\mathbf{W}(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}=\mu(\mathbf{X}\mbox{\boldmath$\beta$})-\tfrac{1}{2}\mathbf{m} stated in (5). The main difference between the EM weight function ω(𝐱T𝜷,m)\omega(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) and the Newton-Raphson weight function ωNR(𝐱T𝜷,m)\omega^{NR}(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) is that, for a fixed value of mm, ω(𝐱T𝜷,m)\omega(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) has much heavier tails than ωNR(𝐱T𝜷,m)\omega^{NR}(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m). This is illustrated in Figure 1 which plots ω(𝐱T𝜷,m)\omega(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) and ωNR(𝐱T𝜷,m)\omega^{NR}(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) when it is assumed that m=1m=1.

Refer to caption
Figure 1: (Color Figure) Weight functions ω(𝐱T𝜷,m)\omega(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) and ωNR(𝐱T𝜷,m)\omega^{NR}(\mathbf{x}^{T}\mbox{\boldmath$\beta$},m) used in the EM and Newton-Raphson algorithms respectively for m=1m=1. This figure shows the considerably heavier tails of the EM algorithm weight function.

A main advantage of the EM algorithm over the Newton-Raphson algorithm is that the EM algorithm is monotone; namely, iterates from the EM algorithm are guaranteed to increase the weighted log-likelihood at every iteration and the iterates are guaranteed to converge to a fixed-point of the algorithm. It is interesting to note that the first iterations of the EM and Newton-Raphson are identical whenever all components of the initial iterate 𝜷(0)\mbox{\boldmath$\beta$}^{(0)} are set to zero. This is a consequence of the fact that ωEM(0,mi)=ωNR(0,mi)=mi/4\omega^{EM}(0,m_{i})=\omega^{NR}(0,m_{i})=m_{i}/4. While the Newton-Raphson algorithm is not monotone and does not possess any global convergence guarantees, the fact that the first step of Newton-Raphson increases the value of the log-likelihood when setting 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0} is one reason for the relatively robust performance of Newton-Raphson for logistic regression. The monotonicity of the Newton-Raphson first step when setting 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0} was also noted in Böhning & Lindsay (1988).

3 A Parameter-Expanded ECME Algorithm for Logistic Regression

3.1 Review of Parameter-Expanded EM and ECME Algorithms

In a variety of estimation problems, the convergence speed of the EM algorithm can be improved by viewing the complete-data model as embedded within a larger model that has additional parameters and then performing parameter updates with respect to the larger model. An EM algorithm constructed from a larger, parameter-expanded model is typically referred to as a parameter-expanded EM algorithm (PX-EM) algorithm (Liu et al. (1998)).

To describe the steps involved in the PX-EM algorithm, we consider a general setup where one is interested in estimating a parameter vector 𝜷\mbox{\boldmath$\beta$}\in\mathcal{B} and one has defined a complete-data vector 𝐯\mathbf{v} with density function gc(𝐯|𝜷)g_{c}(\mathbf{v}|\mbox{\boldmath$\beta$}). The complete-data vector 𝐯𝒱\mathbf{v}\in\mathcal{V} and observed-data vector 𝐲𝒴\mathbf{y}\in\mathcal{Y} are connected through 𝐲=h(𝐯)\mathbf{y}=h(\mathbf{v}) where h:𝒱𝒴h:\mathcal{V}\longrightarrow\mathcal{Y} is the data-reducing mapping to be used for developing the EM algorithm. As described in Liu et al. (1998), to generate a PX-EM algorithm one must first consider an expanded probability model for the complete-data vector 𝐯\mathbf{v}, and we let gPX(𝐯|𝜽,α)g_{PX}(\mathbf{v}|\mbox{\boldmath$\theta$},\alpha) denote the probability density function for this expanded probability model where (𝜽,α)(\mbox{\boldmath$\theta$},\alpha) are the parameters of the expanded model and Θ\Theta denotes the parameter space for (𝜽,α)(\mbox{\boldmath$\theta$},\alpha). As outlined in Liu et al. (1998) and Lewandowski et al. (2010), the expanded probability model should satisfy the following conditions in order to generate a well-defined PX-EM algorithm

  1. 1.

    The observed-data model is preserved in the sense that there is a many-to-one reduction function R:ΘR:\Theta\longrightarrow\mathcal{B} such that

    gO{𝐲|𝜷=R(𝜽,α)}=𝒱(𝐲)gPX(𝐯|𝜽,α)𝑑𝐯, for all (𝜽,α)Θ,g_{O}\big{\{}\mathbf{y}|\mbox{\boldmath$\beta$}=R(\mbox{\boldmath$\theta$},\alpha)\big{\}}=\int_{\mathcal{V}(\mathbf{y})}g_{PX}(\mathbf{v}|\mbox{\boldmath$\theta$},\alpha)d\mathbf{v},\quad\textrm{ for all }(\mbox{\boldmath$\theta$},\alpha)\in\Theta,

    where 𝒱(𝐲)={𝐯𝒱:h(𝐯)=𝐲}\mathcal{V}(\mathbf{y})=\{\mathbf{v}\in\mathcal{V}:h(\mathbf{v})=\mathbf{y}\}.

  2. 2.

    There is a “null value” α0\alpha_{0} of α\alpha such that the complete-data model is preserved at the null value in the sense that

    gPX(𝐯|𝜷,α0)=gc(𝐯|𝜷),for all 𝜷.g_{PX}(\mathbf{v}|\mbox{\boldmath$\beta$},\alpha_{0})=g_{c}(\mathbf{v}|\mbox{\boldmath$\beta$}),\qquad\textrm{for all }\mbox{\boldmath$\beta$}\in\mathcal{B}.

Similar to the EM algorithm, the PX-EM algorithm consists of two steps: a “PX–E” step followed by a “PX–M” step. The PX–E step is similar to the usual “E–step” of an EM algorithm in that a “Q-function” is formed by taking the expectation of the (parameter-expanded) complete-data log likelihood conditional on the observed data and parameter values (𝜽,α)=(𝜷(t),α0)(\mbox{\boldmath$\theta$},\alpha)=(\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}). The “PX–M” step then proceeds by finding the expanded parameter values (𝜽(t+1),α(t+1))(\mbox{\boldmath$\theta$}^{(t+1)},\alpha^{(t+1)}) which maximize this Q-function and computing the parameter update 𝜷(t+1)\mbox{\boldmath$\beta$}^{(t+1)} by applying the reduction function to (𝜽(t+1),α(t+1))(\mbox{\boldmath$\theta$}^{(t+1)},\alpha^{(t+1)}). The two steps of the PX-EM algorithm can be summarized as:

  1. 1.

    PX–E step: Compute the parameter-expanded Q-function

    QPX(𝜽,α|𝜷(t),α0)=E{loggPX(𝐯|𝜽,α)|𝐲,𝜽=𝜷(t),α=α0}.Q_{PX}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})=E\Big{\{}\log g_{PX}(\mathbf{v}|\mbox{\boldmath$\theta$},\alpha)\Big{|}\mathbf{y},\mbox{\boldmath$\theta$}=\mbox{\boldmath$\beta$}^{(t)},\alpha=\alpha_{0}\Big{\}}. (7)
  2. 2.

    PX–M step: Find

    (𝜽(t+1),α(t+1))=argmax𝜽,𝜶QPX(𝜽,α|𝜷(t),α0),(\mbox{\boldmath$\theta$}^{(t+1)},\alpha^{(t+1)})=\operatorname*{arg\,max}_{\boldsymbol{\theta,\alpha}}Q_{PX}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}),

    and update 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} using

    𝜷(t+1)=R(𝜽(t+1),α(t+1)).\mbox{\boldmath$\beta$}^{(t+1)}=R(\mbox{\boldmath$\theta$}^{(t+1)},\alpha^{(t+1)}).

A chief advantage of the PX-EM algorithm is that it is guaranteed to converge at least as fast as the corresponding EM algorithm. When the parameter-expanded model is constructed so that all parameter updates have a direct, closed-form, the PX-EM algorithm can have substantially faster convergence speed than the EM algorithm while maintaining the stability and computational simplicity of the original EM algorithm.

The Expectation-Conditional Maximization (ECM) algorithm (Meng & Rubin (1993)) and the Expectation-Conditional Maximization Either (ECME) algorithms (Liu & Rubin (1994)) are variations of the EM algorithm where the M-step is replaced by a sequence of conditional maximization steps. In the ECM algorithm, one breaks up the maximization of the Q-function into several conditional maximization steps where, in each conditional maximization step, one only maximizes the Q-function with respect to a subset of the parameters while keeping the remaining parameter values fixed. In the ECME algorithm, one also performs a series of conditional maximization steps. However, in the ECME algorithm, some of the conditional maximization are performed with respect to the observed-data log-likelihood function while the remaining conditional maximization steps are performed with respect to the usual Q-function.

As in the original ECME algorithm, in a parameter-expanded version of ECME (PX-ECME), one performs a sequence of conditional maximization steps for the expanded parameters (𝜽,α)(\mbox{\boldmath$\theta$},\alpha) with some of the conditional maximization steps being performed with respect to the parameter-expanded Q-function while the other conditional maximization steps are performed with respect to the observed-data log-likelihood function. Specifically, when using the observed-data log-likelihood function for conditional maximization one will maximize o,𝐬{R(𝜽,α)|𝐲}\ell_{o,\mathbf{s}}\{R(\mbox{\boldmath$\theta$},\alpha)|\mathbf{y}\} with respect to the subset of parameters being maximized in that conditional maximization step. After all of the expanded parameters have been updated through the sequence of conditional maximization steps, the update for the parameter of interest 𝜷\beta is found by applying the reduction function R()R(\cdot) to the updated expanded parameters.

One example of a PX-ECME algorithm is the procedure described in Liu (1997) for estimating the parameters of a multivariate-t distribution. Here, the EM algorithm exploits the representation of the multivariate-t distribution as a scale mixture of multivariate normal distributions, and the parameter-expanded model utilized by Liu (1997) includes an expanded scale parameter that plays a role in both the covariance matrix of the normal distribution and the scale of the latent Gamma distribution. This expanded parameter is updated in each iteration through a conditional maximization step maximizing the observed-data likelihood.

One point worth noting is that a PX-ECME algorithm can often be constructed even if the expanded parameter vector (𝜽,α)(\mbox{\boldmath$\theta$},\alpha) is not identifiable from the complete data. A PX-ECME algorithm can work if the subsets of parameters used in each conditional maximization step are conditionally identifiable from the observed data. Specifically, the parameters in each of the subsets should be identifiable from the observed data assuming that all the other parameters are fixed. For example, if we construct our PX-ECME algorithm so that 𝜽\theta is updated first followed by an update of α\alpha, then 𝜽\theta should be identifiable for a fixed value of α\alpha, and α\alpha should be identifiable from the observed data for a fixed value of 𝜽\theta.

3.2 A Parameter-Expanded ECME Algorithm for Logistic and Penalized Logistic Regression

We consider the following parameter-expanded version of model (3) for the complete-data vector 𝐯={(y1,W1),,(yn,Wn)}\mathbf{v}=\{(y_{1},W_{1}),\ldots,(y_{n},W_{n})\}

yi|Wi\displaystyle y_{i}|W_{i} \displaystyle\sim Binomial(mi,{1+exp(α𝐱iT𝜽)}1)\displaystyle\textrm{Binomial}\Big{(}m_{i},\{1+\exp(-\alpha\mathbf{x}_{i}^{T}\mbox{\boldmath$\theta$})\}^{-1}\Big{)}
Wi\displaystyle W_{i} \displaystyle\sim PG(mi,α𝐱iT𝜽),\displaystyle PG(m_{i},\alpha\mathbf{x}_{i}^{T}\mbox{\boldmath$\theta$}), (8)

where the parameters in this model are 𝜽p\mbox{\boldmath$\theta$}\in\mathbb{R}^{p} and α\alpha\in\mathbb{R}. Under this expanded model, the observed-data model is preserved when using the reduction function 𝜷=R(𝜽,α)=α𝜽\mbox{\boldmath$\beta$}=R(\mbox{\boldmath$\theta$},\alpha)=\alpha\mbox{\boldmath$\theta$}, and the original complete-data model (3) is preserved whenever α=α0=1\alpha=\alpha_{0}=1.

We now turn to describing how to use the parameter-expanded model (8) to construct a PX-ECME algorithm for maximizing a penalized weighted log-likelihood function with penalty function P𝜼(𝜷)P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}). A PX-ECME algorithm for maximizing a penalized log-likelihood is very similar to that of maximizing a log-likelihood function the only difference being that a penalty function will be subtracted from either the parameter-expanded Q-function or the observed-data log-likelihood function. Throughout the remainder of this paper, we will assume the penalty function has the form P𝜼(𝜷)=j=1pP𝜼,j(βj)P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})=\sum_{j=1}^{p}P_{\boldsymbol{\eta},j}(\beta_{j}), where 𝜼q\mbox{\boldmath$\eta$}\in\mathbb{R}^{q} is a vector of tuning parameters. The observed-data weighted penalized log-likelihood function po,𝐬𝜼p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}} that we seek to maximize is then

po,𝐬𝜼(𝜷|𝐲)=i=1nsilog(miyi)+i=1nsiyi𝐱iT𝜷i=1nsimilog(1+exp{𝐱iT𝜷})P𝜼(𝜷),p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}|\mathbf{y})=\sum_{i=1}^{n}s_{i}\log\binom{m_{i}}{y_{i}}+\sum_{i=1}^{n}s_{i}y_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}-\sum_{i=1}^{n}s_{i}m_{i}\log\Big{(}1+\exp\{\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}\Big{)}-P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}),

and the complete-data penalized weighted log-likelihood corresponding to model (8) is

pc,PX𝜼(𝜽,α|𝐯)=αi=1nsiui𝐱iT𝜽α22i=1nsiWi(𝐱iT𝜽)2P𝜼(α𝜽)+C,p\ell_{c,PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mathbf{v})=\alpha\sum_{i=1}^{n}s_{i}u_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\theta$}-\frac{\alpha^{2}}{2}\sum_{i=1}^{n}s_{i}W_{i}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\theta$})^{2}-P_{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$})+C,

where CC is a constant that does not depend on (𝜽,α)(\mbox{\boldmath$\theta$},\alpha). The penalized, parameter-expanded Q-function QPX𝜼(𝜽,α|𝜷(t),α0)Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) is obtained by just subtracting P𝜼(α𝜽)P_{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$}) from the usual parameter-expanded Q-function (7). For model (8), this is given by

QPX𝜼(𝜽,α|𝜷(t),α0)=α𝜽T𝐗T𝐒𝐮α22𝜽T𝐗T𝐒𝐖(𝜷(t))𝐗𝜽P𝜼(α𝜽)+C,Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})=\alpha\mbox{\boldmath$\theta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{u}-\frac{\alpha^{2}}{2}\mbox{\boldmath$\theta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\theta$}-P_{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$})+C, (9)

where 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}) is the n×nn\times n weight matrix defined in (5).

One feature of the parameter-expanded model (8) is that the parameter vector (𝜽,α)(\mbox{\boldmath$\theta$},\alpha) is not identifiable from the complete-data vector 𝐯\mathbf{v}. This may be seen by noting that, for a positive scalar c, the parameters (α,𝜽)(\alpha,\mbox{\boldmath$\theta$}) and (α/c,c𝜽)(\alpha/c,c\mbox{\boldmath$\theta$}) both lead to the same complete-data penalized log-likelihood. Nevertheless, (𝜽,α)(\mbox{\boldmath$\theta$},\alpha) are conditionally identifiable meaning that, for a fixed 𝜽\theta, α\alpha is identifiable, and for a fixed α\alpha, 𝜽\theta is identifiable. Hence, we can first update 𝜽(t)\mbox{\boldmath$\theta$}^{(t)} by fixing α=α(t)\alpha=\alpha^{(t)} and maximizing QPX𝜼(𝜽,α|𝜷(t),α0)Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) with respect to 𝜽\theta

𝜽(t+1)\displaystyle\mbox{\boldmath$\theta$}^{(t+1)} =\displaystyle= argmax𝜽pQPX𝜼(𝜽,α(t)|𝜷(t),α0)\displaystyle\operatorname*{arg\,max}_{\boldsymbol{\theta}\in\mathbb{R}^{p}}Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha^{(t)}|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) (10)
=\displaystyle= argmin𝜽p[α(t)𝜽T𝐗T𝐒𝐮+(α(t))22𝜽T𝐗T𝐒𝐖(𝜷(t))𝐗𝜽+P𝜼(α(t)𝜽)C].\displaystyle\operatorname*{arg\,min}_{\boldsymbol{\theta}\in\mathbb{R}^{p}}\Bigg{[}-\alpha^{(t)}\mbox{\boldmath$\theta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{u}+\frac{(\alpha^{(t)})^{2}}{2}\mbox{\boldmath$\theta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\theta$}+P_{\boldsymbol{\eta}}(\alpha^{(t)}\mbox{\boldmath$\theta$})-C\Bigg{]}.

We then update α(t)\alpha^{(t)} by maximizing the observed-data penalized log-likelihood

α(t+1)=argmaxαpo𝜼(α𝜽(t+1)|𝐲).\alpha^{(t+1)}=\operatorname*{arg\,max}_{\alpha\in\mathbb{R}}p\ell_{o}^{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$}^{(t+1)}|\mathbf{y}).

The PX-ECME parameter update of 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} is given by 𝜷(t+1)=R(𝜽(t+1),α(t+1))=α(t+1)𝜽(t+1)\mbox{\boldmath$\beta$}^{(t+1)}=R(\mbox{\boldmath$\theta$}^{(t+1)},\alpha^{(t+1)})=\alpha^{(t+1)}\mbox{\boldmath$\theta$}^{(t+1)}.

It is worth noting that one can express 𝜽(t+1)\mbox{\boldmath$\theta$}^{(t+1)} in terms of the EM parameter update as 𝜽(t+1)=𝜷(t+1),EM/α(t)\mbox{\boldmath$\theta$}^{(t+1)}=\mbox{\boldmath$\beta$}^{(t+1),EM}/\alpha^{(t)} where the EM update 𝜷(t+1),EM\mbox{\boldmath$\beta$}^{(t+1),EM} is obtained from 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} by maximizing QPX𝜼(𝜽,1|𝜷(t),1)Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},1|\mbox{\boldmath$\beta$}^{(t)},1) with respect to 𝜽\theta. Hence, we can express the PX-ECME update as

𝜷(t+1)=1α(t)𝜷(t+1),EM[argmaxαpo,𝐬𝜼(α𝜷(t+1),EM/α(t)|𝐲)]=ρ(t+1)𝜷(t+1),EM,\mbox{\boldmath$\beta$}^{(t+1)}=\frac{1}{\alpha^{(t)}}\mbox{\boldmath$\beta$}^{(t+1),EM}\Big{[}\operatorname*{arg\,max}_{\alpha\in\mathbb{R}}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\beta$}^{(t+1),EM}/\alpha^{(t)}|\mathbf{y})\Big{]}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),EM},

where ρ(t+1)=argmaxρpo,𝐬𝜼(ρ𝜷(t+1),EM|𝐲)\rho^{(t+1)}=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y}). In other words, the PX-ECME update is found by simply first computing the EM update 𝜷(t+1),EM\mbox{\boldmath$\beta$}^{(t+1),EM} of the regression coefficients and then multiplying 𝜷(t+1),EM\mbox{\boldmath$\beta$}^{(t+1),EM} by the scaling factor ρ(t+1)\rho^{(t+1)}, where ρ(t+1)\rho^{(t+1)} is the scalar maximizing the weighted observed-data penalized log-likelihood function among regression coefficients of the form ρ𝜷(t+1),EM\rho\mbox{\boldmath$\beta$}^{(t+1),EM}. The complete PX-ECME procedure is summarized in Algorithm 1.

Finding ρ(t+1)\rho^{(t+1)} only involves solving a one-dimensional root-finding problem, namely, po,𝐬𝜼(ρ𝜷(t+1),EM|𝐲)/ρ=0\partial p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y})/\partial\rho=0, of which there are a wide range of available numerical methods, and in such cases, robust root-finding methods such as Brent’s method (Brent (2013)) may be directly used to find ρ(t+1)\rho^{(t+1)}.

Algorithm 1 (PX-ECME for Penalized Logistic Regression). In the description of the algorithm, 𝐒=diag{s1,,sn}\mathbf{S}=\textrm{diag}\{s_{1},\ldots,s_{n}\} and 𝐮=(u1,,un)T\mathbf{u}=(u_{1},\ldots,u_{n})^{T}, where ui=yimi/2u_{i}=y_{i}-m_{i}/2.
1:Given 𝜷(0)p\mbox{\boldmath$\beta$}^{(0)}\in\mathbb{R}^{p}.
2:for t=0,1,2,… until convergence do
3:     Compute the n×nn\times n diagonal weight matrix 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}) whose ithi^{th} diagonal element is
ω(𝐱iT𝜷,mi)=mi2𝐱iT𝜷(t)tanh(𝐱iT𝜷(t)/2).\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$},m_{i})=\frac{m_{i}}{2\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}}\tanh\big{(}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}/2\big{)}.
4:     Compute 𝜷(t+1),EM\mbox{\boldmath$\beta$}^{(t+1),EM}:
𝜷(t+1),EM=argmin𝜷p[12𝜷T𝐗T𝐒𝐖(𝜷(t))𝐗𝜷𝜷T𝐗T𝐒𝐮+P𝜼(𝜷)].\mbox{\boldmath$\beta$}^{(t+1),EM}=\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\Big{[}\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{u}+P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})\Big{]}.\vspace{-.2cm}
5:     Compute
ρ(t+1)=argmaxρ[po,𝐬𝜼(ρ𝜷(t+1),EM|𝐲)].\rho^{(t+1)}=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}\Big{[}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y})\Big{]}.
6:     Set 𝜷(t+1)=ρ(t+1)𝜷(t+1),EM\mbox{\boldmath$\beta$}^{(t+1)}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),EM}.

An attractive feature of the PX-ECME algorithm (Algorithm 1) is that it is, like the EM algorithm, a monotone algorithm. The monotonicity property helps to ensure that the procedure stable and robust in practice. The monotonicity of PX-ECME directly follows from the fact that the EM algorithm is monotone with respect to the penalized log-likelihood (i.e., po,𝐬𝜼(𝜷(t+1),EM|𝐲)po,𝐬𝜼(𝜷(t)|𝐲)p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y})\geq p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)}|\mathbf{y})) and that po,𝐬𝜼(ρ(t+1)𝜷(t+1),EM|𝐲)po,𝐬𝜼(𝜷(t+1),EM|𝐲)p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y})\geq p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y}).

3.3 Monotone acceleration of EM via order-1 Anderson acceleration

The PX-ECME algorithm accelerates convergence by multiplying the EM parameter update by a single scalar, and there are likely other straightforward modifications of the EM algorithm that can produce substantial speed-ups in convergence without sacrificing the monotonicity property.

An attractive class of methods with straightforward parameter updates are those that simply take a linear combination of the previous two EM parameter updates with the weights in the linear combination determined adaptively. One can ensure that this scheme is monotone by comparing the value of the objective function for a parameter update versus that of the current parameter value. One method for finding the weights in this linear combination is the order-1 Anderson acceleration (Walker & Ni (2011)). Applying the order-1 Anderson acceleration to the EM algorithm for penalized logistic regression leads to the following parameter updating scheme

𝜷(t+1)={(1γ(t))𝜷(t+1),EM+γ(t)𝜷(t),EM if po,𝐬𝜼(𝜷(t+1)|𝐲)po,𝐬𝜼(𝜷(t+1),EM|𝐲) 𝜷(t+1),EM otherwise, \mbox{\boldmath$\beta$}^{(t+1)}=\begin{cases}(1-\gamma^{(t)})\mbox{\boldmath$\beta$}^{(t+1),EM}+\gamma^{(t)}\mbox{\boldmath$\beta$}^{(t),EM}&\text{ if $p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1)}|\mathbf{y})\geq p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y})$ }\\ \mbox{\boldmath$\beta$}^{(t+1),EM}&\text{ otherwise, }\end{cases}

where γ(t)\gamma^{(t)} is the scalar γ(t)=𝐯tT𝐫t/𝐯tT𝐯t\gamma^{(t)}=\mathbf{v}_{t}^{T}\mathbf{r}_{t}/\mathbf{v}_{t}^{T}\mathbf{v}_{t} and where 𝐫t=𝜷(t+1),EM𝜷(t)\mathbf{r}_{t}=\mbox{\boldmath$\beta$}^{(t+1),EM}-\mbox{\boldmath$\beta$}^{(t)} and 𝐯t=𝜷(t+1),EM𝜷(t)+𝜷(t1)𝜷(t),EM\mathbf{v}_{t}=\mbox{\boldmath$\beta$}^{(t+1),EM}-\mbox{\boldmath$\beta$}^{(t)}+\mbox{\boldmath$\beta$}^{(t-1)}-\mbox{\boldmath$\beta$}^{(t),EM}. Because the underlying EM algorithm is monotone, only accepting the linear combination 𝜷(t+1)\mbox{\boldmath$\beta$}^{(t+1)} if po,𝐬𝜼(𝜷(t+1)|𝐲)po,𝐬𝜼(𝜷(t+1),EM|𝐲)p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1)}|\mathbf{y})\geq p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),EM}|\mathbf{y}) ensures that the above iterative scheme is monotone.

3.4 Rate of Convergence of EM and PX-ECME for Logistic Regression

The PX-ECME updating scheme outlined in Algorithm 1 implicitly defines a mapping G:ppG:\mathbb{R}^{p}\longrightarrow\mathbb{R}^{p} which performs the PX-ECME update of 𝜷(t)\mbox{\boldmath$\beta$}^{(t)}, i.e., 𝜷(t+1)=GPX(𝜷(t))\mbox{\boldmath$\beta$}^{(t+1)}=G_{PX}(\mbox{\boldmath$\beta$}^{(t)}). Likewise, the EM algorithm also defines a mapping 𝜷(t+1),EM=GEM(𝜷(t),EM)\mbox{\boldmath$\beta$}^{(t+1),EM}=G_{EM}(\mbox{\boldmath$\beta$}^{(t),EM}) which, from (6), has the form GEM(𝜷)=(𝐗T𝐒𝐖(𝜷)𝐗)1𝐗T𝐒𝐮G_{EM}(\mbox{\boldmath$\beta$})=(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$})\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{u}. Because the PX-ECME parameter update is a scalar multiple of the EM update, we can express the mapping GPXG_{PX} in terms of GEMG_{EM} as

GPX(𝜷)=h{GEM(𝜷)}GEM(𝜷),G_{PX}(\mbox{\boldmath$\beta$})=h\{G_{EM}(\mbox{\boldmath$\beta$})\}G_{EM}(\mbox{\boldmath$\beta$}), (11)

where h:ph:\mathbb{R}^{p}\longrightarrow\mathbb{R} is the function defined as h(𝜷)=argmaxρ[po,𝐬𝜼(ρ𝜷|𝐲)]h(\mbox{\boldmath$\beta$})=\operatorname*{arg\,max}_{\rho}[p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}|\mathbf{y})].

The rate of convergence of a fixed-point iteration of the form 𝜷(t+1)=G(𝜷(t))\mbox{\boldmath$\beta$}^{(t+1)}=G(\mbox{\boldmath$\beta$}^{(t)}) is typically defined as the maximum eigenvalue of the Jacobian matrix of GG when the Jacobian is evaluated at the convergence point 𝜷\mbox{\boldmath$\beta$}^{*} of the algorithm (see, e.g., McLachlan & Krishnan (2007)). As shown in Durante & Rigon (2018), the Jacobian matrix 𝐉EM(𝜷)\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}) associated with the mapping GEM(𝜷)=(𝐗T𝐒𝐖(𝜷)𝐗)1𝐗T𝐒𝐮G_{EM}(\mbox{\boldmath$\beta$})=(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$})\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{u} is given by

𝐉EM(𝜷)=𝐈p(𝐗T𝐒𝐖(𝜷)𝐗)1𝐗T𝐒𝐑(𝜷)𝐗,\mathbf{J}_{EM}(\mbox{\boldmath$\beta$})=\mathbf{I}_{p}-(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$})\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$})\mathbf{X}, (12)

where 𝐑(𝜷)\mathbf{R}(\mbox{\boldmath$\beta$}) is the n×nn\times n diagonal matrix whose ithi^{th} diagonal element is miπ(𝐱iT𝜷)[1π(𝐱iT𝜷)]m_{i}\pi(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})[1-\pi(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})] and π(u)=1/{1+exp(u)}\pi(u)=1/\{1+\exp(-u)\}. As we show in Theorem 1 below, the maximum eigenvalue of the Jacobian matrix of GPXG_{PX} at 𝜷\mbox{\boldmath$\beta$}^{*} can be no larger than the maximum eigenvalue of 𝐉EM(𝜷)\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*}), and hence the rate of convergence of PX-ECME is no slower than the EM algorithm.

Theorem 1

When assuming the penalty function equals zero for all 𝛃\beta (i.e., P𝛈(𝛃)0P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})\equiv 0) and 𝛃𝟎\mbox{\boldmath$\beta$}^{*}\neq\mathbf{0}, the Jacobian 𝐉PX(𝛃)=GPX(𝛃)/𝛃\mathbf{J}_{PX}(\mbox{\boldmath$\beta$})=\partial G_{PX}(\mbox{\boldmath$\beta$})/\partial\mbox{\boldmath$\beta$} of the PX-ECME mapping at 𝛃=𝛃\mbox{\boldmath$\beta$}=\mbox{\boldmath$\beta$}^{*} is given by

𝐉PX(𝜷)\displaystyle\mathbf{J}_{PX}(\mbox{\boldmath$\beta$}^{*}) =\displaystyle= 𝐈p𝐄1(𝜷)𝐕(𝜷)[c(𝜷)]1𝜷(𝜷)T𝐕(𝜷){𝐕1(𝜷)𝐄1(𝜷)}𝐕(𝜷)T\displaystyle\mathbf{I}_{p}-\mathbf{E}^{-1}(\mbox{\boldmath$\beta$}^{*})\mathbf{V}(\mbox{\boldmath$\beta$}^{*})-[c(\mbox{\boldmath$\beta$}^{*})]^{-1}\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})\big{\{}\mathbf{V}^{-1}(\mbox{\boldmath$\beta$}^{*})-\mathbf{E}^{-1}(\mbox{\boldmath$\beta$}^{*})\big{\}}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})^{T}
=\displaystyle= 𝐉EM(𝜷)[c(𝜷)]1𝜷(𝜷)T𝐕(𝜷){𝐕1(𝜷)𝐄1(𝜷)}𝐕(𝜷)T,\displaystyle\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*})-[c(\mbox{\boldmath$\beta$}^{*})]^{-1}\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})\big{\{}\mathbf{V}^{-1}(\mbox{\boldmath$\beta$}^{*})-\mathbf{E}^{-1}(\mbox{\boldmath$\beta$}^{*})\big{\}}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})^{T},

where 𝐉EM(𝛃)\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*}) is the Jacobian matrix of the EM mapping at 𝛃=𝛃\mbox{\boldmath$\beta$}=\mbox{\boldmath$\beta$}^{*}, c(𝛃)=(𝛃)T𝐗T𝐒𝐑(𝛃)𝐗𝛃c(\mbox{\boldmath$\beta$}^{*})=(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}\mbox{\boldmath$\beta$}^{*}, 𝐄(𝛃)=𝐗T𝐒𝐖(𝛃)𝐗\mathbf{E}(\mbox{\boldmath$\beta$}^{*})=\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}, and 𝐕(𝛃)=𝐗T𝐒𝐑(𝛃)𝐗\mathbf{V}(\mbox{\boldmath$\beta$}^{*})=\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}. In addition, we have rPXrEMr_{PX}\leq r_{EM}, where rPXr_{PX} and rEMr_{EM} are the maximal eigenvalues of 𝐉PX(𝛃)\mathbf{J}_{PX}(\mbox{\boldmath$\beta$}^{*}) and 𝐉EM(𝛃)\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*}) respectively.

4 A Generalized PX-ECME Algorithm for Logistic Regression

4.1 Description of the Aproach

For many penalty functions, the maximization in (10) does not yield an easily-computable solution and may require the use of a time-consuming iterative procedure to compute 𝜽(t+1)\mbox{\boldmath$\theta$}^{(t+1)}. To better handle such cases, we consider a generalized PX-ECME algorithm where 𝜽(t+1)\mbox{\boldmath$\theta$}^{(t+1)} maximizes a more manageable, surrogate parameter-expanded Q-function. The more manageable surrogate Q-functions Q~PX𝜼(𝜽,α|𝜷(t),α0)\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) are assumed to have the form

Q~PX𝜼(𝜽,α|𝜷(t),α0)=QPX𝜼(𝜽,α|𝜷(t),α0)12(α𝜽𝜷(t))T𝐇(t)(α𝜽𝜷(t))\displaystyle\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})=Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\theta$},\alpha|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})-\frac{1}{2}(\alpha\mbox{\boldmath$\theta$}-\mbox{\boldmath$\beta$}^{(t)})^{T}\mathbf{H}^{(t)}(\alpha\mbox{\boldmath$\theta$}-\mbox{\boldmath$\beta$}^{(t)})
=α𝜽T[𝐗T𝐒𝐮+𝐇(t)𝜷(t)]α22𝜽T[𝐗T𝐒𝐖(𝜷(t))𝐗+𝐇(t)]𝜽P𝜼(α𝜽)+C,\displaystyle=\alpha\mbox{\boldmath$\theta$}^{T}[\mathbf{X}^{T}\mathbf{S}\mathbf{u}+\mathbf{H}^{(t)}\mbox{\boldmath$\beta$}^{(t)}]-\frac{\alpha^{2}}{2}\mbox{\boldmath$\theta$}^{T}\big{[}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}+\mathbf{H}^{(t)}\big{]}\mbox{\boldmath$\theta$}-P_{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$})+C, (13)

where 𝐇(t)\mathbf{H}^{(t)} is a positive semi-definite p×pp\times p matrix. As in the PX-ECME algorithm for penalized logistic regression, to update 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} one first finds 𝜽(t+1)\mbox{\boldmath$\theta$}^{(t+1)} and α(t+1)\alpha^{(t+1)} by performing two conditional maximization steps with respect to Q~PX𝜼\tilde{Q}_{PX}^{\boldsymbol{\eta}} and then sets 𝜷(t+1)=α(t+1)𝜽(t+1)\mbox{\boldmath$\beta$}^{(t+1)}=\alpha^{(t+1)}\mbox{\boldmath$\theta$}^{(t+1)}. We will refer to an algorithm which updates 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} in this manner as a generalized parameter-expanded ECME (GPX-ECME) algorithm for penalized weighted logistic regression.

The steps involved in a GPX-ECME are outlined in Algorithm 2. Notice in this Algorithm that, as in the case of a PX-ECME algorithm, the process of first computing 𝜽(t+1)\mbox{\boldmath$\theta$}^{(t+1)} and α(t+1)\alpha^{(t+1)} and then setting 𝜷(t+1)=α(t+1)𝜽(t+1)\mbox{\boldmath$\beta$}^{(t+1)}=\alpha^{(t+1)}\mbox{\boldmath$\theta$}^{(t+1)} may be equivalently expressed as 𝜷(t+1)=ρ(t+1)𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1)}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),GEM}, where 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM} maximizes Q~PX𝜼(𝜷,1|𝜷(t),α0)\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) and ρ(t+1)\rho^{(t+1)} maximizes po,𝐬𝜼(ρ𝜷(t+1),GEM|𝐲)p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y}) with respect to ρ\rho\in\mathbb{R}. As stated in the following proposition, as long as 𝐇(t)\mathbf{H}^{(t)} is a positive semi-definite matrix for each tt, each iterate generated from a GPX-ECME algorithm will increase the value of the penalized log-likelihood.

Proposition 1

If 𝐇(t)\mathbf{H}^{(t)} is a positive semi-definite matrix for each tt, then iterates 𝛃(0),𝛃(1),𝛃(2),\mbox{\boldmath$\beta$}^{(0)},\mbox{\boldmath$\beta$}^{(1)},\mbox{\boldmath$\beta$}^{(2)},... produced by a GPX-ECME algorithm (Algorithm 2) increase the penalized log-likelihood function at each iteration. That is, for any t0t\geq 0,

po,𝐬𝜼(𝜷(t+1)|𝐲)po,𝐬𝜼(𝜷(t)|𝐲).p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1)}|\mathbf{y})\geq p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)}|\mathbf{y}).

The main purpose for considering surrogate Q-functions of the form (13) in a GPX-ECME algorithm is to allow for more easily-computable parameter updates. A choice of 𝐇(t)\mathbf{H}^{(t)} which typically ensures an easy-to-compute maximizer of Q~PX𝜼(𝜷,1|𝜷(t),α0)\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) is 𝐇(t)=𝐃(t)𝐗T𝐒𝐖(𝜷(t))𝐗\mathbf{H}^{(t)}=\mathbf{D}^{(t)}-\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}, where 𝐃(t)\mathbf{D}^{(t)} is a p×pp\times p diagonal matrix with nonnegative diagonal entries djj(t)d_{jj}^{(t)}. When using a matrix 𝐇(t)\mathbf{H}^{(t)} of this form, the surrogate Q-function (13) with arguments (𝜷,1)(\mbox{\boldmath$\beta$},1) reduces to

Q~PX𝜼(𝜷,1|𝜷(t),α0)=𝜷T[𝐗T𝐒𝐮+𝐚t]12𝜷T𝐃(t)𝜷P𝜼(𝜷)+C,\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})=\mbox{\boldmath$\beta$}^{T}[\mathbf{X}^{T}\mathbf{S}\mathbf{u}+\mathbf{a}_{t}]-\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mathbf{D}^{(t)}\mbox{\boldmath$\beta$}-P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})+C, (14)

where 𝐚t=[𝐃(t)𝐗T𝐒𝐖(𝜷(t))𝐗]𝜷(t)\mathbf{a}_{t}=[\mathbf{D}^{(t)}-\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}]\mbox{\boldmath$\beta$}^{(t)}. Because 𝐃(t)\mathbf{D}^{(t)} is diagonal, maximizing (14) is often straightforward as one can maximize each component βj\beta_{j} of Q~PX𝜼(𝜷,1|𝜷(t),α0)\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) separately by maximizing a quadratic function plus the jthj^{th} component of the penalty function P𝜼,j(βj)P_{\boldsymbol{\eta},j}(\beta_{j}).

Algorithm 2 (GPX-ECME for Penalized Logistic Regression). In the description of the algorithm, 𝐮=(u1,,un)T\mathbf{u}=(u_{1},\ldots,u_{n})^{T}, where ui=yimi/2u_{i}=y_{i}-m_{i}/2.
1:Given 𝜷(0)p\mbox{\boldmath$\beta$}^{(0)}\in\mathbb{R}^{p}.
2:for t=0,1,2,… until convergence do
3:     Compute the n×nn\times n diagonal weight matrix 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}) whose ithi^{th} diagonal element is
ω(𝐱iT𝜷,mi)=mi2𝐱iT𝜷(t)tanh(𝐱iT𝜷(t)/2).\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$},m_{i})=\frac{m_{i}}{2\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}}\tanh\big{(}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}/2\big{)}.
4:     Compute 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM}:
𝜷(t+1),GEM=argmin𝜷p[12𝜷T𝐗T𝐒𝐖(𝜷(t))𝐗𝜷𝜷T𝐗T𝐒𝐮+12(𝜷𝜷(t))T𝐇(t)(𝜷𝜷(t))+P𝜼(𝜷)].\mbox{\boldmath$\beta$}^{(t+1),GEM}=\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\Big{[}\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{u}+\frac{1}{2}(\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}^{(t)})^{T}\mathbf{H}^{(t)}(\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}^{(t)})+P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})\Big{]}.\vspace{-.2cm}
5:     Compute
ρ(t+1)=argmaxρ[po,𝐬𝜼(ρ𝜷(t+1),GEM|𝐲)].\rho^{(t+1)}=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}\Bigg{[}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y})\Bigg{]}.
6:     Set 𝜷(t+1)=ρ(t+1)𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1)}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),GEM}.

If choosing 𝐇(t)=𝐃(t)𝐗T𝐖(𝜷(t))𝐗\mathbf{H}^{(t)}=\mathbf{D}^{(t)}-\mathbf{X}^{T}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X} and setting all the diagonal elements of 𝐃(t)\mathbf{D}^{(t)} equal so that 𝐃(t)=κt1𝐈p\mathbf{D}^{(t)}=\kappa_{t}^{-1}\mathbf{I}_{p} for some κt>0\kappa_{t}>0, 𝐇(t)\mathbf{H}^{(t)} will be positive semi-definite provided that κt1λmax(𝐗T𝐒𝐖(𝜷(t))𝐗)\kappa_{t}^{-1}\geq\lambda_{max}\big{(}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\big{)}, where λmax(𝐗T𝐒𝐖(𝜷(t))𝐗)\lambda_{max}\big{(}\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\big{)} is the maximum eigenvalue of 𝐗T𝐒𝐖(𝜷(t))𝐗\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}. When using 𝐃(t)=κt1𝐈p\mathbf{D}^{(t)}=\kappa_{t}^{-1}\mathbf{I}_{p}, it follows from (7) that the term 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM} computed in Step 4 of Algorithm 2 can be be expressed as

𝜷(t+1),GEM\displaystyle\mbox{\boldmath$\beta$}^{(t+1),GEM} =\displaystyle= argmin𝜷p[12𝜷T𝜷κt𝜷T[𝐗T𝐒𝐮+𝐚t]+κtP𝜼(𝜷)].\displaystyle\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\Big{[}\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mbox{\boldmath$\beta$}-\kappa_{t}\mbox{\boldmath$\beta$}^{T}[\mathbf{X}^{T}\mathbf{S}\mathbf{u}+\mathbf{a}_{t}]+\kappa_{t}P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})\Big{]}. (15)

Each component βj(t+1),GEM\beta_{j}^{(t+1),GEM} of 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM} can be updated independently of the other components via

βj(t+1),GEM=argminβj[βj2κt2(i=1nxijsiui+aj,t)βj+2P𝜼,j(βj)],\beta_{j}^{(t+1),GEM}=\operatorname*{arg\,min}_{\beta_{j}\in\mathbb{R}}\Bigg{[}\frac{\beta_{j}^{2}}{\kappa_{t}}-2\Big{(}\sum_{i=1}^{n}x_{ij}s_{i}u_{i}+a_{j,t}\Big{)}\beta_{j}+2P_{\boldsymbol{\eta},j}(\beta_{j})\Bigg{]}, (16)

where aj,ta_{j,t} denotes the jthj^{th} component of 𝐚t\mathbf{a}_{t}.

The updates (16) have a closed form for a number of common penalty functions. For example, in the case of the elastic net penalty function where P𝜼,j(βj)=λ1|βj|+λ22βj2P_{\boldsymbol{\eta},j}(\beta_{j})=\lambda_{1}|\beta_{j}|+\tfrac{\lambda_{2}}{2}\beta_{j}^{2}, the update for the jthj^{th} component of 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} is given by

βj(t+1),GEM={11/κt+λ2(i=1nxijsiui+aj,tλ1) if i=1nxijsiui+aj,t>λ111/κt+λ2(i=1nxijsiui+aj,t+λ1) if i=1nxijsiui+aj,t<λ10 if |i=1nxijsiui+aj,t|λ1\beta_{j}^{(t+1),GEM}=\begin{cases}\frac{1}{1/\kappa_{t}+\lambda_{2}}\Big{(}\sum_{i=1}^{n}x_{ij}s_{i}u_{i}+a_{j,t}-\lambda_{1}\Big{)}&\textrm{ if }\sum_{i=1}^{n}x_{ij}s_{i}u_{i}+a_{j,t}>\lambda_{1}\\ \frac{1}{1/\kappa_{t}+\lambda_{2}}\Big{(}\sum_{i=1}^{n}x_{ij}s_{i}u_{i}+a_{j,t}+\lambda_{1}\Big{)}&\textrm{ if }\sum_{i=1}^{n}x_{ij}s_{i}u_{i}+a_{j,t}<-\lambda_{1}\\ 0&\text{ if }\Big{|}\sum_{i=1}^{n}x_{ij}s_{i}u_{i}+a_{j,t}\Big{|}\leq\lambda_{1}\end{cases}

While other popular choices of the penalty function such as the smoothly clipped absolute deviation (SCAD) (Fan & Li (2001)) penalty may not necessarily yield direct, closed-form solutions, the parameter updates may be directly found by performing a univariate minimization separately for each component of the parameter vector.

4.2 Connections to Proximal Gradient Descent

Proximal gradient methods (Parikh & Boyd (2014)) are commonly used to optimize a function which can be expressed as the sum of a continuously differentiable function and a non-smooth function. In the context of penalized logistic regression where the aim is to minimize the composite function po,𝐬𝜼(𝜷|𝐲)=o,𝐬(𝜷|𝐲)+P𝜼(𝜷)p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}|\mathbf{y})=-\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y})+P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}), the proximal gradient descent update with steplength κt\kappa_{t} of a current parameter estimate 𝜷(t)\mbox{\boldmath$\beta$}^{(t)} is given by

𝜷(t+1)=proxκtP𝜼(𝜷(t)+κto,𝐬(𝜷(t)|𝐲)),\mbox{\boldmath$\beta$}^{(t+1)}=\textrm{prox}_{\kappa_{t}P_{\boldsymbol{\eta}}}\Big{(}\mbox{\boldmath$\beta$}^{(t)}+\kappa_{t}\nabla\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}^{(t)}|\mathbf{y})\Big{)}, (17)

where the proximal operator proxκtP𝜼:pp\textrm{prox}_{\kappa_{t}P_{\boldsymbol{\eta}}}:\mathbb{R}^{p}\longrightarrow\mathbb{R}^{p} of the function κtP𝜼\kappa_{t}P_{\boldsymbol{\eta}} is defined as

proxκtP𝜼(𝜸)=argmin𝜷[12𝜷T𝜷𝜷T𝜸+κtPη(𝜷)].\textrm{prox}_{\kappa_{t}P_{\boldsymbol{\eta}}}(\mbox{\boldmath$\gamma$})=\operatorname*{arg\,min}_{\boldsymbol{\beta}}\Big{[}\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}^{T}\mbox{\boldmath$\gamma$}+\kappa_{t}P_{\eta}(\mbox{\boldmath$\beta$})\Big{]}.

To elaborate on the connection between the proximal gradient update and the GPX-ECME procedure, we first note that the gradient of the observed-data log-likelihood function (2) is o,𝐬(𝜷|𝐲)=𝐗T𝐒{𝐲μ(𝐗𝜷)}\nabla\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y})=\mathbf{X}^{T}\mathbf{S}\{\mathbf{y}-\mu(\mathbf{X}\mbox{\boldmath$\beta$})\}, where μ(𝐗𝜷)\mu(\mathbf{X}\mbox{\boldmath$\beta$}) is an n×1n\times 1 vector whose ithi^{th} component [μ(𝐗𝜷)]i[\mu(\mathbf{X}\mbox{\boldmath$\beta$})]_{i} is given by [μ(𝐗𝜷)]i=miexpit(𝐱iT𝜷)[\mu(\mathbf{X}\mbox{\boldmath$\beta$})]_{i}=m_{i}\textrm{expit}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}). Using the connection (5) between the mean function μ(𝐗𝜷)\mu(\mathbf{X}\mbox{\boldmath$\beta$}) and the weight matrix 𝐖(𝜷)\mathbf{W}(\mbox{\boldmath$\beta$}), we can write the gradient of o,𝐬(𝜷|𝐲)\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y}) as

o,𝐬(𝜷|𝐲)=𝐗T𝐒{𝐮𝐖(𝜷)𝐗𝜷}.\nabla\ell_{o,\mathbf{s}}(\mbox{\boldmath$\beta$}|\mathbf{y})=\mathbf{X}^{T}\mathbf{S}\{\mathbf{u}-\mathbf{W}(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}\}.

Thus, the proximal gradient update (17) for penalized logistic can be expressed as

𝜷(t+1),PGD\displaystyle\mbox{\boldmath$\beta$}^{(t+1),PGD} =\displaystyle= proxκtP𝜼(𝜷(t)+κt𝐗T𝐒{𝐮𝐖(𝜷(t))𝐗𝜷(t)})\displaystyle\textrm{prox}_{\kappa_{t}P_{\boldsymbol{\eta}}}\Big{(}\mbox{\boldmath$\beta$}^{(t)}+\kappa_{t}\mathbf{X}^{T}\mathbf{S}\{\mathbf{u}-\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\beta$}^{(t)}\}\Big{)} (18)
=\displaystyle= argmin𝜷p[12𝜷T𝜷𝜷T(𝜷(t)+κt𝐗T𝐒{𝐮𝐖(𝜷(t))𝐗𝜷(t)})+κtP𝜼(𝜷)].\displaystyle\operatorname*{arg\,min}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\Big{[}\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mbox{\boldmath$\beta$}-\mbox{\boldmath$\beta$}^{T}\Big{(}\mbox{\boldmath$\beta$}^{(t)}+\kappa_{t}\mathbf{X}^{T}\mathbf{S}\{\mathbf{u}-\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}\mbox{\boldmath$\beta$}^{(t)}\}\Big{)}+\kappa_{t}P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})\Big{]}.

It is interesting to note that the proximal gradient descent update (18) is exactly the same as the update 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM} in (15), which is part of a GPX-ECME algorithm with 𝐇(t)=κt1𝐈p𝐗T𝐒𝐖(𝜷(t))𝐗\mathbf{H}^{(t)}=\kappa_{t}^{-1}\mathbf{I}_{p}-\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}. In other words, the GPX-ECME algorithm with this choice of 𝐇(t)\mathbf{H}^{(t)} may be interpreted as a proximal gradient descent algorithm where one first takes a proximal gradient descent step with steplength κt\kappa_{t} and then multiplies this update by the optimal scalar ρ(t+1)\rho^{(t+1)} which is optimal in the sense that it leads to the largest increase in the observed-data penalized log-likelihood.

4.3 Connection to an MM algorithm for Logistic Regression

If one chooses 𝐇(t)\mathbf{H}^{(t)} to be 𝐇(t)=𝐗T𝐒[κt𝐈p𝐖(𝜷(t))]𝐗\mathbf{H}^{(t)}=\mathbf{X}^{T}\mathbf{S}[\kappa_{t}\mathbf{I}_{p}-\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})]\mathbf{X} for a positive scalar κt>0\kappa_{t}>0 in a GPX-ECME algorithm, the surrogate Q-function becomes

Q~PX𝜼(𝜷,1|𝜷(t),α0)=𝜷T𝐗T𝐒𝐛(t)(κt)κt2𝜷T𝐗T𝐒𝐗𝜷P𝜼(𝜷)+C,\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})=\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{b}^{(t)}(\kappa_{t})-\frac{\kappa_{t}}{2}\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{X}\mbox{\boldmath$\beta$}-P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})+C, (19)

where 𝐛(t)(κt)=𝐮+[κt𝐈p𝐖(𝜷(t))]𝐗𝜷(t)\mathbf{b}^{(t)}(\kappa_{t})=\mathbf{u}+[\kappa_{t}\mathbf{I}_{p}-\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})]\mathbf{X}\mbox{\boldmath$\beta$}^{(t)} and CC is a constant independent of 𝜷\beta. In this context, the matrix 𝐇(t)\mathbf{H}^{(t)} is guaranteed to be positive semi-definite if κt\kappa_{t} greater than or equal to the largest diagonal element of 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}). For example, setting κt=κt\kappa_{t}=\kappa_{t}^{*} where

κt=max{ω(𝐱1T𝜷(t),m1),,ω(𝐱nT𝜷(t),mn)}\kappa_{t}^{*}=\max\Big{\{}\omega(\mathbf{x}_{1}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{1}),\ldots,\omega(\mathbf{x}_{n}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{n})\Big{\}}

guarantees that 𝐇(t)\mathbf{H}^{(t)} is positive semi-definite. When there is no penalty (i.e., P𝜼(𝜷)=0P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})=0), maximizing Q~PX𝜼(𝜷,1|𝜷(t),α0)\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}) with respect to 𝜷\beta leads to the following formula for 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM}

𝜷(t+1),GEM=argmax𝜷pQ~PX𝜼(𝜷,1|𝜷(t),α0)=1κt(𝐗T𝐒𝐗)1𝐗T𝐒𝐛(t)(κt).\mbox{\boldmath$\beta$}^{(t+1),GEM}=\operatorname*{arg\,max}_{\boldsymbol{\beta}\in\mathbb{R}^{p}}\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})=\frac{1}{\kappa_{t}}(\mathbf{X}^{T}\mathbf{S}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{b}^{(t)}(\kappa_{t}). (20)

The update 𝜷(t+1)\mbox{\boldmath$\beta$}^{(t+1)} is then found by finding ρ(t+1)=argmaxρpo,𝐬(ρ𝜷(t+1),GEM|𝐲)\rho^{(t+1)}=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}p\ell_{o,\mathbf{s}}(\rho\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y}) and setting 𝜷(t+1)=ρ(t+1)𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1)}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),GEM}.

The algorithm defined by the update (20) is closely related to the MM algorithm for logistic regression described in Böhning & Lindsay (1988) and also detailed in Hunter & Lange (2004). To see why, if we use the fact that (5) implies

𝐛(t)(κt)=𝐮+[κt𝐈𝐖(𝜷(t))]𝐗𝜷(t)=κt𝐗𝜷(t)+𝐲μ(𝐗T𝜷(t)),\mathbf{b}^{(t)}(\kappa_{t})=\mathbf{u}+\big{[}\kappa_{t}\mathbf{I}-\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\big{]}\mathbf{X}\mbox{\boldmath$\beta$}^{(t)}=\kappa_{t}\mathbf{X}\mbox{\boldmath$\beta$}^{(t)}+\mathbf{y}-\mu(\mathbf{X}^{T}\mbox{\boldmath$\beta$}^{(t)}),

then (20) can be rewritten as

𝜷(t+1),GEM\displaystyle\mbox{\boldmath$\beta$}^{(t+1),GEM} =\displaystyle= 𝜷(t)+1κt(𝐗T𝐒𝐗)1𝐗T𝐒{𝐲μ(𝐗T𝜷(t))}.\displaystyle\mbox{\boldmath$\beta$}^{(t)}+\frac{1}{\kappa_{t}}(\mathbf{X}^{T}\mathbf{S}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\big{\{}\mathbf{y}-\mu(\mathbf{X}^{T}\mbox{\boldmath$\beta$}^{(t)})\big{\}}. (21)

If we were to set κt=1/4\kappa_{t}=1/4 for each tt in (21), then in the case of m1=m2==mn=1m_{1}=m_{2}=\ldots=m_{n}=1, then (21) would correspond exactly to the MM algorithm for logistic regression update described by Böhning & Lindsay (1988). Note that, in the case when all mim_{i} equal 11, setting κt=1/4\kappa_{t}=1/4 guarantees that 𝐇(t)\mathbf{H}^{(t)} is positive semi-definite because ω(𝐱T𝜷,1)1/4\omega(\mathbf{x}^{T}\mbox{\boldmath$\beta$},1)\leq 1/4 for any value of 𝐱T𝜷\mathbf{x}^{T}\mbox{\boldmath$\beta$}. One advantage of using κt=1/4\kappa_{t}=1/4 rather than κt=κt\kappa_{t}=\kappa_{t}^{*} is that one does not need to compute the maximum of the weights in each iteration; however, this could result in taking slightly smaller “steps” when compared with using κt\kappa_{t}^{*}.

In the case of an L2L_{2} penalty where P𝜼(𝜷)=λ2j=1pβj2P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})=\tfrac{\lambda}{2}\sum_{j=1}^{p}\beta_{j}^{2}, one would, in (21), simply replace (𝐗T𝐒𝐗)1(\mathbf{X}^{T}\mathbf{S}\mathbf{X})^{-1} with the matrix (𝐗T𝐒𝐗+λκt𝐈p)1(\mathbf{X}^{T}\mathbf{S}\mathbf{X}+\tfrac{\lambda}{\kappa_{t}}\mathbf{I}_{p})^{-1}. This approach for the L2L_{2}-penalized logistic regression problem, which we label the “PX-MM” algorithm is summarized in Algorithm 3.

Algorithm 3 (PX-MM algorithm for Logistic Regression with L2L_{2} penalty P𝜼(𝜷)=λ2𝜷T𝜷P_{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$})=\frac{\lambda}{2}\mbox{\boldmath$\beta$}^{T}\mbox{\boldmath$\beta$}). In the description of the algorithm, 𝐮=(u1,,un)T\mathbf{u}=(u_{1},\ldots,u_{n})^{T}, where ui=yimi/2u_{i}=y_{i}-m_{i}/2.
1:Given 𝜷(0)p\mbox{\boldmath$\beta$}^{(0)}\in\mathbb{R}^{p}.
2:for t=0,1,2,… until convergence do
3:     Compute the n×nn\times n diagonal weight matrix 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}) whose ithi^{th} diagonal element is
ω(𝐱iT𝜷,mi)=mi2𝐱iT𝜷(t)tanh(𝐱iT𝜷(t)/2).\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$},m_{i})=\frac{m_{i}}{2\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}}\tanh\big{(}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)}/2\big{)}.
4:     Set κt=max{ω(𝐱1T𝜷(t),m1),,ω(𝐱nT𝜷(t),mn)}\kappa_{t}^{*}=\max\big{\{}\omega(\mathbf{x}_{1}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{1}),\ldots,\omega(\mathbf{x}_{n}^{T}\mbox{\boldmath$\beta$}^{(t)},m_{n})\big{\}}.
5:     Compute 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM}:
𝜷(t+1),GEM=(𝐗T𝐒𝐗+λκt𝐈p)1𝐗T𝐒𝐗𝜷(t)+1κt(𝐗T𝐒𝐗+λκt𝐈p)1𝐗T𝐒{𝐲μ(𝐗T𝜷(t))}.\mbox{\boldmath$\beta$}^{(t+1),GEM}=\big{(}\mathbf{X}^{T}\mathbf{S}\mathbf{X}+\tfrac{\lambda}{\kappa_{t}^{*}}\mathbf{I}_{p}\big{)}^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{X}\mbox{\boldmath$\beta$}^{(t)}+\frac{1}{\kappa_{t}^{*}}\big{(}\mathbf{X}^{T}\mathbf{S}\mathbf{X}+\tfrac{\lambda}{\kappa_{t}^{*}}\mathbf{I}_{p}\big{)}^{-1}\mathbf{X}^{T}\mathbf{S}\big{\{}\mathbf{y}-\mu(\mathbf{X}^{T}\mbox{\boldmath$\beta$}^{(t)})\big{\}}.\vspace{-.2cm}
6:     Compute
ρ(t+1)=argmaxρ[po,𝐬𝜼(ρ𝜷(t+1),GEM|𝐲)].\rho^{(t+1)}=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}\Big{[}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y})\Big{]}.
7:     Set 𝜷(t+1)=ρ(t+1)𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1)}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),GEM}.

5 PX-ECME and Coordinate Descent

Coordinate descent is often used in the context of L1L_{1}-penalized regression (Friedman et al. (2007)) and L1L_{1}-penalized generalized linear models (Friedman et al. (2010)). Coordinate descent proceeds by updating each regression coefficient one at a time while holding the remaining coefficients fixed. This is frequently an efficient approach as the parameter updates usually have a simple, closed form, and many of the parameters do not change after being mapped to zero.

One could also consider a coordinate descent version of the PX-ECME algorithm where one instead updates the elements of the vector 𝜽\theta one at a time rather than in a single step. To be more specific, consider the parameter-expanded Q-function defined in (9) with the “elastic net” penalty P𝜼,j(βj)=λ1|βj|+λ22βj2P_{\boldsymbol{\eta},j}(\beta_{j})=\lambda_{1}|\beta_{j}|+\frac{\lambda_{2}}{2}\beta_{j}^{2} and where the first j1j-1 components of 𝜽\theta have already been updated

QPXλ(θ1(t+1),,θj1(t+1),θj,θj+1(t),,θp(t),α(t)|𝜽(t),α(t))=α(t)θj(i=1nxijsiui)\displaystyle Q_{PX}^{\lambda}(\theta_{1}^{(t+1)},\ldots,\theta_{j-1}^{(t+1)},\theta_{j},\theta_{j+1}^{(t)},\ldots,\theta_{p}^{(t)},\alpha^{(t)}|\mbox{\boldmath$\theta$}^{(t)},\alpha^{(t)})=\alpha^{(t)}\theta_{j}\Big{(}\sum_{i=1}^{n}x_{ij}s_{i}u_{i}\Big{)}- (22)
\displaystyle- (α(t))22(θj2{i=1n(Aij(t))2+λ2}+2θji=1nAij(t)kjθk(t+j/p)Aik(t))|α(t)|λ1|θj|+C~,\displaystyle\frac{(\alpha^{(t)})^{2}}{2}\Big{(}\theta_{j}^{2}\Big{\{}\sum_{i=1}^{n}(A_{ij}^{(t)})^{2}+\lambda_{2}\Big{\}}+2\theta_{j}\sum_{i=1}^{n}A_{ij}^{(t)}\sum_{k\neq j}\theta_{k}^{(t+j/p)}A_{ik}^{(t)}\Big{)}-|\alpha^{(t)}|\lambda_{1}|\theta_{j}|+\tilde{C},

where Aij(t)A_{ij}^{(t)} denotes the (i,j)(i,j) element of 𝐒1/2𝐖1/2(𝜷(t))𝐗\mathbf{S}^{1/2}\mathbf{W}^{1/2}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X}, C~\tilde{C} is a constant not depending on θj\theta_{j}. Maximizing (22) with respect to θj\theta_{j} leads to an update of θj(t)\theta_{j}^{(t)} which has the form

θj(t+1)=sign(1α(t)Vj(𝜷(t))Uj(𝜷(t),𝜽j(t+j/p)))max{|1α(t)Vj(𝜷(t))Uj(𝜷(t),𝜽j(t+j/p))|1α(t)λ~(𝜷(t)),0}.\theta_{j}^{(t+1)}=\textrm{sign}\Big{(}\tfrac{1}{\alpha^{(t)}}V_{j}(\mbox{\boldmath$\beta$}^{(t)})-U_{j}(\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\theta$}_{-j}^{(t+j/p)})\Big{)}\max\Big{\{}\Big{|}\tfrac{1}{\alpha^{(t)}}V_{j}(\mbox{\boldmath$\beta$}^{(t)})-U_{j}(\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\theta$}_{-j}^{(t+j/p)})\Big{|}-\tfrac{1}{\alpha^{(t)}}\tilde{\lambda}(\mbox{\boldmath$\beta$}^{(t)}),0\Big{\}}. (23)

Expressions for the terms Vj(𝜷(t))V_{j}(\mbox{\boldmath$\beta$}^{(t)}), Uj(𝜷(t),𝜽j(t+j/p))U_{j}(\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\theta$}_{-j}^{(t+j/p)}), and λ~(𝜷(t))\tilde{\lambda}(\mbox{\boldmath$\beta$}^{(t)}) are given in Appendix C. After completing one “cycle” of coordinate descent where all pp components of 𝜽\theta are updated using (23), one can update the value of α(t)\alpha^{(t)} by maximizing po,𝐬𝜼(α𝜽|𝐲)p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$}|\mathbf{y}) with respect to α\alpha. After completing this cycle, the regression coefficients would be updated via 𝜷(t+1)=α(t+1)𝜽(t+1)\mbox{\boldmath$\beta$}^{(t+1)}=\alpha^{(t+1)}\mbox{\boldmath$\theta$}^{(t+1)}.

Rather than waiting an entire cycle to update α(t)\alpha^{(t)} and the weight matrix 𝐖(𝜷(t))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)}), an alternative is to perform this updating after a “block” of kk of the components (for kpk\leq p) of 𝜽(t)\mbox{\boldmath$\theta$}^{(t)} have been updated rather than all pp components. This can often reduce the total number of coordinate cycles required to converge; however, this comes at the expense of a greater computational cost to perform each coordinate descent cycle. The more general PX-ECME algorithm which updates the value of α\alpha and the regression weights after every kk coordinate updates is outlined in Algorithm 4 of Appendix C. Note that when k=pk=p, this algorithm reduces to the case where α\alpha and the weight matrix are only updated after a full cycle of coordinate descent is completed.

Our experience with running the PX-ECME version of coordinate descent suggests that it typically offers very modest improvements in the total number of iterations compared to coordinate descent with no parameter expansion (i.e., α(t)=1\alpha^{(t)}=1 for all tt), and the extra computational effort involved in updating α(t)\alpha^{(t)} often nullifies any reduction in the total number of coordinate descent iterations. Nevertheless, experimenting with different values of the “block size” parameter kk can have a positive impact in some cases, and hence, in some settings, it may be worth exploring different values of kk to determine if there are any gains to be had over a “classic” coordinate descent algorithm which does not have any parameter expansion whatsoever.

6 Simulations

6.1 A Weighted Logistic Regression Example

In the standard logistic regression setting where all the weights sis_{i} are equal, the usual starting values used for the Newton-Raphson algorithm (i.e., 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0}) are generally quite robust, and using these starting values with Newton-Raphson produces iterates which converge to the maximum likelihood estimates of the regression coefficients. However, in many weighted logistic regression problems where there is large variability in the weights, the Newton-Raphson algorithm can result in failure even when the starting values of 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0} are used. To illustrate this, we consider the following small numerical example with responses yiy_{i}, scalar covariates xix_{i}, and regression weights sis_{i}:

(y1,,y7)\displaystyle(y_{1},\ldots,y_{7}) =\displaystyle= (1,0,1,1,1,0,1)\displaystyle(1,0,1,1,1,0,1)
(x1,,x7)\displaystyle(x_{1},\ldots,x_{7}) =\displaystyle= (0,0,0.001,100,1,1,0.5)\displaystyle(0,0,0.001,100,-1,-1,0.5)
(s1,,s7)\displaystyle(s_{1},\ldots,s_{7}) =\displaystyle= (0.4,0.01,0.4,0.01,0.04,0.1,0.04).\displaystyle(0.4,0.01,0.4,0.01,0.04,0.1,0.04). (24)

The performance of Newton-Raphson, PX-ECME, and EM using the data and regression weights in (24) and with no penalty term is summarized in Table 1. We ran each method for 63 iterations because this was the number of iterations required for PX-ECME to converge when using the stopping criterion j=1p(βj(t+1)βj(t))2<109\sqrt{\sum_{j=1}^{p}(\beta_{j}^{(t+1)}-\beta_{j}^{(t)})^{2}}<10^{-9}. As shown in Table 1, the Newton-Raphson procedure diverges even when using the starting values 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0}. While the first iteration of Newton-Raphson improves the value of the weighted log-likelihood, the subsequent iterates begin to quickly diverge. In contrast to Newton-Raphson, the PX-ECME algorithm provides more stable monotone convergence to the weighted maximum likelihood estimates (β^0,β^1)=(4.39,5.30)(\hat{\beta}_{0},\hat{\beta}_{1})=(4.39,5.30). While the EM algorithm also enjoys monotone convergence, the convergence is considerably slower than PX-EMCE, and it takes 419419 iterations for EM to converge when using the same convergence tolerance as PX-ECME.

Newton-Raphson PX-ECME EM
k β0(k)\beta_{0}^{(k)} β1(k)\beta_{1}^{(k)} o,𝐬(k)\ell_{o,\mathbf{s}}^{(k)} β0(k)\beta_{0}^{(k)} β1(k)\beta_{1}^{(k)} o,𝐬(k)\ell_{o,\mathbf{s}}^{(k)} β0(k)\beta_{0}^{(k)} β1(k)\beta_{1}^{(k)} o,𝐬(k)\ell_{o,\mathbf{s}}^{(k)}
0 0.0 0.0 -0.6931 0.0 0.0 -0.6931 0.0 0.0 -0.6931
1 4.26 1.97 -0.2972 1.55 0.01 -0.3611 1.55 0.01 -0.3611
2 -79.22 -10.17 -80.4768 2.09 0.02 -0.3441 1.85 0.01 -0.3471
3 1.87×10161.87\times 10^{16} 4.31×10154.31\times 10^{15} 1.62×1015-1.62\times 10^{15} 2.10 0.03 -0.3432 1.97 0.02 -0.3441
4 9.60×10159.60\times 10^{15} 4.43×10154.43\times 10^{15} 6.13×1014-6.13\times 10^{14} 2.10 0.03 -0.3424 2.03 0.03 -0.3429
5 4.88×10144.88\times 10^{14} 4.56×10154.56\times 10^{15} 1.67×1014-1.67\times 10^{14} 2.10 0.05 -0.3414 2.05 0.04 -0.3420
6 5.28×10145.28\times 10^{14} 4.51×10154.51\times 10^{15} 1.64×1014-1.64\times 10^{14} 2.11 0.06 -0.3404 2.07 0.05 -0.3410
7 5.68×10145.68\times 10^{14} 4.46×10154.46\times 10^{15} 1.61×1014-1.61\times 10^{14} 2.12 0.07 -0.3392 2.07 0.06 -0.3400
8 6.08×10146.08\times 10^{14} 4.42×10154.42\times 10^{15} 1.58×1014-1.58\times 10^{14} 2.13 0.09 -0.3377 2.08 0.08 -0.3388
9 6.48×10146.48\times 10^{14} 4.37×10154.37\times 10^{15} 1.55×1014-1.55\times 10^{14} 2.14 0.11 -0.3360 2.08 0.09 -0.3373
10 6.88×10146.88\times 10^{14} 4.33×10154.33\times 10^{15} 1.52×1014-1.52\times 10^{14} 2.15 0.14 -0.3339 2.08 0.11 -0.3357
63 6.71×1015-6.71\times 10^{15} 2.51×10152.51\times 10^{15} 5.95×1015-5.95\times 10^{15} 4.39 5.30 -0.1376 4.01 4.83 -0.1386
Table 1: Comparison of Newton-Raphson, PX-ECME, and EM using the data and regression weights provided in (24) and with initial values of (β0(0),β1(0))=(0.0,0.0)(\beta_{0}^{(0)},\beta_{1}^{(0)})=(0.0,0.0) for all methods. Values of the regression coefficients β0(k),β1(k)\beta_{0}^{(k)},\beta_{1}^{(k)} and the associated weighted log-likelihood values o,𝐬(k)\ell_{o,\mathbf{s}}^{(k)} are shown for each method and iterations k=0,1,,10k=0,1,...,10 and k=63k=63. Iterates from the Newton-Raphson algorithm do not converge while both PX-ECME and EM converge to the maximum likelihood estimates of the regression coefficients albeit with notably slower convergence from the EM algorithm.

6.2 The Kyphosis Data

In this simulation study, we utilized the Kyphosis data described in Chambers & Hastie (1992) (pg. 200) and also considered in Liu et al. (1998). This dataset contains 81 observations from children who had undergone a corrective spinal surgery. The binary outcome in this dataset indicates whether or not kyphosis was present after the operation, and the following additional 3 covariates were also recorded: the age of the child in months, the number of vertebrae, and the number of the topmost vertebra involved in the surgery. For this simulation study, we did not use the observed binary outcomes but instead only used the observed covariates and simulated binary responses. Similar to the simulation described in Liu et al. (1998), we generated pseudo-binary outcomes yiy_{i} assuming P(yi=1)=1/[1+exp(3xi,num+xi,start)]P(y_{i}=1)=1/[1+\exp(-3x_{i,num}+x_{i,start})], where xi,numx_{i,num} and xi,startx_{i,start} denote the number of vertebrae and the number of topmost vertebra for the ithi^{th} child respectively. We generated 500500 datasets in this manner so that each dataset contained the 3 covariate vectors in the original Kyphosis data and the vector of simulated pseudo-outcomes. When estimating the parameters of this model, we did not include a penalty term, and we included an intercept term so that, in total, four regression coefficients needed to be estimated.

Method Number of iterations Timing logL(θ^)\log L(\hat{\theta})
median mean std. dev. median mean std. dev. mean
EM 424 756.96 1099.89 0.0215 0.0410 0.0671 -10.524639
PX-ECME 49 51.61 19.43 0.0050 0.0061 0.0058 -10.524639
MM 2242 7079.57 18123.52 0.0765 0.2577 0.6738 -10.524639
PX-MM 139 177.88 140.69 0.0130 0.0181 0.0170 -10.524639
NR 10 10.14 1.14 0.0010 0.0015 0.0014 -10.524639
AA1 59 71.53 49.86 0.0040 0.0050 0.0036 -10.524639
Table 2: Simulation results based on the Kyphosis data with simulated outcomes. Median number of iterations required to converge and median timing across simulation replications are presented. The mean value of the log-likelihood at convergence is also shown. Methods evaluated include the EM algorithm, PX-ECME algorithm, the MM algorithm described in Section 4.3, the parameter expanded MM (PX-MM) algorithm (Algorithm 3), the Newton-Raphson (NR) algorithm, and the order-1 Anderson acceleration of EM described in Section 3.3.

Using these 500 simulated datasets, we compared the following methods for computing the four regression coefficient estimates: the EM algorithm, Newton-Raphson, the PX-ECME algorithm (Algorithm 1), the MM algorithm of Section 4.3, the PX-MM algorithm (Algorithm 3), and the order-1 Anderson acceleration of EM (AA1). For each method, all regression coefficients were initialized at 0, and the stopping criteria of j=1p(βj(t+1)βj(t))2<107\sqrt{\sum_{j=1}^{p}(\beta_{j}^{(t+1)}-\beta_{j}^{(t)})^{2}}<10^{-7} was used for each method.

The performance of these procedures is summarized in Table 2. As shown in Table 2, the Newton-Raphson procedure performed the best requiring only a median of 1010 Newton-Raphson iterations to converge and a median of 0.0010.001 seconds to achieve convergence. The PX-ECME was the second-fastest algorithm in terms of number of iterations only requiring a median of 49 iterations for convergence. Compared to the EM algorithm, this was a roughly ten-fold reduction in the median number of iterations. PX-ECME and AA1 were quite close in performance with AA1 requiring a few more iterations than PX-EMCE while having better time performance than PX-ECME due to a mildly cheaper per iteration cost than PX-ECME. A notable result from the Kyphosis data simulations was that PX-MM was substantially faster than the MM algorithm. Compared to the MM algorithm, PX-MM had a nearly 17-fold reduction in the median number of iterations and a nearly six-fold reduction in the median time to converge.

6.3 Simulated Outcomes with Autocorrelated Covariates

For this simulation study, we generated the elements xi,jx_{i,j} of an n×pn\times p design matrix 𝐗\mathbf{X} using an autoregressive process of order 1 with autocorrelation parameter ρ0\rho\geq 0. Specifically, for each ii, the xi,jx_{i,j} were generated as xi,1=1x_{i,1}=1, and

xi,j=ρxi,j1+εj, for j=3,,p,x_{i,j}=\rho x_{i,j-1}+\varepsilon_{j},\qquad\textrm{ for }j=3,\ldots,p,

where xi,2Normal(0,1)x_{i,2}\sim\textrm{Normal}(0,1) and εjNormal(0,1)\varepsilon_{j}\sim\textrm{Normal}(0,1). This simulation design implies the correlation between xi,jx_{i,j} and xi,j1x_{i,j-1} is ρ\rho for any j3j\geq 3. The regression coefficients β1,,βp\beta_{1},\ldots,\beta_{p} were generated independently as βj=ZjTj\beta_{j}=Z_{j}T_{j}, where ZjBernoulli(0.75)Z_{j}\sim\textrm{Bernoulli}(0.75) and TjT_{j} follows a t distribution with 33 degrees of freedom. Given the generated covariate vectors 𝐱1,,𝐱n\mathbf{x}_{1},\ldots,\mathbf{x}_{n} and regression coefficient vector 𝜷\beta, the responses yiy_{i} were then generated independently as yiBernoulli({1+exp(𝐱iT𝜷)}1)y_{i}\sim\textrm{Bernoulli}\big{(}\{1+\exp(-\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}^{-1}\big{)}.

We considered two choices for the number of observations: n{500,2000}n\in\{500,2000\}, three settings for the number of covariates: p{5,50,200}p\in\{5,50,200\}, and three values for the autocorrelation parameter ρ{0,0.9,0.99}\rho\in\{0,0.9,0.99\}. For each of the 1818 simulation settings (i.e., each setting of nn, pp, and ρ\rho), we ran each procedure for computing the maximum likelihood estimate of 𝜷\beta with no penalty term on 5050 simulated datasets. For all methods considered, we used a maximum of 100,000100,000 iterations.

Table 3 shows summary performance measures of 88 methods for estimating the logistic regression coefficients in these simulation scenarios. With the exception of Newton-Raphson, each procedure displayed in Table 3 is a monotone algorithm as the steplength in gradient descent and a GPX-ECME algorithm (Algorithm 2) were chosen to guarantee an increase in the log-likelihood at each iteration. The GPX-ECME algorithm used here chooses 𝐇(t)=κt1𝐈p𝐗T𝐒𝐖(𝜷(t))𝐗\mathbf{H}^{(t)}=\kappa_{t}^{-1}\mathbf{I}_{p}-\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X} (where κt1\kappa_{t}^{-1} is the maximum eigenvalue of 𝐗T𝐒𝐖(𝜷(t))𝐗\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{(t)})\mathbf{X} ) so that the GPX-ECME update is simply the gradient descent update multiplied by the scalar which maximizies the observed-data log-likelihood. The summary measures shown in Table 3 correspond to performance measures aggregated across all 18 simulation settings.

For the results shown in the top half of Table 3, the initial value 𝜷(0)\mbox{\boldmath$\beta$}^{(0)} was set to the zero vector. While Newton-Raphson has the best overall performance when setting 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0}, PX-ECME shows clear advantages over the EM algorithm and the MM and PX-MM algorithms. Indeed, PX-ECME shows a more than ten-fold reduction from EM in the median number of iterations required to converge, and PX-ECME shows a more than three-fold reduction in the median number of iterations when compared with PX-MM. When compared with EM, PX-ECME also reduced the number of cases with very long times to converge which can be observed by noting the even greater reduction in mean number of iterations for PX-ECME versus EM when compared to the reduction in median iterations for PX-ECME versus EM. While each iteration of PX-ECME requires moderately more computational time than EM, the timing comparisons show that the median time to convergence of PX-ECME is nearly 10 times less than that of the EM algorithm.

The lower half of Table 3 shows the performance of each method when the components of the initial vector 𝜷(0)\mbox{\boldmath$\beta$}^{(0)} are set by sampling from a standard normal distribution. With these random starting values, all of the methods except Newton-Raphson demonstrated robust performance with all of these methods having very similar performance to the simulation runs where 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0}. Newton-Raphson, however, showed very erratic performance when the initial values were not set to zero demonstrating this method’s considerable sensitivity to the choice of starting values, and in the majority of simulation runs, Newton-Raphson did not converge. As with the simulation runs with 𝜷(0)=𝟎\mbox{\boldmath$\beta$}^{(0)}=\mathbf{0}, PX-ECME with random starting values provides a roughly 10-fold improvement in convergence speed over the EM algorithm.

Method Number of iterations Timing logL(θ^)\log L(\hat{\theta})
median mean std. dev. median mean std. dev. mean
initial values of zero:
EM 625 4509.12 13224.93 0.2395 5.5697 23.0756 -210.230466
PX-ECME 48 130.98 315.16 0.0340 0.1980 0.9346 -210.230148
MM 3762 24038.59 36171.98 0.3440 3.3383 6.4779 -210.277489
PX-MM 161 1248.01 5258.13 0.0860 0.8922 6.1386 -210.230148
NR 10 10.51 2.92 0.0060 0.0129 0.0180 -210.230148
GD 33917 48948.32 45435.61 8.6060 44.8929 77.4781 -213.906704
GPX-ECME 13178 38945.89 43830.79 7.3200 54.3209 93.0919 -211.519659
GDBT 9766 39079.08 44644.20 4.0385 44.5298 81.2833 -211.461741
AA1 42 89.09 370.13 0.0200 0.1109 0.8442 -210.230148
random initial values:
EM 596 4431.32 12871.93 0.2480 5.6423 24.3456 -203.681932
PX-ECME 50 122.06 193.26 0.0365 0.1649 0.3405 -203.681930
MM 3680 24055.37 35992.27 0.4185 3.1970 6.0510 -203.720918
PX-MM 158 998.80 2754.27 0.0900 0.6194 2.1296 -203.681930
NR 100000 87223.14 33400.38 39.4695 120.7834 146.0996 -4.02 ×1016\times 10^{16}
GD 31308 48637.68 45354.11 8.8395 45.1825 78.0428 -207.656757
GPX-ECME 13901 39070.57 43749.47 7.6315 53.8382 91.0420 -205.777709
GDBT 9094 39178.89 44816.03 3.9850 44.9879 81.2179 -204.791037
AA1 46 80.60 172.71 0.0210 0.0917 0.4262 -203.681930
Table 3: Results for simulated outcomes with autocorrelated covariates. PX-MM, NR, and AA1 denote the parameter-expanded MM algorithm (Algorithm 3), Newton-Raphson, and order-1 Anderson acceleration respectively. GD denotes gradient descent, and GDBT denotes gradient descent with backtracking. The GPX-ECME algorithm (Algorithm 2) used here chooses 𝐇(t)\mathbf{H}^{(t)} so that the parameter update in each iteration is found by multiplying the gradient descent update by an optimal scalar. Each method was stopped after 100,000100,000 iterations.
Newton-Raphson resulted in numerical errors in 1616 out of 900900 simulation runs; these runs were discarded when tabulating the summary measures of performance.

6.4 The Madelon Data

In this simulation study, we consider a weighted logistic regression example with both an L1L_{1} penalty (i.e, Pη(𝜷)=ηj=1p|βj|P_{\eta}(\mbox{\boldmath$\beta$})=\eta\sum_{j=1}^{p}|\beta_{j}|) and an L2L_{2} penalty function (i.e., Pη(𝜷)=η2j=1pβj2P_{\eta}(\mbox{\boldmath$\beta$})=\frac{\eta}{2}\sum_{j=1}^{p}\beta_{j}^{2}). We used the Madelon dataset (Guyon et al. (2004)) which is available for download from the UCI machine learning repository. This dataset has n=2600n=2600 observations and p=500p=500 covariates. While the same Madelon dataset was used for all simulation replications, a different set of weights s1,,s2600s_{1},\ldots,s_{2600} were drawn for each of 10 simulation replications. The weights sis_{i} were sampled independently from an exponential distribution with rate parameter 1. We generated 10 sets of weights for both the L2L_{2} and L1L_{1} penalized cases.

For each of the ten sets of weights, we obtained parameter estimates 𝜷^(ηk)\hat{\mbox{\boldmath$\beta$}}(\eta_{k}) across a decreasing sequence η1>η2>>η9\eta_{1}>\eta_{2}>\ldots>\eta_{9} of nine tuning parameters. The same values of ηk\eta_{k} were used for both the L1L_{1} and L2L_{2}-penalized simulation runs. These values were η1=5000\eta_{1}=5000, η2=1000\eta_{2}=1000, η3=500\eta_{3}=500, η4=200\eta_{4}=200, η5=50\eta_{5}=50, η6=10\eta_{6}=10, η7=2\eta_{7}=2, η2\eta_{2}, η9=0.1\eta_{9}=0.1. For a given set of weights, the parameter estimate 𝜷\beta for the previous value of η\eta was used as the starting value for the subsequent value of η\eta. That is, 𝜷^(ηk)\hat{\mbox{\boldmath$\beta$}}(\eta_{k}) was used as the starting value for each method when η\eta was equal to ηk+1\eta_{k+1}. The following methods were evaluated for the L1L_{1}-penalized case: coordinate descent with EM-based weights, coordinate descent with Newton-Raphson-based weights, proximal gradient descent (PGD) with a fixed steplength, parameter-expanded proximal gradient descent (GPX-ECME), and gradient descent with backtracking. The following methods were evaluated for the L2L_{2}-penalized case: EM, PX-ECME, MM, PX-MM, Newton-Raphson, fixed steplength gradient descent, GPX-ECME, gradient descent with backtracking, and order-1 Anderson acceleration. For the L1L_{1}-penalized simulations, we set a maximum of 100000100000 iterations for the non-coordinate descent methods and a maximum of 10001000 iterations for the coordinate descent methods (one iteration for this case is a full cycle that updates all the regression coefficients), and for the L2L_{2}-penalized simulations, we set a maximum of 1000010000 iterations.

The madelon-data simulation results are shown in Table 4. Note that the summary measures for the number of iterations and timings are aggregated across all 1010 replications and all values of η\eta and are not separated by value of η\eta. In the L1L_{1}-penalized simulations, the parameter-expanded version (GPX-ECME) of proximal gradient descent (PGD) performs well in terms of both timing and final value of the objective function. Specifically, the timing required by GPX-ECME was not much greater than PGD while the average value of the objective function achieved by GPX-ECME at the final iteration was much better than that achieved by PGD. While proximal gradient descent with backtracking achieved better log-likelihood values after 100,000100,000 iterations, this was at a cost of much slower speeds than either PGD or GPX-ECME.

In the L2L_{2}-penalized simulations, Newton-Raphson had the smallest median number of iterations needed for convergence. However, of the 90 simulation runs, there were 14 runs where Newton-Raphson diverged which led to a very large mean time spent and mean number of iterations needed. In contrast to Newton-Raphson, both EM and MM had more robust performance without sacrificing much computational speed relative to Newton-Raphson. Overall, both PX-ECME and PX-MM delivered speed advantages over the EM and MM algorithms respectively, but the additional gains in computational speed were, overall, quite modest. This seems to be an example where both EM and MM are quite fast relative to Newton-Raphson (when Newton-Raphson converges), and hence, it is difficult for PX-ECME or PX-MM to make substantial improvements. The strong performance of order-1 Anderson acceleration (AA1) in the L2L_{2}-penalized simulations is also interesting to note. In addition to demonstrating considerable robustness by converging to the maximizer of the objective function in every simulation run, AA1 was somewhat faster than PX-ECME and was even competitive with Newton-Raphson in median timing.

Method Number of iterations Timing logL(θ^)\log L(\hat{\theta})
median mean std. dev. median mean mean not converged
L1L_{1} penalty:
Coord Desc(EM) 1000 1000 0.00 2212.20 2212.06 -1610.69 90
Coord Desc(NR) 1000 1000 0.00 2737.15 2737.40 -1580.09 90
PGD 100000 73635.87 35608.04 852.77 630.84 -1573.44 49
GPX-ECME 100000 75781.66 33794.31 1268.08 964.46 -1515.11 51
PGDBT 100000 72641.41 40250.80 28457.42 20723.50 -1483.10 60
L2L_{2} penalty:
EM 31 30.76 4.25 9.61 9.52 -1200.41 0
PX-ECME 24 23.50 3.27 7.58 7.39 -1200.41 0
MM 56 55.04 9.92 0.89 0.89 -1200.41 0
PX-MM 43 41.97 8.30 0.99 1.00 -1200.41 0
NR 9 1563.11 3641.38 5.03 874.79 -8.05 ×1013\times 10^{13} 14
GD 10000 10000 0 2696.13 2697.28 -1664.13 90
GPX-ECME 10000 10000 0 2735.57 2737.46 -1530.06 90
GDBT 10000 10000 0 2773.56 2782.01 -1597.34 90
AA1 18 18.06 2.30 5.66 5.63 -1200.41 0
Table 4: Results from the madelon simulation study. Ten different weighted log-likelihood functions were tested using 1010 different sets of exponentially distributed weights, and, for each set of weights, each method was run for 99 different choices of a penalty parameter. Methods shown include coordinate descent with EM or Newton-Raphson (NR) weights, proximal gradient descent (PGD), proximal gradient descent with backtracking (PGDBT), and order-1 Anderson acceleration(AA1). For each method, the number of simulation runs (out of 90 simulation runs in total) where the method did not converge is also reported.

7 Discussion

In this article, we have proposed and investigated the performance of an accelerated version of a Pólya-Gamma-based EM algorithm for weighted penalized logistic regression. Our method provides a useful alternative algorithm that is stable, efficient, directly implementable, and provides monotone convergence regardless of the weights or starting values chosen by the user. We established that our PX-ECME algorithm has a convergence rate guaranteed to be as fast as the original Pólya-Gamma EM algorithm, and empirically, across a range of different simulation examples, our PX-ECME algorithm accelerates the convergence speed roughly by a factor of ten when compared to the EM algorithm. We also proposed a generalized version of the Pólya-Gamma EM algorithm and explored a PX-ECME version of this generalized EM algorithm. This generalization of EM can be useful in certain high dimensional contexts where performing the full M-step in every iteration of the EM algorithm can become computationally expensive. Our generalized EM algorithm includes both proximal gradient descent and the logistic regression MM algorithm as special cases, and in our simulations, we showed that the parameter-expanded version of the MM algorithm can have substantially faster convergence than the corresponding MM algorithm.

While Newton-Raphson typically converges very quickly in cases where Newton-Raphson is well-behaved, a chief motivation behind the development of the PX-ECME algorithm was to explore an alternative optimization procedure which has relatively fast convergence yet still has stable, robust performance regardless of the choice of starting values and choice of weights. Improving the robustness of Newton-Raphson becomes a more salient issue in cases where one is optimizing a weighted log-likelihood with highly variable weights. Indeed, the first example in Section 6 shows a simple case where Newton-Raphson regularly diverges even when the initial values are set to the typically robust starting value of zero for all parameters, and in this case, both EM and PX-ECME converge to the correct value. In addition to greater robustness and stability, another important issue is the fact that the EM algorithm framework can often be more easily adapted to generate stable, straightforward parameter updates in more complex variations of a logistic regression model where there are additional latent variables in the probability model of interest. A common example of this is when one posits an additional probability model for the covariates in order to perform parameter estimation in the context of missing data. In Appendix A, we describe a PX-ECME algorithm for one example of a missing-data model, and we demonstrate how the PX-ECME parameter updates for this missing-data model are very similar to the PX-ECME updates for fully observed covariates.

While PX-ECME is a straightforward modification of EM that substantially boosts convergence speed, it is certainly possible that applying other monotone acceleration schemes to the PX-ECME iteration could improve convergence speed even further. Specifically, using “off-the-shelf” acceleration schemes which directly accelerate the convergence of an iterative sequence without modifying the “base” optimization algorithm in any way could be applied directly to iterates of the PX-ECME algorithm. One such approach would be to apply the monotone order-1 Anderson acceleration to the PX-ECME updates rather than the EM updates as was done in Section 3.3. This would likely provide further speed gains in addition to those of PX-ECME over EM. It would also be of interest to explore the performance of applying other, higher-order off-the-shelf acceleration schemes to iterates of the PX-ECME algorithms such as the SQUAREM procedure (Varadhan & Roland (2008)) or the quasi-Newton acceleration scheme of Zhou et al. (2011).

The PX-ECME parameter update corresponds to multiplying the EM parameter update by a single scalar, where the scalar is chosen to maximize the observed log-likelihood. This can be thought of as finding the parameter update by searching across parameter updates which are a scalar multiple of the EM update. This type of “multiplicative search” was chosen because it fits naturally into the parameter-expansion framework, and because a similar type of multiplicative update has been found to work well in the context of accelerating EM algorithms for probit regression. Though not explored here, it is likely that alternative approaches based on using the EM algorithm search direction could deliver relatively similar performance to our PX-ECME algorithm. For example, searching across parameter updates that are linear combinations of the previous parameter value and the EM update could be an effective strategy that has relatively similar performance to our PX-ECME algorithm.

SUPPLEMENTARY MATERIAL

R-package:

An R-package pxlogistic implementing the methods described in the article is available at https://github.com/nchenderson/pxlogistic

Appendix A A PX-ECME algorithm with Missing Covariates

As an example of a direct EM algorithm that can incorporate modeling of missing covariates, we consider a setting similar to that described in Ibrahim (1990). Specifically, we consider the case where all of the covariates are binary. To develop an EM algorithm in this context, we let 𝐱ic\mathbf{x}_{ic} denote the “complete version” of the vector 𝐱i\mathbf{x}_{i}, where we allow for the possibility that 𝐱i\mathbf{x}_{i} can contain missing values. We can express 𝐱ic=(𝐱i,obs,𝐱i,mis)\mathbf{x}_{ic}=(\mathbf{x}_{i,obs},\mathbf{x}_{i,mis}), where 𝐱i,obs\mathbf{x}_{i,obs} is the collection of observed values from 𝐱ic\mathbf{x}_{ic} and 𝐱i,mis\mathbf{x}_{i,mis} is the collection of missing values from 𝐱ic\mathbf{x}_{ic}.


The assumed joint distribution for the covariate vector 𝐱ic\mathbf{x}_{ic} is given by

p(𝐱ic|𝜸)=k=12p1γkIk(𝐱ic),p(\mathbf{x}_{ic}|\mbox{\boldmath$\gamma$})=\prod_{k=1}^{2^{p-1}}\gamma_{k}^{I_{k}(\mathbf{x}_{ic})}, (25)

where k=1,,2p1k=1,\ldots,2^{p-1} indexes the 2p12^{p-1} possible observed values of 𝐱ic=(1,𝐳icT)T\mathbf{x}_{ic}=(1,\mathbf{z}_{ic}^{T})^{T}, where 𝐳ic{0,1}p1\mathbf{z}_{ic}\in\{0,1\}^{p-1}. In (25), Ik(𝐱ic)=1I_{k}(\mathbf{x}_{ic})=1 if 𝐱ic\mathbf{x}_{ic} equals the kthk^{th} possible observed value of 𝐱ic\mathbf{x}_{ic} and equals 0 otherwise. Note that this missing data model is only useful in practice when pp is relatively small.


Using similar notation to that used in our main manuscript, we consider the following Pólya-Gamma representation of the logistic regression model

yi|Wi,𝐱ic\displaystyle y_{i}|W_{i},\mathbf{x}_{ic} \displaystyle\sim Binomial(mi,{1+exp(𝐱icT𝜷)}1)\displaystyle\textrm{Binomial}\Big{(}m_{i},\{1+\exp(-\mathbf{x}_{ic}^{T}\mbox{\boldmath$\beta$})\}^{-1}\Big{)}
Wi|𝐱ic\displaystyle W_{i}|\mathbf{x}_{ic} \displaystyle\sim PG(mi,𝐱icT𝜷)\displaystyle PG(m_{i},\mathbf{x}_{ic}^{T}\mbox{\boldmath$\beta$})
𝐱ic\displaystyle\mathbf{x}_{ic} \displaystyle\sim p(|𝜸),\displaystyle p(\cdot|\mbox{\boldmath$\gamma$}), (26)

where p(|𝜸)p(\cdot|\mbox{\boldmath$\gamma$}) refers to the same probability model stated in (25). We would also assume that the data are missing at random (MAR) so that if 𝐫i\mathbf{r}_{i} denotes the vector of missingness indicators, then

p(𝐱ic|𝐱i,obs,𝐫i,𝜸)=p(𝐱ic|𝐱i,obs,𝜸)=p(𝐱ic|𝜸)kp(𝐱ic=𝐝k|𝜸)I(k𝐀i),p(\mathbf{x}_{ic}|\mathbf{x}_{i,obs},\mathbf{r}_{i},\mbox{\boldmath$\gamma$})=p(\mathbf{x}_{ic}|\mathbf{x}_{i,obs},\mbox{\boldmath$\gamma$})=\frac{p(\mathbf{x}_{ic}|\mbox{\boldmath$\gamma$})}{\sum_{k}p(\mathbf{x}_{ic}=\mathbf{d}_{k}|\mbox{\boldmath$\gamma$})I(k\in\mathbf{A}_{i})},

where 𝐱i,obs\mathbf{x}_{i,obs} is the collection of observed values from 𝐱i\mathbf{x}_{i}, and 𝒜i\mathcal{A}_{i} is the set of configurations of 𝐱ic\mathbf{x}_{ic} that is consistent with the values in 𝐱i,obs\mathbf{x}_{i,obs}.

Letting 𝐯~={(y1,W1,𝐱1c),,(yn,Wn,𝐱nc)}\tilde{\mathbf{v}}=\{(y_{1},W_{1},\mathbf{x}_{1c}),\ldots,(y_{n},W_{n},\mathbf{x}_{nc})\} denote the “complete data”, the complete-data log-likelihood associated with (26) is

c(𝜷,𝜸|𝐯~)=C+i=1nui𝐱icT𝜷12i=1nWi(𝐱icT𝜷)2+i=1nk=12p1Ik(𝐱i)log(γk),\displaystyle\ell_{c}(\mbox{\boldmath$\beta$},\mbox{\boldmath$\gamma$}|\tilde{\mathbf{v}})=C+\sum_{i=1}^{n}u_{i}\mathbf{x}_{ic}^{T}\mbox{\boldmath$\beta$}-\frac{1}{2}\sum_{i=1}^{n}W_{i}(\mathbf{x}_{ic}^{T}\mbox{\boldmath$\beta$})^{2}+\sum_{i=1}^{n}\sum_{k=1}^{2^{p-1}}I_{k}(\mathbf{x}_{i})\log(\gamma_{k}),

where CC is a term that does not depend on (𝜷,𝜸)(\mbox{\boldmath$\beta$},\mbox{\boldmath$\gamma$}). The “Q-function” associated with c(𝜷,𝜸|𝐯~)\ell_{c}(\mbox{\boldmath$\beta$},\mbox{\boldmath$\gamma$}|\tilde{\mathbf{v}}) is then

Q(𝜷,𝜸|𝜷(t),𝜸(t))\displaystyle Q(\mbox{\boldmath$\beta$},\mbox{\boldmath$\gamma$}|\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}) =\displaystyle= E{c(𝜷,𝜸|𝐯~)𝐲,𝜷(t),𝜸(t)}\displaystyle E\{\ell_{c}(\mbox{\boldmath$\beta$},\mbox{\boldmath$\gamma$}|\tilde{\mathbf{v}})\mid\mathbf{y},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\} (27)
=\displaystyle= C+i=1nuiE{𝐱icT𝐲,𝜷(t),𝜸(t)}𝜷\displaystyle C+\sum_{i=1}^{n}u_{i}E\{\mathbf{x}_{ic}^{T}\mid\mathbf{y},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}\mbox{\boldmath$\beta$}
\displaystyle- 12i=1nE{Wi𝜷T𝐱ic𝐱icT𝜷𝐲,𝜷(t),𝜸(t)}+k=12p1log(γk)i=1nIk(𝐱ic)\displaystyle\frac{1}{2}\sum_{i=1}^{n}E\big{\{}W_{i}\mbox{\boldmath$\beta$}^{T}\mathbf{x}_{ic}\mathbf{x}_{ic}^{T}\mbox{\boldmath$\beta$}\mid\mathbf{y},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}+\sum_{k=1}^{2^{p-1}}\log(\gamma_{k})\sum_{i=1}^{n}I_{k}(\mathbf{x}_{ic})
=\displaystyle= C+i=1nui(𝐚i(t))T𝜷12i=1n𝜷T𝐁i(t)𝜷+k=12p1Gk.(t)log(γk),\displaystyle C+\sum_{i=1}^{n}u_{i}(\mathbf{a}_{i}^{(t)})^{T}\mbox{\boldmath$\beta$}-\frac{1}{2}\sum_{i=1}^{n}\mbox{\boldmath$\beta$}^{T}\mathbf{B}_{i}^{(t)}\mbox{\boldmath$\beta$}+\sum_{k=1}^{2^{p-1}}G_{k.}^{(t)}\log(\gamma_{k}),

where 𝐚i(t)\mathbf{a}_{i}^{(t)}, 𝐁i(t)\mathbf{B}_{i}^{(t)}, and Gk.(t)G_{k.}^{(t)} are defined as

𝐚i(t)\displaystyle\mathbf{a}_{i}^{(t)} =\displaystyle= E{𝐱ic𝐲,𝐱i,obs,𝜷(t),𝜸(t)}\displaystyle E\{\mathbf{x}_{ic}\mid\mathbf{y},\mathbf{x}_{i,obs},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}
𝐁i(t)\displaystyle\mathbf{B}_{i}^{(t)} =\displaystyle= E{Wi𝐱ic𝐱icT𝐲,𝐱i,obs,𝜷(t),𝜸(t)}\displaystyle E\{W_{i}\mathbf{x}_{ic}\mathbf{x}_{ic}^{T}\mid\mathbf{y},\mathbf{x}_{i,obs},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}
Gk.(t)\displaystyle G_{k.}^{(t)} =\displaystyle= i=1nE{Ik(𝐱i)𝐲,𝐱i,obs,𝜷(t),𝜸(t)}.\displaystyle\sum_{i=1}^{n}E\{I_{k}(\mathbf{x}_{i})\mid\mathbf{y},\mathbf{x}_{i,obs},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}.

Thus, we can re-write the Q-function (27) as

Q(𝜷,𝜸|𝜷(t),𝜸(t))=C+𝜷T(𝐀(t))T𝐮12𝜷T𝐁(t)𝜷+k=12p1Gk.(t)log(γk),Q(\mbox{\boldmath$\beta$},\mbox{\boldmath$\gamma$}|\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)})=C+\mbox{\boldmath$\beta$}^{T}(\mathbf{A}^{(t)})^{T}\mathbf{u}-\frac{1}{2}\mbox{\boldmath$\beta$}^{T}\mathbf{B}^{(t)}\mbox{\boldmath$\beta$}+\sum_{k=1}^{2^{p-1}}G_{k.}^{(t)}\log(\gamma_{k}), (28)

where 𝐀(t)\mathbf{A}^{(t)} is the n×pn\times p matrix whose ithi^{th} row is 𝐚i(t)\mathbf{a}_{i}^{(t)} and 𝐁(t)\mathbf{B}^{(t)} is the p×pp\times p matrix defined as

𝐁(t)=i=1nE{Wi𝐱ic𝐱icT|𝐲,𝜷(t),𝜸(t)}.\mathbf{B}^{(t)}=\sum_{i=1}^{n}E\{W_{i}\mathbf{x}_{ic}\mathbf{x}_{ic}^{T}|\mathbf{y},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}.

Maximizing (28) subject to the constraint that γk0\gamma_{k}\geq 0 and k=12p1γk=1\sum_{k=1}^{2^{p-1}}\gamma_{k}=1 yields the following parameter updates:

𝜷(t+1),EM\displaystyle\mbox{\boldmath$\beta$}^{(t+1),EM} =\displaystyle= [𝐁(t)]1(𝐀(t))T𝐮\displaystyle\big{[}\mathbf{B}^{(t)}\big{]}^{-1}(\mathbf{A}^{(t)})^{T}\mathbf{u}
γk(t+1),EM\displaystyle\gamma_{k}^{(t+1),EM} =\displaystyle= Gk.(t)k=12p1Gk.(t)\displaystyle\frac{G_{k.}^{(t)}}{\sum_{k=1}^{2^{p-1}}G_{k.}^{(t)}}

To complete the description of this EM algorithm, we just need to describe how to compute 𝐀(t)\mathbf{A}^{(t)} and 𝐁(t)\mathbf{B}^{(t)}. To this end, note that

𝐚i(t)=k=12ppik(t)𝐝k,\mathbf{a}_{i}^{(t)}=\sum_{k=1}^{2^{p}}p_{ik}^{(t)}\mathbf{d}_{k},

where 𝐝k\mathbf{d}_{k} is one of the 2p2^{p} possible values of 𝐱ic\mathbf{x}_{ic} and pik(t)p_{ik}^{(t)} is defined as

pik(t)=P(𝐱ic=𝐝k|yi,𝐱i,obs,𝜷(t),𝜸(t))=p(yi|𝐱ic=𝐝k,𝜷(t))γk(t)I(k𝒜i)h𝒜ip(yi|𝐱ic=𝐝h,𝜷(t))γh(t).p_{ik}^{(t)}=P(\mathbf{x}_{ic}=\mathbf{d}_{k}|y_{i},\mathbf{x}_{i,obs},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)})=\frac{p(y_{i}|\mathbf{x}_{ic}=\mathbf{d}_{k},\mbox{\boldmath$\beta$}^{(t)})\gamma_{k}^{(t)}I(k\in\mathcal{A}_{i})}{\sum_{h\in\mathcal{A}_{i}}p(y_{i}|\mathbf{x}_{ic}=\mathbf{d}_{h},\mbox{\boldmath$\beta$}^{(t)})\gamma_{h}^{(t)}}. (29)

In (29), 𝒜i\mathcal{A}_{i} is the set of configurations of 𝐱ic\mathbf{x}_{ic} that is consistent with the observed covariate vector 𝐱i,obs\mathbf{x}_{i,obs}. Regarding 𝐁(t)\mathbf{B}^{(t)}, one can note that

𝐁(t)=i=1nω(𝐱iT𝜷(t))E{𝐱ic𝐱icT𝐲,𝜷(t),𝜸(t)}=i=1nω(𝐱iT𝜷(t))k=12ppik(t)𝐝k𝐝kT,\mathbf{B}^{(t)}=\sum_{i=1}^{n}\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)})E\{\mathbf{x}_{ic}\mathbf{x}_{ic}^{T}\mid\mathbf{y},\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\gamma$}^{(t)}\}=\sum_{i=1}^{n}\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t)})\sum_{k=1}^{2^{p}}p_{ik}^{(t)}\mathbf{d}_{k}\mathbf{d}_{k}^{T},

where pik(t)p_{ik}^{(t)} is as defined in (29).

One option for a PX-ECME version of this EM algorithm is to define γk(t+1)=γk(t+1),EM\gamma_{k}^{(t+1)}=\gamma_{k}^{(t+1),EM} and 𝜷(t+1)=ρ(t+1)𝜷(t+1),EM\mbox{\boldmath$\beta$}^{(t+1)}=\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),EM} where ρ(t+1)\rho^{(t+1)} is chosen to maximize the following “observed” log-likelihood

o(ρ𝜷(t+1),EM)\displaystyle\ell_{o}(\rho\mbox{\boldmath$\beta$}^{(t+1),EM}) =\displaystyle= i=1nlog(p(yi|𝐱i,obs))\displaystyle\sum_{i=1}^{n}\log\Big{(}p(y_{i}|\mathbf{x}_{i,obs})\Big{)}
=\displaystyle= i=1nlog(k=12pp(yi|𝐱ic=𝐝k,𝜷=ρ𝜷(t+1),EM)p(𝐱ic=𝐝k|𝐱i,obs,𝜸(t+1),EM)I(k𝒜i)).\displaystyle\sum_{i=1}^{n}\log\Bigg{(}\sum_{k=1}^{2^{p}}p(y_{i}|\mathbf{x}_{ic}=\mathbf{d}_{k},\mbox{\boldmath$\beta$}=\rho\mbox{\boldmath$\beta$}^{(t+1),EM})p(\mathbf{x}_{ic}=\mathbf{d}_{k}|\mathbf{x}_{i,obs},\mbox{\boldmath$\gamma$}^{(t+1),EM})I(k\in\mathcal{A}_{i})\Bigg{)}.

Appendix B Proof of Proposition 1 and Theorem 1

Proof of Proposition 1. First, note that

QPX𝜼(𝜷(t+1),GEM,1|𝜷(t),α0)QPX𝜼(𝜷(t),1|𝜷(t),α0)\displaystyle Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),GEM},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})-Q_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})
=\displaystyle= Q~PX𝜼(𝜷(t+1),GEM,1|𝜷(t),α0)Q~PX𝜼(𝜷(t),1|𝜷(t),α0)+12(𝜷(t+1),GEM𝜷(t))T𝐇(t)(𝜷(t+1),GEM𝜷(t))\displaystyle\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),GEM},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})-\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})+\frac{1}{2}(\mbox{\boldmath$\beta$}^{(t+1),GEM}-\mbox{\boldmath$\beta$}^{(t)})^{T}\mathbf{H}^{(t)}(\mbox{\boldmath$\beta$}^{(t+1),GEM}-\mbox{\boldmath$\beta$}^{(t)})
\displaystyle\geq Q~PX𝜼(𝜷(t+1),GEM,1|𝜷(t),α0)Q~PX𝜼(𝜷(t),1|𝜷(t),α0)\displaystyle\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),GEM},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})-\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})
\displaystyle\geq 0,\displaystyle 0,

where the first inequality follows from the fact that 𝐇(t)\mathbf{H}^{(t)} is positive semi-definite and the second inequality follows from the fact that 𝜷(t+1),GEM\mbox{\boldmath$\beta$}^{(t+1),GEM} maximizes Q~PX𝜼(𝜷,1|𝜷(t),α0)\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0}). It then follows from Jensen’s inequality and the above inequality that

po,𝐬𝜼(𝜷(t+1),GEM|𝐲)po,𝐬𝜼(𝜷(t)|𝐲)Q~PX𝜼(𝜷(t+1),GEM,1|𝜷(t),α0)Q~PX𝜼(𝜷(t),1|𝜷(t),α0)0p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y})-p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)}|\mathbf{y})\geq\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),GEM},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})-\tilde{Q}_{PX}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)},1|\mbox{\boldmath$\beta$}^{(t)},\alpha_{0})\geq 0

Then, since ρ(t+1)=argmaxρpo,𝐬𝜼(ρ𝜷(t+1),GEM|𝐲)\rho^{(t+1)}=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y}), we have that

po,𝐬𝜼(𝜷(t+1)|𝐲)\displaystyle p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1)}|\mathbf{y}) =\displaystyle= po,𝐬𝜼(ρ(t+1)𝜷(t+1),GEM|𝐲)\displaystyle p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\rho^{(t+1)}\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y})
\displaystyle\geq po,𝐬𝜼(𝜷(t+1),GEM|𝐲)\displaystyle p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t+1),GEM}|\mathbf{y})
\displaystyle\geq po,𝐬𝜼(𝜷(t)|𝐲).\displaystyle p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\mbox{\boldmath$\beta$}^{(t)}|\mathbf{y}).

Proof of Theorem 1. First, recall that the PX-ECME mapping GPXG_{PX} can be expressed in terms of the EM mapping GEMG_{EM} as

GPX(𝜷)=h(GEM(𝜷))GEM(𝜷)=a(𝜷)GEM(𝜷),G_{PX}(\mbox{\boldmath$\beta$})=h(G_{EM}(\mbox{\boldmath$\beta$}))G_{EM}(\mbox{\boldmath$\beta$})=a(\mbox{\boldmath$\beta$})G_{EM}(\mbox{\boldmath$\beta$}),

where h(𝜷)=argmaxρo,𝐬(ρ𝜷|𝐲)h(\mbox{\boldmath$\beta$})=\operatorname*{arg\,max}_{\rho\in\mathbb{R}}\ell_{o,\mathbf{s}}(\rho\mbox{\boldmath$\beta$}|\mathbf{y}) and a(𝜷)=h(GEM(𝜷))a(\mbox{\boldmath$\beta$})=h(G_{EM}(\mbox{\boldmath$\beta$})). Hence, if [GPX(𝜷)]j[G_{PX}(\mbox{\boldmath$\beta$})]_{j} denotes the jthj^{th} component of GPX(𝜷)G_{PX}(\mbox{\boldmath$\beta$}), then [GPX(𝜷)]j/βk=a(GEM(𝜷))[GEM(𝜷)]j/βk+[GEM(𝜷)]ja(𝜷)/βk\partial[G_{PX}(\mbox{\boldmath$\beta$})]_{j}/\partial\beta_{k}=a(G_{EM}(\mbox{\boldmath$\beta$}))\partial[G_{EM}(\mbox{\boldmath$\beta$})]_{j}/\partial\beta_{k}+[G_{EM}(\mbox{\boldmath$\beta$})]_{j}\partial a(\mbox{\boldmath$\beta$})/\partial\beta_{k}. This implies that the Jacobian matrix JPX(𝜷)J_{PX}(\mbox{\boldmath$\beta$}) of GPX(𝜷)G_{PX}(\mbox{\boldmath$\beta$}) can be expressed as

JPX(𝜷)=GPX(𝜷)𝜷=a(𝜷)JEM(𝜷)+GEM(𝜷)a(𝜷)T,J_{PX}(\mbox{\boldmath$\beta$})=\frac{\partial G_{PX}(\mbox{\boldmath$\beta$})}{\partial\mbox{\boldmath$\beta$}}=a(\mbox{\boldmath$\beta$})J_{EM}(\mbox{\boldmath$\beta$})+G_{EM}(\mbox{\boldmath$\beta$})\nabla a(\mbox{\boldmath$\beta$})^{T}, (30)

where a(𝜷)\nabla a(\mbox{\boldmath$\beta$}) denotes the gradient of a(𝜷)a(\mbox{\boldmath$\beta$}) and JEM(𝜷)J_{EM}(\mbox{\boldmath$\beta$}) is the Jacobian matrix associated with the EM mapping GEM(𝜷)G_{EM}(\mbox{\boldmath$\beta$}).

Now, note that we can express the gradient of a(𝜷)a(\mbox{\boldmath$\beta$}) in terms of the gradient of h(𝜷)h(\mbox{\boldmath$\beta$}) evaluated at GEM(𝜷)G_{EM}(\mbox{\boldmath$\beta$}) as

a(𝜷)T=h(GEM(𝜷))TJEM(𝜷)\nabla a(\mbox{\boldmath$\beta$})^{T}=\nabla h(G_{EM}(\mbox{\boldmath$\beta$}))^{T}J_{EM}(\mbox{\boldmath$\beta$}) (31)

To derive an explicit formula for h(𝜷)\nabla h(\mbox{\boldmath$\beta$}), we first note that h(𝜷)h(\mbox{\boldmath$\beta$}) is defined implicitly by the equation s(𝜷,h(𝜷))=0s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$}))=0 where the score function s(𝜷,h(𝜷))s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$})) is defined as

s(𝜷,h(𝜷))\displaystyle s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$})) =\displaystyle= i=1nsiyi𝐱iT𝜷i=1nsimi𝐱iT𝜷π{h(𝜷)𝐱iT𝜷}\displaystyle\sum_{i=1}^{n}s_{i}y_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}-\sum_{i=1}^{n}s_{i}m_{i}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}
=\displaystyle= 𝜷T𝐗T𝐒𝐲𝜷T𝐗𝐒μ{h(𝜷)𝐗𝜷},\displaystyle\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{y}-\mbox{\boldmath$\beta$}^{T}\mathbf{X}\mathbf{S}\mu\{h(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}\},

where μ{h(𝜷)𝐗𝜷}\mu\{h(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}\} is the length-n vector whose ithi^{th} element is miπ{h(𝜷)𝐱iT𝜷}m_{i}\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}. If we differentiate s(𝜷,h(𝜷))s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$})) with respect to h(𝜷)h(\mbox{\boldmath$\beta$}), we obtain

s(𝜷,h(𝜷))h\displaystyle\frac{\partial s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$}))}{\partial h} =\displaystyle= i=1nmisi(𝐱iT𝜷)2π{h(𝜷)𝐱iT𝜷}[1π{h(𝜷)𝐱iT𝜷}]\displaystyle-\sum_{i=1}^{n}m_{i}s_{i}(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})^{2}\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}\big{[}1-\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}\big{]}
=\displaystyle= 𝜷T𝐗T𝐒𝐑(𝜷)𝐗𝜷\displaystyle\mbox{\boldmath$\beta$}^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}
=\displaystyle= c(𝜷),\displaystyle c(\mbox{\boldmath$\beta$}),

where R(𝜷)R(\mbox{\boldmath$\beta$}) is the diagonal matrix whose ithi^{th} diagonal element is miπ{h(𝜷)𝐱iT𝜷}[1π{h(𝜷)𝐱iT𝜷}]m_{i}\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}[1-\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}] and π(u)={1+exp(u)}1\pi(u)=\{1+\exp(-u)\}^{-1}. Now, if we differentiate s(𝜷,h(𝜷))s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$})) with respect to βk\beta_{k} (keeping h(𝜷)h(\mbox{\boldmath$\beta$}) fixed), we obtain

s(𝜷,h(𝜷))βk\displaystyle\frac{\partial s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$}))}{\partial\beta_{k}} =\displaystyle= i=1nsiyixiki=1nsimixikπ{h(𝜷)𝐱iT𝜷}\displaystyle\sum_{i=1}^{n}s_{i}y_{i}x_{ik}-\sum_{i=1}^{n}s_{i}m_{i}x_{ik}\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}
\displaystyle- h(𝜷)i=1nsimixik𝐱iT𝜷π{h(𝜷)𝐱iT𝜷}[1π{h(𝜷)𝐱iT𝜷}].\displaystyle h(\mbox{\boldmath$\beta$})\sum_{i=1}^{n}s_{i}m_{i}x_{ik}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}[1-\pi\{h(\mbox{\boldmath$\beta$})\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}\}].

So, we can write the p×1p\times 1 fixed-h gradient βs(β,h(𝜷))=(s(𝜷,h(𝜷))β1,.,s(𝜷,h(𝜷))βp)T\nabla_{\beta}s(\beta,h(\mbox{\boldmath$\beta$}))=\big{(}\tfrac{\partial s(\boldsymbol{\beta},h(\boldsymbol{\beta}))}{\partial\beta_{1}},....,\tfrac{\partial s(\boldsymbol{\beta},h(\boldsymbol{\beta}))}{\partial\beta_{p}}\big{)}^{T} as

βs(𝜷,h(𝜷))=𝐗T𝐒[𝐲μ{h(𝜷)𝐗𝜷}]h(𝜷)𝐗T𝐒𝐑(𝜷)𝐗𝜷.\nabla_{\beta}s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$}))=\mathbf{X}^{T}\mathbf{S}[\mathbf{y}-\mu\{h(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}\}]-h(\mbox{\boldmath$\beta$})\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}.

By the implicit function theorem for continuously differentiable functions, we can express the gradient h(𝜷)\nabla h(\mbox{\boldmath$\beta$}) as

h(𝜷)\displaystyle\nabla h(\mbox{\boldmath$\beta$}) =\displaystyle= [s(𝜷,h(𝜷))h]1βs(𝜷,h(𝜷))\displaystyle-\Big{[}\frac{s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$}))}{\partial h}\Big{]}^{-1}\nabla_{\beta}s(\mbox{\boldmath$\beta$},h(\mbox{\boldmath$\beta$}))
=\displaystyle= 1c(𝜷)[𝐗T𝐒[𝐲μ{h(𝜷)𝐗𝜷}]h(𝜷)𝐗T𝐒𝐑(𝜷)𝐗𝜷].\displaystyle\frac{1}{c(\mbox{\boldmath$\beta$})}\Big{[}\mathbf{X}^{T}\mathbf{S}[\mathbf{y}-\mu\{h(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}\}]-h(\mbox{\boldmath$\beta$})\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$})\mathbf{X}\mbox{\boldmath$\beta$}\Big{]}.

At the point of convergence 𝜷=GEM(𝜷)\mbox{\boldmath$\beta$}^{*}=G_{EM}(\mbox{\boldmath$\beta$}^{*}), h(𝜷)=1h(\mbox{\boldmath$\beta$}^{*})=1 and 𝐗T𝐒[𝐲μ{h(𝜷)𝐗𝜷}]=𝟎\mathbf{X}^{T}\mathbf{S}[\mathbf{y}-\mu\{h(\mbox{\boldmath$\beta$}^{*})\mathbf{X}\mbox{\boldmath$\beta$}^{*}\}]=\mathbf{0} and hence

h(GEM(𝜷))=1c(𝜷)[𝐗T𝐒𝐑(𝜷)𝐗𝜷].\nabla h(G_{EM}(\mbox{\boldmath$\beta$}^{*}))=\frac{-1}{c(\mbox{\boldmath$\beta$}^{*})}\Big{[}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}\mbox{\boldmath$\beta$}^{*}\Big{]}.

Finally, returning to (30), we have that

JPX(𝜷)\displaystyle J_{PX}(\mbox{\boldmath$\beta$}^{*}) =\displaystyle= a(𝜷)JEM(𝜷)+GEM(𝜷)a(𝜷)T\displaystyle a(\mbox{\boldmath$\beta$}^{*})J_{EM}(\mbox{\boldmath$\beta$}^{*})+G_{EM}(\mbox{\boldmath$\beta$}^{*})\nabla a(\mbox{\boldmath$\beta$}^{*})^{T}
=\displaystyle= a(𝜷)JEM(𝜷)+GEM(𝜷)h(GEM(𝜷))TJEM(𝜷)\displaystyle a(\mbox{\boldmath$\beta$}^{*})J_{EM}(\mbox{\boldmath$\beta$}^{*})+G_{EM}(\mbox{\boldmath$\beta$}^{*})\nabla h(G_{EM}(\mbox{\boldmath$\beta$}^{*}))^{T}J_{EM}(\mbox{\boldmath$\beta$}^{*})
=\displaystyle= JEM(𝜷)+𝜷h(GEM(𝜷))TJEM(𝜷)\displaystyle J_{EM}(\mbox{\boldmath$\beta$}^{*})+\mbox{\boldmath$\beta$}^{*}\nabla h(G_{EM}(\mbox{\boldmath$\beta$}^{*}))^{T}J_{EM}(\mbox{\boldmath$\beta$}^{*})
=\displaystyle= JEM(𝜷)1c(𝜷)𝜷(𝜷)T𝐗T𝐒𝐑(𝜷)𝐗JEM(𝜷)\displaystyle J_{EM}(\mbox{\boldmath$\beta$}^{*})-\frac{1}{c(\mbox{\boldmath$\beta$}^{*})}\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}J_{EM}(\mbox{\boldmath$\beta$}^{*})
=\displaystyle= JEM(𝜷)1c(𝜷)𝜷(𝜷)T𝐗T𝐒𝐑(𝜷)𝐗[𝐈p(𝐗T𝐒𝐖(𝜷)𝐗)1𝐗T𝐒𝐑(𝜷)𝐗]\displaystyle J_{EM}(\mbox{\boldmath$\beta$}^{*})-\frac{1}{c(\mbox{\boldmath$\beta$}^{*})}\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}\Big{[}\mathbf{I}_{p}-(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{*})\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}\Big{]}
=\displaystyle= JEM(𝜷)1c(𝜷)𝜷(𝜷)T𝐕(𝜷)𝐀(𝜷)𝐕(𝜷)T,\displaystyle J_{EM}(\mbox{\boldmath$\beta$}^{*})-\frac{1}{c(\mbox{\boldmath$\beta$}^{*})}\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})\mathbf{A}(\mbox{\boldmath$\beta$}^{*})\mathbf{V}(\mbox{\boldmath$\beta$}^{*})^{T},

where 𝐕(𝜷)\mathbf{V}(\mbox{\boldmath$\beta$}^{*}) and 𝐀(𝜷)\mathbf{A}(\mbox{\boldmath$\beta$}^{*}) are defined as

𝐕(𝜷)\displaystyle\mathbf{V}(\mbox{\boldmath$\beta$}^{*}) =\displaystyle= 𝐗T𝐒𝐑(𝜷)𝐗\displaystyle\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X}
𝐀(𝜷)\displaystyle\mathbf{A}(\mbox{\boldmath$\beta$}^{*}) =\displaystyle= (𝐗T𝐒𝐑(𝜷)𝐗)1(𝐗T𝐒𝐖(𝜷)𝐗)1.\displaystyle(\mathbf{X}^{T}\mathbf{S}\mathbf{R}(\mbox{\boldmath$\beta$}^{*})\mathbf{X})^{-1}-(\mathbf{X}^{T}\mathbf{S}\mathbf{W}(\mbox{\boldmath$\beta$}^{*})\mathbf{X})^{-1}.

We now turn to the question of comparing the spectral radii of 𝐉PX(𝜷)\mathbf{J}_{PX}(\mbox{\boldmath$\beta$}^{*}) and 𝐉EM(𝜷)\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*}). To address this, we first consider the matrices 𝐒PX\mathbf{S}_{PX} and 𝐒EM\mathbf{S}_{EM} defined as

𝐒PX\displaystyle\mathbf{S}_{PX} =\displaystyle= 𝐈𝐉PX(𝜷)=𝐈𝐉EM(𝜷)+[c(𝜷)]1𝜷(𝜷)T𝐕(𝜷)𝐀(𝜷)𝐕(𝜷)T\displaystyle\mathbf{I}-\mathbf{J}_{PX}(\mbox{\boldmath$\beta$}^{*})=\mathbf{I}-\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*})+[c(\mbox{\boldmath$\beta$}^{*})]^{-1}\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})\mathbf{A}(\mbox{\boldmath$\beta$}^{*})\mathbf{V}(\mbox{\boldmath$\beta$}^{*})^{T}
𝐒EM\displaystyle\mathbf{S}_{EM} =\displaystyle= 𝐈𝐉EM(𝜷).\displaystyle\mathbf{I}-\mathbf{J}_{EM}(\mbox{\boldmath$\beta$}^{*}).

Because the diagonal elements of 𝐖(𝜷)\mathbf{W}(\mbox{\boldmath$\beta$}^{*}) are greater than the corresponding diagonal elements of 𝐑(𝜷)\mathbf{R}(\mbox{\boldmath$\beta$}^{*}) (i.e., π(𝐱iT𝜷){1π(𝐱iT𝜷)}tanh(𝐱iT𝜷/2)/2𝐱iT𝜷\pi(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\{1-\pi(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$})\}\leq\tanh(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}/2)/2\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}), 𝐀(𝜷)\mathbf{A}(\mbox{\boldmath$\beta$}^{*}) is a symmetric positive definite matrix. This implies that 𝐕(𝜷)𝐀(𝜷)𝐕(𝜷)T\mathbf{V}(\mbox{\boldmath$\beta$}^{*})\mathbf{A}(\mbox{\boldmath$\beta$}^{*})\mathbf{V}(\mbox{\boldmath$\beta$}^{*})^{T} is also a symmetric positive definite matrix. Hence, 𝜷(𝜷)T𝐕(𝜷)𝐀(𝜷)𝐕(𝜷)T\mbox{\boldmath$\beta$}^{*}(\mbox{\boldmath$\beta$}^{*})^{T}\mathbf{V}(\mbox{\boldmath$\beta$}^{*})\mathbf{A}(\mbox{\boldmath$\beta$}^{*})\mathbf{V}(\mbox{\boldmath$\beta$}^{*})^{T} is symmetric positive definite as it is the product of two symmetric positive definite matrices. Hence, because c(𝜷)>0c(\mbox{\boldmath$\beta$}^{*})>0, we have 𝐒PX𝐒EM\mathbf{S}_{PX}\succeq\mathbf{S}_{EM}, which then implies that the rPXrEMr_{PX}\leq r_{EM}, where rPXr_{PX} and rEMr_{EM} are the largest eigenvalues of JPX(𝜷)J_{PX}(\mbox{\boldmath$\beta$}^{*}) and JEM(𝜷)J_{EM}(\mbox{\boldmath$\beta$}^{*}) respectively.

Appendix C PX-ECME and Coordinate Descent

Algorithm 4 (PX-ECME Coordinate Descent for Logistic Regression with Elastic Net Penalty). In the description of the algorithm, 𝐮=(u1,,un)T\mathbf{u}=(u_{1},\ldots,u_{n})^{T}, where ui=yimi/2u_{i}=y_{i}-m_{i}/2.
1:Given an integer 1kp1\leq k\leq p, 𝜷(0)p\mbox{\boldmath$\beta$}^{(0)}\in\mathbb{R}^{p} and α(0)\alpha^{(0)}\in\mathbb{R}. Set 𝜽(0)=α(0)𝜷(0)\mbox{\boldmath$\theta$}^{(0)}=\alpha^{(0)}\mbox{\boldmath$\beta$}^{(0)}.
2:Set 𝐖(𝜷(0))=diag{m1/4,,mn/4}\mathbf{W}(\mbox{\boldmath$\beta$}^{(0)})=\textrm{diag}\{m_{1}/4,\ldots,m_{n}/4\}.
3:for t=0,1,2,… until convergence do
4:     for j = 1,…, p do     Update the components θj(t+1)\theta_{j}^{(t+1)} of 𝜽(t+j/p)\mbox{\boldmath$\theta$}^{(t+j/p)}:
θj(t+1)\displaystyle\theta_{j}^{(t+1)} =\displaystyle= sign(1α(t+j/p)Vj(𝜷(t+j/p))Uj(𝜷(t+j/p),𝜽(t+j/p)))\displaystyle\textrm{sign}\Big{(}\tfrac{1}{\alpha^{(t+j/p)}}V_{j}(\mbox{\boldmath$\beta$}^{(t+j/p)})-U_{j}(\mbox{\boldmath$\beta$}^{(t+j/p)},\mbox{\boldmath$\theta$}^{(t+j/p)})\Big{)}
×\displaystyle\times max{|1α(t+j/p)Vj(𝜷(t+j/p))Uj(𝜷(t+j/p),𝜽(t+j/p))|1α(t+j/p)λ~(𝜷(t+j/p)),0},\displaystyle\max\Big{\{}\Big{|}\tfrac{1}{\alpha^{(t+j/p)}}V_{j}(\mbox{\boldmath$\beta$}^{(t+j/p)})-U_{j}(\mbox{\boldmath$\beta$}^{(t+j/p)},\mbox{\boldmath$\theta$}^{(t+j/p)})\Big{|}-\tfrac{1}{\alpha^{(t+j/p)}}\tilde{\lambda}(\mbox{\boldmath$\beta$}^{(t+j/p)}),0\Big{\}},
           where Vj(𝜷(t+j/p))V_{j}(\mbox{\boldmath$\beta$}^{(t+j/p)}), Uj(𝜷(t+j/p),𝜽j(t+j/p))U_{j}(\mbox{\boldmath$\beta$}^{(t+j/p)},\mbox{\boldmath$\theta$}_{-j}^{(t+j/p)}), and λ~(𝜷(t+j/p))\tilde{\lambda}(\mbox{\boldmath$\beta$}^{(t+j/p)}) are as defined in (32).
5:         if  jj mod k=0k=0 or j=pj=p then
6:              Compute
α(t+j/p)=argmaxα[po,𝐬𝜼(α𝜽(t+1)|𝐲)].\alpha^{(t+j/p)}=\operatorname*{arg\,max}_{\alpha\in\mathbb{R}}\Big{[}p\ell_{o,\mathbf{s}}^{\boldsymbol{\eta}}(\alpha\mbox{\boldmath$\theta$}^{(t+1)}|\mathbf{y})\Big{]}.
7:              Set 𝜷(t+j/p)=α(t+j/p)𝜽(t+j/p)\mbox{\boldmath$\beta$}^{(t+j/p)}=\alpha^{(t+j/p)}\mbox{\boldmath$\theta$}^{(t+j/p)}.
8:              Update the diagonals ω(𝐱iT𝜷(t+j/p),mi)\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t+j/p)},m_{i}) of the weight matrix 𝐖(𝜷(t+j/p))\mathbf{W}(\mbox{\boldmath$\beta$}^{(t+j/p)})
ω(𝐱iT𝜷(t+j/p),mi)=mi2𝐱iT𝜷(t+j/p)tanh(𝐱iT𝜷(t+j/p)/2).\omega(\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t+j/p)},m_{i})=\frac{m_{i}}{2\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t+j/p)}}\tanh\big{(}\mathbf{x}_{i}^{T}\mbox{\boldmath$\beta$}^{(t+j/p)}/2\big{)}.
9:         else
10:              Set 𝜷(t+j/p)=𝜷t+(j1)/p\mbox{\boldmath$\beta$}^{(t+j/p)}=\mbox{\boldmath$\beta$}^{t+(j-1)/p}, α(t+j/p)=αt+(j1)/p\alpha^{(t+j/p)}=\alpha^{t+(j-1)/p}, 𝐖(𝜷(t+j/p))=𝐖(𝜷(t+(j1)/p)\mathbf{W}(\mbox{\boldmath$\beta$}^{(t+j/p)})=\mathbf{W}(\mbox{\boldmath$\beta$}^{(t+(j-1)/p}).               

The terms Vj(𝜷(t))V_{j}(\mbox{\boldmath$\beta$}^{(t)}), Uj(𝜷(t),𝜽j(t+j/p))U_{j}(\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\theta$}_{-j}^{(t+j/p)}), and λ~(𝜷(t))\tilde{\lambda}(\mbox{\boldmath$\beta$}^{(t)}) mentioned in Section 5 are given by

Vj(𝜷(t))\displaystyle V_{j}(\mbox{\boldmath$\beta$}^{(t)}) =\displaystyle= i=1nxijsiui/{i=1n(Aij(t))2+λ2}\displaystyle\sum_{i=1}^{n}x_{ij}s_{i}u_{i}\big{/}\Big{\{}\sum_{i=1}^{n}(A_{ij}^{(t)})^{2}+\lambda_{2}\Big{\}}
Uj(𝜷(t),𝜽j(t+j/p))\displaystyle U_{j}(\mbox{\boldmath$\beta$}^{(t)},\mbox{\boldmath$\theta$}_{-j}^{(t+j/p)}) =\displaystyle= i=1n{Aij(t)kjθk(t+j/p)Aik(t)}/{i=1n(Aij(t))2+λ2}\displaystyle\sum_{i=1}^{n}\Big{\{}A_{ij}^{(t)}\sum_{k\neq j}\theta_{k}^{(t+j/p)}A_{ik}^{(t)}\Big{\}}\Big{/}\Big{\{}\sum_{i=1}^{n}(A_{ij}^{(t)})^{2}+\lambda_{2}\Big{\}}
λ~(𝜷(t))\displaystyle\tilde{\lambda}(\mbox{\boldmath$\beta$}^{(t)}) =\displaystyle= λ1/{i=1n(Aij(t))2+λ2}.\displaystyle\lambda_{1}\big{/}\Big{\{}\sum_{i=1}^{n}(A_{ij}^{(t)})^{2}+\lambda_{2}\Big{\}}. (32)

References

  • (1)
  • Böhning & Lindsay (1988) Böhning, D. & Lindsay, B. G. (1988), ‘Monotonicity of quadratic-approximation algorithms’, Annals of the Institute of Statistical Mathematics 40(4), 641–663.
  • Brent (2013) Brent, R. P. (2013), Algorithms for minimization without derivatives, Courier Corporation.
  • Chambers & Hastie (1992) Chambers, J. M. & Hastie, T. J. (1992), Statistical models in S, Wadsworth and Brooks/Cole, Pacific Grove, CA.
  • Choi et al. (2013) Choi, H. M., Hobert, J. P. et al. (2013), ‘The Pólya–Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic’, Electronic Journal of Statistics 7, 2054–2064.
  • Durante & Rigon (2018) Durante, D. & Rigon, T. (2018), ‘A note on quadratic approximations of logistic log-likelihoods’, ArXiv 1711.
  • Fan & Li (2001) Fan, J. & Li, R. (2001), ‘Variable selection via nonconcave penalized likelihood and its oracle properties’, Journal of the American statistical Association 96(456), 1348–1360.
  • Friedman et al. (2007) Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. et al. (2007), ‘Pathwise coordinate optimization’, The annals of applied statistics 1(2), 302–332.
  • Friedman et al. (2010) Friedman, J., Hastie, T. & Tibshirani, R. (2010), ‘Regularization paths for generalized linear models via coordinate descent’, Journal of statistical software 33(1), 1.
  • Green (1984) Green, P. J. (1984), ‘Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives’, Journal of the Royal Statistical Society: Series B 46(2), 149–170.
  • Guyon et al. (2004) Guyon, I., Gunn, S., Ben-Hur, A. & Dror, G. (2004), ‘Result analysis of the NIPS 2003 feature selection challenge’, Advances in neural information processing systems 17.
  • Hunter & Lange (2004) Hunter, D. R. & Lange, K. (2004), ‘A tutorial on MM algorithms’, The American Statistician 58(1), 30–37.
  • Ibrahim (1990) Ibrahim, J. G. (1990), ‘Incomplete data in generalized linear models’, Journal of the American Statistical Association 85(411), 765–769.
  • Ibrahim et al. (1999) Ibrahim, J. G., Chen, M.-H. & Lipsitz, S. R. (1999), ‘Monte Carlo EM for missing covariates in parametric regression models’, Biometrics 55(2), 591–596.
  • Jaakkola & Jordan (2000) Jaakkola, T. S. & Jordan, M. I. (2000), ‘Bayesian parameter estimation via variational methods’, Statistics and Computing 10, 25–37.
  • Lange (2012) Lange, K. (2012), Numerical Analysis for Statisticians, 2nd edn, Springer Publishing Company, Incorporated.
  • Lewandowski et al. (2010) Lewandowski, A., Liu, C. & Wiel, S. V. (2010), ‘Parameter expansion and efficient inference’, Statistical Science pp. 533–544.
  • Liu (1997) Liu, C. (1997), ‘ML estimation of the multivariate t distribution and the EM algorithm’, Journal of Multivariate Analysis 63(2), 296–312.
  • Liu & Rubin (1994) Liu, C. & Rubin, D. B. (1994), ‘The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence’, Biometrika 81(4), 633–648.
  • Liu et al. (1998) Liu, C., Rubin, D. B. & Wu, Y. N. (1998), ‘Parameter expansion to accelerate EM: the PX-EM algorithm’, Biometrika 85(4), 755–770.
  • Marschner (2011) Marschner, I. C. (2011), ‘glm2: Fitting generalized linear models with convergence problems’, The R Journal 3(2), 12–15.
  • McLachlan & Krishnan (2007) McLachlan, G. J. & Krishnan, T. (2007), The EM algorithm and extensions, Vol. 382, John Wiley & Sons.
  • Meng & Rubin (1993) Meng, X. L. & Rubin, D. B. (1993), ‘Maximum likelihood estimation via the ECM algorithm: a general framework’, Biometrika 80(2), 267–278.
  • Parikh & Boyd (2014) Parikh, N. & Boyd, S. (2014), ‘Proximal algorithms’, Foundations and Trends in optimization 1(3), 127–239.
  • Polson et al. (2013) Polson, N. G., Scott, J. G. & Windle, J. (2013), ‘Bayesian inference for logistic models using Pólya–Gamma latent variables’, Journal of the American Statistical Association 108(504), 1339–1349.
  • Scott & Sun (2013) Scott, J. G. & Sun, L. (2013), ‘Expectation-maximization for logistic regression’, arXiv preprint arXiv:1306.0040 .
  • Van Dyk (2000) Van Dyk, D. A. (2000), ‘Fitting mixed-effects models using efficient EM-type algorithms’, Journal of Computational and Graphical Statistics 9(1), 78–98.
  • Varadhan & Roland (2008) Varadhan, R. & Roland, C. (2008), ‘Simple and globally convergent methods for accelerating the convergence of any EM algorithm’, Scandinavian Journal of Statistics 35(2), 335–353.
  • Walker & Ni (2011) Walker, H. F. & Ni, P. (2011), ‘Anderson acceleration for fixed-point iterations’, SIAM Journal on Numerical Analysis 49(4), 1715–1735.
  • Zhou et al. (2011) Zhou, H., Alexander, D. & Lange, K. (2011), ‘A quasi-Newton accleration for high-dimensional optimization algorithms’, Statistics and Computing 21(2), 261–273.