This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The data augmentation algorithm

Vivekananda Roy, Kshitij Khare and James P. Hobert
Abstract

The data augmentation (DA) algorithms are popular Markov chain Monte Carlo (MCMC) algorithms often used for sampling from intractable probability distributions. This review article comprehensively surveys DA MCMC algorithms, highlighting their theoretical foundations, methodological implementations, and diverse applications in frequentist and Bayesian statistics. The article discusses tools for studying the convergence properties of DA algorithms. Furthermore, it contains various strategies for accelerating the speed of convergence of the DA algorithms, different extensions of DA algorithms and outlines promising directions for future research. This paper aims to serve as a resource for researchers and practitioners seeking to leverage data augmentation techniques in MCMC algorithms by providing key insights and synthesizing recent developments.

1 Introduction

The data augmentation (DA) algorithm (Tanner and Wong,, 1987; Swendsen and Wang,, 1987) is a widely used class of Markov chain Monte Carlo (MCMC) algorithms. Suppose fXf_{X} is the target probability density function (pdf) on p\mathbb{R}^{p}. Generally, the goal is to evaluate the mean of some function h:ph:\mathbb{R}^{p}\rightarrow\mathbb{R} with respect to fXf_{X}, that is, to find EfXh=ph(x)fX(x)𝑑x\mbox{E}_{f_{X}}h=\int_{\mathbb{R}^{p}}h(x)f_{X}(x)\,dx. Since fXf_{X} is usually available only up to a normalizing constant, EfXh\mbox{E}_{f_{X}}h cannot be computed analytically. Furthermore, these days, the target densities arising in machine learning, physics, statistics, and other areas are so complex that direct simulation from fXf_{X} is impossible, making the classical Monte Carlo approximation of EfXh\mbox{E}_{f_{X}}h based on independent and identically distributed (iid) samples from fXf_{X} infeasible. In such situations, it is often possible, as described later in a section, to construct a DA Markov chain {Xn}n1\{X_{n}\}_{n\geq 1}, which has fXf_{X} as its stationary density. If this DA chain {Xn}n1\{X_{n}\}_{n\geq 1} is well behaved then EfXh\mbox{E}_{f_{X}}h can be consistently estimated by the sample average h¯n:=i=1nh(Xi)/n\bar{h}_{n}:=\sum_{i=1}^{n}h(X_{i})/n.

For constructing a DA algorithm with stationary density fXf_{X}, one needs to find a joint density f(x,y)f(x,y) on p×q\mathbb{R}^{p}\times\mathbb{R}^{q} with ‘augmented variables’ yy satisfying the following two properties:

  1. (i)

    the xx-marginal of the joint density f(x,y)f(x,y) is the target density fXf_{X}, that is, qf(x,y)𝑑y=fX(x)\int_{\mathbb{R}^{q}}f(x,y)\,dy=f_{X}(x), and

  2. (ii)

    sampling from the two corresponding conditional pdfs, fX|Y(x|y)f_{X|Y}(x|y) and fY|X(y|x)f_{Y|X}(y|x) is straightforward.

The first property makes sure that the DA algorithm presented below has fXf_{X} as its stationary density, while the second property allows for straightforward simulation of this DA Markov chain. As mentioned in Hobert, (2011), the popularity of the DA algorithm is due in part to the fact that, given an intractable pdf fXf_{X}, there are some general techniques available for constructing a potentially useful joint density f(x,y)f(x,y). For example, the augmented variables yy generally correspond to the missing data used in the EM algorithm (Dempster et al.,, 1977) for defining the complete data density, which in turn is used for finding the maximum likelihood estimates of some parameters. In Sections 25 we will provide several examples of widely used nonlinear and/or high-dimensional statistical models with complex target densities where using appropriate augmented data, popular and efficient DA algorithms are constructed.

Each iteration of the DA algorithm consists of two steps — a draw from fY|Xf_{Y|X} followed by a draw from fX|Yf_{X|Y}. Indeed, if the current state of the DA Markov chain is Xn=xX_{n}=x, we simulate Xn+1X_{n+1} as follows.

Algorithm The nnth iteration for the DA algorithm
1:  Given xx, draw YfY|X(|x)Y\sim f_{Y|X}(\cdot|x), and call the observed value yy.
2:  Draw Xn+1fX|Y(|y)X_{n+1}\sim f_{X|Y}(\cdot|y).

From the two steps in each iteration of the DA algorithm, it follows that the Markov transition density (Mtd) of the DA Markov chain {Xn}n1\{X_{n}\}_{n\geq 1} is given by

k(x|x)=qfX|Y(x|y)fY|X(y|x)𝑑y.k(x^{\prime}|x)=\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y)f_{Y|X}(y|x)\,dy\;. (1.1)

Let fY(y)=pf(x,y)𝑑xf_{Y}(y)=\int_{\mathbb{R}^{p}}f(x,y)\,dx be the marginal density of the augmented variable yy. Since

k(x|x)fX(x)=fX(x)qfX|Y(x|y)fY|X(y|x)𝑑y=qf(x,y)f(x,y)fY(y)𝑑y=k(x|x)fX(x),k(x^{\prime}|x)f_{X}(x)=f_{X}(x)\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y)f_{Y|X}(y|x)\,dy=\int_{\mathbb{R}^{q}}\frac{f(x^{\prime},y)f(x,y)}{f_{Y}(y)}\,dy=k(x|x^{\prime})f_{X}(x^{\prime}), (1.2)

for all x,xx,x^{\prime}, that is, the Mtd kk satisfies the detailed balance condition, we have

pk(x|x)fX(x)𝑑x=fX(x).\int_{\mathbb{R}^{p}}k(x^{\prime}|x)f_{X}(x)\,dx=f_{X}(x^{\prime})\;. (1.3)

Thus, fXf_{X} is stationary for the DA Markov chain {Xn}n1\{X_{n}\}_{n\geq 1}. Moreover, if {Xn}n1\{X_{n}\}_{n\geq 1} is appropriately irreducibile, then h¯n\bar{h}_{n} is a strongly consistent estimator of EfXh\mbox{E}_{f_{X}}h, that is, h¯n\bar{h}_{n} converges to EfXh\mbox{E}_{f_{X}}h almost surely as nn\rightarrow\infty (Asmussen and Glynn,, 2011, Theorem 2). Since the DA chain has the Mtd (1.1), irreducibility also implies Harris recurrence of the DA Markov chain. Furthermore, if {Xn}n1\{X_{n}\}_{n\geq 1} is also aperiodic then it is Harris ergodic and the marginal density of XnX_{n} converges to the stationary density fXf_{X}, no matter what the initial distribution of X1X_{1} is (Meyn and Tweedie,, 1993, chapter 13). A simple sufficient condition which guarantees both irreducibility and aperiodicity of {Xn}n1\{X_{n}\}_{n\geq 1} is that the Mtd kk is strictly positive everywhere. Note that, if f(x,y)f(x,y) is strictly positive on p×q\mathbb{R}^{p}\times\mathbb{R}^{q}, then kk is strictly positive everywhere and the DA Markov chain {Xn}n1\{X_{n}\}_{n\geq 1} is Harris ergodic.

In order to keep the notations and discussions simple, here we consider the target and augmented state space as Euclidean. Similarly, the densities fXf_{X} as well as f(x,y)f(x,y) are considered with respect to the Lebesgue measure. However, all methods and theoretical results discussed in this chapter extend to discrete as well as other general settings. Indeed, there are popular DA algorithms where the target and/or the augmented state space is not continuous, for example consider the widely used DA algorithm for the mixture models (Diebolt and Robert,, 1994).

2 Examples

In this section, we provide examples of some widely used high dimensional linear models, generalized linear models, and generalized linear mixed models where the target distributions are intractable. Then we describe some DA algorithms that are used to fit these models.

2.1 Data augmentation for Bayesian variable selection

In modern datasets arising from diverse applications such as genetics, medical science, and other scientific disciplines, the number of covariates are often larger than the number of observed samples. The ordinary least squares method for estimating the regression coefficients in a linear regression is not applicable in such situations. On the other hand, penalized regression methods such as the least absolute shrinkage and selection operator (lasso) (Tibshirani,, 1996) are applicable as they can simultaneously induce shrinkage and sparsity in the estimation of the regression coefficients.

Consider the standard linear model

Z|μ,β,σ2Nm(μ1m+Wβ,σ2Im),Z|\mu,\beta,\sigma^{2}\sim N_{m}(\mu 1_{m}+W\beta,\sigma^{2}I_{m}),

where ZmZ\in\mathbb{R}^{m} is the vector of responses, μ\mu\in\mathbb{R} is the intercept, 1m1_{m} is the m×1m\times 1 vector of 1’s, WW is the m×pm\times p (standardized) covariate matrix, β=(β1,,βp)p\beta=(\beta_{1},\dots,\beta_{p})\in\mathbb{R}^{p} is the vector of regression coefficients, and σ2\sigma^{2} is the variance parameter. The lasso is based on L1L_{1} norm regularization, and it estimates β\beta by solving

min𝛽(z~Wβ)(z~Wβ)+λj=1p|βj|,\underset{\beta}{\mbox{min}}\;(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+\lambda\sum_{j=1}^{p}|\beta_{j}|, (2.1)

for some shrinkage parameter λ\lambda\in\mathbb{R}, where zz is the vector of observed responses, and z~=zz¯1m\tilde{z}=z-{\bar{z}}1_{m}. The lasso estimate can be interpreted as the posterior mode when, conditional on σ2\sigma^{2}, the regression parameters βj\beta_{j}’s have independent and identical Laplace priors (Tibshirani,, 1996). Indeed, following Park and Casella, (2008), we consider the hierarchical model

Z|μ,β,σ2\displaystyle Z|\mu,\beta,\sigma^{2} \displaystyle\sim Nm(μ1m+Wβ,σ2Im),\displaystyle N_{m}(\mu 1_{m}+W\beta,\sigma^{2}I_{m}),
π(μ)1;β|σ2\displaystyle\pi(\mu)\propto 1;\beta|\sigma^{2} \displaystyle\sim j=1pλ2σ2eλ|βj|/σ2\displaystyle\prod_{j=1}^{p}\frac{\lambda}{2\sqrt{\sigma^{2}}}e^{-\lambda|\beta_{j}|/\sqrt{\sigma^{2}}}
σ2\displaystyle\sigma^{2} \displaystyle\sim π(σ2),\displaystyle\pi(\sigma^{2}), (2.2)

where π(σ2)\pi(\sigma^{2}) is the prior density of σ2\sigma^{2}. Since the columns of WW are centered, standard calculations show that

f(z|β,σ2)f(z|μ,β,σ2)π(μ)𝑑μ=1(2π)(n1)/2σn1exp[(z~Wβ)(z~Wβ)2σ2].f(z|\beta,\sigma^{2})\equiv\int_{\mathbb{R}}f(z|\mu,\beta,\sigma^{2})\pi(\mu)d\mu=\frac{1}{(2\pi)^{(n-1)/2}\sigma^{n-1}}\exp\Big{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)}{2\sigma^{2}}\Big{]}. (2.3)

Thus, from (2.1) and (2.3), the joint posterior density of (β,σ2)(\beta,\sigma^{2}) is

π(β,σ2|z)1(σ2)(n1+p)/2exp[(z~Wβ)(z~Wβ)+2λσj=1p|βj|2σ2]π(σ2).\pi(\beta,\sigma^{2}|z)\propto\frac{1}{(\sigma^{2})^{(n-1+p)/2}}\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+2\lambda\sigma\sum_{j=1}^{p}|\beta_{j}|}{2\sigma^{2}}\Bigg{]}\pi(\sigma^{2}). (2.4)

From (2.1) and (2.4) it follows that the mode of the (conditional on σ2\sigma^{2}) posterior density of β\beta is the lasso estimate.

The posterior density (2.4) is intractable. Introducing augmented variables y=(y1,,yp)y=(y_{1},\dots,y_{p}) with yi>0y_{i}>0 for all ii, Park and Casella, (2008) consider the following joint density of (β,σ2,y)(\beta,\sigma^{2},y)

π(β,σ2,y|z)\displaystyle\pi(\beta,\sigma^{2},y|z) 1(σ2)(n1+p)/2exp[(z~Wβ)(z~Wβ)2σ2]\displaystyle\propto\frac{1}{(\sigma^{2})^{(n-1+p)/2}}\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)}{2\sigma^{2}}\Bigg{]}
×[j=1p1yjexp{βj22σ2yj}exp{λ2yj2}]π(σ2).\displaystyle\hskip 72.26999pt\times\Bigg{[}\prod_{j=1}^{p}\frac{1}{\sqrt{y_{j}}}\exp\bigg{\{}-\frac{\beta_{j}^{2}}{2\sigma^{2}y_{j}}\bigg{\}}\exp\bigg{\{}-\frac{\lambda^{2}y_{j}}{2}\bigg{\}}\Bigg{]}\pi(\sigma^{2}). (2.5)

Replacing tt with βj/σ\beta_{j}/\sigma, ss with yjy_{j}, and aa with λ\lambda in the following representation of the Laplace density as a scale mixture of normals (Andrews and Mallows,, 1974)

a2exp(a|t|)=012πsexp(t22s)a22exp(a22s)𝑑s,a>0\frac{a}{2}\exp(-a|t|)=\int_{0}^{\infty}\frac{1}{\sqrt{2\pi s}}\exp\Big{(}-\frac{t^{2}}{2s}\Big{)}\frac{a^{2}}{2}\exp\Big{(}-\frac{a^{2}}{2}s\Big{)}ds,\;a>0 (2.6)

we see that

pπ(β,σ2,y|z)𝑑y=π(β,σ2|z),\int_{\mathbb{R}^{p}}\pi(\beta,\sigma^{2},y|z)dy=\pi(\beta,\sigma^{2}|z),

that is, the (β,σ2)(\beta,\sigma^{2})- marginal density of the joint posterior density π(β,σ2,y|z)\pi(\beta,\sigma^{2},y|z) given in (2.1) is the target posterior density (2.4). Thus, from Section 1, if sampling from the two conditional densities π(β,σ2|y,z)\pi(\beta,\sigma^{2}|y,z) and π(y|β,σ2,z)\pi(y|\beta,\sigma^{2},z) is straightforward, then we can construct a valid DA algorithm for (2.4).

From (2.1), we have

π(y|β,σ2,z)j=1p1yjexp{βj22σ2yj}exp{λ2yj2}.\pi(y|\beta,\sigma^{2},z)\propto\prod_{j=1}^{p}\frac{1}{\sqrt{y_{j}}}\exp\bigg{\{}-\frac{\beta_{j}^{2}}{2\sigma^{2}y_{j}}\bigg{\}}\exp\Big{\{}-\frac{\lambda^{2}y_{j}}{2}\Big{\}}. (2.7)

Recall that the Inverse-Gaussian (κ,ψ)(\kappa,\psi) density is given by

f(u)=ψ2πu32exp{ψ(uκ)22κ2u}.f(u)=\sqrt{\frac{\psi}{2\pi}}u^{-\frac{3}{2}}\exp\Big{\{}-\frac{\psi(u-\kappa)^{2}}{2\kappa^{2}u}\Big{\}}.

Thus from (2.7), it follows that

1yj|β,σ2,zindInverse-Gaussian(λ2σ2βj2,λ2)forj=1,2,,p.\frac{1}{y_{j}}\Big{|}\beta,\sigma^{2},z\stackrel{{\scriptstyle ind}}{{\sim}}\mbox{Inverse-Gaussian}\Big{(}\sqrt{\frac{\lambda^{2}\sigma^{2}}{\beta^{2}_{j}}},\lambda^{2}\Big{)}\;\mbox{for}\;j=1,2,\dots,p. (2.8)

In order to derive the conditional density π(β,σ2|y,z)\pi(\beta,\sigma^{2}|y,z) we assume that apriori σ2Inverse-Gamma(α,ξ)\sigma^{2}\sim\mbox{Inverse-Gamma}(\alpha,\xi) for some α0\alpha\geq 0 and ξ0\xi\geq 0. The improper prior π(σ2)=1/σ2\pi(\sigma^{2})=1/\sigma^{2} used in Park and Casella, (2008) is obtained by replacing α=0,ξ=0\alpha=0,\xi=0. Note that a draw from π(β,σ2|y,z)\pi(\beta,\sigma^{2}|y,z) can be made by first drawing from π(σ2|y,z)\pi(\sigma^{2}|y,z) followed by a draw from π(β|σ2,y,z)\pi(\beta|\sigma^{2},y,z) (Rajaratnam et al.,, 2019). Denoting the p×pp\times p diagonal matrix with diagonal elements (y1,,yp)(y_{1},\dots,y_{p}) by DyD_{y}, we have

π(β|σ2,y,z)\displaystyle\pi(\beta|\sigma^{2},y,z) π(β,σ2,y|z)exp[(z~Wβ)(z~Wβ)+βDy1β2σ2]\displaystyle\propto\pi(\beta,\sigma^{2},y|z)\propto\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+\beta^{\top}D_{y}^{-1}\beta}{2\sigma^{2}}\Bigg{]}
exp[β(WW+Dy1)β2βWz~2σ2].\displaystyle\propto\exp\Bigg{[}-\frac{\beta^{\top}(W^{\top}W+D_{y}^{-1})\beta-2\beta^{\top}W^{\top}\tilde{z}}{2\sigma^{2}}\Bigg{]}.

Thus, β|σ2,y,zNp((WW+Dy1)1Wz~,σ2(WW+Dy1)1)\beta|\sigma^{2},y,z\sim N_{p}((W^{\top}W+D_{y}^{-1})^{-1}W^{\top}\tilde{z},\sigma^{2}(W^{\top}W+D_{y}^{-1})^{-1}). Also, from (2.1), we have

π(σ2|y,z)\displaystyle\pi(\sigma^{2}|y,z) π(σ2,y|z)=pπ(β,σ2,y|z)𝑑β\displaystyle\propto\pi(\sigma^{2},y|z)=\int_{\mathbb{R}^{p}}\pi(\beta,\sigma^{2},y|z)d\beta
p1(σ2)(n1+p)/2exp[(z~Wβ)(z~Wβ)+βDy1β2σ2]π(σ2)𝑑β\displaystyle\propto\int_{\mathbb{R}^{p}}\frac{1}{(\sigma^{2})^{(n-1+p)/2}}\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+\beta^{\top}D_{y}^{-1}\beta}{2\sigma^{2}}\Bigg{]}\pi(\sigma^{2})d\beta
π(σ2)(σ2)(n1)/2exp(z~z~2σ2)p(σ2)p/2exp[β(WW+Dy1)β2βWz~2σ2]𝑑β\displaystyle\propto\frac{\pi(\sigma^{2})}{(\sigma^{2})^{(n-1)/2}}\exp\Big{(}-\frac{\tilde{z}^{\top}\tilde{z}}{2\sigma^{2}}\Big{)}\int_{\mathbb{R}^{p}}(\sigma^{2})^{-p/2}\exp\Bigg{[}-\frac{\beta^{\top}(W^{\top}W+D_{y}^{-1})\beta-2\beta^{\top}W^{\top}\tilde{z}}{2\sigma^{2}}\Bigg{]}d\beta
(σ2)(n1+2α)/21exp[z~(IW(WW+Dy1)1W)z~+2ξ2σ2].\displaystyle\propto(\sigma^{2})^{-(n-1+2\alpha)/2-1}\exp\Bigg{[}-\frac{\tilde{z}^{\top}(I-W(W^{\top}W+D_{y}^{-1})^{-1}W^{\top})\tilde{z}+2\xi}{2\sigma^{2}}\Bigg{]}.

Thus, σ2|y,zInverse-Gamma((n1+2α)/2,(z~(IW(WW+Dy1)1W)z~+2ξ)/2).\sigma^{2}|y,z\sim\mbox{Inverse-Gamma}((n-1+2\alpha)/2,(\tilde{z}^{\top}(I-W(W^{\top}W+D_{y}^{-1})^{-1}W^{\top})\tilde{z}+2\xi)/2). Hence, a single iteration of the DA algorithm uses the following two steps to move from (β,σ2)(\beta,\sigma^{2}) to (β,σ2)(\beta^{\prime},{\sigma^{2}}^{\prime}).

Algorithm One iteration of the DA algorithm for Bayesian lasso
1:  Given (β,σ2)(\beta,\sigma^{2}), draw y1,,ypy_{1},\dots,y_{p} independently with
1yj|β,σ2,zInverse-Gaussian(λ2σ2βj2,λ2)forj=1,2,,p.\frac{1}{y_{j}}\Big{|}\beta,\sigma^{2},z\sim\mbox{Inverse-Gaussian}\bigg{(}\sqrt{\frac{\lambda^{2}\sigma^{2}}{\beta^{2}_{j}}},\lambda^{2}\bigg{)}\;\mbox{for}\;j=1,2,\dots,p.
2:  Draw (β,σ2)(\beta^{\prime},{\sigma^{2}}^{\prime}) by first drawing
σ2|y,zInverse-Gamma(n12+α,z~(IW(WW+Dy1)1W)z~2+ξ),{\sigma^{2}}^{\prime}|y,z\sim\mbox{Inverse-Gamma}\Big{(}\frac{n-1}{2}+\alpha,\frac{\tilde{z}^{\top}(I-W(W^{\top}W+D_{y}^{-1})^{-1}W^{\top})\tilde{z}}{2}+\xi\Big{)},
and then drawing
β|σ2,y,zNp((WW+Dy1)1Wz~,σ2(WW+Dy1)1).{\beta}^{\prime}|{\sigma^{2}}^{\prime},y,z\sim N_{p}((W^{\top}W+D_{y}^{-1})^{-1}W^{\top}\tilde{z},{\sigma^{2}}^{\prime}(W^{\top}W+D_{y}^{-1})^{-1}).

Although the lasso estimator has been extensively used in applications as diverse as agriculture, genetics, and finance, it may perform unsatisfactorily if the predictors are highly correlated. For example, if there is a group structure among the variables, lasso tends to select only one variable from each group. Zou and Hastie, (2005) proposed the Elastic Net (EN) to achieve better performance in such situations. The EN estimator is obtained by solving

min𝛽(z~Wβ)(z~Wβ)+λ1j=1p|βj|+λ2j=1p|βj|2,\underset{\beta}{\mbox{min}}\;(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+\lambda_{1}\sum_{j=1}^{p}|\beta_{j}|+\lambda_{2}\sum_{j=1}^{p}|\beta_{j}|^{2}, (2.9)

where λ1\lambda_{1} and λ2\lambda_{2} are tuning parameters. From (2.9) we see that the elastic net uses both an L1L_{1} penalty as in lasso and an L2L_{2} penalty as in the ridge regression. Following Kyung et al., (2010) and Roy and Chakraborty, (2017), we consider the hierarchical Bayesian EN model

Z|μ,β,σ2\displaystyle Z|\mu,\beta,\sigma^{2} \displaystyle\sim Nm(μ1m+Wβ,σ2Im),\displaystyle N_{m}(\mu 1_{m}+W\beta,\sigma^{2}I_{m}),
π(μ)1;π(β|σ2)\displaystyle\pi(\mu)\propto 1;\pi(\beta|\sigma^{2}) \displaystyle\propto j=1p1σexp{λ1|βj|σλ2βj22σ2},\displaystyle\prod_{j=1}^{p}\frac{1}{\sigma}\exp\Big{\{}-\frac{\lambda_{1}|\beta_{j}|}{\sigma}-\frac{\lambda_{2}\beta^{2}_{j}}{2\sigma^{2}}\Big{\}},
σ2\displaystyle\sigma^{2} \displaystyle\sim π(σ2),\displaystyle\pi(\sigma^{2}), (2.10)

where π(σ2)\pi(\sigma^{2}) is the prior density of σ2\sigma^{2}. From (2.1) and (2.3) it follows that the joint posterior density of (β,σ2)(\beta,\sigma^{2}) is

πEN(β,σ2|z)π(σ2)(σ2)(n1+p)/2exp[(z~Wβ)(z~Wβ)+2λ1σj=1p|βj|+λ2j=1pβj22σ2].\pi_{\text{EN}}(\beta,\sigma^{2}|z)\propto\frac{\pi(\sigma^{2})}{(\sigma^{2})^{(n-1+p)/2}}\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+2\lambda_{1}\sigma\sum_{j=1}^{p}|\beta_{j}|+\lambda_{2}\sum_{j=1}^{p}\beta_{j}^{2}}{2\sigma^{2}}\Bigg{]}. (2.11)

From (2.9) and (2.11) we see that the mode of the (conditional on σ2\sigma^{2}) posterior density of β\beta is the EN estimate. Since the density (2.11) is intractable, using augmented variables y=(y1,,yp)y=(y_{1},\dots,y_{p}), following Kyung et al., (2010) and Roy and Chakraborty, (2017), we consider the joint density of (β,σ2,y)(\beta,\sigma^{2},y) given by

πEN(β,σ2,y|z)\displaystyle\pi_{\text{EN}}(\beta,\sigma^{2},y|z) π(σ2)(σ2)(n1+p)/2exp[(z~Wβ)(z~Wβ)2σ2]\displaystyle\propto\frac{\pi(\sigma^{2})}{(\sigma^{2})^{(n-1+p)/2}}\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)}{2\sigma^{2}}\Bigg{]}
×[j=1p1yjexp{(λ2+1/yj)βj22σ2}exp{λ12yj2}].\displaystyle\hskip 36.135pt\times\Bigg{[}\prod_{j=1}^{p}\frac{1}{\sqrt{y_{j}}}\exp\bigg{\{}-\frac{(\lambda_{2}+1/y_{j})\beta_{j}^{2}}{2\sigma^{2}}\bigg{\}}\exp\bigg{\{}-\frac{\lambda_{1}^{2}y_{j}}{2}\bigg{\}}\Bigg{]}. (2.12)

From (2.6) we see that

pπEN(β,σ2,y|z)𝑑y=πEN(β,σ2|z).\int_{\mathbb{R}^{p}}\pi_{\text{EN}}(\beta,\sigma^{2},y|z)dy=\pi_{\text{EN}}(\beta,\sigma^{2}|z).

From (2.1) using similar calculations as before, we see that the conditional distributions of 1/yj1/y_{j}’s are the same as (2.8) with λ=λ1\lambda=\lambda_{1}. As before, we assume that apriori σ2Inverse-Gamma(α,ξ)\sigma^{2}\sim\mbox{Inverse-Gamma}(\alpha,\xi) for some α0\alpha\geq 0 and ξ0\xi\geq 0. Denoting the p×pp\times p diagonal matrix with diagonal elements (1/(λ2+1/y1),,1/(λ2+1/yp))(1/(\lambda_{2}+1/y_{1}),\dots,1/(\lambda_{2}+1/y_{p})) by D~y\tilde{D}_{y}, we have

πEN(β|σ2,y,z)πEN(β,σ2,y|z)exp[(z~Wβ)(z~Wβ)+βD~y1β2σ2].\displaystyle\pi_{\text{EN}}(\beta|\sigma^{2},y,z)\propto\pi_{\text{EN}}(\beta,\sigma^{2},y|z)\propto\exp\Bigg{[}-\frac{(\tilde{z}-W\beta)^{\top}(\tilde{z}-W\beta)+\beta^{\top}\tilde{D}_{y}^{-1}\beta}{2\sigma^{2}}\Bigg{]}.

Thus, β|σ2,y,zNp((WW+D~y1)1Wz~,σ2(WW+D~y1)1)\beta|\sigma^{2},y,z\sim N_{p}((W^{\top}W+\tilde{D}_{y}^{-1})^{-1}W^{\top}\tilde{z},\sigma^{2}(W^{\top}W+\tilde{D}_{y}^{-1})^{-1}). Next, doing similar calculations as for the Bayesian lasso, we see that σ2|y,zInverse-Gamma((n1+2α)/2,(z~(IW(WW+D~y1)1W)z~+2ξ)/2).\sigma^{2}|y,z\sim\mbox{Inverse-Gamma}((n-1+2\alpha)/2,(\tilde{z}^{\top}(I-W(W^{\top}W+\tilde{D}_{y}^{-1})^{-1}W^{\top})\tilde{z}+2\xi)/2). Hence, a single iteration of the DA algorithm for the Bayesian EN uses the following two steps to move from (β,σ2)(\beta,\sigma^{2}) to (β,σ2)(\beta^{\prime},{\sigma^{2}}^{\prime}).

Algorithm One iteration of the DA algorithm for Bayesian elastic net
1:  Given (β,σ2)(\beta,\sigma^{2}), draw y1,,ypy_{1},\dots,y_{p} independently with
1yj|β,σ2,zInverse-Gaussian(λ12σ2βj2,λ12)forj=1,2,,p.\frac{1}{y_{j}}\Big{|}\beta,\sigma^{2},z\sim\mbox{Inverse-Gaussian}\bigg{(}\sqrt{\frac{\lambda_{1}^{2}\sigma^{2}}{\beta^{2}_{j}}},\lambda_{1}^{2}\bigg{)}\;\mbox{for}\;j=1,2,\dots,p.
2:  Draw (β,σ2)(\beta^{\prime},{\sigma^{2}}^{\prime}) by first drawing
σ2|y,zInverse-Gamma(n12+α,z~(IW(WW+D~y1)1W)z~2+ξ),{\sigma^{2}}^{\prime}|y,z\sim\mbox{Inverse-Gamma}\Big{(}\frac{n-1}{2}+\alpha,\frac{\tilde{z}^{\top}(I-W(W^{\top}W+\tilde{D}_{y}^{-1})^{-1}W^{\top})\tilde{z}}{2}+\xi\Big{)},
and then drawing
β|σ2,y,zNp((WW+D~y1)1Wz~,σ2(WW+D~y1)1).{\beta}^{\prime}|{\sigma^{2}}^{\prime},y,z\sim N_{p}((W^{\top}W+\tilde{D}_{y}^{-1})^{-1}W^{\top}\tilde{z},{\sigma^{2}}^{\prime}(W^{\top}W+\tilde{D}_{y}^{-1})^{-1}).

Different Bayesian generalized lasso methods, such as the Bayesian group lasso, the Bayesian sparse group lasso, and the Bayesian fused lasso models have been proposed in the literature to handle the situations where the covariates are known to have some structures, for example, when they form groups or are ordered in some way (Kyung et al.,, 2010; Xu and Ghosh,, 2015). Introducing appropriate augmented variables yy, DA algorithms for these models can also be constructed (Kyung et al.,, 2010; Jin and Tan,, 2021).

2.2 Data augmentation for Bayesian logistic models

Logistic regression model is likely the most popular model for analyzing binomial data. Let (Z1,Z2,,Zm)(Z_{1},Z_{2},\dots,Z_{m}) be a vector of independent Binomial random variables, and wiw_{i} be the p×1p\times 1 vector of known covariates associated with ZiZ_{i} for i=1,,mi=1,\dots,m. Let βp\beta\in\mathbb{R}^{p} be the vector of unknown regression coefficients. Assume that ZiZ_{i}\sim Binomial (i,F(wiβ))(\ell_{i},F(w_{i}^{\top}\beta)) where F(t)=et/(1+et)F(t)=e^{t}/(1+e^{t}) is the cumulative distribution function of the standard logistic random variable. Denoting the observed responses by z=(z1,z2,,zm)z=(z_{1},z_{2},\dots,z_{m}), the likelihood function for the logistic regression model is

L(β|z)=i=1m(izi)[exp(wiβ)]zi[1+exp(wiβ)]i.L(\beta|z)=\prod_{i=1}^{m}{\ell_{i}\choose z_{i}}\frac{\left[\exp\left(w_{i}^{\top}\beta\right)\right]^{z_{i}}}{[1+\exp\left(w_{i}^{\top}\beta\right)]^{\ell_{i}}}.

Consider a Bayesian analysis that employs the following Gaussian prior for β\beta

π(β)exp[12(βμ0)Q(βμ0)],\displaystyle\pi(\beta)\propto\exp\Big{[}-\frac{1}{2}(\beta-\mu_{0})^{\top}Q(\beta-\mu_{0})\Big{]}, (2.13)

where μ0p\mu_{0}\in\mathbb{R}^{p} and QQ is a p×pp\times p positive definite matrix. Then the intractable posterior density of β\beta is given by

π(β|z)L(β|z)π(β)c(z)=1c(z)i=1m(iwi)[exp(wiβ)]zi[1+exp(wiβ)]iπ(β),\pi(\beta\,|\,z)\propto\frac{L(\beta|z)\pi(\beta)}{c(z)}=\frac{1}{c(z)}\prod_{i=1}^{m}{\ell_{i}\choose w_{i}}\frac{\left[\exp\left(w_{i}^{\top}\beta\right)\right]^{z_{i}}}{[1+\exp\left(w_{i}^{\top}\beta\right)]^{\ell_{i}}}\pi(\beta)\;, (2.14)

where

c(z)=pi=1m(iwi)[exp(wiβ)]zi[1+exp(wiβ)]iπ(β)dβc(z)=\int_{\mathbb{R}^{p}}\prod_{i=1}^{m}{\ell_{i}\choose w_{i}}\frac{\left[\exp\left(w_{i}^{\top}\beta\right)\right]^{z_{i}}}{[1+\exp\left(w_{i}^{\top}\beta\right)]^{\ell_{i}}}\pi(\beta)d\beta (2.15)

is the marginal density of zz.

There have been many attempts (Holmes and Held,, 2006; Frühwirth-Schnatter and Frühwirth,, 2010) to produce an efficient DA algorithm for the logistic regression model, that is (2.14), without much success until recently, when Polson et al., (2013) (denoted as PS&W hereafter) proposed a new DA algorithm based on the Pólya-Gamma (PG) distribution. A random variable θ\theta follows a PG distribution with parameters a>0,b0a>0,\,b\geq 0, if

θ=𝑑(1/(2π2))i=1gi/[(i1/2)2+b2/(4π2)],\theta\overset{d}{=}(1/(2\pi^{2}))\sum_{i=1}^{\infty}g_{i}/[(i-1/2)^{2}+b^{2}/(4\pi^{2})],

where giiidg_{i}\overset{iid}{\sim}Gamma(a,1)(a,1). From Wang and Roy, 2018c , the pdf for PG(a,b),a>0,b0(a,b),a>0,\,b\geq 0 is

p(θa,b)=[cosh(b2)]a2a1Γ(a)r=0(1)rΓ(r+a)Γ(r+1)(2r+a)2πθ3exp((2r+a)28θθb22),p(\theta\mid a,b)=\Big{[}\cosh\big{(}\frac{b}{2}\big{)}\Big{]}^{a}\frac{2^{a-1}}{\Gamma{(a)}}\sum_{r=0}^{\infty}(-1)^{r}\frac{\Gamma{(r+a)}}{\Gamma{(r+1)}}\frac{(2r+a)}{\sqrt{2\pi\theta^{3}}}\exp\Big{(}-\frac{(2r+a)^{2}}{8\theta}-\frac{\theta b^{2}}{2}\Big{)}, (2.16)

for θ>0,\theta>0, where the hyperbolic cosine function cosh(t)=(et+et)/2\cosh(t)=(e^{t}+e^{-t})/2. We denote this as θ\theta\simPG(a,b)(a,b).

Let y=(y1,y2,yn)y=(y_{1},y_{2},...y_{n}), ki=zii/2,i=1,,nk_{i}=z_{i}-\ell_{i}/2,\,i=1,...,n and p(yii,0)p(y_{i}\mid\ell_{i},0) be the pdf of yiy_{i}\simPG(i,0)(\ell_{i},0). Define the joint posterior density of β\beta and yy given zz as

π(β,yz)=1c(z)[i=1mexp{kiwiβyi(wiβ)2/2}2p(yi|i,0)]π(β).\pi(\beta,y\mid z)=\frac{1}{c(z)}\bigg{[}\prod_{i=1}^{m}\frac{\exp\{k_{i}w_{i}^{\top}\beta-y_{i}(w_{i}^{\top}\beta)^{2}/2\}}{2}p(y_{i}|\ell_{i},0)\bigg{]}\pi(\beta). (2.17)

By Theorem 1 in Polson et al., (2013), it follows that the β\beta-marginal density of the joint posterior density π(β,yz)\pi(\beta,y\mid z) is the target density π(β|z)\pi(\beta\,|\,z) given in (2.14), that is,

π(βz)=+m[i=1mexp{kiwiβyi(wiβ)2/2}2p(yii,0)]𝑑y×π(β)c(z).\displaystyle\pi(\beta\mid z)=\int_{\mathbb{R}_{+}^{m}}\bigg{[}\prod_{i=1}^{m}\frac{\exp\{k_{i}w_{i}^{\top}\beta-y_{i}(w_{i}^{\top}\beta)^{2}/2\}}{2}p(y_{i}\mid\ell_{i},0)\bigg{]}dy\times\frac{\pi(\beta)}{c(z)}.

PS&W’s DA algorithm for the logistic regression model is based on the joint density π(β,yz)\pi(\beta,y\mid z). We now derive the conditional densities, π(β|y,z)\pi(\beta|y,z) and π(y|β,z)\pi(y|\beta,z). From (2.17) we see that

π(yiβ,z)exp(yi(wiβ)2/2)p(yi|i,0),\pi(y_{i}\mid\beta,z)\propto\exp(-y_{i}(w_{i}^{\top}\beta)^{2}/2)p(y_{i}|\ell_{i},0), (2.18)

and thus from (2.16) we have yi|β,zindPG(i,|wiβ|),y_{i}|\beta,z\overset{ind}{\sim}\text{PG}\left(\ell_{i},\left|w_{i}^{\top}\beta\right|\right), for i=1,,mi=1,\dots,m. Let WW denote the m×pm\times p design matrix with iith row wiw_{i}^{\top} and YY be the m×mm\times m diagonal matrix with iith diagonal element yiy_{i}. From (2.17) and (2.13), it follows that the conditional density of β\beta is

π(β|y,z)exp[12β(WYW+Q)β+β(Wκ+Qμ0)],\pi\left(\beta|y,z\right)\propto\exp\left[-\frac{1}{2}\beta^{\top}(W^{\top}YW+Q)\beta+\beta^{\top}(W^{\top}\kappa+Q\mu_{0})\right],

where κ=(κ1,,κm)\kappa=\left(\kappa_{1},\dots,\kappa_{m}\right)^{\top}. Thus the conditional distribution of β\beta is multivariate normal. In particular,

β|y,zN((WYW+Q)1(Wκ+Qμ0),(WYW+Q)1).\beta|y,z\sim N\left(\left(W^{\top}YW+Q\right)^{-1}(W^{\top}\kappa+Q\mu_{0}),\left(W^{\top}YW+Q\right)^{-1}\right). (2.19)

Thus, a single iteration of PS&W’s DA algorithm uses the following two steps to move from β\beta to β\beta^{\prime}.

Algorithm One iteration of the PS&W’s DA algorithm
1:  Given β\beta, draw y1,,ymy_{1},\dots,y_{m} independently with yiPG(i,|wiβ|)y_{i}\sim\text{PG}\left(\ell_{i},\left|w_{i}^{\top}\beta\right|\right).
2:  Draw βN((WYW+Q)1(Wκ+Qμ0),(WYW+Q)1)\beta^{\prime}\sim N\left(\left(W^{\top}YW+Q\right)^{-1}(W^{\top}\kappa+Q\mu_{0}),\left(W^{\top}YW+Q\right)^{-1}\right).

PS&W’s DA algorithm is also applicable to the situation when Q=0Q=0 in (2.13), that is, π(β)1\pi(\beta)\propto 1, the improper uniform prior on β\beta provided the posterior density π(β|z)\pi(\beta|z) is proper, that is c(z)c(z) defined in (2.15) is finite. Polson et al., (2013) demonstrate superior empirical performance of their DA algorithm over some other DA and MCMC algorithms for the logistic regression model.

2.3 Data augmentation for probit mixed models

Generalized linear mixed models (GLMMs) are often used for analyzing correlated binary observations. The random effects in the linear predictor of a GLMM can accommodate for overdispersion as well as dependence among correlated observations arising from longitudinal or repeated measures studies. Let (Z1,Z2,,Zm)(Z_{1},Z_{2},\dots,Z_{m}) denote the vector of Bernoulli random variables. Let wiw_{i} and viv_{i} be the p×1p\times 1 and q×1q\times 1 known covariates and random effect design vectors, respectively associated with the iith observation ZiZ_{i} for i=1,,mi=1,\dots,m. Let βp\beta\in\mathbb{R}^{p} be the unknown vector of regression coefficients and uqu\in\mathbb{R}^{q} be the random effects vector. The probit GLMM connects the expectation of ZiZ_{i} with wiw_{i} and viv_{i} using the probit link function, Φ1()\Phi^{-1}(\cdot) as

Φ1(P(Zi=1))=wiβ+viu,\Phi^{-1}(P(Z_{i}=1))=w_{i}^{\top}\beta+v_{i}^{\top}u, (2.20)

where Φ()\Phi(\cdot) is the cumulative distribution function of the standard normal random variable.

Assume that we have rr random effects with u=(u1,,ur)u=(u_{1}^{\top},\dots,u_{r}^{\top})^{\top}, where uju_{j} is a qj×1q_{j}\times 1 vector with qj>0q_{j}>0, q1++qr=qq_{1}+\cdots+q_{r}=q, and ujindN(0,ΛjRj)u_{j}\stackrel{{\scriptstyle\text{ind}}}{{\sim}}N(0,\Lambda_{j}\otimes R_{j}) where the low-dimensional covariance matrix Λj\Lambda_{j} is unknown and must be estimated, and the structured matrix RjR_{j} is usually known. Here, \otimes indicates the Kronecker product. Denoting Λ=(Λ1,,Λr)\Lambda=(\Lambda_{1},\dots,\Lambda_{r}), the probit GLMM is given by

Zi|β,u,Λ\displaystyle Z_{i}|\beta,u,\Lambda ind\displaystyle\overset{\text{ind}}{\sim} Bern(Φ(xiβ+viu)) for i=1,,m with\displaystyle\text{Bern}(\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u))\text{ for }i=1,\dots,m\;\text{ with}
uj|β,Λ\displaystyle u_{j}|\beta,\Lambda ind\displaystyle\overset{\text{ind}}{\sim} N(0,ΛjRj),j=1,,r.\displaystyle N(0,\Lambda_{j}\otimes R_{j}),\,j=1,\dots,r.

Let z=(z1,z2,,zm)z=(z_{1},z_{2},\dots,z_{m})^{\top} be the observed Bernoulli response variables. Note that, the likelihood function for (β,Λ)(\beta,\Lambda) is

L(β,Λ|z)=qi=1m[Φ(xiβ+viu)]zi[1Φ(xiβ+viu)]1ziϕq(u;0,A(Λ))du,\displaystyle L(\beta,\Lambda|z)=\int_{\mathbb{R}^{q}}\prod_{i=1}^{m}\left[\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u)\right]^{z_{i}}\left[1-\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u)\right]^{1-z_{i}}\phi_{q}(u;0,A(\Lambda))du, (2.21)

which is not available in closed form. Here, A(Λ)=j=1rΛjRjA(\Lambda)=\oplus_{j=1}^{r}\Lambda_{j}\otimes R_{j}, with \oplus indicating the direct sum, and ϕq(s;a,B)\phi_{q}(s;a,B) denotes the probability density function of the qq-dimensional normal distribution with mean vector aa, covariance matrix BB and evaluated at ss.

There are two widely used Monte Carlo approaches for approximating the likelihood function (2.21) and making inference on (β,Λ)(\beta,\Lambda), namely the Monte Carlo EM algorithm (Booth and Hobert,, 1999) and the Monte Carlo maximum likelihood based on importance sampling (Geyer and Thompson,, 1992; Geyer,, 1994). As explained in Roy, (2022), both the Monte Carlo EM and the Monte Carlo maximum likelihood methods for making inference on (β,Λ)(\beta,\Lambda) require effective methods for sampling from the conditional density of the random effect uu

f(u|β,Λ,z)=f(z,u|β,Λ)L(β,Λ|z),f(u|\beta,\Lambda,z)=\frac{f(z,u|\beta,\Lambda)}{L(\beta,\Lambda|z)}, (2.22)

where f(z,u|β,Λ)f(z,u|\beta,\Lambda) is the joint density of (z,u)(z,u) given by

f(z,u|β,Λ)=[i=1m[Φ(xiβ+viu)]zi[1Φ(xiβ+viu)]1zi]ϕq(u;0,A(Λ)).f(z,u|\beta,\Lambda)=\Bigg{[}\prod_{i=1}^{m}\left[\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u)\right]^{z_{i}}\left[1-\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u)\right]^{1-z_{i}}\Bigg{]}\phi_{q}(u;0,A(\Lambda)). (2.23)

Since the likelihood function L(β,Λ|z)L(\beta,\Lambda|z) is not available in closed form, neither is the density (2.22).

Albert and Chib,’s (1993) DA algorithm for the probit regression model is one of the most widely used DA algorithms and it can be extended to construct DA samplers for (2.22). Following Albert and Chib, (1993), let yiy_{i}\in\mathbb{R} be the latent continuous normal variable corresponding to the iith binary observation ziz_{i}, that is zi=I(yi>0)z_{i}=I(y_{i}>0), where yi|β,u,τindN(wiβ+viuy_{i}|\beta,u,\tau\overset{\text{ind}}{\sim}N(w_{i}^{\top}\beta+v_{i}^{\top}u, 1) for i=1,,mi=1,\dots,m. Then

P(Zi=1)=P(Yi>0)=Φ(xiβ+viu),P(Z_{i}=1)=P(Y_{i}>0)=\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u), (2.24)

that is, Zi|β,u,τindBern(Φ(xiβ+viu))Z_{i}|\beta,u,\tau\overset{\text{ind}}{\sim}\text{Bern}(\Phi(x_{i}^{\top}\beta+v_{i}^{\top}u)) as in (2.20). Using the latent variables y=(y1,y2,,ym)y=(y_{1},y_{2},\dots,y_{m}), we now introduce the joint density

f(u,y|β,Λ,z)\displaystyle f(u,y|\beta,\Lambda,z) =[i=1mexp{12(yiwiβviu)2}×i=1m[1(0,)(yi)]zi[1(,0](yi)]1zi]\displaystyle=\Bigg{[}\prod_{i=1}^{m}\exp\left\{-\frac{1}{2}\left(y_{i}-w_{i}^{\top}\beta-v_{i}^{\top}u\right)^{2}\right\}\times\prod_{i=1}^{m}\left[1_{\text{(0,$\infty$)}}\left(y_{i}\right)\right]^{z_{i}}\left[1_{\left(-\infty,0\right]}\left(y_{i}\right)\right]^{1-z_{i}}\Bigg{]}
×ϕq(u;0,A(Λ))L(β,Λ|z).\displaystyle\hskip 72.26999pt\times\frac{\phi_{q}(u;0,A(\Lambda))}{L(\beta,\Lambda|z)}. (2.25)

From (2.21) and (2.24) it follows that

mf(u,y|β,Λ,z)𝑑y=f(u|β,Λ,z),\int_{\mathbb{R}^{m}}f(u,y|\beta,\Lambda,z)dy=f(u|\beta,\Lambda,z), (2.26)

that is, the uu- marginal of the joint density f(u,y|β,Λ,z)f(u,y|\beta,\Lambda,z) is the target density f(u|β,Λ,z)f(u|\beta,\Lambda,z) given in (2.22). Thus, the augmented data yy and the joint density (2.3) satisfy the first property for constructing a valid DA algorithm for f(u|β,Λ,z)f(u|\beta,\Lambda,z).

If the two conditionals of the joint density (2.3) are easy to sample from, then a DA algorithm can be constructed. From (2.3), we see that

yi|u,β,Λ,zindTN(wiβ+viu,1,zi),i=1,,m,y_{i}|u,\beta,\Lambda,z\overset{\text{ind}}{\sim}\text{TN}(w_{i}^{\top}\beta+v_{i}^{\top}u,1,z_{i}),\,i=1,\dots,m, (2.27)

where TN(μ,σ2,e)\text{TN}(\mu,\sigma^{2},e) denotes the distribution of the normal random variable with mean μ\mu and variance σ2\sigma^{2}, that is truncated to have only positive values if e=1e=1, and only negative values if e=0e=0. Let WW and VV denote the m×pm\times p and m×qm\times q matrices whose iith rows are wiw^{\top}_{i} and viv^{\top}_{i}, respectively. Next, using standard linear models-type calculations, Roy, (2022) shows that the conditional distribution of uu is

u|y,β,Λ,zNq((VV+A(Λ)1)1V(yWβ),(VV+A(Λ)1)1).u|y,\beta,\Lambda,z\sim N_{q}\left((V^{\top}V+A(\Lambda)^{-1})^{-1}V^{\top}(y-W\beta),(V^{\top}V+A(\Lambda)^{-1})^{-1}\right). (2.28)

Thus, a single iteration of the DA algorithm uses the following two steps to move from uu to uu^{\prime}.

Algorithm One iteration of the DA algorithm for the probit GLMM
1:  Given uu, draw yiindTN(wiβ+viu,1,zi)y_{i}\overset{\text{ind}}{\sim}\text{TN}(w_{i}^{\top}\beta+v_{i}^{\top}u,1,z_{i}) for i=1,,mi=1,\dots,m.
2:  Draw uNq((VV+A(Λ)1)1V(yWβ),(VV+A(Λ)1)1)u^{\prime}\sim N_{q}\left((V^{\top}V+A(\Lambda)^{-1})^{-1}V^{\top}(y-W\beta),(V^{\top}V+A(\Lambda)^{-1})^{-1}\right).

One can consider a Bayesian analysis of the probit GLMM with appropriate priors on (β,Λ)(\beta,\Lambda) and the DA and Haar PX-DA algorithms discussed in this section can be extended to sample from the corresponding posterior densities (see Wang and Roy, 2018b, ; Roy,, 2022). Similarly, PS&W’s DA algorithm for the logistic model discussed in Section 2.2 can be extended to fit logistic mixed models (see Wang and Roy, 2018a, ; Rao and Roy,, 2021; Roy,, 2022).

3 Improving the DA algorithm

DA algorithms, although popular MCMC schemes, can suffer from slow convergence to stationarity (Roy and Hobert,, 2007; Hobert et al.,, 2011; Duan et al.,, 2018). In the literature, a lot of effort has gone into developing methods for accelerating the speed of convergence of the DA algorithms. These methods are mainly based on techniques like appropriate reparameterizations and parameter expansions such as the centered and noncentered parameterizations (Papaspiliopoulos and Roberts,, 2008), interweaving the two parameterizations (Yu and Meng,, 2011), parameter expanded-data augmentation (PX-DA) (Liu and Wu,, 1999), the marginal augmentation (Meng and van Dyk,, 1999) or the calibrated DA (Duan et al.,, 2018). Unlike the other methods mentioned here, Duan et al.,’s (2018) calibrated DA requires an additional Metropolis-Hastings (MH) step to maintain the stationarity of the algorithm with respect to the target density fXf_{X}. The PX-DA and marginal augmentation are among the most popular techniques for speeding up DA algorithms, and we describe those in more details later in this section.

Graphically, the two steps of the DA algorithm can be viewed as xyxx\rightarrow y\rightarrow x^{\prime}. It has been observed that the convergence behavior of a DA algorithm can be significantly improved by inserting a properly chosen extra step in between the two steps of the DA algorithm. The idea of introducing an extra step using appropriate auxiliary variables was developed independently by Liu and Wu, (1999), who called the resulting MCMC algorithm the PX-DA, and Meng and van Dyk, (1999), who called it the marginal augmentation. Following these works, Hobert and Marchev, (2008) developed general versions of PX-DA type algorithms and Yu and Meng, (2011) referred to these generalized methods as sandwich algorithms since the algorithms involve an additional step that is sandwiched between the two steps of the original DA algorithm. Graphically, a sandwich algorithm can be represented as xyyxx\rightarrow y\rightarrow y^{\prime}\rightarrow x^{\prime}, where the first and the third steps are the two steps of the DA algorithm. Suppose the middle step yyy\rightarrow y^{\prime} is implemented by making a draw according to a Mtd r(y|y)r(y^{\prime}|y). Thus, denoting the current state by xx, a generic sandwich algorithm uses the following three steps to move to the new state xx^{\prime}.

Algorithm An iteration of a generic sandwich algorithm
1:  Given xx, draw YfY|X(|x)Y\sim f_{Y|X}(\cdot|x), and call the observed value yy.
2:  Draw Yr(y|y)Y^{\prime}\sim r(y^{\prime}|y), and call the result yy^{\prime}.
3:  Draw XfX|Y(|y)X^{\prime}\sim f_{X|Y}(\cdot|y^{\prime}).

We now describe how to construct an Mtd rr for a valid middle step yyy\rightarrow y^{\prime} for sandwich algorithms. Recall that fY(y)f_{Y}(y) is the marginal density of yy. Suppose r(y|y)r(y^{\prime}|y) leaves fYf_{Y} invariant, that is,

qr(y|y)fY(y)𝑑y=fY(y).\int_{\mathbb{R}^{q}}r(y^{\prime}|y)f_{Y}(y)\,dy=f_{Y}(y^{\prime})\;. (3.1)

It turns out that any rr satisfying this invariance condition implies that fXf_{X} is stationary for the corresponding sandwich algorithm and consequently can be used to obtain approximate samples from fXf_{X}. To see it, note that the Mtd of the sandwich algorithm is given by

kSA(x|x)=qqfX|Y(x|y)r(y|y)fY|X(y|x)𝑑y𝑑y,k_{\text{SA}}(x^{\prime}|x)=\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,r(y^{\prime}|y)\,f_{Y|X}(y|x)\,dy\,dy^{\prime}\;, (3.2)

and

pkSA(x|x)fX(x)𝑑x\displaystyle\int_{\mathbb{R}^{p}}k_{\text{SA}}(x^{\prime}|x)f_{X}(x)\,dx =p[qqfX|Y(x|y)r(y|y)fY|X(y|x)𝑑y𝑑y]fX(x)𝑑x\displaystyle=\int_{\mathbb{R}^{p}}\bigg{[}\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,r(y^{\prime}|y)\,f_{Y|X}(y|x)\,dy\,dy^{\prime}\bigg{]}f_{X}(x)\,dx
=qfX|Y(x|y)[qr(y|y)fY(y)𝑑y]𝑑y\displaystyle=\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\bigg{[}\int_{\mathbb{R}^{q}}r(y^{\prime}|y)\,f_{Y}(y)\,dy\bigg{]}\,dy^{\prime}
=qfX|Y(x|y)fY(y)𝑑y\displaystyle=\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,f_{Y}(y^{\prime})\,dy^{\prime}
=fX(x),\displaystyle=f_{X}(x^{\prime}), (3.3)

where the third equality follows from (3.1). Furthermore, if r(y|y)r(y^{\prime}|y) satisfies the detailed balance condition with respect to fY(y)f_{Y}(y), that is, r(y|y)fY(y)r(y^{\prime}|y)f_{Y}(y) is symmetric in (y,y)(y,y^{\prime}), then

kSA(x|x)fX(x)\displaystyle k_{\text{SA}}(x^{\prime}|x)f_{X}(x) =fX(x)qqfX|Y(x|y)r(y|y)fY|X(y|x)𝑑y𝑑y\displaystyle=f_{X}(x)\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,r(y^{\prime}|y)\,f_{Y|X}(y|x)\,dy\,dy^{\prime}
=qqfX|Y(x|y)r(y|y)f(x,y)𝑑y𝑑y\displaystyle=\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,r(y^{\prime}|y)\,f(x,y)\,dy\,dy^{\prime}
=qqfX|Y(x|y)r(y|y)fY(y)fX|Y(x|y)𝑑y𝑑y\displaystyle=\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,r(y^{\prime}|y)f_{Y}(y)\,f_{X|Y}(x|y)\,\,dy\,dy^{\prime}
=qqfX|Y(x|y)r(y|y)fY(y)fX|Y(x|y)𝑑y𝑑y\displaystyle=\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x^{\prime}|y^{\prime})\,r(y|y^{\prime})f_{Y}(y^{\prime})\,f_{X|Y}(x|y)\,\,dy\,dy^{\prime}
=qqf(x,y)r(y|y)fX|Y(x|y)𝑑y𝑑y\displaystyle=\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f(x^{\prime},y^{\prime})\,r(y|y^{\prime})\,f_{X|Y}(x|y)\,\,dy\,dy^{\prime}
=fX(x)qqfX|Y(x|y)r(y|y)fY|X(y|x)𝑑y𝑑y\displaystyle=f_{X}(x^{\prime})\int_{\mathbb{R}^{q}}\int_{\mathbb{R}^{q}}f_{X|Y}(x|y)\,r(y|y^{\prime})\,f_{Y|X}(y^{\prime}|x^{\prime})\,\,dy\,dy^{\prime}
=kSA(x|x)fX(x),\displaystyle=k_{\text{SA}}(x|x^{\prime})f_{X}(x^{\prime})\;, (3.4)

that is, the corresponding sandwich algorithm is reversible with respect to the target density fXf_{X}.

Let us denote the Markov chain underlying the sandwich algorithm by {Xn}n1\{X_{n}^{*}\}_{n\geq 1}. Since from (3) or (3) it follows that the target density fXf_{X} is stationary for {Xn}n1\{X_{n}^{*}\}_{n\geq 1}, the sample averages h¯n:=i=1nh(Xi)/n\bar{h}^{*}_{n}:=\sum_{i=1}^{n}h(X^{*}_{i})/n based on the SA Markov chain can also be used to estimate EfXhE_{f_{X}}h. Indeed, as mentioned in Section 1, if {Xn}n1\{X_{n}^{*}\}_{n\geq 1} is appropriately irreducible, h¯nEfXh\bar{h}^{*}_{n}\rightarrow\mbox{E}_{f_{X}}h almost surely as nn\rightarrow\infty. Now, h¯n\bar{h}^{*}_{n} will be preferred over h¯n\bar{h}_{n} as an estimate of EfXh\mbox{E}_{f_{X}}h if the former leads to smaller (Monte Carlo) errors. In particular, if there is a Markov chain central limit theorem (CLT) for h¯n\bar{h}_{n}, that is, if n(h¯nEfXh)d𝒩(0,σh2)\sqrt{n}(\overline{h}_{n}-E_{f_{X}}h)\stackrel{{\scriptstyle d}}{{\rightarrow}}\mathcal{N}(0,\sigma_{h}^{2}) for some finite asymptotic variance σh2\sigma_{h}^{2}, the standard error of h¯n\overline{h}_{n} is σ^h/n\hat{\sigma}_{h}/\sqrt{n} where σ^h\hat{\sigma}_{h} is it consistent estimator of σh\sigma_{h} (see Flegal and Jones,, 2010; Vats et al.,, 2019, for methods of constructing σ^h\hat{\sigma}_{h}). Thus, in practice, establishing CLTs for h¯n\overline{h}_{n} and h¯n\overline{h}^{*}_{n} is important for ascertaining and comparing the quality of these estimators. A sufficient condition extensively used in the literature for guaranteeing the existence of such a CLT for every square integrable function hh (ph2(x)fX(x)𝑑x<\int_{\mathbb{R}^{p}}h^{2}(x)\,f_{X}(x)\,dx<\infty) is that the Markov chain converges to its stationary distribution at a geometric rate (Roberts and Rosenthal,, 1997; Chan and Geyer,, 1994).

Formally, the chain {Xn}n1\{X_{n}\}_{n\geq 1} is called geometrically ergodic if there exist a function M:p[0,)M:\mathbb{R}^{p}\rightarrow[0,\infty) and a constant ρ[0,1)\rho\in[0,1) such that, for all xpx\in\mathbb{R}^{p} and all n=1,2,n=1,2,\dots,

p|kn(x|x)fX(x)|dxM(x)ρn,\int_{\mathbb{R}^{p}}|k^{n}(x^{\prime}|x)-f_{X}(x)|dx\leq M(x)\,\rho^{n}\;,

where kn(|)k^{n}(\cdot|\cdot) is the nn-step Mtd of the chain {Xn}n1\{X_{n}\}_{n\geq 1}. Unfortunately, the simple irreducibility and aperiodicity conditions mentioned in Section 1 does not imply geometric ergodicity. The most common method of proving geometric ergodicity of {Xn}n1\{X_{n}\}_{n\geq 1} is by establishing drift and minorization conditions (Rosenthal,, 1995; Jones and Hobert,, 2001). Indeed, several DA algorithm examples considered in this chapter have been shown to be geometrically ergodic by establishing appropriate drift and minorization conditions. For example, for the Bayesian lasso model discussed in Section 2.1, Rajaratnam et al., (2019) established geometric ergodicity of the DA Markov chain in the special case when α=ξ=0\alpha=\xi=0. The geometric rate of convergence of PS&W’s DA Markov chain for the binary logistic regression model discussed in Section 2.2 has been established by Choi and Hobert, (2013) and Wang and Roy, 2018c under proper normal and improper uniform prior on β\beta, respectively.

If the SA Markov chain {Xn}n1\{X_{n}^{*}\}_{n\geq 1} is reversible, denoting the asymptotic variance for h¯n\overline{h}^{*}_{n} by σh,2\sigma_{h,*}^{2}, it turns out that σh,2σh2\sigma_{h,*}^{2}\leq\sigma_{h}^{2} for every square integrable (with respect to fXf_{X}) function hh (Hobert and Marchev,, 2008). Thus, sandwich algorithms are asymptotically more efficient than the corresponding DA algorithms. So if both the DA and sandwich algorithms are run for the same number of iterations, then the errors in h¯n\bar{h}^{*}_{n} are expected to be smaller than that of h¯n\bar{h}_{n}. In other words, to achieve the same level of precision, the sandwich algorithm is expected to require fewer iterations than the DA algorithm. In Section 4, we discuss other results established in the literature showing the superiority of the sandwich algorithms over the corresponding DA algorithms.

Intuitively, the extra step RR reduces the correlation between xx and xx^{\prime} and thus improves the mixing of the DA algorithm. On the other hand, since the extra step in a sandwich algorithm involves more computational effort compared to the DA algorithm, any gain in mixing should be weighed against this increased computational burden. Fortunately, there are several examples where the sandwich algorithm provides a “free lunch” in that it converges much faster than the underlying DA algorithm while requiring a similar amount of computational effort per iteration (see e.g. Laha et al.,, 2016; Pal et al.,, 2015; Roy and Hobert,, 2007; Hobert et al.,, 2011; Roy,, 2014). Following Liu and Wu, (1999) and Meng and van Dyk, (1999), the Mtd rr for these efficient sandwich algorithms are often constructed by introducing extra parameters g𝖦dg\in{\mathsf{G}}\subset\mathbb{R}^{d} and a class of functions tg:qqt_{g}:\mathbb{R}^{q}\rightarrow\mathbb{R}^{q} indexed by gg. The middle step yyy\rightarrow y^{\prime} is implemented by drawing gg from a density depending on yy and then setting y=tg(y)y^{\prime}=t_{g}(y) maintaining invariance with respect to fYf_{Y}. Typically, dd is small, say 1 or 2, and {tg(y):gG}\{t_{g}(y):g\in G\} is a small subspace of q\mathbb{R}^{q}. Thus, drawing from such an rr is usually much less computationally expensive than the two steps of the DA Algorithm. However, as mentioned before, even when d=1d=1, this perturbation yyy\rightarrow y^{\prime} often greatly improves the mixing of the DA algorithm.

We now describe Liu and Wu,’s (1999) Haar PX-DA algorithm where the set 𝖦{\mathsf{G}} is assumed to possess a certain group structure (see also Hobert and Marchev,, 2008). In particular, 𝖦{\mathsf{G}} is assumed to be a topological group; that is, a group such that the functions (g1,g2)g1g2(g_{1},g_{2})\mapsto g_{1}g_{2} and the inverse map gg1g\mapsto g^{-1} are continuous. An example of such a 𝖦{\mathsf{G}} is the multiplicative group, +\mathbb{R}_{+}, where the binary operation defining the group is multiplication, the identity element is 1, and g1=1/gg^{-1}=1/g. Suppose, tg(y)t_{g}(y) represents 𝖦{\mathsf{G}} acting topologically on the left of q\mathbb{R}^{q} (Eaton,, 1989, Chapter 2), that is, te(y)=yt_{e}(y)=y for all yqy\in\mathbb{R}^{q} where ee is the identity element in 𝖦{\mathsf{G}} and tg1g2(y)=tg1(tg2(y))t_{g_{1}g_{2}}(y)=t_{g_{1}}\big{(}t_{g_{2}}(y)\big{)} for all g1,g2𝖦g_{1},g_{2}\in{\mathsf{G}} and all yqy\in\mathbb{R}^{q}.

We assume the existence of a multiplier χ:𝖦+\chi:{\mathsf{G}}\rightarrow\mathbb{R}_{+}, that is, χ\chi is continuous and χ(g1g2)=χ(g1)χ(g2)\chi(g_{1}g_{2})=\chi(g_{1})\chi(g_{2}) for all g1,g2𝖦g_{1},g_{2}\in{\mathsf{G}}. Assume that the Lebesgue measure on q\mathbb{R}^{q} is relatively (left) invariant with respect to the multiplier χ\chi; that is, assume that for any g𝖦g\in{\mathsf{G}} and any integrable function ζ:q\zeta:\mathbb{R}^{q}\rightarrow\mathbb{R}, we have

χ(g)qζ(tg(y))𝑑y=qζ(y)𝑑y.\chi(g)\int_{\mathbb{R}^{q}}\zeta\big{(}t_{g}(y)\big{)}\,dy=\int_{\mathbb{R}^{q}}\zeta(y)\,dy\;.

Next, suppose the group 𝖦{\mathsf{G}} has a left-Haar measure of the form νl(g)dg\nu_{l}(g)\,dg where dgdg denotes Lebesgue measure on 𝖦{\mathsf{G}}. It is known that the left-Haar measure satisfies

𝖦h(g~g)νl(g)𝑑g=𝖦h(g)νl(g)𝑑g\int_{\mathsf{G}}h\big{(}\tilde{g}g\big{)}\,\nu_{l}(g)\,dg=\int_{\mathsf{G}}h(g)\,\nu_{l}(g)\,dg (3.5)

for all g~𝖦\tilde{g}\in{\mathsf{G}} and all integrable functions h:𝖦h:{\mathsf{G}}\rightarrow\mathbb{R}. Finally, assume that

q(y):=𝖦fY(tg(y))χ(g)νl(g)𝑑gq(y):=\int_{\mathsf{G}}f_{Y}\big{(}t_{g}(y)\big{)}\,\chi(g)\,\nu_{l}(g)\,dg

is strictly positive for all yqy\in\mathbb{R}^{q} and finite for (almost) all yqy\in\mathbb{R}^{q}.

We now state how the middle step yyy\rightarrow y^{\prime} is implemented in Liu and Wu,’s (1999) Haar PX-DA algorithm. If the current state of the Haar PX-DA chain is xx, an iteration of this chain uses the following steps to move to xx^{\prime}.

Algorithm An iteration for the Haar PX-DA Algorithm
1:  Given xx, draw YfY|X(|x)Y\sim f_{Y|X}(\cdot|x), and call the observed value yy.
2:  Draw GG from the density proportional to fY(tg(y))χ(g)νl(g)f_{Y}\big{(}t_{g}(y)\big{)}\,\chi(g)\,\nu_{l}(g), call the result gg, and set y=tg(y)y^{\prime}=t_{g}(y).
3:  Draw XfX|Y(|y)X^{\prime}\sim f_{X|Y}(\cdot|y^{\prime}).
Example 1 (continues=sec:probit).

Improving the DA algorithm for the probit mixed model presented in Section 2.3, Roy, (2022) constructs a Haar PX-DA algorithm for (2.22). For constructing the sandwich step yyy\rightarrow y^{\prime}, we need the marginal density of YY, fY(y|β,Λ,z)f_{Y}(y|\beta,\Lambda,z) from the joint density f(u,y|β,Λ,z)f(u,y|\beta,\Lambda,z) given in (2.3). It turns out that

fY(y|β,Λ,z)i=1m[1(0,)(yi)]zi[1(,0](yi)]1ziexp{12[yV1y2yV1Wβ]},\displaystyle f_{Y}(y|\beta,\Lambda,z)\propto\prod_{i=1}^{m}\left[1_{\text{(0,$\infty$)}}\left(y_{i}\right)\right]^{z_{i}}\left[1_{\left(-\infty,0\right]}\left(y_{i}\right)\right]^{1-z_{i}}\exp\left\{-\frac{1}{2}\left[y^{\top}V_{1}y-2y^{\top}V_{1}W\beta\right]\right\}, (3.6)

where

V1=[ImV(VV+(A(Λ))1)1V].V_{1}=\left[I_{m}-V\left(V^{\top}V+(A(\Lambda))^{-1}\right)^{-1}V^{\top}\right].

Let 𝒴\mathcal{Y} denote the subset of m\mathbb{R}^{m} where yy lives, that is, 𝒴\mathcal{Y} is the Cartesian product of mm half (positive or negative) lines, where the iith component is (0,)(0,\infty) (if zi=1z_{i}=1) or (,0](-\infty,0] (if zi=0z_{i}=0).

As mentioned in this Section, take 𝖦=+{\mathsf{G}}=\mathbb{R}_{+} and let the multiplicative group 𝖦{\mathsf{G}} act on 𝒴\mathcal{Y} through tg(y)=gy=(gy1,,gym)t_{g}(y)=gy=(gy_{1},\dots,gy_{m}). Note that, for any g~𝖦\tilde{g}\in{\mathsf{G}}, we have

0h(g~g)1g𝑑g=0h(g)1g𝑑g,\int_{0}^{\infty}h\big{(}\tilde{g}g\big{)}\,\frac{1}{g}\,dg=\int_{0}^{\infty}h(g)\,\frac{1}{g}\,dg\;,

which shows from (3.5) that ν(dg)=dg/g\nu(dg)=dg/g is a left-Haar measure for the multiplicative group, where dgdg is Lebesgue measure on +\mathbb{R}_{+}. Also, with the group action defined this way, it is known that the Lebesgue measure on 𝒴\mathcal{Y} is relatively left invariant with the multiplier χ(g)=gm\chi(g)=g^{m} (Hobert and Marchev,, 2008). Thus,

fY(gy|β,Λ,z)χ(g)ν(dg)gm1exp{12[g2yV1y2gyV1Wβ]}dg,f_{Y}\left(gy|\beta,\Lambda,z\right)\chi\left(g\right)\nu(dg)\propto g^{m-1}\exp\left\{-\frac{1}{2}\left[g^{2}y^{\top}V_{1}y-2gy^{\top}V_{1}W\beta\right]\right\}dg, (3.7)

and since V1V_{1} is a positive definite matrix,

q(y)=0gm1exp{12[g2yV1y2gyV1Wβ]}𝑑gq(y)=\int_{0}^{\infty}g^{m-1}\exp\left\{-\frac{1}{2}\left[g^{2}y^{\top}V_{1}y-2gy^{\top}V_{1}W\beta\right]\right\}dg

is strictly positive for all y𝒴y\in\mathcal{Y} and finite for (almost) all y𝒴y\in\mathcal{Y}. Thus, a single iteration of the Haar PX-DA algorithm for the probit mixed model uses the following three steps to move from uu to uu^{\prime}.

Algorithm One iteration of the Haar PX-DA algorithm for the probit GLMM
1:  Given uu, draw yiindTN(wiβ+viu,1,zi)y_{i}\overset{\text{ind}}{\sim}\text{TN}(w_{i}^{\top}\beta+v_{i}^{\top}u,1,z_{i}) for i=1,,mi=1,\dots,m.
2:  Draw gg from (3.7).
3:  Calculate yi=gyiy_{i}^{\prime}=gy_{i} for i=1,,mi=1,\dots,m, and draw uu^{\prime} from (2.28) conditional on y=(y1,,ym)y^{\prime}=(y_{1}^{\prime},\dots,y_{m}^{\prime})^{\top}, that is, draw
uNq((VV+(A(Λ))1)1V(yWβ),(VV+A(Λ)1)1).u^{\prime}\sim N_{q}\left((V^{\top}V+(A(\Lambda))^{-1})^{-1}V^{\top}(y^{\prime}-W\beta),(V^{\top}V+A(\Lambda)^{-1})^{-1}\right).

The density (3.7) is log-concave, and Roy, (2022) suggests using Gilks and Wild,’s (1992) adaptive rejection sampling algorithm to sample from it. Since a single draw from (3.7) is the only difference between each iteartion of the DA and the Haar PX-DA algorithms, the computational burden for the two algorithms is similar.

Given the group 𝖦{\mathsf{G}}, the Haar PX-DA algorithm is the best sandwich algorithm in terms of efficiency and operator norm (Hobert and Marchev,, 2008). Khare and Hobert, (2011) established necessary and sufficient conditions for the Haar PX-DA algorithm to be strictly better than the corresponding DA algorithm (see Section 4 for details). Also, the Haar PX-DA method with appropriate choices of 𝖦{\mathsf{G}} and tgt_{g} has been empirically demonstrated to result in a huge improvement in the mixing of DA algorithms albeit with roughly the same computational cost in various examples (see e.g. Roy and Hobert,, 2007; Liu and Wu,, 1999; Meng and van Dyk,, 1999; Roy,, 2014; Pal et al.,, 2015). Finally, there are other strategies proposed in the literature to improve DA algorithms without introducing extra auxiliary variables as in the sandwich algorithms. For example, Roy, (2016) replaces one of the two original steps of the DA algorithm with a draw from the MH step given in Liu, (1996). One advantage of Roy,’s (2016) modified DA algorithm over the sandwich algorithms is that the Markov transition matrix of the modified algorithm can have negative eigenvalues, whereas the DA algorithms and also, generally, the sandwich algorithms used in practice are positive Markov chains leading to positive eigenvalues (Khare and Hobert,, 2011). Hence, Roy,’s (2016) modified DA algorithm can lead to superior performance over the DA and sandwich algorithms in terms of asymptotic variance.

4 Spectral properties of the DA Markov chain

The convergence of the Markov chains associated with MCMC algorithms to the desired stationary distribution is at the heart of the popularity of MCMC technology. In Section 1 we discussed the convergence of the sample averages h¯n\bar{h}_{n} and the marginal distributions of XnX_{n}. For a serious practitioner, understanding the quality of both the convergences is important. As shown in Section 3, proving geometric ergodicity of the Markov chain is key for providing standard errors for h¯n\bar{h}_{n}. Establishing geometric ergodicity for general state space Markov chains arising in statistical practice is in general quite challenging. The drift and minorization technique that is often used to prove geometric ergodicity has two flavors/tracks. The first track leads to establishing qualitative geometric ergodicity, without providing quantitative convergence bounds for distance to stationarity after finitely many Markov chain iterations. The other track leads to explicit convergence bounds, but these bounds are often too conservative to be useful, especially in modern high-dimensional settings Jones and Hobert, (2004); Qin and Hobert, (2021). While several alternate and promising methods for obtaining convergence bounds have been developed in recent years (see Qin and Hobert, (2022) for a useful summary), it is safe to say that a successful geometric ergodicity analysis remains elusive for an overwhelming majority of MCMC algorithms used in statistical practice. However, as we will see below, the special structure/construction of the DA Markov chain often allows us in many examples to establish stronger properties which imply geometric ergodicity in particular, and lead to a richer and deeper understanding of the structure and convergence properties of the relevant DA Markov chain.

4.1 Leveraging operator theory for comparing DA and sandwich chains

Operator theory and associated tools play a key role in convergence analysis and comparison of the DA Markov chain (with Mtd kk) and the sandwich Markov chain (with Mtd kSAk_{SA}). Let L02(fX)L_{0}^{2}(f_{X}) denote the space of real-valued functions (with domain p\mathbb{R}^{p}) that have mean zero and are square integrable with respect to fXf_{X}. In particular,

L02(fX)={h:p:ph(x)2fX(x)𝑑x< andph(x)fX(x)𝑑x=0}.L_{0}^{2}(f_{X})=\left\{h:\mathbb{R}^{p}\rightarrow\mathbb{R}:\;\int_{\mathbb{R}^{p}}h(x)^{2}f_{X}(x)dx<\infty\mbox{ and}\int_{\mathbb{R}^{p}}h(x)f_{X}(x)dx=0\right\}.

Let K:L02(fX)L02(fX)K:L_{0}^{2}(f_{X})\rightarrow L_{0}^{2}(f_{X}) and KSA:L02(fX)L02(fX)K_{SA}:L_{0}^{2}(f_{X})\rightarrow L_{0}^{2}(f_{X}) denote the Markov operators corresponding to the Markov transition densities kk and kSAk_{SA} respectively. Then for any hL02(fX)h\in L_{0}^{2}(f_{X}), we have

(Kh)(x)=ph(x)k(xx)𝑑x and (KSAh)(x)=ph(x)kSA(xx)𝑑x.(Kh)(x)=\int_{\mathbb{R}^{p}}h(x^{\prime})k(x^{\prime}\mid x)dx^{\prime}\mbox{ and }(K_{SA}h)(x)=\int_{\mathbb{R}^{p}}h(x^{\prime})k_{SA}(x^{\prime}\mid x)dx^{\prime}.

Note that L02(fX)L_{0}^{2}(f_{X}) is a Hilbert space equipped with the inner product

<h1,h2>=ph1(x)h2(x)fX(x)𝑑x<h_{1},h_{2}>=\int_{\mathbb{R}^{p}}h_{1}(x)h_{2}(x)f_{X}(x)dx

for every pair h1,h2L02(fX)h_{1},h_{2}\in L_{0}^{2}(f_{X}), and the corresponding norm defined by h=<h,h>\|h\|=\sqrt{<h,h>} for every hL02(fX)h\in L_{0}^{2}(f_{X}). The operator norms of KK and KSAK_{SA}, denoted respectively by K\|K\| and KSA\|K_{SA}\|, are defined as

K=suphL02(fX),h=1Kh and KSA=suphL02(fX),h=1KSAh.\|K\|=\sup_{h\in L_{0}^{2}(f_{X}),\|h\|=1}\|Kh\|\mbox{ and }\|K_{SA}\|=\sup_{h\in L_{0}^{2}(f_{X}),\|h\|=1}\|K_{SA}h\|.

The convergence properties of the DA and sandwich Markov chains are intricately linked to the behavior of the corresponding Markov operators KK and KSAK_{SA}. Using the structure of kk and kSAk_{SA}, it can be shown easily that the operator KK is a non-negative operator (an operator PP is non- negative if <Pg,g><Pg,g> is non-negative g\forall g), K[0,1]\|K\|\in[0,1] and KSA[0,1]\|K_{SA}\|\in[0,1]. Also, KK is a self-adjoint operator, and if r(yy)r(y^{\prime}\mid y) satisfies detailed balance with respect to fY(y)f_{Y}(y) (which will be assumed henceforth, unless specified otherwise) then so is KSAK_{SA} (an operator PP is self-adjoint if <Ph1,h2>=<Ph2,h1>h1,h2<Ph_{1},h_{2}>=<Ph_{2},h_{1}>\;\forall h_{1},h_{2}). Roberts and Rosenthal, (1997) show that for a self-adjoint Markov operator PP corresponding to a Harris ergodic chain, P<1\|P\|<1 if and only if the corresponding Markov chain is geometrically ergodic. In this case, P\|P\| in fact corresponds to the asymptotic rate of convergence of the relevant Markov chain to its stationary distribution. The above concepts enable us to state the first result that rigorously shows that the sandwich chain is at least as good as the DA chain in some key aspects.

Theorem 4.1 (Hobert and Marchev, (2008)).

If the DA chain with Mtd kk is Harris ergodic, and the sandwich chain (with Mtd kSAk_{SA}) can itself be interpreted as a DA chain, then KSAK\|K_{SA}\|\leq\|K\|. Hence, geometric Harris ergodicity of the DA chain implies geometric Harris ergodicity of the sandwich chain, and in such settings the asymptotic rate of convergence to stationarity for the sandwich chain is at least as good as that of the DA chain.

The structure and analysis of sandwich chains is more complicated (sometimes significantly so) than the DA chains due to the additional sandwich step. However, the above result allows us to completely avoid analyzing the sandwich chain, if geometric ergodicity can be established for the corresponding DA chain.

Intuition says that given the extra step, the sandwich chain should have better mixing and faster convergence than the original DA chain under suitable regularity conditions. The above result provides rigorous justification to this intuition. See Section 4.4 for further discussion.

4.2 The trace class property for Markov operators

An operator PP on L02(fX)L^{2}_{0}(f_{X}) is defined to be Hilbert-Schmidt if for any orthonormal sequence {hn}n1\{h_{n}\}_{n\geq 1} in L02(fX)L^{2}_{0}(f_{X}), we have

n=1Phn2<.\sum_{n=1}^{\infty}\|Ph_{n}\|^{2}<\infty. (4.1)

Additionally, a positive self-adjoint operator (such as the DA operator KK) is defined to be trace class if

n=1<Phn,hn><\sum_{n=1}^{\infty}<Ph_{n},h_{n}><\infty (4.2)

for any orthonormal sequence {hn}n1\{h_{n}\}_{n\geq 1} in L02(fX)L^{2}_{0}(f_{X}). Both the above properties imply that the operator PP is compact, and has countably many singular values. The trace class property implies that these singular values are summable, while the Hilbert-Schmidt property implies that these singular values are square-summable. If the associated Markov chain is Harris ergodic, then compactness of a Markov operator implies that its spectral radius is less than 11. Existing results (see for example Proposition 2.1 and Remark 2.1 in Roberts and Rosenthal, (1997)) can now be leveraged to establish geometric ergodicity. To summarize, for Harris ergodic Markov chains, we have

Hilbert Schmidt Compact Geometrically ergodic,\mbox{Hilbert Schmidt }\Longrightarrow\mbox{Compact }\Longrightarrow\mbox{Geometrically ergodic},

and if the corresponding Markov operator is also positive and self-adjoint, then

Trace Class Hilbert Schmidt Compact Geometrically ergodic.\mbox{Trace Class }\Longrightarrow\mbox{Hilbert Schmidt }\Longrightarrow\mbox{Compact }\Longrightarrow\mbox{Geometrically ergodic}.

The above implications potentially offer an alternate mechanism/route for establishing geometric ergodicity. However, the conditions in (4.1) and (4.2) seem hard to verify especially for intractable Markov transition densities arising in statistical applications. Fortunately, there are equivalent characterizations for (4.1) and (4.2) that are easier to state and verify. In particular, results in Davies, (1983) can be leveraged to show that a Markov operator PP (with corresponding Mtd p(,)p(\cdot,\cdot)) on L02(fX)L^{2}_{0}(f_{X}) is Hilbert-Schmidt if and only if

ppp(x,y)2fX(x)fX(y)𝑑x𝑑y=pp(p(x,y)fX(y))2fX(x)fX(y)𝑑x𝑑y<,\int_{\mathbb{R}^{p}}\int_{\mathbb{R}^{p}}\frac{p(x,y)^{2}f_{X}(x)}{f_{X}(y)}dxdy=\int_{\mathbb{R}^{p}}\int_{\mathbb{R}^{p}}\left(\frac{p(x,y)}{f_{X}(y)}\right)^{2}f_{X}(x)f_{X}(y)dxdy<\infty, (4.3)

and in case PP is self-adjoint and positive, it is trace class if and only if

pp(x,x)𝑑x<.\int_{\mathbb{R}^{p}}p(x,x)dx<\infty. (4.4)

The conditions in (4.3) and (4.4) while simpler than those in (4.1) and (4.2) can still be quite challenging to verify for Markov chains arising in statistical practice. The corresponding Markov transition densities are themselves often expressed as intractable integrals, and the analysis of the expressions in (4.3) and (4.4) can be quite lengthy, laborious and intricate. As the authors state in Liu et al., (1995), the condition in (4.3) “is standard, but not easy to check and understand”. Yet, there have been many success stories in recent years, see Khare and Hobert, (2011); Pal et al., (2017); Chakraborty and Khare, (2017, 2019); Qin et al., (2019); Rajaratnam et al., (2019); Jin and Tan, (2021). Markov chains used by statistical practitioners overwhelmingly employ either the Gibbs sampling algorithm, or the Metropolis-Hastings algorithm, or a combination of both. Any continuous state space chain with a non-trivial Metropolis component cannot be compact (Chan and Geyer,, 1994) and hence cannot be trace class. The same is true for random scan Gibbs chains. (Systematic scan) Gibbs sampling Markov chains are in general not reversible. However, the convergence of a two-block (systematic scan) Gibbs chain is completely characterized by its “marginal chains” which correspond to positive self- adjoint Markov operators (Liu et al.,, 1994; Diaconis et al.,, 2008). In particular, the construction in Section 1 shows that the DA chain is a marginal chain of a two-block Gibbs sampler for fX,Yf_{X,Y}. Hence, it is not surprising that all Markov chains which have been shown to be trace class in these papers are DA Markov chains.

4.3 Trace class property as a tool to establish geometric ergodicity

We have already discussed how establishing the trace class property for a positive self-adjoint Markov operator corresponding to a Harris ergodic Markov chain establishes geometric ergodicity of that chain. In this section, we compare the potential advantages and disadvantages of this alternative approach compared to the drift and minorization approach for establishing geometric ergodicity. A key advantage is that the trace class approach based on (4.4) is more straightforward and streamlined than the drift and minorization approach. Indeed, construction of effective drift functions depends heavily on the Markov chain at hand and has been described as “a matter of art” (Diaconis et al., (2008)). On the other hand, the drift and minorization approach has broader applicability, and has been successfully applied to non-reversible and non-compact chains.

Suppose that for a DA Markov chain one is able to establish geometric ergodicity through a drift and minorization analysis, and is also able to establish the trace class condition in (4.4). Since the trace class property is much stronger than geometric ergodicity, one would expect that assumptions needed for establishing the trace class property are stronger than those needed for the drift and minorization analysis. Surprisingly, as demonstrated in Mukherjee et al., (2023), this is not always true. To understand why this might happen, we take a brief but careful look at the Markov chain analyzed in Mukherjee et al., (2023) below.

Similar to Sections 2.2 and 2.3, consider a binary regression setting with nn independent binary responses Z1,Z2,,ZnZ_{1},Z_{2},\cdots,Z_{n} and corresponding predictor vectors 𝐰1,𝐰2,,𝐰np{\bf w}_{1},{\bf w}_{2},\cdots,{\bf w}_{n}\in\mathbb{R}^{p}, such that

P(Zi=1𝜷)=Fν(𝐰iT𝜷),P(Z_{i}=1\mid{\boldsymbol{\beta}})=F_{\nu}({\bf w}_{i}^{T}{\boldsymbol{\beta}}), (4.5)

for 1in1\leq i\leq n. The goal is to estimate 𝜷p{\boldsymbol{\beta}}\in\mathbb{R}^{p}. Here FνF_{\nu} is the CDF of the Student’s tt-distribution with ν\nu degrees of freedom, and the corresponding model is referred to as the robit regression model Liu, (2004). In the binary regression setting, outliers are observations with unexpectedly large predictor values and a misclassified response. The robit model can more effectively down-weight outliers and produce a better fit than probit or logistic models Pregibon, (1982).

Following Roy, 2012a ; Albert and Chib, (1993), consider a Bayesian model with a multivariate normal prior for 𝜷{\boldsymbol{\beta}} with mean 𝜷a{\boldsymbol{\beta}}_{a} and covariance matrix Σa1\Sigma_{a}^{-1}. Let 𝐳{0,1}n{\bf z}\in\{0,1\}^{n} denote the vector of observed values for the response variables Z1,Z2,,ZnZ_{1},Z_{2},\cdots,Z_{n}. As demonstrated in Roy, 2012a , the posterior density π(𝜷𝐳)\pi({\boldsymbol{\beta}}\mid{\bf z}) is intractable, and Roy, 2012a develops a Data DA approach to construct a computationally tractable Markov chain with π(𝜷𝐲)\pi({\boldsymbol{\beta}}\mid{\bf y}) as its stationary density by introducing unobserved latent variables {(Ui,λi)}i=1n\{(U_{i},\lambda_{i})\}_{i=1}^{n} that are mutually independent and satisfy Uiλi𝒩(𝐰iT𝜷,1/λi)U_{i}\mid\lambda_{i}\sim\mathcal{N}({\bf w}_{i}^{T}{\boldsymbol{\beta}},1/{\bf\lambda}_{i}) and λiGamma(ν/2,ν/2)\lambda_{i}\sim\mbox{Gamma}(\nu/2,\nu/2). Straightforward calculations now show that Uitν(𝐰iT𝜷,1)U_{i}\sim t_{\nu}({\bf w}_{i}^{T}{\boldsymbol{\beta}},1), where tν(μ,σ)t_{\nu}(\mu,\sigma) denotes the Student’s tt-distribution with ν\nu degrees of freedom, location μ\mu and scale σ\sigma. Furthermore, if Yi=1{Zi>0}Y_{i}=1_{\{Z_{i}>0\}}, then P(Yi=1𝜷)=P(Ui>0)=Fν(𝐰iT𝜷)P(Y_{i}=1\mid{\boldsymbol{\beta}})=P(U_{i}>0)=F_{\nu}({\bf w}_{i}^{T}{\boldsymbol{\beta}}), which precisely corresponds to the robit regression model specified above.

One can map this setting to the DA framework laid out in Section 1, by viewing 𝜷{\boldsymbol{\beta}} as “XX”, (𝐔,𝝀)({\bf U},{\boldsymbol{\lambda}}) as “YY”, and the joint posterior density of 𝜷,𝐔,𝝀{\boldsymbol{\beta}},{\bf U},{\boldsymbol{\lambda}} as “fX,Yf_{X,Y}”. Straightforward calculations (see Roy, 2012a ) show that samples from fXYf_{X\mid Y} and fYXf_{Y\mid X} are easy to generate, as they involve sampling from standard distributions such as multivariate normal, truncated-tt and Gamma. These observations allow Roy, 2012a to use the corresponding DA Markov chain on p\mathbb{R}^{p} to generate samples from the fXf_{X} (the marginal posterior density of 𝜷{\boldsymbol{\beta}}). We will refer to this Markov chain as the robit DA Markov chain.

Roy, 2012a established Harris ergodicity of the robit DA chain, and investigated and established geometric ergodicity using a drift and minorization analysis. However, this analysis requires the following assumptions.

  • The design matrix WW is full rank (which implies npn\geq p and rules out high-dimensional settings)

  • Σa=g1WTW\Sigma_{a}=g^{-1}W^{T}W

  • ng1ν(ν+1)(1+2𝜷aTWTW𝜷a).n\leq\frac{g^{-1}\nu}{(\nu+1)(1+2\sqrt{{\boldsymbol{\beta}}_{a}^{T}W^{T}W{\boldsymbol{\beta}}_{a}})}.

The last upper bound on nn involving the design matrix, the prior mean and covariance and the degrees of freedom ν\nu is in particular very restrictive. Through a tighter drift and minorization analysis, Mukherjee et al., (2023) relax the above restrictions to some extent, but not substantially (see Theorem S1 and Theorem S2 in Mukherjee et al., (2023)). The related probit DA chain is obtained by (a) using the standard normal CDF Φ\Phi instead of FνF_{\nu} in (4.5), (b) using again a multivariate normal prior for 𝜷{\boldsymbol{\beta}}, (c) introducing latent variables Ui𝒩(𝐰iT𝜷,1)U_{i}\sim\mathcal{N}({\bf w}_{i}^{T}{\boldsymbol{\beta}},1) with Yi=1{Zi>0}Y_{i}=1_{\{Z_{i}>0\}} for 1in1\leq i\leq n, and (d) employing the DA machinery in Section 1 with (𝜷,𝐔)({\boldsymbol{\beta}},{\bf U}) as “(X,Y)(X,Y)”. The geometric ergodicity analysis of the probit DA chain in Roy and Hobert, (2007); Chakraborty and Khare, (2017) uses similar drift functions as in Roy, 2012a ; Mukherjee et al., (2023) but requires minimal assumptions. Note that the latent variables {λi}i=1n\{\lambda_{i}\}_{i=1}^{n} used in the robit setting are not needed in the probit setting. While having this additional layer of latent variables definitely complicates the analysis of the robit DA chain, it is not clear if the restrictive conditions listed above for geometric ergodicity of the robit DA chain are really necessary or if they are an artifact of using the drift and minorization technique in face of this added complication.

A trace class analysis for the probit and robit DA chains is available in Chakraborty and Khare, (2017) and Mukherjee et al., (2023) respectively. The trace class property for the probit DA chain was established in (Chakraborty and Khare,, 2017, Theorem 2) under some constraints on WW and the prior covariance matrix Σa\Sigma_{a}. This is consistent with the expectation that weaker assumptions should be needed for establishing drift and minorization based geometric ergodicity, as compared to establishing the trace class property. However, for the robit DA chain the reverse phenomenon holds: Mukherjee et al., (2023) establish the trace class property for the robit DA chain for any n,ν>2,W,𝐳,𝛃a,Σan,\nu>2,W,{\bf z},{\boldsymbol{\beta}}_{a},\Sigma_{a}. Hence, drift and minorization approach, with the drift functions chosen in Roy, 2012a ; Mukherjee et al., (2023), needs stronger conditions to succeed than the trace class approach. Essentially, the additional layer of latent variables λi\lambda_{i} introduced in the robit setting severely hampers the drift and minorization analysis, but does not cause additional complications for establishing the trace class condition.

The key takeaway from the above discussion is that even if a successful drift and minorization based analysis is available, it is worth investigating the finiteness of the trace class integral in (4.4), it might require comparatively weaker assumptions contrary to expectations. Also, establishing the trace class property provides additional insights which are discussed in the next subsection.

4.4 Trace class property as a tool to compare DA and sandwich chains

The results discussed previously show that, under certain regularity conditions, the sandwich chain is at least as good as the DA chain in certain respects. These results, while very useful, are not completely satisfactory as they only establish that the sandwich chain is “as good as” the DA chain. The question is - are there any conditions under which the sandwich chain can be shown to be “strictly better” than the DA chain (in an appropriate sense)? This question has been investigated in Khare and Hobert, (2011); Roy, 2012b , and we will discuss their results below.

First, we provide a brief review of some ideas and results used in these analyses. Both papers leverage the fact (see Buja, (1990); Diaconis et al., (2008)) that the DA operator KK can be written as a product of two simple ‘projection’ operators. In particular, let PXP_{X} denotes the operator from L02(fY)L^{2}_{0}(f_{Y}) to L02(fX)L^{2}_{0}(f_{X}) defined by

(PXh)(x)=qh(y)fYX(yx)𝑑y(P_{X}h)(x)=\int_{\mathbb{R}^{q}}h(y)f_{Y\mid X}(y\mid x)dy

for every hL02(fY)h\in L^{2}_{0}(f_{Y}), and PYP_{Y} denotes the operator from L02(fX)L^{2}_{0}(f_{X}) to L02(fY)L^{2}_{0}(f_{Y}) defined by

(PYg)(y)=pg(x)fXY(xy)𝑑x(P_{Y}g)(y)=\int_{\mathbb{R}^{p}}g(x)f_{X\mid Y}(x\mid y)dx

for every gL02(fX)g\in L^{2}_{0}(f_{X}). Also, let R:L02(fY)L02(fY)R:L^{2}_{0}(f_{Y})\rightarrow L^{2}_{0}(f_{Y}) denotes the operator corresponding to the Mtd rr (the sandwich step Mtd). Then, it can be shown that

K=PXPY and KSA=PXRPY.K=P_{X}P_{Y}\quad\mbox{ and }\quad K_{SA}=P_{X}RP_{Y}.

For any gL02(fX)g\in L^{2}_{0}(f_{X}) and any hL02(fY)h\in L^{2}_{0}(f_{Y}), simple calculations show that <PXh,g>=<h,PYg><P_{X}h,g>=<h,P_{Y}g> (the first inner product corresponds to L02(fX)L^{2}_{0}(f_{X}), while the second one corresponds to L02(fY)L^{2}_{0}(f_{Y})). Hence, the operators PXP_{X} and PYP_{Y} are adjoints of each other. Using elementary operator theoretic arguments, this fact can be used to immediately conclude that K=PXPYK=P_{X}P_{Y} is a positive operator (as previously mentioned) and K=PX2=PY2\|K\|=\|P_{X}\|^{2}=\|P_{Y}\|^{2}. It follows that

KSA=PXRPYPXRPY=RK.\|K_{SA}\|=\|P_{X}RP_{Y}\|\leq\|P_{X}\|\|R\|\|P_{Y}\|=\|R\|\|K\|.

However, in almost all applications of interest, R=1\|R\|=1, since the sandwich step corresponds typically to a univariate movement in q\mathbb{R}^{q}. Hence, the above argument does not allow us to prove that the sandwich algorithm is strictly better (in the norm sense).

On the other hand, if the trace class condition in (4.4) is satisfied, then a very useful singular value decomposition of fX,Yf_{X,Y} becomes available (see Buja, (1990)). In particular, it can be shown that

fX,Y(x,y)fX(x)fY(y)=i=0βigi(x)hi(y),\frac{f_{X,Y}(x,y)}{f_{X}(x)f_{Y}(y)}=\sum_{i=0}^{\infty}\beta_{i}g_{i}(x)h_{i}(y),

where {gi}i=1\{g_{i}\}_{i=1}^{\infty} and {hj}j=1\{h_{j}\}_{j=1}^{\infty} form orthonormal bases of L02(fX)L^{2}_{0}(f_{X}) and L02(fY)L^{2}_{0}(f_{Y}) respectively (with g01,h01g_{0}\equiv 1,h_{0}\equiv 1), {βi}i=0\{\beta_{i}\}_{i=0}^{\infty} is a decreasing sequence of numbers in the unit interval with β0=1\beta_{0}=1, and gig_{i} and hjh_{j} are orthogonal for every iji\neq j in the sense that

pqgi(x)hj(y)fX,Y(x,y)𝑑y𝑑x=0.\int_{\mathbb{R}^{p}}\int_{\mathbb{R}^{q}}g_{i}(x)h_{j}(y)f_{X,Y}(x,y)dydx=0.

The singular value decomposition can be used to establish that

PYgi=βihi and PXhi=βigiP_{Y}g_{i}=\beta_{i}h_{i}\quad\mbox{ and }P_{X}h_{i}=\beta_{i}g_{i} (4.6)

for every i0i\geq 0. In particular this implies {βi2}i=1\{\beta_{i}^{2}\}_{i=1}^{\infty} is the spectrum of the DA operator KK and {gi}i=1\{g_{i}\}_{i=1}^{\infty} are the corresponding eigenfunctions. Given that KSA=PXRPYK_{SA}=P_{X}RP_{Y} and (4.6), a comparative spectral analysis for the DA and sandwich algorithms now hinges on how the operator RR interacts with the functions {hi}i=1\{h_{i}\}_{i=1}^{\infty}. By investigating this carefully, Khare and Hobert, (2011) establish the following result.

Theorem 4.2.

Assume that (4.4) holds, and that the sandwich operator RR is idempotent (R2=RR^{2}=R) with R=1\|R\|=1. Then the sandwich operator KSAK_{SA} is positive and trace class. If {λSA,i}i=1\{\lambda_{SA,i}\}_{i=1}^{\infty} denotes the ordered sequence of eigenvalues for KSAK_{SA}, then λSA,iβi2\lambda_{SA,i}\leq\beta_{i}^{2} for every i1i\geq 1. Also, for every ii such that βi>0\beta_{i}>0, λSA,i=βi2\lambda_{SA,i}=\beta_{i}^{2} if and only if Rhi=hiRh_{i}=h_{i}.

In other words, the spectrum of the sandwich operator KSAK_{SA} is dominated pointwise by the spectrum of the DA operator KK, and strict domination for a specific pair of eigenvalues depends exclusively on whether or not the operator RR leaves the corresponding hh-function invariant. A stronger version of this result, which only requires KK to be compact is established in Roy, 2012b .

For a more specific comparison of the operator norms of the two operators, as expected the key factor is the interaction of the the operator RR with the linear span of {hi}i=1\{h_{i}\}_{i=1}^{\ell}, where \ell is the multiplicity of the largest eigenvalue β1\beta_{1}, i.e., number of βi\beta_{i}’s that are exactly equal to β1\beta_{1}. The next result from Khare and Hobert, (2011) leverages this idea to provide a necessary and sufficient condition for the operator norm of KSAK_{SA} to be strictly smaller than the operator norm of KK.

Theorem 4.3.

Under the setup in Theorem 4.2, KSA<K\|K_{SA}\|<\|K\| if and only if the only function in the linear span of {hi}i=1\{h_{i}\}_{i=1}^{\ell} which is left invariant by RR is the (identically) zero function.

While the two results above provide necessary and sufficient conditions for the equality of the norm (and other eigenvalues) for KK and KSAK_{SA}, their practical utility is somewhat limited by the requirement to identify the functions {hi}i=1\{h_{i}\}_{i=1}^{\infty}. However, if we restrict to the group action based Haar PX-DA chain, then the following result can be established.

Theorem 4.4.

Assume that (4.4) holds, and that the operator KSAK_{SA} corresponds to the Haar PX-DA sandwich chain outlined in Section 3. Then λi,SA=λi\lambda_{i,SA}=\lambda_{i} for all i1i\geq 1 if and only if

fXY(xy)=fXY(xgy)gG,xp,yq.f_{X\mid Y}(x\mid y)=f_{X\mid Y}(x\mid gy)\;\forall g\in G,x\in\mathbb{R}^{p},y\in\mathbb{R}^{q}.

If the condition in (4.4) is satisfied, then it can be shown that the Mtds kk and kSAk_{SA} are exactly the same. Hence, outside of this triviality, the Haar PX-DA sandwich chain is strictly better than the DA chain in the sense that its eigenvalues are uniformly (pointwise) dominated by those of the DA chain with at least one strict domination.

5 The “two-block” DA algorithm and its sandwich variants

The applicability of the DA algorithm depends on the ease of sampling from the conditional densities fYXf_{Y\mid X} and fXYf_{X\mid Y}. While samples from both of these densities can be generated in a straightforward way in various settings, there are many others where this is not true. In particular, in these settings the density fXYf_{X\mid Y} is often intractable and cannot be directly sampled from. However, it is often possible to partition XX into two components X=(U,V)X=(U,V) (with UuU\in\mathbb{R}^{u} and VpuV\in\mathbb{R}^{p-u}) such that samples from fUV,Yf_{U\mid V,Y} and fVU,Yf_{V\mid U,Y} can be easily generated. In such settings, a modified two-block DA algorithm is used to generate samples from fX=fU,Vf_{X}=f_{U,V}. Each iteration of the DA algorithm consists of three steps — a draw from fY|Xf_{Y|X} followed by a draw from fU|V,Yf_{U|V,Y}, and finally a draw from fV|U,Yf_{V|U,Y}. The transition of this two-block DA Markov chain from the current state x=(u,v)x=(u,v) to the next state x=(u,v)x^{\prime}=(u^{\prime},v^{\prime}) can be described as follows.

Algorithm One-step transition of the two-block DA Markov chain
1:  Given x=(u,v)x=(u,v), generate a draw from fY|X(|x)f_{Y|X}(\cdot|x), and call the observed value yy.
2:  Generate a draw from fU|V,Y(|v,y)f_{U|V,Y}(\cdot|v,y), and call the observed value uu^{\prime}.
3:  Generate a draw from fV|U,Y(|u,y)f_{V|U,Y}(\cdot|u^{\prime},y), and call the observed value vv^{\prime}.

From the three steps in each iteration of the two-block DA algorithm, it follows that the Markov transition density (Mtd) of the DA Markov chain {X~n}n1\{\tilde{X}_{n}\}_{n\geq 1} (with X~n=(Un,Vn)\tilde{X}_{n}=(U_{n},V_{n})) is given by

kTB(x|x)=qfV|U,Y(v|u,y)fU|V,Y(u|v,y)fY|X(y|x)𝑑y.k_{TB}(x^{\prime}|x)=\int_{\mathbb{R}^{q}}f_{V|U,Y}(v^{\prime}|u^{\prime},y)f_{U|V,Y}(u^{\prime}|v,y)f_{Y|X}(y|x)\,dy\;. (5.1)

Its easy to show that the two-block DA chain has fXf_{X} as its stationary density. However, a key difference as compared to the “single-block” Markov chain {Xn}n0\{X_{n}\}_{n\geq 0} in Section 1 is that the two-block DA chain does not satisfy the detailed balance condition.

Example (Bayesian quantile regression) Consider the linear model

Zi=𝐰iT𝜷+ϵi;i=1,2,,nZ_{i}={\bf w}_{i}^{T}{\boldsymbol{\beta}}+\epsilon_{i};i=1,2,...,n

subject to the constraint that the αth\alpha^{th} quantile of the error distribution of is 0. Recall that {𝐰i}i=1n\{{\bf w}_{i}\}_{i=1}^{n} is the collection of pp- dimensional covariate vectors, and 𝜷p{\boldsymbol{\beta}}\in\mathbb{R}^{p} are the regression coefficients. The standard (frequentist) method for estimating the regression parameter 𝜷{\boldsymbol{\beta}} (see Yu and Moyeed, (2001)) minimizes an objective function which is equivalent to the negative log-likelihood for 𝜷{\boldsymbol{\beta}} if the errors are assumed to be i.i.d. with the asymmetric Laplace density given by

gα(ϵ)=α(1α)eρα(ϵ).g_{\alpha}(\epsilon)=\alpha(1-\alpha)e^{-\rho_{\alpha}(\epsilon)}.

where ρα(x)=x(αI(x<0))\rho_{\alpha}(x)=x\left(\alpha-I(x<0)\right) with I()I(\cdot) denoting the standard indicator function. Note that gαg_{\alpha} has αth\alpha^{th} quantile equal to zero, and corresponds to the standard Laplace density with location and scale equal to 0 and 1/21/2, respectively, when α=1/2\alpha=1/2. Leveraging this analogy, Yu and Moyeed, (2001); Kozumi and Kobayashi, (2011) pursue a Bayesian approach where {ϵi}i=1n\{\epsilon_{i}\}_{i=1}^{n} are assumed to be i.i.d. with common density gαg_{\alpha}, with σ>0\sigma>0 is an unknown scale parameter. The following independent priors are assigned to 𝜷{\boldsymbol{\beta}} and σ\sigma: 𝜷𝒩p(β0,B0){\boldsymbol{\beta}}\sim\mathcal{N}_{p}(\beta_{0},B_{0}) and σIG(n02,t02)\sigma\sim IG\left(\frac{n_{0}}{2},\frac{t_{0}}{2}\right). The posterior density of (𝜷,σ)({\boldsymbol{\beta}},\sigma) is intractable, and Kozumi and Kobayashi, (2011) propose the following DA approach which exploits a normal/exponential mixture representation of the asymmetric Laplace distribution.

Define θ:=θ(α)=12αα(1α)\theta:=\theta(\alpha)=\frac{1-2\alpha}{\alpha(1-\alpha)} and τ2:=τ2(α)=2α(1α)\tau^{2}:=\tau^{2}(\alpha)=\frac{2}{\alpha(1-\alpha)}, and consider random pairs {(Zi,Ri)}i=1n\left\{(Z_{i},R_{i})\right\}_{i=1}^{n} such that ZiRi=ri,𝜷,σ𝒩(𝐰iT𝜷+θri,riστ2)Z_{i}\mid R_{i}=r_{i},{\boldsymbol{\beta}},\sigma\sim\mathcal{N}({\bf w}_{i}^{T}{\boldsymbol{\beta}}+\theta r_{i},r_{i}\sigma\tau^{2}) and Ri𝜷,σExp(σ)R_{i}\mid{\boldsymbol{\beta}},\sigma\sim Exp(\sigma). It can be easily verified that the marginal density of Zi𝐰iT𝜷σ\frac{Z_{i}-{\bf w}_{i}^{T}{\boldsymbol{\beta}}}{\sigma} given (𝜷,σ)({\boldsymbol{\beta}},\sigma) is indeed gαg_{\alpha}. However, direct sampling or closed form computations for the joint posterior density of 𝜷,𝐑,σ{\boldsymbol{\beta}},{\bf R},\sigma given 𝐙=𝐳{\bf Z}={\bf z} are not feasible. However, it can be shown that (a) the full posterior conditional distribution of 𝜷{\boldsymbol{\beta}} is multivariate Gaussian, (b) the elements of 𝐑{\bf R} are conditionally independent given 𝜷,σ,𝐙=𝐳{\boldsymbol{\beta}},\sigma,{\bf Z}={\bf z}, and follow a generalized inverse Gaussian distribution, (c) the full posterior conditional distribution of σ\sigma is inverse gamma.

A two-block DA algorithm with 𝜷{\boldsymbol{\beta}} as “UU”, 𝐑{\bf R} as “VV” and σ\sigma as “YY” can hence be used to generate approximate samples from the joint posterior density of (𝜷,𝐙)({\boldsymbol{\beta}},{\bf Z}). This in particular yields samples from the marginal posterior density of 𝜷{\boldsymbol{\beta}}. Since the posterior density of σ\sigma given only 𝜷{\boldsymbol{\beta}} can be shown to be an inverse gamma density, this provides a mechanism to sample from the target posterior density of (β,σ)(\beta,\sigma).

Coming back to the general setting, a direct application/insertion of the single-block Haar PX-DA step (Step 2) as a sandwich step in the two-block setting is not feasible, as the resulting Markov chain does not in general have fXf_{X} as its stationary density. Pal et al., (2015) develop feasible two-block adaptations of the single-block Haar PX-DA sandwich algorithm in Section 3, which we briefly describe below.

Consider a group GG acting topologically on the left of q\mathbb{R}^{q}, as in the discussion prior to the single-block Haar PX-DA algorithm. Given any xp,yqx\in\mathbb{R}^{p},y\in\mathbb{R}^{q}, define the densities

fx,y1(g)=fV,Y(v,gy)χ(g)C1(x,y)gGf_{x,y}^{1}(g)=\frac{f_{V,Y}(v,gy)\chi(g)}{C_{1}(x,y)}\;\forall g\in G

and

fx,y2(g)=fU,V,Y(u,v,gy)χ(g)C2(x,y)gG.f_{x,y}^{2}(g)=\frac{f_{U,V,Y}(u,v,gy)\chi(g)}{C_{2}(x,y)}\;\forall g\in G.

Here C1(x,y)C_{1}(x,y) and C2(x,y)C_{2}(x,y) (assumed to be finite) are normalizing constants. Either of these two densities can be used to construct a ’valid’ Haar PX-DA sandwich step in the current setting. The transition of this two-block sandwich Markov chain from the current state x=(u,v)x=(u,v) to the next state x=(u,v)x^{\prime}=(u^{\prime},v^{\prime}) can be described as follows.

Algorithm One-step transition of the two-block Haar PX-DA sandwich Markov chain
1:  Given x=(u,v)x=(u,v), generate a draw from fY|X(|x)f_{Y|X}(\cdot|x), and call the observed value yy.
2:  Generate a draw from fx,yjf_{x,y}^{j} (for j=1j=1 or j=2j=2) and call the observed value gg.
3:  Generate a draw from fU|V,Y(|v,gy)f_{U|V,Y}(\cdot|v,gy), and call the observed value uu^{\prime}.
4:  Generate a draw from fV|U,Y(|u,gy)f_{V|U,Y}(\cdot|u^{\prime},gy), and call the observed value vv^{\prime}.

Let kTB,SA1k_{TB,SA1} and kTB,SA2k_{TB,SA2} denote the Mtds corresponding to the two- block sandwich chain described in two-block Haar PX-DA algorithm (with j=1j=1 and j=2j=2 respectively). Pal et al., (2015) show that both kTB,SA1k_{TB,SA1} and kTB,SA2k_{TB,SA2} have fXf_{X} as their stationary distribution.

Returning to the Bayesian quantile regression example, recall that the the scale parameter σ\sigma plays the role of ”YY”. Consider the action of the group G=+G=\mathbb{R}_{+} on +\mathbb{R}_{+}, the sample space of σ\sigma, through scalar multiplication. The left Haar measure for GG is given by νl(dg)=dgg\nu_{l}(dg)=\frac{dg}{g}, and it can be shown that the Lebesgue measure on +\mathbb{R}_{+} is relatively invariant with respect to the multiplier χ(g)=g\chi(g)=g. Using the above construction of two-block sandwich densities, it can be shown that fx,y1f^{1}_{x,y} is a non-standard univariate density. However, samples can be generated from this density using a rejection sampler with a dominating inverse gamma density. On the other hand, fx,y2f^{2}_{x,y} can be shown to be an inverse gamma density, and the corresponding sandwich chain is more suitable for practical use.

We conclude with a quick summary of associated theoretical results for the two-block DA setting. Using results in Asmussen and Glynn, (2011), it follows that the two-block DA chain and the sandwich chain are Harris ergodic if the corresponding Mtd is strictly positive everywhere (this is the case with most chains encountered in statistical applications). However, things get much more complicated (compared to the single-block setting) for a comparative study of deeper issues such as operator norms, geometric ergodicity etc. A key reason for these challenges is that the two-block DA operator, based on the three-stage transition step, is a product of three relevant projection operators, and consequently loses its self-adjointness and positivity. A theoretical comparison of closely related “lifted” versions of the two-block DA and sandwich chains is available in Pal et al., (2015), however several questions involving a direct comparison of the two-block DA and sandwich chains remain open.

  • Under what conditions does geometric ergodicity of the two block DA chain imply geometric ergodicity of the corresponding sandwich chain?

  • Although reversibility is lost, one could still aim to establish that the two-block DA operator is Hilbert-Schmidt by showing that the condition in (4.3) is satisfied. In such a setting, under what conditions does the Hilbert-Schmidt property of the two-block DA chain imply the same for the corresponding sandwich chain?

  • How do answers to the above questions depend on the choice of j=1j=1 vs. j=2j=2 in the sandwich step in the two-block Haar PX-DA algorithm?

6 Distributed DA adaptations for massive data settings

The sandwich DA algorithm can serve as a useful and computationally inexpensive tool to speed up the convergence of the DA algorithm. However, its utility is limited in increasingly common “massive” datasets, which contain hundreds of thousands (if not millions) of samples and/or variables. For a Bayesian statistical analysis of such datasets, a key goal is to generate samples from the intractable posterior density fXf_{X} of the parameters “XX”. For instances where the DA algorithm is applicable, the dimensionality of “YY” (the latent variables) is often equal to the number of samples or the number of variables. For example, in the Bayesian variable selection example described in Section 2.1, the dimensionality of YY is pp, the number of variables. The corresponding methodology is meant to be used for high-dimensional settings where pp is very large. In such settings, the computational cost of sampling the latent variables in each iteration of the DA and sandwich chains can be exorbitant. Alternatively, consider the Bayesian logistic regression example in Section 2.2, where the number of latent variables is equal to the number of samples mm. For datasets such as the MovieLens Data Perry, (2016); Srivastava et al., (2019), where the number of samples is in the millions, the computational cost of sampling the latent variables can again be prohibitive.

For settings where a massive number of samples is an issue, the following divide-and-conquer strategy can be considered: (a) Divide the data into a number of smaller subsets of reasonable size (b) run DA Markov chains on all subsets in parallel, and (c) suitably combine the different subset-wise Markov chains for inference. See Scott et al., (2016); Jordan et al., (2019); Wang and Srivastava, (2021) and the references therein. The combined draws from the subset-wise Markov chains do not constitute a Markov chain, and quantification of the Monte Carlo error becomes challenging. While asymptotic statistical guarantees based on a normal approximation of the target posterior are available for some of these methods (as the subset sample size tends to infinity), no rigorous bounds for the distance between the distribution of the combined parameter draws and the target posterior distribution (based on the entire data) are available. In other words, the lack of a Markov chain structure for the combined draws deprives the user of the rich diagnostic and theoretical tools that are available in the literature for Markov chains.

To address the above issues, the authors in Zhou et al., (2023) develop an asynchronous and distributed version of the DA algorithm, called ADDA, which outputs a Markov chain with the desired posterior as its stationary distribution. Two key assumptions that are needed for ADDA to be useful are - (a) the latent variables can be partitioned into blocks which are conditionally independent (given the original parameters “XX” and the data), and (b) the conditional posterior distribution of each latent sub-block exclusively depends on a relevant subset of the observed data. This is for example true for the Bayesian variable selection example in Section 2.1, where the augmented variables {Y~j}j=1p\{\tilde{Y}_{j}\}_{j=1}^{p} are conditionally independent given the regression coefficients and the data. Also, the conditional posterior distribution of Y~j\tilde{Y}_{j} does not depend on the observed data (hence, requirement (b) above is trivially satisfied). Again, for the Bayesian logistic regression example in Section 2.2, the augmented variables {Y~i}i=1m\{\tilde{Y}_{i}\}_{i=1}^{m} are conditionally independent given the regression coefficients and the data, and the conditional posterior distribution of Y~i\tilde{Y}_{i} depends only on (i,wi)(\ell_{i},w_{i}) (quantities related to the ithi^{th} observation).

Consider the general DA setup of Section 1, where XX represents the parameter of interest, and YY represents the block of latent parameters. The goal is to sample from fXf_{X}, which represents the marginal posterior density of XX given the observed data (the conditioning on the observed data will be left implicit for continuity and ease of exposition). We assume that YY can be divided into kk sub-blocks Y1,Y2,,YkY^{1},Y^{2},\cdots,Y^{k} which are conditionally independent given XX. Also, as mentioned above, the observed data can be partitioned into kk subsets such that fYjXf_{Y^{j}\mid X} only depends on the jthj^{th} data subset for j=1,2,,kj=1,2,\cdots,k.

The ADDA algorithm employs 11 manager process and kk worker processes. At the level of sampling, the job of the manager process is to sample new parameter values from fXYf_{X\mid Y} when appropriate, and to send this sample to all the workers. The job of the jthj^{th} worker process is to sample new values for the jthj^{th} latent block from fYjXf_{Y^{j}\mid X} when appropriate, and to send these samples to the manager process. Hence, the job of sampling the latent parameter YY is distributed to the kk worker processes. Another key feature of the ADDA algorithm is asynchronous sampling of the kk latent parameter blocks. The degree of asynchrony is controlled by user-specified parameters r(0,1]r\in(0,1] and ϵ(0,1]\epsilon\in(0,1] as follows. To make its next conditional draw of the parameter XX, the manager process, with probability ϵ\epsilon, waits to receive updated draws of the relevant latent parameter blocks from all the workers. But with probability 1ϵ1-\epsilon, it only waits to receive an rr-fraction of updated draws, and proceeds to sample XX given the most recent draws for all latent parameter blocks. If a worker is in the midst of sampling its assigned latent parameter block, but receives a new XX-draw from the manager, it terminates its current sampling, and begins afresh using the latest XX-draw received from the manager. This entire process is summarized below.

  1. (1)

    At time t=0t=0, the manager starts with initial values (X~0,Y~0)(\tilde{X}_{0},\tilde{Y}_{0}) at t=0t=0 and sends X0X_{0} to the workers.

  2. (2)

    For t=0,1,,t=0,1,\ldots,\infty, the manager

    1. (M-a)

      waits to receive only an rr-fraction of updated Y~t+1j\tilde{Y}_{t+1}^{j}s (see below) from the workers with probability 1ϵ1-\epsilon, and with probability ϵ\epsilon, waits to receive all the updated Y~(t+1)j\tilde{Y}_{(t+1)}^{j}s from the workers;

    2. (M-b)

      creates Y~t+1\tilde{Y}_{t+1} by replacing the relevant Y~tj\tilde{Y}^{j}_{t}s with the newly received Y~t+1j\tilde{Y}^{j}_{t+1};

    3. (M-c)

      draws Xt+1X_{t+1} from f(XY~t+1)f(X\mid\tilde{Y}_{t+1}); and

    4. (M-d)

      sends Xt+1X_{t+1} to all the worker processes and resets t=t+1t=t+1.

  3. (3)

    For t=0,,t=0,\ldots,\infty, the worker jj (j=1,,kj=1,\ldots,k)

    1. (W-a)

      waits to receive XtX_{t} from the manager process;

    2. (W-b)

      draws Y~t+1j\tilde{Y}^{j}_{t+1} from p(YjX~t+1)p(Y^{j}\mid\tilde{X}_{t+1}); and

    3. (W-c)

      sends Y~t+1j\tilde{Y}^{j}_{t+1} to the manager process, resets t=t+1t=t+1, and goes to (W-a) if X~t+1\tilde{X}_{t+1} is not received from the manager before the draw is complete; otherwise, it truncates the sampling process, resets t=t+1t=t+1, and goes to (W-b).

It is clear that when ϵ=1\epsilon=1 or r=1r=1, the ADDA algorithm is identical to the DA algorithm. However, things get interesting when ϵ<1\epsilon<1 and r<1r<1. In this setting, at each iteration of the ADDA algorithm, only an rr-fraction of the latent parameter blocks are updated (with probability 1ϵ1-\epsilon). The ADDA algorithm can then be viewed as a mix of systematic and random subset scan Gibbs sampler. The systematic part comes from always updating XX in any given iteration, and the random subset part comes from only updating a random subset of [kr][kr] sub-blocks of YY at each iteration. Such algorithms have been previously considered in the literature. However, what gives the ADDA a novel and interesting flavor is that unlike existing approaches, the comparative speed of the kk workers for sampling the respective YY sub-blocks can depend on the current value of XX. In other words, the choice of the [kr][kr] sub-blocks of YY that are updated at any given iteration can depend on the current value of the parameter XX. Given this dependence, it turns out that the marginal XX-process {X~t}t0\{\tilde{X}_{t}\}_{t\geq 0} is not Markov, but the joint (X,Y)(X,Y)-process {X~t,Y~t}t0\{\tilde{X}_{t},\tilde{Y}_{t}\}_{t\geq 0} is Markov.

In fact, it can be shown that the Markov chain {X~t,Y~t}t0\{\tilde{X}_{t},\tilde{Y}_{t}\}_{t\geq 0} has fX,Yf_{X,Y} as its stationary (invariant) density. For ease of exposition, we provide a proof of this assertion when YY and XX are supported on a discrete space and kk, the number of latent variable blocks, is equal to 22. We assume that r=0.5r=0.5 and ϵ=0\epsilon=0. The arguments below can be extended in a straightforward way to a general setting with more than two latent variable blocks, and to non-discrete settings with arbitrary rr and ϵ\epsilon values in [0,1][0,1].

Let (Y0,X0)(Y_{0},X_{0}) denote the starting value for the ADDA chain, and assume that it is drawn from the desired posterior fY,Xf_{Y,X}. Let (Y1,X1)(Y_{1},X_{1}) denote the next iterate generated by the ADDA chain. Our goal is to show that (Y1,X1)fY,X(Y_{1},X_{1})\sim f_{Y,X}. As we will see below, a key assumption which enables this is the conditional independence assumption which implies fYX(y~x~)=fY1X(y~1x~)fY2X(y~2x~)f_{Y\mid X}(\tilde{y}\mid\tilde{x})=f_{Y^{1}\mid X}(\tilde{y}_{1}\mid\tilde{x})f_{Y^{2}\mid X}(\tilde{y}_{2}\mid\tilde{x}) (recall that y~1\tilde{y}_{1} and y~2\tilde{y}_{2} denote the two blocks of y~\tilde{y}).

Since X1X_{1} given Y1Y_{1} is a draw from fXYf_{X\mid Y}, it follows that

P((Y1,X1)=(y~,x~))\displaystyle P((Y_{1},X_{1})=(\tilde{y},\tilde{x})) =\displaystyle= P(X1=x~Y1=y~)P(Y1=y~)\displaystyle P(X_{1}=\tilde{x}\mid Y_{1}=\tilde{y})P(Y_{1}=\tilde{y})
=\displaystyle= fXY(x~y~)P(Y1=y~).\displaystyle f_{X\mid Y}(\tilde{x}\mid\tilde{y})P(Y_{1}=\tilde{y}).

Hence, to prove the desired result, it is enough to show that P(Y1=y~)=fY(y~)P(Y_{1}=\tilde{y})=f_{Y}(\tilde{y}). Note that

P(Y1=y~)=y,xP(Y1=y~(Y0,X0)=(y,x))fY,X(y,x).P(Y_{1}=\tilde{y})=\sum_{y^{\prime},x^{\prime}}P(Y_{1}=\tilde{y}\mid(Y_{0},X_{0})=(y^{\prime},x^{\prime}))f_{Y,X}(y^{\prime},x^{\prime}).

Let us recall how Y1Y_{1} is sampled given (Y0,X0)=(y,x)(Y_{0},X_{0})=(y^{\prime},x^{\prime}). With probability say c1(x)c_{1}(x^{\prime}), only the first latent variable block Y11Y^{1}_{1} is obtained using a draw from fY1X(x)f_{Y^{1}\mid X}(\cdot\mid x^{\prime}) and the second block is left unchanged at y2y_{2}^{\prime}, and with probability c2(x)=1c1(x)c_{2}(x^{\prime})=1-c_{1}(x^{\prime}), only the second latent variable block Y12Y^{2}_{1} is obtained using fY2X(x)f_{Y^{2}\mid X}(\cdot\mid x^{\prime}) and the first block is left unchanged at y1y_{1}^{\prime}. It follows that

P(Y1=y~)\displaystyle P(Y_{1}=\tilde{y})
=\displaystyle= y,xc1(x)fY1X(y~1x)1{y2=y~2}fY,X(y,x)+y,xc2(x)fY2X(y~2x)1{y1=y~1}fY,X(y,x)\displaystyle\sum_{y^{\prime},x^{\prime}}c_{1}(x^{\prime})f_{Y^{1}\mid X}(\tilde{y}_{1}\mid x^{\prime})1_{\{y_{2}^{\prime}=\tilde{y}_{2}\}}f_{Y,X}(y^{\prime},x^{\prime})+\sum_{y^{\prime},x^{\prime}}c_{2}(x^{\prime})f_{Y^{2}\mid X}(\tilde{y}_{2}\mid x^{\prime})1_{\{y_{1}^{\prime}=\tilde{y}_{1}\}}f_{Y,X}(y^{\prime},x^{\prime})

Using fY,X(y,x)=fY1X(y1x)fY2X(y2x)fX(x)f_{Y,X}(y^{\prime},x^{\prime})=f_{Y^{1}\mid X}(y_{1}^{\prime}\mid x^{\prime})f_{Y^{2}\mid X}(y_{2}^{\prime}\mid x^{\prime})f_{X}(x^{\prime}) (by conditional independence of Y1Y^{1} and Y2Y^{2} given XX), we get

P(Y1=y~)\displaystyle P(Y_{1}=\tilde{y})
=\displaystyle= xy1y2c1(x)fY1X(y~1x)1{y2=y~2}fY1X(y1x)fY2X(y2x)fX(x)+\displaystyle\sum_{x^{\prime}}\sum_{y_{1}^{\prime}}\sum_{y_{2}^{\prime}}c_{1}(x^{\prime})f_{Y^{1}\mid X}(\tilde{y}_{1}\mid x^{\prime})1_{\{y_{2}^{\prime}=\tilde{y}_{2}\}}f_{Y^{1}\mid X}(y_{1}^{\prime}\mid x^{\prime})f_{Y^{2}\mid X}(y_{2}^{\prime}\mid x^{\prime})f_{X}(x^{\prime})+
xy1y2c2(x)fY2X(y~2x)1{y1=y~1}fY1X(y1x)fY2X(y2x)fX(x)\displaystyle\sum_{x^{\prime}}\sum_{y_{1}^{\prime}}\sum_{y_{2}^{\prime}}c_{2}(x^{\prime})f_{Y^{2}\mid X}(\tilde{y}_{2}\mid x^{\prime})1_{\{y_{1}^{\prime}=\tilde{y}_{1}\}}f_{Y^{1}\mid X}(y_{1}^{\prime}\mid x^{\prime})f_{Y^{2}\mid X}(y_{2}^{\prime}\mid x^{\prime})f_{X}(x^{\prime})
=\displaystyle= xc1(x)fY1X(y~1x)fX(x)(y1fY1X(y1x))(y2fY2X(y2x)1{y2=y~2})+\displaystyle\sum_{x^{\prime}}c_{1}(x^{\prime})f_{Y^{1}\mid X}(\tilde{y}_{1}\mid x^{\prime})f_{X}(x^{\prime})\left(\sum_{y_{1}^{\prime}}f_{Y^{1}\mid X}(y_{1}^{\prime}\mid x^{\prime})\right)\left(\sum_{y_{2}^{\prime}}f_{Y^{2}\mid X}(y_{2}^{\prime}\mid x^{\prime})1_{\{y_{2}^{\prime}=\tilde{y}_{2}\}}\right)+
xc2(x)fY2X(y~2x)fX(x)(y2fY2X(y2x))(y1fY1X(y1x)1{y1=y~1})\displaystyle\sum_{x^{\prime}}c_{2}(x^{\prime})f_{Y^{2}\mid X}(\tilde{y}_{2}\mid x^{\prime})f_{X}(x^{\prime})\left(\sum_{y_{2}^{\prime}}f_{Y^{2}\mid X}(y_{2}^{\prime}\mid x^{\prime})\right)\left(\sum_{y_{1}^{\prime}}f_{Y^{1}\mid X}(y_{1}^{\prime}\mid x^{\prime})1_{\{y_{1}^{\prime}=\tilde{y}_{1}\}}\right)
=\displaystyle= xc1(x)fY1X(y~1x)fX(x)fY2X(y~2x)+xc2(x)fY2X(y~2x)fX(x)fY1X(y~1x)\displaystyle\sum_{x^{\prime}}c_{1}(x^{\prime})f_{Y^{1}\mid X}(\tilde{y}_{1}\mid x^{\prime})f_{X}(x^{\prime})f_{Y^{2}\mid X}(\tilde{y}_{2}\mid x^{\prime})+\sum_{x^{\prime}}c_{2}(x^{\prime})f_{Y^{2}\mid X}(\tilde{y}_{2}\mid x^{\prime})f_{X}(x^{\prime})f_{Y^{1}\mid X}(\tilde{y}_{1}\mid x^{\prime})
=\displaystyle= x(c1(x)+c2(x))fY,X(y~,x).\displaystyle\sum_{x^{\prime}}(c_{1}(x^{\prime})+c_{2}(x^{\prime}))f_{Y,X}(\tilde{y},x^{\prime}).

The last step again uses conditional independence of Y1Y^{1} and Y2Y^{2} given XX. Since c1(x)+c2(x)=1c_{1}(x^{\prime})+c_{2}(x^{\prime})=1, it follows that

P(Y1=y~)=xfY,X(y~,x)=fY(y~).P(Y_{1}=\tilde{y})=\sum_{x^{\prime}}f_{Y,X}(\tilde{y},x^{\prime})=f_{Y}(\tilde{y}).

Note that the above argument considers the pure asynchronous ADDA (ϵ=0\epsilon=0). But the ADDA kernel with positive ϵ\epsilon is a mixture of the DA and pure asynchronous ADDA kernels, so the result immediately follows for such settings as well. For a setting with more than two blocks and a general value of r(0,1)r\in(0,1), we will have J=(KKr)J={{K}\choose{\lceil Kr\rceil}} terms in the derivation above instead of two terms. The jthj^{th} term, with essentially the same manipulations as above, will simplify to cj(x)fY,X(y~,x)c_{j}(x^{\prime})f_{Y,X}(\tilde{y},x^{\prime}), where cj(x)c_{j}(x^{\prime}) denotes the probability of choosing the relevant subset of latent variable blocks. Since j=1Jcj(x)=1\sum_{j=1}^{J}c_{j}(x^{\prime})=1, the invariance result will follow. The authors in Zhou et al., (2023) also establish the geometric ergodicity of the ADDA chain in specific settings, including the Bayesian variable selection example considered in Section 2.1.

The parameter rr controls degree of asynchrony. If rr is small, then the computational cost of each ADDA iteration is lower, but the ADDA Markov chain mixes at a slower pace. This tradeoff between slow mixing and lower computational cost per iteration is key to the choice of rr and the performance of the ADDA algorithm. The hope is that for a range of values of rr where the joint effect of these two competing factors leads to a significant decrease in the overall wall clock time required for the ADDA chain (as compared to a pure distributed implementation of the original DA chain). This is indeed seen in the extensive experimental evaluations performed in Zhou et al., (2023), where the ADDA is shown to have a remarkable overall computational gain with a comparatively small loss of accuracy. For example, when analyzing the MovieLens data Perry, (2016); Srivastava et al., (2019) (with millions of samples) using Bayesian logistic regression (Section 2.2), the ADDA algorithm is three to five times faster with only 2% less accuracy compared to the (pure distributed) DA algorithm after 10,000 iterations.

7 Summary and discussion

DA algorithms introduce appropriate latent variables to facilitate sampling from intractable target distributions. These algorithms explore the parameter space by iteratively updating the parameters and the latent variables, usually drawing from some standard distributions. Indeed, DA algorithms are a popular choice for MCMC computing due to its simplicity and broad applicability. The simple structure of the transition density of the DA Markov chain allows tractable detailed spectral analyses of its convergence behavior in addition to drift and minorization based analyses. This chapter discusses the tools for analyzing and comparing the spectrum of DA Markov chains. The applicability of DA methods has been demonstrated by their successful implementation in various widely used examples from statistics and machine learning.

Despite its widespread use, the DA algorithms can suffer from slow mixing. A great deal of effort has gone into developing techniques to tweak the DA chain’s transition density to improve its convergence speed. Among the different approaches, parameter expansion based methods, the PX-DA strategies, have been the most successful. As described in this chapter, several theoretical results have been established showing the superiority of the PX-DA algorithms over the DA methods. For a given reparameterization, the results in the literature identify the ‘best’ PX-DA algorithm as the Haar PX-DA based on the Haar measure. On the other hand, there is not much study available on comparison among different parameter expansion strategies. One exception is Roy, (2014), which through an empirical study involving a Bayesian robit model, shows that a partially reparameterized Haar PX-DA algorithm can outperform a fully reparameterized Haar PX-DA algorithm. Using the same robit example, Roy, (2014) shows that some reparameterization may provide little improvement over the DA algorithm. An important area of future research is to provide theoretical and methodological results for constructing and comparing different reparameterization strategies for improving DA algorithms.

In various fields and applications, it is common to observe large data sets that need to be analyzed. For these data sets, the number of observations or variables is large. The dimension of the latent vector yy in a DA scheme is usually the same as the number of observations or the number of variables. On the other hand, generally, as seen in all examples considered in this chapter, the latent variables are conditionally independent given the observed data and the current parameter value. Thus, the draw from yy, the first of the two steps in every iteration of the DA algorithm, can be parallelized, distributing the computational burden across multiple processors or nodes. Indeed, conditional independence of the latent blocks Y1,Y2,,YkY^{1},Y^{2},\cdots,Y^{k} is key for ensuring that the ADDA-process {(X~t,Y~t)}t=0\{(\tilde{X}_{t},\tilde{Y}_{t})\}_{t=0}^{\infty} described in Section 6 is a Markov chain with fX,Yf_{X,Y} as its stationary distribution. However, such conditional independence is lost in many applications, such as hidden Markov models and time series models. An important open research direction is the extension of ADDA-type ideas that adapt to this loss of conditional independence without compromising on some underlying Markov structure. Another open direction is a useful integration of the sandwich technique to develop parameter-expanded versions of the ADDA algorithm.

Acknowledgments

The first author’s work was partially supported by USDA NIFA Grant 2023-70412-41087.

References

  • Albert and Chib, (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679.
  • Andrews and Mallows, (1974) Andrews, D. F. and Mallows, C. F. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36:99–102.
  • Asmussen and Glynn, (2011) Asmussen, S. and Glynn, P. W. (2011). A new proof of convergence of MCMC via the ergodic theorem. Statistics and Probability Letters, 81:1482–1485.
  • Booth and Hobert, (1999) Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society, Series B, 61:265–285.
  • Buja, (1990) Buja, A. (1990). Remarks on functional canonical variates, alternating least squares methods and ACE. The Annals of Statistics, 18:1032–1069.
  • Chakraborty and Khare, (2017) Chakraborty, S. and Khare, K. (2017). Convergence properties of Gibbs samplers for Bayesian probit regression with proper priors. Electronic Journal of Statistics, 11(1):177–210.
  • Chakraborty and Khare, (2019) Chakraborty, S. and Khare, K. (2019). Consistent estimation of the spectrum of trace class data augmentation algorithms. Bernoulli, 25:3832–3863.
  • Chan and Geyer, (1994) Chan, K. S. and Geyer, C. J. (1994). Comment on “Markov chains for exploring posterior distributions” by L. Tierney. The Annals of Statistics, 22:1747–1758.
  • Choi and Hobert, (2013) Choi, H. M. and Hobert, J. P. (2013). The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, 7:2054–2064.
  • Davies, (1983) Davies, E. (1983). LINEAR INTEGRAL OPERATORS: Surveys and Reference Works in Mathematics, 7. Wiley Online Library.
  • Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39:1–38.
  • Diaconis et al., (2008) Diaconis, P., Khare, K., and Saloff-Coste, L. (2008). Gibbs sampling, exponential families and orthogonal polynomials (with discussion). Statistical Science, 23:151–200.
  • Diebolt and Robert, (1994) Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions by Bayesian sampling. Journal of the Royal Statistical Society, Series B, 56:363–375.
  • Duan et al., (2018) Duan, L. L., Johndrow, J. E., and Dunson, D. B. (2018). Scaling up data augmentation MCMC via calibration. The Journal of Machine Learning Research, 19(1):2575–2608.
  • Eaton, (1989) Eaton, M. L. (1989). Group Invariance Applications in Statistics. Institute of Mathematical Statistics and the American Statistical Association, Hayward, California and Alexandria, Virginia.
  • Flegal and Jones, (2010) Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics, 38:1034–1070.
  • Frühwirth-Schnatter and Frühwirth, (2010) Frühwirth-Schnatter, S. and Frühwirth, R. (2010). Data augmentation and MCMC for binary and multinomial logit models. In Statistical Modelling and Regression Structures, pages 111–132. Springer.
  • Geyer, (1994) Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society, Series B, 56:261–274.
  • Geyer and Thompson, (1992) Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society, Series B, 54:657–699.
  • Gilks and Wild, (1992) Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41(2):337–348.
  • Hobert, (2011) Hobert, J. P. (2011). Handbook of Markov chain Monte Carlo, chapter The data augmentation algorithm: theory and methodology, pages 253–293. CRC Press, Boca Raton, FL.
  • Hobert and Marchev, (2008) Hobert, J. P. and Marchev, D. (2008). A theoretical comparison of the data augmentation, marginal augmentation and PX-DA algorithms. The Annals of Statistics, 36:532–554.
  • Hobert et al., (2011) Hobert, J. P., Roy, V., and Robert, C. P. (2011). Improving the convergence properties of the data augmentation algorithm with an application to Bayesian mixture modelling. Statistical Science, 26:332–351.
  • Holmes and Held, (2006) Holmes, C. C. and Held, L. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1:145–168.
  • Jin and Tan, (2021) Jin, R. and Tan, A. (2021). Fast Markov chain Monte Carlo for high-dimensional Bayesian regression models with shrinkage priors. Journal of Computational and Graphical Statistics, 30(3):632–646.
  • Jones and Hobert, (2001) Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science, 16:312–34.
  • Jones and Hobert, (2004) Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchical random effects model. The Annals of Statistics, 32:784–817.
  • Jordan et al., (2019) Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681.
  • Khare and Hobert, (2011) Khare, K. and Hobert, J. P. (2011). A spectral analytic comparison of trace-class data augmentation algorithms and their sandwich variants. The Annals of Statistics, 39:2585–2606.
  • Kozumi and Kobayashi, (2011) Kozumi, H. and Kobayashi, G. (2011). Gibbs sampling methods for Bayesian quantile regression. Journal of statistical computation and simulation, 81(11):1565–1578.
  • Kyung et al., (2010) Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standard errors, and Bayesian Lassos. Bayesian Analysis, 5:369–412.
  • Laha et al., (2016) Laha, A., Dutta, S., and Roy, V. (2016). A novel sandwich algorithm for empirical Bayes analysis of rank data. Statistics and its Interface, 10:543–556.
  • Liu, (2004) Liu, C. (2004). Robit regression: A simple robust alternative to logistic and probit regression. In Gelman, A. and Meng, X. L., editors, Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives, pages 227–238. Wiley, London.
  • Liu, (1996) Liu, J. S. (1996). Peskun’s theorem and a modified discrete-state Gibbs sampler. Biometrika, 83:681–682.
  • Liu et al., (1994) Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, 81:27–40.
  • Liu et al., (1995) Liu, J. S., Wong, W. H., and Kong, A. (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. Journal of the Royal Statistical Society, Series B, 57:157–169.
  • Liu and Wu, (1999) Liu, J. S. and Wu, Y. N. (1999). Parameter expansion for data augmentation. Journal of the American Statistical Association, 94:1264–1274.
  • Meng and van Dyk, (1999) Meng, X.-L. and van Dyk, D. A. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika, 86:301–320.
  • Meyn and Tweedie, (1993) Meyn, S. P. and Tweedie, R. L. (1993). Markov Chains and Stochastic Stability. Springer-Verlag, London.
  • Mukherjee et al., (2023) Mukherjee, S., Khare, K., and Chakraborty, S. (2023). Convergence properties of data augmentation algorithms for high-dimensional robit regression. Electronic Journal of Statistics, 17(1):19–69.
  • Pal et al., (2015) Pal, S., Khare, K., and Hobert, J. P. (2015). Improving the data augmentation algorithm in the two-block setup. Journal of Computational and Graphical Statistics, 24(4):1114–1133.
  • Pal et al., (2017) Pal, S., Khare, K., and Hobert, J. P. (2017). Trace class Markov chains for Bayesian inference with generalized double Pareto shrinkage priors. Scandinavian Journal of Statistics, 44(2):307–323.
  • Papaspiliopoulos and Roberts, (2008) Papaspiliopoulos, O. and Roberts, G. (2008). Stability of the Gibbs sampler for Bayesian hierarchical models. The Annals of Statistics, 38:95–117.
  • Park and Casella, (2008) Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103:681–686.
  • Perry, (2016) Perry, P. O. (2016). Fast moment-based estimation for hierarchical models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(1):267–291.
  • Polson et al., (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349.
  • Pregibon, (1982) Pregibon, D. (1982). Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38:485–498.
  • Qin and Hobert, (2021) Qin, Q. and Hobert, J. P. (2021). On the limitations of single-step drift and minorization in Markov chain convergence analysis. The Annals of Applied Probability, 31(4):1633–1659.
  • Qin and Hobert, (2022) Qin, Q. and Hobert, J. P. (2022). Geometric convergence bounds for Markov chains in Wasserstein distance based on generalized drift and contraction conditions. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 58(2):872–889.
  • Qin et al., (2019) Qin, Q., Hobert, J. P., and Khare, K. (2019). Estimating the spectral gap of a trace-class Markov operator. Electronic Journal of Statistics, 13:1790 – 1822.
  • Rajaratnam et al., (2019) Rajaratnam, B., Sparks, D., Khare, K., and Zhang, L. (2019). Scalable Bayesian shrinkage and uncertainty quantification in high-dimensional regression. Journal of Computational and Graphical Statistics, 28(1):174–184.
  • Rao and Roy, (2021) Rao, Y. and Roy, V. (2021). Block Gibbs samplers for logistic mixed models: convergence properties and a comparison with full Gibbs samplers. Electronic Journal of Statistics, 15:5598–5625.
  • Roberts and Rosenthal, (1997) Roberts, G. O. and Rosenthal, J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2:13–25.
  • Rosenthal, (1995) Rosenthal, J. S. (1995). Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90:558–566.
  • (55) Roy, V. (2012a). Convergence rates for MCMC algorithms for a robust Bayesian binary regression model. Electronic Journal of Statistics, 6:2463–2485.
  • (56) Roy, V. (2012b). Spectral analytic comparisons for data augmentation. Statistics & Probability Letters, 82:103–108.
  • Roy, (2014) Roy, V. (2014). Efficient estimation of the link function parameter in a robust Bayesian binary regression model. Computational Statistics and Data Analysis, 73:87–102.
  • Roy, (2016) Roy, V. (2016). Improving efficiency of data augmentation algorithms using Peskun’s theorem. Computational Statistics, 31:709–728.
  • Roy, (2022) Roy, V. (2022). MCMC for GLMMs. In C.R. Rao, A. S. R. and Young, A., editors, Handbook of Statistics Volume 47: Advancements in Bayesian Methods and Implementation, pages 135–159. Elsevier.
  • Roy and Chakraborty, (2017) Roy, V. and Chakraborty, S. (2017). Selection of tuning parameters, solution paths and standard errors for Bayesian lassos. Bayesian Analysis, 12:753–778.
  • Roy and Hobert, (2007) Roy, V. and Hobert, J. P. (2007). Convergence rates and asymptotic standard errors for Markov chain Monte Carlo algorithms for Bayesian probit regression. Journal of the Royal Statistical Society, Series B, 69:607–623.
  • Scott et al., (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., I., G. E., and McCulloch, R. E. (2016). Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
  • Srivastava et al., (2019) Srivastava, S., DePalma, G., and Liu, C. (2019). An asynchronous distributed expectation maximization algorithm for massive data: The dem algorithm. Journal of Computational and Graphical Statistics, 28(2):233–243.
  • Swendsen and Wang, (1987) Swendsen, R. H. and Wang, J.-S. (1987). Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58:86–88.
  • Tanner and Wong, (1987) Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82:528–550.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288.
  • Vats et al., (2019) Vats, D., Flegal, J. M., and Jones, G. L. (2019). Multivariate output analysis for Markov chain Monte Carlo. Biometrka, 106:321–337.
  • Wang and Srivastava, (2021) Wang, C. and Srivastava, S. (2021). Divide-and-conquer Bayesian inference in hidden Markov models. Electronic Journal of Statistics, 17:895–947.
  • (69) Wang, X. and Roy, V. (2018a). Analysis of the Pólya-Gamma block Gibbs sampler for Bayesian logistic linear mixed models. Statistics & Probability Letters, 137:251–256.
  • (70) Wang, X. and Roy, V. (2018b). Convergence analysis of the block Gibbs sampler for Bayesian probit linear mixed models with improper priors. Electronic Journal of Statistics, 12:4412–4439.
  • (71) Wang, X. and Roy, V. (2018c). Geometric ergodicity of Pólya-Gamma Gibbs sampler for Bayesian logistic regression with a flat prior. Electronic Journal of Statistics, 12:3295–3311.
  • Xu and Ghosh, (2015) Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso. Bayesian Analysis, 10(4):909–936.
  • Yu and Moyeed, (2001) Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression. Statistics & Probability Letters, 54(4):437–447.
  • Yu and Meng, (2011) Yu, Y. and Meng, X.-L. (2011). To center or not to center: that is not the question - An ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency (with discussion). Journal of Computational and Graphical Statistics, 20:531–570.
  • Zhou et al., (2023) Zhou, J., Khare, K., and Srivastava, S. (2023). Asynchronous and distributed data augmentation for massive data settings. Journal of Computational and Graphical Statistics, 32(3):895–907.
  • Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67:301–320.