The data augmentation algorithm
Abstract
The data augmentation (DA) algorithms are popular Markov chain Monte Carlo (MCMC) algorithms often used for sampling from intractable probability distributions. This review article comprehensively surveys DA MCMC algorithms, highlighting their theoretical foundations, methodological implementations, and diverse applications in frequentist and Bayesian statistics. The article discusses tools for studying the convergence properties of DA algorithms. Furthermore, it contains various strategies for accelerating the speed of convergence of the DA algorithms, different extensions of DA algorithms and outlines promising directions for future research. This paper aims to serve as a resource for researchers and practitioners seeking to leverage data augmentation techniques in MCMC algorithms by providing key insights and synthesizing recent developments.
1 Introduction
The data augmentation (DA) algorithm (Tanner and Wong,, 1987; Swendsen and Wang,, 1987) is a widely used class of Markov chain Monte Carlo (MCMC) algorithms. Suppose is the target probability density function (pdf) on . Generally, the goal is to evaluate the mean of some function with respect to , that is, to find . Since is usually available only up to a normalizing constant, cannot be computed analytically. Furthermore, these days, the target densities arising in machine learning, physics, statistics, and other areas are so complex that direct simulation from is impossible, making the classical Monte Carlo approximation of based on independent and identically distributed (iid) samples from infeasible. In such situations, it is often possible, as described later in a section, to construct a DA Markov chain , which has as its stationary density. If this DA chain is well behaved then can be consistently estimated by the sample average .
For constructing a DA algorithm with stationary density , one needs to find a joint density on with ‘augmented variables’ satisfying the following two properties:
-
(i)
the marginal of the joint density is the target density , that is, , and
-
(ii)
sampling from the two corresponding conditional pdfs, and is straightforward.
The first property makes sure that the DA algorithm presented below has as its stationary density, while the second property allows for straightforward simulation of this DA Markov chain. As mentioned in Hobert, (2011), the popularity of the DA algorithm is due in part to the fact that, given an intractable pdf , there are some general techniques available for constructing a potentially useful joint density . For example, the augmented variables generally correspond to the missing data used in the EM algorithm (Dempster et al.,, 1977) for defining the complete data density, which in turn is used for finding the maximum likelihood estimates of some parameters. In Sections 2–5 we will provide several examples of widely used nonlinear and/or high-dimensional statistical models with complex target densities where using appropriate augmented data, popular and efficient DA algorithms are constructed.
Each iteration of the DA algorithm consists of two steps — a draw from followed by a draw from . Indeed, if the current state of the DA Markov chain is , we simulate as follows.
From the two steps in each iteration of the DA algorithm, it follows that the Markov transition density (Mtd) of the DA Markov chain is given by
(1.1) |
Let be the marginal density of the augmented variable . Since
(1.2) |
for all , that is, the Mtd satisfies the detailed balance condition, we have
(1.3) |
Thus, is stationary for the DA Markov chain . Moreover, if is appropriately irreducibile, then is a strongly consistent estimator of , that is, converges to almost surely as (Asmussen and Glynn,, 2011, Theorem 2). Since the DA chain has the Mtd (1.1), irreducibility also implies Harris recurrence of the DA Markov chain. Furthermore, if is also aperiodic then it is Harris ergodic and the marginal density of converges to the stationary density , no matter what the initial distribution of is (Meyn and Tweedie,, 1993, chapter 13). A simple sufficient condition which guarantees both irreducibility and aperiodicity of is that the Mtd is strictly positive everywhere. Note that, if is strictly positive on , then is strictly positive everywhere and the DA Markov chain is Harris ergodic.
In order to keep the notations and discussions simple, here we consider the target and augmented state space as Euclidean. Similarly, the densities as well as are considered with respect to the Lebesgue measure. However, all methods and theoretical results discussed in this chapter extend to discrete as well as other general settings. Indeed, there are popular DA algorithms where the target and/or the augmented state space is not continuous, for example consider the widely used DA algorithm for the mixture models (Diebolt and Robert,, 1994).
2 Examples
In this section, we provide examples of some widely used high dimensional linear models, generalized linear models, and generalized linear mixed models where the target distributions are intractable. Then we describe some DA algorithms that are used to fit these models.
2.1 Data augmentation for Bayesian variable selection
In modern datasets arising from diverse applications such as genetics, medical science, and other scientific disciplines, the number of covariates are often larger than the number of observed samples. The ordinary least squares method for estimating the regression coefficients in a linear regression is not applicable in such situations. On the other hand, penalized regression methods such as the least absolute shrinkage and selection operator (lasso) (Tibshirani,, 1996) are applicable as they can simultaneously induce shrinkage and sparsity in the estimation of the regression coefficients.
Consider the standard linear model
where is the vector of responses, is the intercept, is the vector of 1’s, is the (standardized) covariate matrix, is the vector of regression coefficients, and is the variance parameter. The lasso is based on norm regularization, and it estimates by solving
(2.1) |
for some shrinkage parameter , where is the vector of observed responses, and . The lasso estimate can be interpreted as the posterior mode when, conditional on , the regression parameters ’s have independent and identical Laplace priors (Tibshirani,, 1996). Indeed, following Park and Casella, (2008), we consider the hierarchical model
(2.2) |
where is the prior density of . Since the columns of are centered, standard calculations show that
(2.3) |
Thus, from (2.1) and (2.3), the joint posterior density of is
(2.4) |
From (2.1) and (2.4) it follows that the mode of the (conditional on ) posterior density of is the lasso estimate.
The posterior density (2.4) is intractable. Introducing augmented variables with for all , Park and Casella, (2008) consider the following joint density of
(2.5) |
Replacing with , with , and with in the following representation of the Laplace density as a scale mixture of normals (Andrews and Mallows,, 1974)
(2.6) |
we see that
that is, the marginal density of the joint posterior density given in (2.1) is the target posterior density (2.4). Thus, from Section 1, if sampling from the two conditional densities and is straightforward, then we can construct a valid DA algorithm for (2.4).
From (2.1), we have
(2.7) |
Recall that the Inverse-Gaussian density is given by
Thus from (2.7), it follows that
(2.8) |
In order to derive the conditional density we assume that apriori for some and . The improper prior used in Park and Casella, (2008) is obtained by replacing . Note that a draw from can be made by first drawing from followed by a draw from (Rajaratnam et al.,, 2019). Denoting the diagonal matrix with diagonal elements by , we have
Thus, . Also, from (2.1), we have
Thus, Hence, a single iteration of the DA algorithm uses the following two steps to move from to .
Although the lasso estimator has been extensively used in applications as diverse as agriculture, genetics, and finance, it may perform unsatisfactorily if the predictors are highly correlated. For example, if there is a group structure among the variables, lasso tends to select only one variable from each group. Zou and Hastie, (2005) proposed the Elastic Net (EN) to achieve better performance in such situations. The EN estimator is obtained by solving
(2.9) |
where and are tuning parameters. From (2.9) we see that the elastic net uses both an penalty as in lasso and an penalty as in the ridge regression. Following Kyung et al., (2010) and Roy and Chakraborty, (2017), we consider the hierarchical Bayesian EN model
(2.10) |
where is the prior density of . From (2.1) and (2.3) it follows that the joint posterior density of is
(2.11) |
From (2.9) and (2.11) we see that the mode of the (conditional on ) posterior density of is the EN estimate. Since the density (2.11) is intractable, using augmented variables , following Kyung et al., (2010) and Roy and Chakraborty, (2017), we consider the joint density of given by
(2.12) |
From (2.6) we see that
From (2.1) using similar calculations as before, we see that the conditional distributions of ’s are the same as (2.8) with . As before, we assume that apriori for some and . Denoting the diagonal matrix with diagonal elements by , we have
Thus, . Next, doing similar calculations as for the Bayesian lasso, we see that Hence, a single iteration of the DA algorithm for the Bayesian EN uses the following two steps to move from to .
Different Bayesian generalized lasso methods, such as the Bayesian group lasso, the Bayesian sparse group lasso, and the Bayesian fused lasso models have been proposed in the literature to handle the situations where the covariates are known to have some structures, for example, when they form groups or are ordered in some way (Kyung et al.,, 2010; Xu and Ghosh,, 2015). Introducing appropriate augmented variables , DA algorithms for these models can also be constructed (Kyung et al.,, 2010; Jin and Tan,, 2021).
2.2 Data augmentation for Bayesian logistic models
Logistic regression model is likely the most popular model for analyzing binomial data. Let be a vector of independent Binomial random variables, and be the vector of known covariates associated with for . Let be the vector of unknown regression coefficients. Assume that Binomial where is the cumulative distribution function of the standard logistic random variable. Denoting the observed responses by , the likelihood function for the logistic regression model is
Consider a Bayesian analysis that employs the following Gaussian prior for
(2.13) |
where and is a positive definite matrix. Then the intractable posterior density of is given by
(2.14) |
where
(2.15) |
is the marginal density of .
There have been many attempts (Holmes and Held,, 2006; Frühwirth-Schnatter and Frühwirth,, 2010) to produce an efficient DA algorithm for the logistic regression model, that is (2.14), without much success until recently, when Polson et al., (2013) (denoted as PS&W hereafter) proposed a new DA algorithm based on the Pólya-Gamma (PG) distribution. A random variable follows a PG distribution with parameters , if
where Gamma. From Wang and Roy, 2018c , the pdf for PG is
(2.16) |
for where the hyperbolic cosine function . We denote this as PG.
Let , and be the pdf of PG. Define the joint posterior density of and given as
(2.17) |
By Theorem 1 in Polson et al., (2013), it follows that the -marginal density of the joint posterior density is the target density given in (2.14), that is,
PS&W’s DA algorithm for the logistic regression model is based on the joint density . We now derive the conditional densities, and . From (2.17) we see that
(2.18) |
and thus from (2.16) we have for . Let denote the design matrix with th row and be the diagonal matrix with th diagonal element . From (2.17) and (2.13), it follows that the conditional density of is
where . Thus the conditional distribution of is multivariate normal. In particular,
(2.19) |
Thus, a single iteration of PS&W’s DA algorithm uses the following two steps to move from to .
PS&W’s DA algorithm is also applicable to the situation when in (2.13), that is, , the improper uniform prior on provided the posterior density is proper, that is defined in (2.15) is finite. Polson et al., (2013) demonstrate superior empirical performance of their DA algorithm over some other DA and MCMC algorithms for the logistic regression model.
2.3 Data augmentation for probit mixed models
Generalized linear mixed models (GLMMs) are often used for analyzing correlated binary observations. The random effects in the linear predictor of a GLMM can accommodate for overdispersion as well as dependence among correlated observations arising from longitudinal or repeated measures studies. Let denote the vector of Bernoulli random variables. Let and be the and known covariates and random effect design vectors, respectively associated with the th observation for . Let be the unknown vector of regression coefficients and be the random effects vector. The probit GLMM connects the expectation of with and using the probit link function, as
(2.20) |
where is the cumulative distribution function of the standard normal random variable.
Assume that we have random effects with , where is a vector with , , and where the low-dimensional covariance matrix is unknown and must be estimated, and the structured matrix is usually known. Here, indicates the Kronecker product. Denoting , the probit GLMM is given by
Let be the observed Bernoulli response variables. Note that, the likelihood function for is
(2.21) |
which is not available in closed form. Here, , with indicating the direct sum, and denotes the probability density function of the dimensional normal distribution with mean vector , covariance matrix and evaluated at .
There are two widely used Monte Carlo approaches for approximating the likelihood function (2.21) and making inference on , namely the Monte Carlo EM algorithm (Booth and Hobert,, 1999) and the Monte Carlo maximum likelihood based on importance sampling (Geyer and Thompson,, 1992; Geyer,, 1994). As explained in Roy, (2022), both the Monte Carlo EM and the Monte Carlo maximum likelihood methods for making inference on require effective methods for sampling from the conditional density of the random effect
(2.22) |
where is the joint density of given by
(2.23) |
Since the likelihood function is not available in closed form, neither is the density (2.22).
Albert and Chib,’s (1993) DA algorithm for the probit regression model is one of the most widely used DA algorithms and it can be extended to construct DA samplers for (2.22). Following Albert and Chib, (1993), let be the latent continuous normal variable corresponding to the th binary observation , that is , where , 1) for . Then
(2.24) |
that is, as in (2.20). Using the latent variables , we now introduce the joint density
(2.25) |
From (2.21) and (2.24) it follows that
(2.26) |
that is, the marginal of the joint density is the target density given in (2.22). Thus, the augmented data and the joint density (2.3) satisfy the first property for constructing a valid DA algorithm for .
If the two conditionals of the joint density (2.3) are easy to sample from, then a DA algorithm can be constructed. From (2.3), we see that
(2.27) |
where denotes the distribution of the normal random variable with mean and variance , that is truncated to have only positive values if , and only negative values if . Let and denote the and matrices whose th rows are and , respectively. Next, using standard linear models-type calculations, Roy, (2022) shows that the conditional distribution of is
(2.28) |
Thus, a single iteration of the DA algorithm uses the following two steps to move from to .
One can consider a Bayesian analysis of the probit GLMM with appropriate priors on and the DA and Haar PX-DA algorithms discussed in this section can be extended to sample from the corresponding posterior densities (see Wang and Roy, 2018b, ; Roy,, 2022). Similarly, PS&W’s DA algorithm for the logistic model discussed in Section 2.2 can be extended to fit logistic mixed models (see Wang and Roy, 2018a, ; Rao and Roy,, 2021; Roy,, 2022).
3 Improving the DA algorithm
DA algorithms, although popular MCMC schemes, can suffer from slow convergence to stationarity (Roy and Hobert,, 2007; Hobert et al.,, 2011; Duan et al.,, 2018). In the literature, a lot of effort has gone into developing methods for accelerating the speed of convergence of the DA algorithms. These methods are mainly based on techniques like appropriate reparameterizations and parameter expansions such as the centered and noncentered parameterizations (Papaspiliopoulos and Roberts,, 2008), interweaving the two parameterizations (Yu and Meng,, 2011), parameter expanded-data augmentation (PX-DA) (Liu and Wu,, 1999), the marginal augmentation (Meng and van Dyk,, 1999) or the calibrated DA (Duan et al.,, 2018). Unlike the other methods mentioned here, Duan et al.,’s (2018) calibrated DA requires an additional Metropolis-Hastings (MH) step to maintain the stationarity of the algorithm with respect to the target density . The PX-DA and marginal augmentation are among the most popular techniques for speeding up DA algorithms, and we describe those in more details later in this section.
Graphically, the two steps of the DA algorithm can be viewed as . It has been observed that the convergence behavior of a DA algorithm can be significantly improved by inserting a properly chosen extra step in between the two steps of the DA algorithm. The idea of introducing an extra step using appropriate auxiliary variables was developed independently by Liu and Wu, (1999), who called the resulting MCMC algorithm the PX-DA, and Meng and van Dyk, (1999), who called it the marginal augmentation. Following these works, Hobert and Marchev, (2008) developed general versions of PX-DA type algorithms and Yu and Meng, (2011) referred to these generalized methods as sandwich algorithms since the algorithms involve an additional step that is sandwiched between the two steps of the original DA algorithm. Graphically, a sandwich algorithm can be represented as , where the first and the third steps are the two steps of the DA algorithm. Suppose the middle step is implemented by making a draw according to a Mtd . Thus, denoting the current state by , a generic sandwich algorithm uses the following three steps to move to the new state .
We now describe how to construct an Mtd for a valid middle step for sandwich algorithms. Recall that is the marginal density of . Suppose leaves invariant, that is,
(3.1) |
It turns out that any satisfying this invariance condition implies that is stationary for the corresponding sandwich algorithm and consequently can be used to obtain approximate samples from . To see it, note that the Mtd of the sandwich algorithm is given by
(3.2) |
and
(3.3) |
where the third equality follows from (3.1). Furthermore, if satisfies the detailed balance condition with respect to , that is, is symmetric in , then
(3.4) |
that is, the corresponding sandwich algorithm is reversible with respect to the target density .
Let us denote the Markov chain underlying the sandwich algorithm by . Since from (3) or (3) it follows that the target density is stationary for , the sample averages based on the SA Markov chain can also be used to estimate . Indeed, as mentioned in Section 1, if is appropriately irreducible, almost surely as . Now, will be preferred over as an estimate of if the former leads to smaller (Monte Carlo) errors. In particular, if there is a Markov chain central limit theorem (CLT) for , that is, if for some finite asymptotic variance , the standard error of is where is it consistent estimator of (see Flegal and Jones,, 2010; Vats et al.,, 2019, for methods of constructing ). Thus, in practice, establishing CLTs for and is important for ascertaining and comparing the quality of these estimators. A sufficient condition extensively used in the literature for guaranteeing the existence of such a CLT for every square integrable function () is that the Markov chain converges to its stationary distribution at a geometric rate (Roberts and Rosenthal,, 1997; Chan and Geyer,, 1994).
Formally, the chain is called geometrically ergodic if there exist a function and a constant such that, for all and all ,
where is the step Mtd of the chain . Unfortunately, the simple irreducibility and aperiodicity conditions mentioned in Section 1 does not imply geometric ergodicity. The most common method of proving geometric ergodicity of is by establishing drift and minorization conditions (Rosenthal,, 1995; Jones and Hobert,, 2001). Indeed, several DA algorithm examples considered in this chapter have been shown to be geometrically ergodic by establishing appropriate drift and minorization conditions. For example, for the Bayesian lasso model discussed in Section 2.1, Rajaratnam et al., (2019) established geometric ergodicity of the DA Markov chain in the special case when . The geometric rate of convergence of PS&W’s DA Markov chain for the binary logistic regression model discussed in Section 2.2 has been established by Choi and Hobert, (2013) and Wang and Roy, 2018c under proper normal and improper uniform prior on , respectively.
If the SA Markov chain is reversible, denoting the asymptotic variance for by , it turns out that for every square integrable (with respect to ) function (Hobert and Marchev,, 2008). Thus, sandwich algorithms are asymptotically more efficient than the corresponding DA algorithms. So if both the DA and sandwich algorithms are run for the same number of iterations, then the errors in are expected to be smaller than that of . In other words, to achieve the same level of precision, the sandwich algorithm is expected to require fewer iterations than the DA algorithm. In Section 4, we discuss other results established in the literature showing the superiority of the sandwich algorithms over the corresponding DA algorithms.
Intuitively, the extra step reduces the correlation between and and thus improves the mixing of the DA algorithm. On the other hand, since the extra step in a sandwich algorithm involves more computational effort compared to the DA algorithm, any gain in mixing should be weighed against this increased computational burden. Fortunately, there are several examples where the sandwich algorithm provides a “free lunch” in that it converges much faster than the underlying DA algorithm while requiring a similar amount of computational effort per iteration (see e.g. Laha et al.,, 2016; Pal et al.,, 2015; Roy and Hobert,, 2007; Hobert et al.,, 2011; Roy,, 2014). Following Liu and Wu, (1999) and Meng and van Dyk, (1999), the Mtd for these efficient sandwich algorithms are often constructed by introducing extra parameters and a class of functions indexed by . The middle step is implemented by drawing from a density depending on and then setting maintaining invariance with respect to . Typically, is small, say 1 or 2, and is a small subspace of . Thus, drawing from such an is usually much less computationally expensive than the two steps of the DA Algorithm. However, as mentioned before, even when , this perturbation often greatly improves the mixing of the DA algorithm.
We now describe Liu and Wu,’s (1999) Haar PX-DA algorithm where the set is assumed to possess a certain group structure (see also Hobert and Marchev,, 2008). In particular, is assumed to be a topological group; that is, a group such that the functions and the inverse map are continuous. An example of such a is the multiplicative group, , where the binary operation defining the group is multiplication, the identity element is 1, and . Suppose, represents acting topologically on the left of (Eaton,, 1989, Chapter 2), that is, for all where is the identity element in and for all and all .
We assume the existence of a multiplier , that is, is continuous and for all . Assume that the Lebesgue measure on is relatively (left) invariant with respect to the multiplier ; that is, assume that for any and any integrable function , we have
Next, suppose the group has a left-Haar measure of the form where denotes Lebesgue measure on . It is known that the left-Haar measure satisfies
(3.5) |
for all and all integrable functions . Finally, assume that
is strictly positive for all and finite for (almost) all .
We now state how the middle step is implemented in Liu and Wu,’s (1999) Haar PX-DA algorithm. If the current state of the Haar PX-DA chain is , an iteration of this chain uses the following steps to move to .
Example 1 (continues=sec:probit).
Improving the DA algorithm for the probit mixed model presented in Section 2.3, Roy, (2022) constructs a Haar PX-DA algorithm for (2.22). For constructing the sandwich step , we need the marginal density of , from the joint density given in (2.3). It turns out that
(3.6) |
where
Let denote the subset of where lives, that is, is the Cartesian product of half (positive or negative) lines, where the th component is (if ) or (if ).
As mentioned in this Section, take and let the multiplicative group act on through . Note that, for any , we have
which shows from (3.5) that is a left-Haar measure for the multiplicative group, where is Lebesgue measure on . Also, with the group action defined this way, it is known that the Lebesgue measure on is relatively left invariant with the multiplier (Hobert and Marchev,, 2008). Thus,
(3.7) |
and since is a positive definite matrix,
is strictly positive for all and finite for (almost) all . Thus, a single iteration of the Haar PX-DA algorithm for the probit mixed model uses the following three steps to move from to .
The density (3.7) is log-concave, and Roy, (2022) suggests using Gilks and Wild,’s (1992) adaptive rejection sampling algorithm to sample from it. Since a single draw from (3.7) is the only difference between each iteartion of the DA and the Haar PX-DA algorithms, the computational burden for the two algorithms is similar.
Given the group , the Haar PX-DA algorithm is the best sandwich algorithm in terms of efficiency and operator norm (Hobert and Marchev,, 2008). Khare and Hobert, (2011) established necessary and sufficient conditions for the Haar PX-DA algorithm to be strictly better than the corresponding DA algorithm (see Section 4 for details). Also, the Haar PX-DA method with appropriate choices of and has been empirically demonstrated to result in a huge improvement in the mixing of DA algorithms albeit with roughly the same computational cost in various examples (see e.g. Roy and Hobert,, 2007; Liu and Wu,, 1999; Meng and van Dyk,, 1999; Roy,, 2014; Pal et al.,, 2015). Finally, there are other strategies proposed in the literature to improve DA algorithms without introducing extra auxiliary variables as in the sandwich algorithms. For example, Roy, (2016) replaces one of the two original steps of the DA algorithm with a draw from the MH step given in Liu, (1996). One advantage of Roy,’s (2016) modified DA algorithm over the sandwich algorithms is that the Markov transition matrix of the modified algorithm can have negative eigenvalues, whereas the DA algorithms and also, generally, the sandwich algorithms used in practice are positive Markov chains leading to positive eigenvalues (Khare and Hobert,, 2011). Hence, Roy,’s (2016) modified DA algorithm can lead to superior performance over the DA and sandwich algorithms in terms of asymptotic variance.
4 Spectral properties of the DA Markov chain
The convergence of the Markov chains associated with MCMC algorithms to the desired stationary distribution is at the heart of the popularity of MCMC technology. In Section 1 we discussed the convergence of the sample averages and the marginal distributions of . For a serious practitioner, understanding the quality of both the convergences is important. As shown in Section 3, proving geometric ergodicity of the Markov chain is key for providing standard errors for . Establishing geometric ergodicity for general state space Markov chains arising in statistical practice is in general quite challenging. The drift and minorization technique that is often used to prove geometric ergodicity has two flavors/tracks. The first track leads to establishing qualitative geometric ergodicity, without providing quantitative convergence bounds for distance to stationarity after finitely many Markov chain iterations. The other track leads to explicit convergence bounds, but these bounds are often too conservative to be useful, especially in modern high-dimensional settings Jones and Hobert, (2004); Qin and Hobert, (2021). While several alternate and promising methods for obtaining convergence bounds have been developed in recent years (see Qin and Hobert, (2022) for a useful summary), it is safe to say that a successful geometric ergodicity analysis remains elusive for an overwhelming majority of MCMC algorithms used in statistical practice. However, as we will see below, the special structure/construction of the DA Markov chain often allows us in many examples to establish stronger properties which imply geometric ergodicity in particular, and lead to a richer and deeper understanding of the structure and convergence properties of the relevant DA Markov chain.
4.1 Leveraging operator theory for comparing DA and sandwich chains
Operator theory and associated tools play a key role in convergence analysis and comparison of the DA Markov chain (with Mtd ) and the sandwich Markov chain (with Mtd ). Let denote the space of real-valued functions (with domain ) that have mean zero and are square integrable with respect to . In particular,
Let and denote the Markov operators corresponding to the Markov transition densities and respectively. Then for any , we have
Note that is a Hilbert space equipped with the inner product
for every pair , and the corresponding norm defined by for every . The operator norms of and , denoted respectively by and , are defined as
The convergence properties of the DA and sandwich Markov chains are intricately linked to the behavior of the corresponding Markov operators and . Using the structure of and , it can be shown easily that the operator is a non-negative operator (an operator is non- negative if is non-negative ), and . Also, is a self-adjoint operator, and if satisfies detailed balance with respect to (which will be assumed henceforth, unless specified otherwise) then so is (an operator is self-adjoint if ). Roberts and Rosenthal, (1997) show that for a self-adjoint Markov operator corresponding to a Harris ergodic chain, if and only if the corresponding Markov chain is geometrically ergodic. In this case, in fact corresponds to the asymptotic rate of convergence of the relevant Markov chain to its stationary distribution. The above concepts enable us to state the first result that rigorously shows that the sandwich chain is at least as good as the DA chain in some key aspects.
Theorem 4.1 (Hobert and Marchev, (2008)).
If the DA chain with Mtd is Harris ergodic, and the sandwich chain (with Mtd ) can itself be interpreted as a DA chain, then . Hence, geometric Harris ergodicity of the DA chain implies geometric Harris ergodicity of the sandwich chain, and in such settings the asymptotic rate of convergence to stationarity for the sandwich chain is at least as good as that of the DA chain.
The structure and analysis of sandwich chains is more complicated (sometimes significantly so) than the DA chains due to the additional sandwich step. However, the above result allows us to completely avoid analyzing the sandwich chain, if geometric ergodicity can be established for the corresponding DA chain.
Intuition says that given the extra step, the sandwich chain should have better mixing and faster convergence than the original DA chain under suitable regularity conditions. The above result provides rigorous justification to this intuition. See Section 4.4 for further discussion.
4.2 The trace class property for Markov operators
An operator on is defined to be Hilbert-Schmidt if for any orthonormal sequence in , we have
(4.1) |
Additionally, a positive self-adjoint operator (such as the DA operator ) is defined to be trace class if
(4.2) |
for any orthonormal sequence in . Both the above properties imply that the operator is compact, and has countably many singular values. The trace class property implies that these singular values are summable, while the Hilbert-Schmidt property implies that these singular values are square-summable. If the associated Markov chain is Harris ergodic, then compactness of a Markov operator implies that its spectral radius is less than . Existing results (see for example Proposition 2.1 and Remark 2.1 in Roberts and Rosenthal, (1997)) can now be leveraged to establish geometric ergodicity. To summarize, for Harris ergodic Markov chains, we have
and if the corresponding Markov operator is also positive and self-adjoint, then
The above implications potentially offer an alternate mechanism/route for establishing geometric ergodicity. However, the conditions in (4.1) and (4.2) seem hard to verify especially for intractable Markov transition densities arising in statistical applications. Fortunately, there are equivalent characterizations for (4.1) and (4.2) that are easier to state and verify. In particular, results in Davies, (1983) can be leveraged to show that a Markov operator (with corresponding Mtd ) on is Hilbert-Schmidt if and only if
(4.3) |
and in case is self-adjoint and positive, it is trace class if and only if
(4.4) |
The conditions in (4.3) and (4.4) while simpler than those in (4.1) and (4.2) can still be quite challenging to verify for Markov chains arising in statistical practice. The corresponding Markov transition densities are themselves often expressed as intractable integrals, and the analysis of the expressions in (4.3) and (4.4) can be quite lengthy, laborious and intricate. As the authors state in Liu et al., (1995), the condition in (4.3) “is standard, but not easy to check and understand”. Yet, there have been many success stories in recent years, see Khare and Hobert, (2011); Pal et al., (2017); Chakraborty and Khare, (2017, 2019); Qin et al., (2019); Rajaratnam et al., (2019); Jin and Tan, (2021). Markov chains used by statistical practitioners overwhelmingly employ either the Gibbs sampling algorithm, or the Metropolis-Hastings algorithm, or a combination of both. Any continuous state space chain with a non-trivial Metropolis component cannot be compact (Chan and Geyer,, 1994) and hence cannot be trace class. The same is true for random scan Gibbs chains. (Systematic scan) Gibbs sampling Markov chains are in general not reversible. However, the convergence of a two-block (systematic scan) Gibbs chain is completely characterized by its “marginal chains” which correspond to positive self- adjoint Markov operators (Liu et al.,, 1994; Diaconis et al.,, 2008). In particular, the construction in Section 1 shows that the DA chain is a marginal chain of a two-block Gibbs sampler for . Hence, it is not surprising that all Markov chains which have been shown to be trace class in these papers are DA Markov chains.
4.3 Trace class property as a tool to establish geometric ergodicity
We have already discussed how establishing the trace class property for a positive self-adjoint Markov operator corresponding to a Harris ergodic Markov chain establishes geometric ergodicity of that chain. In this section, we compare the potential advantages and disadvantages of this alternative approach compared to the drift and minorization approach for establishing geometric ergodicity. A key advantage is that the trace class approach based on (4.4) is more straightforward and streamlined than the drift and minorization approach. Indeed, construction of effective drift functions depends heavily on the Markov chain at hand and has been described as “a matter of art” (Diaconis et al., (2008)). On the other hand, the drift and minorization approach has broader applicability, and has been successfully applied to non-reversible and non-compact chains.
Suppose that for a DA Markov chain one is able to establish geometric ergodicity through a drift and minorization analysis, and is also able to establish the trace class condition in (4.4). Since the trace class property is much stronger than geometric ergodicity, one would expect that assumptions needed for establishing the trace class property are stronger than those needed for the drift and minorization analysis. Surprisingly, as demonstrated in Mukherjee et al., (2023), this is not always true. To understand why this might happen, we take a brief but careful look at the Markov chain analyzed in Mukherjee et al., (2023) below.
Similar to Sections 2.2 and 2.3, consider a binary regression setting with independent binary responses and corresponding predictor vectors , such that
(4.5) |
for . The goal is to estimate . Here is the CDF of the Student’s -distribution with degrees of freedom, and the corresponding model is referred to as the robit regression model Liu, (2004). In the binary regression setting, outliers are observations with unexpectedly large predictor values and a misclassified response. The robit model can more effectively down-weight outliers and produce a better fit than probit or logistic models Pregibon, (1982).
Following Roy, 2012a ; Albert and Chib, (1993), consider a Bayesian model with a multivariate normal prior for with mean and covariance matrix . Let denote the vector of observed values for the response variables . As demonstrated in Roy, 2012a , the posterior density is intractable, and Roy, 2012a develops a Data DA approach to construct a computationally tractable Markov chain with as its stationary density by introducing unobserved latent variables that are mutually independent and satisfy and . Straightforward calculations now show that , where denotes the Student’s -distribution with degrees of freedom, location and scale . Furthermore, if , then , which precisely corresponds to the robit regression model specified above.
One can map this setting to the DA framework laid out in Section 1, by viewing as “”, as “”, and the joint posterior density of as “”. Straightforward calculations (see Roy, 2012a ) show that samples from and are easy to generate, as they involve sampling from standard distributions such as multivariate normal, truncated- and Gamma. These observations allow Roy, 2012a to use the corresponding DA Markov chain on to generate samples from the (the marginal posterior density of ). We will refer to this Markov chain as the robit DA Markov chain.
Roy, 2012a established Harris ergodicity of the robit DA chain, and investigated and established geometric ergodicity using a drift and minorization analysis. However, this analysis requires the following assumptions.
-
•
The design matrix is full rank (which implies and rules out high-dimensional settings)
-
•
-
•
The last upper bound on involving the design matrix, the prior mean and covariance and the degrees of freedom is in particular very restrictive. Through a tighter drift and minorization analysis, Mukherjee et al., (2023) relax the above restrictions to some extent, but not substantially (see Theorem S1 and Theorem S2 in Mukherjee et al., (2023)). The related probit DA chain is obtained by (a) using the standard normal CDF instead of in (4.5), (b) using again a multivariate normal prior for , (c) introducing latent variables with for , and (d) employing the DA machinery in Section 1 with as “”. The geometric ergodicity analysis of the probit DA chain in Roy and Hobert, (2007); Chakraborty and Khare, (2017) uses similar drift functions as in Roy, 2012a ; Mukherjee et al., (2023) but requires minimal assumptions. Note that the latent variables used in the robit setting are not needed in the probit setting. While having this additional layer of latent variables definitely complicates the analysis of the robit DA chain, it is not clear if the restrictive conditions listed above for geometric ergodicity of the robit DA chain are really necessary or if they are an artifact of using the drift and minorization technique in face of this added complication.
A trace class analysis for the probit and robit DA chains is available in Chakraborty and Khare, (2017) and Mukherjee et al., (2023) respectively. The trace class property for the probit DA chain was established in (Chakraborty and Khare,, 2017, Theorem 2) under some constraints on and the prior covariance matrix . This is consistent with the expectation that weaker assumptions should be needed for establishing drift and minorization based geometric ergodicity, as compared to establishing the trace class property. However, for the robit DA chain the reverse phenomenon holds: Mukherjee et al., (2023) establish the trace class property for the robit DA chain for any . Hence, drift and minorization approach, with the drift functions chosen in Roy, 2012a ; Mukherjee et al., (2023), needs stronger conditions to succeed than the trace class approach. Essentially, the additional layer of latent variables introduced in the robit setting severely hampers the drift and minorization analysis, but does not cause additional complications for establishing the trace class condition.
The key takeaway from the above discussion is that even if a successful drift and minorization based analysis is available, it is worth investigating the finiteness of the trace class integral in (4.4), it might require comparatively weaker assumptions contrary to expectations. Also, establishing the trace class property provides additional insights which are discussed in the next subsection.
4.4 Trace class property as a tool to compare DA and sandwich chains
The results discussed previously show that, under certain regularity conditions, the sandwich chain is at least as good as the DA chain in certain respects. These results, while very useful, are not completely satisfactory as they only establish that the sandwich chain is “as good as” the DA chain. The question is - are there any conditions under which the sandwich chain can be shown to be “strictly better” than the DA chain (in an appropriate sense)? This question has been investigated in Khare and Hobert, (2011); Roy, 2012b , and we will discuss their results below.
First, we provide a brief review of some ideas and results used in these analyses. Both papers leverage the fact (see Buja, (1990); Diaconis et al., (2008)) that the DA operator can be written as a product of two simple ‘projection’ operators. In particular, let denotes the operator from to defined by
for every , and denotes the operator from to defined by
for every . Also, let denotes the operator corresponding to the Mtd (the sandwich step Mtd). Then, it can be shown that
For any and any , simple calculations show that (the first inner product corresponds to , while the second one corresponds to ). Hence, the operators and are adjoints of each other. Using elementary operator theoretic arguments, this fact can be used to immediately conclude that is a positive operator (as previously mentioned) and . It follows that
However, in almost all applications of interest, , since the sandwich step corresponds typically to a univariate movement in . Hence, the above argument does not allow us to prove that the sandwich algorithm is strictly better (in the norm sense).
On the other hand, if the trace class condition in (4.4) is satisfied, then a very useful singular value decomposition of becomes available (see Buja, (1990)). In particular, it can be shown that
where and form orthonormal bases of and respectively (with ), is a decreasing sequence of numbers in the unit interval with , and and are orthogonal for every in the sense that
The singular value decomposition can be used to establish that
(4.6) |
for every . In particular this implies is the spectrum of the DA operator and are the corresponding eigenfunctions. Given that and (4.6), a comparative spectral analysis for the DA and sandwich algorithms now hinges on how the operator interacts with the functions . By investigating this carefully, Khare and Hobert, (2011) establish the following result.
Theorem 4.2.
Assume that (4.4) holds, and that the sandwich operator is idempotent () with . Then the sandwich operator is positive and trace class. If denotes the ordered sequence of eigenvalues for , then for every . Also, for every such that , if and only if .
In other words, the spectrum of the sandwich operator is dominated pointwise by the spectrum of the DA operator , and strict domination for a specific pair of eigenvalues depends exclusively on whether or not the operator leaves the corresponding -function invariant. A stronger version of this result, which only requires to be compact is established in Roy, 2012b .
For a more specific comparison of the operator norms of the two operators, as expected the key factor is the interaction of the the operator with the linear span of , where is the multiplicity of the largest eigenvalue , i.e., number of ’s that are exactly equal to . The next result from Khare and Hobert, (2011) leverages this idea to provide a necessary and sufficient condition for the operator norm of to be strictly smaller than the operator norm of .
Theorem 4.3.
Under the setup in Theorem 4.2, if and only if the only function in the linear span of which is left invariant by is the (identically) zero function.
While the two results above provide necessary and sufficient conditions for the equality of the norm (and other eigenvalues) for and , their practical utility is somewhat limited by the requirement to identify the functions . However, if we restrict to the group action based Haar PX-DA chain, then the following result can be established.
Theorem 4.4.
If the condition in (4.4) is satisfied, then it can be shown that the Mtds and are exactly the same. Hence, outside of this triviality, the Haar PX-DA sandwich chain is strictly better than the DA chain in the sense that its eigenvalues are uniformly (pointwise) dominated by those of the DA chain with at least one strict domination.
5 The “two-block” DA algorithm and its sandwich variants
The applicability of the DA algorithm depends on the ease of sampling from the conditional densities and . While samples from both of these densities can be generated in a straightforward way in various settings, there are many others where this is not true. In particular, in these settings the density is often intractable and cannot be directly sampled from. However, it is often possible to partition into two components (with and ) such that samples from and can be easily generated. In such settings, a modified two-block DA algorithm is used to generate samples from . Each iteration of the DA algorithm consists of three steps — a draw from followed by a draw from , and finally a draw from . The transition of this two-block DA Markov chain from the current state to the next state can be described as follows.
From the three steps in each iteration of the two-block DA algorithm, it follows that the Markov transition density (Mtd) of the DA Markov chain (with ) is given by
(5.1) |
Its easy to show that the two-block DA chain has as its stationary density. However, a key difference as compared to the “single-block” Markov chain in Section 1 is that the two-block DA chain does not satisfy the detailed balance condition.
Example (Bayesian quantile regression) Consider the linear model
subject to the constraint that the quantile of the error distribution of is . Recall that is the collection of - dimensional covariate vectors, and are the regression coefficients. The standard (frequentist) method for estimating the regression parameter (see Yu and Moyeed, (2001)) minimizes an objective function which is equivalent to the negative log-likelihood for if the errors are assumed to be i.i.d. with the asymmetric Laplace density given by
where with denoting the standard indicator function. Note that has quantile equal to zero, and corresponds to the standard Laplace density with location and scale equal to and , respectively, when . Leveraging this analogy, Yu and Moyeed, (2001); Kozumi and Kobayashi, (2011) pursue a Bayesian approach where are assumed to be i.i.d. with common density , with is an unknown scale parameter. The following independent priors are assigned to and : and . The posterior density of is intractable, and Kozumi and Kobayashi, (2011) propose the following DA approach which exploits a normal/exponential mixture representation of the asymmetric Laplace distribution.
Define and , and consider random pairs such that and . It can be easily verified that the marginal density of given is indeed . However, direct sampling or closed form computations for the joint posterior density of given are not feasible. However, it can be shown that (a) the full posterior conditional distribution of is multivariate Gaussian, (b) the elements of are conditionally independent given , and follow a generalized inverse Gaussian distribution, (c) the full posterior conditional distribution of is inverse gamma.
A two-block DA algorithm with as “”, as “” and as “” can hence be used to generate approximate samples from the joint posterior density of . This in particular yields samples from the marginal posterior density of . Since the posterior density of given only can be shown to be an inverse gamma density, this provides a mechanism to sample from the target posterior density of .
Coming back to the general setting, a direct application/insertion of the single-block Haar PX-DA step (Step 2) as a sandwich step in the two-block setting is not feasible, as the resulting Markov chain does not in general have as its stationary density. Pal et al., (2015) develop feasible two-block adaptations of the single-block Haar PX-DA sandwich algorithm in Section 3, which we briefly describe below.
Consider a group acting topologically on the left of , as in the discussion prior to the single-block Haar PX-DA algorithm. Given any , define the densities
and
Here and (assumed to be finite) are normalizing constants. Either of these two densities can be used to construct a ’valid’ Haar PX-DA sandwich step in the current setting. The transition of this two-block sandwich Markov chain from the current state to the next state can be described as follows.
Let and denote the Mtds corresponding to the two- block sandwich chain described in two-block Haar PX-DA algorithm (with and respectively). Pal et al., (2015) show that both and have as their stationary distribution.
Returning to the Bayesian quantile regression example, recall that the the scale parameter plays the role of ””. Consider the action of the group on , the sample space of , through scalar multiplication. The left Haar measure for is given by , and it can be shown that the Lebesgue measure on is relatively invariant with respect to the multiplier . Using the above construction of two-block sandwich densities, it can be shown that is a non-standard univariate density. However, samples can be generated from this density using a rejection sampler with a dominating inverse gamma density. On the other hand, can be shown to be an inverse gamma density, and the corresponding sandwich chain is more suitable for practical use.
We conclude with a quick summary of associated theoretical results for the two-block DA setting. Using results in Asmussen and Glynn, (2011), it follows that the two-block DA chain and the sandwich chain are Harris ergodic if the corresponding Mtd is strictly positive everywhere (this is the case with most chains encountered in statistical applications). However, things get much more complicated (compared to the single-block setting) for a comparative study of deeper issues such as operator norms, geometric ergodicity etc. A key reason for these challenges is that the two-block DA operator, based on the three-stage transition step, is a product of three relevant projection operators, and consequently loses its self-adjointness and positivity. A theoretical comparison of closely related “lifted” versions of the two-block DA and sandwich chains is available in Pal et al., (2015), however several questions involving a direct comparison of the two-block DA and sandwich chains remain open.
-
•
Under what conditions does geometric ergodicity of the two block DA chain imply geometric ergodicity of the corresponding sandwich chain?
-
•
Although reversibility is lost, one could still aim to establish that the two-block DA operator is Hilbert-Schmidt by showing that the condition in (4.3) is satisfied. In such a setting, under what conditions does the Hilbert-Schmidt property of the two-block DA chain imply the same for the corresponding sandwich chain?
-
•
How do answers to the above questions depend on the choice of vs. in the sandwich step in the two-block Haar PX-DA algorithm?
6 Distributed DA adaptations for massive data settings
The sandwich DA algorithm can serve as a useful and computationally inexpensive tool to speed up the convergence of the DA algorithm. However, its utility is limited in increasingly common “massive” datasets, which contain hundreds of thousands (if not millions) of samples and/or variables. For a Bayesian statistical analysis of such datasets, a key goal is to generate samples from the intractable posterior density of the parameters “”. For instances where the DA algorithm is applicable, the dimensionality of “” (the latent variables) is often equal to the number of samples or the number of variables. For example, in the Bayesian variable selection example described in Section 2.1, the dimensionality of is , the number of variables. The corresponding methodology is meant to be used for high-dimensional settings where is very large. In such settings, the computational cost of sampling the latent variables in each iteration of the DA and sandwich chains can be exorbitant. Alternatively, consider the Bayesian logistic regression example in Section 2.2, where the number of latent variables is equal to the number of samples . For datasets such as the MovieLens Data Perry, (2016); Srivastava et al., (2019), where the number of samples is in the millions, the computational cost of sampling the latent variables can again be prohibitive.
For settings where a massive number of samples is an issue, the following divide-and-conquer strategy can be considered: (a) Divide the data into a number of smaller subsets of reasonable size (b) run DA Markov chains on all subsets in parallel, and (c) suitably combine the different subset-wise Markov chains for inference. See Scott et al., (2016); Jordan et al., (2019); Wang and Srivastava, (2021) and the references therein. The combined draws from the subset-wise Markov chains do not constitute a Markov chain, and quantification of the Monte Carlo error becomes challenging. While asymptotic statistical guarantees based on a normal approximation of the target posterior are available for some of these methods (as the subset sample size tends to infinity), no rigorous bounds for the distance between the distribution of the combined parameter draws and the target posterior distribution (based on the entire data) are available. In other words, the lack of a Markov chain structure for the combined draws deprives the user of the rich diagnostic and theoretical tools that are available in the literature for Markov chains.
To address the above issues, the authors in Zhou et al., (2023) develop an asynchronous and distributed version of the DA algorithm, called ADDA, which outputs a Markov chain with the desired posterior as its stationary distribution. Two key assumptions that are needed for ADDA to be useful are - (a) the latent variables can be partitioned into blocks which are conditionally independent (given the original parameters “” and the data), and (b) the conditional posterior distribution of each latent sub-block exclusively depends on a relevant subset of the observed data. This is for example true for the Bayesian variable selection example in Section 2.1, where the augmented variables are conditionally independent given the regression coefficients and the data. Also, the conditional posterior distribution of does not depend on the observed data (hence, requirement (b) above is trivially satisfied). Again, for the Bayesian logistic regression example in Section 2.2, the augmented variables are conditionally independent given the regression coefficients and the data, and the conditional posterior distribution of depends only on (quantities related to the observation).
Consider the general DA setup of Section 1, where represents the parameter of interest, and represents the block of latent parameters. The goal is to sample from , which represents the marginal posterior density of given the observed data (the conditioning on the observed data will be left implicit for continuity and ease of exposition). We assume that can be divided into sub-blocks which are conditionally independent given . Also, as mentioned above, the observed data can be partitioned into subsets such that only depends on the data subset for .
The ADDA algorithm employs manager process and worker processes. At the level of sampling, the job of the manager process is to sample new parameter values from when appropriate, and to send this sample to all the workers. The job of the worker process is to sample new values for the latent block from when appropriate, and to send these samples to the manager process. Hence, the job of sampling the latent parameter is distributed to the worker processes. Another key feature of the ADDA algorithm is asynchronous sampling of the latent parameter blocks. The degree of asynchrony is controlled by user-specified parameters and as follows. To make its next conditional draw of the parameter , the manager process, with probability , waits to receive updated draws of the relevant latent parameter blocks from all the workers. But with probability , it only waits to receive an -fraction of updated draws, and proceeds to sample given the most recent draws for all latent parameter blocks. If a worker is in the midst of sampling its assigned latent parameter block, but receives a new -draw from the manager, it terminates its current sampling, and begins afresh using the latest -draw received from the manager. This entire process is summarized below.
-
(1)
At time , the manager starts with initial values at and sends to the workers.
-
(2)
For , the manager
-
(M-a)
waits to receive only an -fraction of updated s (see below) from the workers with probability , and with probability , waits to receive all the updated s from the workers;
-
(M-b)
creates by replacing the relevant s with the newly received ;
-
(M-c)
draws from ; and
-
(M-d)
sends to all the worker processes and resets .
-
(M-a)
-
(3)
For , the worker ()
-
(W-a)
waits to receive from the manager process;
-
(W-b)
draws from ; and
-
(W-c)
sends to the manager process, resets , and goes to (W-a) if is not received from the manager before the draw is complete; otherwise, it truncates the sampling process, resets , and goes to (W-b).
-
(W-a)
It is clear that when or , the ADDA algorithm is identical to the DA algorithm. However, things get interesting when and . In this setting, at each iteration of the ADDA algorithm, only an -fraction of the latent parameter blocks are updated (with probability ). The ADDA algorithm can then be viewed as a mix of systematic and random subset scan Gibbs sampler. The systematic part comes from always updating in any given iteration, and the random subset part comes from only updating a random subset of sub-blocks of at each iteration. Such algorithms have been previously considered in the literature. However, what gives the ADDA a novel and interesting flavor is that unlike existing approaches, the comparative speed of the workers for sampling the respective sub-blocks can depend on the current value of . In other words, the choice of the sub-blocks of that are updated at any given iteration can depend on the current value of the parameter . Given this dependence, it turns out that the marginal -process is not Markov, but the joint -process is Markov.
In fact, it can be shown that the Markov chain has as its stationary (invariant) density. For ease of exposition, we provide a proof of this assertion when and are supported on a discrete space and , the number of latent variable blocks, is equal to . We assume that and . The arguments below can be extended in a straightforward way to a general setting with more than two latent variable blocks, and to non-discrete settings with arbitrary and values in .
Let denote the starting value for the ADDA chain, and assume that it is drawn from the desired posterior . Let denote the next iterate generated by the ADDA chain. Our goal is to show that . As we will see below, a key assumption which enables this is the conditional independence assumption which implies (recall that and denote the two blocks of ).
Since given is a draw from , it follows that
Hence, to prove the desired result, it is enough to show that . Note that
Let us recall how is sampled given . With probability say , only the first latent variable block is obtained using a draw from and the second block is left unchanged at , and with probability , only the second latent variable block is obtained using and the first block is left unchanged at . It follows that
Using (by conditional independence of and given ), we get
The last step again uses conditional independence of and given . Since , it follows that
Note that the above argument considers the pure asynchronous ADDA (). But the ADDA kernel with positive is a mixture of the DA and pure asynchronous ADDA kernels, so the result immediately follows for such settings as well. For a setting with more than two blocks and a general value of , we will have terms in the derivation above instead of two terms. The term, with essentially the same manipulations as above, will simplify to , where denotes the probability of choosing the relevant subset of latent variable blocks. Since , the invariance result will follow. The authors in Zhou et al., (2023) also establish the geometric ergodicity of the ADDA chain in specific settings, including the Bayesian variable selection example considered in Section 2.1.
The parameter controls degree of asynchrony. If is small, then the computational cost of each ADDA iteration is lower, but the ADDA Markov chain mixes at a slower pace. This tradeoff between slow mixing and lower computational cost per iteration is key to the choice of and the performance of the ADDA algorithm. The hope is that for a range of values of where the joint effect of these two competing factors leads to a significant decrease in the overall wall clock time required for the ADDA chain (as compared to a pure distributed implementation of the original DA chain). This is indeed seen in the extensive experimental evaluations performed in Zhou et al., (2023), where the ADDA is shown to have a remarkable overall computational gain with a comparatively small loss of accuracy. For example, when analyzing the MovieLens data Perry, (2016); Srivastava et al., (2019) (with millions of samples) using Bayesian logistic regression (Section 2.2), the ADDA algorithm is three to five times faster with only 2% less accuracy compared to the (pure distributed) DA algorithm after 10,000 iterations.
7 Summary and discussion
DA algorithms introduce appropriate latent variables to facilitate sampling from intractable target distributions. These algorithms explore the parameter space by iteratively updating the parameters and the latent variables, usually drawing from some standard distributions. Indeed, DA algorithms are a popular choice for MCMC computing due to its simplicity and broad applicability. The simple structure of the transition density of the DA Markov chain allows tractable detailed spectral analyses of its convergence behavior in addition to drift and minorization based analyses. This chapter discusses the tools for analyzing and comparing the spectrum of DA Markov chains. The applicability of DA methods has been demonstrated by their successful implementation in various widely used examples from statistics and machine learning.
Despite its widespread use, the DA algorithms can suffer from slow mixing. A great deal of effort has gone into developing techniques to tweak the DA chain’s transition density to improve its convergence speed. Among the different approaches, parameter expansion based methods, the PX-DA strategies, have been the most successful. As described in this chapter, several theoretical results have been established showing the superiority of the PX-DA algorithms over the DA methods. For a given reparameterization, the results in the literature identify the ‘best’ PX-DA algorithm as the Haar PX-DA based on the Haar measure. On the other hand, there is not much study available on comparison among different parameter expansion strategies. One exception is Roy, (2014), which through an empirical study involving a Bayesian robit model, shows that a partially reparameterized Haar PX-DA algorithm can outperform a fully reparameterized Haar PX-DA algorithm. Using the same robit example, Roy, (2014) shows that some reparameterization may provide little improvement over the DA algorithm. An important area of future research is to provide theoretical and methodological results for constructing and comparing different reparameterization strategies for improving DA algorithms.
In various fields and applications, it is common to observe large data sets that need to be analyzed. For these data sets, the number of observations or variables is large. The dimension of the latent vector in a DA scheme is usually the same as the number of observations or the number of variables. On the other hand, generally, as seen in all examples considered in this chapter, the latent variables are conditionally independent given the observed data and the current parameter value. Thus, the draw from , the first of the two steps in every iteration of the DA algorithm, can be parallelized, distributing the computational burden across multiple processors or nodes. Indeed, conditional independence of the latent blocks is key for ensuring that the ADDA-process described in Section 6 is a Markov chain with as its stationary distribution. However, such conditional independence is lost in many applications, such as hidden Markov models and time series models. An important open research direction is the extension of ADDA-type ideas that adapt to this loss of conditional independence without compromising on some underlying Markov structure. Another open direction is a useful integration of the sandwich technique to develop parameter-expanded versions of the ADDA algorithm.
Acknowledgments
The first author’s work was partially supported by USDA NIFA Grant 2023-70412-41087.
References
- Albert and Chib, (1993) Albert, J. H. and Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88:669–679.
- Andrews and Mallows, (1974) Andrews, D. F. and Mallows, C. F. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36:99–102.
- Asmussen and Glynn, (2011) Asmussen, S. and Glynn, P. W. (2011). A new proof of convergence of MCMC via the ergodic theorem. Statistics and Probability Letters, 81:1482–1485.
- Booth and Hobert, (1999) Booth, J. G. and Hobert, J. P. (1999). Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. Journal of the Royal Statistical Society, Series B, 61:265–285.
- Buja, (1990) Buja, A. (1990). Remarks on functional canonical variates, alternating least squares methods and ACE. The Annals of Statistics, 18:1032–1069.
- Chakraborty and Khare, (2017) Chakraborty, S. and Khare, K. (2017). Convergence properties of Gibbs samplers for Bayesian probit regression with proper priors. Electronic Journal of Statistics, 11(1):177–210.
- Chakraborty and Khare, (2019) Chakraborty, S. and Khare, K. (2019). Consistent estimation of the spectrum of trace class data augmentation algorithms. Bernoulli, 25:3832–3863.
- Chan and Geyer, (1994) Chan, K. S. and Geyer, C. J. (1994). Comment on “Markov chains for exploring posterior distributions” by L. Tierney. The Annals of Statistics, 22:1747–1758.
- Choi and Hobert, (2013) Choi, H. M. and Hobert, J. P. (2013). The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, 7:2054–2064.
- Davies, (1983) Davies, E. (1983). LINEAR INTEGRAL OPERATORS: Surveys and Reference Works in Mathematics, 7. Wiley Online Library.
- Dempster et al., (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39:1–38.
- Diaconis et al., (2008) Diaconis, P., Khare, K., and Saloff-Coste, L. (2008). Gibbs sampling, exponential families and orthogonal polynomials (with discussion). Statistical Science, 23:151–200.
- Diebolt and Robert, (1994) Diebolt, J. and Robert, C. P. (1994). Estimation of finite mixture distributions by Bayesian sampling. Journal of the Royal Statistical Society, Series B, 56:363–375.
- Duan et al., (2018) Duan, L. L., Johndrow, J. E., and Dunson, D. B. (2018). Scaling up data augmentation MCMC via calibration. The Journal of Machine Learning Research, 19(1):2575–2608.
- Eaton, (1989) Eaton, M. L. (1989). Group Invariance Applications in Statistics. Institute of Mathematical Statistics and the American Statistical Association, Hayward, California and Alexandria, Virginia.
- Flegal and Jones, (2010) Flegal, J. M. and Jones, G. L. (2010). Batch means and spectral variance estimators in Markov chain Monte Carlo. The Annals of Statistics, 38:1034–1070.
- Frühwirth-Schnatter and Frühwirth, (2010) Frühwirth-Schnatter, S. and Frühwirth, R. (2010). Data augmentation and MCMC for binary and multinomial logit models. In Statistical Modelling and Regression Structures, pages 111–132. Springer.
- Geyer, (1994) Geyer, C. J. (1994). On the convergence of Monte Carlo maximum likelihood calculations. Journal of the Royal Statistical Society, Series B, 56:261–274.
- Geyer and Thompson, (1992) Geyer, C. J. and Thompson, E. A. (1992). Constrained Monte Carlo maximum likelihood for dependent data. Journal of the Royal Statistical Society, Series B, 54:657–699.
- Gilks and Wild, (1992) Gilks, W. R. and Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41(2):337–348.
- Hobert, (2011) Hobert, J. P. (2011). Handbook of Markov chain Monte Carlo, chapter The data augmentation algorithm: theory and methodology, pages 253–293. CRC Press, Boca Raton, FL.
- Hobert and Marchev, (2008) Hobert, J. P. and Marchev, D. (2008). A theoretical comparison of the data augmentation, marginal augmentation and PX-DA algorithms. The Annals of Statistics, 36:532–554.
- Hobert et al., (2011) Hobert, J. P., Roy, V., and Robert, C. P. (2011). Improving the convergence properties of the data augmentation algorithm with an application to Bayesian mixture modelling. Statistical Science, 26:332–351.
- Holmes and Held, (2006) Holmes, C. C. and Held, L. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1:145–168.
- Jin and Tan, (2021) Jin, R. and Tan, A. (2021). Fast Markov chain Monte Carlo for high-dimensional Bayesian regression models with shrinkage priors. Journal of Computational and Graphical Statistics, 30(3):632–646.
- Jones and Hobert, (2001) Jones, G. L. and Hobert, J. P. (2001). Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science, 16:312–34.
- Jones and Hobert, (2004) Jones, G. L. and Hobert, J. P. (2004). Sufficient burn-in for Gibbs samplers for a hierarchical random effects model. The Annals of Statistics, 32:784–817.
- Jordan et al., (2019) Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681.
- Khare and Hobert, (2011) Khare, K. and Hobert, J. P. (2011). A spectral analytic comparison of trace-class data augmentation algorithms and their sandwich variants. The Annals of Statistics, 39:2585–2606.
- Kozumi and Kobayashi, (2011) Kozumi, H. and Kobayashi, G. (2011). Gibbs sampling methods for Bayesian quantile regression. Journal of statistical computation and simulation, 81(11):1565–1578.
- Kyung et al., (2010) Kyung, M., Gill, J., Ghosh, M., and Casella, G. (2010). Penalized regression, standard errors, and Bayesian Lassos. Bayesian Analysis, 5:369–412.
- Laha et al., (2016) Laha, A., Dutta, S., and Roy, V. (2016). A novel sandwich algorithm for empirical Bayes analysis of rank data. Statistics and its Interface, 10:543–556.
- Liu, (2004) Liu, C. (2004). Robit regression: A simple robust alternative to logistic and probit regression. In Gelman, A. and Meng, X. L., editors, Applied Bayesian Modeling and Casual Inference from Incomplete-Data Perspectives, pages 227–238. Wiley, London.
- Liu, (1996) Liu, J. S. (1996). Peskun’s theorem and a modified discrete-state Gibbs sampler. Biometrika, 83:681–682.
- Liu et al., (1994) Liu, J. S., Wong, W. H., and Kong, A. (1994). Covariance structure of the Gibbs sampler with applications to comparisons of estimators and augmentation schemes. Biometrika, 81:27–40.
- Liu et al., (1995) Liu, J. S., Wong, W. H., and Kong, A. (1995). Covariance structure and convergence rate of the Gibbs sampler with various scans. Journal of the Royal Statistical Society, Series B, 57:157–169.
- Liu and Wu, (1999) Liu, J. S. and Wu, Y. N. (1999). Parameter expansion for data augmentation. Journal of the American Statistical Association, 94:1264–1274.
- Meng and van Dyk, (1999) Meng, X.-L. and van Dyk, D. A. (1999). Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika, 86:301–320.
- Meyn and Tweedie, (1993) Meyn, S. P. and Tweedie, R. L. (1993). Markov Chains and Stochastic Stability. Springer-Verlag, London.
- Mukherjee et al., (2023) Mukherjee, S., Khare, K., and Chakraborty, S. (2023). Convergence properties of data augmentation algorithms for high-dimensional robit regression. Electronic Journal of Statistics, 17(1):19–69.
- Pal et al., (2015) Pal, S., Khare, K., and Hobert, J. P. (2015). Improving the data augmentation algorithm in the two-block setup. Journal of Computational and Graphical Statistics, 24(4):1114–1133.
- Pal et al., (2017) Pal, S., Khare, K., and Hobert, J. P. (2017). Trace class Markov chains for Bayesian inference with generalized double Pareto shrinkage priors. Scandinavian Journal of Statistics, 44(2):307–323.
- Papaspiliopoulos and Roberts, (2008) Papaspiliopoulos, O. and Roberts, G. (2008). Stability of the Gibbs sampler for Bayesian hierarchical models. The Annals of Statistics, 38:95–117.
- Park and Casella, (2008) Park, T. and Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103:681–686.
- Perry, (2016) Perry, P. O. (2016). Fast moment-based estimation for hierarchical models. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(1):267–291.
- Polson et al., (2013) Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using Pólya-Gamma latent variables. Journal of the American statistical Association, 108(504):1339–1349.
- Pregibon, (1982) Pregibon, D. (1982). Resistant fits for some commonly used logistic models with medical applications. Biometrics, 38:485–498.
- Qin and Hobert, (2021) Qin, Q. and Hobert, J. P. (2021). On the limitations of single-step drift and minorization in Markov chain convergence analysis. The Annals of Applied Probability, 31(4):1633–1659.
- Qin and Hobert, (2022) Qin, Q. and Hobert, J. P. (2022). Geometric convergence bounds for Markov chains in Wasserstein distance based on generalized drift and contraction conditions. Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, 58(2):872–889.
- Qin et al., (2019) Qin, Q., Hobert, J. P., and Khare, K. (2019). Estimating the spectral gap of a trace-class Markov operator. Electronic Journal of Statistics, 13:1790 – 1822.
- Rajaratnam et al., (2019) Rajaratnam, B., Sparks, D., Khare, K., and Zhang, L. (2019). Scalable Bayesian shrinkage and uncertainty quantification in high-dimensional regression. Journal of Computational and Graphical Statistics, 28(1):174–184.
- Rao and Roy, (2021) Rao, Y. and Roy, V. (2021). Block Gibbs samplers for logistic mixed models: convergence properties and a comparison with full Gibbs samplers. Electronic Journal of Statistics, 15:5598–5625.
- Roberts and Rosenthal, (1997) Roberts, G. O. and Rosenthal, J. S. (1997). Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2:13–25.
- Rosenthal, (1995) Rosenthal, J. S. (1995). Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90:558–566.
- (55) Roy, V. (2012a). Convergence rates for MCMC algorithms for a robust Bayesian binary regression model. Electronic Journal of Statistics, 6:2463–2485.
- (56) Roy, V. (2012b). Spectral analytic comparisons for data augmentation. Statistics & Probability Letters, 82:103–108.
- Roy, (2014) Roy, V. (2014). Efficient estimation of the link function parameter in a robust Bayesian binary regression model. Computational Statistics and Data Analysis, 73:87–102.
- Roy, (2016) Roy, V. (2016). Improving efficiency of data augmentation algorithms using Peskun’s theorem. Computational Statistics, 31:709–728.
- Roy, (2022) Roy, V. (2022). MCMC for GLMMs. In C.R. Rao, A. S. R. and Young, A., editors, Handbook of Statistics Volume 47: Advancements in Bayesian Methods and Implementation, pages 135–159. Elsevier.
- Roy and Chakraborty, (2017) Roy, V. and Chakraborty, S. (2017). Selection of tuning parameters, solution paths and standard errors for Bayesian lassos. Bayesian Analysis, 12:753–778.
- Roy and Hobert, (2007) Roy, V. and Hobert, J. P. (2007). Convergence rates and asymptotic standard errors for Markov chain Monte Carlo algorithms for Bayesian probit regression. Journal of the Royal Statistical Society, Series B, 69:607–623.
- Scott et al., (2016) Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., I., G. E., and McCulloch, R. E. (2016). Bayes and big data: the consensus Monte Carlo algorithm. International Journal of Management Science and Engineering Management, 11(2):78–88.
- Srivastava et al., (2019) Srivastava, S., DePalma, G., and Liu, C. (2019). An asynchronous distributed expectation maximization algorithm for massive data: The dem algorithm. Journal of Computational and Graphical Statistics, 28(2):233–243.
- Swendsen and Wang, (1987) Swendsen, R. H. and Wang, J.-S. (1987). Nonuniversal critical dynamics in Monte Carlo simulations. Physical Review Letters, 58:86–88.
- Tanner and Wong, (1987) Tanner, M. A. and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation (with discussion). Journal of the American Statistical Association, 82:528–550.
- Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288.
- Vats et al., (2019) Vats, D., Flegal, J. M., and Jones, G. L. (2019). Multivariate output analysis for Markov chain Monte Carlo. Biometrka, 106:321–337.
- Wang and Srivastava, (2021) Wang, C. and Srivastava, S. (2021). Divide-and-conquer Bayesian inference in hidden Markov models. Electronic Journal of Statistics, 17:895–947.
- (69) Wang, X. and Roy, V. (2018a). Analysis of the Pólya-Gamma block Gibbs sampler for Bayesian logistic linear mixed models. Statistics & Probability Letters, 137:251–256.
- (70) Wang, X. and Roy, V. (2018b). Convergence analysis of the block Gibbs sampler for Bayesian probit linear mixed models with improper priors. Electronic Journal of Statistics, 12:4412–4439.
- (71) Wang, X. and Roy, V. (2018c). Geometric ergodicity of Pólya-Gamma Gibbs sampler for Bayesian logistic regression with a flat prior. Electronic Journal of Statistics, 12:3295–3311.
- Xu and Ghosh, (2015) Xu, X. and Ghosh, M. (2015). Bayesian variable selection and estimation for group lasso. Bayesian Analysis, 10(4):909–936.
- Yu and Moyeed, (2001) Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression. Statistics & Probability Letters, 54(4):437–447.
- Yu and Meng, (2011) Yu, Y. and Meng, X.-L. (2011). To center or not to center: that is not the question - An ancillarity-sufficiency interweaving strategy (ASIS) for boosting MCMC efficiency (with discussion). Journal of Computational and Graphical Statistics, 20:531–570.
- Zhou et al., (2023) Zhou, J., Khare, K., and Srivastava, S. (2023). Asynchronous and distributed data augmentation for massive data settings. Journal of Computational and Graphical Statistics, 32(3):895–907.
- Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67:301–320.