Convergence Analysis of Data Augmentation Algorithms for Bayesian Robust Multivariate Linear Regression with Incomplete Data
Abstract
Gaussian mixtures are commonly used for modeling heavy-tailed error distributions in robust linear regression. Combining the likelihood of a multivariate robust linear regression model with a standard improper prior distribution yields an analytically intractable posterior distribution that can be sampled using a data augmentation algorithm. When the response matrix has missing entries, there are unique challenges to the application and analysis of the convergence properties of the algorithm. Conditions for geometric ergodicity are provided when the incomplete data have a “monotone” structure. In the absence of a monotone structure, an intermediate imputation step is necessary for implementing the algorithm. In this case, we provide sufficient conditions for the algorithm to be Harris ergodic. Finally, we show that, when there is a monotone structure and intermediate imputation is unnecessary, intermediate imputation slows the convergence of the underlying Monte Carlo Markov chain, while post hoc imputation does not. An R package for the data augmentation algorithm is provided.
1 Introduction
Markov chain Monte Carlo (MCMC) algorithms are frequently used for sampling from intractable probability distributions in Bayesian statistics. We study the convergence properties of a class of MCMC algorithms designed for Bayesian multivariate linear regression with heavy-tailed errors. Contrary to most existing works in the area of MCMC convergence analysis, we focus on scenarios where the data set is incomplete. The existence of missing data substantially complicates the analysis.
Gaussian mixtures are suitable for constructing heavy-tailed error distributions in robust linear regression models (Zellner, 1976; Lange and Sinsheimer, 1993; Fernandez and Steel, 2000). In a Bayesian setting where a simple improper prior is used, the mixture representation facilitates a data augmentation (DA) MCMC algorithm (Liu, 1996) that can be used to sample from the posterior distribution of the regression coefficients and error scatter matrix. When there are no missing data, this algorithm is geometrically ergodic under regularity conditions (Roy and Hobert, 2010; Hobert et al., 2018). Loosely speaking, geometric ergodicity means that the Markov chain converges to the posterior distribution at a geometric, or exponential rate. Geometric ergodicity is important because it guarantees the existence of a central limit theorem for ergodic averages, which is in turn crucial for assessing the accuracy of Monte Carlo estimators (Chan and Geyer, 1994; Jones and Hobert, 2001; Jones, 2004; Flegal et al., 2008; Vats et al., 2018, 2019; Jones and Qin, 2021).
The current work studies the algorithm when the response matrix has missing entries. An incomplete data set brings unique challenges to the implementation and analysis of the DA algorithm. For instance, establishing posterior propriety becomes much more difficult than when there are no missing values. If the posterior measure is not proper, the DA algorithm will produce nonsensical results (Hobert and Casella, 1996). However, if one can show that the algorithm is geometrically ergodic, then the posterior distribution is guaranteed to be proper.
When the missing data have a certain “monotone” structure, the DA algorithm can be carried out without an intermediate step to impute the missing data. In this case, we establish geometric ergodicity under conditions on the error distribution and the amount of incomplete response components. Roughly speaking, when the mixing distribution in the Gaussian mixture representation of the error distribution places little mass near the origin and the number of incomplete components is not too large, the DA algorithm is geometrically ergodic. The conditions are satisfied by many mixing distributions, including distributions with finite supports, log-normal, generalized inverse Gaussian, and inverse Gamma or Fréchet with shape parameter greater than , where is the dimension of the response variable. Some Gamma, Weibull, and F distributions also satisfy the conditions. A post hoc imputation step can be added to fill in the missing values and the convergence properties of the DA algorithm will be unaffected.
When the missing data do not posses a monotone structure, some missing entries need to be imputed to implement the DA algorithm. This results in a data augmentation algorithm with an intermediate (as opposed to post hoc) imputation step, which we call a DAI algorithm for short. We provide sufficient conditions for the DAI algorithm to be Harris ergodic. Harris ergodicity is weaker than geometric ergodicity, but it guarantees posterior propriety as well as the existence of a law of large numbers for ergodic averages (Meyn and Tweedie, 2005, Theorem 17.1.7).
When the missing data have a monotone structure, both the DA (with or without post hoc imputation) and DAI algorithms can be applied. However, we show that the DA algorithm converges in at least as fast as the DAI algorithm.
Our key strategy is to draw a connection from cases where the data set is incomplete to the standard case where the data set is fully observed. This allows for an analysis of the former using tools built for the latter.
Finally, we provide an R package Bayesianrobust that implements the DA and DAI algorithms. The R package is available from https://github.com/haoxiangliumn/Bayesianrobust. While the algorithms are documented in Liu (1996), we do not know of previous software packages that implement them.
The rest of is organized as follows. Section 2 recounts the Bayesian robust linear regression model with incomplete data. In Section 3, we describe the DA and DAI algorithms. Our main results are in Section 4, where we provide conditions for geometric ergodicity of the DA algorithm and Harris ergodicity of the DAI algorithm. The section also contains a comparison between the DA and DAI algorithms in terms of convergence rate. Section 5 presents a numerical experiment.
2 Robust Linear Regression with Incomplete Data
Let be independent data points, where is a vector of known predictors, and is a random vector. Consider the multivariate linear regression model
where is a matrix of unknown regression coefficients, is a unknown positive definite scatter matrix, and , are iid random errors. Let be the response matrix whose th row is , and let be the design matrix whose th row is .
To allow for error distributions with potentially heavy tails, assume that the distribution of each is described by a scale mixture of multivariate normal densities, which takes the form
where is a probability measure on referred to as the mixing distribution. Gaussian mixtures constitute a variety of error distributions, and are widely used for robust regression. For instance, when corresponds to the distribution for some , i.e., has a density with respect to the Lebesgue measure given by
the errors follow the multivariate distribution with degrees of freedom, which has density function
See Section 4.2 for more examples.
Consider a Bayesian setting. Throughout, we use to denote realized values of . Assume that has the following prior density:
(1) |
where , , and is the convex cone of (symmetric) positive semi-definite real matrices. Equation (1) defines a class of commonly used default priors (Liu, 1996; Fernández and Steel, 1999). For instance, the independence Jeffrey’s prior corresponds to and . Denote by the measure associated with . The model of interest can then be written in the hierarchical form:
where denotes -variate normal distributions. Let be a vector of latent random variables. The joint measure of is given by
where, for , , , and ,
(2) |
Throughout, conditional distributions concerning are to be uniquely defined through the density .
We assume that is fully observed, but may contain missing values. The missing structure can be described by an random matrix , with indicating is observed, and indicating is missing. For a realized response matrix and a realized missing structure , let be the array , where . Then gives the observable portion of . In practice, instead of observing , one sees instead . Throughout, we assume that the missing data mechanism is ignorable, which means the following.
Definition 1.
The missing data mechanism is ignorable if, for any and almost every , the posterior distribution of given is the same as the conditional distribution of given .
If, given , the distribution of does not depend on , and the data are “realized missing at random” (Seaman et al., 2013), which can be implied by the fact that is independent of , then the missing data mechanism is ignorable. From here on, fix a realized missing structure , and only condition on when studying posterior distributions. Without loss of generality, assume that each row of contains at least one nonzero element, i.e., for each , has at least one observed entry.
To write down the exact form of the posterior, we introduce some additional notation. Suppose that given , if and only if for some , where . Given , let be the matrix that satisfies the following: For , in the th row of , all elements except the th one are 0, while the th one is 1. For instance, if , , , , then
Then is a vector consisting of the observed components of if , and we can write as . For a realized value of , say , denote by the th row of transposed, let , and let . Based on Equation (2), the posterior density of given has the following form:
where, for ,
(3) |
Remark 2.
Because the prior distribution is not proper, is not automatically a proper probability density. As a side product of our convergence analysis, we give sufficient conditions for posterior propriety in Section 4.
The posterior density is almost always intractable in the sense that it is hard to calculate its features such as expectation and quantiles. Liu (1996) proposed a data augmentation (DA) algorithm, or two-component Gibbs sampler, that can be used to sample from this distribution. This is an MCMC algorithm that simulates a Markov chain that is reversible with respect to the posterior. In the next section, we describe the algorithm in detail.
3 The DA and DAI Algorithms
3.1 Missing Structures
To adequately describe the DA and DAI algorithms, we need to introduce some concepts regarding missing structure matrices.
A realized missing structure is said to be monotone if the following conditions hold:
-
(i)
If for some and , then whenever and .
-
(ii)
for .
Note that in practice, there are cases where the observed missing structure can be re-arranged to become monotone by permuting the rows and columns of the response matrix. If there are no missing data, the corresponding missing structure is monotone.
Pattern 1 | ||||
---|---|---|---|---|
Pattern 2 | ||||
Pattern | ||||
Let be monotone. Then, for a realized response matrix , the elements in can be arranged as in Table 1. We say that the th observation belongs to pattern for some if for and for . There are possible patterns. For , denote by the number of observations that belong to pattern . Let be the matrix whose th row is and let be the submatrix of formed by the first rows of . Denote by the matrix formed by attaching to the right of . We say that Condition (H1) holds for the pair if
(H1) |
Here, returns the rank of a matrix. This condition is crucial for the DA algorithm to be well-defined and implementable.
3.2 The DA algorithm
Fix a realized response and a realized missing structure . Given the current state , the DA algorithm for sampling from draws the next state using the following steps.
-
1.
I step. Draw from the conditional distribution of given . Call the sampled value .
-
2.
P step. Draw from the conditional distribution of given .
This algorithm simulates a Markov chain that is reversible with respect to .
For , let be the number of nonzero entries in the th row of . (Note: if is monotone, and the th observation belongs to some pattern , then .) One can derive that the conditional distribution of given is
(4) |
where is defined in Equation (3) for . To ensure that this conditional density is always proper, we assume throughout that
(H2) |
The conditional distribution corresponds to independent univariate random variables. For most commonly used mixing distributions , this is not difficult to sample. For instance, if is a Gamma distribution, is the product of Gamma distributions.
The conditional distribution of given is not always tractable. In fact, since an improper prior is used, this conditional is possibly improper. Liu (1996) provided a method for sampling from this distribution when is monotone. When is monotone, and Condition (H1) holds from , the conditional of given is proper for any . This conditional distribution can be sampled using chi-square and normal distributions. The method is intricate, so we relegate the details to Appendix A.
Let be the missing structure that corresponds to a completely observable response, i.e., all elements of are 1. Denote by the missing parts in the response. Sometimes, we are interested in the posterior distribution of given , which takes the following form:
where is given in Equation (2), and is a realized value of such that and . To sample from , one can add a post hoc imputation step at the end of each DA iteration.
-
3.
Post hoc imputation step. Draw from the conditional distribution of given , where is the sampled value of .
Recall that, given , has a multivariate normal distribution. Then the conditional distribution of given is also (univariate or multivariate) normal. One can show that is a Markov chain whose stationary distribution is the posterior distribution of given .
We call the imputation step post hoc because it can be implemented at the end of the whole simulation, as long as the value of is recorded in the I step of each iteration. Post hoc imputation is possible because the I and P steps do not rely on the value of . Naturally, post hoc imputation does not affect the convergence properties of the Markov chain. Indeed, standard arguments (Roberts and Rosenthal, 2001) show that is geometrically ergodic if and only if is geometrically ergodic. Moreover, the convergence rates of the two chains, as defined in Section 4.1, are the same. Thus, when studying the convergence properties of the DA algorithm, we can restrict our attention to instead of even if there is post hoc imputation.
3.3 The DAI algorithm
For two realized missing structures and , write if (i) and (ii) whenever . That is, if the observed response entries under structure is a proper subset of those under . If , then is a missing structure matrix that gives the entries that are missing under , but not under . In other words, when , is observed under , but not under .
As usual, fix a realized response and missing structure . One may imagine that is not monotone, and the DA algorithm cannot be implemented efficiently. The DAI algorithm, also proposed in Liu (1996), overcomes non-monotinicity by imputing the value of in the I step, where is a monotone realized missing structure such that chosen by the user. We now describe the details of this algorithm.
The DAI algorithm associated with the triplet simulates a Markov chain that is reversible with respect to . Given the current state , the DAI algorithm draws the next state using the following steps.
-
1.
I1 step. Draw from the conditional distribution of given , whose exact form is given in Equation (4). Call the sampled value .
-
2.
I2 step. Draw from the conditional distribution of given
. Call the sampled value . -
3.
P step. Draw from the conditional distribution of given .
Recall that, given , has a multivariate normal distribution. Then in the I2 step, the conditional distribution of given is (univariate or multivariate) normal. In the P step, one is in fact sampling from the conditional distribution of given , where is a realized value of such that and . To ensure that this step can be implemented efficiently, we need two conditions: (i) is monotone as we have assumed, and (ii) Condition (H1) holds for . If these two conditions hold, then one can use methods in Appendix A to implement the P step. Of course, in each step of the DAI algorithm, changes. Despite this, due to the following result, which is easy to verify, we do not have to check (ii) in every step.
Proposition 3.
The DAI algorithm can be used to impute the missing values of . Indeed, is a Markov chain whose stationary distribution is the posterior distribution of given . This is similar to the DA algorithm with post hoc imputation described in Section 3.2, especially if all the entries of are 1. Compared to the DA algorithm with post hoc imputation, the DAI algorithm can be implemented even when is not monotone. However, the intermediate imputation step I2 cannot be performed after the whole simulation process since the P step of DAI relies on the value of .
Standard arguments show that and have the same convergence rate in terms of total variation and distances (Roberts and Rosenthal, 2001; Liu et al., 1994). Thus, when studying the convergence properties of the DAI algorithm, we can restrict our attention to instead of even if we care about imputing missing data.
4 Convergence Analysis
4.1 Preliminaries
We start by reviewing some general concepts regarding the convergence properties of Markov chains. Let be a measurable space. Consider a Markov chain whose state space is , and let be its transition kernel. For a signed measure and a measurable function on , let
and can act on and as follows:
assuming that the integrals are well-defined. Suppose that the chain has a stationary distribution , i.e., . The chain is said to be reversible if, for ,
For two probability measures and on , denote as the total variation (TV) distance between the two probability measures. For , let be the -step transition kernel of the chain, so that , and for any probability measure on . A -irreducible aperiodic Markov chain with stationary distribution is Harris ergodic if and only if , for all (Meyn and Tweedie, 2005; Roberts and Rosenthal, 2006). The chain is said to be geometrically ergodic if the chain is Harris ergodic and
(5) |
for some and . As mentioned in the Introduction, Harris ergodicity guarantees a law of large numbers for ergodic averages (Meyn and Tweedie, 2005, Theorem 17.1.7), and geometric ergodicity guarantees a central limit theorem for ergodic averages (Jones and Hobert, 2001; Jones, 2004; Flegal et al., 2008).
Another commonly used distance between probability measures is the distance. Let be the set of measurable functions such that
The distance between two probability measures and on is
Denote by the set of probability measures on such that is absolutely continuous to and that , the density of with respect to , is in . We say the chain is geometrically ergodic if
(6) |
for some and . In some cases, e.g., when the state space is finite, the s in Equations (5) and (6) are exchangeable. For reversible Markov chains, geometric ergodicity implies geometric ergodicity (Roberts and Rosenthal, 1997, Theorem 2.1 and Remark 2.3).
The infimum of such that Equation (6) holds for some is called the convergence rate of the chain. The smaller this rate is, the faster the convergence.
4.2 Geometric ergodicity of the DA algorithm
To state our result, we define three classes of mixing distributions based on their behaviors near the origin. These classes were first examined by Hobert et al. (2018) who analyzed the DA algorithm when the response matrix is fully observed. We say that the mixing distribution is zero near the origin if there exists such that. Assume now that admits a density function with respect to the Lebesgue measure. If there exists , such that , we say that is polynomial near the origin with power . The mixing distribution is faster than polynomial near the origin if, for all , there exists such that the ratio is strictly increasing on .
Most commonly used mixing distributions fall into one of these three classes. Examples will be given after we state the main result of this section.
Again, fix an observed missing structure and observed response . Let be the number of nonzero elements in the th row of . Recall that is the number of observations, is the number of predictors, is the dimension of the responses, and is a parameter in the prior distribution Equation (1).
Theorem 4.
Consider the DA algorithm targeting , as described in Section 3.2. Suppose that Condition (H2) holds, and that the conditional distribution of given is proper for every . If any one of the following conditions holds, then the posterior is proper and the underlying Markov chain is geometrically ergodic.
-
1.
is zero near the origin;
-
2.
is faster than polynomial near the origin; or
-
3.
is polynomial near the origin with power , where
Remark 5.
Recall that when is monotone, the conditional distribution of given is proper for every if Condition (H1) holds for .
Remark 6.
The proof of Theorem 4 is in Appendix B. In what follows, we list commonly used mixing distributions that fall into the three categories in Theorem 4. We also check whether each mixing distribution satisfies Condition (H2).
Zero near the origin
When the mixing distribution is discrete with finite support, the mixing distribution is zero near the origin. This is the case when errors follow finite mixtures of Gaussian. Obviously, Condition (H2) holds in this case.
The Pareto distribution has density , where . It is zero near the origin as the support is . Condition (H2) holds if .
Faster than polynomial near the origin
A generalized inverse Gaussian distribution with density
where , is faster than polynomial near the origin. Condition (H2) holds for any GIG distribution. When the mixing distribution is GIG, the distribution of the error is called Generalized Hyperbolic (Barndorff-Nielsen et al., 1982).
The density of an inverse gamma distribution is , where . is faster than polynomial near the origin. Condition (H2) is satisfied if .
For and , the distribution has density
This distribution is faster than polynomial near the origin and Condition (H2) holds.
A Fréchet distribution with the shape and scale is given by
It is faster than polynomial near the origin. Moreover, Condition (H2) holds whenever .
Non-zero near the origin and polynomial near the origin
A distribution has density , where and . is polynomial near the origin with power . The power if
where is given in Theorem 4. Condition (H2) always holds for Gamma distributions. In particular, when , the error has multivariate distribution with degrees of freedom , and if
The has density , where and . When , the error is called multivariate slash distribution (Lange and Sinsheimer, 1993; Rogers and Tukey, 1972). is polynomial near the origin with power . Condition (H2) always holds for Beta distributions.
The distribution has density , where and . is polynomial near the origin with power . Condition (H2) always holds for Weibull distributions.
The distribution has density , where and . is polynomial near the origin with power . The power if
Condition (H2) is satisfied if .
4.3 Harris ergodicity of the DAI algorithm
Now we give sufficient conditions for the DAI algorithm to be Harris ergodic.
Theorem 7.
Assume that Condition (H2) holds. Let be a missing structure. Suppose that there is a missing structure and a realized value of denoted by such that , and that , the conditional density of given , is proper. Then, for Lebesgue almost every in the range of , the posterior density , where satisfies and , is proper, and any DAI chain targeting this posterior is Harris recurrent.
Proof.
The posterior density can be regarded as the posterior density of given when the prior density is . Since is assumed to be proper, this posterior is proper for almost every possible value of .
By Theorem 4, a posterior density is proper if each of the following conditions holds:
-
•
is monotone, and Condition (H1) holds for ;
-
•
satisfies Condition (H2); or
-
•
is either zero near the origin, or faster than polynomial near the origin, or polynomial near the origin with power , where
and for , is the number of nonzero entries in the th row of .
By Theorem 7, under Condition (H2), to ensure Harris ergodicity of a DAI algorithm targeting in an almost sure sense, one only needs to find a missing structure that satisfies the conditions above. In theory, when such a exists, it is still possible that the posterior is improper – even though it only happens on a zero measure set. See Fernández and Steel (1999) for a detailed discussion on this subtlety.
4.4 Comparison between the DA and DAI algorithms
Again, fix a missing structure and a realized response , and let be the number of nonzero elements in the th row of . Assume that there is at least one missing entry, i.e., for some . Assume also that is proper. In principle, one can either use the DA algorithm (with or without post hoc imputation) or the DAI algorithm associated with where to sample from the posterior. In this subsection, we compare the two algorithms in terms of their convergence rates. This comparison is important when is monotone, and Conditions (H1) and (H2) hold. In this case, both algorithms can be efficiently implemented.
Let and be two random elements, defined on the measurable spaces and respectively. Denote by the marginal distribution of . A generic data augmentation algorithm for sampling from simulates a Markov chain that is reversible to . Given the current state , the next state is generated through the following procedure.
-
1.
I step. Draw from the conditional distribution of given . Call the observed value .
-
2.
P step. Draw from the conditional distribution of given .
The DA and DAI algorithms for Bayesian robust linear regression are special cases of the above method. Indeed, let , , and be three random elements such that the joint distribution of , , and is the conditional joint distribution of , , and given . Then, taking and in the generic algorithm yields the DA algorithm; taking and yields the DAI algorithm associated with .
The convergence rate of the generic data augmentation chain is precisely the squared maximal correlation between and (Liu et al., 1994). The maximal correlation between and is
where corr means linear correlation, and the supremum is taken over real functions and such that the variances of and are finite. Evidently,
We then have the following result.
Theorem 8.
Suppose that is proper, and that the conditional distributions in the DA and DAI algorithms are well-defined. Then, the convergence rate of the DA chain targeting is at least as small as that of any DAI chain targeting .
Recall that a smaller convergence rate means faster convergence. Thus, when computation time is not considered and the observed missing structure is monotone, the DA algorithm is faster than the DAI algorithm. In this case, imputation of missing data, if needed, should be performed in a post hoc rather than intermediate manner. In Section 5 we use numerical experiments to show that this appears to be the case even after computation cost is taken into account.
5 Numerical Experiment
We compare the performance of the DA and DAI algorithms using simulated data. All simulations are implemented through the Bayesianrobust R package.
Suppose that we have observations in a study. Assume that for each observation, the response has components, while the predictor has the form where . We generate the s using independent normal distributions. The response matrix is generated according to the robust linear regression model, with the mixing distribution being . The simulated data set is fixed throughout the section.
On the modeling side, we consider an independence Jeffreys prior with and (see Equation (1)) and three mixing distributions:
-
•
G: The mixing distribution is Gamma. The error is distribution with degrees of freedom 4. By Equation (4), in the I step of the DA algorithm, one draws independent Gamma random variables.
-
•
GIG: The mixing distribution is GIG. The error is generalised hyperbolic. By Equation (4), in the I step of the DA algorithm, one draws independent generalized inverse Gaussian random variables.
-
•
P: The mixing distribution is the point mass at 1. The error is multivariate normal distribution. In this case, the posterior can be exactly sampled by the DA algorithm.
We study three realized missing structures. Under these missing structures, the response matrix has, respectively, 45, 40, and 35 rows fully observed. The other rows all have only the second entry observed. It is clear that all three missing structures are monotone.
In total, we consider nine combinations of mixing distributions and missing structures. We apply both the DA and DAI algorithms in each scenario. In the I2 step of the DAI algorithm (see Section 3.3), is taken to be the matrix whose entries are all 1, i.e., corresponds to the case that the response matrix is fully observed. At the end of each iteration of the DA algorithm, a post hoc imputation step is performed, so both algorithms impute all missing response entries.
Consider estimating the regression coefficients and scatter matrix using the posterior mean computed via the DA or DAI algorithm. We compare the efficiency of the two algorithms based on the effective sample size (ESS) of each component separately and all components jointly. At a given simulation length , the ESS of an MCMC estimator is defined to be times the posterior variance divided by the asymptotic variance of the estimator (Vats et al., 2019). To account for computation cost, we also consider the ESS per minute, , where is the number of minutes needed to run iterations of the algorithm. The simulation length is set to be without burn-in, and the initial values are the ordinary least squares estimates using the observations that belong to pattern , i.e., the observations without missing elements.

Figure 1 gives the ESS and ESSpm of all components jointly for each of the combinations of mixing distributions, missing structures, and MCMC algorithms. The ESS and ESSpm of each component separately are included in Table 2 and 3 in Appendix C. We see that the DA algorithm gives a larger ESS compared to the DAI algorithm. For the DAI algorithm, the ESS is lower when there are more missing data. Similar trends appear when we consider the ESSpm. In short, imputation of missing responses values slows an algorithm down. This is consistent with our theoretical results. (Strictly speaking, our theoretical results concern convergence rate, not effective sample size, but for data augmentation chains which are reversible and positive, these two concepts are closely related. See, e.g., Rosenthal (2003), Section 3.)
In addition to the experiment above, we consider another series of mixing distributions, namely with . The resultant error distributions are multivariate distributions with degrees of freedom. Applying the DA and DAI algorithms to these models, we obtain Figure 2, the ESS and ESSpm of all components jointly. The ESS and ESSpm of each component separately are in Table 4 and 5 in Appendix C. Our simulation shows that when the model assumes that the error distribution has a lighter tail, the MCMC algorithms tend to be more efficient.

6 Conclusion
We conducted convergence analysis for a data augmentation algorithms used in Bayesian robust multivariate linear regression with incomplete data. The algorithm was first proposed by Liu (1996). But, previously, little was known about its theoretical properties when the response matrix contains missing values. We consider two versions of the algorithm, DA and DAI. The DA algorithm can only be applied when the missing structure is monotone, whereas the DAI algorithm can be implemented for an arbitrary missing structure. We establish geometric ergodicity of the DA algorithm under simple conditions. For the DAI algorithm, we give conditions for Harris ergodicity. We compare the convergence rates of the DA and DAI algorithms. The convergence rate of the DA algorithm is at least as small as a corresponding DAI algorithm. A numerical study is provided. Under monotone missing structures, the DA algorithm outperforms the DAI computationally and theoretically.
Acknowledgments
The first and second authors were supported by the National Science Foundation [grant number DMS-2112887], and the third author was supported by the National Science Foundation [grant number DMS-2152746].
Appendix
Appendix A Details of the DA algorithm
The DA algorithm consists of two steps, the I step and the P step. The I step is shown in Section 3.2. In this section, we describe the P step when the missing structure is monotone and Condition (H1) holds.
Recall from Section 3.1 that the monotone missing structure has possible patterns. For , is the number of observations that belong to pattern . and are submatrices of and respectively defined in Section 3.1. Define an diagonal matrix . That is, the first diagonal entries of are , listed in order, and all the other entries are 0. Similarly, let . Denote as the weighted least squares estimates of the regression coefficient of on with weight , and as the corresponding weighted total sum of residual squares and cross-products matrix, i.e.,
Let be the lower right submatrix of the positive semi-definite matrix given in the prior (Equation (1)).
P step. Draw from the conditional distribution of given using the following procedure.
-
1.
Draw given :
For , let , and let be the lower triangular Cholesky factor of (so that ). Draw a sequence of random vectors such that is dimensional, and that
-
(a)
for ,
-
(b)
for , where ,
-
(c)
are independent .
Denote the sampled values by .
Let for , and let be a lower triangular matrix with its lower triangular non-zero part formed by columns . Then serves as a sampled value of .
-
(a)
-
2.
Draw given :
Independently, draw -dimensional standard normal random vectors , and call the sampled values . For , let be the lower triangular Cholesky factor of . Then
serves as a sampled value of .
Let us quickly consider a special case. Recall that is the missing structure that corresponds to a completely observable response, i.e., all elements of are 1. Then , , and is monotone. Let be the matrix obtained by attaching to the right of . Then Condition (H1) for is equivalent to
(H1.0) |
Note that if there is some monotone such that Condition (H1) holds for , then Condition (H1.0) necessarily holds. Under Condition (H1.0), the conditional distribution of given is proper and rather simple. We say for , , and if is a random matrix associated with the probability density function
is called a matrix normal distribution. We say for and if is a random matrix associated with the probability density function
where returns the trace of a matrix, and is a multivariate gamma function. is called an inverse Wishart distribution, and it is well-known that a random matrix follows the distribution if and only if its inverse follows the Wishart distribution , which has density
For , let , i.e., the diagonal matrix whose diagonal elements are , listed in order. Let
and . Then it is easy to see from Equation (2) that
Appendix B Proof of Theorem 4
We prove posterior propriety and geometric ergodicity by establishing drift and minorization (d&m) conditions, which we now describe. Let be a measurable space. Consider a Markov chain whose state space is , and let be its transition kernel. We say that satisfies a d&m condition with drift function , minorization measure , and parameters if each of the following two conditions holds:
- Drift condition:
-
For ,
where and . Note that can be interpreted as the conditional expectation of given , where can be any non-negative integer.
- Minorization condition:
-
Let be a probability measure, , and . Moreover, whenever ,
It is well-known that if a d&m condition holds, then the Markov chain has a proper stationary distribution, and it is geometrically ergodic (Rosenthal, 1995). See also Jones and Hobert (2001), Roberts and Rosenthal (2004), and Meyn and Tweedie (2005).
We begin by establishing a minorization condition for the DA algorithm associated with a realized response and missing structure . As before, will be used to denote the number of nonzero elements in the th row of . Let by the underlying Markov chain, and denote its Markov transition kernel by . Set , and let be the usual Borel algebra associated with . Define a drift function
where, for ,
Then the following holds.
Lemma 9.
For any , there exist a probability measure and such that whenever ,
Proof.
One can write
where is the conditional distribution of given , and is the conditional distribution of given , both derived from Equation (2). As stated in Section 3.2,
Assume that for some . Then for each . It follows that, for any measurable ,
where
Thus, for each ,
where is a probability measure given by
∎
To establish d&m, it remains to show the following.
Lemma 10.
Suppose that Condition (H2) holds, and that the conditional distribution of given is proper for every . If any one of the following conditions holds, then there exist and such that, for ,
-
1.
is zero near the origin;
-
2.
is faster than polynomial near the origin; or
-
3.
is polynomial near the origin with power , where
Proof.
We will prove the result when contains at least one vanishing element. When there are no missing data under , the result is proved by Hobert et al. (2018). (To be absolutely precise, Hobert et al. assumed that is absolutely continuous with respect to the Lebesgue measure in the “zero near the origin” case, but their proof can be easily extended to the case where absolute continuity is absent.)
For and , let be the conditional distribution of given . For , let be the conditional distribution of given . Then
(7) |
Let be a matrix such that all its entries are 1. In other words, if there are no data missing. denotes the unobservable entries of under the missing structure . By our assumptions, given , has a proper conditional distribution. For , denote by the conditional distribution of given . For and a realized value of , say, , let be the conditional distribution of given . We now describe this distribution (see Appendix A). Let be a realized value of such that . Let , where are the components of . Let
and . Then is given by
It is easy to verify that , , and are connected through the following tower property:
(8) |
In light of Equations (7) and (8), let us first investigate the integral
(9) | ||||
where is the distribution, and is the distribution. It follows from basic properties of matrix normal distributions that
By Theorem 3.2.11 of Muirhead (2009), if is a random matrix that follows the distribution, then follows the distribution. Then by basic properties of Wishart distributions,
(10) | ||||
For , denote by the th row of , and note that . It follows from Lemma 11, which is stated right after this proof, that, for ,
and
Lemma 11.
Let be a positive definite matrix, and be a vector, such that the matrix is positive semidefinite, then .
Lemma 12.
Suppose that Condition (H2) holds and that is either zero near the origin, or faster than polynomial near the origin, or polynomial near the origin with power , where
Then there exist and , such that for all and ,
Proof.
We will make use of results in Hobert et al. (2018).
When is zero near the origin, there exists , such that . Then, for all and ,
When is polynomial near the origin or faster than polynomial near the origin, recall that is the density function of with respect to the Lebesgue measure. Let be the set of such that there exists , satisfying
for all . If is non-empty, let ; otherwise, set .
Consider a Gamma mixing distribution with density
It is easy to see that if , for all ,
Therefore, if , .
Let be a mixing density proportional to . Note that
Therefore, it suffices to prove that .
When is polynomial near the origin with power , the distribution associated with is polynomial near the origin with power . Let be a mixing density following Gamma . We have
By Lemma 1 in Hobert et al. (2018), for all ,
When is faster than polynomial near the origin, again let be a mixing density proportional to . The distribution associated with is faster than polynomial near the origin. Let . Recall that is the density of the Gamma mixing distribution. By the definition of faster than polynomial near the origin, there exists , such that is strictly increasing on . Thus, is strictly increasing on . By Lemma 2 in Hobert et al. (2018),
Since can be any value larger than , , for all .
∎
Appendix C The ESS and ESSpm of Each Component in Section 5
Model | Data | |||||||
---|---|---|---|---|---|---|---|---|
DAG | 45 | 19332 | 17450 | 19098 | 16447 | 10290 | 11646 | 10528 |
DAG | 40 | 20001 | 18399 | 17549 | 19992 | 10315 | 12061 | 10714 |
DAG | 35 | 16960 | 18459 | 16786 | 16967 | 11605 | 12442 | 11551 |
DAIG | 45 | 16471 | 19242 | 17976 | 19443 | 9970 | 10723 | 9888 |
DAIG | 40 | 14292 | 15528 | 18045 | 17066 | 8335 | 11155 | 9897 |
DAIG | 35 | 13330 | 17612 | 17745 | 16203 | 9676 | 8988 | 10064 |
DAGIG | 45 | 16843 | 17429 | 14810 | 15401 | 13796 | 13232 | 13422 |
DAGIG | 40 | 20661 | 20371 | 21888 | 14722 | 13375 | 14279 | 12765 |
DAGIG | 35 | 17729 | 15738 | 17347 | 15574 | 15619 | 12788 | 12722 |
DAIGIG | 45 | 18267 | 15193 | 17571 | 15148 | 11803 | 12759 | 13323 |
DAIGIG | 40 | 16540 | 18311 | 16763 | 17924 | 14042 | 14860 | 14749 |
DAIGIG | 35 | 12246 | 14077 | 18134 | 15717 | 8973 | 9506 | 12939 |
DAP | 45 | 30000 | 30000 | 30000 | 30000 | 30000 | 29724 | 30000 |
DAP | 40 | 30000 | 30000 | 30000 | 30000 | 30000 | 30000 | 30438 |
DAP | 35 | 30000 | 30000 | 29216 | 30000 | 30000 | 30000 | 30000 |
DAIP | 45 | 22822 | 26402 | 30000 | 30000 | 25435 | 29601 | 30000 |
DAIP | 40 | 20145 | 23817 | 30000 | 30000 | 20160 | 26305 | 30000 |
DAIP | 35 | 13894 | 22094 | 30000 | 30000 | 16530 | 13943 | 30000 |
Model | Data | |||||||
---|---|---|---|---|---|---|---|---|
DAG | 45 | 4421 | 3990 | 4367 | 3761 | 2353 | 2663 | 2407 |
DAG | 40 | 4886 | 4495 | 4287 | 4884 | 2520 | 2947 | 2618 |
DAG | 35 | 4213 | 4585 | 4170 | 4215 | 2883 | 3090 | 2869 |
DAIG | 45 | 3286 | 3838 | 3586 | 3878 | 1989 | 2139 | 1972 |
DAIG | 40 | 2891 | 3141 | 3651 | 3452 | 1686 | 2257 | 2002 |
DAIG | 35 | 2850 | 3766 | 3794 | 3464 | 2069 | 1922 | 2152 |
DAGIG | 45 | 1611 | 1667 | 1417 | 1473 | 1320 | 1266 | 1284 |
DAGIG | 40 | 2072 | 2043 | 2195 | 1477 | 1341 | 1432 | 1280 |
DAGIG | 35 | 1803 | 1601 | 1765 | 1584 | 1589 | 1301 | 1294 |
DAIGIG | 45 | 1701 | 1415 | 1636 | 1410 | 1099 | 1188 | 1240 |
DAIGIG | 40 | 1543 | 1709 | 1564 | 1672 | 1310 | 1387 | 1376 |
DAIGIG | 35 | 1201 | 1381 | 1779 | 1542 | 880 | 932 | 1269 |
DAP | 45 | 6959 | 6959 | 6959 | 6959 | 6959 | 6895 | 6959 |
DAP | 40 | 7793 | 7793 | 7793 | 7793 | 7793 | 7793 | 7907 |
DAP | 35 | 7478 | 7478 | 7282 | 7478 | 7478 | 7478 | 7478 |
DAIP | 45 | 4929 | 5702 | 6479 | 6479 | 5493 | 6393 | 6479 |
DAIP | 40 | 4415 | 5219 | 6574 | 6574 | 4418 | 5765 | 6574 |
DAIP | 35 | 3183 | 5062 | 6873 | 6873 | 3787 | 3194 | 6873 |
Model | Data | |||||||
---|---|---|---|---|---|---|---|---|
DAG2 | 45 | 19332 | 17450 | 19098 | 16447 | 10290 | 11646 | 10528 |
DAG2 | 40 | 20001 | 18399 | 17549 | 19992 | 10315 | 12061 | 10714 |
DAG2 | 35 | 16960 | 18459 | 16786 | 16967 | 11605 | 12442 | 11551 |
DAIG2 | 45 | 16471 | 19242 | 17976 | 19443 | 9970 | 10723 | 9888 |
DAIG2 | 40 | 14292 | 15528 | 18045 | 17066 | 8335 | 11155 | 9897 |
DAIG2 | 35 | 13330 | 17612 | 17745 | 16203 | 9676 | 8988 | 10064 |
DAG8 | 45 | 26359 | 23824 | 25032 | 25504 | 16537 | 19326 | 16185 |
DAG8 | 40 | 26122 | 26707 | 25362 | 26588 | 14952 | 19118 | 15616 |
DAG8 | 35 | 25470 | 24347 | 24819 | 25501 | 18445 | 16390 | 16759 |
DAIG8 | 45 | 20631 | 23024 | 25552 | 26183 | 13685 | 15991 | 15078 |
DAIG8 | 40 | 18122 | 21684 | 23664 | 25824 | 13061 | 18082 | 16806 |
DAIG8 | 35 | 14837 | 22029 | 24967 | 26256 | 11553 | 11712 | 15336 |
DAG25 | 45 | 29041 | 29055 | 27919 | 28865 | 21494 | 22572 | 23345 |
DAG25 | 40 | 28018 | 29302 | 28548 | 29081 | 23303 | 21042 | 23013 |
DAG25 | 35 | 29310 | 29162 | 29052 | 30000 | 18884 | 16623 | 20118 |
DAIG25 | 45 | 23968 | 24454 | 27999 | 29108 | 17254 | 20737 | 22495 |
DAIG25 | 40 | 18613 | 23362 | 28295 | 28963 | 15049 | 18725 | 21078 |
DAIG25 | 35 | 17014 | 20338 | 28073 | 29150 | 10450 | 9397 | 22567 |
Model | Data | |||||||
---|---|---|---|---|---|---|---|---|
DAG2 | 45 | 4421 | 3990 | 4367 | 3761 | 2353 | 2663 | 2407 |
DAG2 | 40 | 4886 | 4495 | 4287 | 4884 | 2520 | 2947 | 2618 |
DAG2 | 35 | 4213 | 4585 | 4170 | 4215 | 2883 | 3090 | 2869 |
DAIG2 | 45 | 3286 | 3838 | 3586 | 3878 | 1989 | 2139 | 1972 |
DAIG2 | 40 | 2891 | 3141 | 3651 | 3452 | 1686 | 2257 | 2002 |
DAIG2 | 35 | 2850 | 3766 | 3794 | 3464 | 2069 | 1922 | 2152 |
DAG8 | 45 | 6129 | 5539 | 5820 | 5930 | 3845 | 4494 | 3763 |
DAG8 | 40 | 6495 | 6641 | 6306 | 6611 | 3718 | 4754 | 3883 |
DAG8 | 35 | 6632 | 6339 | 6462 | 6640 | 4803 | 4268 | 4364 |
DAIG8 | 45 | 4095 | 4570 | 5072 | 5197 | 2717 | 3174 | 2993 |
DAIG8 | 40 | 3894 | 4659 | 5085 | 5549 | 2806 | 3885 | 3611 |
DAIG8 | 35 | 3265 | 4847 | 5494 | 5777 | 2542 | 2577 | 3375 |
DAG25 | 45 | 7115 | 7118 | 6840 | 7072 | 5266 | 5530 | 5719 |
DAG25 | 40 | 7097 | 7423 | 7232 | 7367 | 5903 | 5330 | 5830 |
DAG25 | 35 | 7728 | 7689 | 7660 | 7910 | 4979 | 4383 | 5304 |
DAIG25 | 45 | 5067 | 5170 | 5919 | 6154 | 3648 | 4384 | 4755 |
DAIG25 | 40 | 4099 | 5144 | 6231 | 6378 | 3314 | 4123 | 4642 |
DAIG25 | 35 | 3898 | 4660 | 6432 | 6679 | 2394 | 2153 | 5170 |
References
- Barndorff-Nielsen et al. [1982] Ole Barndorff-Nielsen, John Kent, and Michael Sørensen. Normal variance-mean mixtures and z distributions. International Statistical Review/Revue Internationale de Statistique, 50(2):145–159, 1982.
- Chan and Geyer [1994] Kung Sik Chan and Charles J Geyer. Discussion: Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4):1747–1758, 1994.
- Fernández and Steel [1999] Carmen Fernández and Mark FJ Steel. Multivariate Student-t regression models: Pitfalls and inference. Biometrika, 86(1):153–167, 1999.
- Fernandez and Steel [2000] Carmen Fernandez and Mark FJ Steel. Bayesian regression analysis with scale mixtures of normals. Econometric Theory, 16(1):80–101, 2000.
- Flegal et al. [2008] James M Flegal, Murali Haran, and Galin L Jones. Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23(2):250–260, 2008.
- Hobert and Casella [1996] James P Hobert and George Casella. The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91(436):1461–1473, 1996.
- Hobert et al. [2018] James P Hobert, Yeun Ji Jung, Kshitij Khare, and Qian Qin. Convergence analysis of MCMC algorithms for Bayesian multivariate linear regression with non-Gaussian errors. Scandinavian Journal of Statistics, 45(3):513–533, 2018.
- Jones [2004] Galin L Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299–320, 2004.
- Jones and Hobert [2001] Galin L Jones and James P Hobert. Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science, 16(4):312–334, 2001.
- Jones and Qin [2021] Galin L Jones and Qian Qin. Markov chain Monte Carlo in practice. Annual Review of Statistics and Its Application, 9, 2021.
- Lange and Sinsheimer [1993] Kenneth Lange and Janet S Sinsheimer. Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics, 2(2):175–198, 1993.
- Liu [1996] Chuanhai Liu. Bayesian robust multivariate linear regression with incomplete data. Journal of the American Statistical Association, 91(435):1219–1227, 1996.
- Liu et al. [1994] Jun S Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27–40, 1994.
- Meyn and Tweedie [2005] Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2005.
- Muirhead [2009] Robb J Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, 2009.
- Roberts and Rosenthal [1997] Gareth Roberts and Jeffrey Rosenthal. Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2:13–25, 1997.
- Roberts and Rosenthal [2001] Gareth O Roberts and Jeffrey S Rosenthal. Markov chains and de-initializing processes. Scandinavian Journal of Statistics, 28(3):489–504, 2001.
- Roberts and Rosenthal [2004] Gareth O Roberts and Jeffrey S Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71, 2004.
- Roberts and Rosenthal [2006] Gareth O Roberts and Jeffrey S Rosenthal. Harris recurrence of Metropolis-within-Gibbs and trans-dimensional Markov chains. The Annals of Applied Probability, 16(4):2123–2139, 2006.
- Rogers and Tukey [1972] William H. Rogers and John W. Tukey. Understanding some long‐tailed symmetrical distributions. Statistica Neerlandica, 26(3):211–226, 1972.
- Rosenthal [1995] Jeffrey S Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90(430):558–566, 1995.
- Rosenthal [2003] Jeffrey S Rosenthal. Asymptotic variance and convergence rates of nearly-periodic Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 98(461):169–177, 2003.
- Roy and Hobert [2010] Vivekananda Roy and James P Hobert. On Monte Carlo methods for Bayesian multivariate regression models with heavy-tailed errors. Journal of Multivariate Analysis, 101(5):1190–1202, 2010.
- Seaman et al. [2013] Shaun Seaman, John Galati, Dan Jackson, and John Carlin. What is meant by “missing at random”? Statistical Science, 28(2):257–268, 2013.
- Vats et al. [2018] Dootika Vats, James M Flegal, and Galin L Jones. Strong consistency of multivariate spectral variance estimators in Markov chain Monte Carlo. Bernoulli, 24(3):1860–1909, 2018.
- Vats et al. [2019] Dootika Vats, James M Flegal, and Galin L Jones. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321–337, 2019.
- Zellner [1976] Arnold Zellner. Bayesian and non-Bayesian analysis of the regression model with multivariate Student-t error terms. Journal of the American Statistical Association, 71(354):400–405, 1976.