This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence Analysis of Data Augmentation Algorithms for Bayesian Robust Multivariate Linear Regression with Incomplete Data

Haoxiang Li School of Statistics
University of Minnesota
Qian Qin School of Statistics
University of Minnesota
Galin L. Jones School of Statistics
University of Minnesota
Abstract

Gaussian mixtures are commonly used for modeling heavy-tailed error distributions in robust linear regression. Combining the likelihood of a multivariate robust linear regression model with a standard improper prior distribution yields an analytically intractable posterior distribution that can be sampled using a data augmentation algorithm. When the response matrix has missing entries, there are unique challenges to the application and analysis of the convergence properties of the algorithm. Conditions for geometric ergodicity are provided when the incomplete data have a “monotone” structure. In the absence of a monotone structure, an intermediate imputation step is necessary for implementing the algorithm. In this case, we provide sufficient conditions for the algorithm to be Harris ergodic. Finally, we show that, when there is a monotone structure and intermediate imputation is unnecessary, intermediate imputation slows the convergence of the underlying Monte Carlo Markov chain, while post hoc imputation does not. An R package for the data augmentation algorithm is provided.

1 Introduction

Markov chain Monte Carlo (MCMC) algorithms are frequently used for sampling from intractable probability distributions in Bayesian statistics. We study the convergence properties of a class of MCMC algorithms designed for Bayesian multivariate linear regression with heavy-tailed errors. Contrary to most existing works in the area of MCMC convergence analysis, we focus on scenarios where the data set is incomplete. The existence of missing data substantially complicates the analysis.

Gaussian mixtures are suitable for constructing heavy-tailed error distributions in robust linear regression models (Zellner, 1976; Lange and Sinsheimer, 1993; Fernandez and Steel, 2000). In a Bayesian setting where a simple improper prior is used, the mixture representation facilitates a data augmentation (DA) MCMC algorithm (Liu, 1996) that can be used to sample from the posterior distribution of the regression coefficients and error scatter matrix. When there are no missing data, this algorithm is geometrically ergodic under regularity conditions (Roy and Hobert, 2010; Hobert et al., 2018). Loosely speaking, geometric ergodicity means that the Markov chain converges to the posterior distribution at a geometric, or exponential rate. Geometric ergodicity is important because it guarantees the existence of a central limit theorem for ergodic averages, which is in turn crucial for assessing the accuracy of Monte Carlo estimators (Chan and Geyer, 1994; Jones and Hobert, 2001; Jones, 2004; Flegal et al., 2008; Vats et al., 2018, 2019; Jones and Qin, 2021).

The current work studies the algorithm when the response matrix has missing entries. An incomplete data set brings unique challenges to the implementation and analysis of the DA algorithm. For instance, establishing posterior propriety becomes much more difficult than when there are no missing values. If the posterior measure is not proper, the DA algorithm will produce nonsensical results (Hobert and Casella, 1996). However, if one can show that the algorithm is geometrically ergodic, then the posterior distribution is guaranteed to be proper.

When the missing data have a certain “monotone” structure, the DA algorithm can be carried out without an intermediate step to impute the missing data. In this case, we establish geometric ergodicity under conditions on the error distribution and the amount of incomplete response components. Roughly speaking, when the mixing distribution in the Gaussian mixture representation of the error distribution places little mass near the origin and the number of incomplete components is not too large, the DA algorithm is geometrically ergodic. The conditions are satisfied by many mixing distributions, including distributions with finite supports, log-normal, generalized inverse Gaussian, and inverse Gamma or Fréchet with shape parameter greater than d/2d/2, where dd is the dimension of the response variable. Some Gamma, Weibull, and F distributions also satisfy the conditions. A post hoc imputation step can be added to fill in the missing values and the convergence properties of the DA algorithm will be unaffected.

When the missing data do not posses a monotone structure, some missing entries need to be imputed to implement the DA algorithm. This results in a data augmentation algorithm with an intermediate (as opposed to post hoc) imputation step, which we call a DAI algorithm for short. We provide sufficient conditions for the DAI algorithm to be Harris ergodic. Harris ergodicity is weaker than geometric ergodicity, but it guarantees posterior propriety as well as the existence of a law of large numbers for ergodic averages (Meyn and Tweedie, 2005, Theorem 17.1.7).

When the missing data have a monotone structure, both the DA (with or without post hoc imputation) and DAI algorithms can be applied. However, we show that the DA algorithm converges in L2L^{2} at least as fast as the DAI algorithm.

Our key strategy is to draw a connection from cases where the data set is incomplete to the standard case where the data set is fully observed. This allows for an analysis of the former using tools built for the latter.

Finally, we provide an R package Bayesianrobust that implements the DA and DAI algorithms. The R package is available from https://github.com/haoxiangliumn/Bayesianrobust. While the algorithms are documented in Liu (1996), we do not know of previous software packages that implement them.

The rest of is organized as follows. Section 2 recounts the Bayesian robust linear regression model with incomplete data. In Section 3, we describe the DA and DAI algorithms. Our main results are in Section 4, where we provide conditions for geometric ergodicity of the DA algorithm and Harris ergodicity of the DAI algorithm. The section also contains a comparison between the DA and DAI algorithms in terms of L2L^{2} convergence rate. Section 5 presents a numerical experiment.

2 Robust Linear Regression with Incomplete Data

Let (𝒀i,𝒙i),i{1,,n}(\bm{Y}_{i},\bm{x}_{i}),\;i\in\{1,\dots,n\} be nn independent data points, where 𝒙i=(xi,1,,xi,p)\bm{x}_{i}=(x_{i,1},\dots,x_{i,p})^{\top} is a p×1p\times 1 vector of known predictors, and 𝒀i=(Yi,1,,Yi,d)\bm{Y}_{i}=(Y_{i,1},\dots,Y_{i,d})^{\top} is a d×1d\times 1 random vector. Consider the multivariate linear regression model

𝒀i=𝑩𝒙i+𝚺1/2𝜺i,i{1,,n},\bm{Y}_{i}=\bm{B}^{\top}\bm{x}_{i}+\bm{\Sigma}^{1/2}\bm{\varepsilon}_{i},\ i\in\{1,\dots,n\},

where 𝑩\bm{B} is a p×dp\times d matrix of unknown regression coefficients, 𝚺\bm{\Sigma} is a d×dd\times d unknown positive definite scatter matrix, and 𝜺i,i{1,,n}\bm{\varepsilon}_{i},\;i\in\{1,\dots,n\}, are d×1d\times 1 iid random errors. Let 𝒀\bm{Y} be the n×dn\times d response matrix whose iith row is 𝒀i\bm{Y}_{i}^{\top}, and let 𝒙\bm{x} be the n×pn\times p design matrix whose iith row is 𝒙i\bm{x}_{i}^{\top}.

To allow for error distributions with potentially heavy tails, assume that the distribution of each 𝜺i\bm{\varepsilon}_{i} is described by a scale mixture of multivariate normal densities, which takes the form

ferr(ϵ)=0wd/2(2π)d/2exp(w2ϵTϵ)Pmix(dw),ϵd,f_{\text{err}}(\bm{\epsilon})=\int_{0}^{\infty}\frac{w^{d/2}}{(2\pi)^{d/2}}\exp\left(-\frac{w}{2}\bm{\epsilon}^{T}\bm{\epsilon}\right)P_{\scriptsize\mbox{mix}}(\mbox{d}w),\quad\bm{\epsilon}\in\mathbb{R}^{d},

where Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is a probability measure on (0,)(0,\infty) referred to as the mixing distribution. Gaussian mixtures constitute a variety of error distributions, and are widely used for robust regression. For instance, when Pmix()P_{\scriptsize\mbox{mix}}(\cdot) corresponds to the Gamma(v/2,v/2)\text{Gamma}(v/2,v/2) distribution for some v>0v>0, i.e., Pmix()P_{\scriptsize\mbox{mix}}(\cdot) has a density with respect to the Lebesgue measure given by

pmix(w)wv/21exp(vw/2),w>0,p_{\scriptsize\mbox{mix}}(w)\propto w^{v/2-1}\exp(vw/2),\quad w>0,

the errors follow the multivariate tt distribution with vv degrees of freedom, which has density function

ferr(ϵ)(1+ϵϵ/v)(d+v)/2,ϵd.f_{\text{err}}(\bm{\epsilon})\propto(1+\bm{\epsilon}^{\top}\bm{\epsilon}/v)^{-(d+v)/2},\quad\bm{\epsilon}\in\mathbb{R}^{d}.

See Section 4.2 for more examples.

Consider a Bayesian setting. Throughout, we use (𝜷,𝝇)(\bm{\beta},\bm{\varsigma}) to denote realized values of (𝑩,𝚺)(\bm{B},\bm{\Sigma}). Assume that (𝑩,𝚺)(\bm{B},\bm{\Sigma}) has the following prior density:

pprior(𝜷,𝝇)|𝝇|(m+1)/2exp{12tr(𝝇1𝒂)},𝜷p×d,𝝇S+d×d,p_{\scriptsize\mbox{prior}}(\bm{\beta},\bm{\varsigma})\propto|\bm{\varsigma}|^{-(m+1)/2}\exp\left\{-\frac{1}{2}\operatorname{tr}\left(\bm{\varsigma}^{-1}\bm{a}\right)\right\},\quad\bm{\beta}\in\mathbb{R}^{p\times d},\;\bm{\varsigma}\in S_{+}^{d\times d}, (1)

where mm\in\mathbb{R}, 𝒂S+d×d\bm{a}\in S_{+}^{d\times d}, and S+d×dS_{+}^{d\times d} is the convex cone of d×dd\times d (symmetric) positive semi-definite real matrices. Equation (1) defines a class of commonly used default priors (Liu, 1996; Fernández and Steel, 1999). For instance, the independence Jeffrey’s prior corresponds to m=dm=d and 𝒂=𝟎\bm{a}=\bm{0}. Denote by PpriorP_{\scriptsize\mbox{prior}} the measure associated with ppriorp_{\scriptsize\mbox{prior}}. The model of interest can then be written in the hierarchical form:

𝒀i𝑾,𝑩,𝚺\displaystyle\bm{Y}_{i}\mid\bm{W},\bm{B},\bm{\Sigma} indNd(𝑩𝒙i,Wi1𝚺),i{1,,n};\displaystyle\stackrel{{\scriptstyle\scriptsize\mbox{ind}}}{{\sim}}\mbox{N}_{d}\left(\bm{B}^{\top}\bm{x}_{i},W_{i}^{-1}\bm{\Sigma}\right),\;i\in\{1,\dots,n\};
Wi𝑩,𝚺\displaystyle W_{i}\mid\bm{B},\bm{\Sigma} indPmix(),i{1,,n};\displaystyle\stackrel{{\scriptstyle\scriptsize\mbox{ind}}}{{\sim}}P_{\scriptsize\mbox{mix}}(\cdot),\;i\in\{1,\dots,n\};
𝑩,𝚺\displaystyle\bm{B},\bm{\Sigma} Pprior(),\displaystyle\sim P_{\scriptsize\mbox{prior}}(\cdot),

where Nd\mbox{N}_{d} denotes dd-variate normal distributions. Let 𝑾:=(W1,,Wn)\bm{W}:=(W_{1},\dots,W_{n})^{\top} be a vector of latent random variables. The joint measure of (𝒀,𝑾,𝑩,𝚺)(\bm{Y},\bm{W},\bm{B},\bm{\Sigma}) is given by

Pjoint(d𝒚,d𝒘,d𝜷,d𝝇)=pjoint(𝒚,𝒘,𝜷,𝝇)d𝒚Pmix(d𝒘)d𝜷d𝝇,\displaystyle P_{\scriptsize\mbox{joint}}(\mbox{d}\bm{y},\mbox{d}\bm{w},\mbox{d}\bm{\beta},\mbox{d}\bm{\varsigma})=p_{\scriptsize\mbox{joint}}(\bm{y},\bm{w},\bm{\beta},\bm{\varsigma})\,\mbox{d}\bm{y}\,P_{\scriptsize\mbox{mix}}(\mbox{d}\bm{w})\,\mbox{d}\bm{\beta}\,\mbox{d}\bm{\varsigma},

where, for 𝒚=(𝒚1,,𝒚n)n×d\bm{y}=(\bm{y}_{1},\dots,\bm{y}_{n})^{\top}\in\mathbb{R}^{n\times d}, 𝒘=(w1,,wn)(0,)n\bm{w}=(w_{1},\dots,w_{n})^{\top}\in(0,\infty)^{n}, 𝜷p×d\bm{\beta}\in\mathbb{R}^{p\times d}, and 𝝇S+d×d\bm{\varsigma}\in S_{+}^{d\times d},

pjoint(𝒚,𝒘,𝜷,𝝇)pprior(𝜷,𝝇)i=1nwid/2|𝝇|1/2exp{12wi(𝒚i𝜷𝒙i)𝝇1(𝒚i𝜷𝒙i)}.p_{\scriptsize\mbox{joint}}(\bm{y},\bm{w},\bm{\beta},\bm{\varsigma})\propto\,p_{\scriptsize\mbox{prior}}(\bm{\beta},\bm{\varsigma})\prod_{i=1}^{n}\frac{w_{i}^{d/2}}{|\bm{\varsigma}|^{1/2}}\exp\left\{-\frac{1}{2}w_{i}\left(\bm{y}_{i}-\bm{\beta}^{\top}\bm{x}_{i}\right)^{\top}\bm{\varsigma}^{-1}\left(\bm{y}_{i}-\bm{\beta}^{\top}\bm{x}_{i}\right)\right\}. (2)

Throughout, conditional distributions concerning (𝒀,𝑾,𝑩,𝚺)(\bm{Y},\bm{W},\bm{B},\bm{\Sigma}) are to be uniquely defined through the density pjointp_{\scriptsize\mbox{joint}}.

We assume that 𝒙\bm{x} is fully observed, but 𝒀\bm{Y} may contain missing values. The missing structure can be described by an n×dn\times d random matrix 𝑲=(Ki,j)i=1nj=1d\bm{K}=(K_{i,j})_{i=1}^{n}{}_{j=1}^{d}, with Ki,j=1K_{i,j}=1 indicating Yi,jY_{i,j} is observed, and Ki,j=0K_{i,j}=0 indicating Yi,jY_{i,j} is missing. For a realized response matrix 𝒚=(yi,j)i=1nj=1dn×d\bm{y}=(y_{i,j})_{i=1}^{n}{}_{j=1}^{d}\in\mathbb{R}^{n\times d} and a realized missing structure 𝒌=(ki,j)i=1nj=1d{0,1}n×d\bm{k}=(k_{i,j})_{i=1}^{n}{}_{j=1}^{d}\in\{0,1\}^{n\times d}, let 𝒚(𝒌)\bm{y}_{(\bm{k})} be the array (yi,j)(i,j)A(𝒌)(y_{i,j})_{(i,j)\in A(\bm{k})}, where A(𝒌)={(i,j):ki,j=1}A(\bm{k})=\{(i,j):k_{i,j}=1\}. Then 𝒀(𝑲)\bm{Y}_{(\bm{K})} gives the observable portion of 𝒀\bm{Y}. In practice, instead of observing 𝒀\bm{Y}, one sees instead (𝑲,𝒀(𝑲))(\bm{K},\bm{Y}_{(\bm{K})}). Throughout, we assume that the missing data mechanism is ignorable, which means the following.

Definition 1.

The missing data mechanism is ignorable if, for any 𝐤{0,1}n×d\bm{k}\in\{0,1\}^{n\times d} and almost every 𝐲n×d\bm{y}\in\mathbb{R}^{n\times d}, the posterior distribution of (𝐁,𝚺)(\bm{B},\bm{\Sigma}) given (𝐊,𝐘(𝐊))=(𝐤,𝐲(𝐤))(\bm{K},\bm{Y}_{(\bm{K})})=(\bm{k},\bm{y}_{(\bm{k})}) is the same as the conditional distribution of (𝐁,𝚺)(\bm{B},\bm{\Sigma}) given 𝐘(𝐤)=𝐲(𝐤)\bm{Y}_{(\bm{k})}=\bm{y}_{(\bm{k})}.

If, given (𝑩,𝚺)=(𝜷,𝝇)(\bm{B},\bm{\Sigma})=(\bm{\beta},\bm{\varsigma}), the distribution of 𝑲\bm{K} does not depend on (𝜷,𝝇)(\bm{\beta},\bm{\varsigma}), and the data are “realized missing at random” (Seaman et al., 2013), which can be implied by the fact that 𝑲\bm{K} is independent of 𝒀\bm{Y}, then the missing data mechanism is ignorable. From here on, fix a realized missing structure 𝒌=(ki,j)i=1nj=1d{0,1}n×d\bm{k}=(k_{i,j})_{i=1}^{n}{}_{j=1}^{d}\in\{0,1\}^{n\times d}, and only condition on 𝒀(𝒌)\bm{Y}_{(\bm{k})} when studying posterior distributions. Without loss of generality, assume that each row of 𝒌\bm{k} contains at least one nonzero element, i.e., for each ii, 𝒀i\bm{Y}_{i} has at least one observed entry.

To write down the exact form of the posterior, we introduce some additional notation. Suppose that given i{1,,n}i\in\{1,\dots,n\}, ki,j=1k_{i,j}=1 if and only if j{j1,,jdi}j\in\{j_{1},\dots,j_{d_{i}}\} for some di{1,,d}d_{i}\in\{1,\dots,d\}, where j1<<jdij_{1}<\dots<j_{d_{i}}. Given ii, let 𝒄i,(𝒌)\bm{c}_{i,(\bm{k})} be the di×dd_{i}\times d matrix that satisfies the following: For =1,,di\ell=1,\dots,d_{i}, in the \ellth row of 𝒄i,(𝒌)\bm{c}_{i,(\bm{k})}, all elements except the jj_{\ell}th one are 0, while the jj_{\ell}th one is 1. For instance, if d=4d=4, di=2d_{i}=2, j1=1j_{1}=1, j2=3j_{2}=3, then

𝒄i,(𝒌)=(10000010).\bm{c}_{i,(\bm{k})}=\left(\begin{array}[]{cccc}1&0&0&0\\ 0&0&1&0\end{array}\right).

Then 𝒀i,(𝒌):=𝒄i,(𝒌)𝒀i\bm{Y}_{i,(\bm{k})}:=\bm{c}_{i,(\bm{k})}\bm{Y}_{i} is a vector consisting of the observed components of 𝒀i\bm{Y}_{i} if 𝑲=𝒌\bm{K}=\bm{k}, and we can write 𝒀(𝒌)\bm{Y}_{(\bm{k})} as (𝒀i,(𝒌))i=1n(\bm{Y}_{i,(\bm{k})})_{i=1}^{n}. For a realized value of 𝒀\bm{Y}, say 𝒚=(yi,j)i=1nj=1dn×d\bm{y}=(y_{i,j})_{i=1}^{n}{}_{j=1}^{d}\in\mathbb{R}^{n\times d}, denote by 𝒚i\bm{y}_{i} the iith row of 𝒚\bm{y} transposed, let 𝒚i,(𝒌)=𝒄i,(𝒌)𝒚i\bm{y}_{i,(\bm{k})}=\bm{c}_{i,(\bm{k})}\bm{y}_{i}, and let 𝒚(𝒌)=(𝒚i,(𝒌))i=1n\bm{y}_{(\bm{k})}=(\bm{y}_{i,(\bm{k})})_{i=1}^{n}. Based on  Equation (2), the posterior density of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given 𝒀(𝒌)=𝒚(𝒌)\bm{Y}_{(\bm{k})}=\bm{y}_{(\bm{k})} has the following form:

π𝒌(𝜷,𝝇𝒚(𝒌))pprior(𝜷,𝝇)i=1n0wdi/2|𝒄i,(𝒌)𝝇𝒄i,(𝒌)|1/2exp(ri,(𝒌)w2)Pmix(dw),\pi_{\bm{k}}\left(\bm{\beta},\bm{\varsigma}\mid\bm{y}_{(\bm{k})}\right)\propto p_{\scriptsize\mbox{prior}}(\bm{\beta},\bm{\varsigma})\prod_{i=1}^{n}\int_{0}^{\infty}\frac{w^{d_{i}/2}}{\left|\bm{c}_{i,(\bm{k})}\bm{\varsigma}\bm{c}_{i,(\bm{k})}^{\top}\right|^{1/2}}\exp\left(-\frac{r_{i,(\bm{k})}w}{2}\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w),

where, for i{1,,n}i\in\{1,\dots,n\},

ri,(𝒌)=(𝒚i,(𝒌)𝒄i,(𝒌)𝜷𝒙i)(𝒄i,(𝒌)𝝇𝒄i,(𝒌))1(𝒚i,(𝒌)𝒄i,(𝒌)𝜷𝒙i).r_{i,(\bm{k})}=\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\bm{\beta}^{\top}\bm{x}_{i}\right)^{\top}\left(\bm{c}_{i,(\bm{k})}\bm{\varsigma}\bm{c}_{i,(\bm{k})}^{\top}\right)^{-1}\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\bm{\beta}^{\top}\bm{x}_{i}\right). (3)
Remark 2.

Because the prior distribution is not proper, π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is not automatically a proper probability density. As a side product of our convergence analysis, we give sufficient conditions for posterior propriety in Section 4.

The posterior density π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is almost always intractable in the sense that it is hard to calculate its features such as expectation and quantiles. Liu (1996) proposed a data augmentation (DA) algorithm, or two-component Gibbs sampler, that can be used to sample from this distribution. This is an MCMC algorithm that simulates a Markov chain (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} that is reversible with respect to the posterior. In the next section, we describe the algorithm in detail.

3 The DA and DAI Algorithms

3.1 Missing Structures

To adequately describe the DA and DAI algorithms, we need to introduce some concepts regarding missing structure matrices.

A realized missing structure 𝒌=(ki,j)i=1nj=1d{0,1}n×d\bm{k}=(k_{i,j})_{i=1}^{n}{}_{j=1}^{d}\in\{0,1\}^{n\times d} is said to be monotone if the following conditions hold:

  1. (i)

    If ki,j=1k_{i,j}=1 for some i{1,,n}i\in\{1,\dots,n\} and j{1,,d}j\in\{1,\dots,d\}, then ki,j=1k_{i^{\prime},j^{\prime}}=1 whenever iii^{\prime}\leq i and jjj^{\prime}\geq j.

  2. (ii)

    ki,d=1k_{i,d}=1 for i{1,,n}i\in\{1,\dots,n\}.

Note that in practice, there are cases where the observed missing structure can be re-arranged to become monotone by permuting the rows and columns of the response matrix. If there are no missing data, the corresponding missing structure is monotone.

Table 1: The observed response under a monotone structure
Pattern 1 y1,1y_{1,1} y1,2y_{1,2} \cdots y1,dy_{1,d}
\vdots \vdots \ddots \vdots
yn1,1y_{n_{1},1} yn1,2y_{n_{1},2} \cdots yn1,dy_{n_{1},d}
Pattern 2 yn1+1,2y_{n_{1}+1,2} \cdots yn1+1,dy_{n_{1}+1,d}
\vdots \ddots \vdots
yn1+n2,2y_{n_{1}+n_{2},2} \cdots yn1+n2,dy_{n_{1}+n_{2},d}
\vdots \ddots \vdots
Pattern dd ynnd+1,dy_{n-n_{d}+1,d}
\vdots
yn,dy_{n,d}

Let 𝒌=(ki,j)i=1nj=1d\bm{k}=(k_{i,j})_{i=1}^{n}{}_{j=1}^{d} be monotone. Then, for a realized response matrix 𝒚=(yi,j)i=1nj=1dn×d\bm{y}=(y_{i,j})_{i=1}^{n}{}_{j=1}^{d}\in\mathbb{R}^{n\times d}, the elements in 𝒚(𝒌)\bm{y}_{(\bm{k})} can be arranged as in Table 1. We say that the iith observation belongs to pattern \ell for some {1,,d}\ell\in\{1,\dots,d\} if ki,j=1k_{i,j}=1 for jj\geq\ell and ki,j=0k_{i,j}=0 for j<j<\ell. There are dd possible patterns. For {1,,d}\ell\in\{1,\dots,d\}, denote by n(𝒌)n_{\ell}(\bm{k}) the number of observations that belong to pattern \ell. Let 𝒚(𝒌,)\bm{y}_{(\bm{k},\ell)} be the [j=1nj(𝒌)]×(d+1)[\sum_{j=1}^{\ell}n_{j}(\bm{k})]\times(d-\ell+1) matrix whose iith row is (yi,,,yi,d),(y_{i,\ell},\dots,y_{i,d}), and let 𝒙(𝒌,)\bm{x}_{(\bm{k},\ell)} be the submatrix of 𝒙\bm{x} formed by the first j=1nj(𝒌)\sum_{j=1}^{\ell}n_{j}(\bm{k}) rows of 𝒙\bm{x}. Denote by 𝒚(𝒌,):𝒙(𝒌,)\bm{y}_{(\bm{k},\ell)}:\bm{x}_{(\bm{k},\ell)} the matrix formed by attaching 𝒙(𝒌,)\bm{x}_{(\bm{k},\ell)} to the right of 𝒚(𝒌,)\bm{y}_{(\bm{k},\ell)}. We say that Condition (H1) holds for the pair (𝒌,𝒚(𝒌))(\bm{k},\bm{y}_{(\bm{k})}) if

r(𝒚(𝒌,):𝒙(𝒌,))=p+d+1,j=1nj(𝒌)>p+dm+1 for {1,,d}.r\left(\bm{y}_{(\bm{k},\ell)}:\bm{x}_{(\bm{k},\ell)}\right)=p+d-\ell+1,\quad\sum_{j=1}^{\ell}n_{j}(\bm{k})>p+d-m+\ell-1\;\text{ for }\ell\in\{1,\dots,d\}. (H1)

Here, r()r(\cdot) returns the rank of a matrix. This condition is crucial for the DA algorithm to be well-defined and implementable.

3.2 The DA algorithm

Fix a realized response 𝒚n×d\bm{y}\in\mathbb{R}^{n\times d} and a realized missing structure 𝒌{0,1}n×d\bm{k}\in\{0,1\}^{n\times d}. Given the current state (𝑩(t),𝚺(t))=(𝜷,𝝇)(\bm{B}(t),\bm{\Sigma}(t))=(\bm{\beta},\bm{\varsigma}), the DA algorithm for sampling from π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) draws the next state (𝑩(t+1),𝚺(t+1))(\bm{B}(t+1),\bm{\Sigma}(t+1)) using the following steps.

  1. 1.

    I step. Draw 𝑾=(W1,,Wn)\bm{W}^{*}=(W_{1}^{*},\dots,W_{n}^{*}) from the conditional distribution of 𝑾\bm{W} given (𝑩,𝚺,𝒀(𝒌))=(𝜷,𝝇,𝒚(𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k})})=(\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}). Call the sampled value 𝒘\bm{w}.

  2. 2.

    P step. Draw (𝑩(t+1),𝚺(t+1))(\bm{B}(t+1),\bm{\Sigma}(t+1)) from the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}).

This algorithm simulates a Markov chain (𝑩(t),𝚺(𝒕))t=0(\bm{B}(t),\bm{\Sigma(t)})_{t=0}^{\infty} that is reversible with respect to π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}).

For i{1,,n}i\in\{1,\dots,n\}, let did_{i} be the number of nonzero entries in the iith row of 𝒌\bm{k}. (Note: if 𝒌\bm{k} is monotone, and the iith observation belongs to some pattern \ell, then di=d+1d_{i}=d-\ell+1.) One can derive that the conditional distribution of 𝑾=(W1,,Wn)\bm{W}=(W_{1},\dots,W_{n}) given (𝑩,𝚺,𝒀(𝒌))=(𝜷,𝝇,𝒚(𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k})})=(\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}) is

P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌))i=1nwidi/2exp(ri,(𝒌)wi2)Pmix(dwi),P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})})\propto\prod_{i=1}^{n}w_{i}^{d_{i}/2}\exp\left(-\frac{r_{i,(\bm{k})}w_{i}}{2}\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w_{i}), (4)

where ri,(𝒌)r_{i,(\bm{k})} is defined in Equation (3) for i{1,,n}i\in\{1,\dots,n\}. To ensure that this conditional density is always proper, we assume throughout that

0wd/2Pmix(dw)<.\int_{0}^{\infty}w^{d/2}\,P_{\scriptsize\mbox{mix}}(\mbox{d}w)<\infty. (H2)

The conditional distribution P𝒌(𝜷,𝝇,𝒚(𝒌))P_{{}_{\bm{k}}}(\cdot\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}) corresponds to nn independent univariate random variables. For most commonly used mixing distributions Pmix()P_{\scriptsize\mbox{mix}}(\cdot), this is not difficult to sample. For instance, if Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is a Gamma distribution, P𝒌(𝜷,𝝇,𝒚(𝒌))P_{{}_{\bm{k}}}(\cdot\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}) is the product of nn Gamma distributions.

The conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}) is not always tractable. In fact, since an improper prior is used, this conditional is possibly improper. Liu (1996) provided a method for sampling from this distribution when 𝒌\bm{k} is monotone. When 𝒌\bm{k} is monotone, and Condition (H1) holds from (𝒚,𝒌)(\bm{y},\bm{k}), the conditional of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}) is proper for any 𝒘=(w1,,wn)(0,)n\bm{w}=(w_{1},\dots,w_{n})\in(0,\infty)^{n}. This conditional distribution can be sampled using chi-square and normal distributions. The method is intricate, so we relegate the details to Appendix A.

Let 𝒌0{0,1}n×d\bm{k}_{0}\in\{0,1\}^{n\times d} be the missing structure that corresponds to a completely observable response, i.e., all elements of 𝒌0\bm{k}_{0} are 1. Denote by 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})} the missing parts in the response. Sometimes, we are interested in the posterior distribution of 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})} given 𝒀(𝒌)=𝒚(𝒌)\bm{Y}_{(\bm{k})}=\bm{y}_{(\bm{k})}, which takes the following form:

P𝒀(𝒌0𝒌)(d𝒛𝒌,𝒚(𝒌))d𝒛S+d×dp×d(0,)npjoint(𝒚𝒛,𝒘,𝜷,𝝇)Pmix(d𝒘)𝑑𝜷𝑑𝝇,P_{\bm{Y}_{(\bm{k}_{0}-\bm{k})}}\left(d\bm{z}\mid\bm{k},\bm{y}_{(\bm{k})}\right)\propto d\bm{z}\int_{S_{+}^{d\times d}}\int_{\mathbb{R}^{p\times d}}\int_{(0,\infty)^{n}}p_{\text{joint}}(\bm{y}^{\bm{z}},\bm{w},\bm{\beta},\bm{\varsigma})P_{\scriptsize\mbox{mix}}(d\bm{w})d\bm{\beta}d\bm{\varsigma},

where pjointp_{\text{joint}} is given in Equation (2), and 𝒚𝒛\bm{y}^{\bm{z}} is a realized value of 𝒀\bm{Y} such that 𝒚(𝒌)𝒛=𝒚(𝒌)\bm{y}^{\bm{z}}_{(\bm{k})}=\bm{y}_{(\bm{k})} and 𝒚(𝒌0𝒌)𝒛=𝒛\bm{y}^{\bm{z}}_{(\bm{k}_{0}-\bm{k})}=\bm{z}. To sample from P𝒀(𝒌0𝒌)(𝒌,𝒚(𝒌))P_{\bm{Y}_{(\bm{k}_{0}-\bm{k})}}\left(\cdot\mid\bm{k},\bm{y}_{(\bm{k})}\right), one can add a post hoc imputation step at the end of each DA iteration.

  1. 3.

    Post hoc imputation step. Draw 𝒁(t+1)\bm{Z}(t+1) from the conditional distribution of 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})} given (𝒀(𝒌),𝑾,𝑩,𝚺)=(𝒚(𝒌),𝒘,𝜷,𝝇)(\bm{Y}_{(\bm{k})},\bm{W},\bm{B},\bm{\Sigma})=(\bm{y}_{(\bm{k})},\bm{w},\bm{\beta}^{*},\bm{\varsigma}^{*}), where (𝜷,𝝇)(\bm{\beta}^{*},\bm{\varsigma}^{*}) is the sampled value of (𝑩(t+1),𝚺(t+1))(\bm{B}(t+1),\bm{\Sigma}(t+1)).

Recall that, given (𝑾,𝑩,𝚺)=(𝒘,𝜷,𝝇)(\bm{W},\bm{B},\bm{\Sigma})=(\bm{w},\bm{\beta}^{*},\bm{\varsigma}^{*}), 𝒀\bm{Y} has a multivariate normal distribution. Then the conditional distribution of 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})} given (𝒀(𝒌),𝑾,𝑩,𝚺)=(𝒚(𝒌),𝒘,𝜷,𝝇)(\bm{Y}_{(\bm{k})},\bm{W},\bm{B},\bm{\Sigma})=(\bm{y}_{(\bm{k})},\bm{w},\bm{\beta}^{*},\bm{\varsigma}^{*}) is also (univariate or multivariate) normal. One can show that (𝑩(t),𝚺(t),𝒁(t))t=0(\bm{B}(t),\bm{\Sigma}(t),\bm{Z}(t))_{t=0}^{\infty} is a Markov chain whose stationary distribution is the posterior distribution of (𝑩,𝚺,𝒀(𝒌0𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k}_{0}-\bm{k})}) given 𝒀(𝒌)=𝒚(𝒌)\bm{Y}_{(\bm{k})}=\bm{y}_{(\bm{k})}.

We call the imputation step post hoc because it can be implemented at the end of the whole simulation, as long as the value of 𝒘\bm{w} is recorded in the I step of each iteration. Post hoc imputation is possible because the I and P steps do not rely on the value of 𝒁(t)\bm{Z}(t). Naturally, post hoc imputation does not affect the convergence properties of the Markov chain. Indeed, standard arguments (Roberts and Rosenthal, 2001) show that (𝑩(t),𝚺(t),𝒁(t))t=0(\bm{B}(t),\bm{\Sigma}(t),\bm{Z}(t))_{t=0}^{\infty} is geometrically ergodic if and only if (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} is geometrically ergodic. Moreover, the L2L^{2} convergence rates of the two chains, as defined in Section 4.1, are the same. Thus, when studying the convergence properties of the DA algorithm, we can restrict our attention to (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} instead of (𝑩(t),𝚺(t),𝒁(t))t=0(\bm{B}(t),\bm{\Sigma}(t),\bm{Z}(t))_{t=0}^{\infty} even if there is post hoc imputation.

3.3 The DAI algorithm

For two realized missing structures 𝒌=(ki,j){0,1}n×d\bm{k}=(k_{i,j})\in\{0,1\}^{n\times d} and 𝒌=(ki,j){0,1}n×d\bm{k}^{\prime}=(k_{i,j}^{\prime})\in\{0,1\}^{n\times d}, write 𝒌𝒌\bm{k}\prec\bm{k}^{\prime} if (i) 𝒌𝒌\bm{k}\neq\bm{k}^{\prime} and (ii) 𝒌i,j=1\bm{k}^{\prime}_{i,j}=1 whenever 𝒌i,j=1\bm{k}_{i,j}=1. That is, 𝒌𝒌\bm{k}\prec\bm{k}^{\prime} if the observed response entries under structure 𝒌\bm{k} is a proper subset of those under 𝒌\bm{k}^{\prime}. If 𝒌𝒌\bm{k}\prec\bm{k}^{\prime}, then 𝒌𝒌\bm{k}^{\prime}-\bm{k} is a missing structure matrix that gives the entries that are missing under 𝒌\bm{k}, but not under 𝒌\bm{k}^{\prime}. In other words, when 𝒌𝒌\bm{k}\prec\bm{k}^{\prime}, 𝒀(𝒌𝒌)\bm{Y}_{(\bm{k}^{\prime}-\bm{k})} is observed under 𝒌\bm{k}^{\prime}, but not under 𝒌\bm{k}.

As usual, fix a realized response 𝒚n×d\bm{y}\in\mathbb{R}^{n\times d} and missing structure 𝒌\bm{k}. One may imagine that 𝒌\bm{k} is not monotone, and the DA algorithm cannot be implemented efficiently. The DAI algorithm, also proposed in Liu (1996), overcomes non-monotinicity by imputing the value of 𝒀(𝒌𝒌)\bm{Y}_{(\bm{k}^{\prime}-\bm{k})} in the I step, where 𝒌\bm{k}^{\prime} is a monotone realized missing structure such that 𝒌𝒌\bm{k}\prec\bm{k}^{\prime} chosen by the user. We now describe the details of this algorithm.

The DAI algorithm associated with the triplet (𝒚,𝒌,𝒌)(\bm{y},\bm{k},\bm{k}^{\prime}) simulates a Markov chain (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} that is reversible with respect to π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}). Given the current state (𝑩(t),𝚺(t))=(𝜷,𝝇)(\bm{B}(t),\bm{\Sigma}(t))=(\bm{\beta},\bm{\varsigma}), the DAI algorithm draws the next state (𝑩(t+1),𝚺(t+1))(\bm{B}(t+1),\bm{\Sigma}(t+1)) using the following steps.

  1. 1.

    I1 step. Draw 𝑾=(W1,,Wn)\bm{W}^{*}=(W_{1}^{*},\dots,W_{n}^{*}) from the conditional distribution of 𝑾\bm{W} given (𝑩,𝚺,𝒀(𝒌))=(𝜷,𝝇,𝒚(𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k})})=(\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}), whose exact form is given in  Equation (4). Call the sampled value 𝒘\bm{w}.

  2. 2.

    I2 step. Draw 𝒁(t+1)\bm{Z}(t+1) from the conditional distribution of 𝒀(𝒌𝒌)\bm{Y}_{(\bm{k}^{\prime}-\bm{k})} given
    (𝒀(𝒌),𝑾,𝑩,𝚺)=(𝒚(𝒌),𝒘,𝜷,𝝇)(\bm{Y}_{(\bm{k})},\bm{W},\bm{B},\bm{\Sigma})=(\bm{y}_{(\bm{k})},\bm{w},\bm{\beta},\bm{\varsigma}). Call the sampled value 𝒛\bm{z}.

  3. 3.

    P step. Draw (𝑩(t+1),𝚺(t+1))(\bm{B}(t+1),\bm{\Sigma}(t+1)) from the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝒘,𝒀(𝒌),𝒀(𝒌𝒌))=(𝒘,𝒚(𝒌),𝒛)(\bm{w},\bm{Y}_{(\bm{k})},\bm{Y}_{(\bm{k}^{\prime}-\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})},\bm{z}).

Recall that, given (𝑾,𝑩,𝚺)=(𝒘,𝜷,𝝇)(\bm{W},\bm{B},\bm{\Sigma})=(\bm{w},\bm{\beta},\bm{\varsigma}), 𝒀\bm{Y} has a multivariate normal distribution. Then in the I2 step, the conditional distribution of 𝒀(𝒌𝒌)\bm{Y}_{(\bm{k}^{\prime}-\bm{k})} given (𝒀(𝒌),𝑾,𝑩,𝚺)=(𝒚(𝒌),𝒘,𝜷,𝝇)(\bm{Y}_{(\bm{k})},\bm{W},\bm{B},\bm{\Sigma})=(\bm{y}_{(\bm{k})},\bm{w},\bm{\beta},\bm{\varsigma}) is (univariate or multivariate) normal. In the P step, one is in fact sampling from the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k}^{\prime})})=(\bm{w},\bm{y}_{(\bm{k}^{\prime})}^{*}), where 𝒚\bm{y}^{*} is a realized value of 𝒀\bm{Y} such that 𝒚(𝒌)=𝒚(𝒌)\bm{y}^{*}_{(\bm{k})}=\bm{y}_{(\bm{k})} and 𝒚(𝒌𝒌)=𝒛\bm{y}^{*}_{(\bm{k}^{\prime}-\bm{k})}=\bm{z}. To ensure that this step can be implemented efficiently, we need two conditions: (i) 𝒌\bm{k}^{\prime} is monotone as we have assumed, and (ii) Condition (H1) holds for (𝒌,𝒚(𝒌))(\bm{k}^{\prime},\bm{y}_{(\bm{k}^{\prime})}^{*}). If these two conditions hold, then one can use methods in Appendix A to implement the P step. Of course, in each step of the DAI algorithm, 𝒛=𝒚(𝒌𝒌)\bm{z}=\bm{y}_{(\bm{k}^{\prime}-\bm{k})}^{*} changes. Despite this, due to the following result, which is easy to verify, we do not have to check (ii) in every step.

Proposition 3.

Let 𝐤\bm{k}^{\prime} be a monotone missing structure such that 𝐤𝐤\bm{k}\prec\bm{k}^{\prime}, and let 𝐲n×d\bm{y}^{*}\in\mathbb{R}^{n\times d} be such that 𝐲(𝐤)=𝐲(𝐤)\bm{y}_{(\bm{k})}^{*}=\bm{y}_{(\bm{k})}. Suppose that there exists a monotone structure 𝐤′′\bm{k}^{\prime\prime} such that 𝐤′′𝐤\bm{k}^{\prime\prime}\prec\bm{k}, and that Condition (H1) holds for (𝐤′′,𝐲(𝐤′′))(\bm{k}^{\prime\prime},\bm{y}_{(\bm{k}^{\prime\prime})}). Then Condition (H1) holds for (𝐤,𝐲(𝐤))(\bm{k}^{\prime},\bm{y}_{(\bm{k}^{\prime})}^{*}) regardless of the value of 𝐲(𝐤𝐤)\bm{y}_{(\bm{k}^{\prime}-\bm{k})}^{*}.

The DAI algorithm can be used to impute the missing values of 𝒀\bm{Y}. Indeed, (𝑩(t),𝚺(t),𝒁(t))t=0(\bm{B}(t),\bm{\Sigma}(t),\bm{Z}(t))_{t=0}^{\infty} is a Markov chain whose stationary distribution is the posterior distribution of (𝑩,𝚺,𝒀(𝒌𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k}^{\prime}-\bm{k})}) given 𝒀(𝒌)=𝒚(𝒌)\bm{Y}_{(\bm{k})}=\bm{y}_{(\bm{k})}. This is similar to the DA algorithm with post hoc imputation described in Section 3.2, especially if all the entries of 𝒌\bm{k}^{\prime} are 1. Compared to the DA algorithm with post hoc imputation, the DAI algorithm can be implemented even when 𝒌\bm{k} is not monotone. However, the intermediate imputation step I2 cannot be performed after the whole simulation process since the P step of DAI relies on the value of 𝒁(t+1)\bm{Z}(t+1).

Standard arguments show that (𝑩(t),𝚺(t),𝒁(t))t=0(\bm{B}(t),\bm{\Sigma}(t),\bm{Z}(t))_{t=0}^{\infty} and (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} have the same convergence rate in terms of total variation and L2L^{2} distances (Roberts and Rosenthal, 2001; Liu et al., 1994). Thus, when studying the convergence properties of the DAI algorithm, we can restrict our attention to (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} instead of (𝑩(t),𝚺(t),𝒁(t))t=0(\bm{B}(t),\bm{\Sigma}(t),\bm{Z}(t))_{t=0}^{\infty} even if we care about imputing missing data.

4 Convergence Analysis

4.1 Preliminaries

We start by reviewing some general concepts regarding the convergence properties of Markov chains. Let (𝒳,)(\mathcal{X},\mathcal{F}) be a measurable space. Consider a Markov chain (X(t))t=0(X(t))_{t=0}^{\infty} whose state space is (𝒳,)(\mathcal{X},\mathcal{F}), and let K:𝒳×[0,1]K:\mathcal{X}\times\mathcal{F}\to[0,1] be its transition kernel. For a signed measure μ\mu and a measurable function ff on (𝒳,)(\mathcal{X},\mathcal{F}), let

μf=𝒳f(x)μ(dx),\mu f=\int_{\mathcal{X}}f(x)\,\mu(\mbox{d}x),

and KK can act on μ\mu and ff as follows:

μK(A)=𝒳K(x,A)μ(dx),A,Kf(x)=𝒳f(y)K(x,dy),x𝒳,\mu K(A)=\int_{\mathcal{X}}K(x,A)\,\mu(dx),\;A\in\mathcal{F},\quad Kf(x)=\int_{\mathcal{X}}f(y)K(x,dy),\;x\in\mathcal{X},

assuming that the integrals are well-defined. Suppose that the chain has a stationary distribution π\pi, i.e., πK()=π()\pi K(\cdot)=\pi(\cdot). The chain is said to be reversible if, for A1,A2A_{1},A_{2}\in\mathcal{F},

A1×A2π(dx)K(x,dy)=A2×A1π(dx)K(x,dy).\int_{A_{1}\times A_{2}}\pi(\mbox{d}x)K(x,\mbox{d}y)=\int_{A_{2}\times A_{1}}\pi(\mbox{d}x)K(x,\mbox{d}y).

For two probability measures μ1\mu_{1} and μ2\mu_{2} on (𝒳,)(\mathcal{X},\mathcal{F}), denote μ1()μ2()TV||\mu_{1}(\cdot)-\mu_{2}(\cdot)||_{TV} as the total variation (TV) distance between the two probability measures. For t+t\in\mathbb{Z}_{+}, let Kt:𝒳×[0,1]K^{t}:\mathcal{X}\times\mathcal{F}\to[0,1] be the tt-step transition kernel of the chain, so that K1=KK^{1}=K, and μKt+1()=(μKt)K()\mu K^{t+1}(\cdot)=(\mu K^{t})K(\cdot) for any probability measure μ\mu on (𝒳,)(\mathcal{X},\mathcal{F}). A ϕ\phi-irreducible aperiodic Markov chain with stationary distribution π\pi is Harris ergodic if and only if limtKt(x,)π()TV=0\lim_{t\to\infty}||K^{t}(x,\cdot)-\pi(\cdot)||_{TV}=0, for all x𝒳x\in\mathcal{X} (Meyn and Tweedie, 2005; Roberts and Rosenthal, 2006). The chain is said to be geometrically ergodic if the chain is Harris ergodic and

Kt(x,)π()TVM(x)ρt,x𝒳,t+,||K^{t}(x,\cdot)-\pi(\cdot)||_{TV}\leq M(x)\rho^{t},\quad x\in\mathcal{X},\;t\in\mathbb{Z}^{+}, (5)

for some ρ[0,1)\rho\in[0,1) and M:𝒳[0,)M:\mathcal{X}\to[0,\infty). As mentioned in the Introduction, Harris ergodicity guarantees a law of large numbers for ergodic averages (Meyn and Tweedie, 2005, Theorem 17.1.7), and geometric ergodicity guarantees a central limit theorem for ergodic averages (Jones and Hobert, 2001; Jones, 2004; Flegal et al., 2008).

Another commonly used distance between probability measures is the L2L^{2} distance. Let L2(π)L^{2}(\pi) be the set of measurable functions f:𝒳f:\mathcal{X}\to\mathbb{R} such that

πf2:=𝒳f(x)2π(dx)<.\pi f^{2}:=\int_{\mathcal{X}}f(x)^{2}\,\pi(\mbox{d}x)<\infty.

The L2L^{2} distance between two probability measures μ1\mu_{1} and μ2\mu_{2} on (𝒳,)(\mathcal{X},\mathcal{F}) is

μ1()μ2()2=supfL2(π)|μ1fμ2f|.\|\mu_{1}(\cdot)-\mu_{2}(\cdot)\|_{2}=\sup_{f\in L^{2}(\pi)}|\mu_{1}f-\mu_{2}f|.

Denote by L2(π)L_{*}^{2}(\pi) the set of probability measures μ\mu on (𝒳,)(\mathcal{X},\mathcal{F}) such that μ\mu is absolutely continuous to π\pi and that dμ/dπ\mbox{d}\mu/\mbox{d}\pi, the density of μ\mu with respect to π\pi, is in L2(π)L^{2}(\pi). We say the chain is L2L^{2} geometrically ergodic if

μKt()π()2C(μ)ρt,μL2(π),t+,\|\mu K^{t}(\cdot)-\pi(\cdot)\|_{2}\leq C(\mu)\rho^{t},\quad\mu\in L_{*}^{2}(\pi),\;t\in\mathbb{Z}^{+}, (6)

for some ρ[0,1)\rho\in[0,1) and C:L2(π)[0,)C:L_{*}^{2}(\pi)\to[0,\infty). In some cases, e.g., when the state space is finite, the ρ\rhos in  Equations (5) and (6) are exchangeable. For reversible Markov chains, geometric ergodicity implies L2L^{2} geometric ergodicity (Roberts and Rosenthal, 1997, Theorem 2.1 and Remark 2.3).

The infimum of ρ[0,1]\rho\in[0,1] such that Equation (6) holds for some C:L2(π)[0,)C:L_{*}^{2}(\pi)\to[0,\infty) is called the L2L^{2} convergence rate of the chain. The smaller this rate is, the faster the convergence.

4.2 Geometric ergodicity of the DA algorithm

To state our result, we define three classes of mixing distributions based on their behaviors near the origin. These classes were first examined by Hobert et al. (2018) who analyzed the DA algorithm when the response matrix is fully observed. We say that the mixing distribution Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is zero near the origin if there exists θ>0\theta>0 such that0θPmix(dw)=0\int_{0}^{\theta}P_{\scriptsize\mbox{mix}}(\mbox{d}w)=0. Assume now that Pmix()P_{\scriptsize\mbox{mix}}(\cdot) admits a density function pmix:(0,)[0,)p_{\scriptsize\mbox{mix}}:(0,\infty)\to[0,\infty) with respect to the Lebesgue measure. If there exists c>1c>-1, such that limw0pmix(w)/wc<\lim_{w\to 0}{p_{\scriptsize\mbox{mix}}(w)}/{w^{c}}<\infty, we say that Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is polynomial near the origin with power cc. The mixing distribution Pmix()P_{\scriptsize\mbox{mix}}({\cdot}) is faster than polynomial near the origin if, for all c>0c>0, there exists κc>0\kappa_{c}>0 such that the ratio pmix(w)/wc{p_{\scriptsize\mbox{mix}}(w)}/{w^{c}} is strictly increasing on (0,κc)(0,\kappa_{c}).

Most commonly used mixing distributions fall into one of these three classes. Examples will be given after we state the main result of this section.

Again, fix an observed missing structure 𝒌\bm{k} and observed response 𝒚(𝒌)\bm{y}_{(\bm{k})}. Let did_{i} be the number of nonzero elements in the iith row of 𝒌\bm{k}. Recall that nn is the number of observations, pp is the number of predictors, dd is the dimension of the responses, and mm is a parameter in the prior distribution Equation (1).

Theorem 4.

Consider the DA algorithm targeting π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}), as described in Section 3.2. Suppose that Condition (H2) holds, and that the conditional distribution of (𝐁,𝚺)(\bm{B},\bm{\Sigma}) given (𝐖,𝐘(𝐤))=(𝐰,𝐲(𝐤))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}) is proper for every 𝐰=(w1,,wn)(0,)n\bm{w}=(w_{1},\dots,w_{n})^{\top}\in(0,\infty)^{n}. If any one of the following conditions holds, then the posterior π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is proper and the underlying Markov chain is geometrically ergodic.

  1. 1.

    Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is zero near the origin;

  2. 2.

    Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is faster than polynomial near the origin; or

  3. 3.

    Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is polynomial near the origin with power c>c1c>c_{1}, where

    c1=np+mmin{d1,,dn}2.c_{1}=\frac{n-p+m-\min\{d_{1},\dots,d_{n}\}}{2}.
Remark 5.

Recall that when 𝐤\bm{k} is monotone, the conditional distribution of (𝐁,𝚺)(\bm{B},\bm{\Sigma}) given (𝐖,𝐘(𝐤))=(𝐰,𝐲(𝐤))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}) is proper for every 𝐰=(w1,,wn)(0,)n\bm{w}=(w_{1},\dots,w_{n})^{\top}\in(0,\infty)^{n} if Condition (H1) holds for (𝐤,𝐲(𝐤))(\bm{k},\bm{y}_{(\bm{k})}).

Remark 6.

When Pmix()P_{\scriptsize\mbox{mix}}(\cdot) has a density function with respect to the Lebesgue measure and 𝐘\bm{Y} is fully observed, Theorem 4 reduces to the main result in Hobert et al. (2018).

The proof of Theorem 4 is in Appendix B. In what follows, we list commonly used mixing distributions that fall into the three categories in Theorem 4. We also check whether each mixing distribution satisfies Condition (H2).

Zero near the origin

When the mixing distribution Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is discrete with finite support, the mixing distribution is zero near the origin. This is the case when errors follow finite mixtures of Gaussian. Obviously, Condition (H2) holds in this case.

The Pareto(a,b)(a,b) distribution has density p(wa,b)wb1,w[a,)p(w\mid a,b)\propto w^{-b-1},\;w\in[a,\infty), where a>0,b>0a>0,b>0. It is zero near the origin as the support is [a,)[a,\infty). Condition (H2) holds if b>d/2b>d/2.

Faster than polynomial near the origin

A generalized inverse Gaussian distribution GIG(a,b,q)\mbox{GIG}(a,b,q) with density

p(wa,b,q)wq1exp(aw+b/w2),p(w\mid a,b,q)\propto w^{q-1}\exp\left(-\frac{aw+b/w}{2}\right),

where a>0,b>0,qa>0,b>0,q\in\mathbb{R}, is faster than polynomial near the origin. Condition (H2) holds for any GIG distribution. When the mixing distribution is GIG, the distribution of the error is called Generalized Hyperbolic (Barndorff-Nielsen et al., 1982).

The density of an inverse gamma distribution IG(a,b)\mbox{IG}(a,b) is p(wa,b)wa1exp(b/w)p(w\mid a,b)\propto w^{-a-1}\exp({-b/w}), where a>0,b>0a>0,b>0. IG(a,b)\mbox{IG}(a,b) is faster than polynomial near the origin. Condition (H2) is satisfied if a>d/2a>d/2.

For μ\mu\in\mathbb{R} and v>0v>0, the Log-normal(μ,v)\mbox{Log-normal}(\mu,v) distribution has density

p(wμ,v)1wexp{(lnwμ)22v2}.p(w\mid\mu,v)\propto\frac{1}{w}\exp\left\{-\frac{(\ln w-\mu)^{2}}{2v^{2}}\right\}.

This distribution is faster than polynomial near the origin and Condition (H2) holds.

A Fréchet distribution with the shape α>0\alpha>0 and scale s>0s>0 is given by

p(wα,s)w(1+α)exp{(s/w)α}.p(w\mid\alpha,s)\propto w^{-(1+\alpha)}\exp\left\{-\left({s}/{w}\right)^{\alpha}\right\}.

It is faster than polynomial near the origin. Moreover, Condition (H2) holds whenever α>d/2\alpha>d/2.

Non-zero near the origin and polynomial near the origin

A Gamma(a,b)\mbox{Gamma}(a,b) distribution has density p(wa,b)wa1exp(bw)p(w\mid a,b)\propto w^{a-1}\exp({-bw}), where a>0a>0 and b>0b>0. Gamma(a,b)\mbox{Gamma}(a,b) is polynomial near the origin with power c=a1c=a-1. The power c>c1c>c_{1} if

a>np+mmin{d1,,dn}2+1,a>\frac{n-p+m-\min\{d_{1},\dots,d_{n}\}}{2}+1,

where c1c_{1} is given in Theorem 4. Condition (H2) always holds for Gamma distributions. In particular, when a=b=v/2a=b=v/2, the error has multivariate tt distribution with degrees of freedom vv, and c>c1c>c_{1} if

v>np+mmin{d1,,dn}+2.v>n-p+m-\min\{d_{1},\dots,d_{n}\}+2.

The Beta(a,b)\mbox{Beta}(a,b) has density p(ua,b)ua1(1x)b1,u(0,1)p(u\mid a,b)\propto u^{a-1}(1-x)^{b-1},\ u\in(0,1), where a>0a>0 and b>0b>0. When b=1b=1, the error is called multivariate slash distribution (Lange and Sinsheimer, 1993; Rogers and Tukey, 1972). Beta(a,b)\mbox{Beta}(a,b) is polynomial near the origin with power c=a1c=a-1. Condition (H2) always holds for Beta distributions.

The Weibull(a,b)\mbox{Weibull}(a,b) distribution has density p(ua,b)ua1exp{(u/b)a}p(u\mid a,b)\propto u^{a-1}\exp\{-(u/b)^{a}\}, where a>0a>0 and b>0b>0. Weibull(a,b)\mbox{Weibull}(a,b) is polynomial near the origin with power c=a1c=a-1. Condition (H2) always holds for Weibull distributions.

The F(a,b)F(a,b) distribution has density p(wa,b)wa/21(aw+b)(a+b)/2p(w\mid a,b)\propto w^{a/2-1}(aw+b)^{-(a+b)/2}, where a>0a>0 and b>0b>0. F(a,b)F(a,b) is polynomial near the origin with power c=a/21c=a/2-1. The power c>c1c>c_{1} if

a>np+mmin{d1,,dn}+2.a>n-p+m-\min\{d_{1},\dots,d_{n}\}+2.

Condition (H2) is satisfied if b>db>d.

4.3 Harris ergodicity of the DAI algorithm

Now we give sufficient conditions for the DAI algorithm to be Harris ergodic.

Theorem 7.

Assume that Condition (H2) holds. Let 𝐤\bm{k} be a missing structure. Suppose that there is a missing structure 𝐤\bm{k}^{\prime} and a realized value of 𝐘(𝐤)\bm{Y}_{(\bm{k}^{\prime})} denoted by 𝐳\bm{z} such that 𝐤𝐤\bm{k}^{\prime}\prec\bm{k}, and that π𝐤(𝐳)\pi_{\bm{k}^{\prime}}(\cdot\mid\bm{z}), the conditional density of (𝐁,𝚺)(\bm{B},\bm{\Sigma}) given 𝐘(𝐤)=𝐳\bm{Y}_{(\bm{k}^{\prime})}=\bm{z}, is proper. Then, for Lebesgue almost every 𝐳\bm{z}^{\prime} in the range of 𝐘(𝐤𝐤)\bm{Y}_{(\bm{k}-\bm{k}^{\prime})}, the posterior density π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}), where 𝐲\bm{y} satisfies 𝐲(𝐤)=𝐳\bm{y}_{(\bm{k}^{\prime})}=\bm{z} and 𝐲(𝐤𝐤)=𝐳\bm{y}_{(\bm{k}-\bm{k}^{\prime})}=\bm{z}^{\prime}, is proper, and any DAI chain targeting this posterior is Harris recurrent.

Proof.

The posterior density π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) can be regarded as the posterior density of (𝜷,𝚺)(\bm{\beta},\bm{\Sigma}) given 𝒀(𝒌𝒌)=𝒛\bm{Y}_{(\bm{k}-\bm{k}^{\prime})}=\bm{z}^{\prime} when the prior density is π𝒌(𝒛)\pi_{\bm{k}^{\prime}}(\cdot\mid\bm{z}). Since π𝒌(𝒛)\pi_{\bm{k}^{\prime}}(\cdot\mid\bm{z}) is assumed to be proper, this posterior is proper for almost every possible value of 𝒛\bm{z}^{\prime}.

Under Condition (H2) and posterior propriety, a simple application of Theorem 6(v) of Roberts and Rosenthal (2006) shows that a DAI chain targeting the posterior is Harris recurrent. ∎

By Theorem 4, a posterior density π𝒌(𝒚(𝒌))\pi_{\bm{k}^{\prime}}(\cdot\mid\bm{y}_{(\bm{k}^{\prime})}) is proper if each of the following conditions holds:

  • 𝒌\bm{k}^{\prime} is monotone, and Condition (H1) holds for (𝒌,𝒚(𝒌))(\bm{k}^{\prime},\bm{y}_{(\bm{k}^{\prime})});

  • Pmix()P_{\scriptsize\mbox{mix}}(\cdot) satisfies Condition (H2); or

  • Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is either zero near the origin, or faster than polynomial near the origin, or polynomial near the origin with power c>c1c>c_{1}, where

    c1=np+mmin{d1,,dn}2,c_{1}=\frac{n-p+m-\min\{d^{\prime}_{1},\dots,d^{\prime}_{n}\}}{2},

    and for i{1,,n}i\in\{1,\dots,n\}, did^{\prime}_{i} is the number of nonzero entries in the iith row of 𝒌\bm{k}^{\prime}.

By Theorem 7, under Condition (H2), to ensure Harris ergodicity of a DAI algorithm targeting π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) in an almost sure sense, one only needs to find a missing structure 𝒌𝒌\bm{k}^{\prime}\prec\bm{k} that satisfies the conditions above. In theory, when such a 𝒌\bm{k}^{\prime} exists, it is still possible that the posterior π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is improper – even though it only happens on a zero measure set. See Fernández and Steel (1999) for a detailed discussion on this subtlety.

4.4 Comparison between the DA and DAI algorithms

Again, fix a missing structure 𝒌\bm{k} and a realized response 𝒚\bm{y}, and let did_{i} be the number of nonzero elements in the iith row of 𝒌\bm{k}. Assume that there is at least one missing entry, i.e., di<dd_{i}<d for some i{1,,n}i\in\{1,\dots,n\}. Assume also that π𝒌(𝒚(𝒌))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is proper. In principle, one can either use the DA algorithm (with or without post hoc imputation) or the DAI algorithm associated with (𝒚,𝒌,𝒌)(\bm{y},\bm{k},\bm{k}^{\prime}) where 𝒌𝒌\bm{k}\prec\bm{k}^{\prime} to sample from the posterior. In this subsection, we compare the two algorithms in terms of their L2L^{2} convergence rates. This comparison is important when 𝒌\bm{k} is monotone, and Conditions (H1) and (H2) hold. In this case, both algorithms can be efficiently implemented.

Let SS and TT be two random elements, defined on the measurable spaces (𝒳1,1)(\mathcal{X}_{1},\mathcal{F}_{1}) and (𝒳2,2)(\mathcal{X}_{2},\mathcal{F}_{2}) respectively. Denote by π\pi the marginal distribution of SS. A generic data augmentation algorithm for sampling from π\pi simulates a Markov chain (S(t))t=0(S(t))_{t=0}^{\infty} that is reversible to π\pi. Given the current state S(t)S(t), the next state S(t+1)S(t+1) is generated through the following procedure.

  1. 1.

    I step. Draw TT^{*} from the conditional distribution of TT given S=S(t)S=S(t). Call the observed value tt^{*}.

  2. 2.

    P step. Draw S(t+1)S(t+1) from the conditional distribution of SS given T=tT=t^{*}.

The DA and DAI algorithms for Bayesian robust linear regression are special cases of the above method. Indeed, let S1S_{1}T1T_{1}, and T2T_{2} be three random elements such that the joint distribution of S1S_{1}T1T_{1}, and T2T_{2} is the conditional joint distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma})𝑾\bm{W}, and 𝒀(𝒌𝒌)\bm{Y}_{(\bm{k}^{\prime}-\bm{k})} given 𝒀(𝒌)=𝒚(𝒌)\bm{Y}_{(\bm{k})}=\bm{y}_{(\bm{k})}. Then, taking S=S1S=S_{1} and T=T1T=T_{1} in the generic algorithm yields the DA algorithm; taking S=S1S=S_{1} and T=(T1,T2)T=(T_{1},T_{2}) yields the DAI algorithm associated with (𝒚,𝒌,𝒌)(\bm{y},\bm{k},\bm{k}^{\prime}).

The L2L^{2} convergence rate of the generic data augmentation chain is precisely the squared maximal correlation between SS and TT (Liu et al., 1994). The maximal correlation between SS and TT is

γ(S,T):=supcorr[f(S),g(T)],\gamma(S,T):=\sup\mbox{corr}[f(S),g(T)],

where corr means linear correlation, and the supremum is taken over real functions ff and gg such that the variances of f(S)f(S) and g(T)g(T) are finite. Evidently,

γ(S1,T1)γ(S1,(T1,T2)).\gamma(S_{1},T_{1})\leq\gamma(S_{1},(T_{1},T_{2})).

We then have the following result.

Theorem 8.

Suppose that π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is proper, and that the conditional distributions in the DA and DAI algorithms are well-defined. Then, the L2L^{2} convergence rate of the DA chain targeting π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}) is at least as small as that of any DAI chain targeting π𝐤(𝐲(𝐤))\pi_{\bm{k}}(\cdot\mid\bm{y}_{(\bm{k})}).

Recall that a smaller convergence rate means faster convergence. Thus, when computation time is not considered and the observed missing structure is monotone, the DA algorithm is faster than the DAI algorithm. In this case, imputation of missing data, if needed, should be performed in a post hoc rather than intermediate manner. In Section 5 we use numerical experiments to show that this appears to be the case even after computation cost is taken into account.

5 Numerical Experiment

We compare the performance of the DA and DAI algorithms using simulated data. All simulations are implemented through the Bayesianrobust R package.

Suppose that we have n=50n=50 observations in a study. Assume that for each observation, the response has d=2d=2 components, while the predictor has the form 𝒙i=(1,xi)\bm{x}_{i}=(1,x_{i})^{\top} where xix_{i}\in\mathbb{R}. We generate the xix_{i}s using independent normal distributions. The response matrix 𝒚\bm{y} is generated according to the robust linear regression model, with the mixing distribution being Gamma(1,1)\mbox{Gamma}(1,1). The simulated data set is fixed throughout the section.

On the modeling side, we consider an independence Jeffreys prior with m=d=2m=d=2 and 𝒂=𝟎\bm{a}=\bm{0} (see Equation (1)) and three mixing distributions:

  • G: The mixing distribution is Gamma(2,2)(2,2). The error is tt distribution with degrees of freedom 4. By Equation (4), in the I step of the DA algorithm, one draws nn independent Gamma random variables.

  • GIG: The mixing distribution is GIG(1,1,0.5)(1,1,-0.5). The error is generalised hyperbolic. By Equation (4), in the I step of the DA algorithm, one draws nn independent generalized inverse Gaussian random variables.

  • P: The mixing distribution is the point mass at 1. The error is multivariate normal distribution. In this case, the posterior can be exactly sampled by the DA algorithm.

We study three realized missing structures. Under these missing structures, the response matrix 𝒚\bm{y} has, respectively, 45, 40, and 35 rows fully observed. The other rows all have only the second entry observed. It is clear that all three missing structures are monotone.

In total, we consider nine combinations of mixing distributions and missing structures. We apply both the DA and DAI algorithms in each scenario. In the I2 step of the DAI algorithm (see Section 3.3), 𝒌\bm{k}^{\prime} is taken to be the n×dn\times d matrix whose entries are all 1, i.e., 𝒌\bm{k}^{\prime} corresponds to the case that the response matrix is fully observed. At the end of each iteration of the DA algorithm, a post hoc imputation step is performed, so both algorithms impute all missing response entries.

Consider estimating the regression coefficients 𝑩\bm{B} and scatter matrix 𝚺\bm{\Sigma} using the posterior mean computed via the DA or DAI algorithm. We compare the efficiency of the two algorithms based on the effective sample size (ESS) of each component separately and all components jointly. At a given simulation length NN, the ESS of an MCMC estimator is defined to be NN times the posterior variance divided by the asymptotic variance of the estimator (Vats et al., 2019). To account for computation cost, we also consider the ESS per minute, ESSpm=ESS/tN\mbox{ESSpm}=\mbox{ESS}/t_{N}, where tNt_{N} is the number of minutes needed to run NN iterations of the algorithm. The simulation length is set to be N=30,000N=30,000 without burn-in, and the initial values are the ordinary least squares estimates using the observations that belong to pattern 11, i.e., the observations without missing elements.

Refer to caption
Figure 1: The ESS and ESSpm of all components jointly in different scenarios. The xx axis lists the mixing distributions and samplers used, and the legend labels the missing structures by the number of fully observed rows.

Figure 1 gives the ESS and ESSpm of all components jointly for each of the 3×3×23\times 3\times 2 combinations of mixing distributions, missing structures, and MCMC algorithms. The ESS and ESSpm of each component separately are included in Table 2 and 3 in Appendix C. We see that the DA algorithm gives a larger ESS compared to the DAI algorithm. For the DAI algorithm, the ESS is lower when there are more missing data. Similar trends appear when we consider the ESSpm. In short, imputation of missing responses values slows an algorithm down. This is consistent with our theoretical results. (Strictly speaking, our theoretical results concern convergence rate, not effective sample size, but for data augmentation chains which are reversible and positive, these two concepts are closely related. See, e.g., Rosenthal (2003), Section 3.)

In addition to the experiment above, we consider another series of mixing distributions, namely Gamma(v,v)\mbox{Gamma}(v,v) with v=2,8,25v=2,8,25. The resultant error distributions are multivariate tt distributions with 2v2v degrees of freedom. Applying the DA and DAI algorithms to these models, we obtain Figure 2, the ESS and ESSpm of all components jointly. The ESS and ESSpm of each component separately are in Table 4 and 5 in Appendix C. Our simulation shows that when the model assumes that the error distribution has a lighter tail, the MCMC algorithms tend to be more efficient.

Refer to caption
Figure 2: The ESS and ESSpm of all components jointly in different scenarios. The xx axis lists the Gamma(v,v)(v,v) mixing distributions and samplers used, and the legend labels the missing structures by the number of fully observed rows.

6 Conclusion

We conducted convergence analysis for a data augmentation algorithms used in Bayesian robust multivariate linear regression with incomplete data. The algorithm was first proposed by Liu (1996). But, previously, little was known about its theoretical properties when the response matrix contains missing values. We consider two versions of the algorithm, DA and DAI. The DA algorithm can only be applied when the missing structure is monotone, whereas the DAI algorithm can be implemented for an arbitrary missing structure. We establish geometric ergodicity of the DA algorithm under simple conditions. For the DAI algorithm, we give conditions for Harris ergodicity. We compare the L2L^{2} convergence rates of the DA and DAI algorithms. The L2L^{2} convergence rate of the DA algorithm is at least as small as a corresponding DAI algorithm. A numerical study is provided. Under monotone missing structures, the DA algorithm outperforms the DAI computationally and theoretically.

Acknowledgments

The first and second authors were supported by the National Science Foundation [grant number DMS-2112887], and the third author was supported by the National Science Foundation [grant number DMS-2152746].

Appendix

Appendix A Details of the DA algorithm

The DA algorithm consists of two steps, the I step and the P step. The I step is shown in Section 3.2. In this section, we describe the P step when the missing structure 𝒌\bm{k} is monotone and Condition (H1) holds.

Recall from Section 3.1 that the monotone missing structure 𝒌\bm{k} has dd possible patterns. For {1,,d}\ell\in\{1,\dots,d\}, n(𝒌)n_{\ell(\bm{k})} is the number of observations that belong to pattern \ell. 𝒙(𝒌,)\bm{x}_{(\bm{k},\ell)} and 𝒚(𝒌,)\bm{y}_{(\bm{k},\ell)} are submatrices of 𝒙\bm{x} and 𝒚\bm{y} respectively defined in Section 3.1. Define an n×nn\times n diagonal matrix 𝝀=diag(w1,,wj=1nj(𝒌),0,,0)\bm{\lambda}_{\ell}=\mbox{diag}(w_{1},\dots,w_{\sum_{j=1}^{\ell}n_{j(\bm{k})}},0,\dots,0). That is, the first j=1nj(𝒌)\sum_{j=1}^{\ell}n_{j(\bm{k})} diagonal entries of 𝝀\bm{\lambda}_{\ell} are w1,,wj=1nj(𝒌)w_{1},\dots,w_{\sum_{j=1}^{\ell}n_{j(\bm{k})}}, listed in order, and all the other entries are 0. Similarly, let 𝝀=diag(w1,,wj=1nj(𝒌))\bm{\lambda}^{\prime}_{\ell}=\mbox{diag}(w_{1},\dots,w_{\sum_{j=1}^{\ell}n_{j}(\bm{k})}). Denote 𝜷^\hat{\bm{\beta}}_{\ell} as the weighted least squares estimates of the regression coefficient of 𝒚(𝒌,)\bm{y}_{(\bm{k},\ell)} on 𝒙(𝒌,)\bm{x}_{(\bm{k},\ell)} with weight (w1,,wj=1nj(𝒌))(w_{1},\dots,w_{\sum_{j=1}^{\ell}n_{j(\bm{k})}}), and 𝒔\bm{s}_{\ell} as the corresponding weighted total sum of residual squares and cross-products matrix, i.e.,

𝜷^=(𝒙𝝀𝒙)1𝒙(𝒌,)𝝀𝒚(𝒌,),𝒔=(𝒚(𝒌,)𝒙(𝒌,)𝜷^)𝝀(𝒚(𝒌,)𝒙(𝒌,)𝜷^).\hat{\bm{\beta}}_{\ell}=\left(\bm{x}^{\top}\bm{\lambda}_{\ell}\bm{x}\right)^{-1}\bm{x}_{(\bm{k},\ell)}^{\top}\bm{\lambda}^{\prime}_{\ell}\bm{y}_{(\bm{k},\ell)},\quad\bm{s}_{\ell}=\left(\bm{y}_{(\bm{k},\ell)}-\bm{x}_{(\bm{k},\ell)}\hat{\bm{\beta}}_{\ell}\right)^{\top}\bm{\lambda}^{\prime}_{\ell}\left(\bm{y}_{(\bm{k},\ell)}-\bm{x}_{(\bm{k},\ell)}\hat{\bm{\beta}}_{\ell}\right).

Let 𝒂\bm{a}_{\ell} be the lower right (d+1)×(d+1)(d-\ell+1)\times(d-\ell+1) submatrix of the positive semi-definite matrix 𝒂\bm{a} given in the prior (Equation (1)).

P step. Draw (𝑩(t+1),𝚺(t+1))(\bm{B}(t+1),\bm{\Sigma}(t+1)) from the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}) using the following procedure.

  1. 1.

    Draw 𝚺(t+1)\bm{\Sigma}(t+1) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}):

    For {1,,d}\ell\in\{1,\dots,d\}, let 𝒄=𝒂+𝒔\bm{c}_{\ell}=\bm{a}_{\ell}+\bm{s}_{\ell}, and let 𝒆\bm{e}_{\ell} be the lower triangular Cholesky factor of 𝒄1\bm{c}_{\ell}^{-1} (so that 𝒄1=𝒆𝒆\bm{c}_{\ell}^{-1}=\bm{e}_{\ell}\bm{e}_{\ell}^{\top}). Draw a sequence of random vectors 𝑭1,,𝑭d\bm{F}_{1},\dots,\bm{F}_{d} such that 𝑭=(F,,,Fd,)\bm{F}_{\ell}=(F_{\ell,\ell},\dots,F_{d,\ell})^{\top} is (d+1)(d-\ell+1) dimensional, and that

    1. (a)

      FijN(0,1)F_{ij}\sim N(0,1) for 1j<id1\leq j<i\leq d,

    2. (b)

      Fii2χ2(dfi)F_{ii}^{2}\sim\chi^{2}(df_{i}) for i{1,,d}i\in\{1,\dots,d\}, where dfi=(j=1inj)i+mpd+1df_{i}=(\sum_{j=1}^{i}n_{j})-i+m-p-d+1,

    3. (c)

      FijF_{ij} are independent for all 1jid\text{for all }1\leq j\leq i\leq d.

    Denote the sampled values by 𝒇1,,𝒇d\bm{f}_{1},\dots,\bm{f}_{d}.

    Let 𝒉=𝒆𝒇\bm{h}_{\ell}=\bm{e}_{\ell}\bm{f}_{\ell} for {1,,d}\ell\in\{1,\dots,d\}, and let 𝒉\bm{h} be a d×dd\times d lower triangular matrix with its lower triangular non-zero part formed by columns 𝒉1,,𝒉d\bm{h}_{1},\dots,\bm{h}_{d}. Then 𝝇=(𝒉𝒉)1\bm{\varsigma}=(\bm{hh}^{\top})^{-1} serves as a sampled value of 𝚺(t+1)\bm{\Sigma}(t+1).

  2. 2.

    Draw 𝑩(t+1)\bm{B}(t+1) given (𝚺(t+1),𝑾,𝒀(𝒌))=(𝝇,𝒘,𝒚(𝒌))(\bm{\Sigma}(t+1),\bm{W},\bm{Y}_{(\bm{k})})=(\bm{\bm{\varsigma}},\bm{w},\bm{y}_{(\bm{k})}):

    Independently, draw pp-dimensional standard normal random vectors 𝒁1,,𝒁d\bm{Z}_{1},\dots,\bm{Z}_{d}, and call the sampled values 𝒛1,,𝒛d\bm{z}_{1},\dots,\bm{z}_{d}. For {1,,d}\ell\in\{1,\dots,d\}, let 𝒖\bm{u}_{\ell} be the lower triangular Cholesky factor of (𝒙𝝀𝒙)1(\bm{x}^{\top}\bm{\lambda}_{\ell}\bm{x})^{-1}. Then

    (𝒖1𝒛1,,𝒖d𝒛d)𝒉1+(𝜷^1𝒉1,,𝜷^d𝒉d)𝒉1(\bm{u}_{1}\bm{z}_{1},\dots,\bm{u}_{d}\bm{z}_{d})\bm{h}^{-1}+(\hat{\bm{\beta}}_{1}\bm{h}_{1},\dots,\hat{\bm{\beta}}_{d}\bm{h}_{d})\bm{h}^{-1}

    serves as a sampled value of 𝑩(t+1)\bm{B}(t+1).

Let us quickly consider a special case. Recall that 𝒌0{0,1}n×d\bm{k}_{0}\in\{0,1\}^{n\times d} is the missing structure that corresponds to a completely observable response, i.e., all elements of 𝒌0\bm{k}_{0} are 1. Then 𝒀(𝒌0)=𝒀\bm{Y}_{(\bm{k}_{0})}=\bm{Y}, 𝒚(𝒌0)=𝒚\bm{y}_{(\bm{k}_{0})}=\bm{y}, and 𝒌0\bm{k}_{0} is monotone. Let 𝒚:𝒙\bm{y}:\bm{x} be the matrix obtained by attaching 𝒙\bm{x} to the right of 𝒚\bm{y}. Then Condition (H1) for (𝒌0,𝒚(𝒌0))(\bm{k}_{0},\bm{y}_{(\bm{k}_{0})}) is equivalent to

r(𝒚:𝒙)=p+d,n>p+2dm1.r(\bm{y}:\bm{x})=p+d,\quad n>p+2d-m-1. (H1.0)

Note that if there is some monotone 𝒌\bm{k} such that Condition (H1) holds for (𝒌,𝒚(𝒌))(\bm{k},\bm{y}_{(\bm{k})}), then Condition (H1.0) necessarily holds. Under Condition (H1.0), the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀)=(𝒘,𝒚)(\bm{W},\bm{Y})=(\bm{w},\bm{y}) is proper and rather simple. We say 𝒁Np,d(𝝁,𝒖,𝒗)\bm{Z}\sim\mbox{N}_{p,d}(\bm{\mu},\bm{u},\bm{v}) for 𝝁p×d\bm{\mu}\in\mathbb{R}^{p\times d}, 𝒖S+p×p\bm{u}\in S_{+}^{p\times p}, and 𝒗S+d×d\bm{v}\in S_{+}^{d\times d} if 𝒁\bm{Z} is a p×dp\times d random matrix associated with the probability density function

exp[12tr{𝒗1(𝒛𝝁)T𝒖1(𝒛𝝁)}](2π)pd/2|𝒗|p/2|𝒖|d/2,𝒛p×d.\frac{\exp\left[-{\frac{1}{2}}\,\mathrm{tr}\left\{\bm{v}^{-1}(\bm{z}-\bm{\mu})^{T}\bm{u}^{-1}(\bm{z}-\bm{\mu})\right\}\right]}{(2\pi)^{pd/2}|\bm{v}|^{p/2}|\bm{u}|^{d/2}},\quad\bm{z}\in\mathbb{R}^{p\times d}.

Np,d\mbox{N}_{p,d} is called a matrix normal distribution. We say 𝒁IWd(ν,𝝍)\bm{Z}\sim\mbox{IW}_{d}(\nu,\bm{\psi}) for ν>d1\nu>d-1 and 𝝍S+d×d\bm{\psi}\in S_{+}^{d\times d} if 𝒁\bm{Z} is a d×dd\times d random matrix associated with the probability density function

|𝝍|ν/22νd/2Γd(ν/2)|𝒛|(ν+d+1)/2exp{12tr(𝝍𝒛1)},𝒛S+d×d,\frac{|\bm{\psi}|^{\nu/2}}{2^{\nu d/2}\Gamma_{d}(\nu/2)}|\bm{z}|^{-(\nu+d+1)/2}\exp\left\{-\frac{1}{2}\mbox{tr}\left(\bm{\psi}\bm{z}^{-1}\right)\right\},\quad\bm{z}\in S_{+}^{d\times d},

where tr()\mbox{tr}(\cdot) returns the trace of a matrix, and Γd()\Gamma_{d}(\cdot) is a multivariate gamma function. IWd\mbox{IW}_{d} is called an inverse Wishart distribution, and it is well-known that a random matrix follows the IWd(ν,𝝍)\mbox{IW}_{d}(\nu,\bm{\psi}) distribution if and only if its inverse follows the Wishart distribution Wd(ν,𝝍1)\mbox{W}_{d}(\nu,\bm{\psi}^{-1}), which has density

|𝝍|ν/22νd/2Γd(ν/2)|𝒛|(νd1)/2exp{12tr(𝝍𝒛)},𝒛S+d×d.\frac{|\bm{\psi}|^{\nu/2}}{2^{\nu d/2}\Gamma_{d}(\nu/2)}|\bm{z}|^{(\nu-d-1)/2}\exp\left\{-\frac{1}{2}\mbox{tr}\left(\bm{\psi}\bm{z}\right)\right\},\quad\bm{z}\in S_{+}^{d\times d}.

For 𝒘=(w1,,wn)(0,)n\bm{w}=(w_{1},\dots,w_{n})\in(0,\infty)^{n}, let 𝝀=diag(w1,,wn)\bm{\lambda}=\mbox{diag}(w_{1},\dots,w_{n}), i.e., the diagonal matrix whose diagonal elements are w1,,wnw_{1},\dots,w_{n}, listed in order. Let

𝜷^=(𝒙𝝀𝒙)1𝒙𝝀𝒚,𝒔=(𝒚𝒙𝜷^)𝝀(𝒚𝒙𝜷^),\hat{\bm{\beta}}=\left(\bm{x}^{\top}\bm{\lambda}\bm{x}\right)^{-1}\bm{x}^{\top}\bm{\lambda}\bm{y},\quad\bm{s}=\left(\bm{y}-\bm{x}\hat{\bm{\beta}}\right)^{\top}\bm{\lambda}\left(\bm{y}-\bm{x}\hat{\bm{\beta}}\right),

and 𝝃=(𝒙𝝀𝒙)1\bm{\xi}=(\bm{x}^{\top}\bm{\lambda}\bm{x})^{-1}. Then it is easy to see from Equation (2) that

𝚺(𝑾,𝒀)=(𝒘,𝒚)\displaystyle\bm{\Sigma}\mid(\bm{W},\bm{Y})=(\bm{w},\bm{y}) IWd(np+md,𝒔+𝒂),\displaystyle\sim\mbox{IW}_{d}\left(n-p+m-d,\bm{s}+\bm{a}\right),
𝑩(𝚺,𝑾,𝒀)=(𝝇,𝒘,𝒚)\displaystyle\bm{B}\mid(\bm{\Sigma},\bm{W},\bm{Y})=(\bm{\varsigma},\bm{w},\bm{y}) Np,d(𝜷^,𝝃,𝝇).\displaystyle\sim\mbox{N}_{p,d}\left(\hat{\bm{\beta}},\bm{\xi},\bm{\varsigma}\right).

Appendix B Proof of Theorem 4

We prove posterior propriety and geometric ergodicity by establishing drift and minorization (d&m) conditions, which we now describe. Let (𝒳,)(\mathcal{X},\mathcal{F}) be a measurable space. Consider a Markov chain (X(t))t=0(X(t))_{t=0}^{\infty} whose state space is (𝒳,)(\mathcal{X},\mathcal{F}), and let K:𝒳×[0,1]K:\mathcal{X}\times\mathcal{F}\to[0,1] be its transition kernel. We say that (X(t))(X(t)) satisfies a d&m condition with drift function V:𝒳[0,)V:\mathcal{X}\to[0,\infty), minorization measure ν:[0,1]\nu:\mathcal{F}\to[0,1], and parameters (η,ϱ,δ,ϵ)4(\eta,\varrho,\delta,\epsilon)\in\mathbb{R}^{4} if each of the following two conditions holds:

Drift condition:

For x𝒳x\in\mathcal{X},

KV(x):=𝒳V(x)K(x,dx)ηV(x)+ϱ,KV(x):=\int_{\mathcal{X}}V(x^{\prime})K(x,\mbox{d}x^{\prime})\leq\eta V(x)+\varrho,

where η<1\eta<1 and ϱ<\varrho<\infty. Note that KV(x)KV(x) can be interpreted as the conditional expectation of V(X(t+1))V(X(t+1)) given X(t)=xX(t)=x, where tt can be any non-negative integer.

Minorization condition:

Let ν\nu be a probability measure, ϵ>0\epsilon>0, and δ>2ϱ/(1η)\delta>2\varrho/(1-\eta). Moreover, whenever V(x)<δV(x)<\delta,

K(x,A)ϵν(A) for each A.K(x,A)\geq\epsilon\nu(A)\;\text{ for each }A\in\mathcal{F}.

It is well-known that if a d&m condition holds, then the Markov chain has a proper stationary distribution, and it is geometrically ergodic (Rosenthal, 1995). See also Jones and Hobert (2001), Roberts and Rosenthal (2004), and Meyn and Tweedie (2005).

We begin by establishing a minorization condition for the DA algorithm associated with a realized response 𝒚\bm{y} and missing structure 𝒌\bm{k}. As before, did_{i} will be used to denote the number of nonzero elements in the iith row of 𝒌\bm{k}. Let (𝑩(t),𝚺(t))t=0(\bm{B}(t),\bm{\Sigma}(t))_{t=0}^{\infty} by the underlying Markov chain, and denote its Markov transition kernel by KK. Set 𝒳=p×d×S+d×d\mathcal{X}=\mathbb{R}^{p\times d}\times S_{+}^{d\times d}, and let \mathcal{F} be the usual Borel algebra associated with 𝒳\mathcal{X}. Define a drift function

V(𝜷,𝝇)=i=1nri,(𝒌),V(\bm{\beta},\bm{\varsigma})=\sum_{i=1}^{n}r_{i,(\bm{k})},

where, for i{1,,n}i\in\{1,\dots,n\},

ri,(𝒌)=(𝒚i,(𝒌)𝒄i,(𝒌)𝜷𝒙i)(𝒄i,(𝒌)𝝇𝒄i,(𝒌))1(𝒚i,(𝒌)𝒄i,(𝒌)𝜷𝒙i).r_{i,(\bm{k})}=\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\bm{\beta}^{\top}\bm{x}_{i}\right)^{\top}\left(\bm{c}_{i,(\bm{k})}\bm{\varsigma}\bm{c}_{i,(\bm{k})}^{\top}\right)^{-1}\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\bm{\beta}^{\top}\bm{x}_{i}\right).

Then the following holds.

Lemma 9.

For any δ>0\delta>0, there exist a probability measure ν:[0,1]\nu:\mathcal{F}\to[0,1] and ϵ>0\epsilon>0 such that whenever V(𝛃,𝛓)<δV(\bm{\beta},\bm{\varsigma})<\delta,

K((𝜷,𝝇),A)>ϵν(A) for each A.K((\bm{\beta},\bm{\varsigma}),A)>\epsilon\nu(A)\;\text{ for each }A\in\mathcal{F}.
Proof.

One can write

K((𝜷,𝝇),A)=(0,)nQ𝒌(A𝒘,𝒚(𝒌))P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌)),K((\bm{\beta},\bm{\varsigma}),A)=\int_{(0,\infty)^{n}}Q_{{\bm{k}}}(A\mid\bm{w},\bm{y}_{(\bm{k})})\,P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}),

where P𝒌(𝜷,𝝇,𝒚(𝒌))P_{{\bm{k}}}(\cdot\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}) is the conditional distribution of 𝑾\bm{W} given (𝑩,𝚺,𝒀(𝒌))=(𝜷,𝝇,𝒚(𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k})})=(\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}), and Q𝒌(𝒘,𝒚(𝒌))Q_{{\bm{k}}}(\cdot\mid\bm{w},\bm{y}_{(\bm{k})}) is the conditional distribution of (𝜷,𝚺)(\bm{\beta},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}), both derived from Equation (2). As stated in Section 3.2,

P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌))=i=1nwidi/2exp(ri,(𝒌)wi/2)Pmix(dwi)0wdi/2exp(ri,(𝒌)w/2)Pmix(dw),\displaystyle P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})})=\prod_{i=1}^{n}\frac{w_{i}^{d_{i}/2}\exp\left(-r_{i,(\bm{k})}w_{i}/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w_{i})}{\int_{0}^{\infty}w^{d_{i}/2}\exp\left(-r_{i,(\bm{k})}w/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w)},
𝒘=(w1,,wn)(0,)n.\displaystyle\quad\bm{w}=(w_{1},\dots,w_{n})^{\top}\in(0,\infty)^{n}.

Assume that V(𝜷,𝝇)<δV(\bm{\beta},\bm{\varsigma})<\delta for some δ>0\delta>0. Then ri,(𝒌)<δr_{i,(\bm{k})}<\delta for each ii. It follows that, for any measurable A(0,)nA^{\prime}\in(0,\infty)^{n},

P𝒌(A𝜷,𝝇,𝒚(𝒌))ϵAi=1nwidi/2exp(δwi/2)Pmix(dwi)0wdi/2exp(δw/2)Pmix(dw),P_{{\bm{k}}}(A^{\prime}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})})\geq\epsilon\int_{A^{\prime}}\prod_{i=1}^{n}\frac{w_{i}^{d_{i}/2}\exp\left(-\delta w_{i}/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w_{i})}{\int_{0}^{\infty}w^{d_{i}/2}\exp\left(-\delta w/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w)},

where

ϵ={0wdi/2exp(δw/2)Pmix(dw)0wdi/2Pmix(dw)}n.\epsilon=\left\{\frac{\int_{0}^{\infty}w^{d_{i}/2}\exp\left(-\delta w/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w)}{\int_{0}^{\infty}w^{d_{i}/2}\,P_{\scriptsize\mbox{mix}}(\mbox{d}w)}\right\}^{n}.

Thus, for each AA\in\mathcal{F},

K((𝜷,𝝇),A)ϵν(A),K((\bm{\beta},\bm{\varsigma}),A)\geq\epsilon\nu(A),

where ν()\nu(\cdot) is a probability measure given by

ν(A)=(0,)nQ𝒌(A𝒘,𝒚(𝒌))i=1nwidi/2exp(δwi/2)Pmix(dwi)0wdi/2exp(δw/2)Pmix(dw).\nu(A)=\int_{(0,\infty)^{n}}Q_{{\bm{k}}}(A\mid\bm{w},\bm{y}_{(\bm{k})})\,\prod_{i=1}^{n}\frac{w_{i}^{d_{i}/2}\exp\left(-\delta w_{i}/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w_{i})}{\int_{0}^{\infty}w^{d_{i}/2}\exp\left(-\delta w/2\right)\,P_{\scriptsize\mbox{mix}}(\mbox{d}w)}.

To establish d&m, it remains to show the following.

Lemma 10.

Suppose that Condition (H2) holds, and that the conditional distribution of (𝐁,𝚺)(\bm{B},\bm{\Sigma}) given (𝐖,𝐘(𝐤))=(𝐰,𝐲(𝐤))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}) is proper for every 𝐰=(w1,,wn)(0,)n\bm{w}=(w_{1},\dots,w_{n})^{\top}\in(0,\infty)^{n}. If any one of the following conditions holds, then there exist η<1\eta<1 and ϱ<\varrho<\infty such that, for (𝛃,𝛓)𝒳(\bm{\beta},\bm{\varsigma})\in\mathcal{X},

KV(𝜷,𝝇)<ηV(𝜷,𝝇)+ϱ.KV(\bm{\beta},\bm{\varsigma})<\eta V(\bm{\beta},\bm{\varsigma})+\varrho.
  1. 1.

    Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is zero near the origin;

  2. 2.

    Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is faster than polynomial near the origin; or

  3. 3.

    Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is polynomial near the origin with power c>c1c>c_{1}, where

    c1=np+mmin{d1,,dn}2.c_{1}=\frac{n-p+m-\min\{d_{1},\dots,d_{n}\}}{2}.
Proof.

We will prove the result when 𝒌\bm{k} contains at least one vanishing element. When there are no missing data under 𝒌\bm{k}, the result is proved by Hobert et al. (2018). (To be absolutely precise, Hobert et al. assumed that Pmix()P_{\scriptsize\mbox{mix}}(\cdot) is absolutely continuous with respect to the Lebesgue measure in the “zero near the origin” case, but their proof can be easily extended to the case where absolute continuity is absent.)

For 𝜷p×d\bm{\beta}\in\mathbb{R}^{p\times d} and 𝝇S+d×d\bm{\varsigma}\in S_{+}^{d\times d}, let P𝒌(𝜷,𝝇,𝒚(𝒌))P_{{\bm{k}}}(\cdot\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}) be the conditional distribution of 𝑾\bm{W} given (𝑩,𝚺,𝒀(𝒌))=(𝜷,𝝇,𝒚(𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k})})=(\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}). For 𝒘(0,)n\bm{w}\in(0,\infty)^{n}, let Q𝒌(𝒘,𝒚(𝒌))Q_{{\bm{k}}}(\cdot\mid\bm{w},\bm{y}_{(\bm{k})}) be the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}). Then

KV(𝜷,𝝇)=(0,)np×d×S+d×dV(𝜷,𝝇)Q𝒌(d𝜷,d𝝇𝒘,𝒚(𝒌))P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌)).KV(\bm{\beta},\bm{\varsigma})=\int_{(0,\infty)^{n}}\int_{\mathbb{R}^{p\times d}\times S_{+}^{d\times d}}V(\bm{\beta}^{*},\bm{\varsigma}^{*})\,Q_{{\bm{k}}}(\mbox{d}\bm{\beta}^{*},\mbox{d}\bm{\varsigma}^{*}\mid\bm{w},\bm{y}_{(\bm{k})})\,P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}). (7)

Let 𝒌0{0,1}n×d\bm{k}_{0}\in\{0,1\}^{n\times d} be a matrix such that all its entries are 1. In other words, 𝑲=𝒌0\bm{K}=\bm{k}_{0} if there are no data missing. 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})} denotes the unobservable entries of 𝒀\bm{Y} under the missing structure 𝒌\bm{k}. By our assumptions, given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}), (𝑩,𝚺,𝒀(𝒌0𝒌))(\bm{B},\bm{\Sigma},\bm{Y}_{(\bm{k}_{0}-\bm{k})}) has a proper conditional distribution. For 𝒘(0,)n\bm{w}\in(0,\infty)^{n}, denote by Q1,𝒌(𝒘,𝒚(𝒌))Q_{1,{\bm{k}}}(\cdot\mid\bm{w},\bm{y}_{(\bm{k})}) the conditional distribution of 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})} given (𝑾,𝒀(𝒌))=(𝒘,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{y}_{(\bm{k})}). For 𝒘(0,)n\bm{w}\in(0,\infty)^{n} and a realized value of 𝒀(𝒌0𝒌)\bm{Y}_{(\bm{k}_{0}-\bm{k})}, say, 𝒛i=1n(ddi)\bm{z}\in\mathbb{R}^{\sum_{i=1}^{n}(d-d_{i})}, let Q2,𝒌(𝒘,𝒛,𝒚(𝒌))Q_{2,{\bm{k}}}(\cdot\mid\bm{w},\bm{z},\bm{y}_{(\bm{k})}) be the conditional distribution of (𝑩,𝚺)(\bm{B},\bm{\Sigma}) given (𝑾,𝒀(𝒌0𝒌),𝒀(𝒌))=(𝒘,𝒛,𝒚(𝒌))(\bm{W},\bm{Y}_{(\bm{k}_{0}-\bm{k})},\bm{Y}_{(\bm{k})})=(\bm{w},\bm{z},\bm{y}_{(\bm{k})}). We now describe this distribution (see Appendix A). Let 𝒚\bm{y}^{*} be a realized value of 𝒀\bm{Y} such that 𝒚(𝒌)=𝒚(𝒌)\bm{y}_{(\bm{k})}^{*}=\bm{y}_{(\bm{k})}. Let 𝝀=diag(w1,,wn)\bm{\lambda}=\mbox{diag}(w_{1},\dots,w_{n}), where w1,,wnw_{1},\dots,w_{n} are the components of 𝒘\bm{w}. Let

𝜷^=(𝒙𝝀𝒙)1𝒙𝝀𝒚,𝒔=(𝒚𝒙𝜷^)𝝀(𝒚𝒙𝜷^),\hat{\bm{\beta}}=\left(\bm{x}^{\top}\bm{\lambda}\bm{x}\right)^{-1}\bm{x}^{\top}\bm{\lambda}\bm{y}^{*},\quad\bm{s}=\left(\bm{y}^{*}-\bm{x}\hat{\bm{\beta}}\right)^{\top}\bm{\lambda}\left(\bm{y}^{*}-\bm{x}\hat{\bm{\beta}}\right),

and 𝝃=(𝒙𝝀𝒙)1\bm{\xi}=(\bm{x}^{\top}\bm{\lambda}\bm{x})^{-1}. Then Q2,𝒌(𝝎,𝒚(𝒌0𝒌),𝒚(𝒌))Q_{2,\bm{k}}(\cdot\mid\bm{\omega},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{\bm{(k)}}) is given by

𝚺(𝑾,𝒀)=(𝒘,𝒚)\displaystyle\bm{\Sigma}\mid(\bm{W},\bm{Y})=(\bm{w},\bm{y}^{*}) IWd(np+md,𝒔+𝒂),\displaystyle\sim\mbox{IW}_{d}\left(n-p+m-d,\bm{s}+\bm{a}\right),
𝑩(𝚺,𝑾,𝒀)=(𝝇,𝒘,𝒚)\displaystyle\bm{B}\mid(\bm{\Sigma},\bm{W},\bm{Y})=(\bm{\varsigma},\bm{w},\bm{y}^{*}) Np,d(𝜷^,𝝃,𝝇).\displaystyle\sim\mbox{N}_{p,d}\left(\hat{\bm{\beta}},\bm{\xi},\bm{\varsigma}\right).

It is easy to verify that Q𝒌Q_{{\bm{k}}}, Q1,𝒌Q_{1,{\bm{k}}}, and Q2,𝒌Q_{2,{\bm{k}}} are connected through the following tower property:

Q𝒌(𝒘,𝒚(𝒌))=i=1n(ddi)Q2,𝒌(𝒘,𝒚(𝒌0𝒌),𝒚(𝒌))Q1,𝒌(d𝒚(𝒌0𝒌)𝒘,𝒚(𝒌)).Q_{{\bm{k}}}\left(\cdot\mid\bm{w},\bm{y}_{(\bm{k})}\right)=\int_{\mathbb{R}^{\sum_{i=1}^{n}(d-d_{i})}}Q_{2,{\bm{k}}}\left(\cdot\mid\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right)\,Q_{1,{\bm{k}}}\left(\mbox{d}\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*}\mid\bm{w},\bm{y}_{(\bm{k})}\right). (8)

In light of Equations (7) and (8), let us first investigate the integral

p×d×S+d×dV(𝜷,𝝇)Q2,𝒌(d𝜷,d𝝇𝒘,𝒚(𝒌0𝒌),𝒚(𝒌))\displaystyle\int_{\mathbb{R}^{p\times d}\times S_{+}^{d\times d}}V(\bm{\beta}^{*},\bm{\varsigma}^{*})\,Q_{2,{\bm{k}}}\left(\mbox{d}\bm{\beta}^{*},\mbox{d}\bm{\varsigma}^{*}\mid\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right) (9)
=\displaystyle= S+d×dp×dV(𝜷,𝝇)Q4,𝒌(d𝜷𝝇,𝒘,𝒚(𝒌0𝒌),𝒚(𝒌))Q3,𝒌(d𝝇𝒘,𝒚(𝒌0𝒌),𝒚(𝒌)),\displaystyle\int_{S_{+}^{d\times d}}\int_{\mathbb{R}^{p\times d}}V(\bm{\beta}^{*},\bm{\varsigma}^{*})\,Q_{4,{\bm{k}}}\left(\mbox{d}\bm{\beta}^{*}\mid\bm{\varsigma}^{*},\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right)\,Q_{3,{\bm{k}}}\left(\mbox{d}\bm{\varsigma}^{*}\mid\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right),

where Q3,𝒌(𝒘,,𝒚(𝒌0𝒌),𝒚(𝒌))Q_{3,{\bm{k}}}(\cdot\mid\bm{w},,\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}) is the IWd(np+md,𝒔+𝒂)\mbox{IW}_{d}(n-p+m-d,\bm{s}+\bm{a}) distribution, and Q4,𝒌(𝝇,𝒘,𝒚(𝒌0𝒌),𝒚(𝒌))Q_{4,{\bm{k}}}(\cdot\mid\bm{\varsigma}^{*},\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}) is the Np,d(𝜷^,𝝃,𝝇)\mbox{N}_{p,d}(\hat{\bm{\beta}},\bm{\xi},\bm{\varsigma}^{*}) distribution. It follows from basic properties of matrix normal distributions that

p×dV(𝜷,𝝇)Q4,𝒌(d𝜷𝝇,𝒘,𝒚(𝒌0𝒌),𝒚(𝒌))\displaystyle\int_{\mathbb{R}^{p\times d}}V(\bm{\beta}^{*},\bm{\varsigma}^{*})\,Q_{4,{\bm{k}}}\left(\mbox{d}\bm{\beta}^{*}\mid\bm{\varsigma}^{*},\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right)
=\displaystyle= i=1n{(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)(𝒄i,(𝒌)𝝇𝒄i,(𝒌))1(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)+di𝒙i𝝃𝒙i}.\displaystyle\sum_{i=1}^{n}\left\{\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)^{\top}\left(\bm{c}_{i,(\bm{k})}\bm{\varsigma}^{*}\bm{c}_{i,(\bm{k})}^{\top}\right)^{-1}\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)+d_{i}\bm{x}_{i}^{\top}\bm{\xi}\bm{x}_{i}\right\}.

By Theorem 3.2.11 of Muirhead (2009), if 𝚺\bm{\Sigma}^{*} is a random matrix that follows the IWd(np+md,𝒔+𝒂)\mbox{IW}_{d}(n-p+m-d,\bm{s}+\bm{a}) distribution, then (𝒄i,(𝒌)𝚺𝒄i,(𝒌))1(\bm{c}_{i,(\bm{k})}\bm{\Sigma}^{*}\bm{c}_{i,(\bm{k})}^{\top})^{-1} follows the Wdi(np+m2d+di,[𝒄i,(𝒌)(𝒔+𝒂)𝒄i,(𝒌)]1)\mbox{W}_{d_{i}}(n-p+m-2d+d_{i},[\bm{c}_{i,(\bm{k})}(\bm{s}+\bm{a})\bm{c}_{i,(\bm{k})}^{\top}]^{-1}) distribution. Then by basic properties of Wishart distributions,

S+d×dp×dV(𝜷,𝝇)Q4,𝒌(d𝜷𝝇,𝒘,𝒚(𝒌0𝒌),𝒚(𝒌))Q3,𝒌(d𝝇𝒘,,𝒚(𝒌0𝒌),𝒚(𝒌))\displaystyle\int_{S_{+}^{d\times d}}\int_{\mathbb{R}^{p\times d}}V(\bm{\beta}^{*},\bm{\varsigma}^{*})\,Q_{4,{\bm{k}}}\left(\mbox{d}\bm{\beta}^{*}\mid\bm{\varsigma}^{*},\bm{w},\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right)\,Q_{3,{\bm{k}}}\left(\mbox{d}\bm{\varsigma}^{*}\mid\bm{w},,\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*},\bm{y}_{(\bm{k})}\right) (10)
=\displaystyle= i=1n(np+m2d+di)[(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i){𝒄i,(𝒌)(𝒔+𝒂)𝒄i,(𝒌)}1(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)]+\displaystyle\sum_{i=1}^{n}(n-p+m-2d+d_{i})\left[\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)^{\top}\left\{\bm{c}_{i,(\bm{k})}(\bm{s}+\bm{a})\bm{c}_{i,(\bm{k})}^{\top}\right\}^{-1}\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)\right]+
i=1ndi𝒙i𝝃𝒙i.\displaystyle\sum_{i=1}^{n}d_{i}\bm{x}_{i}^{\top}\bm{\xi}\bm{x}_{i}.

For i=1,,ni=1,\dots,n, denote by 𝒚i\bm{y}_{i}^{*} the iith row of 𝒚\bm{y}^{*}, and note that 𝒄i,(𝒌)𝒚i=𝒚i,(𝒌)\bm{c}_{i,(\bm{k})}\bm{y}_{i}^{*}=\bm{y}_{i,(\bm{k})}. It follows from Lemma 11, which is stated right after this proof, that, for i=1,,ni=1,\dots,n,

𝒙i𝝃𝒙i=𝒙i(j=1nwj𝒙j𝒙j)1𝒙i1wi,\bm{x}_{i}^{\top}\bm{\xi}\bm{x}_{i}=\bm{x}_{i}^{\top}\left(\sum_{j=1}^{n}w_{j}\bm{x}_{j}\bm{x}_{j}^{\top}\right)^{-1}\bm{x}_{i}\leq\frac{1}{w_{i}},

and

(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i){𝒄i,(𝒌)(𝒔+𝒂)𝒄i,(𝒌)}1(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)\displaystyle\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)^{\top}\left\{\bm{c}_{i,(\bm{k})}(\bm{s}+\bm{a})\bm{c}_{i,(\bm{k})}^{\top}\right\}^{-1}\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)
=\displaystyle= (𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)[𝒄i,(𝒌){j=1nwj(𝒚j𝜷^𝒙j)(𝒚j𝜷^𝒙j)}𝒄i,(𝒌)+𝒄i,(𝒌)𝒂𝒄i,(𝒌)]1\displaystyle\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)^{\top}\left[\bm{c}_{i,(\bm{k})}\left\{\sum_{j=1}^{n}w_{j}\left(\bm{y}_{j}^{*}-\hat{\bm{\beta}}^{\top}\bm{x}_{j}\right)\left(\bm{y}_{j}^{*}-\hat{\bm{\beta}}^{\top}\bm{x}_{j}\right)^{\top}\right\}\bm{c}_{i,(\bm{k})}^{\top}+\bm{c}_{i,(\bm{k})}\bm{a}\bm{c}_{i,(\bm{k})}^{\top}\right]^{-1}
(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)\displaystyle\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)
=\displaystyle= (𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i){j=1nwj(𝒚j,(𝒌)𝒄j,(𝒌)𝜷^𝒙j)(𝒚j,(𝒌)𝒄j,(𝒌)𝜷^𝒙j)+𝒄i,(𝒌)𝒂𝒄i,(𝒌)}1\displaystyle\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)^{\top}\left\{\sum_{j=1}^{n}w_{j}\left(\bm{y}_{j,(\bm{k})}-\bm{c}_{j,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{j}\right)\left(\bm{y}_{j,(\bm{k})}-\bm{c}_{j,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{j}\right)^{\top}+\bm{c}_{i,(\bm{k})}\bm{a}\bm{c}_{i,(\bm{k})}^{\top}\right\}^{-1}
(𝒚i,(𝒌)𝒄i,(𝒌)𝜷^𝒙i)\displaystyle\left(\bm{y}_{i,(\bm{k})}-\bm{c}_{i,(\bm{k})}\hat{\bm{\beta}}^{\top}\bm{x}_{i}\right)
\displaystyle\leq 1wi.\displaystyle\frac{1}{w_{i}}.

Notice that 1/ωi1/\omega_{i} is a constant, and does not depend on 𝒚(𝒌0𝒌)\bm{y}_{(\bm{k}_{0}-\bm{k})}^{*}. It then follows from Equations (7) to (10) that

KV(𝜷,𝝇)\displaystyle KV(\bm{\beta},\bm{\varsigma}) (0,)n(i=1nnp+m2d+2diwi)P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌))\displaystyle\leq\int_{(0,\infty)^{n}}\left(\sum_{i=1}^{n}\frac{n-p+m-2d+2d_{i}}{w_{i}}\right)P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})})
(0,)n(i=1nnp+mwi)P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌)),\displaystyle\leq\int_{(0,\infty)^{n}}\left(\sum_{i=1}^{n}\frac{n-p+m}{w_{i}}\right)P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}),

where P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌)P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})}) is given by Equation (4). By Lemma 12 below, there exist η<1/(np+m)\eta^{\prime}<1/(n-p+m) and ϱ<\varrho^{\prime}<\infty such that, for (𝜷,𝝇)𝒳(\bm{\beta},\bm{\varsigma})\in\mathcal{X},

(0,)n(i=1nnp+mwi)P𝒌(d𝒘𝜷,𝝇,𝒚(𝒌))i=1n(np+m)(ηri,(𝒌)+ϱ).\int_{(0,\infty)^{n}}\left(\sum_{i=1}^{n}\frac{n-p+m}{w_{i}}\right)P_{{\bm{k}}}(\mbox{d}\bm{w}\mid\bm{\beta},\bm{\varsigma},\bm{y}_{(\bm{k})})\\ \leq\sum_{i=1}^{n}(n-p+m)(\eta^{\prime}{r_{i,(\bm{k})}}+\varrho^{\prime}).

Then desired result holds with η=(np+m)η<1\eta=(n-p+m)\eta^{\prime}<1 and ϱ=n(np+m)ϱ<\varrho=n(n-p+m)\varrho^{\prime}<\infty. ∎

Lemma 11.

Let 𝛗\bm{\varphi} be a positive definite matrix, and 𝛖\bm{\upsilon} be a vector, such that the matrix 𝛗𝛖𝛖T\bm{\varphi}-\bm{\upsilon\upsilon}^{T} is positive semidefinite, then 𝛖T𝛗1𝛖1\bm{\upsilon}^{T}\bm{\varphi}^{-1}\bm{\upsilon}\leq 1.

For the proof of Lemma 11, see, e.g., Roy and Hobert (2010), Lemma 3.

Lemma 12.

Suppose that Condition (H2) holds and that Pmix()P_{\text{mix}}{(\cdot)} is either zero near the origin, or faster than polynomial near the origin, or polynomial near the origin with power c>c1c>c_{1}, where

c1=np+mmin{d1,,dn}2.c_{1}=\frac{n-p+m-\min\{d_{1},\dots,d_{n}\}}{2}.

Then there exist η[0,1/(np+m))\eta\in[0,1/(n-p+m)) and ϱ<\varrho<\infty, such that for all d~[min{d1,,dn},d]\tilde{d}\in[\min\{d_{1},\dots,d_{n}\},d] and r~[0,)\tilde{r}\in[0,\infty),

0w(d~2)/2exp(r~w/2)Pmix(dw)0wd~/2exp(r~w/2)Pmix(dw)ηr~+ϱ.\frac{\int_{0}^{\infty}w^{(\tilde{d}-2)/2}\exp({-\tilde{r}w/2})P_{\scriptsize\mbox{mix}}{(\mbox{d}w)}}{\int_{0}^{\infty}w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\scriptsize\mbox{mix}}{(\mbox{d}w)}}\leq\eta\tilde{r}+\varrho.
Proof.

We will make use of results in Hobert et al. (2018).

When Pmix()P_{\text{mix}}(\cdot) is zero near the origin, there exists θ>0\theta>0, such that 0θPmix(dw)=0\int_{0}^{\theta}P_{\text{mix}}(\mbox{d}w)=0. Then, for all d~[min{d1,,dn},d]\tilde{d}\in[\min\{d_{1},\dots,d_{n}\},d] and r~[0,)\tilde{r}\in[0,\infty),

0w(d~2)/2exp(r~w/2)Pmix(dw)0wd~/2exp(r~w/2)Pmix(dw)\displaystyle\frac{\int_{0}^{\infty}w^{(\tilde{d}-2)/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}{\int_{0}^{\infty}w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}} =θ(1/w)wd~/2exp(r~w/2)Pmix(dw)θwd~/2exp(r~w/2)Pmix(dw)\displaystyle=\frac{\int_{\theta}^{\infty}(1/w)w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}{\int_{\theta}^{\infty}w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}
(1/θ)θwd~/2exp(r~w/2)Pmix(dw)θwd~/2exp(r~w/2)Pmix(dw)\displaystyle\leq\frac{(1/\theta)\int_{\theta}^{\infty}w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}{\int_{\theta}^{\infty}w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}
=1/θ.\displaystyle=1/{\theta}.

When Pmix()P_{\text{mix}}(\cdot) is polynomial near the origin or faster than polynomial near the origin, recall that pmix()p_{\scriptsize\mbox{mix}}(\cdot) is the density function of Pmix()P_{\scriptsize\mbox{mix}}(\cdot) with respect to the Lebesgue measure. Let S(pmix)S(p_{\scriptsize\mbox{mix}}) be the set of η[0,)\eta\in[0,\infty) such that there exists ϱη\varrho_{\eta}\in\mathbb{R}, satisfying

0w1/2exp(r~w/2)Pmix(dw)0w1/2exp(r~w/2)Pmix(dw)ηr~+ϱη\frac{\int_{0}^{\infty}w^{-1/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}{\int_{0}^{\infty}w^{1/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}\leq\eta\tilde{r}+\varrho_{\eta}

for all r~[0,)\tilde{r}\in[0,\infty). If S(pmix)S(p_{\scriptsize\mbox{mix}}) is non-empty, let ηpmix=infS(pmix)\eta_{p_{\scriptsize\mbox{mix}}}=\inf S(p_{\scriptsize\mbox{mix}}); otherwise, set ηpmix=\eta_{p_{\scriptsize\mbox{mix}}}=\infty.

Consider a Gamma (α,1)(\alpha,1) mixing distribution with density

pG(α)(w)={Γ1(α)}1wα1exp(w).p_{G(\alpha)}(w)=\{\Gamma_{1}(\alpha)\}^{-1}w^{\alpha-1}\exp({-w}).

It is easy to see that if α>1/2\alpha>1/2, for all r~[0,)\tilde{r}\in[0,\infty),

0w1/2exp(r~w/2)pG(α)(w)𝑑w0w1/2exp(r~w/2)pG(α)(w)𝑑w=12α1r~+22α1.\frac{\int_{0}^{\infty}w^{-1/2}\exp({-\tilde{r}w/2})p_{G(\alpha)}(w)dw}{\int_{0}^{\infty}w^{1/2}\exp({-\tilde{r}w/2})p_{G(\alpha)}(w)dw}=\frac{1}{2\alpha-1}\tilde{r}+\frac{2}{2\alpha-1}.

Therefore, if α>1/2\alpha>1/2, ηpG(α)=1/(2α1)\eta_{p_{G(\alpha)}}=1/(2\alpha-1).

Let pmixd~()p_{\scriptsize\mbox{mix}}^{\tilde{d}}(\cdot) be a mixing density proportional to w(d~1)/2pmix()w^{(\tilde{d}-1)/2}p_{\scriptsize\mbox{mix}}(\cdot). Note that

0w(d~2)/2exp(r~w/2)Pmix(dw)0wd~/2exp(r~w/2)Pmix(dw)=0w1/2exp(r~w/2)pmixd~(w)𝑑w0w1/2exp(r~w/2)pmixd~(w)𝑑w.\frac{\int_{0}^{\infty}w^{(\tilde{d}-2)/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}{\int_{0}^{\infty}w^{\tilde{d}/2}\exp({-\tilde{r}w/2})P_{\text{mix}}{(\mbox{d}w)}}=\frac{\int_{0}^{\infty}w^{-1/2}\exp({-\tilde{r}w/2})p_{\scriptsize\mbox{mix}}^{\tilde{d}}(w)dw}{\int_{0}^{\infty}w^{1/2}\exp({-\tilde{r}w/2})p_{\scriptsize\mbox{mix}}^{\tilde{d}}(w)dw}.

Therefore, it suffices to prove that ηpmixd~<1/(np+m)\eta_{p_{\scriptsize\mbox{mix}}^{\tilde{d}}}<1/(n-p+m).

When Pmix()P_{\text{mix}}(\cdot) is polynomial near the origin with power c>c1c>c_{1}, the distribution associated with pmixd~()p_{\scriptsize\mbox{mix}}^{\tilde{d}}(\cdot) is polynomial near the origin with power c+(d~1)/2c+(\tilde{d}-1)/2. Let pGd~()p^{\tilde{d}}_{G}(\cdot) be a mixing density following Gamma (c+(d~+1)/2,1)(c+(\tilde{d}+1)/2,1). We have

limw0pmixd~(w)pGd~(w)(0,).\lim_{w\to 0}\frac{p_{\scriptsize\mbox{mix}}^{\tilde{d}}(w)}{p^{\tilde{d}}_{G}(w)}\in(0,\infty).

By Lemma 1 in Hobert et al. (2018), for all d~[min{d1,,dn},d]\tilde{d}\in[\min\{d_{1},\dots,d_{n}\},d],

ηpmixd~=ηpGd~=12c+d~<1np+m.\eta_{p_{\scriptsize\mbox{mix}}^{\tilde{d}}}=\eta_{p^{\tilde{d}}_{G}}=\frac{1}{2c+\tilde{d}}<\frac{1}{n-p+m}.

When Pmix()P_{\text{mix}}(\cdot) is faster than polynomial near the origin, again let pmixd~()p_{\scriptsize\mbox{mix}}^{\tilde{d}}(\cdot) be a mixing density proportional to w(d~1)/2pmix()w^{(\tilde{d}-1)/2}p_{\scriptsize\mbox{mix}}(\cdot). The distribution associated with pmixd~()p_{\scriptsize\mbox{mix}}^{\tilde{d}}(\cdot) is faster than polynomial near the origin. Let α>1/2\alpha>1/2. Recall that pG(α)(w)p_{G(\alpha)}(w) is the density of the Gamma (α,1)(\alpha,1) mixing distribution. By the definition of faster than polynomial near the origin, there exists κα1>0\kappa_{\alpha-1}>0, such that pmixd~(w)/wα1p_{\scriptsize\mbox{mix}}^{\tilde{d}}(w)/w^{\alpha-1} is strictly increasing on (0,κα1)(0,\kappa_{\alpha-1}). Thus, pmixd~(w)/pG(α)(w)p_{\scriptsize\mbox{mix}}^{\tilde{d}}(w)/p_{G(\alpha)}(w) is strictly increasing on (0,κα1)(0,\kappa_{\alpha-1}). By Lemma 2 in Hobert et al. (2018),

ηpmixd~ηpG(α)=12α1.\eta_{p_{\scriptsize\mbox{mix}}^{\tilde{d}}}\leq\eta_{p_{G(\alpha)}}=\frac{1}{2\alpha-1}.

Since α\alpha can be any value larger than 1/21/2, ηpmixd~=0\eta_{p_{\scriptsize\mbox{mix}}^{\tilde{d}}}=0, for all d~[min{d1,,dn},d]\tilde{d}\in[\min\{d_{1},\dots,d_{n}\},d].

Appendix C The ESS and ESSpm of Each Component in Section 5

Table 2: The ESS of each component in different scenarios. The model column lists the mixing distributions and samplers used, and the data column labels the missing structures by the number of fully observed rows.
Model Data 𝑩11\bm{B}_{11} 𝑩21\bm{B}_{21} 𝑩12\bm{B}_{12} 𝑩22\bm{B}_{22} 𝚺11\bm{\Sigma}_{11} 𝚺21\bm{\Sigma}_{21} 𝚺22\bm{\Sigma}_{22}
DAG 45 19332 17450 19098 16447 10290 11646 10528
DAG 40 20001 18399 17549 19992 10315 12061 10714
DAG 35 16960 18459 16786 16967 11605 12442 11551
DAIG 45 16471 19242 17976 19443 9970 10723 9888
DAIG 40 14292 15528 18045 17066 8335 11155 9897
DAIG 35 13330 17612 17745 16203 9676 8988 10064
DAGIG 45 16843 17429 14810 15401 13796 13232 13422
DAGIG 40 20661 20371 21888 14722 13375 14279 12765
DAGIG 35 17729 15738 17347 15574 15619 12788 12722
DAIGIG 45 18267 15193 17571 15148 11803 12759 13323
DAIGIG 40 16540 18311 16763 17924 14042 14860 14749
DAIGIG 35 12246 14077 18134 15717 8973 9506 12939
DAP 45 30000 30000 30000 30000 30000 29724 30000
DAP 40 30000 30000 30000 30000 30000 30000 30438
DAP 35 30000 30000 29216 30000 30000 30000 30000
DAIP 45 22822 26402 30000 30000 25435 29601 30000
DAIP 40 20145 23817 30000 30000 20160 26305 30000
DAIP 35 13894 22094 30000 30000 16530 13943 30000
Table 3: The ESSpm of each component in different scenarios. The model column lists the mixing distributions and samplers used, and the data column labels the missing structures by the number of fully observed rows.
Model Data 𝑩11\bm{B}_{11} 𝑩21\bm{B}_{21} 𝑩12\bm{B}_{12} 𝑩22\bm{B}_{22} 𝚺11\bm{\Sigma}_{11} 𝚺21\bm{\Sigma}_{21} 𝚺22\bm{\Sigma}_{22}
DAG 45 4421 3990 4367 3761 2353 2663 2407
DAG 40 4886 4495 4287 4884 2520 2947 2618
DAG 35 4213 4585 4170 4215 2883 3090 2869
DAIG 45 3286 3838 3586 3878 1989 2139 1972
DAIG 40 2891 3141 3651 3452 1686 2257 2002
DAIG 35 2850 3766 3794 3464 2069 1922 2152
DAGIG 45 1611 1667 1417 1473 1320 1266 1284
DAGIG 40 2072 2043 2195 1477 1341 1432 1280
DAGIG 35 1803 1601 1765 1584 1589 1301 1294
DAIGIG 45 1701 1415 1636 1410 1099 1188 1240
DAIGIG 40 1543 1709 1564 1672 1310 1387 1376
DAIGIG 35 1201 1381 1779 1542 880 932 1269
DAP 45 6959 6959 6959 6959 6959 6895 6959
DAP 40 7793 7793 7793 7793 7793 7793 7907
DAP 35 7478 7478 7282 7478 7478 7478 7478
DAIP 45 4929 5702 6479 6479 5493 6393 6479
DAIP 40 4415 5219 6574 6574 4418 5765 6574
DAIP 35 3183 5062 6873 6873 3787 3194 6873
Table 4: The ESS of each component in different scenarios. The model column lists the Gamma(v,v)(v,v) mixing distributions and samplers used, and the data column labels the missing structures by the number of fully observed rows.
Model Data 𝑩11\bm{B}_{11} 𝑩21\bm{B}_{21} 𝑩12\bm{B}_{12} 𝑩22\bm{B}_{22} 𝚺11\bm{\Sigma}_{11} 𝚺21\bm{\Sigma}_{21} 𝚺22\bm{\Sigma}_{22}
DAG2 45 19332 17450 19098 16447 10290 11646 10528
DAG2 40 20001 18399 17549 19992 10315 12061 10714
DAG2 35 16960 18459 16786 16967 11605 12442 11551
DAIG2 45 16471 19242 17976 19443 9970 10723 9888
DAIG2 40 14292 15528 18045 17066 8335 11155 9897
DAIG2 35 13330 17612 17745 16203 9676 8988 10064
DAG8 45 26359 23824 25032 25504 16537 19326 16185
DAG8 40 26122 26707 25362 26588 14952 19118 15616
DAG8 35 25470 24347 24819 25501 18445 16390 16759
DAIG8 45 20631 23024 25552 26183 13685 15991 15078
DAIG8 40 18122 21684 23664 25824 13061 18082 16806
DAIG8 35 14837 22029 24967 26256 11553 11712 15336
DAG25 45 29041 29055 27919 28865 21494 22572 23345
DAG25 40 28018 29302 28548 29081 23303 21042 23013
DAG25 35 29310 29162 29052 30000 18884 16623 20118
DAIG25 45 23968 24454 27999 29108 17254 20737 22495
DAIG25 40 18613 23362 28295 28963 15049 18725 21078
DAIG25 35 17014 20338 28073 29150 10450 9397 22567
Table 5: The ESSpm of each component in different scenarios. The model column lists the Gamma(v,v)(v,v) mixing distributions and samplers used, and the data column labels the missing structures by the number of fully observed rows.
Model Data 𝑩11\bm{B}_{11} 𝑩21\bm{B}_{21} 𝑩12\bm{B}_{12} 𝑩22\bm{B}_{22} 𝚺11\bm{\Sigma}_{11} 𝚺21\bm{\Sigma}_{21} 𝚺22\bm{\Sigma}_{22}
DAG2 45 4421 3990 4367 3761 2353 2663 2407
DAG2 40 4886 4495 4287 4884 2520 2947 2618
DAG2 35 4213 4585 4170 4215 2883 3090 2869
DAIG2 45 3286 3838 3586 3878 1989 2139 1972
DAIG2 40 2891 3141 3651 3452 1686 2257 2002
DAIG2 35 2850 3766 3794 3464 2069 1922 2152
DAG8 45 6129 5539 5820 5930 3845 4494 3763
DAG8 40 6495 6641 6306 6611 3718 4754 3883
DAG8 35 6632 6339 6462 6640 4803 4268 4364
DAIG8 45 4095 4570 5072 5197 2717 3174 2993
DAIG8 40 3894 4659 5085 5549 2806 3885 3611
DAIG8 35 3265 4847 5494 5777 2542 2577 3375
DAG25 45 7115 7118 6840 7072 5266 5530 5719
DAG25 40 7097 7423 7232 7367 5903 5330 5830
DAG25 35 7728 7689 7660 7910 4979 4383 5304
DAIG25 45 5067 5170 5919 6154 3648 4384 4755
DAIG25 40 4099 5144 6231 6378 3314 4123 4642
DAIG25 35 3898 4660 6432 6679 2394 2153 5170

References

  • Barndorff-Nielsen et al. [1982] Ole Barndorff-Nielsen, John Kent, and Michael Sørensen. Normal variance-mean mixtures and z distributions. International Statistical Review/Revue Internationale de Statistique, 50(2):145–159, 1982.
  • Chan and Geyer [1994] Kung Sik Chan and Charles J Geyer. Discussion: Markov chains for exploring posterior distributions. The Annals of Statistics, 22(4):1747–1758, 1994.
  • Fernández and Steel [1999] Carmen Fernández and Mark FJ Steel. Multivariate Student-t regression models: Pitfalls and inference. Biometrika, 86(1):153–167, 1999.
  • Fernandez and Steel [2000] Carmen Fernandez and Mark FJ Steel. Bayesian regression analysis with scale mixtures of normals. Econometric Theory, 16(1):80–101, 2000.
  • Flegal et al. [2008] James M Flegal, Murali Haran, and Galin L Jones. Markov chain Monte Carlo: Can we trust the third significant figure? Statistical Science, 23(2):250–260, 2008.
  • Hobert and Casella [1996] James P Hobert and George Casella. The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association, 91(436):1461–1473, 1996.
  • Hobert et al. [2018] James P Hobert, Yeun Ji Jung, Kshitij Khare, and Qian Qin. Convergence analysis of MCMC algorithms for Bayesian multivariate linear regression with non-Gaussian errors. Scandinavian Journal of Statistics, 45(3):513–533, 2018.
  • Jones [2004] Galin L Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299–320, 2004.
  • Jones and Hobert [2001] Galin L Jones and James P Hobert. Honest exploration of intractable probability distributions via Markov chain Monte Carlo. Statistical Science, 16(4):312–334, 2001.
  • Jones and Qin [2021] Galin L Jones and Qian Qin. Markov chain Monte Carlo in practice. Annual Review of Statistics and Its Application, 9, 2021.
  • Lange and Sinsheimer [1993] Kenneth Lange and Janet S Sinsheimer. Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics, 2(2):175–198, 1993.
  • Liu [1996] Chuanhai Liu. Bayesian robust multivariate linear regression with incomplete data. Journal of the American Statistical Association, 91(435):1219–1227, 1996.
  • Liu et al. [1994] Jun S Liu, Wing Hung Wong, and Augustine Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27–40, 1994.
  • Meyn and Tweedie [2005] Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2005.
  • Muirhead [2009] Robb J Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, 2009.
  • Roberts and Rosenthal [1997] Gareth Roberts and Jeffrey Rosenthal. Geometric ergodicity and hybrid Markov chains. Electronic Communications in Probability, 2:13–25, 1997.
  • Roberts and Rosenthal [2001] Gareth O Roberts and Jeffrey S Rosenthal. Markov chains and de-initializing processes. Scandinavian Journal of Statistics, 28(3):489–504, 2001.
  • Roberts and Rosenthal [2004] Gareth O Roberts and Jeffrey S Rosenthal. General state space Markov chains and MCMC algorithms. Probability Surveys, 1:20–71, 2004.
  • Roberts and Rosenthal [2006] Gareth O Roberts and Jeffrey S Rosenthal. Harris recurrence of Metropolis-within-Gibbs and trans-dimensional Markov chains. The Annals of Applied Probability, 16(4):2123–2139, 2006.
  • Rogers and Tukey [1972] William H. Rogers and John W. Tukey. Understanding some long‐tailed symmetrical distributions. Statistica Neerlandica, 26(3):211–226, 1972.
  • Rosenthal [1995] Jeffrey S Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90(430):558–566, 1995.
  • Rosenthal [2003] Jeffrey S Rosenthal. Asymptotic variance and convergence rates of nearly-periodic Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 98(461):169–177, 2003.
  • Roy and Hobert [2010] Vivekananda Roy and James P Hobert. On Monte Carlo methods for Bayesian multivariate regression models with heavy-tailed errors. Journal of Multivariate Analysis, 101(5):1190–1202, 2010.
  • Seaman et al. [2013] Shaun Seaman, John Galati, Dan Jackson, and John Carlin. What is meant by “missing at random”? Statistical Science, 28(2):257–268, 2013.
  • Vats et al. [2018] Dootika Vats, James M Flegal, and Galin L Jones. Strong consistency of multivariate spectral variance estimators in Markov chain Monte Carlo. Bernoulli, 24(3):1860–1909, 2018.
  • Vats et al. [2019] Dootika Vats, James M Flegal, and Galin L Jones. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321–337, 2019.
  • Zellner [1976] Arnold Zellner. Bayesian and non-Bayesian analysis of the regression model with multivariate Student-t error terms. Journal of the American Statistical Association, 71(354):400–405, 1976.