This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Coupling-based convergence assessment of some Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors

Niloy Biswas Department of Statistics, Harvard University. Email: [email protected]    Anirban Bhattacharya Department of Statistics, Texas A&M University. Email: [email protected]    Pierre E. Jacob Department of Statistics, Harvard University. Email: [email protected]    James E. Johndrow Department of Statistics, The Wharton School, University of Pennsylvania. Email: [email protected]
Abstract

We consider Markov chain Monte Carlo (MCMC) algorithms for Bayesian high-dimensional regression with continuous shrinkage priors. A common challenge with these algorithms is the choice of the number of iterations to perform. This is critical when each iteration is expensive, as is the case when dealing with modern data sets, such as genome-wide association studies with thousands of rows and up to hundred of thousands of columns. We develop coupling techniques tailored to the setting of high-dimensional regression with shrinkage priors, which enable practical, non-asymptotic diagnostics of convergence without relying on traceplots or long-run asymptotics. By establishing geometric drift and minorization conditions for the algorithm under consideration, we prove that the proposed couplings have finite expected meeting time. Focusing on a class of shrinkage priors which includes the “Horseshoe”, we empirically demonstrate the scalability of the proposed couplings. A highlight of our findings is that less than 1000 iterations can be enough for a Gibbs sampler to reach stationarity in a regression on 100,000100,000 covariates. The numerical results also illustrate the impact of the prior on the computational efficiency of the coupling, and suggest the use of priors where the local precisions are Half-t distributed with degree of freedom larger than one.

1 Introduction

1.1 Iterative computation in high dimensions

We consider the setting of high-dimensional regression where the number of observations nn is smaller than the number of covariates pp and the true signal is sparse. This problem formulation is ubiquitous in modern applications ranging from genomics to the social sciences. Optimization-based methods, such as the LASSO (Tibshirani, 1994) or Elastic Net (Zou and Hastie, 2005), allow sparse point estimates to be obtained even when the number of covariates is on the order of hundreds of thousands. More specifically, iterative optimization procedures to obtain these estimates are practical because the following conditions are met: 1) the cost per iteration scales favorably with the size of the input (nn and pp), 2) the number of iterations to convergence also scales favorably, and 3) there are reliable stopping criteria to detect convergence.

In the Bayesian paradigm, Markov chain Monte Carlo (MCMC) methods are commonly used to sample from the posterior distribution. Default MCMC implementations, for example using general-purpose software such as Stan (Carpenter et al., 2017), can lead to high computational costs in high-dimensional settings, both per iteration and in terms of the number of iterations required to reach convergence. In the setting of high-dimensional regressions, tailored algorithms can provide substantial improvements on both fronts (e.g. Bhattacharya et al., 2016; Nishimura and Suchard, 2018; Johndrow et al., 2020; Bhattacharya and Johndrow, 2021). Comparatively less attention has been put on the design of reliable stopping criteria. Stopping criteria for MCMC, such as univariate effective sample size (ESS, e.g Vats et al., 2019), or the R^\hat{R} convergence diagnostic (e.g. Vats and Knudson, 2020), rely on the asymptotic behavior of the chains as time goes to infinity, and thus effectively ignore the non-asymptotic “burn-in” bias, which vanishes as time progresses. This is acceptable in situations where we have solid a priori estimates of the burn-in period; otherwise the lack of non-asymptotic stopping criteria poses an important practical problem. With Bayesian analysis of high-dimensional data sets, MCMC algorithms can require seconds or minutes per iteration, and thus any non-asymptotic insight on the number of iterations to perform is highly valuable.

Diagnostics of convergence are particularly useful when considering the number of factors that affect the performance of MCMC algorithms for Bayesian regressions. Beyond the size of the data set, the specification of the prior and the (much less explicit) signal-to-noise ratio in the data all have an impact on the convergence of MCMC algorithms. Across applications, the number of iterations required for the chain to converge varies by multiple orders of magnitude. This could lead users to either waste computational resources by running overly long chains, or, worrisomely, to base their analysis on chains that have not converged. This manuscript proposes a concrete method to avoid these pitfalls in the case of a Gibbs sampler for Bayesian regression with heavy-tailed shrinkage priors. Specifically, we follow the approach of Glynn and Rhee (2014); Jacob et al. (2020); Biswas et al. (2019) and use coupled lagged chains to monitor convergence.

1.2 Bayesian shrinkage regression with Half-t(ν)t(\nu) priors

Consider Gaussian linear regression with nn observations and pp covariates, with a focus on the high-dimensional setting where npn\ll p. The likelihood is given by

L(β,σ2;X,y)=1(2πσ2)n/2exp(12σ2yXβ2),L(\beta,\sigma^{2};X,y)=\frac{1}{(2\pi\sigma^{2})^{n/2}}\exp\Big{(}-\frac{1}{2\sigma^{2}}\|y-X\beta\|^{2}\Big{)}, (1)

where \|\cdot\| denotes the L2L_{2} norm, Xn×pX\in\mathbb{R}^{n\times p} is the observed design matrix, yny\in\mathbb{R}^{n} is the observed response vector, σ2>0\sigma^{2}>0 is the unknown Gaussian noise variance, and βp\beta\in\mathbb{R}^{p} is the unknown signal vector that is assumed to be sparse. We study hierarchical Gaussian scale-mixture priors on (β,σ2)(\beta,\sigma^{2}) given by

βjσ2,ξ,ηj=1,,pind𝒩(0,σ2ξηj),ξπξ(),ηjj=1,,pi.i.d.πη(),σ2InvGamma(a02,b02),\displaystyle\beta_{j}\mid\sigma^{2},\xi,\eta\underset{j=1,\ldots,p}{\overset{ind}{\sim}}\mathcal{N}\Big{(}0,\frac{\sigma^{2}}{\xi\eta_{j}}\Big{)},\ \xi\sim\pi_{\xi}(\cdot),\ \eta_{j}\underset{j=1,\ldots,p}{\overset{\text{i.i.d.}}{\sim}}\pi_{\eta}(\cdot),\ \sigma^{2}\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}}{2},\frac{b_{0}}{2}\bigg{)}, (2)

where a0,b0>0a_{0},b_{0}>0, and πξ(),πη()\pi_{\xi}(\cdot),\pi_{\eta}(\cdot) are continuous densities on >0\mathbb{R}_{>0}. Such global-local mixture priors induce approximate sparsity, where the components of β\beta can be arbitrarily close to zero, but never exactly zero. This is in contrast to point-mass mixture priors (e.g Johnson and Rossell, 2012), where some components of β\beta can be exactly zero a posteriori. The global precision parameter ξ\xi relates to the number of signals, and we use ξ1/2Cauchy+(0,1)\xi^{-1/2}\sim\text{Cauchy}_{+}(0,1) throughout. The local precision parameters ηj\eta_{j} determine which components of β\beta are null.

We focus on the Half-t(ν)t(\nu) prior family for the local scale parameter, ηj1/2t+(ν)\eta_{j}^{-1/2}\sim t_{+}(\nu) for ν1\nu\geq 1, where a t+(ν)t_{+}(\nu) distribution has density proportional to (1+x2/ν)(ν+1)/2 1(0,)(x)(1+x^{2}/\nu)^{-(\nu+1)/2}\,\mathbbm{1}_{(0,\infty)}(x). The induced prior on the local precision ηj\eta_{j} has density

πη(ηj)1ηj2ν2(1+νηj)ν+12 1(0,)(ηj).\pi_{\eta}(\eta_{j})\propto\frac{1}{\eta_{j}^{\frac{2-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}\,\mathbbm{1}_{(0,\infty)}(\eta_{j}). (3)

The case ν=1\nu=1 corresponds to the Horseshoe prior (Carvalho et al., 2009, 2010) and has received overwhelming attention in the literature among this prior class; see Bhadra et al. (2019) for a recent overview. The use of Half-t(ν)t(\nu) priors on scale parameters in hierarchical Gaussian models was popularized by Gelman (2006). However, the subsequent literature has largely gravitated towards the default choice (Polson and Scott, 2012) of ν=1\nu=1, in part because a convincing argument for preferring a different value of ν\nu has been absent to date. Here we give empirical evidence that Half-t(ν)t(\nu) priors yield statistical estimation properties similar to the Horseshoe prior, whilst leading to significant computational gains when ν>1\nu>1. We note that Ghosh and Chakrabarti (2017) established optimal posterior concentration in the Normal means problem for a broad class of priors on the local scale that includes the Half-t(ν)t(\nu) priors, extending earlier work by van der Pas et al. (2014) for the Horseshoe, providing theoretical support for the comparable performance we observe with ν=1\nu=1 and ν>1\nu>1 in simulations.

Our stopping criterion relies on couplings. Couplings have long been used to analyze the convergence of MCMC algorithms, while methodological implementations have been rare. The work of Johnson (1996) pioneered the use of couplings for practical diagnostics of convergence, but relied on the assumption of some similarity between the initial distribution and the target, which can be hard to check. Here we instead follow the approach of Glynn and Rhee (2014); Jacob et al. (2020) based on couplings of lagged chains, applied to the question of convergence in Biswas et al. (2019). That approach makes no assumptions on the closeness of the initial distribution to the target, and in our experiments we initialize all chains independently from the prior distribution.

The approach requires the ability to sample pairs of Markov chains such that 1) each chain marginally follows a prescribed MCMC algorithm and 2) the two chains “meet” after a random –but almost surely finite– number of iterations, called the meeting time. When the chains meet, i.e. when all of their components become identical, the user can stop the procedure. From the output the user can obtain unbiased estimates of expectations with respect to the target distribution, which can be generated in parallel and averaged, as well as estimates of the finite-time bias of the underlying MCMC algorithm. Crucially, this information is retrieved from the distribution of the meeting time, which is an integer-valued random variable irrespective of the dimension of the problem, and thus provides a convenient summary of the performance of the algorithm. A difficulty inherent to the approach is the design of an effective coupling strategy for the algorithm under consideration. High-dimensional parameter spaces add a substantial layer of complication as some of the simpler coupling strategies, that are solely based on maximal couplings, lead to prohibitively long meeting times.

1.3 Our contribution

In this paper we argue that couplings offer a practical way of assessing the number of iterations required by MCMC algorithms in high-dimensional regression settings. Our specific contributions are summarized below.

We introduce blocked Gibbs samplers (Section 2) for Bayesian linear regression focusing on Half-t(ν)t(\nu) local shrinkage priors, extending the algorithm of Johndrow et al. (2020) for the Horseshoe. Our algorithm has an overall computation cost of 𝒪(n2p)\mathcal{O}(n^{2}p) per iteration, which is state-of-the-art for npn\ll p. We then design a one-scale coupling strategy for these Gibbs samplers (Section 2.3) and show that it results in meeting times that have exponential tails, which in turn validates their use in convergence diagnostics (Biswas et al., 2019). Our proof of exponentially-tailed meeting times is based on identifying a geometric drift condition, as a corollary to which we also establish geometric ergodicity of the marginal Gibbs sampler. Despite a decade of widespread use, MCMC algorithms for the Horseshoe and Half-t(ν)t(\nu) priors generally have not previously been shown to be geometrically ergodic, in contrast to MCMC algorithms for global-local models with exponentially-tailed local scale priors (Khare and Hobert, 2013; Pal and Khare, 2014). Our proofs utilize a uniform bound on generalized ridge regression estimates, which enables us to avoid a technical assumption of Johndrow et al. (2020), and auxiliary results on moments of distributions related to the confluent hypergeometric function of the second kind. Shortly after an earlier version of this manuscript was posted, Bhattacharya et al. (2021) have also established geometric ergodicity of related Gibbs samplers for the Horseshoe prior.

The meeting times resulting from the one-scale coupling increase exponentially as dimension increases. To address that issue we design improved coupling strategies (Section 3). We develop a two-scale strategy, based on a carefully chosen metric, which informs the choice between synchronous and maximal couplings. In numerical experiments in high dimensions this strategy leads to orders of magnitude shorter meeting times compared to the naïve approach. We establish some preliminary results in a stylized setting, providing a partial understanding of why the two-scale strategy leads to the observed drastic reduction in meeting times. We also describe other couplings that do not require explicit calculation of a metric between the chains, have exponentially-tailed meeting times, and empirically result in similar meeting times as the two-scale coupling. For our two-scale coupling, we further note that priors with stronger shrinkage towards zero can give significantly shorter meeting times in high dimensions. This motivates usage of Half-t(ν)t(\nu) priors with greater degrees of freedom ν>1\nu>1, which can give similar statistical performance to the Horseshoe (ν=1\nu=1) whilst enjoying orders of magnitude computational improvements.

Finally, we demonstrate (Section 4) that the proposed coupled MCMC algorithm is applicable to big data settings. Figure 1 displays an application of our coupling strategy to monitor the convergence of the Gibbs sampler for Bayesian regression on a genome-wide association study (GWAS) dataset with (n,p)(2000,100000)(n,p)\approx(2000,100000). Our method suggests that a burn-in of just 700700 iterations can suffice even for such a large problem, and confirms the applicability of coupled MCMC algorithms for practical high-dimensional inference.

Refer to caption
Figure 1: Upper bounds on the total variation between the marginal distributions of the chain and the target distribution for Bayesian regression on a GWAS dataset with (n,p)(2000,100000)(n,p)\approx(2000,100000).

Scripts in R (R Core Team, 2013) are available from https://github.com/niloyb/CoupledHalfT to reproduce the figures of the article.

2 Coupled MCMC for regression with Half-t(ν)t(\nu) priors

We develop MCMC techniques for Bayesian shrinkage regression with Half-t(ν)t(\nu) priors. The model is given by (1)–(2) and η1/2t(ν)\eta^{-1/2}\sim t(\nu) for some ν1\nu\geq 1 as in (3). Posterior distributions resulting from such heavy-tailed shrinkage priors have desirable statistical properties, but their features pose challenges to generic-purpose MCMC algorithms. Characterization of the marginal prior and posterior densities of β\beta on p\mathbb{R}^{p} given ξ\xi and σ2\sigma^{2}, and further comments on these challenges are presented in Appendix A. Specifically, the resulting posterior distributions present 1) multimodality, 2) heavy tails and 3) infinite density at zero. This hints at a trade-off between statistical accuracy and computational feasibility, since the very features that present computational difficulties are crucial for optimal statistical performance across sparsity levels and signal strengths.

2.1 A blocked Gibbs sampler for Half-t(ν)t(\nu) priors

Blocked Gibbs samplers are popularly used for Bayesian regression with global-local shrinkage priors, due to the analytical tractability of full conditional distributions (Park and Casella, 2008; Carvalho et al., 2010; Polson et al., 2014; Bhattacharya et al., 2015; Makalic and Schmidt, 2016). Several convenient blocking and marginalization strategies are possible, leading to conditionals that are easy to sample from. For the case of the Horseshoe prior (ν=1\nu=1), Johndrow et al. (2020) have developed exact and approximate MCMC algorithms for high-dimensional settings, building on the algorithms of Polson et al. (2014) and Bhattacharya et al. (2016). We extend that sampler to general Half-t(ν)t(\nu) priors, and summarize it in (LABEL:eq:blocked_gibbs_ergodicity_version). Given some initial state (β0,η0,σ02,ξ0)p×>0p×>0×>0(\beta_{0},\eta_{0},\sigma_{0}^{2},\xi_{0})\in\mathbb{R}^{p}\times\mathbb{R}^{p}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}, it generates a Markov chain (βt,ηt,σt2,ξt)t0(\beta_{t},\eta_{t},\sigma^{2}_{t},\xi_{t})_{t\geq 0} which targets the posterior corresponding to Half-t(ν)t(\nu) priors.

π(ηt+1|βt,σt2,ξt)j=1pemt,jηt+1,jηt+1,j1ν2(1+νηt+1,j)ν+12 for mt,j=ξtβt,j22σt2,π(ξt+1|ηt+1)L(y|ξt+1,ηt+1)πξ(ξt+1),σt+12|ηt+1,ξt+1InvGamma(a0+n2,yTM1y+b02),βt+1|ηt+1,ξt+1,σt+12𝒩(Σ1XTy,σt+12Σ1).\displaystyle\begin{split}&\pi(\eta_{t+1}|\beta_{t},\sigma_{t}^{2},\xi_{t})\propto\prod_{j=1}^{p}\frac{e^{-m_{t,j}\eta_{t+1,j}}}{\eta_{t+1,j}^{\frac{1-\nu}{2}}(1+\nu\eta_{t+1,j})^{\frac{\nu+1}{2}}}\text{ for }m_{t,j}=\frac{\xi_{t}\beta_{t,j}^{2}}{2\sigma_{t}^{2}},\\ &\pi(\xi_{t+1}|\eta_{t+1})\propto L(y|\xi_{t+1},\eta_{t+1})\pi_{\xi}(\xi_{t+1})\ ,\\ &\sigma_{t+1}^{2}|\eta_{t+1},\xi_{t+1}\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}+n}{2},\frac{y^{T}M^{-1}y+b_{0}}{2}\bigg{)},\\ &\beta_{t+1}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\sim\mathcal{N}(\Sigma^{-1}X^{T}y,\sigma_{t+1}^{2}\Sigma^{-1}).\end{split} (4)

Above L(y|ξt+1,ηt+1)L(y|\xi_{t+1},\eta_{t+1}) represents the marginal likelihood of yy given ξt+1\xi_{t+1} and ηt+1\eta_{t+1}, and we use the notation M=In+ξt+11XDiag(ηt+11)XTM=I_{n}+\xi_{t+1}^{-1}X\,\text{Diag}(\eta_{t+1}^{-1})\,X^{T}, and Σ=XTX+ξt+1Diag(ηt+1)\Sigma=X^{T}X+\xi_{t+1}\text{Diag}(\eta_{t+1}). We sample ηt+1|βt,σt2,ξt\eta_{t+1}|\beta_{t},\sigma_{t}^{2},\xi_{t} component-wise independently using slice sampling and sample ξt+1|ηt+1\xi_{t+1}|\eta_{t+1} using Metropolis–Rosenbluth–Teller–Hastings with step-size σMRTH\sigma_{\text{MRTH}} (Hastings, 1970). Full details of our sampler are in Algorithms 3 and 4 of Appendix D, and derivations are in Appendix E.

Sampling ηt+1|βt,σt2,ξt\eta_{t+1}|\beta_{t},\sigma_{t}^{2},\xi_{t} component-wise independently is an 𝒪(p)\mathcal{O}(p) cost operation. Sampling ξt+1|ηt+1\xi_{t+1}|\eta_{t+1} requires evaluating L(y|ξt+1,ηt+1)L(y|\xi_{t+1},\eta_{t+1}), which involves an 𝒪(n2p)\mathcal{O}(n^{2}p) cost operation (calculating the weighted matrix cross-product XDiag(ηt+1)1XTX\text{Diag}(\eta_{t+1})^{-1}X^{T}) and an 𝒪(n3)\mathcal{O}(n^{3}) cost operation (calculating yTM1yy^{T}M^{-1}y and |M||M| using Cholesky decomposition). Sampling σt+12|ηt+1,ξt+1\sigma_{t+1}^{2}|\eta_{t+1},\xi_{t+1} is only an 𝒪(1)\mathcal{O}(1) cost operation, as yTM1yy^{T}M^{-1}y is pre-calculated. Similarly, as M1M^{-1} is pre-calculated, sampling βt+1|ηt+1,ξt+1,σt+12\beta_{t+1}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2} using the algorithm of Bhattacharya et al. (2016) for structured multivariate Gaussians only involves an 𝒪(np)\mathcal{O}(np) and an 𝒪(n2)\mathcal{O}(n^{2}) cost operation. In high-dimensional settings with pnp\gg n, which is our focus, the weighted matrix cross-product calculation is the most expensive and overall the sampler described in (LABEL:eq:blocked_gibbs_ergodicity_version) has 𝒪(n2p)\mathcal{O}(n^{2}p) computation cost.

2.2 Coupling the proposed Gibbs sampler

We develop couplings for the blocked Gibbs sampler presented in Section 2.1. We will use these couplings to generate coupled chains with a time lag L1L\geq 1, in order to implement the diagnostics of convergence proposed in Biswas et al. (2019). Consider an LL-lag coupled chain (Ct,C~tL)tL(C_{t},\tilde{C}_{t-L})_{t\geq L} with meeting time τ:=inf{tL:Ct=C~tL}\tau:=\inf\{t\geq L:C_{t}=\tilde{C}_{t-L}\}. Here CtC_{t} and C~t\tilde{C}_{t} denote the full states (βt,ηt,σt2,ξt)(\beta_{t},\eta_{t},\sigma_{t}^{2},\xi_{t}) and (β~t,η~t,σ~t2,ξ~t)(\tilde{\beta}_{t},\tilde{\eta}_{t},\tilde{\sigma}_{t}^{2},\tilde{\xi}_{t}) respectively. For the coupling-based methods proposed in Biswas et al. (2019); Jacob et al. (2020) to be most practical, the meeting times should occur as early as possible, under the constraint that both marginal chains (Ct)t0(C_{t})_{t\geq 0} and (C~t)t0(\tilde{C}_{t})_{t\geq 0} start from the same initial distribution, and that they both marginally evolve according to the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version).

Briefly, we recall how LL-lag coupled chains enable the estimation of upper bounds on TV(πt,π)\text{TV}(\pi_{t},\pi). Since Ct=C~tLC_{t}=\tilde{C}_{t-L} for all tτt\geq\tau, then the indicator 𝟙(CtC~tL)\mathbbm{1}(C_{t}\neq\tilde{C}_{t-L}) can be evaluated for all tt as soon as we observe τ\tau. Such indicators are estimators of (CtC~tL)\mathbb{P}(C_{t}\neq\tilde{C}_{t-L}), which themselves are upper bounds on TV(πt,πtL)\text{TV}(\pi_{t},\pi_{t-L}), since CtπtC_{t}\sim\pi_{t} and C~tLπtL\tilde{C}_{t-L}\sim\pi_{t-L}. We can then obtain upper bounds on TV(πt,π)\text{TV}(\pi_{t},\pi) by the triangle inequality: TV(πt,π)𝔼[j0𝟙(Ct+(j+1)LC~t+jL)]\text{TV}(\pi_{t},\pi)\leq\mathbb{E}[\sum_{j\geq 0}\mathbbm{1}(C_{t+(j+1)L}\neq\tilde{C}_{t+jL})]. This can be justified more formally if the meeting times are exponentially-tailed (Biswas et al., 2019). Such tail condition also ensures that the expected computation time of unbiased estimators as in Jacob et al. (2020), generated independently in parallel, scales at a logarithmic rate in the number of processors. Thus we will be particularly interested in couplings that result in exponentially-tailed meeting times.

In Section 2.3, we consider an algorithm based on maximal couplings. We show that the ensuing meeting times have exponential tails, and discuss how our analysis directly implies that the marginal chain introduced in Section 2.1 is geometrically ergodic. We then illustrate difficulties encountered by such scheme in high dimensions, which will motivate alternate strategies described in Section 3. For simplicity, we mostly omit the lag LL from the notation.

2.3 One-scale coupling

For the blocked Gibbs sampler in Section 2.1, we first consider a coupled MCMC algorithm that attempts exact meetings at every step. We will apply a maximal coupling algorithm with independent residuals (see Thorisson (2000, Chapter 1 Section 4.4) and Johnson (1998)). It is included in Algorithm 5 of Appendix D, and has an expected computation cost of two units (Johnson, 1998).

Our initial coupled MCMC kernel is given in Algorithm 1, which we refer to as a one-scale coupling. That is, before the chains have met (when CtC~tC_{t}\neq\tilde{C}_{t}), the coupled kernel on Steps (1)(1) and (2)(ac)(2)(a-c) does not explicitly depend on the distance between states CtC_{t} and C~t\tilde{C}_{t}. After meeting, the coupled chains remain together by construction, such that Ct=C~tC_{t}=\tilde{C}_{t} implies Ct+1=C~t+1C_{t+1}=\tilde{C}_{t+1}. When CtC~tC_{t}\neq\tilde{C}_{t}, Step (1)(1) uses the coupled slice sampler of Algorithm 2 component-wise for (ηt+1,η~t+1)(\eta_{t+1},\tilde{\eta}_{t+1}), which allows each pair of components (ηt+1,j,η~t+1,j)(\eta_{t+1,j},\tilde{\eta}_{t+1,j}) to meet exactly with positive probability. In Algorithm 2, we use a common random numbers or a “synchronous” coupling of the auxiliary random variables (Uj,,U~j,)(U_{j,*},\tilde{U}_{j,*}), and alternative couplings could be considered. Steps (2)(ab)(2)(a-b) are maximal couplings of the conditional sampling steps for (ξ,ξ~)(\xi,\tilde{\xi}) and (σ2,σ~2)(\sigma^{2},\tilde{\sigma}^{2}), such that (σt+1,ξt+1,ηt+1)=(σ~t+1,ξ~t+1,η~t+1)\big{(}\sigma_{t+1},\xi_{t+1},\eta_{t+1}\big{)}=\big{(}\tilde{\sigma}_{t+1},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1}\big{)} occurs with positive probability for all t0t\geq 0. Step (2)(c)(2)(c) uses common random numbers, such that (σt+12,ξt+1,ηt+1)=(σ~t+12,ξ~t+1,η~t+1)\big{(}\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\big{)}=\big{(}\tilde{\sigma}^{2}_{t+1},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1}\big{)} implies βt+1=β~t+1\beta_{t+1}=\tilde{\beta}_{t+1}. This allows the full chains to meet exactly with positive probability at every step.

Input: Ct:=(βt,ηt,σt2,ξt)C_{t}:=(\beta_{t},\eta_{t},\sigma^{2}_{t},\xi_{t}) and C~t:=(β~t,η~t,σ~t2,ξ~t)\tilde{C}_{t}:=(\tilde{\beta}_{t},\tilde{\eta}_{t},\tilde{\sigma}^{2}_{t},\tilde{\xi}_{t}).
if Ct=C~tC_{t}=\tilde{C}_{t} then  Sample Ct+1|CtC_{t+1}|C_{t} as in (LABEL:eq:blocked_gibbs_ergodicity_version) and set C~t+1=Ct+1\tilde{C}_{t+1}=C_{t+1}.
else
      
  1. 1.

    Sample (ηt+1,η~t+1)|Ct,C~t(\eta_{t+1},\tilde{\eta}_{t+1})\big{|}C_{t},\tilde{C}_{t} component-wise independently using Algorithm 2.

  2. 2.

    Sample ((ξt+1,σt+12,βt+1),(ξ~t+1,σ~t+1,β~t+1))((\xi_{t+1},\sigma_{t+1}^{2},\beta_{t+1}),(\tilde{\xi}_{t+1},\tilde{\sigma}_{t+1},\tilde{\beta}_{t+1})) given ηt+1,η~t+1\eta_{t+1},\tilde{\eta}_{t+1} as follows:

    1. (a)

      Sample (ξt+1,ξ~t+1)|ηt+1,η~t+1,ξt,ξ~t(\xi_{t+1},\tilde{\xi}_{t+1})\big{|}\eta_{t+1},\tilde{\eta}_{t+1},\xi_{t},\tilde{\xi}_{t} using coupled Metropolis–Rosenbluth–Teller–Hastings (Algorithm 9 in Appendix D).

    2. (b)

      Sample (σt+12,σ~t+12)|ξt+1,ηt+1,ξ~t+1,η~t+1(\sigma_{t+1}^{2},\tilde{\sigma}_{t+1}^{2})|\xi_{t+1},\eta_{t+1},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1} from a maximal coupling of two Inverse Gamma distributions (Algorithm 5 in Appendix D).

    3. (c)

      Sample (βt+1,β~t+1)|σt+12,ξt+1,ηt+1,σ~t+12,ξ~t+1,η~t+1(\beta_{t+1},\tilde{\beta}_{t+1})|\sigma_{t+1}^{2},\xi_{t+1},\eta_{t+1},\tilde{\sigma}_{t+1}^{2},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1} using common random numbers for Gaussian scale mixture priors (Algorithm 8 in Appendix D).

return (Ct+1,C~t+1)(C_{t+1},\tilde{C}_{t+1}).
Algorithm 1 A one-scale coupled MCMC kernel for Half-t(ν)t(\nu) priors.
Result: A coupling of slice samplers marginally targeting p(|mj)p(\cdot|m_{j}) and p(|m~j)p(\cdot|\tilde{m}_{j}), where p(ηj|m)(ηj1ν2(1+νηj)ν+12)1emηjp(\eta_{j}|m)\propto(\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}})^{-1}e^{-m\eta_{j}} on (0,)(0,\infty).
Input: ηt,j,η~t,j,mt,j:=ξtβt,j2/(2σt2),m~t,j:=ξ~tβ~t,j2/(2σ~t2)>0\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j}:=\xi_{t}\beta_{t,j}^{2}/(2\sigma_{t}^{2}),\tilde{m}_{t,j}:=\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}/(2\tilde{\sigma}_{t}^{2})>0.
  1. 1.

    Sample Uj,crnUniform(0,1)U^{crn}_{j,*}\sim\mathrm{Uniform}(0,1), and set Uj,:=Uj,crn(1+νηt,j)ν+12U_{j,*}:=U^{crn}_{j,*}(1+\nu\eta_{t,j})^{-\frac{\nu+1}{2}} and U~j,:=Uj,crn(1+νη~t,j)ν+12\tilde{U}_{j,*}:=U^{crn}_{j,*}(1+\nu\tilde{\eta}_{t,j})^{-\frac{\nu+1}{2}}.

  2. 2.

    Sample (ηt+1,j,η~t+1,j)|Uj,,U~j,(\eta_{t+1,j},\tilde{\eta}_{t+1,j})|U_{j,*},\tilde{U}_{j,*} from a maximal coupling of distributions PjP_{j} and P~j\tilde{P}_{j} using Algorithm 5 in Appendix D, where PjP_{j} and P~j\tilde{P}_{j} have unnormalized densities ηηs1emjη\eta\mapsto\eta^{s-1}e^{-m_{j}\eta} and ηηs1em~jη\eta\mapsto\eta^{s-1}e^{-\tilde{m}_{j}\eta} on (0,Tj)(0,T_{j}) and (0,T~j)(0,\tilde{T}_{j}) respectively, where Tj=(Uj,2/(1+ν)1)/νT_{j}=(U_{j,*}^{-2/(1+\nu)}-1)/\nu, T~j=(U~j,2/(1+ν)1)/ν\tilde{T}_{j}=(\tilde{U}_{j,*}^{-2/(1+\nu)}-1)/\nu and s=(1+ν)/2s=(1+\nu)/2.

return (ηt+1,j,η~t+1,j)(\eta_{t+1,j},\tilde{\eta}_{t+1,j}).
Algorithm 2 Coupled Slice Sampler for Half-t(ν)t(\nu) priors using maximal couplings.

Under the one-scale coupling, we prove that the meetings times have exponential tails and hence finite expectation. Our analysis is based on identifying a suitable drift function for a variant of the marginal chain in Section 2.1 and an application of Jacob et al. (2020, Proposition 4). We assume that the global shrinkage prior πξ()\pi_{\xi}(\cdot) has a compact support on (0,)(0,\infty). Such compactly supported priors for ξ\xi have been recommended by van der Pas et al. (2017) to achieve optimal posterior concentration for the Horseshoe in the Normal means model. For simplicity, we also assume that each component ηt+1,j|βt,j,ξt,σt2\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2} and ξt+1|ηt+1\xi_{t+1}|\eta_{t+1} in (LABEL:eq:blocked_gibbs_ergodicity_version) are sampled perfectly, instead of slice sampling and Metropolis–Rosenbluth–Teller–Hastings steps respectively. Perfect sampling algorithms for each component ηt+1,j|βt,j,ξt,σt2\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2} and ξt+1|ηt+1\xi_{t+1}|\eta_{t+1} are provided in Algorithms 10 and 11 of Appendix D for completeness; see also Bhattacharya and Johndrow (2021) for perfect sampling of the components of ηt+1,j|βt,j,ξt,σt2\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2} for the Horseshoe. Perfectly sampling ξt+1|ηt+1\xi_{t+1}|\eta_{t+1} and ηt+1,j|βt,j,ξt,σt2\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2} retains the 𝒪(n2p)\mathcal{O}(n^{2}p) computational cost for the full blocked Gibbs sampler, though in practice this implementation is more expensive per iteration due to the computation of eigenvalue decompositions in lieu of Cholesky decompositions, and the computation of inverse of the confluent hypergeometric function of the second kind.

Proposition 2.1 gives a geometric drift condition for this variant of the blocked Gibbs sampler.

Proposition 2.1.

(Geometric drift). Consider the Markov chain (βt,ξt,σt2)t0(\beta_{t},\xi_{t},\sigma^{2}_{t})_{t\geq 0} generated from the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version), where each component ηt+1,j|βt,j,ξt,σt2\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2} and ξt+1|ηt+1\xi_{t+1}|\eta_{t+1} are sampled perfectly such that (βt,ξt,σt2)ηt+1(βt+1,ξt+1,σt+12)(\beta_{t},\xi_{t},\sigma^{2}_{t})\mapsto\eta_{t+1}\mapsto(\beta_{t+1},\xi_{t+1},\sigma^{2}_{t+1}) with ηt+1\eta_{t+1} intermediary for each t0t\geq 0. Assume the global shrinkage prior πξ()\pi_{\xi}(\cdot) has a compact support. For mj=ξβj2/(2σ2)m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2}), c(0,1/2)c\in(0,1/2) and d(0,1)d\in(0,1), define

V(β,ξ,σ2)=j=1pmjc+mjd.V(\beta,\xi,\sigma^{2})=\sum_{j=1}^{p}m_{j}^{-c}+m_{j}^{d}. (5)

Then for each ν1\nu\geq 1, there exist some c(0,1/2),d(0,1)c\in(0,1/2),d\in(0,1) such that VV is a drift function, i.e., there exists γ(0,1)\gamma\in(0,1) and K(0,)K\in(0,\infty) such that for all βtp\beta_{t}\in\mathbb{R}^{p}, ξt>0\xi_{t}>0 and σt2>0\sigma_{t}^{2}>0,

𝔼[V(βt+1,ξt+1,σt+12)|βt,ξt,σt2]γV(βt,ξt,σt2)+K.\displaystyle\mathbb{E}[V(\beta_{t+1},\xi_{t+1},\sigma_{t+1}^{2})|\beta_{t},\xi_{t},\sigma_{t}^{2}]\leq\gamma V(\beta_{t},\xi_{t},\sigma_{t}^{2})+K. (6)

The drift (or “Lyapunov”) function in (5), approaches infinity when any βj\beta_{j} approaches the origin or infinity. This ensures that the corresponding sub-level sets, defined by S(R)={(β,ξ,σ2)p×>0×>0:V(β,ξ,σ)<R}S(R)=\{(\beta,\xi,\sigma^{2})\in\mathbb{R}^{p}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}:V(\beta,\xi,\sigma)<R\} for R>0R>0, exclude the pole at the origin and are bounded. Such geometric drift condition, in conjunction with Jacob et al. (2020, Proposition 4), helps verify that the meeting times under the one-scale coupling have exponential tails and hence finite expectation.

Proposition 2.2.

Consider the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). In the setup of Proposition 2.1, assume that the global shrinkage prior πξ()\pi_{\xi}(\cdot) has a compact support. Write Zt=(βt,ξt,σt2)Z_{t}=(\beta_{t},\xi_{t},\sigma^{2}_{t}) and Z~t=(β~t,ξ~t,σ~t2)\tilde{Z}_{t}=(\tilde{\beta}_{t},\tilde{\xi}_{t},\tilde{\sigma}^{2}_{t}). Consider the one-scale coupling given by (Zt,Z~t)(ηt+1,η~t+1)(Zt+1,Z~t+1)(Z_{t},\tilde{Z}_{t})\mapsto(\eta_{t+1},\tilde{\eta}_{t+1})\mapsto(Z_{t+1},\tilde{Z}_{t+1}) where (ηt+1,η~t+1)(\eta_{t+1},\tilde{\eta}_{t+1}) are maximally coupled component-wise, and (Zt+1,Z~t+1)(Z_{t+1},\tilde{Z}_{t+1}) are coupled using common random numbers. Denote the meeting time by τ:=inf{t0:Zt=Z~t}\tau:=\inf\{t\geq 0:Z_{t}=\tilde{Z}_{t}\}. Then (τ>t)A0κ0t\mathbb{P}(\tau>t)\leq A_{0}\kappa_{0}^{t} for some constants A0(0,)A_{0}\in(0,\infty) and κ0(0,1)\kappa_{0}\in(0,1), and for all t0t\geq 0.

Exponentially-tailed meeting times immediately imply that the blocked Gibbs sampler in Section 2.1 is geometrically ergodic. This follows from noting that Jacob et al. (2020, Proposition 4) is closely linked to a minorization condition (Rosenthal, 1995), and details are included in Proposition C.10 of Appendix C. For a class of Gibbs samplers targeting shrinkage priors including the Bayesian LASSO, the Normal-Gamma prior (Griffin and Brown, 2010), and Dirichlet-Laplace prior (Bhattacharya et al., 2015), Khare and Hobert (2013) and Pal and Khare (2014) have proven geometric ergodicity based on drift and minorization arguments (Meyn and Tweedie, 1993; Roberts and Rosenthal, 2004). For the Horseshoe prior, Johndrow et al. (2020) has recently established geometric ergodicity. In the high-dimensional setting with p>np>n for the Horseshoe, the proof of Johndrow et al. (2020) required truncating the prior on each ηj\eta_{j} below by a small >0\ell>0 to guarantee the uniform bound ηj\eta_{j}\geq\ell. Our proof of geometric ergodicity for Half-t(ν)t(\nu) priors including the Horseshoe (ν=1\nu=1) works in both low and high-dimensional settings, without any modification of the priors on the ηj\eta_{j}. In parallel work, Bhattacharya et al. (2021) have also established geometric ergodicity for the Horseshoe prior without requiring such truncation.

Remark 2.3.

Our proofs of exponentially-tailed meeting times and geometric ergodicity generalize to a larger class of priors satisfying some moment conditions on the full conditionals of ηj\eta_{j}. Consider a compactly supported prior πξ\pi_{\xi} on ξ\xi, and a prior πη\pi_{\eta} on each ηj\eta_{j}. Then the unnormalized density of the full conditionals of each ηj\eta_{j} is given by π(ηjmj=ξβj2/(2σ2))ηj1/2emjηjπη(ηj)\pi\big{(}\eta_{j}\mid m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2})\big{)}\propto\eta_{j}^{1/2}e^{-m_{j}\eta_{j}}\pi_{\eta}(\eta_{j}). Consider the following assumptions on πη\pi_{\eta}.

  1. 1.

    For some 0<c<1/20<c<1/2, there exists 0<ϵ<Γ(12c)1π0<\epsilon<\Gamma(\frac{1}{2}-c)^{-1}\sqrt{\pi} and Kc,ϵ(1)<K^{(1)}_{c,\epsilon}<\infty such that 𝔼[ηjc|mj]ϵmjc+Kc,ϵ(1)\mathbb{E}[\eta_{j}^{c}|m_{j}]\leq\epsilon m_{j}^{-c}+K^{(1)}_{c,\epsilon} for all mj>0m_{j}>0.

  2. 2.

    For some 0<d<10<d<1, there exists 0<ϵ<2d0<\epsilon<2^{d} and Kd,ϵ(2)<K^{(2)}_{d,\epsilon}<\infty such that 𝔼[ηjd|mj]ϵmjd+Kd,ϵ(2)\mathbb{E}[\eta_{j}^{-d}|m_{j}]\leq\epsilon m_{j}^{d}+K^{(2)}_{d,\epsilon} for all mj>0m_{j}>0.

Then the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version) is geometrically ergodic.

We now consider the performance of Algorithm 1 on synthetic datasets. In particular, we empirically illustrate that the rate κ0\kappa_{0} in Proposition 2.2 can tend to 11 exponentially fast as dimension pp increases. The design matrix Xn×pX\in\mathbb{R}^{n\times p} is generated with [X]i,ji.i.d.𝒩(0,1)[X]_{i,j}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,1), and a response vector as y𝒩(Xβ,σ2In)y\sim\mathcal{N}(X\beta_{*},\sigma^{2}_{*}I_{n}), where βp\beta_{*}\in\mathbb{R}^{p} is the true signal and σ0\sigma_{*}\geq 0 is the true error standard deviation. We choose β\beta_{*} to be sparse such that given sparsity parameter ss, β,j=2(9j)/4\beta_{*,j}=2^{(9-j)/4} for 1js1\leq j\leq s and β,j=0\beta_{*,j}=0 for all s>js>j. Figure 2 shows the meeting times τ\tau of coupled Markov chains (with a lag L=1L=1) targeting the Horseshoe posterior (ν=1\nu=1) with a0=b0=1a_{0}=b_{0}=1 under Algorithm 1. We consider n=100,s=10,σ=0.5n=100,s=10,\sigma_{*}=0.5 and vary the dimension pp. For each dimension pp, we simulate 100100 meeting times under the one-scale coupling for independently generated synthetic datasets. We initialize chains independently from the prior distribution. For the ξ\xi updates, we use a proposal step-size of σMRTH=0.8\sigma_{\text{MRTH}}=0.8 as in Johndrow et al. (2020). Figure 2, with a vertical axis on a logarithmic scale, shows that the meeting times increase exponentially and have heavier tails with increasing dimension.

Refer to caption
Figure 2: Meeting times against dimension for posteriors associated with a Horseshoe prior under the one-scale coupling of Algorithm 1, with n=100,s=10,σ=0.5n=100,s=10,\sigma_{*}=0.5.

3 Coupling strategies for high-dimensional settings

To address the scaling issues observed with the one-scale coupling in Section 2.3, we develop a two-scale coupling in Section 3.1 that delivers better empirical performance. We also offer some formal results supporting the empirical improvements. In Section 3.2 we develop other related coupling strategies which have exponentially-tailed meeting times and have similar empirical performance as the two-scale coupling. In Section 3.3, we investigate the impact of the degree of freedom ν\nu for Half-t(ν)t(\nu) priors on our two-scale coupling. We find that Half-t(ν)t(\nu) priors with higher degrees of freedom ν>1\nu>1 can give similar statistical performance to the Horseshoe (ν=1\nu=1), whilst enjoying orders of magnitude computational improvements through shorter meeting times.

3.1 A two-scale coupling

We develop a two-scale coupling algorithm, which slightly differs from the terminology used in stochastic dynamics and in analysis of pre-conditioned HMC (Bou-Rabee and Eberle, 2020). It is also closely linked to the delayed one-shot coupling of Roberts and Rosenthal (2002), and refers to coupled kernels that attempt exact meetings when the chains are close, and aim for a contraction when the chains are far. The motivation for this construction is that in Algorithm 1, the components of (ηt,η~t)(\eta_{t},\tilde{\eta}_{t}) that fail to meet instead evolve independently. As dimension grows, the number of components of (ηt,η~t)(\eta_{t},\tilde{\eta}_{t}) that fail to meet also tends to grow, making it increasingly difficult for all components to exactly meet on subsequent iterations. In the two-scale coupling, we only attempt to obtain exact meetings when the associated probability is large enough. This is done by constructing a metric dd and a corresponding threshold parameter d00d_{0}\geq 0 such that when the current states are dd-close with d(Ct,C~t)d0d(C_{t},\tilde{C}_{t})\leq d_{0}, we apply Algorithm 2 and try to obtain exact meetings. When d(Ct,C~t)>d0d(C_{t},\tilde{C}_{t})>d_{0}, we instead employ common random numbers to sample (ηt,η~t)(\eta_{t},\tilde{\eta}_{t}).

Our chosen metric dd on p×>0p×>0×>0\mathbb{R}^{p}\times\mathbb{R}^{p}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0} is

d(Ct,C~t)=(ηt+1η~t+1|Ct,C~t),d(C_{t},\tilde{C}_{t})=\mathbb{P}\big{(}\eta_{t+1}\neq\tilde{\eta}_{t+1}|C_{t},\tilde{C}_{t}\big{)}, (7)

where recall that Ct=(βt,ηt,σt2,ξt)C_{t}=(\beta_{t},\eta_{t},\sigma^{2}_{t},\xi_{t}), C~t=(β~t,η~t,σ~t2,ξ~t)\tilde{C}_{t}=(\tilde{\beta}_{t},\tilde{\eta}_{t},\tilde{\sigma}^{2}_{t},\tilde{\xi}_{t}), and (ηt+1,η~t+1)|Ct,C~t(\eta_{t+1},\tilde{\eta}_{t+1})|C_{t},\tilde{C}_{t} is sampled using the coupled slice sampler in Algorithm 2. As the coupling in Algorithm 2 is independent component-wise, we can obtain a simplified expression involving only univariate probabilities as

d(Ct,C~t)\displaystyle d(C_{t},\tilde{C}_{t}) =1j=1p(ηt+1,j=η~t+1,j|ηt,j,η~t,j,mt,j,m~t,j),\displaystyle=1-\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)}, (8)

where mt,j=ξtβt,j2/(2σt2)m_{t,j}={\xi_{t}\beta_{t,j}^{2}}/(2\sigma_{t}^{2}) and mt,j=ξ~tβ~t,j2/(2σ~t2)m_{t,j}={\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}}/(2\tilde{\sigma}_{t}^{2}). Under this metric dd and a threshold d0[0,1]d_{0}\in[0,1], an exact meeting is attempted when (ηt+1=η~t+1|Ct,C~t)1d0\mathbb{P}(\eta_{t+1}=\tilde{\eta}_{t+1}|C_{t},\tilde{C}_{t})\geq 1-d_{0}. Once ηt+1\eta_{t+1} and η~t+1\tilde{\eta}_{t+1} coincide, and if the scalars ξt+1\xi_{t+1} and ξ~t+1\tilde{\xi}_{t+1} also subsequently coincide, then the entire chains meet. When d0=1d_{0}=1, the two-scale coupling reduces to the one-scale coupling of Algorithm 1 since the maximal coupling slice sampler (Algorithm 2) is invoked at each step. On the other hand, when threshold d0=0d_{0}=0, we always apply the common random numbers. Then the chains Ct,C~tC_{t},\tilde{C}_{t} may come arbitrarily close but will never exactly meet. Empirically we find that different thresholds d0d_{0} values away from 0 and 11 give similar meeting times. Simulations concerning these choices can be found in Appendix B.2.

We now discuss computation of the metric d(Ct,C~t)d(C_{t},\tilde{C}_{t}). Since the probability in (7) is unavailable in closed form, we can resort to various approximations. One option is to evaluate the one-dimensional integrals in (8) using numerical integration. Alternatively, one can form the Monte Carlo based estimate

dR^(1)(Ct,C~t)=1Rr=1R𝟙{ηt+1(r)η~t+1(r)},\widehat{d_{R}}^{(1)}(C_{t},\tilde{C}_{t})=\frac{1}{R}\sum_{r=1}^{R}\mathbbm{1}\{\eta^{(r)}_{t+1}\neq\tilde{\eta}^{(r)}_{t+1}\}, (9)

where (ηt+1(r),η~t+1(r))(\eta^{(r)}_{t+1},\tilde{\eta}^{(r)}_{t+1}) are sampled independently for r=1,,Rr=1,\ldots,R using Algorithm 2. We recommend a Rao–Blackwellized estimate by combining Monte Carlo with analytical calculations of integrals as

dR^(2)(Ct,C~t)=1j=1p(1Rr=1R(ηt+1,j=η~t+1,j|Uj(r),U~j(r),mt,j,m~t,j)),\widehat{d_{R}}^{(2)}(C_{t},\tilde{C}_{t})=1-\prod_{j=1}^{p}\Big{(}\frac{1}{R}\sum_{r=1}^{R}\mathbb{P}\Big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U^{(r)}_{j},\tilde{U}^{(r)}_{j},m_{t,j},\tilde{m}_{t,j}\Big{)}\Big{)}, (10)

where (Uj(r),U~j(r))\big{(}U^{(r)}_{j},\tilde{U}^{(r)}_{j}\big{)} are sampled independently for r=1,,Rr=1,\ldots,R using Algorithm 2 and each (ηt+1,j=η~t+1,j|Uj(r),U~j(r),mt,j,m~t,j)\mathbb{P}(\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U^{(r)}_{j},\tilde{U}^{(r)}_{j},m_{t,j},\tilde{m}_{t,j}) is calculated analytically based on meeting probabilities from Algorithm 2. Further metric calculation details are in Appendix B.3. For a number of samples RR, estimates (9) and (10) both have computation cost of order pRpR. Compared to the estimate in (9), the estimate in (10) has lower variance and faster numerical run-times as it only involves sampling uniformly distributed random numbers. It suffices to choose a small number of samples RR, and often we take R=1R=1. Indeed it appears unnecessary to estimate d(Ct,C~t)d(C_{t},\tilde{C}_{t}) accurately, as we are only interested in comparing that distance to a fixed threshold d0[0,1]d_{0}\in[0,1]. Often the trajectories (d(Ct,C~t))t0(d(C_{t},\tilde{C}_{t}))_{t\geq 0} initially take values close to 11, and then sharply drop to values close to 0. This leads to the estimates in (9) and (10) having low variance. Henceforth, we will use the estimate in (10) in our experiments with two-scale couplings, unless specified otherwise.

Overall the two-scale coupling kernel has twice the 𝒪(n2p)\mathcal{O}(n^{2}p) cost of the single chain kernel in (LABEL:eq:blocked_gibbs_ergodicity_version). When CtC~tC_{t}\neq\tilde{C}_{t}, we calculate a distance estimate and sample from a coupled slice sampling kernel. Calculating the distance estimate involves sampling 2pR2pR uniforms and has 𝒪(p)\mathcal{O}(p) cost. Coupled slice sampling with maximal couplings (Algorithm 2) or common random numbers both have expected or deterministic cost 𝒪(p)\mathcal{O}(p). The remaining steps of the two-scale coupling match the one-step coupling in Algorithm 1.

We now consider the performance of the two-scale coupling on synthetic datasets. The setup is identical to that introduced in Section 2.3, where for each dimension pp we simulate 100100 meetings times based on independently generated synthetic datasets. We target the Horseshoe posterior (ν=1\nu=1) with a0=b0=1a_{0}=b_{0}=1. Figure 3 shows the meeting times τ\tau of coupled Markov chains (with a lag L=1L=1), for both the one-scale coupling and the two-scale coupling with R=1R=1 and threshold d0=0.5d_{0}=0.5. Figure 3 shows that the two-scale coupling can lead to orders of magnitude reductions in meeting times, compared to the one-scale coupling.

Refer to caption
Figure 3: Meeting times against dimension for posteriors associated with a Horseshoe prior under one-scale and two-scale couplings, with n=100,s=10,σ=0.5n=100,s=10,\sigma_{*}=0.5.

We next present some preliminary results which hint as to why the two-scale coupling leads to the drastic reduction in meeting times. We focus on the coupling of the (η,η~)(\eta,\tilde{\eta}) full conditionals, as the one-scale and two-scale algorithms differ exactly at this step. First, we show that for any monotone function h:(0,)h:(0,\infty)\rightarrow\mathbb{R} and all components j=1,,pj=1,\ldots,p, the expected distance 𝔼[|h(ηt+1,j)h(η~t+1,j)|]\mathbb{E}[|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|] is the same under both common random numbers and maximal coupling.

Proposition 3.1.

Consider the setup of Proposition 2.2, such that the η\eta full conditionals are sampled perfectly. Then for any monotone function h:(0,)h:(0,\infty)\rightarrow\mathbb{R} and all components j=1,,pj=1,\ldots,p,

𝔼max[|h(ηt+1,j)h(η~t+1,j)||mt,m~t]=𝔼crn[|h(ηt+1,j)h(η~t+1,j)||mt,m~t],\mathbb{E}_{\text{max}}\big{[}|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}_{\text{crn}}\big{[}|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|\big{|}m_{t},\tilde{m}_{t}\big{]}, (11)

where 𝔼max\mathbb{E}_{\text{max}} and 𝔼crn\mathbb{E}_{\text{crn}} correspond to expectations under the maximal coupling and CRN coupling respectively of (ηt+1,η~t+1)(\eta_{t+1},\tilde{\eta}_{t+1}) given mtm_{t} and m~t\tilde{m}_{t}.

Proposition 3.1 implies that such single-step expectations alone do not distinguish the behavior of CRN from maximal couplings. This compels us to investigate other distributional features of |h(ηt+1,j)h(η~t+1,j)||h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})| under either coupling to possibly tease apart differences in their behavior on high probability events. Focusing on the Horseshoe prior and making the choice h(x)=log(1+x)h(x)=\log(1+x), we uncover a distinction between the tail behavior of the two couplings which can substantially accumulate over pp independent coordinates. We offer an illustration in a stylized setting below.

Proposition 3.2.

For the Horseshoe prior (ν=1)(\nu=1), consider when mt=μ𝟏p(0,)pm_{t}=\mu\mathbf{1}_{p}\in(0,\infty)^{p} and m~t=μ~𝟏p(0,)p\tilde{m}_{t}=\tilde{\mu}\mathbf{1}_{p}\in(0,\infty)^{p} for some positive scalars μ\mu and μ~\tilde{\mu} with μ~μ\tilde{\mu}\neq\mu. Then under a CRN coupling,

(maxj=1p|log(1+ηt+1,j)log(1+η~t+1,j)|>|logμlogμ~||mt,m~t)=0\mathbb{P}\big{(}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>|\log\mu-\log\tilde{\mu}|\ \big{|}\ m_{t},\tilde{m}_{t}\big{)}=0 (12)

for all p1p\geq 1. Define a positive constant L=log(1+1/max{μ,μ~})/2L=\log(1+1/\max\{\mu,\tilde{\mu}\})/2. Under maximal coupling, for any α(0,1)\alpha\in(0,1) and any α(α,1)\alpha^{\prime}\in(\alpha,1), there exists a constant D>0D>0 which does not depend on pp such that for all p>1p>1,

(maxj=1p|log(1+ηt+1,j)log(1+η~t+1,j)|>log(αL)+loglogp|mt,m~t)>1eDp1α.\displaystyle\mathbb{P}\big{(}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>\log(\alpha L)+\log\log p\ \big{|}\ m_{t},\tilde{m}_{t}\big{)}>1-e^{-Dp^{1-\alpha^{\prime}}}. (13)

Extensions of Proposition 3.2 which allow different components of mtm_{t} and m~t\tilde{m}_{t} to take different values under some limiting assumptions are omitted here for simplicity. This stylized setting already captures why the one-scale algorithm may result in much larger meeting times in high dimensions compared to the CRN-based two-scale algorithm. We note that Proposition 3.2 remains informative whenever |logμlogμ~|loglogp|\log\mu-\log\tilde{\mu}|\ll\log\log p, even when loglogp\log\log p is not large. For example, consider μ=μ~eδ\mu=\tilde{\mu}e^{\delta} for some small δ>0\delta>0 with δloglogp\delta\ll\log\log p. Under CRN, log(1+ηt+1,j)\log(1+\eta_{t+1,j}) and log(1+η~t+1,j)\log(1+\tilde{\eta}_{t+1,j}) remain δ\delta-close almost surely for all components jj. In contrast, under the maximal coupling, at least one component will have larger than log(αL)+loglogp\log(\alpha L)+\log\log p deviations with high probability. Since α\alpha and L=log(1+1/μ)/2L=\log(1+1/\mu)/2 do not depend on pp or δ\delta, log(αL)+loglogpδ\log(\alpha L)+\log\log p\gg\delta for fixed μ,μ~>0\mu,\tilde{\mu}>0. These larger deviations between some components of ηt+1\eta_{t+1} and η~t+1\tilde{\eta}_{t+1} under maximal couplings push the two chains much further apart when sampling (mt+1,m~t+1)|ηt+1,η~t+1(m_{t+1},\tilde{m}_{t+1})\big{|}\eta_{t+1},\tilde{\eta}_{t+1}, resulting in meeting probabilities that are much lower over multiple steps of the algorithm. Overall, Propositions 3.1 and 3.2 show that multi-step kernel calculations taking into account such high-probability events may be necessary to further distinguish the relative performances of one-scale and two-scale couplings. A full analytic understanding of this difference in performance remains an open problem.

Another way of understanding the benefits of two-scale coupling would be to find a distance under which the coupled chains under common random numbers exhibit a contraction. The CRN coupling part of Proposition 3.2 hints that d¯(mt,m~t):=j=1p|logmt,jlogm~t,j|\bar{d}(m_{t},\tilde{m}_{t}):=\sum_{j=1}^{p}|\log m_{t,j}-\log\tilde{m}_{t,j}| may be a natural metric. Appendix B.1 contains discussions and related simulations comparing this metric with alternatives. Proposition 3.3 verifies that such metric bounds the total variation distance between one-step transition kernels.

Proposition 3.3.

Let 𝒫\mathcal{P} denote the Markov transition kernel associated with the update from (βt,ξt,σt2)(\beta_{t},\xi_{t},\sigma_{t}^{2}) to (βt+1,ξt+1,σt+12)(\beta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}) for the Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). Let Z=(β,ξ,σ2)Z=(\beta,\xi,\sigma^{2}), Z~=(β~,ξ~,σ~2)\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2}), mj=ξβj2/(2σ2)m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2}), and m~j=ξ~β~j2/(2σ~2)\tilde{m}_{j}=\tilde{\xi}\tilde{\beta}_{j}^{2}/(2\tilde{\sigma}^{2}) for j=1,,pj=1,\ldots,p. Then,

TV(𝒫(Z,),𝒫(Z~,))2(1+ν)(e1)4d¯(Z,Z~).\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z},\cdot)\big{)}^{2}\leq\frac{(1+\nu)(e-1)}{4}\bar{d}(Z,\tilde{Z}). (14)

By Proposition 3.3, it suffices to show that (d¯(Zt,Z~t))t0\big{(}\bar{d}(Z_{t},\tilde{Z}_{t})\big{)}_{t\geq 0} contracts under CRN to a sufficiently small value such that the upper bound in (14) is strictly less that 11. This would ensure that CRN coupling will bring the marginal chains close enough that the probability of one-step max coupling resulting in a meeting is high. Note that (14) corresponds to the total variation distance of the full chain, so under this setup we would attempt to maximally couple the full vectors ηt+1\eta_{t+1} and η~t+1\tilde{\eta}_{t+1} (rather than maximally coupling component-wise) when close. Analytical calculations to establish such a contraction, and to understand how the contraction rate scales with dimension and other features of the problem such as the choice of prior, sparsity, and signal to noise ratio requires further work.

3.2 Alternative coupling strategies

The rapid meeting of the two-scale coupling in simulations recommends its use. However, we were unable to establish finite expected meeting times with this coupling, which prevents its use in constructing unbiased estimates. We now describe several small modifications of the two-scale coupling that have similar empirical performance but for which we can prove that the meeting time distribution has exponential tails.

Our basic approach is to maximally couple each component (ηt,j,η~t,j)(\eta_{t,j},\tilde{\eta}_{t,j}) until a failed meeting {ηt,jη~t,j}\{\eta_{t,j}\neq\tilde{\eta}_{t,j}\} occurs, after which we switch to CRN for the remaining components. The ordering of the components j=1,,pj=1,\ldots,p can be deterministic or randomized at each iteration. Under this construction, only one component of (ηt,η~t)(\eta_{t},\tilde{\eta}_{t}), corresponding to the failed meeting, will evolve independently; the other components will either meet or evolve under CRN. Therefore, we expect to obtain benefits similar to the two-scale coupling in high dimensions. An apparent advantage is that we can bypass the choice of metric and associated threshold. While the switch-to-CRN coupling has the same 𝒪(n2p)\mathcal{O}(n^{2}p) cost per iteration as the two-scale coupling, it can give faster run-times in practice when the dimension pp is large compared to the number of observations nn as it avoids repeated calculation of a metric.

We report the performance of the switch-to-CRN coupling with the ordering of the components randomized at each iteration. The setup is identical to that in Section 3.1, where for each dimension pp we simulate 100100 meetings times based on independently generated synthetic datasets. Figure 4 shows that the switch-to-CRN coupling leads to similar meeting times as the two-scale coupling.

Refer to caption
Figure 4: Meeting times for posteriors from the Horseshoe prior under the two-scale and the switch-to-CRN coupling algorithms. n=100,s=10,σ=0.5n=100,s=10,\sigma_{*}=0.5.

We can establish that the meeting times under this switch-to-CRN coupling have exponential tails and hence finite expectation.

Proposition 3.4.

Consider the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). We follow the setup in Proposition 2.1, and assume that the global shrinkage prior πξ()\pi_{\xi}(\cdot) has a compact support. Denote Zt=(βt,ξt,σt2)Z_{t}=(\beta_{t},\xi_{t},\sigma^{2}_{t}) and Z~t=(β~t,ξ~t,σ~t2)\tilde{Z}_{t}=(\tilde{\beta}_{t},\tilde{\xi}_{t},\tilde{\sigma}^{2}_{t}). Consider the switch-to-CRN coupling given by (Zt,Z~t)(ηt+1,η~t+1)(Zt+1,Z~t+1)(Z_{t},\tilde{Z}_{t})\mapsto(\eta_{t+1},\tilde{\eta}_{t+1})\mapsto(Z_{t+1},\tilde{Z}_{t+1}) where (ηt+1,η~t+1)(\eta_{t+1},\tilde{\eta}_{t+1}) are maximally coupled component-wise (under any fixed or random ordering of the components) until the first failed meeting, after which common random numbers are employed, and (Zt+1,Z~t+1)(Z_{t+1},\tilde{Z}_{t+1}) are coupled using common random numbers. Denote the meeting time by τ:=inf{t0:Zt=Z~t}\tau:=\inf\{t\geq 0:Z_{t}=\tilde{Z}_{t}\}. Then (τ>t)A1κ1t\mathbb{P}(\tau>t)\leq A_{1}\kappa_{1}^{t} for some constants A1(0,)A_{1}\in(0,\infty) and κ1(0,1)\kappa_{1}\in(0,1), and for all t0t\geq 0.

As in Proposition 2.2, our proof of Proposition 3.4 implies a rate κ1\kappa_{1} which tends to 11 exponentially as dimension pp increases. To obtain more favorable rates with respect to the dimension as hinted by Figure 4 would require a better understanding of the CRN coupling, for example establishing a contraction in some parts of the state space as discussed above.

Overall Figures 3 and 4 show significant empirical improvements in using the two-scale coupling and related strategies, compared to the one-scale coupling of Section 2.3. Therefore, we recommend the use of two-scales couplings and variants such as the switch-to-CRN coupling for practical high-dimensional applications. Furthermore, in settings where estimating the metric dd is computationally expensive or analytically intractable, variants such as the switch-to-CRN coupling may offer faster numerical run-times per iteration. Henceforth, we will use the two-scale coupling of Section 3.1 in our experiments and GWAS dataset applications, unless specified otherwise.

3.3 Computational and statistical impact of degree of freedom ν\nu

We empirically investigate the impact of the degree of freedom ν1\nu\geq 1 for Half-t(ν)t(\nu) priors. Higher degrees of freedom ν\nu corresponds to priors on β\beta which have stronger shrinkage towards zero (Proposition A.1). We consider meeting times resulting from the two-scale coupling of Section 3.1, as well as the statistical performance of the posteriors. On the computation side, recent work on convergence analysis (Qin and Hobert, 2019a, b) has highlighted the impact of shrinkage on convergence of simpler MCMC algorithms. Under a common random numbers coupling, Qin and Hobert (2019b) have studied the convergence of the Gibbs sampler of Albert and Chib (1993) for Bayesian probit regression, proving dimension-free convergence rates for priors with sufficiently strong shrinkage towards zero. On the statistical estimation side, Half-t(ν)t(\nu) priors have been long proposed (Gelman, 2006; Carvalho et al., 2009, 2010) for Bayesian hierarchical models. van der Pas et al. (2014, 2017) have established (near) minimax optimal estimation for the Horseshoe (ν=1\nu=1) in the Normal means model. Ghosh and Chakrabarti (2017) have extended the results of van der Pas et al. (2014) to show minimax optimality of a wider class of priors including Half-t(ν)t(\nu) priors. Recently, Song (2020) has established that the degree of freedom ν\nu can impact the multiplicative constant in the posterior contraction rate for the Normal means model. Posterior contraction rates of continuous shrinkage priors in the regression setting beyond the Normal means model would deserve further investigation.

Refer to caption
(a) Meeting times
Refer to caption
(b) Mean Squared Error
Figure 5: Meeting times under the two-scale coupling algorithm and mean squared error with Half-t(ν)t(\nu) priors and LASSO estimates. In Figure 5(a) n=100,s=20n=100,s=20 and σ=2\sigma_{*}=2; in Figure 5(b) n=100,p=500,s=20,σ=2n=100,p=500,s=20,\sigma_{*}=2.

We consider the meeting times obtained with Half-t(ν)t(\nu) priors under the two-scale coupling (with R=1R=1 and threshold d0=0.5d_{0}=0.5) on synthetic datasets. The synthetic datasets are generated as per Figure 2 of Section 2.3. Figures 5(a) is then based on 2020 synthetic datasets generated independently for different degrees of freedom ν[1,2]\nu\in[1,2]. Figure 5(a) plots meeting times τ\tau of 11-lag coupled Markov chains against dimension. It indicates that even a small increase in the degree of freedom ν>1\nu>1 compared to the Horseshoe (ν=1\nu=1) can lead to orders of magnitude computational improvements through shorter meeting times.

We also consider the statistical performance of Bayesian shrinkage regression with Half-t(ν)t(\nu) priors on synthetic datasets in Figure 5(b). The synthetic datasets are generated as per Figure 2 of Section 2.3. We use the blocked Gibbs sampler of Algorithm 3 to draw samples from the posteriors corresponding to different degrees of freedom ν\nu, by simulating chains of length 10001000, with a burn-in period chosen using the coupling-based total variation distance upper bounds of Biswas et al. (2019). Specifically we employ a burn-in of 300300 steps for ν1.2\nu\geq 1.2 and of 600600 steps for ν=1\nu=1. Figure 5(b) shows similar mean squared error (MSE) for different values of ν1\nu\geq 1. For a fixed synthetic dataset, the MSE in Figure 5(b) is calculated as ββ^22/p\|\beta_{*}-\hat{\beta}\|_{2}^{2}/p where β^:=t=11000βt/1000\hat{\beta}:=\sum_{t=1}^{1000}\beta_{t}/1000, is the MCMC estimator obtained from the Markov chain after burn-in. The error bars in Figure 5(b) are generated by simulating synthetic datasets independently 100100 times and each time calculating the corresponding MSE from a Markov chain for different values of ν[1,2]\nu\in[1,2]. The MSE of the LASSO is also included, with regularization parameter chosen with cross-validation using the glmnet package (Friedman et al., 2010). For ν[1,2]\nu\in[1,2], we observe a lower MSE using Half-t(ν)t(\nu) priors compared to the LASSO. The grey plots in Figures 5(b) show that much larger ν{4,6,8}\nu\in\{4,6,8\} can lead to higher MSE, as the corresponding posteriors are strongly concentrated about zero and less able to identify non-null signals. Overall, Figure 5(b) suggests that Half-t(ν)t(\nu) priors with degrees of freedom ν[1,2]\nu\in[1,2] can result in comparable statistical performance while much larger values of ν\nu should be discouraged.

4 Results on GWAS datasets

Section 3.3 suggests that Half-t(ν)t(\nu) priors with higher degrees of freedom ν>1\nu>1 can give similar statistical performance to the Horseshoe (ν=1\nu=1), whilst allowing orders of magnitude computational improvements.

Motivated by this observation, we apply our algorithms to genome-wide association study (GWAS) datasets using ν=2\nu=2. We consider a bacteria dataset (Bühlmann et al., 2014) with n=71n=71 observations (corresponding to production of the vitamin riboflavin), and p=4,088p=4,088 covariates (corresponding to single nucleotide polymorphisms (SNPs) in the genome), and a maize dataset (Romay et al., 2013; Liu et al., 2016; Zeng and Zhou, 2017; Johndrow et al., 2020) with n=2,266n=2,266 observations (corresponding to average number of days taken for silk emergence in different maize lines), and p=98,385p=98,385 covariates (corresponding to SNPs). Bayesian methods are well-suited to such GWAS datasets, as they provide interpretable notions of uncertainty through marginal posterior probabilities which allow for multimodality, enable the use of prior information, and account for the uncertainty in parameter estimates (e.g. Guan and Stephens, 2011; Zhou et al., 2013). Furthermore, heavy-tailed continuous shrinkage priors can be particularly effective in the GWAS setting, as the associations between SNPs and the phenotype are expected to be sparse, and such priors have shown competitive empirical performance in polygenic prediction (Ge et al., 2019).

For both datasets, we target the Half-t(2)t(2) posterior with a0=b0=1a_{0}=b_{0}=1. We use the two-scale coupling with R=1R=1 and threshold d0=0.5d_{0}=0.5, and initialize chains independently from the prior. For the ξ\xi updates, we use a Metropolis–Rosenbluth–Teller–Hastings step-size of σMRTH=0.8\sigma_{\text{MRTH}}=0.8. Based on 100 independent coupled chains, Figure 6 shows upper bounds on the total variation distance to stationarity, TV(πt,π)\text{TV}(\pi_{t},\pi) as a function of tt, using LL-lag couplings with L=750L=750 and L=200L=200 for the maize and riboflavin datasets respectively. The results indicate that these Markov chains converge to the corresponding posterior distributions in less than 10001000 and 500500 iterations respectively. We note that the lags L=750L=750 and L=200L=200 were selected based on some preliminary runs of the coupled chains to obtain informative total variation bounds, which incurs an additional preliminary cost. Any choice of lag results in valid upper bounds, but only large enough lags lead to upper bounds that are in the interval [0,1][0,1] even for small iterations tt.

In these GWAS examples, the run-time per iteration may be significant. For the maize dataset, on a 2015 high-end laptop, each iteration of the coupled chain takes approximately 6060 seconds, and running one coupled chain until meeting can take more than one day. This run-time is dominated by the calculation of the weighted matrix cross-product XDiag(η)1XTX\text{Diag}(\eta)^{-1}X^{T}, with an 𝒪(n2p)\mathcal{O}(n^{2}p) cost. In this setting, a scientist wanting to run chains for 10410^{4} or 10510^{5} iterations may have to wait for weeks or months. Using the proposed couplings, a scientist with access to parallel computers can be confident that samples obtained after 10310^{3} iterations would be indistinguishable from perfect samples from the target; for a similar cost they can also obtain unbiased estimators as described in Jacob et al. (2020). For the riboflavin dataset, each iteration of the coupled chain takes approximately 0.10.1 seconds, and running one coupled chain until meeting takes only 1010 to 2020 seconds. The run-time there is dominated by the component-wise η\eta updates which have 𝒪(p)\mathcal{O}(p) cost. In the riboflavin example, the coupling-based diagnostics enable the scientist to perform reliable Bayesian computation directly from personal computers. Overall, the GWAS examples emphasize that couplings of MCMC algorithms can aid practitioners in high-dimensional settings.

Refer to caption
(a) Riboflavin dataset with n=71n=71 and p=4,088p=4,088  
Refer to caption
(b) Maize dataset with n=2,266n=2,266 and p=98,385p=98,385  
Figure 6: LL-Lag coupling-based upper bounds on TV(πt,π)\text{TV}(\pi_{t},\pi), the distance between the chain at time tt and its limiting distribution, in Bayesian shrinkage regression with Half-t(2)t(2) prior on GWAS datasets. Maize dataset: n=2,266n=2,266 and p=98,385p=98,385; Riboflavin dataset: n=71n=71 and p=4,088p=4,088.

5 Discussion

We have introduced couplings of Gibbs samplers for Bayesian shrinkage regression with Half-t(ν)t(\nu) priors. The proposed two-scale coupling is operational in realistic settings, including a GWAS setting with n2,000,p100,000n\approx 2,000,p\approx 100,000.

Firstly, our work participates in a wider effort to apply MCMC algorithms and coupling techniques to ever more challenging settings (Middleton et al., 2020; Ruiz et al., 2020; B. Trippe, 2021; Lee et al., 2020b; Xu et al., 2021). In particular, the short meeting times obtained here vindicate the use of coupling techniques in multimodal, high-dimensional sampling problems. This relates to questions raised in some comments of Lee et al. (2020a) and Paulin (2020) in the discussion of Jacob et al. (2020), and shows that the dimension of the state space may not necessarily be the most important driver of the performance of MCMC algorithms and couplings thereof.

Secondly, we have applied LL-lag couplings (Biswas et al., 2019) to obtain upper bounds on the total variation distance between the Markov chain at a finite time and its stationary distribution. This allows empirical investigations of the mixing time and how it varies with the inputs of the problem, including number of observations, number of covariates, signal to noise ratio and sparsity. We find that coupling techniques constitute a convenient non-asymptotic tool to monitor the performance of MCMC algorithms.

Thirdly, we observe that Half-t(ν)t(\nu) priors with degrees of freedom ν\nu higher than one give similar statistical estimation performance than the Horseshoe whilst providing significant computational advantages. This contributes towards the discussion on the impact of the prior on the trade-off between statistical estimation and computational feasibility, in the setting of high-dimensional Bayesian regression.

The following questions arise from our work.

  1. 1.

    Convergence of the blocked Gibbs sampler. Short meeting times suggest that the blocked Gibbs sampler converges quickly, even in high dimensions. This motivates a more formal study of the convergence of that Markov chain, to understand how the convergence rate varies with features of the data generating process and of the prior. The Lyapunov function in Proposition 2.1 and the two-scale coupling may prove useful for such analysis. Our initial work in this area suggests that finding a convenient metric that gives sharp bounds on the metric dd used here, while simultaneously being amenable to theoretical analysis, will be a key step.

  2. 2.

    Alternative coupling algorithms. Section 3.2 indicates that many coupling strategies are available in the present setting, and more generally for various Gibbs samplers. For example, we could combine the switch-to-CRN coupling with the two-scale coupling to form other coupling strategies. Under the two-scale coupling setup, we could apply the switch-to-CRN coupling only when the chains are close with respect to a chosen metric, and employ CRN couplings when far away. Alternatively, we could always apply the switch-to-CRN coupling, and estimate component-wise meeting probabilities to select the order of the updates, for example such that components more likely to meet are updated first. Some couplings may result in shorter meeting times for the Horseshoe prior. This may allow the Horseshoe prior to remain competitive with Half-t(ν)t(\nu) priors with higher degrees of freedom. In our experiments we did not find ways to obtain shorter meetings for the Horseshoe, but we certainly cannot rule out that possibility. If one could obtain short meetings for the Horseshoe in high dimensions, the apparent trade-off between statistical performance and computing cost identified in the present work would disappear.

  3. 3.

    Interplay between posterior concentration and MCMC convergence. For Bayesian regression with spike-and-slab priors, Yang et al. (2016) and Atchadé (2019) have shown that posterior contraction can aid the convergence of MCMC in high dimensions. The performance of the proposed couplings motivates similar investigations for continuous shrinkage priors.

Acknowledgement.

The authors thank Xiaolei Liu and Xiang Zhou for sharing the Maize GWAS data, and Yves F. Atchadé, Qian Qin and Neil Shephard for helpful comments. Niloy Biswas gratefully acknowledges support from a GSAS Merit Fellowship and a Two Sigma Fellowship Award. Anirban Bhattacharya gratefully acknowledges support by the National Science Foundation through an NSF CAREER award (DMS-1653404). Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grants DMS-1712872 and DMS-1844695. The maize dataset GWAS simulations in this article were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

References

  • Abramowitz [1974] M. Abramowitz. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,. Dover Publications, Inc., USA, 1974. ISBN 0486612724.
  • Albert and Chib [1993] J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993. ISSN 01621459. URL http://www.jstor.org/stable/2290350.
  • Atchadé [2019] Y. F. Atchadé. Approximate spectral gaps for Markov chains mixing times in high dimensions. arXiv preprint arXiv:1903.11745, 2019.
  • B. Trippe [2021] T. B. B. Trippe, T. D. Nguyen. Optimal transport couplings of Gibbs samplers on partitions for unbiased estimation. 2021.
  • Bhadra et al. [2019] A. Bhadra, J. Datta, N. G. Polson, and B. Willard. Lasso meets Horseshoe: A survey. Statistical Science, 34(3):405–427, 2019.
  • Bhattacharya and Johndrow [2021] A. Bhattacharya and J. E. Johndrow. Sampling Local-Scale Parameters in High-Dimensional Regression Models, pages 1–14. American Cancer Society, 2021. ISBN 9781118445112. doi: https://doi.org/10.1002/9781118445112.stat08292. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08292.
  • Bhattacharya et al. [2015] A. Bhattacharya, D. Pati, N. S. Pillai, and D. B. Dunson. Dirichlet–Laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512):1479–1490, 2015. doi: 10.1080/01621459.2014.960967. URL https://doi.org/10.1080/01621459.2014.960967.
  • Bhattacharya et al. [2016] A. Bhattacharya, A. Chakraborty, and B. K. Mallick. Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4):985–991, 10 2016. ISSN 0006-3444. doi: 10.1093/biomet/asw042. URL https://doi.org/10.1093/biomet/asw042.
  • Bhattacharya et al. [2021] S. K. Bhattacharya, K. Khare, and S. Pal. Geometric ergodicity of Gibbs samplers for the Horseshoe and its regularized variants. arXiv preprint arXiv:2101.00366, 2021.
  • Biswas et al. [2019] N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with L-lag couplings. In Advances in Neural Information Processing Systems, pages 7389–7399, 2019.
  • Bou-Rabee and Eberle [2020] N. Bou-Rabee and A. Eberle. Two-scale coupling for preconditioned Hamiltonian Monte Carlo in infinite dimensions. Stochastics and Partial Differential Equations: Analysis and Computations, 2020. doi: 10.1007/s40072-020-00175-6. URL https://doi.org/10.1007/s40072-020-00175-6.
  • Bou-Rabee and Sanz-Serna [2018] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018. doi: 10.1017/S0962492917000101.
  • Bühlmann et al. [2014] P. Bühlmann, M. Kalisch, and L. Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014. doi: 10.1146/annurev-statistics-022513-115545. URL https://doi.org/10.1146/annurev-statistics-022513-115545.
  • Carpenter et al. [2017] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: a probabilistic programming language. Grantee Submission, 76(1):1–32, 2017.
  • Carvalho et al. [2009] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the Horseshoe. In Proceedings of Machine Learning Research, volume 5, pages 73–80. PMLR, 2009. URL http://proceedings.mlr.press/v5/carvalho09a.html.
  • Carvalho et al. [2010] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. ISSN 00063444. URL http://www.jstor.org/stable/25734098.
  • Friedman et al. [2010] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, Articles, 33(1):1–22, 2010. ISSN 1548-7660. doi: 10.18637/jss.v033.i01. URL https://www.jstatsoft.org/v033/i01.
  • Ge et al. [2019] T. Ge, C.-Y. Chen, Y. Ni, Y.-C. A. Feng, and J. W. Smoller. Polygenic prediction via bayesian regression and continuous shrinkage priors. Nature Communications, 10(1):1776, 2019. doi: 10.1038/s41467-019-09718-5. URL https://doi.org/10.1038/s41467-019-09718-5.
  • Gelman [2006] A. Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal., 1(3):515–534, 09 2006. doi: 10.1214/06-BA117A. URL https://doi.org/10.1214/06-BA117A.
  • Ghosh and Chakrabarti [2017] P. Ghosh and A. Chakrabarti. Asymptotic optimality of one-group shrinkage priors in sparse high-dimensional problems. Bayesian Anal., 12(4):1133–1161, 12 2017. doi: 10.1214/16-BA1029. URL https://doi.org/10.1214/16-BA1029.
  • Glynn and Rhee [2014] P. W. Glynn and C.-h. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377–389, 2014.
  • Griffin and Brown [2010] J. E. Griffin and P. J. Brown. Inference with Normal-Gamma prior distributions in regression problems. Bayesian Anal., 5(1):171–188, 03 2010. doi: 10.1214/10-BA507. URL https://doi.org/10.1214/10-BA507.
  • Guan and Stephens [2011] Y. Guan and M. Stephens. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780 – 1815, 2011. doi: 10.1214/11-AOAS455. URL https://doi.org/10.1214/11-AOAS455.
  • Hager [1989] W. W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239, 1989. ISSN 00361445. URL http://www.jstor.org/stable/2030425.
  • Hastings [1970] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970.
  • Jacob et al. [2020] P. E. Jacob, J. O’Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):543–600, 2020. doi: 10.1111/rssb.12336. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12336.
  • Jarner and Roberts [2007] S. Jarner and G. Roberts. Convergence of heavy-tailed Monte Carlo Markov Chain algorithms. Scandinavian Journal of Statistics, 34:781–815, 02 2007. doi: 10.1111/j.1467-9469.2007.00557.x.
  • Jarner and Tweedie [2003] S. F. Jarner and R. L. Tweedie. Necessary conditions for geometric and polynomial ergodicity of random-walk-type. Bernoulli, 9(4):559–578, 08 2003. doi: 10.3150/bj/1066223269. URL https://doi.org/10.3150/bj/1066223269.
  • Johndrow et al. [2020] J. Johndrow, P. Orenstein, and A. Bhattacharya. Scalable approximate MCMC algorithms for the Horseshoe prior. Journal of Machine Learning Research, 21(73):1–61, 2020. URL http://jmlr.org/papers/v21/19-536.html.
  • Johnson and Geyer [2012] L. T. Johnson and C. J. Geyer. Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm. Ann. Statist., 40(6):3050–3076, 12 2012. doi: 10.1214/12-AOS1048. URL https://doi.org/10.1214/12-AOS1048.
  • Johnson [1996] V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupled sample paths. Journal of the American Statistical Association, 91(433):154–166, 1996.
  • Johnson [1998] V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 93(441):238–248, 1998.
  • Johnson and Rossell [2012] V. E. Johnson and D. Rossell. Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660, 2012. doi: 10.1080/01621459.2012.682536. URL https://doi.org/10.1080/01621459.2012.682536.
  • Khare and Hobert [2013] K. Khare and J. P. Hobert. Geometric ergodicity of the Bayesian lasso. Electron. J. Statist., 7:2150–2163, 2013. doi: 10.1214/13-EJS841. URL https://doi.org/10.1214/13-EJS841.
  • Lee et al. [2020a] A. Lee, S. S. Singh, and M. Vihola. Comment on ‘Unbiased Markov chain Monte Carlo with couplings’ by Jacob, O’leary and Atchadé. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2020a.
  • Lee et al. [2020b] A. Lee, S. S. Singh, and M. Vihola. Coupled conditional backward sampling particle filter. Annals of Statistics, 48(5):3066–3089, 2020b.
  • Liu et al. [2016] X. Liu, M. Huang, B. Fan, E. S. Buckler, and Z. Zhang. Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLOS Genetics, 12(2):1–24, 02 2016. doi: 10.1371/journal.pgen.1005767. URL https://doi.org/10.1371/journal.pgen.1005767.
  • Livingstone et al. [2019a] S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo. Bernoulli, 25(4A):3109–3138, 11 2019a. doi: 10.3150/18-BEJ1083. URL https://doi.org/10.3150/18-BEJ1083.
  • Livingstone et al. [2019b] S. Livingstone, M. F. Faulkner, and G. O. Roberts. Kinetic energy choice in Hamiltonian/hybrid Monte Carlo. Biometrika, 106(2):303–319, 2019b.
  • Makalic and Schmidt [2016] E. Makalic and D. F. Schmidt. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182, 2016.
  • Meyn and Tweedie [1993] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer, 1993.
  • Middleton et al. [2020] L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased Markov chain Monte Carlo for intractable target distributions. Electron. J. Statist., 14(2):2842–2891, 2020. doi: 10.1214/20-EJS1727. URL https://doi.org/10.1214/20-EJS1727.
  • Nishimura and Suchard [2018] A. Nishimura and M. A. Suchard. Prior-preconditioned conjugate gradient for accelerated Gibbs sampling in “large n & large p” sparse Bayesian logistic regression models. arXiv preprint arXiv:1810.12437, 2018.
  • Pal and Khare [2014] S. Pal and K. Khare. Geometric ergodicity for Bayesian shrinkage models. Electron. J. Statist., 8(1):604–645, 2014. doi: 10.1214/14-EJS896. URL https://doi.org/10.1214/14-EJS896.
  • Park and Casella [2008] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. doi: 10.1198/016214508000000337. URL https://doi.org/10.1198/016214508000000337.
  • Paulin [2020] D. Paulin. Comment on ‘Unbiased Markov chain Monte Carlo with couplings’ by Jacob, O’leary and Atchadé. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2020.
  • Polson and Scott [2012] N. G. Polson and J. G. Scott. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):887–902, 2012.
  • Polson et al. [2014] N. G. Polson, J. G. Scott, and J. Windle. The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):713–733, 2014. doi: 10.1111/rssb.12042. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12042.
  • Qin and Hobert [2019a] Q. Qin and J. P. Hobert. Convergence complexity analysis of Albert and Chib’s algorithm for Bayesian probit regression. Ann. Statist., 47(4):2320–2347, 08 2019a. doi: 10.1214/18-AOS1749. URL https://doi.org/10.1214/18-AOS1749.
  • Qin and Hobert [2019b] Q. Qin and J. P. Hobert. Wasserstein-based methods for convergence complexity analysis of MCMC with applications. arXiv preprint arXiv:1810.08826, 2019b.
  • R Core Team [2013] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.
  • Roberts and Rosenthal [2002] G. O. Roberts and J. S. Rosenthal. One-shot coupling for certain stochastic recursive sequences. Stochastic Processes and their Applications, 99(2):195–208, 2002. ISSN 0304-4149. doi: https://doi.org/10.1016/S0304-4149(02)00096-0. URL https://www.sciencedirect.com/science/article/pii/S0304414902000960.
  • Roberts and Rosenthal [2004] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability surveys, 1:20–71, 2004.
  • Roberts and Tweedie [1996] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. ISSN 13507265. URL http://www.jstor.org/stable/3318418.
  • Romay et al. [2013] M. C. Romay, M. J. Millard, J. C. Glaubitz, J. A. Peiffer, K. L. Swarts, T. M. Casstevens, R. J. Elshire, C. B. Acharya, S. E. Mitchell, S. A. Flint-Garcia, M. D. McMullen, J. B. Holland, E. S. Buckler, and C. A. Gardner. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biology, 14(6):R55, 2013. doi: 10.1186/gb-2013-14-6-r55. URL https://doi.org/10.1186/gb-2013-14-6-r55.
  • Rosenthal [1995] J. S. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90(430):558–566, 1995. ISSN 01621459. URL http://www.jstor.org/stable/2291067.
  • Ruiz et al. [2020] F. J. R. Ruiz, M. K. Titsias, T. Cemgil, and A. Doucet. Unbiased gradient estimation for variational auto-encoders using coupled Markov chains. ArXiv, abs/2010.01845, 2020.
  • Sherlock et al. [2010] C. Sherlock, P. Fearnhead, and G. O. Roberts. The random walk Metropolis: Linking theory and practice through a case study. Statist. Sci., 25(2):172–190, 05 2010. doi: 10.1214/10-STS327. URL https://doi.org/10.1214/10-STS327.
  • Song [2020] Q. Song. Bayesian shrinkage towards sharp minimaxity. Electronic Journal of Statistics, 14(2):2714 – 2741, 2020. doi: 10.1214/20-EJS1732. URL https://doi.org/10.1214/20-EJS1732.
  • Thorisson [2000] H. Thorisson. Coupling, stationarity, and regeneration. Springer, 2000.
  • Tibshirani [1994] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
  • van der Pas et al. [2017] S. van der Pas, B. Szabó, and A. van der Vaart. Adaptive posterior contraction rates for the horseshoe. Electron. J. Statist., 11(2):3196–3225, 2017. doi: 10.1214/17-EJS1316. URL https://doi.org/10.1214/17-EJS1316.
  • van der Pas et al. [2014] S. L. van der Pas, B. J. K. Kleijn, and A. W. van der Vaart. The horseshoe estimator: Posterior concentration around nearly black vectors. Electron. J. Statist., 8(2):2585–2618, 2014. doi: 10.1214/14-EJS962. URL https://doi.org/10.1214/14-EJS962.
  • van Erven and Harremos [2014] T. van Erven and P. Harremos. Rényi divergence and Kullback–Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014. doi: 10.1109/TIT.2014.2320500.
  • Vats and Knudson [2020] D. Vats and C. Knudson. Revisiting the Gelman–Rubin diagnostic. Statistical Science (to appear), 2020.
  • Vats et al. [2019] D. Vats, J. M. Flegal, and G. L. Jones. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321–337, 04 2019. ISSN 0006-3444. doi: 10.1093/biomet/asz002. URL https://doi.org/10.1093/biomet/asz002.
  • Xu et al. [2021] K. Xu, T. E. Fjelde, C. Sutton, and H. Ge. Couplings for multinomial Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3646–3654. PMLR, 2021.
  • Yang et al. [2016] Y. Yang, M. J. Wainwright, and M. I. Jordan. On the computational complexity of high-dimensional Bayesian variable selection. Ann. Statist., 44(6):2497–2532, 12 2016. doi: 10.1214/15-AOS1417. URL https://doi.org/10.1214/15-AOS1417.
  • Zeng and Zhou [2017] P. Zeng and X. Zhou. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nature Communications, 8(1):456, 2017. doi: 10.1038/s41467-017-00470-2. URL https://doi.org/10.1038/s41467-017-00470-2.
  • Zhou et al. [2013] X. Zhou, P. Carbonetto, and M. Stephens. Polygenic modeling with Bayesian sparse linear mixed models. PLOS Genetics, 9(2):1–14, 02 2013. doi: 10.1371/journal.pgen.1003264. URL https://doi.org/10.1371/journal.pgen.1003264.
  • Zou and Hastie [2005] H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.

In Section A, we describe some features of the posterior distributions arising in high-dimensional regression with shrinkage priors. In Section B we describe metrics to monitor the distance between pairs of chains and provide details on the calculation of metrics to use in two-scale coupling strategies. In Section C we provide the proofs for all the theoretical results in this document. Section D provides pseudo-codes for various algorithms mentioned in this document. Section E provides some derivations of certain steps in the Gibbs sampler.

Appendix A Features of the target complicate use of generic MCMC

While posterior distributions resulting from heavy-tailed shrinkage priors have desirable statistical properties, their features pose challenges to generic MCMC algorithms. For Half-t(ν)t(\nu) priors, we show that the resulting posterior distributions present 1) multimodality, 2) heavy tails and 3) poles at zero. This hints at a trade-off between statistical accuracy and computational difficulty, since the very features that present computational challenges are crucial for optimal statistical performance across sparsity levels and signal strengths.

Proposition A.1 gives the marginal prior and posterior densities of (β,σ2,ξ)(\beta,\sigma^{2},\xi) up to normalizing constants, i.e. integrating over η\eta. For any real-valued functions ff and gg, we write f(x)g(x)f(x)\asymp g(x) for xx0x\rightarrow x_{0} when m<lim infxx0|f(x)/g(x)|lim supxx0|f(x)/g(x)|<Mm<\liminf_{x\rightarrow x_{0}}\big{|}f(x)/g(x)\big{|}\leq\limsup_{x\rightarrow x_{0}}\big{|}f(x)/g(x)\big{|}<M for some 0<mM<0<m\leq M<\infty.

Proposition A.1.

The marginal prior of component βj\beta_{j} on \mathbb{R} given ξ\xi and σ2\sigma^{2} has density

π(βj|ξ,σ2)U(1+ν2,1,ξβj22σ2ν){log|βj| for |βj|0|βj|(1+ν) for |βj|+,\pi(\beta_{j}|\xi,\sigma^{2})\propto U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}\asymp\begin{cases}-\log|\beta_{j}|&\text{ for }\;|\beta_{j}|\rightarrow 0\\ |\beta_{j}|^{-(1+\nu)}&\text{ for }\;|\beta_{j}|\rightarrow+\infty\end{cases}, (15)

where Γ()\Gamma(\cdot) is the Gamma function, and U(a,b,z):=Γ(a)10xa1(1+x)ba1ezx𝑑xU(a,b,z):=\Gamma(a)^{-1}\int_{0}^{\infty}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx is the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13] for any a,b,z>0a,b,z>0. The marginal posterior density of β\beta on p\mathbb{R}^{p} given ξ\xi and σ2\sigma^{2} is

π(β|σ2,ξ,y)\displaystyle\pi(\beta|\sigma^{2},\xi,y) 𝒩(y;Xβ,σ2In)j=1pU(1+ν2,1,ξβj22σ2ν),\displaystyle\propto\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}, (16)

thus π(β|σ2,ξ,y)β0j=1plog(|βj|)\pi(\beta|\sigma^{2},\xi,y)\overset{\|\beta\|\rightarrow 0}{\asymp}-\prod_{j=1}^{p}\log(|\beta_{j}|) and βjlogπ(β|σ2,ξ,y)βj0(βjlog|βj|)1-\frac{\partial}{\partial\beta_{j}}\log\pi(\beta|\sigma^{2},\xi,y)\overset{\beta_{j}\rightarrow 0}{\asymp}-(\beta_{j}\log|\beta_{j}|)^{-1}.

Figure 7 gives an example of such marginal posterior in a stylized setting. Here

n=2,p=3,X=(110101),y=X(1 0 0)T=(1 1)T,n=2,\quad p=3,\quad X=\begin{pmatrix}1&1&0\\ 1&0&1\end{pmatrix},\quad y=X(1\;0\;0)^{T}=(1\;1)^{T},

with ν=2\nu=2 and ξ=σ2=0.25\xi=\sigma^{2}=0.25. Component β1\beta_{1} in Figure 7 illustrates the potential multimodality in the posterior. The posterior distribution also has polynomial tails in high dimensions, as stated in Corollary A.2.

Refer to caption
Figure 7: Posterior densities π(β1,β2|σ2,ξ,y),π(β1,β3|σ2,ξ,y)\pi(\beta_{1},\beta_{2}|\sigma^{2},\xi,y),\pi(\beta_{1},\beta_{3}|\sigma^{2},\xi,y) and π(β2,β3|σ2,ξ,y)\pi(\beta_{2},\beta_{3}|\sigma^{2},\xi,y).
Corollary A.2.

Let epe^{\perp}\in\mathbb{R}^{p} be a unit vector such that Xe=0Xe^{\perp}=0 and let λ>0\lambda>0. Assume entries eje^{\perp}_{j}\in\mathbb{R} are non-zero for all j=1,,pj=1,\ldots,p, which is necessary to ensure that π(λe|σ2,ξ,y)\pi(\lambda e^{\perp}|\sigma^{2},\xi,y) is finite for all λ>0\lambda>0. Then

π(λe|σ2,ξ,y)λp(1+ν)\pi(\lambda e^{\perp}|\sigma^{2},\xi,y)\asymp\lambda^{-p(1+\nu)} (17)

as λ+\lambda\rightarrow+\infty. That is, the marginal posterior of β\beta on p\mathbb{R}^{p} has polynomial tails along all directions in the null space ker(X)\ker(X) of XX.

In our stylized example, e=31/2(111)Te^{\perp}=3^{-1/2}(1\,-1\,-1)^{T} satisfies Xe=0Xe^{\perp}=0. It follows from Corollary A.2 that π(λe|σ2,ξ,y)λ3(1+ν)λ9)\pi(\lambda e^{\perp}|\sigma^{2},\xi,y)\asymp\lambda^{-3(1+\nu)}\asymp\lambda^{-9}) for large λ\lambda. More generally, whenever p>np>n and the marginal prior on β\beta has polynomial tails, the subspace ker(X)\ker(X) of p\mathbb{R}^{p} has vector space dimension at least (pn)(p-n), and therefore the posterior has polynomial tails over a non-trivial subspace. For such heavy-tailed targets, the most popular “generic” MCMC algorithms encounter significant theoretical and practical challenges. Here, we use the term generic for algorithms that are not explicitly designed to target this class of distributions, but instead rely on a general strategy to explore state spaces. Specifically, for heavy-tailed targets, lack of geometric ergodicity has been established for Random Walk Metropolis–Hastings (RWMH) [Jarner and Tweedie, 2003], Metropolis-Adjusted Langevin (MALA) [Roberts and Tweedie, 1996] and Hamiltonian Monte Carlo (HMC) [Livingstone et al., 2019a]111Jarner and Tweedie [2003], Roberts and Tweedie [1996] and Livingstone et al. [2019a] show lack of geometric ergodicity for targets which have heavy-tails in every direction. Their arguments can be modified to hold for targets which are heavy-tailed along any one direction as in Corollary A.2, by considering a projection of the full chain along that direction.. Better performance can sometimes be obtained using heavy-tailed proposals [Jarner and Roberts, 2007, Sherlock et al., 2010, Livingstone et al., 2019b] or variable transformations when available [Johnson and Geyer, 2012].

The posterior density in (16) also approaches infinity when any component βj\beta_{j} approaches zero, as limβj0U((1+ν)/2,1,(ξβj2)/(2σ2ν))=\lim_{\beta_{j}\rightarrow 0}U((1+\nu)/2,1,(\xi\beta_{j}^{2})/(2\sigma^{2}\nu))=\infty for any ν,ξ,σ2>0\nu,\xi,\sigma^{2}>0. This is highlighted by the modes at the origin in Figure 7. The pole at zero corresponds to a discontinuity in the gradients of the negative log-density, as given in Proposition A.1. Such diverging gradients can lead to numerically unstable leapfrog integration for HMC samplers [Bou-Rabee and Sanz-Serna, 2018].

Appendix B Details on the choice and calculation of metric

B.1 Comparison with other metrics

Figure 8 plots d(Ct,C~t)d(C_{t},\tilde{C}_{t}) alongside the L1L_{1} metric and the log-transformed L1L_{1} metric on parameters mt,jm_{t,j} and m~t,j\tilde{m}_{t,j} as in (8). It is based on one trajectory of our two-scale coupled chain under common random numbers (threshold d0=0d_{0}=0), for a synthetic dataset generated as per Section 2.3 targeting the Horseshoe posterior (ν=1\nu=1) with a0=b0=1a_{0}=b_{0}=1. Figure 8 shows that d(Ct,C~t)d(C_{t},\tilde{C}_{t}) is bounded above by both the L1L_{1} metric and the L1L_{1} metric between the logarithms of mj,m~jm_{j},\tilde{m}_{j}. Therefore, we could consider the capped L1L_{1} metric (min{j|mt,jm~t,j|,1}\min\{\sum_{j}|m_{t,j}-\tilde{m}_{t,j}|,1\}) or the capped L1L_{1} metric on the logs (min{j|log(|mt,j/m~t,j|)|,1}\min\{\sum_{j}|\log(|m_{t,j}/\tilde{m}_{t,j}|)|,1\}) to obtain upper bounds on dd. Finding metrics which are easily calculable and accurate could be investigated further and may prove valuable in the theoretical analysis of the proposed algorithms.

Refer to caption
Figure 8: Metric trajectories for the posterior associated with a Horseshoe prior under common random numbers coupling with n=100n=100, p=1000p=1000, s=10s=10 and σ=0.5\sigma_{*}=0.5.

B.2 Choice of metric threshold for the two-scale coupling

For different values of d0d_{0}, Figure 9(a) shows averaged trajectories (d(Ct,C~t))t0(d(C_{t},\tilde{C}_{t}))_{t\geq 0}. When d0=1d_{0}=1 (one-scale coupling), metric d(Ct,C~t)d(C_{t},\tilde{C}_{t}) remains near 11, and when d0=0d_{0}=0 (common random numbers coupling) metric d(Ct,C~t)d(C_{t},\tilde{C}_{t}) gets very close to but does not exactly equal 0. For all values of the threshold shown, the chains exactly meet and d(Ct,C~t)d(C_{t},\tilde{C}_{t}) reaches zero at similar times. Figure 9(b) considers higher dimensional settings. It plots the histograms of (d(Ct,C~t))t=12000(d(C_{t},\tilde{C}_{t}))_{t=1}^{2000} when d0=0d_{0}=0. The histograms show that d(Ct,C~t)d(C_{t},\tilde{C}_{t}) takes values close to either 11 or 0 for different dimensions pp. This may explain why different threshold values sufficiently away from 0 and 11 give similar meeting times. Figure 9(c) shows that different thresholds give similar meeting times for the Horseshoe.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 9: Two-scale coupling algorithm performance for the Horseshoe prior under different values of d0d_{0}. Unless specified otherwise, n=100,p=500,s=20n=100,p=500,s=20 and σ=2\sigma_{*}=2.

B.3 Calculating metric dd

We present details of estimating the metric dd defined in (7). As the coupling in Algorithm 2 is independent component-wise, we have

d(Ct,C~t)\displaystyle d(C_{t},\tilde{C}_{t}) =1(ηt+1=η~t+1|Ct,C~t)=1j=1p(ηt+1,j=η~t+1,j|ηt,j,η~t,j,mt,j,m~t,j)\displaystyle=1-\mathbb{P}\big{(}\eta_{t+1}=\tilde{\eta}_{t+1}|C_{t},\tilde{C}_{t}\big{)}=1-\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)} (18)

for mj=ξβj2/(2σ2)m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2}) and m~j=ξ~β~j2/(2σ~2)\tilde{m}_{j}=\tilde{\xi}\tilde{\beta}_{j}^{2}/(2\tilde{\sigma}^{2}).

B.3.1 An integration based approximation

We consider the approximation

(ηt+1,jη~t+1,j|ηt,j,η~t,j,mt,j,m~t,j)TV(Mt,j,M~t,j),\mathbb{P}\big{(}\eta_{t+1,j}\neq\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)}\approx\text{TV}(M_{t,j},\tilde{M}_{t,j}), (19)

where Mt,jM_{t,j} and M~t,j\tilde{M}_{t,j} correspond to target distributions of the marginal slice samplers in Algorithm 2. This corresponds to approximating the probability of not meeting under Algorithm 2 with the total variation distance (i.e. probability of not meeting under a maximal coupling) between the marginal target distributions Mt,jM_{t,j} and M~t,j\tilde{M}_{t,j} of the slice samplers. We can evaluate TV(Mt,j,M~t,j)\text{TV}(M_{t,j},\tilde{M}_{t,j}) analytically using Proposition B.1. This motivates the approximation

d~(Ct,C~t):=1j=1p(1TV(Mt,j,M~t,j))\displaystyle\tilde{d}(C_{t},\tilde{C}_{t}):=1-\prod_{j=1}^{p}\Big{(}1-\text{TV}(M_{t,j},\tilde{M}_{t,j})\Big{)} (20)

for d(Ct,C~t)d(C_{t},\tilde{C}_{t}).

Proposition B.1.

Consider distributions MM and M~\tilde{M} on [0,)[0,\infty) with densities

p(ηj|m)1ηj1ν2(1+νηj)ν+12emηj and p(ηj|m~)1ηj1ν2(1+νηj)ν+12em~ηj,p(\eta_{j}|m)\propto\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}\text{ and }p(\eta_{j}|\tilde{m})\propto\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}, (21)

respectively on [0,)[0,\infty), such that MM and M~\tilde{M} are parameterized by the fixed constant m>0m>0 and m~>0\tilde{m}>0. Then, TV(M,M~)=0\text{TV}(M,\tilde{M})=0 when m=m~m=\tilde{m}. When mm~m\neq\tilde{m},

TV(M,M~)=|U1+ν2,1,mν(νK)U(1+ν2,1,mν)U1+ν2,1,m~ν(νK)U(1+ν2,1,m~ν)|,\displaystyle\text{TV}(M,\tilde{M})=\bigg{|}\frac{U_{\frac{1+\nu}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}-\frac{U_{\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}\bigg{|}, (22)

where Ua,b,z(t):=1Γ(a)0txa1(1+x)ba1ezx𝑑xU_{a,b,z}(t):=\frac{1}{\Gamma(a)}\int_{0}^{t}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx is defined as the lower incomplete confluent hypergeometric function of the second kind, such that Ua,b,z()=U(a,b,z)U_{a,b,z}(\infty)=U(a,b,z) gives the confluent hypergeometric function of the second kind, and

K:=logU(1+ν2,1,mν)logU(1+ν2,1,m~ν)m~m.K:=\frac{\log U(\frac{1+\nu}{2},1,\frac{m}{\nu})-\log U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}{\tilde{m}-m}. (23)
Proof.

We have

p(ηj|m)=1ηj1ν2(1+νηj)ν+12emηj1Z and p(ηj|m~)=1ηj1ν2(1+νηj)ν+12em~ηj1Z~,p(\eta_{j}|m)=\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}\frac{1}{Z}\ \text{ and }\ p(\eta_{j}|\tilde{m})=\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}\frac{1}{\tilde{Z}}, (24)

where

Z\displaystyle Z =01ηj1ν2(1+νηj)ν+12emηj𝑑ηj=νν+12Γ(ν+12)U(ν+12,1,mν),\displaystyle=\int_{0}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}d\eta_{j}=\nu^{-\frac{\nu+1}{2}}\Gamma\Big{(}\frac{\nu+1}{2}\Big{)}U\Big{(}\frac{\nu+1}{2},1,\frac{m}{\nu}\Big{)}, (25)
Z~\displaystyle\tilde{Z} =01ηj1ν2(1+νηj)ν+12em~ηj𝑑ηj=νν+12Γ(ν+12)U(ν+12,1,m~ν).\displaystyle=\int_{0}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}d\eta_{j}=\nu^{-\frac{\nu+1}{2}}\Gamma\Big{(}\frac{\nu+1}{2}\Big{)}U\Big{(}\frac{\nu+1}{2},1,\frac{\tilde{m}}{\nu}\Big{)}. (26)
Case m=m~m=\tilde{m}.

TV(M,M~)=0\text{TV}(M,\tilde{M})=0 is immediate.

Case m<m~m<\tilde{m}.

Note that

p(ηj|m)p(ηj|m~)\displaystyle p(\eta_{j}|m)\leq p(\eta_{j}|\tilde{m}) exp((m~m)η)ZZ~\displaystyle\Leftrightarrow\exp\big{(}(\tilde{m}-m)\eta\big{)}\leq\frac{Z}{\tilde{Z}} (27)
ηlog(U(1+ν2,1,mν))log(U(1+ν2,1,m~ν))m~m=:K.\displaystyle\Leftrightarrow\eta\leq\frac{\log(U(\frac{1+\nu}{2},1,\frac{m}{\nu}))-\log(U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}))}{\tilde{m}-m}=:K. (28)

Therefore,

1TV(M,M~)\displaystyle 1-\text{TV}(M,\tilde{M}) =0min(p(ηj|m),p(ηj|m~))𝑑ηj\displaystyle=\int_{0}^{\infty}\min(p(\eta_{j}|m),p(\eta_{j}|\tilde{m}))d\eta_{j} (29)
=0Kp(ηj|m)𝑑ηj+Kp(ηj|m~)𝑑ηj\displaystyle=\int_{0}^{K}p(\eta_{j}|m)d\eta_{j}+\int_{K}^{\infty}p(\eta_{j}|\tilde{m})d\eta_{j} (30)
=1Z0K1ηj1ν2(1+νηj)ν+12emηj𝑑ηj\displaystyle=\frac{1}{Z}\int_{0}^{K}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}d\eta_{j} (31)
+1Z~K1ηj1ν2(1+νηj)ν+12em~ηj𝑑ηj\displaystyle\qquad+\frac{1}{\tilde{Z}}\int_{K}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}d\eta_{j} (32)
=U1+ν2,1,mν(νK)U(1+ν2,1,mν)+1U1+ν2,1,m~ν(νK)U(1+ν2,1,m~ν)\displaystyle=\frac{U_{\frac{1+\nu}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}+1-\frac{U_{\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})} (33)
=1|U1+ν2,1,m~ν(νK)U(1+ν2,1,m~ν)U1+ν2,1,mν(νK)U(1+ν2,1,mν)|.\displaystyle=1-\Bigg{|}\frac{U_{\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}-\frac{U_{\frac{1+\nu}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}\Bigg{|}. (34)

Equation (22) directly follows.

Case m>m~m>\tilde{m}.

Follows from the m<m~m<\tilde{m} case by symmetry. ∎

B.3.2 A Rao–Blackwellized estimator

We have

(ηt+1,j=η~t+1,j|ηt,j,η~t,j,mt,j,m~t,j)\displaystyle\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)} (35)
=𝔼[(ηt+1,j=η~t+1,j|Uj,,U~j,,mt,j,m~t,j)|ηt,j,η~t,j,mt,j,m~t,j],\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U_{j,*},\tilde{U}_{j,*},m_{t,j},\tilde{m}_{t,j}\big{)}\big{|}\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\Big{]}, (36)

where mt,j=ξtβt,j2/(2σt2)m_{t,j}=\xi_{t}\beta_{t,j}^{2}/(2\sigma_{t}^{2}) and m~t,j=ξ~tβ~t,j2/(2σ~t2)\tilde{m}_{t,j}=\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}/(2\tilde{\sigma}_{t}^{2}), (Uj,,U~j,)|mt,j,m~t,j(U_{j,*},\tilde{U}_{j,*})|m_{t,j},\tilde{m}_{t,j} are sampled using common random numbers as in Algorithm 2, and (ηt+1,j,η~t+1,j)|Uj,,U~j,,mt,j,m~t,j(\eta_{t+1,j},\tilde{\eta}_{t+1,j})|U_{j,*},\tilde{U}_{j,*},m_{t,j},\tilde{m}_{t,j} are sampled as in Algorithm 2. We can analytically evaluate probability (ηt+1,j=η~t+1,j|Uj,,U~j,,mt,j,m~t,j)\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U_{j,*},\tilde{U}_{j,*},m_{t,j},\tilde{m}_{t,j}\big{)}, which motivates the Rao-Blackwellized estimator from (10),

dR^(2)(Ct,Ct~):=1j=1p(1Rr=1R(ηt+1,j=η~t+1,j|Uj,r,U~j,r,mt,j,m~t,j)),\widehat{d_{R}}^{(2)}(C_{t},\tilde{C_{t}}):=1-\prod_{j=1}^{p}\Big{(}\frac{1}{R}\sum_{r=1}^{R}\mathbb{P}\Big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U_{j,r},\tilde{U}_{j,r},m_{t,j},\tilde{m}_{t,j}\Big{)}\Big{)}, (37)

where RR is the number of samples. For each r=1,,Rr=1,\ldots,R, (Uj,r,U~j,r)(U_{j,r},\tilde{U}_{j,r}) is sampled independently as in Algorithm 2 and maximal coupling probability (ηt+1,j=η~t+1,j|Uj,r,U~j,r,mt,j,m~t,j)\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\big{|}U_{j,r},\tilde{U}_{j,r},m_{t,j},\tilde{m}_{t,j}\big{)} is calculated analytically using Proposition B.2.

Proposition B.2.

Suppose ηj|Uj,,mjPj\eta_{j}\big{|}U_{j,*},m_{j}\sim P_{j} and η~j|U~j,,m~jP~j\tilde{\eta}_{j}\big{|}\tilde{U}_{j,*},\tilde{m}_{j}\sim\tilde{P}_{j} where marginal distributions Pj{P}_{j} and P~j\tilde{P}_{j} correspond to the distribution in the second step of the slice sampler in Algorithm 4. Then,

\displaystyle\mathbb{P} (ηj=η~j|Uj,,U~j,,mj,m~j)={γs(mjK~j)γs(mjTj,)+γs(m~j(Tj,T~j,))γs(m~jK~j)γs(m~jT~j,)formj<m~jγs(mj(Tj,T~j,))γs(mj(Tj,T~j,))formj=m~jγs(m~jK~j)γs(m~jT~j,)+γs(mj(Tj,T~j,))γs(mjK~j)γs(mjTj,)formj>m~j\displaystyle\big{(}\eta_{j}=\tilde{\eta}_{j}\Big{|}U_{j,*},\tilde{U}_{j,*},m_{j},\tilde{m}_{j}\big{)}=\left\{\begin{matrix}\frac{\gamma_{s}(m_{j}\tilde{K}_{j})}{\gamma_{s}(m_{j}T_{j,*})}+\frac{\gamma_{s}(\tilde{m}_{j}(T_{j,*}\wedge\tilde{T}_{j,*}))-\gamma_{s}(\tilde{m}_{j}\tilde{K}_{j})}{\gamma_{s}(\tilde{m}_{j}\tilde{T}_{j,*})}&for\ m_{j}<\tilde{m}_{j}\\ \frac{\gamma_{s}(m_{j}(T_{j,*}\wedge\tilde{T}_{j,*}))}{\gamma_{s}(m_{j}(T_{j,*}\vee\tilde{T}_{j,*}))}&for\ m_{j}=\tilde{m}_{j}\\ \frac{\gamma_{s}(\tilde{m}_{j}\tilde{K}_{j})}{\gamma_{s}(\tilde{m}_{j}\tilde{T}_{j,*})}+\frac{\gamma_{s}(m_{j}(T_{j,*}\wedge\tilde{T}_{j,*}))-\gamma_{s}(m_{j}\tilde{K}_{j})}{\gamma_{s}(m_{j}T_{j,*})}&for\ m_{j}>\tilde{m}_{j}\end{matrix}\right. (38)

for mt,j=ξtβt,j2/(2σt2)m_{t,j}=\xi_{t}\beta_{t,j}^{2}/(2\sigma_{t}^{2}) and m~t,j=ξ~tβ~t,j2/(2σ~t2)\tilde{m}_{t,j}=\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}/(2\tilde{\sigma}_{t}^{2}), Tj,:=(Uj,21+ν1)/νT_{j,*}:=(U_{j,*}^{-\frac{2}{1+\nu}}-1)/\nu and T~j,:=(U~j,21+ν1)/ν\tilde{T}_{j,*}:=(\tilde{U}_{j,*}^{-\frac{2}{1+\nu}}-1)/\nu, s=1+ν2s=\frac{1+\nu}{2}, γs(x)=1Γ(s)0xts1et𝑑t[0,1]\gamma_{s}(x)=\frac{1}{\Gamma(s)}\int_{0}^{x}t^{s-1}e^{-t}dt\in[0,1] the regularized incomplete lower Gamma function, ab:=min{a,b}a\wedge b:=\min\{a,b\}, ab:=max{a,b}a\vee b:=\max\{a,b\}, and

K~j:=(0(log((m~jmj)sγs(mjTj,)γs(m~jT~j,)))m~jmj))(Tj,T~j,).\displaystyle\tilde{K}_{j}:=\bigg{(}0\vee\Big{(}\frac{\log\Big{(}\Big{(}\frac{\tilde{m}_{j}}{m_{j}}\Big{)}^{s}\frac{\gamma_{s}(m_{j}T_{j,*})}{\gamma_{s}(\tilde{m}_{j}\tilde{T}_{j,*})}\Big{)}\Big{)}}{\tilde{m}_{j}-m_{j}}\Big{)}\bigg{)}\wedge\Big{(}T_{j,*}\wedge\tilde{T}_{j,*}\Big{)}. (39)
Proof.

We work component-wise and drop the subscripts for ease of notation. Then distributions PP and P~\tilde{P} have densities

p(η;m,T)\displaystyle p(\eta;m,T) =ηs1emηmsΓ(s)γs(mT) on (0,T]\displaystyle=\eta^{s-1}e^{-m\eta}\frac{m^{s}}{\Gamma(s)\gamma_{s}(mT)}\quad\text{ on }(0,T] (40)
p(η;m~,T~)\displaystyle p(\eta;\tilde{m},\tilde{T}) =ηs1em~ηm~sΓ(s)γs(m~T~) on (0,T~]\displaystyle=\eta^{s-1}e^{-\tilde{m}\eta}\frac{\tilde{m}^{s}}{\Gamma(s)\gamma_{s}(\tilde{m}\tilde{T})}\quad\text{ on }(0,\tilde{T}] (41)

respectively where T:=(U21+ν1)/νT:=(U_{*}^{-\frac{2}{1+\nu}}-1)/\nu and T~:=(U~21+ν1)/ν\tilde{T}:=(\tilde{U}_{*}^{-\frac{2}{1+\nu}}-1)/\nu, s=1+ν2s=\frac{1+\nu}{2} and γs(x)=1Γ(s)0xts1et𝑑t[0,1]\gamma_{s}(x)=\frac{1}{\Gamma(s)}\int_{0}^{x}t^{s-1}e^{-t}dt\in[0,1] is the regularized incomplete lower Gamma function. Note that

max(η=η~|T,T~,m,m~)=0TT~p(η;m,T)p(η;m~,T~)dη.\displaystyle\mathbb{P}_{max}\Big{(}\eta=\tilde{\eta}\Big{|}T,\tilde{T},m,\tilde{m}\Big{)}=\int_{0}^{T\wedge\tilde{T}}p(\eta;m,T)\wedge p(\eta;\tilde{m},\tilde{T})d\eta.
Case m=m~m=\tilde{m}.
0TT~p(η;m,T)p(η;m~,T~)dη\displaystyle\int_{0}^{T\wedge\tilde{T}}p(\eta;m,T)\wedge p(\eta;\tilde{m},\tilde{T})d\eta =0TT~ηs1(m)sΓ(s)(1γs(mT)1γs(mT~))𝑑η\displaystyle=\int_{0}^{T\wedge\tilde{T}}\eta^{s-1}\frac{(m)^{s}}{\Gamma(s)}\Big{(}\frac{1}{\gamma_{s}(mT)}\wedge\frac{1}{\gamma_{s}(m\tilde{T})}\Big{)}d\eta
=0TT~ηs1(m)sΓ(s)1γs(m(TT~))\displaystyle=\int_{0}^{T\wedge\tilde{T}}\eta^{s-1}\frac{(m)^{s}}{\Gamma(s)}\frac{1}{\gamma_{s}(m(T\vee\tilde{T}))}
=γs(m(TT~))γs(m(TT~))\displaystyle=\frac{\gamma_{s}(m(T\wedge\tilde{T}))}{\gamma_{s}(m(T\vee\tilde{T}))}

as required.

Case m<m~m<\tilde{m}.

For 0<ηTT~0<\eta\leq T\wedge\tilde{T},

p(η;m,T)p(η;m~,T~)\displaystyle p(\eta;m,T)\leq p(\eta;\tilde{m},\tilde{T}) exp((m~m)η)(m~m)sγs(mT)γs(m~T~)ηlog((m~m)sγs(mT)γs(m~T~)))m~m=:K.\displaystyle\Leftrightarrow\exp((\tilde{m}-m)\eta)\leq\Big{(}\frac{\tilde{m}}{m}\Big{)}^{s}\frac{\gamma_{s}(mT)}{\gamma_{s}(\tilde{m}\tilde{T})}\Leftrightarrow\eta\leq\frac{\log\Big{(}\Big{(}\frac{\tilde{m}}{m}\Big{)}^{s}\frac{\gamma_{s}(mT)}{\gamma_{s}(\tilde{m}\tilde{T})}\Big{)}\Big{)}}{\tilde{m}-m}=:K.

Denote K~:=(0K)(TT~)\tilde{K}:=\big{(}0\vee K\big{)}\wedge\big{(}T\wedge\tilde{T}\big{)}, such that

K~:={0whenK0Kwhen 0<KTT~TT~whenTT~<K.\displaystyle\tilde{K}:=\left\{\begin{matrix}0&when\ K\leq 0\\ K&when\ 0<K\leq T\wedge\tilde{T}\\ T\wedge\tilde{T}&when\ T\wedge\tilde{T}<K.\end{matrix}\right.

This gives

0TT~p(η;m,T)p(η;m~,T~)dη\displaystyle\int_{0}^{T\wedge\tilde{T}}p(\eta;m,T)\wedge p(\eta;\tilde{m},\tilde{T})d\eta =0K~p(η;m,T)𝑑η+K~TT~p(η;m~,T~)𝑑η\displaystyle=\int_{0}^{\tilde{K}}p(\eta;m,T)d\eta+\int_{\tilde{K}}^{T\wedge\tilde{T}}p(\eta;\tilde{m},\tilde{T})d\eta
=(Γ(s)γs(mK~)(m)s)((m)sΓ(s)γs(mT))\displaystyle=\Big{(}\frac{\Gamma(s)\gamma_{s}(m\tilde{K})}{(m)^{s}}\Big{)}\Big{(}\frac{(m)^{s}}{\Gamma(s)\gamma_{s}(mT)}\Big{)}
+(Γ(s)(γs(m~(TT~))γs(m~K))(m~)s)((m~)sΓ(s)γs(m~T~)))\displaystyle\qquad\qquad+\Big{(}\frac{\Gamma(s)\big{(}\gamma_{s}(\tilde{m}(T\wedge\tilde{T}))-\gamma_{s}(\tilde{m}K)\big{)}}{(\tilde{m})^{s}}\Big{)}\Big{(}\frac{(\tilde{m})^{s}}{\Gamma(s)\gamma_{s}(\tilde{m}\tilde{T}))}\Big{)}
=γs(mK~)γs(mT)+γs(m~(TT~))γs(m~K~)γs(m~T~).\displaystyle=\frac{\gamma_{s}(m\tilde{K})}{\gamma_{s}(mT)}+\frac{\gamma_{s}(\tilde{m}(T\wedge\tilde{T}))-\gamma_{s}(\tilde{m}\tilde{K})}{\gamma_{s}(\tilde{m}\tilde{T})}.

as required.

Case m>m~m>\tilde{m}.

Follows from the m<m~m<\tilde{m} case by symmetry. ∎

Appendix C Proofs

C.1 Marginal β\beta prior and posterior densities

Proof of Proposition A.1.
Marginal Prior on β\beta.

Let πη\pi_{\eta} denote the component-wise prior density of each ηj\eta_{j} based on the Half-t(ν)t(\nu) distribution, and 𝒩(;μ,Σ)\mathcal{N}(\cdot;\mu,\Sigma) the density of a multivariate Normal distribution with mean μ\mu and Σ\Sigma. It suffices to calculate the marginal prior for each βj\beta_{j} given ξ,σ2\xi,\sigma^{2}, by marginalizing over each ηj\eta_{j}. We obtain,

π(βj|ξ,σ2)\displaystyle\pi(\beta_{j}|\xi,\sigma^{2}) =0𝒩(βj;0,σ2ξηj)πη(ηj)𝑑ηj\displaystyle=\int_{0}^{\infty}\mathcal{N}\Big{(}\beta_{j};0,\frac{\sigma^{2}}{\xi\eta_{j}}\Big{)}\pi_{\eta}(\eta_{j})d\eta_{j} (42)
0ξηj2πσ2exp(ξβj22σ2ηj)1ηj2ν2(1+νηj)ν+12𝑑ηj\displaystyle\propto\int_{0}^{\infty}\frac{\sqrt{\xi\eta_{j}}}{\sqrt{2\pi\sigma^{2}}}\exp\Big{(}-\frac{\xi\beta_{j}^{2}}{2\sigma^{2}}\eta_{j}\Big{)}\frac{1}{\eta_{j}^{\frac{2-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}d\eta_{j} (43)
01ηj1ν2(1+νηj)ν+12exp(ξβj22σ2ηj)𝑑ηj\displaystyle\propto\int_{0}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}\exp\Big{(}-\frac{\xi\beta_{j}^{2}}{2\sigma^{2}}\eta_{j}\Big{)}d\eta_{j} (44)
=01(tν)1ν2(1+t)ν+12exp(ξβj22σ2νt)𝑑t1ν\displaystyle=\int_{0}^{\infty}\frac{1}{\big{(}\frac{t}{\nu}\big{)}^{\frac{1-\nu}{2}}(1+t)^{\frac{\nu+1}{2}}}\exp\Big{(}-\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}t\Big{)}dt\frac{1}{\nu} (45)
=Γ(ν+12)νν+12U(ν+12,1,ξβj22σ2ν)\displaystyle=\frac{\Gamma(\frac{\nu+1}{2})}{\nu^{\frac{\nu+1}{2}}}U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)} (46)
U(ν+12,1,ξβj22σ2ν),\displaystyle\propto U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}, (47)

where (47) follows from the definition of the confluent hypergeometric function of the second kind: U(a,b,z):=1Γ(a)0xa1(1+x)ba1ezx𝑑xU(a,b,z):=\frac{1}{\Gamma(a)}\int_{0}^{\infty}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx for any a,b,z>0a,b,z>0. By the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.1], we obtain

π(βj|ξ,σ2)U(ν+12,1,ξβj22σ2ν){1Γ(1+ν2)log(ξβj22σ2ν)log(|βj|) for |βj|0(ξβj22σ2ν)1+ν2|βj|(1+ν) for |βj|+\displaystyle\pi(\beta_{j}|\xi,\sigma^{2})\propto U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}\asymp\begin{cases}-\frac{1}{\Gamma(\frac{1+\nu}{2})}\log\Big{(}\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}&\asymp-\log(|\beta_{j}|)\text{ for }|\beta_{j}|\rightarrow 0\\ \Big{(}\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}^{-\frac{1+\nu}{2}}&\asymp|\beta_{j}|^{-(1+\nu)}\text{ for }|\beta_{j}|\rightarrow+\infty\end{cases} (48)

as required for (15).

Marginal Posterior of β\beta.

Let πξ\pi_{\xi} denote the prior density on ξ\xi and πσ2\pi_{\sigma^{2}} the InvGamma(a02,b02)\mathrm{InvGamma}(\frac{a_{0}}{2},\frac{b_{0}}{2}) prior density on σ2\sigma^{2}. Then,

π(β|ξ,σ2,y)𝒩(y;Xβ,σ2In)(j=1pπ(βj|ξ,σ2))𝒩(y;Xβ,σ2In)j=1pU(1+ν2,1,ξβj22σ2ν)\displaystyle\pi(\beta|\xi,\sigma^{2},y)\propto\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\Big{(}\prod_{j=1}^{p}\pi(\beta_{j}|\xi,\sigma^{2})\Big{)}\propto\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)} (49)

as required for (16). Equations (48) and (49) directly give

π(β|ξ,σ2,y)j=1pU(ν+12,1,ξβj22σ2ν)j=1plog|βj| for β0.\pi(\beta|\xi,\sigma^{2},y)\asymp\prod_{j=1}^{p}U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}\asymp-\prod_{j=1}^{p}\log|\beta_{j}|\text{ for }\ \|\beta\|\rightarrow 0. (50)
Gradient of negative log-density of posterior of β\beta.

From (49), we obtain

βj(logπ(β|ξ,σ2,y))\displaystyle-\frac{\partial}{\partial\beta_{j}}\Big{(}\log\pi(\beta|\xi,\sigma^{2},y)\Big{)} =βj(12σ2yXβ22+j=1plogU(1+ν2,1,ξβj22σ2ν))\displaystyle=-\frac{\partial}{\partial\beta_{j}}\bigg{(}-\frac{1}{2\sigma^{2}}\|y-X\beta\|_{2}^{2}+\sum_{j=1}^{p}\log U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}\bigg{)} (51)
=[1σ2XT(yXβ)]jddβjU(1+ν2,1,ξβj22σ2ν)U(1+ν2,1,ξβj22σ2ν).\displaystyle=-\Big{[}\frac{1}{\sigma^{2}}X^{T}(y-X\beta)\Big{]}_{j}-\frac{\frac{d}{d\beta_{j}}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}{U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}. (52)

Recall [Abramowitz, 1974, Section 13.4] for a,b>0a,b>0,

ddzU(a,b,z)=z>0aU(a+1,b+1,z),U(a,2,z)z0+1Γ(a)1z,U(a,1,z)z0+1Γ(a)log(z).\displaystyle\frac{d}{dz}U(a,b,z)\overset{\forall z>0}{=}-aU(a+1,b+1,z),\ U(a,2,z)\overset{z\rightarrow 0^{+}}{\asymp}\frac{1}{\Gamma(a)}\frac{1}{z},\ U(a,1,z)\overset{z\rightarrow 0^{+}}{\asymp}-\frac{1}{\Gamma(a)}\log(z). (53)

This gives

ddβjU(1+ν2,1,ξβj22σ2ν)U(1+ν2,1,ξβj22σ2ν)\displaystyle\frac{\frac{d}{d\beta_{j}}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}{U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}} =1+ν2U(1+ν2+1,1+1,ξβj22σ2ν)U(1+ν2,1,ξβj22σ2ν)ξβjσ2νβj01βjlog|βj|.\displaystyle=-\frac{1+\nu}{2}\frac{U\Big{(}\frac{1+\nu}{2}+1,1+1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}{U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}\frac{\xi\beta_{j}}{\sigma^{2}\nu}\overset{\beta_{j}\rightarrow 0}{\asymp}\frac{1}{\beta_{j}\log|\beta_{j}|}. (54)

Therefore by (52), βj(logπ(β|ξ,σ2,y))βj01βjlog|βj|-\frac{\partial}{\partial\beta_{j}}\Big{(}\log\pi(\beta|\xi,\sigma^{2},y)\Big{)}\overset{\beta_{j}\rightarrow 0}{\asymp}-\frac{1}{\beta_{j}\log|\beta_{j}|} as required. ∎

Proof of Corollary A.2.

By Proposition A.1,

π(β|ξ,σ2,y)=1Zξ,σ2,ν𝒩(y;Xβ,σ2In)j=1pU(1+ν2,1,ξβj22σ2ν),\displaystyle\pi(\beta|\xi,\sigma^{2},y)=\frac{1}{Z_{\xi,\sigma^{2},\nu}}\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}, (55)

where Zξ,σ2,ν:=p𝒩(y;Xβ,σ2In)j=1pU(1+ν2,1,ξβj22σ2ν)dβZ_{\xi,\sigma^{2},\nu}:=\int_{\mathbb{R}^{p}}\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}d\beta is the normalizing constant. As p>np>n, there exists some epe^{\perp}\in\mathbb{R}^{p} such that Xe=0Xe^{\perp}=0. When ej=0e^{\perp}_{j}=0 for some j=1,,pj=1,\ldots,p, the posterior density is infinite by Proposition A.1. Consider epe^{\perp}\in\mathbb{R}^{p} such that each eje^{\perp}_{j}\in\mathbb{R} is non-zero for j=1,,pj=1,\ldots,p. For any such fixed such ee^{\perp} and large λ>0\lambda>0, we obtain

π(λe|ξ,σ2,y)\displaystyle\pi(\lambda e^{\perp}|\xi,\sigma^{2},y) =1Zξ,σ2,ν𝒩(y;0,σ2In)j=1pU(1+ν2,1,ξλ2(ej)22σ2ν)\displaystyle=\frac{1}{Z_{\xi,\sigma^{2},\nu}}\mathcal{N}(y;0,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\lambda^{2}(e^{\perp}_{j})^{2}}{2\sigma^{2}\nu}\Big{)} (56)
j=1pU(1+ν2,1,ξλ2(ej)22σ2ν)\displaystyle\propto\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\lambda^{2}(e^{\perp}_{j})^{2}}{2\sigma^{2}\nu}\Big{)} (57)
j=1p(ξλ2(ej)22σ2ν)1+ν2\displaystyle\asymp\prod_{j=1}^{p}\Big{(}\frac{\xi\lambda^{2}(e^{\perp}_{j})^{2}}{2\sigma^{2}\nu}\Big{)}^{-\frac{1+\nu}{2}} (58)
λp(1+ν),\displaystyle\asymp\lambda^{-p(1+\nu)}, (59)

where (58) follows from the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.1]. ∎

C.2 Geometric Ergodicity

Initial technical results.

We first record some technical results.

Lemma C.1.

Consider any matrix AA such that A=[BC]A=\begin{bmatrix}B\\ C\end{bmatrix} for submatrices BB and CC. Then A2B2+C2\|A\|_{2}\leq\|B\|_{2}+\|C\|_{2}, where 2\|\cdot\|_{2} is the spectral norm of a matrix.

Proof.

Take any vector vv of length equal to the number of columns of AA with v2=1\|v\|_{2}=1. Then

Av22=[BC]v22=Bv22+Cv22B22+C22.\displaystyle\|Av\|_{2}^{2}=\Big{\|}\begin{bmatrix}B\\ C\end{bmatrix}v\Big{\|}_{2}^{2}=\|Bv\|_{2}^{2}+\|Cv\|_{2}^{2}\leq\|B\|_{2}^{2}+\|C\|_{2}^{2}. (60)

Taking the supremum over all such vv with v2=1\|v\|_{2}=1 gives A22B22+C22(B2+C2)2\|A\|_{2}^{2}\leq\|B\|_{2}^{2}+\|C\|_{2}^{2}\leq(\|B\|_{2}+\|C\|_{2})^{2}. ∎

Lemma C.2.

(Sherman–Morrison–Woodbury formula [Hager, 1989]) For any p×pp\times p invertible matrix AA, p×np\times n matrix UU and n×pn\times p matrix VV,

(A+UV)1=A1A1U(In+VA1U)1VA1.(A+UV)^{-1}=A^{-1}-A^{-1}U(I_{n}+VA^{-1}U)^{-1}VA^{-1}. (61)

When U=VT=xpU=V^{T}=x\in\mathbb{R}^{p}, we obtain (A+xxT)1=A1A1xxTA1/(1+xTA1x)(A+xx^{T})^{-1}=A^{-1}-A^{-1}xx^{T}A^{-1}/(1+x^{T}A^{-1}x).

Lemma C.3.

(A uniform bound on the full conditional mean of β\beta). Let XX be an n×pn\times p matrix with columns xinx_{i}\in\mathbb{R}^{n} for i=1,,pi=1,\ldots,p, and let R=Diag(r1,,rp)R=\text{Diag}(r_{1},\ldots,r_{p}) be a p×pp\times p diagonal matrix with positive entries. Then

supr1,,rp>0(XTX+R)1XT2Cp,\sup_{r_{1},\ldots,r_{p}>0}\|(X^{T}X+R)^{-1}X^{T}\|_{2}\leq C_{p}, (62)

for some Cp<C_{p}<\infty, where 2\|\cdot\|_{2} is the spectral norm of a matrix. That is, the spectral norm of (XTX+R)1XT(X^{T}X+R)^{-1}X^{T} is uniformly bounded over all positive diagonal matrices RR.

In an earlier version of the present manuscript, we had explicitly characterized the uniform upper bound as Xy2\|X^{\dagger}y\|_{2} for XX^{\dagger} the Moore-Penrose pseudoinverse of XX, which is not true in general. Existence of an uniform upper bound has been also established by Bhattacharya et al. [2021] using block matrix inversions and induction.

Proof.

Fix any n1n\geq 1. We first consider the case p=1p=1, such that X=x1nX=x_{1}\in\mathbb{R}^{n}, R=r1R=r_{1} for some scalar r1>0r_{1}>0. We obtain

(XTX+R)1XT2=1r1+x1Tx1x1T2=x12r1+x122Ix10x2,\|(X^{T}X+R)^{-1}X^{T}\|_{2}=\frac{1}{r_{1}+x_{1}^{T}x_{1}}\|x_{1}^{T}\|_{2}=\frac{\|x_{1}\|_{2}}{r_{1}+\|x_{1}\|_{2}^{2}}\leq\frac{\mathrm{I}_{x_{1}\neq 0}}{\|x\|_{2}}, (63)

which is an uniform bound for all r1>0r_{1}>0.

We now consider p>1p>1. We have X=[x1xp]X=[x_{1}\ldots x_{p}] with xinx_{i}\in\mathbb{R}^{n} for i=1,,pi=1,\ldots,p and R=Diag(r1,,rp)R=\text{Diag}(r_{1},\ldots,r_{p}) with ri>0r_{i}>0 for i=1,,pi=1,\ldots,p. For any p×pp\times p permutation matrix PP, the matrices XPTXP^{T} and PRPTPRP^{T} correspond to a reordering of the columns of XX and the diagonal entries of RR. We have

((XPT)T(XPT)+PRPT)1(XPT)T2=P(XTX+R)1XT2=(XTX+R)1XT2.\|\big{(}(XP^{T})^{T}(XP^{T})+PRP^{T}\big{)}^{-1}(XP^{T})^{T}\|_{2}=\|P(X^{T}X+R)^{-1}X^{T}\|_{2}=\|(X^{T}X+R)^{-1}X^{T}\|_{2}. (64)

This implies that (XTX+R)1XT2\|(X^{T}X+R)^{-1}X^{T}\|_{2} is invariant under any reordering of the pp columns of XX and the pp diagonal entries of RR. By considering such reordering, we can assume the diagonal entries of RR are non-decreasing without loss of generality; that is R=Diag(r1,,rp)R=\text{Diag}(r_{1},\ldots,r_{p}) with 0<r1rp0<r_{1}\leq\ldots\leq r_{p}. Denote Ap:=(XTX+R)1XTA_{p}:=(X^{T}X+R)^{-1}X^{T}. Then,

Ap\displaystyle A_{p} =(R1R1XT(In+XR1XT)1XR1)XTby Lemma C.2\displaystyle=\big{(}R^{-1}-R^{-1}X^{T}(I_{n}+XR^{-1}X^{T})^{-1}XR^{-1}\big{)}X^{T}\ \text{by Lemma \ref{lemma:woodbury}} (65)
=R1XTR1XT(In(In+XR1XT)1)\displaystyle=R^{-1}X^{T}-R^{-1}X^{T}(I_{n}-\big{(}I_{n}+XR^{-1}X^{T})^{-1}\big{)} (66)
=R1XT(In+XR1XT)1.\displaystyle=R^{-1}X^{T}(I_{n}+XR^{-1}X^{T})^{-1}. (67)

It therefore suffices to show that

Cp:=sup0<r1rpAp2=sup0<r1rpR1XT(In+XR1XT)12C_{p}:=\sup_{0<r_{1}\leq\ldots\leq r_{p}}\|A_{p}\|_{2}=\sup_{0<r_{1}\leq\ldots\leq r_{p}}\|{R}^{-1}{X}^{T}(I_{n}+{X}{R}^{-1}{X}^{T})^{-1}\|_{2} (68)

is finite for any p>1p>1. For k=1,,p1k=1,\ldots,p-1, define

Ck:=sup0<r1rkAk2=sup0<r1rkRk1XkT(In+XkRk1XkT)12,C_{k}:=\sup_{0<r_{1}\leq\ldots\leq r_{k}}\|A_{k}\|_{2}=\sup_{0<r_{1}\leq\ldots\leq r_{k}}\|{R}_{k}^{-1}{X}_{k}^{T}(I_{n}+{X}_{k}{R}_{k}^{-1}{X}_{k}^{T})^{-1}\|_{2}, (69)

where Xk=[x1xk]X_{k}=[x_{1}\ldots x_{k}] is the n×kn\times k matrix corresponding to the first kk columns of XX and Rk=Diag(r1,,rk){R}_{k}=\text{Diag}(r_{1},\ldots,r_{k}) with 0<r1rk0<r_{1}\leq\ldots\leq r_{k} is the k×kk\times k diagonal matrix corresponding to the top kk diagonal entries of RR, and Ak:=Rk1XkT(In+XkRk1XkT)1A_{k}:=R_{k}^{-1}X_{k}^{T}(I_{n}+X_{k}R_{k}^{-1}X_{k}^{T})^{-1}. Then Xp1Rp11Xp1T=i=1p1xixiT/riX_{p-1}R_{p-1}^{-1}X_{p-1}^{T}=\sum_{i=1}^{p-1}x_{i}x_{i}^{T}/r_{i}. By Lemma C.2 with U=VT=xp/rp1/2U=V^{T}=x_{p}/r_{p}^{1/2},

(In+XR1XT)1\displaystyle(I_{n}+XR^{-1}X^{T})^{-1} =((In+Xp1Rp11Xp1T)+xpxpTrp)1\displaystyle=\Big{(}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})+\frac{x_{p}x_{p}^{T}}{r_{p}}\Big{)}^{-1} (70)
=(In+Xp1Rp11Xp1T)1\displaystyle=(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1} (71)
(In+Xp1Rp11Xp1T)1xpxpT(In+Xp1Rp11Xp1T)1rp+xpT(In+Xp1Rp11Xp1T)1xp.\displaystyle\quad-\frac{(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}x_{p}x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}x_{p}}. (72)

Substituting this decomposition into (67), we obtain

Ap=[Rp11Xp1Trp1xpT](In+XR1XT)1=[Vp1vpT]\displaystyle A_{p}=\begin{bmatrix}R_{p-1}^{-1}X_{p-1}^{T}\\ r_{p}^{-1}x_{p}^{T}\end{bmatrix}(I_{n}+XR^{-1}X^{T})^{-1}=\begin{bmatrix}V_{p-1}\\ v_{p}^{T}\end{bmatrix} (73)

where, after some simplifications, vpnv_{p}\in\mathbb{R}^{n} and Vp1V_{p-1} is a p1p-1 by nn matrix such that

vpT\displaystyle v_{p}^{T} :=rp1xpT(In+XR1XT)1=xpT(In+Xp1Rp11Xp1T)1rp+xpT(In+Xp1Rp11Xp1T)1xp,\displaystyle:=r_{p}^{-1}x_{p}^{T}(I_{n}+XR^{-1}X^{T})^{-1}=\frac{x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}x_{p}}, (74)
Vp1\displaystyle V_{p-1} :=Rp11Xp1T(In+XR1XT)1=Ap1(InxpvpT).\displaystyle:=R_{p-1}^{-1}X_{p-1}^{T}(I_{n}+XR^{-1}X^{T})^{-1}=A_{p-1}\Big{(}I_{n}-x_{p}v_{p}^{T}\Big{)}. (75)

We have Vp12Cp1(1+xp2vpT2)\|V_{p-1}\|_{2}\leq C_{p-1}(1+\|x_{p}\|_{2}\|v_{p}^{T}\|_{2}) by (69), subadditivity and submultiplicativity. Then by Lemma C.1,

Ap2\displaystyle\|A_{p}\|_{2} Vp12+vpT2Cp1(1+xp2vpT2)+vpT2.\displaystyle\leq\|V_{p-1}\|_{2}+\|v_{p}^{T}\|_{2}\leq C_{p-1}(1+\|x_{p}\|_{2}\|v_{p}^{T}\|_{2})+\|v_{p}^{T}\|_{2}. (76)

Therefore to show CpC_{p} in (68) is finite, it suffices to bound vpT2\|v_{p}^{T}\|_{2} uniformly over all R=Diag(r1,,rp)R=\text{Diag}(r_{1},\ldots,r_{p}) with 0<r1rp0<r_{1}\leq\ldots\leq r_{p}. Note that xp=Xp1ap+bpx_{p}=X_{p-1}a_{p}+b_{p} for some app1a_{p}\in\mathbb{R}^{p-1} and some bpnb_{p}\in\mathbb{R}^{n} such that Xp1Tbp=0X_{p-1}^{T}b_{p}=0. Here Xp1apX_{p-1}a_{p} corresponds to the projection of xpx_{p} to the column space of Xp1X_{p-1}, and bpb_{p} corresponds to the component of xpx_{p} that is perpendicular to the column space of Xp1X_{p-1}. Note that bp=(In+Xp1Rp11Xp1T)bpb_{p}=(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})b_{p}. This implies

(In+Xp1Rp11Xp1T)1bp\displaystyle(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}b_{p} =bp,\displaystyle=b_{p}, (77)
bpT(In+Xp1Rp11Xp1T)1Xp1ap\displaystyle b_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p} =bpTXp1ap=0.\displaystyle=b_{p}^{T}X_{p-1}a_{p}=0. (78)

Therefore by (74),

vpT2\displaystyle\|v_{p}^{T}\|_{2} =apTXp1T(In+Xp1Rp11Xp1T)1+bpTrp+apTXp1T(In+Xp1Rp11Xp1T)1Xp1ap+bpTbp2\displaystyle=\|\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}+b_{p}^{T}}{r_{p}+a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}+b_{p}^{T}b_{p}}\|_{2} (79)
apTXp1T(In+Xp1Rp11Xp1T)1rp+apTXp1T(In+Xp1Rp11Xp1T)1Xp1ap2+bpTrp+bpTbp2 by subadditivity\displaystyle\leq\big{\|}\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}}\big{\|}_{2}+\big{\|}\frac{b_{p}^{T}}{r_{p}+b_{p}^{T}b_{p}}\big{\|}_{2}\ \text{ by subadditivity} (80)
apTXp1T(In+Xp1Rp11Xp1T)1rp2+Ibp0bp2 as (In+Xp1Rp11Xp1T)1 is p.s.d.\displaystyle\leq\|\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}}\|_{2}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\ \text{ as }(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}\text{ is p.s.d.} (81)
=(Rp1ap)TRp11Xp1T(In+Xp1Rp11Xp1T)1rp2+Ibp0bp2\displaystyle=\|\frac{(R_{p-1}a_{p})^{T}R_{p-1}^{-1}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}}\|_{2}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}} (82)
Rp1rp2ap2Cp1+Ibp0bp2 by submultiplicativity and (69)\displaystyle\leq\big{\|}\frac{R_{p-1}}{r_{p}}\big{\|}_{2}\|a_{p}\|_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\ \text{ by submultiplicativity and \eqref{eq:Ck_bound}} (83)
ap2Cp1+Ibp0bp2 as rprj for all j=1,,p1.\displaystyle\leq\big{\|}a_{p}\big{\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\ \text{ as $r_{p}\geq r_{j}$ for all $j=1,\ldots,p-1$}. (84)

Note that xp,apx_{p},a_{p} and bpb_{p} do not depend on entries of the diagonal matrix RR. Taking the supremum over all R=Diag(r1,,rp)R=\text{Diag}(r_{1},\ldots,r_{p}) with 0<r1rp0<r_{1}\leq\ldots\leq r_{p} in (76), by (84) we obtain the recurrence relation

CpCp1(1+xp2(ap2Cp1+Ibp0bp2))+(ap2Cp1+Ibp0bp2)\displaystyle C_{p}\leq C_{p-1}\Big{(}1+\|x_{p}\|_{2}\big{(}\big{\|}a_{p}\big{\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\big{)}\Big{)}+\Big{(}\big{\|}a_{p}\big{\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\Big{)} (85)

for all p1p\geq 1. As C1C_{1} is finite by (63), (85) implies that CpC_{p} is finite for all pp. ∎

Lemma C.4.

(Moments of the full conditionals of each ηj\eta_{j}). For fixed constants m>0m>0 and ν>0\nu>0, define a distribution on (0,)(0,\infty) with probability density function

p(η|m)1η1ν2(1+νη)ν+12emη.p(\eta|m)\propto\frac{1}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}e^{-m\eta}. (86)

Then for any c>max{ν+12,1}c>\max\{-\frac{\nu+1}{2},-1\} and ηp(|m)\eta\sim p(\cdot|m), we have

𝔼[ηc|m]=1νcΓ(1+ν2+c)Γ(1+ν2)U(1+ν2+c,1+c,mν)U(1+ν2,1,mν).\displaystyle\mathbb{E}[\eta^{c}|m]=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\frac{U(\frac{1+\nu}{2}+c,1+c,\frac{m}{\nu})}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}. (87)
Proof.

Recall the definition of the confluent hypergeometric function of the second kind U(a,b,z):=1Γ(a)0xa1(1+x)ba1ezx𝑑xU(a,b,z):=\frac{1}{\Gamma(a)}\int_{0}^{\infty}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx. We obtain

0ηcη1ν2(1+νη)ν+12emη𝑑η\displaystyle\int_{0}^{\infty}\frac{\eta^{c}}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}e^{-m\eta}d\eta =νν+12c0xν+12+c1(1+x)ν+12emνx𝑑x\displaystyle=\nu^{-\frac{\nu+1}{2}-c}\int_{0}^{\infty}x^{\frac{\nu+1}{2}+c-1}(1+x)^{-\frac{\nu+1}{2}}e^{-\frac{m}{\nu}x}dx (88)
=νν+12cΓ(ν+12+c)U(ν+12+c,1+c,mν).\displaystyle=\nu^{-\frac{\nu+1}{2}-c}\Gamma\Big{(}\frac{\nu+1}{2}+c\Big{)}U\Big{(}\frac{\nu+1}{2}+c,1+c,\frac{m}{\nu}\Big{)}. (89)

Therefore,

𝔼[ηc|m]\displaystyle\mathbb{E}[\eta^{c}|m] =νν+12cΓ(ν+12+c)U(ν+12+c,1+c,mν)νν+12Γ(ν+12)U(ν+12,1,mν)\displaystyle=\frac{\nu^{-\frac{\nu+1}{2}-c}\Gamma\big{(}\frac{\nu+1}{2}+c\big{)}U\big{(}\frac{\nu+1}{2}+c,1+c,\frac{m}{\nu}\big{)}}{\nu^{-\frac{\nu+1}{2}}\Gamma\big{(}\frac{\nu+1}{2}\big{)}U\big{(}\frac{\nu+1}{2},1,\frac{m}{\nu}\big{)}} (90)
=1νcΓ(ν+12+c)U(ν+12+c,1+c,mν)Γ(ν+12)U(ν+12,1,mν).\displaystyle=\frac{1}{\nu^{c}}\frac{\Gamma\big{(}\frac{\nu+1}{2}+c\big{)}U\big{(}\frac{\nu+1}{2}+c,1+c,\frac{m}{\nu}\big{)}}{\Gamma\big{(}\frac{\nu+1}{2}\big{)}U\big{(}\frac{\nu+1}{2},1,\frac{m}{\nu}\big{)}}. (91)

Ratios of confluent hypergeometric function evaluations, such as the ones arising in Lemma C.4, can be upper-bounded using the following result.

Lemma C.5.

Let UU denote the confluent hypergeometric function of the second kind, and fix r>0r>0.

  1. 1.

    Let c>0c>0. Then

    U(r+c,1+c,x)U(r,1,x)<xcfor all x>0.\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}<x^{-c}\quad\text{for all }x>0. (92)

    Furthermore, for every ϵ>0\epsilon>0, there exists a Kϵ,c,r(1)<K^{(1)}_{\epsilon,c,r}<\infty such that

    U(r+c,1+c,x)U(r,1,x)<ϵxc+Kϵ,c,r(1)for all x>0.\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}<\epsilon x^{-c}+K^{(1)}_{\epsilon,c,r}\quad\text{for all }x>0. (93)
  2. 2.

    Let max{1,r}<c<0\max\{-1,-r\}<c<0. Then there exists some Kc,r(2)<K^{(2)}_{c,r}<\infty such that

    U(r+c,1+c,x)U(r,1,x)<xc+Kc,r(2)for all x>0.\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}<x^{-c}+K^{(2)}_{c,r}\quad\text{for all }x>0. (94)
Proof.
Case: c>0c>0.

By definition, we have

xcΓ(r+c)U(r+c,1+c,x)Γ(r)U(r,1,x)\displaystyle x^{c}\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}} =xc0tr+c1(1+t)rext𝑑t0tr1(1+t)rext𝑑t\displaystyle=\frac{x^{c}\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt}{\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt}
=0sr+c1(x+s)res𝑑s0sr1(x+s)res𝑑s for s=xt\displaystyle=\frac{\int_{0}^{\infty}s^{r+c-1}(x+s)^{-r}e^{-s}ds}{\int_{0}^{\infty}s^{r-1}(x+s)^{-r}e^{-s}ds}\text{ for }s=xt
=Γ(r+c)Γ(r)𝔼[(x+S)r]𝔼[(x+S~)r] for SGamma(r+c,1) and S~Gamma(r,1)\displaystyle=\frac{\Gamma(r+c)}{\Gamma(r)}\frac{\mathbb{E}[(x+S)^{-r}]}{\mathbb{E}[(x+\tilde{S})^{-r}]}\text{ for }S\sim\mathrm{Gamma}(r+c,1)\text{ and }\tilde{S}\sim\mathrm{Gamma}(r,1)
<Γ(r+c)Γ(r)\displaystyle<\frac{\Gamma(r+c)}{\Gamma(r)}

where the final equality follows as the distribution Γ(r+c,1)\Gamma(r+c,1) stochastically dominates the distribution Γ(r,1)\Gamma(r,1), and hence 𝔼[(x+S)r]<𝔼[(x+S~)r]\mathbb{E}[(x+S)^{-r}]<\mathbb{E}[(x+\tilde{S})^{-r}] for all x>0x>0 as r>0r>0 and c>0c>0. This gives U(r+c,1+c,x)/U(r,1,x)<xcU\big{(}r+c,1+c,x\big{)}/U\big{(}r,1,x\big{)}<x^{c} as required for (92).

We now prove (93). By definition, we have

Γ(r+c)U(r+c,1+c,x)Γ(r)U(r,1,x)=0tr+c1(1+t)rext𝑑t0tr1(1+t)rext𝑑t=𝔼[Txc]=xc𝔼[Yxc]\displaystyle\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}=\frac{\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt}{\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt}=\mathbb{E}[T_{x}^{c}]=x^{-c}\mathbb{E}[Y_{x}^{c}]

where TxT_{x} is a random variable with density pTx(t)tr1(1+t)rextp_{T_{x}}(t)\propto t^{r-1}(1+t)^{-r}e^{-xt} on [0,)[0,\infty), and Yx:=xTxY_{x}:=xT_{x} is a random variable with density pYx(t)yr1(x+y)reyp_{Y_{x}}(t)\propto y^{r-1}(x+y)^{-r}e^{-y} on [0,)[0,\infty).

  1. 1.

    Consider 𝔼[Yxc]\mathbb{E}[Y_{x}^{c}] as x0+x\rightarrow 0^{+}. We obtain

    limx0+𝔼[Yxc]=limx0+0yc+r1(x+y)rey𝑑y0yr1(x+y)rey𝑑y=Γ(c)limx0+0(yx+y)ry1ey𝑑y=0.\displaystyle\lim_{x\rightarrow 0^{+}}\mathbb{E}[Y_{x}^{c}]=\lim_{x\rightarrow 0^{+}}\frac{\int_{0}^{\infty}y^{c+r-1}(x+y)^{-r}e^{-y}dy}{\int_{0}^{\infty}y^{r-1}(x+y)^{-r}e^{-y}dy}=\frac{\Gamma(c)}{\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}\big{(}\frac{y}{x+y}\big{)}^{r}y^{-1}e^{-y}dy}=0.
  2. 2.

    Consider 𝔼[Txc]\mathbb{E}[T_{x}^{c}] as a function of xx for x>0x>0. From the density of TxT_{x}, we note that for all 0<x1<x20<x_{1}<x_{2}, Tx1T_{x_{1}} stochastically dominates Tx2T_{x_{2}}, and as c>0c>0, Tx1cT^{c}_{x_{1}} also stochastically dominates Tx2cT^{c}_{x_{2}}. Therefore, 𝔼[Txc]\mathbb{E}[T_{x}^{c}] is a decreasing function of xx for x>0x>0.

As limx0+𝔼[Yxc]=0\lim_{x\rightarrow 0^{+}}\mathbb{E}[Y_{x}^{c}]=0, for any ϵ>0\epsilon>0, there exists δ>0\delta>0 such that 𝔼[Yxc]<Γ(r+c)Γ(r)ϵ\mathbb{E}[Y_{x}^{c}]<\frac{\Gamma(r+c)}{\Gamma(r)}\epsilon for all xδx\leq\delta. This gives xc𝔼[Yxc]<Γ(r+c)Γ(r)ϵxcx^{-c}\mathbb{E}[Y_{x}^{c}]<\frac{\Gamma(r+c)}{\Gamma(r)}\epsilon x^{-c} for all xδx\leq\delta. Also for all x>δx>\delta, we have xc𝔼[Yxc]=𝔼[Txc]𝔼[Tδc]<x^{-c}\mathbb{E}[Y_{x}^{c}]=\mathbb{E}[T_{x}^{c}]\leq\mathbb{E}[T_{\delta}^{c}]<\infty as 𝔼[Txc]\mathbb{E}[T_{x}^{c}] is a decreasing function of xx for x>0x>0. Overall, we obtain

Γ(r+c)U(r+c,1+c,x)Γ(r)U(r,1,x)\displaystyle\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}} =xc𝔼[Yxc]<Γ(r+c)Γ(r)ϵxc+𝔼[Tδc]\displaystyle=x^{-c}\mathbb{E}[Y_{x}^{c}]<\frac{\Gamma(r+c)}{\Gamma(r)}\epsilon x^{-c}+\mathbb{E}[T_{\delta}^{c}]
U(r+c,1+c,x)U(r,1,x)\displaystyle\Rightarrow\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}} <ϵxc+Kϵ,c,r(1)\displaystyle<\epsilon x^{-c}+K^{(1)}_{\epsilon,c,r}

for Kϵ,c,r(1)=Γ(r)Γ(r+c)𝔼[Tδc]=U(r+c,1+c,δ)U(r,1,δ)<K^{(1)}_{\epsilon,c,r}=\frac{\Gamma(r)}{\Gamma(r+c)}\mathbb{E}[T_{\delta}^{c}]=\frac{U\big{(}r+c,1+c,\delta\big{)}}{U\big{(}r,1,\delta\big{)}}<\infty as required.

Case: max{1,r}<c<0\max\{-1,-r\}<c<0.

Note that U(r+c,1+c,x)U\big{(}r+c,1+c,x\big{)} and U(r,1,x)U\big{(}r,1,x\big{)} are well-defined as r,r+c,c+1>0r,r+c,c+1>0.

  1. 1.

    Consider U(r+c,1+c,x)U(r,1,x)xc\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c} as xx\rightarrow\infty. We first recall the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.5] at infinity,

    U(a,b,z)=zaa(ab+1)z(a+1)+𝒪(z(a+2))U(a,b,z)=z^{-a}-a(a-b+1)z^{-(a+1)}+\mathcal{O}(z^{-(a+2)}) (95)

    as zz\rightarrow\infty for any a,b>0a,b>0. For large x>0x>0, this gives

    xc+1(U(r+c,1+c,x)U(r,1,x)xc)\displaystyle x^{c+1}\Big{(}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}\Big{)} =xc+1(xrU(r+c,1+c,x)xrcU(r,1,x)xrU(r,1,x))\displaystyle=x^{c+1}\Big{(}\frac{x^{r}U\big{(}r+c,1+c,x\big{)}-x^{r-c}U\big{(}r,1,x\big{)}}{x^{r}U\big{(}r,1,x\big{)}}\Big{)} (96)
    =x(xr+cU(r+c,1+c,x)xrU(r,1,x)xrU(r,1,x))\displaystyle=x\Big{(}\frac{x^{r+c}U\big{(}r+c,1+c,x\big{)}-x^{r}U\big{(}r,1,x\big{)}}{x^{r}U\big{(}r,1,x\big{)}}\Big{)} (97)
    =x((1(r+c)rx)(1r2x)1+𝒪(x2))\displaystyle=x\Big{(}\frac{(1-\frac{(r+c)r}{x})-(1-\frac{r^{2}}{x})}{1}+\mathcal{O}(x^{-2})\Big{)} (98)
    =rc+𝒪(x1)\displaystyle=-rc+\mathcal{O}(x^{-1}) (99)

    where (98) follows from the asymptotic expansion in (95). As c+1>0c+1>0, we obtain

    limx(U(r+c,1+c,x)U(r,1,x)xc)=0.\displaystyle\lim_{x\rightarrow\infty}\Big{(}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}\Big{)}=0. (100)
  2. 2.

    Consider U(r+c,1+c,x)U(r,1,x)xc\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c} as x0+x\rightarrow 0^{+}. By definition of confluent hypergeometric function of the second kind, we have

    limx0+Γ(r+c)U(r+c,1+c,x)Γ(r)U(r,1,x)=limx0+0tr+c1(1+t)rext𝑑t0tr1(1+t)rext𝑑t.\displaystyle\lim_{x\rightarrow 0^{+}}\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}=\lim_{x\rightarrow 0^{+}}\frac{\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt}{\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt}. (101)

    Note that as r+c>0r+c>0 and c1<1c-1<-1,

    0tr+c1(1+t)r𝑑t=0(t1+t)rtc1𝑑t01tr+c1𝑑t+1tc1𝑑t=1r+c1c<.\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}dt=\int_{0}^{\infty}\big{(}\frac{t}{1+t}\big{)}^{r}t^{c-1}dt\leq\int_{0}^{1}t^{r+c-1}dt+\int_{1}^{\infty}t^{c-1}dt=\frac{1}{r+c}-\frac{1}{c}<\infty. (102)

    Therefore, limx0+0tr+c1(1+t)rext𝑑t=1r+c1c<\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt=\frac{1}{r+c}-\frac{1}{c}<\infty. However, we have

    limx0+0tr1(1+t)rext𝑑t=limx0+0(t1+t)rt1ext𝑑t=.\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt=\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}\big{(}\frac{t}{1+t}\big{)}^{r}t^{-1}e^{-xt}dt=\infty. (103)

    Therefore limx0+Γ(r+c)U(r+c,1+c,x)Γ(r)U(r,1,x)=0\lim_{x\rightarrow 0^{+}}\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}=0. As c<0c<0, we obtain

    limx0+U(r+c,1+c,x)U(r,1,x)xc=0.\lim_{x\rightarrow 0^{+}}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}=0. (104)

From (100) and (104), we have limxU(r+c,1+c,x)U(r,1,x)xc=0\lim_{x\rightarrow\infty}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}=0 and limx0+U(r+c,1+c,x)U(r,1,x)xc=0\lim_{x\rightarrow 0^{+}}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}=0. As xU(r+c,1+c,x)U(r,1,x)xcx\mapsto\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c} is a continuous function on (0,)(0,\infty), this implies that U(r+c,1+c,x)U(r,1,x)xc\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c} is bounded above for x(0,)x\in(0,\infty). In particular, there exists some Kc,r(2)<K^{(2)}_{c,r}<\infty such that

U(r+c,1+c,x)U(r,1,x)xc<Kc,r(2)\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}<K^{(2)}_{c,r} (105)

for all x>0x>0 as required for (94). ∎

Corollary C.6.

(Bound on moments of the full conditionals of each ηj\eta_{j}). Consider a random variable η\eta on (0,)(0,\infty) with density as in (86) in Lemma C.4 for some ν>0\nu>0.

Consider when c>0c>0. There for every ϵ>0\epsilon>0, there exists some Kϵ,c,ν(1)<K^{(1)}_{\epsilon,c,\nu}<\infty such that

𝔼[ηc|m]ϵmc+Kϵ,c,ν(1)for all m>0.\displaystyle\mathbb{E}[\eta^{c}|m]\leq\epsilon m^{-c}+K^{(1)}_{\epsilon,c,\nu}\quad\text{for all }m>0. (106)

Consider when max{1,1+ν2}<c<0\max\{-1,-\frac{1+\nu}{2}\}<c<0. Then, there exists some Kc,ν(2)<K^{(2)}_{c,\nu}<\infty such that

𝔼[ηc|m]Γ(1+ν2+c)Γ(1+ν2)mc+Kc,ν(2)for all m>0.\displaystyle\mathbb{E}[\eta^{c}|m]\leq\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}m^{-c}+K^{(2)}_{c,\nu}\quad\text{for all }m>0. (107)
Proof.

By Lemma C.4, for all max{1,1+ν2}<c\max\{-1,-\frac{1+\nu}{2}\}<c,

𝔼[ηc|m]=1νcΓ(1+ν2+c)Γ(1+ν2)U(1+ν2+c,1+c,mν)U(1+ν2,1,mν).\displaystyle\mathbb{E}[\eta^{c}|m]=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\frac{U(\frac{1+\nu}{2}+c,1+c,\frac{m}{\nu})}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}. (108)

Consider when c>0c>0. We apply Lemma C.5 with r=1+ν2r=\frac{1+\nu}{2}, x=mνx=\frac{m}{\nu}, and ϵ~=ϵ(Γ(1+ν2+c)Γ(1+ν2))1\tilde{\epsilon}=\epsilon\Big{(}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\Big{)}^{-1}. Then there exists some K~ϵ,c,ν(1)<\tilde{K}^{(1)}_{\epsilon,c,\nu}<\infty such that

𝔼[ηc|m]1νcΓ(1+ν2+c)Γ(1+ν2)(ϵ~(mν)c+K~ϵ,c,ν(1))=ϵmc+Kϵ,c,ν(1)\displaystyle\mathbb{E}[\eta^{c}|m]\leq\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\Big{(}\tilde{\epsilon}\Big{(}\frac{m}{\nu}\Big{)}^{-c}+\tilde{K}^{(1)}_{\epsilon,c,\nu}\Big{)}=\epsilon m^{-c}+K^{(1)}_{{\epsilon},c,\nu} (109)

for Kϵ,c,ν(1)=1νcΓ(1+ν2+c)Γ(1+ν2)K~ϵ,c,ν(1)<K^{(1)}_{{\epsilon},c,\nu}=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\tilde{K}^{(1)}_{\epsilon,c,\nu}<\infty.

Consider when max{1,1+ν2}<c<0\max\{-1,-\frac{1+\nu}{2}\}<c<0. By Lemma C.5 with r=1+ν2r=\frac{1+\nu}{2} and x=mνx=\frac{m}{\nu}, there exists some K~c,ν(2)<\tilde{K}^{(2)}_{c,\nu}<\infty such that

𝔼[ηc|m]1νcΓ(1+ν2+c)Γ(1+ν2)((mν)c+K~c,ν(2))=Γ(1+ν2+c)Γ(1+ν2)mc+Kc,ν(2)\displaystyle\mathbb{E}[\eta^{c}|m]\leq\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\Big{(}\Big{(}\frac{m}{\nu}\Big{)}^{-c}+\tilde{K}^{(2)}_{c,\nu}\Big{)}=\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}m^{-c}+K^{(2)}_{c,\nu} (110)

for Kc,ν(2)=1νcΓ(1+ν2+c)Γ(1+ν2)K~c,ν(2)<K^{(2)}_{c,\nu}=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\tilde{K}^{(2)}_{c,\nu}<\infty. ∎

Lemma C.7.

Let X𝒩(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^{2}) on \mathbb{R}. Then, for c(0,1)c\in(0,1),

𝔼[|X|c]1σc2c/2Γ(1c2)π\mathbb{E}[|X|^{-c}]\leq\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}} (111)
Proof.

We follow the proof in Johndrow et al. [2020], which is included here for completeness. Recall the inverse moments of the univariate Normal distribution,

𝔼[|X|c]=1σc2c/2Γ(1c2)πexp(μ22σ2)M(1c2,12,μ22σ2)\mathbb{E}[|X|^{-c}]=\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}\exp(-\frac{\mu^{2}}{2\sigma^{2}})M\Big{(}\frac{1-c}{2},\frac{1}{2},\frac{\mu^{2}}{2\sigma^{2}}\Big{)} (112)

where M(a,b,z):=F11(a;b;z)M(a,b,z):={}_{1}{F}_{1}(a;b;z) is the confluent hypergeometric function of the first kind. For a<ba<b, we know M(a,b,z)=Γ(b)Γ(a)Γ(ba)01eztta1(1t)ba1𝑑tM(a,b,z)=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}e^{zt}t^{a-1}(1-t)^{b-a-1}dt. For a=1c2<b=12a=\frac{1-c}{2}<b=\frac{1}{2} and z=μ22σ2z=\frac{\mu^{2}}{2\sigma^{2}}, this gives

exp(z)M(a,b,z)\displaystyle\exp(-z)M(a,b,z) =Γ(b)Γ(a)Γ(ba)01ez(1t)ta1(1t)ba1𝑑t\displaystyle=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}e^{-z(1-t)}t^{a-1}(1-t)^{b-a-1}dt (113)
=Γ(b)Γ(a)Γ(ba)01ezt(1t)a1tba1𝑑t.\displaystyle=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}e^{-zt}(1-t)^{a-1}t^{b-a-1}dt. (114)

Therefore, exp(z)M(a,b,z)\exp(-z)M(a,b,z) is decreasing for z0z\geq 0 and exp(z)M(a,b,z)exp(0)M(a,b,0)=1\exp(-z)M(a,b,z)\leq\exp(-0)M(a,b,0)=1 for all z0z\geq 0. This gives

𝔼[|X|c]=1σc2c/2Γ(1c2)πexp(μ22σ2)M(1c2,12,μ22σ2)1σc2c/2Γ(1c2)π.\mathbb{E}[|X|^{-c}]=\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}\exp(-\frac{\mu^{2}}{2\sigma^{2}})M\Big{(}\frac{1-c}{2},\frac{1}{2},\frac{\mu^{2}}{2\sigma^{2}}\Big{)}\leq\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}. (115)

Lemma C.8.

(A uniform bound on the full conditional mean of 1/σ21/\sigma^{2}). Consider the full conditional distribution of σ2\sigma^{2} given η,ξ\eta,\xi in Algorithm 3. Given η>0p,ξ>0\eta\in\mathbb{R}_{>0}^{p},\xi>0, we have

σ2|ξ,ηInvGamma(a0+n2,yTMξ,η1y+b02),\sigma^{2}|\xi,\eta\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}+n}{2},\frac{y^{T}M_{\xi,\eta}^{-1}y+b_{0}}{2}\bigg{)}, (116)

where Mξ,η=In+ξ1XDiag(η1)XTM_{\xi,\eta}=I_{n}+\xi^{-1}X\text{Diag}(\eta^{-1})X^{T}. Then

𝔼[1σ2|ξ,η]=a0+nyTMξ,η1y+b0<a0+nb0<.\mathbb{E}\Big{[}\frac{1}{\sigma^{2}}\Big{|}\xi,\eta\Big{]}=\frac{a_{0}+n}{y^{T}M_{\xi,\eta}^{-1}y+b_{0}}<\frac{a_{0}+n}{b_{0}}<\infty. (117)
Proof.

Recall that the mean of a Gamma(a,b)\mathrm{Gamma}(a,b) distribution (under shape and rate parameterization) is a/ba/b. Equation (117) now directly follows as Mξ,η1M_{\xi,\eta}^{-1} is positive semi-definite. ∎

Lemma C.9.

(Maximal coupling probability of each component ηj\eta_{j}). For fixed constants m>0m>0 and ν>0\nu>0, define a distribution on (0,)(0,\infty) with probability density function

p(η|m)1η1ν2(1+νη)ν+12emη.p(\eta|m)\propto\frac{1}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}e^{-m\eta}. (118)

Consider random variables ηp(|m)\eta\sim p(\cdot|m), and η~p(|m~)\tilde{\eta}\sim p(\cdot|\tilde{m}) for some fixed ν>0\nu>0. Fix some 0<a<b<0<a<b<\infty, and take m,m~(a,b)m,\tilde{m}\in(a,b). Then under a maximal coupling,

(η=η~|m,m~)=0min{p(η|m),p(η|m~)}𝑑ηU(1+ν2,1,bν)/U(1+ν2,1,aν).\mathbb{P}(\eta=\tilde{\eta}|m,\tilde{m})=\int_{0}^{\infty}\min\{p(\eta^{*}|m),p(\eta^{*}|\tilde{m})\}d\eta^{*}\geq U\Big{(}\frac{1+\nu}{2},1,\frac{b}{\nu}\Big{)}\Big{/}U\Big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\Big{)}. (119)
Proof of Lemma C.9.

Fix some 0<a<b<0<a<b<\infty, and take m,m~(a,b)m,\tilde{m}\in(a,b). Then

(η=η~|m,m~)\displaystyle\mathbb{P}(\eta=\tilde{\eta}|m,\tilde{m}) =0min{p(η|m),p(η|m~)}𝑑η, as (η,η~)|m,m~ are maximally coupled\displaystyle=\int_{0}^{\infty}\min\{p(\eta^{*}|m),p(\eta^{*}|\tilde{m})\}d\eta^{*}\text{, as }(\eta,\tilde{\eta})|m,\tilde{m}\text{ are maximally coupled} (120)
=1(η)1ν2(1+νη)ν+12min{emηU(1+ν2,1,mν),em~ηU(1+ν2,1,m~ν)}𝑑η.\displaystyle=\int\frac{1}{(\eta^{*})^{\frac{1-\nu}{2}}(1+\nu\eta^{*})^{\frac{\nu+1}{2}}}\min\Big{\{}\frac{e^{-m\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{m}{\nu}\big{)}},\frac{e^{-\tilde{m}\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}\big{)}}\Big{\}}d\eta^{*}. (121)

Recall m,m~(a,b)m,\tilde{m}\in(a,b). Note also that xU(1+ν2,1,x)x\mapsto U\big{(}\frac{1+\nu}{2},1,x\big{)} is a decreasing function for x>0x>0. This gives

min{emηU(1+ν2,1,mν),em~ηU(1+ν2,1,m~ν)}>ebηU(1+ν2,1,aν).\min\Big{\{}\frac{e^{-m\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{m}{\nu}\big{)}},\frac{e^{-\tilde{m}\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}\big{)}}\Big{\}}>\frac{e^{-b\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\big{)}}. (122)

Therefore for all m,m~(a,b)m,\tilde{m}\in(a,b),

(η=η~|m,m~)\displaystyle\mathbb{P}(\eta=\tilde{\eta}|m,\tilde{m}) >1(η)1ν2(1+νη)ν+12ebηU(1+ν2,1,aν)𝑑η=U(1+ν2,1,bν)U(1+ν2,1,aν).\displaystyle>\int\frac{1}{(\eta^{*})^{\frac{1-\nu}{2}}(1+\nu\eta^{*})^{\frac{\nu+1}{2}}}\frac{e^{-b\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\big{)}}d\eta^{*}=\frac{U\big{(}\frac{1+\nu}{2},1,\frac{b}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\big{)}}. (123)

Equation (119) follows from taking an infimum over all m,m~(a,b)m,\tilde{m}\in(a,b). ∎

Drift and Minorization conditions.
Proof of Proposition 2.1.

For ease of notation, denote Σt+1:=XTX+ξt+1Diag(ηt+1)\Sigma_{t+1}:=X^{T}X+\xi_{t+1}\text{Diag}(\eta_{t+1}) and let eje_{j} be the jthj^{th} unit vector of the standard basis on p\mathbb{R}^{p}. We first consider j=1pmjd\sum_{j=1}^{p}m_{j}^{d}. Note that

𝔼[j=1pmt+1,jd|ηt+1,ξt+1,σt+12]\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]} =(ξt+12σt+12)dj=1p𝔼[|βt+1,j|2d|ηt+1,ξt+1,σt+12]\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\mathbb{E}\Big{[}\big{|}\beta_{t+1,j}\big{|}^{2d}\big{|}\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]} (124)
(ξt+12σt+12)dj=1p𝔼[βt+1,j2|ηt+1,ξt+1,σt+12]d\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\mathbb{E}\Big{[}\beta^{2}_{t+1,j}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]}^{d} (125)
=(ξt+12σt+12)dj=1p((ejTΣt+11XTy)2+σt+12(ejTΣt+11ej))d\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\bigg{(}\big{(}e_{j}^{T}\Sigma_{t+1}^{-1}X^{T}y\big{)}^{2}+\sigma_{t+1}^{2}\big{(}e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\big{)}\bigg{)}^{d} (126)
(ξt+12σt+12)dj=1p(|ejTΣt+11XTy|2d+σt+12d(ejTΣt+11ej)d)\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\Big{(}|e_{j}^{T}\Sigma_{t+1}^{-1}X^{T}y|^{2d}+\sigma_{t+1}^{2d}\big{(}e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\big{)}^{d}\Big{)} (127)
(ξt+12σt+12)d(j=1p(|ejTΣt+11XTy|2d)1/d)dp1d\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\Big{(}\sum_{j=1}^{p}\big{(}|e_{j}^{T}\Sigma_{t+1}^{-1}X^{T}y|^{2d}\big{)}^{1/d}\Big{)}^{d}p^{1-d}
+σt+12d(ξt+12σt+12)dj=1p(ejT(ξt+1Diag(ηt+1))1ej)d\displaystyle\qquad+\sigma_{t+1}^{2d}\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\big{(}e_{j}^{T}\big{(}\xi_{t+1}\text{Diag}(\eta_{t+1})\big{)}^{-1}e_{j}\big{)}^{d} (128)
=(ξt+12σt+12)d(Σt+11XTy22)dp1d+12dj=1p1ηt+1,jd\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\Big{(}\|\Sigma_{t+1}^{-1}X^{T}y\|^{2}_{2}\Big{)}^{d}p^{1-d}+\frac{1}{2^{d}}\sum_{j=1}^{p}\frac{1}{\eta_{t+1,j}^{d}} (129)
(ξt+12σt+12)d(Cp2y22)dp1d+12dj=1p1ηt+1,jd,\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\Big{(}C_{p}^{2}\|y\|_{2}^{2}\Big{)}^{d}p^{1-d}+\frac{1}{2^{d}}\sum_{j=1}^{p}\frac{1}{\eta_{t+1,j}^{d}}, (130)

where (125) follows from Jensen’s inequality, (127) follows as (x+y)dxd+yd(x+y)^{d}\leq x^{d}+y^{d} for x,x0x,x\geq 0 and d(0,1)d\in(0,1), (128) follows from Hölder’s inequality and from ejTΣt+11ejejT(ξt+1Diag(ηt+1))1eje_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\leq e_{j}^{T}\big{(}\xi_{t+1}\text{Diag}(\eta_{t+1})\big{)}^{-1}e_{j} as XTXX^{T}X is positive-definite matrix, and (130) follows from Lemma C.3 for some finite constant CpC_{p} which does not depend on ηt+1\eta_{t+1} or ξt+1\xi_{t+1}.

As the prior πξ()\pi_{\xi}(\cdot) on ξ\xi has compact support, we have bξt+1Bb\leq\xi_{t+1}\leq B for all t0t\geq 0 for some 0<bB<0<b\leq B<\infty. This upper bound on ξt+1\xi_{t+1}, Lemma C.8 and Jensen’s inequality gives

𝔼[(ξt+12σt+12)d|ηt+1]Bd2d𝔼[(1σt+12)|ηt+1]dBd2d(a0+nb0)d=(B(a0+n)2b0)d.\displaystyle\mathbb{E}\Big{[}\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}|\eta_{t+1}\Big{]}\leq\frac{B^{d}}{2^{d}}\mathbb{E}\Big{[}\Big{(}\frac{1}{\sigma_{t+1}^{2}}\Big{)}|\eta_{t+1}\Big{]}^{d}\leq\frac{B^{d}}{2^{d}}\big{(}\frac{a_{0}+n}{b_{0}}\big{)}^{d}=\Big{(}\frac{B(a_{0}+n)}{2b_{0}}\Big{)}^{d}. (131)

Therefore, by (130) and (131), we obtain

𝔼[j=1pmt+1,jd|ηt+1]=𝔼[𝔼[j=1pmt+1,jd|ηt+1,ξt+1,σt+12]|ηt+1]12dj=1p1ηt+1,jd+K1\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\eta_{t+1}\Big{]}=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]}|\eta_{t+1}\Big{]}\leq\frac{1}{2^{d}}\sum_{j=1}^{p}\frac{1}{\eta_{t+1,j}^{d}}+K_{1} (132)

for K1:=(B(a0+n)2b0)d(Cp2y22)dp1d<K_{1}:=\Big{(}\frac{B(a_{0}+n)}{2b_{0}}\Big{)}^{d}\Big{(}C_{p}^{2}\|y\|_{2}^{2}\Big{)}^{d}p^{1-d}<\infty. By (132) and Corollary C.6, we obtain

𝔼[j=1pmt+1,jd|βt,ξt,σt2]\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\beta_{t},\xi_{t},\sigma_{t}^{2}\Big{]} =𝔼[𝔼[j=1pmt+1,jd|ηt+1]|βt,ξt,σt2]\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\eta_{t+1}\Big{]}|\beta_{t},\xi_{t},\sigma_{t}^{2}\Big{]} (133)
12dj=1p𝔼[ηt+1,jd|βt,ξt,σt2]+K1\displaystyle\leq\frac{1}{2^{d}}\sum_{j=1}^{p}\mathbb{E}\Big{[}\eta_{t+1,j}^{-d}|\beta_{t},\xi_{t},\sigma_{t}^{2}\Big{]}+K_{1} (134)
12d(j=1pΓ(1+ν2d)Γ(1+ν2)mt,jd+Kd,ν(2))+K1\displaystyle\leq\frac{1}{2^{d}}\Big{(}\sum_{j=1}^{p}\frac{\Gamma(\frac{1+\nu}{2}-d)}{\Gamma(\frac{1+\nu}{2})}m_{t,j}^{d}+K^{(2)}_{d,\nu}\Big{)}+K_{1} (135)
12dΓ(1+ν2d)Γ(1+ν2)j=1pmt,jd+K2\displaystyle\leq\frac{1}{2^{d}}\frac{\Gamma(\frac{1+\nu}{2}-d)}{\Gamma(\frac{1+\nu}{2})}\sum_{j=1}^{p}m_{t,j}^{d}+K_{2} (136)

for K2:=p2dKd,ν(2)+K1<K_{2}:=\frac{p}{2^{d}}K^{(2)}_{d,\nu}+K_{1}<\infty. Note that for every ν1\nu\geq 1, there exists d(0,1)d\in(0,1) such that 12dΓ(ν+12d)Γ(ν+12)<1\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}<1. We can verify this by considering the function f(d):=12dΓ(ν+12d)Γ(ν+12)f(d):=\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}. It suffices to note that f(0)=1f(0)=1 and f(0)=ψ(ν+12)log(2)<0f^{\prime}(0)=-\psi(\frac{\nu+1}{2})-\log(2)<0 for ν1\nu\geq 1, where ψ(x):=Γ(x)Γ(x)\psi(x):=\frac{\Gamma^{\prime}(x)}{\Gamma(x)} is the Digamma function.

We now consider j=1pmjc\sum_{j=1}^{p}m_{j}^{-c} for c(0,1/2)c\in(0,1/2). We obtain

𝔼[j=1p\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p} mt+1,jc|σt+12,ξt+1,ηt+1]=(ξt+12σt+12)cj=1p𝔼[|βt+1,j|2c|σt+12,ξt+1,ηt+1]\displaystyle m_{t+1,j}^{-c}|\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\Big{]}=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\sum_{j=1}^{p}\mathbb{E}\Big{[}\big{|}\beta_{t+1,j}\big{|}^{-2c}\big{|}\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\Big{]} (137)
(ξt+12σt+12)c12cΓ(12c2)πj=1p(σt+12ejTΣt+11ej)c\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\frac{1}{2^{c}}\frac{\Gamma(\frac{1-2c}{2})}{\sqrt{\pi}}\sum_{j=1}^{p}\big{(}\sigma_{t+1}^{2}e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\big{)}^{-c} (138)
(ξt+12σt+12)c12cΓ(12c2)πj=1p(σt+12ejT(X22Ip+ξt+1Diag(ηt+1))1ej)c\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\frac{1}{2^{c}}\frac{\Gamma(\frac{1-2c}{2})}{\sqrt{\pi}}\sum_{j=1}^{p}\big{(}\sigma_{t+1}^{2}e_{j}^{T}(\|X\|_{2}^{2}I_{p}+\xi_{t+1}\text{Diag}(\eta_{t+1}))^{-1}e_{j}\big{)}^{-c} (139)
=(ξt+12σt+12)c12cΓ(12c2)πj=1p(X22+ξt+1ηt+1,j)cσt+12c\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\frac{1}{2^{c}}\frac{\Gamma(\frac{1-2c}{2})}{\sqrt{\pi}}\sum_{j=1}^{p}\frac{(\|X\|_{2}^{2}+\xi_{t+1}\eta_{t+1,j})^{c}}{\sigma_{t+1}^{2c}} (140)
Γ(12c)πj=1p((X22ξt+1)c+ηt+1,jc)\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\Big{(}\Big{(}\frac{\|X\|_{2}^{2}}{\xi_{t+1}}\Big{)}^{c}+\eta_{t+1,j}^{c}\Big{)} (141)
Γ(12c)πj=1pηt+1,jc+pΓ(12c)π(X22b)c\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\eta_{t+1,j}^{c}+p\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\Big{(}\frac{\|X\|_{2}^{2}}{b}\Big{)}^{c} (142)

for some b>0b>0. Here (138) follows from Lemma C.7, (139) follows as ejT(X22Ip+ξtDiag(ηt))1ejejTΣt+11eje_{j}^{T}(\|X\|_{2}^{2}I_{p}+\xi_{t}\text{Diag}(\eta_{t}))^{-1}e_{j}\leq e_{j}^{T}\Sigma_{t+1}^{-1}e_{j} for 2\|\cdot\|_{2} the L2L_{2} operator norm of a matrix, (141) follows as (x+y)cxc+yc(x+y)^{c}\leq x^{c}+y^{c} for all x,y>0x,y>0 and 0<c<1/20<c<1/2, and (142) follows as the prior on ξ\xi has compact support, so ξt+1b\xi_{t+1}\geq b for some b>0b>0. Therefore, by (142),

𝔼[j=1pmt+1,jc|ηt+1]=𝔼[𝔼[j=1pmt+1,jc|σt+12,ξt+1,ηt+1]|ηt+1]Γ(12c)πj=1pηt+1,jc+K3.\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}|\eta_{t+1}\Big{]}=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}|\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\Big{]}|\eta_{t+1}\Big{]}\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\eta_{t+1,j}^{c}+K_{3}. (143)

for K3:=pΓ(12c)π(X22b)c<K_{3}:=p\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\Big{(}\frac{\|X\|_{2}^{2}}{b}\Big{)}^{c}<\infty. By (143) and Corollary C.6 with r=1+ν2r=\frac{1+\nu}{2} and any ϵ<πΓ(12c)\epsilon<\frac{\sqrt{\pi}}{\Gamma(\frac{1}{2}-c)}, there exists Kϵ,c,ν(1)<K^{(1)}_{\epsilon,c,\nu}<\infty such that

𝔼[j=1pmt+1,jc|σt2,ξt,βt]\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}|\sigma^{2}_{t},\xi_{t},\beta_{t}\Big{]} =𝔼[𝔼[j=1pmt+1,jc|ηt+1]|σt2,ξt,βt]\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}|\eta_{t+1}\Big{]}|\sigma^{2}_{t},\xi_{t},\beta_{t}\Big{]} (144)
Γ(12c)πj=1p𝔼[ηt+1,jc|σt2,ξt,βt]+K3\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\mathbb{E}\Big{[}\eta_{t+1,j}^{c}|\sigma^{2}_{t},\xi_{t},\beta_{t}\Big{]}+K_{3} (145)
Γ(12c)πj=1p(ϵmt,jc+Kϵ,c,ν(1))+K3\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\Big{(}\epsilon m_{t,j}^{-c}+K^{(1)}_{\epsilon,c,\nu}\Big{)}+K_{3} (146)
=(Γ(12c)πϵ)j=1pmt,jc+K4,\displaystyle=\Big{(}\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\epsilon\Big{)}\sum_{j=1}^{p}m_{t,j}^{-c}+K_{4}, (147)

where Γ(12c)πϵ<1\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\epsilon<1 and K4=pΓ(12c)πKϵ,c,ν(1)+K3K_{4}=p\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}K^{(1)}_{\epsilon,c,\nu}+K_{3}.

We can now combine (136) and (147) together. Then, for some fixed d(0,1)d\in(0,1) such that 12dΓ(ν+12d)Γ(ν+12)<1\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}<1 and any c(0,1/2)c\in(0,1/2), we obtain

𝔼[j=1pmt+1,jc+mt+1,jd|mt]γdrift(j=1pmt,jc+mt,jd)+Kdrift\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}+m_{t+1,j}^{d}|m_{t}\Big{]}\leq\gamma_{drift}\Big{(}\sum_{j=1}^{p}m_{t,j}^{-c}+m_{t,j}^{d}\Big{)}+K_{drift} (148)

for some

0<γdrift\displaystyle 0<\gamma_{drift} :=max{Γ(12c)πϵ,12dΓ(ν+12d)Γ(ν+12)}<1\displaystyle:=max\Big{\{}\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\epsilon,\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}\Big{\}}<1 (149)
0<Kdrift\displaystyle 0<K_{drift} :=K2+K4<.\displaystyle:=K_{2}+K_{4}<\infty. (150)

Therefore, for any such d(0,1)d\in(0,1) and c(0,1/2)c\in(0,1/2),

V(β)=j=1pmjc+mjdV(\beta)=\sum_{j=1}^{p}m_{j}^{-c}+m_{j}^{d} (151)

for mj=ξβj22σ2m_{j}=\frac{\xi\beta_{j}^{2}}{2\sigma^{2}} is a Lyapunov function. ∎

Proposition C.10.

(Minorization condition). For R>0R>0, let S(R)={(β,ξ,σ2)p×>0×>0:V(β,ξ,σ)<R}S(R)=\{(\beta,\xi,\sigma^{2})\in\mathbb{R}^{p}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}:V(\beta,\xi,\sigma)<R\} denote the sub-level sets of the Lyapunov function in Proposition 2.1. Let 𝒫\mathcal{P} denote the Markov transition kernel associated with the update from (βt,ξt,σt2)(\beta_{t},\xi_{t},\sigma_{t}^{2}) to (βt+1,ξt+1,σt+12)(\beta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}) implied by the update rule in (LABEL:eq:blocked_gibbs_ergodicity_version). Let Z=(β,ξ,σ2)Z=(\beta,\xi,\sigma^{2}) and Z~=(β~,ξ~,σ~2)\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2}). Then for all R>0R>0, there exists ϵ(0,1)\epsilon\in(0,1) such that

TV(𝒫(Z,),𝒫(Z~,))<1ϵ for all Z,Z~S(R),\text{TV}\Big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z},\cdot)\Big{)}<1-\epsilon\quad\text{ for all }Z,\tilde{Z}\in S(R), (152)

where TV stands for the total variation distance between two probability distributions. In particular for R>1R>1, ϵ=(U(1+ν2,1,R1/dν)/U(1+ν2,1,R1/cν))p\epsilon=(U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}/U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)})^{p} suffices.

Proof of Proposition C.10.

Denote Z=(β,ξ,σ2)Z=(\beta,\xi,\sigma^{2}) and Z~=(β~,ξ~,σ~2)\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2}), and let Z,Z~S(R)Z,\tilde{Z}\in S(R) for some R>1R>1. We obtain

TV (𝒫(Z,),𝒫(Z~,))(Z1Z~1|Z,Z~)<1ϵ,\displaystyle\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z},\cdot)\big{)}\leq\mathbb{P}\big{(}Z_{1}\neq\tilde{Z}_{1}\big{|}Z,\tilde{Z}\big{)}<1-\epsilon, (153)

where (Z1,Z~1)(Z_{1},\tilde{Z}_{1}) corresponds to one-scale coupling of 𝒫(Z,)\mathcal{P}(Z,\cdot) and 𝒫(Z~,)\mathcal{P}(\tilde{Z},\cdot) as in Proposition 2.2, the first inequality follows from the coupling inequality, and the second inequality follows from the proof of Proposition 2.2 for ϵ=(U(1+ν2,1,R1/dν)/U(1+ν2,1,R1/cν))p(0,1)\epsilon=\Big{(}U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}\Big{/}U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}\Big{)}^{p}\in(0,1) as R>1R>1. Note that ϵ\epsilon does not depend on ZZ and Z~\tilde{Z} as required. ∎

Geometric ergodicity follows from Proposition 2.1 and Proposition C.10 with any R>max{1,2K/(1γ)}R>\max\{1,2K/(1-\gamma)\} [Rosenthal, 1995].

C.3 Coupled Kernel Proofs

Proof of Proposition 2.2.

For R>0R>0, let S(R):={(β,ξ,σ2)p×>0×>0:V(β,ξ,σ)<R}S(R):=\{(\beta,\xi,\sigma^{2})\in\mathbb{R}^{p}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}:V(\beta,\xi,\sigma)<R\} denote the sub-level sets of the Lyapunov function in Proposition 2.1, with corresponding contraction constant γ(0,1)\gamma\in(0,1) and additive constant K>0K>0. Let Zt:=(βt,ξt,σt2)Z_{t}:=(\beta_{t},\xi_{t},\sigma_{t}^{2}) and Z~t=(β~t,ξ~t,σ~t2)\tilde{Z}_{t}=(\tilde{\beta}_{t},\tilde{\xi}_{t},\tilde{\sigma}_{t}^{2}) for all t0t\geq 0. By Jacob et al. [2020, Proposition 4], it suffices to prove that for any R>0R>0 sufficiently large such that γ+K/(1+R)<1\gamma+K/(1+R)<1, there exists ϵ(0,1)\epsilon\in(0,1) such that

infZt,Z~tS(R)(Zt+1=Z~t+1|Zt,Z~t)>ϵ,\inf_{Z_{t},\tilde{Z}_{t}\in S(R)}\mathbb{P}\big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{|}Z_{t},\tilde{Z}_{t}\big{)}>\epsilon, (154)

under the one-scale coupling. We now prove (154) directly for the one-scale coupling in Proposition 2.2. Fix some R>1R>1 such that γ+K/(1+R)<1\gamma+K/(1+R)<1, and take any Zt,Z~tS(R)Z_{t},\tilde{Z}_{t}\in S(R). Then

\displaystyle\mathbb{P} (Zt+1=Z~t+1|Zt,Z~t)\displaystyle(Z_{t+1}=\tilde{Z}_{t+1}\big{|}Z_{t},\tilde{Z}_{t}) (155)
=𝔼[(Zt+1=Z~t+1|ηt+1,η~t+1,Zt,Z~t)|Zt,Z~t]\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{|}\eta_{t+1},\tilde{\eta}_{t+1},Z_{t},\tilde{Z}_{t}\Big{)}\Big{|}Z_{t},\tilde{Z}_{t}\Big{]} (156)
=𝔼[(Zt+1=Z~t+1|ηt+1,η~t+1)|Zt,Z~t]\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{|}\eta_{t+1},\tilde{\eta}_{t+1}\Big{)}\Big{|}Z_{t},\tilde{Z}_{t}\Big{]} (157)
=𝔼[1{ηt+1=η~t+1}|Zt,Z~t], as (Zt+1,Z~t+1)|(ηt+1,η~t+1) follows CRN coupling\displaystyle=\mathbb{E}\Big{[}\mathrm{1}_{\{\eta_{t+1}=\tilde{\eta}_{t+1}\}}\Big{|}Z_{t},\tilde{Z}_{t}\Big{]}\text{, as }(Z_{t+1},\tilde{Z}_{t+1})|(\eta_{t+1},\tilde{\eta}_{t+1})\text{ follows CRN coupling} (158)
=(ηt+1=η~t+1|Zt,Z~t)\displaystyle=\mathbb{P}\big{(}\eta_{t+1}=\tilde{\eta}_{t+1}\Big{|}Z_{t},\tilde{Z}_{t}\big{)} (159)
=j=1p(ηt+1,j=η~t+1,j|Zt,Z~t), as (ηt+1,η~t+1)|Zt,Z~t is maximally coupled componentwise\displaystyle=\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}Z_{t},\tilde{Z}_{t}\big{)}\text{, as }(\eta_{t+1},\tilde{\eta}_{t+1})|Z_{t},\tilde{Z}_{t}\text{ is maximally coupled componentwise} (160)
=j=1p(ηt+1,j=η~t+1,j|mj,m~j), formj=ξtβt,j22σt2 and m~j=ξ~tβ~t,j22σ~t2.\displaystyle=\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}m_{j},\tilde{m}_{j}\big{)}\text{, for}\ m_{j}=\frac{\xi_{t}\beta_{t,j}^{2}}{2\sigma_{t}^{2}}\text{ and }\tilde{m}_{j}=\frac{\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}}{2\tilde{\sigma}_{t}^{2}}. (161)

As Z,Z~S(R)Z,\tilde{Z}\in S(R) for some R>1R>1, mj,m~j(R1/c,R1/d)m_{j},\tilde{m}_{j}\in(R^{-1/c},R^{1/d}) for all j=1,,pj=1,\ldots,p. By Lemma C.9,

(ηt+1,j=η~t+1,j|mj,m~j)U(1+ν2,1,R1/dν)/U(1+ν2,1,R1/cν)\displaystyle\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}m_{j},\tilde{m}_{j}\big{)}\geq U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}\Big{/}U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)} (162)

for all j=1,,pj=1,\ldots,p. Therefore,

\displaystyle\mathbb{P} (Zt+1=Z~t+1|Zt,Z~t)j=1pU(1+ν2,1,R1/dν)U(1+ν2,1,R1/cν)=ϵ\displaystyle(Z_{t+1}=\tilde{Z}_{t+1}\big{|}Z_{t},\tilde{Z}_{t})\geq\prod_{j=1}^{p}\frac{U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}}=\epsilon (163)

for ϵ:=(U(1+ν2,1,R1/dν)/U(1+ν2,1,R1/cν))p(0,1)\epsilon:=(U(\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu})/U(\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}))^{p}\in(0,1) as R>1R>1. ∎

Proof of Proposition 3.1.

As ηt+1\eta_{t+1} given mtm_{t} and η~t+1\tilde{\eta}_{t+1} given m~t\tilde{m}_{t} are conditionally independent component-wise, it suffices to consider the univariate case where p=1p=1. Henceforth we assume p=1p=1 and we drop the component subscripts for ease of notation. Without loss of generality, take m~tmt\tilde{m}_{t}\leq m_{t} and h:(0,)h:(0,\infty) to be monotone increasing. Then the densities of ηt+1|mt\eta_{t+1}|m_{t} and η~t+1|m~t\tilde{\eta}_{t+1}|\tilde{m}_{t} on (0,)(0,\infty) are given by

p(η|mt)emtηη1ν2(1+νη)ν+12 and p(η~|m~t)em~tηη1ν2(1+νη)ν+12\displaystyle p(\eta|m_{t})\propto\frac{e^{-m_{t}\eta}}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}\text{ and }p(\tilde{\eta}|\tilde{m}_{t})\propto\frac{e^{-\tilde{m}_{t}\eta}}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}} (164)

respectively. As m~tmt\tilde{m}_{t}\leq m_{t}, the distribution of η~t+1\tilde{\eta}_{t+1} stochastically dominates the distribution of ηt+1\eta_{t+1}. Therefore under a CRN coupling of ηt+1{\eta}_{t+1} and η~t+1\tilde{\eta}_{t+1} (specifically a CRN coupling of Algorithm 11), η~t+1ηt+1\tilde{\eta}_{t+1}\geq\eta_{t+1} almost surely, and h(η~t+1)h(ηt+1)h(\tilde{\eta}_{t+1})\geq h(\eta_{t+1}) almost surely. We obtain

𝔼crn[|h(ηt+1)h(η~t+1)||mt,m~t]=𝔼[h(η~t+1)h(ηt+1)|mt,m~t].\mathbb{E}_{\text{crn}}\big{[}|h(\eta_{t+1})-h(\tilde{\eta}_{t+1})|\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}\big{[}h(\tilde{\eta}_{t+1})-h(\eta_{t+1})\big{|}m_{t},\tilde{m}_{t}\big{]}. (165)

Now consider a maximal coupling of ηt+1\eta_{t+1} and η~t+1\tilde{\eta}_{t+1} with independent residuals [Johnson, 1998]. Let q=TV(p(|mt),p(|m~t))q=\text{TV}(p(\cdot|m_{t}),p(\cdot|\tilde{m}_{t})). Then ηt+1=η~t+1\eta_{t+1}=\tilde{\eta}_{t+1} with probability 1q1-q. Otherwise with probability qq, (ηt+1,η~t+1)=(ηt+1,η~t+1)(\eta_{t+1},\tilde{\eta}_{t+1})=(\eta^{*}_{t+1},\tilde{\eta}^{*}_{t+1}) where ηt+1\eta^{*}_{t+1} and η~t+1\tilde{\eta}^{*}_{t+1} are independent random variables with densities

p(ηt+1|mt+1,m~t+1)\displaystyle p(\eta^{*}_{t+1}|m_{t+1},\tilde{m}_{t+1}) =max{p(ηt+1|mt)p(ηt+1|m~t),0}/q and\displaystyle=\max\{p(\eta^{*}_{t+1}|m_{t})-p(\eta^{*}_{t+1}|\tilde{m}_{t}),0\}/q\text{ and } (166)
p(η~t+1|mt+1,m~t+1)\displaystyle p(\tilde{\eta}^{*}_{t+1}|m_{t+1},\tilde{m}_{t+1}) =max{p(η~t+1|m~t)p(η~t+1|mt),0}/q\displaystyle=\max\{p(\tilde{\eta}^{*}_{t+1}|\tilde{m}_{t})-p(\tilde{\eta}^{*}_{t+1}|{m}_{t}),0\}/q (167)

respectively on (0,)(0,\infty). This implies that the supports of ηt+1\eta^{*}_{t+1} and η~t+1\tilde{\eta}^{*}_{t+1} are disjoint. By (164), we further have ηt+1η~t+1\eta^{*}_{t+1}\leq\tilde{\eta}^{*}_{t+1} almost surely. Overall, we obtain η~t+1ηt+1\tilde{\eta}_{t+1}\geq\eta_{t+1} almost surely under a maximal coupling, and hence

𝔼max[|h(ηt+1)h(η~t+1)||mt,m~t]=𝔼[h(η~t+1)h(ηt+1)|mt,m~t]=𝔼crn[|h(ηt+1)h(η~t+1)||mt,m~t]\mathbb{E}_{\text{max}}\big{[}|h(\eta_{t+1})-h(\tilde{\eta}_{t+1})|\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}\big{[}h(\tilde{\eta}_{t+1})-h(\eta_{t+1})\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}_{\text{crn}}\big{[}|h(\eta_{t+1})-h(\tilde{\eta}_{t+1})|\big{|}m_{t},\tilde{m}_{t}\big{]} (168)

as required. ∎

Proof of Proposition 3.2.
CRN coupling result.

To prove (12), it suffices to show that under CRN,

|log(1+ηt+1,j)log(1+η~t+1,j)||logmt,jlogm~t,j|,|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|\leq|\log m_{t,j}-\log\tilde{m}_{t,j}|, (169)

almost surely for all j=1,,pj=1,\ldots,p. Henceforth, to prove (169) we work component-wise and drop the subscripts for ease of notation. For the Horseshoe prior, we then have p(η|m)exp(mη)/(1+η)p(\eta|m)\propto\exp(-m\eta)/(1+\eta) on (0,)(0,\infty). The cumulative density function is then given by Fm(t)=1E1(m(t+1))/E1(m)F_{m}(t)=1-E_{1}(m(t+1))/E_{1}(m) where E1E_{1} is the exponential integral function. Therefore, under CRN coupling, we obtain

1+η=1mE11(UE1(m)) and 1+η~=1m~E11(UE1(m~)),\displaystyle 1+\eta=\frac{1}{m}E_{1}^{-1}(U^{*}E_{1}(m))\text{ and }1+\tilde{\eta}=\frac{1}{\tilde{m}}E_{1}^{-1}(U^{*}E_{1}(\tilde{m})), (170)

for UUniform(0,1)U^{*}\sim\mathrm{Uniform}(0,1). Then

ddlogm(log(1+η))\displaystyle\frac{d}{d\log m}\big{(}\log(1+\eta)\big{)} =1+ddlogm(log(E11(UE1(m))))\displaystyle=-1+\frac{d}{d\log m}\big{(}\log\big{(}E_{1}^{-1}(U^{*}E_{1}(m))\big{)}\big{)} (171)
=1+mE11(UE1(m))(ddm(E11(UE1(m))))\displaystyle=-1+\frac{m}{E_{1}^{-1}(U^{*}E_{1}(m))}\Big{(}\frac{d}{dm}\big{(}E_{1}^{-1}(U^{*}E_{1}(m))\big{)}\Big{)} (172)
=1+mE11(UE1(m))UE1(m)E1(E11(UE1(m))), where E1(z)=ez/z\displaystyle=-1+\frac{m}{E_{1}^{-1}(U^{*}E_{1}(m))}\frac{U^{*}E_{1}^{\prime}(m)}{E_{1}^{\prime}\big{(}E_{1}^{-1}(U^{*}E_{1}(m))\big{)}}\text{, where }E^{\prime}_{1}(z)=-e^{-z}/z (173)
=1+mm(1+η)UE1(m)E1(m(1+η)), as m(1+η)=E11(UE1(m))\displaystyle=-1+\frac{m}{m(1+\eta)}\frac{U^{*}E_{1}^{\prime}(m)}{E_{1}^{\prime}\big{(}m(1+\eta)\big{)}}\text{, as }m(1+\eta)=E_{1}^{-1}(U^{*}E_{1}(m)) (174)
=1+Uemη, as E1(z)=ez/z\displaystyle=-1+U^{*}e^{m\eta}\text{, as }E^{\prime}_{1}(z)=-e^{-z}/z (175)
=1+E1(m(1+η))em(1+η)E1(m)em, as U=E1(m(1+η))/E1(m)\displaystyle=-1+\frac{E_{1}(m(1+\eta))e^{m(1+\eta)}}{E_{1}(m)e^{m}}\text{, as }U^{*}=E_{1}(m(1+\eta))/E_{1}(m) (176)
(1,0) as xexE1(x) is positive and decreasing on (0,).\displaystyle\in(-1,0)\text{ as }x\mapsto e^{x}E_{1}(x)\text{ is positive and decreasing on }(0,\infty). (177)

Therefore, as |dlog(1+η)dlogm|1|\frac{d\log(1+\eta)}{d\log m}|\leq 1 and by the mean value theorem we obtain

|log(1+η)log(1+η~)||logmlogm~|,|\log(1+\eta)-\log(1+\tilde{\eta})|\leq|\log m-\log\tilde{m}|, (178)

almost surely as required for (169). Finally, (169) implies (12) with mt,j=μm_{t,j}=\mu and m~t,j=μ~\tilde{m}_{t,j}=\tilde{\mu} for all j=1,,pj=1,\ldots,p.

Maximal coupling result.

To prove (13), take μ~<μ\tilde{\mu}<\mu without loss of generality. For c>0c>0, denote the event

Ec:={maxj=1p|log(1+ηt+1,j)log(1+η~t+1,j)|>c},E_{c}:=\big{\{}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>c\big{\}},

where (ηt+1,j,η~t+1,j)(\eta_{t+1,j},\tilde{\eta}_{t+1,j}) are i.i.d. random variables jointly distributed according to a maximal coupling. By component-wise independence,

(Ec|μ,μ~)=1j=1p(|log(1+ηt+1,j)log(1+η~t+1,j)|c|μ,μ~)=1xcp\displaystyle\mathbb{P}(E_{c}|\mu,\tilde{\mu})=1-\prod_{j=1}^{p}\mathbb{P}\big{(}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|\leq c\big{|}\mu,\tilde{\mu}\big{)}=1-x_{c}^{p} (179)

for xc=(|log(1+ηt+1,1)log(1+η~t+1,1)|c|μ,μ~)x_{c}=\mathbb{P}\big{(}|\log(1+\eta_{t+1,1})-\log(1+\tilde{\eta}_{t+1,1})|\leq c\big{|}\mu,\tilde{\mu}\big{)}. Henceforth, we focus on this quantity xcx_{c} and drop the subscripts for ease of notation. For the Horseshoe prior, the marginal densities densities of η|μ\eta|\mu and η~|μ~\tilde{\eta}|\tilde{\mu} are given by

p(η|μ)=eμη1+η1eμE1(μ) and p(η|μ~)=eμ~η1+η1eμ~E1(μ~) on (0,).\displaystyle p(\eta|\mu)=\frac{e^{-\mu\eta}}{1+\eta}\frac{1}{e^{\mu}E_{1}(\mu)}\text{ and }p(\eta|\tilde{\mu})=\frac{e^{-\tilde{\mu}\eta}}{1+\eta}\frac{1}{e^{\tilde{\mu}}E_{1}(\tilde{\mu})}\text{ on }(0,\infty). (180)

Consider a maximal coupling of η\eta give μ\mu and η~\tilde{\eta} given μ~\tilde{\mu} with independent residuals [Johnson, 1998]. Let q=TV(p(|μ),p(|μ~))q=\text{TV}(p(\cdot|\mu),p(\cdot|\tilde{\mu})). Then η=η~\eta=\tilde{\eta} with probability 1q1-q. Otherwise with probability qq, (η,η~)=(η,η~)(\eta,\tilde{\eta})=(\eta^{*},\tilde{\eta}^{*}) where η\eta^{*} and η~\tilde{\eta}^{*} are independent random variables on (0,)(0,\infty) with densities

p(η|μ,μ~)\displaystyle p(\eta^{*}|\mu,\tilde{\mu}) =max{p(η|μ)p(η|μ~),0}/q and\displaystyle=\max\{p(\eta^{*}|\mu)-p(\eta^{*}|\tilde{\mu}),0\}/q\text{ and } (181)
p(η~|μ,μ~)\displaystyle p(\tilde{\eta}^{*}|\mu,\tilde{\mu}) =max{p(η~|μ~)p(η~|μ),0}/q,\displaystyle=\max\{p(\tilde{\eta}^{*}|\tilde{\mu})-p(\tilde{\eta}^{*}|{\mu}),0\}/q, (182)

respectively. Note that p(η|μ)p(η|μ~)p(\eta^{*}|\mu)\geq p(\eta^{*}|\tilde{\mu}) if and only if 1+ηK1+\eta^{*}\leq K for constant K:=(logE1(μ)logE1(μ~))/(μ~μ)K:=(\log E_{1}(\mu)-\log E_{1}(\tilde{\mu}))/(\tilde{\mu}-\mu). Therefore, 1+η1+\eta^{*} has support [1,K)[1,K) and 1+η~1+\tilde{\eta}^{*} has support [K,)[K,\infty). We obtain,

xc\displaystyle x_{c} =(1q)+q(|log(1+η)log(1+η~)|c)\displaystyle=(1-q)+q\mathbb{P}\big{(}|\log(1+\eta^{*})-\log(1+\tilde{\eta}^{*})|\leq c\big{)} (183)
=1q(|log(1+η)log(1+η~)|>c)\displaystyle=1-q\mathbb{P}\big{(}|\log(1+\eta^{*})-\log(1+\tilde{\eta}^{*})|>c\big{)} (184)
=1q(log(1+η~)log(1+η)>c), as ηη~\displaystyle=1-q\mathbb{P}\big{(}\log(1+\tilde{\eta}^{*})-\log(1+\eta^{*})>c\big{)}\text{, as }\eta^{*}\leq\tilde{\eta}^{*} (185)
=1q(1+η~>(1+η)ec)\displaystyle=1-q\mathbb{P}\big{(}1+\tilde{\eta}^{*}>(1+\eta^{*})e^{c}\big{)} (186)
1q(1+η~>t) for t=Kec, as 1+η~K almost surely\displaystyle\leq 1-q\mathbb{P}\big{(}1+\tilde{\eta}^{*}>t\big{)}\text{ for }t=Ke^{c}\text{, as }1+\tilde{\eta}^{*}\leq K\text{ almost surely } (187)
=1t11+η~(eμ~η~eμ~E1(μ~)eμη~eμE1(μ))𝑑y, by (182)\displaystyle=1-\int_{t}^{\infty}\frac{1}{1+\tilde{\eta}^{*}}\Big{(}\frac{e^{-\tilde{\mu}\tilde{\eta}^{*}}}{e^{\tilde{\mu}}E_{1}(\tilde{\mu})}-\frac{e^{-\mu\tilde{\eta}^{*}}}{e^{\mu}E_{1}(\mu)}\Big{)}dy\text{, by }\eqref{eq:horseshoe_residuals} (188)
=1t1y(eμ~yE1(μ~)eμyE1(μ))𝑑y with y=1+η~\displaystyle=1-\int_{t}^{\infty}\frac{1}{y}\Big{(}\frac{e^{-\tilde{\mu}y}}{E_{1}(\tilde{\mu})}-\frac{e^{-\mu y}}{E_{1}(\mu)}\Big{)}dy\text{ with }y=1+\tilde{\eta}^{*} (189)
=1(E1(μ~t)E1(μ~)E1(μt)E1(μ)).\displaystyle=1-\Big{(}\frac{E_{1}(\tilde{\mu}t)}{E_{1}(\tilde{\mu})}-\frac{E_{1}(\mu t)}{E_{1}(\mu)}\Big{)}. (190)

We now consider this upper bound on xpx_{p} as tt\rightarrow\infty. Note that for μ,μ~\mu,\tilde{\mu} fixed with μ~<μ\tilde{\mu}<\mu,

limt(E1(μ~)μ~teμ~t(E1(μ~t)E1(μ~)E1(μt)E1(μ)))\displaystyle\lim_{t\rightarrow\infty}\Big{(}E_{1}(\tilde{\mu})\tilde{\mu}te^{\tilde{\mu}t}\Big{(}\frac{E_{1}(\tilde{\mu}t)}{E_{1}(\tilde{\mu})}-\frac{E_{1}(\mu t)}{E_{1}(\mu)}\Big{)}\Big{)} =limt(μ~teμ~tE1(μ~t))limt(1E1(μt)E1(μ~)E1(μ~t)E1(μ))\displaystyle=\lim_{t\rightarrow\infty}\Big{(}\tilde{\mu}te^{\tilde{\mu}t}E_{1}(\tilde{\mu}t)\Big{)}\lim_{t\rightarrow\infty}\Big{(}1-\frac{E_{1}(\mu t)E_{1}(\tilde{\mu})}{E_{1}(\tilde{\mu}t)E_{1}(\mu)}\Big{)} (191)
=limt(μ~teμ~tE1(μ~t)) as limtE1(μt)E1(μ~t)=0\displaystyle=\lim_{t\rightarrow\infty}\Big{(}\tilde{\mu}te^{\tilde{\mu}t}E_{1}(\tilde{\mu}t)\Big{)}\text{ as }\lim_{t\rightarrow\infty}\frac{E_{1}(\mu t)}{E_{1}(\tilde{\mu}t)}=0 (192)
=1 as limx(xexE1(x))=1.\displaystyle=1\text{ as }\lim_{x\rightarrow\infty}\big{(}xe^{x}E_{1}(x)\big{)}=1. (193)

Therefore, there exists some constant D1>0D_{1}>0 such that

(E1(μ~t)E1(μ~)E1(μt)E1(μ))>D1μ~teμ~t\Big{(}\frac{E_{1}(\tilde{\mu}t)}{E_{1}(\tilde{\mu})}-\frac{E_{1}(\mu t)}{E_{1}(\mu)}\Big{)}>\frac{D_{1}}{\tilde{\mu}te^{\tilde{\mu}t}} (194)

for all t>0t>0. By (190) and (194), we obtain

xcp\displaystyle x_{c}^{p} <(1D1teμ~t)p\displaystyle<\Big{(}1-\frac{D_{1}}{te^{\tilde{\mu}t}}\Big{)}^{p} (195)
=((11D11μ~teμ~t)D11μ~teμ~t)p/(D11μ~teμ~t)\displaystyle=\Big{(}\Big{(}1-\frac{1}{D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t}}\Big{)}^{D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t}}\Big{)}^{p/(D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t})} (196)
<exp(p/(D11μ~teμ~t)) as (11/x)x<1/e for x1.\displaystyle<\exp\big{(}-p/(D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t})\big{)}\text{ as }(1-1/x)^{x}<1/e\text{ for }x\geq 1. (197)

For p>1p>1, now take c=log(αlogpμ~K)c=\log\big{(}\frac{\alpha\log p}{\tilde{\mu}K}\big{)} for any fixed constant α(0,1)\alpha\in(0,1). Recall t=Kect=Ke^{c}, such that μ~t=αlogp\tilde{\mu}t=\alpha\log p. Then

xcp\displaystyle x_{c}^{p} <exp(p/(D11μ~teμ~t))=exp(D1p1α/(αlogp))<exp(Dp1α)\displaystyle<\exp\big{(}-p/(D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t})\big{)}=\exp\big{(}-D_{1}p^{1-\alpha}/(\alpha\log p)\big{)}<\exp\big{(}-Dp^{1-\alpha^{\prime}}\big{)} (198)

for any α(α,1)\alpha^{\prime}\in(\alpha,1) and some constant DD which does not depend on pp. Therefore,

(Ec|mt,m~t)\displaystyle\mathbb{P}\big{(}E_{c}\big{|}\ m_{t},\tilde{m}_{t}\big{)} >1exp(Dp1α)\displaystyle>1-\exp\big{(}-Dp^{1-\alpha^{\prime}}\big{)} (199)

for c=log(α/(μ~K))+loglogpc=\log(\alpha/(\tilde{\mu}K))+\log\log p. Recall that K:=(logE1(μ)logE1(μ~))/(μ~μ)K:=(\log E_{1}(\mu)-\log E_{1}(\tilde{\mu}))/(\tilde{\mu}-\mu). We obtain

1μ~K\displaystyle\frac{1}{\tilde{\mu}K} =μμ~μ~(logE1(μ~)logE1(μ))\displaystyle=\frac{\mu-\tilde{\mu}}{\tilde{\mu}(\log E_{1}(\tilde{\mu})-\log E_{1}(\mu))} (200)
=μ/μ~1logE1(μ~)logE1(μ)\displaystyle=\frac{\mu/\tilde{\mu}-1}{\log E_{1}(\tilde{\mu})-\log E_{1}(\mu)} (201)
logμlogμ~logE1(μ~)logE1(μ) as x1log(x) for x>0\displaystyle\geq\frac{\log\mu-\log\tilde{\mu}}{\log E_{1}(\tilde{\mu})-\log E_{1}(\mu)}\text{ as }x-1\geq\log(x)\text{ for }x>0 (202)
=eμ^E1(μ^) for some μ^[μ~,μ] as dd(logμ)(logE1(μ))=1eμE1(μ)\displaystyle=e^{\hat{\mu}}E_{1}(\hat{\mu})\text{ for some }\hat{\mu}\in[\tilde{\mu},\mu]\text{ as }\frac{d}{d(\log\mu)}(-\log E_{1}(\mu))=\frac{1}{e^{\mu}E_{1}(\mu)} (203)
eμE1(μ) as xexE1(x) is decreasing for all x>0\displaystyle\geq e^{\mu}E_{1}(\mu)\text{ as }x\mapsto e^{x}E_{1}(x)\text{ is decreasing for all }x>0 (204)
12log(1+1/μ) as exE1(x)12log(1+1/x) for all x>0\displaystyle\geq\frac{1}{2}\log(1+1/\mu)\text{ as }e^{x}E_{1}(x)\geq\frac{1}{2}\log(1+1/x)\text{ for all }x>0 (205)
=12log(1+1/max{μ,μ~})\displaystyle=\frac{1}{2}\log(1+1/\max\{\mu,\tilde{\mu}\}) (206)
=:L.\displaystyle=:L. (207)

Therefore, clog(αL)+loglogpc\geq\log(\alpha L)+\log\log p. By (199),

(maxj=1p|log(1+ηt+1,j)log(1+η~t+1,j)|>log(αL)+loglogp|mt,m~t)>1exp(Dp1α).\displaystyle\mathbb{P}\big{(}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>\log(\alpha L)+\log\log p\ \big{|}\ m_{t},\tilde{m}_{t}\big{)}>1-\exp(-Dp^{1-\alpha^{\prime}}). (208)

The case μ~>μ\tilde{\mu}>\mu follows by symmetry to give (13). ∎

Proof of Proposition 3.3.

Fix some ν1\nu\geq 1. For any m>0m>0, let p(|m)p(\cdot|m) denote the component-wise full conditional density of η\eta given mm. This is given by

p(η|m)=exp(mη)η1ν2(1+νη)ν+121Zmp(\eta|m)=\frac{\exp(-m\eta)}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}\frac{1}{Z_{m}} (209)

on (0,)(0,\infty), where ZmZ_{m} is the normalization constant. We will first show

TV(p(|m),p(|m~))2(1+ν)(e1)4|log(m)log(m~)|.\text{TV}\big{(}p(\cdot|m),p(\cdot|\tilde{m})\big{)}^{2}\leq\frac{(1+\nu)(e-1)}{4}|\log(m)-\log(\tilde{m})|. (210)

By a generalization of Pinkser’s inequality [e.g. van Erven and Harremos, 2014], we obtain

TV(p(|m),p(|m~))212αDα(p(|m)p(|m~))\text{TV}\big{(}p(\cdot|m),p(\cdot|\tilde{m})\big{)}^{2}\leq\frac{1}{2\alpha}D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m})) (211)

for all α(0,1]\alpha\in(0,1], where

Dα(p(|m)p(|m~)):=1α1log(p(η|m)αp(η|m~)1αdη)D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m})):=\frac{1}{\alpha-1}\log\Big{(}\int p(\eta|m)^{\alpha}p(\eta|\tilde{m})^{1-\alpha}d\eta\Big{)}

is the Rényi divergence between probability densities pp and qq with respect to the Lebesgue measure. For α=1/2\alpha=1/2, we obtain

Dα(p(|m)p(|m~))\displaystyle D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m})) =2log(0exp(m+m~2η)η1ν2(1+νη)ν+121ZmZm~𝑑η)=log(ZmZm~Z(m+m~)/22)\displaystyle=-2\log\Big{(}\int_{0}^{\infty}\frac{\exp(-\frac{m+\tilde{m}}{2}\eta)}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}\frac{1}{\sqrt{Z_{m}Z_{\tilde{m}}}}d\eta\Big{)}=\log\Big{(}\frac{Z_{m}Z_{\tilde{m}}}{Z_{(m+\tilde{m})/2}^{2}}\Big{)} (212)

Let L(m)=logZmL(m)=\log Z_{m}. Then for η|mp(|m)\eta|m\sim p(\cdot|m), note that

dL(m)dm=dZmdmZm=𝔼[η|m]0, and d2L(m)dm2=d2Zmdm2Zm(dZmdm)2Zm2=𝔼[η2|m]𝔼[η|m]20,\displaystyle\frac{dL(m)}{dm}=\frac{\frac{dZ_{m}}{dm}}{Z_{m}}=-\mathbb{E}[\eta|m]\leq 0,\text{ and }\frac{d^{2}L(m)}{dm^{2}}=\frac{\frac{d^{2}Z_{m}}{dm^{2}}Z_{m}-(\frac{dZ_{m}}{dm})^{2}}{Z_{m}^{2}}=\mathbb{E}[\eta^{2}|m]-\mathbb{E}[\eta|m]^{2}\geq 0, (213)

such that L(m)L(m) is a decreasing convex function. For m~m\tilde{m}\geq m, this gives

L(m)+L(m~)2L(m+m~2)\displaystyle L(m)+L(\tilde{m})-2L\Big{(}\frac{m+\tilde{m}}{2}\Big{)} (L(m~)L(m))(m~m2)\displaystyle\leq\big{(}L^{\prime}(\tilde{m})-L^{\prime}(m)\Big{)}\Big{(}\frac{\tilde{m}-m}{2}\big{)} (214)
=(𝔼[η|m]𝔼[η|m~])(m~m2).\displaystyle=\big{(}\mathbb{E}[\eta|m]-\mathbb{E}[\eta|\tilde{m}]\big{)}\Big{(}\frac{\tilde{m}-m}{2}\Big{)}. (215)

Furthermore, note that by Lemmas C.4 and C.5,

𝔼[η|m]=1νΓ(1+ν2+1)Γ(1+ν2)U(1+ν2+1,1+1,mν)U(1+ν2,1,mν)1νΓ(1+ν2+1)Γ(1+ν2)(mν)1=1+ν2m1\displaystyle\mathbb{E}[\eta|m]=\frac{1}{\nu}\frac{\Gamma(\frac{1+\nu}{2}+1)}{\Gamma(\frac{1+\nu}{2})}\frac{U(\frac{1+\nu}{2}+1,1+1,\frac{m}{\nu})}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}\leq\frac{1}{\nu}\frac{\Gamma(\frac{1+\nu}{2}+1)}{\Gamma(\frac{1+\nu}{2})}\big{(}\frac{m}{\nu}\big{)}^{-1}=\frac{1+\nu}{2}m^{-1} (216)

where U(a,b,z)U(a,b,z) corresponds to the confluent hypergeometric function of the second kind. By (214), we therefore obtain

Dα(p(|m)p(|m~))\displaystyle D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m})) (𝔼[η|m]𝔼[η|m~])(m~m2)\displaystyle\leq\big{(}\mathbb{E}[\eta|m]-\mathbb{E}[\eta|\tilde{m}]\big{)}\Big{(}\frac{\tilde{m}-m}{2}\Big{)} (217)
1+ν2(m~m2m)\displaystyle\leq\frac{1+\nu}{2}\Big{(}\frac{\tilde{m}-m}{2m}\Big{)} (218)
=1+ν4(exp(logm~logm)1)),\displaystyle=\frac{1+\nu}{4}\big{(}\exp(\log\tilde{m}-\log m)-1)\big{)}, (219)

where exp(|logm~logm|)1(e1)|logmlogm~|\exp(|\log\tilde{m}-\log m|)-1\leq(e-1)|\log m-\log\tilde{m}| for |logmlogm~|1|\log m-\log\tilde{m}|\leq 1. This gives

TV(p(|m),p(|m~))2(1+ν)(e1)4(log(m~)log(m)),\text{TV}\big{(}p(\cdot|m),p(\cdot|\tilde{m})\big{)}^{2}\leq\frac{(1+\nu)(e-1)}{4}\big{(}\log(\tilde{m})-\log(m)\big{)}, (220)

for m~m\tilde{m}\geq m. The case m~m\tilde{m}\leq m follows by symmetry, giving (210).

We now extend (210) to the multivariate setting. Let π(η|m):=j=1pp(ηj|mj)\pi(\eta|m):=\prod_{j=1}^{p}p(\eta_{j}|m_{j}) now denote the full conditional density of η(0,)p\eta\in(0,\infty)^{p} given m(0,)pm\in(0,\infty)^{p}. Then we obtain

TV(π(|m),π(|m~))2\displaystyle\text{TV}\big{(}\pi(\cdot|m),\pi(\cdot|\tilde{m})\big{)}^{2} D1/2(π(|m)π(|m~))\displaystyle\leq D_{1/2}(\pi(\cdot|m)\|\pi(\cdot|\tilde{m})) (221)
=j=1pD1/2(p(|mj)p(|m~j)) by conditional independence\displaystyle=\sum_{j=1}^{p}D_{1/2}(p(\cdot|m_{j})\|p(\cdot|\tilde{m}_{j}))\text{ by conditional independence } (222)
=(1+ν)(e1)4j=1p|log(mj)log(m~j)| by (220).\displaystyle=\frac{(1+\nu)(e-1)}{4}\sum_{j=1}^{p}|\log(m_{j})-\log(\tilde{m}_{j})|\text{ by \eqref{eq:renyi_div}}. (223)

Finally, let Z=(β,ξ,σ2)Z=(\beta,\xi,\sigma^{2}) and Z~=(β~,ξ~,σ~2)\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2}), and mj=ξβj2/(2σ2)m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2}) and m~j=ξ~β~j2/(2σ~2)\tilde{m}_{j}=\tilde{\xi}\tilde{\beta}_{j}^{2}/(2\tilde{\sigma}^{2}) for j=1,,pj=1,\ldots,p. For any such Z,Z~Z,\tilde{Z}, we obtain

TV(𝒫(Z,),𝒫(Z~|))\displaystyle\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z}|\cdot)\big{)} =12|p(η|Z)p(η|Z~)|p(Z|η)dηdZ\displaystyle=\frac{1}{2}\int\big{|}p(\eta^{*}|Z)-p(\eta^{*}|\tilde{Z})\big{|}p(Z^{*}|\eta^{*})d\eta^{*}dZ^{*} (224)
=12|p(η|Z)p(η|Z~)|dη\displaystyle=\frac{1}{2}\int|p(\eta^{*}|Z)-p(\eta^{*}|\tilde{Z})|d\eta^{*} (225)
=12|π(η|m)π(η|m~)|dη\displaystyle=\frac{1}{2}\int|\pi(\eta^{*}|m)-\pi(\eta^{*}|\tilde{m})|d\eta^{*} (226)
=TV(π(|m),π(|m~)).\displaystyle=\text{TV}\big{(}\pi(\cdot|m),\pi(\cdot|\tilde{m})\big{)}. (227)

Therefore, by (221), we obtain

TV(𝒫(Z,),𝒫(Z~|))2\displaystyle\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z}|\cdot)\big{)}^{2} (1+ν)(e1)4j=1p|log(mj)log(m~j)|.\displaystyle\leq\frac{(1+\nu)(e-1)}{4}\sum_{j=1}^{p}|\log(m_{j})-\log(\tilde{m}_{j})|. (228)

Proof of Proposition 3.4.

This proposition follows from a modification of the proof of Proposition 2.2. Following the setup and notation in the proof of Proposition 2.2, we have

\displaystyle\mathbb{P} (Zt+1=Z~t+1|Zt,Z~t)\displaystyle(Z_{t+1}=\tilde{Z}_{t+1}\big{|}Z_{t},\tilde{Z}_{t}) (229)
=𝔼[(Zt+1=Z~t+1|ηt+1,η~t+1,Zt,Z~t)|Zt,Z~t]\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{|}\eta_{t+1},\tilde{\eta}_{t+1},Z_{t},\tilde{Z}_{t}\Big{)}\Big{|}Z_{t},\tilde{Z}_{t}\Big{]} (230)
=𝔼[(Zt+1=Z~t+1|ηt+1,η~t+1)|Zt,Z~t]\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{|}\eta_{t+1},\tilde{\eta}_{t+1}\Big{)}\Big{|}Z_{t},\tilde{Z}_{t}\Big{]} (231)
=𝔼[1{ηt+1=η~t+1}|Zt,Z~t], as (Zt+1,Z~t+1)|(ηt+1,η~t+1) follows common random numbers coupling\displaystyle=\mathbb{E}\Big{[}\mathrm{1}_{\{\eta_{t+1}=\tilde{\eta}_{t+1}\}}\Big{|}Z_{t},\tilde{Z}_{t}\Big{]}\text{, as }(Z_{t+1},\tilde{Z}_{t+1})|(\eta_{t+1},\tilde{\eta}_{t+1})\text{ follows common random numbers coupling} (232)
=(ηt+1=η~t+1|Zt,Z~t)\displaystyle=\mathbb{P}\big{(}\eta_{t+1}=\tilde{\eta}_{t+1}\Big{|}Z_{t},\tilde{Z}_{t}\big{)} (233)
=(η1,t+1=η~1,t+1|Zt,Z~t)j=2p(ηt+1,j=η~t+1,j|1ij1{ηi,t+1=η~i,t+1},Zt,Z~t)\displaystyle=\mathbb{P}\big{(}\eta_{1,t+1}=\tilde{\eta}_{1,t+1}\Big{|}Z_{t},\tilde{Z}_{t}\big{)}\prod_{j=2}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}\underset{1\leq i\leq j-1}{\bigcap}\{\eta_{i,t+1}=\tilde{\eta}_{i,t+1}\},Z_{t},\tilde{Z}_{t}\big{)} (234)
under the switch-to-CRN coupling (235)
j=1pU(1+ν2,1,R1/dν)U(1+ν2,1,R1/cν) by Lemma C.9, as in the proof of Proposition 2.2 for Zt,Z~tS(R),\displaystyle\geq\prod_{j=1}^{p}\frac{U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}}\text{ by Lemma \ref{lemma:eta_coupling_bound}, as in the proof of Proposition \ref{prop:one_scale_meeting} for }Z_{t},\tilde{Z}_{t}\in S(R), (236)
=(U(1+ν2,1,R1/dν)U(1+ν2,1,R1/cν))p=:ϵ,\displaystyle=\Big{(}\frac{U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}}\Big{)}^{p}=:\epsilon, (237)

where ϵ(0,1)\epsilon\in(0,1) since R>1R>1. ∎

Appendix D Additional Algorithms

Input: Ct:=(βt,ηt,σt2,ξt)p×>0p×>0×>0C_{t}:=(\beta_{t},\eta_{t},\sigma_{t}^{2},\xi_{t})\in\mathbb{R}^{p}\times\mathbb{R}^{p}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}.
  1. 1.

    Sample ηt+1|βt,σt2,ξt\eta_{t+1}|\beta_{t},\sigma^{2}_{t},\xi_{t} component-wise independently using Algorithm 4, targeting

    π(ηt+1|βt,σt2,ξt)j=1pπ(ηt+1,j|mt,j:=ξtβt,j22σt2)j=1pemt,jηt+1,jηt+1,j1ν2(1+νηt+1,j)ν+12.\pi(\eta_{t+1}|\beta_{t},\sigma^{2}_{t},\xi_{t})\propto\prod_{j=1}^{p}\pi\big{(}\eta_{t+1,j}|m_{t,j}:=\frac{\xi_{t}\beta_{t,j}^{2}}{2\sigma^{2}_{t}}\big{)}\propto\prod_{j=1}^{p}\frac{e^{-m_{t,j}\eta_{t+1,j}}}{\eta_{t+1,j}^{\frac{1-\nu}{2}}(1+\nu\eta_{t+1,j})^{\frac{\nu+1}{2}}}. (238)
  2. 2.

    Sample ξt+1,σt+12,βt+1\xi_{t+1},\sigma^{2}_{t+1},\beta_{t+1} given ηt+1\eta_{t+1} as follows:

    1. (a)

      Sample ξt+1\xi_{t+1} using Metropolis–Rosenbluth–Teller–Hastings with step-size σMRTH\sigma_{\text{MRTH}}:

Given ξt\xi_{t}, propose log(ξ)𝒩(log(ξt),σMRTH2)\log(\xi^{*})\sim\mathcal{N}(\log(\xi_{t}),\sigma^{2}_{\text{MRTH}}).
Calculate acceptance probability
q=L(y|ξ,ηt+1)πξ(ξ)L(y|ξt,ηt+1)πξ(ξt)ξξt,q=\frac{L(y|\xi_{*},\eta_{t+1})\pi_{\xi}(\xi_{*})}{L(y|\xi_{t},\eta_{t+1})\pi_{\xi}(\xi_{t})}\frac{\xi^{*}}{\xi_{t}},
where πξ()\pi_{\xi}(\cdot) is the prior for ξ\xi, Mξ,η:=In+ξ1XDiag(η1)XTM_{\xi,\eta}:=I_{n}+\xi^{-1}X\,\text{Diag}(\eta^{-1})\,X^{T} and
log(L(y|ξ,η))=12log(|Mξ,η|)a0+n2log(b0+yTMξ,η1y).\log(L(y|\xi,\eta))=-\frac{1}{2}\log(|M_{\xi,\eta}|)-\frac{a_{0}+n}{2}\log(b_{0}+y^{T}M_{\xi,\eta}^{-1}y). (239)
Set ξt+1:=ξ\xi_{t+1}:=\xi^{*} with probability min(1,q)\min(1,q), otherwise set ξt+1:=ξt\xi_{t+1}:=\xi_{t}.
  • (b)

    Sample σt+12\sigma^{2}_{t+1} given ξt+1,ηt+1\xi_{t+1},\eta_{t+1}:

    σt+12|ξt+1,ηt+1InvGamma(a0+n2,yTMξt+1,ηt+11y+b02).\sigma^{2}_{t+1}|\xi_{t+1},\eta_{t+1}\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}+n}{2},\frac{y^{T}M_{\xi_{t+1},\eta_{t+1}}^{-1}y+b_{0}}{2}\bigg{)}. (240)
  • (c)

    Sample βt+1\beta_{t+1} given σt+12,ξt+1,ηt+1\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1} using Algorithm 7 in Appendix D, targeting

    βt+1|σt+12,ξt+1,ηt+1𝒩(Σ1XTy,σt+12Σ1) for Σ=XTX+ξt+1Diag(ηt+1).\beta_{t+1}|\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\sim\mathcal{N}\big{(}\Sigma^{-1}X^{T}y,\sigma^{2}_{t+1}\Sigma^{-1}\big{)}\text{ for }\Sigma=X^{T}X+\xi_{t+1}\text{Diag}(\eta_{t+1}). (241)
  • return Ct+1:=(βt+1,ηt+1,σt+12,ξt+1)C_{t+1}:=(\beta_{t+1},\eta_{t+1},\sigma^{2}_{t+1},\xi_{t+1}).

    Algorithm 3 MCMC for Bayesian shrinkage regression with Half-t(ν)t(\nu) priors.
    Result: Slice sampler targeting p(ηj|mj)(ηj1ν2(1+νηj)ν+12)1emjηjp(\eta_{j}|m_{j})\propto(\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}})^{-1}e^{-m_{j}\eta_{j}} on (0,)(0,\infty).
    Input: mj>0m_{j}>0, state ηt,j>0\eta_{t,j}>0.
    1. 1.

      Sample Ut,j|ηt,jUniform(0,(1+νηt,j)ν+12)U_{t,j}|\eta_{t,j}\sim\mathrm{Uniform}(0,(1+\nu\eta_{t,j})^{-\frac{\nu+1}{2}}).

    2. 2.

      Sample ηt+1,j|Ut,jPj\eta_{t+1,j}|U_{t,j}\sim P_{j}, where PjP_{j} has unnormalized density ηηs1emjη\eta\mapsto\eta^{s-1}e^{-m_{j}\eta} on (0,Tt,j)(0,T_{t,j}), with Tt,j=(Ut,j2/(1+ν)1)/νT_{t,j}=(U_{t,j}^{-2/(1+\nu)}-1)/\nu and s=(1+ν)/2s=(1+\nu)/2. We can sample ηt+1,j\eta_{t+1,j} perfectly by setting

      ηt+1,j=1mjγs1(γs(mjTt,j)U)whereUUniform(0,1),\eta_{t+1,j}=\frac{1}{m_{j}}\gamma_{s}^{-1}\Big{(}\gamma_{s}(m_{j}T_{t,j})U^{*}\Big{)}\quad\text{where}\quad U^{*}\sim\mathrm{Uniform}(0,1), (242)

      and where γs(x):=Γ(s)10xts1et𝑑t[0,1]\gamma_{s}(x):=\Gamma(s)^{-1}\int_{0}^{x}t^{s-1}e^{-t}dt\in[0,1] is the cumulative distribution function of a Gamma(s,1)\mathrm{Gamma}(s,1) distribution.

    return ηt+1,j\eta_{t+1,j}.
    Algorithm 4 Slice Sampling for Half-t(ν)t(\nu) prior.
    Input: Distributions PP and QQ with respective densities pp and qq
    Sample XPX\sim P, and WUniform(0,1)W\sim\mathrm{Uniform}(0,1).
    if  p(X)Wq(X)p(X)W\leq q(X)  then  set Y=XY=X and return (X,Y)(X,Y).
    else  sample Y~Q\tilde{Y}\sim Q and W~Uniform(0,1)\tilde{W}\sim\mathrm{Uniform}(0,1) until q(Y~)W~>p(Y~)q(\tilde{Y})\tilde{W}>p(\tilde{Y}). Set Y=Y~Y=\tilde{Y} and return (X,Y)(X,Y).
    Algorithm 5 Maximal coupling with independent residuals.
    Result: Sample from 𝒩((XTX+ξDiag(η))1XTy,σ2(XTX+ξDiag(η))1)\mathcal{N}((X^{T}X+\xi\text{Diag}(\eta))^{-1}X^{T}y,\sigma^{2}(X^{T}X+\xi\text{Diag}(\eta))^{-1})
    1. 1.

      Sample r𝒩(0,Ip)r\sim\mathcal{N}(0,I_{p}), δ𝒩(0,In)\delta\sim\mathcal{N}(0,I_{n})

    2. 2.

      Let u=1ξηru=\frac{1}{\sqrt{\xi\eta}}r, v=Xu+δv=Xu+\delta. Calculate v=M1(yσv)v^{*}=M^{-1}(\frac{y}{\sigma}-v) for M=In+(ξ)1XDiag(η1)XTM=I_{n}+(\xi)^{-1}X\text{Diag}(\eta^{-1})X^{T}. Let U=(ξη)1XTU=(\xi\eta)^{-1}X^{T}. Set β=σ(u+Uv).\beta=\sigma(u+Uv^{*}).

    Algorithm 6 Bhattacharya’s algorithm with common random numbers.
    Algorithm 7 Fast Sampling of Normals for Gaussian scale mixture priors [Bhattacharya et al., 2016].
    Input: ηt+1,ξt+1,σt+12,η~t+1,ξ~t+1,σ~t+12\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2},\tilde{\eta}_{t+1},\tilde{\xi}_{t+1},\tilde{\sigma}_{t+1}^{2}.
    1. 1.

      Sample r𝒩(0,Ip)r\sim\mathcal{N}(0,I_{p}), δ𝒩(0,In)\delta\sim\mathcal{N}(0,I_{n}).

    2. 2.

      Let u=r/ξt+1ηt+1u=r\big{/}{\sqrt{\xi_{t+1}\eta_{t+1}}} and u~=r/ξ~t+1η~t+1\tilde{u}=r\big{/}{\sqrt{\tilde{\xi}_{t+1}\tilde{\eta}_{t+1}}}, and v=Xu+δv=Xu+\delta and v~=Xu~+δ\tilde{v}=X\tilde{u}+\delta.

    3. 3.

      Calculate v=M1(yσt+1v)v^{*}=M^{-1}\big{(}\frac{y}{\sigma_{t+1}}-v\big{)} and v~=M~1(yσ~t+1v~)\tilde{v}^{*}=\tilde{M}^{-1}\big{(}\frac{y}{\tilde{\sigma}_{t+1}}-\tilde{v}\big{)} where

    M=In+ξt+11XDiag(ηt+1))1XTM=I_{n}+\xi_{t+1}^{-1}X\text{Diag}(\eta_{t+1}))^{-1}X^{T} and M~=In+ξ~t+11XDiag(η~t+1))1XT\tilde{M}=I_{n}+\tilde{\xi}_{t+1}^{-1}X\text{Diag}(\tilde{\eta}_{t+1}))^{-1}X^{T}.
  • 4.

    Let U=(ξt+1ηt+1)1XTU=(\xi_{t+1}\eta_{t+1})^{-1}X^{T} and U~=(ξ~t+1η~t+1)1XT\tilde{U}=(\tilde{\xi}_{t+1}\tilde{\eta}_{t+1})^{-1}X^{T}, and set βt+1=σt+1(u+Uv)\beta_{t+1}=\sigma_{t+1}(u+Uv^{*}) and β~t+1=σ~t+1(u~+U~v~)\tilde{\beta}_{t+1}=\tilde{\sigma}_{t+1}(\tilde{u}+\tilde{U}\tilde{v}^{*}).

    return (βt+1,β~t+1)(\beta_{t+1},\tilde{\beta}_{t+1})

  • Algorithm 8 Coupled Normals for Gaussian scale mixture priors with common random numbers.
    Input: ξt,ξ~t>0\xi_{t},\tilde{\xi}_{t}>0 and ηt+1,η~t+1>0p\eta_{t+1},\tilde{\eta}_{t+1}\in\mathbb{R}_{>0}^{p}.
    Sample joint proposal
    (log(ξ),log(ξ~))|ξt,ξ~tγmax(𝒩(log(ξt),σMRTH2),𝒩(log(ξ~t),σMRTH2)).\Big{(}\log(\xi^{*}),\log(\tilde{\xi}^{*})\Big{)}\Big{|}\xi_{t},\tilde{\xi}_{t}\sim\gamma_{max}\Big{(}\mathcal{N}(\log(\xi_{t}),\sigma_{\text{MRTH}}^{2}),\mathcal{N}(\log(\tilde{\xi}_{t}),\sigma_{\text{MRTH}}^{2})\Big{)}. (243)
    where σMRTH\sigma_{\text{MRTH}} is the Metropolis–Rosenbluth–Teller–Hastings step-size, and γmax\gamma_{max} is a maximal coupling with independent residuals (Algorithm 5).
    Calculate log-acceptance probabilities
    log(q)=log(L(y|ξ,ηt+1)πξ(ξ)L(y|ξt,ηt+1)πξ(ξt)ξξt)andlog(q~)=log(L(y|ξ~,η~t+1)πξ(ξ~)L(y|ξ~t,η~t+1)πξ(ξ~t)ξ~ξ~t).\log(q)=\log\Big{(}\frac{L(y|\xi^{*},\eta_{t+1})\pi_{\xi}(\xi^{*})}{L(y|\xi_{t},\eta_{t+1})\pi_{\xi}(\xi_{t})}\frac{\xi^{*}}{\xi_{t}}\Big{)}\ \text{and}\ \log(\tilde{q})=\log\Big{(}\frac{L(y|\tilde{\xi}^{*},\tilde{\eta}_{t+1})\pi_{\xi}(\tilde{\xi}^{*})}{L(y|\tilde{\xi}_{t},\tilde{\eta}_{t+1})\pi_{\xi}(\tilde{\xi}_{t})}\frac{\tilde{\xi}^{*}}{\tilde{\xi}_{t}}\Big{)}. (244)
    with log-likelihoods log(L(y|ξ,ηt+1))\log(L(y|\xi,\eta_{t+1})) and log(L(y|ξ~,η~t+1))\log(L(y|\tilde{\xi},\tilde{\eta}_{t+1})) as defined in Algorithm 3, Equation (239).
    Sample UUniform(0,1)U^{*}\sim\mathrm{Uniform}(0,1).
    if UqU^{*}\leq q then set ξt+1:=ξ\xi_{t+1}:=\xi^{*} else set ξt+1:=ξt\xi_{t+1}:=\xi_{t}.
      if Uq~U^{*}\leq\tilde{q} then set ξ~t+1:=ξ~\tilde{\xi}_{t+1}:=\tilde{\xi}^{*} else set ξ~t+1:=ξ~t\tilde{\xi}_{t+1}:=\tilde{\xi}_{t}.
      return (ξt+1,ξ~t+1)(\xi_{t+1},\tilde{\xi}_{t+1}).
    Algorithm 9 Coupled Metropolis–Rosenbluth–Teller–Hastings for ξ|η\xi|\eta updates with exact meetings.
    Input: ν>0p\nu\in\mathbb{R}^{p}_{>0}, discretized grid GG on the compact support of πξ()\pi_{\xi}(\cdot)
    Perform eigenvalue decomposition XDiag(η)1XT=QΛQTX\text{Diag}(\eta)^{-1}X^{T}=Q\Lambda Q^{T} for orthogonal matrix QQ and diagonal matrix Λ\Lambda.
    Calculate unnormalized probability mass function
    pmf~(ξ|η):=(i=1n(1+ξ1Λi,i))1/2(b0+i=1n[Qy]i211+ξ1Λi,i)a0+n2πξ(ξ)\displaystyle\widetilde{pmf}(\xi|\eta):=\Big{(}\prod_{i=1}^{n}(1+\xi^{-1}\Lambda_{i,i})\Big{)}^{-1/2}\Big{(}b_{0}+\sum_{i=1}^{n}[Qy]^{2}_{i}\frac{1}{1+\xi^{-1}\Lambda_{i,i}}\Big{)}^{-\frac{a_{0}+n}{2}}\pi_{\xi}(\xi) (245)
    for each ξG\xi\in G, where [Qy]i[Qy]_{i} is the ithi^{th} entry of QynQy\in\mathbb{R}^{n}.
    Normalize to obtain probability mass function: pmf(ξ|η)=pmf~(ξ|η)ξGpmf~(ξ|η)pmf(\xi|\eta)=\frac{\widetilde{pmf}(\xi|\eta)}{\sum_{\xi\in G}\tilde{pmf}(\xi|\eta)}.
    Sample ξpmf(|η)\xi\sim pmf(\cdot|\eta) on GG using probability inverse transform.
    return ξ.\xi.
    Algorithm 10 Perfect sampling of ξ|η\xi|\eta.
    Input: ν,mj:=ξβj2/(2σ2)>0\nu,m_{j}:=\xi\beta_{j}^{2}/(2\sigma^{2})>0.
    Sample WUniform(0,1)W^{*}\sim\mathrm{Uniform}(0,1)
    Set
    ηj=1νUν+12,1,mjν1(WUν+12,1,mjν())\eta_{j}=\frac{1}{\nu}U_{\frac{\nu+1}{2},1,\frac{m_{j}}{\nu}}^{-1}\Big{(}W^{*}U_{\frac{\nu+1}{2},1,\frac{m_{j}}{\nu}}(\infty)\Big{)} (246)
    where Ua,b,z(t):=1Γ(a)0txa1(1+x)ba1ezx𝑑xU_{a,b,z}(t):=\frac{1}{\Gamma(a)}\int_{0}^{t}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx is defined as the lower incomplete confluent hypergeometric function of the second kind, such that Ua,b,z()=U(a,b,z)U_{a,b,z}(\infty)=U(a,b,z) gives the confluent hypergeometric function of the second kind.
    return ηj.\eta_{j}.
    Algorithm 11 Perfect sampling of ηj|βj,σ2,ξ\eta_{j}|\beta_{j},\sigma^{2},\xi.

    Appendix E Algorithm Derivations

    Gibbs sampler for Half-t(ν)t(\nu) priors (Algorithm 3).

    Derivation of the blocked Gibbs sampling algorithm (Algorithm 3) is given in Johndrow et al. [2020] for the Horseshoe prior. For Half-t(ν)t(\nu) priors, it remains to check the validity of the slice sampler in Algorithm 4. Working component-wise, let p(η|m)p(\eta|m) denote the conditional density of η\eta given m>0m>0. Then,

    p(η|m)\displaystyle p(\eta|m) ην12(1+νη)1+ν2emη𝟙(0,)(η)=u=0ην12emη𝟙(0,)(η)𝟙(0,(1+νη)1+ν2)(u)𝑑u.\displaystyle\propto\eta^{\frac{\nu-1}{2}}(1+\nu\eta)^{-\frac{1+\nu}{2}}e^{-m\eta}\mathbbm{1}_{(0,\infty)}(\eta)=\int_{u=0}^{\infty}\eta^{\frac{\nu-1}{2}}e^{-m\eta}\mathbbm{1}_{(0,\infty)}(\eta)\mathbbm{1}_{\big{(}0,(1+\nu\eta)^{-\frac{1+\nu}{2}}\big{)}}(u)du. (247)

    Let p¯(η,u|m)ην12emη𝟙(0,)(η)𝟙(0,(1+νη)1+ν2)(u)\bar{p}(\eta,u|m)\propto\eta^{\frac{\nu-1}{2}}e^{-m\eta}\mathbbm{1}_{(0,\infty)}(\eta)\mathbbm{1}_{\big{(}0,(1+\nu\eta)^{-\frac{1+\nu}{2}}\big{)}}(u) be the conditional density of the augmented random variable (η,u)(\eta,u) given mm, such that p(η|m)=u=0p¯(η,u|m)𝑑up(\eta|m)=\int_{u=0}^{\infty}\bar{p}(\eta,u|m)du. Then, the slice sampler in Algorithm 4 corresponds to a Gibbs sampler targeting p¯(η,u|m)\bar{p}(\eta,u|m):

    p¯(u|η,m)\displaystyle\bar{p}(u|\eta,m) Uniform(0,(1+νη)1+ν2),\displaystyle\sim\mathrm{Uniform}\big{(}0,(1+\nu\eta)^{-\frac{1+\nu}{2}}\big{)}, (248)
    p¯(η|u,m)\displaystyle\bar{p}(\eta|u,m) ην12emη𝟙(0,u2/(1+ν)1ν)(η).\displaystyle\propto\eta^{\frac{\nu-1}{2}}e^{-m\eta}\mathbbm{1}_{\big{(}0,\frac{u^{-2/(1+\nu)}-1}{\nu}\big{)}}(\eta). (249)

    Note that 0xηs1emη𝑑η=msΓ(s)γs(mx)\int_{0}^{x}\eta^{s-1}e^{-m\eta}d\eta=m^{-s}\Gamma(s)\gamma_{s}(mx), where γs(x):=1Γ(s)0xts1et𝑑t[0,1]\gamma_{s}(x):=\frac{1}{\Gamma(s)}\int_{0}^{x}t^{s-1}e^{-t}dt\in[0,1] is the regularized incomplete lower Gamma function. For T=(u2/(1+ν)1)/ν,s=1+ν2T=(u^{-2/(1+\nu)}-1)/\nu,s=\frac{1+\nu}{2}, we can now obtain the cumulative density function of p¯(|u,m)\bar{p}(\cdot|u,m), given by 0xp¯(η|u,m)=γs(mx)/γs(mT)\int_{0}^{x}\bar{p}(\eta|u,m)=\gamma_{s}(mx)/\gamma_{s}(mT) for 0xT0\leq x\leq T. By probability inverse transform, we can now sample perfectly from η|u,m\eta|u,m, such that

    η:=1mγs1(γs(mT)U) for UUniform(0,1)\eta:=\frac{1}{m}\gamma_{s}^{-1}\big{(}\gamma_{s}(mT)U^{*}\big{)}\text{ for }U^{*}\sim\mathrm{Uniform}(0,1) (250)

    as in Step (2)(2) of Algorithm 4.

    Perfect sampling of ξ|η\xi|\eta (Algorithm 10).

    In Algorithm 3, we have

    p(ξ|η)|Mξ,η|1/2(b0+yTMξ,η1y)a0+n2πξ(η)\displaystyle p(\xi|\eta)\propto|M_{\xi,\eta}|^{-1/2}\big{(}b_{0}+y^{T}M_{\xi,\eta}^{-1}y\big{)}^{-\frac{a_{0}+n}{2}}\pi_{\xi}(\eta) (251)

    for Mξ,η=In+ξ1XDiag(η)1XTM_{\xi,\eta}=I_{n}+\xi^{-1}X\text{Diag}(\eta)^{-1}X^{T} and πξ()\pi_{\xi}(\cdot) the prior density on ξ\xi. Consider the eigenvalue decomposition of XDiag(η)1XTX\text{Diag}(\eta)^{-1}X^{T}, such that XDiag(η)1XT=QΛQTX\text{Diag}(\eta)^{-1}X^{T}=Q\Lambda Q^{T} for QQ orthogonal and Λ\Lambda diagonal. Then,

    |Mξ,η|\displaystyle|M_{\xi,\eta}| =|Q(In+ξ1Λ)QT|=|In+ξ1Λ|=i=1n(1+ξ1Λi,i)\displaystyle=|Q(I_{n}+\xi^{-1}\Lambda)Q^{T}|=|I_{n}+\xi^{-1}\Lambda|=\prod_{i=1}^{n}\big{(}1+\xi^{-1}\Lambda_{i,i}\big{)} (252)
    yTMξ,η1y\displaystyle y^{T}M_{\xi,\eta}^{-1}y =yTQ(In+ξ1Λ)1QTy=i=1n[QTy]i2(1+ξ1Λi,i)1\displaystyle=y^{T}Q(I_{n}+\xi^{-1}\Lambda)^{-1}Q^{T}y=\sum_{i=1}^{n}[Q^{T}y]_{i}^{2}\big{(}1+\xi^{-1}\Lambda_{i,i}\big{)}^{-1} (253)

    This allows the discretized perfect sampling of ξ\xi given η\eta using probability inverse transform as in Algorithm 10, with 𝒪(n3)\mathcal{O}(n^{3}) computational cost arising from eigenvalue decomposition of XDiag(η)1XTX\text{Diag}(\eta)^{-1}X^{T}.

    Perfect sampling of ηj|βj,σ2,ξ\eta_{j}|\beta_{j},\sigma^{2},\xi (Algorithm 11).

    Working component-wise, let

    p(η|m)ην12(1+νη)1+ν2emη\displaystyle p(\eta|m)\propto\eta^{\frac{\nu-1}{2}}(1+\nu\eta)^{-\frac{1+\nu}{2}}e^{-m\eta} (254)

    denote the conditional density of η(0,)\eta\in(0,\infty) given m>0m>0. We can calculate the corresponding cumulative density function (see proof of Proposition B.1), given by

    0Kp(η|m)𝑑η=Uν+12,1,mν(νK)U(ν+12,1,mν).\displaystyle\int_{0}^{K}p(\eta|m)d\eta=\frac{U_{\frac{\nu+1}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{\nu+1}{2},1,\frac{m}{\nu})}. (255)

    where Ua,b,z(t):=1Γ(a)0txa1(1+x)ba1ezx𝑑xU_{a,b,z}(t):=\frac{1}{\Gamma(a)}\int_{0}^{t}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx is defined as the lower incomplete confluent hypergeometric function of the second kind. Algorithm 11 and (246) directly follow from probability inverse transform.