Coupling-based convergence assessment of some Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors
Abstract
We consider Markov chain Monte Carlo (MCMC) algorithms for Bayesian high-dimensional regression with continuous shrinkage priors. A common challenge with these algorithms is the choice of the number of iterations to perform. This is critical when each iteration is expensive, as is the case when dealing with modern data sets, such as genome-wide association studies with thousands of rows and up to hundred of thousands of columns. We develop coupling techniques tailored to the setting of high-dimensional regression with shrinkage priors, which enable practical, non-asymptotic diagnostics of convergence without relying on traceplots or long-run asymptotics. By establishing geometric drift and minorization conditions for the algorithm under consideration, we prove that the proposed couplings have finite expected meeting time. Focusing on a class of shrinkage priors which includes the “Horseshoe”, we empirically demonstrate the scalability of the proposed couplings. A highlight of our findings is that less than 1000 iterations can be enough for a Gibbs sampler to reach stationarity in a regression on covariates. The numerical results also illustrate the impact of the prior on the computational efficiency of the coupling, and suggest the use of priors where the local precisions are Half-t distributed with degree of freedom larger than one.
1 Introduction
1.1 Iterative computation in high dimensions
We consider the setting of high-dimensional regression where the number of observations is smaller than the number of covariates and the true signal is sparse. This problem formulation is ubiquitous in modern applications ranging from genomics to the social sciences. Optimization-based methods, such as the LASSO (Tibshirani, 1994) or Elastic Net (Zou and Hastie, 2005), allow sparse point estimates to be obtained even when the number of covariates is on the order of hundreds of thousands. More specifically, iterative optimization procedures to obtain these estimates are practical because the following conditions are met: 1) the cost per iteration scales favorably with the size of the input ( and ), 2) the number of iterations to convergence also scales favorably, and 3) there are reliable stopping criteria to detect convergence.
In the Bayesian paradigm, Markov chain Monte Carlo (MCMC) methods are commonly used to sample from the posterior distribution. Default MCMC implementations, for example using general-purpose software such as Stan (Carpenter et al., 2017), can lead to high computational costs in high-dimensional settings, both per iteration and in terms of the number of iterations required to reach convergence. In the setting of high-dimensional regressions, tailored algorithms can provide substantial improvements on both fronts (e.g. Bhattacharya et al., 2016; Nishimura and Suchard, 2018; Johndrow et al., 2020; Bhattacharya and Johndrow, 2021). Comparatively less attention has been put on the design of reliable stopping criteria. Stopping criteria for MCMC, such as univariate effective sample size (ESS, e.g Vats et al., 2019), or the convergence diagnostic (e.g. Vats and Knudson, 2020), rely on the asymptotic behavior of the chains as time goes to infinity, and thus effectively ignore the non-asymptotic “burn-in” bias, which vanishes as time progresses. This is acceptable in situations where we have solid a priori estimates of the burn-in period; otherwise the lack of non-asymptotic stopping criteria poses an important practical problem. With Bayesian analysis of high-dimensional data sets, MCMC algorithms can require seconds or minutes per iteration, and thus any non-asymptotic insight on the number of iterations to perform is highly valuable.
Diagnostics of convergence are particularly useful when considering the number of factors that affect the performance of MCMC algorithms for Bayesian regressions. Beyond the size of the data set, the specification of the prior and the (much less explicit) signal-to-noise ratio in the data all have an impact on the convergence of MCMC algorithms. Across applications, the number of iterations required for the chain to converge varies by multiple orders of magnitude. This could lead users to either waste computational resources by running overly long chains, or, worrisomely, to base their analysis on chains that have not converged. This manuscript proposes a concrete method to avoid these pitfalls in the case of a Gibbs sampler for Bayesian regression with heavy-tailed shrinkage priors. Specifically, we follow the approach of Glynn and Rhee (2014); Jacob et al. (2020); Biswas et al. (2019) and use coupled lagged chains to monitor convergence.
1.2 Bayesian shrinkage regression with Half- priors
Consider Gaussian linear regression with observations and covariates, with a focus on the high-dimensional setting where . The likelihood is given by
(1) |
where denotes the norm, is the observed design matrix, is the observed response vector, is the unknown Gaussian noise variance, and is the unknown signal vector that is assumed to be sparse. We study hierarchical Gaussian scale-mixture priors on given by
(2) |
where , and are continuous densities on . Such global-local mixture priors induce approximate sparsity, where the components of can be arbitrarily close to zero, but never exactly zero. This is in contrast to point-mass mixture priors (e.g Johnson and Rossell, 2012), where some components of can be exactly zero a posteriori. The global precision parameter relates to the number of signals, and we use throughout. The local precision parameters determine which components of are null.
We focus on the Half- prior family for the local scale parameter, for , where a distribution has density proportional to . The induced prior on the local precision has density
(3) |
The case corresponds to the Horseshoe prior (Carvalho et al., 2009, 2010) and has received overwhelming attention in the literature among this prior class; see Bhadra et al. (2019) for a recent overview. The use of Half- priors on scale parameters in hierarchical Gaussian models was popularized by Gelman (2006). However, the subsequent literature has largely gravitated towards the default choice (Polson and Scott, 2012) of , in part because a convincing argument for preferring a different value of has been absent to date. Here we give empirical evidence that Half- priors yield statistical estimation properties similar to the Horseshoe prior, whilst leading to significant computational gains when . We note that Ghosh and Chakrabarti (2017) established optimal posterior concentration in the Normal means problem for a broad class of priors on the local scale that includes the Half- priors, extending earlier work by van der Pas et al. (2014) for the Horseshoe, providing theoretical support for the comparable performance we observe with and in simulations.
Our stopping criterion relies on couplings. Couplings have long been used to analyze the convergence of MCMC algorithms, while methodological implementations have been rare. The work of Johnson (1996) pioneered the use of couplings for practical diagnostics of convergence, but relied on the assumption of some similarity between the initial distribution and the target, which can be hard to check. Here we instead follow the approach of Glynn and Rhee (2014); Jacob et al. (2020) based on couplings of lagged chains, applied to the question of convergence in Biswas et al. (2019). That approach makes no assumptions on the closeness of the initial distribution to the target, and in our experiments we initialize all chains independently from the prior distribution.
The approach requires the ability to sample pairs of Markov chains such that 1) each chain marginally follows a prescribed MCMC algorithm and 2) the two chains “meet” after a random –but almost surely finite– number of iterations, called the meeting time. When the chains meet, i.e. when all of their components become identical, the user can stop the procedure. From the output the user can obtain unbiased estimates of expectations with respect to the target distribution, which can be generated in parallel and averaged, as well as estimates of the finite-time bias of the underlying MCMC algorithm. Crucially, this information is retrieved from the distribution of the meeting time, which is an integer-valued random variable irrespective of the dimension of the problem, and thus provides a convenient summary of the performance of the algorithm. A difficulty inherent to the approach is the design of an effective coupling strategy for the algorithm under consideration. High-dimensional parameter spaces add a substantial layer of complication as some of the simpler coupling strategies, that are solely based on maximal couplings, lead to prohibitively long meeting times.
1.3 Our contribution
In this paper we argue that couplings offer a practical way of assessing the number of iterations required by MCMC algorithms in high-dimensional regression settings. Our specific contributions are summarized below.
We introduce blocked Gibbs samplers (Section 2) for Bayesian linear regression focusing on Half- local shrinkage priors, extending the algorithm of Johndrow et al. (2020) for the Horseshoe. Our algorithm has an overall computation cost of per iteration, which is state-of-the-art for . We then design a one-scale coupling strategy for these Gibbs samplers (Section 2.3) and show that it results in meeting times that have exponential tails, which in turn validates their use in convergence diagnostics (Biswas et al., 2019). Our proof of exponentially-tailed meeting times is based on identifying a geometric drift condition, as a corollary to which we also establish geometric ergodicity of the marginal Gibbs sampler. Despite a decade of widespread use, MCMC algorithms for the Horseshoe and Half- priors generally have not previously been shown to be geometrically ergodic, in contrast to MCMC algorithms for global-local models with exponentially-tailed local scale priors (Khare and Hobert, 2013; Pal and Khare, 2014). Our proofs utilize a uniform bound on generalized ridge regression estimates, which enables us to avoid a technical assumption of Johndrow et al. (2020), and auxiliary results on moments of distributions related to the confluent hypergeometric function of the second kind. Shortly after an earlier version of this manuscript was posted, Bhattacharya et al. (2021) have also established geometric ergodicity of related Gibbs samplers for the Horseshoe prior.
The meeting times resulting from the one-scale coupling increase exponentially as dimension increases. To address that issue we design improved coupling strategies (Section 3). We develop a two-scale strategy, based on a carefully chosen metric, which informs the choice between synchronous and maximal couplings. In numerical experiments in high dimensions this strategy leads to orders of magnitude shorter meeting times compared to the naïve approach. We establish some preliminary results in a stylized setting, providing a partial understanding of why the two-scale strategy leads to the observed drastic reduction in meeting times. We also describe other couplings that do not require explicit calculation of a metric between the chains, have exponentially-tailed meeting times, and empirically result in similar meeting times as the two-scale coupling. For our two-scale coupling, we further note that priors with stronger shrinkage towards zero can give significantly shorter meeting times in high dimensions. This motivates usage of Half- priors with greater degrees of freedom , which can give similar statistical performance to the Horseshoe () whilst enjoying orders of magnitude computational improvements.
Finally, we demonstrate (Section 4) that the proposed coupled MCMC algorithm is applicable to big data settings. Figure 1 displays an application of our coupling strategy to monitor the convergence of the Gibbs sampler for Bayesian regression on a genome-wide association study (GWAS) dataset with . Our method suggests that a burn-in of just iterations can suffice even for such a large problem, and confirms the applicability of coupled MCMC algorithms for practical high-dimensional inference.

Scripts in R (R Core Team, 2013) are available from https://github.com/niloyb/CoupledHalfT to reproduce the figures of the article.
2 Coupled MCMC for regression with Half- priors
We develop MCMC techniques for Bayesian shrinkage regression with Half- priors. The model is given by (1)–(2) and for some as in (3). Posterior distributions resulting from such heavy-tailed shrinkage priors have desirable statistical properties, but their features pose challenges to generic-purpose MCMC algorithms. Characterization of the marginal prior and posterior densities of on given and , and further comments on these challenges are presented in Appendix A. Specifically, the resulting posterior distributions present 1) multimodality, 2) heavy tails and 3) infinite density at zero. This hints at a trade-off between statistical accuracy and computational feasibility, since the very features that present computational difficulties are crucial for optimal statistical performance across sparsity levels and signal strengths.
2.1 A blocked Gibbs sampler for Half- priors
Blocked Gibbs samplers are popularly used for Bayesian regression with global-local shrinkage priors, due to the analytical tractability of full conditional distributions (Park and Casella, 2008; Carvalho et al., 2010; Polson et al., 2014; Bhattacharya et al., 2015; Makalic and Schmidt, 2016). Several convenient blocking and marginalization strategies are possible, leading to conditionals that are easy to sample from. For the case of the Horseshoe prior (), Johndrow et al. (2020) have developed exact and approximate MCMC algorithms for high-dimensional settings, building on the algorithms of Polson et al. (2014) and Bhattacharya et al. (2016). We extend that sampler to general Half- priors, and summarize it in (LABEL:eq:blocked_gibbs_ergodicity_version). Given some initial state , it generates a Markov chain which targets the posterior corresponding to Half- priors.
(4) |
Above represents the marginal likelihood of given and , and we use the notation , and . We sample component-wise independently using slice sampling and sample using Metropolis–Rosenbluth–Teller–Hastings with step-size (Hastings, 1970). Full details of our sampler are in Algorithms 3 and 4 of Appendix D, and derivations are in Appendix E.
Sampling component-wise independently is an cost operation. Sampling requires evaluating , which involves an cost operation (calculating the weighted matrix cross-product ) and an cost operation (calculating and using Cholesky decomposition). Sampling is only an cost operation, as is pre-calculated. Similarly, as is pre-calculated, sampling using the algorithm of Bhattacharya et al. (2016) for structured multivariate Gaussians only involves an and an cost operation. In high-dimensional settings with , which is our focus, the weighted matrix cross-product calculation is the most expensive and overall the sampler described in (LABEL:eq:blocked_gibbs_ergodicity_version) has computation cost.
2.2 Coupling the proposed Gibbs sampler
We develop couplings for the blocked Gibbs sampler presented in Section 2.1. We will use these couplings to generate coupled chains with a time lag , in order to implement the diagnostics of convergence proposed in Biswas et al. (2019). Consider an -lag coupled chain with meeting time . Here and denote the full states and respectively. For the coupling-based methods proposed in Biswas et al. (2019); Jacob et al. (2020) to be most practical, the meeting times should occur as early as possible, under the constraint that both marginal chains and start from the same initial distribution, and that they both marginally evolve according to the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version).
Briefly, we recall how -lag coupled chains enable the estimation of upper bounds on . Since for all , then the indicator can be evaluated for all as soon as we observe . Such indicators are estimators of , which themselves are upper bounds on , since and . We can then obtain upper bounds on by the triangle inequality: . This can be justified more formally if the meeting times are exponentially-tailed (Biswas et al., 2019). Such tail condition also ensures that the expected computation time of unbiased estimators as in Jacob et al. (2020), generated independently in parallel, scales at a logarithmic rate in the number of processors. Thus we will be particularly interested in couplings that result in exponentially-tailed meeting times.
In Section 2.3, we consider an algorithm based on maximal couplings. We show that the ensuing meeting times have exponential tails, and discuss how our analysis directly implies that the marginal chain introduced in Section 2.1 is geometrically ergodic. We then illustrate difficulties encountered by such scheme in high dimensions, which will motivate alternate strategies described in Section 3. For simplicity, we mostly omit the lag from the notation.
2.3 One-scale coupling
For the blocked Gibbs sampler in Section 2.1, we first consider a coupled MCMC algorithm that attempts exact meetings at every step. We will apply a maximal coupling algorithm with independent residuals (see Thorisson (2000, Chapter 1 Section 4.4) and Johnson (1998)). It is included in Algorithm 5 of Appendix D, and has an expected computation cost of two units (Johnson, 1998).
Our initial coupled MCMC kernel is given in Algorithm 1, which we refer to as a one-scale coupling. That is, before the chains have met (when ), the coupled kernel on Steps and does not explicitly depend on the distance between states and . After meeting, the coupled chains remain together by construction, such that implies . When , Step uses the coupled slice sampler of Algorithm 2 component-wise for , which allows each pair of components to meet exactly with positive probability. In Algorithm 2, we use a common random numbers or a “synchronous” coupling of the auxiliary random variables , and alternative couplings could be considered. Steps are maximal couplings of the conditional sampling steps for and , such that occurs with positive probability for all . Step uses common random numbers, such that implies . This allows the full chains to meet exactly with positive probability at every step.
-
1.
Sample component-wise independently using Algorithm 2.
-
2.
Sample given as follows:
- (a)
- (b)
- (c)
Under the one-scale coupling, we prove that the meetings times have exponential tails and hence finite expectation. Our analysis is based on identifying a suitable drift function for a variant of the marginal chain in Section 2.1 and an application of Jacob et al. (2020, Proposition 4). We assume that the global shrinkage prior has a compact support on . Such compactly supported priors for have been recommended by van der Pas et al. (2017) to achieve optimal posterior concentration for the Horseshoe in the Normal means model. For simplicity, we also assume that each component and in (LABEL:eq:blocked_gibbs_ergodicity_version) are sampled perfectly, instead of slice sampling and Metropolis–Rosenbluth–Teller–Hastings steps respectively. Perfect sampling algorithms for each component and are provided in Algorithms 10 and 11 of Appendix D for completeness; see also Bhattacharya and Johndrow (2021) for perfect sampling of the components of for the Horseshoe. Perfectly sampling and retains the computational cost for the full blocked Gibbs sampler, though in practice this implementation is more expensive per iteration due to the computation of eigenvalue decompositions in lieu of Cholesky decompositions, and the computation of inverse of the confluent hypergeometric function of the second kind.
Proposition 2.1 gives a geometric drift condition for this variant of the blocked Gibbs sampler.
Proposition 2.1.
(Geometric drift). Consider the Markov chain generated from the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version), where each component and are sampled perfectly such that with intermediary for each . Assume the global shrinkage prior has a compact support. For , and , define
(5) |
Then for each , there exist some such that is a drift function, i.e., there exists and such that for all , and ,
(6) |
The drift (or “Lyapunov”) function in (5), approaches infinity when any approaches the origin or infinity. This ensures that the corresponding sub-level sets, defined by for , exclude the pole at the origin and are bounded. Such geometric drift condition, in conjunction with Jacob et al. (2020, Proposition 4), helps verify that the meeting times under the one-scale coupling have exponential tails and hence finite expectation.
Proposition 2.2.
Consider the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). In the setup of Proposition 2.1, assume that the global shrinkage prior has a compact support. Write and . Consider the one-scale coupling given by where are maximally coupled component-wise, and are coupled using common random numbers. Denote the meeting time by . Then for some constants and , and for all .
Exponentially-tailed meeting times immediately imply that the blocked Gibbs sampler in Section 2.1 is geometrically ergodic. This follows from noting that Jacob et al. (2020, Proposition 4) is closely linked to a minorization condition (Rosenthal, 1995), and details are included in Proposition C.10 of Appendix C. For a class of Gibbs samplers targeting shrinkage priors including the Bayesian LASSO, the Normal-Gamma prior (Griffin and Brown, 2010), and Dirichlet-Laplace prior (Bhattacharya et al., 2015), Khare and Hobert (2013) and Pal and Khare (2014) have proven geometric ergodicity based on drift and minorization arguments (Meyn and Tweedie, 1993; Roberts and Rosenthal, 2004). For the Horseshoe prior, Johndrow et al. (2020) has recently established geometric ergodicity. In the high-dimensional setting with for the Horseshoe, the proof of Johndrow et al. (2020) required truncating the prior on each below by a small to guarantee the uniform bound . Our proof of geometric ergodicity for Half- priors including the Horseshoe () works in both low and high-dimensional settings, without any modification of the priors on the . In parallel work, Bhattacharya et al. (2021) have also established geometric ergodicity for the Horseshoe prior without requiring such truncation.
Remark 2.3.
Our proofs of exponentially-tailed meeting times and geometric ergodicity generalize to a larger class of priors satisfying some moment conditions on the full conditionals of . Consider a compactly supported prior on , and a prior on each . Then the unnormalized density of the full conditionals of each is given by . Consider the following assumptions on .
-
1.
For some , there exists and such that for all .
-
2.
For some , there exists and such that for all .
Then the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version) is geometrically ergodic.
We now consider the performance of Algorithm 1 on synthetic datasets. In particular, we empirically illustrate that the rate in Proposition 2.2 can tend to exponentially fast as dimension increases. The design matrix is generated with , and a response vector as , where is the true signal and is the true error standard deviation. We choose to be sparse such that given sparsity parameter , for and for all . Figure 2 shows the meeting times of coupled Markov chains (with a lag ) targeting the Horseshoe posterior () with under Algorithm 1. We consider and vary the dimension . For each dimension , we simulate meeting times under the one-scale coupling for independently generated synthetic datasets. We initialize chains independently from the prior distribution. For the updates, we use a proposal step-size of as in Johndrow et al. (2020). Figure 2, with a vertical axis on a logarithmic scale, shows that the meeting times increase exponentially and have heavier tails with increasing dimension.

3 Coupling strategies for high-dimensional settings
To address the scaling issues observed with the one-scale coupling in Section 2.3, we develop a two-scale coupling in Section 3.1 that delivers better empirical performance. We also offer some formal results supporting the empirical improvements. In Section 3.2 we develop other related coupling strategies which have exponentially-tailed meeting times and have similar empirical performance as the two-scale coupling. In Section 3.3, we investigate the impact of the degree of freedom for Half- priors on our two-scale coupling. We find that Half- priors with higher degrees of freedom can give similar statistical performance to the Horseshoe (), whilst enjoying orders of magnitude computational improvements through shorter meeting times.
3.1 A two-scale coupling
We develop a two-scale coupling algorithm, which slightly differs from the terminology used in stochastic dynamics and in analysis of pre-conditioned HMC (Bou-Rabee and Eberle, 2020). It is also closely linked to the delayed one-shot coupling of Roberts and Rosenthal (2002), and refers to coupled kernels that attempt exact meetings when the chains are close, and aim for a contraction when the chains are far. The motivation for this construction is that in Algorithm 1, the components of that fail to meet instead evolve independently. As dimension grows, the number of components of that fail to meet also tends to grow, making it increasingly difficult for all components to exactly meet on subsequent iterations. In the two-scale coupling, we only attempt to obtain exact meetings when the associated probability is large enough. This is done by constructing a metric and a corresponding threshold parameter such that when the current states are -close with , we apply Algorithm 2 and try to obtain exact meetings. When , we instead employ common random numbers to sample .
Our chosen metric on is
(7) |
where recall that , , and is sampled using the coupled slice sampler in Algorithm 2. As the coupling in Algorithm 2 is independent component-wise, we can obtain a simplified expression involving only univariate probabilities as
(8) |
where and . Under this metric and a threshold , an exact meeting is attempted when . Once and coincide, and if the scalars and also subsequently coincide, then the entire chains meet. When , the two-scale coupling reduces to the one-scale coupling of Algorithm 1 since the maximal coupling slice sampler (Algorithm 2) is invoked at each step. On the other hand, when threshold , we always apply the common random numbers. Then the chains may come arbitrarily close but will never exactly meet. Empirically we find that different thresholds values away from and give similar meeting times. Simulations concerning these choices can be found in Appendix B.2.
We now discuss computation of the metric . Since the probability in (7) is unavailable in closed form, we can resort to various approximations. One option is to evaluate the one-dimensional integrals in (8) using numerical integration. Alternatively, one can form the Monte Carlo based estimate
(9) |
where are sampled independently for using Algorithm 2. We recommend a Rao–Blackwellized estimate by combining Monte Carlo with analytical calculations of integrals as
(10) |
where are sampled independently for using Algorithm 2 and each is calculated analytically based on meeting probabilities from Algorithm 2. Further metric calculation details are in Appendix B.3. For a number of samples , estimates (9) and (10) both have computation cost of order . Compared to the estimate in (9), the estimate in (10) has lower variance and faster numerical run-times as it only involves sampling uniformly distributed random numbers. It suffices to choose a small number of samples , and often we take . Indeed it appears unnecessary to estimate accurately, as we are only interested in comparing that distance to a fixed threshold . Often the trajectories initially take values close to , and then sharply drop to values close to . This leads to the estimates in (9) and (10) having low variance. Henceforth, we will use the estimate in (10) in our experiments with two-scale couplings, unless specified otherwise.
Overall the two-scale coupling kernel has twice the cost of the single chain kernel in (LABEL:eq:blocked_gibbs_ergodicity_version). When , we calculate a distance estimate and sample from a coupled slice sampling kernel. Calculating the distance estimate involves sampling uniforms and has cost. Coupled slice sampling with maximal couplings (Algorithm 2) or common random numbers both have expected or deterministic cost . The remaining steps of the two-scale coupling match the one-step coupling in Algorithm 1.
We now consider the performance of the two-scale coupling on synthetic datasets. The setup is identical to that introduced in Section 2.3, where for each dimension we simulate meetings times based on independently generated synthetic datasets. We target the Horseshoe posterior () with . Figure 3 shows the meeting times of coupled Markov chains (with a lag ), for both the one-scale coupling and the two-scale coupling with and threshold . Figure 3 shows that the two-scale coupling can lead to orders of magnitude reductions in meeting times, compared to the one-scale coupling.

We next present some preliminary results which hint as to why the two-scale coupling leads to the drastic reduction in meeting times. We focus on the coupling of the full conditionals, as the one-scale and two-scale algorithms differ exactly at this step. First, we show that for any monotone function and all components , the expected distance is the same under both common random numbers and maximal coupling.
Proposition 3.1.
Consider the setup of Proposition 2.2, such that the full conditionals are sampled perfectly. Then for any monotone function and all components ,
(11) |
where and correspond to expectations under the maximal coupling and CRN coupling respectively of given and .
Proposition 3.1 implies that such single-step expectations alone do not distinguish the behavior of CRN from maximal couplings. This compels us to investigate other distributional features of under either coupling to possibly tease apart differences in their behavior on high probability events. Focusing on the Horseshoe prior and making the choice , we uncover a distinction between the tail behavior of the two couplings which can substantially accumulate over independent coordinates. We offer an illustration in a stylized setting below.
Proposition 3.2.
For the Horseshoe prior , consider when and for some positive scalars and with . Then under a CRN coupling,
(12) |
for all . Define a positive constant . Under maximal coupling, for any and any , there exists a constant which does not depend on such that for all ,
(13) |
Extensions of Proposition 3.2 which allow different components of and to take different values under some limiting assumptions are omitted here for simplicity. This stylized setting already captures why the one-scale algorithm may result in much larger meeting times in high dimensions compared to the CRN-based two-scale algorithm. We note that Proposition 3.2 remains informative whenever , even when is not large. For example, consider for some small with . Under CRN, and remain -close almost surely for all components . In contrast, under the maximal coupling, at least one component will have larger than deviations with high probability. Since and do not depend on or , for fixed . These larger deviations between some components of and under maximal couplings push the two chains much further apart when sampling , resulting in meeting probabilities that are much lower over multiple steps of the algorithm. Overall, Propositions 3.1 and 3.2 show that multi-step kernel calculations taking into account such high-probability events may be necessary to further distinguish the relative performances of one-scale and two-scale couplings. A full analytic understanding of this difference in performance remains an open problem.
Another way of understanding the benefits of two-scale coupling would be to find a distance under which the coupled chains under common random numbers exhibit a contraction. The CRN coupling part of Proposition 3.2 hints that may be a natural metric. Appendix B.1 contains discussions and related simulations comparing this metric with alternatives. Proposition 3.3 verifies that such metric bounds the total variation distance between one-step transition kernels.
Proposition 3.3.
Let denote the Markov transition kernel associated with the update from to for the Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). Let , , , and for . Then,
(14) |
By Proposition 3.3, it suffices to show that contracts under CRN to a sufficiently small value such that the upper bound in (14) is strictly less that . This would ensure that CRN coupling will bring the marginal chains close enough that the probability of one-step max coupling resulting in a meeting is high. Note that (14) corresponds to the total variation distance of the full chain, so under this setup we would attempt to maximally couple the full vectors and (rather than maximally coupling component-wise) when close. Analytical calculations to establish such a contraction, and to understand how the contraction rate scales with dimension and other features of the problem such as the choice of prior, sparsity, and signal to noise ratio requires further work.
3.2 Alternative coupling strategies
The rapid meeting of the two-scale coupling in simulations recommends its use. However, we were unable to establish finite expected meeting times with this coupling, which prevents its use in constructing unbiased estimates. We now describe several small modifications of the two-scale coupling that have similar empirical performance but for which we can prove that the meeting time distribution has exponential tails.
Our basic approach is to maximally couple each component until a failed meeting occurs, after which we switch to CRN for the remaining components. The ordering of the components can be deterministic or randomized at each iteration. Under this construction, only one component of , corresponding to the failed meeting, will evolve independently; the other components will either meet or evolve under CRN. Therefore, we expect to obtain benefits similar to the two-scale coupling in high dimensions. An apparent advantage is that we can bypass the choice of metric and associated threshold. While the switch-to-CRN coupling has the same cost per iteration as the two-scale coupling, it can give faster run-times in practice when the dimension is large compared to the number of observations as it avoids repeated calculation of a metric.
We report the performance of the switch-to-CRN coupling with the ordering of the components randomized at each iteration. The setup is identical to that in Section 3.1, where for each dimension we simulate meetings times based on independently generated synthetic datasets. Figure 4 shows that the switch-to-CRN coupling leads to similar meeting times as the two-scale coupling.

We can establish that the meeting times under this switch-to-CRN coupling have exponential tails and hence finite expectation.
Proposition 3.4.
Consider the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). We follow the setup in Proposition 2.1, and assume that the global shrinkage prior has a compact support. Denote and . Consider the switch-to-CRN coupling given by where are maximally coupled component-wise (under any fixed or random ordering of the components) until the first failed meeting, after which common random numbers are employed, and are coupled using common random numbers. Denote the meeting time by . Then for some constants and , and for all .
As in Proposition 2.2, our proof of Proposition 3.4 implies a rate which tends to exponentially as dimension increases. To obtain more favorable rates with respect to the dimension as hinted by Figure 4 would require a better understanding of the CRN coupling, for example establishing a contraction in some parts of the state space as discussed above.
Overall Figures 3 and 4 show significant empirical improvements in using the two-scale coupling and related strategies, compared to the one-scale coupling of Section 2.3. Therefore, we recommend the use of two-scales couplings and variants such as the switch-to-CRN coupling for practical high-dimensional applications. Furthermore, in settings where estimating the metric is computationally expensive or analytically intractable, variants such as the switch-to-CRN coupling may offer faster numerical run-times per iteration. Henceforth, we will use the two-scale coupling of Section 3.1 in our experiments and GWAS dataset applications, unless specified otherwise.
3.3 Computational and statistical impact of degree of freedom
We empirically investigate the impact of the degree of freedom for Half- priors. Higher degrees of freedom corresponds to priors on which have stronger shrinkage towards zero (Proposition A.1). We consider meeting times resulting from the two-scale coupling of Section 3.1, as well as the statistical performance of the posteriors. On the computation side, recent work on convergence analysis (Qin and Hobert, 2019a, b) has highlighted the impact of shrinkage on convergence of simpler MCMC algorithms. Under a common random numbers coupling, Qin and Hobert (2019b) have studied the convergence of the Gibbs sampler of Albert and Chib (1993) for Bayesian probit regression, proving dimension-free convergence rates for priors with sufficiently strong shrinkage towards zero. On the statistical estimation side, Half- priors have been long proposed (Gelman, 2006; Carvalho et al., 2009, 2010) for Bayesian hierarchical models. van der Pas et al. (2014, 2017) have established (near) minimax optimal estimation for the Horseshoe () in the Normal means model. Ghosh and Chakrabarti (2017) have extended the results of van der Pas et al. (2014) to show minimax optimality of a wider class of priors including Half- priors. Recently, Song (2020) has established that the degree of freedom can impact the multiplicative constant in the posterior contraction rate for the Normal means model. Posterior contraction rates of continuous shrinkage priors in the regression setting beyond the Normal means model would deserve further investigation.


We consider the meeting times obtained with Half- priors under the two-scale coupling (with and threshold ) on synthetic datasets. The synthetic datasets are generated as per Figure 2 of Section 2.3. Figures 5(a) is then based on synthetic datasets generated independently for different degrees of freedom . Figure 5(a) plots meeting times of -lag coupled Markov chains against dimension. It indicates that even a small increase in the degree of freedom compared to the Horseshoe () can lead to orders of magnitude computational improvements through shorter meeting times.
We also consider the statistical performance of Bayesian shrinkage regression with Half- priors on synthetic datasets in Figure 5(b). The synthetic datasets are generated as per Figure 2 of Section 2.3. We use the blocked Gibbs sampler of Algorithm 3 to draw samples from the posteriors corresponding to different degrees of freedom , by simulating chains of length , with a burn-in period chosen using the coupling-based total variation distance upper bounds of Biswas et al. (2019). Specifically we employ a burn-in of steps for and of steps for . Figure 5(b) shows similar mean squared error (MSE) for different values of . For a fixed synthetic dataset, the MSE in Figure 5(b) is calculated as where , is the MCMC estimator obtained from the Markov chain after burn-in. The error bars in Figure 5(b) are generated by simulating synthetic datasets independently times and each time calculating the corresponding MSE from a Markov chain for different values of . The MSE of the LASSO is also included, with regularization parameter chosen with cross-validation using the glmnet package (Friedman et al., 2010). For , we observe a lower MSE using Half- priors compared to the LASSO. The grey plots in Figures 5(b) show that much larger can lead to higher MSE, as the corresponding posteriors are strongly concentrated about zero and less able to identify non-null signals. Overall, Figure 5(b) suggests that Half- priors with degrees of freedom can result in comparable statistical performance while much larger values of should be discouraged.
4 Results on GWAS datasets
Section 3.3 suggests that Half- priors with higher degrees of freedom can give similar statistical performance to the Horseshoe (), whilst allowing orders of magnitude computational improvements.
Motivated by this observation, we apply our algorithms to genome-wide association study (GWAS) datasets using . We consider a bacteria dataset (Bühlmann et al., 2014) with observations (corresponding to production of the vitamin riboflavin), and covariates (corresponding to single nucleotide polymorphisms (SNPs) in the genome), and a maize dataset (Romay et al., 2013; Liu et al., 2016; Zeng and Zhou, 2017; Johndrow et al., 2020) with observations (corresponding to average number of days taken for silk emergence in different maize lines), and covariates (corresponding to SNPs). Bayesian methods are well-suited to such GWAS datasets, as they provide interpretable notions of uncertainty through marginal posterior probabilities which allow for multimodality, enable the use of prior information, and account for the uncertainty in parameter estimates (e.g. Guan and Stephens, 2011; Zhou et al., 2013). Furthermore, heavy-tailed continuous shrinkage priors can be particularly effective in the GWAS setting, as the associations between SNPs and the phenotype are expected to be sparse, and such priors have shown competitive empirical performance in polygenic prediction (Ge et al., 2019).
For both datasets, we target the Half- posterior with . We use the two-scale coupling with and threshold , and initialize chains independently from the prior. For the updates, we use a Metropolis–Rosenbluth–Teller–Hastings step-size of . Based on 100 independent coupled chains, Figure 6 shows upper bounds on the total variation distance to stationarity, as a function of , using -lag couplings with and for the maize and riboflavin datasets respectively. The results indicate that these Markov chains converge to the corresponding posterior distributions in less than and iterations respectively. We note that the lags and were selected based on some preliminary runs of the coupled chains to obtain informative total variation bounds, which incurs an additional preliminary cost. Any choice of lag results in valid upper bounds, but only large enough lags lead to upper bounds that are in the interval even for small iterations .
In these GWAS examples, the run-time per iteration may be significant. For the maize dataset, on a 2015 high-end laptop, each iteration of the coupled chain takes approximately seconds, and running one coupled chain until meeting can take more than one day. This run-time is dominated by the calculation of the weighted matrix cross-product , with an cost. In this setting, a scientist wanting to run chains for or iterations may have to wait for weeks or months. Using the proposed couplings, a scientist with access to parallel computers can be confident that samples obtained after iterations would be indistinguishable from perfect samples from the target; for a similar cost they can also obtain unbiased estimators as described in Jacob et al. (2020). For the riboflavin dataset, each iteration of the coupled chain takes approximately seconds, and running one coupled chain until meeting takes only to seconds. The run-time there is dominated by the component-wise updates which have cost. In the riboflavin example, the coupling-based diagnostics enable the scientist to perform reliable Bayesian computation directly from personal computers. Overall, the GWAS examples emphasize that couplings of MCMC algorithms can aid practitioners in high-dimensional settings.


5 Discussion
We have introduced couplings of Gibbs samplers for Bayesian shrinkage regression with Half- priors. The proposed two-scale coupling is operational in realistic settings, including a GWAS setting with .
Firstly, our work participates in a wider effort to apply MCMC algorithms and coupling techniques to ever more challenging settings (Middleton et al., 2020; Ruiz et al., 2020; B. Trippe, 2021; Lee et al., 2020b; Xu et al., 2021). In particular, the short meeting times obtained here vindicate the use of coupling techniques in multimodal, high-dimensional sampling problems. This relates to questions raised in some comments of Lee et al. (2020a) and Paulin (2020) in the discussion of Jacob et al. (2020), and shows that the dimension of the state space may not necessarily be the most important driver of the performance of MCMC algorithms and couplings thereof.
Secondly, we have applied -lag couplings (Biswas et al., 2019) to obtain upper bounds on the total variation distance between the Markov chain at a finite time and its stationary distribution. This allows empirical investigations of the mixing time and how it varies with the inputs of the problem, including number of observations, number of covariates, signal to noise ratio and sparsity. We find that coupling techniques constitute a convenient non-asymptotic tool to monitor the performance of MCMC algorithms.
Thirdly, we observe that Half- priors with degrees of freedom higher than one give similar statistical estimation performance than the Horseshoe whilst providing significant computational advantages. This contributes towards the discussion on the impact of the prior on the trade-off between statistical estimation and computational feasibility, in the setting of high-dimensional Bayesian regression.
The following questions arise from our work.
-
1.
Convergence of the blocked Gibbs sampler. Short meeting times suggest that the blocked Gibbs sampler converges quickly, even in high dimensions. This motivates a more formal study of the convergence of that Markov chain, to understand how the convergence rate varies with features of the data generating process and of the prior. The Lyapunov function in Proposition 2.1 and the two-scale coupling may prove useful for such analysis. Our initial work in this area suggests that finding a convenient metric that gives sharp bounds on the metric used here, while simultaneously being amenable to theoretical analysis, will be a key step.
-
2.
Alternative coupling algorithms. Section 3.2 indicates that many coupling strategies are available in the present setting, and more generally for various Gibbs samplers. For example, we could combine the switch-to-CRN coupling with the two-scale coupling to form other coupling strategies. Under the two-scale coupling setup, we could apply the switch-to-CRN coupling only when the chains are close with respect to a chosen metric, and employ CRN couplings when far away. Alternatively, we could always apply the switch-to-CRN coupling, and estimate component-wise meeting probabilities to select the order of the updates, for example such that components more likely to meet are updated first. Some couplings may result in shorter meeting times for the Horseshoe prior. This may allow the Horseshoe prior to remain competitive with Half- priors with higher degrees of freedom. In our experiments we did not find ways to obtain shorter meetings for the Horseshoe, but we certainly cannot rule out that possibility. If one could obtain short meetings for the Horseshoe in high dimensions, the apparent trade-off between statistical performance and computing cost identified in the present work would disappear.
-
3.
Interplay between posterior concentration and MCMC convergence. For Bayesian regression with spike-and-slab priors, Yang et al. (2016) and Atchadé (2019) have shown that posterior contraction can aid the convergence of MCMC in high dimensions. The performance of the proposed couplings motivates similar investigations for continuous shrinkage priors.
Acknowledgement.
The authors thank Xiaolei Liu and Xiang Zhou for sharing the Maize GWAS data, and Yves F. Atchadé, Qian Qin and Neil Shephard for helpful comments. Niloy Biswas gratefully acknowledges support from a GSAS Merit Fellowship and a Two Sigma Fellowship Award. Anirban Bhattacharya gratefully acknowledges support by the National Science Foundation through an NSF CAREER award (DMS-1653404). Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grants DMS-1712872 and DMS-1844695. The maize dataset GWAS simulations in this article were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.
References
- Abramowitz [1974] M. Abramowitz. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,. Dover Publications, Inc., USA, 1974. ISBN 0486612724.
- Albert and Chib [1993] J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993. ISSN 01621459. URL http://www.jstor.org/stable/2290350.
- Atchadé [2019] Y. F. Atchadé. Approximate spectral gaps for Markov chains mixing times in high dimensions. arXiv preprint arXiv:1903.11745, 2019.
- B. Trippe [2021] T. B. B. Trippe, T. D. Nguyen. Optimal transport couplings of Gibbs samplers on partitions for unbiased estimation. 2021.
- Bhadra et al. [2019] A. Bhadra, J. Datta, N. G. Polson, and B. Willard. Lasso meets Horseshoe: A survey. Statistical Science, 34(3):405–427, 2019.
- Bhattacharya and Johndrow [2021] A. Bhattacharya and J. E. Johndrow. Sampling Local-Scale Parameters in High-Dimensional Regression Models, pages 1–14. American Cancer Society, 2021. ISBN 9781118445112. doi: https://doi.org/10.1002/9781118445112.stat08292. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08292.
- Bhattacharya et al. [2015] A. Bhattacharya, D. Pati, N. S. Pillai, and D. B. Dunson. Dirichlet–Laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512):1479–1490, 2015. doi: 10.1080/01621459.2014.960967. URL https://doi.org/10.1080/01621459.2014.960967.
- Bhattacharya et al. [2016] A. Bhattacharya, A. Chakraborty, and B. K. Mallick. Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4):985–991, 10 2016. ISSN 0006-3444. doi: 10.1093/biomet/asw042. URL https://doi.org/10.1093/biomet/asw042.
- Bhattacharya et al. [2021] S. K. Bhattacharya, K. Khare, and S. Pal. Geometric ergodicity of Gibbs samplers for the Horseshoe and its regularized variants. arXiv preprint arXiv:2101.00366, 2021.
- Biswas et al. [2019] N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with L-lag couplings. In Advances in Neural Information Processing Systems, pages 7389–7399, 2019.
- Bou-Rabee and Eberle [2020] N. Bou-Rabee and A. Eberle. Two-scale coupling for preconditioned Hamiltonian Monte Carlo in infinite dimensions. Stochastics and Partial Differential Equations: Analysis and Computations, 2020. doi: 10.1007/s40072-020-00175-6. URL https://doi.org/10.1007/s40072-020-00175-6.
- Bou-Rabee and Sanz-Serna [2018] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018. doi: 10.1017/S0962492917000101.
- Bühlmann et al. [2014] P. Bühlmann, M. Kalisch, and L. Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014. doi: 10.1146/annurev-statistics-022513-115545. URL https://doi.org/10.1146/annurev-statistics-022513-115545.
- Carpenter et al. [2017] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: a probabilistic programming language. Grantee Submission, 76(1):1–32, 2017.
- Carvalho et al. [2009] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the Horseshoe. In Proceedings of Machine Learning Research, volume 5, pages 73–80. PMLR, 2009. URL http://proceedings.mlr.press/v5/carvalho09a.html.
- Carvalho et al. [2010] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. ISSN 00063444. URL http://www.jstor.org/stable/25734098.
- Friedman et al. [2010] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, Articles, 33(1):1–22, 2010. ISSN 1548-7660. doi: 10.18637/jss.v033.i01. URL https://www.jstatsoft.org/v033/i01.
- Ge et al. [2019] T. Ge, C.-Y. Chen, Y. Ni, Y.-C. A. Feng, and J. W. Smoller. Polygenic prediction via bayesian regression and continuous shrinkage priors. Nature Communications, 10(1):1776, 2019. doi: 10.1038/s41467-019-09718-5. URL https://doi.org/10.1038/s41467-019-09718-5.
- Gelman [2006] A. Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal., 1(3):515–534, 09 2006. doi: 10.1214/06-BA117A. URL https://doi.org/10.1214/06-BA117A.
- Ghosh and Chakrabarti [2017] P. Ghosh and A. Chakrabarti. Asymptotic optimality of one-group shrinkage priors in sparse high-dimensional problems. Bayesian Anal., 12(4):1133–1161, 12 2017. doi: 10.1214/16-BA1029. URL https://doi.org/10.1214/16-BA1029.
- Glynn and Rhee [2014] P. W. Glynn and C.-h. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377–389, 2014.
- Griffin and Brown [2010] J. E. Griffin and P. J. Brown. Inference with Normal-Gamma prior distributions in regression problems. Bayesian Anal., 5(1):171–188, 03 2010. doi: 10.1214/10-BA507. URL https://doi.org/10.1214/10-BA507.
- Guan and Stephens [2011] Y. Guan and M. Stephens. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780 – 1815, 2011. doi: 10.1214/11-AOAS455. URL https://doi.org/10.1214/11-AOAS455.
- Hager [1989] W. W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239, 1989. ISSN 00361445. URL http://www.jstor.org/stable/2030425.
- Hastings [1970] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970.
- Jacob et al. [2020] P. E. Jacob, J. O’Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):543–600, 2020. doi: 10.1111/rssb.12336. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12336.
- Jarner and Roberts [2007] S. Jarner and G. Roberts. Convergence of heavy-tailed Monte Carlo Markov Chain algorithms. Scandinavian Journal of Statistics, 34:781–815, 02 2007. doi: 10.1111/j.1467-9469.2007.00557.x.
- Jarner and Tweedie [2003] S. F. Jarner and R. L. Tweedie. Necessary conditions for geometric and polynomial ergodicity of random-walk-type. Bernoulli, 9(4):559–578, 08 2003. doi: 10.3150/bj/1066223269. URL https://doi.org/10.3150/bj/1066223269.
- Johndrow et al. [2020] J. Johndrow, P. Orenstein, and A. Bhattacharya. Scalable approximate MCMC algorithms for the Horseshoe prior. Journal of Machine Learning Research, 21(73):1–61, 2020. URL http://jmlr.org/papers/v21/19-536.html.
- Johnson and Geyer [2012] L. T. Johnson and C. J. Geyer. Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm. Ann. Statist., 40(6):3050–3076, 12 2012. doi: 10.1214/12-AOS1048. URL https://doi.org/10.1214/12-AOS1048.
- Johnson [1996] V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupled sample paths. Journal of the American Statistical Association, 91(433):154–166, 1996.
- Johnson [1998] V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 93(441):238–248, 1998.
- Johnson and Rossell [2012] V. E. Johnson and D. Rossell. Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660, 2012. doi: 10.1080/01621459.2012.682536. URL https://doi.org/10.1080/01621459.2012.682536.
- Khare and Hobert [2013] K. Khare and J. P. Hobert. Geometric ergodicity of the Bayesian lasso. Electron. J. Statist., 7:2150–2163, 2013. doi: 10.1214/13-EJS841. URL https://doi.org/10.1214/13-EJS841.
- Lee et al. [2020a] A. Lee, S. S. Singh, and M. Vihola. Comment on ‘Unbiased Markov chain Monte Carlo with couplings’ by Jacob, O’leary and Atchadé. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2020a.
- Lee et al. [2020b] A. Lee, S. S. Singh, and M. Vihola. Coupled conditional backward sampling particle filter. Annals of Statistics, 48(5):3066–3089, 2020b.
- Liu et al. [2016] X. Liu, M. Huang, B. Fan, E. S. Buckler, and Z. Zhang. Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLOS Genetics, 12(2):1–24, 02 2016. doi: 10.1371/journal.pgen.1005767. URL https://doi.org/10.1371/journal.pgen.1005767.
- Livingstone et al. [2019a] S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo. Bernoulli, 25(4A):3109–3138, 11 2019a. doi: 10.3150/18-BEJ1083. URL https://doi.org/10.3150/18-BEJ1083.
- Livingstone et al. [2019b] S. Livingstone, M. F. Faulkner, and G. O. Roberts. Kinetic energy choice in Hamiltonian/hybrid Monte Carlo. Biometrika, 106(2):303–319, 2019b.
- Makalic and Schmidt [2016] E. Makalic and D. F. Schmidt. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182, 2016.
- Meyn and Tweedie [1993] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer, 1993.
- Middleton et al. [2020] L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased Markov chain Monte Carlo for intractable target distributions. Electron. J. Statist., 14(2):2842–2891, 2020. doi: 10.1214/20-EJS1727. URL https://doi.org/10.1214/20-EJS1727.
- Nishimura and Suchard [2018] A. Nishimura and M. A. Suchard. Prior-preconditioned conjugate gradient for accelerated Gibbs sampling in “large n & large p” sparse Bayesian logistic regression models. arXiv preprint arXiv:1810.12437, 2018.
- Pal and Khare [2014] S. Pal and K. Khare. Geometric ergodicity for Bayesian shrinkage models. Electron. J. Statist., 8(1):604–645, 2014. doi: 10.1214/14-EJS896. URL https://doi.org/10.1214/14-EJS896.
- Park and Casella [2008] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. doi: 10.1198/016214508000000337. URL https://doi.org/10.1198/016214508000000337.
- Paulin [2020] D. Paulin. Comment on ‘Unbiased Markov chain Monte Carlo with couplings’ by Jacob, O’leary and Atchadé. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2020.
- Polson and Scott [2012] N. G. Polson and J. G. Scott. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):887–902, 2012.
- Polson et al. [2014] N. G. Polson, J. G. Scott, and J. Windle. The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):713–733, 2014. doi: 10.1111/rssb.12042. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12042.
- Qin and Hobert [2019a] Q. Qin and J. P. Hobert. Convergence complexity analysis of Albert and Chib’s algorithm for Bayesian probit regression. Ann. Statist., 47(4):2320–2347, 08 2019a. doi: 10.1214/18-AOS1749. URL https://doi.org/10.1214/18-AOS1749.
- Qin and Hobert [2019b] Q. Qin and J. P. Hobert. Wasserstein-based methods for convergence complexity analysis of MCMC with applications. arXiv preprint arXiv:1810.08826, 2019b.
- R Core Team [2013] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.
- Roberts and Rosenthal [2002] G. O. Roberts and J. S. Rosenthal. One-shot coupling for certain stochastic recursive sequences. Stochastic Processes and their Applications, 99(2):195–208, 2002. ISSN 0304-4149. doi: https://doi.org/10.1016/S0304-4149(02)00096-0. URL https://www.sciencedirect.com/science/article/pii/S0304414902000960.
- Roberts and Rosenthal [2004] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability surveys, 1:20–71, 2004.
- Roberts and Tweedie [1996] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. ISSN 13507265. URL http://www.jstor.org/stable/3318418.
- Romay et al. [2013] M. C. Romay, M. J. Millard, J. C. Glaubitz, J. A. Peiffer, K. L. Swarts, T. M. Casstevens, R. J. Elshire, C. B. Acharya, S. E. Mitchell, S. A. Flint-Garcia, M. D. McMullen, J. B. Holland, E. S. Buckler, and C. A. Gardner. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biology, 14(6):R55, 2013. doi: 10.1186/gb-2013-14-6-r55. URL https://doi.org/10.1186/gb-2013-14-6-r55.
- Rosenthal [1995] J. S. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90(430):558–566, 1995. ISSN 01621459. URL http://www.jstor.org/stable/2291067.
- Ruiz et al. [2020] F. J. R. Ruiz, M. K. Titsias, T. Cemgil, and A. Doucet. Unbiased gradient estimation for variational auto-encoders using coupled Markov chains. ArXiv, abs/2010.01845, 2020.
- Sherlock et al. [2010] C. Sherlock, P. Fearnhead, and G. O. Roberts. The random walk Metropolis: Linking theory and practice through a case study. Statist. Sci., 25(2):172–190, 05 2010. doi: 10.1214/10-STS327. URL https://doi.org/10.1214/10-STS327.
- Song [2020] Q. Song. Bayesian shrinkage towards sharp minimaxity. Electronic Journal of Statistics, 14(2):2714 – 2741, 2020. doi: 10.1214/20-EJS1732. URL https://doi.org/10.1214/20-EJS1732.
- Thorisson [2000] H. Thorisson. Coupling, stationarity, and regeneration. Springer, 2000.
- Tibshirani [1994] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
- van der Pas et al. [2017] S. van der Pas, B. Szabó, and A. van der Vaart. Adaptive posterior contraction rates for the horseshoe. Electron. J. Statist., 11(2):3196–3225, 2017. doi: 10.1214/17-EJS1316. URL https://doi.org/10.1214/17-EJS1316.
- van der Pas et al. [2014] S. L. van der Pas, B. J. K. Kleijn, and A. W. van der Vaart. The horseshoe estimator: Posterior concentration around nearly black vectors. Electron. J. Statist., 8(2):2585–2618, 2014. doi: 10.1214/14-EJS962. URL https://doi.org/10.1214/14-EJS962.
- van Erven and Harremos [2014] T. van Erven and P. Harremos. Rényi divergence and Kullback–Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014. doi: 10.1109/TIT.2014.2320500.
- Vats and Knudson [2020] D. Vats and C. Knudson. Revisiting the Gelman–Rubin diagnostic. Statistical Science (to appear), 2020.
- Vats et al. [2019] D. Vats, J. M. Flegal, and G. L. Jones. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321–337, 04 2019. ISSN 0006-3444. doi: 10.1093/biomet/asz002. URL https://doi.org/10.1093/biomet/asz002.
- Xu et al. [2021] K. Xu, T. E. Fjelde, C. Sutton, and H. Ge. Couplings for multinomial Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3646–3654. PMLR, 2021.
- Yang et al. [2016] Y. Yang, M. J. Wainwright, and M. I. Jordan. On the computational complexity of high-dimensional Bayesian variable selection. Ann. Statist., 44(6):2497–2532, 12 2016. doi: 10.1214/15-AOS1417. URL https://doi.org/10.1214/15-AOS1417.
- Zeng and Zhou [2017] P. Zeng and X. Zhou. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nature Communications, 8(1):456, 2017. doi: 10.1038/s41467-017-00470-2. URL https://doi.org/10.1038/s41467-017-00470-2.
- Zhou et al. [2013] X. Zhou, P. Carbonetto, and M. Stephens. Polygenic modeling with Bayesian sparse linear mixed models. PLOS Genetics, 9(2):1–14, 02 2013. doi: 10.1371/journal.pgen.1003264. URL https://doi.org/10.1371/journal.pgen.1003264.
- Zou and Hastie [2005] H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.
In Section A, we describe some features of the posterior distributions arising in high-dimensional regression with shrinkage priors. In Section B we describe metrics to monitor the distance between pairs of chains and provide details on the calculation of metrics to use in two-scale coupling strategies. In Section C we provide the proofs for all the theoretical results in this document. Section D provides pseudo-codes for various algorithms mentioned in this document. Section E provides some derivations of certain steps in the Gibbs sampler.
Appendix A Features of the target complicate use of generic MCMC
While posterior distributions resulting from heavy-tailed shrinkage priors have desirable statistical properties, their features pose challenges to generic MCMC algorithms. For Half- priors, we show that the resulting posterior distributions present 1) multimodality, 2) heavy tails and 3) poles at zero. This hints at a trade-off between statistical accuracy and computational difficulty, since the very features that present computational challenges are crucial for optimal statistical performance across sparsity levels and signal strengths.
Proposition A.1 gives the marginal prior and posterior densities of up to normalizing constants, i.e. integrating over . For any real-valued functions and , we write for when for some .
Proposition A.1.
The marginal prior of component on given and has density
(15) |
where is the Gamma function, and is the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13] for any . The marginal posterior density of on given and is
(16) |
thus and .
Figure 7 gives an example of such marginal posterior in a stylized setting. Here
with and . Component in Figure 7 illustrates the potential multimodality in the posterior. The posterior distribution also has polynomial tails in high dimensions, as stated in Corollary A.2.

Corollary A.2.
Let be a unit vector such that and let . Assume entries are non-zero for all , which is necessary to ensure that is finite for all . Then
(17) |
as . That is, the marginal posterior of on has polynomial tails along all directions in the null space of .
In our stylized example, satisfies . It follows from Corollary A.2 that for large . More generally, whenever and the marginal prior on has polynomial tails, the subspace of has vector space dimension at least , and therefore the posterior has polynomial tails over a non-trivial subspace. For such heavy-tailed targets, the most popular “generic” MCMC algorithms encounter significant theoretical and practical challenges. Here, we use the term generic for algorithms that are not explicitly designed to target this class of distributions, but instead rely on a general strategy to explore state spaces. Specifically, for heavy-tailed targets, lack of geometric ergodicity has been established for Random Walk Metropolis–Hastings (RWMH) [Jarner and Tweedie, 2003], Metropolis-Adjusted Langevin (MALA) [Roberts and Tweedie, 1996] and Hamiltonian Monte Carlo (HMC) [Livingstone et al., 2019a]111Jarner and Tweedie [2003], Roberts and Tweedie [1996] and Livingstone et al. [2019a] show lack of geometric ergodicity for targets which have heavy-tails in every direction. Their arguments can be modified to hold for targets which are heavy-tailed along any one direction as in Corollary A.2, by considering a projection of the full chain along that direction.. Better performance can sometimes be obtained using heavy-tailed proposals [Jarner and Roberts, 2007, Sherlock et al., 2010, Livingstone et al., 2019b] or variable transformations when available [Johnson and Geyer, 2012].
The posterior density in (16) also approaches infinity when any component approaches zero, as for any . This is highlighted by the modes at the origin in Figure 7. The pole at zero corresponds to a discontinuity in the gradients of the negative log-density, as given in Proposition A.1. Such diverging gradients can lead to numerically unstable leapfrog integration for HMC samplers [Bou-Rabee and Sanz-Serna, 2018].
Appendix B Details on the choice and calculation of metric
B.1 Comparison with other metrics
Figure 8 plots alongside the metric and the log-transformed metric on parameters and as in (8). It is based on one trajectory of our two-scale coupled chain under common random numbers (threshold ), for a synthetic dataset generated as per Section 2.3 targeting the Horseshoe posterior () with . Figure 8 shows that is bounded above by both the metric and the metric between the logarithms of . Therefore, we could consider the capped metric () or the capped metric on the logs () to obtain upper bounds on . Finding metrics which are easily calculable and accurate could be investigated further and may prove valuable in the theoretical analysis of the proposed algorithms.

B.2 Choice of metric threshold for the two-scale coupling
For different values of , Figure 9(a) shows averaged trajectories . When (one-scale coupling), metric remains near , and when (common random numbers coupling) metric gets very close to but does not exactly equal . For all values of the threshold shown, the chains exactly meet and reaches zero at similar times. Figure 9(b) considers higher dimensional settings. It plots the histograms of when . The histograms show that takes values close to either or for different dimensions . This may explain why different threshold values sufficiently away from and give similar meeting times. Figure 9(c) shows that different thresholds give similar meeting times for the Horseshoe.



B.3 Calculating metric
We present details of estimating the metric defined in (7). As the coupling in Algorithm 2 is independent component-wise, we have
(18) |
for and .
B.3.1 An integration based approximation
We consider the approximation
(19) |
where and correspond to target distributions of the marginal slice samplers in Algorithm 2. This corresponds to approximating the probability of not meeting under Algorithm 2 with the total variation distance (i.e. probability of not meeting under a maximal coupling) between the marginal target distributions and of the slice samplers. We can evaluate analytically using Proposition B.1. This motivates the approximation
(20) |
for .
Proposition B.1.
Consider distributions and on with densities
(21) |
respectively on , such that and are parameterized by the fixed constant and . Then, when . When ,
(22) |
where is defined as the lower incomplete confluent hypergeometric function of the second kind, such that gives the confluent hypergeometric function of the second kind, and
(23) |
Proof.
We have
(24) |
where
(25) | ||||
(26) |
Case .
is immediate.
Case .
Case .
Follows from the case by symmetry. ∎
B.3.2 A Rao–Blackwellized estimator
We have
(35) | |||
(36) |
where and , are sampled using common random numbers as in Algorithm 2, and are sampled as in Algorithm 2. We can analytically evaluate probability , which motivates the Rao-Blackwellized estimator from (10),
(37) |
where is the number of samples. For each , is sampled independently as in Algorithm 2 and maximal coupling probability is calculated analytically using Proposition B.2.
Proposition B.2.
Suppose and where marginal distributions and correspond to the distribution in the second step of the slice sampler in Algorithm 4. Then,
(38) |
for and , and , , the regularized incomplete lower Gamma function, , , and
(39) |
Proof.
We work component-wise and drop the subscripts for ease of notation. Then distributions and have densities
(40) | ||||
(41) |
respectively where and , and is the regularized incomplete lower Gamma function. Note that
Case .
as required.
Case .
For ,
Denote , such that
This gives
as required.
Case .
Follows from the case by symmetry. ∎
Appendix C Proofs
C.1 Marginal prior and posterior densities
Proof of Proposition A.1.
Marginal Prior on .
Let denote the component-wise prior density of each based on the Half- distribution, and the density of a multivariate Normal distribution with mean and . It suffices to calculate the marginal prior for each given , by marginalizing over each . We obtain,
(42) | ||||
(43) | ||||
(44) | ||||
(45) | ||||
(46) | ||||
(47) |
where (47) follows from the definition of the confluent hypergeometric function of the second kind: for any . By the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.1], we obtain
(48) |
as required for (15).
Marginal Posterior of .
Gradient of negative log-density of posterior of .
From (49), we obtain
(51) | ||||
(52) |
Recall [Abramowitz, 1974, Section 13.4] for ,
(53) |
This gives
(54) |
Therefore by (52), as required. ∎
Proof of Corollary A.2.
By Proposition A.1,
(55) |
where is the normalizing constant. As , there exists some such that . When for some , the posterior density is infinite by Proposition A.1. Consider such that each is non-zero for . For any such fixed such and large , we obtain
(56) | ||||
(57) | ||||
(58) | ||||
(59) |
where (58) follows from the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.1]. ∎
C.2 Geometric Ergodicity
Initial technical results.
We first record some technical results.
Lemma C.1.
Consider any matrix such that for submatrices and . Then , where is the spectral norm of a matrix.
Proof.
Take any vector of length equal to the number of columns of with . Then
(60) |
Taking the supremum over all such with gives . ∎
Lemma C.2.
(Sherman–Morrison–Woodbury formula [Hager, 1989]) For any invertible matrix , matrix and matrix ,
(61) |
When , we obtain .
Lemma C.3.
(A uniform bound on the full conditional mean of ). Let be an matrix with columns for , and let be a diagonal matrix with positive entries. Then
(62) |
for some , where is the spectral norm of a matrix. That is, the spectral norm of is uniformly bounded over all positive diagonal matrices .
In an earlier version of the present manuscript, we had explicitly characterized the uniform upper bound as for the Moore-Penrose pseudoinverse of , which is not true in general. Existence of an uniform upper bound has been also established by Bhattacharya et al. [2021] using block matrix inversions and induction.
Proof.
Fix any . We first consider the case , such that , for some scalar . We obtain
(63) |
which is an uniform bound for all .
We now consider . We have with for and with for . For any permutation matrix , the matrices and correspond to a reordering of the columns of and the diagonal entries of . We have
(64) |
This implies that is invariant under any reordering of the columns of and the diagonal entries of . By considering such reordering, we can assume the diagonal entries of are non-decreasing without loss of generality; that is with . Denote . Then,
(65) | ||||
(66) | ||||
(67) |
It therefore suffices to show that
(68) |
is finite for any . For , define
(69) |
where is the matrix corresponding to the first columns of and with is the diagonal matrix corresponding to the top diagonal entries of , and . Then . By Lemma C.2 with ,
(70) | ||||
(71) | ||||
(72) |
Substituting this decomposition into (67), we obtain
(73) |
where, after some simplifications, and is a by matrix such that
(74) | ||||
(75) |
Therefore to show in (68) is finite, it suffices to bound uniformly over all with . Note that for some and some such that . Here corresponds to the projection of to the column space of , and corresponds to the component of that is perpendicular to the column space of . Note that . This implies
(77) | ||||
(78) |
Therefore by (74),
(79) | ||||
(80) | ||||
(81) | ||||
(82) | ||||
(83) | ||||
(84) |
Lemma C.4.
(Moments of the full conditionals of each ). For fixed constants and , define a distribution on with probability density function
(86) |
Then for any and , we have
(87) |
Proof.
Recall the definition of the confluent hypergeometric function of the second kind . We obtain
(88) | ||||
(89) |
Therefore,
(90) | ||||
(91) |
∎
Ratios of confluent hypergeometric function evaluations, such as the ones arising in Lemma C.4, can be upper-bounded using the following result.
Lemma C.5.
Let denote the confluent hypergeometric function of the second kind, and fix .
-
1.
Let . Then
(92) Furthermore, for every , there exists a such that
(93) -
2.
Let . Then there exists some such that
(94)
Proof.
Case: .
By definition, we have
where the final equality follows as the distribution stochastically dominates the distribution , and hence for all as and . This gives as required for (92).
We now prove (93). By definition, we have
where is a random variable with density on , and is a random variable with density on .
-
1.
Consider as . We obtain
-
2.
Consider as a function of for . From the density of , we note that for all , stochastically dominates , and as , also stochastically dominates . Therefore, is a decreasing function of for .
As , for any , there exists such that for all . This gives for all . Also for all , we have as is a decreasing function of for . Overall, we obtain
for as required.
Case: .
Note that and are well-defined as .
-
1.
Consider as . We first recall the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.5] at infinity,
(95) as for any . For large , this gives
(96) (97) (98) (99) where (98) follows from the asymptotic expansion in (95). As , we obtain
(100) -
2.
Consider as . By definition of confluent hypergeometric function of the second kind, we have
(101) Note that as and ,
(102) Therefore, . However, we have
(103) Therefore . As , we obtain
(104)
From (100) and (104), we have and . As is a continuous function on , this implies that is bounded above for . In particular, there exists some such that
(105) |
for all as required for (94). ∎
Corollary C.6.
(Bound on moments of the full conditionals of each ). Consider a random variable on with density as in (86) in Lemma C.4 for some .
Consider when . There for every , there exists some such that
(106) |
Consider when . Then, there exists some such that
(107) |
Proof.
By Lemma C.4, for all ,
(108) |
Lemma C.7.
Let on . Then, for ,
(111) |
Proof.
We follow the proof in Johndrow et al. [2020], which is included here for completeness. Recall the inverse moments of the univariate Normal distribution,
(112) |
where is the confluent hypergeometric function of the first kind. For , we know . For and , this gives
(113) | ||||
(114) |
Therefore, is decreasing for and for all . This gives
(115) |
∎
Lemma C.8.
(A uniform bound on the full conditional mean of ). Consider the full conditional distribution of given in Algorithm 3. Given , we have
(116) |
where . Then
(117) |
Proof.
Recall that the mean of a distribution (under shape and rate parameterization) is . Equation (117) now directly follows as is positive semi-definite. ∎
Lemma C.9.
(Maximal coupling probability of each component ). For fixed constants and , define a distribution on with probability density function
(118) |
Consider random variables , and for some fixed . Fix some , and take . Then under a maximal coupling,
(119) |
Drift and Minorization conditions.
Proof of Proposition 2.1.
For ease of notation, denote and let be the unit vector of the standard basis on . We first consider . Note that
(124) | ||||
(125) | ||||
(126) | ||||
(127) | ||||
(128) | ||||
(129) | ||||
(130) |
where (125) follows from Jensen’s inequality, (127) follows as for and , (128) follows from Hölder’s inequality and from as is positive-definite matrix, and (130) follows from Lemma C.3 for some finite constant which does not depend on or .
As the prior on has compact support, we have for all for some . This upper bound on , Lemma C.8 and Jensen’s inequality gives
(131) |
Therefore, by (130) and (131), we obtain
(132) |
for . By (132) and Corollary C.6, we obtain
(133) | ||||
(134) | ||||
(135) | ||||
(136) |
for . Note that for every , there exists such that . We can verify this by considering the function . It suffices to note that and for , where is the Digamma function.
We now consider for . We obtain
(137) | ||||
(138) | ||||
(139) | ||||
(140) | ||||
(141) | ||||
(142) |
for some . Here (138) follows from Lemma C.7, (139) follows as for the operator norm of a matrix, (141) follows as for all and , and (142) follows as the prior on has compact support, so for some . Therefore, by (142),
(143) |
for . By (143) and Corollary C.6 with and any , there exists such that
(144) | ||||
(145) | ||||
(146) | ||||
(147) |
where and .
Proposition C.10.
(Minorization condition). For , let denote the sub-level sets of the Lyapunov function in Proposition 2.1. Let denote the Markov transition kernel associated with the update from to implied by the update rule in (LABEL:eq:blocked_gibbs_ergodicity_version). Let and . Then for all , there exists such that
(152) |
where TV stands for the total variation distance between two probability distributions. In particular for , suffices.
Proof of Proposition C.10.
Denote and , and let for some . We obtain
TV | (153) |
where corresponds to one-scale coupling of and as in Proposition 2.2, the first inequality follows from the coupling inequality, and the second inequality follows from the proof of Proposition 2.2 for as . Note that does not depend on and as required. ∎
C.3 Coupled Kernel Proofs
Proof of Proposition 2.2.
For , let denote the sub-level sets of the Lyapunov function in Proposition 2.1, with corresponding contraction constant and additive constant . Let and for all . By Jacob et al. [2020, Proposition 4], it suffices to prove that for any sufficiently large such that , there exists such that
(154) |
under the one-scale coupling. We now prove (154) directly for the one-scale coupling in Proposition 2.2. Fix some such that , and take any . Then
(155) | ||||
(156) | ||||
(157) | ||||
(158) | ||||
(159) | ||||
(160) | ||||
(161) |
Proof of Proposition 3.1.
As given and given are conditionally independent component-wise, it suffices to consider the univariate case where . Henceforth we assume and we drop the component subscripts for ease of notation. Without loss of generality, take and to be monotone increasing. Then the densities of and on are given by
(164) |
respectively. As , the distribution of stochastically dominates the distribution of . Therefore under a CRN coupling of and (specifically a CRN coupling of Algorithm 11), almost surely, and almost surely. We obtain
(165) |
Now consider a maximal coupling of and with independent residuals [Johnson, 1998]. Let . Then with probability . Otherwise with probability , where and are independent random variables with densities
(166) | ||||
(167) |
respectively on . This implies that the supports of and are disjoint. By (164), we further have almost surely. Overall, we obtain almost surely under a maximal coupling, and hence
(168) |
as required. ∎
Proof of Proposition 3.2.
CRN coupling result.
To prove (12), it suffices to show that under CRN,
(169) |
almost surely for all . Henceforth, to prove (169) we work component-wise and drop the subscripts for ease of notation. For the Horseshoe prior, we then have on . The cumulative density function is then given by where is the exponential integral function. Therefore, under CRN coupling, we obtain
(170) |
for . Then
(171) | ||||
(172) | ||||
(173) | ||||
(174) | ||||
(175) | ||||
(176) | ||||
(177) |
Therefore, as and by the mean value theorem we obtain
(178) |
almost surely as required for (169). Finally, (169) implies (12) with and for all .
Maximal coupling result.
To prove (13), take without loss of generality. For , denote the event
where are i.i.d. random variables jointly distributed according to a maximal coupling. By component-wise independence,
(179) |
for . Henceforth, we focus on this quantity and drop the subscripts for ease of notation. For the Horseshoe prior, the marginal densities densities of and are given by
(180) |
Consider a maximal coupling of give and given with independent residuals [Johnson, 1998]. Let . Then with probability . Otherwise with probability , where and are independent random variables on with densities
(181) | ||||
(182) |
respectively. Note that if and only if for constant . Therefore, has support and has support . We obtain,
(183) | ||||
(184) | ||||
(185) | ||||
(186) | ||||
(187) | ||||
(188) | ||||
(189) | ||||
(190) |
We now consider this upper bound on as . Note that for fixed with ,
(191) | ||||
(192) | ||||
(193) |
Therefore, there exists some constant such that
(194) |
for all . By (190) and (194), we obtain
(195) | ||||
(196) | ||||
(197) |
For , now take for any fixed constant . Recall , such that . Then
(198) |
for any and some constant which does not depend on . Therefore,
(199) |
for . Recall that . We obtain
(200) | ||||
(201) | ||||
(202) | ||||
(203) | ||||
(204) | ||||
(205) | ||||
(206) | ||||
(207) |
Therefore, . By (199),
(208) |
The case follows by symmetry to give (13). ∎
Proof of Proposition 3.3.
Fix some . For any , let denote the component-wise full conditional density of given . This is given by
(209) |
on , where is the normalization constant. We will first show
(210) |
By a generalization of Pinkser’s inequality [e.g. van Erven and Harremos, 2014], we obtain
(211) |
for all , where
is the Rényi divergence between probability densities and with respect to the Lebesgue measure. For , we obtain
(212) |
Let . Then for , note that
(213) |
such that is a decreasing convex function. For , this gives
(214) | ||||
(215) |
Furthermore, note that by Lemmas C.4 and C.5,
(216) |
where corresponds to the confluent hypergeometric function of the second kind. By (214), we therefore obtain
(217) | ||||
(218) | ||||
(219) |
where for . This gives
(220) |
for . The case follows by symmetry, giving (210).
Appendix D Additional Algorithms
-
1.
Sample component-wise independently using Algorithm 4, targeting
(238) -
2.
Sample given as follows:
-
(a)
Sample using Metropolis–Rosenbluth–Teller–Hastings with step-size :
-
(a)
-
1.
Sample .
-
2.
Sample , where has unnormalized density on , with and . We can sample perfectly by setting
(242) and where is the cumulative distribution function of a distribution.
-
1.
Sample ,
-
2.
Let , . Calculate for . Let . Set
-
1.
Sample , .
-
2.
Let and , and and .
-
3.
Calculate and where
Let and , and set and .
return
(243) |
(244) |
(245) |
(246) |
Appendix E Algorithm Derivations
Gibbs sampler for Half- priors (Algorithm 3).
Derivation of the blocked Gibbs sampling algorithm (Algorithm 3) is given in Johndrow et al. [2020] for the Horseshoe prior. For Half- priors, it remains to check the validity of the slice sampler in Algorithm 4. Working component-wise, let denote the conditional density of given . Then,
(247) |
Let be the conditional density of the augmented random variable given , such that . Then, the slice sampler in Algorithm 4 corresponds to a Gibbs sampler targeting :
(248) | ||||
(249) |
Note that , where is the regularized incomplete lower Gamma function. For , we can now obtain the cumulative density function of , given by for . By probability inverse transform, we can now sample perfectly from , such that
(250) |
as in Step of Algorithm 4.
Perfect sampling of (Algorithm 10).
In Algorithm 3, we have
(251) |
for and the prior density on . Consider the eigenvalue decomposition of , such that for orthogonal and diagonal. Then,
(252) | ||||
(253) |
This allows the discretized perfect sampling of given using probability inverse transform as in Algorithm 10, with computational cost arising from eigenvalue decomposition of .
Perfect sampling of (Algorithm 11).
Working component-wise, let
(254) |
denote the conditional density of given . We can calculate the corresponding cumulative density function (see proof of Proposition B.1), given by
(255) |
where is defined as the lower incomplete confluent hypergeometric function of the second kind. Algorithm 11 and (246) directly follow from probability inverse transform.