Coupling-based convergence assessment of some Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors

Niloy Biswas Department of Statistics, Harvard University. Email: [email protected] Anirban Bhattacharya Department of Statistics, Texas A&M University. Email: [email protected] Pierre E. Jacob Department of Statistics, Harvard University. Email: [email protected] James E. Johndrow Department of Statistics, The Wharton School, University of Pennsylvania. Email: [email protected]

Abstract

We consider Markov chain Monte Carlo (MCMC) algorithms for Bayesian high-dimensional regression with continuous shrinkage priors. A common challenge with these algorithms is the choice of the number of iterations to perform. This is critical when each iteration is expensive, as is the case when dealing with modern data sets, such as genome-wide association studies with thousands of rows and up to hundred of thousands of columns. We develop coupling techniques tailored to the setting of high-dimensional regression with shrinkage priors, which enable practical, non-asymptotic diagnostics of convergence without relying on traceplots or long-run asymptotics. By establishing geometric drift and minorization conditions for the algorithm under consideration, we prove that the proposed couplings have finite expected meeting time. Focusing on a class of shrinkage priors which includes the “Horseshoe”, we empirically demonstrate the scalability of the proposed couplings. A highlight of our findings is that less than 1000 iterations can be enough for a Gibbs sampler to reach stationarity in a regression on $100,000$ covariates. The numerical results also illustrate the impact of the prior on the computational efficiency of the coupling, and suggest the use of priors where the local precisions are Half-t distributed with degree of freedom larger than one.

1 Introduction

1.1 Iterative computation in high dimensions

We consider the setting of high-dimensional regression where the number of observations $n$ is smaller than the number of covariates $p$ and the true signal is sparse. This problem formulation is ubiquitous in modern applications ranging from genomics to the social sciences. Optimization-based methods, such as the LASSO (Tibshirani, 1994) or Elastic Net (Zou and Hastie, 2005), allow sparse point estimates to be obtained even when the number of covariates is on the order of hundreds of thousands. More specifically, iterative optimization procedures to obtain these estimates are practical because the following conditions are met: 1) the cost per iteration scales favorably with the size of the input ( $n$ and $p$ ), 2) the number of iterations to convergence also scales favorably, and 3) there are reliable stopping criteria to detect convergence.

In the Bayesian paradigm, Markov chain Monte Carlo (MCMC) methods are commonly used to sample from the posterior distribution. Default MCMC implementations, for example using general-purpose software such as Stan (Carpenter et al., 2017), can lead to high computational costs in high-dimensional settings, both per iteration and in terms of the number of iterations required to reach convergence. In the setting of high-dimensional regressions, tailored algorithms can provide substantial improvements on both fronts (e.g. Bhattacharya et al., 2016; Nishimura and Suchard, 2018; Johndrow et al., 2020; Bhattacharya and Johndrow, 2021). Comparatively less attention has been put on the design of reliable stopping criteria. Stopping criteria for MCMC, such as univariate effective sample size (ESS, e.g Vats et al., 2019), or the $\hat{R}$ convergence diagnostic (e.g. Vats and Knudson, 2020), rely on the asymptotic behavior of the chains as time goes to infinity, and thus effectively ignore the non-asymptotic “burn-in” bias, which vanishes as time progresses. This is acceptable in situations where we have solid a priori estimates of the burn-in period; otherwise the lack of non-asymptotic stopping criteria poses an important practical problem. With Bayesian analysis of high-dimensional data sets, MCMC algorithms can require seconds or minutes per iteration, and thus any non-asymptotic insight on the number of iterations to perform is highly valuable.

Diagnostics of convergence are particularly useful when considering the number of factors that affect the performance of MCMC algorithms for Bayesian regressions. Beyond the size of the data set, the specification of the prior and the (much less explicit) signal-to-noise ratio in the data all have an impact on the convergence of MCMC algorithms. Across applications, the number of iterations required for the chain to converge varies by multiple orders of magnitude. This could lead users to either waste computational resources by running overly long chains, or, worrisomely, to base their analysis on chains that have not converged. This manuscript proposes a concrete method to avoid these pitfalls in the case of a Gibbs sampler for Bayesian regression with heavy-tailed shrinkage priors. Specifically, we follow the approach of Glynn and Rhee (2014); Jacob et al. (2020); Biswas et al. (2019) and use coupled lagged chains to monitor convergence.

1.2 Bayesian shrinkage regression with Half- $t(\nu)$ priors

Consider Gaussian linear regression with $n$ observations and $p$ covariates, with a focus on the high-dimensional setting where $n\ll p$ . The likelihood is given by

L(\beta,\sigma^{2};X,y)=\frac{1}{(2\pi\sigma^{2})^{n/2}}\exp\Big{(}-\frac{1}{2\sigma^{2}}\|y-X\beta\|^{2}\Big{)},

(1)

where $\|\cdot\|$ denotes the $L_{2}$ norm, $X\in\mathbb{R}^{n\times p}$ is the observed design matrix, $y\in\mathbb{R}^{n}$ is the observed response vector, $\sigma^{2}>0$ is the unknown Gaussian noise variance, and $\beta\in\mathbb{R}^{p}$ is the unknown signal vector that is assumed to be sparse. We study hierarchical Gaussian scale-mixture priors on $(\beta,\sigma^{2})$ given by

\displaystyle\beta_{j}\mid\sigma^{2},\xi,\eta\underset{j=1,\ldots,p}{\overset{ind}{\sim}}\mathcal{N}\Big{(}0,\frac{\sigma^{2}}{\xi\eta_{j}}\Big{)},\ \xi\sim\pi_{\xi}(\cdot),\ \eta_{j}\underset{j=1,\ldots,p}{\overset{\text{i.i.d.}}{\sim}}\pi_{\eta}(\cdot),\ \sigma^{2}\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}}{2},\frac{b_{0}}{2}\bigg{)},

(2)

where $a_{0},b_{0}>0$ , and $\pi_{\xi}(\cdot),\pi_{\eta}(\cdot)$ are continuous densities on $\mathbb{R}_{>0}$ . Such global-local mixture priors induce approximate sparsity, where the components of $\beta$ can be arbitrarily close to zero, but never exactly zero. This is in contrast to point-mass mixture priors (e.g Johnson and Rossell, 2012), where some components of $\beta$ can be exactly zero a posteriori. The global precision parameter $\xi$ relates to the number of signals, and we use $\xi^{-1/2}\sim\text{Cauchy}_{+}(0,1)$ throughout. The local precision parameters $\eta_{j}$ determine which components of $\beta$ are null.

We focus on the Half- $t(\nu)$ prior family for the local scale parameter, $\eta_{j}^{-1/2}\sim t_{+}(\nu)$ for $\nu\geq 1$ , where a $t_{+}(\nu)$ distribution has density proportional to $(1+x^{2}/\nu)^{-(\nu+1)/2}\,\mathbbm{1}_{(0,\infty)}(x)$ . The induced prior on the local precision $\eta_{j}$ has density

\pi_{\eta}(\eta_{j})\propto\frac{1}{\eta_{j}^{\frac{2-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}\,\mathbbm{1}_{(0,\infty)}(\eta_{j}).

(3)

The case $\nu=1$ corresponds to the Horseshoe prior (Carvalho et al., 2009, 2010) and has received overwhelming attention in the literature among this prior class; see Bhadra et al. (2019) for a recent overview. The use of Half- $t(\nu)$ priors on scale parameters in hierarchical Gaussian models was popularized by Gelman (2006). However, the subsequent literature has largely gravitated towards the default choice (Polson and Scott, 2012) of $\nu=1$ , in part because a convincing argument for preferring a different value of $\nu$ has been absent to date. Here we give empirical evidence that Half- $t(\nu)$ priors yield statistical estimation properties similar to the Horseshoe prior, whilst leading to significant computational gains when $\nu>1$ . We note that Ghosh and Chakrabarti (2017) established optimal posterior concentration in the Normal means problem for a broad class of priors on the local scale that includes the Half- $t(\nu)$ priors, extending earlier work by van der Pas et al. (2014) for the Horseshoe, providing theoretical support for the comparable performance we observe with $\nu=1$ and $\nu>1$ in simulations.

Our stopping criterion relies on couplings. Couplings have long been used to analyze the convergence of MCMC algorithms, while methodological implementations have been rare. The work of Johnson (1996) pioneered the use of couplings for practical diagnostics of convergence, but relied on the assumption of some similarity between the initial distribution and the target, which can be hard to check. Here we instead follow the approach of Glynn and Rhee (2014); Jacob et al. (2020) based on couplings of lagged chains, applied to the question of convergence in Biswas et al. (2019). That approach makes no assumptions on the closeness of the initial distribution to the target, and in our experiments we initialize all chains independently from the prior distribution.

The approach requires the ability to sample pairs of Markov chains such that 1) each chain marginally follows a prescribed MCMC algorithm and 2) the two chains “meet” after a random –but almost surely finite– number of iterations, called the meeting time. When the chains meet, i.e. when all of their components become identical, the user can stop the procedure. From the output the user can obtain unbiased estimates of expectations with respect to the target distribution, which can be generated in parallel and averaged, as well as estimates of the finite-time bias of the underlying MCMC algorithm. Crucially, this information is retrieved from the distribution of the meeting time, which is an integer-valued random variable irrespective of the dimension of the problem, and thus provides a convenient summary of the performance of the algorithm. A difficulty inherent to the approach is the design of an effective coupling strategy for the algorithm under consideration. High-dimensional parameter spaces add a substantial layer of complication as some of the simpler coupling strategies, that are solely based on maximal couplings, lead to prohibitively long meeting times.

1.3 Our contribution

In this paper we argue that couplings offer a practical way of assessing the number of iterations required by MCMC algorithms in high-dimensional regression settings. Our specific contributions are summarized below.

We introduce blocked Gibbs samplers (Section 2) for Bayesian linear regression focusing on Half- $t(\nu)$ local shrinkage priors, extending the algorithm of Johndrow et al. (2020) for the Horseshoe. Our algorithm has an overall computation cost of $\mathcal{O}(n^{2}p)$ per iteration, which is state-of-the-art for $n\ll p$ . We then design a one-scale coupling strategy for these Gibbs samplers (Section 2.3) and show that it results in meeting times that have exponential tails, which in turn validates their use in convergence diagnostics (Biswas et al., 2019). Our proof of exponentially-tailed meeting times is based on identifying a geometric drift condition, as a corollary to which we also establish geometric ergodicity of the marginal Gibbs sampler. Despite a decade of widespread use, MCMC algorithms for the Horseshoe and Half- $t(\nu)$ priors generally have not previously been shown to be geometrically ergodic, in contrast to MCMC algorithms for global-local models with exponentially-tailed local scale priors (Khare and Hobert, 2013; Pal and Khare, 2014). Our proofs utilize a uniform bound on generalized ridge regression estimates, which enables us to avoid a technical assumption of Johndrow et al. (2020), and auxiliary results on moments of distributions related to the confluent hypergeometric function of the second kind. Shortly after an earlier version of this manuscript was posted, Bhattacharya et al. (2021) have also established geometric ergodicity of related Gibbs samplers for the Horseshoe prior.

The meeting times resulting from the one-scale coupling increase exponentially as dimension increases. To address that issue we design improved coupling strategies (Section 3). We develop a two-scale strategy, based on a carefully chosen metric, which informs the choice between synchronous and maximal couplings. In numerical experiments in high dimensions this strategy leads to orders of magnitude shorter meeting times compared to the naïve approach. We establish some preliminary results in a stylized setting, providing a partial understanding of why the two-scale strategy leads to the observed drastic reduction in meeting times. We also describe other couplings that do not require explicit calculation of a metric between the chains, have exponentially-tailed meeting times, and empirically result in similar meeting times as the two-scale coupling. For our two-scale coupling, we further note that priors with stronger shrinkage towards zero can give significantly shorter meeting times in high dimensions. This motivates usage of Half- $t(\nu)$ priors with greater degrees of freedom $\nu>1$ , which can give similar statistical performance to the Horseshoe ( $\nu=1$ ) whilst enjoying orders of magnitude computational improvements.

Finally, we demonstrate (Section 4) that the proposed coupled MCMC algorithm is applicable to big data settings. Figure 1 displays an application of our coupling strategy to monitor the convergence of the Gibbs sampler for Bayesian regression on a genome-wide association study (GWAS) dataset with $(n,p)\approx(2000,100000)$ . Our method suggests that a burn-in of just $700$ iterations can suffice even for such a large problem, and confirms the applicability of coupled MCMC algorithms for practical high-dimensional inference.

Refer to caption — Figure 1: Upper bounds on the total variation between the marginal distributions of the chain and the target distribution for Bayesian regression on a GWAS dataset with $(n,p)\approx(2000,100000)$ .

Scripts in R (R Core Team, 2013) are available from https://github.com/niloyb/CoupledHalfT to reproduce the figures of the article.

2 Coupled MCMC for regression with Half- $t(\nu)$ priors

We develop MCMC techniques for Bayesian shrinkage regression with Half- $t(\nu)$ priors. The model is given by (1)–(2) and $\eta^{-1/2}\sim t(\nu)$ for some $\nu\geq 1$ as in (3). Posterior distributions resulting from such heavy-tailed shrinkage priors have desirable statistical properties, but their features pose challenges to generic-purpose MCMC algorithms. Characterization of the marginal prior and posterior densities of $\beta$ on $\mathbb{R}^{p}$ given $\xi$ and $\sigma^{2}$ , and further comments on these challenges are presented in Appendix A. Specifically, the resulting posterior distributions present 1) multimodality, 2) heavy tails and 3) infinite density at zero. This hints at a trade-off between statistical accuracy and computational feasibility, since the very features that present computational difficulties are crucial for optimal statistical performance across sparsity levels and signal strengths.

2.1 A blocked Gibbs sampler for Half- $t(\nu)$ priors

Blocked Gibbs samplers are popularly used for Bayesian regression with global-local shrinkage priors, due to the analytical tractability of full conditional distributions (Park and Casella, 2008; Carvalho et al., 2010; Polson et al., 2014; Bhattacharya et al., 2015; Makalic and Schmidt, 2016). Several convenient blocking and marginalization strategies are possible, leading to conditionals that are easy to sample from. For the case of the Horseshoe prior ( $\nu=1$ ), Johndrow et al. (2020) have developed exact and approximate MCMC algorithms for high-dimensional settings, building on the algorithms of Polson et al. (2014) and Bhattacharya et al. (2016). We extend that sampler to general Half- $t(\nu)$ priors, and summarize it in (LABEL:eq:blocked_gibbs_ergodicity_version). Given some initial state $(\beta_{0},\eta_{0},\sigma_{0}^{2},\xi_{0})\in\mathbb{R}^{p}\times\mathbb{R}^{p}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ , it generates a Markov chain $(\beta_{t},\eta_{t},\sigma^{2}_{t},\xi_{t})_{t\geq 0}$ which targets the posterior corresponding to Half- $t(\nu)$ priors.

\displaystyle\begin{split}&\pi(\eta_{t+1}|\beta_{t},\sigma_{t}^{2},\xi_{t})\propto\prod_{j=1}^{p}\frac{e^{-m_{t,j}\eta_{t+1,j}}}{\eta_{t+1,j}^{\frac{1-\nu}{2}}(1+\nu\eta_{t+1,j})^{\frac{\nu+1}{2}}}\text{ for }m_{t,j}=\frac{\xi_{t}\beta_{t,j}^{2}}{2\sigma_{t}^{2}},\\ &\pi(\xi_{t+1}|\eta_{t+1})\propto L(y|\xi_{t+1},\eta_{t+1})\pi_{\xi}(\xi_{t+1})\ ,\\ &\sigma_{t+1}^{2}|\eta_{t+1},\xi_{t+1}\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}+n}{2},\frac{y^{T}M^{-1}y+b_{0}}{2}\bigg{)},\\ &\beta_{t+1}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\sim\mathcal{N}(\Sigma^{-1}X^{T}y,\sigma_{t+1}^{2}\Sigma^{-1}).\end{split}

(4)

Above $L(y|\xi_{t+1},\eta_{t+1})$ represents the marginal likelihood of $y$ given $\xi_{t+1}$ and $\eta_{t+1}$ , and we use the notation $M=I_{n}+\xi_{t+1}^{-1}X\,\text{Diag}(\eta_{t+1}^{-1})\,X^{T}$ , and $\Sigma=X^{T}X+\xi_{t+1}\text{Diag}(\eta_{t+1})$ . We sample $\eta_{t+1}|\beta_{t},\sigma_{t}^{2},\xi_{t}$ component-wise independently using slice sampling and sample $\xi_{t+1}|\eta_{t+1}$ using Metropolis–Rosenbluth–Teller–Hastings with step-size $\sigma_{\text{MRTH}}$ (Hastings, 1970). Full details of our sampler are in Algorithms 3 and 4 of Appendix D, and derivations are in Appendix E.

Sampling $\eta_{t+1}|\beta_{t},\sigma_{t}^{2},\xi_{t}$ component-wise independently is an $\mathcal{O}(p)$ cost operation. Sampling $\xi_{t+1}|\eta_{t+1}$ requires evaluating $L(y|\xi_{t+1},\eta_{t+1})$ , which involves an $\mathcal{O}(n^{2}p)$ cost operation (calculating the weighted matrix cross-product $X\text{Diag}(\eta_{t+1})^{-1}X^{T}$ ) and an $\mathcal{O}(n^{3})$ cost operation (calculating $y^{T}M^{-1}y$ and $|M|$ using Cholesky decomposition). Sampling $\sigma_{t+1}^{2}|\eta_{t+1},\xi_{t+1}$ is only an $\mathcal{O}(1)$ cost operation, as $y^{T}M^{-1}y$ is pre-calculated. Similarly, as $M^{-1}$ is pre-calculated, sampling $\beta_{t+1}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}$ using the algorithm of Bhattacharya et al. (2016) for structured multivariate Gaussians only involves an $\mathcal{O}(np)$ and an $\mathcal{O}(n^{2})$ cost operation. In high-dimensional settings with $p\gg n$ , which is our focus, the weighted matrix cross-product calculation is the most expensive and overall the sampler described in (LABEL:eq:blocked_gibbs_ergodicity_version) has $\mathcal{O}(n^{2}p)$ computation cost.

2.2 Coupling the proposed Gibbs sampler

We develop couplings for the blocked Gibbs sampler presented in Section 2.1. We will use these couplings to generate coupled chains with a time lag $L\geq 1$ , in order to implement the diagnostics of convergence proposed in Biswas et al. (2019). Consider an $L$ -lag coupled chain $(C_{t},\tilde{C}_{t-L})_{t\geq L}$ with meeting time $\tau:=\inf\{t\geq L:C_{t}=\tilde{C}_{t-L}\}$ . Here $C_{t}$ and $\tilde{C}_{t}$ denote the full states $(\beta_{t},\eta_{t},\sigma_{t}^{2},\xi_{t})$ and $(\tilde{\beta}_{t},\tilde{\eta}_{t},\tilde{\sigma}_{t}^{2},\tilde{\xi}_{t})$ respectively. For the coupling-based methods proposed in Biswas et al. (2019); Jacob et al. (2020) to be most practical, the meeting times should occur as early as possible, under the constraint that both marginal chains $(C_{t})_{t\geq 0}$ and $(\tilde{C}_{t})_{t\geq 0}$ start from the same initial distribution, and that they both marginally evolve according to the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version).

Briefly, we recall how $L$ -lag coupled chains enable the estimation of upper bounds on $\text{TV}(\pi_{t},\pi)$ . Since $C_{t}=\tilde{C}_{t-L}$ for all $t\geq\tau$ , then the indicator $\mathbbm{1}(C_{t}\neq\tilde{C}_{t-L})$ can be evaluated for all $t$ as soon as we observe $\tau$ . Such indicators are estimators of $\mathbb{P}(C_{t}\neq\tilde{C}_{t-L})$ , which themselves are upper bounds on $\text{TV}(\pi_{t},\pi_{t-L})$ , since $C_{t}\sim\pi_{t}$ and $\tilde{C}_{t-L}\sim\pi_{t-L}$ . We can then obtain upper bounds on $\text{TV}(\pi_{t},\pi)$ by the triangle inequality: $\text{TV}(\pi_{t},\pi)\leq\mathbb{E}[\sum_{j\geq 0}\mathbbm{1}(C_{t+(j+1)L}\neq\tilde{C}_{t+jL})]$ . This can be justified more formally if the meeting times are exponentially-tailed (Biswas et al., 2019). Such tail condition also ensures that the expected computation time of unbiased estimators as in Jacob et al. (2020), generated independently in parallel, scales at a logarithmic rate in the number of processors. Thus we will be particularly interested in couplings that result in exponentially-tailed meeting times.

In Section 2.3, we consider an algorithm based on maximal couplings. We show that the ensuing meeting times have exponential tails, and discuss how our analysis directly implies that the marginal chain introduced in Section 2.1 is geometrically ergodic. We then illustrate difficulties encountered by such scheme in high dimensions, which will motivate alternate strategies described in Section 3. For simplicity, we mostly omit the lag $L$ from the notation.

2.3 One-scale coupling

For the blocked Gibbs sampler in Section 2.1, we first consider a coupled MCMC algorithm that attempts exact meetings at every step. We will apply a maximal coupling algorithm with independent residuals (see Thorisson (2000, Chapter 1 Section 4.4) and Johnson (1998)). It is included in Algorithm 5 of Appendix D, and has an expected computation cost of two units (Johnson, 1998).

Our initial coupled MCMC kernel is given in Algorithm 1, which we refer to as a one-scale coupling. That is, before the chains have met (when $C_{t}\neq\tilde{C}_{t}$ ), the coupled kernel on Steps $(1)$ and $(2)(a-c)$ does not explicitly depend on the distance between states $C_{t}$ and $\tilde{C}_{t}$ . After meeting, the coupled chains remain together by construction, such that $C_{t}=\tilde{C}_{t}$ implies $C_{t+1}=\tilde{C}_{t+1}$ . When $C_{t}\neq\tilde{C}_{t}$ , Step $(1)$ uses the coupled slice sampler of Algorithm 2 component-wise for $(\eta_{t+1},\tilde{\eta}_{t+1})$ , which allows each pair of components $(\eta_{t+1,j},\tilde{\eta}_{t+1,j})$ to meet exactly with positive probability. In Algorithm 2, we use a common random numbers or a “synchronous” coupling of the auxiliary random variables $(U_{j,*},\tilde{U}_{j,*})$ , and alternative couplings could be considered. Steps $(2)(a-b)$ are maximal couplings of the conditional sampling steps for $(\xi,\tilde{\xi})$ and $(\sigma^{2},\tilde{\sigma}^{2})$ , such that $\big{(}\sigma_{t+1},\xi_{t+1},\eta_{t+1}\big{)}=\big{(}\tilde{\sigma}_{t+1},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1}\big{)}$ occurs with positive probability for all $t\geq 0$ . Step $(2)(c)$ uses common random numbers, such that $\big{(}\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\big{)}=\big{(}\tilde{\sigma}^{2}_{t+1},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1}\big{)}$ implies $\beta_{t+1}=\tilde{\beta}_{t+1}$ . This allows the full chains to meet exactly with positive probability at every step.

Input:

C_{t}:=(\beta_{t},\eta_{t},\sigma^{2}_{t},\xi_{t})

and

\tilde{C}_{t}:=(\tilde{\beta}_{t},\tilde{\eta}_{t},\tilde{\sigma}^{2}_{t},\tilde{\xi}_{t})

if $C_{t}=\tilde{C}_{t}$ then Sample

C_{t+1}|C_{t}

as in (LABEL:eq:blocked_gibbs_ergodicity_version) and set

\tilde{C}_{t+1}=C_{t+1}

else

1.

Sample $(\eta_{t+1},\tilde{\eta}_{t+1})\big{|}C_{t},\tilde{C}_{t}$ component-wise independently using Algorithm 2.
2.
Sample $((\xi_{t+1},\sigma_{t+1}^{2},\beta_{t+1}),(\tilde{\xi}_{t+1},\tilde{\sigma}_{t+1},\tilde{\beta}_{t+1}))$ given $\eta_{t+1},\tilde{\eta}_{t+1}$ as follows:
1. (a)
  
  Sample $(\xi_{t+1},\tilde{\xi}_{t+1})\big{|}\eta_{t+1},\tilde{\eta}_{t+1},\xi_{t},\tilde{\xi}_{t}$ using coupled Metropolis–Rosenbluth–Teller–Hastings (Algorithm 9 in Appendix D).
2. (b)
  
  Sample $(\sigma_{t+1}^{2},\tilde{\sigma}_{t+1}^{2})|\xi_{t+1},\eta_{t+1},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1}$ from a maximal coupling of two Inverse Gamma distributions (Algorithm 5 in Appendix D).
3. (c)
  
  Sample $(\beta_{t+1},\tilde{\beta}_{t+1})|\sigma_{t+1}^{2},\xi_{t+1},\eta_{t+1},\tilde{\sigma}_{t+1}^{2},\tilde{\xi}_{t+1},\tilde{\eta}_{t+1}$ using common random numbers for Gaussian scale mixture priors (Algorithm 8 in Appendix D).

return

(C_{t+1},\tilde{C}_{t+1})

Algorithm 1 A one-scale coupled MCMC kernel for Half-

t(\nu)

priors.

Result: A coupling of slice samplers marginally targeting

p(\cdot|m_{j})

and

p(\cdot|\tilde{m}_{j})

, where

p(\eta_{j}|m)\propto(\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}})^{-1}e^{-m\eta_{j}}

(0,\infty)

Input:

\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j}:=\xi_{t}\beta_{t,j}^{2}/(2\sigma_{t}^{2}),\tilde{m}_{t,j}:=\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}/(2\tilde{\sigma}_{t}^{2})>0

1.

Sample $U^{crn}_{j,*}\sim\mathrm{Uniform}(0,1)$ , and set $U_{j,*}:=U^{crn}_{j,*}(1+\nu\eta_{t,j})^{-\frac{\nu+1}{2}}$ and $\tilde{U}_{j,*}:=U^{crn}_{j,*}(1+\nu\tilde{\eta}_{t,j})^{-\frac{\nu+1}{2}}$ .
2.

Sample $(\eta_{t+1,j},\tilde{\eta}_{t+1,j})|U_{j,*},\tilde{U}_{j,*}$ from a maximal coupling of distributions $P_{j}$ and $\tilde{P}_{j}$ using Algorithm 5 in Appendix D, where $P_{j}$ and $\tilde{P}_{j}$ have unnormalized densities $\eta\mapsto\eta^{s-1}e^{-m_{j}\eta}$ and $\eta\mapsto\eta^{s-1}e^{-\tilde{m}_{j}\eta}$ on $(0,T_{j})$ and $(0,\tilde{T}_{j})$ respectively, where $T_{j}=(U_{j,*}^{-2/(1+\nu)}-1)/\nu$ , $\tilde{T}_{j}=(\tilde{U}_{j,*}^{-2/(1+\nu)}-1)/\nu$ and $s=(1+\nu)/2$ .

return

(\eta_{t+1,j},\tilde{\eta}_{t+1,j})

Algorithm 2 Coupled Slice Sampler for Half-

t(\nu)

priors using maximal couplings.

Under the one-scale coupling, we prove that the meetings times have exponential tails and hence finite expectation. Our analysis is based on identifying a suitable drift function for a variant of the marginal chain in Section 2.1 and an application of Jacob et al. (2020, Proposition 4). We assume that the global shrinkage prior $\pi_{\xi}(\cdot)$ has a compact support on $(0,\infty)$ . Such compactly supported priors for $\xi$ have been recommended by van der Pas et al. (2017) to achieve optimal posterior concentration for the Horseshoe in the Normal means model. For simplicity, we also assume that each component $\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2}$ and $\xi_{t+1}|\eta_{t+1}$ in (LABEL:eq:blocked_gibbs_ergodicity_version) are sampled perfectly, instead of slice sampling and Metropolis–Rosenbluth–Teller–Hastings steps respectively. Perfect sampling algorithms for each component $\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2}$ and $\xi_{t+1}|\eta_{t+1}$ are provided in Algorithms 10 and 11 of Appendix D for completeness; see also Bhattacharya and Johndrow (2021) for perfect sampling of the components of $\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2}$ for the Horseshoe. Perfectly sampling $\xi_{t+1}|\eta_{t+1}$ and $\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2}$ retains the $\mathcal{O}(n^{2}p)$ computational cost for the full blocked Gibbs sampler, though in practice this implementation is more expensive per iteration due to the computation of eigenvalue decompositions in lieu of Cholesky decompositions, and the computation of inverse of the confluent hypergeometric function of the second kind.

Proposition 2.1 gives a geometric drift condition for this variant of the blocked Gibbs sampler.

Proposition 2.1.

(Geometric drift). Consider the Markov chain $(\beta_{t},\xi_{t},\sigma^{2}_{t})_{t\geq 0}$ generated from the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version), where each component $\eta_{t+1,j}|\beta_{t,j},\xi_{t},\sigma_{t}^{2}$ and $\xi_{t+1}|\eta_{t+1}$ are sampled perfectly such that $(\beta_{t},\xi_{t},\sigma^{2}_{t})\mapsto\eta_{t+1}\mapsto(\beta_{t+1},\xi_{t+1},\sigma^{2}_{t+1})$ with $\eta_{t+1}$ intermediary for each $t\geq 0$ . Assume the global shrinkage prior $\pi_{\xi}(\cdot)$ has a compact support. For $m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2})$ , $c\in(0,1/2)$ and $d\in(0,1)$ , define

V(\beta,\xi,\sigma^{2})=\sum_{j=1}^{p}m_{j}^{-c}+m_{j}^{d}.

(5)

Then for each $\nu\geq 1$ , there exist some $c\in(0,1/2),d\in(0,1)$ such that $V$ is a drift function, i.e., there exists $\gamma\in(0,1)$ and $K\in(0,\infty)$ such that for all $\beta_{t}\in\mathbb{R}^{p}$ , $\xi_{t}>0$ and $\sigma_{t}^{2}>0$ ,

\displaystyle\mathbb{E}[V(\beta_{t+1},\xi_{t+1},\sigma_{t+1}^{2})|\beta_{t},\xi_{t},\sigma_{t}^{2}]\leq\gamma V(\beta_{t},\xi_{t},\sigma_{t}^{2})+K.

(6)

The drift (or “Lyapunov”) function in (5), approaches infinity when any $\beta_{j}$ approaches the origin or infinity. This ensures that the corresponding sub-level sets, defined by $S(R)=\{(\beta,\xi,\sigma^{2})\in\mathbb{R}^{p}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}:V(\beta,\xi,\sigma)<R\}$ for $R>0$ , exclude the pole at the origin and are bounded. Such geometric drift condition, in conjunction with Jacob et al. (2020, Proposition 4), helps verify that the meeting times under the one-scale coupling have exponential tails and hence finite expectation.

Proposition 2.2.

Consider the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). In the setup of Proposition 2.1, assume that the global shrinkage prior $\pi_{\xi}(\cdot)$ has a compact support. Write $Z_{t}=(\beta_{t},\xi_{t},\sigma^{2}_{t})$ and $\tilde{Z}_{t}=(\tilde{\beta}_{t},\tilde{\xi}_{t},\tilde{\sigma}^{2}_{t})$ . Consider the one-scale coupling given by $(Z_{t},\tilde{Z}_{t})\mapsto(\eta_{t+1},\tilde{\eta}_{t+1})\mapsto(Z_{t+1},\tilde{Z}_{t+1})$ where $(\eta_{t+1},\tilde{\eta}_{t+1})$ are maximally coupled component-wise, and $(Z_{t+1},\tilde{Z}_{t+1})$ are coupled using common random numbers. Denote the meeting time by $\tau:=\inf\{t\geq 0:Z_{t}=\tilde{Z}_{t}\}$ . Then $\mathbb{P}(\tau>t)\leq A_{0}\kappa_{0}^{t}$ for some constants $A_{0}\in(0,\infty)$ and $\kappa_{0}\in(0,1)$ , and for all $t\geq 0$ .

Exponentially-tailed meeting times immediately imply that the blocked Gibbs sampler in Section 2.1 is geometrically ergodic. This follows from noting that Jacob et al. (2020, Proposition 4) is closely linked to a minorization condition (Rosenthal, 1995), and details are included in Proposition C.10 of Appendix C. For a class of Gibbs samplers targeting shrinkage priors including the Bayesian LASSO, the Normal-Gamma prior (Griffin and Brown, 2010), and Dirichlet-Laplace prior (Bhattacharya et al., 2015), Khare and Hobert (2013) and Pal and Khare (2014) have proven geometric ergodicity based on drift and minorization arguments (Meyn and Tweedie, 1993; Roberts and Rosenthal, 2004). For the Horseshoe prior, Johndrow et al. (2020) has recently established geometric ergodicity. In the high-dimensional setting with $p>n$ for the Horseshoe, the proof of Johndrow et al. (2020) required truncating the prior on each $\eta_{j}$ below by a small $\ell>0$ to guarantee the uniform bound $\eta_{j}\geq\ell$ . Our proof of geometric ergodicity for Half- $t(\nu)$ priors including the Horseshoe ( $\nu=1$ ) works in both low and high-dimensional settings, without any modification of the priors on the $\eta_{j}$ . In parallel work, Bhattacharya et al. (2021) have also established geometric ergodicity for the Horseshoe prior without requiring such truncation.

Remark 2.3.

Our proofs of exponentially-tailed meeting times and geometric ergodicity generalize to a larger class of priors satisfying some moment conditions on the full conditionals of $\eta_{j}$ . Consider a compactly supported prior $\pi_{\xi}$ on $\xi$ , and a prior $\pi_{\eta}$ on each $\eta_{j}$ . Then the unnormalized density of the full conditionals of each $\eta_{j}$ is given by $\pi\big{(}\eta_{j}\mid m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2})\big{)}\propto\eta_{j}^{1/2}e^{-m_{j}\eta_{j}}\pi_{\eta}(\eta_{j})$ . Consider the following assumptions on $\pi_{\eta}$ .

1.

For some $0<c<1/2$ , there exists $0<\epsilon<\Gamma(\frac{1}{2}-c)^{-1}\sqrt{\pi}$ and $K^{(1)}_{c,\epsilon}<\infty$ such that $\mathbb{E}[\eta_{j}^{c}|m_{j}]\leq\epsilon m_{j}^{-c}+K^{(1)}_{c,\epsilon}$ for all $m_{j}>0$ .
2.

For some $0<d<1$ , there exists $0<\epsilon<2^{d}$ and $K^{(2)}_{d,\epsilon}<\infty$ such that $\mathbb{E}[\eta_{j}^{-d}|m_{j}]\leq\epsilon m_{j}^{d}+K^{(2)}_{d,\epsilon}$ for all $m_{j}>0$ .

Then the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version) is geometrically ergodic.

We now consider the performance of Algorithm 1 on synthetic datasets. In particular, we empirically illustrate that the rate $\kappa_{0}$ in Proposition 2.2 can tend to $1$ exponentially fast as dimension $p$ increases. The design matrix $X\in\mathbb{R}^{n\times p}$ is generated with $[X]_{i,j}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(0,1)$ , and a response vector as $y\sim\mathcal{N}(X\beta_{*},\sigma^{2}_{*}I_{n})$ , where $\beta_{*}\in\mathbb{R}^{p}$ is the true signal and $\sigma_{*}\geq 0$ is the true error standard deviation. We choose $\beta_{*}$ to be sparse such that given sparsity parameter $s$ , $\beta_{*,j}=2^{(9-j)/4}$ for $1\leq j\leq s$ and $\beta_{*,j}=0$ for all $s>j$ . Figure 2 shows the meeting times $\tau$ of coupled Markov chains (with a lag $L=1$ ) targeting the Horseshoe posterior ( $\nu=1$ ) with $a_{0}=b_{0}=1$ under Algorithm 1. We consider $n=100,s=10,\sigma_{*}=0.5$ and vary the dimension $p$ . For each dimension $p$ , we simulate $100$ meeting times under the one-scale coupling for independently generated synthetic datasets. We initialize chains independently from the prior distribution. For the $\xi$ updates, we use a proposal step-size of $\sigma_{\text{MRTH}}=0.8$ as in Johndrow et al. (2020). Figure 2, with a vertical axis on a logarithmic scale, shows that the meeting times increase exponentially and have heavier tails with increasing dimension.

3 Coupling strategies for high-dimensional settings

To address the scaling issues observed with the one-scale coupling in Section 2.3, we develop a two-scale coupling in Section 3.1 that delivers better empirical performance. We also offer some formal results supporting the empirical improvements. In Section 3.2 we develop other related coupling strategies which have exponentially-tailed meeting times and have similar empirical performance as the two-scale coupling. In Section 3.3, we investigate the impact of the degree of freedom $\nu$ for Half- $t(\nu)$ priors on our two-scale coupling. We find that Half- $t(\nu)$ priors with higher degrees of freedom $\nu>1$ can give similar statistical performance to the Horseshoe ( $\nu=1$ ), whilst enjoying orders of magnitude computational improvements through shorter meeting times.

3.1 A two-scale coupling

We develop a two-scale coupling algorithm, which slightly differs from the terminology used in stochastic dynamics and in analysis of pre-conditioned HMC (Bou-Rabee and Eberle, 2020). It is also closely linked to the delayed one-shot coupling of Roberts and Rosenthal (2002), and refers to coupled kernels that attempt exact meetings when the chains are close, and aim for a contraction when the chains are far. The motivation for this construction is that in Algorithm 1, the components of $(\eta_{t},\tilde{\eta}_{t})$ that fail to meet instead evolve independently. As dimension grows, the number of components of $(\eta_{t},\tilde{\eta}_{t})$ that fail to meet also tends to grow, making it increasingly difficult for all components to exactly meet on subsequent iterations. In the two-scale coupling, we only attempt to obtain exact meetings when the associated probability is large enough. This is done by constructing a metric $d$ and a corresponding threshold parameter $d_{0}\geq 0$ such that when the current states are $d$ -close with $d(C_{t},\tilde{C}_{t})\leq d_{0}$ , we apply Algorithm 2 and try to obtain exact meetings. When $d(C_{t},\tilde{C}_{t})>d_{0}$ , we instead employ common random numbers to sample $(\eta_{t},\tilde{\eta}_{t})$ .

Our chosen metric $d$ on $\mathbb{R}^{p}\times\mathbb{R}^{p}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}$ is

d(C_{t},\tilde{C}_{t})=\mathbb{P}\big{(}\eta_{t+1}\neq\tilde{\eta}_{t+1}|C_{t},\tilde{C}_{t}\big{)},

(7)

where recall that $C_{t}=(\beta_{t},\eta_{t},\sigma^{2}_{t},\xi_{t})$ , $\tilde{C}_{t}=(\tilde{\beta}_{t},\tilde{\eta}_{t},\tilde{\sigma}^{2}_{t},\tilde{\xi}_{t})$ , and $(\eta_{t+1},\tilde{\eta}_{t+1})|C_{t},\tilde{C}_{t}$ is sampled using the coupled slice sampler in Algorithm 2. As the coupling in Algorithm 2 is independent component-wise, we can obtain a simplified expression involving only univariate probabilities as

\displaystyle d(C_{t},\tilde{C}_{t})

\displaystyle=1-\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)},

(8)

where $m_{t,j}={\xi_{t}\beta_{t,j}^{2}}/(2\sigma_{t}^{2})$ and $m_{t,j}={\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}}/(2\tilde{\sigma}_{t}^{2})$ . Under this metric $d$ and a threshold $d_{0}\in[0,1]$ , an exact meeting is attempted when $\mathbb{P}(\eta_{t+1}=\tilde{\eta}_{t+1}|C_{t},\tilde{C}_{t})\geq 1-d_{0}$ . Once $\eta_{t+1}$ and $\tilde{\eta}_{t+1}$ coincide, and if the scalars $\xi_{t+1}$ and $\tilde{\xi}_{t+1}$ also subsequently coincide, then the entire chains meet. When $d_{0}=1$ , the two-scale coupling reduces to the one-scale coupling of Algorithm 1 since the maximal coupling slice sampler (Algorithm 2) is invoked at each step. On the other hand, when threshold $d_{0}=0$ , we always apply the common random numbers. Then the chains $C_{t},\tilde{C}_{t}$ may come arbitrarily close but will never exactly meet. Empirically we find that different thresholds $d_{0}$ values away from $0$ and $1$ give similar meeting times. Simulations concerning these choices can be found in Appendix B.2.

We now discuss computation of the metric $d(C_{t},\tilde{C}_{t})$ . Since the probability in (7) is unavailable in closed form, we can resort to various approximations. One option is to evaluate the one-dimensional integrals in (8) using numerical integration. Alternatively, one can form the Monte Carlo based estimate

\widehat{d_{R}}^{(1)}(C_{t},\tilde{C}_{t})=\frac{1}{R}\sum_{r=1}^{R}\mathbbm{1}\{\eta^{(r)}_{t+1}\neq\tilde{\eta}^{(r)}_{t+1}\},

(9)

where $(\eta^{(r)}_{t+1},\tilde{\eta}^{(r)}_{t+1})$ are sampled independently for $r=1,\ldots,R$ using Algorithm 2. We recommend a Rao–Blackwellized estimate by combining Monte Carlo with analytical calculations of integrals as

\widehat{d_{R}}^{(2)}(C_{t},\tilde{C}_{t})=1-\prod_{j=1}^{p}\Big{(}\frac{1}{R}\sum_{r=1}^{R}\mathbb{P}\Big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U^{(r)}_{j},\tilde{U}^{(r)}_{j},m_{t,j},\tilde{m}_{t,j}\Big{)}\Big{)},

(10)

where $\big{(}U^{(r)}_{j},\tilde{U}^{(r)}_{j}\big{)}$ are sampled independently for $r=1,\ldots,R$ using Algorithm 2 and each $\mathbb{P}(\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U^{(r)}_{j},\tilde{U}^{(r)}_{j},m_{t,j},\tilde{m}_{t,j})$ is calculated analytically based on meeting probabilities from Algorithm 2. Further metric calculation details are in Appendix B.3. For a number of samples $R$ , estimates (9) and (10) both have computation cost of order $pR$ . Compared to the estimate in (9), the estimate in (10) has lower variance and faster numerical run-times as it only involves sampling uniformly distributed random numbers. It suffices to choose a small number of samples $R$ , and often we take $R=1$ . Indeed it appears unnecessary to estimate $d(C_{t},\tilde{C}_{t})$ accurately, as we are only interested in comparing that distance to a fixed threshold $d_{0}\in[0,1]$ . Often the trajectories $(d(C_{t},\tilde{C}_{t}))_{t\geq 0}$ initially take values close to $1$ , and then sharply drop to values close to $0$ . This leads to the estimates in (9) and (10) having low variance. Henceforth, we will use the estimate in (10) in our experiments with two-scale couplings, unless specified otherwise.

Overall the two-scale coupling kernel has twice the $\mathcal{O}(n^{2}p)$ cost of the single chain kernel in (LABEL:eq:blocked_gibbs_ergodicity_version). When $C_{t}\neq\tilde{C}_{t}$ , we calculate a distance estimate and sample from a coupled slice sampling kernel. Calculating the distance estimate involves sampling $2pR$ uniforms and has $\mathcal{O}(p)$ cost. Coupled slice sampling with maximal couplings (Algorithm 2) or common random numbers both have expected or deterministic cost $\mathcal{O}(p)$ . The remaining steps of the two-scale coupling match the one-step coupling in Algorithm 1.

We now consider the performance of the two-scale coupling on synthetic datasets. The setup is identical to that introduced in Section 2.3, where for each dimension $p$ we simulate $100$ meetings times based on independently generated synthetic datasets. We target the Horseshoe posterior ( $\nu=1$ ) with $a_{0}=b_{0}=1$ . Figure 3 shows the meeting times $\tau$ of coupled Markov chains (with a lag $L=1$ ), for both the one-scale coupling and the two-scale coupling with $R=1$ and threshold $d_{0}=0.5$ . Figure 3 shows that the two-scale coupling can lead to orders of magnitude reductions in meeting times, compared to the one-scale coupling.

We next present some preliminary results which hint as to why the two-scale coupling leads to the drastic reduction in meeting times. We focus on the coupling of the $(\eta,\tilde{\eta})$ full conditionals, as the one-scale and two-scale algorithms differ exactly at this step. First, we show that for any monotone function $h:(0,\infty)\rightarrow\mathbb{R}$ and all components $j=1,\ldots,p$ , the expected distance $\mathbb{E}[|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|]$ is the same under both common random numbers and maximal coupling.

Proposition 3.1.

Consider the setup of Proposition 2.2, such that the $\eta$ full conditionals are sampled perfectly. Then for any monotone function $h:(0,\infty)\rightarrow\mathbb{R}$ and all components $j=1,\ldots,p$ ,

\mathbb{E}_{\text{max}}\big{[}|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}_{\text{crn}}\big{[}|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|\big{|}m_{t},\tilde{m}_{t}\big{]},

(11)

where $\mathbb{E}_{\text{max}}$ and $\mathbb{E}_{\text{crn}}$ correspond to expectations under the maximal coupling and CRN coupling respectively of $(\eta_{t+1},\tilde{\eta}_{t+1})$ given $m_{t}$ and $\tilde{m}_{t}$ .

Proposition 3.1 implies that such single-step expectations alone do not distinguish the behavior of CRN from maximal couplings. This compels us to investigate other distributional features of $|h(\eta_{t+1,j})-h(\tilde{\eta}_{t+1,j})|$ under either coupling to possibly tease apart differences in their behavior on high probability events. Focusing on the Horseshoe prior and making the choice $h(x)=\log(1+x)$ , we uncover a distinction between the tail behavior of the two couplings which can substantially accumulate over $p$ independent coordinates. We offer an illustration in a stylized setting below.

Proposition 3.2.

For the Horseshoe prior $(\nu=1)$ , consider when $m_{t}=\mu\mathbf{1}_{p}\in(0,\infty)^{p}$ and $\tilde{m}_{t}=\tilde{\mu}\mathbf{1}_{p}\in(0,\infty)^{p}$ for some positive scalars $\mu$ and $\tilde{\mu}$ with $\tilde{\mu}\neq\mu$ . Then under a CRN coupling,

\mathbb{P}\big{(}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>|\log\mu-\log\tilde{\mu}|\ \big{|}\ m_{t},\tilde{m}_{t}\big{)}=0

(12)

for all $p\geq 1$ . Define a positive constant $L=\log(1+1/\max\{\mu,\tilde{\mu}\})/2$ . Under maximal coupling, for any $\alpha\in(0,1)$ and any $\alpha^{\prime}\in(\alpha,1)$ , there exists a constant $D>0$ which does not depend on $p$ such that for all $p>1$ ,

\displaystyle\mathbb{P}\big{(}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>\log(\alpha L)+\log\log p\ \big{|}\ m_{t},\tilde{m}_{t}\big{)}>1-e^{-Dp^{1-\alpha^{\prime}}}.

(13)

Extensions of Proposition 3.2 which allow different components of $m_{t}$ and $\tilde{m}_{t}$ to take different values under some limiting assumptions are omitted here for simplicity. This stylized setting already captures why the one-scale algorithm may result in much larger meeting times in high dimensions compared to the CRN-based two-scale algorithm. We note that Proposition 3.2 remains informative whenever $|\log\mu-\log\tilde{\mu}|\ll\log\log p$ , even when $\log\log p$ is not large. For example, consider $\mu=\tilde{\mu}e^{\delta}$ for some small $\delta>0$ with $\delta\ll\log\log p$ . Under CRN, $\log(1+\eta_{t+1,j})$ and $\log(1+\tilde{\eta}_{t+1,j})$ remain $\delta$ -close almost surely for all components $j$ . In contrast, under the maximal coupling, at least one component will have larger than $\log(\alpha L)+\log\log p$ deviations with high probability. Since $\alpha$ and $L=\log(1+1/\mu)/2$ do not depend on $p$ or $\delta$ , $\log(\alpha L)+\log\log p\gg\delta$ for fixed $\mu,\tilde{\mu}>0$ . These larger deviations between some components of $\eta_{t+1}$ and $\tilde{\eta}_{t+1}$ under maximal couplings push the two chains much further apart when sampling $(m_{t+1},\tilde{m}_{t+1})\big{|}\eta_{t+1},\tilde{\eta}_{t+1}$ , resulting in meeting probabilities that are much lower over multiple steps of the algorithm. Overall, Propositions 3.1 and 3.2 show that multi-step kernel calculations taking into account such high-probability events may be necessary to further distinguish the relative performances of one-scale and two-scale couplings. A full analytic understanding of this difference in performance remains an open problem.

Another way of understanding the benefits of two-scale coupling would be to find a distance under which the coupled chains under common random numbers exhibit a contraction. The CRN coupling part of Proposition 3.2 hints that $\bar{d}(m_{t},\tilde{m}_{t}):=\sum_{j=1}^{p}|\log m_{t,j}-\log\tilde{m}_{t,j}|$ may be a natural metric. Appendix B.1 contains discussions and related simulations comparing this metric with alternatives. Proposition 3.3 verifies that such metric bounds the total variation distance between one-step transition kernels.

Proposition 3.3.

Let $\mathcal{P}$ denote the Markov transition kernel associated with the update from $(\beta_{t},\xi_{t},\sigma_{t}^{2})$ to $(\beta_{t+1},\xi_{t+1},\sigma_{t+1}^{2})$ for the Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). Let $Z=(\beta,\xi,\sigma^{2})$ , $\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2})$ , $m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2})$ , and $\tilde{m}_{j}=\tilde{\xi}\tilde{\beta}_{j}^{2}/(2\tilde{\sigma}^{2})$ for $j=1,\ldots,p$ . Then,

\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z},\cdot)\big{)}^{2}\leq\frac{(1+\nu)(e-1)}{4}\bar{d}(Z,\tilde{Z}).

(14)

By Proposition 3.3, it suffices to show that $\big{(}\bar{d}(Z_{t},\tilde{Z}_{t})\big{)}_{t\geq 0}$ contracts under CRN to a sufficiently small value such that the upper bound in (14) is strictly less that $1$ . This would ensure that CRN coupling will bring the marginal chains close enough that the probability of one-step max coupling resulting in a meeting is high. Note that (14) corresponds to the total variation distance of the full chain, so under this setup we would attempt to maximally couple the full vectors $\eta_{t+1}$ and $\tilde{\eta}_{t+1}$ (rather than maximally coupling component-wise) when close. Analytical calculations to establish such a contraction, and to understand how the contraction rate scales with dimension and other features of the problem such as the choice of prior, sparsity, and signal to noise ratio requires further work.

3.2 Alternative coupling strategies

The rapid meeting of the two-scale coupling in simulations recommends its use. However, we were unable to establish finite expected meeting times with this coupling, which prevents its use in constructing unbiased estimates. We now describe several small modifications of the two-scale coupling that have similar empirical performance but for which we can prove that the meeting time distribution has exponential tails.

Our basic approach is to maximally couple each component $(\eta_{t,j},\tilde{\eta}_{t,j})$ until a failed meeting $\{\eta_{t,j}\neq\tilde{\eta}_{t,j}\}$ occurs, after which we switch to CRN for the remaining components. The ordering of the components $j=1,\ldots,p$ can be deterministic or randomized at each iteration. Under this construction, only one component of $(\eta_{t},\tilde{\eta}_{t})$ , corresponding to the failed meeting, will evolve independently; the other components will either meet or evolve under CRN. Therefore, we expect to obtain benefits similar to the two-scale coupling in high dimensions. An apparent advantage is that we can bypass the choice of metric and associated threshold. While the switch-to-CRN coupling has the same $\mathcal{O}(n^{2}p)$ cost per iteration as the two-scale coupling, it can give faster run-times in practice when the dimension $p$ is large compared to the number of observations $n$ as it avoids repeated calculation of a metric.

We report the performance of the switch-to-CRN coupling with the ordering of the components randomized at each iteration. The setup is identical to that in Section 3.1, where for each dimension $p$ we simulate $100$ meetings times based on independently generated synthetic datasets. Figure 4 shows that the switch-to-CRN coupling leads to similar meeting times as the two-scale coupling.

We can establish that the meeting times under this switch-to-CRN coupling have exponential tails and hence finite expectation.

Proposition 3.4.

Consider the blocked Gibbs sampler in (LABEL:eq:blocked_gibbs_ergodicity_version). We follow the setup in Proposition 2.1, and assume that the global shrinkage prior $\pi_{\xi}(\cdot)$ has a compact support. Denote $Z_{t}=(\beta_{t},\xi_{t},\sigma^{2}_{t})$ and $\tilde{Z}_{t}=(\tilde{\beta}_{t},\tilde{\xi}_{t},\tilde{\sigma}^{2}_{t})$ . Consider the switch-to-CRN coupling given by $(Z_{t},\tilde{Z}_{t})\mapsto(\eta_{t+1},\tilde{\eta}_{t+1})\mapsto(Z_{t+1},\tilde{Z}_{t+1})$ where $(\eta_{t+1},\tilde{\eta}_{t+1})$ are maximally coupled component-wise (under any fixed or random ordering of the components) until the first failed meeting, after which common random numbers are employed, and $(Z_{t+1},\tilde{Z}_{t+1})$ are coupled using common random numbers. Denote the meeting time by $\tau:=\inf\{t\geq 0:Z_{t}=\tilde{Z}_{t}\}$ . Then $\mathbb{P}(\tau>t)\leq A_{1}\kappa_{1}^{t}$ for some constants $A_{1}\in(0,\infty)$ and $\kappa_{1}\in(0,1)$ , and for all $t\geq 0$ .

As in Proposition 2.2, our proof of Proposition 3.4 implies a rate $\kappa_{1}$ which tends to $1$ exponentially as dimension $p$ increases. To obtain more favorable rates with respect to the dimension as hinted by Figure 4 would require a better understanding of the CRN coupling, for example establishing a contraction in some parts of the state space as discussed above.

Overall Figures 3 and 4 show significant empirical improvements in using the two-scale coupling and related strategies, compared to the one-scale coupling of Section 2.3. Therefore, we recommend the use of two-scales couplings and variants such as the switch-to-CRN coupling for practical high-dimensional applications. Furthermore, in settings where estimating the metric $d$ is computationally expensive or analytically intractable, variants such as the switch-to-CRN coupling may offer faster numerical run-times per iteration. Henceforth, we will use the two-scale coupling of Section 3.1 in our experiments and GWAS dataset applications, unless specified otherwise.

3.3 Computational and statistical impact of degree of freedom $\nu$

We empirically investigate the impact of the degree of freedom $\nu\geq 1$ for Half- $t(\nu)$ priors. Higher degrees of freedom $\nu$ corresponds to priors on $\beta$ which have stronger shrinkage towards zero (Proposition A.1). We consider meeting times resulting from the two-scale coupling of Section 3.1, as well as the statistical performance of the posteriors. On the computation side, recent work on convergence analysis (Qin and Hobert, 2019a, b) has highlighted the impact of shrinkage on convergence of simpler MCMC algorithms. Under a common random numbers coupling, Qin and Hobert (2019b) have studied the convergence of the Gibbs sampler of Albert and Chib (1993) for Bayesian probit regression, proving dimension-free convergence rates for priors with sufficiently strong shrinkage towards zero. On the statistical estimation side, Half- $t(\nu)$ priors have been long proposed (Gelman, 2006; Carvalho et al., 2009, 2010) for Bayesian hierarchical models. van der Pas et al. (2014, 2017) have established (near) minimax optimal estimation for the Horseshoe ( $\nu=1$ ) in the Normal means model. Ghosh and Chakrabarti (2017) have extended the results of van der Pas et al. (2014) to show minimax optimality of a wider class of priors including Half- $t(\nu)$ priors. Recently, Song (2020) has established that the degree of freedom $\nu$ can impact the multiplicative constant in the posterior contraction rate for the Normal means model. Posterior contraction rates of continuous shrinkage priors in the regression setting beyond the Normal means model would deserve further investigation.

We consider the meeting times obtained with Half- $t(\nu)$ priors under the two-scale coupling (with $R=1$ and threshold $d_{0}=0.5$ ) on synthetic datasets. The synthetic datasets are generated as per Figure 2 of Section 2.3. Figures 5(a) is then based on $20$ synthetic datasets generated independently for different degrees of freedom $\nu\in[1,2]$ . Figure 5(a) plots meeting times $\tau$ of $1$ -lag coupled Markov chains against dimension. It indicates that even a small increase in the degree of freedom $\nu>1$ compared to the Horseshoe ( $\nu=1$ ) can lead to orders of magnitude computational improvements through shorter meeting times.

We also consider the statistical performance of Bayesian shrinkage regression with Half- $t(\nu)$ priors on synthetic datasets in Figure 5(b). The synthetic datasets are generated as per Figure 2 of Section 2.3. We use the blocked Gibbs sampler of Algorithm 3 to draw samples from the posteriors corresponding to different degrees of freedom $\nu$ , by simulating chains of length $1000$ , with a burn-in period chosen using the coupling-based total variation distance upper bounds of Biswas et al. (2019). Specifically we employ a burn-in of $300$ steps for $\nu\geq 1.2$ and of $600$ steps for $\nu=1$ . Figure 5(b) shows similar mean squared error (MSE) for different values of $\nu\geq 1$ . For a fixed synthetic dataset, the MSE in Figure 5(b) is calculated as $\|\beta_{*}-\hat{\beta}\|_{2}^{2}/p$ where $\hat{\beta}:=\sum_{t=1}^{1000}\beta_{t}/1000$ , is the MCMC estimator obtained from the Markov chain after burn-in. The error bars in Figure 5(b) are generated by simulating synthetic datasets independently $100$ times and each time calculating the corresponding MSE from a Markov chain for different values of $\nu\in[1,2]$ . The MSE of the LASSO is also included, with regularization parameter chosen with cross-validation using the glmnet package (Friedman et al., 2010). For $\nu\in[1,2]$ , we observe a lower MSE using Half- $t(\nu)$ priors compared to the LASSO. The grey plots in Figures 5(b) show that much larger $\nu\in\{4,6,8\}$ can lead to higher MSE, as the corresponding posteriors are strongly concentrated about zero and less able to identify non-null signals. Overall, Figure 5(b) suggests that Half- $t(\nu)$ priors with degrees of freedom $\nu\in[1,2]$ can result in comparable statistical performance while much larger values of $\nu$ should be discouraged.

4 Results on GWAS datasets

Section 3.3 suggests that Half- $t(\nu)$ priors with higher degrees of freedom $\nu>1$ can give similar statistical performance to the Horseshoe ( $\nu=1$ ), whilst allowing orders of magnitude computational improvements.

Motivated by this observation, we apply our algorithms to genome-wide association study (GWAS) datasets using $\nu=2$ . We consider a bacteria dataset (Bühlmann et al., 2014) with $n=71$ observations (corresponding to production of the vitamin riboflavin), and $p=4,088$ covariates (corresponding to single nucleotide polymorphisms (SNPs) in the genome), and a maize dataset (Romay et al., 2013; Liu et al., 2016; Zeng and Zhou, 2017; Johndrow et al., 2020) with $n=2,266$ observations (corresponding to average number of days taken for silk emergence in different maize lines), and $p=98,385$ covariates (corresponding to SNPs). Bayesian methods are well-suited to such GWAS datasets, as they provide interpretable notions of uncertainty through marginal posterior probabilities which allow for multimodality, enable the use of prior information, and account for the uncertainty in parameter estimates (e.g. Guan and Stephens, 2011; Zhou et al., 2013). Furthermore, heavy-tailed continuous shrinkage priors can be particularly effective in the GWAS setting, as the associations between SNPs and the phenotype are expected to be sparse, and such priors have shown competitive empirical performance in polygenic prediction (Ge et al., 2019).

For both datasets, we target the Half- $t(2)$ posterior with $a_{0}=b_{0}=1$ . We use the two-scale coupling with $R=1$ and threshold $d_{0}=0.5$ , and initialize chains independently from the prior. For the $\xi$ updates, we use a Metropolis–Rosenbluth–Teller–Hastings step-size of $\sigma_{\text{MRTH}}=0.8$ . Based on 100 independent coupled chains, Figure 6 shows upper bounds on the total variation distance to stationarity, $\text{TV}(\pi_{t},\pi)$ as a function of $t$ , using $L$ -lag couplings with $L=750$ and $L=200$ for the maize and riboflavin datasets respectively. The results indicate that these Markov chains converge to the corresponding posterior distributions in less than $1000$ and $500$ iterations respectively. We note that the lags $L=750$ and $L=200$ were selected based on some preliminary runs of the coupled chains to obtain informative total variation bounds, which incurs an additional preliminary cost. Any choice of lag results in valid upper bounds, but only large enough lags lead to upper bounds that are in the interval $[0,1]$ even for small iterations $t$ .

In these GWAS examples, the run-time per iteration may be significant. For the maize dataset, on a 2015 high-end laptop, each iteration of the coupled chain takes approximately $60$ seconds, and running one coupled chain until meeting can take more than one day. This run-time is dominated by the calculation of the weighted matrix cross-product $X\text{Diag}(\eta)^{-1}X^{T}$ , with an $\mathcal{O}(n^{2}p)$ cost. In this setting, a scientist wanting to run chains for $10^{4}$ or $10^{5}$ iterations may have to wait for weeks or months. Using the proposed couplings, a scientist with access to parallel computers can be confident that samples obtained after $10^{3}$ iterations would be indistinguishable from perfect samples from the target; for a similar cost they can also obtain unbiased estimators as described in Jacob et al. (2020). For the riboflavin dataset, each iteration of the coupled chain takes approximately $0.1$ seconds, and running one coupled chain until meeting takes only $10$ to $20$ seconds. The run-time there is dominated by the component-wise $\eta$ updates which have $\mathcal{O}(p)$ cost. In the riboflavin example, the coupling-based diagnostics enable the scientist to perform reliable Bayesian computation directly from personal computers. Overall, the GWAS examples emphasize that couplings of MCMC algorithms can aid practitioners in high-dimensional settings.

5 Discussion

We have introduced couplings of Gibbs samplers for Bayesian shrinkage regression with Half- $t(\nu)$ priors. The proposed two-scale coupling is operational in realistic settings, including a GWAS setting with $n\approx 2,000,p\approx 100,000$ .

Firstly, our work participates in a wider effort to apply MCMC algorithms and coupling techniques to ever more challenging settings (Middleton et al., 2020; Ruiz et al., 2020; B. Trippe, 2021; Lee et al., 2020b; Xu et al., 2021). In particular, the short meeting times obtained here vindicate the use of coupling techniques in multimodal, high-dimensional sampling problems. This relates to questions raised in some comments of Lee et al. (2020a) and Paulin (2020) in the discussion of Jacob et al. (2020), and shows that the dimension of the state space may not necessarily be the most important driver of the performance of MCMC algorithms and couplings thereof.

Secondly, we have applied $L$ -lag couplings (Biswas et al., 2019) to obtain upper bounds on the total variation distance between the Markov chain at a finite time and its stationary distribution. This allows empirical investigations of the mixing time and how it varies with the inputs of the problem, including number of observations, number of covariates, signal to noise ratio and sparsity. We find that coupling techniques constitute a convenient non-asymptotic tool to monitor the performance of MCMC algorithms.

Thirdly, we observe that Half- $t(\nu)$ priors with degrees of freedom $\nu$ higher than one give similar statistical estimation performance than the Horseshoe whilst providing significant computational advantages. This contributes towards the discussion on the impact of the prior on the trade-off between statistical estimation and computational feasibility, in the setting of high-dimensional Bayesian regression.

The following questions arise from our work.

1.

Convergence of the blocked Gibbs sampler. Short meeting times suggest that the blocked Gibbs sampler converges quickly, even in high dimensions. This motivates a more formal study of the convergence of that Markov chain, to understand how the convergence rate varies with features of the data generating process and of the prior. The Lyapunov function in Proposition 2.1 and the two-scale coupling may prove useful for such analysis. Our initial work in this area suggests that finding a convenient metric that gives sharp bounds on the metric $d$ used here, while simultaneously being amenable to theoretical analysis, will be a key step.
2.

Alternative coupling algorithms. Section 3.2 indicates that many coupling strategies are available in the present setting, and more generally for various Gibbs samplers. For example, we could combine the switch-to-CRN coupling with the two-scale coupling to form other coupling strategies. Under the two-scale coupling setup, we could apply the switch-to-CRN coupling only when the chains are close with respect to a chosen metric, and employ CRN couplings when far away. Alternatively, we could always apply the switch-to-CRN coupling, and estimate component-wise meeting probabilities to select the order of the updates, for example such that components more likely to meet are updated first. Some couplings may result in shorter meeting times for the Horseshoe prior. This may allow the Horseshoe prior to remain competitive with Half- $t(\nu)$ priors with higher degrees of freedom. In our experiments we did not find ways to obtain shorter meetings for the Horseshoe, but we certainly cannot rule out that possibility. If one could obtain short meetings for the Horseshoe in high dimensions, the apparent trade-off between statistical performance and computing cost identified in the present work would disappear.
3.

Interplay between posterior concentration and MCMC convergence. For Bayesian regression with spike-and-slab priors, Yang et al. (2016) and Atchadé (2019) have shown that posterior contraction can aid the convergence of MCMC in high dimensions. The performance of the proposed couplings motivates similar investigations for continuous shrinkage priors.

Acknowledgement.

The authors thank Xiaolei Liu and Xiang Zhou for sharing the Maize GWAS data, and Yves F. Atchadé, Qian Qin and Neil Shephard for helpful comments. Niloy Biswas gratefully acknowledges support from a GSAS Merit Fellowship and a Two Sigma Fellowship Award. Anirban Bhattacharya gratefully acknowledges support by the National Science Foundation through an NSF CAREER award (DMS-1653404). Pierre E. Jacob gratefully acknowledges support by the National Science Foundation through grants DMS-1712872 and DMS-1844695. The maize dataset GWAS simulations in this article were run on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

References

Abramowitz [1974] M. Abramowitz. Handbook of Mathematical Functions, With Formulas, Graphs, and Mathematical Tables,. Dover Publications, Inc., USA, 1974. ISBN 0486612724.
Albert and Chib [1993] J. H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88(422):669–679, 1993. ISSN 01621459. URL http://www.jstor.org/stable/2290350.
Atchadé [2019] Y. F. Atchadé. Approximate spectral gaps for Markov chains mixing times in high dimensions. arXiv preprint arXiv:1903.11745, 2019.
B. Trippe [2021] T. B. B. Trippe, T. D. Nguyen. Optimal transport couplings of Gibbs samplers on partitions for unbiased estimation. 2021.
Bhadra et al. [2019] A. Bhadra, J. Datta, N. G. Polson, and B. Willard. Lasso meets Horseshoe: A survey. Statistical Science, 34(3):405–427, 2019.
Bhattacharya and Johndrow [2021] A. Bhattacharya and J. E. Johndrow. Sampling Local-Scale Parameters in High-Dimensional Regression Models, pages 1–14. American Cancer Society, 2021. ISBN 9781118445112. doi: https://doi.org/10.1002/9781118445112.stat08292. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat08292.
Bhattacharya et al. [2015] A. Bhattacharya, D. Pati, N. S. Pillai, and D. B. Dunson. Dirichlet–Laplace priors for optimal shrinkage. Journal of the American Statistical Association, 110(512):1479–1490, 2015. doi: 10.1080/01621459.2014.960967. URL https://doi.org/10.1080/01621459.2014.960967.
Bhattacharya et al. [2016] A. Bhattacharya, A. Chakraborty, and B. K. Mallick. Fast sampling with Gaussian scale mixture priors in high-dimensional regression. Biometrika, 103(4):985–991, 10 2016. ISSN 0006-3444. doi: 10.1093/biomet/asw042. URL https://doi.org/10.1093/biomet/asw042.
Bhattacharya et al. [2021] S. K. Bhattacharya, K. Khare, and S. Pal. Geometric ergodicity of Gibbs samplers for the Horseshoe and its regularized variants. arXiv preprint arXiv:2101.00366, 2021.
Biswas et al. [2019] N. Biswas, P. E. Jacob, and P. Vanetti. Estimating convergence of Markov chains with L-lag couplings. In Advances in Neural Information Processing Systems, pages 7389–7399, 2019.
Bou-Rabee and Eberle [2020] N. Bou-Rabee and A. Eberle. Two-scale coupling for preconditioned Hamiltonian Monte Carlo in infinite dimensions. Stochastics and Partial Differential Equations: Analysis and Computations, 2020. doi: 10.1007/s40072-020-00175-6. URL https://doi.org/10.1007/s40072-020-00175-6.
Bou-Rabee and Sanz-Serna [2018] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018. doi: 10.1017/S0962492917000101.
Bühlmann et al. [2014] P. Bühlmann, M. Kalisch, and L. Meier. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014. doi: 10.1146/annurev-statistics-022513-115545. URL https://doi.org/10.1146/annurev-statistics-022513-115545.
Carpenter et al. [2017] B. Carpenter, A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. A. Brubaker, J. Guo, P. Li, and A. Riddell. Stan: a probabilistic programming language. Grantee Submission, 76(1):1–32, 2017.
Carvalho et al. [2009] C. M. Carvalho, N. G. Polson, and J. G. Scott. Handling sparsity via the Horseshoe. In Proceedings of Machine Learning Research, volume 5, pages 73–80. PMLR, 2009. URL http://proceedings.mlr.press/v5/carvalho09a.html.
Carvalho et al. [2010] C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465–480, 2010. ISSN 00063444. URL http://www.jstor.org/stable/25734098.
Friedman et al. [2010] J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, Articles, 33(1):1–22, 2010. ISSN 1548-7660. doi: 10.18637/jss.v033.i01. URL https://www.jstatsoft.org/v033/i01.
Ge et al. [2019] T. Ge, C.-Y. Chen, Y. Ni, Y.-C. A. Feng, and J. W. Smoller. Polygenic prediction via bayesian regression and continuous shrinkage priors. Nature Communications, 10(1):1776, 2019. doi: 10.1038/s41467-019-09718-5. URL https://doi.org/10.1038/s41467-019-09718-5.
Gelman [2006] A. Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Anal., 1(3):515–534, 09 2006. doi: 10.1214/06-BA117A. URL https://doi.org/10.1214/06-BA117A.
Ghosh and Chakrabarti [2017] P. Ghosh and A. Chakrabarti. Asymptotic optimality of one-group shrinkage priors in sparse high-dimensional problems. Bayesian Anal., 12(4):1133–1161, 12 2017. doi: 10.1214/16-BA1029. URL https://doi.org/10.1214/16-BA1029.
Glynn and Rhee [2014] P. W. Glynn and C.-h. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377–389, 2014.
Griffin and Brown [2010] J. E. Griffin and P. J. Brown. Inference with Normal-Gamma prior distributions in regression problems. Bayesian Anal., 5(1):171–188, 03 2010. doi: 10.1214/10-BA507. URL https://doi.org/10.1214/10-BA507.
Guan and Stephens [2011] Y. Guan and M. Stephens. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. The Annals of Applied Statistics, 5(3):1780 – 1815, 2011. doi: 10.1214/11-AOAS455. URL https://doi.org/10.1214/11-AOAS455.
Hager [1989] W. W. Hager. Updating the inverse of a matrix. SIAM Review, 31(2):221–239, 1989. ISSN 00361445. URL http://www.jstor.org/stable/2030425.
Hastings [1970] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970.
Jacob et al. [2020] P. E. Jacob, J. O’Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(3):543–600, 2020. doi: 10.1111/rssb.12336. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12336.
Jarner and Roberts [2007] S. Jarner and G. Roberts. Convergence of heavy-tailed Monte Carlo Markov Chain algorithms. Scandinavian Journal of Statistics, 34:781–815, 02 2007. doi: 10.1111/j.1467-9469.2007.00557.x.
Jarner and Tweedie [2003] S. F. Jarner and R. L. Tweedie. Necessary conditions for geometric and polynomial ergodicity of random-walk-type. Bernoulli, 9(4):559–578, 08 2003. doi: 10.3150/bj/1066223269. URL https://doi.org/10.3150/bj/1066223269.
Johndrow et al. [2020] J. Johndrow, P. Orenstein, and A. Bhattacharya. Scalable approximate MCMC algorithms for the Horseshoe prior. Journal of Machine Learning Research, 21(73):1–61, 2020. URL http://jmlr.org/papers/v21/19-536.html.
Johnson and Geyer [2012] L. T. Johnson and C. J. Geyer. Variable transformation to obtain geometric ergodicity in the random-walk Metropolis algorithm. Ann. Statist., 40(6):3050–3076, 12 2012. doi: 10.1214/12-AOS1048. URL https://doi.org/10.1214/12-AOS1048.
Johnson [1996] V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupled sample paths. Journal of the American Statistical Association, 91(433):154–166, 1996.
Johnson [1998] V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 93(441):238–248, 1998.
Johnson and Rossell [2012] V. E. Johnson and D. Rossell. Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660, 2012. doi: 10.1080/01621459.2012.682536. URL https://doi.org/10.1080/01621459.2012.682536.
Khare and Hobert [2013] K. Khare and J. P. Hobert. Geometric ergodicity of the Bayesian lasso. Electron. J. Statist., 7:2150–2163, 2013. doi: 10.1214/13-EJS841. URL https://doi.org/10.1214/13-EJS841.
Lee et al. [2020a] A. Lee, S. S. Singh, and M. Vihola. Comment on ‘Unbiased Markov chain Monte Carlo with couplings’ by Jacob, O’leary and Atchadé. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2020a.
Lee et al. [2020b] A. Lee, S. S. Singh, and M. Vihola. Coupled conditional backward sampling particle filter. Annals of Statistics, 48(5):3066–3089, 2020b.
Liu et al. [2016] X. Liu, M. Huang, B. Fan, E. S. Buckler, and Z. Zhang. Iterative usage of fixed and random effect models for powerful and efficient genome-wide association studies. PLOS Genetics, 12(2):1–24, 02 2016. doi: 10.1371/journal.pgen.1005767. URL https://doi.org/10.1371/journal.pgen.1005767.
Livingstone et al. [2019a] S. Livingstone, M. Betancourt, S. Byrne, and M. Girolami. On the geometric ergodicity of Hamiltonian Monte Carlo. Bernoulli, 25(4A):3109–3138, 11 2019a. doi: 10.3150/18-BEJ1083. URL https://doi.org/10.3150/18-BEJ1083.
Livingstone et al. [2019b] S. Livingstone, M. F. Faulkner, and G. O. Roberts. Kinetic energy choice in Hamiltonian/hybrid Monte Carlo. Biometrika, 106(2):303–319, 2019b.
Makalic and Schmidt [2016] E. Makalic and D. F. Schmidt. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182, 2016.
Meyn and Tweedie [1993] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Springer, 1993.
Middleton et al. [2020] L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased Markov chain Monte Carlo for intractable target distributions. Electron. J. Statist., 14(2):2842–2891, 2020. doi: 10.1214/20-EJS1727. URL https://doi.org/10.1214/20-EJS1727.
Nishimura and Suchard [2018] A. Nishimura and M. A. Suchard. Prior-preconditioned conjugate gradient for accelerated Gibbs sampling in “large n & large p” sparse Bayesian logistic regression models. arXiv preprint arXiv:1810.12437, 2018.
Pal and Khare [2014] S. Pal and K. Khare. Geometric ergodicity for Bayesian shrinkage models. Electron. J. Statist., 8(1):604–645, 2014. doi: 10.1214/14-EJS896. URL https://doi.org/10.1214/14-EJS896.
Park and Casella [2008] T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. doi: 10.1198/016214508000000337. URL https://doi.org/10.1198/016214508000000337.
Paulin [2020] D. Paulin. Comment on ‘Unbiased Markov chain Monte Carlo with couplings’ by Jacob, O’leary and Atchadé. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2020.
Polson and Scott [2012] N. G. Polson and J. G. Scott. On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7(4):887–902, 2012.
Polson et al. [2014] N. G. Polson, J. G. Scott, and J. Windle. The Bayesian bridge. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(4):713–733, 2014. doi: 10.1111/rssb.12042. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssb.12042.
Qin and Hobert [2019a] Q. Qin and J. P. Hobert. Convergence complexity analysis of Albert and Chib’s algorithm for Bayesian probit regression. Ann. Statist., 47(4):2320–2347, 08 2019a. doi: 10.1214/18-AOS1749. URL https://doi.org/10.1214/18-AOS1749.
Qin and Hobert [2019b] Q. Qin and J. P. Hobert. Wasserstein-based methods for convergence complexity analysis of MCMC with applications. arXiv preprint arXiv:1810.08826, 2019b.
R Core Team [2013] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013. URL http://www.R-project.org/.
Roberts and Rosenthal [2002] G. O. Roberts and J. S. Rosenthal. One-shot coupling for certain stochastic recursive sequences. Stochastic Processes and their Applications, 99(2):195–208, 2002. ISSN 0304-4149. doi: https://doi.org/10.1016/S0304-4149(02)00096-0. URL https://www.sciencedirect.com/science/article/pii/S0304414902000960.
Roberts and Rosenthal [2004] G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probability surveys, 1:20–71, 2004.
Roberts and Tweedie [1996] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. ISSN 13507265. URL http://www.jstor.org/stable/3318418.
Romay et al. [2013] M. C. Romay, M. J. Millard, J. C. Glaubitz, J. A. Peiffer, K. L. Swarts, T. M. Casstevens, R. J. Elshire, C. B. Acharya, S. E. Mitchell, S. A. Flint-Garcia, M. D. McMullen, J. B. Holland, E. S. Buckler, and C. A. Gardner. Comprehensive genotyping of the USA national maize inbred seed bank. Genome Biology, 14(6):R55, 2013. doi: 10.1186/gb-2013-14-6-r55. URL https://doi.org/10.1186/gb-2013-14-6-r55.
Rosenthal [1995] J. S. Rosenthal. Minorization conditions and convergence rates for Markov chain Monte Carlo. Journal of the American Statistical Association, 90(430):558–566, 1995. ISSN 01621459. URL http://www.jstor.org/stable/2291067.
Ruiz et al. [2020] F. J. R. Ruiz, M. K. Titsias, T. Cemgil, and A. Doucet. Unbiased gradient estimation for variational auto-encoders using coupled Markov chains. ArXiv, abs/2010.01845, 2020.
Sherlock et al. [2010] C. Sherlock, P. Fearnhead, and G. O. Roberts. The random walk Metropolis: Linking theory and practice through a case study. Statist. Sci., 25(2):172–190, 05 2010. doi: 10.1214/10-STS327. URL https://doi.org/10.1214/10-STS327.
Song [2020] Q. Song. Bayesian shrinkage towards sharp minimaxity. Electronic Journal of Statistics, 14(2):2714 – 2741, 2020. doi: 10.1214/20-EJS1732. URL https://doi.org/10.1214/20-EJS1732.
Thorisson [2000] H. Thorisson. Coupling, stationarity, and regeneration. Springer, 2000.
Tibshirani [1994] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994.
van der Pas et al. [2017] S. van der Pas, B. Szabó, and A. van der Vaart. Adaptive posterior contraction rates for the horseshoe. Electron. J. Statist., 11(2):3196–3225, 2017. doi: 10.1214/17-EJS1316. URL https://doi.org/10.1214/17-EJS1316.
van der Pas et al. [2014] S. L. van der Pas, B. J. K. Kleijn, and A. W. van der Vaart. The horseshoe estimator: Posterior concentration around nearly black vectors. Electron. J. Statist., 8(2):2585–2618, 2014. doi: 10.1214/14-EJS962. URL https://doi.org/10.1214/14-EJS962.
van Erven and Harremos [2014] T. van Erven and P. Harremos. Rényi divergence and Kullback–Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014. doi: 10.1109/TIT.2014.2320500.
Vats and Knudson [2020] D. Vats and C. Knudson. Revisiting the Gelman–Rubin diagnostic. Statistical Science (to appear), 2020.
Vats et al. [2019] D. Vats, J. M. Flegal, and G. L. Jones. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321–337, 04 2019. ISSN 0006-3444. doi: 10.1093/biomet/asz002. URL https://doi.org/10.1093/biomet/asz002.
Xu et al. [2021] K. Xu, T. E. Fjelde, C. Sutton, and H. Ge. Couplings for multinomial Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3646–3654. PMLR, 2021.
Yang et al. [2016] Y. Yang, M. J. Wainwright, and M. I. Jordan. On the computational complexity of high-dimensional Bayesian variable selection. Ann. Statist., 44(6):2497–2532, 12 2016. doi: 10.1214/15-AOS1417. URL https://doi.org/10.1214/15-AOS1417.
Zeng and Zhou [2017] P. Zeng and X. Zhou. Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models. Nature Communications, 8(1):456, 2017. doi: 10.1038/s41467-017-00470-2. URL https://doi.org/10.1038/s41467-017-00470-2.
Zhou et al. [2013] X. Zhou, P. Carbonetto, and M. Stephens. Polygenic modeling with Bayesian sparse linear mixed models. PLOS Genetics, 9(2):1–14, 02 2013. doi: 10.1371/journal.pgen.1003264. URL https://doi.org/10.1371/journal.pgen.1003264.
Zou and Hastie [2005] H. Zou and T. Hastie. Regularization and variable selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.

In Section A, we describe some features of the posterior distributions arising in high-dimensional regression with shrinkage priors. In Section B we describe metrics to monitor the distance between pairs of chains and provide details on the calculation of metrics to use in two-scale coupling strategies. In Section C we provide the proofs for all the theoretical results in this document. Section D provides pseudo-codes for various algorithms mentioned in this document. Section E provides some derivations of certain steps in the Gibbs sampler.

Appendix A Features of the target complicate use of generic MCMC

While posterior distributions resulting from heavy-tailed shrinkage priors have desirable statistical properties, their features pose challenges to generic MCMC algorithms. For Half- $t(\nu)$ priors, we show that the resulting posterior distributions present 1) multimodality, 2) heavy tails and 3) poles at zero. This hints at a trade-off between statistical accuracy and computational difficulty, since the very features that present computational challenges are crucial for optimal statistical performance across sparsity levels and signal strengths.

Proposition A.1 gives the marginal prior and posterior densities of $(\beta,\sigma^{2},\xi)$ up to normalizing constants, i.e. integrating over $\eta$ . For any real-valued functions $f$ and $g$ , we write $f(x)\asymp g(x)$ for $x\rightarrow x_{0}$ when $m<\liminf_{x\rightarrow x_{0}}\big{|}f(x)/g(x)\big{|}\leq\limsup_{x\rightarrow x_{0}}\big{|}f(x)/g(x)\big{|}<M$ for some $0<m\leq M<\infty$ .

Proposition A.1.

The marginal prior of component $\beta_{j}$ on $\mathbb{R}$ given $\xi$ and $\sigma^{2}$ has density

\pi(\beta_{j}|\xi,\sigma^{2})\propto U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}\asymp\begin{cases}-\log|\beta_{j}|&\text{ for }\;|\beta_{j}|\rightarrow 0\\ |\beta_{j}|^{-(1+\nu)}&\text{ for }\;|\beta_{j}|\rightarrow+\infty\end{cases},

(15)

where $\Gamma(\cdot)$ is the Gamma function, and $U(a,b,z):=\Gamma(a)^{-1}\int_{0}^{\infty}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx$ is the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13] for any $a,b,z>0$ . The marginal posterior density of $\beta$ on $\mathbb{R}^{p}$ given $\xi$ and $\sigma^{2}$ is

\displaystyle\pi(\beta|\sigma^{2},\xi,y)

\displaystyle\propto\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)},

(16)

thus $\pi(\beta|\sigma^{2},\xi,y)\overset{\|\beta\|\rightarrow 0}{\asymp}-\prod_{j=1}^{p}\log(|\beta_{j}|)$ and $-\frac{\partial}{\partial\beta_{j}}\log\pi(\beta|\sigma^{2},\xi,y)\overset{\beta_{j}\rightarrow 0}{\asymp}-(\beta_{j}\log|\beta_{j}|)^{-1}$ .

Figure 7 gives an example of such marginal posterior in a stylized setting. Here

n=2,\quad p=3,\quad X=\begin{pmatrix}1&1&0\\ 1&0&1\end{pmatrix},\quad y=X(1\;0\;0)^{T}=(1\;1)^{T},

with $\nu=2$ and $\xi=\sigma^{2}=0.25$ . Component $\beta_{1}$ in Figure 7 illustrates the potential multimodality in the posterior. The posterior distribution also has polynomial tails in high dimensions, as stated in Corollary A.2.

Corollary A.2.

Let $e^{\perp}\in\mathbb{R}^{p}$ be a unit vector such that $Xe^{\perp}=0$ and let $\lambda>0$ . Assume entries $e^{\perp}_{j}\in\mathbb{R}$ are non-zero for all $j=1,\ldots,p$ , which is necessary to ensure that $\pi(\lambda e^{\perp}|\sigma^{2},\xi,y)$ is finite for all $\lambda>0$ . Then

\pi(\lambda e^{\perp}|\sigma^{2},\xi,y)\asymp\lambda^{-p(1+\nu)}

(17)

as $\lambda\rightarrow+\infty$ . That is, the marginal posterior of $\beta$ on $\mathbb{R}^{p}$ has polynomial tails along all directions in the null space $\ker(X)$ of $X$ .

In our stylized example, $e^{\perp}=3^{-1/2}(1\,-1\,-1)^{T}$ satisfies $Xe^{\perp}=0$ . It follows from Corollary A.2 that $\pi(\lambda e^{\perp}|\sigma^{2},\xi,y)\asymp\lambda^{-3(1+\nu)}\asymp\lambda^{-9})$ for large $\lambda$ . More generally, whenever $p>n$ and the marginal prior on $\beta$ has polynomial tails, the subspace $\ker(X)$ of $\mathbb{R}^{p}$ has vector space dimension at least $(p-n)$ , and therefore the posterior has polynomial tails over a non-trivial subspace. For such heavy-tailed targets, the most popular “generic” MCMC algorithms encounter significant theoretical and practical challenges. Here, we use the term generic for algorithms that are not explicitly designed to target this class of distributions, but instead rely on a general strategy to explore state spaces. Specifically, for heavy-tailed targets, lack of geometric ergodicity has been established for Random Walk Metropolis–Hastings (RWMH) [Jarner and Tweedie, 2003], Metropolis-Adjusted Langevin (MALA) [Roberts and Tweedie, 1996] and Hamiltonian Monte Carlo (HMC) [Livingstone et al., 2019a]¹¹1Jarner and Tweedie [2003], Roberts and Tweedie [1996] and Livingstone et al. [2019a] show lack of geometric ergodicity for targets which have heavy-tails in every direction. Their arguments can be modified to hold for targets which are heavy-tailed along any one direction as in Corollary A.2, by considering a projection of the full chain along that direction.. Better performance can sometimes be obtained using heavy-tailed proposals [Jarner and Roberts, 2007, Sherlock et al., 2010, Livingstone et al., 2019b] or variable transformations when available [Johnson and Geyer, 2012].

The posterior density in (16) also approaches infinity when any component $\beta_{j}$ approaches zero, as $\lim_{\beta_{j}\rightarrow 0}U((1+\nu)/2,1,(\xi\beta_{j}^{2})/(2\sigma^{2}\nu))=\infty$ for any $\nu,\xi,\sigma^{2}>0$ . This is highlighted by the modes at the origin in Figure 7. The pole at zero corresponds to a discontinuity in the gradients of the negative log-density, as given in Proposition A.1. Such diverging gradients can lead to numerically unstable leapfrog integration for HMC samplers [Bou-Rabee and Sanz-Serna, 2018].

Appendix B Details on the choice and calculation of metric

B.1 Comparison with other metrics

Figure 8 plots $d(C_{t},\tilde{C}_{t})$ alongside the $L_{1}$ metric and the log-transformed $L_{1}$ metric on parameters $m_{t,j}$ and $\tilde{m}_{t,j}$ as in (8). It is based on one trajectory of our two-scale coupled chain under common random numbers (threshold $d_{0}=0$ ), for a synthetic dataset generated as per Section 2.3 targeting the Horseshoe posterior ( $\nu=1$ ) with $a_{0}=b_{0}=1$ . Figure 8 shows that $d(C_{t},\tilde{C}_{t})$ is bounded above by both the $L_{1}$ metric and the $L_{1}$ metric between the logarithms of $m_{j},\tilde{m}_{j}$ . Therefore, we could consider the capped $L_{1}$ metric ( $\min\{\sum_{j}|m_{t,j}-\tilde{m}_{t,j}|,1\}$ ) or the capped $L_{1}$ metric on the logs ( $\min\{\sum_{j}|\log(|m_{t,j}/\tilde{m}_{t,j}|)|,1\}$ ) to obtain upper bounds on $d$ . Finding metrics which are easily calculable and accurate could be investigated further and may prove valuable in the theoretical analysis of the proposed algorithms.

B.2 Choice of metric threshold for the two-scale coupling

For different values of $d_{0}$ , Figure 9(a) shows averaged trajectories $(d(C_{t},\tilde{C}_{t}))_{t\geq 0}$ . When $d_{0}=1$ (one-scale coupling), metric $d(C_{t},\tilde{C}_{t})$ remains near $1$ , and when $d_{0}=0$ (common random numbers coupling) metric $d(C_{t},\tilde{C}_{t})$ gets very close to but does not exactly equal $0$ . For all values of the threshold shown, the chains exactly meet and $d(C_{t},\tilde{C}_{t})$ reaches zero at similar times. Figure 9(b) considers higher dimensional settings. It plots the histograms of $(d(C_{t},\tilde{C}_{t}))_{t=1}^{2000}$ when $d_{0}=0$ . The histograms show that $d(C_{t},\tilde{C}_{t})$ takes values close to either $1$ or $0$ for different dimensions $p$ . This may explain why different threshold values sufficiently away from $0$ and $1$ give similar meeting times. Figure 9(c) shows that different thresholds give similar meeting times for the Horseshoe.

B.3 Calculating metric $d$

We present details of estimating the metric $d$ defined in (7). As the coupling in Algorithm 2 is independent component-wise, we have

\displaystyle d(C_{t},\tilde{C}_{t})

\displaystyle=1-\mathbb{P}\big{(}\eta_{t+1}=\tilde{\eta}_{t+1}|C_{t},\tilde{C}_{t}\big{)}=1-\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)}

(18)

for $m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2})$ and $\tilde{m}_{j}=\tilde{\xi}\tilde{\beta}_{j}^{2}/(2\tilde{\sigma}^{2})$ .

B.3.1 An integration based approximation

We consider the approximation

\mathbb{P}\big{(}\eta_{t+1,j}\neq\tilde{\eta}_{t+1,j}|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)}\approx\text{TV}(M_{t,j},\tilde{M}_{t,j}),

(19)

where $M_{t,j}$ and $\tilde{M}_{t,j}$ correspond to target distributions of the marginal slice samplers in Algorithm 2. This corresponds to approximating the probability of not meeting under Algorithm 2 with the total variation distance (i.e. probability of not meeting under a maximal coupling) between the marginal target distributions $M_{t,j}$ and $\tilde{M}_{t,j}$ of the slice samplers. We can evaluate $\text{TV}(M_{t,j},\tilde{M}_{t,j})$ analytically using Proposition B.1. This motivates the approximation

\displaystyle\tilde{d}(C_{t},\tilde{C}_{t}):=1-\prod_{j=1}^{p}\Big{(}1-\text{TV}(M_{t,j},\tilde{M}_{t,j})\Big{)}

(20)

for $d(C_{t},\tilde{C}_{t})$ .

Proposition B.1.

Consider distributions $M$ and $\tilde{M}$ on $[0,\infty)$ with densities

p(\eta_{j}|m)\propto\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}\text{ and }p(\eta_{j}|\tilde{m})\propto\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}},

(21)

respectively on $[0,\infty)$ , such that $M$ and $\tilde{M}$ are parameterized by the fixed constant $m>0$ and $\tilde{m}>0$ . Then, $\text{TV}(M,\tilde{M})=0$ when $m=\tilde{m}$ . When $m\neq\tilde{m}$ ,

\displaystyle\text{TV}(M,\tilde{M})=\bigg{|}\frac{U_{\frac{1+\nu}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}-\frac{U_{\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}\bigg{|},

(22)

where $U_{a,b,z}(t):=\frac{1}{\Gamma(a)}\int_{0}^{t}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx$ is defined as the lower incomplete confluent hypergeometric function of the second kind, such that $U_{a,b,z}(\infty)=U(a,b,z)$ gives the confluent hypergeometric function of the second kind, and

K:=\frac{\log U(\frac{1+\nu}{2},1,\frac{m}{\nu})-\log U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}{\tilde{m}-m}.

(23)

Proof.

We have

p(\eta_{j}|m)=\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}\frac{1}{Z}\ \text{ and }\ p(\eta_{j}|\tilde{m})=\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}\frac{1}{\tilde{Z}},

(24)

where

	$\displaystyle Z$	$\displaystyle=\int_{0}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}d\eta_{j}=\nu^{-\frac{\nu+1}{2}}\Gamma\Big{(}\frac{\nu+1}{2}\Big{)}U\Big{(}\frac{\nu+1}{2},1,\frac{m}{\nu}\Big{)},$		(25)
	$\displaystyle\tilde{Z}$	$\displaystyle=\int_{0}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}d\eta_{j}=\nu^{-\frac{\nu+1}{2}}\Gamma\Big{(}\frac{\nu+1}{2}\Big{)}U\Big{(}\frac{\nu+1}{2},1,\frac{\tilde{m}}{\nu}\Big{)}.$		(26)

Case $m=\tilde{m}$ .

$\text{TV}(M,\tilde{M})=0$ is immediate.

Case $m<\tilde{m}$ .

Note that

	$\displaystyle p(\eta_{j}\|m)\leq p(\eta_{j}\|\tilde{m})$	$\displaystyle\Leftrightarrow\exp\big{(}(\tilde{m}-m)\eta\big{)}\leq\frac{Z}{\tilde{Z}}$		(27)
		$\displaystyle\Leftrightarrow\eta\leq\frac{\log(U(\frac{1+\nu}{2},1,\frac{m}{\nu}))-\log(U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}))}{\tilde{m}-m}=:K.$		(28)

Therefore,

$\displaystyle 1-\text{TV}(M,\tilde{M})$	$\displaystyle=\int_{0}^{\infty}\min(p(\eta_{j}\|m),p(\eta_{j}\|\tilde{m}))d\eta_{j}$	(29)
	$\displaystyle=\int_{0}^{K}p(\eta_{j}\|m)d\eta_{j}+\int_{K}^{\infty}p(\eta_{j}\|\tilde{m})d\eta_{j}$	(30)
	$\displaystyle=\frac{1}{Z}\int_{0}^{K}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-m\eta_{j}}d\eta_{j}$	(31)
	$\displaystyle\qquad+\frac{1}{\tilde{Z}}\int_{K}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}e^{-\tilde{m}\eta_{j}}d\eta_{j}$	(32)
	$\displaystyle=\frac{U_{\frac{1+\nu}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}+1-\frac{U_{\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}$	(33)
	$\displaystyle=1-\Bigg{\|}\frac{U_{\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu})}-\frac{U_{\frac{1+\nu}{2},1,\frac{m}{\nu}}(\nu K)}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}\Bigg{\|}.$	(34)

Equation (22) directly follows.

Case $m>\tilde{m}$ .

Follows from the $m<\tilde{m}$ case by symmetry. ∎

B.3.2 A Rao–Blackwellized estimator

We have

	$\displaystyle\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\|\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\big{)}$		(35)
	$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{\|}U_{j,},\tilde{U}_{j,},m_{t,j},\tilde{m}_{t,j}\big{)}\big{\|}\eta_{t,j},\tilde{\eta}_{t,j},m_{t,j},\tilde{m}_{t,j}\Big{]},$		(36)

where $m_{t,j}=\xi_{t}\beta_{t,j}^{2}/(2\sigma_{t}^{2})$ and $\tilde{m}_{t,j}=\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}/(2\tilde{\sigma}_{t}^{2})$ , $(U_{j,*},\tilde{U}_{j,*})|m_{t,j},\tilde{m}_{t,j}$ are sampled using common random numbers as in Algorithm 2, and $(\eta_{t+1,j},\tilde{\eta}_{t+1,j})|U_{j,*},\tilde{U}_{j,*},m_{t,j},\tilde{m}_{t,j}$ are sampled as in Algorithm 2. We can analytically evaluate probability $\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U_{j,*},\tilde{U}_{j,*},m_{t,j},\tilde{m}_{t,j}\big{)}$ , which motivates the Rao-Blackwellized estimator from (10),

\widehat{d_{R}}^{(2)}(C_{t},\tilde{C_{t}}):=1-\prod_{j=1}^{p}\Big{(}\frac{1}{R}\sum_{r=1}^{R}\mathbb{P}\Big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}U_{j,r},\tilde{U}_{j,r},m_{t,j},\tilde{m}_{t,j}\Big{)}\Big{)},

(37)

where $R$ is the number of samples. For each $r=1,\ldots,R$ , $(U_{j,r},\tilde{U}_{j,r})$ is sampled independently as in Algorithm 2 and maximal coupling probability $\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\big{|}U_{j,r},\tilde{U}_{j,r},m_{t,j},\tilde{m}_{t,j}\big{)}$ is calculated analytically using Proposition B.2.

Proposition B.2.

Suppose $\eta_{j}\big{|}U_{j,*},m_{j}\sim P_{j}$ and $\tilde{\eta}_{j}\big{|}\tilde{U}_{j,*},\tilde{m}_{j}\sim\tilde{P}_{j}$ where marginal distributions ${P}_{j}$ and $\tilde{P}_{j}$ correspond to the distribution in the second step of the slice sampler in Algorithm 4. Then,

\displaystyle\mathbb{P}

\displaystyle\big{(}\eta_{j}=\tilde{\eta}_{j}\Big{|}U_{j,*},\tilde{U}_{j,*},m_{j},\tilde{m}_{j}\big{)}=\left\{\begin{matrix}\frac{\gamma_{s}(m_{j}\tilde{K}_{j})}{\gamma_{s}(m_{j}T_{j,*})}+\frac{\gamma_{s}(\tilde{m}_{j}(T_{j,*}\wedge\tilde{T}_{j,*}))-\gamma_{s}(\tilde{m}_{j}\tilde{K}_{j})}{\gamma_{s}(\tilde{m}_{j}\tilde{T}_{j,*})}&for\ m_{j}<\tilde{m}_{j}\\ \frac{\gamma_{s}(m_{j}(T_{j,*}\wedge\tilde{T}_{j,*}))}{\gamma_{s}(m_{j}(T_{j,*}\vee\tilde{T}_{j,*}))}&for\ m_{j}=\tilde{m}_{j}\\ \frac{\gamma_{s}(\tilde{m}_{j}\tilde{K}_{j})}{\gamma_{s}(\tilde{m}_{j}\tilde{T}_{j,*})}+\frac{\gamma_{s}(m_{j}(T_{j,*}\wedge\tilde{T}_{j,*}))-\gamma_{s}(m_{j}\tilde{K}_{j})}{\gamma_{s}(m_{j}T_{j,*})}&for\ m_{j}>\tilde{m}_{j}\end{matrix}\right.

(38)

for $m_{t,j}=\xi_{t}\beta_{t,j}^{2}/(2\sigma_{t}^{2})$ and $\tilde{m}_{t,j}=\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}/(2\tilde{\sigma}_{t}^{2})$ , $T_{j,*}:=(U_{j,*}^{-\frac{2}{1+\nu}}-1)/\nu$ and $\tilde{T}_{j,*}:=(\tilde{U}_{j,*}^{-\frac{2}{1+\nu}}-1)/\nu$ , $s=\frac{1+\nu}{2}$ , $\gamma_{s}(x)=\frac{1}{\Gamma(s)}\int_{0}^{x}t^{s-1}e^{-t}dt\in[0,1]$ the regularized incomplete lower Gamma function, $a\wedge b:=\min\{a,b\}$ , $a\vee b:=\max\{a,b\}$ , and

\displaystyle\tilde{K}_{j}:=\bigg{(}0\vee\Big{(}\frac{\log\Big{(}\Big{(}\frac{\tilde{m}_{j}}{m_{j}}\Big{)}^{s}\frac{\gamma_{s}(m_{j}T_{j,*})}{\gamma_{s}(\tilde{m}_{j}\tilde{T}_{j,*})}\Big{)}\Big{)}}{\tilde{m}_{j}-m_{j}}\Big{)}\bigg{)}\wedge\Big{(}T_{j,*}\wedge\tilde{T}_{j,*}\Big{)}.

(39)

Proof.

We work component-wise and drop the subscripts for ease of notation. Then distributions $P$ and $\tilde{P}$ have densities

	$\displaystyle p(\eta;m,T)$	$\displaystyle=\eta^{s-1}e^{-m\eta}\frac{m^{s}}{\Gamma(s)\gamma_{s}(mT)}\quad\text{ on }(0,T]$		(40)
	$\displaystyle p(\eta;\tilde{m},\tilde{T})$	$\displaystyle=\eta^{s-1}e^{-\tilde{m}\eta}\frac{\tilde{m}^{s}}{\Gamma(s)\gamma_{s}(\tilde{m}\tilde{T})}\quad\text{ on }(0,\tilde{T}]$		(41)

respectively where $T:=(U_{*}^{-\frac{2}{1+\nu}}-1)/\nu$ and $\tilde{T}:=(\tilde{U}_{*}^{-\frac{2}{1+\nu}}-1)/\nu$ , $s=\frac{1+\nu}{2}$ and $\gamma_{s}(x)=\frac{1}{\Gamma(s)}\int_{0}^{x}t^{s-1}e^{-t}dt\in[0,1]$ is the regularized incomplete lower Gamma function. Note that

\displaystyle\mathbb{P}_{max}\Big{(}\eta=\tilde{\eta}\Big{|}T,\tilde{T},m,\tilde{m}\Big{)}=\int_{0}^{T\wedge\tilde{T}}p(\eta;m,T)\wedge p(\eta;\tilde{m},\tilde{T})d\eta.

Case $m=\tilde{m}$ .

	$\displaystyle\int_{0}^{T\wedge\tilde{T}}p(\eta;m,T)\wedge p(\eta;\tilde{m},\tilde{T})d\eta$	$\displaystyle=\int_{0}^{T\wedge\tilde{T}}\eta^{s-1}\frac{(m)^{s}}{\Gamma(s)}\Big{(}\frac{1}{\gamma_{s}(mT)}\wedge\frac{1}{\gamma_{s}(m\tilde{T})}\Big{)}d\eta$
		$\displaystyle=\int_{0}^{T\wedge\tilde{T}}\eta^{s-1}\frac{(m)^{s}}{\Gamma(s)}\frac{1}{\gamma_{s}(m(T\vee\tilde{T}))}$
		$\displaystyle=\frac{\gamma_{s}(m(T\wedge\tilde{T}))}{\gamma_{s}(m(T\vee\tilde{T}))}$

as required.

Case $m<\tilde{m}$ .

For $0<\eta\leq T\wedge\tilde{T}$ ,

\displaystyle p(\eta;m,T)\leq p(\eta;\tilde{m},\tilde{T})

\displaystyle\Leftrightarrow\exp((\tilde{m}-m)\eta)\leq\Big{(}\frac{\tilde{m}}{m}\Big{)}^{s}\frac{\gamma_{s}(mT)}{\gamma_{s}(\tilde{m}\tilde{T})}\Leftrightarrow\eta\leq\frac{\log\Big{(}\Big{(}\frac{\tilde{m}}{m}\Big{)}^{s}\frac{\gamma_{s}(mT)}{\gamma_{s}(\tilde{m}\tilde{T})}\Big{)}\Big{)}}{\tilde{m}-m}=:K.

Denote $\tilde{K}:=\big{(}0\vee K\big{)}\wedge\big{(}T\wedge\tilde{T}\big{)}$ , such that

\displaystyle\tilde{K}:=\left\{\begin{matrix}0&when\ K\leq 0\\ K&when\ 0<K\leq T\wedge\tilde{T}\\ T\wedge\tilde{T}&when\ T\wedge\tilde{T}<K.\end{matrix}\right.

This gives

	$\displaystyle\int_{0}^{T\wedge\tilde{T}}p(\eta;m,T)\wedge p(\eta;\tilde{m},\tilde{T})d\eta$	$\displaystyle=\int_{0}^{\tilde{K}}p(\eta;m,T)d\eta+\int_{\tilde{K}}^{T\wedge\tilde{T}}p(\eta;\tilde{m},\tilde{T})d\eta$
		$\displaystyle=\Big{(}\frac{\Gamma(s)\gamma_{s}(m\tilde{K})}{(m)^{s}}\Big{)}\Big{(}\frac{(m)^{s}}{\Gamma(s)\gamma_{s}(mT)}\Big{)}$
		$\displaystyle\qquad\qquad+\Big{(}\frac{\Gamma(s)\big{(}\gamma_{s}(\tilde{m}(T\wedge\tilde{T}))-\gamma_{s}(\tilde{m}K)\big{)}}{(\tilde{m})^{s}}\Big{)}\Big{(}\frac{(\tilde{m})^{s}}{\Gamma(s)\gamma_{s}(\tilde{m}\tilde{T}))}\Big{)}$
		$\displaystyle=\frac{\gamma_{s}(m\tilde{K})}{\gamma_{s}(mT)}+\frac{\gamma_{s}(\tilde{m}(T\wedge\tilde{T}))-\gamma_{s}(\tilde{m}\tilde{K})}{\gamma_{s}(\tilde{m}\tilde{T})}.$

as required.

Case $m>\tilde{m}$ .

Follows from the $m<\tilde{m}$ case by symmetry. ∎

Appendix C Proofs

C.1 Marginal $\beta$ prior and posterior densities

Proof of Proposition A.1.

Marginal Prior on $\beta$ .

Let $\pi_{\eta}$ denote the component-wise prior density of each $\eta_{j}$ based on the Half- $t(\nu)$ distribution, and $\mathcal{N}(\cdot;\mu,\Sigma)$ the density of a multivariate Normal distribution with mean $\mu$ and $\Sigma$ . It suffices to calculate the marginal prior for each $\beta_{j}$ given $\xi,\sigma^{2}$ , by marginalizing over each $\eta_{j}$ . We obtain,

$\displaystyle\pi(\beta_{j}\|\xi,\sigma^{2})$	$\displaystyle=\int_{0}^{\infty}\mathcal{N}\Big{(}\beta_{j};0,\frac{\sigma^{2}}{\xi\eta_{j}}\Big{)}\pi_{\eta}(\eta_{j})d\eta_{j}$	(42)
	$\displaystyle\propto\int_{0}^{\infty}\frac{\sqrt{\xi\eta_{j}}}{\sqrt{2\pi\sigma^{2}}}\exp\Big{(}-\frac{\xi\beta_{j}^{2}}{2\sigma^{2}}\eta_{j}\Big{)}\frac{1}{\eta_{j}^{\frac{2-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}d\eta_{j}$	(43)
	$\displaystyle\propto\int_{0}^{\infty}\frac{1}{\eta_{j}^{\frac{1-\nu}{2}}(1+\nu\eta_{j})^{\frac{\nu+1}{2}}}\exp\Big{(}-\frac{\xi\beta_{j}^{2}}{2\sigma^{2}}\eta_{j}\Big{)}d\eta_{j}$	(44)
	$\displaystyle=\int_{0}^{\infty}\frac{1}{\big{(}\frac{t}{\nu}\big{)}^{\frac{1-\nu}{2}}(1+t)^{\frac{\nu+1}{2}}}\exp\Big{(}-\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}t\Big{)}dt\frac{1}{\nu}$	(45)
	$\displaystyle=\frac{\Gamma(\frac{\nu+1}{2})}{\nu^{\frac{\nu+1}{2}}}U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}$	(46)
	$\displaystyle\propto U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)},$	(47)

where (47) follows from the definition of the confluent hypergeometric function of the second kind: $U(a,b,z):=\frac{1}{\Gamma(a)}\int_{0}^{\infty}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx$ for any $a,b,z>0$ . By the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.1], we obtain

\displaystyle\pi(\beta_{j}|\xi,\sigma^{2})\propto U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}\asymp\begin{cases}-\frac{1}{\Gamma(\frac{1+\nu}{2})}\log\Big{(}\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}&\asymp-\log(|\beta_{j}|)\text{ for }|\beta_{j}|\rightarrow 0\\ \Big{(}\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}^{-\frac{1+\nu}{2}}&\asymp|\beta_{j}|^{-(1+\nu)}\text{ for }|\beta_{j}|\rightarrow+\infty\end{cases}

(48)

as required for (15).

Marginal Posterior of $\beta$ .

Let $\pi_{\xi}$ denote the prior density on $\xi$ and $\pi_{\sigma^{2}}$ the $\mathrm{InvGamma}(\frac{a_{0}}{2},\frac{b_{0}}{2})$ prior density on $\sigma^{2}$ . Then,

\displaystyle\pi(\beta|\xi,\sigma^{2},y)\propto\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\Big{(}\prod_{j=1}^{p}\pi(\beta_{j}|\xi,\sigma^{2})\Big{)}\propto\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}

(49)

as required for (16). Equations (48) and (49) directly give

\pi(\beta|\xi,\sigma^{2},y)\asymp\prod_{j=1}^{p}U\Big{(}\frac{\nu+1}{2},1,\frac{\xi\beta_{j}^{2}}{2\sigma^{2}\nu}\Big{)}\asymp-\prod_{j=1}^{p}\log|\beta_{j}|\text{ for }\ \|\beta\|\rightarrow 0.

(50)

Gradient of negative log-density of posterior of $\beta$ .

From (49), we obtain

	$\displaystyle-\frac{\partial}{\partial\beta_{j}}\Big{(}\log\pi(\beta\|\xi,\sigma^{2},y)\Big{)}$	$\displaystyle=-\frac{\partial}{\partial\beta_{j}}\bigg{(}-\frac{1}{2\sigma^{2}}\\|y-X\beta\\|_{2}^{2}+\sum_{j=1}^{p}\log U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}\bigg{)}$		(51)
		$\displaystyle=-\Big{[}\frac{1}{\sigma^{2}}X^{T}(y-X\beta)\Big{]}_{j}-\frac{\frac{d}{d\beta_{j}}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}{U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}.$		(52)

Recall [Abramowitz, 1974, Section 13.4] for $a,b>0$ ,

\displaystyle\frac{d}{dz}U(a,b,z)\overset{\forall z>0}{=}-aU(a+1,b+1,z),\ U(a,2,z)\overset{z\rightarrow 0^{+}}{\asymp}\frac{1}{\Gamma(a)}\frac{1}{z},\ U(a,1,z)\overset{z\rightarrow 0^{+}}{\asymp}-\frac{1}{\Gamma(a)}\log(z).

(53)

This gives

\displaystyle\frac{\frac{d}{d\beta_{j}}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}{U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}

\displaystyle=-\frac{1+\nu}{2}\frac{U\Big{(}\frac{1+\nu}{2}+1,1+1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}{U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}}\frac{\xi\beta_{j}}{\sigma^{2}\nu}\overset{\beta_{j}\rightarrow 0}{\asymp}\frac{1}{\beta_{j}\log|\beta_{j}|}.

(54)

Therefore by (52), $-\frac{\partial}{\partial\beta_{j}}\Big{(}\log\pi(\beta|\xi,\sigma^{2},y)\Big{)}\overset{\beta_{j}\rightarrow 0}{\asymp}-\frac{1}{\beta_{j}\log|\beta_{j}|}$ as required. ∎

Proof of Corollary A.2.

By Proposition A.1,

\displaystyle\pi(\beta|\xi,\sigma^{2},y)=\frac{1}{Z_{\xi,\sigma^{2},\nu}}\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)},

(55)

where $Z_{\xi,\sigma^{2},\nu}:=\int_{\mathbb{R}^{p}}\mathcal{N}(y;X\beta,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\beta^{2}_{j}}{2\sigma^{2}\nu}\Big{)}d\beta$ is the normalizing constant. As $p>n$ , there exists some $e^{\perp}\in\mathbb{R}^{p}$ such that $Xe^{\perp}=0$ . When $e^{\perp}_{j}=0$ for some $j=1,\ldots,p$ , the posterior density is infinite by Proposition A.1. Consider $e^{\perp}\in\mathbb{R}^{p}$ such that each $e^{\perp}_{j}\in\mathbb{R}$ is non-zero for $j=1,\ldots,p$ . For any such fixed such $e^{\perp}$ and large $\lambda>0$ , we obtain

$\displaystyle\pi(\lambda e^{\perp}\|\xi,\sigma^{2},y)$	$\displaystyle=\frac{1}{Z_{\xi,\sigma^{2},\nu}}\mathcal{N}(y;0,\sigma^{2}I_{n})\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\lambda^{2}(e^{\perp}_{j})^{2}}{2\sigma^{2}\nu}\Big{)}$	(56)
	$\displaystyle\propto\prod_{j=1}^{p}U\Big{(}\frac{1+\nu}{2},1,\frac{\xi\lambda^{2}(e^{\perp}_{j})^{2}}{2\sigma^{2}\nu}\Big{)}$	(57)
	$\displaystyle\asymp\prod_{j=1}^{p}\Big{(}\frac{\xi\lambda^{2}(e^{\perp}_{j})^{2}}{2\sigma^{2}\nu}\Big{)}^{-\frac{1+\nu}{2}}$	(58)
	$\displaystyle\asymp\lambda^{-p(1+\nu)},$	(59)

where (58) follows from the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.1]. ∎

C.2 Geometric Ergodicity

Initial technical results.

We first record some technical results.

Lemma C.1.

Consider any matrix $A$ such that $A=\begin{bmatrix}B\\ C\end{bmatrix}$ for submatrices $B$ and $C$ . Then $\|A\|_{2}\leq\|B\|_{2}+\|C\|_{2}$ , where $\|\cdot\|_{2}$ is the spectral norm of a matrix.

Proof.

Take any vector $v$ of length equal to the number of columns of $A$ with $\|v\|_{2}=1$ . Then

\displaystyle\|Av\|_{2}^{2}=\Big{\|}\begin{bmatrix}B\\ C\end{bmatrix}v\Big{\|}_{2}^{2}=\|Bv\|_{2}^{2}+\|Cv\|_{2}^{2}\leq\|B\|_{2}^{2}+\|C\|_{2}^{2}.

(60)

Taking the supremum over all such $v$ with $\|v\|_{2}=1$ gives $\|A\|_{2}^{2}\leq\|B\|_{2}^{2}+\|C\|_{2}^{2}\leq(\|B\|_{2}+\|C\|_{2})^{2}$ . ∎

Lemma C.2.

(Sherman–Morrison–Woodbury formula [Hager, 1989]) For any $p\times p$ invertible matrix $A$ , $p\times n$ matrix $U$ and $n\times p$ matrix $V$ ,

(A+UV)^{-1}=A^{-1}-A^{-1}U(I_{n}+VA^{-1}U)^{-1}VA^{-1}.

(61)

When $U=V^{T}=x\in\mathbb{R}^{p}$ , we obtain $(A+xx^{T})^{-1}=A^{-1}-A^{-1}xx^{T}A^{-1}/(1+x^{T}A^{-1}x)$ .

Lemma C.3.

(A uniform bound on the full conditional mean of $\beta$ ). Let $X$ be an $n\times p$ matrix with columns $x_{i}\in\mathbb{R}^{n}$ for $i=1,\ldots,p$ , and let $R=\text{Diag}(r_{1},\ldots,r_{p})$ be a $p\times p$ diagonal matrix with positive entries. Then

\sup_{r_{1},\ldots,r_{p}>0}\|(X^{T}X+R)^{-1}X^{T}\|_{2}\leq C_{p},

(62)

for some $C_{p}<\infty$ , where $\|\cdot\|_{2}$ is the spectral norm of a matrix. That is, the spectral norm of $(X^{T}X+R)^{-1}X^{T}$ is uniformly bounded over all positive diagonal matrices $R$ .

In an earlier version of the present manuscript, we had explicitly characterized the uniform upper bound as $\|X^{\dagger}y\|_{2}$ for $X^{\dagger}$ the Moore-Penrose pseudoinverse of $X$ , which is not true in general. Existence of an uniform upper bound has been also established by Bhattacharya et al. [2021] using block matrix inversions and induction.

Proof.

Fix any $n\geq 1$ . We first consider the case $p=1$ , such that $X=x_{1}\in\mathbb{R}^{n}$ , $R=r_{1}$ for some scalar $r_{1}>0$ . We obtain

\|(X^{T}X+R)^{-1}X^{T}\|_{2}=\frac{1}{r_{1}+x_{1}^{T}x_{1}}\|x_{1}^{T}\|_{2}=\frac{\|x_{1}\|_{2}}{r_{1}+\|x_{1}\|_{2}^{2}}\leq\frac{\mathrm{I}_{x_{1}\neq 0}}{\|x\|_{2}},

(63)

which is an uniform bound for all $r_{1}>0$ .

We now consider $p>1$ . We have $X=[x_{1}\ldots x_{p}]$ with $x_{i}\in\mathbb{R}^{n}$ for $i=1,\ldots,p$ and $R=\text{Diag}(r_{1},\ldots,r_{p})$ with $r_{i}>0$ for $i=1,\ldots,p$ . For any $p\times p$ permutation matrix $P$ , the matrices $XP^{T}$ and $PRP^{T}$ correspond to a reordering of the columns of $X$ and the diagonal entries of $R$ . We have

\|\big{(}(XP^{T})^{T}(XP^{T})+PRP^{T}\big{)}^{-1}(XP^{T})^{T}\|_{2}=\|P(X^{T}X+R)^{-1}X^{T}\|_{2}=\|(X^{T}X+R)^{-1}X^{T}\|_{2}.

(64)

This implies that $\|(X^{T}X+R)^{-1}X^{T}\|_{2}$ is invariant under any reordering of the $p$ columns of $X$ and the $p$ diagonal entries of $R$ . By considering such reordering, we can assume the diagonal entries of $R$ are non-decreasing without loss of generality; that is $R=\text{Diag}(r_{1},\ldots,r_{p})$ with $0<r_{1}\leq\ldots\leq r_{p}$ . Denote $A_{p}:=(X^{T}X+R)^{-1}X^{T}$ . Then,

$\displaystyle A_{p}$	$\displaystyle=\big{(}R^{-1}-R^{-1}X^{T}(I_{n}+XR^{-1}X^{T})^{-1}XR^{-1}\big{)}X^{T}\ \text{by Lemma \ref{lemma:woodbury}}$	(65)
	$\displaystyle=R^{-1}X^{T}-R^{-1}X^{T}(I_{n}-\big{(}I_{n}+XR^{-1}X^{T})^{-1}\big{)}$	(66)
	$\displaystyle=R^{-1}X^{T}(I_{n}+XR^{-1}X^{T})^{-1}.$	(67)

It therefore suffices to show that

C_{p}:=\sup_{0<r_{1}\leq\ldots\leq r_{p}}\|A_{p}\|_{2}=\sup_{0<r_{1}\leq\ldots\leq r_{p}}\|{R}^{-1}{X}^{T}(I_{n}+{X}{R}^{-1}{X}^{T})^{-1}\|_{2}

(68)

is finite for any $p>1$ . For $k=1,\ldots,p-1$ , define

C_{k}:=\sup_{0<r_{1}\leq\ldots\leq r_{k}}\|A_{k}\|_{2}=\sup_{0<r_{1}\leq\ldots\leq r_{k}}\|{R}_{k}^{-1}{X}_{k}^{T}(I_{n}+{X}_{k}{R}_{k}^{-1}{X}_{k}^{T})^{-1}\|_{2},

(69)

where $X_{k}=[x_{1}\ldots x_{k}]$ is the $n\times k$ matrix corresponding to the first $k$ columns of $X$ and ${R}_{k}=\text{Diag}(r_{1},\ldots,r_{k})$ with $0<r_{1}\leq\ldots\leq r_{k}$ is the $k\times k$ diagonal matrix corresponding to the top $k$ diagonal entries of $R$ , and $A_{k}:=R_{k}^{-1}X_{k}^{T}(I_{n}+X_{k}R_{k}^{-1}X_{k}^{T})^{-1}$ . Then $X_{p-1}R_{p-1}^{-1}X_{p-1}^{T}=\sum_{i=1}^{p-1}x_{i}x_{i}^{T}/r_{i}$ . By Lemma C.2 with $U=V^{T}=x_{p}/r_{p}^{1/2}$ ,

$\displaystyle(I_{n}+XR^{-1}X^{T})^{-1}$	$\displaystyle=\Big{(}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})+\frac{x_{p}x_{p}^{T}}{r_{p}}\Big{)}^{-1}$	(70)
	$\displaystyle=(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}$	(71)
	$\displaystyle\quad-\frac{(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}x_{p}x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}x_{p}}.$	(72)

Substituting this decomposition into (67), we obtain

\displaystyle A_{p}=\begin{bmatrix}R_{p-1}^{-1}X_{p-1}^{T}\\ r_{p}^{-1}x_{p}^{T}\end{bmatrix}(I_{n}+XR^{-1}X^{T})^{-1}=\begin{bmatrix}V_{p-1}\\ v_{p}^{T}\end{bmatrix}

(73)

where, after some simplifications, $v_{p}\in\mathbb{R}^{n}$ and $V_{p-1}$ is a $p-1$ by $n$ matrix such that

	$\displaystyle v_{p}^{T}$	$\displaystyle:=r_{p}^{-1}x_{p}^{T}(I_{n}+XR^{-1}X^{T})^{-1}=\frac{x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+x_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}x_{p}},$		(74)
	$\displaystyle V_{p-1}$	$\displaystyle:=R_{p-1}^{-1}X_{p-1}^{T}(I_{n}+XR^{-1}X^{T})^{-1}=A_{p-1}\Big{(}I_{n}-x_{p}v_{p}^{T}\Big{)}.$		(75)

We have $\|V_{p-1}\|_{2}\leq C_{p-1}(1+\|x_{p}\|_{2}\|v_{p}^{T}\|_{2})$ by (69), subadditivity and submultiplicativity. Then by Lemma C.1,

\displaystyle\|A_{p}\|_{2}

\displaystyle\leq\|V_{p-1}\|_{2}+\|v_{p}^{T}\|_{2}\leq C_{p-1}(1+\|x_{p}\|_{2}\|v_{p}^{T}\|_{2})+\|v_{p}^{T}\|_{2}.

(76)

Therefore to show $C_{p}$ in (68) is finite, it suffices to bound $\|v_{p}^{T}\|_{2}$ uniformly over all $R=\text{Diag}(r_{1},\ldots,r_{p})$ with $0<r_{1}\leq\ldots\leq r_{p}$ . Note that $x_{p}=X_{p-1}a_{p}+b_{p}$ for some $a_{p}\in\mathbb{R}^{p-1}$ and some $b_{p}\in\mathbb{R}^{n}$ such that $X_{p-1}^{T}b_{p}=0$ . Here $X_{p-1}a_{p}$ corresponds to the projection of $x_{p}$ to the column space of $X_{p-1}$ , and $b_{p}$ corresponds to the component of $x_{p}$ that is perpendicular to the column space of $X_{p-1}$ . Note that $b_{p}=(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})b_{p}$ . This implies

	$\displaystyle(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}b_{p}$	$\displaystyle=b_{p},$		(77)
	$\displaystyle b_{p}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}$	$\displaystyle=b_{p}^{T}X_{p-1}a_{p}=0.$		(78)

Therefore by (74),

$\displaystyle\\|v_{p}^{T}\\|_{2}$	$\displaystyle=\\|\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}+b_{p}^{T}}{r_{p}+a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}+b_{p}^{T}b_{p}}\\|_{2}$	(79)
	$\displaystyle\leq\big{\\|}\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}}\big{\\|}_{2}+\big{\\|}\frac{b_{p}^{T}}{r_{p}+b_{p}^{T}b_{p}}\big{\\|}_{2}\ \text{ by subadditivity}$	(80)
	$\displaystyle\leq\\|\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}}\\|_{2}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}\ \text{ as }(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}\text{ is p.s.d.}$	(81)
	$\displaystyle=\\|\frac{(R_{p-1}a_{p})^{T}R_{p-1}^{-1}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}}\\|_{2}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}$	(82)
	$\displaystyle\leq\big{\\|}\frac{R_{p-1}}{r_{p}}\big{\\|}_{2}\\|a_{p}\\|_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}\ \text{ by submultiplicativity and \eqref{eq:Ck_bound}}$	(83)
	$\displaystyle\leq\big{\\|}a_{p}\big{\\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}\ \text{ as $r_{p}\geq r_{j}$ for all $j=1,\ldots,p-1$}.$	(84)

Note that $x_{p},a_{p}$ and $b_{p}$ do not depend on entries of the diagonal matrix $R$ . Taking the supremum over all $R=\text{Diag}(r_{1},\ldots,r_{p})$ with $0<r_{1}\leq\ldots\leq r_{p}$ in (76), by (84) we obtain the recurrence relation

\displaystyle C_{p}\leq C_{p-1}\Big{(}1+\|x_{p}\|_{2}\big{(}\big{\|}a_{p}\big{\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\big{)}\Big{)}+\Big{(}\big{\|}a_{p}\big{\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\|b_{p}\|_{2}}\Big{)}

(85)

for all $p\geq 1$ . As $C_{1}$ is finite by (63), (85) implies that $C_{p}$ is finite for all $p$ . ∎

Lemma C.4.

(Moments of the full conditionals of each $\eta_{j}$ ). For fixed constants $m>0$ and $\nu>0$ , define a distribution on $(0,\infty)$ with probability density function

p(\eta|m)\propto\frac{1}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}e^{-m\eta}.

(86)

Then for any $c>\max\{-\frac{\nu+1}{2},-1\}$ and $\eta\sim p(\cdot|m)$ , we have

\displaystyle\mathbb{E}[\eta^{c}|m]=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\frac{U(\frac{1+\nu}{2}+c,1+c,\frac{m}{\nu})}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}.

(87)

Proof.

Recall the definition of the confluent hypergeometric function of the second kind $U(a,b,z):=\frac{1}{\Gamma(a)}\int_{0}^{\infty}x^{a-1}(1+x)^{b-a-1}e^{-zx}dx$ . We obtain

	$\displaystyle\int_{0}^{\infty}\frac{\eta^{c}}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}e^{-m\eta}d\eta$	$\displaystyle=\nu^{-\frac{\nu+1}{2}-c}\int_{0}^{\infty}x^{\frac{\nu+1}{2}+c-1}(1+x)^{-\frac{\nu+1}{2}}e^{-\frac{m}{\nu}x}dx$		(88)
		$\displaystyle=\nu^{-\frac{\nu+1}{2}-c}\Gamma\Big{(}\frac{\nu+1}{2}+c\Big{)}U\Big{(}\frac{\nu+1}{2}+c,1+c,\frac{m}{\nu}\Big{)}.$		(89)

Therefore,

	$\displaystyle\mathbb{E}[\eta^{c}\|m]$	$\displaystyle=\frac{\nu^{-\frac{\nu+1}{2}-c}\Gamma\big{(}\frac{\nu+1}{2}+c\big{)}U\big{(}\frac{\nu+1}{2}+c,1+c,\frac{m}{\nu}\big{)}}{\nu^{-\frac{\nu+1}{2}}\Gamma\big{(}\frac{\nu+1}{2}\big{)}U\big{(}\frac{\nu+1}{2},1,\frac{m}{\nu}\big{)}}$		(90)
		$\displaystyle=\frac{1}{\nu^{c}}\frac{\Gamma\big{(}\frac{\nu+1}{2}+c\big{)}U\big{(}\frac{\nu+1}{2}+c,1+c,\frac{m}{\nu}\big{)}}{\Gamma\big{(}\frac{\nu+1}{2}\big{)}U\big{(}\frac{\nu+1}{2},1,\frac{m}{\nu}\big{)}}.$		(91)

∎

Ratios of confluent hypergeometric function evaluations, such as the ones arising in Lemma C.4, can be upper-bounded using the following result.

Lemma C.5.

Let $U$ denote the confluent hypergeometric function of the second kind, and fix $r>0$ .

1.

Let $c>0$ . Then

$\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}<x^{-c}\quad\text{for all }x>0.$ (92)

Furthermore, for every $\epsilon>0$ , there exists a $K^{(1)}_{\epsilon,c,r}<\infty$ such that

$\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}<\epsilon x^{-c}+K^{(1)}_{\epsilon,c,r}\quad\text{for all }x>0.$ (93)
2.

Let $\max\{-1,-r\}<c<0$ . Then there exists some $K^{(2)}_{c,r}<\infty$ such that

$\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}<x^{-c}+K^{(2)}_{c,r}\quad\text{for all }x>0.$ (94)

Proof.

Case: $c>0$ .

By definition, we have

	$\displaystyle x^{c}\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}$	$\displaystyle=\frac{x^{c}\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt}{\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt}$
		$\displaystyle=\frac{\int_{0}^{\infty}s^{r+c-1}(x+s)^{-r}e^{-s}ds}{\int_{0}^{\infty}s^{r-1}(x+s)^{-r}e^{-s}ds}\text{ for }s=xt$
		$\displaystyle=\frac{\Gamma(r+c)}{\Gamma(r)}\frac{\mathbb{E}[(x+S)^{-r}]}{\mathbb{E}[(x+\tilde{S})^{-r}]}\text{ for }S\sim\mathrm{Gamma}(r+c,1)\text{ and }\tilde{S}\sim\mathrm{Gamma}(r,1)$
		$\displaystyle<\frac{\Gamma(r+c)}{\Gamma(r)}$

where the final equality follows as the distribution $\Gamma(r+c,1)$ stochastically dominates the distribution $\Gamma(r,1)$ , and hence $\mathbb{E}[(x+S)^{-r}]<\mathbb{E}[(x+\tilde{S})^{-r}]$ for all $x>0$ as $r>0$ and $c>0$ . This gives $U\big{(}r+c,1+c,x\big{)}/U\big{(}r,1,x\big{)}<x^{c}$ as required for (92).

We now prove (93). By definition, we have

\displaystyle\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}=\frac{\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt}{\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt}=\mathbb{E}[T_{x}^{c}]=x^{-c}\mathbb{E}[Y_{x}^{c}]

where $T_{x}$ is a random variable with density $p_{T_{x}}(t)\propto t^{r-1}(1+t)^{-r}e^{-xt}$ on $[0,\infty)$ , and $Y_{x}:=xT_{x}$ is a random variable with density $p_{Y_{x}}(t)\propto y^{r-1}(x+y)^{-r}e^{-y}$ on $[0,\infty)$ .

Consider $\mathbb{E}[Y_{x}^{c}]$ as $x\rightarrow 0^{+}$ . We obtain

\displaystyle\lim_{x\rightarrow 0^{+}}\mathbb{E}[Y_{x}^{c}]=\lim_{x\rightarrow 0^{+}}\frac{\int_{0}^{\infty}y^{c+r-1}(x+y)^{-r}e^{-y}dy}{\int_{0}^{\infty}y^{r-1}(x+y)^{-r}e^{-y}dy}=\frac{\Gamma(c)}{\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}\big{(}\frac{y}{x+y}\big{)}^{r}y^{-1}e^{-y}dy}=0.

2.

Consider $\mathbb{E}[T_{x}^{c}]$ as a function of $x$ for $x>0$ . From the density of $T_{x}$ , we note that for all $0<x_{1}<x_{2}$ , $T_{x_{1}}$ stochastically dominates $T_{x_{2}}$ , and as $c>0$ , $T^{c}_{x_{1}}$ also stochastically dominates $T^{c}_{x_{2}}$ . Therefore, $\mathbb{E}[T_{x}^{c}]$ is a decreasing function of $x$ for $x>0$ .

As $\lim_{x\rightarrow 0^{+}}\mathbb{E}[Y_{x}^{c}]=0$ , for any $\epsilon>0$ , there exists $\delta>0$ such that $\mathbb{E}[Y_{x}^{c}]<\frac{\Gamma(r+c)}{\Gamma(r)}\epsilon$ for all $x\leq\delta$ . This gives $x^{-c}\mathbb{E}[Y_{x}^{c}]<\frac{\Gamma(r+c)}{\Gamma(r)}\epsilon x^{-c}$ for all $x\leq\delta$ . Also for all $x>\delta$ , we have $x^{-c}\mathbb{E}[Y_{x}^{c}]=\mathbb{E}[T_{x}^{c}]\leq\mathbb{E}[T_{\delta}^{c}]<\infty$ as $\mathbb{E}[T_{x}^{c}]$ is a decreasing function of $x$ for $x>0$ . Overall, we obtain

	$\displaystyle\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}$	$\displaystyle=x^{-c}\mathbb{E}[Y_{x}^{c}]<\frac{\Gamma(r+c)}{\Gamma(r)}\epsilon x^{-c}+\mathbb{E}[T_{\delta}^{c}]$
	$\displaystyle\Rightarrow\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}$	$\displaystyle<\epsilon x^{-c}+K^{(1)}_{\epsilon,c,r}$

for $K^{(1)}_{\epsilon,c,r}=\frac{\Gamma(r)}{\Gamma(r+c)}\mathbb{E}[T_{\delta}^{c}]=\frac{U\big{(}r+c,1+c,\delta\big{)}}{U\big{(}r,1,\delta\big{)}}<\infty$ as required.

Case: $\max\{-1,-r\}<c<0$ .

Note that $U\big{(}r+c,1+c,x\big{)}$ and $U\big{(}r,1,x\big{)}$ are well-defined as $r,r+c,c+1>0$ .

Consider $\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}$ as $x\rightarrow\infty$ . We first recall the asymptotic expansion of the confluent hypergeometric function of the second kind [Abramowitz, 1974, Section 13.5] at infinity,

U(a,b,z)=z^{-a}-a(a-b+1)z^{-(a+1)}+\mathcal{O}(z^{-(a+2)})

(95)

as $z\rightarrow\infty$ for any $a,b>0$ . For large $x>0$ , this gives

$\displaystyle x^{c+1}\Big{(}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}\Big{)}$	$\displaystyle=x^{c+1}\Big{(}\frac{x^{r}U\big{(}r+c,1+c,x\big{)}-x^{r-c}U\big{(}r,1,x\big{)}}{x^{r}U\big{(}r,1,x\big{)}}\Big{)}$	(96)
	$\displaystyle=x\Big{(}\frac{x^{r+c}U\big{(}r+c,1+c,x\big{)}-x^{r}U\big{(}r,1,x\big{)}}{x^{r}U\big{(}r,1,x\big{)}}\Big{)}$	(97)
	$\displaystyle=x\Big{(}\frac{(1-\frac{(r+c)r}{x})-(1-\frac{r^{2}}{x})}{1}+\mathcal{O}(x^{-2})\Big{)}$	(98)
	$\displaystyle=-rc+\mathcal{O}(x^{-1})$	(99)

where (98) follows from the asymptotic expansion in (95). As $c+1>0$ , we obtain

\displaystyle\lim_{x\rightarrow\infty}\Big{(}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}\Big{)}=0.

(100)

Consider $\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}$ as $x\rightarrow 0^{+}$ . By definition of confluent hypergeometric function of the second kind, we have

\displaystyle\lim_{x\rightarrow 0^{+}}\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}=\lim_{x\rightarrow 0^{+}}\frac{\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt}{\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt}.

(101)

Note that as $r+c>0$ and $c-1<-1$ ,

\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}dt=\int_{0}^{\infty}\big{(}\frac{t}{1+t}\big{)}^{r}t^{c-1}dt\leq\int_{0}^{1}t^{r+c-1}dt+\int_{1}^{\infty}t^{c-1}dt=\frac{1}{r+c}-\frac{1}{c}<\infty.

(102)

Therefore, $\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}t^{r+c-1}(1+t)^{-r}e^{-xt}dt=\frac{1}{r+c}-\frac{1}{c}<\infty$ . However, we have

\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}t^{r-1}(1+t)^{-r}e^{-xt}dt=\lim_{x\rightarrow 0^{+}}\int_{0}^{\infty}\big{(}\frac{t}{1+t}\big{)}^{r}t^{-1}e^{-xt}dt=\infty.

(103)

Therefore $\lim_{x\rightarrow 0^{+}}\frac{\Gamma(r+c)U\big{(}r+c,1+c,x\big{)}}{\Gamma(r)U\big{(}r,1,x\big{)}}=0$ . As $c<0$ , we obtain

\lim_{x\rightarrow 0^{+}}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}=0.

(104)

From (100) and (104), we have $\lim_{x\rightarrow\infty}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}=0$ and $\lim_{x\rightarrow 0^{+}}\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}=0$ . As $x\mapsto\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}$ is a continuous function on $(0,\infty)$ , this implies that $\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}$ is bounded above for $x\in(0,\infty)$ . In particular, there exists some $K^{(2)}_{c,r}<\infty$ such that

\frac{U\big{(}r+c,1+c,x\big{)}}{U\big{(}r,1,x\big{)}}-x^{-c}<K^{(2)}_{c,r}

(105)

for all $x>0$ as required for (94). ∎

Corollary C.6.

(Bound on moments of the full conditionals of each $\eta_{j}$ ). Consider a random variable $\eta$ on $(0,\infty)$ with density as in (86) in Lemma C.4 for some $\nu>0$ .

Consider when $c>0$ . There for every $\epsilon>0$ , there exists some $K^{(1)}_{\epsilon,c,\nu}<\infty$ such that

\displaystyle\mathbb{E}[\eta^{c}|m]\leq\epsilon m^{-c}+K^{(1)}_{\epsilon,c,\nu}\quad\text{for all }m>0.

(106)

Consider when $\max\{-1,-\frac{1+\nu}{2}\}<c<0$ . Then, there exists some $K^{(2)}_{c,\nu}<\infty$ such that

\displaystyle\mathbb{E}[\eta^{c}|m]\leq\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}m^{-c}+K^{(2)}_{c,\nu}\quad\text{for all }m>0.

(107)

Proof.

By Lemma C.4, for all $\max\{-1,-\frac{1+\nu}{2}\}<c$ ,

\displaystyle\mathbb{E}[\eta^{c}|m]=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\frac{U(\frac{1+\nu}{2}+c,1+c,\frac{m}{\nu})}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}.

(108)

Consider when $c>0$ . We apply Lemma C.5 with $r=\frac{1+\nu}{2}$ , $x=\frac{m}{\nu}$ , and $\tilde{\epsilon}=\epsilon\Big{(}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\Big{)}^{-1}$ . Then there exists some $\tilde{K}^{(1)}_{\epsilon,c,\nu}<\infty$ such that

\displaystyle\mathbb{E}[\eta^{c}|m]\leq\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\Big{(}\tilde{\epsilon}\Big{(}\frac{m}{\nu}\Big{)}^{-c}+\tilde{K}^{(1)}_{\epsilon,c,\nu}\Big{)}=\epsilon m^{-c}+K^{(1)}_{{\epsilon},c,\nu}

(109)

for $K^{(1)}_{{\epsilon},c,\nu}=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\tilde{K}^{(1)}_{\epsilon,c,\nu}<\infty$ .

Consider when $\max\{-1,-\frac{1+\nu}{2}\}<c<0$ . By Lemma C.5 with $r=\frac{1+\nu}{2}$ and $x=\frac{m}{\nu}$ , there exists some $\tilde{K}^{(2)}_{c,\nu}<\infty$ such that

\displaystyle\mathbb{E}[\eta^{c}|m]\leq\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\Big{(}\Big{(}\frac{m}{\nu}\Big{)}^{-c}+\tilde{K}^{(2)}_{c,\nu}\Big{)}=\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}m^{-c}+K^{(2)}_{c,\nu}

(110)

for $K^{(2)}_{c,\nu}=\frac{1}{\nu^{c}}\frac{\Gamma(\frac{1+\nu}{2}+c)}{\Gamma(\frac{1+\nu}{2})}\tilde{K}^{(2)}_{c,\nu}<\infty$ . ∎

Lemma C.7.

Let $X\sim\mathcal{N}(\mu,\sigma^{2})$ on $\mathbb{R}$ . Then, for $c\in(0,1)$ ,

\mathbb{E}[|X|^{-c}]\leq\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}

(111)

Proof.

We follow the proof in Johndrow et al. [2020], which is included here for completeness. Recall the inverse moments of the univariate Normal distribution,

\mathbb{E}[|X|^{-c}]=\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}\exp(-\frac{\mu^{2}}{2\sigma^{2}})M\Big{(}\frac{1-c}{2},\frac{1}{2},\frac{\mu^{2}}{2\sigma^{2}}\Big{)}

(112)

where $M(a,b,z):={}_{1}{F}_{1}(a;b;z)$ is the confluent hypergeometric function of the first kind. For $a<b$ , we know $M(a,b,z)=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}e^{zt}t^{a-1}(1-t)^{b-a-1}dt$ . For $a=\frac{1-c}{2}<b=\frac{1}{2}$ and $z=\frac{\mu^{2}}{2\sigma^{2}}$ , this gives

	$\displaystyle\exp(-z)M(a,b,z)$	$\displaystyle=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}e^{-z(1-t)}t^{a-1}(1-t)^{b-a-1}dt$		(113)
		$\displaystyle=\frac{\Gamma(b)}{\Gamma(a)\Gamma(b-a)}\int_{0}^{1}e^{-zt}(1-t)^{a-1}t^{b-a-1}dt.$		(114)

Therefore, $\exp(-z)M(a,b,z)$ is decreasing for $z\geq 0$ and $\exp(-z)M(a,b,z)\leq\exp(-0)M(a,b,0)=1$ for all $z\geq 0$ . This gives

\mathbb{E}[|X|^{-c}]=\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}\exp(-\frac{\mu^{2}}{2\sigma^{2}})M\Big{(}\frac{1-c}{2},\frac{1}{2},\frac{\mu^{2}}{2\sigma^{2}}\Big{)}\leq\frac{1}{\sigma^{c}2^{c/2}}\frac{\Gamma(\frac{1-c}{2})}{\sqrt{\pi}}.

(115)

∎

Lemma C.8.

(A uniform bound on the full conditional mean of $1/\sigma^{2}$ ). Consider the full conditional distribution of $\sigma^{2}$ given $\eta,\xi$ in Algorithm 3. Given $\eta\in\mathbb{R}_{>0}^{p},\xi>0$ , we have

\sigma^{2}|\xi,\eta\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}+n}{2},\frac{y^{T}M_{\xi,\eta}^{-1}y+b_{0}}{2}\bigg{)},

(116)

where $M_{\xi,\eta}=I_{n}+\xi^{-1}X\text{Diag}(\eta^{-1})X^{T}$ . Then

\mathbb{E}\Big{[}\frac{1}{\sigma^{2}}\Big{|}\xi,\eta\Big{]}=\frac{a_{0}+n}{y^{T}M_{\xi,\eta}^{-1}y+b_{0}}<\frac{a_{0}+n}{b_{0}}<\infty.

(117)

Proof.

Recall that the mean of a $\mathrm{Gamma}(a,b)$ distribution (under shape and rate parameterization) is $a/b$ . Equation (117) now directly follows as $M_{\xi,\eta}^{-1}$ is positive semi-definite. ∎

Lemma C.9.

(Maximal coupling probability of each component $\eta_{j}$ ). For fixed constants $m>0$ and $\nu>0$ , define a distribution on $(0,\infty)$ with probability density function

p(\eta|m)\propto\frac{1}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}e^{-m\eta}.

(118)

Consider random variables $\eta\sim p(\cdot|m)$ , and $\tilde{\eta}\sim p(\cdot|\tilde{m})$ for some fixed $\nu>0$ . Fix some $0<a<b<\infty$ , and take $m,\tilde{m}\in(a,b)$ . Then under a maximal coupling,

\mathbb{P}(\eta=\tilde{\eta}|m,\tilde{m})=\int_{0}^{\infty}\min\{p(\eta^{*}|m),p(\eta^{*}|\tilde{m})\}d\eta^{*}\geq U\Big{(}\frac{1+\nu}{2},1,\frac{b}{\nu}\Big{)}\Big{/}U\Big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\Big{)}.

(119)

Proof of Lemma C.9.

Fix some $0<a<b<\infty$ , and take $m,\tilde{m}\in(a,b)$ . Then

	$\displaystyle\mathbb{P}(\eta=\tilde{\eta}\|m,\tilde{m})$	$\displaystyle=\int_{0}^{\infty}\min\{p(\eta^{}\|m),p(\eta^{}\|\tilde{m})\}d\eta^{*}\text{, as }(\eta,\tilde{\eta})\|m,\tilde{m}\text{ are maximally coupled}$		(120)
		$\displaystyle=\int\frac{1}{(\eta^{})^{\frac{1-\nu}{2}}(1+\nu\eta^{})^{\frac{\nu+1}{2}}}\min\Big{\{}\frac{e^{-m\eta^{}}}{U\big{(}\frac{1+\nu}{2},1,\frac{m}{\nu}\big{)}},\frac{e^{-\tilde{m}\eta^{}}}{U\big{(}\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}\big{)}}\Big{\}}d\eta^{*}.$		(121)

Recall $m,\tilde{m}\in(a,b)$ . Note also that $x\mapsto U\big{(}\frac{1+\nu}{2},1,x\big{)}$ is a decreasing function for $x>0$ . This gives

\min\Big{\{}\frac{e^{-m\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{m}{\nu}\big{)}},\frac{e^{-\tilde{m}\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{\tilde{m}}{\nu}\big{)}}\Big{\}}>\frac{e^{-b\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\big{)}}.

(122)

Therefore for all $m,\tilde{m}\in(a,b)$ ,

\displaystyle\mathbb{P}(\eta=\tilde{\eta}|m,\tilde{m})

\displaystyle>\int\frac{1}{(\eta^{*})^{\frac{1-\nu}{2}}(1+\nu\eta^{*})^{\frac{\nu+1}{2}}}\frac{e^{-b\eta^{*}}}{U\big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\big{)}}d\eta^{*}=\frac{U\big{(}\frac{1+\nu}{2},1,\frac{b}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{a}{\nu}\big{)}}.

(123)

Equation (119) follows from taking an infimum over all $m,\tilde{m}\in(a,b)$ . ∎

Drift and Minorization conditions.

Proof of Proposition 2.1.

For ease of notation, denote $\Sigma_{t+1}:=X^{T}X+\xi_{t+1}\text{Diag}(\eta_{t+1})$ and let $e_{j}$ be the $j^{th}$ unit vector of the standard basis on $\mathbb{R}^{p}$ . We first consider $\sum_{j=1}^{p}m_{j}^{d}$ . Note that

$\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}\|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]}$	$\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\mathbb{E}\Big{[}\big{\|}\beta_{t+1,j}\big{\|}^{2d}\big{\|}\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]}$	(124)
	$\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\mathbb{E}\Big{[}\beta^{2}_{t+1,j}\|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]}^{d}$	(125)
	$\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\bigg{(}\big{(}e_{j}^{T}\Sigma_{t+1}^{-1}X^{T}y\big{)}^{2}+\sigma_{t+1}^{2}\big{(}e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\big{)}\bigg{)}^{d}$	(126)
	$\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\Big{(}\|e_{j}^{T}\Sigma_{t+1}^{-1}X^{T}y\|^{2d}+\sigma_{t+1}^{2d}\big{(}e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\big{)}^{d}\Big{)}$	(127)
	$\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\Big{(}\sum_{j=1}^{p}\big{(}\|e_{j}^{T}\Sigma_{t+1}^{-1}X^{T}y\|^{2d}\big{)}^{1/d}\Big{)}^{d}p^{1-d}$
	$\displaystyle\qquad+\sigma_{t+1}^{2d}\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\sum_{j=1}^{p}\big{(}e_{j}^{T}\big{(}\xi_{t+1}\text{Diag}(\eta_{t+1})\big{)}^{-1}e_{j}\big{)}^{d}$	(128)
	$\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\Big{(}\\|\Sigma_{t+1}^{-1}X^{T}y\\|^{2}_{2}\Big{)}^{d}p^{1-d}+\frac{1}{2^{d}}\sum_{j=1}^{p}\frac{1}{\eta_{t+1,j}^{d}}$	(129)
	$\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}\Big{(}C_{p}^{2}\\|y\\|_{2}^{2}\Big{)}^{d}p^{1-d}+\frac{1}{2^{d}}\sum_{j=1}^{p}\frac{1}{\eta_{t+1,j}^{d}},$	(130)

where (125) follows from Jensen’s inequality, (127) follows as $(x+y)^{d}\leq x^{d}+y^{d}$ for $x,x\geq 0$ and $d\in(0,1)$ , (128) follows from Hölder’s inequality and from $e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\leq e_{j}^{T}\big{(}\xi_{t+1}\text{Diag}(\eta_{t+1})\big{)}^{-1}e_{j}$ as $X^{T}X$ is positive-definite matrix, and (130) follows from Lemma C.3 for some finite constant $C_{p}$ which does not depend on $\eta_{t+1}$ or $\xi_{t+1}$ .

As the prior $\pi_{\xi}(\cdot)$ on $\xi$ has compact support, we have $b\leq\xi_{t+1}\leq B$ for all $t\geq 0$ for some $0<b\leq B<\infty$ . This upper bound on $\xi_{t+1}$ , Lemma C.8 and Jensen’s inequality gives

\displaystyle\mathbb{E}\Big{[}\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{d}|\eta_{t+1}\Big{]}\leq\frac{B^{d}}{2^{d}}\mathbb{E}\Big{[}\Big{(}\frac{1}{\sigma_{t+1}^{2}}\Big{)}|\eta_{t+1}\Big{]}^{d}\leq\frac{B^{d}}{2^{d}}\big{(}\frac{a_{0}+n}{b_{0}}\big{)}^{d}=\Big{(}\frac{B(a_{0}+n)}{2b_{0}}\Big{)}^{d}.

(131)

Therefore, by (130) and (131), we obtain

\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\eta_{t+1}\Big{]}=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}|\eta_{t+1},\xi_{t+1},\sigma_{t+1}^{2}\Big{]}|\eta_{t+1}\Big{]}\leq\frac{1}{2^{d}}\sum_{j=1}^{p}\frac{1}{\eta_{t+1,j}^{d}}+K_{1}

(132)

for $K_{1}:=\Big{(}\frac{B(a_{0}+n)}{2b_{0}}\Big{)}^{d}\Big{(}C_{p}^{2}\|y\|_{2}^{2}\Big{)}^{d}p^{1-d}<\infty$ . By (132) and Corollary C.6, we obtain

$\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}\|\beta_{t},\xi_{t},\sigma_{t}^{2}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{d}\|\eta_{t+1}\Big{]}\|\beta_{t},\xi_{t},\sigma_{t}^{2}\Big{]}$	(133)
	$\displaystyle\leq\frac{1}{2^{d}}\sum_{j=1}^{p}\mathbb{E}\Big{[}\eta_{t+1,j}^{-d}\|\beta_{t},\xi_{t},\sigma_{t}^{2}\Big{]}+K_{1}$	(134)
	$\displaystyle\leq\frac{1}{2^{d}}\Big{(}\sum_{j=1}^{p}\frac{\Gamma(\frac{1+\nu}{2}-d)}{\Gamma(\frac{1+\nu}{2})}m_{t,j}^{d}+K^{(2)}_{d,\nu}\Big{)}+K_{1}$	(135)
	$\displaystyle\leq\frac{1}{2^{d}}\frac{\Gamma(\frac{1+\nu}{2}-d)}{\Gamma(\frac{1+\nu}{2})}\sum_{j=1}^{p}m_{t,j}^{d}+K_{2}$	(136)

for $K_{2}:=\frac{p}{2^{d}}K^{(2)}_{d,\nu}+K_{1}<\infty$ . Note that for every $\nu\geq 1$ , there exists $d\in(0,1)$ such that $\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}<1$ . We can verify this by considering the function $f(d):=\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}$ . It suffices to note that $f(0)=1$ and $f^{\prime}(0)=-\psi(\frac{\nu+1}{2})-\log(2)<0$ for $\nu\geq 1$ , where $\psi(x):=\frac{\Gamma^{\prime}(x)}{\Gamma(x)}$ is the Digamma function.

We now consider $\sum_{j=1}^{p}m_{j}^{-c}$ for $c\in(0,1/2)$ . We obtain

$\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}$	$\displaystyle m_{t+1,j}^{-c}\|\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\Big{]}=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\sum_{j=1}^{p}\mathbb{E}\Big{[}\big{\|}\beta_{t+1,j}\big{\|}^{-2c}\big{\|}\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\Big{]}$	(137)
	$\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\frac{1}{2^{c}}\frac{\Gamma(\frac{1-2c}{2})}{\sqrt{\pi}}\sum_{j=1}^{p}\big{(}\sigma_{t+1}^{2}e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}\big{)}^{-c}$	(138)
	$\displaystyle\leq\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\frac{1}{2^{c}}\frac{\Gamma(\frac{1-2c}{2})}{\sqrt{\pi}}\sum_{j=1}^{p}\big{(}\sigma_{t+1}^{2}e_{j}^{T}(\\|X\\|_{2}^{2}I_{p}+\xi_{t+1}\text{Diag}(\eta_{t+1}))^{-1}e_{j}\big{)}^{-c}$	(139)
	$\displaystyle=\Big{(}\frac{\xi_{t+1}}{2\sigma_{t+1}^{2}}\Big{)}^{-c}\frac{1}{2^{c}}\frac{\Gamma(\frac{1-2c}{2})}{\sqrt{\pi}}\sum_{j=1}^{p}\frac{(\\|X\\|_{2}^{2}+\xi_{t+1}\eta_{t+1,j})^{c}}{\sigma_{t+1}^{2c}}$	(140)
	$\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\Big{(}\Big{(}\frac{\\|X\\|_{2}^{2}}{\xi_{t+1}}\Big{)}^{c}+\eta_{t+1,j}^{c}\Big{)}$	(141)
	$\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\eta_{t+1,j}^{c}+p\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\Big{(}\frac{\\|X\\|_{2}^{2}}{b}\Big{)}^{c}$	(142)

for some $b>0$ . Here (138) follows from Lemma C.7, (139) follows as $e_{j}^{T}(\|X\|_{2}^{2}I_{p}+\xi_{t}\text{Diag}(\eta_{t}))^{-1}e_{j}\leq e_{j}^{T}\Sigma_{t+1}^{-1}e_{j}$ for $\|\cdot\|_{2}$ the $L_{2}$ operator norm of a matrix, (141) follows as $(x+y)^{c}\leq x^{c}+y^{c}$ for all $x,y>0$ and $0<c<1/2$ , and (142) follows as the prior on $\xi$ has compact support, so $\xi_{t+1}\geq b$ for some $b>0$ . Therefore, by (142),

\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}|\eta_{t+1}\Big{]}=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}|\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\Big{]}|\eta_{t+1}\Big{]}\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\eta_{t+1,j}^{c}+K_{3}.

(143)

for $K_{3}:=p\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\Big{(}\frac{\|X\|_{2}^{2}}{b}\Big{)}^{c}<\infty$ . By (143) and Corollary C.6 with $r=\frac{1+\nu}{2}$ and any $\epsilon<\frac{\sqrt{\pi}}{\Gamma(\frac{1}{2}-c)}$ , there exists $K^{(1)}_{\epsilon,c,\nu}<\infty$ such that

$\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}\|\sigma^{2}_{t},\xi_{t},\beta_{t}\Big{]}$	$\displaystyle=\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}\|\eta_{t+1}\Big{]}\|\sigma^{2}_{t},\xi_{t},\beta_{t}\Big{]}$	(144)
	$\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\mathbb{E}\Big{[}\eta_{t+1,j}^{c}\|\sigma^{2}_{t},\xi_{t},\beta_{t}\Big{]}+K_{3}$	(145)
	$\displaystyle\leq\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\sum_{j=1}^{p}\Big{(}\epsilon m_{t,j}^{-c}+K^{(1)}_{\epsilon,c,\nu}\Big{)}+K_{3}$	(146)
	$\displaystyle=\Big{(}\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\epsilon\Big{)}\sum_{j=1}^{p}m_{t,j}^{-c}+K_{4},$	(147)

where $\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\epsilon<1$ and $K_{4}=p\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}K^{(1)}_{\epsilon,c,\nu}+K_{3}$ .

We can now combine (136) and (147) together. Then, for some fixed $d\in(0,1)$ such that $\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}<1$ and any $c\in(0,1/2)$ , we obtain

\displaystyle\mathbb{E}\Big{[}\sum_{j=1}^{p}m_{t+1,j}^{-c}+m_{t+1,j}^{d}|m_{t}\Big{]}\leq\gamma_{drift}\Big{(}\sum_{j=1}^{p}m_{t,j}^{-c}+m_{t,j}^{d}\Big{)}+K_{drift}

(148)

for some

	$\displaystyle 0<\gamma_{drift}$	$\displaystyle:=max\Big{\{}\frac{\Gamma(\frac{1}{2}-c)}{\sqrt{\pi}}\epsilon,\frac{1}{2^{d}}\frac{\Gamma(\frac{\nu+1}{2}-d)}{\Gamma(\frac{\nu+1}{2})}\Big{\}}<1$		(149)
	$\displaystyle 0<K_{drift}$	$\displaystyle:=K_{2}+K_{4}<\infty.$		(150)

Therefore, for any such $d\in(0,1)$ and $c\in(0,1/2)$ ,

V(\beta)=\sum_{j=1}^{p}m_{j}^{-c}+m_{j}^{d}

(151)

for $m_{j}=\frac{\xi\beta_{j}^{2}}{2\sigma^{2}}$ is a Lyapunov function. ∎

Proposition C.10.

(Minorization condition). For $R>0$ , let $S(R)=\{(\beta,\xi,\sigma^{2})\in\mathbb{R}^{p}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}:V(\beta,\xi,\sigma)<R\}$ denote the sub-level sets of the Lyapunov function in Proposition 2.1. Let $\mathcal{P}$ denote the Markov transition kernel associated with the update from $(\beta_{t},\xi_{t},\sigma_{t}^{2})$ to $(\beta_{t+1},\xi_{t+1},\sigma_{t+1}^{2})$ implied by the update rule in (LABEL:eq:blocked_gibbs_ergodicity_version). Let $Z=(\beta,\xi,\sigma^{2})$ and $\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2})$ . Then for all $R>0$ , there exists $\epsilon\in(0,1)$ such that

\text{TV}\Big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z},\cdot)\Big{)}<1-\epsilon\quad\text{ for all }Z,\tilde{Z}\in S(R),

(152)

where TV stands for the total variation distance between two probability distributions. In particular for $R>1$ , $\epsilon=(U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}/U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)})^{p}$ suffices.

Proof of Proposition C.10.

Denote $Z=(\beta,\xi,\sigma^{2})$ and $\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2})$ , and let $Z,\tilde{Z}\in S(R)$ for some $R>1$ . We obtain

\displaystyle\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z},\cdot)\big{)}\leq\mathbb{P}\big{(}Z_{1}\neq\tilde{Z}_{1}\big{|}Z,\tilde{Z}\big{)}<1-\epsilon,

(153)

where $(Z_{1},\tilde{Z}_{1})$ corresponds to one-scale coupling of $\mathcal{P}(Z,\cdot)$ and $\mathcal{P}(\tilde{Z},\cdot)$ as in Proposition 2.2, the first inequality follows from the coupling inequality, and the second inequality follows from the proof of Proposition 2.2 for $\epsilon=\Big{(}U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}\Big{/}U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}\Big{)}^{p}\in(0,1)$ as $R>1$ . Note that $\epsilon$ does not depend on $Z$ and $\tilde{Z}$ as required. ∎

Geometric ergodicity follows from Proposition 2.1 and Proposition C.10 with any $R>\max\{1,2K/(1-\gamma)\}$ [Rosenthal, 1995].

C.3 Coupled Kernel Proofs

Proof of Proposition 2.2.

For $R>0$ , let $S(R):=\{(\beta,\xi,\sigma^{2})\in\mathbb{R}^{p}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}:V(\beta,\xi,\sigma)<R\}$ denote the sub-level sets of the Lyapunov function in Proposition 2.1, with corresponding contraction constant $\gamma\in(0,1)$ and additive constant $K>0$ . Let $Z_{t}:=(\beta_{t},\xi_{t},\sigma_{t}^{2})$ and $\tilde{Z}_{t}=(\tilde{\beta}_{t},\tilde{\xi}_{t},\tilde{\sigma}_{t}^{2})$ for all $t\geq 0$ . By Jacob et al. [2020, Proposition 4], it suffices to prove that for any $R>0$ sufficiently large such that $\gamma+K/(1+R)<1$ , there exists $\epsilon\in(0,1)$ such that

\inf_{Z_{t},\tilde{Z}_{t}\in S(R)}\mathbb{P}\big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{|}Z_{t},\tilde{Z}_{t}\big{)}>\epsilon,

(154)

under the one-scale coupling. We now prove (154) directly for the one-scale coupling in Proposition 2.2. Fix some $R>1$ such that $\gamma+K/(1+R)<1$ , and take any $Z_{t},\tilde{Z}_{t}\in S(R)$ . Then

$\displaystyle\mathbb{P}$	$\displaystyle(Z_{t+1}=\tilde{Z}_{t+1}\big{\|}Z_{t},\tilde{Z}_{t})$	(155)
	$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{\|}\eta_{t+1},\tilde{\eta}_{t+1},Z_{t},\tilde{Z}_{t}\Big{)}\Big{\|}Z_{t},\tilde{Z}_{t}\Big{]}$	(156)
	$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{\|}\eta_{t+1},\tilde{\eta}_{t+1}\Big{)}\Big{\|}Z_{t},\tilde{Z}_{t}\Big{]}$	(157)
	$\displaystyle=\mathbb{E}\Big{[}\mathrm{1}_{\{\eta_{t+1}=\tilde{\eta}_{t+1}\}}\Big{\|}Z_{t},\tilde{Z}_{t}\Big{]}\text{, as }(Z_{t+1},\tilde{Z}_{t+1})\|(\eta_{t+1},\tilde{\eta}_{t+1})\text{ follows CRN coupling}$	(158)
	$\displaystyle=\mathbb{P}\big{(}\eta_{t+1}=\tilde{\eta}_{t+1}\Big{\|}Z_{t},\tilde{Z}_{t}\big{)}$	(159)
	$\displaystyle=\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{\|}Z_{t},\tilde{Z}_{t}\big{)}\text{, as }(\eta_{t+1},\tilde{\eta}_{t+1})\|Z_{t},\tilde{Z}_{t}\text{ is maximally coupled componentwise}$	(160)
	$\displaystyle=\prod_{j=1}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{\|}m_{j},\tilde{m}_{j}\big{)}\text{, for}\ m_{j}=\frac{\xi_{t}\beta_{t,j}^{2}}{2\sigma_{t}^{2}}\text{ and }\tilde{m}_{j}=\frac{\tilde{\xi}_{t}\tilde{\beta}_{t,j}^{2}}{2\tilde{\sigma}_{t}^{2}}.$	(161)

As $Z,\tilde{Z}\in S(R)$ for some $R>1$ , $m_{j},\tilde{m}_{j}\in(R^{-1/c},R^{1/d})$ for all $j=1,\ldots,p$ . By Lemma C.9,

\displaystyle\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{|}m_{j},\tilde{m}_{j}\big{)}\geq U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}\Big{/}U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}

(162)

for all $j=1,\ldots,p$ . Therefore,

\displaystyle\mathbb{P}

\displaystyle(Z_{t+1}=\tilde{Z}_{t+1}\big{|}Z_{t},\tilde{Z}_{t})\geq\prod_{j=1}^{p}\frac{U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}}=\epsilon

(163)

for $\epsilon:=(U(\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu})/U(\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}))^{p}\in(0,1)$ as $R>1$ . ∎

Proof of Proposition 3.1.

As $\eta_{t+1}$ given $m_{t}$ and $\tilde{\eta}_{t+1}$ given $\tilde{m}_{t}$ are conditionally independent component-wise, it suffices to consider the univariate case where $p=1$ . Henceforth we assume $p=1$ and we drop the component subscripts for ease of notation. Without loss of generality, take $\tilde{m}_{t}\leq m_{t}$ and $h:(0,\infty)$ to be monotone increasing. Then the densities of $\eta_{t+1}|m_{t}$ and $\tilde{\eta}_{t+1}|\tilde{m}_{t}$ on $(0,\infty)$ are given by

\displaystyle p(\eta|m_{t})\propto\frac{e^{-m_{t}\eta}}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}\text{ and }p(\tilde{\eta}|\tilde{m}_{t})\propto\frac{e^{-\tilde{m}_{t}\eta}}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}

(164)

respectively. As $\tilde{m}_{t}\leq m_{t}$ , the distribution of $\tilde{\eta}_{t+1}$ stochastically dominates the distribution of $\eta_{t+1}$ . Therefore under a CRN coupling of ${\eta}_{t+1}$ and $\tilde{\eta}_{t+1}$ (specifically a CRN coupling of Algorithm 11), $\tilde{\eta}_{t+1}\geq\eta_{t+1}$ almost surely, and $h(\tilde{\eta}_{t+1})\geq h(\eta_{t+1})$ almost surely. We obtain

\mathbb{E}_{\text{crn}}\big{[}|h(\eta_{t+1})-h(\tilde{\eta}_{t+1})|\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}\big{[}h(\tilde{\eta}_{t+1})-h(\eta_{t+1})\big{|}m_{t},\tilde{m}_{t}\big{]}.

(165)

Now consider a maximal coupling of $\eta_{t+1}$ and $\tilde{\eta}_{t+1}$ with independent residuals [Johnson, 1998]. Let $q=\text{TV}(p(\cdot|m_{t}),p(\cdot|\tilde{m}_{t}))$ . Then $\eta_{t+1}=\tilde{\eta}_{t+1}$ with probability $1-q$ . Otherwise with probability $q$ , $(\eta_{t+1},\tilde{\eta}_{t+1})=(\eta^{*}_{t+1},\tilde{\eta}^{*}_{t+1})$ where $\eta^{*}_{t+1}$ and $\tilde{\eta}^{*}_{t+1}$ are independent random variables with densities

	$\displaystyle p(\eta^{*}_{t+1}\|m_{t+1},\tilde{m}_{t+1})$	$\displaystyle=\max\{p(\eta^{}_{t+1}\|m_{t})-p(\eta^{}_{t+1}\|\tilde{m}_{t}),0\}/q\text{ and }$		(166)
	$\displaystyle p(\tilde{\eta}^{*}_{t+1}\|m_{t+1},\tilde{m}_{t+1})$	$\displaystyle=\max\{p(\tilde{\eta}^{}_{t+1}\|\tilde{m}_{t})-p(\tilde{\eta}^{}_{t+1}\|{m}_{t}),0\}/q$		(167)

respectively on $(0,\infty)$ . This implies that the supports of $\eta^{*}_{t+1}$ and $\tilde{\eta}^{*}_{t+1}$ are disjoint. By (164), we further have $\eta^{*}_{t+1}\leq\tilde{\eta}^{*}_{t+1}$ almost surely. Overall, we obtain $\tilde{\eta}_{t+1}\geq\eta_{t+1}$ almost surely under a maximal coupling, and hence

\mathbb{E}_{\text{max}}\big{[}|h(\eta_{t+1})-h(\tilde{\eta}_{t+1})|\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}\big{[}h(\tilde{\eta}_{t+1})-h(\eta_{t+1})\big{|}m_{t},\tilde{m}_{t}\big{]}=\mathbb{E}_{\text{crn}}\big{[}|h(\eta_{t+1})-h(\tilde{\eta}_{t+1})|\big{|}m_{t},\tilde{m}_{t}\big{]}

(168)

as required. ∎

Proof of Proposition 3.2.

CRN coupling result.

To prove (12), it suffices to show that under CRN,

|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|\leq|\log m_{t,j}-\log\tilde{m}_{t,j}|,

(169)

almost surely for all $j=1,\ldots,p$ . Henceforth, to prove (169) we work component-wise and drop the subscripts for ease of notation. For the Horseshoe prior, we then have $p(\eta|m)\propto\exp(-m\eta)/(1+\eta)$ on $(0,\infty)$ . The cumulative density function is then given by $F_{m}(t)=1-E_{1}(m(t+1))/E_{1}(m)$ where $E_{1}$ is the exponential integral function. Therefore, under CRN coupling, we obtain

\displaystyle 1+\eta=\frac{1}{m}E_{1}^{-1}(U^{*}E_{1}(m))\text{ and }1+\tilde{\eta}=\frac{1}{\tilde{m}}E_{1}^{-1}(U^{*}E_{1}(\tilde{m})),

(170)

for $U^{*}\sim\mathrm{Uniform}(0,1)$ . Then

$\displaystyle\frac{d}{d\log m}\big{(}\log(1+\eta)\big{)}$	$\displaystyle=-1+\frac{d}{d\log m}\big{(}\log\big{(}E_{1}^{-1}(U^{*}E_{1}(m))\big{)}\big{)}$	(171)
	$\displaystyle=-1+\frac{m}{E_{1}^{-1}(U^{}E_{1}(m))}\Big{(}\frac{d}{dm}\big{(}E_{1}^{-1}(U^{}E_{1}(m))\big{)}\Big{)}$	(172)
	$\displaystyle=-1+\frac{m}{E_{1}^{-1}(U^{}E_{1}(m))}\frac{U^{}E_{1}^{\prime}(m)}{E_{1}^{\prime}\big{(}E_{1}^{-1}(U^{*}E_{1}(m))\big{)}}\text{, where }E^{\prime}_{1}(z)=-e^{-z}/z$	(173)
	$\displaystyle=-1+\frac{m}{m(1+\eta)}\frac{U^{}E_{1}^{\prime}(m)}{E_{1}^{\prime}\big{(}m(1+\eta)\big{)}}\text{, as }m(1+\eta)=E_{1}^{-1}(U^{}E_{1}(m))$	(174)
	$\displaystyle=-1+U^{*}e^{m\eta}\text{, as }E^{\prime}_{1}(z)=-e^{-z}/z$	(175)
	$\displaystyle=-1+\frac{E_{1}(m(1+\eta))e^{m(1+\eta)}}{E_{1}(m)e^{m}}\text{, as }U^{*}=E_{1}(m(1+\eta))/E_{1}(m)$	(176)
	$\displaystyle\in(-1,0)\text{ as }x\mapsto e^{x}E_{1}(x)\text{ is positive and decreasing on }(0,\infty).$	(177)

Therefore, as $|\frac{d\log(1+\eta)}{d\log m}|\leq 1$ and by the mean value theorem we obtain

|\log(1+\eta)-\log(1+\tilde{\eta})|\leq|\log m-\log\tilde{m}|,

(178)

almost surely as required for (169). Finally, (169) implies (12) with $m_{t,j}=\mu$ and $\tilde{m}_{t,j}=\tilde{\mu}$ for all $j=1,\ldots,p$ .

Maximal coupling result.

To prove (13), take $\tilde{\mu}<\mu$ without loss of generality. For $c>0$ , denote the event

E_{c}:=\big{\{}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>c\big{\}},

where $(\eta_{t+1,j},\tilde{\eta}_{t+1,j})$ are i.i.d. random variables jointly distributed according to a maximal coupling. By component-wise independence,

\displaystyle\mathbb{P}(E_{c}|\mu,\tilde{\mu})=1-\prod_{j=1}^{p}\mathbb{P}\big{(}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|\leq c\big{|}\mu,\tilde{\mu}\big{)}=1-x_{c}^{p}

(179)

for $x_{c}=\mathbb{P}\big{(}|\log(1+\eta_{t+1,1})-\log(1+\tilde{\eta}_{t+1,1})|\leq c\big{|}\mu,\tilde{\mu}\big{)}$ . Henceforth, we focus on this quantity $x_{c}$ and drop the subscripts for ease of notation. For the Horseshoe prior, the marginal densities densities of $\eta|\mu$ and $\tilde{\eta}|\tilde{\mu}$ are given by

\displaystyle p(\eta|\mu)=\frac{e^{-\mu\eta}}{1+\eta}\frac{1}{e^{\mu}E_{1}(\mu)}\text{ and }p(\eta|\tilde{\mu})=\frac{e^{-\tilde{\mu}\eta}}{1+\eta}\frac{1}{e^{\tilde{\mu}}E_{1}(\tilde{\mu})}\text{ on }(0,\infty).

(180)

Consider a maximal coupling of $\eta$ give $\mu$ and $\tilde{\eta}$ given $\tilde{\mu}$ with independent residuals [Johnson, 1998]. Let $q=\text{TV}(p(\cdot|\mu),p(\cdot|\tilde{\mu}))$ . Then $\eta=\tilde{\eta}$ with probability $1-q$ . Otherwise with probability $q$ , $(\eta,\tilde{\eta})=(\eta^{*},\tilde{\eta}^{*})$ where $\eta^{*}$ and $\tilde{\eta}^{*}$ are independent random variables on $(0,\infty)$ with densities

	$\displaystyle p(\eta^{*}\|\mu,\tilde{\mu})$	$\displaystyle=\max\{p(\eta^{}\|\mu)-p(\eta^{}\|\tilde{\mu}),0\}/q\text{ and }$		(181)
	$\displaystyle p(\tilde{\eta}^{*}\|\mu,\tilde{\mu})$	$\displaystyle=\max\{p(\tilde{\eta}^{}\|\tilde{\mu})-p(\tilde{\eta}^{}\|{\mu}),0\}/q,$		(182)

respectively. Note that $p(\eta^{*}|\mu)\geq p(\eta^{*}|\tilde{\mu})$ if and only if $1+\eta^{*}\leq K$ for constant $K:=(\log E_{1}(\mu)-\log E_{1}(\tilde{\mu}))/(\tilde{\mu}-\mu)$ . Therefore, $1+\eta^{*}$ has support $[1,K)$ and $1+\tilde{\eta}^{*}$ has support $[K,\infty)$ . We obtain,

$\displaystyle x_{c}$	$\displaystyle=(1-q)+q\mathbb{P}\big{(}\|\log(1+\eta^{})-\log(1+\tilde{\eta}^{})\|\leq c\big{)}$	(183)
	$\displaystyle=1-q\mathbb{P}\big{(}\|\log(1+\eta^{})-\log(1+\tilde{\eta}^{})\|>c\big{)}$	(184)
	$\displaystyle=1-q\mathbb{P}\big{(}\log(1+\tilde{\eta}^{})-\log(1+\eta^{})>c\big{)}\text{, as }\eta^{}\leq\tilde{\eta}^{}$	(185)
	$\displaystyle=1-q\mathbb{P}\big{(}1+\tilde{\eta}^{}>(1+\eta^{})e^{c}\big{)}$	(186)
	$\displaystyle\leq 1-q\mathbb{P}\big{(}1+\tilde{\eta}^{}>t\big{)}\text{ for }t=Ke^{c}\text{, as }1+\tilde{\eta}^{}\leq K\text{ almost surely }$	(187)
	$\displaystyle=1-\int_{t}^{\infty}\frac{1}{1+\tilde{\eta}^{}}\Big{(}\frac{e^{-\tilde{\mu}\tilde{\eta}^{}}}{e^{\tilde{\mu}}E_{1}(\tilde{\mu})}-\frac{e^{-\mu\tilde{\eta}^{*}}}{e^{\mu}E_{1}(\mu)}\Big{)}dy\text{, by }\eqref{eq:horseshoe_residuals}$	(188)
	$\displaystyle=1-\int_{t}^{\infty}\frac{1}{y}\Big{(}\frac{e^{-\tilde{\mu}y}}{E_{1}(\tilde{\mu})}-\frac{e^{-\mu y}}{E_{1}(\mu)}\Big{)}dy\text{ with }y=1+\tilde{\eta}^{*}$	(189)
	$\displaystyle=1-\Big{(}\frac{E_{1}(\tilde{\mu}t)}{E_{1}(\tilde{\mu})}-\frac{E_{1}(\mu t)}{E_{1}(\mu)}\Big{)}.$	(190)

We now consider this upper bound on $x_{p}$ as $t\rightarrow\infty$ . Note that for $\mu,\tilde{\mu}$ fixed with $\tilde{\mu}<\mu$ ,

$\displaystyle\lim_{t\rightarrow\infty}\Big{(}E_{1}(\tilde{\mu})\tilde{\mu}te^{\tilde{\mu}t}\Big{(}\frac{E_{1}(\tilde{\mu}t)}{E_{1}(\tilde{\mu})}-\frac{E_{1}(\mu t)}{E_{1}(\mu)}\Big{)}\Big{)}$	$\displaystyle=\lim_{t\rightarrow\infty}\Big{(}\tilde{\mu}te^{\tilde{\mu}t}E_{1}(\tilde{\mu}t)\Big{)}\lim_{t\rightarrow\infty}\Big{(}1-\frac{E_{1}(\mu t)E_{1}(\tilde{\mu})}{E_{1}(\tilde{\mu}t)E_{1}(\mu)}\Big{)}$	(191)
	$\displaystyle=\lim_{t\rightarrow\infty}\Big{(}\tilde{\mu}te^{\tilde{\mu}t}E_{1}(\tilde{\mu}t)\Big{)}\text{ as }\lim_{t\rightarrow\infty}\frac{E_{1}(\mu t)}{E_{1}(\tilde{\mu}t)}=0$	(192)
	$\displaystyle=1\text{ as }\lim_{x\rightarrow\infty}\big{(}xe^{x}E_{1}(x)\big{)}=1.$	(193)

Therefore, there exists some constant $D_{1}>0$ such that

\Big{(}\frac{E_{1}(\tilde{\mu}t)}{E_{1}(\tilde{\mu})}-\frac{E_{1}(\mu t)}{E_{1}(\mu)}\Big{)}>\frac{D_{1}}{\tilde{\mu}te^{\tilde{\mu}t}}

(194)

for all $t>0$ . By (190) and (194), we obtain

$\displaystyle x_{c}^{p}$	$\displaystyle<\Big{(}1-\frac{D_{1}}{te^{\tilde{\mu}t}}\Big{)}^{p}$	(195)
	$\displaystyle=\Big{(}\Big{(}1-\frac{1}{D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t}}\Big{)}^{D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t}}\Big{)}^{p/(D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t})}$	(196)
	$\displaystyle<\exp\big{(}-p/(D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t})\big{)}\text{ as }(1-1/x)^{x}<1/e\text{ for }x\geq 1.$	(197)

For $p>1$ , now take $c=\log\big{(}\frac{\alpha\log p}{\tilde{\mu}K}\big{)}$ for any fixed constant $\alpha\in(0,1)$ . Recall $t=Ke^{c}$ , such that $\tilde{\mu}t=\alpha\log p$ . Then

\displaystyle x_{c}^{p}

\displaystyle<\exp\big{(}-p/(D_{1}^{-1}\tilde{\mu}te^{\tilde{\mu}t})\big{)}=\exp\big{(}-D_{1}p^{1-\alpha}/(\alpha\log p)\big{)}<\exp\big{(}-Dp^{1-\alpha^{\prime}}\big{)}

(198)

for any $\alpha^{\prime}\in(\alpha,1)$ and some constant $D$ which does not depend on $p$ . Therefore,

\displaystyle\mathbb{P}\big{(}E_{c}\big{|}\ m_{t},\tilde{m}_{t}\big{)}

\displaystyle>1-\exp\big{(}-Dp^{1-\alpha^{\prime}}\big{)}

(199)

for $c=\log(\alpha/(\tilde{\mu}K))+\log\log p$ . Recall that $K:=(\log E_{1}(\mu)-\log E_{1}(\tilde{\mu}))/(\tilde{\mu}-\mu)$ . We obtain

$\displaystyle\frac{1}{\tilde{\mu}K}$	$\displaystyle=\frac{\mu-\tilde{\mu}}{\tilde{\mu}(\log E_{1}(\tilde{\mu})-\log E_{1}(\mu))}$	(200)
	$\displaystyle=\frac{\mu/\tilde{\mu}-1}{\log E_{1}(\tilde{\mu})-\log E_{1}(\mu)}$	(201)
	$\displaystyle\geq\frac{\log\mu-\log\tilde{\mu}}{\log E_{1}(\tilde{\mu})-\log E_{1}(\mu)}\text{ as }x-1\geq\log(x)\text{ for }x>0$	(202)
	$\displaystyle=e^{\hat{\mu}}E_{1}(\hat{\mu})\text{ for some }\hat{\mu}\in[\tilde{\mu},\mu]\text{ as }\frac{d}{d(\log\mu)}(-\log E_{1}(\mu))=\frac{1}{e^{\mu}E_{1}(\mu)}$	(203)
	$\displaystyle\geq e^{\mu}E_{1}(\mu)\text{ as }x\mapsto e^{x}E_{1}(x)\text{ is decreasing for all }x>0$	(204)
	$\displaystyle\geq\frac{1}{2}\log(1+1/\mu)\text{ as }e^{x}E_{1}(x)\geq\frac{1}{2}\log(1+1/x)\text{ for all }x>0$	(205)
	$\displaystyle=\frac{1}{2}\log(1+1/\max\{\mu,\tilde{\mu}\})$	(206)
	$\displaystyle=:L.$	(207)

Therefore, $c\geq\log(\alpha L)+\log\log p$ . By (199),

\displaystyle\mathbb{P}\big{(}\max_{j=1}^{p}|\log(1+\eta_{t+1,j})-\log(1+\tilde{\eta}_{t+1,j})|>\log(\alpha L)+\log\log p\ \big{|}\ m_{t},\tilde{m}_{t}\big{)}>1-\exp(-Dp^{1-\alpha^{\prime}}).

(208)

The case $\tilde{\mu}>\mu$ follows by symmetry to give (13). ∎

Proof of Proposition 3.3.

Fix some $\nu\geq 1$ . For any $m>0$ , let $p(\cdot|m)$ denote the component-wise full conditional density of $\eta$ given $m$ . This is given by

p(\eta|m)=\frac{\exp(-m\eta)}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}\frac{1}{Z_{m}}

(209)

on $(0,\infty)$ , where $Z_{m}$ is the normalization constant. We will first show

\text{TV}\big{(}p(\cdot|m),p(\cdot|\tilde{m})\big{)}^{2}\leq\frac{(1+\nu)(e-1)}{4}|\log(m)-\log(\tilde{m})|.

(210)

By a generalization of Pinkser’s inequality [e.g. van Erven and Harremos, 2014], we obtain

\text{TV}\big{(}p(\cdot|m),p(\cdot|\tilde{m})\big{)}^{2}\leq\frac{1}{2\alpha}D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m}))

(211)

for all $\alpha\in(0,1]$ , where

D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m})):=\frac{1}{\alpha-1}\log\Big{(}\int p(\eta|m)^{\alpha}p(\eta|\tilde{m})^{1-\alpha}d\eta\Big{)}

is the Rényi divergence between probability densities $p$ and $q$ with respect to the Lebesgue measure. For $\alpha=1/2$ , we obtain

\displaystyle D_{\alpha}(p(\cdot|m)\|p(\cdot|\tilde{m}))

\displaystyle=-2\log\Big{(}\int_{0}^{\infty}\frac{\exp(-\frac{m+\tilde{m}}{2}\eta)}{\eta^{\frac{1-\nu}{2}}(1+\nu\eta)^{\frac{\nu+1}{2}}}\frac{1}{\sqrt{Z_{m}Z_{\tilde{m}}}}d\eta\Big{)}=\log\Big{(}\frac{Z_{m}Z_{\tilde{m}}}{Z_{(m+\tilde{m})/2}^{2}}\Big{)}

(212)

Let $L(m)=\log Z_{m}$ . Then for $\eta|m\sim p(\cdot|m)$ , note that

\displaystyle\frac{dL(m)}{dm}=\frac{\frac{dZ_{m}}{dm}}{Z_{m}}=-\mathbb{E}[\eta|m]\leq 0,\text{ and }\frac{d^{2}L(m)}{dm^{2}}=\frac{\frac{d^{2}Z_{m}}{dm^{2}}Z_{m}-(\frac{dZ_{m}}{dm})^{2}}{Z_{m}^{2}}=\mathbb{E}[\eta^{2}|m]-\mathbb{E}[\eta|m]^{2}\geq 0,

(213)

such that $L(m)$ is a decreasing convex function. For $\tilde{m}\geq m$ , this gives

	$\displaystyle L(m)+L(\tilde{m})-2L\Big{(}\frac{m+\tilde{m}}{2}\Big{)}$	$\displaystyle\leq\big{(}L^{\prime}(\tilde{m})-L^{\prime}(m)\Big{)}\Big{(}\frac{\tilde{m}-m}{2}\big{)}$		(214)
		$\displaystyle=\big{(}\mathbb{E}[\eta\|m]-\mathbb{E}[\eta\|\tilde{m}]\big{)}\Big{(}\frac{\tilde{m}-m}{2}\Big{)}.$		(215)

Furthermore, note that by Lemmas C.4 and C.5,

\displaystyle\mathbb{E}[\eta|m]=\frac{1}{\nu}\frac{\Gamma(\frac{1+\nu}{2}+1)}{\Gamma(\frac{1+\nu}{2})}\frac{U(\frac{1+\nu}{2}+1,1+1,\frac{m}{\nu})}{U(\frac{1+\nu}{2},1,\frac{m}{\nu})}\leq\frac{1}{\nu}\frac{\Gamma(\frac{1+\nu}{2}+1)}{\Gamma(\frac{1+\nu}{2})}\big{(}\frac{m}{\nu}\big{)}^{-1}=\frac{1+\nu}{2}m^{-1}

(216)

where $U(a,b,z)$ corresponds to the confluent hypergeometric function of the second kind. By (214), we therefore obtain

$\displaystyle D_{\alpha}(p(\cdot\|m)\\|p(\cdot\|\tilde{m}))$	$\displaystyle\leq\big{(}\mathbb{E}[\eta\|m]-\mathbb{E}[\eta\|\tilde{m}]\big{)}\Big{(}\frac{\tilde{m}-m}{2}\Big{)}$	(217)
	$\displaystyle\leq\frac{1+\nu}{2}\Big{(}\frac{\tilde{m}-m}{2m}\Big{)}$	(218)
	$\displaystyle=\frac{1+\nu}{4}\big{(}\exp(\log\tilde{m}-\log m)-1)\big{)},$	(219)

where $\exp(|\log\tilde{m}-\log m|)-1\leq(e-1)|\log m-\log\tilde{m}|$ for $|\log m-\log\tilde{m}|\leq 1$ . This gives

\text{TV}\big{(}p(\cdot|m),p(\cdot|\tilde{m})\big{)}^{2}\leq\frac{(1+\nu)(e-1)}{4}\big{(}\log(\tilde{m})-\log(m)\big{)},

(220)

for $\tilde{m}\geq m$ . The case $\tilde{m}\leq m$ follows by symmetry, giving (210).

We now extend (210) to the multivariate setting. Let $\pi(\eta|m):=\prod_{j=1}^{p}p(\eta_{j}|m_{j})$ now denote the full conditional density of $\eta\in(0,\infty)^{p}$ given $m\in(0,\infty)^{p}$ . Then we obtain

$\displaystyle\text{TV}\big{(}\pi(\cdot\|m),\pi(\cdot\|\tilde{m})\big{)}^{2}$	$\displaystyle\leq D_{1/2}(\pi(\cdot\|m)\\|\pi(\cdot\|\tilde{m}))$	(221)
	$\displaystyle=\sum_{j=1}^{p}D_{1/2}(p(\cdot\|m_{j})\\|p(\cdot\|\tilde{m}_{j}))\text{ by conditional independence }$	(222)
	$\displaystyle=\frac{(1+\nu)(e-1)}{4}\sum_{j=1}^{p}\|\log(m_{j})-\log(\tilde{m}_{j})\|\text{ by \eqref{eq:renyi_div}}.$	(223)

Finally, let $Z=(\beta,\xi,\sigma^{2})$ and $\tilde{Z}=(\tilde{\beta},\tilde{\xi},\tilde{\sigma}^{2})$ , and $m_{j}=\xi\beta_{j}^{2}/(2\sigma^{2})$ and $\tilde{m}_{j}=\tilde{\xi}\tilde{\beta}_{j}^{2}/(2\tilde{\sigma}^{2})$ for $j=1,\ldots,p$ . For any such $Z,\tilde{Z}$ , we obtain

$\displaystyle\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z}\|\cdot)\big{)}$	$\displaystyle=\frac{1}{2}\int\big{\|}p(\eta^{}\|Z)-p(\eta^{}\|\tilde{Z})\big{\|}p(Z^{}\|\eta^{})d\eta^{}dZ^{}$	(224)
	$\displaystyle=\frac{1}{2}\int\|p(\eta^{}\|Z)-p(\eta^{}\|\tilde{Z})\|d\eta^{*}$	(225)
	$\displaystyle=\frac{1}{2}\int\|\pi(\eta^{}\|m)-\pi(\eta^{}\|\tilde{m})\|d\eta^{*}$	(226)
	$\displaystyle=\text{TV}\big{(}\pi(\cdot\|m),\pi(\cdot\|\tilde{m})\big{)}.$	(227)

Therefore, by (221), we obtain

\displaystyle\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z}|\cdot)\big{)}^{2}

\displaystyle\leq\frac{(1+\nu)(e-1)}{4}\sum_{j=1}^{p}|\log(m_{j})-\log(\tilde{m}_{j})|.

(228)

∎

Proof of Proposition 3.4.

This proposition follows from a modification of the proof of Proposition 2.2. Following the setup and notation in the proof of Proposition 2.2, we have

$\displaystyle\mathbb{P}$	$\displaystyle(Z_{t+1}=\tilde{Z}_{t+1}\big{\|}Z_{t},\tilde{Z}_{t})$	(229)
	$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{\|}\eta_{t+1},\tilde{\eta}_{t+1},Z_{t},\tilde{Z}_{t}\Big{)}\Big{\|}Z_{t},\tilde{Z}_{t}\Big{]}$	(230)
	$\displaystyle=\mathbb{E}\Big{[}\mathbb{P}\Big{(}Z_{t+1}=\tilde{Z}_{t+1}\big{\|}\eta_{t+1},\tilde{\eta}_{t+1}\Big{)}\Big{\|}Z_{t},\tilde{Z}_{t}\Big{]}$	(231)
	$\displaystyle=\mathbb{E}\Big{[}\mathrm{1}_{\{\eta_{t+1}=\tilde{\eta}_{t+1}\}}\Big{\|}Z_{t},\tilde{Z}_{t}\Big{]}\text{, as }(Z_{t+1},\tilde{Z}_{t+1})\|(\eta_{t+1},\tilde{\eta}_{t+1})\text{ follows common random numbers coupling}$	(232)
	$\displaystyle=\mathbb{P}\big{(}\eta_{t+1}=\tilde{\eta}_{t+1}\Big{\|}Z_{t},\tilde{Z}_{t}\big{)}$	(233)
	$\displaystyle=\mathbb{P}\big{(}\eta_{1,t+1}=\tilde{\eta}_{1,t+1}\Big{\|}Z_{t},\tilde{Z}_{t}\big{)}\prod_{j=2}^{p}\mathbb{P}\big{(}\eta_{t+1,j}=\tilde{\eta}_{t+1,j}\Big{\|}\underset{1\leq i\leq j-1}{\bigcap}\{\eta_{i,t+1}=\tilde{\eta}_{i,t+1}\},Z_{t},\tilde{Z}_{t}\big{)}$	(234)
	under the switch-to-CRN coupling	(235)
	$\displaystyle\geq\prod_{j=1}^{p}\frac{U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}}\text{ by Lemma \ref{lemma:eta_coupling_bound}, as in the proof of Proposition \ref{prop:one_scale_meeting} for }Z_{t},\tilde{Z}_{t}\in S(R),$	(236)
	$\displaystyle=\Big{(}\frac{U\big{(}\frac{1+\nu}{2},1,\frac{R^{1/d}}{\nu}\big{)}}{U\big{(}\frac{1+\nu}{2},1,\frac{R^{-1/c}}{\nu}\big{)}}\Big{)}^{p}=:\epsilon,$	(237)

where $\epsilon\in(0,1)$ since $R>1$ . ∎

Appendix D Additional Algorithms

Input:

C_{t}:=(\beta_{t},\eta_{t},\sigma_{t}^{2},\xi_{t})\in\mathbb{R}^{p}\times\mathbb{R}^{p}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0}

Sample $\eta_{t+1}|\beta_{t},\sigma^{2}_{t},\xi_{t}$ component-wise independently using Algorithm 4, targeting

\pi(\eta_{t+1}|\beta_{t},\sigma^{2}_{t},\xi_{t})\propto\prod_{j=1}^{p}\pi\big{(}\eta_{t+1,j}|m_{t,j}:=\frac{\xi_{t}\beta_{t,j}^{2}}{2\sigma^{2}_{t}}\big{)}\propto\prod_{j=1}^{p}\frac{e^{-m_{t,j}\eta_{t+1,j}}}{\eta_{t+1,j}^{\frac{1-\nu}{2}}(1+\nu\eta_{t+1,j})^{\frac{\nu+1}{2}}}.

(238)

2.
Sample $\xi_{t+1},\sigma^{2}_{t+1},\beta_{t+1}$ given $\eta_{t+1}$ as follows:
1. (a)
  
  Sample $\xi_{t+1}$ using Metropolis–Rosenbluth–Teller–Hastings with step-size $\sigma_{\text{MRTH}}$ :

Given

\xi_{t}

, propose

\log(\xi^{*})\sim\mathcal{N}(\log(\xi_{t}),\sigma^{2}_{\text{MRTH}})

Calculate acceptance probability

q=\frac{L(y|\xi_{*},\eta_{t+1})\pi_{\xi}(\xi_{*})}{L(y|\xi_{t},\eta_{t+1})\pi_{\xi}(\xi_{t})}\frac{\xi^{*}}{\xi_{t}},

where

\pi_{\xi}(\cdot)

is the prior for

\xi

M_{\xi,\eta}:=I_{n}+\xi^{-1}X\,\text{Diag}(\eta^{-1})\,X^{T}

and

\log(L(y|\xi,\eta))=-\frac{1}{2}\log(|M_{\xi,\eta}|)-\frac{a_{0}+n}{2}\log(b_{0}+y^{T}M_{\xi,\eta}^{-1}y).

(239)

Set

\xi_{t+1}:=\xi^{*}

with probability

\min(1,q)

, otherwise set

\xi_{t+1}:=\xi_{t}

(b)

Sample $\sigma^{2}_{t+1}$ given $\xi_{t+1},\eta_{t+1}$ :

\sigma^{2}_{t+1}|\xi_{t+1},\eta_{t+1}\sim\mathrm{InvGamma}\bigg{(}\frac{a_{0}+n}{2},\frac{y^{T}M_{\xi_{t+1},\eta_{t+1}}^{-1}y+b_{0}}{2}\bigg{)}.

(240)

(c)

Sample $\beta_{t+1}$ given $\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}$ using Algorithm 7 in Appendix D, targeting

\beta_{t+1}|\sigma^{2}_{t+1},\xi_{t+1},\eta_{t+1}\sim\mathcal{N}\big{(}\Sigma^{-1}X^{T}y,\sigma^{2}_{t+1}\Sigma^{-1}\big{)}\text{ for }\Sigma=X^{T}X+\xi_{t+1}\text{Diag}(\eta_{t+1}).

(241)

return $C_{t+1}:=(\beta_{t+1},\eta_{t+1},\sigma^{2}_{t+1},\xi_{t+1})$ .

$\displaystyle\\|v_{p}^{T}\\|_{2}$	$\displaystyle=\\|\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}+b_{p}^{T}}{r_{p}+a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}+b_{p}^{T}b_{p}}\\|_{2}$	(79)
	$\displaystyle\leq\big{\\|}\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}+a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}X_{p-1}a_{p}}\big{\\|}_{2}+\big{\\|}\frac{b_{p}^{T}}{r_{p}+b_{p}^{T}b_{p}}\big{\\|}_{2}\ \text{ by subadditivity}$	(80)
	$\displaystyle\leq\\|\frac{a_{p}^{T}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}}\\|_{2}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}\ \text{ as }(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}\text{ is p.s.d.}$	(81)
	$\displaystyle=\\|\frac{(R_{p-1}a_{p})^{T}R_{p-1}^{-1}X_{p-1}^{T}(I_{n}+X_{p-1}R_{p-1}^{-1}X_{p-1}^{T})^{-1}}{r_{p}}\\|_{2}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}$	(82)
	$\displaystyle\leq\big{\\|}\frac{R_{p-1}}{r_{p}}\big{\\|}_{2}\\|a_{p}\\|_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}\ \text{ by submultiplicativity and \eqref{eq:Ck_bound}}$	(83)
	$\displaystyle\leq\big{\\|}a_{p}\big{\\|}_{2}C_{p-1}+\frac{\mathrm{I}_{b_{p}\neq 0}}{\\|b_{p}\\|_{2}}\ \text{ as $r_{p}\geq r_{j}$ for all $j=1,\ldots,p-1$}.$	(84)

	$\displaystyle p(\eta^{*}_{t+1}\|m_{t+1},\tilde{m}_{t+1})$	$\displaystyle=\max\{p(\eta^{}_{t+1}\|m_{t})-p(\eta^{}_{t+1}\|\tilde{m}_{t}),0\}/q\text{ and }$		(166)
	$\displaystyle p(\tilde{\eta}^{*}_{t+1}\|m_{t+1},\tilde{m}_{t+1})$	$\displaystyle=\max\{p(\tilde{\eta}^{}_{t+1}\|\tilde{m}_{t})-p(\tilde{\eta}^{}_{t+1}\|{m}_{t}),0\}/q$		(167)

	$\displaystyle p(\eta^{*}\|\mu,\tilde{\mu})$	$\displaystyle=\max\{p(\eta^{}\|\mu)-p(\eta^{}\|\tilde{\mu}),0\}/q\text{ and }$		(181)
	$\displaystyle p(\tilde{\eta}^{*}\|\mu,\tilde{\mu})$	$\displaystyle=\max\{p(\tilde{\eta}^{}\|\tilde{\mu})-p(\tilde{\eta}^{}\|{\mu}),0\}/q,$		(182)

$\displaystyle\text{TV}\big{(}\pi(\cdot\|m),\pi(\cdot\|\tilde{m})\big{)}^{2}$	$\displaystyle\leq D_{1/2}(\pi(\cdot\|m)\\|\pi(\cdot\|\tilde{m}))$	(221)
	$\displaystyle=\sum_{j=1}^{p}D_{1/2}(p(\cdot\|m_{j})\\|p(\cdot\|\tilde{m}_{j}))\text{ by conditional independence }$	(222)
	$\displaystyle=\frac{(1+\nu)(e-1)}{4}\sum_{j=1}^{p}\|\log(m_{j})-\log(\tilde{m}_{j})\|\text{ by \eqref{eq:renyi_div}}.$	(223)

$\displaystyle\text{TV}\big{(}\mathcal{P}(Z,\cdot),\mathcal{P}(\tilde{Z}\|\cdot)\big{)}$	$\displaystyle=\frac{1}{2}\int\big{\|}p(\eta^{}\|Z)-p(\eta^{}\|\tilde{Z})\big{\|}p(Z^{}\|\eta^{})d\eta^{}dZ^{}$	(224)
	$\displaystyle=\frac{1}{2}\int\|p(\eta^{}\|Z)-p(\eta^{}\|\tilde{Z})\|d\eta^{*}$	(225)
	$\displaystyle=\frac{1}{2}\int\|\pi(\eta^{}\|m)-\pi(\eta^{}\|\tilde{m})\|d\eta^{*}$	(226)
	$\displaystyle=\text{TV}\big{(}\pi(\cdot\|m),\pi(\cdot\|\tilde{m})\big{)}.$	(227)

	$\displaystyle\bar{p}(u\|\eta,m)$	$\displaystyle\sim\mathrm{Uniform}\big{(}0,(1+\nu\eta)^{-\frac{1+\nu}{2}}\big{)},$		(248)
	$\displaystyle\bar{p}(\eta\|u,m)$	$\displaystyle\propto\eta^{\frac{\nu-1}{2}}e^{-m\eta}\mathbbm{1}_{\big{(}0,\frac{u^{-2/(1+\nu)}-1}{\nu}\big{)}}(\eta).$		(249)

	$\displaystyle\|M_{\xi,\eta}\|$	$\displaystyle=\|Q(I_{n}+\xi^{-1}\Lambda)Q^{T}\|=\|I_{n}+\xi^{-1}\Lambda\|=\prod_{i=1}^{n}\big{(}1+\xi^{-1}\Lambda_{i,i}\big{)}$		(252)
	$\displaystyle y^{T}M_{\xi,\eta}^{-1}y$	$\displaystyle=y^{T}Q(I_{n}+\xi^{-1}\Lambda)^{-1}Q^{T}y=\sum_{i=1}^{n}[Q^{T}y]_{i}^{2}\big{(}1+\xi^{-1}\Lambda_{i,i}\big{)}^{-1}$		(253)

Coupling-based convergence assessment of some Gibbs samplers for high-dimensional Bayesian regression with shrinkage priors

Abstract

1 Introduction

1.1 Iterative computation in high dimensions

1.2 Bayesian shrinkage regression with Half-t​(ν)t(\nu) priors

1.3 Our contribution

2 Coupled MCMC for regression with Half-t​(ν)t(\nu) priors

2.1 A blocked Gibbs sampler for Half-t​(ν)t(\nu) priors

2.2 Coupling the proposed Gibbs sampler

2.3 One-scale coupling

Proposition 2.1.

Proposition 2.2.

Remark 2.3.

3 Coupling strategies for high-dimensional settings

3.1 A two-scale coupling

Proposition 3.1.

Proposition 3.2.

Proposition 3.3.

3.2 Alternative coupling strategies

Proposition 3.4.

3.3 Computational and statistical impact of degree of freedom ν\nu

4 Results on GWAS datasets

5 Discussion

Acknowledgement.

References

Appendix A Features of the target complicate use of generic MCMC

Proposition A.1.

Corollary A.2.

Appendix B Details on the choice and calculation of metric

B.1 Comparison with other metrics

B.2 Choice of metric threshold for the two-scale coupling

B.3 Calculating metric dd

B.3.1 An integration based approximation

Proposition B.1.

Proof.

Case m=m~m=\tilde{m}.

Case m<m~m<\tilde{m}.

Case m>m~m>\tilde{m}.

B.3.2 A Rao–Blackwellized estimator

Proposition B.2.

Proof.

Case m=m~m=\tilde{m}.

Case m<m~m<\tilde{m}.

Case m>m~m>\tilde{m}.

Appendix C Proofs

C.1 Marginal β\beta prior and posterior densities

Proof of Proposition A.1.

Marginal Prior on β\beta.

Marginal Posterior of β\beta.

Gradient of negative log-density of posterior of β\beta.

Proof of Corollary A.2.

C.2 Geometric Ergodicity

Initial technical results.

Lemma C.1.

Proof.

Lemma C.2.

Lemma C.3.

Proof.

Lemma C.4.

Proof.

Lemma C.5.

Proof.

Case: c>0c>0.

Case: max⁡{−1,−r}<c<0\max\{-1,-r\}<c<0.

Corollary C.6.

Proof.

Lemma C.7.

Proof.

Lemma C.8.

Proof.

Lemma C.9.

Proof of Lemma C.9.

Drift and Minorization conditions.

Proof of Proposition 2.1.

Proposition C.10.

Proof of Proposition C.10.

C.3 Coupled Kernel Proofs

Proof of Proposition 2.2.

Proof of Proposition 3.1.

Proof of Proposition 3.2.

1.2 Bayesian shrinkage regression with Half- $t(\nu)$ priors

2 Coupled MCMC for regression with Half- $t(\nu)$ priors

2.1 A blocked Gibbs sampler for Half- $t(\nu)$ priors

3.3 Computational and statistical impact of degree of freedom $\nu$

B.3 Calculating metric $d$

Case $m=\tilde{m}$ .

Case $m<\tilde{m}$ .

Case $m>\tilde{m}$ .

Case $m=\tilde{m}$ .

Case $m<\tilde{m}$ .

Case $m>\tilde{m}$ .

C.1 Marginal $\beta$ prior and posterior densities

Marginal Prior on $\beta$ .

Marginal Posterior of $\beta$ .

Gradient of negative log-density of posterior of $\beta$ .

Case: $c>0$ .

Case: $\max\{-1,-r\}<c<0$ .

Gibbs sampler for Half- $t(\nu)$ priors (Algorithm 3).

Perfect sampling of $\xi|\eta$ (Algorithm 10).

Perfect sampling of $\eta_{j}|\beta_{j},\sigma^{2},\xi$ (Algorithm 11).