This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Cox-Polya-Gamma Algorithm for Flexible Bayesian Inference of Multilevel Survival Models

Benny Ren
Department of Family, Population and Preventive Medicine
Renaissance School of Medicine at Stony Brook University, Stony Brook, NY

Jeffrey S. Morris and Ian Barnett
Department of Biostatistics, Epidemiology, and Informatics
Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA
Abstract

Bayesian Cox semiparametric regression is an important problem in many clinical settings. Bayesian procedures provide finite-sample inference and naturally incorporate prior information if MCMC algorithms and posteriors are well behaved. Survival analysis should also be able to incorporate multilevel modeling such as case weights, frailties and smoothing splines, in a straightforward manner. To tackle these modeling challenges, we propose the Cox-Polya-Gamma (Cox-PG) algorithm for Bayesian multilevel Cox semiparametric regression and survival functions. Our novel computational procedure succinctly addresses the difficult problem of monotonicity constrained modeling of the nonparametric baseline cumulative hazard along with multilevel regression. We develop two key strategies. First, we exploit an approximation between Cox models and negative binomial processes through the Poisson process to reduce Bayesian computation to iterative Gaussian sampling. Next, we appeal to sufficient dimension reduction to address the difficult computation of nonparametric baseline cumulative hazard, allowing for the collapse of the Markov transition within the Gibbs sampler based on beta sufficient statistics. In addition, we explore conditions for uniform ergodicity of the Cox-PG algorithm. We demonstrate our multilevel modeling approach using open source data and simulations. We provide software for our Bayesian procedure in the supplement.


Keywords: Kaplan-Meier, Cox model, Frailty model, Multilevel model, Bayesian inference, Survival analysis

1 Introduction

Semiparametric Cox proportional hazards (PH) regression is an important tool used for time to event data (Cox, 1972, 1975). Many extensions of the Cox model have been proposed, such as cure models (or zero-inflation) (Sy & Taylor, 2000), frailty models (or random effects) (Wienke, 2010), smoothing splines (Therneau et al., 2000; Cui et al., 2021), convex set constraints (McGregor et al., 2020), variable selection (Tibshirani, 1997), and other methods. However, nonparametric inference for semiparametric Cox regression involves difficult monotonicity constrained inference of the cumulative hazard (Hothorn et al., 2018). The Bayesian paradigm provides a convenient framework to naturally address many of these variations of the canonical Cox model (Ibrahim et al., 2001; Chen et al., 2006). Bayesian methods can naturally incorporate convex set constraints (Valeriano et al., 2023; Pakman & Paninski, 2014; Maatouk & Bay, 2016) that can be applied to estimation and inference of monotonic cumulative hazard functions. In addition, Bayesian hierarchical modeling can be extended to different types of regression, such as random effects (Wang & Roy, 2018), zero-inflation (Neelon, 2019), smoothing splines (Wand & Ormerod, 2008), and variable selection (Carvalho et al., 2009; Bhadra et al., 2019; Makalic & Schmidt, 2015) with straightforward Gibbs samplers for multilevel Gaussian models. In order to leverage this rich resource, we propose a novel Markov chain Monte Carlo (MCMC) algorithm that allows Cox model parameters to be modeled with hierarchical Gaussian models

The Cox model can be described as locally a Poisson process that is constrained to binary outcomes of recording death or right censoring (Ibrahim et al., 2001; Aalen et al., 2008). In recent years, Polson et al. 2013 proposed a Polya-gamma (PG) Gibbs sampler for logistic and negative binomial (NB) regression that is based in Gaussian sampling. Consequently, the negative binomial process is a useful analog to the Poisson process of Cox regression. We use a standard gamma process Bayesian modification to Cox regression that results in a NB kernel and enables Gaussian-based modeling to incorporate the necessary structure and constraints. By specifying a univariate gamma frailty or gamma prior on the cumulative hazard, we establish an analogous frailty model to locally negative binomial relationship (Aalen et al., 2008; Winkelmann, 2008). Typically, a vague gamma frailty is specified to bridge Poisson and NB models (Kalbfleisch, 1978; Sinha et al., 2003; Duan et al., 2018). The practicality of a vague gamma prior allows access to the PG sampler, benefiting from the use of Gaussian Gibbs samplers. In addition, maintaining Gibbs transitions is crucial as we can take advantage of the marginal transitions to derive succinct Metropolis-Hastings (MH) acceptance probabilities that can be be used to remove bias due to gamma frailty augmentation (Duan et al., 2018).

However, the NB form alleviates much of the algebra associated with the cumulative hazard but does not address the entirety of the baseline hazard rate which has positivity constraints. There are a broad class nonparametric estimators that rely on partitioning the time interval (Nieto-Barajas & Walker, 2002; Nelson, 1972; Aalen, 1978; Kaplan & Meier, 1958; Ibrahim et al., 2001). Evaluating these estimators without considering the computational algorithm often results in difficult MCMC procedures due to monotonic constraints on the baseline cumulative hazard with MH steps that require tuning. There are many tools available to enforce convex set constraints on Gaussian sampling (Valeriano et al., 2023; Pakman & Paninski, 2014; Maatouk & Bay, 2016; Cong et al., 2017; Li & Ghosh, 2015) which we will demonstrate can be used to enforce monotonicity and positivity constraints of Cox regression. Inspired by previous work in nonparametric analysis, we proposed the use of a sufficient statistic partition-based estimator with closed forms for Gibbs sampler conditional distributions. In order to do so, we introduce a novel monotonic spline system with a half-space constraint (Ramsay, 1988; Meyer et al., 2018) to model the baseline log cumulative hazard function that factorizes into beta sufficient statistics in the likelihood. Next, we introduce slice sampler auxiliary variables to remove the sufficient statistics associated with the baseline hazard rate from the algebra (Damien et al., 1999; Mira & Tierney, 2002; Neal, 2003). Our spline system collapses the linear operator of our Markov transitions on to beta sufficient statistics to improve the efficiency of Bayesian computation while retaining the Gaussian structure. This results in a flexible Bayesian procedure we refer to as the Cox-Polya-Gamma (Cox-PG) algorithm that is appealing for being able to adapt multilevel Gaussian inference to Cox regression.

Aside from being able to incorporate prior information into the Cox model, our Cox-PG approach is desirable for situations where inference on the baseline hazard is necessary such as Kaplan-Meier models (Kaplan & Meier, 1958). We also show that case weights such as inverse probability weighting and power priors, can be incorporated into the Cox-PG algorithm through fractal beta kernels (Shu et al., 2021; Ibrahim et al., 2015). In addition, we explore conditions under which the Cox-PG sampler is uniformly ergodic. Consequently, we are able to derive ergodic theory results for our Gibbs algorithm under mild assumptions. Hierarchical models from Gaussian Gibbs samplers are natural extensions of our approach with straightforward modifications to the Cox-PG algorithm. This includes cure models or zero-inflation (Sy & Taylor, 2000; Neelon, 2019), joint models (Wulfsohn & Tsiatis, 1997; Wang et al., 2022), copulas (Wienke, 2010), competing risk (Kalbfleisch & Prentice, 2011) and variable selection (Carvalho et al., 2009; Bhadra et al., 2019; Makalic & Schmidt, 2015) and other multilevel Gaussian models. As a result, we propose a unified Bayesian framework that envelopes many of the models found throughout survival analysis Ibrahim et al. 2001. In contrast to influence function based semiparametric theory (Tsiatis 2006 Chapter 5), the Cox-PG sampler provides a flexible inference procedure for finite sample Cox modeling, addressing a critical gap in many survival analysis settings such as clinical trials (Ildstad et al., 2001).

The rest of the article is organized as follows. In Section 2, we outline the Cox-PG algorithm. In Section 3, we present uniform ergodicity theoretical results. We present additional methods for accelerating MCMC algorithms, removing bias, and variations of Cox semiparametric regression in Section 4 and 5. Simulation studies and real data analysis involving multilevel survival models are presented in Section 6, 7 and 8.

2 Method: Cox-PG Gibbs sampler

Semiparametric Cox PH regression is closely related to the exponential family with an additional hazard rate term that needs to be accounted for with constrained inference. Exponential family Bayesian models are well studied and have appealing theoretical properties (Choi & Hobert, 2013). In order to emphasize Cox model constraints and its Poisson kernel, we show the derivation of the transformation model for Cox PH regression (Hothorn et al., 2018). The extreme value distribution function of PH models is given as FMEV{A(t)}=1exp[exp{A(t)}]F_{\mathrm{MEV}}\{A(t)\}=1-\exp[-\exp\{A(t)\}] with survival function SMEV{A(t)}=exp[exp{A(t)}]S_{\mathrm{MEV}}\{A(t)\}=\exp[-\exp\{A(t)\}]. In this case, study time TiT_{i} denotes the censoring or event time for subject i{1,,N}i\in\{1,\dots,N\}. Let A(ti)=log{Λ(ti)}+𝐱i𝜷=α(ti)+𝐱i𝜷=𝐳α(ti)𝐮α+𝐱i𝜷=𝐦i𝜼A(t_{i})=\log\{\Lambda(t_{i})\}+\mathbf{x}^{\top}_{i}\bm{\beta}=\alpha(t_{i})+\mathbf{x}^{\top}_{i}\bm{\beta}=\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}+\mathbf{x}^{\top}_{i}\bm{\beta}=\mathbf{m}_{i}^{\top}\bm{\eta}. Monotonic function α(ti)=j=1Jzα,j(ti)uα,j=𝐳α(ti)𝐮α\alpha(t_{i})=\sum^{J}_{j=1}z_{\alpha,j}(t_{i})u_{\alpha,j}=\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha} is the baseline log cumulative hazard approximated using additive monotonic splines with the constraints 𝐮α>𝟎\mathbf{u}_{\alpha}>\mathbf{0} represented through 𝒞0={𝜼:𝐮α>𝟎}\mathcal{C}_{0}=\left\{\bm{\eta}:\mathbf{u}_{\alpha}>\mathbf{0}\right\}. The baseline hazard is DtΛ(ti)=Dtexp{α0+𝐳α(ti)𝐮α}={Dt𝐳α(ti)𝐮α}exp{α0+𝐳α(ti)𝐮α}D_{t}\Lambda(t_{i})=D_{t}\exp\left\{\alpha_{0}+\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}=\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}\exp\left\{\alpha_{0}+\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\} where the differential operator is represented as d/dt=Dtd/dt=D_{t} and α0\alpha_{0} is an intercept incorporated into 𝐱i𝜷\mathbf{x}^{\top}_{i}\bm{\beta}. The marginal likelihood contribution of the model (Kalbfleisch & Prentice 2011 Chapter 2) is

[DtFMEV{A(ti)}]yi[SMEV{A(ti)}]1yi={Dtα(ti)}yiexp{yiA(ti)}SMEV{A(ti)}\begin{array}[]{c}\left[D_{t}F_{\mathrm{MEV}}\{A(t_{i})\}\right]^{y_{i}}\left[S_{\mathrm{MEV}}\{A(t_{i})\}\right]^{1-y_{i}}=\left\{D_{t}\alpha(t_{i})\right\}^{y_{i}}\exp\{y_{i}A(t_{i})\}S_{\mathrm{MEV}}\{A(t_{i})\}\end{array}

with death (or event) variable yiy_{i} and right censoring 1yi1-y_{i}. The hazard function is given by {Dtα(ti)}yiexp{yiA(ti)}\left\{D_{t}\alpha(t_{i})\right\}^{y_{i}}\exp\{y_{i}A(t_{i})\} following the canonical factorization of hazard-survival functions. The flexibility of this framework is seen by noting that if α(t)=α0+log(t)\alpha(t)=\alpha_{0}+\log(t), the likelihood is an exponential PH regression while if α(t)=α0+γlog(t)\alpha(t)=\alpha_{0}+\gamma\log(t) we have the Weibull PH model and the parameterization of α(t)\alpha(t) determines the hazard family. Now, we set A(ti)=𝐦i𝜼A(t_{i})=\mathbf{m}^{\top}_{i}\bm{\eta}, where 𝜼=[𝐮α,𝜷]\bm{\eta}=\left[\mathbf{u}^{\top}_{\alpha},\bm{\beta}^{\top}\right]^{\top} and 𝐦i=[𝐳α(ti),𝐱i]\mathbf{m}_{i}=\left[\mathbf{z}^{\top}_{\alpha}(t_{i}),\mathbf{x}^{\top}_{i}\right]^{\top}, and the resulting generalized likelihood is

LPH(𝜼𝐲,𝐌)=[i=1N{Dt𝐳α(ti)𝐮α}yi]𝕀(𝜼𝒞0)exp(𝐲𝐌𝜼)exp{i=1Nexp(𝐦i𝜼)}.\begin{array}[]{c}L_{\mathrm{PH}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})=\left[\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}_{\alpha}^{\top}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\exp\left(\mathbf{y}^{\top}\mathbf{M}\bm{\eta}\right)\exp\left\{-\sum_{i=1}^{N}\exp\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)\right\}.\end{array}

The derivative of monotonic splines, Dt𝐳α(ti)D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i}), is positive and the basis coefficients are constrained to be positive. The design matrix 𝐌N×(J+P)\mathbf{M}\in\mathbb{R}^{N\times(J+P)} is composed of rows 𝐦i\mathbf{m}_{i}. Under this construction, slice sampler auxiliary variables are suitable to maintain positive Dt𝐳α(ti)𝐮α+(0,)D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\in\mathbb{R}_{+}\coloneqq(0,\infty) (Damien et al., 1999; Mira & Tierney, 2002).

Slice samplers naturally remove positive terms in the likelihood by enforcing constraints in conditional distributions. In order to effectively incorporate a slice sampler, we start by placing a gamma mixture variable zγ,iG(ϵ,ϵ)z_{\gamma,i}\sim\mathrm{G}(\epsilon,\epsilon) on the Poisson kernel of PH models to obtain a frailty model analogous to negative binomial regression. Note that the first two moments are given as 𝔼(zγ,i)=1\mathbb{E}(z_{\gamma,i})=1 and Var(zγ,i)=ϵ1\mathrm{Var}(z_{\gamma,i})=\epsilon^{-1} and a vague mixture ϵ0\epsilon\gg 0 can be used. As noted in Duan et al. (2018) this limit approximation works well for Poisson processes. In addition, we show how to remove this bias later on in the article through a simple MH step. A quick Laplace transform reveals the modification to the baseline hazard under frailty: DtΛfrailty(t)=DtΛ(t)/{1+ϵ1Λ(t)}D_{t}\Lambda_{\text{frailty}}(t)=D_{t}\Lambda(t)/\left\{1+\epsilon^{-1}\Lambda(t)\right\} (Aalen et al. 2008 Chapter 6) and by sending ϵ\epsilon\rightarrow\infty we have the Cox model hazard. We remove the hazard component of the likelihood with multiplication of slice sampling auxiliary uniform variable νi𝜼U{0,Dt𝐳α(ti)𝐮α}\nu_{i}\mid\bm{\eta}\sim\mathrm{U}\{0,D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\} for uncensored events, yi=1y_{i}=1. The resulting likelihood is given as

LPH(𝜼,𝐳γ𝐲,𝐌)π(𝝂𝜼)π(𝐳γ)=[i=1N𝕀{νiDt𝐳α(ti)𝐮α}yi]𝕀(𝜼𝒞0)×[i=1N{exp(𝐦i𝜼)zγ,i}yi]π(𝐳γ)×exp{i=1Nexp(𝐦i𝜼)zγ,i}\begin{array}[]{rl}L_{\text{PH}}(\bm{\eta},\mathbf{z}_{\gamma}\mid\mathbf{y},\mathbf{M})\pi(\bm{\nu}\mid\bm{\eta})\pi(\mathbf{z}_{\gamma})=&\left[\prod_{i=1}^{N}\mathbb{I}\left\{\nu_{i}\leq D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\\ &\times\left[\prod_{i=1}^{N}\left\{\exp\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)z_{\gamma,i}\right\}^{y_{i}}\right]\pi(\mathbf{z}_{\gamma})\\ &\times\exp\left\{-\sum_{i=1}^{N}\exp\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)z_{\gamma,i}\right\}\end{array}

where 𝐳γ\mathbf{z}_{\gamma} is a vector of the zγ,iz_{\gamma,i} (Wienke 2010 Chapter 3) and νi0\nu_{i}\geq 0. We can marginalize out 𝐳γ\mathbf{z}_{\gamma} by carrying out the Mellin transform to obtain a negative binomial kernel

1yi!0exp(λizγ)λiyizγyiϵϵΓ(ϵ)zγϵ1eϵzγ𝑑zγ=Γ(yi+ϵ)Γ(yi+1)Γ(ϵ)(ϵϵ+λi)ϵ(λiϵ+λi)yi.\begin{array}[]{c}\frac{1}{y_{i}!}\int_{0}^{\infty}\exp(-\lambda_{i}{z_{\gamma}})\lambda_{i}^{y_{i}}z_{\gamma}^{y_{i}}\frac{\epsilon^{\epsilon}}{\Gamma(\epsilon)}z_{\gamma}^{\epsilon-1}e^{-\epsilon z_{\gamma}}dz_{\gamma}=\frac{\Gamma(y_{i}+\epsilon)}{\Gamma(y_{i}+1)\Gamma(\epsilon)}\left(\frac{\epsilon}{\epsilon+\lambda_{i}}\right)^{\epsilon}\left(\frac{\lambda_{i}}{\epsilon+\lambda_{i}}\right)^{y_{i}}.\end{array}

Let ψi=𝐦i𝜼log(ϵ)=𝐦i𝜼+ηϵ\psi_{i}=\mathbf{m}_{i}^{\top}\bm{\eta}-\log(\epsilon)=\mathbf{m}_{i}^{\top}\bm{\eta}+\eta_{\epsilon} and λi=exp(𝐦i𝜼)\lambda_{i}=\exp(\mathbf{m}_{i}^{\top}\bm{\eta}), then the NB likelihood is

LNB(𝜼𝐲,𝐌)[i=1N{Dt𝐳α(ti)𝐮α}yi]𝕀(𝜼𝒞0)[i=1N(pi)yi(1pi)ϵ]LNB(𝜼𝐲,𝐌)π(𝝂𝜼)[i=1N𝕀{νiDt𝐳α(ti)𝐮α}yi]𝕀(𝜼𝒞0)[i=1N(pi)yi(1pi)ϵ]\begin{array}[]{rl}L_{\text{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\propto&\left[\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left[\prod_{i=1}^{N}(p_{i})^{y_{i}}(1-p_{i})^{\epsilon}\right]\\ L_{\text{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi(\bm{\nu}\mid\bm{\eta})\propto&\left[\prod_{i=1}^{N}\mathbb{I}\left\{\nu_{i}\leq D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left[\prod_{i=1}^{N}(p_{i})^{y_{i}}(1-p_{i})^{\epsilon}\right]\end{array} (1)

where pi=expit(ψi)=exp(ψi)/{1+exp(ψi)}p_{i}=\mathrm{expit}(\psi_{i})=\exp(\psi_{i})/\{1+\exp(\psi_{i})\} and we condition on ϵ\epsilon. We can integrate out νi0\nu_{i}\geq 0 to obtain the proportional frailty likelihood. The frailty likelihood is multiplied by constant i=1Nϵyi\prod_{i=1}^{N}\epsilon^{-y_{i}} to get the NB likelihood. After accounting for covariates and conditioning on ϵ\epsilon, through the Laplace transform, the hazard function is proportional to DtΛfrailty(ti){Dt𝐳α(ti)𝐮α}yipiyiD_{t}\Lambda_{\text{frailty}}(t_{i}\mid-)\propto\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}p_{i}^{y_{i}}, which allows for data augmentation (Wienke 2010 Chapter 3). The survival function under frailty is (1pi)ϵ(1-p_{i})^{\epsilon} resulting in the canonical hazard-survival factorization.

To account for the logit link, we can use a PG auxiliary variable ωb,cPG(b,c)\omega\mid b,c\sim\mathrm{PG}\left(b,c\right), where b>0b>0 (Polson et al., 2013). The PG density is

π(ωb,c)=12π2k=1gk(k1/2)2+c2/(4π2)\begin{array}[]{c}\pi(\omega\mid b,c)=\frac{1}{2\pi^{2}}\sum_{k=1}^{\infty}\frac{g_{k}}{(k-1/2)^{2}+c^{2}/\left(4\pi^{2}\right)}\end{array}

where gkg_{k} are independently distributed according to G(b,1)\mathrm{G}(b,1). We have the following exponential tilting or Esscher transform (Esscher, 1932) relationship for PG distributions

(eψ)a(1+eψ)b=2beκψ0eωψ2/2π(ωb,0)𝑑ω,π(ωb,c)=exp(c2ω/2)π(ωb,0)𝔼ωb,0[exp(c2ω/2)]\begin{array}[]{c}\frac{\left(e^{\psi}\right)^{a}}{\left(1+e^{\psi}\right)^{b}}=2^{-b}e^{\kappa\psi}\int_{0}^{\infty}e^{-\omega\psi^{2}/2}\pi(\omega\mid b,0)d\omega,\quad\pi(\omega\mid b,c)=\frac{\exp\left(-c^{2}\omega/2\right)\pi(\omega\mid b,0)}{\mathbb{E}_{\omega\mid b,0}\left[\exp\left(-c^{2}\omega/2\right)\right]}\end{array} (2)

where κ=ab/2\kappa=a-b/2 and π(ωb,0)\pi(\omega\mid b,0) denotes a PG(b,0)\mathrm{PG}(b,0) density. We use ωiPG(yi+ϵ,0)\omega_{i}\sim\mathrm{PG}\left(y_{i}+\epsilon,0\right) and get

(eψi)yi(1+eψi)yi+ϵ0eκiψieωiψi2/2π(ωiyi+ϵ,0)𝑑ωi\begin{array}[]{rl}\frac{\left(e^{\psi_{i}}\right)^{y_{i}}}{\left(1+e^{\psi_{i}}\right)^{y_{i}+\epsilon}}\propto\int_{0}^{\infty}e^{\kappa_{i}\psi_{i}}e^{-\omega_{i}\psi_{i}^{2}/2}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)d\omega_{i}\end{array}

to tease out a Gaussian kernel. Note that

π(ωiyi+ϵ,ψi)=coshyi+ϵ(|ψi/2|)exp(ωiψi2/2)π(ωiyi+ϵ,0),\begin{array}[]{c}\pi(\omega_{i}\mid y_{i}+\epsilon,\psi_{i})=\cosh^{y_{i}+\epsilon}(|\psi_{i}/2|)\exp\left(-\omega_{i}\psi_{i}^{2}/2\right)\pi(\omega_{i}\mid y_{i}+\epsilon,0),\end{array}

can be shown with a Laplace transform (Polson et al., 2013; Choi & Hobert, 2013). We get the NB posterior

LNB(𝜼𝐲,𝐌)π(𝝂,𝝎𝜼)[i=1N𝕀{νiDt𝐳α(ti)𝐮α}yi]𝕀(𝜼𝒞0)×i=1Neκiψieωiψi2/2π(ωiyi+ϵ,0)\begin{array}[]{rl}L_{\text{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi(\bm{\nu},\bm{\omega}\mid\bm{\eta})\propto&\left[\prod_{i=1}^{N}\mathbb{I}\left\{\nu_{i}\leq D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\psi_{i}}e^{-\omega_{i}\psi_{i}^{2}/2}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\end{array} (3)

where κi=(yiϵ)/2\kappa_{i}=(y_{i}-\epsilon)/2. Integrating out 𝝂\bm{\nu} and 𝝎\bm{\omega} results in the original NB likelihood. We can write our Gaussian density conditioned on ηϵ\eta_{\epsilon} through eκiψieωiψi2/2eκiφieηϵωiφieωiφi2/2\begin{array}[]{c}e^{\kappa_{i}\psi_{i}}e^{-\omega_{i}\psi_{i}^{2}/2}\propto e^{\kappa_{i}\varphi_{i}}e^{-\eta_{\epsilon}\omega_{i}\varphi_{i}}e^{-\omega_{i}\varphi_{i}^{2}/2}\end{array} with φi=𝐦i𝜼\varphi_{i}=\mathbf{m}_{i}^{\top}\bm{\eta}. The conditional distributions, after applying the Esscher transform to both the PG and Gaussian variables, are given as

(ωi)PG(yi+ϵ,ψi)(νi)U{0,Dt𝐳α(ti)𝐮α}(𝜼)TN[(𝐌𝛀𝐌)1𝐌(𝜿ηϵ𝝎),(𝐌𝛀𝐌)1,𝒞ν𝒞0]\begin{array}[]{rl}\left(\omega_{i}\mid-\right)&\sim\mathrm{PG}\left(y_{i}+\epsilon,\psi_{i}\right)\\ (\nu_{i}\mid-)&\sim\mathrm{U}\{0,D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\}\\ (\bm{\eta}\mid-)&\sim\mathrm{TN}\left[\left(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}\right)^{-1}\mathbf{M}^{\top}(\bm{\kappa}-\eta_{\epsilon}\bm{\omega}),\left(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}\right)^{-1},\mathcal{C}_{\nu}\bigcap\mathcal{C}_{0}\right]\end{array} (4)

where the truncated normal is constrained to set: 𝒞ν𝒞0\mathcal{C}_{\nu}\bigcap\mathcal{C}_{0} and 𝒞ν={i:yi=1}N{𝜼:νiDt𝐳α(ti)𝐮α}\mathcal{C}_{\nu}=\bigcap_{\{i:y_{i}=1\}}^{N}\left\{\bm{\eta}:\nu_{i}\leq D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\right\}. The truncated normal density is proportional to the Gaussian kernel multiplied by an indicator for set constraints: 𝜼TN(𝝁,𝚺,𝒞)\bm{\eta}\sim\mathrm{TN}(\bm{\mu},\bm{\Sigma},\mathcal{C}),

π(𝜼𝝁,𝚺,𝒞)=exp{12(𝜼𝝁)𝚺1(𝜼𝝁)}c1(𝝁,𝚺,𝒞)𝕀(𝜼𝒞)exp{12(𝜼𝝁)𝚺1(𝜼𝝁)}𝕀(𝜼𝒞)\begin{array}[]{c}\pi(\bm{\eta}\mid\bm{\mu},\bm{\Sigma},\mathcal{C})=\frac{\exp\left\{-\frac{1}{2}(\bm{\eta}-\bm{\mu})^{\top}\bm{\Sigma}^{-1}(\bm{\eta}-\bm{\mu})\right\}}{c_{1}(\bm{\mu},\bm{\Sigma},\mathcal{C})}\mathbb{I}(\bm{\eta}\in\mathcal{C})\propto{\exp\left\{-\frac{1}{2}(\bm{\eta}-\bm{\mu})^{\top}\bm{\Sigma}^{-1}(\bm{\eta}-\bm{\mu})\right\}}\mathbb{I}(\bm{\eta}\in\mathcal{C})\end{array}

where c1(𝝁,𝚺,𝒞)=𝜼𝒞exp{12(𝜼𝝁)𝚺1(𝜼𝝁)}𝑑𝜼c_{1}(\bm{\mu},\bm{\Sigma},\mathcal{C})=\oint_{\bm{\eta}\in\mathcal{C}}\exp\left\{-\frac{1}{2}(\bm{\eta}-\bm{\mu})^{\top}\bm{\Sigma}^{-1}(\bm{\eta}-\bm{\mu})\right\}d\bm{\eta}. Here 𝛀=diag(ω1,,ωN)\bm{\Omega}=\operatorname{diag}\left(\omega_{1},\cdots,\omega_{N}\right), ωi\omega_{i} make up the vector 𝝎\bm{\omega} and κi=(yiϵ)/2\kappa_{i}=(y_{i}-\epsilon)/2 and make up the vector 𝜿\bm{\kappa}. Note that convex set intersection preserves convexity (Boyd & Vandenberghe 2004 Chapter 2), allowing for the exploitation of convex set constrained Gaussian sampling (Maatouk & Bay, 2016). At this point, we have shown how to incorporate hazard rate constraints into the general Gibbs sampler. The priors for 𝜼\bm{\eta} are flat and we outline hierarchical Gaussian structures later in the article.

2.1 Sufficient statistics dimension reduction for efficient computation

A naive slice sampler enforces a prohibitive number of iyi\sum_{i}y_{i} inequality constraints in 𝒞ν\mathcal{C}_{\nu} during Gaussian sampling. The slice sampler for nonparametric partition estimators (Nieto-Barajas & Walker, 2002; Kaplan & Meier, 1958; Ibrahim et al., 2001) presents a unique opportunity to collapse the linear operator of Markov transitions based on sufficient statistics. By using Riemann sum integration to construct the baseline log cumulative hazard 𝐳α(ti)𝐮α\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}, we the simplify 𝒞ν\mathcal{C}_{\nu} and auxiliary transitions (νi)U{0,Dt𝐳α(ti)𝐮α}(\nu_{i}\mid-)\sim\mathrm{U}\{0,D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}\}. The new hazard rate is now constructed with non-overlapping histogram partitions, using delta functions Dtzα,j(ti)=δj(ti)=𝕀(sj1ti<sj)D_{t}z_{\alpha,j}(t_{i})=\delta_{j}(t_{i})=\mathbb{I}(s_{j-1}\leq t_{i}<s_{j}) as its additive bases with Dt𝐳α(ti)𝐮α=j=1JDtzα,j(ti)uα,jD_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}=\sum_{j=1}^{J}D_{t}z_{\alpha,j}(t_{i})u_{\alpha,j}. A example of the basis functions Dt𝐳α(ti)D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i}) used to construct the histogram in Riemann sum integration for the lung dataset from R survival is presented in Figure 1a (Loprinzi et al., 1994; Therneau, 2023). We can write matrix Dt𝐙αN×JD_{t}\mathbf{Z}_{\alpha}\in\mathbb{R}^{N\times J} with rows Dt𝐳α(ti)D_{t}\mathbf{z}_{\alpha}^{\top}(t_{i}). Now the matrix becomes Dt𝐙α{0,1}N×JD_{t}\mathbf{Z}_{\alpha}\in\{0,1\}^{N\times J} with induced norm Dt𝐙α=1\|D_{t}\mathbf{Z}_{\alpha}\|_{\infty}=1. Each row only has at most a single element that is 1, meaning each uncensored subject activates at most a single delta function. The uniform auxiliary space becomes the last order statistics of uniform random variables or beta random variables. An example of inequalities νiDt𝐳α(ti)𝐮α\nu_{i}\leq D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha} are given as

ν30+0+uα,3+0+0+0+0ν40+0+uα,3+0+0+0+0ν50+0+0+uα,4+0+0+0\begin{array}[]{rlllllll}\vdots&\leq\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ \nu_{3}&\leq 0&+0&+u_{\alpha,3}&+0&+0&+0&+0\\ \nu_{4}&\leq 0&+0&+u_{\alpha,3}&+0&+0&+0&+0\\ \nu_{5}&\leq 0&+0&+0&+u_{\alpha,4}&+0&+0&+0\\ \vdots&\leq\vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\end{array}

and we see that for the set {i:yi=1}\{i:y_{i}=1\} we have inequalities for each coefficient uα,ju_{\alpha,j} multiplied together as indicators, {i:yi=1}𝕀{δj(ti)νiuα,j}\begin{array}[]{c}\prod_{\{i:y_{i}=1\}}\mathbb{I}\left\{\delta_{j}(t_{i})\nu_{i}\leq{u}_{\alpha,j}\right\}\end{array} and our set 𝒞ν\mathcal{C}_{\nu} consist of
max{δj(ti)νi:yi=1}uα,j.\begin{array}[]{c}\max\left\{\delta_{j}(t_{i})\nu_{i}:y_{i}=1\right\}\leq{u}_{\alpha,j}\end{array}. We denote νi,j=δj(ti)νi\nu_{i,j}=\delta_{j}(t_{i})\nu_{i} and we only need to sample from the last order statistic vj=max{νi,j:yi=1,δj(ti)=1}.\begin{array}[]{c}v_{j}=\max\{\nu_{i,j}:y_{i}=1,\delta_{j}(t_{i})=1\}.\end{array} Each sample of the uniform variable becomes νi{δj(ti)=1}U(0,uα,j)\nu_{i}\mid\{\delta_{j}(t_{i})=1\}\sim\mathrm{U}(0,u_{\alpha,j}). Let the number of uncensored events in partition jj be nα,j=i=1Nyiδj(ti)n_{\alpha,j}=\sum_{i=1}^{N}y_{i}\delta_{j}(t_{i}) and Vj=ν(nα,j)V_{j}=\nu_{(n_{\alpha,j})} denote the last order statistic. We only need to use the last order statistic, the sufficient statistics of U(0,uα,j)\mathrm{U}(0,u_{\alpha,j}) to construct 𝒞ν\mathcal{C}_{\nu} with reduced dimensions. The conditional cumulative distributions and densities for the last order statistics are

(Vjvjuα,j,nα,j)=𝕀(vjuα,j)(vjuα,j)nα,j,π(vjuα,j,nα,j)=𝕀(vjuα,j)nα,jvjnα,j1/uα,jnα,j\begin{array}[]{c}\mathbb{P}\left(V_{j}\leq v_{j}\mid u_{\alpha,j},n_{\alpha,j}\right)=\mathbb{I}(v_{j}\leq u_{\alpha,j})\left(\frac{v_{j}}{u_{\alpha,j}}\right)^{n_{\alpha,j}},\pi\left(v_{j}\mid u_{\alpha,j},n_{\alpha,j}\right)=\mathbb{I}(v_{j}\leq u_{\alpha,j}){n_{\alpha,j}v_{j}^{n_{\alpha,j}-1}}/{u_{\alpha,j}^{n_{\alpha,j}}}\end{array}

with vjuα,j,nα,jB(uα,j,nα,j,1)v_{j}\mid u_{\alpha,j},n_{\alpha,j}\sim\mathrm{B}(u_{\alpha,j},n_{\alpha,j},1) and 𝒞v=j=1J{𝜼:vjuα,j}\mathcal{C}_{v}=\bigcap_{j=1}^{J}\left\{\bm{\eta}:v_{j}\leq{u}_{\alpha,j}\right\}. This is a simple half space 𝐯𝐮α\mathbf{v}\leq\mathbf{u}_{\alpha} and a important improvement over naive slice sampling strategies. Notably, we can rescale beta variables to sample from B(uα,j,nα,j,1)\mathrm{B}(u_{\alpha,j},n_{\alpha,j},1), where (vj/uα,j)Beta(nα,j,1)(v_{j}/u_{\alpha,j})\sim\mathrm{Beta}(n_{\alpha,j},1) can be verified using a Jacobian and has the property 0vj/uα,j10\leq v_{j}/u_{\alpha,j}\leq 1. The denominator terms (uα,j)nα,j({u_{\alpha,j}})^{n_{\alpha,j}} cancels our new sufficiently reduced hazard term i=1N{Dt𝐳α(ti)𝐮α}yi=i=1Nj=1Juα,jyiδj(ti)=j=1J(uα,j)nα,j\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}_{\alpha}^{\top}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}=\prod_{i=1}^{N}\prod_{j=1}^{J}{u}_{\alpha,j}^{y_{i}\delta_{j}(t_{i})}=\prod_{j=1}^{J}({u_{\alpha,j}})^{n_{\alpha,j}}. The new NB likelihood with beta kernels is given as

LNB(𝜼𝐲,𝐌)[i=1Nj=1Juα,jyiδj(ti)]𝕀(𝜼𝒞0)[i=1N(pi)yi(1pi)ϵ].\begin{array}[]{c}L_{\text{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\propto\left[\prod_{i=1}^{N}\prod_{j=1}^{J}{u}_{\alpha,j}^{y_{i}\delta_{j}(t_{i})}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left[\prod_{i=1}^{N}(p_{i})^{y_{i}}(1-p_{i})^{\epsilon}\right]\end{array}.

The conditional distributions replacing equations (4) are now given as

(ωi)PG(yi+ϵ,ψi)(vj)B(uα,j,nα,j,1)(𝜼)TN{(𝐌𝛀𝐌)1𝐌(𝜿ηϵ𝝎),(𝐌𝛀𝐌)1,𝐯𝐮α}.\begin{array}[]{rl}\left(\omega_{i}\mid-\right)&\sim\mathrm{PG}\left(y_{i}+\epsilon,\psi_{i}\right)\\ \left(v_{j}\mid-\right)&\sim\mathrm{B}(u_{\alpha,j},n_{\alpha,j},1)\\ (\bm{\eta}\mid-)&\sim\mathrm{TN}\left\{\left(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}\right)^{-1}\mathbf{M}^{\top}(\bm{\kappa}-\eta_{\epsilon}\bm{\omega}),\left(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}\right)^{-1},\mathbf{v}\leq\mathbf{u}_{\alpha}\right\}.\end{array} (5)

Note that {𝐯𝐮α}=𝒞v𝒞v𝒞0\{\mathbf{v}\leq\mathbf{u}_{\alpha}\}=\mathcal{C}_{v}\neq\mathcal{C}_{v}\bigcap\mathcal{C}_{0} if only censored events are found in a given partition. But we recommend to using a parametrization where {𝐯𝐮α}=𝒞v𝒞0\{\mathbf{v}\leq\mathbf{u}_{\alpha}\}=\mathcal{C}_{v}\bigcap\mathcal{C}_{0} by ensuring each partition has at least one uncensored event. Under this construction, the spline zα,j(t)z_{\alpha,j}(t) is a monotonic step function with a 45 degree slope in the partition sj1t<sjs_{j-1}\leq t<s_{j} and is visualized in Figure 1b. The slice constraints are now based on number of coefficients, instead of number of subjects. Many methods exist that can efficiently sample a Gaussian distribution constrained to {𝐯𝐮α}\{\mathbf{v}\leq\mathbf{u}_{\alpha}\}, constraints reduced from number of subject NN to number of coefficients in 𝐮α\mathbf{u}_{\alpha} (Valeriano et al., 2023). Note that 𝛀\bm{\Omega} is a diagonal matrix and matrices 𝐌𝛀𝐌\mathbf{M}^{\top}\bm{\Omega}\mathbf{M} and 𝐌(𝜿ηϵ𝝎)\mathbf{M}^{\top}(\bm{\kappa}-\eta_{\epsilon}\bm{\omega}) can be computed in a parallel manner without loading all of the data in memory (Polson et al., 2013). In addition, we can further simplify sampling of 𝜼=[𝐮α,𝜷]\bm{\eta}=\left[\mathbf{u}^{\top}_{\alpha},\bm{\beta}^{\top}\right]^{\top} through conditional Gaussian distributions,

(𝐮α)TN(𝝁uβ,Σuβ,vuα)baseline log cumulative hazard local slopes,(𝜷)N(𝝁βu,Σβu)PH coefficients & intercept.\begin{array}[]{c}\underbrace{\left(\mathbf{u}_{\alpha}\mid-\right)\sim\operatorname{TN}\left(\bm{\mu}_{u\mid\beta},\Sigma_{u\mid\beta},\textbf{v}\leq\textbf{u}_{\alpha}\right)}_{\text{baseline log cumulative hazard local slopes}},\quad\underbrace{\left(\bm{\beta}\mid-\right)\sim\operatorname{N}(\bm{\mu}_{\beta\mid u},\Sigma_{\beta\mid u})}_{\text{PH coefficients \& intercept}}.\end{array}

A multilevel Gaussian model can be specified for 𝜷\bm{\beta} while a flat prior can be maintained for 𝐮α\mathbf{u}_{\alpha}. The interpretation of spline coefficients 𝐮α\mathbf{u}_{\alpha} are positive slopes for local linear interpolation of the baseline log cumulative hazard function to ensure a final monotonic function.

As we increase number of partitions JJ, we achieve a more accurate estimate at the expense of computation. This improves on standard nonparametric estimators (Kaplan & Meier, 1958; Nelson, 1972; Aalen, 1978; Nieto-Barajas & Walker, 2002) by using a continuous monotonic local linear regression to approximate the log cumulative hazard, rather than an empirical jump process. These splines are centered at their midpoint to reduce attenuation due to priors that can be incorporated (Figure 1b). The intercept term α0\alpha_{0} allows location shifts and is unconstrained under Cox regression. The truncated normal Gibbs step is an iterative sampling of non-negative local slopes. Under Bayesian inference, we condition the slope of each partition on the remaining variables. This is a variant of sufficient dimension reduction techniques (Li, 2018; Sun et al., 2019) and these local slopes are an analog of the martingale increments (Tsiatis, 2006) used to construct the partial likelihood under stochastic calculus (Kaplan & Meier, 1958; Cox, 1972, 1975).

2.2 Case weights under sufficient statistics

By reducing computation to beta kernels, case weights wiw_{i} can easily be incorporated into the Gibbs sampler. The NB likelihood becomes

LNB(𝜼𝐰,𝐲,𝐌)[i=1Nj=1Juα,jyiδj(ti)wi]𝕀(𝜼𝒞0)[i=1N(pi)yiwi(1pi)ϵwi]\begin{array}[]{c}L_{\text{NB}}(\bm{\eta}\mid\mathbf{w},\mathbf{y},\mathbf{M})\propto\left[\prod_{i=1}^{N}\prod_{j=1}^{J}{u}_{\alpha,j}^{y_{i}\delta_{j}(t_{i})w_{i}}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left[\prod_{i=1}^{N}(p_{i})^{y_{i}w_{i}}(1-p_{i})^{\epsilon w_{i}}\right]\end{array}

and beta distribution parameter is modified to be nα,j=i=1Nyiδj(ti)win^{\star}_{\alpha,j}=\sum_{i=1}^{N}y_{i}\delta_{j}(t_{i})w_{i}. The updated Gibbs sampler is

(ωi)PG{(yi+ϵ)wi,ψi}(vj)B(uα,j,nα,j,1)(𝜼)TN{(𝐌𝛀𝐌)1𝐌(𝜿ηϵ𝝎),(𝐌𝛀𝐌)1,𝐯𝐮α}\begin{array}[]{rl}\left(\omega_{i}\mid-\right)&\sim\mathrm{PG}\left\{(y_{i}+\epsilon)w_{i},\psi_{i}\right\}\\ \left(v_{j}\mid-\right)&\sim\mathrm{B}(u_{\alpha,j},n^{\star}_{\alpha,j},1)\\ (\bm{\eta}\mid-)&\sim\mathrm{TN}\left\{\left(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}\right)^{-1}\mathbf{M}^{\top}(\bm{\kappa}^{\star}-\eta_{\epsilon}\bm{\omega}),\left(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}\right)^{-1},\mathbf{v}\leq\mathbf{u}_{\alpha}\right\}\end{array} (6)

with vector 𝜿\bm{\kappa}^{\star} consisting of values (yiϵ)wi/2(y_{i}-\epsilon)w_{i}/2. This allows the use of Cox-PG in weighted analysis settings such as power priors (Ibrahim & Chen, 2000; Ibrahim et al., 2015) and inverse probability weighting (Shu et al., 2021).

2.3 General hierarchical Cox model: the Cox-PG algorithm

Our previous Gibbs algorithm can sample from the Cox model, while mixed models for clustered data are straightforward extensions. We allow for additional heterogeneity through Gaussian random effects also known as log-normal frailties. We can also incorporate prior information on the baseline hazard and coefficients by specifying the appropriate Gaussian priors and formulating a general Gibbs sampler. We update the notation for 𝜼\bm{\eta} to include random effects and specify truncated Gaussian prior 𝜼TN(𝐛,𝐀(τ)1,𝒞0)\bm{\eta}\sim\mathrm{TN}(\mathbf{b},\mathbf{A}(\tau_{\mathcal{B}})^{-1},\mathcal{C}_{0}) for coefficients given by the form 𝐦i𝜼=𝐳α(ti)𝐮α+𝐱i𝜷+𝐳i,𝐮\mathbf{m}_{i}^{\top}\bm{\eta}=\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha}+\mathbf{x}^{\top}_{i}\bm{\beta}+\mathbf{z}^{\top}_{i,\mathcal{B}}\mathbf{u}_{\mathcal{B}}. Let 𝐀(τ)1\mathbf{A}(\tau_{\mathcal{B}})^{-1} be the covariance matrix. We order 𝐌=[𝐙α,𝐗,𝐙]N×(J+P+M)\mathbf{M}=[\mathbf{Z}_{\alpha},\mathbf{X},\mathbf{Z}_{\mathcal{B}}]\in\mathbb{R}^{N\times(J+P+M)}, 𝜼=[𝐮α,𝜷,𝐮]\bm{\eta}=[\mathbf{u}^{\top}_{\alpha},\bm{\beta}^{\top},\mathbf{u}^{\top}_{\mathcal{B}}]^{\top}. Note that the intercept is combined into 𝐗\mathbf{X}. Then 𝐀(τ)=𝚺01τ𝐈M=𝚺α1𝚺x1τ𝐈M\mathbf{A}(\tau_{\mathcal{B}})=\bm{\Sigma}^{-1}_{0}\oplus\tau_{\mathcal{B}}\mathbf{I}_{M}=\bm{\Sigma}^{-1}_{\alpha}\oplus\bm{\Sigma}^{-1}_{x}\oplus\tau_{\mathcal{B}}\mathbf{I}_{M} is the prior precision represented with direct sum operator \oplus and 𝚺x1\bm{\Sigma}^{-1}_{x} is the precision for 𝜷\bm{\beta} and 𝚺α1\bm{\Sigma}^{-1}_{\alpha} is the precision of 𝐮α\mathbf{u}_{\alpha} and MM is the number of random effect clusters for 𝐙N×M\mathbf{Z}_{\mathcal{B}}\in\mathbb{R}^{N\times M}. The prior mean is 𝐛=[𝝁α,𝝁x,𝟎M]\mathbf{b}=\left[\bm{\mu}^{\top}_{\alpha},\bm{\mu}^{\top}_{x},\mathbf{0}_{M}^{\top}\right]^{\top} and [𝐙α,𝐗]N×(J+P)[\mathbf{Z}_{\alpha},\mathbf{X}]\in\mathbb{R}^{N\times(J+P)} are fixed effect covariates. Let 𝒞0\mathcal{C}_{0} be a set of convex constraints for fixed effects {𝐮α,𝜷}\{\mathbf{u}_{\alpha},\bm{\beta}\} only. The initial constraints are 𝒞0={𝜼:𝐮α>𝟎}\mathcal{C}_{0}=\left\{\bm{\eta}:\mathbf{u}_{\alpha}>\mathbf{0}\right\} and we condition on ϵ\epsilon through the offset term ηϵ\eta_{\epsilon} to model the baseline log cumulative hazard. We introduce the random effect structure by placing a gamma distribution on the precision: τG(a0,b0)\tau_{\mathcal{B}}\sim\mathrm{G}\left(a_{0},b_{0}\right), π(τa0,b0)τa01exp(b0τ)\pi(\tau_{\mathcal{B}}\mid a_{0},b_{0})\propto\tau_{\mathcal{B}}^{a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right). The Gibbs sampler for (4), after applying sufficient dimension reduction and multiplying by the Gaussian and gamma priors, is given as

(ωi)PG(yi+ϵ,ψi)(vj)B(uα,j,nα,j,1)(τ)G(a0+M2,b0+𝐮𝐮2)(𝜼)TN(𝐐1𝝁,𝐐1,𝐯𝐮α)\begin{array}[]{rl}\left(\omega_{i}\mid-\right)&\sim\mathrm{PG}\left(y_{i}+\epsilon,\psi_{i}\right)\\ \left(v_{j}\mid-\right)&\sim\mathrm{B}(u_{\alpha,j},n_{\alpha,j},1)\\ \left(\tau_{\mathcal{B}}\mid-\right)&\sim\mathrm{G}\left(a_{0}+\frac{M}{2},b_{0}+\frac{\mathbf{u}^{\top}_{\mathcal{B}}\mathbf{u}_{\mathcal{B}}}{2}\right)\\ \left(\bm{\eta}\mid-\right)&\sim\mathrm{TN}\left(\mathbf{Q}^{-1}\bm{\mu},\mathbf{Q}^{-1},\mathbf{v}\leq\mathbf{u}_{\alpha}\right)\end{array} (7)

with 𝝁=𝐌(𝜿ηϵ𝝎)+𝐀(τ)𝐛=𝐌(𝜿ηϵ𝝎)+[(𝚺01𝝁0),𝟎M]\bm{\mu}=\mathbf{M}^{\top}(\bm{\kappa}-\eta_{\epsilon}\bm{\omega})+\mathbf{A}(\tau_{\mathcal{B}})\mathbf{b}=\mathbf{M}^{\top}(\bm{\kappa}-\eta_{\epsilon}\bm{\omega})+[(\bm{\Sigma}^{-1}_{0}\bm{\mu}_{0})^{\top},\mathbf{0}_{M}^{\top}]^{\top} and 𝐐=𝐌𝛀𝐌+𝐀(τ)\mathbf{Q}=\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}+\mathbf{A}(\tau_{\mathcal{B}}). We may also write 𝝁=𝐌𝜿𝐌𝛀𝟏ηϵ+[(𝚺01𝝁0),𝟎M]=𝝁~𝐌𝛀𝟏ηϵ\bm{\mu}=\mathbf{M}^{\top}\bm{\kappa}-\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+[(\bm{\Sigma}^{-1}_{0}\bm{\mu}_{0})^{\top},\mathbf{0}_{M}^{\top}]^{\top}=\widetilde{\bm{\mu}}-\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon} where 𝝁~=𝐌𝜿+[(𝚺01𝝁0),𝟎M]\widetilde{\bm{\mu}}=\mathbf{M}^{\top}\bm{\kappa}+[(\bm{\Sigma}^{-1}_{0}\bm{\mu}_{0})^{\top},\mathbf{0}_{M}^{\top}]^{\top}. For brevity, we abbreviate this modeling approach as the Cox-PG algorithm. As noted in Remark 3.3 of Choi & Hobert 2013, a proper posterior does not depend on a full rank design matrix. In addition, Bayesian inference using (7) is possible in settings with high dimensional covariates, J+P>NJ+P>N (Polson et al., 2013; Makalic & Schmidt, 2015).

3 Theoretical results: uniform ergodicity

We establish uniform ergodicity for our Gibbs sampler and show that the moments of posterior samples for 𝜼\bm{\eta} exists which guarantees central limit theorem (CLT) results for posterior MCMC averages and consistent estimators of the associated asymptotic variance (Jones, 2004). Our posterior inference of α(t)\alpha(t) involves matrix multiplication of coefficients and basis functions, 𝓩α𝐮α\bm{\mathcal{Z}}_{\alpha}\mathbf{u}_{\alpha} or a decomposition of additive functions and have the same CLT properties. Here 𝓩α\bm{\mathcal{Z}}_{\alpha} is the matrix representation of the monotonic spline bases over the range of the observed covariate data. Suppose 𝜼(n)\bm{\eta}^{(n)} is a draw from the posterior distribution, then α(n)(t)=α0(n)+𝓩α𝐮α(n)\alpha^{(n)}(t)=\alpha_{0}^{(n)}+\bm{\mathcal{Z}}_{\alpha}\mathbf{u}^{(n)}_{\alpha} are posterior samples of nonparametric effects with MCMC averages of α^(t)=(NMC)1n=1NMCα(n)(t)\widehat{\alpha}(t)=\left(N_{\text{MC}}\right)^{-1}\sum^{N_{\text{MC}}}_{n=1}\alpha^{(n)}(t) with an example provided in Figure 1c.

First, we show uniform ergodicity of 𝜼\bm{\eta} for our Gibbs sampler by taking advantage of the truncated gamma distribution (Choi & Hobert, 2013; Wang & Roy, 2018) and slice sampler properties (Mira & Tierney, 2002). A key feature of this strategy is truncating the gamma distribution at a small τ0\tau_{0} results in useful inequalities related to ergodicity while allowing for a vague prior to be specified for τ\tau_{\mathcal{B}}. We denote the 𝜼\bm{\eta}-marginal Markov chain as Ψ{𝜼(n)}n=0\Psi\equiv\{\bm{\eta}(n)\}^{\infty}_{n=0} and Markov transition density (Mtd) of Ψ\Psi as

k(𝜼𝜼)=+J++Nπ(𝜼𝝎,τ,𝐯,𝐲)π(𝝎,τ,𝐯𝜼,𝐲)𝑑𝝎𝑑τ𝑑𝐯\begin{array}[]{c}k\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)=\int_{\mathbb{R}_{+}^{J}}\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\pi(\bm{\eta}\mid\bm{\omega},{\tau}_{\mathcal{B}},\mathbf{v},\mathbf{y})\pi\left(\bm{\omega},{\tau}_{\mathcal{B}},\mathbf{v}\mid\bm{\eta}^{\prime},\mathbf{y}\right)d\bm{\omega}d{\tau}_{\mathcal{B}}d\mathbf{v}\end{array}

where 𝜼\bm{\eta}^{\prime} is the current state, 𝜼\bm{\eta} is the next state. Space +N\mathbb{R}_{+}^{N} is the PG support that contains 𝝎\bm{\omega}, τ+\tau_{\mathcal{B}}\in\mathbb{R}_{+} and 𝐯+J\mathbf{v}\in\mathbb{R}_{+}^{J}. In addition, placing upper bounds of the baseline hazard ensures finite integrals in the Markov transitions while accommodating vague priors. Under initial constraints 𝒞0\mathcal{C}_{0}, we see that k(𝜼𝜼)k\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right) is strictly positive and Harris ergodic (Tierney, 1994; Hobert, 2011). We show the Mtd of satisfies the following minorization condition: k(𝜼𝜼)δh(𝜼)k\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)\geq\delta h(\bm{\eta}), where there exist a δ>0\delta>0 and density function hh, to prove uniform ergodicity (Roberts & Rosenthal 2004, Meyn & Tweedie 2012 Chapter 16). Uniform ergodicity is defined as bounded and geometrically decreasing bounds for total variation distance to the stationary distribution in number of Markov transitions nn, Kn(𝜼,)Π(𝐲)TV:=supA|Kn(𝜼,A)Π(A𝐲)|Vrn\left\|K^{n}(\bm{\eta},\cdot)-\Pi(\cdot\mid\mathbf{y})\right\|_{\text{TV}}:=\sup_{A\in\mathscr{B}}\left|K^{n}(\bm{\eta},A)-\Pi(A\mid\mathbf{y})\right|\leq Vr^{n} leading to CLT results (Van der Vaart 2000 Chapter 2). Here \mathscr{B} denotes the Borel σ\sigma-algebra of P+M\mathbb{R}^{P+M}, K(,)K(\cdot,\cdot) is the Markov transition function for the Mtd k(,)k(\cdot,\cdot)

K(𝜼,A)=(𝜼(r+1)A𝜼(r)=𝜼)=Ak(𝜼𝜼)𝑑𝜼,K\left(\bm{\eta}^{\prime},A\right)=\mathbb{P}\left(\bm{\eta}^{(r+1)}\in A\mid\bm{\eta}^{(r)}=\bm{\eta}^{\prime}\right)=\int_{A}k\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)d\bm{\eta},

and Kn(𝜼,A)=(𝜼(n+r)A𝜼(r)=𝜼)K^{n}\left(\bm{\eta}^{\prime},A\right)=\mathbb{P}\left(\bm{\eta}^{(n+r)}\in A\mid\bm{\eta}^{(r)}=\bm{\eta}^{\prime}\right). We denote Π(𝐲)\Pi(\cdot\mid\mathbf{y}) as the probability measure with density π(𝜼𝐲)\pi(\bm{\eta}\mid\mathbf{y}), VV is bounded above and r(0,1)r\in(0,1).

  1. (C1)

    The gamma prior parameters are constrained as a0+M/21a_{0}+M/2\geq 1, b0>0b_{0}>0 and gamma support is constrained to be ττ0\tau_{\mathcal{B}}\geq\tau_{0}.

  2. (C2)

    The monotonic splines system are comprised of Lipschitz continuous basis functions with coefficients constrained to 𝐮α(𝟎,𝐮α+)\mathbf{u}_{\alpha}\in\left(\mathbf{0},\mathbf{u}_{\alpha}^{+}\right), such that 𝒞0{𝜼:𝐮α(𝟎,𝐮α+)}\mathcal{C}_{0}\subseteq\{\bm{\eta}:\mathbf{u}_{\alpha}\in\left(\mathbf{0},\mathbf{u}_{\alpha}^{+}\right)\} and Dt𝐳α(ti)𝐮α<Dt𝐳α(ti)𝐮α+<D_{t}\mathbf{z}_{\alpha}^{\top}(t_{i})\mathbf{u}_{\alpha}<D_{t}\mathbf{z}_{\alpha}^{\top}(t_{i})\mathbf{u}_{\alpha}^{+}<\infty.

Theorem 1.

Under conditions (C1) and (C2), the Markov chain (7) is uniformly ergodic.

Theorem 2.

Under conditions (C1) and (C2), for any fixed 𝐭P+M\mathbf{t}\in\mathbb{R}^{P+M}, P+Me𝛈𝐭π(𝛈𝐲)𝑑𝛈<\int_{\mathbb{R}^{P+M}}e^{\bm{\eta}^{\top}\mathbf{t}}\pi(\bm{\eta}\mid\mathbf{y})d\bm{\eta}<\infty. Hence the moment generating function of the posterior associated with (7) exist.

We leave the proofs to the supplement. Condition (C1) can be disregarded when considering a Gibbs sampler without mixed effects; the Gibbs sampler is still uniformly ergodic under (C2). Condition (C1) bounds the second order variation of the random effects while allowing for vague priors through the truncated gamma support (Wang & Roy, 2018). Condition (C2) enforces distributional robustness (Blanchet et al., 2022) on how much the baseline hazard can vary, with a broad class of vague and informative priors that can be accommodated. Specifically, it is a common condition that bounds the slopes of the log cumulative hazard function. In large sample theory, condition (C2) or bounding the baseline cumulative hazard is also used to facilitate the dominated convergence theorem and achieve strong consistency (Wang, 1985; McLain & Ghosh, 2013). This proof also guarantees uniform ergodicity for Poisson regression through NB approximations and PG samplers (Zhou et al., 2012; Duan et al., 2018). The following results for coupling time Kn(𝜼,)Π(𝐲)TV(1δ)n\left\|K^{n}(\bm{\eta},\cdot)-\Pi(\cdot\mid\mathbf{y})\right\|_{\text{TV}}\leq(1-\delta)^{n} is derived from the minorization condition (Jones, 2004). Numerical integration can be used to calculate δ\delta prior to MCMC sampling to evaluate convergence rates and determine number of MCMC samples nn (Jones & Hobert, 2001; Geweke, 1989). More importantly, these results establish conditions for convergence in total variation distance and posterior expectations for constrained nonparametric objects under Bayesian inference.

4 Accelerated mixing and removing bias with calibration

From Gibbs samplers, we have the useful property of reversible marginal transitions. This allows us to construct a simple Metropolis-Hasting correction to debias our MCMC estimate. The pure Poisson process version of this problem was studied in Duan et al. (2018) with slow mixing but accurate inference using ϵ=1000\epsilon=1000. A large ϵ\epsilon artificially induces an imbalanced logistic regression problem (Zens et al., 2023). Data calibration can be used once a Gibbs sampler for approximate PH regression has been established (Duan et al., 2018). We can use Gibbs sampler (7) as a proposal distribution to sample from the target posterior constructed with the beta kernel Cox semiparametric regression posterior, LPH(𝜼𝐲,𝐌)π0(𝜼)L_{\mathrm{PH}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta}) where π0(𝜼)\pi_{0}(\bm{\eta}) is the prior. We accept the new Gibbs draw from proposal distribution q(𝜼𝜼)q(\bm{\eta}\mid\bm{\eta}^{\prime}) using a MH step with probability

min{1,LPH(𝜼𝐲,𝐌)π0(𝜼)q(𝜼𝜼)LPH(𝜼𝐲,𝐌)π0(𝜼)q(𝜼𝜼)}=min{1,i=1Nexp{λi}exp{λi}{1+exp(ψi)}yi+ϵ{1+exp(ψi)}yi+ϵ}.\begin{array}[]{c}\min\left\{1,\frac{L_{\mathrm{PH}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta})q\left(\bm{\eta}^{\prime}\mid\bm{\eta}\right)}{L_{\mathrm{PH}}(\bm{\eta}^{\prime}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta}^{\prime})q\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)}\right\}=\min\left\{1,\prod_{i=1}^{N}\frac{\exp\left\{\lambda_{i}^{\prime}\right\}}{\exp\left\{\lambda_{i}\right\}}\frac{\left\{1+\exp\left(\psi_{i}\right)\right\}^{y_{i}+\epsilon}}{\left\{1+\exp\left(\psi_{i}^{\prime}\right)\right\}^{y_{i}+\epsilon}}\right\}\end{array}. (8)

We denote current state as ψi=log(λi)log(ϵ)=𝐦i𝜼log(ϵ)\psi_{i}^{\prime}=\log(\lambda_{i}^{\prime})-\log(\epsilon)=\mathbf{m}_{i}^{\top}\bm{\eta}^{\prime}-\log(\epsilon) and next state as ψi=log(λi)log(ϵ)=𝐦i𝜼log(ϵ)\psi_{i}=\log(\lambda_{i})-\log(\epsilon)=\mathbf{m}_{i}^{\top}\bm{\eta}-\log(\epsilon). The derivation of this probability can be done quickly by noting that Gibbs samplers are MH algorithms with acceptance probability one. Therefore under the Cox-PG Gibbs sampler for target posterior LNB(𝜼𝐲,𝐌)π0(𝜼)L_{\mathrm{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta}) we have

LNB(𝜼𝐲,𝐌)π0(𝜼)q(𝜼𝜼)LNB(𝜼𝐲,𝐌)π0(𝜼)q(𝜼𝜼)=1q(𝜼𝜼)q(𝜼𝜼)=LNB(𝜼𝐲,𝐌)π0(𝜼)LNB(𝜼𝐲,𝐌)π0(𝜼).\begin{array}[]{c}\frac{L_{\mathrm{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta})q\left(\bm{\eta}^{\prime}\mid\bm{\eta}\right)}{L_{\mathrm{NB}}(\bm{\eta}^{\prime}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta}^{\prime})q\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)}=1\implies\frac{q\left(\bm{\eta}^{\prime}\mid\bm{\eta}\right)}{q\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)}=\frac{L_{\mathrm{NB}}(\bm{\eta}^{\prime}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta}^{\prime})}{L_{\mathrm{NB}}(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi_{0}(\bm{\eta})}\end{array}. (9)

Substituting (9) into (8), we have a ratio of likelihoods which simplifies to approximately a ratio of survival functions or a difference of cumulants on the log scale. This nested detailed balance structure removes the bias of G(ϵ,ϵ)\text{G}(\epsilon,\epsilon) convolution and allows moderate ϵ=100\epsilon=100 to be used to accelerate mixing. The frailty model with ϵ=100\epsilon=100 modifies the original Cox model to have more heterogeneity, making it an ideal proposal distribution because proposal distributions should have slightly more variation than their target distribution. Through our experiments, we found ϵ=100\epsilon=100 and the MH step to work well with over 90% acceptance rates in various settings. The bias due to frailty in Gibbs samplers (7) can be removed with MH acceptance probability defined in (8). For example, draws of 𝜼\bm{\eta} are sampled sequentially from (7), each new draw of 𝜼\bm{\eta} is accepted with probability (8). After removing the bias, we have draws from the PH likelihood based posterior and we can use the log–log link (i.e. exp[exp{A(ti)}]\exp[-\exp\{A(t_{i})\}]) to map to the survival probabilities.

For Cox-PG with weights (6), the ratios (8) and (9) are calculated using weighted likelihoods LNB(𝜼𝐰,𝐲,𝐌)L_{\text{NB}}(\bm{\eta}\mid\mathbf{w},\mathbf{y},\mathbf{M}) and LPH(𝜼𝐰,𝐲,𝐌)L_{\text{PH}}(\bm{\eta}\mid\mathbf{w},\mathbf{y},\mathbf{M}),

min{1,i=1N[exp{λi}exp{λi}{1+exp(ψi)}yi+ϵ{1+exp(ψi)}yi+ϵ]wi}.\begin{array}[]{c}\min\left\{1,\prod_{i=1}^{N}\left[\frac{\exp\left\{\lambda_{i}^{\prime}\right\}}{\exp\left\{\lambda_{i}\right\}}\frac{\left\{1+\exp\left(\psi_{i}\right)\right\}^{y_{i}+\epsilon}}{\left\{1+\exp\left(\psi_{i}^{\prime}\right)\right\}^{y_{i}+\epsilon}}\right]^{w_{i}}\right\}\end{array}.

Aside from mixed models, all hierarchical models that can be computed with Gibbs transitions simplifies to convenient acceptance probability (8). These results are due to the prior π0(𝜼)\pi_{0}(\bm{\eta}) canceling out in the ratios.

5 Other extensions of the Cox-PG algorithm

The framework we have established has wide applicability to numerous other contexts and problems. Any Gaussian Gibbs sampler model on the PH coefficients can be incorporated as direct extensions of our Cox-PG algorithm, including variable selection (Tipping, 2001; Makalic & Schmidt, 2015). We detail a few additional extensions here:

Stratified hazards. Two sets of baseline hazard spline systems can be specified based with interaction variables. For example, male and female interactions are given as 𝐦i𝜼=αmale+𝐳α(ti)𝐮α,male+αfemale+𝐳α(ti)𝐮α,female+𝐱i𝜷\mathbf{m}_{i}^{\top}\bm{\eta}=\alpha_{\text{male}}+\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha,\text{male}}+\alpha_{\text{female}}+\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha,\text{female}}+\mathbf{x}^{\top}_{i}\bm{\beta}.

Smoothing splines as random effects. Smoothing splines can also be modeled with random effects summarized in Wand & Ormerod 2008; O’Sullivan 1986; Lee et al. 2018. Similar to generalized additive models (GAMs), a nonparametric smooth effect of a continuous covariate can be obtain through penalized regression coefficients. The O’Sullivan spline approach found in Wand & Ormerod (2008) uses a linear fixed effect and L2 penalized coefficients on Demmler-Reinsch (DR) oscillation basis functions (Demmler & Reinsch, 1975). This conveniently corresponds with the second order derivative smoothness penalty and can be fitted with random effects using (7).

Power prior for incorporating historical clinical trial data. The power prior of Ibrahim & Chen (2000) represents L(𝜼𝐲,𝐌)L(\bm{\eta}\mid\mathbf{y}^{\dagger},\mathbf{M}^{\dagger}) as the historical data prior where a[0,1]a\in[0,1] is a hyper-parameter for including historical data. Under power prior form, we evaluate L(𝜼𝐲,𝐌)[L(𝜼𝐲,𝐌)]aL(\bm{\eta}\mid\mathbf{y},\mathbf{M})\left[L(\bm{\eta}\mid\mathbf{y}^{\dagger},\mathbf{M}^{\dagger})\right]^{a} (Ibrahim et al., 2015). This results in aa being multiplied with yiy_{i}^{\dagger} and ϵ\epsilon resulting in Cox-PG with weights, (6).

6 Illustrative example: KM curve

We apply Cox-PG to model survival curves using the lung dataset from the survival R package for illustrative purposes (Loprinzi et al., 1994; Therneau, 2023). Naturally, we do not need many partitions to fit a survival curve. Partitions boundaries sjs_{j} should be place near the few changes in the second derivatives of the log cumulative hazard. We found that J=5J=5 partitions works well with boundaries sjs_{j} selected as equally spaced quantiles of uncensored times {ti:yi=1}\{t_{i}:y_{i}=1\} in the same manner as knot selection in functional data analysis (Ruppert et al., 2003; Ramsay & Silverman, 2005). This also results in nα,1nα,2nα,Jn_{\alpha,1}\approx n_{\alpha,2}\approx\dots\approx n_{\alpha,J}, equal numbers of uncensored events in each partition (Figure 1a). This is an intuitive selection of partition boundaries as deaths drive the rate of survival processes. In addition, this spline-based approach enforces a horizontal line if the last set of study times tit_{i} are all censored events, similar to Kaplan-Meier (KM) estimates and is demonstrated in Figure 1c & d. This horizontal region coincides with the cessation of martingale calculus in KM and partial likelihoods (Figure 1d). Using ϵ=100\epsilon=100, J=5J=5 partitions and MH step (8) to remove bias, we achieve fast computation and accurate results. For the truncated Gaussian sampling, we use recently developed efficient Gibbs sampler package relliptical (Valeriano et al., 2023). We use the BayesLogit package for PG sampling (Polson et al., 2013) and set 𝚺0=106𝐈\bm{\Sigma}_{0}=10^{6}\mathbf{I}, 𝐛=𝟎\mathbf{b}=\mathbf{0} as default parameters. In the back-end, study times are scaled and registered on compact interval [0,0.5]\left[0,0.5\right] in order to mitigate numerical underflow due to numerous inner product calculations. The code is provided in the supplement with convenient R function posterior_draw() for posterior sampling. We drew 10001000 burn-in samples and then drew 100000100000 posterior samples and thinned to retain 10001000 samples. This process was very fast and on average, took 2–3 minutes on an Apple M2 Pro Mac mini, 16GB.

We map the Cox-PG posterior mean baseline log cumulative hazard α^(t)+α^0\widehat{\alpha}(t)+\widehat{\alpha}_{0} (Figure 1c) to survival curves, using exp{exp[α^(t)+α^0]}\exp\{-\exp[\widehat{\alpha}(t)+\widehat{\alpha}_{0}]\} and compared it to Kaplan-Meier estimates (Figure 1d). We plot our monotone basis functions 𝓩α\bm{\mathcal{Z}}_{\alpha} for J=5J=5 in Figure 1b. We note that the posterior mean of the baseline log cumulative hazard is monotonic because it is a sum of monotonic functions and every posterior draw is a monotonic function. We use the approach of Meyer et al. 2015 to construct 0.950.95 multiplicity adjusted joint bands for α(t)\alpha(t). Our bounds are mapped to survival probabilities and are plotted in Figure 1d.

A novel and key advantage of our approach is that Cox-PG enforces monotonicity and continuity at every MCMC draw as local linear regression of the log cumulative hazards. This also induces a degree of smoothness in the survival curve, a useful safeguard against over fitting. In addition, fit can be improved with better partition selection and larger ϵ\epsilon. The KM example is a special case of Cox-PG with intercept α0\alpha_{0} being the only unconstrained PH regression coefficient. We present PH regression models with different choices of JJ and ϵ\epsilon next.

7 Simulation study: Weibull PH model

We used the following configurations of Cox-PG. Cox-PG1: (7) without Metropolis-Hastings bias removal, J=5J=5, ϵ=1000\epsilon=1000. Cox-PG2: (7) with Metropolis-Hastings bias removal, J=5J=5, ϵ=100\epsilon=100. Cox-PG3: (7) with Metropolis-Hastings bias removal, J=10J=10, ϵ=100\epsilon=100. In addition, we compare our method with Bayesian integrated nested Laplace approximation (INLA) (Alvares et al., 2024; Rustand et al., 2024; Rue et al., 2009). As a gold standard method, we also fit the Weibull simulated data with a Bayesian Weibull PH model using a Hamiltonian Monte Carlo (HMC) algorithm built on Stan (Bürkner, 2017). Competing methods used default initialization conditions and parameters found in R package INLAjoint and R package brms. These methods are referred to as INLA and HMC-Weibull in the results.

As the basis of our simulation, we use the Weibull PH model α1(ti)+x1iβ1+x2iβ2\alpha_{1}(t_{i})+x_{1i}\beta_{1}+x_{2i}\beta_{2} where α1(ti)=log(0.1)+2log(ti)\alpha_{1}(t_{i})=\log(0.1)+2\log\left(t_{i}\right) is the baseline log cumulative hazard and X1iN(0,1)X_{1i}\sim\mathrm{N}(0,1), X2iU(1,1)X_{2i}\sim\operatorname{U}(-1,1) and {β1=0.5,β2=0.5}\{\beta_{1}=0.5,\beta_{2}=-0.5\}. We simulate 4 cases, different configurations of the Weibull PH model. Frailty: we simulate 25 balanced cluster random effects and add it to the Weibull process α1(ti)+x1iβ1+x2iβ2+bk\alpha_{1}(t_{i})+x_{1i}\beta_{1}+x_{2i}\beta_{2}+b_{k}, with bkN(0,1)b_{k}\sim\mathrm{N}(0,1). Weighting: we first simulate the Weibull event times from α1(ti)+x1iβ1+x2iβ2\alpha_{1}(t_{i})+x_{1i}\beta_{1}+x_{2i}\beta_{2} and then contaminate 10% of the data by replacing covariates with draws X1iU(10,10)X_{1i}\sim\operatorname{U}(-10,10), X2iU(10,10)X_{2i}\sim\operatorname{U}(-10,10); we assigned contaminated data a weight of 0.001 to mitigate contamination. GAM: we add a nonparametric effect to the Weibull process α1(ti)+x1iβ1+x2iβ2+sin(x3i)\alpha_{1}(t_{i})+x_{1i}\beta_{1}+x_{2i}\beta_{2}+\sin(x_{3i}), with x3iU(0,2π)x_{3i}\sim\mathrm{U}(0,2\pi). Stratified: we sample 75% of the data α1(ti)+x1iβ1+x2iβ2\alpha_{1}(t_{i})+x_{1i}\beta_{1}+x_{2i}\beta_{2} and 25% of the data from α2(ti)+x1iβ1+x2iβ2\alpha_{2}(t_{i})+x_{1i}\beta_{1}+x_{2i}\beta_{2} with α2(ti)=log(0.2)+log(ti)\alpha_{2}(t_{i})=\log(0.2)+\log\left(t_{i}\right); indicators for hazard groups are recorded. For each case, we sample N=200N=200 observations and we drew independent censoring times from Exp(0.1)\text{Exp}(0.1). For each case, we simulate 200 replicates and all were fitted with the 5 competing methods: Cox-PG1, Cox-PG2, Cox-PG3, INLA, and HMC-Weibull. HMC-Weibull can accommodate weights, frailties and GAMs but cannot fit stratified hazards or nonparametric baseline hazards. INLA can accommodate frailties, stratified hazards and nonparametric baseline hazards but cannot accommodate weights and GAMs. Cox-PG can accommodate the parameterization of all cases. All Cox-PG algorithms are initialized at the MLE and can be fitted with our posterior_draw() function. All MCMC methods sampled 1000 burn-in and 10000 draws that were thinned to 1000 samples.

For the baseline log cumulative hazard, we plot the integrated square error (ISE) (UL)1LU[α(t)α^(t)]2𝑑t(U-L)^{-1}\int^{U}_{L}\left[\alpha(t)-\widehat{\alpha}(t)\right]^{2}dt and integrated coverage rate (UL)1LU𝕀[α^(t)<α(t)<α^+(t)]𝑑t(U-L)^{-1}\int^{U}_{L}\mathbb{I}\left[\widehat{\alpha}^{-}(t)<\alpha(t)<\widehat{\alpha}^{+}(t)\right]dt in Figure 7, where α^+(t)\widehat{\alpha}^{+}(t) and α^(t)\widehat{\alpha}^{-}(t) are upper and lower bounds respectively. Integral domain LL and UU are set as the first and last death observed in the data. For HMC-Weibull and Cox-PG we were able to calculate the joint bands (Meyer et al., 2015) and we used the provided bounds from INLA. We plot estimates of α^(t)\widehat{\alpha}(t) based on posterior means in Figure 3. Cox-PG has comparable coverage to HMC-Weibull which is specified under oracle knowledge of the true hazard family. The piecewise slope of Cox-PG estimates the shape of the baseline log cumulative hazard function well. INLA struggles using numerical integration to approximate the hazard function near the boundary t0t\approx 0 (Figure 3). HMC-Weibull with knowledge of the true baseline hazard family outperforms Cox-PG in terms of ISE. However, knowledge of the true hazard family is not possible in real data analysis. For the Stratified case, INLA and Cox-PG outperforms HMC-Weibull by specifying two separate baseline hazards and the HMC-Weibull estimate bisects the two true hazard functions. Considering INLA uses 1515 partitions by default, Cox-PG outperforms INLA with fewer parameters in estimating α(t)\alpha(t). For the GAM case, results for estimating sin(x3)\sin(x_{3}) can be found in the supplement.

Squared errors, 0.95 credible interval length and 0.95 credible interval coverage rate for coefficient estimate of {β1=0.5,β2=0.5}\{\beta_{1}=0.5,\beta_{2}=-0.5\} are presented in Figure 4 & 5. All competing methods perform comparably in the Frailty case, because all methods can fit mixed models. For analysis of both α(t)\alpha(t) and 𝜷\bm{\beta}, INLA performance suffers in cases where the weights and GAMs are not used in estimation. However, in cases of Frailty and Stratified, estimation of 𝜷\bm{\beta} remains accurate when there is a reasonable approximation of α(t)\alpha(t); this robustness property is well documented in the literature (Tsiatis, 2006; Ibrahim et al., 2001). Cox-PG1, without bias removal, slightly suffers in performance due to its additional heterogeneity and bias. As sample size increases, nα,jn_{\alpha,j} increases leading to vj/uα,j1v_{j}/u_{\alpha,j}\rightarrow 1 from the left. This leads to slow mixing of the MCMC by causing uα,j(n1)uα,j(n)u_{\alpha,j}^{(n-1)}\approx u_{\alpha,j}^{(n)}. We can decrease nα,jn_{\alpha,j} by increasing the number of partitions JJ at the cost of computation. However, Cox-PG3 can be unstable when there is not enough uncensored data to fit J=10J=10 partitions, an identifiability problem, such as the Stratified case with only 50 subjects in the α2(t)\alpha_{2}(t) group. Due to its flexibility, fewer monotonic spline parameters, bias removal and its comparable performance to gold standard methods, we recommend Cox-PG2 as the default.

8 Leukemia example: stratified hazards, GAM, and frailties

We used the leukemia death/right-censored dataset from Henderson et al. (2002) with N=1043N=1043 and baseline covariates, found in R package INLA (Rue et al., 2017). We include age and sex as linear PH effects. There are 24 district random effects denoting different sites. In addition, we account for nonlinear effect of the Townsend score: a quantitative measure with a range of [7,10][-7,10] where higher values indicate less affluent areas. We believe that low scores 7\approx-7 have a greater association with increased survival that cannot be explained using a linear effect. The nonlinear GAM component induces an additional 7 random effects to model. We stratified the data into two baseline hazard groups based on white blood cell count to study the relationship between low count (leukopenia) and survival rates. We divided the data into two groups based on white blood cell count cutoff <3×109/L<3\times 10^{9}/L for leukopenia and considered it normal otherwise. Cox-PG enables us to study the nonlinear effect of Townsend score and stratified baseline hazards, two violations of the standard proportional hazard assumption. For comparison, we fit Cox-PG2 and HMC-Weibull, using 1000 burn-in and 200000 draws that were thinned to retain 1000 samples. Our model has M=31M=31 random effects, P=5P=5 fixed effects/intercepts and 2J=102J=10 monotonic splines. We use our posterior_draw() function for Bayesian computation.

HMC-Weibull can parameterize the site random effect and GAM, but not the stratified baseline hazard. We observed that the fitted baseline log cumulative hazard from HMC-Weibull is consistent with the overall shape of Cox-PG estimates (Figure 6a). However, we observed violation of proportional hazards (linear effect) due to leukopenia in the Cox-PG estimates. If the effect was linear, there would be an equal-distance difference between the two baseline log cumulative hazards (Figure 6a). Furthermore, the last death in the normal group was observed at day 704, whereas the leukopenia group recorded 87 deaths after day 704. This imbalanced distribution of event times suggests that stratified baseline hazards are more appropriate. We observed a decreased risk of mortality associated with the a low Townsend score reflecting affluent areas (Figure 6b). The nonparametric effect through the affine combination of DR basis and penalized coefficients elucidates the nonlinear relationship between Townsend score and risk. The DR basis associated with quadratic relationship yielded a coefficient with a significant Bayesian p-value of 0.014 (Figure 6c) indicating a nonlinear relationship. In both competing methods, the effect of age was significant, while sex effect was not (Table 1).

9 Discussion

The link between Cox models and logistic regression is well known, with relative risks being equal to odds ratios in rare event settings. Because Cox models admits a local Poisson representation, we can reformulated Cox models as local negative binomial processes with a vague gamma mixture. These local negative binomial processes are in fact rare event logistic regressions with an appropriate offset term. This allows Bayesian computational and inferential tools of Gaussian inference through logistic regression to be deployed in survival settings. Now, multilevel Gaussian models can readily be adapted to Cox models. These techniques serve as a foundation for future work in Bayesian survival models. Cox-PG can be implemented in base R rather than working with a software framework such as Stan or INLA, making it more accessible to modelers. We provide our initial implementation of the Cox-PG algorithm in the supplement files. We leave other multilevel modeling extensions of Cox-PG and additional strategies for computational acceleration to future work.

Refer to caption
Figure 1: Kaplan-Meier example with lung dataset. (a) Derivative of monotonic spline system Dt𝓩αD_{t}\bm{\mathcal{Z}}_{\alpha} resulting in non-overlaping delta functions with likelihood contribution as beta kernels: i=1N{Dt𝐳α(ti)𝐮α}yi=i=1Nj=1Juα,jyiδj(ti)\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}_{\alpha}^{\top}(t_{i})\mathbf{u}_{\alpha}\right\}^{y_{i}}=\prod_{i=1}^{N}\prod_{j=1}^{J}{u}_{\alpha,j}^{y_{i}\delta_{j}(t_{i})}. (b) Monotonic spline system 𝓩α\bm{\mathcal{Z}}_{\alpha} centered at midpoints. (c) Cox-PG posterior mean and 0.95 joint bands for baseline log cumulative hazard function. (d) Cox-PG and KM estimates for survival function.
Refer to caption
Figure 2: Boxplot of integrated squared errors and integrated coverage rates for baseline log cumulative hazard. Means are denoted with diamond.
Refer to caption
Figure 3: Posterior estimates for Weibull log cumulative hazard α1(ti)=log(0.1)+2log(ti)\alpha_{1}(t_{i})=\log(0.1)+2\log\left(t_{i}\right). Estimates from N=200N=200 replicates are plotted in gray. The true α1(ti)\alpha_{1}(t_{i}) is plotted in solid red. For the Stratified case α2(ti)=log(0.2)+log(ti)\alpha_{2}(t_{i})=\log(0.2)+\log\left(t_{i}\right) is plotted in solid blue and estimates are plotted in transparent red and blue.
Refer to caption
Figure 4: Boxplot of squared errors, interval lengths and coverage rates for β1=0.5\beta_{1}=0.5. Means are denoted with diamond.
Refer to caption
Figure 5: Boxplot of squared errors, interval lengths and coverage rates for β2=0.5\beta_{2}=-0.5. Means are denoted with diamond.
Refer to caption
Figure 6: Cox-PG estimates of baseline log cumulative hazard and Townsend effect from the leukemia dataset. (a) Estimates and joint bands for the stratified baseline log cumulative hazard along with the combined hazard from HMC-Weibull fit. (b) Smoothed nonparametric effect of Townsend score and 0.95 joint band. (c) Demmler-Reinsch (DR) basis system used to fit GAMs are plotted in gray; DR basis function corresponding to the coefficient with significant Bayesian p-value (0.0140.014) is highlighted with a dashed blue line.
Table 1: Leukemia Regression Results for Competing Methods with 0.95 Credible Intervals
Variable Cox-PG2 HMC-Weibull
sex 0.0714 (-0.0680, 0.2187) 0.0573 (-0.0744, 0.1840)
age 0.0314 (0.0272, 0.0358) 0.0316 (0.0275, 0.0356)

References

  • (1)
  • Aalen (1978) Aalen, O. (1978), ‘Nonparametric inference for a family of counting processes’, The Annals of Statistics pp. 701–726.
  • Aalen et al. (2008) Aalen, O., Borgan, O. & Gjessing, H. (2008), Survival and event history analysis: a process point of view, Springer Science & Business Media.
  • Alvares et al. (2024) Alvares, D., Van Niekerk, J., Krainski, E. T., Rue, H. & Rustand, D. (2024), ‘Bayesian survival analysis with inla’, Statistics in Medicine 43(20), 3975–4010.
  • Bhadra et al. (2019) Bhadra, A., Datta, J., Polson, N. G. & Willard, B. (2019), ‘Lasso meets horseshoe’, Statistical Science 34(3), 405–427.
  • Blanchet et al. (2022) Blanchet, J., Murthy, K. & Si, N. (2022), ‘Confidence regions in wasserstein distributionally robust estimation’, Biometrika 109(2), 295–315.
  • Boyd & Vandenberghe (2004) Boyd, S. P. & Vandenberghe, L. (2004), Convex optimization, Cambridge university press.
  • Bürkner (2017) Bürkner, P.-C. (2017), ‘brms: An r package for bayesian multilevel models using stan’, Journal of statistical software 80, 1–28.
  • Carvalho et al. (2009) Carvalho, C. M., Polson, N. G. & Scott, J. G. (2009), Handling sparsity via the horseshoe, in ‘Artificial intelligence and statistics’, PMLR, pp. 73–80.
  • Chen et al. (2006) Chen, M.-H., Ibrahim, J. G. & Shao, Q.-M. (2006), ‘Posterior propriety and computation for the cox regression model with applications to missing covariates’, Biometrika 93(4), 791–807.
  • Choi & Hobert (2013) Choi, H. M. & Hobert, J. P. (2013), ‘The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic’, Electronic Journal of Statistics 7(none), 2054 – 2064.
    https://doi.org/10.1214/13-EJS837
  • Cong et al. (2017) Cong, Y., Chen, B. & Zhou, M. (2017), ‘Fast simulation of hyperplane-truncated multivariate normal distributions.’, Bayesian Analysis 12(4).
  • Cox (1972) Cox, D. R. (1972), ‘Regression models and life-tables’, Journal of the Royal Statistical Society: Series B (Methodological) 34(2), 187–202.
  • Cox (1975) Cox, D. R. (1975), ‘Partial likelihood’, Biometrika 62(2), 269–276.
  • Cui et al. (2021) Cui, E., Crainiceanu, C. M. & Leroux, A. (2021), ‘Additive functional cox model’, Journal of Computational and Graphical Statistics 30(3), 780–793.
  • Damien et al. (1999) Damien, P., Wakefield, J. & Walker, S. (1999), ‘Gibbs sampling for bayesian non-conjugate and hierarchical models by using auxiliary variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(2), 331–344.
  • Demmler & Reinsch (1975) Demmler, A. & Reinsch, C. (1975), ‘Oscillation matrices with spline smoothing’, Numerische Mathematik 24(5), 375–382.
  • Duan et al. (2018) Duan, L. L., Johndrow, J. E. & Dunson, D. B. (2018), ‘Scaling up data augmentation mcmc via calibration’, Journal of Machine Learning Research 19(64), 1–34.
  • Esscher (1932) Esscher, F. (1932), ‘On the probability function in the collective theory of risk’, Skand. Aktuarie Tidskr. 15, 175–195.
  • Geweke (1989) Geweke, J. (1989), ‘Bayesian inference in econometric models using monte carlo integration’, Econometrica: Journal of the Econometric Society pp. 1317–1339.
  • Henderson et al. (2002) Henderson, R., Shimakura, S. & Gorst, D. (2002), ‘Modeling spatial variation in leukemia survival data’, Journal of the American Statistical Association 97(460), 965–972.
  • Hobert (2011) Hobert, J. P. (2011), ‘The data augmentation algorithm: Theory and methodology’, Handbook of Markov Chain Monte Carlo pp. 253–293.
  • Hothorn et al. (2018) Hothorn, T., Möst, L. & Bühlmann, P. (2018), ‘Most likely transformations’, Scandinavian Journal of Statistics 45(1), 110–134.
  • Ibrahim & Chen (2000) Ibrahim, J. G. & Chen, M.-H. (2000), ‘Power prior distributions for regression models’, Statistical Science pp. 46–60.
  • Ibrahim et al. (2015) Ibrahim, J. G., Chen, M.-H., Gwon, Y. & Chen, F. (2015), ‘The power prior: theory and applications’, Statistics in medicine 34(28), 3724–3749.
  • Ibrahim et al. (2001) Ibrahim, J. G., Chen, M.-H. & Sinha, D. (2001), Bayesian survival analysis, Vol. 2, Springer.
  • Ildstad et al. (2001) Ildstad, S. T., Evans Jr, C. H. et al. (2001), ‘Small clinical trials: Issues and challenges’.
  • Jones (2004) Jones, G. L. (2004), ‘On the markov chain central limit theorem’.
  • Jones & Hobert (2001) Jones, G. L. & Hobert, J. P. (2001), ‘Honest exploration of intractable probability distributions via markov chain monte carlo’, Statistical Science pp. 312–334.
  • Kalbfleisch (1978) Kalbfleisch, J. D. (1978), ‘Non-parametric bayesian analysis of survival time data’, Journal of the Royal Statistical Society: Series B (Methodological) 40(2), 214–221.
  • Kalbfleisch & Prentice (2011) Kalbfleisch, J. D. & Prentice, R. L. (2011), The statistical analysis of failure time data, John Wiley & Sons.
  • Kaplan & Meier (1958) Kaplan, E. L. & Meier, P. (1958), ‘Nonparametric estimation from incomplete observations’, Journal of the American statistical association 53(282), 457–481.
  • Lee et al. (2018) Lee, W., Miranda, M., Rausch, P., Baladandayuthapani, V., Fazio, M., Downs, C. & Morris, J. S. (2018), ‘Bayesian semiparametric functional mixed models for serially correlated functional data, with application to glaucoma data’, Journal of the American Statistical Association .
  • Li (2018) Li, B. (2018), Sufficient dimension reduction: Methods and applications with R, CRC Press.
  • Li & Ghosh (2015) Li, Y. & Ghosh, S. K. (2015), ‘Efficient sampling methods for truncated multivariate normal and student-t distributions subject to linear inequality constraints’, Journal of Statistical Theory and Practice 9, 712–732.
  • Loprinzi et al. (1994) Loprinzi, C. L., Laurie, J. A., Wieand, H. S., Krook, J. E., Novotny, P. J., Kugler, J. W., Bartel, J., Law, M., Bateman, M. & Klatt, N. E. (1994), ‘Prospective evaluation of prognostic variables from patient-completed questionnaires. north central cancer treatment group.’, Journal of Clinical Oncology 12(3), 601–607.
  • Maatouk & Bay (2016) Maatouk, H. & Bay, X. (2016), A new rejection sampling method for truncated multivariate gaussian random variables restricted to convex sets, in ‘Monte Carlo and Quasi-Monte Carlo Methods: MCQMC, Leuven, Belgium, April 2014’, Springer, pp. 521–530.
  • Makalic & Schmidt (2015) Makalic, E. & Schmidt, D. F. (2015), ‘A simple sampler for the horseshoe estimator’, IEEE Signal Processing Letters 23(1), 179–182.
  • McGregor et al. (2020) McGregor, D., Palarea-Albaladejo, J., Dall, P., Hron, K. & Chastin, S. (2020), ‘Cox regression survival analysis with compositional covariates: application to modelling mortality risk from 24-h physical activity patterns’, Statistical methods in medical research 29(5), 1447–1465.
  • McLain & Ghosh (2013) McLain, A. C. & Ghosh, S. K. (2013), ‘Efficient sieve maximum likelihood estimation of time-transformation models’, Journal of Statistical Theory and Practice 7, 285–303.
  • Meyer et al. (2018) Meyer, M. C., Kim, S.-Y. & Wang, H. (2018), ‘Convergence rates for constrained regression splines’, Journal of Statistical Planning and Inference 193, 179–188.
  • Meyer et al. (2015) Meyer, M. J., Coull, B. A., Versace, F., Cinciripini, P. & Morris, J. S. (2015), ‘Bayesian function-on-function regression for multilevel functional data’, Biometrics 71(3), 563–574.
  • Meyn & Tweedie (2012) Meyn, S. P. & Tweedie, R. L. (2012), Markov chains and stochastic stability, Springer Science & Business Media.
  • Mira & Tierney (2002) Mira, A. & Tierney, L. (2002), ‘Efficiency and convergence properties of slice samplers’, Scandinavian Journal of Statistics 29(1), 1–12.
  • Neal (2003) Neal, R. M. (2003), ‘Slice sampling’, The annals of statistics 31(3), 705–767.
  • Neelon (2019) Neelon, B. (2019), ‘Bayesian zero-inflated negative binomial regression based on pólya-gamma mixtures’, Bayesian analysis 14(3), 829.
  • Nelson (1972) Nelson, W. (1972), ‘Theory and applications of hazard plotting for censored failure data’, Technometrics 14(4), 945–966.
  • Nieto-Barajas & Walker (2002) Nieto-Barajas, L. E. & Walker, S. G. (2002), ‘Markov beta and gamma processes for modelling hazard rates’, Scandinavian Journal of Statistics 29(3), 413–424.
  • O’Sullivan (1986) O’Sullivan, F. (1986), ‘A statistical perspective on ill-posed inverse problems’, Statistical science pp. 502–518.
  • Pakman & Paninski (2014) Pakman, A. & Paninski, L. (2014), ‘Exact hamiltonian monte carlo for truncated multivariate gaussians’, Journal of Computational and Graphical Statistics 23(2), 518–542.
  • Polson et al. (2013) Polson, N. G., Scott, J. G. & Windle, J. (2013), ‘Bayesian inference for logistic models using pólya–gamma latent variables’, Journal of the American statistical Association 108(504), 1339–1349.
  • Ramsay (1988) Ramsay, J. O. (1988), ‘Monotone regression splines in action’, Statistical science pp. 425–441.
  • Ramsay & Silverman (2005) Ramsay, J. O. & Silverman, B. W. (2005), Fitting differential equations to functional data: Principal differential analysis, Springer.
  • Roberts & Rosenthal (2004) Roberts, G. O. & Rosenthal, J. S. (2004), ‘General state space markov chains and mcmc algorithms’.
  • Rue et al. (2009) Rue, H., Martino, S. & Chopin, N. (2009), ‘Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations’, Journal of the Royal Statistical Society Series B: Statistical Methodology 71(2), 319–392.
  • Rue et al. (2017) Rue, H., Riebler, A., Sørbye, S. H., Illian, J. B., Simpson, D. P. & Lindgren, F. K. (2017), ‘Bayesian computing with inla: a review’, Annual Review of Statistics and Its Application 4(1), 395–421.
  • Ruppert et al. (2003) Ruppert, D., Wand, M. P. & Carroll, R. J. (2003), Semiparametric regression, number 12, Cambridge university press.
  • Rustand et al. (2024) Rustand, D., Van Niekerk, J., Krainski, E. T., Rue, H. & Proust-Lima, C. (2024), ‘Fast and flexible inference for joint models of multivariate longitudinal and survival data using integrated nested laplace approximations’, Biostatistics 25(2), 429–448.
  • Shu et al. (2021) Shu, D., Young, J. G., Toh, S. & Wang, R. (2021), ‘Variance estimation in inverse probability weighted cox models’, Biometrics 77(3), 1101–1117.
  • Sinha et al. (2003) Sinha, D., Ibrahim, J. G. & Chen, M.-H. (2003), ‘A bayesian justification of cox’s partial likelihood’, Biometrika 90(3), 629–641.
  • Sun et al. (2019) Sun, Q., Zhu, R., Wang, T. & Zeng, D. (2019), ‘Counting process-based dimension reduction methods for censored outcomes’, Biometrika 106(1), 181–196.
  • Sy & Taylor (2000) Sy, J. P. & Taylor, J. M. (2000), ‘Estimation in a cox proportional hazards cure model’, Biometrics 56(1), 227–236.
  • Therneau (2023) Therneau, T. M. (2023), A Package for Survival Analysis in R. R package version 3.5-7.
    https://CRAN.R-project.org/package=survival
  • Therneau et al. (2000) Therneau, T. M., Grambsch, P. M., Therneau, T. M. & Grambsch, P. M. (2000), The cox model, Springer.
  • Tibshirani (1997) Tibshirani, R. (1997), ‘The lasso method for variable selection in the cox model’, Statistics in medicine 16(4), 385–395.
  • Tierney (1994) Tierney, L. (1994), ‘Markov chains for exploring posterior distributions’, the Annals of Statistics pp. 1701–1728.
  • Tipping (2001) Tipping, M. E. (2001), ‘Sparse bayesian learning and the relevance vector machine’, Journal of machine learning research 1(Jun), 211–244.
  • Tsiatis (2006) Tsiatis, A. A. (2006), ‘Semiparametric theory and missing data’.
  • Valeriano et al. (2023) Valeriano, K. A., Galarza, C. E. & Matos, L. A. (2023), ‘Moments and random number generation for the truncated elliptical family of distributions’, Statistics and Computing 33(1), 32.
  • Van der Vaart (2000) Van der Vaart, A. W. (2000), Asymptotic statistics, Vol. 3, Cambridge university press.
  • Wand & Ormerod (2008) Wand, M. P. & Ormerod, J. (2008), ‘On semiparametric regression with o’sullivan penalized splines’, Australian & New Zealand Journal of Statistics 50(2), 179–198.
  • Wang (1985) Wang, J.-L. (1985), ‘Strong consistency of approximate maximum likelihood estimators with applications in nonparametrics’, The Annals of Statistics pp. 932–946.
  • Wang et al. (2022) Wang, T., Ratcliffe, S. J. & Guo, W. (2022), ‘Time-to-event analysis with unknown time origins via longitudinal biomarker registration’, Journal of the American Statistical Association pp. 1–16.
  • Wang & Roy (2018) Wang, X. & Roy, V. (2018), ‘Analysis of the pólya-gamma block gibbs sampler for bayesian logistic linear mixed models’, Statistics & Probability Letters 137, 251–256.
  • Wienke (2010) Wienke, A. (2010), Frailty models in survival analysis, CRC press.
  • Winkelmann (2008) Winkelmann, R. (2008), Econometric analysis of count data, Springer Science & Business Media.
  • Wulfsohn & Tsiatis (1997) Wulfsohn, M. S. & Tsiatis, A. A. (1997), ‘A joint model for survival and longitudinal data measured with error’, Biometrics pp. 330–339.
  • Zens et al. (2023) Zens, G., Frühwirth-Schnatter, S. & Wagner, H. (2023), ‘Ultimate pólya gamma samplers–efficient mcmc for possibly imbalanced binary and categorical data’, Journal of the American Statistical Association pp. 1–12.
  • Zhou et al. (2012) Zhou, M., Li, L., Dunson, D. & Carin, L. (2012), Lognormal and gamma mixed negative binomial regression, in ‘Proceedings of the… International Conference on Machine Learning. International Conference on Machine Learning’, Vol. 2012, NIH Public Access, p. 1343.

Supplement

Simulation Studies: GAM results

Refer to caption
Figure 7: Boxplot of integrated squared errors and integrated coverage rates for sin(x3)\sin(x_{3}). Means are denoted with diamond.

Appendix: Proofs

In order to facilitate the proof Theorem 1 & 2, we first present the following definitions and lemma.

Definition 1.

Truncated gamma density: τa0,b0,τ0TG(a0,b0,τ0)\tau\mid a_{0},b_{0},\tau_{0}\sim\mathrm{TG}\left(a_{0},b_{0},\tau_{0}\right)

π(τa0,b0,τ0)=c2(τ0,a0,b0)1τa01exp(b0τ)𝕀(ττ0)\begin{array}[]{c}\pi\left(\tau\mid a_{0},b_{0},\tau_{0}\right)=c_{2}\left(\tau_{0},a_{0},b_{0}\right)^{-1}\tau^{a_{0}-1}\exp\left(-b_{0}\tau\right)\mathbb{I}\left(\tau\geq\tau_{0}\right)\end{array}

where c2(τ0,a0,b0)=τ0τa01exp(b0τ)𝑑τc_{2}\left(\tau_{0},a_{0},b_{0}\right)=\int_{\tau_{0}}^{\infty}\tau^{a_{0}-1}\exp\left(-b_{0}\tau\right)d\tau.

Definition 2.

Loewner order of matrices 𝐀𝐁\mathbf{A}\succeq\mathbf{B}: if 𝐀𝐁\mathbf{A}-\mathbf{B} is positive semi-definite or 𝐮𝐀𝐮𝐮𝐁𝐮\mathbf{u}^{\top}\mathbf{A}\mathbf{u}\geq\mathbf{u}^{\top}\mathbf{B}\mathbf{u} for all real 𝐮\mathbf{u} and the following relationships hold |𝐀||𝐁||\mathbf{A}|\geq|\mathbf{B}| and 𝐀1𝐁1\mathbf{A}^{-1}\preceq\mathbf{B}^{-1}.

Definition 3.

Hyperbolic cosine through Laplace transform:

+exp[(𝐦i𝜼)2+(𝐦i𝜼)22ωi]π(ωiyi+ϵ,0)𝑑ωi={cosh[(𝐦i𝜼)2+(𝐦i𝜼)22]}(yi+ϵ)\begin{array}[]{c}\int_{\mathbb{R}_{+}}\exp\left[-\frac{\left(\mathbf{m}_{i}^{\top}\bm{\eta}^{\prime}\right)^{2}+\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)^{2}}{2}\omega_{i}\right]\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)d\omega_{i}=\left\{\cosh\left[\frac{\sqrt{\left(\mathbf{m}_{i}^{\top}\bm{\eta}^{\prime}\right)^{2}+\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)^{2}}}{2}\right]\right\}^{-(y_{i}+\epsilon)}\end{array}
Definition 4.

Hyperbolic cosine inequalities: for a,b,cosh(a+b)2cosh(a)cosh(b)a,b\in\mathbb{R},\cosh(a+b)\leq 2\cosh(a)\cosh(b), cosh(a)1\cosh(a)\geq 1 and cosh(a)cosh(b)\cosh(a)\leq\cosh(b) for |a||b||a|\leq|b|.

Lemma 1.

From Loewner ordering, we have

(𝜼𝐐1𝝁)𝐐(𝜼𝐐1𝝁)𝜼𝐐𝜼2𝜼𝝁~+2𝜼𝐌𝛀𝟏ηϵ+𝝁~𝐀(τ0)1𝝁~+C1𝟏𝛀𝟏+ηϵ𝟏𝛀𝟏ηϵ\begin{array}[]{rl}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})^{\top}\mathbf{Q}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})\leq&\bm{\eta}^{\top}\mathbf{Q}\bm{\eta}-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+2\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\\ &+\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\end{array}

where C1=2ηϵ𝛍~𝐀(τ0)1𝐌C_{1}=\left\|2\eta_{\epsilon}\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\mathbf{M}^{\top}\right\| and 𝐀(τ0)=𝚺01τ0𝐈M\mathbf{A}(\tau_{0})=\bm{\Sigma}^{-1}_{0}\oplus\tau_{0}\mathbf{I}_{M}.

Proof.

Proof of Lemma 1. Recall 𝝁=𝝁~𝐌𝛀𝟏ηϵ\bm{\mu}=\widetilde{\bm{\mu}}-\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}. We note that 𝐐=𝐌𝛀𝐌+𝐀(τ)𝐌𝛀𝐌\mathbf{Q}=\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}+\mathbf{A}(\tau_{\mathcal{B}})\succeq\mathbf{M}^{\top}\bm{\Omega}\mathbf{M} and 𝐐𝐀(τ)𝐀(τ0)\mathbf{Q}\succeq\mathbf{A}(\tau_{\mathcal{B}})\succeq\mathbf{A}(\tau_{0}). Given that

(𝜼𝐐1𝝁)𝐐(𝜼𝐐1𝝁)=𝜼𝐐𝜼2𝜼𝝁+𝝁𝐐1𝝁=𝜼𝐐𝜼2𝜼𝝁~+2𝜼𝐌𝛀𝟏ηϵ+𝝁𝐐1𝝁.\begin{array}[]{rl}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})^{\top}\mathbf{Q}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})=&\bm{\eta}^{\top}\mathbf{Q}\bm{\eta}-2\bm{\eta}^{\top}\bm{\mu}+\bm{\mu}^{\top}\mathbf{Q}^{-1}\bm{\mu}\\ =&\bm{\eta}^{\top}\mathbf{Q}\bm{\eta}-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+2\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+\bm{\mu}^{\top}\mathbf{Q}^{-1}\bm{\mu}.\end{array}

We apply Loewner ordering to the projection matrix

𝐐1(𝐌𝛀𝐌)1𝛀1/2𝐌(𝐌𝛀𝐌)1𝐌𝛀1/2𝐈,\begin{array}[]{c}\mathbf{Q}^{-1}\preceq(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M})^{-1}\\ \bm{\Omega}^{1/2}\mathbf{M}(\mathbf{M}^{\top}\bm{\Omega}\mathbf{M})^{-1}\mathbf{M}^{\top}\bm{\Omega}^{1/2}\preceq\mathbf{I},\end{array}

triangle and Cauchy–Schwarz inequality next,

𝝁𝐐1𝝁=𝝁~𝐐1𝝁~2𝝁~𝐐1𝐌𝛀𝟏ηϵ+ηϵ𝟏𝛀𝐌𝐐1𝐌𝛀𝟏ηϵ𝝁~𝐐1𝝁~2𝝁~𝐐1𝐌𝛀𝟏ηϵ+ηϵ𝟏𝛀𝟏ηϵ𝝁~𝐐1𝝁~+|2ηϵ𝝁~𝐐1𝐌𝛀𝟏|+ηϵ𝟏𝛀𝟏ηϵ𝝁~𝐐1𝝁~+2ηϵ𝝁~𝐐1𝐌𝛀𝟏+ηϵ𝟏𝛀𝟏ηϵ𝝁~𝐀(τ0)1𝝁~+C1𝟏𝛀𝟏+ηϵ𝟏𝛀𝟏ηϵ\begin{array}[]{rl}\bm{\mu}^{\top}\mathbf{Q}^{-1}\bm{\mu}=&\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\widetilde{\bm{\mu}}-2\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{M}\mathbf{Q}^{-1}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\\ \leq&\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\widetilde{\bm{\mu}}-2\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\\ \leq&\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\widetilde{\bm{\mu}}+\left|2\eta_{\epsilon}\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\right|+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\\ \leq&\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\widetilde{\bm{\mu}}+\left\|2\eta_{\epsilon}\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\mathbf{M}^{\top}\right\|\left\|\bm{\Omega}\mathbf{1}\right\|+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\\ \leq&\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\end{array}

where 𝛀𝟏=iωi2iωi=𝟏𝛀𝟏\left\|\bm{\Omega}\mathbf{1}\right\|=\sqrt{\sum_{i}\omega_{i}^{2}}\leq\sum_{i}\omega_{i}=\mathbf{1}^{\top}\bm{\Omega}\mathbf{1} and C1=2ηϵ𝝁~𝐀(τ0)1𝐌2ηϵ𝝁~𝐐1𝐌C_{1}=\left\|2\eta_{\epsilon}\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\mathbf{M}^{\top}\right\|\geq\left\|2\eta_{\epsilon}\widetilde{\bm{\mu}}^{\top}\mathbf{Q}^{-1}\mathbf{M}^{\top}\right\|. ∎

Proof.

Proof of Theorem 1. First we use the truncated gamma density to replace the gamma conditional distribution

(τ)TG(a0+M2,b0+𝐮𝐮2,τ0).\begin{array}[]{c}(\tau_{\mathcal{B}}\mid-)\sim\mathrm{TG}\left(a_{0}+\frac{M}{2},b_{0}+\frac{\mathbf{u}^{\top}_{\mathcal{B}}\mathbf{u}_{\mathcal{B}}}{2},\tau_{0}\right).\end{array}

We note that 𝐐=𝐌𝛀𝐌+𝐀(τ)𝐌𝛀𝐌+𝐀(τ0)𝐀(τ0)\mathbf{Q}=\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}+\mathbf{A}(\tau_{\mathcal{B}})\succeq\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}+\mathbf{A}(\tau_{0})\succeq\mathbf{A}(\tau_{0}) and

c1(𝐐1𝝁,𝐐1,𝒞0𝒞v)=𝜼𝒞0𝒞vexp{12(𝜼𝐐1𝝁)𝐐(𝜼𝐐1𝝁)}𝑑𝜼exp{12(𝜼𝐐1𝝁)𝐐(𝜼𝐐1𝝁)}𝑑𝜼=(2π)(P+M)/2|𝐐|1/2c1(𝐐1𝝁,𝐐1,𝒞0𝒞v)1(2π)(P+M)/2|𝐐|1/2\begin{array}[]{rl}c_{1}(\mathbf{Q}^{-1}\bm{\mu},\mathbf{Q}^{-1},\mathcal{C}_{0}\bigcap\mathcal{C}_{v})&=\oint_{\bm{\eta}\in\mathcal{C}_{0}\bigcap\mathcal{C}_{v}}\exp\left\{-\frac{1}{2}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})^{\top}\mathbf{Q}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})\right\}d\bm{\eta}\\ &\leq\int\exp\left\{-\frac{1}{2}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})^{\top}\mathbf{Q}(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu})\right\}d\bm{\eta}\\ &=(2\pi)^{(P+M)/2}|\mathbf{Q}|^{-1/2}\\ c_{1}(\mathbf{Q}^{-1}\bm{\mu},\mathbf{Q}^{-1},\mathcal{C}_{0}\bigcap\mathcal{C}_{v})^{-1}&\geq(2\pi)^{-(P+M)/2}|\mathbf{Q}|^{1/2}\end{array}

with |𝐐||𝐀(τ0)||\mathbf{Q}|\geq|\mathbf{A}(\tau_{0})|. In addition, 𝕀(𝜼𝒞v𝒞0)=𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v}\bigcap\mathcal{C}_{0})=\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v}). After applying Lemma 1, we have the inequality,

π(𝜼𝝎,τ,𝐯,𝐲)=c1(𝐐1𝝁,𝐐1,𝒞0𝒞v)1𝕀(𝜼𝒞v𝒞0)×exp[12(𝜼𝐐1𝝁)𝐐(𝜼𝐐1𝝁)](2π)P+M2|𝐀(τ0)|1/2𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp[12(𝜼𝐐𝜼+2𝜼𝐌𝛀𝟏ηϵ+ηϵ𝟏𝛀𝟏ηϵ)]×exp[12{2𝜼𝝁~+𝝁~𝐀(τ0)1𝝁~+C1𝟏𝛀𝟏}]=(2π)P+M2|𝐀(τ0)|1/2𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp[12𝜷𝚺01𝜷12τ𝐮𝐮]exp[12i=1Nωi(𝐦i𝜼)2]×exp[12(2𝜼𝐌𝛀𝟏ηϵ+ηϵ𝟏𝛀𝟏ηϵ+C1𝟏𝛀𝟏)]×exp[12{2𝜼𝝁~+𝝁~𝐀(τ0)1𝝁~}].\begin{array}[]{rl}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})=&c_{1}(\mathbf{Q}^{-1}\bm{\mu},\mathbf{Q}^{-1},\mathcal{C}_{0}\bigcap\mathcal{C}_{v})^{-1}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v}\bigcap\mathcal{C}_{0})\\ &\times\exp\left[-\frac{1}{2}\left(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu}\right)^{\top}\mathbf{Q}\left(\bm{\eta}-\mathbf{Q}^{-1}\bm{\mu}\right)\right]\\ \geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left[-\frac{1}{2}\left(\bm{\eta}^{\top}\mathbf{Q}\bm{\eta}+2\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}\right)\right]\\ &\times\exp\left[-\frac{1}{2}\left\{-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\right\}\right]\\ =&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left[-\frac{1}{2}\bm{\beta}^{\top}\bm{\Sigma}^{-1}_{0}\bm{\beta}-\frac{1}{2}\tau_{\mathcal{B}}\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}\right]\exp\left[-\frac{1}{2}\sum_{i=1}^{N}\omega_{i}\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)^{2}\right]\\ &\times\exp\left[-\frac{1}{2}\left(2\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\right)\right]\\ &\times\exp\left[-\frac{1}{2}\left\{-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}\right\}\right].\end{array}

Note that we expand the quadratic form of 𝜼𝐐𝜼=𝜼{𝐌𝛀𝐌+𝐀(τ)}𝜼\bm{\eta}^{\top}\mathbf{Q}\bm{\eta}=\bm{\eta}^{\top}\left\{\mathbf{M}^{\top}\bm{\Omega}\mathbf{M}+\mathbf{A}(\tau_{\mathcal{B}})\right\}\bm{\eta}. With some algebra to combine i=1Nωi(𝐦i𝜼)2\sum_{i=1}^{N}\omega_{i}\left(\mathbf{m}_{i}^{\top}\bm{\eta}\right)^{2}, 2𝜼𝐌𝛀𝟏ηϵ2\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon} and ηϵ𝟏𝛀𝟏ηϵ\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\eta_{\epsilon} to get i=1Nωi(𝐦i𝜼+ηϵ)2\sum_{i=1}^{N}\omega_{i}\left(\mathbf{m}_{i}^{\top}\bm{\eta}+{\eta}_{\epsilon}\right)^{2}. We have

π(𝜼𝝎,τ,𝐯,𝐲)(2π)P+M2|𝐀(τ0)|1/2𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp[12𝜷𝚺01𝜷12τ𝐮𝐮]exp[12i=1Nωi(𝐦i𝜼+ηϵ)2]×exp[12{2𝜼𝝁~+𝝁~𝐀(τ0)1𝝁~+C1𝟏𝛀𝟏}]=(2π)P+M2|𝐀(τ0)|1/2𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp[12𝜷𝚺01𝜷12τ𝐮𝐮]exp[12i=1Nωiψi2]×exp[12{2𝜼𝝁~+𝝁~𝐀(τ0)1𝝁~+C1𝟏𝛀𝟏}]\begin{array}[]{rl}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})\geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left[-\frac{1}{2}\bm{\beta}^{\top}\bm{\Sigma}^{-1}_{0}\bm{\beta}-\frac{1}{2}\tau_{\mathcal{B}}\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}\right]\exp\left[-\frac{1}{2}\sum_{i=1}^{N}\omega_{i}\left(\mathbf{m}_{i}^{\top}\bm{\eta}+{\eta}_{\epsilon}\right)^{2}\right]\\ &\times\exp\left[-\frac{1}{2}\left\{-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\right\}\right]\\ =&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left[-\frac{1}{2}\bm{\beta}^{\top}\bm{\Sigma}^{-1}_{0}\bm{\beta}-\frac{1}{2}\tau_{\mathcal{B}}\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}\right]\exp\left[-\frac{1}{2}\sum_{i=1}^{N}\omega_{i}\psi_{i}^{2}\right]\\ &\times\exp\left[-\frac{1}{2}\left\{-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\right\}\right]\end{array}

with ψi=𝐦i𝜼+ηϵ\psi_{i}=\mathbf{m}_{i}^{\top}\bm{\eta}+\eta_{\epsilon}.

Replacing our gamma Markov chain transitions with truncated gamma π(τa0,b0,τ0)\pi\left(\tau_{\mathcal{B}}\mid a_{0},b_{0},\tau_{0}\right), following Theorem 1 of Wang & Roy 2018, we now bound

g(𝜼𝐯)=++Nπ(𝜼𝝎,τ,𝐯,𝐲)π(𝝎,τ𝜼,𝐲)𝑑𝝎𝑑τ=++Nπ(𝜼𝝎,τ,𝐯,𝐲)π(𝝎𝜼,𝐲)π(τ𝜼,𝐲)𝑑𝝎𝑑τ.\begin{array}[]{rl}g\left(\bm{\eta}\mid\mathbf{v}\right)=&\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})\pi\left(\bm{\omega},\tau_{\mathcal{B}}\mid\bm{\eta}^{\prime},\mathbf{y}\right)d\bm{\omega}d\tau_{\mathcal{B}}\\ =&\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})\pi\left(\bm{\omega}\mid\bm{\eta}^{\prime},\mathbf{y}\right)\pi\left(\tau_{\mathcal{B}}\mid\bm{\eta}^{\prime},\mathbf{y}\right)d\bm{\omega}d\tau_{\mathcal{B}}.\end{array}

From Theorem 1 of Wang & Roy 2018, under condition (C1), we have

+π(𝜼𝝎,τ,𝐯,𝐲)π(τ𝜼,𝐲)𝑑τ(2π)P+M2|𝐀(τ0)|1/2𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp[12𝜷𝚺01𝜷]exp[12i=1Nωiψi2]×exp[12{2𝜼𝝁~+𝝁~𝐀(τ0)1𝝁~+C1𝟏𝛀𝟏}]×exp[12τ0𝐮𝐮](b0b0+𝐮𝐮/2)a0+M/2.\begin{array}[]{rl}\int_{\mathbb{R}_{+}}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})\pi\left(\tau_{\mathcal{B}}\mid\bm{\eta}^{\prime},\mathbf{y}\right)d\tau_{\mathcal{B}}\geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left[-\frac{1}{2}\bm{\beta}^{\top}\bm{\Sigma}^{-1}_{0}\bm{\beta}\right]\exp\left[-\frac{1}{2}\sum_{i=1}^{N}\omega_{i}\psi_{i}^{2}\right]\\ &\times\exp\left[-\frac{1}{2}\left\{-2\bm{\eta}^{\top}\widetilde{\bm{\mu}}+\widetilde{\bm{\mu}}^{\top}\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}+C_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}\right\}\right]\\ &\times\exp\left[-\frac{1}{2}\tau_{0}\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}\right]\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}.\end{array}

The next step of the proof follows the spirit of Wang & Roy 2018, but it is less straight forward so we write these steps in detail. Note that the conditional distribution, π(𝝎𝜼,𝐲)\pi\left(\bm{\omega}\mid\bm{\eta}^{\prime},\mathbf{y}\right) is given as ωi𝜼,𝐲PG(yi+ϵ,ψi)\omega_{i}\mid\bm{\eta}^{\prime},\mathbf{y}\sim\mathrm{PG}\left(y_{i}+\epsilon,\psi_{i}^{\prime}\right) with ψi=𝐦i𝜼+ηϵ\psi^{\prime}_{i}=\mathbf{m}_{i}^{\top}\bm{\eta}^{\prime}+\eta_{\epsilon} and C1𝟏𝛀𝟏=i=1NC1ωiC_{1}\mathbf{1}^{\top}\bm{\Omega}\mathbf{1}=\sum_{i=1}^{N}C_{1}\omega_{i}. After multiplying by π(𝝎𝜼,𝐲)\pi\left(\bm{\omega}\mid\bm{\eta}^{\prime},\mathbf{y}\right), using Definition 3 & 4, we carry out the integral

+exp(12ωiψi212ωiC1)π(ωiyi+ϵ,ψi)𝑑ωi=+exp(12ωiψi212ωiC1)×coshyi+ϵ(|ψi/2|)exp[12ωi(ψi)2]×π(ωiyi+ϵ,0)dωi={cosh[(ψi)2+(ψi)2+C12]}(yi+ϵ)×coshyi+ϵ(|ψi/2|){cosh[|ψi|+|ψi|+C12]}(yi+ϵ)×coshyi+ϵ(|ψi/2|){4cosh(|ψi|/2)cosh(|ψi|/2)×cosh(C1/2)}(yi+ϵ)coshyi+ϵ(|ψi/2|).\begin{array}[]{rl}\int_{\mathbb{R}_{+}}\exp\left(-\frac{1}{2}\omega_{i}\psi_{i}^{2}-\frac{1}{2}\omega_{i}C_{1}\right)\pi(\omega_{i}\mid y_{i}+\epsilon,\psi_{i})d\omega_{i}=&\int_{\mathbb{R}_{+}}\exp\left(-\frac{1}{2}\omega_{i}\psi_{i}^{2}-\frac{1}{2}\omega_{i}C_{1}\right)\\ &\times\cosh^{y_{i}+\epsilon}(|\psi^{\prime}_{i}/2|)\exp\left[-\frac{1}{2}\omega_{i}(\psi_{i}^{\prime})^{2}\right]\\ &\times\pi(\omega_{i}\mid y_{i}+\epsilon,0)d\omega_{i}\\ =&\left\{\cosh\left[\frac{\sqrt{\left(\psi_{i}^{\prime}\right)^{2}+\left(\psi_{i}\right)^{2}+C_{1}}}{2}\right]\right\}^{-(y_{i}+\epsilon)}\\ &\times\cosh^{y_{i}+\epsilon}(|\psi^{\prime}_{i}/2|)\\ \geq&\left\{\cosh\left[\frac{\left|\psi_{i}^{\prime}\right|+\left|\psi_{i}\right|+\sqrt{C_{1}}}{2}\right]\right\}^{-(y_{i}+\epsilon)}\\ &\times\cosh^{y_{i}+\epsilon}(|\psi^{\prime}_{i}/2|)\\ \geq&\Bigg{\{}4\cosh\left(\left|\psi_{i}^{\prime}\right|/2\right)\cosh\left(\left|\psi_{i}\right|/2\right)\\ &\times\cosh\left(\sqrt{C_{1}}/2\right)\Bigg{\}}^{-(y_{i}+\epsilon)}\cosh^{y_{i}+\epsilon}(|\psi^{\prime}_{i}/2|).\end{array}

We have

+exp(12ωiψi212C1ωi)π(ωiyi+ϵ,ψi)𝑑ωi{4cosh(|ψi|/2)cosh(C1/2)}(yi+ϵ).\begin{array}[]{c}\int_{\mathbb{R}_{+}}\exp\left(-\frac{1}{2}\omega_{i}\psi_{i}^{2}-\frac{1}{2}C_{1}\omega_{i}\right)\pi(\omega_{i}\mid y_{i}+\epsilon,\psi_{i})d\omega_{i}\geq\Bigg{\{}4\cosh\left(\left|\psi_{i}\right|/2\right)\cosh\left(\sqrt{C_{1}}/2\right)\Bigg{\}}^{-(y_{i}+\epsilon)}.\end{array}

Note that

cosh(|ψi|/2)[exp(|ψi|2)](yi+ϵ)[exp(ψi2+14)](yi+ϵ)=e(yi+ϵ)/4exp[yi+ϵ4ψi2]\begin{array}[]{rl}\cosh\left(\left|\psi_{i}\right|/2\right)\geq\left[\exp\left(\frac{\left|\psi_{i}\right|}{2}\right)\right]^{-(y_{i}+\epsilon)}\geq\left[\exp\left(\frac{\psi_{i}^{2}+1}{4}\right)\right]^{-(y_{i}+\epsilon)}=e^{-(y_{i}+\epsilon)/4}\exp\left[-\frac{y_{i}+\epsilon}{4}\psi_{i}^{2}\right]\end{array}

with quadratic expansion i=1N(yi+ϵ)ψi2=[𝜼,ηϵ][𝐌,𝟏]𝚲[𝐌,𝟏][𝜼,ηϵ]\sum_{i=1}^{N}(y_{i}+\epsilon)\psi_{i}^{2}=\left[\bm{\eta},\eta_{\epsilon}\right]^{\top}\left[\mathbf{M},\mathbf{1}\right]^{\top}\bm{\Lambda}\left[\mathbf{M},\mathbf{1}\right]\left[\bm{\eta},\eta_{\epsilon}\right] where 𝚲\bm{\Lambda} is the N×NN\times N diagonal matrix with iith diagonal element yi+ϵy_{i}+\epsilon. We can simplify the expansion as i=1N(yi+ϵ)ψi2=𝜼𝐌𝚲𝐌𝜼+2𝜼𝐌𝚲𝟏ηϵ+ηϵ𝟏𝚲𝟏ηϵ\sum_{i=1}^{N}(y_{i}+\epsilon)\psi_{i}^{2}=\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Lambda}\mathbf{M}\bm{\eta}+2\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}+\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}.

g(𝜼𝐯)=++Nπ(𝜼𝝎,τ,𝐯,𝐲)π(𝝎,τ𝜼,𝐲)𝑑𝝎𝑑τ(2π)P+M2|𝐀(τ0)|1/2[4cosh(C1/2)]NϵeNϵ4𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp[12𝜷𝚺01𝜷+𝜼𝝁~12𝝁~𝐀(τ0)1𝝁~]exp(τ0𝐮𝐮/2)×exp[14𝜼𝐌𝚲𝐌𝜼12𝜼𝐌𝚲𝟏ηϵ14ηϵ𝟏𝚲𝟏ηϵ](b0b0+𝐮𝐮/2)a0+M/2\begin{array}[]{rl}g\left(\bm{\eta}\mid\mathbf{v}\right)=&\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})\pi\left(\bm{\omega},\tau_{\mathcal{B}}\mid\bm{\eta}^{\prime},\mathbf{y}\right)d\bm{\omega}d\tau_{\mathcal{B}}\\ \geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\left[4\cosh(\sqrt{C_{1}}/2)\right]^{-N_{\epsilon}}e^{-\frac{N_{\epsilon}}{4}}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left[-\frac{1}{2}\bm{\beta}^{\top}\bm{\Sigma}^{-1}_{0}\bm{\beta}+\bm{\eta}^{\top}\widetilde{\bm{\mu}}-\frac{1}{2}\widetilde{\bm{\mu}}^{\top}\mathbf{A}\left(\tau_{0}\right)^{-1}\widetilde{\bm{\mu}}\right]\exp\left(-\tau_{0}\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2\right)\\ &\times\exp\left[-\frac{1}{4}\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Lambda}\mathbf{M}\bm{\eta}-\frac{1}{2}\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}-\frac{1}{4}\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}\right]\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\end{array}

where Nϵ=i=1Nyi+ϵN_{\epsilon}=\sum_{i=1}^{N}y_{i}+\epsilon. Let 𝝁¯=𝐌𝚲𝟏ηϵ/2\overline{\bm{\mu}}=-\mathbf{M}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}/2, 𝐇=𝐌𝚲𝐌/2\mathbf{H}=\mathbf{M}^{\top}\bm{\Lambda}\mathbf{M}/2 and we get

exp[14𝜼𝐌𝚲𝐌𝜼12𝜼𝐌𝚲𝟏ηϵ]=exp[12(𝜼𝐇1𝝁¯)𝐇(𝜼𝐇1𝝁¯)]×exp(12𝝁¯𝐇1𝝁¯).\begin{array}[]{rl}\exp\left[-\frac{1}{4}\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Lambda}\mathbf{M}\bm{\eta}-\frac{1}{2}\bm{\eta}^{\top}\mathbf{M}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}\right]=&\exp\left[-\frac{1}{2}\left(\bm{\eta}-\mathbf{H}^{-1}\overline{\bm{\mu}}\right)^{\top}\mathbf{H}\left(\bm{\eta}-\mathbf{H}^{-1}\overline{\bm{\mu}}\right)\right]\\ &\times\exp\left(\frac{1}{2}\overline{\bm{\mu}}^{\top}\mathbf{H}^{-1}\overline{\bm{\mu}}\right).\end{array}

Note that 𝝁¯𝐇1𝝁¯ηϵ𝟏𝚲𝟏ηϵ/2\overline{\bm{\mu}}^{\top}\mathbf{H}^{-1}\overline{\bm{\mu}}\leq\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}/2 by Loewner ordering (Definition 2). Let 𝝁Λ=𝝁~+𝝁¯\bm{\mu}_{\Lambda}=\widetilde{\bm{\mu}}+\overline{\bm{\mu}} and

R1=[4cosh(C1/2)]NϵeNϵ4exp(12𝝁~𝐀(τ0)1𝝁~14ηϵ𝟏𝚲𝟏ηϵ)<1\begin{array}[]{c}R_{1}=\left[4\cosh(\sqrt{C_{1}}/2)\right]^{-N_{\epsilon}}e^{-\frac{N_{\epsilon}}{4}}\exp\left(-\frac{1}{2}\widetilde{\bm{\mu}}^{\top}\mathbf{A}\left(\tau_{0}\right)^{-1}\widetilde{\bm{\mu}}-\frac{1}{4}\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}\right)<1\end{array}

(see Definition 4 for cosh\cosh relationship). We can write the bounds of g(𝜼𝐯)g\left(\bm{\eta}\mid\mathbf{v}\right) as a Gaussian kernel,

g(𝜼𝐯)(2π)P+M2|𝐀(τ0)|1/2[4cosh(C1/2)]NϵeNϵ4(b0b0+𝐮𝐮/2)a0+M/2×exp[12{𝜼𝐀(τ0)1𝝁~}𝐀(τ0){𝜼𝐀(τ0)1𝝁~}]×exp[12(𝜼𝐇1𝝁¯)𝐇(𝜼𝐇1𝝁¯)]×exp(12𝝁¯𝐇1𝝁¯14ηϵ𝟏𝚲𝟏ηϵ)𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)=(2π)P+M2|𝐀(τ0)|1/2[4cosh(C1/2)]NϵeNϵ4(b0b0+𝐮𝐮/2)a0+M/2×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})×exp[12{𝝁Λ𝐒1𝝁Λ𝝁~𝐀(τ0)1𝝁~𝝁¯𝐇1𝝁¯}]×exp(12𝝁¯𝐇1𝝁¯14ηϵ𝟏𝚲𝟏ηϵ)𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)(2π)P+M2|𝐀(τ0)|1/2[4cosh(C1/2)]NϵeNϵ4(b0b0+𝐮𝐮/2)a0+M/2×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})×exp[12𝝁~𝐀(τ0)1𝝁~14ηϵ𝟏𝚲𝟏ηϵ]𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)=(2π)P+M2|𝐀(τ0)|1/2R1(b0b0+𝐮𝐮/2)a0+M/2𝕀(𝜼𝒞0)𝕀(𝜼𝒞v)×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})\begin{array}[]{rl}g\left(\bm{\eta}\mid\mathbf{v}\right)\geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\left[4\cosh(\sqrt{C_{1}}/2)\right]^{-N_{\epsilon}}e^{-\frac{N_{\epsilon}}{4}}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\\ &\times\exp\left[-\frac{1}{2}\left\{\bm{\eta}-\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}\right\}^{\top}\mathbf{A}(\tau_{0})\left\{\bm{\eta}-\mathbf{A}(\tau_{0})^{-1}\widetilde{\bm{\mu}}\right\}\right]\\ &\times\exp\left[-\frac{1}{2}\left(\bm{\eta}-\mathbf{H}^{-1}\overline{\bm{\mu}}\right)^{\top}\mathbf{H}\left(\bm{\eta}-\mathbf{H}^{-1}\overline{\bm{\mu}}\right)\right]\\ &\times\exp\left(\frac{1}{2}\overline{\bm{\mu}}^{\top}\mathbf{H}^{-1}\overline{\bm{\mu}}-\frac{1}{4}\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}\right)\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ =&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\left[4\cosh(\sqrt{C_{1}}/2)\right]^{-N_{\epsilon}}e^{-\frac{N_{\epsilon}}{4}}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\\ &\times\exp\left[\frac{1}{2}\left\{\bm{\mu}_{\Lambda}^{\top}\mathbf{S}^{-1}\bm{\mu}_{\Lambda}-\widetilde{\bm{\mu}}^{\top}\mathbf{A}\left(\tau_{0}\right)^{-1}\widetilde{\bm{\mu}}-\overline{\bm{\mu}}^{\top}\mathbf{H}^{-1}\overline{\bm{\mu}}\right\}\right]\\ &\times\exp\left(\frac{1}{2}\overline{\bm{\mu}}^{\top}\mathbf{H}^{-1}\overline{\bm{\mu}}-\frac{1}{4}\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}\right)\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ \geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}\left[4\cosh(\sqrt{C_{1}}/2)\right]^{-N_{\epsilon}}e^{-\frac{N_{\epsilon}}{4}}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\\ &\times\exp\left[-\frac{1}{2}\widetilde{\bm{\mu}}^{\top}\mathbf{A}\left(\tau_{0}\right)^{-1}\widetilde{\bm{\mu}}-\frac{1}{4}\eta_{\epsilon}\mathbf{1}^{\top}\bm{\Lambda}\mathbf{1}\eta_{\epsilon}\right]\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ =&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}R_{1}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\end{array}

where 𝐒=𝐌𝚲𝐌/2+𝐀(τ0)=𝐇+𝐀(τ0)𝐀(τ0)\mathbf{S}=\mathbf{M}^{\top}\bm{\Lambda}\mathbf{M}/2+\mathbf{A}(\tau_{0})=\mathbf{H}+\mathbf{A}(\tau_{0})\succeq\mathbf{A}\left(\tau_{0}\right).

Note that 𝕀(𝜼𝒞v)=j=1J𝕀(vjuα,j)\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})=\prod_{j=1}^{J}\mathbb{I}(v_{j}\leq{u}_{\alpha,j}) and using (C2) with Theorem 7 from Mira & Tierney 2002, we get

k(𝜼𝜼)=+J++Nπ(𝜼𝝎,τ,𝐯,𝐲)π(𝝎,τ,𝐯𝜼,𝐲)𝑑𝝎𝑑τ𝑑𝐯(2π)P+M2|𝐀(τ0)|1/2R1(b0b0+𝐮𝐮/2)a0+M/2𝕀(𝜼𝒞0)×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})×+J𝕀(𝜼𝒞v)π(𝐯𝜼)d𝐯\begin{array}[]{rl}k\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)=&\int_{\mathbb{R}_{+}^{J}}\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\pi(\bm{\eta}\mid\bm{\omega},\tau_{\mathcal{B}},\mathbf{v},\mathbf{y})\pi\left(\bm{\omega},\tau_{\mathcal{B}},\mathbf{v}\mid\bm{\eta}^{\prime},\mathbf{y}\right)d\bm{\omega}d\tau_{\mathcal{B}}d\mathbf{v}\\ \geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}R_{1}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\\ &\times\int_{\mathbb{R}_{+}^{J}}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\pi(\mathbf{v}\mid\bm{\eta}^{\prime})d\mathbf{v}\end{array}

where

+J𝕀(𝜼𝒞v)π(𝐯𝜼)𝑑𝐯=+Jj=1Jnα,jvjnα,j1(uα,j)nα,j𝕀(vjuα,j)𝕀(vjuα,j)d𝐯=+Jj=1Jnα,jvjnα,j1(uα,j)nα,j𝕀(vjuα,juα,j)d𝐯=j=1J0(uα,juα,j)nα,jvjnα,j1(uα,j)nα,j𝑑vj=j=1J{(uα,juα,j)uα,j}nα,jj=1J(uα,jsupuα,j)nα,j=j=1J(uα,juα,j+)nα,j\begin{array}[]{rl}\int_{\mathbb{R}_{+}^{J}}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{v})\pi(\mathbf{v}\mid\bm{\eta}^{\prime})d\mathbf{v}=&\int_{\mathbb{R}_{+}^{J}}\prod_{j=1}^{J}\frac{n_{\alpha,j}v_{j}^{n_{\alpha,j}-1}}{\left(u^{\prime}_{\alpha,j}\right)^{n_{\alpha,j}}}\mathbb{I}(v_{j}\leq u_{\alpha,j}^{\prime})\mathbb{I}(v_{j}\leq u_{\alpha,j})d\mathbf{v}\\ =&\int_{\mathbb{R}_{+}^{J}}\prod_{j=1}^{J}\frac{n_{\alpha,j}v_{j}^{n_{\alpha,j}-1}}{\left(u^{\prime}_{\alpha,j}\right)^{n_{\alpha,j}}}\mathbb{I}(v_{j}\leq u_{\alpha,j}^{\prime}\wedge u_{\alpha,j})d\mathbf{v}\\ =&\prod_{j=1}^{J}\int_{0}^{(u_{\alpha,j}^{\prime}\wedge u_{\alpha,j})}\frac{n_{\alpha,j}v_{j}^{n_{\alpha,j}-1}}{\left(u^{\prime}_{\alpha,j}\right)^{n_{\alpha,j}}}d{v_{j}}\\ =&\prod_{j=1}^{J}\left\{\frac{(u_{\alpha,j}^{\prime}\wedge u_{\alpha,j})}{u_{\alpha,j}^{\prime}}\right\}^{n_{\alpha,j}}\\ \geq&\prod_{j=1}^{J}\left(\frac{u_{\alpha,j}}{\sup u_{\alpha,j}}\right)^{n_{\alpha,j}}\\ =&\prod_{j=1}^{J}\left(\frac{u_{\alpha,j}}{u_{\alpha,j}^{+}}\right)^{n_{\alpha,j}}\end{array}

where \wedge denotes minimum. We may also use the Dt𝐳α(ti)𝐮αD_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}_{\alpha} notation in our proof instead. For our Mtd, we have

k(𝜼𝜼)(2π)P+M2|𝐀(τ0)|1/2R1(b0b0+𝐮𝐮/2)a0+M/2𝕀(𝜼𝒞0){j=1J(uα,juα,j+)nα,j}×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})=δh(𝜼)\begin{array}[]{rl}k\left(\bm{\eta}\mid\bm{\eta}^{\prime}\right)\geq&(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}R_{1}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left\{\prod_{j=1}^{J}\left(\frac{u_{\alpha,j}}{u_{\alpha,j}^{+}}\right)^{n_{\alpha,j}}\right\}\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\\ =&\delta h(\bm{\eta})\end{array}

with

δ=c3(𝐌,𝐲)(2π)P+M2|𝐀(τ0)|1/2R1h(𝜼)=1c3(𝐌,𝐲)(b0b0+𝐮𝐮/2)a0+M/2𝕀(𝜼𝒞0){j=1J(uα,juα,j+)nα,j}×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})c3(𝐌,𝐲)=P+M[(b0b0+𝐮𝐮/2)a0+M/2𝕀(𝜼𝒞0){j=1J(uα,juα,j+)nα,j}×exp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})]d𝜼P+Mexp(12{𝜼𝐒1𝝁Λ}𝐒{𝜼𝐒1𝝁Λ})𝑑𝜼=(2π)P+M2|𝐒|1/2.\begin{array}[]{rl}\delta=&{c_{3}(\mathbf{M},\mathbf{y})}(2\pi)^{-\frac{P+M}{2}}\left|\mathbf{A}\left(\tau_{0}\right)\right|^{1/2}R_{1}\\ h(\bm{\eta})=&\frac{1}{c_{3}(\mathbf{M},\mathbf{y})}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left\{\prod_{j=1}^{J}\left(\frac{u_{\alpha,j}}{u_{\alpha,j}^{+}}\right)^{n_{\alpha,j}}\right\}\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\\ c_{3}(\mathbf{M},\mathbf{y})=&\int_{\mathbb{R}^{P+M}}\Bigg{[}\left(\frac{b_{0}}{b_{0}+\mathbf{u}_{\mathcal{B}}^{\top}\mathbf{u}_{\mathcal{B}}/2}\right)^{a_{0}+M/2}\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\left\{\prod_{j=1}^{J}\left(\frac{u_{\alpha,j}}{u_{\alpha,j}^{+}}\right)^{n_{\alpha,j}}\right\}\\ &\times\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)\Bigg{]}d\bm{\eta}\\ \leq&\int_{\mathbb{R}^{P+M}}\exp\left(-\frac{1}{2}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}^{\top}\mathbf{S}\left\{\bm{\eta}-\mathbf{S}^{-1}\bm{\mu}_{\Lambda}\right\}\right)d\bm{\eta}\\ =&(2\pi)^{\frac{P+M}{2}}|\mathbf{S}|^{-1/2}.\end{array}

Note that |𝐒|1/2|𝐀(τ0)|1/21|\mathbf{S}|^{-1/2}|\mathbf{A}\left(\tau_{0}\right)|^{1/2}\leq 1 and δ<1\delta<1.

Proof.

Proof of Theorem 2. First, we define normalizing constant c(𝐲)c(\mathbf{y}) and

c(𝐲)=+P+Mπ(τa0,b0,τ0)π(𝜷𝝁0,𝚺0,𝒞0)π(𝐮𝟎,τ1𝐈M)L(𝜼𝐲,𝐌)𝑑𝜼𝑑τπ(𝜼,τ,𝝎𝐲)=c(𝐲)1π(𝝎𝜼)L(𝜼𝐲,𝐌)π(𝜷𝝁0,𝚺0,𝒞0)π(𝐮𝟎,τ1𝐈M)π(τa0,b0,τ0)π(𝜼𝐲)=++Nπ(𝜼,τ,𝝎𝐲)𝑑𝝎𝑑τ.\begin{array}[]{c}c(\mathbf{y})=\int_{\mathbb{R}_{+}}\int_{\mathbb{R}^{P+M}}\pi\left(\tau_{\mathcal{B}}\mid a_{0},b_{0},\tau_{0}\right)\pi(\bm{\beta}\mid\bm{\mu}_{0},\bm{\Sigma}_{0},\mathcal{C}_{0})\pi(\mathbf{u}_{\mathcal{B}}\mid\mathbf{0},\tau_{\mathcal{B}}^{-1}\mathbf{I}_{M})L(\bm{\eta}\mid\mathbf{y},\mathbf{M})d\bm{\eta}d\tau_{\mathcal{B}}\\ \pi(\bm{\eta},\tau_{\mathcal{B}},\bm{\omega}\mid\mathbf{y})=c(\mathbf{y})^{-1}{\pi(\bm{\omega}\mid\bm{\eta})L(\bm{\eta}\mid\mathbf{y},\mathbf{M})\pi(\bm{\beta}\mid\bm{\mu}_{0},\bm{\Sigma}_{0},\mathcal{C}_{0})\pi(\mathbf{u}_{\mathcal{B}}\mid\mathbf{0},\tau_{\mathcal{B}}^{-1}\mathbf{I}_{M})\pi\left(\tau_{\mathcal{B}}\mid a_{0},b_{0},\tau_{0}\right)}\\ \pi(\bm{\eta}\mid\mathbf{y})=\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\pi(\bm{\eta},\tau_{\mathcal{B}},\bm{\omega}\mid\mathbf{y})d\bm{\omega}d\tau_{\mathcal{B}}.\end{array}

We can integrate and bound the the slice variable at 𝐮α+\mathbf{u}_{\alpha}^{+}. We can rewrite with kernels and bound the indicator function 𝕀(𝜼𝒞0)1\mathbb{I}(\bm{\eta}\in\mathcal{C}_{0})\leq 1,

π(𝜼,τ,𝝎𝐲)c(𝐲)1i=1N{Dt𝐳α(ti)𝐮α+}yi×i=1Neκiψieωiψi2/2π(ωiyi+ϵ,0)×exp{12(𝜷𝝁0)𝚺01(𝜷𝝁0)}×exp(τ2𝐮α𝐮α)τM/2τa01exp(b0τ)𝕀(ττ0)=c(𝐲)1i=1N{Dt𝐳α(ti)𝐮α+}yi×i=1Neκiηϵeκiφieωiψi2/2π(ωiyi+ϵ,0)×exp{12(𝜼𝐛)𝐀(τ)(𝜼𝐛)}×τM/2+a01exp(b0τ)𝕀(ττ0)c(𝐲)1i=1N{Dt𝐳α(ti)𝐮α+}yi×i=1Neκiηϵeκiφiπ(ωiyi+ϵ,0)×exp{12(𝜼𝐛)𝐀(τ)(𝜼𝐛)}×τM/2+a01exp(b0τ)𝕀(ττ0)=c(𝐲)1i=1N{Dt𝐳α(ti)𝐮α+}yi×i=1Neκiηϵπ(ωiyi+ϵ,0)×exp(𝜿𝐌𝜼)exp{12(𝜼𝐛)𝐀(τ)(𝜼𝐛)}×τM/2+a01exp(b0τ)𝕀(ττ0).\begin{array}[]{rl}\pi(\bm{\eta},\tau_{\mathcal{B}},\bm{\omega}\mid\mathbf{y})\leq&c(\mathbf{y})^{-1}\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}^{+}_{\alpha}\right\}^{y_{i}}\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\psi_{i}}e^{-\omega_{i}\psi_{i}^{2}/2}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\\ &\times\exp\left\{-\frac{1}{2}(\bm{\beta}-\bm{\mu}_{0})^{\top}\bm{\Sigma}_{0}^{-1}(\bm{\beta}-\bm{\mu}_{0})\right\}\\ &\times\exp\left(-\frac{\tau_{\mathcal{B}}}{2}\mathbf{u}_{\alpha}^{\top}\mathbf{u}_{\alpha}\right)\tau_{\mathcal{B}}^{M/2}\tau_{\mathcal{B}}^{a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right)\mathbb{I}\left(\tau_{\mathcal{B}}\geq\tau_{0}\right)\\ =&c(\mathbf{y})^{-1}\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}^{+}_{\alpha}\right\}^{y_{i}}\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\eta_{\epsilon}}e^{\kappa_{i}\varphi_{i}}e^{-\omega_{i}\psi_{i}^{2}/2}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\\ &\times\exp\left\{-\frac{1}{2}(\bm{\eta}-\mathbf{b})^{\top}\mathbf{A}(\tau_{\mathcal{B}})(\bm{\eta}-\mathbf{b})\right\}\\ &\times\tau_{\mathcal{B}}^{M/2+a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right)\mathbb{I}\left(\tau_{\mathcal{B}}\geq\tau_{0}\right)\\ \leq&c(\mathbf{y})^{-1}\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}^{+}_{\alpha}\right\}^{y_{i}}\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\eta_{\epsilon}}e^{\kappa_{i}\varphi_{i}}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\\ &\times\exp\left\{-\frac{1}{2}(\bm{\eta}-\mathbf{b})^{\top}\mathbf{A}(\tau_{\mathcal{B}})(\bm{\eta}-\mathbf{b})\right\}\\ &\times\tau_{\mathcal{B}}^{M/2+a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right)\mathbb{I}\left(\tau_{\mathcal{B}}\geq\tau_{0}\right)\\ =&c(\mathbf{y})^{-1}\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}^{+}_{\alpha}\right\}^{y_{i}}\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\eta_{\epsilon}}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\\ &\times\exp\left(\bm{\kappa}^{\top}\mathbf{M}\bm{\eta}\right)\exp\left\{-\frac{1}{2}(\bm{\eta}-\mathbf{b})^{\top}\mathbf{A}(\tau_{\mathcal{B}})(\bm{\eta}-\mathbf{b})\right\}\\ &\times\tau_{\mathcal{B}}^{M/2+a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right)\mathbb{I}\left(\tau_{\mathcal{B}}\geq\tau_{0}\right).\end{array}

Next,

π(𝜼,τ,𝝎𝐲)c(𝐲)1i=1N{Dt𝐳α(ti)𝐮α+}yi×i=1Neκiηϵπ(ωiyi+ϵ,0)×exp(𝜿𝐌𝜼)|𝐀(τ)|1/2|𝐀(τ)|1/2exp{12(𝜼𝐛)𝐀(τ)(𝜼𝐛)}×τM/2+a01exp(b0τ)𝕀(ττ0)c(𝐲)1(2π)(P+M)/2i=1N{Dt𝐳α(ti)𝐮α+}yi×i=1Neκiηϵπ(ωiyi+ϵ,0)×exp(𝜿𝐌𝜼)|𝐀(τ0)|1/2π(𝜼|𝐛,𝐀(τ)1)×τM/2+a01exp(b0τ)𝕀(ττ0).\begin{array}[]{rl}\pi(\bm{\eta},\tau_{\mathcal{B}},\bm{\omega}\mid\mathbf{y})\leq&c(\mathbf{y})^{-1}\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}^{+}_{\alpha}\right\}^{y_{i}}\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\eta_{\epsilon}}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\\ &\times\exp\left(\bm{\kappa}^{\top}\mathbf{M}\bm{\eta}\right)\frac{\left|\mathbf{A}(\tau_{\mathcal{B}})\right|^{1/2}}{\left|\mathbf{A}(\tau_{\mathcal{B}})\right|^{1/2}}\exp\left\{-\frac{1}{2}(\bm{\eta}-\mathbf{b})^{\top}\mathbf{A}(\tau_{\mathcal{B}})(\bm{\eta}-\mathbf{b})\right\}\\ &\times\tau_{\mathcal{B}}^{M/2+a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right)\mathbb{I}\left(\tau_{\mathcal{B}}\geq\tau_{0}\right)\\ \leq&c(\mathbf{y})^{-1}\left(2\pi\right)^{(P+M)/2}\prod_{i=1}^{N}\left\{D_{t}\mathbf{z}^{\top}_{\alpha}(t_{i})\mathbf{u}^{+}_{\alpha}\right\}^{y_{i}}\\ &\times\prod_{i=1}^{N}e^{\kappa_{i}\eta_{\epsilon}}\pi\left(\omega_{i}\mid y_{i}+\epsilon,0\right)\\ &\times\exp\left(\bm{\kappa}^{\top}\mathbf{M}\bm{\eta}\right){\left|\mathbf{A}(\tau_{0})\right|^{-1/2}}\pi(\bm{\eta}|\mathbf{b},\mathbf{A}(\tau_{\mathcal{B}})^{-1})\\ &\times\tau_{\mathcal{B}}^{M/2+a_{0}-1}\exp\left(-b_{0}\tau_{\mathcal{B}}\right)\mathbb{I}\left(\tau_{\mathcal{B}}\geq\tau_{0}\right).\end{array}

Note that |𝐀(τ)|1/2|𝐀(τ0)|1/2|\mathbf{A}(\tau_{\mathcal{B}})|^{-1/2}\leq|\mathbf{A}(\tau_{0})|^{-1/2}. We set 𝐬=𝜿𝐌+𝐭\mathbf{s}^{\top}=\bm{\kappa}^{\top}\mathbf{M}+\mathbf{t}^{\top} and apply the MGF to normal distribution π(𝜼|𝐛,𝐀(τ)1)\pi\left(\bm{\eta}|\mathbf{b},\mathbf{A}(\tau_{\mathcal{B}})^{-1}\right) to get

P+Mexp(𝐭𝜼)exp(𝜿𝐌𝜼)π(𝜼|𝐛,𝐀(τ)1)𝑑𝜼=exp[𝐬𝐛+12𝐬𝐀(τ)1𝐬]exp[𝐬𝐛+12𝐬𝐀(τ0)1𝐬].\begin{array}[]{rl}\int_{\mathbb{R}^{P+M}}\exp(\mathbf{t}^{\top}\bm{\eta})\exp(\bm{\kappa}^{\top}\mathbf{M}\bm{\eta})\pi\left(\bm{\eta}|\mathbf{b},\mathbf{A}(\tau_{\mathcal{B}})^{-1}\right)d\bm{\eta}=&\exp\left[\mathbf{s}^{\top}\mathbf{b}+\frac{1}{2}\mathbf{s}^{\top}\mathbf{A}(\tau_{\mathcal{B}})^{-1}\mathbf{s}\right]\\ \leq&\exp\left[\mathbf{s}^{\top}\mathbf{b}+\frac{1}{2}\mathbf{s}^{\top}\mathbf{A}(\tau_{0})^{-1}\mathbf{s}\right].\end{array}

All that is left is to integrate the gamma kernel and integrate Polya-gamma kernels. An upper bound is achieved and we have

P+Me𝜼𝐭π(𝜼𝐲)𝑑𝜼=++NP+Me𝜼𝐭π(𝜼,τ,𝝎𝐲)𝑑𝜼𝑑𝝎𝑑τ<.\begin{array}[]{c}\int_{\mathbb{R}^{P+M}}e^{\bm{\eta}^{\top}\mathbf{t}}\pi(\bm{\eta}\mid\mathbf{y})d\bm{\eta}=\int_{\mathbb{R}_{+}}\int_{\mathbb{R}_{+}^{N}}\int_{\mathbb{R}^{P+M}}e^{\bm{\eta}^{\top}\mathbf{t}}\pi(\bm{\eta},\tau_{\mathcal{B}},\bm{\omega}\mid\mathbf{y})d\bm{\eta}d\bm{\omega}d\tau_{\mathcal{B}}<\infty.\end{array}