The Cox-Polya-Gamma Algorithm for Flexible Bayesian Inference of Multilevel Survival Models
Abstract
Bayesian Cox semiparametric regression is an important problem in many clinical settings. Bayesian procedures provide finite-sample inference and naturally incorporate prior information if MCMC algorithms and posteriors are well behaved. Survival analysis should also be able to incorporate multilevel modeling such as case weights, frailties and smoothing splines, in a straightforward manner. To tackle these modeling challenges, we propose the Cox-Polya-Gamma (Cox-PG) algorithm for Bayesian multilevel Cox semiparametric regression and survival functions. Our novel computational procedure succinctly addresses the difficult problem of monotonicity constrained modeling of the nonparametric baseline cumulative hazard along with multilevel regression. We develop two key strategies. First, we exploit an approximation between Cox models and negative binomial processes through the Poisson process to reduce Bayesian computation to iterative Gaussian sampling. Next, we appeal to sufficient dimension reduction to address the difficult computation of nonparametric baseline cumulative hazard, allowing for the collapse of the Markov transition within the Gibbs sampler based on beta sufficient statistics. In addition, we explore conditions for uniform ergodicity of the Cox-PG algorithm. We demonstrate our multilevel modeling approach using open source data and simulations. We provide software for our Bayesian procedure in the supplement.
Keywords: Kaplan-Meier, Cox model, Frailty model, Multilevel model, Bayesian inference, Survival analysis
1 Introduction
Semiparametric Cox proportional hazards (PH) regression is an important tool used for time to event data (Cox, 1972, 1975). Many extensions of the Cox model have been proposed, such as cure models (or zero-inflation) (Sy & Taylor, 2000), frailty models (or random effects) (Wienke, 2010), smoothing splines (Therneau et al., 2000; Cui et al., 2021), convex set constraints (McGregor et al., 2020), variable selection (Tibshirani, 1997), and other methods. However, nonparametric inference for semiparametric Cox regression involves difficult monotonicity constrained inference of the cumulative hazard (Hothorn et al., 2018). The Bayesian paradigm provides a convenient framework to naturally address many of these variations of the canonical Cox model (Ibrahim et al., 2001; Chen et al., 2006). Bayesian methods can naturally incorporate convex set constraints (Valeriano et al., 2023; Pakman & Paninski, 2014; Maatouk & Bay, 2016) that can be applied to estimation and inference of monotonic cumulative hazard functions. In addition, Bayesian hierarchical modeling can be extended to different types of regression, such as random effects (Wang & Roy, 2018), zero-inflation (Neelon, 2019), smoothing splines (Wand & Ormerod, 2008), and variable selection (Carvalho et al., 2009; Bhadra et al., 2019; Makalic & Schmidt, 2015) with straightforward Gibbs samplers for multilevel Gaussian models. In order to leverage this rich resource, we propose a novel Markov chain Monte Carlo (MCMC) algorithm that allows Cox model parameters to be modeled with hierarchical Gaussian models
The Cox model can be described as locally a Poisson process that is constrained to binary outcomes of recording death or right censoring (Ibrahim et al., 2001; Aalen et al., 2008). In recent years, Polson et al. 2013 proposed a Polya-gamma (PG) Gibbs sampler for logistic and negative binomial (NB) regression that is based in Gaussian sampling. Consequently, the negative binomial process is a useful analog to the Poisson process of Cox regression. We use a standard gamma process Bayesian modification to Cox regression that results in a NB kernel and enables Gaussian-based modeling to incorporate the necessary structure and constraints. By specifying a univariate gamma frailty or gamma prior on the cumulative hazard, we establish an analogous frailty model to locally negative binomial relationship (Aalen et al., 2008; Winkelmann, 2008). Typically, a vague gamma frailty is specified to bridge Poisson and NB models (Kalbfleisch, 1978; Sinha et al., 2003; Duan et al., 2018). The practicality of a vague gamma prior allows access to the PG sampler, benefiting from the use of Gaussian Gibbs samplers. In addition, maintaining Gibbs transitions is crucial as we can take advantage of the marginal transitions to derive succinct Metropolis-Hastings (MH) acceptance probabilities that can be be used to remove bias due to gamma frailty augmentation (Duan et al., 2018).
However, the NB form alleviates much of the algebra associated with the cumulative hazard but does not address the entirety of the baseline hazard rate which has positivity constraints. There are a broad class nonparametric estimators that rely on partitioning the time interval (Nieto-Barajas & Walker, 2002; Nelson, 1972; Aalen, 1978; Kaplan & Meier, 1958; Ibrahim et al., 2001). Evaluating these estimators without considering the computational algorithm often results in difficult MCMC procedures due to monotonic constraints on the baseline cumulative hazard with MH steps that require tuning. There are many tools available to enforce convex set constraints on Gaussian sampling (Valeriano et al., 2023; Pakman & Paninski, 2014; Maatouk & Bay, 2016; Cong et al., 2017; Li & Ghosh, 2015) which we will demonstrate can be used to enforce monotonicity and positivity constraints of Cox regression. Inspired by previous work in nonparametric analysis, we proposed the use of a sufficient statistic partition-based estimator with closed forms for Gibbs sampler conditional distributions. In order to do so, we introduce a novel monotonic spline system with a half-space constraint (Ramsay, 1988; Meyer et al., 2018) to model the baseline log cumulative hazard function that factorizes into beta sufficient statistics in the likelihood. Next, we introduce slice sampler auxiliary variables to remove the sufficient statistics associated with the baseline hazard rate from the algebra (Damien et al., 1999; Mira & Tierney, 2002; Neal, 2003). Our spline system collapses the linear operator of our Markov transitions on to beta sufficient statistics to improve the efficiency of Bayesian computation while retaining the Gaussian structure. This results in a flexible Bayesian procedure we refer to as the Cox-Polya-Gamma (Cox-PG) algorithm that is appealing for being able to adapt multilevel Gaussian inference to Cox regression.
Aside from being able to incorporate prior information into the Cox model, our Cox-PG approach is desirable for situations where inference on the baseline hazard is necessary such as Kaplan-Meier models (Kaplan & Meier, 1958). We also show that case weights such as inverse probability weighting and power priors, can be incorporated into the Cox-PG algorithm through fractal beta kernels (Shu et al., 2021; Ibrahim et al., 2015). In addition, we explore conditions under which the Cox-PG sampler is uniformly ergodic. Consequently, we are able to derive ergodic theory results for our Gibbs algorithm under mild assumptions. Hierarchical models from Gaussian Gibbs samplers are natural extensions of our approach with straightforward modifications to the Cox-PG algorithm. This includes cure models or zero-inflation (Sy & Taylor, 2000; Neelon, 2019), joint models (Wulfsohn & Tsiatis, 1997; Wang et al., 2022), copulas (Wienke, 2010), competing risk (Kalbfleisch & Prentice, 2011) and variable selection (Carvalho et al., 2009; Bhadra et al., 2019; Makalic & Schmidt, 2015) and other multilevel Gaussian models. As a result, we propose a unified Bayesian framework that envelopes many of the models found throughout survival analysis Ibrahim et al. 2001. In contrast to influence function based semiparametric theory (Tsiatis 2006 Chapter 5), the Cox-PG sampler provides a flexible inference procedure for finite sample Cox modeling, addressing a critical gap in many survival analysis settings such as clinical trials (Ildstad et al., 2001).
The rest of the article is organized as follows. In Section 2, we outline the Cox-PG algorithm. In Section 3, we present uniform ergodicity theoretical results. We present additional methods for accelerating MCMC algorithms, removing bias, and variations of Cox semiparametric regression in Section 4 and 5. Simulation studies and real data analysis involving multilevel survival models are presented in Section 6, 7 and 8.
2 Method: Cox-PG Gibbs sampler
Semiparametric Cox PH regression is closely related to the exponential family with an additional hazard rate term that needs to be accounted for with constrained inference. Exponential family Bayesian models are well studied and have appealing theoretical properties (Choi & Hobert, 2013). In order to emphasize Cox model constraints and its Poisson kernel, we show the derivation of the transformation model for Cox PH regression (Hothorn et al., 2018). The extreme value distribution function of PH models is given as with survival function . In this case, study time denotes the censoring or event time for subject . Let . Monotonic function is the baseline log cumulative hazard approximated using additive monotonic splines with the constraints represented through . The baseline hazard is where the differential operator is represented as and is an intercept incorporated into . The marginal likelihood contribution of the model (Kalbfleisch & Prentice 2011 Chapter 2) is
with death (or event) variable and right censoring . The hazard function is given by following the canonical factorization of hazard-survival functions. The flexibility of this framework is seen by noting that if , the likelihood is an exponential PH regression while if we have the Weibull PH model and the parameterization of determines the hazard family. Now, we set , where and , and the resulting generalized likelihood is
The derivative of monotonic splines, , is positive and the basis coefficients are constrained to be positive. The design matrix is composed of rows . Under this construction, slice sampler auxiliary variables are suitable to maintain positive (Damien et al., 1999; Mira & Tierney, 2002).
Slice samplers naturally remove positive terms in the likelihood by enforcing constraints in conditional distributions. In order to effectively incorporate a slice sampler, we start by placing a gamma mixture variable on the Poisson kernel of PH models to obtain a frailty model analogous to negative binomial regression. Note that the first two moments are given as and and a vague mixture can be used. As noted in Duan et al. (2018) this limit approximation works well for Poisson processes. In addition, we show how to remove this bias later on in the article through a simple MH step. A quick Laplace transform reveals the modification to the baseline hazard under frailty: (Aalen et al. 2008 Chapter 6) and by sending we have the Cox model hazard. We remove the hazard component of the likelihood with multiplication of slice sampling auxiliary uniform variable for uncensored events, . The resulting likelihood is given as
where is a vector of the (Wienke 2010 Chapter 3) and . We can marginalize out by carrying out the Mellin transform to obtain a negative binomial kernel
Let and , then the NB likelihood is
(1) |
where and we condition on . We can integrate out to obtain the proportional frailty likelihood. The frailty likelihood is multiplied by constant to get the NB likelihood. After accounting for covariates and conditioning on , through the Laplace transform, the hazard function is proportional to , which allows for data augmentation (Wienke 2010 Chapter 3). The survival function under frailty is resulting in the canonical hazard-survival factorization.
To account for the logit link, we can use a PG auxiliary variable , where (Polson et al., 2013). The PG density is
where are independently distributed according to . We have the following exponential tilting or Esscher transform (Esscher, 1932) relationship for PG distributions
(2) |
where and denotes a density. We use and get
to tease out a Gaussian kernel. Note that
can be shown with a Laplace transform (Polson et al., 2013; Choi & Hobert, 2013). We get the NB posterior
(3) |
where . Integrating out and results in the original NB likelihood. We can write our Gaussian density conditioned on through with . The conditional distributions, after applying the Esscher transform to both the PG and Gaussian variables, are given as
(4) |
where the truncated normal is constrained to set: and . The truncated normal density is proportional to the Gaussian kernel multiplied by an indicator for set constraints: ,
where . Here , make up the vector and and make up the vector . Note that convex set intersection preserves convexity (Boyd & Vandenberghe 2004 Chapter 2), allowing for the exploitation of convex set constrained Gaussian sampling (Maatouk & Bay, 2016). At this point, we have shown how to incorporate hazard rate constraints into the general Gibbs sampler. The priors for are flat and we outline hierarchical Gaussian structures later in the article.
2.1 Sufficient statistics dimension reduction for efficient computation
A naive slice sampler enforces a prohibitive number of inequality constraints in during Gaussian sampling. The slice sampler for nonparametric partition estimators (Nieto-Barajas & Walker, 2002; Kaplan & Meier, 1958; Ibrahim et al., 2001) presents a unique opportunity to collapse the linear operator of Markov transitions based on sufficient statistics. By using Riemann sum integration to construct the baseline log cumulative hazard , we the simplify and auxiliary transitions . The new hazard rate is now constructed with non-overlapping histogram partitions, using delta functions as its additive bases with . A example of the basis functions used to construct the histogram in Riemann sum integration for the lung dataset from R survival is presented in Figure 1a (Loprinzi et al., 1994; Therneau, 2023). We can write matrix with rows . Now the matrix becomes with induced norm . Each row only has at most a single element that is 1, meaning each uncensored subject activates at most a single delta function. The uniform auxiliary space becomes the last order statistics of uniform random variables or beta random variables. An example of inequalities are given as
and we see that for the set we have inequalities for each coefficient multiplied together as indicators,
and our set consist of
We denote and we only need to sample from the last order statistic
Each sample of the uniform variable becomes . Let the number of uncensored events in partition be and denote the last order statistic. We only need to use the last order statistic, the sufficient statistics of to construct with reduced dimensions. The conditional cumulative distributions and densities for the last order statistics are
with and . This is a simple half space and a important improvement over naive slice sampling strategies. Notably, we can rescale beta variables to sample from , where can be verified using a Jacobian and has the property . The denominator terms cancels our new sufficiently reduced hazard term . The new NB likelihood with beta kernels is given as
The conditional distributions replacing equations (4) are now given as
(5) |
Note that if only censored events are found in a given partition. But we recommend to using a parametrization where by ensuring each partition has at least one uncensored event. Under this construction, the spline is a monotonic step function with a 45 degree slope in the partition and is visualized in Figure 1b. The slice constraints are now based on number of coefficients, instead of number of subjects. Many methods exist that can efficiently sample a Gaussian distribution constrained to , constraints reduced from number of subject to number of coefficients in (Valeriano et al., 2023). Note that is a diagonal matrix and matrices and can be computed in a parallel manner without loading all of the data in memory (Polson et al., 2013). In addition, we can further simplify sampling of through conditional Gaussian distributions,
A multilevel Gaussian model can be specified for while a flat prior can be maintained for . The interpretation of spline coefficients are positive slopes for local linear interpolation of the baseline log cumulative hazard function to ensure a final monotonic function.
As we increase number of partitions , we achieve a more accurate estimate at the expense of computation. This improves on standard nonparametric estimators (Kaplan & Meier, 1958; Nelson, 1972; Aalen, 1978; Nieto-Barajas & Walker, 2002) by using a continuous monotonic local linear regression to approximate the log cumulative hazard, rather than an empirical jump process. These splines are centered at their midpoint to reduce attenuation due to priors that can be incorporated (Figure 1b). The intercept term allows location shifts and is unconstrained under Cox regression. The truncated normal Gibbs step is an iterative sampling of non-negative local slopes. Under Bayesian inference, we condition the slope of each partition on the remaining variables. This is a variant of sufficient dimension reduction techniques (Li, 2018; Sun et al., 2019) and these local slopes are an analog of the martingale increments (Tsiatis, 2006) used to construct the partial likelihood under stochastic calculus (Kaplan & Meier, 1958; Cox, 1972, 1975).
2.2 Case weights under sufficient statistics
By reducing computation to beta kernels, case weights can easily be incorporated into the Gibbs sampler. The NB likelihood becomes
and beta distribution parameter is modified to be . The updated Gibbs sampler is
(6) |
with vector consisting of values . This allows the use of Cox-PG in weighted analysis settings such as power priors (Ibrahim & Chen, 2000; Ibrahim et al., 2015) and inverse probability weighting (Shu et al., 2021).
2.3 General hierarchical Cox model: the Cox-PG algorithm
Our previous Gibbs algorithm can sample from the Cox model, while mixed models for clustered data are straightforward extensions. We allow for additional heterogeneity through Gaussian random effects also known as log-normal frailties. We can also incorporate prior information on the baseline hazard and coefficients by specifying the appropriate Gaussian priors and formulating a general Gibbs sampler. We update the notation for to include random effects and specify truncated Gaussian prior for coefficients given by the form . Let be the covariance matrix. We order , . Note that the intercept is combined into . Then is the prior precision represented with direct sum operator and is the precision for and is the precision of and is the number of random effect clusters for . The prior mean is and are fixed effect covariates. Let be a set of convex constraints for fixed effects only. The initial constraints are and we condition on through the offset term to model the baseline log cumulative hazard. We introduce the random effect structure by placing a gamma distribution on the precision: , . The Gibbs sampler for (4), after applying sufficient dimension reduction and multiplying by the Gaussian and gamma priors, is given as
(7) |
with and . We may also write where . For brevity, we abbreviate this modeling approach as the Cox-PG algorithm. As noted in Remark 3.3 of Choi & Hobert 2013, a proper posterior does not depend on a full rank design matrix. In addition, Bayesian inference using (7) is possible in settings with high dimensional covariates, (Polson et al., 2013; Makalic & Schmidt, 2015).
3 Theoretical results: uniform ergodicity
We establish uniform ergodicity for our Gibbs sampler and show that the moments of posterior samples for exists which guarantees central limit theorem (CLT) results for posterior MCMC averages and consistent estimators of the associated asymptotic variance (Jones, 2004). Our posterior inference of involves matrix multiplication of coefficients and basis functions, or a decomposition of additive functions and have the same CLT properties. Here is the matrix representation of the monotonic spline bases over the range of the observed covariate data. Suppose is a draw from the posterior distribution, then are posterior samples of nonparametric effects with MCMC averages of with an example provided in Figure 1c.
First, we show uniform ergodicity of for our Gibbs sampler by taking advantage of the truncated gamma distribution (Choi & Hobert, 2013; Wang & Roy, 2018) and slice sampler properties (Mira & Tierney, 2002). A key feature of this strategy is truncating the gamma distribution at a small results in useful inequalities related to ergodicity while allowing for a vague prior to be specified for . We denote the -marginal Markov chain as and Markov transition density (Mtd) of as
where is the current state, is the next state. Space is the PG support that contains , and . In addition, placing upper bounds of the baseline hazard ensures finite integrals in the Markov transitions while accommodating vague priors. Under initial constraints , we see that is strictly positive and Harris ergodic (Tierney, 1994; Hobert, 2011). We show the Mtd of satisfies the following minorization condition: , where there exist a and density function , to prove uniform ergodicity (Roberts & Rosenthal 2004, Meyn & Tweedie 2012 Chapter 16). Uniform ergodicity is defined as bounded and geometrically decreasing bounds for total variation distance to the stationary distribution in number of Markov transitions , leading to CLT results (Van der Vaart 2000 Chapter 2). Here denotes the Borel -algebra of , is the Markov transition function for the Mtd
and . We denote as the probability measure with density , is bounded above and .
-
(C1)
The gamma prior parameters are constrained as , and gamma support is constrained to be .
-
(C2)
The monotonic splines system are comprised of Lipschitz continuous basis functions with coefficients constrained to , such that and .
Theorem 2.
We leave the proofs to the supplement. Condition (C1) can be disregarded when considering a Gibbs sampler without mixed effects; the Gibbs sampler is still uniformly ergodic under (C2). Condition (C1) bounds the second order variation of the random effects while allowing for vague priors through the truncated gamma support (Wang & Roy, 2018). Condition (C2) enforces distributional robustness (Blanchet et al., 2022) on how much the baseline hazard can vary, with a broad class of vague and informative priors that can be accommodated. Specifically, it is a common condition that bounds the slopes of the log cumulative hazard function. In large sample theory, condition (C2) or bounding the baseline cumulative hazard is also used to facilitate the dominated convergence theorem and achieve strong consistency (Wang, 1985; McLain & Ghosh, 2013). This proof also guarantees uniform ergodicity for Poisson regression through NB approximations and PG samplers (Zhou et al., 2012; Duan et al., 2018). The following results for coupling time is derived from the minorization condition (Jones, 2004). Numerical integration can be used to calculate prior to MCMC sampling to evaluate convergence rates and determine number of MCMC samples (Jones & Hobert, 2001; Geweke, 1989). More importantly, these results establish conditions for convergence in total variation distance and posterior expectations for constrained nonparametric objects under Bayesian inference.
4 Accelerated mixing and removing bias with calibration
From Gibbs samplers, we have the useful property of reversible marginal transitions. This allows us to construct a simple Metropolis-Hasting correction to debias our MCMC estimate. The pure Poisson process version of this problem was studied in Duan et al. (2018) with slow mixing but accurate inference using . A large artificially induces an imbalanced logistic regression problem (Zens et al., 2023). Data calibration can be used once a Gibbs sampler for approximate PH regression has been established (Duan et al., 2018). We can use Gibbs sampler (7) as a proposal distribution to sample from the target posterior constructed with the beta kernel Cox semiparametric regression posterior, where is the prior. We accept the new Gibbs draw from proposal distribution using a MH step with probability
(8) |
We denote current state as and next state as . The derivation of this probability can be done quickly by noting that Gibbs samplers are MH algorithms with acceptance probability one. Therefore under the Cox-PG Gibbs sampler for target posterior we have
(9) |
Substituting (9) into (8), we have a ratio of likelihoods which simplifies to approximately a ratio of survival functions or a difference of cumulants on the log scale. This nested detailed balance structure removes the bias of convolution and allows moderate to be used to accelerate mixing. The frailty model with modifies the original Cox model to have more heterogeneity, making it an ideal proposal distribution because proposal distributions should have slightly more variation than their target distribution. Through our experiments, we found and the MH step to work well with over 90% acceptance rates in various settings. The bias due to frailty in Gibbs samplers (7) can be removed with MH acceptance probability defined in (8). For example, draws of are sampled sequentially from (7), each new draw of is accepted with probability (8). After removing the bias, we have draws from the PH likelihood based posterior and we can use the log–log link (i.e. ) to map to the survival probabilities.
For Cox-PG with weights (6), the ratios (8) and (9) are calculated using weighted likelihoods and ,
Aside from mixed models, all hierarchical models that can be computed with Gibbs transitions simplifies to convenient acceptance probability (8). These results are due to the prior canceling out in the ratios.
5 Other extensions of the Cox-PG algorithm
The framework we have established has wide applicability to numerous other contexts and problems. Any Gaussian Gibbs sampler model on the PH coefficients can be incorporated as direct extensions of our Cox-PG algorithm, including variable selection (Tipping, 2001; Makalic & Schmidt, 2015). We detail a few additional extensions here:
Stratified hazards. Two sets of baseline hazard spline systems can be specified based with interaction variables. For example, male and female interactions are given as .
Smoothing splines as random effects. Smoothing splines can also be modeled with random effects summarized in Wand & Ormerod 2008; O’Sullivan 1986; Lee et al. 2018. Similar to generalized additive models (GAMs), a nonparametric smooth effect of a continuous covariate can be obtain through penalized regression coefficients. The O’Sullivan spline approach found in Wand & Ormerod (2008) uses a linear fixed effect and L2 penalized coefficients on Demmler-Reinsch (DR) oscillation basis functions (Demmler & Reinsch, 1975). This conveniently corresponds with the second order derivative smoothness penalty and can be fitted with random effects using (7).
Power prior for incorporating historical clinical trial data. The power prior of Ibrahim & Chen (2000) represents as the historical data prior where is a hyper-parameter for including historical data. Under power prior form, we evaluate (Ibrahim et al., 2015). This results in being multiplied with and resulting in Cox-PG with weights, (6).
6 Illustrative example: KM curve
We apply Cox-PG to model survival curves using the lung dataset from the survival R package for illustrative purposes (Loprinzi et al., 1994; Therneau, 2023). Naturally, we do not need many partitions to fit a survival curve. Partitions boundaries should be place near the few changes in the second derivatives of the log cumulative hazard. We found that partitions works well with boundaries selected as equally spaced quantiles of uncensored times in the same manner as knot selection in functional data analysis (Ruppert et al., 2003; Ramsay & Silverman, 2005). This also results in , equal numbers of uncensored events in each partition (Figure 1a). This is an intuitive selection of partition boundaries as deaths drive the rate of survival processes. In addition, this spline-based approach enforces a horizontal line if the last set of study times are all censored events, similar to Kaplan-Meier (KM) estimates and is demonstrated in Figure 1c & d. This horizontal region coincides with the cessation of martingale calculus in KM and partial likelihoods (Figure 1d). Using , partitions and MH step (8) to remove bias, we achieve fast computation and accurate results. For the truncated Gaussian sampling, we use recently developed efficient Gibbs sampler package relliptical (Valeriano et al., 2023). We use the BayesLogit package for PG sampling (Polson et al., 2013) and set , as default parameters. In the back-end, study times are scaled and registered on compact interval in order to mitigate numerical underflow due to numerous inner product calculations. The code is provided in the supplement with convenient R function posterior_draw() for posterior sampling. We drew burn-in samples and then drew posterior samples and thinned to retain samples. This process was very fast and on average, took 2–3 minutes on an Apple M2 Pro Mac mini, 16GB.
We map the Cox-PG posterior mean baseline log cumulative hazard (Figure 1c) to survival curves, using and compared it to Kaplan-Meier estimates (Figure 1d). We plot our monotone basis functions for in Figure 1b. We note that the posterior mean of the baseline log cumulative hazard is monotonic because it is a sum of monotonic functions and every posterior draw is a monotonic function. We use the approach of Meyer et al. 2015 to construct multiplicity adjusted joint bands for . Our bounds are mapped to survival probabilities and are plotted in Figure 1d.
A novel and key advantage of our approach is that Cox-PG enforces monotonicity and continuity at every MCMC draw as local linear regression of the log cumulative hazards. This also induces a degree of smoothness in the survival curve, a useful safeguard against over fitting. In addition, fit can be improved with better partition selection and larger . The KM example is a special case of Cox-PG with intercept being the only unconstrained PH regression coefficient. We present PH regression models with different choices of and next.
7 Simulation study: Weibull PH model
We used the following configurations of Cox-PG. Cox-PG1: (7) without Metropolis-Hastings bias removal, , . Cox-PG2: (7) with Metropolis-Hastings bias removal, , . Cox-PG3: (7) with Metropolis-Hastings bias removal, , . In addition, we compare our method with Bayesian integrated nested Laplace approximation (INLA) (Alvares et al., 2024; Rustand et al., 2024; Rue et al., 2009). As a gold standard method, we also fit the Weibull simulated data with a Bayesian Weibull PH model using a Hamiltonian Monte Carlo (HMC) algorithm built on Stan (Bürkner, 2017). Competing methods used default initialization conditions and parameters found in R package INLAjoint and R package brms. These methods are referred to as INLA and HMC-Weibull in the results.
As the basis of our simulation, we use the Weibull PH model where is the baseline log cumulative hazard and , and . We simulate 4 cases, different configurations of the Weibull PH model. Frailty: we simulate 25 balanced cluster random effects and add it to the Weibull process , with . Weighting: we first simulate the Weibull event times from and then contaminate 10% of the data by replacing covariates with draws , ; we assigned contaminated data a weight of 0.001 to mitigate contamination. GAM: we add a nonparametric effect to the Weibull process , with . Stratified: we sample 75% of the data and 25% of the data from with ; indicators for hazard groups are recorded. For each case, we sample observations and we drew independent censoring times from . For each case, we simulate 200 replicates and all were fitted with the 5 competing methods: Cox-PG1, Cox-PG2, Cox-PG3, INLA, and HMC-Weibull. HMC-Weibull can accommodate weights, frailties and GAMs but cannot fit stratified hazards or nonparametric baseline hazards. INLA can accommodate frailties, stratified hazards and nonparametric baseline hazards but cannot accommodate weights and GAMs. Cox-PG can accommodate the parameterization of all cases. All Cox-PG algorithms are initialized at the MLE and can be fitted with our posterior_draw() function. All MCMC methods sampled 1000 burn-in and 10000 draws that were thinned to 1000 samples.
For the baseline log cumulative hazard, we plot the integrated square error (ISE) and integrated coverage rate in Figure 7, where and are upper and lower bounds respectively. Integral domain and are set as the first and last death observed in the data. For HMC-Weibull and Cox-PG we were able to calculate the joint bands (Meyer et al., 2015) and we used the provided bounds from INLA. We plot estimates of based on posterior means in Figure 3. Cox-PG has comparable coverage to HMC-Weibull which is specified under oracle knowledge of the true hazard family. The piecewise slope of Cox-PG estimates the shape of the baseline log cumulative hazard function well. INLA struggles using numerical integration to approximate the hazard function near the boundary (Figure 3). HMC-Weibull with knowledge of the true baseline hazard family outperforms Cox-PG in terms of ISE. However, knowledge of the true hazard family is not possible in real data analysis. For the Stratified case, INLA and Cox-PG outperforms HMC-Weibull by specifying two separate baseline hazards and the HMC-Weibull estimate bisects the two true hazard functions. Considering INLA uses partitions by default, Cox-PG outperforms INLA with fewer parameters in estimating . For the GAM case, results for estimating can be found in the supplement.
Squared errors, 0.95 credible interval length and 0.95 credible interval coverage rate for coefficient estimate of are presented in Figure 4 & 5. All competing methods perform comparably in the Frailty case, because all methods can fit mixed models. For analysis of both and , INLA performance suffers in cases where the weights and GAMs are not used in estimation. However, in cases of Frailty and Stratified, estimation of remains accurate when there is a reasonable approximation of ; this robustness property is well documented in the literature (Tsiatis, 2006; Ibrahim et al., 2001). Cox-PG1, without bias removal, slightly suffers in performance due to its additional heterogeneity and bias. As sample size increases, increases leading to from the left. This leads to slow mixing of the MCMC by causing . We can decrease by increasing the number of partitions at the cost of computation. However, Cox-PG3 can be unstable when there is not enough uncensored data to fit partitions, an identifiability problem, such as the Stratified case with only 50 subjects in the group. Due to its flexibility, fewer monotonic spline parameters, bias removal and its comparable performance to gold standard methods, we recommend Cox-PG2 as the default.
8 Leukemia example: stratified hazards, GAM, and frailties
We used the leukemia death/right-censored dataset from Henderson et al. (2002) with and baseline covariates, found in R package INLA (Rue et al., 2017). We include age and sex as linear PH effects. There are 24 district random effects denoting different sites. In addition, we account for nonlinear effect of the Townsend score: a quantitative measure with a range of where higher values indicate less affluent areas. We believe that low scores have a greater association with increased survival that cannot be explained using a linear effect. The nonlinear GAM component induces an additional 7 random effects to model. We stratified the data into two baseline hazard groups based on white blood cell count to study the relationship between low count (leukopenia) and survival rates. We divided the data into two groups based on white blood cell count cutoff for leukopenia and considered it normal otherwise. Cox-PG enables us to study the nonlinear effect of Townsend score and stratified baseline hazards, two violations of the standard proportional hazard assumption. For comparison, we fit Cox-PG2 and HMC-Weibull, using 1000 burn-in and 200000 draws that were thinned to retain 1000 samples. Our model has random effects, fixed effects/intercepts and monotonic splines. We use our posterior_draw() function for Bayesian computation.
HMC-Weibull can parameterize the site random effect and GAM, but not the stratified baseline hazard. We observed that the fitted baseline log cumulative hazard from HMC-Weibull is consistent with the overall shape of Cox-PG estimates (Figure 6a). However, we observed violation of proportional hazards (linear effect) due to leukopenia in the Cox-PG estimates. If the effect was linear, there would be an equal-distance difference between the two baseline log cumulative hazards (Figure 6a). Furthermore, the last death in the normal group was observed at day 704, whereas the leukopenia group recorded 87 deaths after day 704. This imbalanced distribution of event times suggests that stratified baseline hazards are more appropriate. We observed a decreased risk of mortality associated with the a low Townsend score reflecting affluent areas (Figure 6b). The nonparametric effect through the affine combination of DR basis and penalized coefficients elucidates the nonlinear relationship between Townsend score and risk. The DR basis associated with quadratic relationship yielded a coefficient with a significant Bayesian p-value of 0.014 (Figure 6c) indicating a nonlinear relationship. In both competing methods, the effect of age was significant, while sex effect was not (Table 1).
9 Discussion
The link between Cox models and logistic regression is well known, with relative risks being equal to odds ratios in rare event settings. Because Cox models admits a local Poisson representation, we can reformulated Cox models as local negative binomial processes with a vague gamma mixture. These local negative binomial processes are in fact rare event logistic regressions with an appropriate offset term. This allows Bayesian computational and inferential tools of Gaussian inference through logistic regression to be deployed in survival settings. Now, multilevel Gaussian models can readily be adapted to Cox models. These techniques serve as a foundation for future work in Bayesian survival models. Cox-PG can be implemented in base R rather than working with a software framework such as Stan or INLA, making it more accessible to modelers. We provide our initial implementation of the Cox-PG algorithm in the supplement files. We leave other multilevel modeling extensions of Cox-PG and additional strategies for computational acceleration to future work.






Variable | Cox-PG2 | HMC-Weibull |
---|---|---|
sex | 0.0714 (-0.0680, 0.2187) | 0.0573 (-0.0744, 0.1840) |
age | 0.0314 (0.0272, 0.0358) | 0.0316 (0.0275, 0.0356) |
References
- (1)
- Aalen (1978) Aalen, O. (1978), ‘Nonparametric inference for a family of counting processes’, The Annals of Statistics pp. 701–726.
- Aalen et al. (2008) Aalen, O., Borgan, O. & Gjessing, H. (2008), Survival and event history analysis: a process point of view, Springer Science & Business Media.
- Alvares et al. (2024) Alvares, D., Van Niekerk, J., Krainski, E. T., Rue, H. & Rustand, D. (2024), ‘Bayesian survival analysis with inla’, Statistics in Medicine 43(20), 3975–4010.
- Bhadra et al. (2019) Bhadra, A., Datta, J., Polson, N. G. & Willard, B. (2019), ‘Lasso meets horseshoe’, Statistical Science 34(3), 405–427.
- Blanchet et al. (2022) Blanchet, J., Murthy, K. & Si, N. (2022), ‘Confidence regions in wasserstein distributionally robust estimation’, Biometrika 109(2), 295–315.
- Boyd & Vandenberghe (2004) Boyd, S. P. & Vandenberghe, L. (2004), Convex optimization, Cambridge university press.
- Bürkner (2017) Bürkner, P.-C. (2017), ‘brms: An r package for bayesian multilevel models using stan’, Journal of statistical software 80, 1–28.
- Carvalho et al. (2009) Carvalho, C. M., Polson, N. G. & Scott, J. G. (2009), Handling sparsity via the horseshoe, in ‘Artificial intelligence and statistics’, PMLR, pp. 73–80.
- Chen et al. (2006) Chen, M.-H., Ibrahim, J. G. & Shao, Q.-M. (2006), ‘Posterior propriety and computation for the cox regression model with applications to missing covariates’, Biometrika 93(4), 791–807.
-
Choi & Hobert (2013)
Choi, H. M. & Hobert, J. P. (2013), ‘The Polya-Gamma Gibbs sampler for Bayesian logistic regression is
uniformly ergodic’, Electronic Journal of Statistics 7(none), 2054 – 2064.
https://doi.org/10.1214/13-EJS837 - Cong et al. (2017) Cong, Y., Chen, B. & Zhou, M. (2017), ‘Fast simulation of hyperplane-truncated multivariate normal distributions.’, Bayesian Analysis 12(4).
- Cox (1972) Cox, D. R. (1972), ‘Regression models and life-tables’, Journal of the Royal Statistical Society: Series B (Methodological) 34(2), 187–202.
- Cox (1975) Cox, D. R. (1975), ‘Partial likelihood’, Biometrika 62(2), 269–276.
- Cui et al. (2021) Cui, E., Crainiceanu, C. M. & Leroux, A. (2021), ‘Additive functional cox model’, Journal of Computational and Graphical Statistics 30(3), 780–793.
- Damien et al. (1999) Damien, P., Wakefield, J. & Walker, S. (1999), ‘Gibbs sampling for bayesian non-conjugate and hierarchical models by using auxiliary variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61(2), 331–344.
- Demmler & Reinsch (1975) Demmler, A. & Reinsch, C. (1975), ‘Oscillation matrices with spline smoothing’, Numerische Mathematik 24(5), 375–382.
- Duan et al. (2018) Duan, L. L., Johndrow, J. E. & Dunson, D. B. (2018), ‘Scaling up data augmentation mcmc via calibration’, Journal of Machine Learning Research 19(64), 1–34.
- Esscher (1932) Esscher, F. (1932), ‘On the probability function in the collective theory of risk’, Skand. Aktuarie Tidskr. 15, 175–195.
- Geweke (1989) Geweke, J. (1989), ‘Bayesian inference in econometric models using monte carlo integration’, Econometrica: Journal of the Econometric Society pp. 1317–1339.
- Henderson et al. (2002) Henderson, R., Shimakura, S. & Gorst, D. (2002), ‘Modeling spatial variation in leukemia survival data’, Journal of the American Statistical Association 97(460), 965–972.
- Hobert (2011) Hobert, J. P. (2011), ‘The data augmentation algorithm: Theory and methodology’, Handbook of Markov Chain Monte Carlo pp. 253–293.
- Hothorn et al. (2018) Hothorn, T., Möst, L. & Bühlmann, P. (2018), ‘Most likely transformations’, Scandinavian Journal of Statistics 45(1), 110–134.
- Ibrahim & Chen (2000) Ibrahim, J. G. & Chen, M.-H. (2000), ‘Power prior distributions for regression models’, Statistical Science pp. 46–60.
- Ibrahim et al. (2015) Ibrahim, J. G., Chen, M.-H., Gwon, Y. & Chen, F. (2015), ‘The power prior: theory and applications’, Statistics in medicine 34(28), 3724–3749.
- Ibrahim et al. (2001) Ibrahim, J. G., Chen, M.-H. & Sinha, D. (2001), Bayesian survival analysis, Vol. 2, Springer.
- Ildstad et al. (2001) Ildstad, S. T., Evans Jr, C. H. et al. (2001), ‘Small clinical trials: Issues and challenges’.
- Jones (2004) Jones, G. L. (2004), ‘On the markov chain central limit theorem’.
- Jones & Hobert (2001) Jones, G. L. & Hobert, J. P. (2001), ‘Honest exploration of intractable probability distributions via markov chain monte carlo’, Statistical Science pp. 312–334.
- Kalbfleisch (1978) Kalbfleisch, J. D. (1978), ‘Non-parametric bayesian analysis of survival time data’, Journal of the Royal Statistical Society: Series B (Methodological) 40(2), 214–221.
- Kalbfleisch & Prentice (2011) Kalbfleisch, J. D. & Prentice, R. L. (2011), The statistical analysis of failure time data, John Wiley & Sons.
- Kaplan & Meier (1958) Kaplan, E. L. & Meier, P. (1958), ‘Nonparametric estimation from incomplete observations’, Journal of the American statistical association 53(282), 457–481.
- Lee et al. (2018) Lee, W., Miranda, M., Rausch, P., Baladandayuthapani, V., Fazio, M., Downs, C. & Morris, J. S. (2018), ‘Bayesian semiparametric functional mixed models for serially correlated functional data, with application to glaucoma data’, Journal of the American Statistical Association .
- Li (2018) Li, B. (2018), Sufficient dimension reduction: Methods and applications with R, CRC Press.
- Li & Ghosh (2015) Li, Y. & Ghosh, S. K. (2015), ‘Efficient sampling methods for truncated multivariate normal and student-t distributions subject to linear inequality constraints’, Journal of Statistical Theory and Practice 9, 712–732.
- Loprinzi et al. (1994) Loprinzi, C. L., Laurie, J. A., Wieand, H. S., Krook, J. E., Novotny, P. J., Kugler, J. W., Bartel, J., Law, M., Bateman, M. & Klatt, N. E. (1994), ‘Prospective evaluation of prognostic variables from patient-completed questionnaires. north central cancer treatment group.’, Journal of Clinical Oncology 12(3), 601–607.
- Maatouk & Bay (2016) Maatouk, H. & Bay, X. (2016), A new rejection sampling method for truncated multivariate gaussian random variables restricted to convex sets, in ‘Monte Carlo and Quasi-Monte Carlo Methods: MCQMC, Leuven, Belgium, April 2014’, Springer, pp. 521–530.
- Makalic & Schmidt (2015) Makalic, E. & Schmidt, D. F. (2015), ‘A simple sampler for the horseshoe estimator’, IEEE Signal Processing Letters 23(1), 179–182.
- McGregor et al. (2020) McGregor, D., Palarea-Albaladejo, J., Dall, P., Hron, K. & Chastin, S. (2020), ‘Cox regression survival analysis with compositional covariates: application to modelling mortality risk from 24-h physical activity patterns’, Statistical methods in medical research 29(5), 1447–1465.
- McLain & Ghosh (2013) McLain, A. C. & Ghosh, S. K. (2013), ‘Efficient sieve maximum likelihood estimation of time-transformation models’, Journal of Statistical Theory and Practice 7, 285–303.
- Meyer et al. (2018) Meyer, M. C., Kim, S.-Y. & Wang, H. (2018), ‘Convergence rates for constrained regression splines’, Journal of Statistical Planning and Inference 193, 179–188.
- Meyer et al. (2015) Meyer, M. J., Coull, B. A., Versace, F., Cinciripini, P. & Morris, J. S. (2015), ‘Bayesian function-on-function regression for multilevel functional data’, Biometrics 71(3), 563–574.
- Meyn & Tweedie (2012) Meyn, S. P. & Tweedie, R. L. (2012), Markov chains and stochastic stability, Springer Science & Business Media.
- Mira & Tierney (2002) Mira, A. & Tierney, L. (2002), ‘Efficiency and convergence properties of slice samplers’, Scandinavian Journal of Statistics 29(1), 1–12.
- Neal (2003) Neal, R. M. (2003), ‘Slice sampling’, The annals of statistics 31(3), 705–767.
- Neelon (2019) Neelon, B. (2019), ‘Bayesian zero-inflated negative binomial regression based on pólya-gamma mixtures’, Bayesian analysis 14(3), 829.
- Nelson (1972) Nelson, W. (1972), ‘Theory and applications of hazard plotting for censored failure data’, Technometrics 14(4), 945–966.
- Nieto-Barajas & Walker (2002) Nieto-Barajas, L. E. & Walker, S. G. (2002), ‘Markov beta and gamma processes for modelling hazard rates’, Scandinavian Journal of Statistics 29(3), 413–424.
- O’Sullivan (1986) O’Sullivan, F. (1986), ‘A statistical perspective on ill-posed inverse problems’, Statistical science pp. 502–518.
- Pakman & Paninski (2014) Pakman, A. & Paninski, L. (2014), ‘Exact hamiltonian monte carlo for truncated multivariate gaussians’, Journal of Computational and Graphical Statistics 23(2), 518–542.
- Polson et al. (2013) Polson, N. G., Scott, J. G. & Windle, J. (2013), ‘Bayesian inference for logistic models using pólya–gamma latent variables’, Journal of the American statistical Association 108(504), 1339–1349.
- Ramsay (1988) Ramsay, J. O. (1988), ‘Monotone regression splines in action’, Statistical science pp. 425–441.
- Ramsay & Silverman (2005) Ramsay, J. O. & Silverman, B. W. (2005), Fitting differential equations to functional data: Principal differential analysis, Springer.
- Roberts & Rosenthal (2004) Roberts, G. O. & Rosenthal, J. S. (2004), ‘General state space markov chains and mcmc algorithms’.
- Rue et al. (2009) Rue, H., Martino, S. & Chopin, N. (2009), ‘Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations’, Journal of the Royal Statistical Society Series B: Statistical Methodology 71(2), 319–392.
- Rue et al. (2017) Rue, H., Riebler, A., Sørbye, S. H., Illian, J. B., Simpson, D. P. & Lindgren, F. K. (2017), ‘Bayesian computing with inla: a review’, Annual Review of Statistics and Its Application 4(1), 395–421.
- Ruppert et al. (2003) Ruppert, D., Wand, M. P. & Carroll, R. J. (2003), Semiparametric regression, number 12, Cambridge university press.
- Rustand et al. (2024) Rustand, D., Van Niekerk, J., Krainski, E. T., Rue, H. & Proust-Lima, C. (2024), ‘Fast and flexible inference for joint models of multivariate longitudinal and survival data using integrated nested laplace approximations’, Biostatistics 25(2), 429–448.
- Shu et al. (2021) Shu, D., Young, J. G., Toh, S. & Wang, R. (2021), ‘Variance estimation in inverse probability weighted cox models’, Biometrics 77(3), 1101–1117.
- Sinha et al. (2003) Sinha, D., Ibrahim, J. G. & Chen, M.-H. (2003), ‘A bayesian justification of cox’s partial likelihood’, Biometrika 90(3), 629–641.
- Sun et al. (2019) Sun, Q., Zhu, R., Wang, T. & Zeng, D. (2019), ‘Counting process-based dimension reduction methods for censored outcomes’, Biometrika 106(1), 181–196.
- Sy & Taylor (2000) Sy, J. P. & Taylor, J. M. (2000), ‘Estimation in a cox proportional hazards cure model’, Biometrics 56(1), 227–236.
-
Therneau (2023)
Therneau, T. M. (2023), A Package for
Survival Analysis in R.
R package version 3.5-7.
https://CRAN.R-project.org/package=survival - Therneau et al. (2000) Therneau, T. M., Grambsch, P. M., Therneau, T. M. & Grambsch, P. M. (2000), The cox model, Springer.
- Tibshirani (1997) Tibshirani, R. (1997), ‘The lasso method for variable selection in the cox model’, Statistics in medicine 16(4), 385–395.
- Tierney (1994) Tierney, L. (1994), ‘Markov chains for exploring posterior distributions’, the Annals of Statistics pp. 1701–1728.
- Tipping (2001) Tipping, M. E. (2001), ‘Sparse bayesian learning and the relevance vector machine’, Journal of machine learning research 1(Jun), 211–244.
- Tsiatis (2006) Tsiatis, A. A. (2006), ‘Semiparametric theory and missing data’.
- Valeriano et al. (2023) Valeriano, K. A., Galarza, C. E. & Matos, L. A. (2023), ‘Moments and random number generation for the truncated elliptical family of distributions’, Statistics and Computing 33(1), 32.
- Van der Vaart (2000) Van der Vaart, A. W. (2000), Asymptotic statistics, Vol. 3, Cambridge university press.
- Wand & Ormerod (2008) Wand, M. P. & Ormerod, J. (2008), ‘On semiparametric regression with o’sullivan penalized splines’, Australian & New Zealand Journal of Statistics 50(2), 179–198.
- Wang (1985) Wang, J.-L. (1985), ‘Strong consistency of approximate maximum likelihood estimators with applications in nonparametrics’, The Annals of Statistics pp. 932–946.
- Wang et al. (2022) Wang, T., Ratcliffe, S. J. & Guo, W. (2022), ‘Time-to-event analysis with unknown time origins via longitudinal biomarker registration’, Journal of the American Statistical Association pp. 1–16.
- Wang & Roy (2018) Wang, X. & Roy, V. (2018), ‘Analysis of the pólya-gamma block gibbs sampler for bayesian logistic linear mixed models’, Statistics & Probability Letters 137, 251–256.
- Wienke (2010) Wienke, A. (2010), Frailty models in survival analysis, CRC press.
- Winkelmann (2008) Winkelmann, R. (2008), Econometric analysis of count data, Springer Science & Business Media.
- Wulfsohn & Tsiatis (1997) Wulfsohn, M. S. & Tsiatis, A. A. (1997), ‘A joint model for survival and longitudinal data measured with error’, Biometrics pp. 330–339.
- Zens et al. (2023) Zens, G., Frühwirth-Schnatter, S. & Wagner, H. (2023), ‘Ultimate pólya gamma samplers–efficient mcmc for possibly imbalanced binary and categorical data’, Journal of the American Statistical Association pp. 1–12.
- Zhou et al. (2012) Zhou, M., Li, L., Dunson, D. & Carin, L. (2012), Lognormal and gamma mixed negative binomial regression, in ‘Proceedings of the… International Conference on Machine Learning. International Conference on Machine Learning’, Vol. 2012, NIH Public Access, p. 1343.
Supplement
Simulation Studies: GAM results

Appendix: Proofs
In order to facilitate the proof Theorem 1 & 2, we first present the following definitions and lemma.
Definition 1.
Truncated gamma density:
where .
Definition 2.
Loewner order of matrices : if is positive semi-definite or for all real and the following relationships hold and .
Definition 3.
Hyperbolic cosine through Laplace transform:
Definition 4.
Hyperbolic cosine inequalities: for , and for .
Lemma 1.
From Loewner ordering, we have
where and .
Proof.
Proof of Lemma 1. Recall . We note that and . Given that
We apply Loewner ordering to the projection matrix
triangle and Cauchy–Schwarz inequality next,
where and . ∎
Proof.
Proof of Theorem 1. First we use the truncated gamma density to replace the gamma conditional distribution
We note that and
with . In addition, . After applying Lemma 1, we have the inequality,
Note that we expand the quadratic form of . With some algebra to combine , and to get . We have
with .
Replacing our gamma Markov chain transitions with truncated gamma , following Theorem 1 of Wang & Roy 2018, we now bound
From Theorem 1 of Wang & Roy 2018, under condition (C1), we have
The next step of the proof follows the spirit of Wang & Roy 2018, but it is less straight forward so we write these steps in detail. Note that the conditional distribution, is given as with and . After multiplying by , using Definition 3 & 4, we carry out the integral
We have
Note that
with quadratic expansion where is the diagonal matrix with th diagonal element . We can simplify the expansion as .
where . Let , and we get
Note that by Loewner ordering (Definition 2). Let and
(see Definition 4 for relationship). We can write the bounds of as a Gaussian kernel,
where .
Note that and using (C2) with Theorem 7 from Mira & Tierney 2002, we get
where
where denotes minimum. We may also use the notation in our proof instead. For our Mtd, we have
with
Note that and .
∎
Proof.
Proof of Theorem 2. First, we define normalizing constant and
We can integrate and bound the the slice variable at . We can rewrite with kernels and bound the indicator function ,
Next,
Note that . We set and apply the MGF to normal distribution to get
All that is left is to integrate the gamma kernel and integrate Polya-gamma kernels. An upper bound is achieved and we have
∎