Analysis of survival data with non-proportional hazards: A comparison of propensity score weighted methods

Elizabeth A. Handorf*¹, Marc Smaldone², Sujana Movva³, Nandita Mitra⁴

Abstract

One of the most common ways researchers compare survival outcomes across treatments when confounding is present is using Cox regression. This model is limited by its underlying assumption of proportional hazards; in some cases, substantial violations may occur. Here we present and compare approaches which attempt to address this issue, including Cox models with time-varying hazard ratios; parametric accelerated failure time models; Kaplan-Meier curves; and pseudo-observations. To adjust for differences between treatment groups, we use Inverse Probability of Treatment Weighting based on the propensity score. We examine clinically meaningful outcome measures that can be computed and directly compared across each method, namely, survival probability at time $T$ , median survival, and restricted mean survival. We conduct simulation studies under a range of scenarios, and determine the biases, coverages, and standard errors of the Average Treatment Effects for each method. We then apply these approaches to two published observational studies of survival after cancer treatment. The first examines chemotherapy in sarcoma, where survival is very similar initially, but after two years the chemotherapy group shows a benefit. The other study is a comparison of surgical techniques for kidney cancer, where survival differences are attenuated over time.

[1]Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, PA USA

[2]Department of Surgical Oncology, Fox Chase Cancer Center, PA, USA

[3]Department of Medicine, Memorial Sloan Kettering Cancer Center, NY, USA

[4]Division of Biostatistics, University of Pennsylvania Perelman School of Medicine, PA, USA

*Correspondence to: Elizabeth Handorf. 333 Cottman Ave., Reimann 383, Philadelphia, PA 19111. [email protected]

1 Introduction

Many health care studies use observational databases to compare censored survival times between different treatment or exposure groups. A common statistical approach used to model such data is the Cox proportional hazards regression model. However, the standard Cox model makes a strong assumption that the hazards are proportional between treatment groups of interest over the course of follow-up; that is, that the ratio of the hazards is constant over time. [1] This assumption may not hold in real-world studies.

Although Cox regression is often used to analyze survival time data, particularly when adjusting for covariates, there are a variety of models for censored outcomes which do not rely on the assumption of proportional hazards. Non-parametric methods, including the ubiquitous Kaplan-Meier method, by definition do not make assumptions about the functional form of the survival curves, and therefore accommodate non-proportionality. [2] An alternative non-parametric method is the Nelson-Aalen estimator, [3] and a recent method proposes using pseudo-observations, which can also be used without parametric assumptions. [4, 5, 6] However, standard implementations of non-parametric approaches do not readily incorporate covariates. This motivates use of regression-based parametric and semi-parametric approaches which accommodate non-proportionality. The Cox model can be modified to include time dependent hazards, relaxing the assumption of proportionality. [3] Accelerated Failure Time (AFT) models are a different class of regression methods for censored data, which assume a linear model for log-time. They are often modeled fully parametrically, and for many choices of distribution for the error term, the models are non-proportional. Semi-parametric AFT methods are also available. [7] Proportional odds models can also be used in the setting of non-proportionality. [3]

Given the number of choices, a natural question is which method performs best in real-world conditions, particularly when analyzing observational data where confounding is an issue. In this paper, we focus on approaches which can use propensity score weighting to address confounding. The propensity score is commonly used in the analysis of clinical studies and has many advantages as we discuss in detail in section 2. We then describe several methods for analysis of survival data where the proportional hazards assumption is violated. We focus on meaningful estimands and compare the performance of parametric, semi-parametric, and nonparametric approaches; some of which are commonly used while others are less familiar. We focus on methods which can 1) be used with Inverse Probability of Treatment Weighting (IPTW), 2) identify the estimands of interest, and 3) are readily implemented using currently available software. We provide insight on the performance of these methods in scenarios often involved in clinical studies.

This work is motivated by two of our recently published studies in cancer in which the treatment effect exhibited non-proportional hazards. In our study of chemotherapy for soft-tissue sarcoma, there was a negligible survival benefit until about two years after the start of treatment. [8] In another study comparing two different surgical techniques in early-stage renal cancer, we found that one approach had a strong benefit immediately after surgery, but the benefit decreased over time. [9] The data for both of these studies came from an observational source, the National Cancer Database. [10] Therefore, both were likely subject to confounding by indication; that is, the choice of treatment is informed by factors that are also associated with survival.

The outline of our paper is as follows. In section 2, we provide a brief review of estimating the average treatment effect using propensity score weighting. In section 3, we present several methods to analyze survival outcomes under non proportional hazards that utilize IPTW. In section 4, we conduct extensive simulation studies to compare the methods presented in section 3. In section 5, we use our motivating studies of soft-tissue sarcoma and renal cancer to demonstrate implementation and interpretation of the various methods described earlier. We conclude with a brief discussion and thoughts on next steps.

2 Propensity Score weighting

The propensity score, first proposed by Rosenbaum and Rubin, [11, 12] is defined as the probability of receiving treatment given a set of covariates,

e=P(Z=1\mid\mathbf{X})

where $e$ is the propensity score, $Z$ is a binary indicator for treatment and $\mathbf{X}$ is a vector of (measured) covariates. The propensity score is useful for causal inference, under the potential outcomes framework. Here, each subject has two potential outcomes under control and treatment conditions $(Y^{0},Y^{1})$ , one of which is unobserved. [13, 14] The estimand of interest is often the Average Treatment Effect (ATE), defined as

\mathrm{ATE}=\mathrm{E}(Y^{1}-Y^{0}),

which is usefully interpreted as the expected value of the outcome if every subject is given the treatment, compared with the expected value if every subject receives the control. Strongly ignorable treatment assignment occurs under the conditions

(Y^{0},Y^{1})\perp Z\mid\mathbf{X}\;\mathrm{and}\;0<e<1,

that is, when the potential outcomes are independent of treatment assignment conditional on the covariates, and each subject has a nonzero probability of receiving both the treatment and the control (positivity). When strong ignorability holds, the ATE based on the propensity score is identifiable from the observed data. [11, 15, 16]

\mathrm{E}(Y^{1}-Y^{0})=\mathrm{E}(Y\mid Z=1,e)-\mathrm{E}(Y\mid Z=0,e).

Propensity score methods have a number of benefits over traditional regression adjustment. First, they provide an easy way to check for sufficient covariate overlap and balance between the treated and control groups. [17] Second, one can use flexible and modern approaches, such as ensemble models, [18] to estimate the propensity scores, which make the estimates less vulnerable to bias from an incorrect functional form. Propensity scores also allow for the use of doubly robust methods. [19] Third, propensity score weighting approaches provide marginal, not conditional, estimates, which are more interpretable and useful to clinicians and policymakers. [20, 21, 22, 23]

There are several ways to use propensity scores in analyses of survival data. Here we focus on weighting, specifically inverse probability of treatment weighting (IPTW). The idea underlying IPTW is to create synthetic samples where covariates are unrelated to treatment assignment. The weighted samples can then be summarized and compared directly. One nice consequence of using IPTW is that the estimand is the ATE. [17] To implement an IPTW analysis, one must first obtain the estimated propensity scores for each subject ( $\hat{e}_{i}$ ). This can be accomplished via a number of approaches, including logistic regression, random forests, or ensemble methods. The IPTW for subject $i$ is then defined as

\hat{w}_{i}=\frac{Z_{i}}{\hat{e}_{i}}+\frac{1-Z_{i}}{1-\hat{e}_{i}}.

As $Z_{i}$ is binary, if subject $i$ is treated, their weight is one over their probability of being treated (based on their covariates). If the subject did not receive the treatment, their weight is one over their probability of receiving the control. Therefore treated subjects who were less likely to be treated have larger weights, as do control subjects who were less likely to receive the control, thus creating a balanced pseudo-population. In subsequent analyses, these weights are used to estimate the expected effects under the treatment and control conditions.

3 Survival methods under non proportional hazards

3.1 Quantifying differences between groups with non-proportionality

In the setting of survival outcomes with non-proportionality, we must first decide what our estimand will be. It should be readily interpretable by the intended audience, meaningful given the observed data features, and estimable using available software. If we fit a proportional hazards model to non-proportional data, we would obtain an estimate of a single HR, but this would be a misleading measure because it would essentially be a summary of the different hazard ratios at each observed failure time. Alternative hazard ratios have been proposed including the average hazard ratio which is interpreted as the average of the hazard ratios weighted by the number of patients at risk. [24]

The difficulties inherent in reporting hazard ratios in the setting of non-proportionality can be avoided by focusing on measures that can be defined directly from the survival functions for the treated and control arms, $S_{1}(t)$ and $S_{0}(t)$ , respectively.

The simplest approach compares differences in survival probabilities at a particular time $t=T$ .

\Delta S_{t}=S_{1}(t=T)-S_{0}(t=T)

Quantiles of the survival functions can also be compared. For instance, if the survival curves are defined for $S(t)=0.5$ , median survival times can be compared.

\Delta T_{50}=(t:S_{1}(t=T)=0.5)-(t:S_{0}(t=T)=0.5)

These measures have the benefit of being readily understood by a clinical audience, but the drawback is that they do not comprehensively measure survival effects over the length of follow-up. An alternative measure is restricted mean survival (RMS), the mean survival up through some time $T$ where $S(t)$ is defined (i.e. mean survival up through the maximum observation time). [25] RMS can be found using the area under the survival curve, where the difference in RMS between the treated and control conditions is

\Delta\mathrm{RMS}_{t}=\int_{0}^{t}S_{1}(T=t)dt-\int_{0}^{t}S_{0}(T=t)dt.

Each of these measures are estimated directly from the survival curve, so they are always meaningful regardless of proportionality. However, not every method for the analysis of survival data explicitly estimates the survival curve, and not all methods can accommodate weighting. In the subsequent sections, we focus on methods which can 1) accommodate IPTW, 2) allow the estimation of the measure of interest ( $\Delta S_{T}$ , $\Delta T_{50}$ , and $\Delta\mathrm{RMS}_{T}$ ), and 3) are readily implemented using published software.

3.2 Cox model

Although the Cox proportional hazards model does rely on the assumption of proportionality, the model tends to be robust to small deviations from this assumption. However, if $n$ is large, as is common is registry studies, small deviations from proportionality may be statistically significant when tested using standard diagnostic tests (e.g. Cox-Snell residuals, Schoenfeld residuals). [26] Therefore, the Cox model may still be useful in some cases were deviations from proportionality are demonstrated. The Cox model is defined as follows:

h_{i}(t\mid Z_{i})=h_{0}(t)\mathrm{exp}(\beta Z_{i}).

IPTW can be readily accommodated in the Cox model, by weighting the contribution of each observation to the partial likelihood function:

L(\beta)=\prod_{i=1}^{N}\left(\frac{\mathrm{exp}(\beta Z_{i})}{\sum_{j\in R(t_{i})}\hat{w}_{j}\mathrm{exp}(\beta Z_{j})}\right)^{\hat{w}_{i}}

where $R(t_{i})$ is the risk set at time $t_{i}$ . The parameter $\beta$ is the average of the log-hazard ratios at each failure time. Importantly, one can find $\hat{S}(t)$ based on the estimates $\hat{h}_{0}(t)$ and $\hat{\beta}$ .

We can also extend the standard Cox model, relaxing the assumption of proportionality, by allowing the hazard ratio to vary as a function of time.

h_{i}(t\mid Z_{i})=h_{0}(t)\mathrm{exp}(\beta Z_{i}+f(Z_{i},t)).

In our simulations, we consider two particular cases, 1) allowing the hazard to vary by log-time.

f(Z_{i},t)=\kappa Z_{i}\mathrm{log}(t),

and 2) using a piecewise constant treatment effect

f(Z_{i},t)=\left\{\begin{array}[]{ll}0&0\leq t<C_{1}\\ \kappa_{1}Z_{i}&C_{1}\leq t<C_{2}\\ \kappa_{2}Z_{i}&C_{2}\leq t\\ \end{array}\right.

These models can all be implemented with the widely-used survival package in R. [27]

3.3 Parametric AFT models

Accelerated Failure Time (AFT) models assume a linear model for $\mathrm{log}(T)$ . [3, 7, 28]

\mathrm{log}(T_{i})=\mu+\beta Z_{i}+\sigma\epsilon_{i}

Here, we consider parametric models, where the error term $\epsilon$ is assumed to take on a given distribution. We have many choices for the distribution of $\epsilon$ . A common choice is the Gumbel distribution, which yields a Weibull hazard model; however, the Weibull model is also a proportional hazards model. We can allow for non-proportionality by allowing the hazard ratio to change over time

h_{i}(t\mid Z)=\mathrm{exp}[-(\beta Z_{i}+f(Z_{i},t))]^{\gamma}\lambda\gamma t^{\gamma-1},

(1)

where $\lambda$ is the scale and $\gamma$ is the shape parameter. If we let

f(Z_{i},t)=\kappa Z_{i}\mathrm{log}(t),

then (1) can be re-written as

\mathrm{log}(T_{i})=\frac{1}{\gamma+\kappa Z_{i}}\left(-\mathrm{log}(\lambda)+\beta Z_{i}+\epsilon_{i}\right).

\epsilon_{i}\sim\mathrm{Gumbel}

Therefore, if we include an additional parameter in the Weibull model, where the shape can vary by treatment, the model specification allows the hazard ratio to vary by log-time.

Another flexible alternative is the generalized gamma model, which is also a three parameter model. [29] The distribution of the errors has an additional parameter $Q$ .

\mathrm{log}(T_{i})=\mu+\beta Z_{i}+\sigma\epsilon_{i}

\epsilon_{i}\sim\left(\frac{|Q|}{\Gamma(1/Q^{2}}\right)\left(\frac{\mathrm{exp}(Qv)}{Q^{2}}\right)^{1/Q^{2}}\mathrm{exp}\left(\frac{-\mathrm{exp}(Qv)}{Q^{2}}\right)

This model encompasses other distributions as special cases; when $Q=1$ we have a Weibull AFT model, and if $Q=0$ it results in a log-normal model.

For any AFT model, the weights $\hat{w}_{i}$ can be applied during estimation of parameters. Here, we focus on the 3-parameter Weibull and generalized gamma AFT models, but a wide range of AFT models can be fit in R using the flexsurv package. [30]

3.4 Kaplan-Meier method

Another strategy is to avoid parametric assumptions entirely, and use Kaplan-Meier curves with IPTW to directly estimate the survival functions. [31, 32]

\hat{S}(t)=\prod_{t_{j}\leq t}\left(1-\frac{\widehat{d_{j}^{w}}}{\widehat{n_{j}^{w}}}\right)

Where $\widehat{d_{j}^{w}}=\sum_{i:T_{i}=t_{j}}\hat{w}_{i}\delta_{i}$ is the weighted number of events and $\widehat{n_{j}^{w}}\sum_{i:T_{i}\geq t_{j}}\hat{w}_{i}$ is the weighted number of people at risk at time $t_{j}$ . Weighted curves are easily estimated in R via the survival package.

3.5 Pseudo-observations

Another non-parametric method which can be used to estimate the survival curve is the recently-developed pseudo-observations approach. [4, 5, 6] This framework uses a missing data approach to account for censoring. If no missing data or confounders are present, the ATE can be found using a simple average over the (possibly transformed) outcomes for each subject

\theta=E(f(Y))=\frac{1}{n}\sum_{i}f(Y_{i}).

But if $f(Y_{i})$ is unknown for some subjects, we cannot calculate $E(f(y))$ directly. Instead, we can define the pseudo-observation for subject $i$ as

\hat{\theta_{i}}=n\hat{\theta}-(n-1)\hat{\theta}^{-i}

where $\hat{\theta}$ is a consistent estimate for $\theta$ , and $\hat{\theta}^{-i}$ is the estimator applied to all observations excluding subject $i$ . The pseudo-observations are then used in place of all observations, not only the ones which are missing due to censoring.

In a survival context, our parameter of interest, $\theta$ , is the probability of survival at time $t$ , and $\theta_{i}$ is a binary indicator for the status (alive/dead) of subject $i$ at time $t$ . We can use a non-parametric method, such as Kaplan-Meier, to estimate $S_{Z}(t)$ , which is then used to find $\theta_{i}$ for all observations in arm $Z$ .

\hat{\theta}_{Zi}=n_{Z}\hat{S}_{Z}(t)-(n_{Z}-1)\hat{S}_{Z}(t)^{-i}

Heuristically, $n_{Z}\hat{S}(t)$ is the expected number of patients alive at time $t$ , and $(n_{Z}-1)\hat{S}(t)^{-i}$ is the expected number of patients other than patient $i$ alive at time $t$ . $\hat{\theta_{i}}$ is used in place of (possibly unobserved) survival status at time $t$ . Therefore, the probability of being alive at time $t$ in arm $Z$ is

\hat{\mathrm{P}}(Y|Z,t)=\frac{1}{n_{Z}}\sum_{i}\hat{\theta}_{Zi}.

Incorporating propensity-score weights to account for confounding is straightforward.

\hat{\mathrm{P}}(Y|Z,t)=\frac{1}{n_{Z}}\sum_{i}\hat{w}_{i}\hat{\theta}_{Zi}

There are several benefits of using the pseudo-observations approach. It has been demonstrated to provide a valid estimate of the ATE in a causal framework, and requires weaker assumptions than regression-based models. It also supports estimation of many outcomes of interest. We can estimate $\hat{\mathrm{P}}(Y|Z,t)$ at each failure time, defining the survival curve, which in turn allows us to estimate the measures of interest described in section 3.1. Pseudo-observations can also be used to estimate RMS. This method can be implemented using the R package pseudo. [33]

3.6 Variance estimation

To conduct inference for the outcomes of interest, we need to quantify uncertainty. Some methods have closed-form variance estimators for our estimands of interest, but others do not. This motivates the use of a non-parametric bootstrap to estimate variances and find confidence intervals.[34] Bootstrap-based estimates are available for every combination of model and outcome we discuss above. Further, to correctly account for the variance of $\hat{e}$ , the estimated propensity scores, we can re-estimate the propensity score within each bootstrap iteration. [35]

4 Simulation Studies

4.1 Simulation Methods

We tested the performance of the methods described above using simulation studies. We assessed the bias, coverage, and variance of each method’s estimates of our outcomes of interest: median survival, restricted mean survival, and survival probability at time $T$ . We based our covariate and outcome distributions, and effect sizes on our NCDB cancer studies to mimic real world scenarios.

First, we simulated a set of covariates based on the observed multivariate distribution of the NCDB covariates in the renal dataset. We used the R package genOrd [36] to draw variables representing gender, age, stage, histology, tumor grade, Charlson comorbidity score, race, ethnicity, insurance status, facility type, income, and education. As our goal here is to evaluate the survival analysis methods, and not the propensity score estimation methods, we held this set of covariates constant for each simulation; we employ simulated covariates instead of the actual covariates so that our data and simulation code can be shared.
(Available online at https://github.com/BethHandorf/NonPH_IPTW)

We used a simple logistic model with linear covariate effects to determine each subjects’ probability of being assigned to the treatment condition (versus control). The effect size (log-scale) for each covariate ranged from 0.8 (effect of high vs low stage) to 0.03 (for each additional year of age). (See Appendix Table 1 for full specification.) This corresponds roughly to the effect sizes estimated when modeling actual treatment allocation in the NCDB sample. This model defined the probability that each individual would be allocated to the treatment condition. We set the intercept such that, on average, the population’s probability of receiving the treatment was 0.5.

Survival times were drawn based on a Weibull model with a time-varying treatment effect.

h(t)=\gamma\lambda t^{\gamma-1}\mathrm{exp}(\beta_{1}Z+f(Z,t)+\mathbf{X\psi})

(2)

In our base case, the hazard ratio varied by log time $f(Z,t)=\beta_{2}Z\mathrm{log(t)}$ , we set $\beta_{1}$ = -0.69, $\beta_{2}$ = 0.25, with a sample size of 5000 total cases. For each individual, survival times under $Z=1$ and $Z=0$ were both drawn using the R package simsurv. [37] These true potential outcomes were used to calculate the true ATE for the estimands of interest. Observed treatment status was drawn from the binomial based on each individuals probability of receiving the treatment, as defined by our propensity score model. Covariate effect sizes ranged from 0 to 0.8 (see Appendix Table 1).

For each set of simulated outcomes, we first estimated the propensity scores using a logistic regression model, and then calculated respective IPTWs for each individual. Using these estimated weights, we fit each of the models described above in Section 3 using standard R packages, including flexsurv for parametric survival models and pseudo for pseudo-observations methods.

We repeated this simulation framework with various changes to Equation 2, for a total of five scenarios:

1.

Base case: $\beta_{1}$ = -0.69 ; $f(Z,t)=0.25Z\mathrm{log}(t)$
2.

Piecewise constant (PWC) hazard ratio: $\beta_{1}$ = 0; $f(Z,t)=-0.25Z\mathrm{I}(t\geq 2)$ , where $\mathrm{I}(\cdot)$ is the indicator function
3.

Modest non-proportionality $\beta_{1}$ = -0.69 ; $f(Z,t)=0.125Z\mathrm{log}(t)$
4.

Modest treatment effect $\beta_{1}$ = -0.41 ; $f(Z,t)=0.25Z\mathrm{log}(t)$
5.

Base case, with a smaller sample size (N=500 per arm)

Senario 1 (base case) had a large treatment effect with substantial non-proportionality. Scenarios 2-4 were chosen to approximate the parameters and non-proportionality found in the clinical examples, and scenario 5 was used to assess how the methods perform with a more modest sample size.

4.2 Simulation Results

For each method, we assessed performance based on the mean bias, standard error, and coverage for the ATE for treatment versus the control. For the base case, the bias was often largest for the standard Cox model, as expected given the substantial degree of non-proportionality in the simulated data. For example, considering median survival, bias for the Cox model was -0.347 years, while all other methods had biases less than -0.069. The generalized gamma AFT model also had large bias relative to other methods; at 10 years, it’s bias was actually greater than that of the Cox model (2.5% vs 2.0%, respectively). Even though this three-parameter model is flexible and allows for non-proportionality, fitting this mis-specified parametric model still introduced substantial biases into the results, often comparable in size to those found using the standard Cox model. In the base case, the Cox model with a treatment effect varying by log-time was correctly specified, as was the fully parametric AFT model with variable location and scale parameters. These models had lower biases than the standard Cox model (e.g. for median survival, biases were -0.042 and -0.056, respectively). However, the non-parametric Kaplan-Meier and pseudo-observation methods generally had the lowest biases (e.g. biases for median survival were -0.021 and -0.020, respectively). Finally, we note that the RMS was less sensitive to the choice of method than the other outcomes of interest; the relative differences in the sizes of the biases were smallest for this survival outcome. (see Figure 1)

As shown in Table 1, coverage was worst for the standard Cox model, followed by the generalized gamma AFT model. The time-varying Cox models did better, although coverage was surprisingly low for the log-time model at the 2-year outcome (0.82). Coverage for the non-parametric methods was close to the nominal level for all outcomes. Standard Errors (SEs) of the differences in treatment effects were usually slightly larger for the non-parametric methods than for the parametric and semi-parametric methods. (See Appendix Figure 1) Comparing the SEs for the Kaplan-Meier versus the Weibull model, the increases ranged from 0.3-24.4%, and were lowest for RMS and highest for 2-year survival.

For piecewise constant hazards (scenario 2), biases were typically lowest for the non-parametric methods and when using a PWC time-varying Cox model. An exception to this was for RMS, where the naïve Cox model and Weibull AFT model had the lowest bias. This likely occurred because bias from different timepoints were in different directions (positive at 2 and 5 years, negative at 10 years), and may have cancelled each other out when calculating the area under the survival curve. (Note that Figure 1 shows the absolute value of the average biases). So, although biases may have been lower in this simulation for these mis-specified models, such behavior should not be relied upon in other scenarios. Coverages (See Appendix Table 2) were close to the nominal when biases were low. Standard errors were generally similar to be base case, except for median survival, where they were somewhat larger.

When the effect of non-proportionality was modest (scenario 3), biases tended to be smaller than those of the base case, especially for estimates of survival probabilities at a given timepoint. When the treatment effect was modest (scenario 4), results were generally similar to those of the base case. In the small sample size case (scenario 5), the bias was similar to that of the base case, with the exception of RMS, where across methods, biases were larger than those of the base case. As expected, SEs were higher when the sample size was smaller.

Refer to caption — Figure 1: Mean bias (absolute value) in estimates from simulation studies

Table 1: Coverage for base case

Method	2y	5y	10y	Median	RMS
Cox	0.00	0.07	0.61	0.81	0.71
CTV LT	0.82	0.91	0.93	0.94	0.93
CTV PWC	0.96	0.96	0.95	0.95	0.93
AFT GG	0.00	0.37	0.44	0.94	0.90
AFT WBL LS	0.89	0.93	0.93	0.95	0.93
Pseudo	0.97	0.97	0.96	0.95	0.94
Wtd KM	0.97	0.96	0.96	0.95	0.94

4.3 Computational resources required

The computational time and resources required to fit each model varied substantially. Depending on the size of the dataset, computational considerations may become important, especially given our use of the bootstrap to compute confidence intervals. The fastest method was weighted KM. To fit a Cox model with a time varying-effect of log-time, we found it infeasibly slow to split the dataset at each observed failure time given our large sample size. Instead, we limited the splits to occur at one-month intervals; we tested this approach and found that the results were nearly identical to those obtained by splitting at each failure time. The pseudo observations approach was also relatively slow, and highly memory intensive, even when we estimated $S(t)$ at one-month intervals. Table 2 shows the time required to fit each model once, and the total time required to calculate 500 bootstrap samples.

Table 2: Performance comparison: time required for N=5,000

*28 times higher if split at each failure
Method	Per bootstrap iteration (sec)	Total time (min)
Cox	0.98	8.21
CTV LT*	7.60	63.44
CTV PWC	0.41	3.40
AFT GG	13.70	114.39
AFT Weibull LS	5.26	43.90
Pseudo**	2.17	18.14
Wtd KM	0.14	1.18
**60 times higher if estimated at each failure

5 Comparative effectiveness of Cancer Therapies

We applied the previously described methods to estimate the effect of treatments on survival in two clinical studies using the National Cancer Database (NCDB). The NCDB is a large registry which encompasses approximately 70% of newly diagnosed cancer cases in the United States. It contains variables describing patient and tumor characteristics, and the treatments they receive. [10] Here, we describe two studies we conducted using the NCDB that featured two different forms of non-proportional hazards. The first is a moderately sized study where two treatment groups had similar survival initially, but differences in long-term outcomes. The second example is a much larger study where benefits of one treatment were attenuated over time.

5.1 Sarcoma

Soft-tissue sarcoma is a rare cancer of the connective tissue (e.g. fat, muscle, blood vessel), usually arising in the extremities. Stage III disease (high grade, large, or deep tumor) is typically treated surgically or with radiation therapy. It remains uncertain whether chemotherapy provides an overall survival benefit for these patients, and its use is optional according to national guidelines. [38] We studied this question using the NCDB, creating a cohort of Stage III sarcoma patients (with various histologies) treated with definitive surgical resection of the primary tumor, with or without chemotherapy pre/post operatively. We identified 5,337 cases with recorded overall survival data, of whom 28% were treated with chemotherapy. [8]

Due to the non-randomized nature of the data, there were substantial differences between treatment groups. For example, patients treated with chemotherapy were younger and had fewer recorded co-morbid conditions. Furthermore, the hazards for the treatment groups exhibited substantial non-proportionality, as is clearly evident upon examining the complementary log-log survival curves. (See Figure 2) Testing the Schoenfeld residuals provides further evidence of non-proportionality (P=0.047). We used IPTW to account for covariate imbalance between the treatment groups. The propensity score was estimated using all available covariates (including age, gender, insurance status, income, comorbidity score, tumor histology, grade, size, anatomic site, treating facility type, and travel distance.) The IPTW Kaplan-Meier curves are shown in Figure 2. In our original analysis, we also fit a time-varying Cox proportional hazards model, assuming a piecewise constant hazard ratio, which could change at two years. We found that there was little effect of chemotherapy during the first two years after diagnosis, but that patients treated with chemotherapy did better in the long term. These results motivated simulation scenario 2 (PWC).

Table 3: Chemotherapy in sarcoma: results from each model

*Centered weights
	RMS			2 years			5 years
	$\Delta$	LCL	UCL	$\Delta$	LCL	UCL	$\Delta$	LCL	UCL
Cox	0.452	0.146	0.757	0.036	0.013	0.060	0.052	0.018	0.088
TV Cox: log-T	0.457	0.132	0.752	0.018	-0.011	0.043	0.052	0.019	0.087
TV Cox: PWC	0.442	0.117	0.743	-0.003	-0.038	0.028	0.044	0.006	0.081
AFT Gen Gamma	0.409	0.114	0.680	0.039	0.011	0.064	0.047	0.014	0.079
AFT Wbl TV shape	0.468	0.160	0.773	0.026	0.002	0.048	0.054	0.020	0.090
Weighted K-M	0.439	0.109	0.746	-0.007	-0.043	0.026	0.041	0.003	0.078
Pseudo-obs*	0.439	0.110	0.745	-0.007	-0.043	0.026	0.041	0.003	0.078
$\Delta$ = Difference between ATEs for chemo vs no chemo, boldface denotes significance
LCL = Lower confidence limit, UCL = Upper confidence limit

Here, we apply each of the methods discussed in the sections above to the sarcoma data. Results for RMS, 2 year survival, and 5 year survival are shown in Table 3. Median and 10-year survival were not estimable from the observed data, so we excluded these outcomes. We found that the different methods yielded different effect size estimates; however, these differences were often modest and in some cases would not change statistical inferences. For example, $\Delta$ RMS varied from a minimum of 0.409 (AFT with a generalized gamma distribution) to a maximum of 0.468, but all models showed an improvement in RMS for patients treated with chemotherapy. In other cases the inferences would differ based on the model chosen. At 2 years, the standard Cox model showed a difference of 3.6% between the treated and control arms, which was statistically significant. The results of both AFT models also found significantly higher survival in the chemotherapy group. However, the weighted Kaplan-Meier method showed a non-significant difference of -0.7%. Both time-varying Cox models and the pseudo-observations method also produced non-significant differences.

We would like to note that in this data analysis example, we found that the pseudo-observations method was sensitive to deviations in the mean of the estimated weights from 1. While the expectation of the mean of the weights is equal to 1, in practice there may be small deviations from the expected value. The pseudo-observations method seemed to be more affected by these deviations than the other methods we evaluated. This sensitivity can be explained by how the weights are applied and how the survival function is estimated. Each patient’s (binary) contribution to the survival function is weighted, and the weighted indicators are then averaged. Therefore, if the mean of the weights is not equal to 1, at baseline the survival function will not start at 1. To address this issue in the estimation of the survival function when using pseudo-observations, we centered the weights so their mean was exactly 1. However, we note that this was done heuristically, and not theoretically driven. Further exploration of this issue is warranted.

5.2 Renal Cancer

The second motivating clinical example for this work was a study of surgical options for early stage renal cancer. These small tumors can be treated with Radical Nephrectomy (RN) or Partial Nephrectomy (PN). PN preserves more kidney tissue, but surgery/recovery is more difficult and it is more likely to be given to healthier patients. [39, 9] Here, we focus on a subgroup from the original study: patients aged 51-60 with T1a tumors, which gives a sample size of 28,973 patients (61.1% treated with PN). In unadjusted analysis, patients given PN had improved survival over those given RN. However, these results are subject to confounding by indication. Examination of the complementary log-log plots shows a notable deviation from non-proportionality, which is confirmed by testing the Schoenfeld residuals (P $<$ 0.0001). [26]

When we fit an IPTW Cox PH model to these data, allowing the hazards to vary as a function of log-time, such that $h(t)=\lambda_{0}(t)\mathrm{exp}(\beta_{1}Z+\beta_{2}Z\mathrm{log}(t))$ the maximum likelihood estimates are $\hat{\beta_{1}}=-0.468$ and $\hat{\beta_{2}}=0.136$ . Therefore, the non-proportionality observed here is close to our modest time-varying hazard simulation scenario (number 3), and the treatment effect is modest as well, similar to what we used in scenario 4.

Our results were largely simiar, regardless of what model was used. (See Table 4) This was particularly the case for the RMS and five-year survival. Two-year survival showed the largest range in results, with estimated differences ranging from 0.9% (Generalized Gamma AFT model) to 1.6% (Cox model with piece-wise time-varying effects). The standard Cox model gave an estimated difference of 1%, compared to 1.5% from the Kaplan-Meier method. Even with these small effects, we found that none of the 95% confidence limits crossed zero, so all models would lead to the same inference: a small benefit for PN over RN. The congruous inferences were partially driven by the very large sample size, and the resulting small confidence intervals; however, given the clear evidence of non-proportionality, it is notable that the point estimates for the standard Cox model were close to those of time-varying or non-parametric methods.

Table 4: Effect of PN vs RN in renal cancer: Results from each model

*Centered weights
	RMS			2 years			5 years
	$\Delta$	LCL	UCL	$\Delta$	LCL	UCL	$\Delta$	LCL	UCL
Cox	0.268	0.199	0.331	0.010	0.007	0.012	0.027	0.020	0.034
TV Cox: log-T	0.267	0.199	0.330	0.014	0.010	0.018	0.029	0.022	0.036
TV Cox: PWC	0.264	0.196	0.327	0.016	0.012	0.020	0.030	0.021	0.037
AFT Gen Gamma	0.263	0.194	0.326	0.009	0.007	0.012	0.026	0.020	0.033
AFT Wbl TV shape	0.264	0.196	0.329	0.019	0.014	0.023	0.031	0.024	0.038
Weighted K-M	0.268	0.200	0.333	0.015	0.011	0.020	0.029	0.020	0.036
Pseudo-obs*	0.269	0.200	0.334	0.015	0.011	0.020	0.029	0.020	0.036
$\Delta$ = Difference between ATEs for PN vs RN, boldface denotes significance
LCL = Lower confidence limit, UCL = Upper confidence limit

6 Discussion

When analyzing survival outcomes, we often see non-proportional hazards. Based on our findings, it seems that non-proportionality is readily addressable in practice with weighted Kaplan-Meier curves, the simplest method we assessed. It performed quite well across a variety of scenarios and outcome measures of interest. The IPTW Kaplan-Meier method does not require specialized software or methods, and it also had the best computational performance. There is a penalty in terms of efficiency, but we found the increase in the size of the standard errors (compared to parametric methods) to be small in practice. We believe that the larger standard errors are a minor drawback when compared to the benefits of fewer assumptions and reduced bias.

In this work, we used simple IPTW weights. Alternative propensity score weights have also been proposed; [20] notably, variance stabilized weights have good properties, especially when one treatment is given to a small proportion of the patients. In our simulation studies, we used a simple functional form to model the covariates’ relationship with the probability of receiving treatment. Therefore, the simple logistic regression model used to estimate the propensity scores was correctly specified. In practice, there may be much more complex covariate effects, motivating more flexible procedures to estimate the propensity scores, such as ensemble machine learning methods. [18] Such methods have been shown to improve prediction, but are often more resource-intensive to implement.

Our simulation studies were directly informed by the two cancer studies discussed in Section 5. We sought to better understand how choice of method may affect results of these real-world analyses, and to learn broader lessons about how best to accommodate non-proportionality in large, observational studies. Although both clinical examples had large sample sizes, we found in simulations that the results held up even when sample sizes were small. Taken together, our simulation results and clinical examples show that one can easily protect against incorrect inferences using IPTW Kaplan-Meier curves to estimate treatment effects.

Acknowledgement

Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number P30CA006927 and U54CA221705. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The data used in the study are derived from a de-identified NCDB file. The American College of Surgeons and the Commission on Cancer have not verified and are not responsible for the analytic or statistical methodology employed, or the conclusions drawn from these data by the investigator.

References

[1] Klein JP and Moeschberger ML. Survival Analysis: Teshniques for Censored and Truncated Data. second edition ed. Springer, 2003.
[2] Kaplan EL and Meier P. Nonparametric Estimation from Incomplete Observations. J Am Stat Assoc 1958; 53(282): 457–481.
[3] Collett D. Modelling Survival Data in Medical Research. Chapman and Hall/CRC, 2015.
[4] Andersen PK, Hansen MG and Klein JP. Regression Analysis of Restricted Mean Survival Time Based on Pseudo-Observations. Lifetime Data Anal 2004; 10(4): 335–350.
[5] Andersen PK and Pohar Perme M. Pseudo-observations in survival analysis. Stat Methods Med Res 2010; 19(1): 71–99.
[6] Andersen PK, Syriopoulou E and Parner ET. Causal inference in survival analysis using pseudo-observations. Stat Med 2017; 36(17): 2669–2681.
[7] Jin Z, Lin DY and Ying Z. On Least-Squares Regression with Censored Data. Biometrika 2006; 93(1): 147–161.
[8] Movva S, von Mehren M, Ross EA et al. Patterns of Chemotherapy Administration in High-Risk Soft Tissue Sarcoma and Impact on Overall Survival. J Natl Compr Canc Netw 2015; 13(11): 1366–1374.
[9] Ristau BT, Handorf EA, Cahn DB et al. Partial nephrectomy is not associated with an overall survival advantage over radical nephrectomy in elderly patients with stage Ib-II renal masses: An analysis of the national cancer data base. Cancer 2018; 124(19): 3839–3848.
[10] Boffa DJ, Rosen JE, Mallin K et al. Using the National Cancer Database for Outcomes Research: A Review. JAMA Oncol 2017; 3(12): 1722–1728.
[11] Rosenbaum PR and Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983; 70: 412–8.
[12] Rosenbaum PR and Rubin DB. Bias in Observational Studies Using Subclassification on the Propensity Score. J Am Stat Assoc 1984; 79: 516–524.
[13] Rubin DB. Causal Inference Using Potential Outcomes. J Am Stat Assoc 2005; 100(469): 322–331.
[14] Holland PW. Statistics and Causal Inference. J Am Stat Assoc 1986; 81(396): 945–960.
[15] Rosenbaum PR. From Association to Causation in Observational Studies: The Role of Tests of Strongly Ignorable Treatment Assignment. J Am Stat Assoc 1984; 79(385): 41–48.
[16] Rosenbaum PR. Observational Studies. second edition ed. Springer, 2002.
[17] Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivar Behav Res 2011; 46(3): 399–424.
[18] van der Laan, Mark J, Polley EC and Hubbard AE. Super Learner. sagmb 2007; 6(1).
[19] Funk MJ, Westreich D, Wiesen C et al. Doubly Robust Estimation of Causal Effects. Am J Epidemiol 2011; 173(7): 761–767.
[20] Austin PC and Stuart EA. Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Stat Med 2015; 34(28): 3661–3679.
[21] Lunceford JK and Davidian M. Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Stat Med 2004; 23(19): 2937–2960.
[22] Austin PC and Stuart EA. The performance of inverse probability of treatment weighting and full matching on the propensity score in the presence of model misspecification when estimating the effect of treatment on survival outcomes. Stat Methods Med Res 2017; 26(4): 1654–1670.
[23] Mao H, Li L, Yang W et al. On the propensity score weighting analysis with survival outcome: Estimands, estimation, and inference. Stat Med 2018; 37(26): 3745–3763.
[24] Schemper M, Wakounig S and Heinze G. The estimation of average hazard ratios by weighted Cox regression. Stat Med 2009; 28(19): 2473–2489.
[25] Conner SC, Sullivan LM, Benjamin EJ et al. Adjusted restricted mean survival times in observational studies. Stat Med 2019; 38(20): 3832–3860.
[26] Grambsch PM and Therneau TM. Proportional Hazards Tests and Diagnostics Based on Weighted Residuals. Biometrika 1994; 81(3): 515–526.
[27] Terry M Therneau. A Package for Survival Analysis in R, 2020.
[28] Wei LJ. The accelerated failure time model: A useful alternative to the Cox regression model in survival analysis. Stat Med 1992; 11(14-15): 1871–1879.
[29] Marshall AW and Olkin I. A new method for adding a parameter to a family of distributions with application to the exponential and Weibull families. Biometrika 1997; 84(3): 641–652.
[30] Jackson, Christopher. Flexsurv: A Platform for Parametric Survival Modeling in R. J Stat Softw 2016; 70(8): 1–33.
[31] Cole SR and Hernán MA. Adjusted survival curves with inverse probability weights. Comput Methods Programs Biomed 2004; 75(1): 45–49.
[32] Xie J and Liu C. Adjusted Kaplan–Meier estimator and log-rank test with inverse probability of treatment weighting for survival data. Stat Med 2005; 24(20): 3089–3110.
[33] Pohar Perme, Maha and Gerster, Mette. Pseudo: Computes Pseudo-Observations for Modeling, 2017.
[34] Efron B and Tibshirani RJ. An Introduction to the Bootstrap. Chapman and Hall/CRC, 1993.
[35] Austin PC. Variance estimation when using inverse probability of treatment weighting (IPTW) with survival analysis. Stat Med 2016; 35(30): 5642–5655.
[36] Barbiero, Alessandro and Alda Ferrari, Pier. GenOrd: Simulation of Discrete Random Variables with Given Correlation Matrix and Marginal Distributions, 2015.
[37] Sam Brilleman. Simsurv: Simulate Survival Data, 2019.
[38] von Mehren, Margaret. NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®)Soft Tissue Sarcoma Version 2.2020, 2020.
[39] Motzer, Robert J. NCCN Clinical Practice Guidelines in Oncology (NCCN Guidelines®) Kindey Cancer version 1.2021, 2020.

Appendices

Table A1: Simulation model coefficients (

\beta

)

	Coefficients
	Treatment model	Survival model
Intercept	0.529	N/A
age	0.03	0.04
charlson 1	0.15	0.3
charlson 2+	0.35	0.8
male	0.15	0.3
low.stage	-0.8	-0.4
high.grade	0.4	0.2
histology	0.3	0.2
white	-0.4	-0.1
hispanic	-0.1	-0.2
facility 1	-0.15	0
facility 2	-0.2	0
facility 3	-0.25	0
income 1	-0.15	-0.1
income 2	-0.25	-0.1
income 3	-0.3	-0.2
education 1	-0.2	-0.3
education 2	-0.25	-0.2
education 3	-0.3	-0.15
insurance 1	-0.3	-0.1
insurance 2	-0.35	-0.2

Table A2: Coverage probabilities by scenario

Scenario	Method	2y	5y	10y	Median	RMS
PWC	Cox	0.00	0.51	0.30	0.96	0.95
PWC	CTV LT	0.95	0.78	0.92	0.90	0.84
PWC	CTV PWC	0.96	0.96	0.87	0.94	0.90
PWC	AFT GG	0.00	0.75	0.10	0.90	0.91
PWC	AFT WBL LS	0.03	0.91	0.81	0.92	0.94
PWC	Pseudo	0.96	0.95	0.96	0.95	0.91
PWC	Wtd KM	0.96	0.95	0.96	0.95	0.92
Modest NPH	Cox	0.00	0.45	0.83	0.94	0.79
Modest NPH	CTV LT	0.76	0.87	0.94	0.94	0.89
Modest NPH	CTV PWC	0.96	0.97	0.96	0.95	0.92
Modest NPH	AFT GG	0.13	0.72	0.80	0.95	0.89
Modest NPH	AFT WBL LS	0.90	0.91	0.95	0.93	0.92
Modest NPH	Pseudo	0.96	0.97	0.96	0.95	0.92
Modest NPH	Wtd KM	0.97	0.98	0.97	0.95	0.92
Modest TE	Cox	0.00	0.15	0.48	0.49	0.86
Modest TE	CTV LT	0.78	0.93	0.92	0.94	0.94
Modest TE	CTV PWC	0.94	0.95	0.95	0.96	0.94
Modest TE	AFT GG	0.00	0.42	0.35	0.75	0.94
Modest TE	AFT WBL LS	0.91	0.95	0.94	0.95	0.95
Modest TE	Pseudo	0.96	0.95	0.94	0.95	0.95
Modest TE	Wtd KM	0.96	0.95	0.95	0.96	0.95
Small SS	Cox	0.02	0.64	0.88	0.93	0.86
Small SS	CTV LT	0.93	0.93	0.95	0.97	0.93
Small SS	CTV PWC	0.97	0.97	0.95	0.97	0.94
Small SS	AFT GG	0.27	0.82	0.82	0.96	0.93
Small SS	AFT WBL LS	0.91	0.93	0.95	0.96	0.95
Small SS	Pseudo	0.96	0.98	0.96	0.96	0.96
Small SS	Wtd KM	0.97	0.97	0.96	0.96	0.95