This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

“6 choose 4”: A framework to understand and facilitate discussion of strategies for overall survival safety monitoring

Godwin Yung
F. Hoffmann-La Roche AG, South San Francisco, CA, US
and
Kaspar Rufibach
F. Hoffmann-La Roche AG, Basel, Switzerland
and
Marcel Wolbers
F. Hoffmann-La Roche AG, Basel Switzerland
and
Mark Yan
F. Hoffmann-La Roche AG, Mississauga, CAN
and
Jue Wang
F. Hoffmann-La Roche AG, South San Francisco, CA, US
Abstract

Advances in anticancer therapies have significantly contributed to declining death rates in certain disease and clinical settings. However, they have also made it difficult to power a clinical trial in these settings with overall survival (OS) as the primary efficacy endpoint. A number of statistical approaches have therefore been recently proposed for the pre-specified analysis of OS as a safety endpoint (Shan, 2023; Fleming et al., 2024; Rodriguez et al., 2024). In this paper, we provide a simple, unifying framework that includes the aforementioned approaches—and a couple others—as special cases. By highlighting each approach’s focus, priority, tolerance for risk, and strengths or challenges for practical implementation, this framework can help to facilitate discussions between stakeholders on “fit-for-purpose OS data collection and assessment of harm” (American Association for Cancer Research, 2023). We apply this framework to a real clinical trial in large B-cell lymphoma to illustrate its application and value. Several recommendations and open questions are also raised.


Keywords: chronic disease; indolent cancers; intermediate outcome; oncology clinical trial; surrogate endpoint; survival detriment

1 Introduction

With the continued advancement of anticancer therapies, there are now many disease and clinical contexts in which patients’ lives are extended and multiple treatment options are available to patients following progression. While these successes should be celebrated, they have also made it challenging for clinical trials to have adequate statistical power to detect statistically significant improvements in overall survival (OS)—the gold standard endpoint in oncology (Merino et al., 2024). In July 2023, a workshop titled “Overall Survival in Oncology Clinical Trials” was held jointly by the U.S. Food and Drug Administration (FDA), American Association for Cancer Research (AACR), and American Statistical Association (ASA) to discuss challenges with timely assessment of OS for novel therapies (American Association for Cancer Research, 2023). The workshop cited lack of OS data, lack of plans for further OS data collection, and lack of pre-specified OS analyses as emerging challenges for regulatory authorities to interpret OS data in trials where OS is not a primary or key secondary endpoint. A call-to-action was made to “encourage thoughtful and comprehensive planning during trial design for fit-for-purpose OS data collection and assessment of harm given the disease and clinical context”.

Following that call-to-action, several statistical approaches have been proposed to date for the pre-specified analysis of OS as a safety endpoint. Fleming et al. (2024) proposed a monitoring guideline inspired by ones used by the FDA in cardiovascular safety trials. Rodriguez et al. (2024) suggested using confidence intervals to decide whether or not there is sufficient evidence to rule out harm. Shan (2023) adopted sample size calculation practices for non-inferiority trials to achieve desirable trial operating characteristics.

The existing approaches by Fleming et al. (2024), Rodriguez et al. (2024), and Shan (2023) may seem similar at first glance. For example, all three approaches consider similar sets of parameters, e.g., definition of harm measured in terms of the OS hazard ratio and type I error. However, upon closer look, it can be unclear even to the experienced statistician how these approaches relate to each other, how they differ, which approach one should use in practice, and why.

This paper proposes a framework for OS safety analysis that includes as special cases Fleming et al. (2024), Rodriguez et al. (2024), and Shan (2023). Importantly, this framework explains what flexibility and restrictions exist for any approach analyzing OS as a safety endpoint under the traditional framework of hypothesis testing, and clarifies for each approach their focus, priority, and tolerance for risk over the course of a trial. By increasing understanding of different approaches and the relationship between them, we believe this framework can help facilitate “thoughtful planning” by sponsors and improve communication between stakeholders in order to achieve alignment on clinical trial design and analysis.

The rest of the paper is organized as follows. In Section 2, we present our framework for OS safety monitoring and provide five examples of approaches that fall under this framework, including Fleming et al. (2024), Rodriguez et al. (2024), and Shan (2023). In Section 3, we illustrate and compare the five approaches by applying them to POLARIX, a real clinical trial studying patients with untreated large B-cell lymphoma. Section 4 concludes with a brief summary, recommendations for practice, and open questions for further deliberation or research.

2 Proposed framework

Consider a trial with nn patients, nπn\pi randomized to the experimental arm and n(1π)n(1-\pi) randomized to the control arm. A prospectively designed hypothesis test for OS detriment—as measured by the OS hazard ratio (HR)—involves six parameters:

  1. 1.

    θ0\theta_{0} the OS HR under the null hypothesis H0H_{0} of OS detriment

  2. 2.

    θ1\theta_{1} the OS HR under the alternative hypothesis H1H_{1} of no OS detriment

  3. 3.

    dd the number of deaths

  4. 4.

    θ\theta^{*} the threshold for decision making between H0H_{0} vs. H1H_{1}

  5. 5.

    α\alpha the one-sided type I error rate

  6. 6.

    β\beta the type II error rate

For the purpose of safety monitoring in indolent cancer trials, we may limit ourselves to considering θ0>1.0\theta_{0}>1.0 and moderate to neutral θ1\theta_{1} (e.g., between 0.7 and 1.0). Fleming et al. (2024) characterized θ0\theta_{0} as “the smallest unacceptable detrimental OS HR” and θ1\theta_{1} as “a plausible OS HR consistent with benefit that is reasonably expected from the experimental intervention”.

Two classical equations characterize the approximate relationship between the six parameters:

α=Pr(θ^<θ|θ0)=Φ(log(θ/θ0)π(1π)d)\alpha=Pr(\hat{\theta}<\theta^{*}|\theta_{0})=\Phi\left(\log(\theta^{*}/\theta_{0})\sqrt{\pi(1-\pi)d}\right) (I)
1β=Pr(θ^<θ|θ1)=Φ(log(θ/θ1)π(1π)d)1-\beta=Pr(\hat{\theta}<\theta^{*}|\theta_{1})=\Phi\left(\log(\theta^{*}/\theta_{1})\sqrt{\pi(1-\pi)d}\right) (II)

where Φ()\Phi(\cdot) denotes the cumulative distribution function of the standard normal distribution. These equations assume that the observed log hazard ratio log(θ^)\log(\hat{\theta}) follows a normal distribution with variance 1/(π(1π)d)1/(\pi(1-\pi)d) and mean log(θ0)\log(\theta_{0}) or log(θ1)\log(\theta_{1}) (Schoenfeld, 1981).

Given six parameters and Equations (I)-(II) characterizing the relationship between them, our proposed framework is as follows: at each analysis time, users input or “choose” 4 of the 6 parameters (not all from the same equation) and solve for the remaining 2. This one-sentence framework communicates clearly what flexibility and restrictions exist in any OS monitoring strategy. It highlights a strategy’s focus and associated risks (e.g., choosing β\beta and solving for α\alpha at an early interim analysis implies a focus on having the trial continue in case the experimental therapy is effective, while understanding the risk of a false positive). And as we shall illustrate, it includes as special cases Fleming et al. (2024), Rodriguez et al. (2024), Shan (2023), and other approaches for OS safety monitoring.

Before proceeding, there are three important points that deserve clarification. First, α\alpha and β\beta here refer to error rates associated with a single test as opposed to a group-sequential test and/or multiple endpoints. More specifically, they refer to marginal type I and type II error rates for hypothesis testing of OS safety at interim and final analyses. They should not be confused with overall error rates across all analysis times. (Note that Fleming et al. refer to α\alpha and β\beta as false positive and false negative error rates. We prefer the terms marginal type I and type II error rates to more clearly distinguish them from overall error rates.) They should also not be confused with error rates for primary and key secondary efficacy analyses, which may or may not include OS. Whether OS should be formally tested as a safety endpoint and how to do so in relation to other formally tested endpoints is context specific and deserves careful consideration in practice. However, a thorough discussion of this topic is beyond the scope of this paper.

Second, the proposed framework should not be misunderstood as suggesting a single one-time calculation. Just because a user inputs certain parameters does not mean they should accept whatever the other parameters end up being. All six parameters are important and an iterative process is likely needed to arrive at a configuration that various stakeholders can agree on. The intention of our framework is not to encourage focus on some parameters while ignoring others, but rather to elucidate and encourage stakeholder discussion of the relative priorities and trade-offs between the parameters. Our framework may also help clarify how certain parameters are more informed/guided by data or additional (e.g., clinical or scientific) considerations compared to other parameters.

Third, the word “choose” should not be misunderstood as suggesting that sponsors are always at liberty to specify the value of a parameter. Similar to the previous note, certain parameter values may be driven by trial conduct, clinical consideration, or regulatory context. For example, in a trial of indolent cancer where timing of the first analysis is based on progression free survival (PFS), the number of OS events at this time cannot be prespecified. It will be what-it-is when the prespecified number of PFS events is observed. At the time of study design, we can input a predicted number of OS events to explore operating characteristics of a safety monitoring strategy. However, when the actual trial takes place, the input will need to be updated based on the observed number of OS events. If the observed number of OS events is quite different from the predicted, then the entire strategy and configuration may need to be revisited to ensure that stakeholders are realigned on the focus and risks.

Special case #1: Fleming et al. (2024)

Fleming et al.’s OS monitoring guideline can be understood under the proposed framework as choosing (θ0,θ1,β,d)(\theta_{0},\theta_{1},\beta,d) at interim analyses (IAs) and (θ0,θ1,α,d)(\theta_{0},\theta_{1},\alpha,d) at the final analysis (FA). As such, their approach places more emphasis on trial continuation while evidence is still accumulating, which could make sense in trials with limited OS events at early IAs. This is later counter-balanced by a focus on ruling out OS detriment with reasonably low type I error rate α\alpha when more OS events have accumulated.

While representing Fleming et al.’s guideline in such a way makes it clearer and simpler to understand, it is worth reiterating that users should avoid strictly interpreting parameters as being either an input or output. Beyond the mechanistic calculations implied by Equations (I)-(II), all parameters are important at each analysis time, whether it’s α\alpha to protect patients from detriment, or β\beta to ensure that trials of effective drugs are not prematurely terminated, etc.. Therefore, an iterative process is needed to find a “sweet spot” that balances the various parameter values, and we encourage users to think of Fleming et al. and all other approaches under this framework as having different starting lines, different implications with respect to implementation, and (perhaps slightly) different focuses and tolerances for risk.

Special case #2: Rodriguez et al. (2024)

Rodriguez et al. proposed to calculate the upper bound of the 100×(12α)%100\times(1-2\alpha)\% confidence interval (CI) for the estimated OS HR, and to conclude low probability of substantial detriment if the upper bound is less than θ0\theta_{0} or high probability of substantial detriment otherwise. This approach is equivalent to choosing (θ0,θ1,α,d)(\theta_{0},\theta_{1},\alpha,d) at each analysis time. It intuitively uses CIs to ask, “Can OS detriment be ruled out?”, but risks stopping trials with effective experimental treatments early due to a decreased emphasis on β\beta. (Note that “choosing” θ1\theta_{1} here does not directly follow from the use of CIs. Rather, it follows from Rodriguez et al.’s consideration of trials which evaluate for OS safety early and then for OS efficacy later when OS is sufficiently powered against a certain alternative θ1\theta_{1}.)

Special case #3: Shan (2023)

Shan (2023) proposed to choose (θ0,θ1,α,β)(\theta_{0},\theta_{1},\alpha,\beta) and to solve for (d,θ)(d,\theta^{*}) in trials with a single readout. This approach is akin to traditional sample size calculation in non-inferiority trials. It can be generalized to trials with multiple readouts by choosing the same set of four parameters (θ0,θ1,α,β)(\theta_{0},\theta_{1},\alpha,\beta) at each analysis time. (Of course, users may choose at each time a different set of parameter values, e.g., α\alpha may decrease over time.) However, as mentioned before, timing of the first IA (or first multiple IAs) in trials of indolent cancers is often driven by an intermediate or surrogate endpoint, not OS. In the likely situation that the observed number of deaths differs from the prespecified number dd, it is unclear from this strategy what modifications would be made and how. For cancers with extremely low mortality rate, dd deaths may not even be obtainable within a reasonable timeframe.

Special case #4: Discrete thresholds

Another approach—one that we have seen study teams consider—is to choose (θ0,θ1,d,θ)(\theta_{0},\theta_{1},d,\theta^{*}) at each analysis time, with θ\theta^{*} being some discrete threshold such as 1.0, 1.1, or 1.2. There may be reasons for choosing θ\theta^{*}. Exceeding θ=1.0\theta^{*}=1.0, in particular, implies that OS detriment was observed in the trial overall. The OS HR point estimate is also a clinically important result that can have a strong influence on the public’s perception of a drug, providing another reason why sponsors may wish to apply a specific threshold.

However, it is unclear by choosing (θ0,θ1,d,θ)(\theta_{0},\theta_{1},d,\theta^{*}) what risks one is willing to tolerate and how uncertainty due to small number of events is taken into consideration, if at all. Would θ\theta^{*} need to be changed if a different number of deaths than dd is observed? If so, how? θ=1.0\theta^{*}=1.0 is also not appropriate if θ1\theta_{1} is close to 1.0, since the marginal type II error rate will be approximately 50% regardless of how many OS events there are.

Special case #5: FDA guidance for evaluating cardiovascular risk in type 2 diabetes (2008)

Our final example is to choose (θ1,α,β,d)(\theta_{1},\alpha,\beta,d) at IAs and (θ0,θ1,α,d)(\theta_{0},\theta_{1},\alpha,d) at the FA. This approach is based on monitoring guidelines that the FDA has used in trials of type 2 diabetes mellitus (T2D) to evaluate cardiovascular risk (FDA Center for Drug Evaluation and Research, 2008), guidelines which also served to inspire approach #1 by Fleming et al. (2024). The key difference between approach #5 and #1 is that, while approach #1 addresses the question “With what certainty can we rule out a particular level of OS detriment?”, approach #5 places more emphasis on type I error control at IAs and addresses the question “What OS detriment can we rule out with a high level of certainty?”.

However, it is worth noting that both approaches lead to the same calculated thresholds θ\theta^{*}, because the thresholds are uniquely determined by the parameters θ1\theta_{1}, β\beta, and dd. So in fact users do not need to choose between the two. Rather, they can simultaneously use both approaches to understand uncertainty and to evaluate benefit-risk. This can be especially useful when a clinically relevant θ0\theta_{0} has been defined a priori, but few deaths have been observed to enable robust inference with respect to θ0\theta_{0}.

3 Case study

POLARIX was a Phase 3 randomized controlled trial that compared Pola-R-CHP and R-CHOP in patients with untreated large B-cell lymphoma (LBCL). At the time of the primary analysis for PFS and first IA for OS, the trial provided statistically significant evidence of PFS benefit (PFS HR estimate 0.73; 95% CI 0.57 to 0.95), but immature and inconclusive OS data despite all patients having a minimum of 2 years follow-up (OS HR estimate 0.94, 95% CI 0.67 to 1.33). Subsequently, an FDA ODAC meeting was held to discuss the benefit-risk of Pola-R-CHP, taking into consideration the uncertainty around OS due to low number of deaths. For a detailed account of POLARIX and the ODAC’s eventual vote in favor of Pola-R-CHP, we refer the reader to the briefing document (FDA ODAC, 2023).

Fleming et al. (2024) illustrated their OS monitoring guideline by retrospectively applying it to POLARIX (Strategy 1, Table 1). Under our proposed framework, their guideline is equivalent to choosing (d,θ0,θ1,β)=(89,1.30,0.80,0.1)(d,\theta_{0},\theta_{1},\beta)=(89,1.30,0.80,0.1) at IA1, (d,θ0,θ1,β)=(131,1.30,0.80,0.1)(d,\theta_{0},\theta_{1},\beta)=(131,1.30,0.80,0.1) at IA2, and (d,θ0,θ1,α)=(178,1.30,0.80,0.025)(d,\theta_{0},\theta_{1},\alpha)=(178,1.30,0.80,0.025) at FA. For the rationale behind some of these parameters, we refer the reader to Fleming et al. (2024). We now illustrate and compare the remaining four approaches described in Section 2 by choosing similar parameter values wherever the set of input parameters overlap with Fleming et al.’s, and choosing conventional/intuitive values wherever they do not.

Table 1: Various OS monitoring strategies, applied to the POLARIX trial and depicted under the proposed framework. For each strategy, the four input or “chosen” parameters at a given stage are indicated in bold. Overall operating characteristics if safety thresholds are strictly adhered to are also provided. Strategy 1 is identical to the case study presented by Fleming et al. (2024), minus one potential interim analysis at 60 deaths for simplicity.
OS monitoring strategy Stage # of deaths (dd) Treatment effect under H0H_{0} (θ0\theta_{0}) Treatment effect under H1H_{1} (θ1\theta_{1}) HR threshold for decision making (θ\theta^{*}) Prob. of meeting threshold(s) under H0H_{0} (α\alpha) Prob. of meeting threshold(s) under H1H_{1} (1β1-\beta)
1. Fleming et al. IA1 89 1.30 0.80 1.050 0.157 0.900
(2024) IA2 131 1.30 0.80 1.001 0.067 0.900
FA 178 1.30 0.80 0.969 0.025 0.900
All three 0.018a 0.829b
2A. Rodriguez et IA1 89 1.30 0.80 0.858 0.025 0.629
al. (2024) IA2 131 1.30 0.80 0.923 0.025 0.793
FA 178 1.30 0.80 0.969 0.025 0.900
All three 0.007a 0.600b
2B. Rodriguez et IA1 89 1.30 0.80 0.991 0.100 0.843
al. (2024) IA2 131 1.30 0.80 0.975 0.050 0.872
FA 178 1.30 0.80 0.969 0.025 0.900
All three 0.015a 0.783b
3. Shan (2023) IA1 111 1.30 0.80 1.020 0.100 0.900
IA2 145 1.30 0.80 0.990 0.050 0.900
FA 178 1.30 0.80 0.969 0.025 0.900
All three 0.018a 0.841b
Table 1 (continued)
OS monitoring strategy Stage # of deaths (dd) Treatment effect under H0H_{0} (θ0\theta_{0}) Treatment effect under H1H_{1} (θ1\theta_{1}) HR threshold for decision making (θ\theta^{*}) Prob. of meeting threshold(s) under H0H_{0} (α\alpha) Prob. of meeting threshold(s) under H1H_{1} (1β1-\beta)
4. Discrete IA1 89 1.30 0.80 1.100 0.215 0.933
thresholds IA2 131 1.30 0.80 1.000 0.067 0.899
FA 178 1.30 0.80 1.000 0.040 0.932
All three 0.027a 0.862b
5. FDA IA1 89 1.59 0.80 1.050 0.025 0.900
guidance in IA2 131 1.41 0.80 1.001 0.025 0.900
T2D (2008) FA 178 1.30 0.80 0.969 0.025 0.900
All three 0.018c 0.829b
aProbability of meeting all three thresholds (IA1, IA2, and FA) under H0H_{0}.
bProbability of meeting all three thresholds (IA1, IA2, and FA) under H1H_{1}.
cProbability of meeting all three thresholds (IA1, IA2, and FA) under HR = 1.30, the HR under H0H_{0} at FA.

Our first adaptation of Rodriguez et al. (2024) is based on the authors’ suggestion to use OS HR 95% confidence intervals to rule out OS detriment at all analysis times. This is equivalent to specifying α=0.025\alpha=0.025 throughout (Strategy 2.A, Table 1). By emphasizing stringent control of type I error early, it can be seen that the probability of ruling out OS detriment when Pola-R-CHP is effective is substantially lower compared to Strategy 1 (0.629 with 89 deaths and 0.793 with 131 deaths, compared to 0.90 at both IAs).

Rodriguez et al. also suggested that one could relax the type I error rate from 0.025, although they did not provide guidance on how to actually choose α\alpha. Here, we set α\alpha to be 0.10 at IA1, 0.05 at IA2, and 0.025 at FA—levels typical of drug development and which become more stringent as event size increases (Strategy 2.B, Table 2). Doing so results in slightly lower or similar probabilities of meeting the decision making threshold under H1H_{1} than Strategy 1 (0.843, 0.872, and 0.900).

If there is a desire to control both type I and type II error rates, then one approach could be to use Shan’s to calculate the required event size. For example, Strategy 3 in Table 1 shares the same type I error rates as Strategy 2.B (0.10, 0.05, and 0.025) and the same type II error rates as Strategy 1 (0.10). However, the actual timing of IA in POLARIX was driven by PFS, not OS. Indeed, the trial conducted its analysis when 241 PFS events were observed. If instead the timing was driven by OS, then the readout could have been substantially delayed, with an increased risk of overpowering PFS or sitting on PFS results for many months.

Strategy 4 considers the use of θ=\theta^{*}= 1.1, 1.0, and 1.0 as the HR thresholds for decision making over time. We chose θ=1.1\theta^{*}=1.1 for IA1 based on our observation that study teams have a tendency to be conservative early on in a trial when there is little data. Specific to this case study, Strategy 4 can be seen to have similar operating characteristics as Strategy 1. However, the slightly larger type I error rates of 0.215 and 0.040 at IA1 and FA might be a safety concern. More generally, this strategy doesn’t support a transparent discussion on the various risks, since it focuses on the point estimate—which is problematic from a statistical perspective when there are few deaths.

Strategy 5 calculates what θ0\theta_{0} can be ruled out at the IAs with α=0.025\alpha=0.025 and β=0.1\beta=0.1. We see in this case that θ0=1.59\theta_{0}=1.59 and 1.41 given 89 and 131 deaths, respectively. As noted in Section 2, Strategies 1 and 5 result in the same calculated thresholds θ\theta^{*}. Therefore, a user can use both strategies at an IA to understand with what level of certainty they can rule out θ0=1.30\theta_{0}=1.30 (α=0.157\alpha=0.157 at IA1 and α=0.067\alpha=0.067 at IA2) and what OS detriment they can rule out with high certainty α=0.025\alpha=0.025 (θ0=1.59\theta_{0}=1.59 at IA1 and θ0=1.41\theta_{0}=1.41 at IA2). Using Strategy 5 alone may be problematic, as it doesn’t allow a transparent discussion on the risks associated with a clinically relevant target, and may in fact claim a low risk but for a clinically irrelevant question.

Besides the marginal risks involved at each stage in a trial, stakeholders should also consider the overall risk of the trial should the decision making thresholds θ\theta^{*} be strictly followed. Table 1 calculates for each strategy the probability of meeting all three thresholds (i.e., probability of consistently being deemed safe) under H0H_{0} and under H1H_{1}. It can be seen that all strategies have a high probability (>97%)(>97\%) of being flagged as unsafe at least once (i.e., probability of failing to meet a threshold at least once) during trial conduct if the drug is harmful. Meanwhile, four of the strategies have a >>80% probability of consistently being deemed safe (i.e., probability of meeting all three thresholds) if the drug is effective, the exceptions being Strategies 2A (60%) and 2B (78.3%).

Rodriguez et al. (2024) considered controlling type I error in the strong sense. For Strategies 2A and 2B, the family wise error rates (i.e., probability of meeting at least one threshold under H0H_{0}) are 0.049 and 0.12, respectively (not shown in Table 1). Therefore, if there is a desire to control these rates at a specified level, say 0.025 or 0.10, then one can expect probabilities less than 60% and 78.3% of meeting all three thresholds under H1H_{1}—levels of risk that sponsors might be unwilling to take. In the context of safety monitoring where interest is in flagging a treatment that causes OS detriment, we think the probability of consistently being deemed safe under H0H_{0}—or more precisely, its complement, the probability of being flagged as unsafe at least once if the drug is harmful—is a more meaningful quantity than FWER.

4 Discussion

With advancements in anticancer therapy, it has become increasingly difficult for clinical trials in certain disease and clinical contexts to power for OS as an efficacy endpoint. Nevertheless, OS is an important endpoint for assessing safety, and rigorous benefit-risk assessment of novel therapies requires improved prospective planning, collection, and assessment of OS data. In this paper, we proposed a framework for the pre-specified analysis of OS as a safety endpoint (Section 2). Our framework solicits users to choose, at each analysis time, 4 out of 6 parameters from classical equations for statistical hypothesis testing. In doing so, users communicate clearly their focus, priority, and tolerance for risk over the course of a trial.

We described five special cases under the proposed framework: Fleming et al. (2024), Rodriguez et al. (2024), Shan (2023), an approach that uses only discrete thresholds for decision making, and another approach that is based off of previous FDA monitoring guidelines for evaluating cardiovascular risk in T2D. We illustrated use of the five approaches by applying them to the POLARIX trial (Section 3). While the main objective of our paper is to increase understanding of each approach, thereby facilitating planning by sponsors of future clinical trials and improving communication between stakeholders, we leave here with some final comments and recommendations regarding their use.

Of the five approaches mentioned in this paper, we think Fleming et al. (2024) is the most appropriate for practical use due to its balance in priority over time—which shifts from a focus on ensuring that potentially effective treatments will continue to be studied when there are few OS events, to a focus on ruling out OS detriment when more OS events have accumulated. This approach also has the advantage of making clear how decision making thresholds can be re-calculated at any time given the observed number of OS events.

Approach #5 based on FDA’s guidance in T2D can be used in tandem with Fleming et al. (2024) to better evaluate benefit-risk. Together, when deaths are still slowly accruing, the two approaches address the questions, “What OS detriment can we rule out with a high level of certainty?” (approach #5) and “With what certainty can we rule out a particular level of OS detriment?” (Fleming et al., 2024).

As for the remaining three approaches, users should use Rodriguez et al. (2024) with caution, as overemphasizing control of type I error early risks stopping the investigation of effective therapies prematurely. Users can choose to relax the marginal type I error rates, but more guidance is needed on how to do so. Shan (2023) cannot be used for earlier interim analyses whose timing is driven by an intermediate/surrogate endpoint, but it can be used for later interim analyses whose timings are driven by OS. Finally, use of discrete thresholds can be appropriate provided that the anticipated risks involved are understood and acceptable.

Nevertheless, for all three of these approaches, the actual type I and type II error rates from using the decision making thresholds will depend on the observed number of OS events. Therefore, sponsors should have a plan in place on how they will make adjustments in case the observed number of OS events at an analysis time deviates from the number of OS events anticipated at the planning stage. Alternatively, sponsors could re-frame their approach under Fleming et al. (2024) with similar operating characteristics. Measures of overall rates—such as the probability of a detrimental drug being flagged at least once for safety concern, or the probability that an effective drug is consistently deemed safe (Section 3)—can help to inform the choice of one approach over another, as well as to update the parameter values during trial conduct.

Regardless of the approach taken, the low mortality rate in indolent diseases makes rigorous OS monitoring challenging. Intercurrent events such as subsequent therapy may further increase uncertainty by confounding the treatment effect (Korn et al., 2011). Therefore, multiple iterations of calculations and discussions are likely needed in practice for stakeholders to achieve alignment. We agree with Fleming et al. (2024) that any set of proposed thresholds should be seen as non-binding “informative guidelines” as opposed to black-and-white “rules” for continuing or stopping a trial. We find the list of additional analyses suggested by Rodriguez et al. (2024) to increase understanding of immature OS data helpful: analysis of biologically plausible subgroups, simulation and supplementary analyses to investigate the robustness of results under different assumptions, evaluation of other endpoints, results from other trials, use of real-world data, etc.. Data monitoring committees may also find measures of stability (or graphical summaries such as the predicted interval plot) useful for understanding potential effect size estimates and associated precision with trial continuation (Betensky, 2015; Li et al., 2009).

Several questions remain to be addressed for the practical implementation of OS safety monitoring. How should θ0\theta_{0} be defined, potentially taking into consideration the treatment effect that will be observed for other endpoints? (For example, a greater tolerance for OS risk might be more acceptable when a robust clinical benefit is ascertained on an intermediate/surrogate endpoint.) Likewise, how should θ1\theta_{1} be defined, given potential confounding by intercurrent events? How frequently should OS be evaluated for safety, i.e., how should we define the “analysis times” in the proposed framework? In what contexts should OS safety be formally tested and how should it be done relative to other formally tested endpoints—which may or may not include OS for efficacy?

This paper proposes a supportive decision making framework based on a single criteria (i.e., θ\theta^{*}). However, it is worth noting that Rodriguez et al. (2024) and Shan (2023) considered introducing a ‘grey zone’, which, if a trial were to fall into, would prompt further OS follow-up. Doing so could have the effect of ruling out substantial OS detriment while limiting rejection of treatments with marginal OS benefit. While we did not consider such an extension here for simplicity of comparison and presentation, we do find this idea of practical interest worth pursuing in future discussions and research.

In conclusion, it is in everyone’s interest—patients, clinicians, regulators, and sponsors alike—for safe and effective cancer therapies to be developed in a timely manner. However, the low mortality rate in some disease settings can make it difficult to rigorously evaluate benefit-risk. There is a present need for stakeholders to come together, to communicate, and to align on the expectations and priorities of clinical trials with respect to the generation and interpretation of OS data. By simplifying the presentation of recent approaches for OS safety analysis and clarifying their focus, priority, and tolerance for risk over time, we believe the “6 choose 4” framework proposed in this paper can help to meet those needs.

Acknowledgements

We are grateful to Emmanuel Zuber, Lisa Hampson, Arunuva Chakravartty, and other co-authors of Fleming et al. (2024) for their time and valuable feedback during the development of this manuscript.

Disclosure statement

The authors report that there are no competing interests to declare. All authors are employees of Roche and own stocks in this company.

References

  • (1)
  • American Association for Cancer Research (2023) American Association for Cancer Research (2023), ‘FDA-AACR-ASA Workshop: Overall Survival In Oncology Clinical Trials’. Available at: https://www.aacr.org/professionals/policy-and-advocacy/regulatory-science-and-policy/events/fda-aacr-asa-workshop-overall-survival-in-oncology-clinical-trials/.
  • Betensky (2015) Betensky, R. A. (2015), ‘Measures of follow-up in time-to-event studies: Why provide them and what should they be?’, Clinical Trials 12, 403–408.
  • FDA Center for Drug Evaluation and Research (2008) FDA Center for Drug Evaluation and Research (2008), ‘Guidance for industry. Diabetes mellitus – evaluating cardiovascular risk in new antidiabetic therapies to treat type 2 diabetes’. Available at: https://downloads.regulations.gov/FDA-2008-D-0118-0029/content.pdf.
  • FDA ODAC (2023) FDA ODAC (2023), ‘March 9, 2023. Meeting of the Oncologic Drugs Advisory Committee – Combined FDA and Genentech Briefing Document’. Available at: https://www.fda.gov/media/165961/download.
  • Fleming et al. (2024) Fleming, T. R., Hampson, L. V., Bharani, B.-D., Bretz, F., Chakravartty, A., Coroller, T., Koukouli, E., Wittes, J., Yateman, N. and Zumber, E. (2024), ‘Monitoring overall survival in pivotal trials in indolent cancers’, Statistics in Biopharmaceutical Research . doi: 10.1080/19466315.2024.2365648.
  • Korn et al. (2011) Korn, E. L., Freidlin, B. and Abrams, J. S. (2011), ‘Overall survival as the outcome for randomized clinical trials with effective subsequent therapies’, Journal of Clinical Oncology 29, 2439–2442.
  • Li et al. (2009) Li, L., Evans, S. R., Uno, H. and Wei, L. (2009), ‘Predicted interval plots (PIPS): A graphical tool for data monitoring of clinical trials’, Statistics in Biopharmaceutical Research 1, 348–355.
  • Merino et al. (2024) Merino, M., Kasamon, Y., Theoret, M., Pazdur, R., Kluetz, P. and Gormley, N. (2024), ‘Irreconcilable differences: The divorce between response rates, progression-free survival, and overall survival’, Journal of Clinical Oncology 41, 2706–2713.
  • Rodriguez et al. (2024) Rodriguez, L. R., Gormley, N. J., Lu, R., Amatya, A. K., Demetri, G. D., Flaherty, K. T., Mesa, R. A., Pazdur, R., Sekeres, M. A., Shan, M., Snapinn, S., Theoret, M. R., Umoja, R., Vallejo, J., Warren, N. J. H., Xu, Q. and Anderson, K. C. (2024), ‘Improving collection and analysis of overall survival data’, Clinical Cancer Research . doi: 10.1158/1078-0432.CCR-24-0919.
  • Schoenfeld (1981) Schoenfeld, D. (1981), ‘The asymptotic properties of nonparametric tests for comparing survival distributions’, Biometrika 68, 316–319.
  • Shan (2023) Shan, M. (2023), ‘Assessment of OS data for safety as pre-specified endpoint for trials in indolent/early-stage cancers’. Presented at FDA-AACR-ASA Workshop. Available at: https://www.aacr.org/wp-content/uploads/2023/07/Session-2- Slides.pdf.