This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regression Trees for Cumulative Incidence Functions

Youngjoo Cho, Annette M. Molinaro,
Chen Hu, and Robert L. Strawderman
[email protected]
Department of Biostatistics & Computational Biology
601 Elmwood Avenue, Box 630
University of Rochester
Rochester NY 14642
Abstract

The use of cumulative incidence functions for characterizing the risk of one type of event in the presence of others has become increasingly popular over the past decade. The problems of modeling, estimation and inference have been treated using parametric, nonparametric and semi-parametric methods. Efforts to develop suitable extensions of machine learning methods, such as regression trees and related ensemble methods, have begun only recently. In this paper, we develop a novel approach to building regression trees for estimating cumulative incidence curves in a competing risks setting. The proposed methods employ augmented estimators of the Brier score risk as the primary basis for building and pruning trees. The proposed methods are easily implemented using the R statistical software package. Simulation studies demonstrate the utility of our approach in the competing risks setting. Data from the Radiation Therapy Oncology Group (trial 9410) is used to illustrate these new methods.

Keywords: Brier score; CART; Cause-specific hazard; Fine and Gray model; Sub-distribution function

1 Introduction

A subject being followed over time may experience several types of events related, for example, to disease morbidity and mortality. For example, in a Phase III trial of concomitant versus sequential chemotherapy and thoracic radiotherapy for patients with inoperable non-small cell lung cancer (NSCLC) conducted by the Radiation Therapy Oncology Group (RTOG), patients were followed up to 5 years, the occurrence of either disease progression or death being of particular interest. Such “competing risks” data are commonly encountered in cancer and other biomedical follow-up studies, in addition to the potential complication of right-censoring on the event time(s) of interest.

Two quantities are often used when analyzing competing risks data: the cause-specific hazard function (CSH) and the cumulative incidence function (CIF). For a given event, the former describes the instantaneous risk of this event at time tt, given that no events have yet occurred; the latter describes the probability of occurrence, or absolute risk, of that event across time and can be derived directly from the subdistribution hazard function (fine1999proportional). The literature on competing risks is growing rapidly, particularly for hazard-based regression modeling; see dignam for a contemporary review, where parametric and semi-parametric approaches to modeling both the CSH and CIF are considered. The literature on tree-based methods for competing risks remains much less developed. leblanc1992relative extended the classification and regression tree (CART) procedure of breiman1984classification to the case of right-censored survival data by replacing the squared error loss function ordinarily used for a continuous outcome with an approximation to the likelihood function derived under a certain proportional hazards assumption. Motivated by methods for handling monotone missing data, molinaro2004tree later proposed using an inverse probability weighted (IPCW) squared error loss function to directly generalize the original CART procedure for right-censored survival data, obtaining a procedure that reduces to the standard form of CART for a continuous outcome when censoring is absent. steingrimsson2016doubly recently proposed a “doubly robust” extension motivated by semiparametric efficiency theory for missing data (robins1994estimation; tsiatis2007semiparametric), and demonstrated significantly improved performance in comparison to the IPCW-based tree procedure of molinaro2004tree. However, such methods cannot be used when there is more than one type of event (i.e., in addition to censoring). For two or more competing risks, callaghan2008 proposed two methods of building classification trees for the CIF: by maximizing between-node heterogeneity via the two-sample log-rank test of gray1988class; and, by maximizing within-node homogeneity using sums of event-specific martingale residuals. To our knowledge, no software package is available that implements these methods, and more importantly, there is also no software specifically targeted to the problem of fitting regression trees to competing risk outcomes. In contrast, ensemble methods for estimating the CIF as a function of covariates do exist; see, for example, ishwaran2014random and mogensen2013random. ishwaran2014random implement their methods in the randomForestSRC package (rfsrc), where the unpruned trees that make up the bootstrap ensemble are typically built using splitting criteria designed for group comparisons with a single competing risk outcome, such as the generalized logrank test.

This paper proposes a novel CART-based approach to building regression trees under competing risks. In Section 2, we introduce the relevant data structures and describe a direct approach to building regression trees for estimating the CIF for a specified cause when there is no loss to follow-up. Section 3 develops the necessary extensions of these loss functions for right-censored outcomes, along with extensions for estimating the CIF at several time points. These methods focus on directly estimating the CIF, avoiding estimation of the cause-specific or subdistribution hazard function. In Section 4, simulation studies are used to investigate performance of these new methods and Section 5 summarizes analysis results for the RTOG 9410 Phase III lung cancer trial mentioned at the beginning of this section. The paper concludes with comments on future work in Section 6.

2 CIF Regression Trees with Uncensored Data

This section introduces the relevant competing risks data structure and reviews a CART-based approach to building regression trees for estimating the CIF in the case where there is no other loss to follow-up (i.e., no random censoring). This necessary background will allow us to develop an analogous approach when there is loss to follow-up.

2.1 Full Data Structure

Let T(m)T^{(m)} be the time to failure for failure type m=1,,K,K2m=1,\ldots,K,K\geq 2 and let WW be a vector of pp covariates, where W𝒮pW\in{\cal S}\subset\mathbb{R}^{p}. Let T=min(T(1),,T(K))T=\min(T^{(1)},\ldots,T^{(K)}) be the minimum of all latent failure times; it is assumed (T(1),,T(K))(T^{(1)},\ldots,T^{(K)}) has a bounded multivariate density function and hence that TT is continuous. Then, in the absence of other loss to follow-up, F=(T,W,M)F=(T,W,M) is assumed to be the fully observed (or full) data for a subject, where MM is the event type corresponding to T.T. This assumption implies that (T(M),M,W)(T^{(M)},M,W) is observed and, in addition, that T(m)>T(M)T^{(m)}>T^{(M)} for every mM;m\neq M; however, T(m)T^{(m)} itself is not assumed to be observed for mMm\neq M. Let =(F1,,Fn)\mathcal{F}=(F_{1},\ldots,F_{n}) be the full data observed on nn independent subjects, where Fi=(Ti,Wi,Mi),i=1,,nF_{i}=(T_{i},W_{i},M_{i}),i=1,\ldots,n are assumed to be identically distributed (i.i.d.). The dependence structure of (T(1),,T(K))(T^{(1)},\ldots,T^{(K)}) is left unspecifed.

2.2 CIF Estimation Under a Brier Loss Function

Let Ψ0={ψ0m(t;w)=P(Tt,M=m|W=w),t>0,m=1,,K}.\Psi_{0}=\{\psi_{0m}(t;w)=P(T\leq t,M=m|W=w),t>0,m=1,\ldots,K\}. The set of CIFs Ψ0\Psi_{0} can be estimated from the data {\cal F} using any of several suitable parametric or semiparametric methods without further assumptions (e.g., independence of T(1),,T(K)T^{(1)},\ldots,T^{(K)}). This section introduces a loss-based method for estimating ψ0m(t;w)\psi_{0m}(t;w) for a fixed cause mm and time point t>0t>0 in the case where ψ0m(t;w)\psi_{0m}(t;w) is piecewise constant as a function of W.W. This is a key step in our proposed approach to building a regression tree to estimate ψ0m(t;w).\psi_{0m}(t;w).

Let 𝒩1,,𝒩L\mathcal{N}_{1},\ldots,\mathcal{N}_{L} form a partition of 𝒮.{\cal S}. Define β0lm(t)=P(Tt,M=m|W𝒩l)\beta_{0lm}(t)=P(T\leq t,M=m|W\in\mathcal{N}_{l}) and suppose ψ0m(t;w)=l=1Lβ0lm(t)I{W𝒩l};\psi_{0m}(t;w)=\sum_{l=1}^{L}\beta_{0lm}(t)I\{W\in\mathcal{N}_{l}\}; that is, subjects falling into partition 𝒩l\mathcal{N}_{l} all share the same CIF β0lm(t).\beta_{0lm}(t). Define Zm(t)=I(Tt,M=m)Z_{m}(t)=I(T\leq t,M=m) and let ψm(t;w)=l=1Lβlm(t)I{W𝒩l}\psi_{m}(t;w)=\sum_{l=1}^{L}\beta_{lm}(t)I\{W\in\mathcal{N}_{l}\} be a model for ψ0m(t;w).\psi_{0m}(t;w). Then, fixing both t>0t>0 and m,m, the Brier loss (cf., brier1950verification) Lm,tfull(F,ψm)={Zm(t)ψm(t;w)}2=l=1LI{W𝒩l}{Zm(t)βlm(t)}2L_{m,t}^{full}(F,\psi_{m})=\{Z_{m}(t)-\psi_{m}(t;w)\}^{2}=\sum_{l=1}^{L}I\{W\in\mathcal{N}_{l}\}\{Z_{m}(t)-\beta_{lm}(t)\}^{2} is an unbiased estimator of the risk (t,ψm)=E[l=1LI{W𝒩l}{Zm(t)βlm(t)}2],\Re(t,\psi_{m})=E[\sum_{l=1}^{L}I\{W\in\mathcal{N}_{l}\}\{Z_{m}(t)-\beta_{lm}(t)\}^{2}], or equivalently, (t,ψm)=l=1LP{W𝒩l}{β0lm(t)βlm(t)}2.\Re(t,\psi_{m})=\sum_{l=1}^{L}P\{W\in\mathcal{N}_{l}\}\{\beta_{0lm}(t)-\beta_{lm}(t)\}^{2}. Assuming that {\cal F} is observed,

Lm,temp(,ψm)=1ni=1nLm,tfull(Fi,ψm)=1ni=1nl=1LI{Wi𝒩l}{Zim(t)βlm(t)}2\displaystyle L_{m,t}^{emp}({\cal F},\psi_{m})=\frac{1}{n}\sum_{i=1}^{n}L_{m,t}^{full}(F_{i},\psi_{m})=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\{Z_{im}(t)-\beta_{lm}(t)\}^{2} (1)

is also an unbiased estimator of (t,ψm).\Re(t,\psi_{m}). When considered as a function of the set of scalar parameters βlm(t),l=1,,L\beta_{lm}(t),l=1,\ldots,L (i.e., for fixed mm and tt), the empirical Brier loss (1) is minimized when ψm(t;w)=ψ^m(t;w)=l=1LI{Wi𝒩l}β^lm(t),\psi_{m}(t;w)=\hat{\psi}_{m}(t;w)=\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\hat{\beta}_{lm}(t), where

β^lm(t)=i=1nI{Wi𝒩l}Zim(t)i=1nI{Wi𝒩l}\displaystyle\hat{\beta}_{lm}(t)=\frac{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}Z_{im}(t)}{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}} (2)

is a nonparametric estimate for β0lm(t),m=1,,K.\beta_{0lm}(t),m=1,\ldots,K.

2.3 Estimating a CIF Regression Tree Using CART

The CART algorithm of breiman1984classification estimates a regression tree as follows:

  1. 1.

    Using recursive binary partitioning, grow a maximal tree by selecting a (covariate, cutpoint) combination at every stage that minimizes an appropriate loss function;

  2. 2.

    Using cross-validation, select the best tree from the sequence of candidate trees generated by Step 1 via cost complexity pruning (i.e., using penalized loss).

In its most commonly used form, CART estimates the conditional mean response as a piecewise constant function on the covariate space, making all decisions on the basis of minimizing squared error loss. The resulting tree-structured regression function estimates the predicted response in each terminal node using the sample mean of the observations falling into that node. This process generalizes in a straightforward way to more general loss functions.

In the absence of censoring and under a piecewise constant model

ψm(t;w)=l=1Lβlm(t)I{W𝒩l}\displaystyle\psi_{m}(t;w)=\sum_{l=1}^{L}\beta_{lm}(t)I\{W\in\mathcal{N}_{l}\} (3)

for ψ0m(t;w),\psi_{0m}(t;w), Section 2.2 shows that a nonparametric estimate for ψ0m(t;w)\psi_{0m}(t;w) for a given cause mm at a fixed t>0t>0 can be obtained by minimizing the loss (1), a problem equivalent to estimating the conditional mean response from the modified dataset red,t={(Zim(t),Wi),i=1,n}{\cal F}_{red,t}=\{(Z_{im}(t),W^{\prime}_{i})^{\prime},i=1,\ldots n\} by minimizing the squared error loss (1). Thus, any implementation of CART for squared error loss (e.g., rpart) applied to red,t{\cal F}_{red,t} will produce a regression tree estimate of ψ0m(t;w).\psi_{0m}(t;w). In particular, CART estimates LL and the associated terminal nodes {𝒩1,,𝒩L}\{\mathcal{N}_{1},\ldots,\mathcal{N}_{L}\} from the data red,t,{\cal F}_{red,t}, and within each terminal node, estimates ψ0m(t|w)\psi_{0m}(t|w) by (2). While statistically inefficient, this process can be repeated for each m=1,,Km=1,\ldots,K and any t>0t>0 to generate nonparametric estimates of the CIF at a given set of time points for every cause.

3 CIF Regression Trees with Censored Data

With “full data” the squared error loss plays a critical role in both steps of the CART algorithm described at the beginning of Section 2.3 when estimating a conditional mean (e.g., breiman1984classification, Chapter 3.3). In follow-up studies involving competing risks outcomes, the full data {\cal F} might not be observed due to loss to follow-up. In this case, estimating ψ0m(t;w)\psi_{0m}(t;w) for a specified mm under the squared error loss (1) is not possible and one cannot run the CART procedure of Section 2.3 as described.

An attractive way to overcome this difficulty is to use to construct an observed data loss function that has the same risk as the desired full data loss function (c.f., molinaro2004tree; lostritto2012partitioning; steingrimsson2016doubly), that is, the empirical Brier loss (1). This section extends the developments of Section 2 to the case of right-censored competing risks by deriving several new observed data loss functions that share the same risk as (1). By substituting any of these new loss functions in for (1) everywhere throughout the CART algorithm of Section 2.3, one obtains a new regression tree method for estimating ψ0m(t;w).\psi_{0m}(t;w). Similarly to steingrimsson2016doubly, direct implementation is possible using rpart, which provides users with the ability to customize loss functions and decision making procedures. An appealing feature of the algorithms induced by these new loss functions is that each also reduces to the algorithm described in Section 2.3 when censoring is absent.

3.1 Observed Data Structure

Let CC be a continuous right-censoring random variable that, given WW, is statistically independent of (T,M).(T,M). For a given subject, assume that instead of FF we only observe O={T~,Δ,MΔ,W},O=\{\tilde{T},\Delta,M\Delta,W\}, where T~=min(T,C)\tilde{T}=\min(T,C) and Δ=I(TC)\Delta=I(T\leq C) is the (any) event indicator. The observed data on nn i.i.d. subjects is 𝒪=(O1,,On).\mathcal{O}=(O_{1},\ldots,O_{n}). Similarly to the case where K=1,K=1, random censoring on TT permits estimation of the CIF from the data 𝒪{\cal O}.

3.2 Estimating (t,ψm)\Re(t,\psi_{m}) via IPCW Loss

Fix t>0t>0 and let tt.t^{*}\geq t. Define G0(s|W)=P(Cs|W)G_{0}(s|W)=P(C\geq s|W) for any s0s\geq 0 and suppose that G0(Ti|Wi)ϵG_{0}(T_{i}|W_{i})\geq\epsilon almost surely for some ϵ>0\epsilon>0. Let Δi(t)=I(Ti(t)C)\Delta_{i}(t^{*})=I(T_{i}(t^{*})\leq C) and Ti(t)=Tit;T_{i}(t^{*})=T_{i}\wedge t^{*}; in addition, define Z~im(t)=I(T~it,Mi=m),\tilde{Z}_{im}(t)=I(\tilde{T}_{i}\leq t,M_{i}=m), i=1,,n.i=1,\ldots,n. Easy calculations then show

E[Δi(t)G0(T~i(t)|Wi)(Z~im(t)ψm(t;Wi))2]=E[(Zim(t)ψm(t;Wi))2]=(t,ψm)E\left[\frac{\Delta_{i}(t^{*})}{G_{0}(\tilde{T}_{i}(t^{*})|W_{i})}(\tilde{Z}_{im}(t)-\psi_{m}(t;W_{i}))^{2}\right]=E\left[(Z_{im}(t)-\psi_{m}(t;W_{i}))^{2}\right]=\Re(t,\psi_{m})

for a fixed ψm(t;w).\psi_{m}(t;w). This risk equivalence motivates the construction of an IPCW-type observed data loss function that is an unbiased estimator for (t,ψm).\Re(t,\psi_{m}). Define for any suitable survivor function G(|)G(\cdot|\cdot)

Lm,tipcw(𝒪,ψm;t,G)=1ni=1nl=1LI{Wi𝒩l}[Δi(t){Z~im(t)βlm(t)}2G(T~i(t)|Wi)];\displaystyle L_{m,t}^{ipcw}({\cal O},\psi_{m};t^{*},G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}(t^{*})\{\tilde{Z}_{im}(t)-\beta_{lm}(t)\}^{2}}{G(\tilde{T}_{i}(t^{*})|W_{i})}\bigg{]}; (4)

then, Lm,tipcw(𝒪,ψm;t,G0)L_{m,t}^{ipcw}({\cal O},\psi_{m};t^{*},G_{0}) has the same risk as (1) and it follows that (4) is minimized by

β^lmipcw(t;t,G)=i=1nI{Wi𝒩l}Δi(t)Z~im(t)G(T~i(t)|Wi)i=1nI{Wi𝒩l}Δi(t)G(T~i(t)|Wi),l=1,,L.\displaystyle\hat{\beta}^{ipcw}_{lm}(t;t^{*},G)=\frac{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\frac{\Delta_{i}(t^{*})\tilde{Z}_{im}(t)}{G(\tilde{T}_{i}(t^{*})|W_{i})}}{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\frac{\Delta_{i}(t^{*})}{G(\tilde{T}_{i}(t^{*})|W_{i})}},l=1,\ldots,L. (5)

Observe that (4) and (5) respectively reduce to (1) and (2) when censoring is absent.

When K=1K=1 and t=,t^{*}=\infty, we have Δi()=Δi\Delta_{i}(\infty)=\Delta_{i} and Ti()=Ti;T_{i}(\infty)=T_{i}; the loss function (4) is then just a special case of that considered in molinaro2004tree (see also steingrimsson2016doubly). Similarly, for K=1K=1 and setting t=t,t^{*}=t, the loss function (4) is just that considered in lostritto2012partitioning. Hence, (4) extends several existing loss functions to the problem of estimating a CIF. In practice, an estimator G^(|)\hat{G}(\cdot|\cdot) for G0(|)G_{0}(\cdot|\cdot) is used in (4); popular approaches here include product-limit estimators derived from the Kaplan-Meier and Cox regression estimation procedures.

Given t>0,t>0, different choices of tt^{*}\geq in (4) generate different losses, hence different CART algorithms. However, there are just two important choices of ttt^{*}\geq t in (4): t=t^{*}=\infty (standard IPCW; IPCW1{}_{\!1}) and t=tt^{*}=t (modified IPCW; IPCW2{}_{\!2}). In selecting t=t,t^{*}=t, any observations that are censored after time tt can still contribute (all-cause failure) information to (4); as tt^{*}\rightarrow\infty, the influence of these observations eventually vanishes. Consequently, the IPCW2{}_{\!2} loss uses more of the observed data in calculating the loss function and associated minimizers compared to the IPCW1{}_{\!1} loss and one can therefore expect a regression tree built using (4) for t=tt^{*}=t to perform as well or better than one derived from (4) with t=t^{*}=\infty. The use of t=tt^{*}=t in place of t=t^{*}=\infty has an additional practical advantage: since Ti(t)TiT_{i}(t^{*})\leq T_{i} for every i,i, we have G^(T~i(t)|Wi)G^(T~i|Wi),\hat{G}(\tilde{T}_{i}(t^{*})|W_{i})\geq\hat{G}(\tilde{T}_{i}|W_{i}), making it easier to satisfy the empirical positivity condition required for implementation.

3.3 Improving the Efficiency of IPCW Loss Functions

3.3.1 Estimating (t,ψm)\Re(t,\psi_{m}) by augmenting the IPCW loss

As in steingrimsson2016doubly, one can employ semiparametric estimation theory for missing data to construct an improved estimator of the full data risk (t,ψm).\Re(t,\psi_{m}). In particular, one can augment the IPCW loss function (4) with additional information on censored subjects, thereby incorporating additional information into the tree building process.

Consider first the loss function (4) with t=;t^{*}=\infty; we denote this by Lm,tipcw(𝒪,ψm;G).L_{m,t}^{ipcw}({\cal O},\psi_{m};G). Define Ψ0={ψ0r(s;w),s0,w𝒮,r=1,,K}\Psi_{0}=\{\psi_{0r}(s;w),s\geq 0,w\in{\cal S},r=1,\ldots,K\} as the set of CIFs of interest and let Ψ\Psi denote a corresponding model that may or may not coincide with Ψ0\Psi_{0}. For any t,u0t,u\geq 0 and w𝒮,w\in{\cal S}, define Vlm(u;t,w,Ψ)=EΨ[(Zm(t)βlm(t))2|Tu,W=w];V_{lm}(u;t,w,\Psi)=E_{\Psi}[(Z_{m}(t)-\beta_{lm}(t))^{2}|T\geq u,W=w]; it is shown later how this expression depends on Ψ.\Psi. Then, fixing β1m(t),,βLm(t),\beta_{1m}(t),\ldots,\beta_{Lm}(t), the augmented estimator of (t,ψm)\Re(t,\psi_{m}) having the smallest variance that can be constructed from the unbiased estimator Lm,tipcw(𝒪,ψm;G0)L_{m,t}^{ipcw}({\cal O},\psi_{m};G_{0}) is given by Lm,tdr(𝒪,ψm;G0,Ψ0)=Lm,tipcw(𝒪,ψm;G0)+Lm,taug(𝒪,ψm;G0,Ψ0)L_{m,t}^{dr}({\cal O},\psi_{m};G_{0},\Psi_{0})=L_{m,t}^{ipcw}({\cal O},\psi_{m};G_{0})+L_{m,t}^{aug}({\cal O},\psi_{m};G_{0},\Psi_{0}) where, for suitable models Ψ\Psi and G(|),G(\cdot|\cdot),

Lm,taug(𝒪,ψm;G,Ψ)=1nl=1Li=1nI{Wi𝒩l}0T~iVlm(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi)\displaystyle L_{m,t}^{aug}(\mathcal{O},\psi_{m};G,\Psi)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\int_{0}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}) (6)

and MG(t|w)=I(T~t,Δ=0)0tI(T~u)𝑑ΛG(u|w)M_{G}(t|w)=I(\tilde{T}\leq t,\Delta=0)-\int_{0}^{t}I(\tilde{T}\geq u)d\Lambda_{G}(u|w) with ΛG(t|w)\Lambda_{G}(t|w) denoting the cumulative hazard function corresponding to G(t|w)G(t|w) (cf.  tsiatis2007semiparametric, Sec. 9.3, 10.3). The “doubly robust” loss function Lm,tdr(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi) reduces to a special case of the class of loss functions proposed in steingrimsson2016doubly when K=1.K=1.

Instead of augmenting Lm,tipcw(𝒪,ψm;G)L_{m,t}^{ipcw}({\cal O},\psi_{m};G) (i.e., IPCW1{}_{\!1}), one might instead augment (4) when t=tt^{*}=t (i.e., IPCW2{}_{\!2}). However, it turns out that the resulting loss function is identical to Lm,tdr(𝒪,ψm;G,Ψ);L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi); see the Appendix (Section A.1). Hence, no additional efficiency is gained from augmenting IPCW2{}_{\!2} and it suffices to consider Lm,tdr(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi) only.

Returning to Lm,tdr(𝒪,ψm;G,Ψ):L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi): because Zm(t)Z_{m}(t) is binary, we have

Vlm(u;t,w,Ψ)=ym(u;t,w,Ψ)2ym(u;t,w,Ψ)βlm(t)+βlm2(t)V_{lm}(u;t,w,\Psi)=y_{m}(u;t,w,\Psi)-2y_{m}(u;t,w,\Psi)\beta_{lm}(t)+\beta^{2}_{lm}(t) (7)

for any suitable Ψ\Psi (e.g., Ψ0\Psi_{0}), where ym(u;t,w,Ψ)=EΨ{Zm(t)|Tu,W=w}y_{m}(u;t,w,\Psi)=E_{\Psi}\{Z_{m}(t)|T\geq u,W=w\} reduces to

ym(u;t,w,Ψ)={PΨ(uTt,M=m|W=w)PΨ(Tu|W=w)ifut0otherwise.\displaystyle y_{m}(u;t,w,\Psi)=\begin{cases}\dfrac{P_{\Psi}(u\leq T\leq t,M=m|W=w)}{P_{\Psi}(T\geq u|W=w)}&\text{if}\quad u\leq t\\ 0&\text{otherwise}\end{cases}. (8)

The notation EΨE_{\Psi} and PΨP_{\Psi} means that these quantities are calculated under the CIF model specification Ψ\Psi. Hence, under a model Ψ\Psi, the calculation of Lm,tdr(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi) requires estimating both the CIF for cause mm and the all-cause probability PΨ(Tu|W=w).P_{\Psi}(T\geq u|W=w).

Considering Lm,tdr(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi) as a function of the LL scalar parameters β1m(t),,βLm(t)\beta_{1m}(t),\ldots,\beta_{Lm}(t) only and differentiating with respect to each one, it can be shown that

β~lmdr(t;G,Ψ)=i=1nI{Wi𝒩l}[TS~1,im1(t)+TS~2,im1(t)]i=1nI{Wi𝒩l}[TS~1,im0+TS~2,im0],l=1,,L\displaystyle\tilde{\beta}_{lm}^{dr}(t;G,\Psi)=\dfrac{\displaystyle\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{1}(t)+\widetilde{TS}_{2,im}^{1}(t)]}{\displaystyle\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{0}+\widetilde{TS}_{2,im}^{0}]},~{}l=1,\ldots,L (9)

minimize Lm,tdr(𝒪,ψm;G,Ψ),L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi), where

TS~1,im0(t)=ΔiG(T~i|Wi)TS~2,im0(t)=0T~idMG(u|Wi)G(u|Wi)\displaystyle\widetilde{TS}_{1,im}^{0}(t)=\frac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}\quad\quad\widetilde{TS}_{2,im}^{0}(t)=\int_{0}^{\tilde{T}_{i}}\frac{dM_{G}(u|W_{i})}{G(u|W_{i})}
TS~1,im1(t)=Z~im(t)ΔiG(T~i|Wi)TS~2,im1(t)=0T~iym(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi).\displaystyle\widetilde{TS}_{1,im}^{1}(t)=\frac{\tilde{Z}_{im}(t)\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}\quad\quad\widetilde{TS}_{2,im}^{1}(t)=\int_{0}^{\tilde{T}_{i}}\frac{y_{m}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

The validity of this result relies on the assumption that G(T~i|Wi)ϵ>0G(\tilde{T}_{i}|W_{i})\geq\epsilon>0 for some ϵ\epsilon and each i=1,,n.i=1,\ldots,n. Under this same assumption, Lemma 1 of strawderman2000estimating implies

TS~1,im0+TS~2,im0=ΔiG(T~i|Wi)+1ΔiG(T~i|Wi)0T~idΛG(u|Wi)G(u|Wi)=1;\displaystyle\widetilde{TS}_{1,im}^{0}+\widetilde{TS}_{2,im}^{0}=\dfrac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}+\dfrac{1-\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}-\displaystyle\int_{0}^{\tilde{T}_{i}}\dfrac{d\Lambda_{G}(u|W_{i})}{G(u|W_{i})}=1;

letting Nl=i=1nI{Wi𝒩l},l=1,,L,N_{l}=\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\},~{}l=1,\ldots,L, it follows that (9) can be rewritten as

β^lmdr(t;G,Ψ)=1Nli=1nI{Wi𝒩l}[TS~1,im1(t)+TS~2,im1(t)],l=1,,L.\displaystyle\hat{\beta}_{lm}^{dr}(t;G,\Psi)=\frac{1}{N_{l}}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{1}(t)+\widetilde{TS}_{2,im}^{1}(t)],~{}l=1,\ldots,L. (10)

As in Section 3.2, Lm,tdr(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi) and (10) respectively reduce to (1) and (2) when censoring is absent.

The specification G~(t|w)=1\tilde{G}(t|w)=1 for all t0t\geq 0 and w𝒮w\in{\cal S} generates an interesting special case of Lm,tdr(𝒪,ψm;G~,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi) despite the fact that G~(|)\tilde{G}(\cdot|\cdot) is necessarily incorrectly modeled in the presence of censoring. In particular, for any suitable choice of Ψ\Psi, (i) Lm,tdr(𝒪,ψm;G~,Ψ)=l=1LLml,tbj(𝒪,ψm;Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi)=\sum_{l=1}^{L}L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi) where

Lml,tbj(𝒪,ψm;Ψ)=1ni=1nI{Wi𝒩l}[Δi{Z~im(t)βlm(t)}2+(1Δi)Vlm(T~i;t,Wi,Ψ)];L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi)=\frac{1}{n}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\Delta_{i}\{\tilde{Z}_{im}(t)-\beta_{lm}(t)\}^{2}+(1-\Delta_{i})V_{lm}(\tilde{T}_{i};t,W_{i},\Psi)];

and, (ii) for Ψ=Ψ0,\Psi=\Psi_{0}, Lm,tdr(𝒪,ψm;G~,Ψ0)L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi_{0}) is an unbiased estimator of the risk (t,ψm)\Re(t,\psi_{m}). In fact, noting that (7) implies Vlm(T~i;t,Wi,Ψ)V_{lm}(\tilde{T}_{i};t,W_{i},\Psi) can be rewritten in terms of ym(T~i;t,w,Ψ)y_{m}(\tilde{T}_{i};t,w,\Psi) for every i,i, the minimizer of Lml,tbj(𝒪,ψm;Ψ)L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi) is given by

β~lmbj(t;Ψ)=1Nli=1nI{Wi𝒩l}[ΔiZ~im(t)+(1Δi)ym(T~i;t,Wi,Ψ)].\displaystyle\tilde{\beta}_{lm}^{bj}(t;\Psi)=\frac{1}{N_{l}}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\Delta_{i}\tilde{Z}_{im}(t)+(1-\Delta_{i})y_{m}(\tilde{T}_{i};t,W_{i},\Psi)].

That is, under the loss Lml,tbj(𝒪,ψm;Ψ),L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi), the estimator for βlm(t)\beta_{lm}(t) is the Buckley-James (BJ) estimator of the mean response within node 𝒩l{\cal N}_{l} (buckley1979linear), an estimator that can also be derived directly from (10) by setting G=G~.G=\tilde{G}. For this reason, we refer to Lm,tdr(𝒪,ψm;G~,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi) as the Buckley-James loss function.

3.3.2 Composite augmented loss functions with multiple time points

Under a tree model of the form (3), and similarly to methods used for survival trees (e.g., leblanc1992relative; molinaro2004tree; steingrimsson2016doubly), the quantity being estimated within each terminal node depends on the choice of tt but the underlying partition structure is time-invariant. That is, the effects of baseline covariates and their interactions on the CIF are not allowed to change over time. As a result, for a given mm, we can further reduce variability when estimating ψ0m(t|w)\psi_{0m}(t|w) by considering loss functions constructed from Lm,tdr(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi) that incorporate information over several time points.

Recall that Lm,tdr(𝒪,ψm;G,Ψ)=Lm,tipcw(𝒪,ψm;G)+Lm,taug(𝒪,ψm;G,Ψ)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)=L_{m,t}^{ipcw}({\cal O},\psi_{m};G)+L_{m,t}^{aug}({\cal O},\psi_{m};G,\Psi) where Lm,tipcw(𝒪,ψm;G)L_{m,t}^{ipcw}({\cal O},\psi_{m};G) is given by (4) (i.e., for t=t^{*}=\infty) and Lm,taug(𝒪,ψm;G,Ψ)L_{m,t}^{aug}({\cal O},\psi_{m};G,\Psi) is given by (6). For a given set of time points 0<t1<t2<tJ<,0<t_{1}<t_{2}\ldots<t_{J}<\infty, a simple composite loss function for a given event type mm can be formed by calculating

Lm,𝐭mult,dr(𝒪,ψm;G,Ψ)=j=1JwjLm,tjdr(𝒪,ψm;G,Ψ),\displaystyle L_{m,\mathbf{t}}^{mult,dr}(\mathcal{O},\psi_{m};G,\Psi)=\sum_{j=1}^{J}w_{j}L_{m,t_{j}}^{dr}({\cal O},\psi_{m};G,\Psi), (11)

where w1,,wJw_{1},\ldots,w_{J} are positive weights that, without loss of generality, can be assumed to sum to one. Minimizing (11) with respect to βlm(tj),j=1,,J;l=1,,L,\beta_{lm}(t_{j}),j=1,\ldots,J;l=1,\ldots,L, gives

β~lmmult,dr(tj;G,Ψ)=1Nli=1nI{Wi𝒩l}[TS~1,im1(tj)+TS~2,im1(tj)].\displaystyle\tilde{\beta}_{lm}^{mult,dr}(t_{j};G,\Psi)=\frac{1}{N_{l}}\ \sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{1}(t_{j})+\widetilde{TS}_{2,im}^{1}(t_{j})]. (12)

The terminal node estimators (12) are exactly equivalent to (10) computed for t=tjt=t_{j} and do not depend on w1,,wJw_{1},\ldots,w_{J}. However, the corresponding assessment of total loss and thus decisions to split nodes at any given cutpoint for any given covariate are directly influenced by the combined specifications of (tj,wj),j=1,,J.(t_{j},w_{j}),j=1,\ldots,J. Consequently, performance gains in the resulting tree-based estimates for ψm0(tj;w),j=1,,J\psi_{m0}(t_{j};w),j=1,\ldots,J occur in the determination of when and where to split, rather than in the manner by which individual terminal nodes are estimated. In the absence of censoring, the indicated composite loss function also reduces to that which would be computed by extending the loss function introduced in Section 2.2 in the manner described above.

4 Simulation Studies

In this section, we will summarize the structure and results of several simulation studies designed to study the performance of regression trees built using the IPCW, Buckley James, and doubly robust Brier loss functions introduced in the previous section. We first describe how data are generated. Several performance evaluation measures are used to evaluate tree-building performance; these are described next, followed by a summary of the results.

4.1 Simulation Settings

Covariates W1,W2,W3,W4,W5,W6,W7,W8,W9,W10Unif(0,1)W_{1},W_{2},W_{3},W_{4},W_{5},W_{6},W_{7},W_{8},W_{9},W_{10}\sim Unif(0,1) are generated independently of each other; let 𝑾=(W1,W2,W3,W4,W5,W6,W7,W8,W9,W10).\boldsymbol{W}=(W_{1},W_{2},W_{3},W_{4},W_{5},W_{6},W_{7},W_{8},W_{9},W_{10})^{\prime}. The underlying competing risks model assumes K=2K=2 and takes the form

ψ01(t;𝑾)=1(1p(1et))exp(β1Z(𝑾))\displaystyle\psi_{01}(t;\boldsymbol{W})=1-(1-p(1-e^{-t}))^{\exp{(\beta_{1}Z(\boldsymbol{W}))}} (13)
ψ02(t;𝑾)=(1p)exp(β1Z(𝑾))×(1exp(texp(β2Z(𝑾))))\displaystyle\psi_{02}(t;\boldsymbol{W})=(1-p)^{\exp{(\beta_{1}Z(\boldsymbol{W}))}}\times(1-\exp{(-t\exp{(\beta_{2}Z(\boldsymbol{W}))})}) (14)

where Z(𝑾)=I(W10.5&W2>0.5)Z(\boldsymbol{W})=I(W_{1}\leq 0.5\ \&\ W_{2}>0.5) and β1\beta_{1} and β2\beta_{2} are regression coefficients (cf., fine1999proportional). Note that this formulation satisfies the additivity constraint ψ01(;𝑾)+ψ02(;𝑾)=1.\psi_{01}(\infty;\boldsymbol{W})+\psi_{02}(\infty;\boldsymbol{W})=1. Three settings are considered: (i) β1=3\beta_{1}=3 and β2=0.5\beta_{2}=-0.5 with p=0.3p=0.3 (High Signal); (ii) β1=2\beta_{1}=2 and β2=0.5\beta_{2}=-0.5 with p=0.3p=0.3 (Medium Signal); and, (iii) β10=1.5\beta_{10}=1.5 and β20=0.5\beta_{20}=-0.5 with p=0.3p=0.3 (Low Signal). The corresponding cumulative incidence functions for these settings are shown in Figure 1.

Figure 1: Three settings in simulation 1. Black line : ψ01(|Z(𝑾)=1)\psi_{01}(\cdot|Z(\boldsymbol{W})=1), Blue line : ψ01(|Z(𝑾)=0)\psi_{01}(\cdot|Z(\boldsymbol{W})=0), Red line : ψ02(|Z(𝑾)=1)\psi_{02}(\cdot|Z(\boldsymbol{W})=1), Green line : ψ02(|Z(𝑾)=0)\psi_{02}(\cdot|Z(\boldsymbol{W})=0).
Refer to caption

Similarly to steingrimsson2016doubly, we respectively generate training sets of size 250, 500 and 1000 (with noninformative censoring) for the purposes of estimation and an independent uncensored test set of size 2000 for evaluating performance. Censoring is exponentially distributed with a rate γ\gamma chosen to give approximately 50% censoring on TT. For each training set size, 500 simulations are run for each of the three settings.

4.2 Performance Measures

For assessing estimator performance, it is assumed there is an underlying true tree structure to be recovered. Let ψ^1(t|Wi)\hat{\psi}_{1}(t|W_{i}) be the tree-predicted cumulative incidence for the cause of interest for a subject having covariates WiW_{i}; let ψ01(t|Wi)\psi_{01}(t|W_{i}) be the corresponding true cumulative incidence function. Then, as a measure of predictive performance, we consider

1ntesti=1ntest{ψ^1(t|Wi)ψ01(t|Wi)}2\displaystyle\frac{1}{n_{test}}\sum_{i=1}^{n_{test}}\{\hat{\psi}_{1}(t|W_{i})-\psi_{01}(t|W_{i})\}^{2} (15)

where ntestn_{test} is the number of observations in the test dataset. We also use several other performance measures focused more on the identification of the underlying tree structure:

  • ||fitted size-true size|| : This quantity measures difference between the size of fitted tree and the size of the true tree, where size denotes the number of terminal nodes. It can be seen from Figure 1 that the true tree size is 3.

  • NSP: This quantity is defined as (nsim)1i=1nsimNSi(n_{sim})^{-1}\sum_{i=1}^{n_{sim}}NS_{i} where NSiNS_{i} represents number of times any of the covariates W3,,W10W_{3},\ldots,W_{10} (i.e., covariates that do not affect the true CIF) appear in simulation run ii.

  • PCSP : This quantity is defined as (nsim)1i=1nsimCTi(n_{sim})^{-1}\sum_{i=1}^{n_{sim}}CT_{i} where CTi=1CT_{i}=1 if the fitted tree in simulation ii has the same covariates and the same number of splits as the true tree that defines the CIF; otherwise CTi=0CT_{i}=0. While it can happen that a fitted tree may involve only the covariates W1W_{1} and W2W_{2}, it cannot be expected to have exactly the same splits as the true tree (especially for a continuous covariate); thus, in this measure, a ‘correct tree’ is one that looks like the true tree but may involve different split points.

4.3 Estimation of G0G_{0} and Ψ0\Psi_{0}

The IPCW and doubly robust Brier loss functions require estimation of G0(|).G_{0}(\cdot|\cdot). The Buckley-James and doubly robust Brier loss functions both rely on the function ym(;t,w,Ψ)y_{m}(\cdot;t,w,\Psi) in (8), the optimal choice requiring specification of both ψ0m(u|w)\psi_{0m}(u|w) and the event-free survival function ψ¯0(u|w)=PΨ0(T>u|w)\bar{\psi}_{0}(u|w)=P_{\Psi_{0}}(T>u|w) for u>0,w𝒮u>0,w\in{\cal S}.

Estimation of G0(|)G_{0}(\cdot|\cdot) is comparatively straightforward; for example, one may use regression procedures (e.g., Cox regression models, random survival forests) or product-limit estimators as appropriate. For all simulations considered here, the censoring distribution G0(|)G_{0}(\cdot|\cdot) is estimated by the Kaplan-Meier method. For building an IPCW-type tree with t=t^{*}=\infty in (4), the resulting censoring weight may not satisfy the required positivity assumption. Hence, similarly to steingrimsson2016doubly, we replace t=t^{*}=\infty by t=τt^{*}=\tau when calculating IPCW1,{}_{\!1}, where G^(T~i(τ))0.05,i=1,,n\hat{G}(\tilde{T}_{i}(\tau))\geq 0.05,i=1,\ldots,n (i.e., the marginal truncation rate of observed survival times is 5%). No such modification is required when calculating IPCW2,{}_{\!2}, that is, when computing (4) with t=t=tj,j=1,2,3.t^{*}=t=t_{j},j=1,2,3.

Calculation of the Buckley-James and doubly robust Brier loss functions in practice requires using an estimated model Ψ^\hat{\Psi} in place of Ψ0\Psi_{0} when computing (8). Parametric models have been proposed that ensure the compatibility between the models used for ψm(|)\psi_{m}(\cdot|\cdot) and ψ¯(|)\bar{\psi}(\cdot|\cdot) as well as the ability to calculate probabilities for every u>0;u>0; see, for example, jeong2006direct, jeong2007parametric, cheng2009modeling, and shi2013constrained. Compared to existing semiparametric methods (e.g., fine1999proportional; scheike2008predicting; eriksson2015proportional), the parametric likelihood methods of jeong2006direct and jeong2007parametric enable researchers to estimate two or more CIFs jointly. However, cheng2009modeling and shi2013constrained show that even these methods fail to enforce the additivity constraint m=1MPΨ(T,M=m;W=w)=1\sum_{m=1}^{M}P_{\Psi}(T\leq\infty,M=m;W=w)=1 and propose methods that resolve this inconsistency.

Since our focus is on a particular cause m,m, we will without loss of generality assume that m=1m=1 and that K=2K=2 (i.e., m=2m=2 captures all causes m1m\neq 1); in this case, specification of ψ1(|)\psi_{1}(\cdot|\cdot) and ψ¯(|)\bar{\psi}(\cdot|\cdot) is equivalent to specifying Ψ=(ψ1(|),ψ2(|)).\Psi=(\psi_{1}(\cdot|\cdot),\psi_{2}(\cdot|\cdot)). Our simulation study considers the following estimators Ψ^\hat{\Psi} when K=2:K=2:

  1. 1.

    RF(npar) : random survival forests as proposed in ishwaran2014random.

  2. 2.

    RF+lgtc: a hybrid method of random survival forests and parametric modeling;

  3. 3.

    FG(true): the simulation model of Section 4.1 is an example of the class of models considered in jeong2007parametric. For m=1,2,m=1,2, ψm(|)\psi_{m}(\cdot|\cdot) are estimated by maximum likelihood under a correctly specified parametric model.

  4. 4.

    lgtc+godds: for m=1,2,m=1,2, ψm(|)\psi_{m}(\cdot|\cdot) is estimated by maximum likelihood using the parametric cumulative incidence model proposed in shi2013constrained.

Further details on the RF(npar) and RF+lgtc estimators, including the parametric models used in #2 and #4 above, may be found in Section A.3 of the Appendix. We anticipate that FG(true) will lead to the best performance because the underlying models for the event of interest and for censoring are both specified correctly. As in steingrimsson2016doubly, the estimates G^(|)\hat{G}(\cdot|\cdot) and Ψ^\hat{\Psi} depend only on 𝒪{\cal O} and are computed in advance of the process of building the desired regression tree for estimating ψ01(|).\psi_{01}(\cdot|\cdot).

4.4 Simulation Results

Each combination of loss function and method for estimating G0G_{0} and/or Ψ0\Psi_{0} leads to a distinct algorithm. In this section, we show results for the IPCW-type estimators with t=tj,j=1,2,3t^{*}=t_{j},j=1,2,3 and t=τt^{*}=\tau; the Buckley James Brier loss, where RF(npar) (algorithm BJ-RF (npar)) and FG(true) (algorithm BJ-FG (true)) are respectively used to estimate Ψ0\Psi_{0}; and, the doubly robust Brier loss function, where RF(npar) (algorithm DR-RF (npar)) and FG(true) (algorithm DR-FG (true)) are respectively used to estimate Ψ0\Psi_{0}. Results using RF+lgtc and lgtc+godds for estimating Ψ0\Psi_{0} are summarized in the Appendix (Section A.5).

In fitting the trees, we consider three fixed time points t1t_{1}, t2t_{2} and t3t_{3}, respectively representing the 25th, 50th and 75th quantile of the true marginal distribution of TT. These times are used for building individual trees as well as for composite loss functions with the respective training set sizes n=250,500n=250,500 and 10001000. Below, we focus on the results with n=500n=500 using composite loss functions as described in Section 3.3.2. Table 1 summarizes the tree fitting performance measures for estimating ψ01(|)\psi_{01}(\cdot|\cdot) in (13) for n=500n=500 using a composite loss function having weights w1=w2=w3=1/3w_{1}=w_{2}=w_{3}=1/3. For the chosen metrics, and irrespective of signal strength, there is a clear benefit to using augmented loss functions over IPCW loss functions, the exception being BJ-RF (npar) (i.e., the Buckley-James loss with Ψ^\hat{\Psi} estimated by random survival forests). Overall, the BJ-FG (true) algorithm that uses the correct model for Ψ0\Psi_{0} performs best, followed by DR-FG (true), DR-RF (npar), IPCW2,{}_{\!2}, BJ-RF (npar) and then IPCW1{}_{\!1}. Prediction error performance is considered in Figure 2, which shows boxplots of the mean squared prediction error given in (15) under equally weighted composite loss. Here, the BJ-FG (true) algorithm again performs uniformly best, with the augmented methods DR-FG (true), DR-RF (npar), and BJ-RF (npar) all performing similarly, followed by IPCW2{}_{\!2} and then IPCW1.{}_{\!1}.

In general, the methods that incorporate information beyond that used by IPCW1 exhibit significant improvement in performance, with the methodology for estimating quantities needed to compute the augmented loss function playing a smaller role, particularly as sample size increases. Considering only methods where no knowledge of Ψ0\Psi_{0} is used, the doubly robust loss function also provides gains over both the Buckley-James and IPCW2{}_{\!2} loss functions; however, the gains achieved are substantially less compared to the gains over IPCW1{}_{\!1}. This phenomenon is expected and can be explained by noting that (i) the censoring distribution is being consistently estimated in all cases; (ii) IPCW2{}_{\!2} can be viewed as an augmented version of IPCW1;{}_{\!1}; and, (iii) IPCW2{}_{\!2} only requires estimating one infinite dimensional parameter.

Numerical and graphical results for other methods of nuisance parameter estimation and other settings (a single time point loss and composite loss with various training set sample sizes) are shown in the Appendix. These results demonstrate no significant deviations from the trends and observations summarized above. Some important conclusions that can be drawn from the combined set of results include the following: (i) the method for estimating Ψ\Psi in (8) has little overall impact on performance, with greater differences arising from the type of loss (IPCW versus DR versus BJ) and whether or not a composite loss function is used; (ii) getting the data generating model exactly right yields noticeable gains, and there is otherwise a degree of robustness across methods, particularly for doubly robust loss; and, (iii) the performance of IPCW2 is often reasonably close to that for the augmented losses, particularly with composite loss, and the comparative degree of simplicity involved in implementing this method has much to recommend it.

Table 1: Numerical summaries for trees built using composite loss for n=500n=500. IPCW1 and IPCW2 are standard and modified IPCW, respectively; BJ-RF(npar) and BJ-FG(true) use Buckley-James loss functions with the augmentation term estimated via nonparametric random forests and under the correct simulation model, respectively; and, DR-RF(npar) and DR-FG(true) use the doubly robust loss function with the augmentation term estimated via nonparametric random forests and under the correct simulation model, respectively.
High Sig Med Sig Low Sig
||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP
IPCW1 0.132 0.084 0.916 0.182 0.132 0.874 0.558 0.164 0.658
IPCW2 0.124 0.082 0.932 0.138 0.100 0.906 0.282 0.136 0.830
BJ-RF(npar) 0.142 0.090 0.904 0.148 0.088 0.908 0.226 0.128 0.878
BJ-FG(true) 0.092 0.068 0.940 0.102 0.064 0.934 0.118 0.082 0.918
DR-RF(npar) 0.066 0.052 0.958 0.092 0.056 0.940 0.216 0.118 0.856
DR-FG(true) 0.058 0.044 0.966 0.092 0.058 0.942 0.176 0.076 0.874
Figure 2: Prediction error (multiplied by 100) for event 1 using composite loss for n=500n=500. Rows represent signal strength; columns indicate time values considered. IPCW1 and IPCW2 are standard and modified IPCW, respectively; BJ-RF(npar) and BJ-FG(true) use Buckley-James loss functions with the augmentation term estimated via nonparametric random forests and under the correct simulation model, respectively; and, DR-RF(npar) and DR-FG(true) use the doubly robust loss function with the augmentation term estimated via nonparametric random forests and under the correct simulation model, respectively.
Refer to captionRefer to captionRefer to caption

5 Example: Lung Cancer Treatment Trial

We illustrate our methods using data from the RTOG 9410, randomized trial of patients with locally advanced inoperable non-small cell lung cancer. The motivation for this trial was to ascertain whether sequential or concurrent delivery of chemotherapy and thoracic radiotherapy (TRT) is a better treatment strategy. The original RTOG 9410 study randomized 610 patients to three treatment arms: sequential chemotherapy followed by radiotherapy (RX=1); once-daily chemotherapy concurrent with radiotherapy (RX=2); and, twice-daily chemotherapy concurrent with radiotherapy (RX=3). The primary endpoint of interest was overall survival and the main trial analysis results were published in curran2011sequential, demonstrating a survival benefit of concurrent delivery of chemotherapy and TRT compared with sequential delivery. Secondary analyses of the data using the time from randomization to the first occurrence of three possible outcomes are considered: in-field failure (cancer recurrence within the treatment field for TRT); out-field failure (cancer recurrence and distant metastasis outside of the treatment field for TRT); and, death without documented in-field or out-field failure (i.e., cancer progression). Among these event types, those that first experienced out-field failures are of particular interest since these patients typically have suboptimal prognosis and may be candidates for more intensified treatment regimens intended to prevent distant metastasis, including but not limited to consolidative chemotherapy, prophylactic cranial irradiation (for brain metastases), and so on. As such, patients that experienced both in-field failure and out-field failure were considered to be out-field failures for purposes of this analysis.

At the time the study database was last updated in 2009, there were 577 patients, with approximately 3% censoring on the aforementioned outcomes. Because this censoring rate is so low, we removed these censored observations and compare the results of analyses of the resulting uncensored dataset (554 patients) to analyses of data created using an artificially induced censoring mechanism. The purpose of doing this analysis is to evaluate how censoring affects the analysis and, in particular, illustrates how well the various procedures are able to recover the desired tree(s) that would be built had all outcomes been fully observed. We focus on building trees for each outcome using a composite loss function with 5 time points (5.2, 6.2, 8.5, 11.8, 19.0 months), selected as the 25th, 35th, 50th, 65th and 80th percentiles of the observed “all cause” event time (i.e., TT). Baseline covariates included in this analysis are Treatment (RX), Age, Stage (American Joint Committee on Cancer [AJCC] stage IIIB vs.  IIIA or II), Gender, Karnofsky score (70, 80, 90 or 100), Race (White vs. non-White), and Histology (Squamous vs. non-Squamous). Censoring is created according to a Uniform [0,50][0,50] distribution, generating approximately 29% censoring on TT. In addition to building a tree using the uncensored dataset (i.e., see Section 3.3.2), we consider the methods IPCW1{}_{\!1}, IPCW2{}_{\!2}, BJ-RF(npar), BJ-RF(npar), DR-RF+lgtc, DR-RF(npar) and DR-RF+lgtc; this will allow us to evaluate the effect of censoring as well as different models used in constructing the augmented loss. Similarly to steingrimsson2016doubly, we use repeated 10-fold cross-validation to improve the stability of the tree building process and select the final tree using that having the lowest mean cross-validated error. For the in-field failure outcome, all methods (including those applied to the uncensored dataset) lead to trees that consist only of a root node. Hence, we focus on outfield failure and death without progression below.

For out-field failure, the estimated tree obtained using the uncensored data creates 4 risk groupings (see Figure A.4 in the Appendix): age << 50.5; 50.5 \leq age << 70.5 with squamous histology; age \geq 50.5 with squamous histology; and, age \geq 70.5 with non-squamous histology. The plot of the CIF indicates that age \geq 70.5 with non-squamous histology has the lowest risk, and that this risk is very similar to those aged \geq 50.5 with squamous histology. With censored data, the IPCW1{}_{\!1}, IPCW2{}_{\!2}, BJ-RF(npar) and DR-RF(npar) methods all lead to the same tree structure; see Figures A.5A.8 in the Appendix. The BJ-RF+lgtc and DR-RF+lgtc also share the same tree structures (see Figures A.9 and A.10) and identify nearly the same risk groups as the other censored data methods, the difference being that the secondary split on age occurs at 70.5 rather than 71.5. Importantly, the trees for out-field failure respectively built using the uncensored and censored datasets are very similar. Figure 3 respectively summarize the estimated CIFs obtained using the uncensored, DR-RF(npar), IPCW2{}_{\!2} and BJ-RF(npar) loss functions. There are slight differences in the latter three estimators due to the way in which the CIFs (i.e., terminal nodes) are estimated. We see that the three highest risk groups are identified as being the same for all methods; the differences created by censoring essentially occur in the risk determination for the oldest squamous patients, where censoring tends to be the heaviest. In general, the results show that younger patients with non-squamous histology have the highest risk of out-field failure. This observation is important, as these patients may be considered as candidates for more intensified treatment regimens.

Figure 3: Risk stratification for out-field failure. Panels contain estimated CIFs obtained using the following composite loss functions: Uncensored [top left]; DR-RF(npar) [top right]; IPCW2 [bottom left]; BJ-RF(npar) [bottom right]. CIF estimates are calculated at the following percentiles of observed all-cause event time: 15, 25, 35, 50, 65, 75, 80. Estimates are otherwise interpolated.
Refer to caption

For death without progression, the fitted tree from uncensored data creates 5 risk groups: non-squamous histology with age << 72.5; male, with age <67.5<67.5 and squamous histology; male, with age \geq 67.5 and squamous histology; female with squamous histology; and, non-squamous histology with age \geq 72.5. This tree is given in Figure A.11 in the Appendix. The tree built using IPCW1{}_{\!1} results in a root node. The trees built with IPCW2{}_{\!2} and the two doubly robust methods are nearly identical to that built using the uncensored dataset, the difference being that the age cutpoint of 72.5 is replaced by 70.5; see Figures A.12 - A.14 of the Appendix. The tree built using BJ-RF(npar) is also very similar, but incorporates an extra cutpoint on Karnofsky score for younger males with squamous histology; this distinguishes those having a score of 100 from those having a lower score (see Figure A.15 of the Appendix). The tree built using the BJ-RF+lgtc method differs somewhat more substantially for younger males with squamous histology, splitting on both Karnofsky score (70 or 80 vs. 90 or 100) treatment (RX = 1 vs. not); see Figure A.16 of the Appendix. The comparison of these trees points to a benefit of using the doubly robust loss over the Buckley-James loss, as the censoring distribution is modeled correctly; the comparison of the augmented loss results to those obtained for IPCW1{}_{\!1} (where censoring is modeled correctly) further highlights the value of incorporating additional information into the model building process. Figure 4 is obtained analogously to Figure 3 and shows the estimated CIFs and risk stratification obtained using the uncensored and censored datasets. For death without progression, the risk evidently increases with increasing age and being male; the risk also increases with squamous histology, though in a way that appears to be age-dependent.

Figure 4: Risk stratification for death without progression. Panels contain estimated CIFs obtained using the following composite loss functions: Uncensored [top left]; DR-RF(npar) [top right]; IPCW2 [bottom left]; BJ-RF(npar) [bottom right]. CIF estimates are calculated at the following percentiles of observed all-cause event time: 15, 25, 35, 50, 65, 75, 80. Estimates are otherwise interpolated.
Refer to caption

6 Discussion

Trees remain of significant interest to practitioners, especially so clinicians. The proposed doubly robust CIF regression tree demonstrates improved performance compared to IPCW-based methods whether a single time point or composite loss function is used, with composite loss functions giving much better performance overall. To our knowledge, there are no publically available software packages that directly implement tree-based regression methods for competing risks; our proposed methods can be implemented using existing software, with example code made available as part of supplementary materials.

Several extensions are possible. For example, it is easy in principle to combine the proposed estimation procedure with ensemble methods, providing an interesting alternative to the random forests methods recently introduced by mogensen2013random and ishwaran2014random. With no random feature selection, this extension is easily accomplished through bagging (i.e., bootstrap-based aggregation). However, with random feature selection, new software is required, as existing software does not provide for the same possibility of extending rpart for use with other loss functions. A second interesting direction is to extend both regression tree and ensemble procedures to the problem of simultaneous estimation of multiple CIFs. Similarly to the case of a single CIF, we anticipate that such multivariate problems can be handled using existing software in the case of regression trees and that new software will again be needed for related ensemble extensions.

Acknowledgments

This work was supported by the National Institutes of Health (R01CA163687: AMM, RLS, YC; U10-CA180822: CH). We thank the NRG Oncology Statistics and Data Management Center for providing de-identified RTOG 9410 clinical trial data under a data use agreement.

Supplementary Material for
Regression Trees for Cumulative Incidence Functions
by Cho, Molinaro, Hu and Strawderman

Appendix A Appendix

References to figures, tables, theorems and equations preceded by “A.” are internal to this appendix; all other references refer to the main paper.

A.1 Equivalence of augmentation for IPCW1{}_{\!1} and IPCW2{}_{\!2}

The equivalence result to be proved below holds for any pair of suitable models of GG and Ψ\Psi, including G0G_{0} and Ψ0\Psi_{0}. Throughout, we suppose that suitable models Ψ\Psi and G(|)G(\cdot|\cdot) are given. The proof will be facilitated by first establishing that IPCW2{}_{\!2} can be rewritten as an IPCW estimator in standard form and augmented similarly to IPCW1.{}_{\!1}. We will then show that the difference between its augmented version and that for IPCW1{}_{\!1} are mathematically identical.

Define the modified data 𝒪(t)=(T~i(t),Δi(t),Wi),i=1,,n,{\cal O}(t)=(\tilde{T}_{i}(t),\Delta_{i}(t),W_{i}),i=1,\ldots,n, where Δi(t)=I(Ti(t)C)\Delta_{i}(t)=I(T_{i}(t)\leq C) and Ti(t)=TitT_{i}(t)=T_{i}\wedge t for i=1,,n.i=1,\ldots,n. We first observe that the IPCW2{}_{\!2} loss can be re-expressed as a standard IPCW estimator that can be computed from 𝒪(t).{\cal O}(t). In particular, the IPCW2{}_{\!2} loss (i.e., (4) with t=tt^{*}=t) is mathematically equivalent to

Lm,tipcw(𝒪(t),ψm;G)=1ni=1nl=1LI{Wi𝒩l}[Δi(t){I(T~i(t)<t,Mi=m)βlm(t)}2G(T~i(t)|Wi)].\displaystyle L_{m,t}^{ipcw}({\cal O}(t),\psi_{m};G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}(t)\{I(\tilde{T}_{i}(t)<t,M_{i}=m)-\beta_{lm}(t)\}^{2}}{G(\tilde{T}_{i}(t)|W_{i})}\bigg{]}. (A.1)

This equivalence follows because I(T~it,Mi=m)=I(T~i(t)<t,Mi=m),i=1,,nI(\tilde{T}_{i}\leq t,M_{i}=m)=I(\tilde{T}_{i}(t)<t,M_{i}=m),i=1,\ldots,n almost surely provided that T~i\tilde{T}_{i} is continuous. Strict inequality on the right-hand side is necessary here because the fact that T~i(t)t\tilde{T}_{i}(t)\leq t almost surely implies 0=I(T~it,Mi=m)I(T~i(t)t,Mi=m)=10=I(\tilde{T}_{i}\leq t,M_{i}=m)\neq I(\tilde{T}_{i}(t)\leq t,M_{i}=m)=1 whenever T~i>t\tilde{T}_{i}>t and Mi=m.M_{i}=m. The critical observation is that the loss estimator (A.1) is now an IPCW estimator in standard form constructed from 𝒪(t),{\cal O}(t), and thus can be augmented similarly to IPCW1{}_{\!1} (cf.  tsiatis2007semiparametric, Sec. 9.3, 10.3). In particular, the augmented loss is given by Lm,tdr(𝒪(t),ψm;G0,Ψ0)=Lm,tipcw(𝒪(t),ψm;G0)+Lm,taug(𝒪(t),ψm;G0,Ψ0),L_{m,t}^{dr}({\cal O}(t),\psi_{m};G_{0},\Psi_{0})=L_{m,t}^{ipcw}({\cal O}(t),\psi_{m};G_{0})+L_{m,t}^{aug}({\cal O}(t),\psi_{m};G_{0},\Psi_{0}), where

Lm,taug(𝒪(t),ψm;G,Ψ)=1nl=1Li=1nI{Wi𝒩l}0T~i(t)Vlm(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi).L_{m,t}^{aug}(\mathcal{O}(t),\psi_{m};G,\Psi)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\int_{0}^{\tilde{T}_{i}(t)}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

Note that this last expression is just (6), but with an upper limit of integration of T~i(t)\tilde{T}_{i}(t) in place of T~i\tilde{T}_{i}.

We can now establish the equivalence of the augmented loss functions respectively derived from the IPCW1{}_{\!1} and IPCW2{}_{\!2} losses. Trivially, we may rewrite

Lm,tipcw(𝒪,ψm;G)=1ni=1nl=1LI{Wi𝒩l}[Δi{Zim(t)βlm(t)}2G(Ti|Wi)]\displaystyle L_{m,t}^{ipcw}({\cal O},\psi_{m};G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}}{G(T_{i}|W_{i})}\bigg{]} (A.2)

and

Lm,tipcw(𝒪(t),ψm;G)=1ni=1nl=1LI{Wi𝒩l}[Δi(t){Zim(t)βlm(t)}2G(Ti(t)|Wi)].\displaystyle L_{m,t}^{ipcw}({\cal O}(t),\psi_{m};G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}(t)\{Z_{im}(t)-\beta_{lm}(t)\}^{2}}{G(T_{i}(t)|W_{i})}\bigg{]}. (A.3)

Consequently, Lm,tdr(𝒪,ψm;G,Ψ)Lm,tdr(𝒪(t),ψm;G,Ψ)=(A)(B)L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)-L_{m,t}^{dr}({\cal O}(t),\psi_{m};G,\Psi)=(A)-(B) almost surely, where

(A)=1nl=1Li=1nI{Wi𝒩l}(ΔiG(Ti|Wi)Δi(t)G(Ti(t)|Wi)){Zim(t)βlm(t)}2\displaystyle(A)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{(}\frac{\Delta_{i}}{G(T_{i}|W_{i})}-\frac{\Delta_{i}(t)}{G(T_{i}(t)|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}

and

(B)=1nl=1Li=1nI{Wi𝒩l}T~i(t)T~iVlm(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi).\displaystyle(B)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\int_{\tilde{T}_{i}(t)}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

Consider the integral term appearing in (B) for any fixed value of ii. By the definition of T~i(t)\tilde{T}_{i}(t), the integral is clearly zero if T~i(t)=T~i\tilde{T}_{i}(t)=\tilde{T}_{i} (i.e., when T~i<t\tilde{T}_{i}<t); hence, we only need to consider the calculation of

tT~iVlm(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi).\int_{t}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

for T~it.\tilde{T}_{i}\geq t. Recall that Vlm(u;t,Wi,Ψ)=EΨ{Zim(t)βlm(t)|Tu,Wi}2=ym(u;t,w,Ψ)2ym(u;t,w,Ψ)βlm(t)+βlm2(t)V_{lm}(u;t,W_{i},\Psi)=E_{\Psi}\{Z_{im}(t)-\beta_{lm}(t)|T\geq u,W_{i}\}^{2}=y_{m}(u;t,w,\Psi)-2y_{m}(u;t,w,\Psi)\beta_{lm}(t)+\beta_{lm}^{2}(t), where ym(u;t,w,Ψ)y_{m}(u;t,w,\Psi) is given in (8). By definition, we have that ym(u;t,w,Ψ)=0y_{m}(u;t,w,\Psi)=0 for utu\geq t; hence, for T~it,\tilde{T}_{i}\geq t,

tT~iVlm(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi)=βlm2(t)tT~idMG(u|Wi)G(u|Wi).\displaystyle\int_{t}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i})=\beta_{lm}^{2}(t)\int_{t}^{\tilde{T}_{i}}\frac{dM_{G}(u|W_{i})}{G(u|W_{i})}. (A.4)

Calculations analogous to those done in bai2013doubly show that

tT~idMG(u|Wi)G(u|Wi)=I(T~it)(1G(t|Wi)ΔG(T~i|Wi)).\displaystyle\int_{t}^{\tilde{T}_{i}}\frac{dM_{G}(u|W_{i})}{G(u|W_{i})}=I(\tilde{T}_{i}\geq t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{\Delta}{G(\tilde{T}_{i}|W_{i})}\bigg{)}. (A.5)

Substituting (A.5) into (A.4) gives

tT~iVlm(u;t,Wi,Ψ)G(u|Wi)𝑑MG(u|Wi)=βlm2(t)I(T~it)(1G(t|Wi)ΔiG(T~i|Wi)).\displaystyle\int_{t}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i})=\beta_{lm}^{2}(t)I(\tilde{T}_{i}\geq t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}\bigg{)}. (A.6)

Hence we can see that

Lm,tdr(𝒪,ψm;G,Ψ)Lm,tdr(𝒪(t),ψm;G,Ψ)=1ni=1nI(Wi𝒩l)Ωi,L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)-L_{m,t}^{dr}({\cal O}(t),\psi_{m};G,\Psi)=\frac{1}{n}\sum_{i=1}^{n}I(W_{i}\in\mathcal{N}_{l})\,\Omega_{i},

almost surely, where

Ωi=(ΔiG(Ti|Wi)Δi(t)G(Ti(t)|Wi)){Zim(t)βlm(t)}2+I(T~it)(1G(t|Wi)ΔiG(T~i|Wi))βlm2(t).\Omega_{i}=\bigg{(}\frac{\Delta_{i}}{G(T_{i}|W_{i})}-\frac{\Delta_{i}(t)}{G(T_{i}(t)|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+I(\tilde{T}_{i}\geq t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t).

Recall that TiT_{i} and CiC_{i} are both continuous; hence, P(Ti=t)=P(Ci=t)=0P(T_{i}=t)=P(C_{i}=t)=0. We now consider the calculation of Ωi\Omega_{i} under the 6 possible ways in which Ti,T_{i}, CiC_{i} and tt can be ordered with positive probability:

  1. 1.

    Ti<Ci<t:T_{i}<C_{i}<t: In this case, Δi=Δi(t)=1,\Delta_{i}=\Delta_{i}(t)=1, Ti(t)=Ti,T_{i}(t)=T_{i}, and T~i<t.\tilde{T}_{i}<t. Hence,

    Ωi=(1G(Ti|Wi)1G(Ti|Wi)){Zim(t)βlm(t)}2+0(1G(t|Wi)1G(T~i|Wi))βlm2(t)=0.\Omega_{i}=\bigg{(}\frac{1}{G(T_{i}|W_{i})}-\frac{1}{G(T_{i}|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+0\cdot\bigg{(}\frac{1}{G(t|W_{i})}-\frac{1}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t)=0.
  2. 2.

    Ti<t<Ci:T_{i}<t<C_{i}: As in Case 1, Δi=Δi(t)=1,\Delta_{i}=\Delta_{i}(t)=1, Ti(t)=Ti,T_{i}(t)=T_{i}, and T~i<t;\tilde{T}_{i}<t; hence, Ωi=0.\Omega_{i}=0.

  3. 3.

    t<Ti<Ci:t<T_{i}<C_{i}: In this case, Δi=1,\Delta_{i}=1, Ti(t)=t,T_{i}(t)=t, Δi(t)=I(Ti(t)Ci)=I(tCi)=1\Delta_{i}(t)=I(T_{i}(t)\leq C_{i})=I(t\leq C_{i})=1 and T~i>t\tilde{T}_{i}>t. Moreover, Zim(t)=I(Tit,Mi=1)=0Z_{im}(t)=I(T_{i}\leq t,M_{i}=1)=0 regardless of MiM_{i} because Ti>t.T_{i}>t. Hence,

    Ωi=(1G(Ti|Wi)1G(t|Wi)){0βlm(t)}2+βlm2(t)(1G(t|Wi)1G(Ti|Wi))\displaystyle\Omega_{i}=\bigg{(}\frac{1}{G(T_{i}|W_{i})}-\frac{1}{G(t|W_{i})}\bigg{)}\{0-\beta_{lm}(t)\}^{2}+\beta_{lm}^{2}(t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{1}{G(T_{i}|W_{i})}\bigg{)}
    =βlm2(t)G(Ti|Wi)βlm2(t)G(t|Wi)+βlm2(t)G(t|Wi)βlm2(t)G(Ti|Wi)=0.\displaystyle=\frac{\beta_{lm}^{2}(t)}{G(T_{i}|W_{i})}-\frac{\beta_{lm}^{2}(t)}{G(t|W_{i})}+\frac{\beta_{lm}^{2}(t)}{G(t|W_{i})}-\frac{\beta_{lm}^{2}(t)}{G(T_{i}|W_{i})}=0.
  4. 4.

    Ci<Ti<t:C_{i}<T_{i}<t: In this case, Δi=0,\Delta_{i}=0, T~i<t,\tilde{T}_{i}<t, Ti(t)=Ti,T_{i}(t)=T_{i}, and Δi(t)=I(Ti(t)Ci)=0.\Delta_{i}(t)=I(T_{i}(t)\leq C_{i})=0. Hence,

    Ωi=(0G(Ti|Wi)0G(Ti|Wi)){Zim(t)βlm(t)}2+0×(1G(t|Wi)0G(T~i|Wi))βlm2(t)=0.\Omega_{i}=\bigg{(}\frac{0}{G(T_{i}|W_{i})}-\frac{0}{G(T_{i}|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+0\times\bigg{(}\frac{1}{G(t|W_{i})}-\frac{0}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t)=0.
  5. 5.

    Ci<t<Ti:C_{i}<t<T_{i}: In this case, Δi=0,\Delta_{i}=0, T~i<t,\tilde{T}_{i}<t, Δi(t)=I(Ti(t)Ci)=0\Delta_{i}(t)=I(T_{i}(t)\leq C_{i})=0 and Ti(t)=t.T_{i}(t)=t. Hence,

    Ωi=(0G(Ti|Wi)0G(t|Wi)){Zim(t)βlm(t)}2+0×(1G(t|Wi)0G(T~i|Wi))βlm2(t)=0.\Omega_{i}=\bigg{(}\frac{0}{G(T_{i}|W_{i})}-\frac{0}{G(t|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+0\times\bigg{(}\frac{1}{G(t|W_{i})}-\frac{0}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t)=0.
  6. 6.

    t<Ci<Ti:t<C_{i}<T_{i}: In this case, Δi=0,\Delta_{i}=0, T~i>t,\tilde{T}_{i}>t, Δi(t)=1,\Delta_{i}(t)=1, Ti(t)=tT_{i}(t)=t and Zim(t)=0Z_{im}(t)=0 regardless of MiM_{i} because Ti>tT_{i}>t. Hence,

    Ωi=(0G(Ti|Wi)1G(t|Wi)){0βlm(t)}2+βlm2(t)(1G(t|Wi)0G(Ti|Wi))\displaystyle\Omega_{i}=\bigg{(}\frac{0}{G(T_{i}|W_{i})}-\frac{1}{G(t|W_{i})}\bigg{)}\{0-\beta_{lm}(t)\}^{2}+\beta_{lm}^{2}(t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{0}{G(T_{i}|W_{i})}\bigg{)}
    =βlm2(t)G(t|Wi)+βlm2(t)G(t|Wi)=0.\displaystyle=-\frac{\beta_{lm}^{2}(t)}{G(t|W_{i})}+\frac{\beta_{lm}^{2}(t)}{G(t|W_{i})}=0.

In summary, we have proved that Lm,tdr(𝒪,ψm;G,Ψ)Lm,tdr(𝒪(t),ψm;G,Ψ)=0L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)-L_{m,t}^{dr}({\cal O}(t),\psi_{m};G,\Psi)=0 except possibly on a set which has probability measure 0.

A.2 Further details on CART algorithms using augmented loss

The extension of the CIF regression tree methodology described in Section 2.3 to the case where 𝒪{\cal O} is observed has two main steps:

  1. 1.

    Throughout, replace the L2L_{2} loss function used by CART with Lm,tdr(𝒪,ψm;G^,Ψ^);L_{m,t}^{dr}({\cal O},\psi_{m};\hat{G},\hat{\Psi});

  2. 2.

    Use the modified algorithm in Step #1 to grow a maximal tree, thereby generating a sequence of trees as candidates for the best tree;

  3. 3.

    Using cross-validation, select the best tree from this sequence via cost complexity pruning.

Regarding Step #1, the overall structure of the CART algorithm is independent of the choice of loss function; the specification of the loss function only plays a role in determining how the splitting, evaluation, and model selection decisions are made in Steps #2 and #3 (cf., breiman1984classification).

Step #2 serves to restrict the search space, that is, the set of possible trees that one must consider in determining which tree is optimal; see below. To describe how Step #3 is carried out, we first need to define the notion of a cross-validated risk estimator. Let 𝒪=q=1Q𝒪q{\cal O}=\cup_{q=1}^{Q}{\cal O}_{q} be a partition of 𝒪{\cal O} such that each 𝒪q{\cal O}_{q} contains all observed data on some subset of subjects and each subject in 𝒪{\cal O} appears in exactly one of the sets 𝒪1,,𝒪Q{\cal O}_{1},\ldots,{\cal O}_{Q}. Let pq=1ni=1nSi,q,p_{q}=\frac{1}{n}\sum_{i=1}^{n}S_{i,q}, where Si,q=1S_{i,q}=1 if observation ii is in dataset 𝒪q{\cal O}_{q} and zero otherwise. Recalling (7), define the modified notation

V^lm(u;t,w,ψm)=ym(u;t,w,Ψ^)2ym(u;t,w,Ψ^)ψm(t;w)+[ψm(t;w)]2;\hat{V}_{lm}(u;t,w,\psi_{m})=y_{m}(u;t,w,\hat{\Psi})-2y_{m}(u;t,w,\hat{\Psi})\psi_{m}(t;w)+[\psi_{m}(t;w)]^{2};

as given, V^lm(u;t,w,ψm)\hat{V}_{lm}(u;t,w,\psi_{m}) is now a function of ψm(t;w)\psi_{m}(t;w). Fixing qq and for each i𝒪q,i\in{\cal O}_{q}, let

φi,mdr(𝒪q,ψ^m(q)(t;Wi))=Δi{Z~im(t)ψ^m(q)(t;Wi)}2G^(T~i|Wi)+0T~iV^lm(u;t,Wi,ψ^m(q))G^(u|Wi)𝑑MG^(u|Wi)\varphi_{i,m}^{dr}(\mathcal{O}_{q},\hat{\psi}^{(q)}_{m}(t;W_{i}))=\frac{\Delta_{i}\{\tilde{Z}_{im}(t)-\hat{\psi}^{(q)}_{m}(t;W_{i})\}^{2}}{\hat{G}(\tilde{T}_{i}|W_{i})}+\displaystyle\int_{0}^{\tilde{T}_{i}}\frac{\hat{V}_{lm}(u;t,W_{i},\hat{\psi}^{(q)}_{m})}{\hat{G}(u|W_{i})}dM_{\hat{G}}(u|W_{i})

where ψ^m(q)(t;w)\hat{\psi}^{(q)}_{m}(t;w) is the prediction obtained from a tree that estimates ψ0m(t;w)\psi_{0m}(t;w) from the reduced dataset 𝒪𝒪q.{\cal O}\setminus{\cal O}_{q}. Then, similarly to steingrimsson2016doubly, a cross-validated doubly robust risk estimator can be defined as follows:

^(t;ψ^m(1),,ψ^m(Q))=1nQq=1Qi=1nI(Si,q=1)pqφi,mdr(𝒪q,ψ^m(q)(t;Wi)).\hat{\cal R}(t;\hat{\psi}^{(1)}_{m},\ldots,\hat{\psi}^{(Q)}_{m})=\frac{1}{nQ}\sum_{q=1}^{Q}\sum_{i=1}^{n}\frac{I(S_{i,q}=1)}{p_{q}}~{}\varphi_{i,m}^{dr}(\mathcal{O}_{q},\hat{\psi}^{(q)}_{m}(t;W_{i})).

Returning to Step #3: Suppose Steps #1 and 2 have been run. The corresponding unpruned maximal tree generates a sequence of (say) RR subtrees, each of which is a candidate for the final tree. Each of these trees can be identified from this maximal tree by a unique choice of the so-called cost complexity tuning penalty parameter; call these parameter values α1,,αR\alpha_{1},\ldots,\alpha_{R}. Now, for each fixed r=1,,R,r=1,\ldots,R, let ψ^m(1,r),,ψ^m(Q,r)\hat{\psi}^{(1,r)}_{m},\ldots,\hat{\psi}^{(Q,r)}_{m} denote the sequence of trees obtained by running the doubly robust tree building procedure on the datasets 𝒪𝒪q,q=1,,Q,{\cal O}\setminus{\cal O}_{q},q=1,\ldots,Q, where each such tree is determined by cost complexity pruning using the penalty parameter αr\alpha_{r}. Define ^(t;ψ^m(1,r),,ψ^m(Q,r))\hat{\cal R}(t;\hat{\psi}^{(1,r)}_{m},\ldots,\hat{\psi}^{(Q,r)}_{m}) to be the corresponding cross-validated risk estimates. The final tree is then obtained from the initial maximal tree using the penalty parameter αr^,\alpha_{\hat{r}}, where r^\hat{r} minimizes ^(t;ψ^m(1,r),,ψ^m(Q,r)),r=1,,R.\hat{\cal R}(t;\hat{\psi}^{(1,r)}_{m},\ldots,\hat{\psi}^{(Q,r)}_{m}),r=1,\ldots,R.

The indicated procedure is modified in an obvious way to accommodate IPCW-type and Buckley-James Brier loss functions. In the case of the corresponding composite loss function, a weighted sum of the desired loss estimates over t1,,tJt_{1},\ldots,t_{J} is instead used. Beyond these loss modifications, estimation of the maximal tree and selection of the corresponding best tree is carried out exactly as described above. In practice, implementation of the algorithm just described is possible using rpart’s capabilities for incorporating user-written splitting and evaluation functions (therneau2015introduction). Examples of such code are provided as part of the Supplementary Materials for this paper.

A.3 Estimating Ψ\Psi for computing augmented loss functions

Section 4.3 discusses methods for estimating both the censoring distribution GG and the parameter Ψ\Psi when computing observed data loss functions. This section expands on the four methods for estimating Ψ,\Psi, needed for computing (8), that are described in Section 4.3. We assume K=2.K=2.

  1. 1.

    RF(npar): for m=1,2,m=1,2, ψ0m(|)\psi_{0m}(\cdot|\cdot) is estimated using nonparametric random forests. More specifically, the rfsrc function in the randomForestSRC package is used (ishwaran2014random), where the predict command is used to estimate ψ01(u|w)\psi_{01}(u|w) and ψ¯0(u|w)\bar{\psi}_{0}(u|w) for K=2K=2 and u0u\geq 0. For m=1,2m=1,2 and B=500B=500 bootstrap samples, the components of Ψ^\hat{\Psi} are given by ψ^mrf(u|w)=B1b=1Bψ^m[b](u|w);\hat{\psi}_{m}^{\mbox{\tiny rf}}(u|w)=B^{-1}\sum_{b=1}^{B}\hat{\psi}_{m[b]}(u|w); here, for the bthb^{th} bootstrap sample, ψ^1[b](u|w)=P^[b](Tu,M=1|W=w)\hat{\psi}_{1[b]}(u|w)=\hat{P}_{[b]}(T\leq u,M=1|W=w) is the estimated CIF of interest and ψ^2[b](u|w)=P^[b](Tu,M1|W=w)\hat{\psi}_{2[b]}(u|w)=\hat{P}_{[b]}(T\leq u,M\neq 1|W=w) is the corresponding ”all other causes” estimate. The event-free survival function is estimated in the obvious manner.

  2. 2.

    RF+lgtc: for m=1,2,m=1,2, ψ0m(|)\psi_{0m}(\cdot|\cdot) is estimated using “random parametric forest”. That is, as above, the rfsrc function in the randomForestSRC package is first used (ishwaran2014random) to construct a forest (i.e., B=500B=500 trees) with K=2K=2. Using the tree obtained for the bthb^{th} bootstrapped sample, ψ^m[b](u|w)=P^[b](Tu,M=m|W=w),m=1,2\hat{\psi}_{m[b]}(u|w)=\hat{P}_{[b]}(T\leq u,M=m|W=w),m=1,2 are then estimated by fitting the three parameter logistic cumulative incidence model of cheng2009modeling separately to the data falling into each terminal node. Finally, similarly to ψ^mrf(u|w),\hat{\psi}_{m}^{\mbox{\tiny rf}}(u|w), the desired ensemble predictors are obtained by computing bootstrap averages.

  3. 3.

    FG(true): the simulation model of Section 4.1 is an example of the class of models considered in jeong2007parametric. For m=1,2,m=1,2, ψ0m(|)\psi_{0m}(\cdot|\cdot) are estimated by maximum likelihood under a correctly specified parametric model.

  4. 4.

    lgtc+godds: for m=1,2,m=1,2, ψ0m(|)\psi_{0m}(\cdot|\cdot) is estimated by maximum likelihood using the parametric cumulative incidence model proposed in shi2013constrained.

A.3.1 The CIF models of cheng2009modeling shi2013constrained

Here, we describe the parametric models of cheng2009modeling shi2013constrained that are used by RF+lgtc and lgtc+godds. Specifically, assuming K=2K=2 and hence for m=1,2,m=1,2, these models are given below:

  1. 1.

    Three parameter logistic model without covariates (RF+lgtc; cheng2009modeling):
    This model is motivated by nonlinear modeling in bioassay and dose-response curves (ritz2005bioassay). Specifically, the parametric model that is fit by maximum likelihood separately within each terminal node of a tree is given by

    ψm(t;ξm)=pmexp{bm(tcm)}pmexp(bmcm)1+exp{bm(tcm)}\displaystyle\psi_{m}(t;\xi_{m})=\frac{p_{m}\exp\{b_{m}(t-c_{m})\}-p_{m}\exp(-b_{m}c_{m})}{1+\exp\{b_{m}(t-c_{m})\}}

    where ξm=(bm,cm,pm)T\xi_{m}=(b_{m},c_{m},p_{m})^{T} consists of three parameters: pmp_{m} is upper asymptote of ψm(;)\psi_{m}(\cdot;\cdot) when tt\rightarrow\infty; and, bmb_{m} and cmc_{m} are the slope and center of the curve (cheng2009modeling). To satisfy the additivity constraint required of CIFs, we further assume p2=1p1p_{2}=1-p_{1}.

  2. 2.

    Three parameter logistic model with covariates (lgtc+godds; shi2013constrained):
    shi2013constrained extends the model of cheng2009modeling to incorporate covariates. Specifically, the parametric model used for ψ0m(t;w)\psi_{0m}(t;w) is

    ψm(t;𝐖,𝜸,ξm)=gm1[gm{ψm(t;ξm)}+𝜸T𝐖]\displaystyle\psi_{m}(t;\mathbf{W},\boldsymbol{\gamma},\xi_{m})=g_{m}^{-1}[g_{m}\{\psi_{m}(t;\xi_{m})\}+\boldsymbol{\gamma}^{T}\mathbf{W}]

    where gmg_{m} is a specified nondecreasing function and ψm(t;ξm)\psi_{m}(t;\xi_{m}) is the covariate-independent model of cheng2009modeling. One possible form of gmg_{m} is

    gm(u;αm)=log[{(1u)αm1}/αm],0<αm<\displaystyle g_{m}(u;\alpha_{m})=\log[\{(1-u)^{-\alpha_{m}}-1\}/\alpha_{m}],0<\alpha_{m}<\infty

    where αm\alpha_{m} is parameter for each event type. This model is an extension of the generalized odds-rate model (dabrowska1988estimation). When using a generalized odds model as a link function gm(;)g_{m}(\cdot;\cdot), the CIF for m=1m=1 is given by the formula ψ1(t;𝑾)=1H(t;𝑾)1/α1,\psi_{1}(t;\boldsymbol{W})=1-H(t;\boldsymbol{W})^{-1/\alpha_{1}}, where

    H(t;𝑾)={(1p1exp{b1(tc1)p1exp(b1c1)}1+exp{b1(tc1)})α11}exp(𝜷1T𝑾)+1;\displaystyle H(t;\boldsymbol{W})=\bigg{\{}\bigg{(}1-\frac{p_{1}\exp\{b_{1}(t-c_{1})-p_{1}\exp(-b_{1}c_{1})\}}{1+\exp\{b_{1}(t-c_{1})\}}\bigg{)}^{-\alpha_{1}}\!\!\!\!-1\bigg{\}}\exp(\boldsymbol{\beta}_{1}^{T}\boldsymbol{W})+1;

    the CIF for m=2m=2 is given by

    ψ2(t;𝑾)=p2(𝑾)[exp{b2(tc2)}exp{b2c2}]1+exp{b2(tc2)},\displaystyle\psi_{2}(t;\boldsymbol{W})=\frac{p_{2}(\boldsymbol{W})[\exp\{b_{2}(t-c_{2})\}-\exp\{-b_{2}c_{2}\}]}{1+\exp\{b_{2}(t-c_{2})\}},

    where p2(𝑾)=1ψ1(;𝑾)=[{(1p1)α11}exp(𝜷1T𝑾))+1]1/α1p_{2}(\boldsymbol{W})=1-\psi_{1}(\infty;\boldsymbol{W})=[\{(1-p_{1})^{-\alpha_{1}}-1\}\exp(\boldsymbol{\beta}_{1}^{T}\boldsymbol{W}))+1]^{-1/\alpha_{1}}. Observe that the additive constraint ψ1(;𝑾)+ψ2(;𝑾)=1\psi_{1}(\infty;\boldsymbol{W})+\psi_{2}(\infty;\boldsymbol{W})=1 is satisfied. In addition, note that a separate covariate effect for m=2m=2 is not modeled and hence there are no additional regression parameters to be estimated (i.e., even though CIF for m=2m=2 does depend on covariates through p2(𝑾)p_{2}(\boldsymbol{W})).

A.4 Miscellaneous Algorithm Specifications

Two important tuning parameters that govern the size of the maximal tree built in Step #2 of general CART algorithm are minbucket, the minimum possible number of observations in a terminal node, and minsplit, the minimum number of observations in a node to be considered for a possible split. Throughout, we set minbucket=10 and minsplit=30. For the doubly robust and Buckley-James methods, both uncensored and censored observations are counted when considering these limits; for IPCW methods, only uncensored observations are counted.

A.5 Additional simulation results

In this section, we summarize the full set of simulation results using both“single time point” and composite loss functions. We recall that the IPCW1 and IPCW2 refer to the usual (t=t^{*}=\infty) and modified (t=tt^{*}=t) IPCW methods, respectively. As described elsewhere, the Buckley-James (BJ) and doubly robust (DR) methods use several approaches to estimating (8), which is required for computing the augmented loss function. In particular, as described in Sections 4.3 (see also Section A.3.1), we use nonparametric random forests (RF (npar)), random parametric forests (RF+lgtc), the true data generating model (FG (true)) and the three-parameter logistic distribution in combination with a generalized odds model (Lgtc+godds). Results are summarized for each type of loss function. For example, results corresponding to“BJ-RF (npar)” means that we use the BJ loss function in combination with nonparametric random forests for calculating (8).

Table A.1A.4 summarize the numerical performance of the fitted trees and Figures A.1A.3 show boxplots of prediction error. The choice of time points and method of summarizing the results in the tables and figures is the same as in Section 4.4. It is easy to see that (i) performance generally improves with increasing sample size and tends to be worst at t1t_{1}; (ii) there is substantially more variation in results using a single time point loss in comparison to a composite loss; and, (iii) the use of the composite loss function results in better overall performance, inlcuding lower prediction error, in comparison to the use of any single time point loss function.

In the medium and low signal settings, greater differences between methods emerge, with a clear improvement at n=500n=500 compared to n=250.n=250. Regardless of sample size, IPCW1 has the least favorable performance overall in both the medium and low signal settings, and BJ-FG(true) has the most favorable performance overall; these differences are especially noticeable in the low signal setting.

It is also clear that IPCW2 generally exhibits less variability than IPCW1 at time points t1t_{1} and t2t_{2}, whereas for time t3t_{3}, the two measures perform similarly. This last result is expected since Δ(t)G^(T(t)|W)ΔG^(T|W)\frac{\Delta(t^{*})}{\hat{G}(T(t^{*})|W)}\rightarrow\frac{\Delta}{\hat{G}(T|W)} as tt^{*}\rightarrow\infty.

The results further show that the choice of method for estimating the conditional expectation has little overall impact on performance; the main differences stem from the type of loss (IPCW versus DR versus BJ) and whether or not a composite loss function is used. The results also clearly highlight the value of getting the data generating model exactly right, and otherwise demonstrate a desirable degree of robustness across methods, particularly for doubly robust loss. When only comparing IPCW1 and the doubly robust methods, the efficiency gain is evident, particularly at time point t3t_{3} using a single time point loss. The performance for IPCW2 is often reasonably close to that for the Buckley-James and doubly robust loss functions, especially for a composite loss function; the comparative degree of simplicity involved in implementing this method has much to recommend it.

Table A.1: Numerical summaries for trees built using a single time point loss function when n=250n=250. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
High Sig Med Sig Low Sig
||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP
IPCW1 t1t_{1} 0.432 0.196 0.724 1.668 0.118 0.112 1.890 0.072 0.020
t2t_{2} 0.204 0.138 0.890 0.838 0.166 0.494 1.644 0.124 0.118
t3t_{3} 0.294 0.188 0.816 0.780 0.250 0.556 1.332 0.198 0.246
IPCW2 t1t_{1} 0.350 0.230 0.800 1.428 0.208 0.220 1.834 0.114 0.040
t2t_{2} 0.130 0.072 0.922 0.546 0.168 0.674 1.500 0.150 0.182
t3t_{3} 0.354 0.242 0.800 0.726 0.216 0.576 1.318 0.134 0.238
BJ-RF(npar) t1t_{1} 0.392 0.280 0.814 1.298 0.286 0.284 1.840 0.160 0.042
t2t_{2} 0.236 0.162 0.876 0.424 0.238 0.770 1.354 0.258 0.252
t3t_{3} 0.320 0.258 0.858 0.336 0.218 0.818 0.980 0.304 0.438
BJ-RF+lgtc t1t_{1} 0.326 0.234 0.822 1.302 0.304 0.290 1.840 0.140 0.044
t2t_{2} 0.256 0.182 0.864 0.404 0.228 0.774 1.344 0.232 0.256
t3t_{3} 0.302 0.236 0.858 0.234 0.144 0.854 0.930 0.256 0.448
BJ-FG(true) t1t_{1} 0.336 0.260 0.826 0.950 0.284 0.482 1.694 0.152 0.106
t2t_{2} 0.190 0.140 0.902 0.204 0.134 0.882 0.682 0.246 0.630
t3t_{3} 0.142 0.116 0.932 0.150 0.098 0.908 0.202 0.138 0.886
BJ-lgtc+godds t1t_{1} 0.392 0.282 0.792 1.182 0.320 0.354 1.766 0.146 0.070
t2t_{2} 0.342 0.244 0.850 0.368 0.236 0.818 1.024 0.284 0.404
t3t_{3} 0.616 0.336 0.728 0.426 0.262 0.816 0.702 0.386 0.602
DR-RF(npar) t1t_{1} 0.342 0.248 0.814 1.378 0.246 0.246 1.838 0.138 0.044
t2t_{2} 0.170 0.128 0.898 0.508 0.242 0.696 1.506 0.222 0.182
t3t_{3} 0.288 0.188 0.860 0.480 0.128 0.688 1.382 0.122 0.236
DR-RF+lgtc t1t_{1} 0.338 0.236 0.816 1.392 0.288 0.250 1.826 0.132 0.044
t2t_{2} 0.166 0.122 0.916 0.524 0.248 0.692 1.518 0.210 0.168
t3t_{3} 0.270 0.190 0.864 0.544 0.138 0.660 1.426 0.144 0.222
DR-FG(true) t1t_{1} 0.366 0.266 0.818 1.362 0.268 0.274 1.830 0.134 0.046
t2t_{2} 0.168 0.126 0.910 0.482 0.208 0.734 1.416 0.150 0.212
t3t_{3} 0.244 0.164 0.866 0.588 0.234 0.686 1.318 0.148 0.284
DR-lgtc+godds t1t_{1} 0.366 0.274 0.802 1.366 0.250 0.264 1.842 0.112 0.036
t2t_{2} 0.126 0.100 0.920 0.510 0.236 0.722 1.518 0.222 0.180
t3t_{3} 0.236 0.160 0.854 0.578 0.174 0.648 1.404 0.116 0.236
Table A.2: Numerical summaries for trees built using a single time point loss function when n=500n=500. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
High Sig Med Sig Low Sig
||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP
IPCW1 t1t_{1} 0.154 0.110 0.892 1.058 0.184 0.422 1.842 0.078 0.064
t2t_{2} 0.124 0.066 0.932 0.222 0.138 0.850 1.020 0.122 0.422
t3t_{3} 0.134 0.092 0.916 0.178 0.106 0.892 0.672 0.134 0.594
IPCW2 t1t_{1} 0.198 0.130 0.886 0.480 0.220 0.726 1.588 0.172 0.166
t2t_{2} 0.118 0.076 0.928 0.120 0.090 0.904 0.666 0.170 0.614
t3t_{3} 0.116 0.086 0.916 0.162 0.094 0.898 0.632 0.176 0.626
BJ-RF+npar t1t_{1} 0.192 0.128 0.906 0.394 0.208 0.774 1.506 0.192 0.214
t2t_{2} 0.120 0.070 0.924 0.178 0.126 0.894 0.366 0.162 0.784
t3t_{3} 0.136 0.086 0.918 0.182 0.114 0.894 0.190 0.136 0.868
BJ-RF+lgtc t1t_{1} 0.192 0.136 0.904 0.428 0.240 0.762 1.554 0.198 0.196
t2t_{2} 0.086 0.052 0.934 0.170 0.120 0.898 0.362 0.162 0.792
t3t_{3} 0.142 0.090 0.918 0.174 0.108 0.900 0.200 0.140 0.870
BJ-FG(true) t1t_{1} 0.184 0.136 0.912 0.250 0.166 0.852 1.190 0.244 0.370
t2t_{2} 0.110 0.072 0.950 0.186 0.134 0.880 0.122 0.086 0.906
t3t_{3} 0.082 0.066 0.956 0.104 0.074 0.942 0.098 0.058 0.936
BJ-lgtc+godds t1t_{1} 0.194 0.140 0.904 0.326 0.222 0.804 1.314 0.214 0.300
t2t_{2} 0.196 0.126 0.902 0.226 0.146 0.862 0.270 0.152 0.828
t3t_{3} 0.526 0.190 0.704 0.350 0.192 0.818 0.218 0.146 0.852
DR-RF(npar) t1t_{1} 0.184 0.136 0.910 0.456 0.204 0.742 1.546 0.140 0.186
t2t_{2} 0.118 0.066 0.934 0.134 0.094 0.908 0.626 0.148 0.654
t3t_{3} 0.102 0.068 0.938 0.088 0.056 0.944 0.498 0.140 0.692
DR-RF+lgtc t1t_{1} 0.158 0.116 0.918 0.482 0.202 0.722 1.568 0.146 0.180
t2t_{2} 0.108 0.052 0.930 0.118 0.088 0.916 0.644 0.158 0.652
t3t_{3} 0.076 0.054 0.952 0.110 0.070 0.936 0.524 0.140 0.678
DR-FG(true) t1t_{1} 0.190 0.140 0.906 0.444 0.188 0.756 1.576 0.172 0.186
t2t_{2} 0.084 0.050 0.956 0.124 0.080 0.916 0.552 0.134 0.676
t3t_{3} 0.096 0.066 0.946 0.078 0.054 0.950 0.522 0.128 0.686
DR-lgtc+godds t1t_{1} 0.206 0.140 0.904 0.440 0.186 0.744 1.552 0.192 0.170
t2t_{2} 0.098 0.050 0.946 0.138 0.094 0.908 0.594 0.148 0.668
t3t_{3} 0.084 0.054 0.948 0.118 0.076 0.928 0.496 0.112 0.682
Table A.3: Numerical summaries for trees built using a single time point loss function when n=1000n=1000. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
High Sig Med Sig Low Sig
||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP
IPCW1 t1t_{1} 0.106 0.076 0.930 0.132 0.072 0.908 1.328 0.134 0.284
t2t_{2} 0.066 0.042 0.962 0.134 0.114 0.904 0.208 0.116 0.864
t3t_{3} 0.092 0.054 0.936 0.070 0.052 0.952 0.124 0.088 0.916
IPCW2 t1t_{1} 0.088 0.072 0.946 0.092 0.070 0.934 0.646 0.170 0.650
t2t_{2} 0.064 0.042 0.956 0.116 0.090 0.924 0.122 0.076 0.916
t3t_{3} 0.100 0.056 0.932 0.082 0.056 0.944 0.130 0.094 0.922
BJ-RF+npar t1t_{1} 0.106 0.082 0.924 0.114 0.084 0.932 0.488 0.156 0.722
t2t_{2} 0.088 0.056 0.950 0.168 0.118 0.890 0.108 0.070 0.924
t3t_{3} 0.114 0.060 0.932 0.180 0.110 0.894 0.142 0.106 0.906
BJ-RF+lgtc t1t_{1} 0.126 0.102 0.922 0.104 0.078 0.938 0.458 0.152 0.738
t2t_{2} 0.064 0.048 0.958 0.136 0.094 0.904 0.086 0.056 0.934
t3t_{3} 0.124 0.068 0.924 0.160 0.104 0.898 0.138 0.094 0.904
BJ-FG(true) t1t_{1} 0.108 0.088 0.932 0.088 0.056 0.946 0.218 0.100 0.866
t2t_{2} 0.084 0.070 0.956 0.058 0.042 0.954 0.076 0.050 0.944
t3t_{3} 0.036 0.024 0.970 0.104 0.074 0.942 0.060 0.040 0.958
BJ-lgtc+godds t1t_{1} 0.112 0.084 0.922 0.110 0.082 0.934 0.302 0.120 0.820
t2t_{2} 0.164 0.094 0.922 0.162 0.090 0.902 0.100 0.076 0.932
t3t_{3} 0.774 0.128 0.508 0.358 0.126 0.786 0.204 0.122 0.860
DR-RF(npar) t1t_{1} 0.088 0.070 0.942 0.102 0.072 0.934 0.556 0.168 0.692
t2t_{2} 0.068 0.048 0.958 0.098 0.068 0.930 0.104 0.072 0.926
t3t_{3} 0.048 0.038 0.978 0.048 0.030 0.964 0.098 0.076 0.936
DR-RF+lgtc t1t_{1} 0.092 0.070 0.938 0.094 0.066 0.938 0.588 0.168 0.682
t2t_{2} 0.048 0.034 0.962 0.100 0.068 0.932 0.090 0.064 0.934
t3t_{3} 0.040 0.032 0.982 0.034 0.024 0.974 0.098 0.074 0.938
DR-FG(true) t1t_{1} 0.080 0.066 0.940 0.068 0.048 0.954 0.532 0.164 0.706
t2t_{2} 0.082 0.062 0.960 0.092 0.066 0.938 0.072 0.058 0.950
t3t_{3} 0.050 0.042 0.974 0.056 0.036 0.968 0.088 0.074 0.948
DR-lgtc+godds t1t_{1} 0.068 0.054 0.948 0.094 0.064 0.938 0.564 0.162 0.690
t2t_{2} 0.054 0.036 0.962 0.114 0.078 0.934 0.074 0.054 0.938
t3t_{3} 0.048 0.022 0.970 0.040 0.024 0.972 0.062 0.052 0.948
Table A.4: Numerical summaries for trees built using multiple time points when n=250,500n=250,500 and 10001000. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
High Sig Med Sig Low Sig
||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP ||fitted-3|| NSP PCSP
n=250n=250 IPCW1 0.168 0.116 0.914 0.506 0.132 0.670 1.286 0.102 0.272
IPCW2 0.156 0.120 0.916 0.314 0.142 0.798 1.128 0.146 0.358
BJ-RF(npar) 0.234 0.186 0.878 0.304 0.158 0.810 1.128 0.246 0.364
BJ-RF+lgtc 0.250 0.196 0.882 0.290 0.158 0.824 1.110 0.244 0.360
BJ-FG(true) 0.156 0.116 0.910 0.194 0.142 0.892 0.348 0.182 0.790
BJ-lgtc+godds 0.358 0.246 0.844 0.364 0.266 0.814 0.788 0.328 0.532
DR-RF(npar) 0.150 0.120 0.924 0.272 0.138 0.828 1.156 0.146 0.334
DR-RF+lgtc 0.152 0.126 0.912 0.264 0.120 0.834 1.134 0.178 0.342
DR-FG(true) 0.132 0.090 0.932 0.194 0.092 0.880 1.050 0.170 0.408
DR-lgtc+godds 0.120 0.092 0.926 0.286 0.144 0.846 1.044 0.146 0.378
n=500n=500 IPCW1 0.132 0.084 0.916 0.182 0.132 0.874 0.558 0.164 0.658
IPCW2 0.124 0.082 0.932 0.138 0.100 0.906 0.282 0.136 0.830
BJ-RF(npar) 0.142 0.090 0.904 0.148 0.088 0.908 0.226 0.128 0.878
BJ-RF+lgtc 0.130 0.082 0.908 0.156 0.094 0.908 0.210 0.124 0.880
BJ-FG(true) 0.092 0.068 0.940 0.102 0.064 0.934 0.118 0.082 0.918
BJ-lgtc+godds 0.250 0.122 0.848 0.156 0.080 0.896 0.154 0.104 0.896
DR-RF(npar) 0.066 0.052 0.958 0.092 0.056 0.940 0.216 0.118 0.856
DR-RF+lgtc 0.040 0.030 0.966 0.092 0.052 0.938 0.230 0.106 0.834
DR-FG(true) 0.058 0.044 0.966 0.092 0.058 0.942 0.176 0.076 0.874
DR-lgtc+godds 0.040 0.030 0.970 0.082 0.056 0.950 0.214 0.102 0.852
n=1000n=1000 IPCW1 0.080 0.036 0.940 0.122 0.096 0.928 0.088 0.074 0.936
IPCW2 0.068 0.052 0.948 0.084 0.046 0.946 0.094 0.074 0.938
BJ-RF(npar) 0.110 0.062 0.918 0.112 0.074 0.918 0.096 0.068 0.934
BJ-RF+lgtc 0.080 0.052 0.938 0.126 0.084 0.912 0.094 0.064 0.936
BJ-FG(true) 0.072 0.046 0.954 0.088 0.060 0.940 0.086 0.070 0.954
BJ-lgtc+godds 0.270 0.078 0.818 0.140 0.076 0.904 0.100 0.066 0.928
DR-RF(npar) 0.050 0.028 0.976 0.068 0.046 0.952 0.098 0.068 0.928
DR-RF+lgtc 0.038 0.022 0.978 0.056 0.044 0.956 0.104 0.078 0.930
DR-FG(true) 0.030 0.022 0.978 0.056 0.034 0.966 0.058 0.042 0.958
DR-lgtc+godds 0.028 0.016 0.984 0.060 0.044 0.952 0.082 0.056 0.944
Figure A.1: Prediction error for event 1 when n=250n=250 (multiplied by 100). Losses using a single time point and multiple time points. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
Refer to captionRefer to captionRefer to caption
Figure A.2: Prediction error for event 1 when n=500n=500 (multiplied by 100). Losses using a single time point and multiple time points. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
Refer to captionRefer to captionRefer to caption
Figure A.3: Prediction error for event 1 when n=1000n=1000 (multiplied by 100). Losses using a single time point and multiple time points. IPCW1 and IPCW2 are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).
Refer to captionRefer to captionRefer to caption

A.6 Tree results in data analysis

In this subsection, we show various trees built for the RTOG9410 study described in Section 5. Numbers in each terminal node represent terminal node estimators (i.e., CIF estimates) generated by the loss function used to build the corresponding tree.

Figure A.4: Tree built using uncensored data for out-field failure
\pgfmathresultptage << 50.5t1t_{1}: 0.29t2t_{2}: 0.34t3t_{3}: 0.47t4t_{4}: 0.53t5t_{5}: 0.6histology=squamous?t1t_{1}: 0.05t2t_{2}: 0.10t3t_{3}: 0.17t4t_{4}: 0.26t5t_{5}: 0.29age << 70.5t1t_{1}: 0.17t2t_{2}: 0.24t3t_{3}: 0.34t4t_{4}: 0.41t5t_{5}: 0.46t1t_{1}: 0.04t2t_{2}: 0.06t3t_{3}: 0.12t4t_{4}: 0.24t5t_{5}: 0.29YesNoYesNoYesNo
Figure A.5: Tree built using 29% artificially censored data for out-field failure for IPCW1
\pgfmathresultptage << 50.5t1t_{1}: 0.29 t2t_{2} : 0.36 t3t_{3} : 0.50 t4t_{4}:0.55 t5t_{5} : 0.61age << 71.5t1t_{1}: 0.04 t2t_{2} : 0.04 t3t_{3} : 0.10 t4t_{4} : 0.12 t5t_{5}: 0.14histology=squamous?t1t_{1}: 0.05 t2t_{2} : 0.11 t3t_{3}: 0.17 t4t_{4}: 0.26 t5t_{5} : 0.30t1t_{1}: 0.16 t2t_{2} : 0.22 t3t_{3} : 0.33 t4t_{4} : 0.40 t5t_{5} : 0.46YesNoNoYesYesNo
Figure A.6: Tree built using 29% artificially censored data for out-field failure for IPCW2
\pgfmathresultptage << 50.5t1t_{1}: 0.29 t2t_{2} : 0.35 t3t_{3} : 0.48 t4t_{4}:0.54 t5t_{5} : 0.61age << 71.5t1t_{1}: 0.04 t2t_{2} : 0.04 t3t_{3} : 0.10 t4t_{4} : 0.12 t5t_{5}: 0.15histology=squamous?t1t_{1}: 0.05 t2t_{2} : 0.11 t3t_{3}: 0.17 t4t_{4}: 0.25 t5t_{5} : 0.30t1t_{1}: 0.16 t2t_{2} : 0.23 t3t_{3} : 0.33 t4t_{4} : 0.40 t5t_{5} : 0.46YesNoNoYesYesNo
Figure A.7: Tree built using 29% artificially censored data for out-field failure for BJ-RF(npar)
\pgfmathresultptage << 50.5t1t_{1}: 0.28 t2t_{2} : 0.35 t3t_{3} : 0.47 t4t_{4} : 0.53 t5t_{5} : 0.58age << 71.5t1t_{1}: 0.04 t2t_{2} : 0.04 t3t_{3} : 0.11 t4t_{4} : 0.14 t5t_{5} : 0.17histology=squamous?t1t_{1}: 0.06 t2t_{2} : 0.11 t3t_{3} : 0.18 t4t_{4} : 0.26 t5t_{5} : 0.30t1t_{1}: 0.16 t2t_{2} : 0.23 t3t_{3} : 0.33 t4t_{4} : 0.40 t5t_{5} : 0.46YesNoNoYesYesNo
Figure A.8: Tree built using 29% artificially censored data for out-field failure for DR-RF(npar)
\pgfmathresultptage << 50.5t1t_{1}: 0.29 t2t_{2} : 0.36 t3t_{3} : 0.49 t4t_{4} : 0.54 t5t_{5} : 0.60age << 71.5t1t_{1}:0.04 t2t_{2} : 0.04 t3t_{3} : 0.10 t4t_{4} : 0.12 t5t_{5} : 0.15histology=squamous?t1t_{1}: 0.05 t2t_{2} : 0.11 t3t_{3} : 0.17 t4t_{4} : 0.25 t5t_{5} : 0.30t1t_{1}: 0.16 t2t_{2} : 0.23 t3t_{3} : 0.34 t4t_{4} : 0.41 t5t_{5} :0.47 YesNoNoYesYesNo
Figure A.9: Tree built using 29% artificially censored data for out-field failure for BJ-RF+lgtc
\pgfmathresultptage << 50.5t1t_{1} : 0.27 t2t_{2} : 0.33 t3t_{3} :0.45 t4t_{4} :0.50 t5t_{5} :0.54age << 70.5t1t_{1} : 0.03 t2t_{2} : 0.03 t3t_{3} : 0.10 t4t_{4} : 0.12 t5t_{5} : 0.15histology=squamous?t1t_{1} : 0.05t2t_{2} : 0.10 t3t_{3} : 0.15 t4t_{4} : 0.22 t5t_{5} : 0.26t1t_{1} : 0.16 t2t_{2} : 0.22 t3t_{3} : 0.31 t4t_{4} : 0.38 t5t_{5} : 0.42YesNoNoYesYesNo
Figure A.10: Tree built using 29% artificially censored data for out-field failure for BJ-RF+lgtc
\pgfmathresultptage << 50.5t1t_{1}: 0.30 t2t_{2}: 0.36 t3t_{3}: 0.50 t4t_{4}: 0.55 t5t_{5}: 0.62age << 70.5t1t_{1} : 0.03 t2t_{2} : 0.03 t3t_{3} : 0.11 t4t_{4} : 0.13 t5t_{5} : 0.17histology=squamous?t1t_{1}: 0.05 t2t_{2}: 0.11 t3t_{3}: 0.16 t4t_{4}: 0.25t5t_{5}: 0.29t1t_{1} : 0.17 t2t_{2}: 0.24 t3t_{3}: 0.35 t4t_{4} : 0.42 t5t_{5}: 0.48YesNoNoYesYesNo
Figure A.11: Tree built using uncensored data for death without progression
\pgfmathresultpthistology=squamous?age << 72.5t1t_{1}: 0.05 t2t_{2}: 0.05 t3t_{3}: 0.07 t4t_{4}: 0.07t5t_{5}: 0.09t1t_{1}: 0.21 t2t_{2}: 0.24 t3t_{3}:0.24 t4t_{4}: 0.27t5t_{5}: 0.36gender=male?t1t_{1}: 0.03 t2t_{2}: 0.05 t3t_{3}: 0.05t4t_{4}: 0.16t5t_{5}: 0.18age << 67.5t1t_{1}: 0.15 t2t_{2}: 0.17 t3t_{3}: 0.19t4t_{4}: 0.20t5t_{5}: 0.24t1t_{1}: 0.35 t2t_{2}: 0.40 t3t_{3}: 0.42t4t_{4}: 0.45t5t_{5}: 0.55NoYesNoYesNoYesYesNo
Figure A.12: Tree built using 29% artificially censored data for death without progression from IPCW2
\pgfmathresultpthistology=squamous?age << 70.5t1t_{1}: 0.04 t2t_{2}: 0.05t3t_{3}: 0.06t4t_{4}: 0.07 t5t_{5}: 0.08t1t_{1}: 0.20 t2t_{2}: 0.23t3t_{3}: 0.24t4t_{4}: 0.24t5t_{5}: 0.31gender=male?t1t_{1}: 0.03 t2t_{2}: 0.05 t3t_{3}: 0.05t4t_{4}: 0.16t5t_{5}: 0.16 age << 67.5t1t_{1}: 0.16 t2t_{2}: 0.17 t3t_{3}: 0.19 t4t_{4}: 0.20t5t_{5}: 0.23t1t_{1}: 0.36 t2t_{2}: 0.43 t3t_{3}: 0.46 t4t_{4}: 0.48 t5:t_{5}: 0.62NoYesNoYesNoYesYesNo
Figure A.13: Tree built using 29% artificially censored data for death without progression from DR-RF(npar)
\pgfmathresultpthistology=squamous?age << 72.5t1t_{1}: 0.05 t2t_{2} : 0.05 t3t_{3}: 0.07t4t_{4}: 0.07 t5t_{5}: 0.08t1t_{1}: 0.21 t2t_{2}: 0.25t3t_{3}: 0.25t4t_{4}: 0.25t5t_{5}: 0.34gender=male?t1t_{1}: 0.03 t2t_{2}: 0.05 t3t_{3}: 0.05 t4t_{4}: 0.15 t5t_{5}: 0.16 age << 67.5t1t_{1}: 0.15 t2t_{2}: 0.16 t3t_{3}: 0.19 t4t_{4}: 0.20 t5t_{5} : 0.23t1t_{1}: 0.35 t2t_{2}: 0.41 t3t_{3}: 0.44 t4:t_{4}: 0.44 t5t_{5}: 0.56NoYesNoYesNoYesYesNo
Figure A.14: Tree built using 29% artificially censored data for death without progression from DR-RF+lgtc
\pgfmathresultpthistology=squamous?age << 70.5t1t_{1}: 0.04 t2t_{2} : 0.05 t3t_{3}: 0.06t4t_{4}: 0.06t5t_{5}: 0.07 t1t_{1}: 0.19 t2t_{2}: 0.22 t3t_{3}: 0.22 t4t_{4}: 0.23 t5t_{5}: 0.30 gender=male?t1t_{1}: 0.03 t2t_{2}: 0.05 t3t_{3}: 0.05 t4t_{4}: 0.15 t5t_{5}: 0.15 age << 67.5t1t_{1}: 0.15 t2t_{2}: 0.16 t3t_{3}: 0.19 t4t_{4}: 0.20 t5t_{5}: 0.23t1t_{1}: 0.35 t2t_{2}: 0.41 t3t_{3}: 0.44 t4t_{4}: 0.44 t5t_{5}: 0.57NoYesNoYesNoYesYesNo
Figure A.15: Tree built using 29% artificially censored data for death without progression from BJ-RF(npar)
\pgfmathresultptHistology=squamous?age << 72.5t1t_{1}: 0.05 t2t_{2} : 0.05 t3t_{3}: 0.07 t4t_{4} : 0.08 t5t_{5}: 0.09t1t_{1}: 0.22 t2t_{2}: 0.25 t3t_{3}: 0.25 t4t_{4} : 0.25 t5t_{5}:0.33gender=male?t1t_{1}: 0.04 t2t_{2}: 0.05 t3t_{3}: 0.06 t4t_{4}: 0.15 t5t_{5}: 0.15age << 67.5t1t_{1}: 0.33 t2t_{2}: 0.38t3t_{3}: 0.41 t4t_{4}: 0.42t5t_{5}: 0.51kps << 95t1t_{1}: 0.18 t2t_{2}: 0.20t3t_{3}: 0.23t4t_{4}: 0.24t5t_{5}: 0.26t1t_{1}: 0.04 t2t_{2}: 0.05 t3t_{3}: 0.05 t4t_{4}: 0.06t5t_{5}: 0.10NoYesNoYesNoYesNoYesYesNo
Figure A.16: Tree built using 29% artificially censored data for death without progression from BJ-RF+lgtc
\pgfmathresultptage << 67.5histology=squamous?t1t_{1}: 0.05 t2t_{2}: 0.05 t3t_{3}: 0.07 t4t_{4}: 0.09 t5t_{5}: 0.12 t1t_{1}: 0.12 t2t_{2}: 0.14 t3t_{3}: 0.16 t4t_{4}: 0.21 t5t_{5}: 0.25 gender=male?t1t_{1}: 0.03 t2t_{2}: 0.03 t3t_{3}: 0.04 t4t_{4}: 0.06 t5t_{5}: 0.12kps << 85t1t_{1}: 0.46 t2t_{2}: 0.51 t3t_{3}: 0.52 t4t_{4}: 0.54 t5t_{5}: 0.56RX=1t1t_{1}: 0.12 t2t_{2}: 0.15 t3t_{3}: 0.16 t4t_{4}: 0.18 t5t_{5}: 0.26t1t_{1}: 0.32 t2t_{2}: 0.35 t3t_{3}: 0.43 t4t_{4}: 0.45 t5t_{5}: 0.56NoNoYesYesNoYesYesNoYesNo