Regression Trees for Cumulative Incidence Functions

Youngjoo Cho, Annette M. Molinaro,
Chen Hu, and Robert L. Strawderman^∗
^∗ [email protected]
Department of Biostatistics & Computational Biology
601 Elmwood Avenue, Box 630
University of Rochester
Rochester NY 14642

Abstract

The use of cumulative incidence functions for characterizing the risk of one type of event in the presence of others has become increasingly popular over the past decade. The problems of modeling, estimation and inference have been treated using parametric, nonparametric and semi-parametric methods. Efforts to develop suitable extensions of machine learning methods, such as regression trees and related ensemble methods, have begun only recently. In this paper, we develop a novel approach to building regression trees for estimating cumulative incidence curves in a competing risks setting. The proposed methods employ augmented estimators of the Brier score risk as the primary basis for building and pruning trees. The proposed methods are easily implemented using the R statistical software package. Simulation studies demonstrate the utility of our approach in the competing risks setting. Data from the Radiation Therapy Oncology Group (trial 9410) is used to illustrate these new methods.

Keywords: Brier score; CART; Cause-specific hazard; Fine and Gray model; Sub-distribution function

1 Introduction

A subject being followed over time may experience several types of events related, for example, to disease morbidity and mortality. For example, in a Phase III trial of concomitant versus sequential chemotherapy and thoracic radiotherapy for patients with inoperable non-small cell lung cancer (NSCLC) conducted by the Radiation Therapy Oncology Group (RTOG), patients were followed up to 5 years, the occurrence of either disease progression or death being of particular interest. Such “competing risks” data are commonly encountered in cancer and other biomedical follow-up studies, in addition to the potential complication of right-censoring on the event time(s) of interest.

Two quantities are often used when analyzing competing risks data: the cause-specific hazard function (CSH) and the cumulative incidence function (CIF). For a given event, the former describes the instantaneous risk of this event at time $t$ , given that no events have yet occurred; the latter describes the probability of occurrence, or absolute risk, of that event across time and can be derived directly from the subdistribution hazard function (fine1999proportional). The literature on competing risks is growing rapidly, particularly for hazard-based regression modeling; see dignam for a contemporary review, where parametric and semi-parametric approaches to modeling both the CSH and CIF are considered. The literature on tree-based methods for competing risks remains much less developed. leblanc1992relative extended the classification and regression tree (CART) procedure of breiman1984classification to the case of right-censored survival data by replacing the squared error loss function ordinarily used for a continuous outcome with an approximation to the likelihood function derived under a certain proportional hazards assumption. Motivated by methods for handling monotone missing data, molinaro2004tree later proposed using an inverse probability weighted (IPCW) squared error loss function to directly generalize the original CART procedure for right-censored survival data, obtaining a procedure that reduces to the standard form of CART for a continuous outcome when censoring is absent. steingrimsson2016doubly recently proposed a “doubly robust” extension motivated by semiparametric efficiency theory for missing data (robins1994estimation; tsiatis2007semiparametric), and demonstrated significantly improved performance in comparison to the IPCW-based tree procedure of molinaro2004tree. However, such methods cannot be used when there is more than one type of event (i.e., in addition to censoring). For two or more competing risks, callaghan2008 proposed two methods of building classification trees for the CIF: by maximizing between-node heterogeneity via the two-sample log-rank test of gray1988class; and, by maximizing within-node homogeneity using sums of event-specific martingale residuals. To our knowledge, no software package is available that implements these methods, and more importantly, there is also no software specifically targeted to the problem of fitting regression trees to competing risk outcomes. In contrast, ensemble methods for estimating the CIF as a function of covariates do exist; see, for example, ishwaran2014random and mogensen2013random. ishwaran2014random implement their methods in the randomForestSRC package (rfsrc), where the unpruned trees that make up the bootstrap ensemble are typically built using splitting criteria designed for group comparisons with a single competing risk outcome, such as the generalized logrank test.

This paper proposes a novel CART-based approach to building regression trees under competing risks. In Section 2, we introduce the relevant data structures and describe a direct approach to building regression trees for estimating the CIF for a specified cause when there is no loss to follow-up. Section 3 develops the necessary extensions of these loss functions for right-censored outcomes, along with extensions for estimating the CIF at several time points. These methods focus on directly estimating the CIF, avoiding estimation of the cause-specific or subdistribution hazard function. In Section 4, simulation studies are used to investigate performance of these new methods and Section 5 summarizes analysis results for the RTOG 9410 Phase III lung cancer trial mentioned at the beginning of this section. The paper concludes with comments on future work in Section 6.

2 CIF Regression Trees with Uncensored Data

This section introduces the relevant competing risks data structure and reviews a CART-based approach to building regression trees for estimating the CIF in the case where there is no other loss to follow-up (i.e., no random censoring). This necessary background will allow us to develop an analogous approach when there is loss to follow-up.

2.1 Full Data Structure

Let $T^{(m)}$ be the time to failure for failure type $m=1,\ldots,K,K\geq 2$ and let $W$ be a vector of $p$ covariates, where $W\in{\cal S}\subset\mathbb{R}^{p}$ . Let $T=\min(T^{(1)},\ldots,T^{(K)})$ be the minimum of all latent failure times; it is assumed $(T^{(1)},\ldots,T^{(K)})$ has a bounded multivariate density function and hence that $T$ is continuous. Then, in the absence of other loss to follow-up, $F=(T,W,M)$ is assumed to be the fully observed (or full) data for a subject, where $M$ is the event type corresponding to $T.$ This assumption implies that $(T^{(M)},M,W)$ is observed and, in addition, that $T^{(m)}>T^{(M)}$ for every $m\neq M;$ however, $T^{(m)}$ itself is not assumed to be observed for $m\neq M$ . Let $\mathcal{F}=(F_{1},\ldots,F_{n})$ be the full data observed on $n$ independent subjects, where $F_{i}=(T_{i},W_{i},M_{i}),i=1,\ldots,n$ are assumed to be identically distributed (i.i.d.). The dependence structure of $(T^{(1)},\ldots,T^{(K)})$ is left unspecifed.

2.2 CIF Estimation Under a Brier Loss Function

Let $\Psi_{0}=\{\psi_{0m}(t;w)=P(T\leq t,M=m|W=w),t>0,m=1,\ldots,K\}.$ The set of CIFs $\Psi_{0}$ can be estimated from the data ${\cal F}$ using any of several suitable parametric or semiparametric methods without further assumptions (e.g., independence of $T^{(1)},\ldots,T^{(K)}$ ). This section introduces a loss-based method for estimating $\psi_{0m}(t;w)$ for a fixed cause $m$ and time point $t>0$ in the case where $\psi_{0m}(t;w)$ is piecewise constant as a function of $W.$ This is a key step in our proposed approach to building a regression tree to estimate $\psi_{0m}(t;w).$

Let $\mathcal{N}_{1},\ldots,\mathcal{N}_{L}$ form a partition of ${\cal S}.$ Define $\beta_{0lm}(t)=P(T\leq t,M=m|W\in\mathcal{N}_{l})$ and suppose $\psi_{0m}(t;w)=\sum_{l=1}^{L}\beta_{0lm}(t)I\{W\in\mathcal{N}_{l}\};$ that is, subjects falling into partition $\mathcal{N}_{l}$ all share the same CIF $\beta_{0lm}(t).$ Define $Z_{m}(t)=I(T\leq t,M=m)$ and let $\psi_{m}(t;w)=\sum_{l=1}^{L}\beta_{lm}(t)I\{W\in\mathcal{N}_{l}\}$ be a model for $\psi_{0m}(t;w).$ Then, fixing both $t>0$ and $m,$ the Brier loss (cf., brier1950verification) $L_{m,t}^{full}(F,\psi_{m})=\{Z_{m}(t)-\psi_{m}(t;w)\}^{2}=\sum_{l=1}^{L}I\{W\in\mathcal{N}_{l}\}\{Z_{m}(t)-\beta_{lm}(t)\}^{2}$ is an unbiased estimator of the risk $\Re(t,\psi_{m})=E[\sum_{l=1}^{L}I\{W\in\mathcal{N}_{l}\}\{Z_{m}(t)-\beta_{lm}(t)\}^{2}],$ or equivalently, $\Re(t,\psi_{m})=\sum_{l=1}^{L}P\{W\in\mathcal{N}_{l}\}\{\beta_{0lm}(t)-\beta_{lm}(t)\}^{2}.$ Assuming that ${\cal F}$ is observed,

\displaystyle L_{m,t}^{emp}({\cal F},\psi_{m})=\frac{1}{n}\sum_{i=1}^{n}L_{m,t}^{full}(F_{i},\psi_{m})=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}

(1)

is also an unbiased estimator of $\Re(t,\psi_{m}).$ When considered as a function of the set of scalar parameters $\beta_{lm}(t),l=1,\ldots,L$ (i.e., for fixed $m$ and $t$ ), the empirical Brier loss (1) is minimized when $\psi_{m}(t;w)=\hat{\psi}_{m}(t;w)=\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\hat{\beta}_{lm}(t),$ where

\displaystyle\hat{\beta}_{lm}(t)=\frac{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}Z_{im}(t)}{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}}

(2)

is a nonparametric estimate for $\beta_{0lm}(t),m=1,\ldots,K.$

2.3 Estimating a CIF Regression Tree Using CART

The CART algorithm of breiman1984classification estimates a regression tree as follows:

1.

Using recursive binary partitioning, grow a maximal tree by selecting a (covariate, cutpoint) combination at every stage that minimizes an appropriate loss function;
2.

Using cross-validation, select the best tree from the sequence of candidate trees generated by Step 1 via cost complexity pruning (i.e., using penalized loss).

In its most commonly used form, CART estimates the conditional mean response as a piecewise constant function on the covariate space, making all decisions on the basis of minimizing squared error loss. The resulting tree-structured regression function estimates the predicted response in each terminal node using the sample mean of the observations falling into that node. This process generalizes in a straightforward way to more general loss functions.

In the absence of censoring and under a piecewise constant model

\displaystyle\psi_{m}(t;w)=\sum_{l=1}^{L}\beta_{lm}(t)I\{W\in\mathcal{N}_{l}\}

(3)

for $\psi_{0m}(t;w),$ Section 2.2 shows that a nonparametric estimate for $\psi_{0m}(t;w)$ for a given cause $m$ at a fixed $t>0$ can be obtained by minimizing the loss (1), a problem equivalent to estimating the conditional mean response from the modified dataset ${\cal F}_{red,t}=\{(Z_{im}(t),W^{\prime}_{i})^{\prime},i=1,\ldots n\}$ by minimizing the squared error loss (1). Thus, any implementation of CART for squared error loss (e.g., rpart) applied to ${\cal F}_{red,t}$ will produce a regression tree estimate of $\psi_{0m}(t;w).$ In particular, CART estimates $L$ and the associated terminal nodes $\{\mathcal{N}_{1},\ldots,\mathcal{N}_{L}\}$ from the data ${\cal F}_{red,t},$ and within each terminal node, estimates $\psi_{0m}(t|w)$ by (2). While statistically inefficient, this process can be repeated for each $m=1,\ldots,K$ and any $t>0$ to generate nonparametric estimates of the CIF at a given set of time points for every cause.

3 CIF Regression Trees with Censored Data

With “full data” the squared error loss plays a critical role in both steps of the CART algorithm described at the beginning of Section 2.3 when estimating a conditional mean (e.g., breiman1984classification, Chapter 3.3). In follow-up studies involving competing risks outcomes, the full data ${\cal F}$ might not be observed due to loss to follow-up. In this case, estimating $\psi_{0m}(t;w)$ for a specified $m$ under the squared error loss (1) is not possible and one cannot run the CART procedure of Section 2.3 as described.

An attractive way to overcome this difficulty is to use to construct an observed data loss function that has the same risk as the desired full data loss function (c.f., molinaro2004tree; lostritto2012partitioning; steingrimsson2016doubly), that is, the empirical Brier loss (1). This section extends the developments of Section 2 to the case of right-censored competing risks by deriving several new observed data loss functions that share the same risk as (1). By substituting any of these new loss functions in for (1) everywhere throughout the CART algorithm of Section 2.3, one obtains a new regression tree method for estimating $\psi_{0m}(t;w).$ Similarly to steingrimsson2016doubly, direct implementation is possible using rpart, which provides users with the ability to customize loss functions and decision making procedures. An appealing feature of the algorithms induced by these new loss functions is that each also reduces to the algorithm described in Section 2.3 when censoring is absent.

3.1 Observed Data Structure

Let $C$ be a continuous right-censoring random variable that, given $W$ , is statistically independent of $(T,M).$ For a given subject, assume that instead of $F$ we only observe $O=\{\tilde{T},\Delta,M\Delta,W\},$ where $\tilde{T}=\min(T,C)$ and $\Delta=I(T\leq C)$ is the (any) event indicator. The observed data on $n$ i.i.d. subjects is $\mathcal{O}=(O_{1},\ldots,O_{n}).$ Similarly to the case where $K=1,$ random censoring on $T$ permits estimation of the CIF from the data ${\cal O}$ .

3.2 Estimating $\Re(t,\psi_{m})$ via IPCW Loss

Fix $t>0$ and let $t^{*}\geq t.$ Define $G_{0}(s|W)=P(C\geq s|W)$ for any $s\geq 0$ and suppose that $G_{0}(T_{i}|W_{i})\geq\epsilon$ almost surely for some $\epsilon>0$ . Let $\Delta_{i}(t^{*})=I(T_{i}(t^{*})\leq C)$ and $T_{i}(t^{*})=T_{i}\wedge t^{*};$ in addition, define $\tilde{Z}_{im}(t)=I(\tilde{T}_{i}\leq t,M_{i}=m),$ $i=1,\ldots,n.$ Easy calculations then show

E\left[\frac{\Delta_{i}(t^{*})}{G_{0}(\tilde{T}_{i}(t^{*})|W_{i})}(\tilde{Z}_{im}(t)-\psi_{m}(t;W_{i}))^{2}\right]=E\left[(Z_{im}(t)-\psi_{m}(t;W_{i}))^{2}\right]=\Re(t,\psi_{m})

for a fixed $\psi_{m}(t;w).$ This risk equivalence motivates the construction of an IPCW-type observed data loss function that is an unbiased estimator for $\Re(t,\psi_{m}).$ Define for any suitable survivor function $G(\cdot|\cdot)$

\displaystyle L_{m,t}^{ipcw}({\cal O},\psi_{m};t^{*},G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}(t^{*})\{\tilde{Z}_{im}(t)-\beta_{lm}(t)\}^{2}}{G(\tilde{T}_{i}(t^{*})|W_{i})}\bigg{]};

(4)

then, $L_{m,t}^{ipcw}({\cal O},\psi_{m};t^{*},G_{0})$ has the same risk as (1) and it follows that (4) is minimized by

\displaystyle\hat{\beta}^{ipcw}_{lm}(t;t^{*},G)=\frac{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\frac{\Delta_{i}(t^{*})\tilde{Z}_{im}(t)}{G(\tilde{T}_{i}(t^{*})|W_{i})}}{\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\frac{\Delta_{i}(t^{*})}{G(\tilde{T}_{i}(t^{*})|W_{i})}},l=1,\ldots,L.

(5)

Observe that (4) and (5) respectively reduce to (1) and (2) when censoring is absent.

When $K=1$ and $t^{*}=\infty,$ we have $\Delta_{i}(\infty)=\Delta_{i}$ and $T_{i}(\infty)=T_{i};$ the loss function (4) is then just a special case of that considered in molinaro2004tree (see also steingrimsson2016doubly). Similarly, for $K=1$ and setting $t^{*}=t,$ the loss function (4) is just that considered in lostritto2012partitioning. Hence, (4) extends several existing loss functions to the problem of estimating a CIF. In practice, an estimator $\hat{G}(\cdot|\cdot)$ for $G_{0}(\cdot|\cdot)$ is used in (4); popular approaches here include product-limit estimators derived from the Kaplan-Meier and Cox regression estimation procedures.

Given $t>0,$ different choices of $t^{*}\geq$ in (4) generate different losses, hence different CART algorithms. However, there are just two important choices of $t^{*}\geq t$ in (4): $t^{*}=\infty$ (standard IPCW; IPCW ${}_{\!1}$ ) and $t^{*}=t$ (modified IPCW; IPCW ${}_{\!2}$ ). In selecting $t^{*}=t,$ any observations that are censored after time $t$ can still contribute (all-cause failure) information to (4); as $t^{*}\rightarrow\infty$ , the influence of these observations eventually vanishes. Consequently, the IPCW ${}_{\!2}$ loss uses more of the observed data in calculating the loss function and associated minimizers compared to the IPCW ${}_{\!1}$ loss and one can therefore expect a regression tree built using (4) for $t^{*}=t$ to perform as well or better than one derived from (4) with $t^{*}=\infty$ . The use of $t^{*}=t$ in place of $t^{*}=\infty$ has an additional practical advantage: since $T_{i}(t^{*})\leq T_{i}$ for every $i,$ we have $\hat{G}(\tilde{T}_{i}(t^{*})|W_{i})\geq\hat{G}(\tilde{T}_{i}|W_{i}),$ making it easier to satisfy the empirical positivity condition required for implementation.

3.3 Improving the Efficiency of IPCW Loss Functions

3.3.1 Estimating $\Re(t,\psi_{m})$ by augmenting the IPCW loss

As in steingrimsson2016doubly, one can employ semiparametric estimation theory for missing data to construct an improved estimator of the full data risk $\Re(t,\psi_{m}).$ In particular, one can augment the IPCW loss function (4) with additional information on censored subjects, thereby incorporating additional information into the tree building process.

Consider first the loss function (4) with $t^{*}=\infty;$ we denote this by $L_{m,t}^{ipcw}({\cal O},\psi_{m};G).$ Define $\Psi_{0}=\{\psi_{0r}(s;w),s\geq 0,w\in{\cal S},r=1,\ldots,K\}$ as the set of CIFs of interest and let $\Psi$ denote a corresponding model that may or may not coincide with $\Psi_{0}$ . For any $t,u\geq 0$ and $w\in{\cal S},$ define $V_{lm}(u;t,w,\Psi)=E_{\Psi}[(Z_{m}(t)-\beta_{lm}(t))^{2}|T\geq u,W=w];$ it is shown later how this expression depends on $\Psi.$ Then, fixing $\beta_{1m}(t),\ldots,\beta_{Lm}(t),$ the augmented estimator of $\Re(t,\psi_{m})$ having the smallest variance that can be constructed from the unbiased estimator $L_{m,t}^{ipcw}({\cal O},\psi_{m};G_{0})$ is given by $L_{m,t}^{dr}({\cal O},\psi_{m};G_{0},\Psi_{0})=L_{m,t}^{ipcw}({\cal O},\psi_{m};G_{0})+L_{m,t}^{aug}({\cal O},\psi_{m};G_{0},\Psi_{0})$ where, for suitable models $\Psi$ and $G(\cdot|\cdot),$

\displaystyle L_{m,t}^{aug}(\mathcal{O},\psi_{m};G,\Psi)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\int_{0}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i})

(6)

and $M_{G}(t|w)=I(\tilde{T}\leq t,\Delta=0)-\int_{0}^{t}I(\tilde{T}\geq u)d\Lambda_{G}(u|w)$ with $\Lambda_{G}(t|w)$ denoting the cumulative hazard function corresponding to $G(t|w)$ (cf. tsiatis2007semiparametric, Sec. 9.3, 10.3). The “doubly robust” loss function $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)$ reduces to a special case of the class of loss functions proposed in steingrimsson2016doubly when $K=1.$

Instead of augmenting $L_{m,t}^{ipcw}({\cal O},\psi_{m};G)$ (i.e., IPCW ${}_{\!1}$ ), one might instead augment (4) when $t^{*}=t$ (i.e., IPCW ${}_{\!2}$ ). However, it turns out that the resulting loss function is identical to $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi);$ see the Appendix (Section A.1). Hence, no additional efficiency is gained from augmenting IPCW ${}_{\!2}$ and it suffices to consider $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)$ only.

Returning to $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi):$ because $Z_{m}(t)$ is binary, we have

V_{lm}(u;t,w,\Psi)=y_{m}(u;t,w,\Psi)-2y_{m}(u;t,w,\Psi)\beta_{lm}(t)+\beta^{2}_{lm}(t)

(7)

for any suitable $\Psi$ (e.g., $\Psi_{0}$ ), where $y_{m}(u;t,w,\Psi)=E_{\Psi}\{Z_{m}(t)|T\geq u,W=w\}$ reduces to

\displaystyle y_{m}(u;t,w,\Psi)=\begin{cases}\dfrac{P_{\Psi}(u\leq T\leq t,M=m|W=w)}{P_{\Psi}(T\geq u|W=w)}&\text{if}\quad u\leq t\\ 0&\text{otherwise}\end{cases}.

(8)

The notation $E_{\Psi}$ and $P_{\Psi}$ means that these quantities are calculated under the CIF model specification $\Psi$ . Hence, under a model $\Psi$ , the calculation of $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)$ requires estimating both the CIF for cause $m$ and the all-cause probability $P_{\Psi}(T\geq u|W=w).$

Considering $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)$ as a function of the $L$ scalar parameters $\beta_{1m}(t),\ldots,\beta_{Lm}(t)$ only and differentiating with respect to each one, it can be shown that

\displaystyle\tilde{\beta}_{lm}^{dr}(t;G,\Psi)=\dfrac{\displaystyle\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{1}(t)+\widetilde{TS}_{2,im}^{1}(t)]}{\displaystyle\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{0}+\widetilde{TS}_{2,im}^{0}]},~{}l=1,\ldots,L

(9)

minimize $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi),$ where

	$\displaystyle\widetilde{TS}_{1,im}^{0}(t)=\frac{\Delta_{i}}{G(\tilde{T}_{i}\|W_{i})}\quad\quad\widetilde{TS}_{2,im}^{0}(t)=\int_{0}^{\tilde{T}_{i}}\frac{dM_{G}(u\|W_{i})}{G(u\|W_{i})}$
	$\displaystyle\widetilde{TS}_{1,im}^{1}(t)=\frac{\tilde{Z}_{im}(t)\Delta_{i}}{G(\tilde{T}_{i}\|W_{i})}\quad\quad\widetilde{TS}_{2,im}^{1}(t)=\int_{0}^{\tilde{T}_{i}}\frac{y_{m}(u;t,W_{i},\Psi)}{G(u\|W_{i})}dM_{G}(u\|W_{i}).$

The validity of this result relies on the assumption that $G(\tilde{T}_{i}|W_{i})\geq\epsilon>0$ for some $\epsilon$ and each $i=1,\ldots,n.$ Under this same assumption, Lemma 1 of strawderman2000estimating implies

\displaystyle\widetilde{TS}_{1,im}^{0}+\widetilde{TS}_{2,im}^{0}=\dfrac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}+\dfrac{1-\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}-\displaystyle\int_{0}^{\tilde{T}_{i}}\dfrac{d\Lambda_{G}(u|W_{i})}{G(u|W_{i})}=1;

letting $N_{l}=\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\},~{}l=1,\ldots,L,$ it follows that (9) can be rewritten as

\displaystyle\hat{\beta}_{lm}^{dr}(t;G,\Psi)=\frac{1}{N_{l}}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{1}(t)+\widetilde{TS}_{2,im}^{1}(t)],~{}l=1,\ldots,L.

(10)

As in Section 3.2, $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)$ and (10) respectively reduce to (1) and (2) when censoring is absent.

The specification $\tilde{G}(t|w)=1$ for all $t\geq 0$ and $w\in{\cal S}$ generates an interesting special case of $L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi)$ despite the fact that $\tilde{G}(\cdot|\cdot)$ is necessarily incorrectly modeled in the presence of censoring. In particular, for any suitable choice of $\Psi$ , (i) $L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi)=\sum_{l=1}^{L}L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi)$ where

L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi)=\frac{1}{n}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\Delta_{i}\{\tilde{Z}_{im}(t)-\beta_{lm}(t)\}^{2}+(1-\Delta_{i})V_{lm}(\tilde{T}_{i};t,W_{i},\Psi)];

and, (ii) for $\Psi=\Psi_{0},$ $L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi_{0})$ is an unbiased estimator of the risk $\Re(t,\psi_{m})$ . In fact, noting that (7) implies $V_{lm}(\tilde{T}_{i};t,W_{i},\Psi)$ can be rewritten in terms of $y_{m}(\tilde{T}_{i};t,w,\Psi)$ for every $i,$ the minimizer of $L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi)$ is given by

\displaystyle\tilde{\beta}_{lm}^{bj}(t;\Psi)=\frac{1}{N_{l}}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\Delta_{i}\tilde{Z}_{im}(t)+(1-\Delta_{i})y_{m}(\tilde{T}_{i};t,W_{i},\Psi)].

That is, under the loss $L_{ml,t}^{bj}({\cal O},\psi_{m};\Psi),$ the estimator for $\beta_{lm}(t)$ is the Buckley-James (BJ) estimator of the mean response within node ${\cal N}_{l}$ (buckley1979linear), an estimator that can also be derived directly from (10) by setting $G=\tilde{G}.$ For this reason, we refer to $L_{m,t}^{dr}({\cal O},\psi_{m};\tilde{G},\Psi)$ as the Buckley-James loss function.

3.3.2 Composite augmented loss functions with multiple time points

Under a tree model of the form (3), and similarly to methods used for survival trees (e.g., leblanc1992relative; molinaro2004tree; steingrimsson2016doubly), the quantity being estimated within each terminal node depends on the choice of $t$ but the underlying partition structure is time-invariant. That is, the effects of baseline covariates and their interactions on the CIF are not allowed to change over time. As a result, for a given $m$ , we can further reduce variability when estimating $\psi_{0m}(t|w)$ by considering loss functions constructed from $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)$ that incorporate information over several time points.

Recall that $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)=L_{m,t}^{ipcw}({\cal O},\psi_{m};G)+L_{m,t}^{aug}({\cal O},\psi_{m};G,\Psi)$ where $L_{m,t}^{ipcw}({\cal O},\psi_{m};G)$ is given by (4) (i.e., for $t^{*}=\infty$ ) and $L_{m,t}^{aug}({\cal O},\psi_{m};G,\Psi)$ is given by (6). For a given set of time points $0<t_{1}<t_{2}\ldots<t_{J}<\infty,$ a simple composite loss function for a given event type $m$ can be formed by calculating

\displaystyle L_{m,\mathbf{t}}^{mult,dr}(\mathcal{O},\psi_{m};G,\Psi)=\sum_{j=1}^{J}w_{j}L_{m,t_{j}}^{dr}({\cal O},\psi_{m};G,\Psi),

(11)

where $w_{1},\ldots,w_{J}$ are positive weights that, without loss of generality, can be assumed to sum to one. Minimizing (11) with respect to $\beta_{lm}(t_{j}),j=1,\ldots,J;l=1,\ldots,L,$ gives

\displaystyle\tilde{\beta}_{lm}^{mult,dr}(t_{j};G,\Psi)=\frac{1}{N_{l}}\ \sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}[\widetilde{TS}_{1,im}^{1}(t_{j})+\widetilde{TS}_{2,im}^{1}(t_{j})].

(12)

The terminal node estimators (12) are exactly equivalent to (10) computed for $t=t_{j}$ and do not depend on $w_{1},\ldots,w_{J}$ . However, the corresponding assessment of total loss and thus decisions to split nodes at any given cutpoint for any given covariate are directly influenced by the combined specifications of $(t_{j},w_{j}),j=1,\ldots,J.$ Consequently, performance gains in the resulting tree-based estimates for $\psi_{m0}(t_{j};w),j=1,\ldots,J$ occur in the determination of when and where to split, rather than in the manner by which individual terminal nodes are estimated. In the absence of censoring, the indicated composite loss function also reduces to that which would be computed by extending the loss function introduced in Section 2.2 in the manner described above.

4 Simulation Studies

In this section, we will summarize the structure and results of several simulation studies designed to study the performance of regression trees built using the IPCW, Buckley James, and doubly robust Brier loss functions introduced in the previous section. We first describe how data are generated. Several performance evaluation measures are used to evaluate tree-building performance; these are described next, followed by a summary of the results.

4.1 Simulation Settings

Covariates $W_{1},W_{2},W_{3},W_{4},W_{5},W_{6},W_{7},W_{8},W_{9},W_{10}\sim Unif(0,1)$ are generated independently of each other; let $\boldsymbol{W}=(W_{1},W_{2},W_{3},W_{4},W_{5},W_{6},W_{7},W_{8},W_{9},W_{10})^{\prime}.$ The underlying competing risks model assumes $K=2$ and takes the form

	$\displaystyle\psi_{01}(t;\boldsymbol{W})=1-(1-p(1-e^{-t}))^{\exp{(\beta_{1}Z(\boldsymbol{W}))}}$		(13)
	$\displaystyle\psi_{02}(t;\boldsymbol{W})=(1-p)^{\exp{(\beta_{1}Z(\boldsymbol{W}))}}\times(1-\exp{(-t\exp{(\beta_{2}Z(\boldsymbol{W}))})})$		(14)

where $Z(\boldsymbol{W})=I(W_{1}\leq 0.5\ \&\ W_{2}>0.5)$ and $\beta_{1}$ and $\beta_{2}$ are regression coefficients (cf., fine1999proportional). Note that this formulation satisfies the additivity constraint $\psi_{01}(\infty;\boldsymbol{W})+\psi_{02}(\infty;\boldsymbol{W})=1.$ Three settings are considered: (i) $\beta_{1}=3$ and $\beta_{2}=-0.5$ with $p=0.3$ (High Signal); (ii) $\beta_{1}=2$ and $\beta_{2}=-0.5$ with $p=0.3$ (Medium Signal); and, (iii) $\beta_{10}=1.5$ and $\beta_{20}=-0.5$ with $p=0.3$ (Low Signal). The corresponding cumulative incidence functions for these settings are shown in Figure 1.

Refer to caption — Figure 1: Three settings in simulation 1. Black line : $\psi_{01}(\cdot|Z(\boldsymbol{W})=1)$ , Blue line : $\psi_{01}(\cdot|Z(\boldsymbol{W})=0)$ , Red line : $\psi_{02}(\cdot|Z(\boldsymbol{W})=1)$ , Green line : $\psi_{02}(\cdot|Z(\boldsymbol{W})=0)$ .

Similarly to steingrimsson2016doubly, we respectively generate training sets of size 250, 500 and 1000 (with noninformative censoring) for the purposes of estimation and an independent uncensored test set of size 2000 for evaluating performance. Censoring is exponentially distributed with a rate $\gamma$ chosen to give approximately 50% censoring on $T$ . For each training set size, 500 simulations are run for each of the three settings.

4.2 Performance Measures

For assessing estimator performance, it is assumed there is an underlying true tree structure to be recovered. Let $\hat{\psi}_{1}(t|W_{i})$ be the tree-predicted cumulative incidence for the cause of interest for a subject having covariates $W_{i}$ ; let $\psi_{01}(t|W_{i})$ be the corresponding true cumulative incidence function. Then, as a measure of predictive performance, we consider

\displaystyle\frac{1}{n_{test}}\sum_{i=1}^{n_{test}}\{\hat{\psi}_{1}(t|W_{i})-\psi_{01}(t|W_{i})\}^{2}

(15)

where $n_{test}$ is the number of observations in the test dataset. We also use several other performance measures focused more on the identification of the underlying tree structure:

•

$|$ fitted size $-$ true size $|$ : This quantity measures difference between the size of fitted tree and the size of the true tree, where size denotes the number of terminal nodes. It can be seen from Figure 1 that the true tree size is 3.
•

NSP: This quantity is defined as $(n_{sim})^{-1}\sum_{i=1}^{n_{sim}}NS_{i}$ where $NS_{i}$ represents number of times any of the covariates $W_{3},\ldots,W_{10}$ (i.e., covariates that do not affect the true CIF) appear in simulation run $i$ .
•

PCSP : This quantity is defined as $(n_{sim})^{-1}\sum_{i=1}^{n_{sim}}CT_{i}$ where $CT_{i}=1$ if the fitted tree in simulation $i$ has the same covariates and the same number of splits as the true tree that defines the CIF; otherwise $CT_{i}=0$ . While it can happen that a fitted tree may involve only the covariates $W_{1}$ and $W_{2}$ , it cannot be expected to have exactly the same splits as the true tree (especially for a continuous covariate); thus, in this measure, a ‘correct tree’ is one that looks like the true tree but may involve different split points.

4.3 Estimation of $G_{0}$ and $\Psi_{0}$

The IPCW and doubly robust Brier loss functions require estimation of $G_{0}(\cdot|\cdot).$ The Buckley-James and doubly robust Brier loss functions both rely on the function $y_{m}(\cdot;t,w,\Psi)$ in (8), the optimal choice requiring specification of both $\psi_{0m}(u|w)$ and the event-free survival function $\bar{\psi}_{0}(u|w)=P_{\Psi_{0}}(T>u|w)$ for $u>0,w\in{\cal S}$ .

Estimation of $G_{0}(\cdot|\cdot)$ is comparatively straightforward; for example, one may use regression procedures (e.g., Cox regression models, random survival forests) or product-limit estimators as appropriate. For all simulations considered here, the censoring distribution $G_{0}(\cdot|\cdot)$ is estimated by the Kaplan-Meier method. For building an IPCW-type tree with $t^{*}=\infty$ in (4), the resulting censoring weight may not satisfy the required positivity assumption. Hence, similarly to steingrimsson2016doubly, we replace $t^{*}=\infty$ by $t^{*}=\tau$ when calculating IPCW ${}_{\!1},$ where $\hat{G}(\tilde{T}_{i}(\tau))\geq 0.05,i=1,\ldots,n$ (i.e., the marginal truncation rate of observed survival times is 5%). No such modification is required when calculating IPCW ${}_{\!2},$ that is, when computing (4) with $t^{*}=t=t_{j},j=1,2,3.$

Calculation of the Buckley-James and doubly robust Brier loss functions in practice requires using an estimated model $\hat{\Psi}$ in place of $\Psi_{0}$ when computing (8). Parametric models have been proposed that ensure the compatibility between the models used for $\psi_{m}(\cdot|\cdot)$ and $\bar{\psi}(\cdot|\cdot)$ as well as the ability to calculate probabilities for every $u>0;$ see, for example, jeong2006direct, jeong2007parametric, cheng2009modeling, and shi2013constrained. Compared to existing semiparametric methods (e.g., fine1999proportional; scheike2008predicting; eriksson2015proportional), the parametric likelihood methods of jeong2006direct and jeong2007parametric enable researchers to estimate two or more CIFs jointly. However, cheng2009modeling and shi2013constrained show that even these methods fail to enforce the additivity constraint $\sum_{m=1}^{M}P_{\Psi}(T\leq\infty,M=m;W=w)=1$ and propose methods that resolve this inconsistency.

Since our focus is on a particular cause $m,$ we will without loss of generality assume that $m=1$ and that $K=2$ (i.e., $m=2$ captures all causes $m\neq 1$ ); in this case, specification of $\psi_{1}(\cdot|\cdot)$ and $\bar{\psi}(\cdot|\cdot)$ is equivalent to specifying $\Psi=(\psi_{1}(\cdot|\cdot),\psi_{2}(\cdot|\cdot)).$ Our simulation study considers the following estimators $\hat{\Psi}$ when $K=2:$

1.

RF(npar) : random survival forests as proposed in ishwaran2014random.
2.

RF+lgtc: a hybrid method of random survival forests and parametric modeling;
3.

FG(true): the simulation model of Section 4.1 is an example of the class of models considered in jeong2007parametric. For $m=1,2,$ $\psi_{m}(\cdot|\cdot)$ are estimated by maximum likelihood under a correctly specified parametric model.
4.

lgtc+godds: for $m=1,2,$ $\psi_{m}(\cdot|\cdot)$ is estimated by maximum likelihood using the parametric cumulative incidence model proposed in shi2013constrained.

Further details on the RF(npar) and RF+lgtc estimators, including the parametric models used in #2 and #4 above, may be found in Section A.3 of the Appendix. We anticipate that FG(true) will lead to the best performance because the underlying models for the event of interest and for censoring are both specified correctly. As in steingrimsson2016doubly, the estimates $\hat{G}(\cdot|\cdot)$ and $\hat{\Psi}$ depend only on ${\cal O}$ and are computed in advance of the process of building the desired regression tree for estimating $\psi_{01}(\cdot|\cdot).$

4.4 Simulation Results

Each combination of loss function and method for estimating $G_{0}$ and/or $\Psi_{0}$ leads to a distinct algorithm. In this section, we show results for the IPCW-type estimators with $t^{*}=t_{j},j=1,2,3$ and $t^{*}=\tau$ ; the Buckley James Brier loss, where RF(npar) (algorithm BJ-RF (npar)) and FG(true) (algorithm BJ-FG (true)) are respectively used to estimate $\Psi_{0}$ ; and, the doubly robust Brier loss function, where RF(npar) (algorithm DR-RF (npar)) and FG(true) (algorithm DR-FG (true)) are respectively used to estimate $\Psi_{0}$ . Results using RF+lgtc and lgtc+godds for estimating $\Psi_{0}$ are summarized in the Appendix (Section A.5).

In fitting the trees, we consider three fixed time points $t_{1}$ , $t_{2}$ and $t_{3}$ , respectively representing the 25th, 50th and 75th quantile of the true marginal distribution of $T$ . These times are used for building individual trees as well as for composite loss functions with the respective training set sizes $n=250,500$ and $1000$ . Below, we focus on the results with $n=500$ using composite loss functions as described in Section 3.3.2. Table 1 summarizes the tree fitting performance measures for estimating $\psi_{01}(\cdot|\cdot)$ in (13) for $n=500$ using a composite loss function having weights $w_{1}=w_{2}=w_{3}=1/3$ . For the chosen metrics, and irrespective of signal strength, there is a clear benefit to using augmented loss functions over IPCW loss functions, the exception being BJ-RF (npar) (i.e., the Buckley-James loss with $\hat{\Psi}$ estimated by random survival forests). Overall, the BJ-FG (true) algorithm that uses the correct model for $\Psi_{0}$ performs best, followed by DR-FG (true), DR-RF (npar), IPCW ${}_{\!2},$ BJ-RF (npar) and then IPCW ${}_{\!1}$ . Prediction error performance is considered in Figure 2, which shows boxplots of the mean squared prediction error given in (15) under equally weighted composite loss. Here, the BJ-FG (true) algorithm again performs uniformly best, with the augmented methods DR-FG (true), DR-RF (npar), and BJ-RF (npar) all performing similarly, followed by IPCW ${}_{\!2}$ and then IPCW ${}_{\!1}.$

In general, the methods that incorporate information beyond that used by IPCW₁ exhibit significant improvement in performance, with the methodology for estimating quantities needed to compute the augmented loss function playing a smaller role, particularly as sample size increases. Considering only methods where no knowledge of $\Psi_{0}$ is used, the doubly robust loss function also provides gains over both the Buckley-James and IPCW ${}_{\!2}$ loss functions; however, the gains achieved are substantially less compared to the gains over IPCW ${}_{\!1}$ . This phenomenon is expected and can be explained by noting that (i) the censoring distribution is being consistently estimated in all cases; (ii) IPCW ${}_{\!2}$ can be viewed as an augmented version of IPCW ${}_{\!1};$ and, (iii) IPCW ${}_{\!2}$ only requires estimating one infinite dimensional parameter.

Numerical and graphical results for other methods of nuisance parameter estimation and other settings (a single time point loss and composite loss with various training set sample sizes) are shown in the Appendix. These results demonstrate no significant deviations from the trends and observations summarized above. Some important conclusions that can be drawn from the combined set of results include the following: (i) the method for estimating $\Psi$ in (8) has little overall impact on performance, with greater differences arising from the type of loss (IPCW versus DR versus BJ) and whether or not a composite loss function is used; (ii) getting the data generating model exactly right yields noticeable gains, and there is otherwise a degree of robustness across methods, particularly for doubly robust loss; and, (iii) the performance of IPCW₂ is often reasonably close to that for the augmented losses, particularly with composite loss, and the comparative degree of simplicity involved in implementing this method has much to recommend it.

Table 1: Numerical summaries for trees built using composite loss for

n=500

. IPCW₁ and IPCW₂ are standard and modified IPCW, respectively; BJ-RF(npar) and BJ-FG(true) use Buckley-James loss functions with the augmentation term estimated via nonparametric random forests and under the correct simulation model, respectively; and, DR-RF(npar) and DR-FG(true) use the doubly robust loss function with the augmentation term estimated via nonparametric random forests and under the correct simulation model, respectively.

	High Sig			Med Sig			Low Sig
	$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP
IPCW₁	0.132	0.084	0.916	0.182	0.132	0.874	0.558	0.164	0.658
IPCW₂	0.124	0.082	0.932	0.138	0.100	0.906	0.282	0.136	0.830
BJ-RF(npar)	0.142	0.090	0.904	0.148	0.088	0.908	0.226	0.128	0.878
BJ-FG(true)	0.092	0.068	0.940	0.102	0.064	0.934	0.118	0.082	0.918
DR-RF(npar)	0.066	0.052	0.958	0.092	0.056	0.940	0.216	0.118	0.856
DR-FG(true)	0.058	0.044	0.966	0.092	0.058	0.942	0.176	0.076	0.874

5 Example: Lung Cancer Treatment Trial

We illustrate our methods using data from the RTOG 9410, randomized trial of patients with locally advanced inoperable non-small cell lung cancer. The motivation for this trial was to ascertain whether sequential or concurrent delivery of chemotherapy and thoracic radiotherapy (TRT) is a better treatment strategy. The original RTOG 9410 study randomized 610 patients to three treatment arms: sequential chemotherapy followed by radiotherapy (RX=1); once-daily chemotherapy concurrent with radiotherapy (RX=2); and, twice-daily chemotherapy concurrent with radiotherapy (RX=3). The primary endpoint of interest was overall survival and the main trial analysis results were published in curran2011sequential, demonstrating a survival benefit of concurrent delivery of chemotherapy and TRT compared with sequential delivery. Secondary analyses of the data using the time from randomization to the first occurrence of three possible outcomes are considered: in-field failure (cancer recurrence within the treatment field for TRT); out-field failure (cancer recurrence and distant metastasis outside of the treatment field for TRT); and, death without documented in-field or out-field failure (i.e., cancer progression). Among these event types, those that first experienced out-field failures are of particular interest since these patients typically have suboptimal prognosis and may be candidates for more intensified treatment regimens intended to prevent distant metastasis, including but not limited to consolidative chemotherapy, prophylactic cranial irradiation (for brain metastases), and so on. As such, patients that experienced both in-field failure and out-field failure were considered to be out-field failures for purposes of this analysis.

At the time the study database was last updated in 2009, there were 577 patients, with approximately 3% censoring on the aforementioned outcomes. Because this censoring rate is so low, we removed these censored observations and compare the results of analyses of the resulting uncensored dataset (554 patients) to analyses of data created using an artificially induced censoring mechanism. The purpose of doing this analysis is to evaluate how censoring affects the analysis and, in particular, illustrates how well the various procedures are able to recover the desired tree(s) that would be built had all outcomes been fully observed. We focus on building trees for each outcome using a composite loss function with 5 time points (5.2, 6.2, 8.5, 11.8, 19.0 months), selected as the 25th, 35th, 50th, 65th and 80th percentiles of the observed “all cause” event time (i.e., $T$ ). Baseline covariates included in this analysis are Treatment (RX), Age, Stage (American Joint Committee on Cancer [AJCC] stage IIIB vs. IIIA or II), Gender, Karnofsky score (70, 80, 90 or 100), Race (White vs. non-White), and Histology (Squamous vs. non-Squamous). Censoring is created according to a Uniform $[0,50]$ distribution, generating approximately 29% censoring on $T$ . In addition to building a tree using the uncensored dataset (i.e., see Section 3.3.2), we consider the methods IPCW ${}_{\!1}$ , IPCW ${}_{\!2}$ , BJ-RF(npar), BJ-RF(npar), DR-RF+lgtc, DR-RF(npar) and DR-RF+lgtc; this will allow us to evaluate the effect of censoring as well as different models used in constructing the augmented loss. Similarly to steingrimsson2016doubly, we use repeated 10-fold cross-validation to improve the stability of the tree building process and select the final tree using that having the lowest mean cross-validated error. For the in-field failure outcome, all methods (including those applied to the uncensored dataset) lead to trees that consist only of a root node. Hence, we focus on outfield failure and death without progression below.

For out-field failure, the estimated tree obtained using the uncensored data creates 4 risk groupings (see Figure A.4 in the Appendix): age $<$ 50.5; 50.5 $\leq$ age $<$ 70.5 with squamous histology; age $\geq$ 50.5 with squamous histology; and, age $\geq$ 70.5 with non-squamous histology. The plot of the CIF indicates that age $\geq$ 70.5 with non-squamous histology has the lowest risk, and that this risk is very similar to those aged $\geq$ 50.5 with squamous histology. With censored data, the IPCW ${}_{\!1}$ , IPCW ${}_{\!2}$ , BJ-RF(npar) and DR-RF(npar) methods all lead to the same tree structure; see Figures A.5 – A.8 in the Appendix. The BJ-RF+lgtc and DR-RF+lgtc also share the same tree structures (see Figures A.9 and A.10) and identify nearly the same risk groups as the other censored data methods, the difference being that the secondary split on age occurs at 70.5 rather than 71.5. Importantly, the trees for out-field failure respectively built using the uncensored and censored datasets are very similar. Figure 3 respectively summarize the estimated CIFs obtained using the uncensored, DR-RF(npar), IPCW ${}_{\!2}$ and BJ-RF(npar) loss functions. There are slight differences in the latter three estimators due to the way in which the CIFs (i.e., terminal nodes) are estimated. We see that the three highest risk groups are identified as being the same for all methods; the differences created by censoring essentially occur in the risk determination for the oldest squamous patients, where censoring tends to be the heaviest. In general, the results show that younger patients with non-squamous histology have the highest risk of out-field failure. This observation is important, as these patients may be considered as candidates for more intensified treatment regimens.

For death without progression, the fitted tree from uncensored data creates 5 risk groups: non-squamous histology with age $<$ 72.5; male, with age $<67.5$ and squamous histology; male, with age $\geq$ 67.5 and squamous histology; female with squamous histology; and, non-squamous histology with age $\geq$ 72.5. This tree is given in Figure A.11 in the Appendix. The tree built using IPCW ${}_{\!1}$ results in a root node. The trees built with IPCW ${}_{\!2}$ and the two doubly robust methods are nearly identical to that built using the uncensored dataset, the difference being that the age cutpoint of 72.5 is replaced by 70.5; see Figures A.12 $-$ A.14 of the Appendix. The tree built using BJ-RF(npar) is also very similar, but incorporates an extra cutpoint on Karnofsky score for younger males with squamous histology; this distinguishes those having a score of 100 from those having a lower score (see Figure A.15 of the Appendix). The tree built using the BJ-RF+lgtc method differs somewhat more substantially for younger males with squamous histology, splitting on both Karnofsky score (70 or 80 vs. 90 or 100) treatment (RX = 1 vs. not); see Figure A.16 of the Appendix. The comparison of these trees points to a benefit of using the doubly robust loss over the Buckley-James loss, as the censoring distribution is modeled correctly; the comparison of the augmented loss results to those obtained for IPCW ${}_{\!1}$ (where censoring is modeled correctly) further highlights the value of incorporating additional information into the model building process. Figure 4 is obtained analogously to Figure 3 and shows the estimated CIFs and risk stratification obtained using the uncensored and censored datasets. For death without progression, the risk evidently increases with increasing age and being male; the risk also increases with squamous histology, though in a way that appears to be age-dependent.

6 Discussion

Trees remain of significant interest to practitioners, especially so clinicians. The proposed doubly robust CIF regression tree demonstrates improved performance compared to IPCW-based methods whether a single time point or composite loss function is used, with composite loss functions giving much better performance overall. To our knowledge, there are no publically available software packages that directly implement tree-based regression methods for competing risks; our proposed methods can be implemented using existing software, with example code made available as part of supplementary materials.

Several extensions are possible. For example, it is easy in principle to combine the proposed estimation procedure with ensemble methods, providing an interesting alternative to the random forests methods recently introduced by mogensen2013random and ishwaran2014random. With no random feature selection, this extension is easily accomplished through bagging (i.e., bootstrap-based aggregation). However, with random feature selection, new software is required, as existing software does not provide for the same possibility of extending rpart for use with other loss functions. A second interesting direction is to extend both regression tree and ensemble procedures to the problem of simultaneous estimation of multiple CIFs. Similarly to the case of a single CIF, we anticipate that such multivariate problems can be handled using existing software in the case of regression trees and that new software will again be needed for related ensemble extensions.

Acknowledgments

This work was supported by the National Institutes of Health (R01CA163687: AMM, RLS, YC; U10-CA180822: CH). We thank the NRG Oncology Statistics and Data Management Center for providing de-identified RTOG 9410 clinical trial data under a data use agreement.

Supplementary Material for
Regression Trees for Cumulative Incidence Functions
by Cho, Molinaro, Hu and Strawderman

Appendix A Appendix

References to figures, tables, theorems and equations preceded by “A.” are internal to this appendix; all other references refer to the main paper.

A.1 Equivalence of augmentation for IPCW ${}_{\!1}$ and IPCW ${}_{\!2}$

The equivalence result to be proved below holds for any pair of suitable models of $G$ and $\Psi$ , including $G_{0}$ and $\Psi_{0}$ . Throughout, we suppose that suitable models $\Psi$ and $G(\cdot|\cdot)$ are given. The proof will be facilitated by first establishing that IPCW ${}_{\!2}$ can be rewritten as an IPCW estimator in standard form and augmented similarly to IPCW ${}_{\!1}.$ We will then show that the difference between its augmented version and that for IPCW ${}_{\!1}$ are mathematically identical.

Define the modified data ${\cal O}(t)=(\tilde{T}_{i}(t),\Delta_{i}(t),W_{i}),i=1,\ldots,n,$ where $\Delta_{i}(t)=I(T_{i}(t)\leq C)$ and $T_{i}(t)=T_{i}\wedge t$ for $i=1,\ldots,n.$ We first observe that the IPCW ${}_{\!2}$ loss can be re-expressed as a standard IPCW estimator that can be computed from ${\cal O}(t).$ In particular, the IPCW ${}_{\!2}$ loss (i.e., (4) with $t^{*}=t$ ) is mathematically equivalent to

\displaystyle L_{m,t}^{ipcw}({\cal O}(t),\psi_{m};G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}(t)\{I(\tilde{T}_{i}(t)<t,M_{i}=m)-\beta_{lm}(t)\}^{2}}{G(\tilde{T}_{i}(t)|W_{i})}\bigg{]}.

(A.1)

This equivalence follows because $I(\tilde{T}_{i}\leq t,M_{i}=m)=I(\tilde{T}_{i}(t)<t,M_{i}=m),i=1,\ldots,n$ almost surely provided that $\tilde{T}_{i}$ is continuous. Strict inequality on the right-hand side is necessary here because the fact that $\tilde{T}_{i}(t)\leq t$ almost surely implies $0=I(\tilde{T}_{i}\leq t,M_{i}=m)\neq I(\tilde{T}_{i}(t)\leq t,M_{i}=m)=1$ whenever $\tilde{T}_{i}>t$ and $M_{i}=m.$ The critical observation is that the loss estimator (A.1) is now an IPCW estimator in standard form constructed from ${\cal O}(t),$ and thus can be augmented similarly to IPCW ${}_{\!1}$ (cf. tsiatis2007semiparametric, Sec. 9.3, 10.3). In particular, the augmented loss is given by $L_{m,t}^{dr}({\cal O}(t),\psi_{m};G_{0},\Psi_{0})=L_{m,t}^{ipcw}({\cal O}(t),\psi_{m};G_{0})+L_{m,t}^{aug}({\cal O}(t),\psi_{m};G_{0},\Psi_{0}),$ where

L_{m,t}^{aug}(\mathcal{O}(t),\psi_{m};G,\Psi)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\int_{0}^{\tilde{T}_{i}(t)}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

Note that this last expression is just (6), but with an upper limit of integration of $\tilde{T}_{i}(t)$ in place of $\tilde{T}_{i}$ .

We can now establish the equivalence of the augmented loss functions respectively derived from the IPCW ${}_{\!1}$ and IPCW ${}_{\!2}$ losses. Trivially, we may rewrite

\displaystyle L_{m,t}^{ipcw}({\cal O},\psi_{m};G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}}{G(T_{i}|W_{i})}\bigg{]}

(A.2)

and

\displaystyle L_{m,t}^{ipcw}({\cal O}(t),\psi_{m};G)=\frac{1}{n}\sum_{i=1}^{n}\sum_{l=1}^{L}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{[}\frac{\Delta_{i}(t)\{Z_{im}(t)-\beta_{lm}(t)\}^{2}}{G(T_{i}(t)|W_{i})}\bigg{]}.

(A.3)

Consequently, $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)-L_{m,t}^{dr}({\cal O}(t),\psi_{m};G,\Psi)=(A)-(B)$ almost surely, where

\displaystyle(A)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\bigg{(}\frac{\Delta_{i}}{G(T_{i}|W_{i})}-\frac{\Delta_{i}(t)}{G(T_{i}(t)|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}

and

\displaystyle(B)=\frac{1}{n}\sum_{l=1}^{L}\sum_{i=1}^{n}I\{W_{i}\in\mathcal{N}_{l}\}\int_{\tilde{T}_{i}(t)}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

Consider the integral term appearing in (B) for any fixed value of $i$ . By the definition of $\tilde{T}_{i}(t)$ , the integral is clearly zero if $\tilde{T}_{i}(t)=\tilde{T}_{i}$ (i.e., when $\tilde{T}_{i}<t$ ); hence, we only need to consider the calculation of

\int_{t}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i}).

for $\tilde{T}_{i}\geq t.$ Recall that $V_{lm}(u;t,W_{i},\Psi)=E_{\Psi}\{Z_{im}(t)-\beta_{lm}(t)|T\geq u,W_{i}\}^{2}=y_{m}(u;t,w,\Psi)-2y_{m}(u;t,w,\Psi)\beta_{lm}(t)+\beta_{lm}^{2}(t)$ , where $y_{m}(u;t,w,\Psi)$ is given in (8). By definition, we have that $y_{m}(u;t,w,\Psi)=0$ for $u\geq t$ ; hence, for $\tilde{T}_{i}\geq t,$

\displaystyle\int_{t}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i})=\beta_{lm}^{2}(t)\int_{t}^{\tilde{T}_{i}}\frac{dM_{G}(u|W_{i})}{G(u|W_{i})}.

(A.4)

Calculations analogous to those done in bai2013doubly show that

\displaystyle\int_{t}^{\tilde{T}_{i}}\frac{dM_{G}(u|W_{i})}{G(u|W_{i})}=I(\tilde{T}_{i}\geq t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{\Delta}{G(\tilde{T}_{i}|W_{i})}\bigg{)}.

(A.5)

Substituting (A.5) into (A.4) gives

\displaystyle\int_{t}^{\tilde{T}_{i}}\frac{V_{lm}(u;t,W_{i},\Psi)}{G(u|W_{i})}dM_{G}(u|W_{i})=\beta_{lm}^{2}(t)I(\tilde{T}_{i}\geq t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}\bigg{)}.

(A.6)

Hence we can see that

L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)-L_{m,t}^{dr}({\cal O}(t),\psi_{m};G,\Psi)=\frac{1}{n}\sum_{i=1}^{n}I(W_{i}\in\mathcal{N}_{l})\,\Omega_{i},

almost surely, where

\Omega_{i}=\bigg{(}\frac{\Delta_{i}}{G(T_{i}|W_{i})}-\frac{\Delta_{i}(t)}{G(T_{i}(t)|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+I(\tilde{T}_{i}\geq t)\bigg{(}\frac{1}{G(t|W_{i})}-\frac{\Delta_{i}}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t).

Recall that $T_{i}$ and $C_{i}$ are both continuous; hence, $P(T_{i}=t)=P(C_{i}=t)=0$ . We now consider the calculation of $\Omega_{i}$ under the 6 possible ways in which $T_{i},$ $C_{i}$ and $t$ can be ordered with positive probability:

$T_{i}<C_{i}<t:$ In this case, $\Delta_{i}=\Delta_{i}(t)=1,$ $T_{i}(t)=T_{i},$ and $\tilde{T}_{i}<t.$ Hence,

\Omega_{i}=\bigg{(}\frac{1}{G(T_{i}|W_{i})}-\frac{1}{G(T_{i}|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+0\cdot\bigg{(}\frac{1}{G(t|W_{i})}-\frac{1}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t)=0.

2.

$T_{i}<t<C_{i}:$ As in Case 1, $\Delta_{i}=\Delta_{i}(t)=1,$ $T_{i}(t)=T_{i},$ and $\tilde{T}_{i}<t;$ hence, $\Omega_{i}=0.$

$t<T_{i}<C_{i}:$ In this case, $\Delta_{i}=1,$ $T_{i}(t)=t,$ $\Delta_{i}(t)=I(T_{i}(t)\leq C_{i})=I(t\leq C_{i})=1$ and $\tilde{T}_{i}>t$ . Moreover, $Z_{im}(t)=I(T_{i}\leq t,M_{i}=1)=0$ regardless of $M_{i}$ because $T_{i}>t.$ Hence,

	$\displaystyle\Omega_{i}=\bigg{(}\frac{1}{G(T_{i}\|W_{i})}-\frac{1}{G(t\|W_{i})}\bigg{)}\{0-\beta_{lm}(t)\}^{2}+\beta_{lm}^{2}(t)\bigg{(}\frac{1}{G(t\|W_{i})}-\frac{1}{G(T_{i}\|W_{i})}\bigg{)}$
	$\displaystyle=\frac{\beta_{lm}^{2}(t)}{G(T_{i}\|W_{i})}-\frac{\beta_{lm}^{2}(t)}{G(t\|W_{i})}+\frac{\beta_{lm}^{2}(t)}{G(t\|W_{i})}-\frac{\beta_{lm}^{2}(t)}{G(T_{i}\|W_{i})}=0.$

$C_{i}<T_{i}<t:$ In this case, $\Delta_{i}=0,$ $\tilde{T}_{i}<t,$ $T_{i}(t)=T_{i},$ and $\Delta_{i}(t)=I(T_{i}(t)\leq C_{i})=0.$ Hence,

\Omega_{i}=\bigg{(}\frac{0}{G(T_{i}|W_{i})}-\frac{0}{G(T_{i}|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+0\times\bigg{(}\frac{1}{G(t|W_{i})}-\frac{0}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t)=0.

$C_{i}<t<T_{i}:$ In this case, $\Delta_{i}=0,$ $\tilde{T}_{i}<t,$ $\Delta_{i}(t)=I(T_{i}(t)\leq C_{i})=0$ and $T_{i}(t)=t.$ Hence,

\Omega_{i}=\bigg{(}\frac{0}{G(T_{i}|W_{i})}-\frac{0}{G(t|W_{i})}\bigg{)}\{Z_{im}(t)-\beta_{lm}(t)\}^{2}+0\times\bigg{(}\frac{1}{G(t|W_{i})}-\frac{0}{G(\tilde{T}_{i}|W_{i})}\bigg{)}\beta_{lm}^{2}(t)=0.

$t<C_{i}<T_{i}:$ In this case, $\Delta_{i}=0,$ $\tilde{T}_{i}>t,$ $\Delta_{i}(t)=1,$ $T_{i}(t)=t$ and $Z_{im}(t)=0$ regardless of $M_{i}$ because $T_{i}>t$ . Hence,

	$\displaystyle\Omega_{i}=\bigg{(}\frac{0}{G(T_{i}\|W_{i})}-\frac{1}{G(t\|W_{i})}\bigg{)}\{0-\beta_{lm}(t)\}^{2}+\beta_{lm}^{2}(t)\bigg{(}\frac{1}{G(t\|W_{i})}-\frac{0}{G(T_{i}\|W_{i})}\bigg{)}$
	$\displaystyle=-\frac{\beta_{lm}^{2}(t)}{G(t\|W_{i})}+\frac{\beta_{lm}^{2}(t)}{G(t\|W_{i})}=0.$

In summary, we have proved that $L_{m,t}^{dr}({\cal O},\psi_{m};G,\Psi)-L_{m,t}^{dr}({\cal O}(t),\psi_{m};G,\Psi)=0$ except possibly on a set which has probability measure 0.

A.2 Further details on CART algorithms using augmented loss

The extension of the CIF regression tree methodology described in Section 2.3 to the case where ${\cal O}$ is observed has two main steps:

1.

Throughout, replace the $L_{2}$ loss function used by CART with $L_{m,t}^{dr}({\cal O},\psi_{m};\hat{G},\hat{\Psi});$
2.

Use the modified algorithm in Step #1 to grow a maximal tree, thereby generating a sequence of trees as candidates for the best tree;
3.

Using cross-validation, select the best tree from this sequence via cost complexity pruning.

Regarding Step #1, the overall structure of the CART algorithm is independent of the choice of loss function; the specification of the loss function only plays a role in determining how the splitting, evaluation, and model selection decisions are made in Steps #2 and #3 (cf., breiman1984classification).

Step #2 serves to restrict the search space, that is, the set of possible trees that one must consider in determining which tree is optimal; see below. To describe how Step #3 is carried out, we first need to define the notion of a cross-validated risk estimator. Let ${\cal O}=\cup_{q=1}^{Q}{\cal O}_{q}$ be a partition of ${\cal O}$ such that each ${\cal O}_{q}$ contains all observed data on some subset of subjects and each subject in ${\cal O}$ appears in exactly one of the sets ${\cal O}_{1},\ldots,{\cal O}_{Q}$ . Let $p_{q}=\frac{1}{n}\sum_{i=1}^{n}S_{i,q},$ where $S_{i,q}=1$ if observation $i$ is in dataset ${\cal O}_{q}$ and zero otherwise. Recalling (7), define the modified notation

\hat{V}_{lm}(u;t,w,\psi_{m})=y_{m}(u;t,w,\hat{\Psi})-2y_{m}(u;t,w,\hat{\Psi})\psi_{m}(t;w)+[\psi_{m}(t;w)]^{2};

as given, $\hat{V}_{lm}(u;t,w,\psi_{m})$ is now a function of $\psi_{m}(t;w)$ . Fixing $q$ and for each $i\in{\cal O}_{q},$ let

\varphi_{i,m}^{dr}(\mathcal{O}_{q},\hat{\psi}^{(q)}_{m}(t;W_{i}))=\frac{\Delta_{i}\{\tilde{Z}_{im}(t)-\hat{\psi}^{(q)}_{m}(t;W_{i})\}^{2}}{\hat{G}(\tilde{T}_{i}|W_{i})}+\displaystyle\int_{0}^{\tilde{T}_{i}}\frac{\hat{V}_{lm}(u;t,W_{i},\hat{\psi}^{(q)}_{m})}{\hat{G}(u|W_{i})}dM_{\hat{G}}(u|W_{i})

where $\hat{\psi}^{(q)}_{m}(t;w)$ is the prediction obtained from a tree that estimates $\psi_{0m}(t;w)$ from the reduced dataset ${\cal O}\setminus{\cal O}_{q}.$ Then, similarly to steingrimsson2016doubly, a cross-validated doubly robust risk estimator can be defined as follows:

\hat{\cal R}(t;\hat{\psi}^{(1)}_{m},\ldots,\hat{\psi}^{(Q)}_{m})=\frac{1}{nQ}\sum_{q=1}^{Q}\sum_{i=1}^{n}\frac{I(S_{i,q}=1)}{p_{q}}~{}\varphi_{i,m}^{dr}(\mathcal{O}_{q},\hat{\psi}^{(q)}_{m}(t;W_{i})).

Returning to Step #3: Suppose Steps #1 and 2 have been run. The corresponding unpruned maximal tree generates a sequence of (say) $R$ subtrees, each of which is a candidate for the final tree. Each of these trees can be identified from this maximal tree by a unique choice of the so-called cost complexity tuning penalty parameter; call these parameter values $\alpha_{1},\ldots,\alpha_{R}$ . Now, for each fixed $r=1,\ldots,R,$ let $\hat{\psi}^{(1,r)}_{m},\ldots,\hat{\psi}^{(Q,r)}_{m}$ denote the sequence of trees obtained by running the doubly robust tree building procedure on the datasets ${\cal O}\setminus{\cal O}_{q},q=1,\ldots,Q,$ where each such tree is determined by cost complexity pruning using the penalty parameter $\alpha_{r}$ . Define $\hat{\cal R}(t;\hat{\psi}^{(1,r)}_{m},\ldots,\hat{\psi}^{(Q,r)}_{m})$ to be the corresponding cross-validated risk estimates. The final tree is then obtained from the initial maximal tree using the penalty parameter $\alpha_{\hat{r}},$ where $\hat{r}$ minimizes $\hat{\cal R}(t;\hat{\psi}^{(1,r)}_{m},\ldots,\hat{\psi}^{(Q,r)}_{m}),r=1,\ldots,R.$

The indicated procedure is modified in an obvious way to accommodate IPCW-type and Buckley-James Brier loss functions. In the case of the corresponding composite loss function, a weighted sum of the desired loss estimates over $t_{1},\ldots,t_{J}$ is instead used. Beyond these loss modifications, estimation of the maximal tree and selection of the corresponding best tree is carried out exactly as described above. In practice, implementation of the algorithm just described is possible using rpart’s capabilities for incorporating user-written splitting and evaluation functions (therneau2015introduction). Examples of such code are provided as part of the Supplementary Materials for this paper.

A.3 Estimating $\Psi$ for computing augmented loss functions

Section 4.3 discusses methods for estimating both the censoring distribution $G$ and the parameter $\Psi$ when computing observed data loss functions. This section expands on the four methods for estimating $\Psi,$ needed for computing (8), that are described in Section 4.3. We assume $K=2.$

1.

RF(npar): for $m=1,2,$ $\psi_{0m}(\cdot|\cdot)$ is estimated using nonparametric random forests. More specifically, the rfsrc function in the randomForestSRC package is used (ishwaran2014random), where the predict command is used to estimate $\psi_{01}(u|w)$ and $\bar{\psi}_{0}(u|w)$ for $K=2$ and $u\geq 0$ . For $m=1,2$ and $B=500$ bootstrap samples, the components of $\hat{\Psi}$ are given by $\hat{\psi}_{m}^{\mbox{\tiny rf}}(u|w)=B^{-1}\sum_{b=1}^{B}\hat{\psi}_{m[b]}(u|w);$ here, for the $b^{th}$ bootstrap sample, $\hat{\psi}_{1[b]}(u|w)=\hat{P}_{[b]}(T\leq u,M=1|W=w)$ is the estimated CIF of interest and $\hat{\psi}_{2[b]}(u|w)=\hat{P}_{[b]}(T\leq u,M\neq 1|W=w)$ is the corresponding ”all other causes” estimate. The event-free survival function is estimated in the obvious manner.
2.

RF+lgtc: for $m=1,2,$ $\psi_{0m}(\cdot|\cdot)$ is estimated using “random parametric forest”. That is, as above, the rfsrc function in the randomForestSRC package is first used (ishwaran2014random) to construct a forest (i.e., $B=500$ trees) with $K=2$ . Using the tree obtained for the $b^{th}$ bootstrapped sample, $\hat{\psi}_{m[b]}(u|w)=\hat{P}_{[b]}(T\leq u,M=m|W=w),m=1,2$ are then estimated by fitting the three parameter logistic cumulative incidence model of cheng2009modeling separately to the data falling into each terminal node. Finally, similarly to $\hat{\psi}_{m}^{\mbox{\tiny rf}}(u|w),$ the desired ensemble predictors are obtained by computing bootstrap averages.
3.

FG(true): the simulation model of Section 4.1 is an example of the class of models considered in jeong2007parametric. For $m=1,2,$ $\psi_{0m}(\cdot|\cdot)$ are estimated by maximum likelihood under a correctly specified parametric model.
4.

lgtc+godds: for $m=1,2,$ $\psi_{0m}(\cdot|\cdot)$ is estimated by maximum likelihood using the parametric cumulative incidence model proposed in shi2013constrained.

A.3.1 The CIF models of cheng2009modeling shi2013constrained

Here, we describe the parametric models of cheng2009modeling shi2013constrained that are used by RF+lgtc and lgtc+godds. Specifically, assuming $K=2$ and hence for $m=1,2,$ these models are given below:

Three parameter logistic model without covariates (RF+lgtc; cheng2009modeling):
This model is motivated by nonlinear modeling in bioassay and dose-response curves (ritz2005bioassay). Specifically, the parametric model that is fit by maximum likelihood separately within each terminal node of a tree is given by

\displaystyle\psi_{m}(t;\xi_{m})=\frac{p_{m}\exp\{b_{m}(t-c_{m})\}-p_{m}\exp(-b_{m}c_{m})}{1+\exp\{b_{m}(t-c_{m})\}}

where $\xi_{m}=(b_{m},c_{m},p_{m})^{T}$ consists of three parameters: $p_{m}$ is upper asymptote of $\psi_{m}(\cdot;\cdot)$ when $t\rightarrow\infty$ ; and, $b_{m}$ and $c_{m}$ are the slope and center of the curve (cheng2009modeling). To satisfy the additivity constraint required of CIFs, we further assume $p_{2}=1-p_{1}$ .

Three parameter logistic model with covariates (lgtc+godds; shi2013constrained):
shi2013constrained extends the model of cheng2009modeling to incorporate covariates. Specifically, the parametric model used for $\psi_{0m}(t;w)$ is

\displaystyle\psi_{m}(t;\mathbf{W},\boldsymbol{\gamma},\xi_{m})=g_{m}^{-1}[g_{m}\{\psi_{m}(t;\xi_{m})\}+\boldsymbol{\gamma}^{T}\mathbf{W}]

where $g_{m}$ is a specified nondecreasing function and $\psi_{m}(t;\xi_{m})$ is the covariate-independent model of cheng2009modeling. One possible form of $g_{m}$ is

\displaystyle g_{m}(u;\alpha_{m})=\log[\{(1-u)^{-\alpha_{m}}-1\}/\alpha_{m}],0<\alpha_{m}<\infty

where $\alpha_{m}$ is parameter for each event type. This model is an extension of the generalized odds-rate model (dabrowska1988estimation). When using a generalized odds model as a link function $g_{m}(\cdot;\cdot)$ , the CIF for $m=1$ is given by the formula $\psi_{1}(t;\boldsymbol{W})=1-H(t;\boldsymbol{W})^{-1/\alpha_{1}},$ where

\displaystyle H(t;\boldsymbol{W})=\bigg{\{}\bigg{(}1-\frac{p_{1}\exp\{b_{1}(t-c_{1})-p_{1}\exp(-b_{1}c_{1})\}}{1+\exp\{b_{1}(t-c_{1})\}}\bigg{)}^{-\alpha_{1}}\!\!\!\!-1\bigg{\}}\exp(\boldsymbol{\beta}_{1}^{T}\boldsymbol{W})+1;

the CIF for $m=2$ is given by

\displaystyle\psi_{2}(t;\boldsymbol{W})=\frac{p_{2}(\boldsymbol{W})[\exp\{b_{2}(t-c_{2})\}-\exp\{-b_{2}c_{2}\}]}{1+\exp\{b_{2}(t-c_{2})\}},

where $p_{2}(\boldsymbol{W})=1-\psi_{1}(\infty;\boldsymbol{W})=[\{(1-p_{1})^{-\alpha_{1}}-1\}\exp(\boldsymbol{\beta}_{1}^{T}\boldsymbol{W}))+1]^{-1/\alpha_{1}}$ . Observe that the additive constraint $\psi_{1}(\infty;\boldsymbol{W})+\psi_{2}(\infty;\boldsymbol{W})=1$ is satisfied. In addition, note that a separate covariate effect for $m=2$ is not modeled and hence there are no additional regression parameters to be estimated (i.e., even though CIF for $m=2$ does depend on covariates through $p_{2}(\boldsymbol{W})$ ).

A.4 Miscellaneous Algorithm Specifications

Two important tuning parameters that govern the size of the maximal tree built in Step #2 of general CART algorithm are minbucket, the minimum possible number of observations in a terminal node, and minsplit, the minimum number of observations in a node to be considered for a possible split. Throughout, we set minbucket=10 and minsplit=30. For the doubly robust and Buckley-James methods, both uncensored and censored observations are counted when considering these limits; for IPCW methods, only uncensored observations are counted.

A.5 Additional simulation results

In this section, we summarize the full set of simulation results using both“single time point” and composite loss functions. We recall that the IPCW₁ and IPCW₂ refer to the usual ( $t^{*}=\infty$ ) and modified ( $t^{*}=t$ ) IPCW methods, respectively. As described elsewhere, the Buckley-James (BJ) and doubly robust (DR) methods use several approaches to estimating (8), which is required for computing the augmented loss function. In particular, as described in Sections 4.3 (see also Section A.3.1), we use nonparametric random forests (RF (npar)), random parametric forests (RF+lgtc), the true data generating model (FG (true)) and the three-parameter logistic distribution in combination with a generalized odds model (Lgtc+godds). Results are summarized for each type of loss function. For example, results corresponding to“BJ-RF (npar)” means that we use the BJ loss function in combination with nonparametric random forests for calculating (8).

Table A.1 – A.4 summarize the numerical performance of the fitted trees and Figures A.1 – A.3 show boxplots of prediction error. The choice of time points and method of summarizing the results in the tables and figures is the same as in Section 4.4. It is easy to see that (i) performance generally improves with increasing sample size and tends to be worst at $t_{1}$ ; (ii) there is substantially more variation in results using a single time point loss in comparison to a composite loss; and, (iii) the use of the composite loss function results in better overall performance, inlcuding lower prediction error, in comparison to the use of any single time point loss function.

In the medium and low signal settings, greater differences between methods emerge, with a clear improvement at $n=500$ compared to $n=250.$ Regardless of sample size, IPCW₁ has the least favorable performance overall in both the medium and low signal settings, and BJ-FG(true) has the most favorable performance overall; these differences are especially noticeable in the low signal setting.

It is also clear that IPCW₂ generally exhibits less variability than IPCW₁ at time points $t_{1}$ and $t_{2}$ , whereas for time $t_{3}$ , the two measures perform similarly. This last result is expected since $\frac{\Delta(t^{*})}{\hat{G}(T(t^{*})|W)}\rightarrow\frac{\Delta}{\hat{G}(T|W)}$ as $t^{*}\rightarrow\infty$ .

The results further show that the choice of method for estimating the conditional expectation has little overall impact on performance; the main differences stem from the type of loss (IPCW versus DR versus BJ) and whether or not a composite loss function is used. The results also clearly highlight the value of getting the data generating model exactly right, and otherwise demonstrate a desirable degree of robustness across methods, particularly for doubly robust loss. When only comparing IPCW₁ and the doubly robust methods, the efficiency gain is evident, particularly at time point $t_{3}$ using a single time point loss. The performance for IPCW₂ is often reasonably close to that for the Buckley-James and doubly robust loss functions, especially for a composite loss function; the comparative degree of simplicity involved in implementing this method has much to recommend it.

Table A.1: Numerical summaries for trees built using a single time point loss function when

n=250

. IPCW₁ and IPCW₂ are time-independent IPCW and time-dependent IPCW, respectively; BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true) are respectively built using Buckley-James loss functions with augmentation term estimated via nonparametric random forests, random parametric forests using ensembles of parametric models described in cheng2009modeling, the parametric model of shi2013constrained, and under the correct (i.e., consistently estimated) simulation model; and, DR-RF(npar), DR-RF+lgtc, DR-lgtc+godds and DR-FG(true) are built using the doubly robust loss function, respectively using the same methods for calculating the augmentation term as in BJ-RF(npar), BJ-RF+lgtc, BJ-lgtc+godds and BJ-FG(true).

		High Sig			Med Sig			Low Sig
		$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP
IPCW₁	$t_{1}$	0.432	0.196	0.724	1.668	0.118	0.112	1.890	0.072	0.020
	$t_{2}$	0.204	0.138	0.890	0.838	0.166	0.494	1.644	0.124	0.118
	$t_{3}$	0.294	0.188	0.816	0.780	0.250	0.556	1.332	0.198	0.246
IPCW₂	$t_{1}$	0.350	0.230	0.800	1.428	0.208	0.220	1.834	0.114	0.040
	$t_{2}$	0.130	0.072	0.922	0.546	0.168	0.674	1.500	0.150	0.182
	$t_{3}$	0.354	0.242	0.800	0.726	0.216	0.576	1.318	0.134	0.238
BJ-RF(npar)	$t_{1}$	0.392	0.280	0.814	1.298	0.286	0.284	1.840	0.160	0.042
	$t_{2}$	0.236	0.162	0.876	0.424	0.238	0.770	1.354	0.258	0.252
	$t_{3}$	0.320	0.258	0.858	0.336	0.218	0.818	0.980	0.304	0.438
BJ-RF+lgtc	$t_{1}$	0.326	0.234	0.822	1.302	0.304	0.290	1.840	0.140	0.044
	$t_{2}$	0.256	0.182	0.864	0.404	0.228	0.774	1.344	0.232	0.256
	$t_{3}$	0.302	0.236	0.858	0.234	0.144	0.854	0.930	0.256	0.448
BJ-FG(true)	$t_{1}$	0.336	0.260	0.826	0.950	0.284	0.482	1.694	0.152	0.106
	$t_{2}$	0.190	0.140	0.902	0.204	0.134	0.882	0.682	0.246	0.630
	$t_{3}$	0.142	0.116	0.932	0.150	0.098	0.908	0.202	0.138	0.886
BJ-lgtc+godds	$t_{1}$	0.392	0.282	0.792	1.182	0.320	0.354	1.766	0.146	0.070
	$t_{2}$	0.342	0.244	0.850	0.368	0.236	0.818	1.024	0.284	0.404
	$t_{3}$	0.616	0.336	0.728	0.426	0.262	0.816	0.702	0.386	0.602
DR-RF(npar)	$t_{1}$	0.342	0.248	0.814	1.378	0.246	0.246	1.838	0.138	0.044
	$t_{2}$	0.170	0.128	0.898	0.508	0.242	0.696	1.506	0.222	0.182
	$t_{3}$	0.288	0.188	0.860	0.480	0.128	0.688	1.382	0.122	0.236
DR-RF+lgtc	$t_{1}$	0.338	0.236	0.816	1.392	0.288	0.250	1.826	0.132	0.044
	$t_{2}$	0.166	0.122	0.916	0.524	0.248	0.692	1.518	0.210	0.168
	$t_{3}$	0.270	0.190	0.864	0.544	0.138	0.660	1.426	0.144	0.222
DR-FG(true)	$t_{1}$	0.366	0.266	0.818	1.362	0.268	0.274	1.830	0.134	0.046
	$t_{2}$	0.168	0.126	0.910	0.482	0.208	0.734	1.416	0.150	0.212
	$t_{3}$	0.244	0.164	0.866	0.588	0.234	0.686	1.318	0.148	0.284
DR-lgtc+godds	$t_{1}$	0.366	0.274	0.802	1.366	0.250	0.264	1.842	0.112	0.036
	$t_{2}$	0.126	0.100	0.920	0.510	0.236	0.722	1.518	0.222	0.180
	$t_{3}$	0.236	0.160	0.854	0.578	0.174	0.648	1.404	0.116	0.236

Table A.2: Numerical summaries for trees built using a single time point loss function when

n=500

		High Sig			Med Sig			Low Sig
		$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP
IPCW₁	$t_{1}$	0.154	0.110	0.892	1.058	0.184	0.422	1.842	0.078	0.064
	$t_{2}$	0.124	0.066	0.932	0.222	0.138	0.850	1.020	0.122	0.422
	$t_{3}$	0.134	0.092	0.916	0.178	0.106	0.892	0.672	0.134	0.594
IPCW₂	$t_{1}$	0.198	0.130	0.886	0.480	0.220	0.726	1.588	0.172	0.166
	$t_{2}$	0.118	0.076	0.928	0.120	0.090	0.904	0.666	0.170	0.614
	$t_{3}$	0.116	0.086	0.916	0.162	0.094	0.898	0.632	0.176	0.626
BJ-RF+npar	$t_{1}$	0.192	0.128	0.906	0.394	0.208	0.774	1.506	0.192	0.214
	$t_{2}$	0.120	0.070	0.924	0.178	0.126	0.894	0.366	0.162	0.784
	$t_{3}$	0.136	0.086	0.918	0.182	0.114	0.894	0.190	0.136	0.868
BJ-RF+lgtc	$t_{1}$	0.192	0.136	0.904	0.428	0.240	0.762	1.554	0.198	0.196
	$t_{2}$	0.086	0.052	0.934	0.170	0.120	0.898	0.362	0.162	0.792
	$t_{3}$	0.142	0.090	0.918	0.174	0.108	0.900	0.200	0.140	0.870
BJ-FG(true)	$t_{1}$	0.184	0.136	0.912	0.250	0.166	0.852	1.190	0.244	0.370
	$t_{2}$	0.110	0.072	0.950	0.186	0.134	0.880	0.122	0.086	0.906
	$t_{3}$	0.082	0.066	0.956	0.104	0.074	0.942	0.098	0.058	0.936
BJ-lgtc+godds	$t_{1}$	0.194	0.140	0.904	0.326	0.222	0.804	1.314	0.214	0.300
	$t_{2}$	0.196	0.126	0.902	0.226	0.146	0.862	0.270	0.152	0.828
	$t_{3}$	0.526	0.190	0.704	0.350	0.192	0.818	0.218	0.146	0.852
DR-RF(npar)	$t_{1}$	0.184	0.136	0.910	0.456	0.204	0.742	1.546	0.140	0.186
	$t_{2}$	0.118	0.066	0.934	0.134	0.094	0.908	0.626	0.148	0.654
	$t_{3}$	0.102	0.068	0.938	0.088	0.056	0.944	0.498	0.140	0.692
DR-RF+lgtc	$t_{1}$	0.158	0.116	0.918	0.482	0.202	0.722	1.568	0.146	0.180
	$t_{2}$	0.108	0.052	0.930	0.118	0.088	0.916	0.644	0.158	0.652
	$t_{3}$	0.076	0.054	0.952	0.110	0.070	0.936	0.524	0.140	0.678
DR-FG(true)	$t_{1}$	0.190	0.140	0.906	0.444	0.188	0.756	1.576	0.172	0.186
	$t_{2}$	0.084	0.050	0.956	0.124	0.080	0.916	0.552	0.134	0.676
	$t_{3}$	0.096	0.066	0.946	0.078	0.054	0.950	0.522	0.128	0.686
DR-lgtc+godds	$t_{1}$	0.206	0.140	0.904	0.440	0.186	0.744	1.552	0.192	0.170
	$t_{2}$	0.098	0.050	0.946	0.138	0.094	0.908	0.594	0.148	0.668
	$t_{3}$	0.084	0.054	0.948	0.118	0.076	0.928	0.496	0.112	0.682

Table A.3: Numerical summaries for trees built using a single time point loss function when

n=1000

		High Sig			Med Sig			Low Sig
		$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP
IPCW₁	$t_{1}$	0.106	0.076	0.930	0.132	0.072	0.908	1.328	0.134	0.284
	$t_{2}$	0.066	0.042	0.962	0.134	0.114	0.904	0.208	0.116	0.864
	$t_{3}$	0.092	0.054	0.936	0.070	0.052	0.952	0.124	0.088	0.916
IPCW₂	$t_{1}$	0.088	0.072	0.946	0.092	0.070	0.934	0.646	0.170	0.650
	$t_{2}$	0.064	0.042	0.956	0.116	0.090	0.924	0.122	0.076	0.916
	$t_{3}$	0.100	0.056	0.932	0.082	0.056	0.944	0.130	0.094	0.922
BJ-RF+npar	$t_{1}$	0.106	0.082	0.924	0.114	0.084	0.932	0.488	0.156	0.722
	$t_{2}$	0.088	0.056	0.950	0.168	0.118	0.890	0.108	0.070	0.924
	$t_{3}$	0.114	0.060	0.932	0.180	0.110	0.894	0.142	0.106	0.906
BJ-RF+lgtc	$t_{1}$	0.126	0.102	0.922	0.104	0.078	0.938	0.458	0.152	0.738
	$t_{2}$	0.064	0.048	0.958	0.136	0.094	0.904	0.086	0.056	0.934
	$t_{3}$	0.124	0.068	0.924	0.160	0.104	0.898	0.138	0.094	0.904
BJ-FG(true)	$t_{1}$	0.108	0.088	0.932	0.088	0.056	0.946	0.218	0.100	0.866
	$t_{2}$	0.084	0.070	0.956	0.058	0.042	0.954	0.076	0.050	0.944
	$t_{3}$	0.036	0.024	0.970	0.104	0.074	0.942	0.060	0.040	0.958
BJ-lgtc+godds	$t_{1}$	0.112	0.084	0.922	0.110	0.082	0.934	0.302	0.120	0.820
	$t_{2}$	0.164	0.094	0.922	0.162	0.090	0.902	0.100	0.076	0.932
	$t_{3}$	0.774	0.128	0.508	0.358	0.126	0.786	0.204	0.122	0.860
DR-RF(npar)	$t_{1}$	0.088	0.070	0.942	0.102	0.072	0.934	0.556	0.168	0.692
	$t_{2}$	0.068	0.048	0.958	0.098	0.068	0.930	0.104	0.072	0.926
	$t_{3}$	0.048	0.038	0.978	0.048	0.030	0.964	0.098	0.076	0.936
DR-RF+lgtc	$t_{1}$	0.092	0.070	0.938	0.094	0.066	0.938	0.588	0.168	0.682
	$t_{2}$	0.048	0.034	0.962	0.100	0.068	0.932	0.090	0.064	0.934
	$t_{3}$	0.040	0.032	0.982	0.034	0.024	0.974	0.098	0.074	0.938
DR-FG(true)	$t_{1}$	0.080	0.066	0.940	0.068	0.048	0.954	0.532	0.164	0.706
	$t_{2}$	0.082	0.062	0.960	0.092	0.066	0.938	0.072	0.058	0.950
	$t_{3}$	0.050	0.042	0.974	0.056	0.036	0.968	0.088	0.074	0.948
DR-lgtc+godds	$t_{1}$	0.068	0.054	0.948	0.094	0.064	0.938	0.564	0.162	0.690
	$t_{2}$	0.054	0.036	0.962	0.114	0.078	0.934	0.074	0.054	0.938
	$t_{3}$	0.048	0.022	0.970	0.040	0.024	0.972	0.062	0.052	0.948

Table A.4: Numerical summaries for trees built using multiple time points when

n=250,500

and

1000

		High Sig			Med Sig			Low Sig
		$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP	$\|$ fitted-3 $\|$	NSP	PCSP
$n=250$	IPCW₁	0.168	0.116	0.914	0.506	0.132	0.670	1.286	0.102	0.272
	IPCW₂	0.156	0.120	0.916	0.314	0.142	0.798	1.128	0.146	0.358
	BJ-RF(npar)	0.234	0.186	0.878	0.304	0.158	0.810	1.128	0.246	0.364
	BJ-RF+lgtc	0.250	0.196	0.882	0.290	0.158	0.824	1.110	0.244	0.360
	BJ-FG(true)	0.156	0.116	0.910	0.194	0.142	0.892	0.348	0.182	0.790
	BJ-lgtc+godds	0.358	0.246	0.844	0.364	0.266	0.814	0.788	0.328	0.532
	DR-RF(npar)	0.150	0.120	0.924	0.272	0.138	0.828	1.156	0.146	0.334
	DR-RF+lgtc	0.152	0.126	0.912	0.264	0.120	0.834	1.134	0.178	0.342
	DR-FG(true)	0.132	0.090	0.932	0.194	0.092	0.880	1.050	0.170	0.408
	DR-lgtc+godds	0.120	0.092	0.926	0.286	0.144	0.846	1.044	0.146	0.378
$n=500$	IPCW₁	0.132	0.084	0.916	0.182	0.132	0.874	0.558	0.164	0.658
	IPCW₂	0.124	0.082	0.932	0.138	0.100	0.906	0.282	0.136	0.830
	BJ-RF(npar)	0.142	0.090	0.904	0.148	0.088	0.908	0.226	0.128	0.878
	BJ-RF+lgtc	0.130	0.082	0.908	0.156	0.094	0.908	0.210	0.124	0.880
	BJ-FG(true)	0.092	0.068	0.940	0.102	0.064	0.934	0.118	0.082	0.918
	BJ-lgtc+godds	0.250	0.122	0.848	0.156	0.080	0.896	0.154	0.104	0.896
	DR-RF(npar)	0.066	0.052	0.958	0.092	0.056	0.940	0.216	0.118	0.856
	DR-RF+lgtc	0.040	0.030	0.966	0.092	0.052	0.938	0.230	0.106	0.834
	DR-FG(true)	0.058	0.044	0.966	0.092	0.058	0.942	0.176	0.076	0.874
	DR-lgtc+godds	0.040	0.030	0.970	0.082	0.056	0.950	0.214	0.102	0.852
$n=1000$	IPCW₁	0.080	0.036	0.940	0.122	0.096	0.928	0.088	0.074	0.936
	IPCW₂	0.068	0.052	0.948	0.084	0.046	0.946	0.094	0.074	0.938
	BJ-RF(npar)	0.110	0.062	0.918	0.112	0.074	0.918	0.096	0.068	0.934
	BJ-RF+lgtc	0.080	0.052	0.938	0.126	0.084	0.912	0.094	0.064	0.936
	BJ-FG(true)	0.072	0.046	0.954	0.088	0.060	0.940	0.086	0.070	0.954
	BJ-lgtc+godds	0.270	0.078	0.818	0.140	0.076	0.904	0.100	0.066	0.928
	DR-RF(npar)	0.050	0.028	0.976	0.068	0.046	0.952	0.098	0.068	0.928
	DR-RF+lgtc	0.038	0.022	0.978	0.056	0.044	0.956	0.104	0.078	0.930
	DR-FG(true)	0.030	0.022	0.978	0.056	0.034	0.966	0.058	0.042	0.958
	DR-lgtc+godds	0.028	0.016	0.984	0.060	0.044	0.952	0.082	0.056	0.944

A.6 Tree results in data analysis

In this subsection, we show various trees built for the RTOG9410 study described in Section 5. Numbers in each terminal node represent terminal node estimators (i.e., CIF estimates) generated by the loss function used to build the corresponding tree.

Figure A.4: Tree built using uncensored data for out-field failure

Figure A.5: Tree built using 29% artificially censored data for out-field failure for IPCW₁

Figure A.6: Tree built using 29% artificially censored data for out-field failure for IPCW₂

Figure A.7: Tree built using 29% artificially censored data for out-field failure for BJ-RF(npar)

Figure A.8: Tree built using 29% artificially censored data for out-field failure for DR-RF(npar)

Figure A.9: Tree built using 29% artificially censored data for out-field failure for BJ-RF+lgtc

Figure A.10: Tree built using 29% artificially censored data for out-field failure for BJ-RF+lgtc

Figure A.11: Tree built using uncensored data for death without progression

Figure A.12: Tree built using 29% artificially censored data for death without progression from IPCW₂

Figure A.13: Tree built using 29% artificially censored data for death without progression from DR-RF(npar)

Figure A.14: Tree built using 29% artificially censored data for death without progression from DR-RF+lgtc

Figure A.15: Tree built using 29% artificially censored data for death without progression from BJ-RF(npar)

Figure A.16: Tree built using 29% artificially censored data for death without progression from BJ-RF+lgtc