Treatment heterogeneity with right-censored outcomes using grf

Erik Sverdrup Stefan Wager

(Stanford GSB, February 2024)

This article walks through how to estimate conditional average treatment effects (CATEs) with right-censored time-to-event outcomes using the function causal_survival_forest (Cui et al., 2023) in the R package grf (Athey et al., 2019, Tibshirani et al., 2024). We consider a setting where we observe an unconfounded binary treatment $W_{i}\in\{0,1\}$ , and are interested in estimating its effect on a restricted mean survival time (RMST) conditional on covariates $X_{i}$ ,

\tau(x)=E[\textrm{min}(T_{i}(1),~{}h)-\textrm{min}(T_{i}(0),~{}h)~{}|~{}X_{i}=x],

where $T_{i}(1),T_{i}(0)$ are potential survival outcomes in the two treatment arms, and $h$ is some chosen maximum considered time. Since we are in a time-to-event setting, we don’t necessarily observe the survival time $T_{i}$ ; instead, we observe the recorded time $Y_{i}=\textrm{min}(T_{i},C_{i})$ where $C_{i}$ is the censoring time for the $i$ -th unit, along with the event indicator $D_{i}=1(T_{i}\leq C_{i})$ . We assume that censoring is ignorable; see Cui et al. (2023) for a precise statement.

The function causal_survival_forest estimates CATEs using a flexible forest-based approach that extends the random forest algorithm of Breiman (2001). The approach starts by estimating a number of auxiliary “nuisance” components, such as propensity scores and survival curves for the survival and censoring process, using regression forests and random survival forests respectively. The approach then trains a final forest that directly targets CATEs via a Neyman-orthogonal construction using these nuisance component estimates. For complete details on the algorithm, we refer to Cui et al. (2023).

Remark. causal_survival_forest also supports estimating the effect on survival probabilities conditional on covariates. To target the quantity

\tau(x)=P[T_{i}(1)>h~{}|~{}X_{i}=x]-P[T_{i}(0)>h~{}|~{}X_{i}=x],

one simply uses the argument target = "survival.probability" when calling into causal_survival_forest.

Data

We use data from the National Job Training Partnership Act (JPTA) study that assessed the impact of eligibility for an employment training program on employment outcomes using a randomized controlled trial. All participants were interviewed at enrollment before randomization; they were then re-interviewed sometime between one and three years post-enrollment. The study enrolled participants who were unemployed at the time, and one natural question to ask using data from this study is: How did being randomized into treatment affect the amount of time it took for study participants to find a job (Crommen et al., 2023, Frandsen, 2015)? This time-to-event outcome is subject to random censoring because not all participants had yet found a new job at the time of the second survey (which was administered with different delays post-enrollment for different participants); however, the censoring appears plausibly ignorable here and so causal survival forests can be used to correct for it.

The data set we’re employing is a processed copy of the JPTA data made available on GitHub by Crommen et al. (2023), which contains approximately 11 000 entries where subjects record the outcome (days of unemployment), an event indicator (where 1 denotes the subject found a job and 0 denotes censoring), a treatment indicator, as well as a handful of covariates such as age, high school education, race, children, marital status, and gender.

# Read in publicly available JTPA data from Crommen et al. (2023).

url = paste(

"https://raw.githubusercontent.com/GillesCrommen/DCC/",

"748bd7f98feccad09205ee3df76df5ba740cc3d7/",

"clean_dataset_JTPA.csv", sep = "")

data = read.csv(url)

# Outcome (days of unemployment observed)

Y = data$days

# Treatment indicator (W = 1 if the subject was eligible to enroll in JTPA)

W = data$treatment

# Non-censoring indicator (D = 1 means the subject had a job by the second survey)

D = data$delta

# Covariates

X = cbind(

age = data$age,

high.school.diploma = data$hsged,

race.white = data$white,

children = data$children,

married = data$married,

male = data$male

)

Our estimand is the RMST, and so we need to choose some suitable truncation time for the mean survival time to be identified. Based on the histogram of outcomes by event status, we set the truncation time to 2 years ( $h=720$ days), at which point most subsequent observations are censored.

hist(Y[D==1], main = "Histogram of Y", xlab = "")

hist(Y[D == 0], col = adjustcolor("red", 0.5), add = TRUE)

legend("topright", c("Event", "Censored"),

col = c("gray", adjustcolor("red", 0.5)),

lwd = 4)

abline(v = 720, lty = 2)

Fitting a causal survival forest and assessing heterogeneity

Since the trial was randomized, we are analyzing the data as an RCT by passing the sample average treatment fraction as the propensity score via the argument W.hat. The truncation parameter $h$ is supplied with the argument horizon. Below we fit a causal survival forest with default options and calculate a doubly robust estimate for the average treatment effect (ATE) $E[\textrm{min}(T_{i}(1),~{}h)-\textrm{min}(T_{i}(0),~{}h)]$ .

Note. To avoid potential confusion with a negative sign representing a beneficial treatment effect, we relabel the treatment indicator such that we are measuring the effect of not being offered the program, which will have a positive sign if the training program is beneficial.

library(grf)

# Relabel treatment to measure the effect of not being offered training

W.relabeled = 1 - W

cs.forest = causal_survival_forest(

X, Y, W.relabeled, D,

W.hat = mean(W.relabeled),

horizon = 720)

# Get a doubly robust estimate for the ATE

average_treatment_effect(cs.forest)

## estimate  std.err
##     27.2      5.4

The ATE is positive and significant, indicating that not being offered the program led to a longer spell of unemployment. That is, the program appears on average to be beneficial.

# Retrieve out-of-bag CATE estimates

tau.hat = predict(cs.forest)$predictions

summary(tau.hat)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##     -80       0      24      27      51     163

If we look at the distribution of the CATE estimates we see that most point estimates for $\hat{\tau}(X_{i})$ are large and positive, suggesting the program offering reduces the unemployment duration, and that the magnitude of the effect varies across individuals.

The numbers above are noisy non-parametric point estimates, and by themselves don’t necessarily yield insight into robust signs of treatment effect heterogeneity. grf comes with various functionality for analyzing heterogeneity based on what questions we are interested in answering. We can conduct inference on a “pseudo”-parameter of the CATE with the function best_linear_projection which gives a doubly robust estimate of the linear model $\tau(X_{i})=\beta X_{i}+\varepsilon_{i}$ , where $X_{i}$ are some chosen covariates.

best_linear_projection(cs.forest, X)

##
## Best linear projection of the conditional average treatment effect.
## Confidence intervals are cluster- and heteroskedasticity-robust (HC3):
##
##                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)           44.359     19.380    2.29    0.022 *
## age                   -0.623      0.563   -1.11    0.268
## high.school.diploma    4.118     10.803    0.38    0.703
## race.white           -15.096     11.069   -1.36    0.173
## children               9.532     12.613    0.76    0.450
## married               29.184     14.620    2.00    0.046 *
## male                  -9.282     11.975   -0.78    0.438
## ---
## Signif. codes:  0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

This analysis suggests that treatment heterogeneity is largely explained by marital status here (married participants benefit more); however, we caution that the statistical significance is not strong enough to withstand a multiple testing correction (e.g., a Bonferroni correction).

To get a single p-value with power to test whether there is any treatment heterogeneity, we can use the function rank_average_treatment_effect (RATE). This function tests whether the estimated CATE function $\hat{\tau}(\cdot)$ can find stable subgroups that benefit more from the treatment than others. This is operationalized using the Targeting Operator Characteristic (TOC) curve, which for each fraction $q$ , compares the average treatment effect of the top q-th fraction of units as prioritized by a rule $S(X_{i})$ to the overall average treatment effect.

\textrm{TOC}(q)=E[\textrm{min}(T_{i}(1),~{}h)-\textrm{min}(T_{i}(0),~{}h)\,|\,S(X_{i})\geq\textrm{q-th quantile}]-\textrm{ATE}

If we let $S(X_{i})$ equal the estimated CATEs, $S(X_{i})=\hat{\tau}(X_{i})$ , we get a TOC that quantifies the difference in ATE from the overall ATE among units predicted to have a large increase in unemployment duration.

# Form a doubly robust estimate of RATE using cs.forest

rate = rank_average_treatment_effect(cs.forest, tau.hat)

# Plot the TOC, along with 95 % confidence bars

plot(rate,

main = "TOC: By decreasing CATE estimates"

)

The plot indicates there are signs of heterogeneity, for example, the $q=20$ percent of units with the largest estimated effect (as predicted by $\hat{\tau}$ ) have an unemployment duration that is around on average 30 days longer than the overall ATE of 27 days.

We can translate this visual stratification exercise into a point estimate by forming an estimate of the area under this curve. A 95 % confidence interval for the area under the TOC does not cover 0, which suggests that causal forests are in fact detecting real treatment heterogeneity here:

paste("AUTOC:", round(rate$estimate, 2), "+/-", round(1.96 * rate$std.err, 2))

## [1] "AUTOC: 13.87 +/- 11.58"

Given an estimated function $\hat{\tau}(\cdot)$ that appears to detect meaningful heterogeneity, there are a variety of ways we can probe further into how this heterogeneity can be described using available covariates $X_{i}$ . One simple option is to examine the distributions of covariates across subregions defined in terms of quantiles of the estimated CATEs. Here we simply conclude with a brief example of showing the covariate means for the full sample, and the covariate mean for the 20% of units with the largest predicted decrease in unemployment duration, which appears to indicate there are a larger fraction of parents and married people in the group that is predicted to have a large effect.

q80 = quantile(tau.hat, 0.8)

rbind(

full.sample = apply(X, 2, mean),

top.20 = apply(X[tau.hat >= q80, ], 2, mean)

)

##             age high.school.diploma race.white children married male
## full.sample  29                0.48       0.56     0.49    0.23 0.43
## top.20       27                0.45       0.50     0.63    0.38 0.41

For more details on the RATE metric, as well as strategies to avoid over-fitting based on train/test splits, see Yadlowsky et al. (2021), as well as the online package documentation at https://grf-labs.github.io/grf/.

References

Susan Athey, Julie Tibshirani, and Stefan Wager. “Generalized random forests.” The Annals of Statistics, 47(2), 2019.

Leo Breiman. “Random forests.” Machine learning, 45, 2001.

Gilles Crommen, Jad Beyhum, and Ingrid van Keilegom. “An instrumental variable approach under dependent censoring.” TEST, 2023.

Yifan Cui, Michael R. Kosorok, Erik Sverdrup, Stefan Wager, and Ruoqing Zhu. “Estimating heterogeneous treatment effects with right-censored data via causal survival forests.” Journal of the Royal Statistical Society: Series B, 85(2), 2023.

Brigham R. Frandsen. “Treatment effects with censoring and endogeneity.” Journal of the American Statistical Association, 110(512), 2015.

Julie Tibshirani, Susan Athey, Rina Friedberg, Vitor Hadad, David Hirshberg, Luke Miner, Erik Sverdrup, Stefan Wager, and Marvin Wright. “grf: Generalized random forests.”, 2024. URL https://github.com/grf-labs/grf. R package version 2.3.2.

Steve Yadlowsky, Scott Fleming, Nigam Shah, Emma Brunskill, and Stefan Wager. “Evaluating treatment prioritization rules via rank-weighted average treatment effects.” arXiv preprint arXiv:2111.07966, 2021.