Estimating Treatment Effect Heterogeneity in Psychiatry:
A Review and Tutorial with Causal Forests

Erik Sverdrup Department of Econometrics & Business Statistics, Monash University, Melbourne, Australia. Maria Petukhova Department of Health Care Policy, Harvard Medical School, Boston, Massachusetts. Stefan Wager Graduate School of Business, Stanford University, Stanford, California.
Corresponding Author: Erik Sverdrup, Department of Econometrics & Business Statistics, Monash University, 29 Ancora Imparo Way, Clayton, VIC 3168, Australia; Telephone: +61 990 55 041; E-Mail: [email protected].

Abstract

Flexible machine learning tools are increasingly used to estimate heterogeneous treatment effects. This paper gives an accessible tutorial demonstrating the use of the causal forest algorithm, available in the R package grf. We start with a brief non-technical overview of treatment effect estimation methods, focusing on estimation in observational studies; the same techniques can also be applied in experimental studies. We then discuss the logic of estimating heterogeneous effects using the extension of the random forest algorithm implemented in grf. Finally, we illustrate causal forest by conducting a secondary analysis on the extent to which individual differences in resilience to high combat stress can be measured among US Army soldiers deploying to Afghanistan based on information about these soldiers available prior to deployment. We illustrate simple and interpretable exercises for model selection and evaluation, including targeting operator characteristics curves, Qini curves, area-under-the-curve summaries, and best linear projections. A replication script with simulated data is available at github.com/grf-labs/grf/tree/master/experiments/ijmpr.

Keywords: causal inference, treatment heterogeneity, machine learning

1 Introduction

An important question in psychiatry as well as other scientific disciplines that evaluate effects of interventions on the health and well-being of individuals is the extent to which the treatment effects estimated in typical experimental evaluations differ across individuals (Angus and Chang,, 2021). This is an important question because we have long known that the effects of interventions may vary across different segments of the population (Kravitz et al.,, 2004). Because of these differences, some interventions that are estimated not to be effective in the entire population based on the typical assumption of constant treatment effects are nonetheless effective in some important segments of the population, whereas some interventions found to be effective in the entire population are either not effective or in some cases even harmful in important segments of the population (Kent et al.,, 2016). In addition, when multiple interventions are available, comparative effectiveness can differ across individuals and population segments (Velentgas et al.,, 2013).

Research on heterogeneity of treatment effects (HTE), as this variation in intervention effects has come to be known, has proliferated in recent years in psychiatry (e.g., Feczko et al.,, 2019; Allsopp et al.,, 2019; Kaiser et al.,, 2022; Cohen and DeRubeis,, 2018), and many other disciplines. The basic approach is to search for significant variations in treatment effects across subsamples. This is most easily done by estimating interactions between some hypothesized effect modifiers assessed prior to randomization and random assignment. When the interaction is non-zero, HTE is said to exist with respect to the specified variable—meaning that the treatment effect differs depending on the value of the variable.

Numerous small clinical trials have reported HTE detections of this sort with respect to such fundamental baseline variables as patient age, sex, education, and symptom severity (e.g., Cuijpers et al.,, 2014; Driessen et al.,, 2010). But a question arises in trying to synthesize all this information when numerous significant interactions of this sort emerge, as meta-analyses show that they do (e.g., Maj et al.,, 2020). What is the best way to combine the information about all these specifiers into a single composite? A commonly used method is to apply the potential outcomes framework (Imbens and Rubin,, 2015) in analyzing the results of a moderated multiple regression model that includes multiple interactions. Counterfactual logic is used here to generate two predicted outcome scores from this model for each participant in the study, the first assuming that the participant was assigned to the intervention group and the second assuming that the participant was assigned to the control group (DeRubeis et al.,, 2014). The two within-person predicted outcome scores are then compared at the individual level and used as an estimate of the treatment effect for an individual based on the multivariate specifier profile of the individual. The distribution of these differences scores is then examined to investigate the pattern of significant individual differences in the treatment effect.

A drawback of this approach is that in trying out many different regression models, applications of the procedure that do not account for multiple testing may lead to false positives (van Klaveren et al.,, 2019). This problem can be addressed, though, using data-driven machine learning methods to search for interactions with cross-validation to address the problem of over-fitting (VanderWeele et al.,, 2019; Wager and Athey,, 2018). The use of data-driven methods in this way also addresses the problem that the moderated regression approach only allows for a limited range of typically two-way interactions, whereas HTE can involve much more complex forms of interaction. Machine learning methods, in comparison, allow for these more complex forms of interaction to be discovered using non-parametric methods. A large body of work has made tremendous progress in adapting a variety of machine learning algorithms to estimate HTE. Examples include Athey and Imbens, (2016); Athey et al., (2019); Hahn et al., (2020); Kennedy, (2023); Künzel et al., (2019); Luedtke and van der Laan, (2016); Nie and Wager, (2021); Tian et al., (2014).

2 Estimating treatment effects in non-experimental settings

To communicate statements about causal effects it is helpful to employ a universal language with well-defined constructs. We employ the potential outcomes framework (Imbens and Rubin,, 2015) where we posit the existence of potential outcomes $Y_{i}(0)~{}\text{and}~{}Y_{i}(1)$ which records the hypothetical clinical outcome $Y_{i}$ for patient $i$ in either treatment state. We typically denote the control state by $W_{i}=0$ and the treatment state by $W_{i}=1$ . Naturally, we only observe one potential outcome for any single patient: the outcome corresponding to the treatment arm she was assigned to. Table 1 shows an example of hypothetical potential outcomes along with realized outcomes.

Patient $i$	Treatment assignment $W_{i}$	$Y_{i}(1)$	Realized outcome $Y_{i}$
1	1 (Treated)	0 (Sick)	0 (Sick)
2	0 (Control)	1 (Healthy)	0 (Sick)
⋮	1 (Treated)	1 (Healthy)	1 (Healthy)

Table 1: Potential outcomes denoting hypothetical outcomes for each patient and each intervention.

The assumption that our column of realized outcomes is consistent with the columns of potential outcomes, $Y_{i}=Y_{i}(W_{i})$ , is called the Stable Unit Treatment Value Assumption (SUTVA). In some complicated social network settings, this assumption may not be viable. Consider, for example, if patient A is given behavioral therapy and patient B is not. If patient A interacts with patient B through some social functioning, like a support group, then A’s newly acquired behavioral knowledge may influence B. Here, however, we rely on SUTVA and thus rule out scenarios like these by assuming no interference or spillover between patients.

The type of causal effects we are interested in quantifying are average differences in potential outcomes. Given some hypothetical target population, the average treatment effect (ATE) is

\tau=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\right],

where $\mathbb{E}\left[\cdot\right]$ denotes an average taken over a population. This would quantify, following the setup in Table 1, for example, the increase in healthy prevalence when assigning every patient the intervention, as opposed to withholding the intervention from everyone.

Even though for any single patient $i$ we never observe the difference $Y_{i}(1)-Y_{i}(0)$ , we still have enough information to identify the average difference. If we collect a representative sample from our population and then randomly assign an intervention to some units in the sample while withholding it from others, then the simple difference in means (i.e., the prevalence among treated patients minus the mean prevalence among control patients) provides an unbiased estimate of the average treatment effect. The fact that the treatment assignment is randomized means that our hypothetical potential outcomes and treatment status are decoupled, which justifies this simple calculation.

The availability of such a randomized treatment assignment is typically constrained to randomized controlled trials. In many important settings, due to ethical and resource considerations, we don’t have access to a randomized intervention. A dataset where the intervention assignment is not randomized is often referred to as an observational dataset. As an example, via electronic health records, we could assemble a dataset that records an intervention, such as psychiatric hospitalization, together with an outcome, such as subsequent suicide, in a sample of patients presenting at an emergency department with suicidality. The decision whether to hospitalize such patients is clearly not randomized but will rather depend on attending physician and hospital protocols, patient severity and perceived risk of suicidal behavior, available outpatient alternatives to hospitalization (e.g., day treatment programs), and availability of hospital beds. Some of these determinants of hospitalization are also likely to be determinants of subsequent suicidal behavior, which means that a simple comparison of the rates of suicidal behavior in, say, six months after emergency department presentation cannot be interpreted as providing evidence of the effects of hospitalization on these behaviors. At the same time, constructing a randomized controlled trial in a setting of this sort poses ethical challenges.

We can, however, under suitable circumstances, make causal inferences from such data if we are willing to make certain assumptions about access to information about the determinants of treatment assignment that are also determinants of subsequent suicidal behavior in the absence of treatment assignment. The key assumption in this regard is known as the unconfoundedness assumption (Rosenbaum and Rubin,, 1983), which posits that if we account for a set of observed patient characteristics $X_{i}$ known prior to intervention, then we can think of the treatment assignment as if it were randomized,

\left(Y_{i}(1),~{}Y_{i}(0)\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}W_{i}\right)\mid X_{i}~{}~{}\text{(unconfoundedness)}.

An important object under this setup is the propensity score,

e(x)=\mathbb{E}\left[W_{i}\mid X_{i}=x\right],

which is the probability that a patient with confounder variables equal to $x$ is assigned the treatment. To estimate causal effects, a necessary condition is that the association between baseline predictors and intervention assignment is not so strong that no individuals have a non-zero probability of assignment to any of the interventions under study. This is the overlap assumption, which requires that the propensity score $e(x)$ is strictly above 0 and below 1 for every unit $x$ . For example, if 100% of the suicidal patients presenting at an emergency department with severe thought disturbance were hospitalized, it would be impossible to use observational methods to evaluate the effect of hospitalization in preventing subsequent suicidal behaviors among such patients. It is only in the subset of patients whose probability of hospitalization is less than 100% and greater than 0% that data-driven estimates of treatment effects can be made. Assessing this requirement empirically translates into ensuring that estimated propensity scores are not close to either zero or one, recognizing that estimation of treatment effects is impossible in the subset of patients whose scores are close to zero or one.

While the overlap assumption is testable, the direct conditional independence assumption under unconfoundedness is not, as it involves statements of independence between random variables. This is usually referred to as an identifying assumption and is fundamentally untestable. Instead, the evaluation of identifying assumptions requires domain knowledge and subject-matter expertise to assess how credible they are in a particular application. In this tutorial, we are treating this first step as done or known, i.e., the underlying causal graph (Pearl,, 1995) is agreed upon. In the context of the example of psychiatric hospitalization as a predictor of subsequent suicidal behavior among suicidal emergency department patients, an example where this assumption is invoked is Ross et al., (2023).

To estimate average treatment effects in observational settings under unconfoundedness, we must account for how the different types of patients respond to and are assigned the treatment before making comparisons between average outcome scores. We can do this by estimating the propensity score. It is also possible to perform the confounding adjustment by estimating conditional outcome functions or via a doubly robust combination of estimates of both the propensity and the conditional outcome function. This is the approach grf implements in the function average_treatment_effect; see Ding, (2023), Hernan and Robins, (2023), and Wager, (2024) for recent textbook treatments of these methods.

2.1 From ATE to HTE

A limitation of average treatment effects is that they can mask important individual differences. In the example of hospitalization for suicidal patients presenting to an emergency department, even though numerous observational studies that attempted to adjust for baseline confounders (Steeg et al.,, 2018; Carroll et al.,, 2016; Large and Kapur,, 2018; Jones et al.,, 2008) and two small experiments (Large and Kapur,, 2018; Jones et al.,, 2008) were all unable to detect a non-zero average treatment effects of hospitalization in preventing subsequent suicidal behavior, a subsequent observational study (Ross et al.,, 2023) documented the presence of HTE by finding a subset of patients with significantly positive average treatment effects (i.e., these patients were helped by hospitalization) and finding a second subset of patients with significantly negative average treatment effects (i.e., these patients were harmed by hospitalization). These subsets were found by focusing on something known as conditional average treatment effects (CATEs).

CATEs are averages of the individual-level treatment effects (ITEs), $\Delta_{i}=Y_{i}(1)-Y_{i}(0),$ which represent individual-level responses to treatment. If known, ITEs would directly represent a measure of HTE. Unfortunately, the ITE is not possible to estimate, given that we only get to observe one realized potential outcome for every patient $i$ . An object that in granularity, lies somewhere between the ITE and the ATE is the CATE,

\tau(x)=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\mid X_{i}=x\right].

The CATE is an average treatment effect over patients $i$ that have characteristics that equal $x$ , where $x$ is a collection of pre-treatment numeric predictor variables (e.g., age, sex, blood pressure, past trauma, etc.). A certain patient might have predictor variables that look like

x=\{55,~{}\text{female},~{}120,~{}\text{no},~{}\ldots\}.

The CATE is simply an average treatment effect that is conditional on the patient having a multivariate profile defined by this vector of characteristics. (To keep the exposition simple and limit notational sprawl, $X_{i}$ can refer to both confounders and treatment effect modifiers, although the two sets of variables do not have to be the same).

This more granular object has practical appeal since we can formulate it as a function, denoted by $\tau(\cdot)$ , that takes as input some patient characteristics given by $x$ , then maps it to a treatment effect, as given by the ATE for the subset of patients with this profile. An individualized treatment rule could be to treat those patients where the treatment effect, as measured by CATE, is sufficiently large. If only $x$ is known, then CATE thresholding is optimal (e.g., Bhattacharya and Dupas,, 2012).

3 The causal forest approach to estimating CATE

Causal forests (Athey et al.,, 2019) build on Breiman’s random forest algorithm (Breiman,, 2001), a popular and empirically successful algorithm for prediction that allows for a mix of real- and discrete-valued predictors. Random forests have a strong empirical track record of doing well out-of-the-box in many applications, often with minimal tweaking of tuning parameters. An appealing aspect of random forests is that they essentially act as an algorithmic way of dividing data into subgroups and then making predictions based on which subgroup a patient belongs to. A core component of the random forest algorithm is to scan over each patient characteristic, then decide on a cut point that serves as a “good” candidate for a subgroup partition. The algorithm then aggregates these partitions into neighborhoods where patients with specific characteristics in terms of predictor variables $X_{i}$ have similar values on outcome $Y_{i}$ .

At first glance, the algorithmic blueprint of random forests appears like a promising candidate for our task of discovering subgroups with different treatment effects, as we are looking for ways to partition a sample of participants in an intervention experiment according to observable characteristics. The primary difference between causal forests and Breiman’s random forests is that the objective of the original random forests is to partition the sample to optimize discrimination of individual differences in scores on an outcome, whereas our objective is to predict differential treatment response.

Refer to caption — (a) Centered outcomes ( $\widetilde{Y}_{i}$ ) for control and treatment groups are plotted against $age$ , a potential predictor variable.

The challenge in targeting treatment heterogeneity that does not arise in the basic prediction setting is that scores on an outcome variable are observed for each participant, whereas individual-level treatment effects are not observed. As we saw previously, though, treatment effect estimation with non-experimental data can be carried out by adjusting for baseline characteristics. We do this using a two-step process. In the first step, causal forests account for uneven treatment assignments via the propensity score $e(x)$ , as well as baseline effects by estimating the conditional mean (marginalizing over treatment)

m(x)=\mathbb{E}\left[Y_{i}\mid X_{i}=x\right].

In this first step, baseline effects are essentially stripped away via something known as an orthogonal construction (Chernozhukov et al.,, 2018; Robinson,, 1988). This construction works on the outcomes as well as on treatment indicators to generate residualized outcome $\widetilde{Y}_{i}=Y_{i}-m(X_{i})$ and treatment indicators $\widetilde{W}_{i}=W_{i}-e(X_{i})$ that have been adjusted for confounding effects. In the second step, $\widetilde{Y}_{i}$ and $\widetilde{W}_{i}$ are used to form treatment effects estimates, and a random forest is used to find predictor variable partitions where these treatment effects estimates exhibit heterogeneity. Figure 1 presents a stylized illustration of the underlying logic for this phase of the algorithm. For a candidate predictor variable, such as age, we seek a single axis-aligned split separating units into two groups with different average treatment effect estimates. Here, units with ages above 25 appear to benefit more, and a partition that immediately increases heterogeneity, as measured by the vertical distance between estimates in Figure 1(b), would be at this point. This exercise is repeated recursively for the left and right samples, considering all possible predictor variables (e.g., blood pressure) to partition the covariates $X_{i}$ into regions where treatment effects are expected to differ.

In the case where the analysis is applied to an experiment, where the probability of assignment to each intervention arm is known by construction, these probabilities can simply be supplied in the first step. However, in a non-experimental situation in which we are attempting to approximate estimates of treatment effects by adjusting for nonrandom exposure respective to observed baseline covariates, the probabilities of treatment assignment are estimated in the first step with separate random forests.

It is possible to modify this random forest framework to accommodate many other relevant statistical quantities and settings – thus the name “generalized” random forests–grf. For example, Athey et al., (2019) use this general framework to construct an instrumental forest that estimates HTE in settings with an instrumental variable, Cui et al., (2023) use this framework to construct a causal survival forest for estimating HTE with right-censored survival outcomes, and Friedberg et al., (2020) constructs a local linear forest. While these random forest algorithms are designed to directly target various causally relevant targets, we emphasize that one still needs to exercise caution when interpreting flexible non-parametric point estimates of these quantities. To give some intuition, machine learning models serve as great complements to traditional parametric statistical models as they allow us to be model-agnostic. This agnostic property, however, comes at the cost of introducing higher estimation uncertainty. So, our estimates of individual-level treatment effects come with considerable uncertainty, and in assessing the quality of these estimates, it is practical to move to a less granular summary measure. In the applied Section 4, we cover different approaches to assess if the individual-level estimates capture meaningful heterogeneity via calibration exercises. On a general note, the perspective we are taking is that machine learning tools can serve as effective algorithmic devices that target CATEs, which an analyst can then validate on held-out data and transparently convey potentially interesting scientific findings regarding treatment heterogeneity.

Finally, it is worth pointing out that the underlying identifying causal assumptions used in causal forests are the same as in classical parametric models, such as in the moderated regression approach described earlier. However, a main difference is that whereas classical parametric models (e.g., linear and logistic regressions) specify particular predictors and functional forms of their associations with the outcomes prior to estimation, causal machine learning approaches like causal forest use flexible data-driven forest-based constructions (grf also allow for more advanced use-cases by allowing for user-specified estimates of the first-stage inputs $e(x)$ and $m(x)$ , which can be derived from separate machine learning models, such as gradient boosting or neural networks, or through ensemble methods that combine multiple models (van der Laan et al.,, 2007)).

Remark 1.

In this context, an experimental setting is a special case of an observational setting with a known propensity score. The implication is that the only algorithmic difference between analyzing experimental or observational data is that in the experimental setting, we supply the known propensity score instead of estimating it (in causal_forest, this can be done via supplying the argument W.hat). The propensity score is then used the same way, including in downstream analyses, such as in calculating doubly robust average treatment effects discussed in Section 2 (statistical guarantees on average treatment effect estimation are then even stronger, see Wager, (2024, Corollary 3.3) for details).

4 Application: Estimating Resilience

We illustrate the approach by focusing on a recent application of grf to estimate differential resilience to the effects of combat stress in leading to post-traumatic stress disorder (PTSD) among US Army soldiers (Kessler et al.,, 2024). PTSD is the signature mental disorder of war (Paulson and Krippner,, 2007). The roughly 7% of the US population made up of military personnel and veterans are estimated to account for nearly 20% of all cases of PTSD in the US, with an annual societal cost estimated at more than $230B (DeAngelis,, 2023). However great variation is thought to exist in the effects of combat stress on PTSD among service members (Karstoft et al.,, 2015; Schultebraucks et al.,, 2021). We illustrate the value of causal forests by investigating these effects in a secondary analysis of a sample of US Army soldiers in three combat brigades who participated in a self-reported survey shortly before and then again after returning from a combat deployment in Afghanistan in 2012-2014 as part of the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS, Ursano et al., (2014)). Exposure to combat stress was assessed in the follow-up surveys administered to this sample after they returned from deployment. As expected, high combat stress exposure was found to be associated significantly with elevated prevalence of current PTSD as assessed in the follow-up surveys (Kessler et al.,, 2024).

The effects of exposure to high combat stress can be approximated by defining a dichotomous measure of the extent of combat stress exposure in the sample who had combat arms occupations (e.g., infantry, artillery). The extent to which information about individual differences in both baseline survey reports (including reports about history of and recent symptoms of PTSD) and baseline administrative data predicted this variation in exposure to high combat stress can then be investigated. The net association of high combat stress with the subsequent occurrence of meeting diagnostic criteria for PTSD in the follow-up survey can then be examined to estimate ATE as well as to inspect the distribution of estimated CATE. More details on the dataset (Papini et al.,, 2023) and a more extensive substantive analysis of the data (Kessler et al.,, 2024) are presented elsewhere.

In order to use methods for CATE estimation to capture heterogeneity in resilience, we need to somewhat reconceptualize the notion of an “intervention”: The intervention here, i.e., member combat deployment, is not a therapeutic intervention as is often considered in medical settings, but is nonetheless one that can be manipulated in meaningful ways. Only a fraction of all military personnel in a unit (often in the range 40-60%) are assigned to deploy, with the remainder held in reserve (referred to by the military as the “rear detachment”). Commanders take a wide range of factors into consideration in deciding which soldiers to deploy and which to hold in reserve. Estimated mental fitness based on supervisor assessment is one of these considerations. A CATE estimate based on a comprehensive assessment of administrative data and an optimized algorithm based on causal forests could be a useful addition to such assessments either to help commanders decide which members of their units to deploy and which to keep in the rear detachment or, in cases where it is necessary to deploy a non-resilient service member based on other considerations, to help target special resilience-enhancing resources (e.g., special pre-deployment resilience training programs (Thompson and Dobbins,, 2018)).

Considering deployment with few traumatic events (“low combat stress”) as a treatment arm ( $W_{i}=1$ ) and deployment with “high combat stress” as a control arm ( $W_{i}=0$ ), we can construct a causally motivated empirical measure of resilience using the language of potential outcomes. Table 2 shows an example of potential outcome configurations, or principal strata, describing outcomes a unit would experience in both treatment states. These principal strata are not observable (because, like the ITE, they depend on both potential outcomes); however, they are still useful for interpretation (Frangakis and Rubin,, 2002). The type of soldier who is healthy regardless of combat stress exposure can be deemed to have high resilience, whereas the type of soldier who develops PTSD under high combat stress but is healthy under low combat stress can be deemed to have low resilience. Finally, the type of soldier that develops PTSD regardless of combat stress exposure can be termed high risk. We assume that there are no soldiers who would be healthy under high stress and sick under low stress; this assumption is mathematically related to the widely used “no defiers” assumptions in clinical trials with non-compliance (Angrist et al.,, 1996).

		High combat stress
		Healthy	PTSD
Low combat stress	Healthy	“High resilience”	“Low resilience”
Low combat stress	PTSD	$\times$	“High risk”

Table 2: Hypothetical potential outcomes configurations (“Healthy”, “PTSD”) in the two treatment states (“low combat stress” and “high combat stress”) with corresponding resilience classifications.

We record realized healthy outcomes by 1 and PTSD by 0,

Y_{i}=\begin{cases}1&\text{if Healthy}\\ 0&\text{if PTSD},\end{cases}

and so our average treatment effect measures the fraction of healthy soldiers when no one is deployed minus the counterfactual fraction of healthy soldiers when all soldiers are deployed, which we expect to be larger than zero.

\displaystyle\begin{split}\tau&=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\right]\\ &=\mathbb{P}\left[Y_{i}(\text{Low combat stress})=\text{Healthy}\right]-\mathbb{P}\left[Y_{i}(\text{High combat stress})=\text{Healthy}\right]\\ &=\mathbb{P}\left[\text{Low resilience}\right].\end{split}

The CATE then captures heterogeneity in resilience:

\tau(x)=\mathbb{P}\left[\text{Low resilience}\mid X_{i}=x\right].

This type of taxonomy gives us two approaches to measuring resilience: a causal measure based on CATEs (low vs. high resilience) and a predictive measure based on risk (high vs. low risk). How these two approaches differ in practice is an empirical question we address in Section 4.6.

Illustration code.

Illustration code for the R language (R Core Team,, 2024) to conduct the analysis in this section with synthetic example data, using the package grf (Tibshirani et al.,, 2024) along with maq (Sverdrup et al.,, 2024) and ggplot2 (Wickham,, 2016), is available on GitHub at github.com/grf-labs/grf/tree/master/experiments/ijmpr.

4.1 An Exploratory Analysis

We consider 2,542 US soldiers with combat service, where 1,542 experienced low combat stress, and 1,000 soldiers experienced high combat stress (for details on the sample construction, including details on treatment and outcome definitions, we refer to Kessler et al., (2024)). In the high combat stress group, 96 soldiers developed PTSD post-deployment, while in the low combat stress group, 36 soldiers developed PTSD. The simple difference-in-means (i.e., the mean number of healthy soldiers in the low-stress group minus the mean number of healthy soldiers in the high-stress group) is 7.4% (standard error of 0.9%), and yields a measure of the raw association between deployment and PTSD. This is, however, not necessarily a valid estimate of the ATE, as combat stress exposure during deployment might not be completely random.

In order to mitigate confounding, we control for a set of pre-treatment characteristics that includes variables describing individual disorders, suicidality measures, past trauma experiences, past stressors, medical treatments, personality scores, army-specific employment details, demographics, and details on traumatic brain injuries, for a total of 410 variables. If we are willing to maintain an unconfoundedness assumption, then we can use these variables to account for the different proportions of soldiers with, for example, past trauma experiences in the two combat groups. Using a causal forest, a doubly robust estimate that adjusts for these pre-treatment characteristics gives an average treatment effect of 6.3% (standard error of 1%). grf forms this estimate via forest-based Augmented Inverse-Propensity Weighting (AIPW, Robins et al., (1994)) using the causal forest first-stage estimates of the propensity score $e(x)$ and conditional mean $m(x)$ .

A necessary condition for this to be a viable strategy for estimating causal effects is that the propensity score as a function of these variables is bounded away from zero and one, i.e., we have treatment overlap. Figure 2 shows a histogram of the forest-estimated propensity scores and indicates the overlap assumption is plausible as most of the propensity scores fall within the range 0.5-0.7, that is, along our observable adjustment variables the different soldiers are more or less equally likely to be in either treatment group.

4.2 Treatment Effect Heterogeneity

The average-case analysis above established that there are a substantial number of low-resilience soldiers in the sample. However, there might be between-individual variation in the effects of high combat stress such that prevalence of PTSD could be reduced by taking this individual variation into account. In order to take a data-driven approach to discover potential subgroups of soldiers that respond differently to combat stress, we divide the dataset into two random groups. In the training set (60% of the data), we fit a CATE estimator, and on a held-out test set (40% of the data), we evaluate these CATE estimates (the exact train/test ratio is more or less a heuristic; we prefer to have slightly more data for training than evaluation in this small dataset).

On the training data, we fit a causal forest using default settings for hyper-parameters and obtain an estimated CATE function $\hat{\tau}(\cdot)$ . These default settings often work well, though with grf it is possible to select parameters with automated tuning that minimize the R-loss criterion of Nie and Wager, (2021). Figure 3(a) shows a histogram of the CATE estimates for units on the test sample. The figure suggests that there are soldiers who would benefit meaningfully more than average from not experiencing high combat stress. There also appear to be other soldiers that would be much less affected than average from experiencing high combat stress. However, this histogram is made up of individual point estimates that are inherently noisy. To assess the statistical significance of this variation in CATE estimates, we need to use our test set. We do this by using the model developed in the training sample to predict CATEs in the independent test set. We then group soldiers in the test set sample into buckets according to which quartile of the predicted CATEs they belong to. Finally, we estimate average treatment effects in each bucket via AIPW (using a separate test set causal forest). Figure 3(b) shows an estimate of the average treatment effect over these quartiles. The results suggest that treatment effect heterogeneity is statistically significant and substantively robust. In the quartile of the sample with the lowest predicted CATEs the average treatment effect is close to zero, indicating that exposure to high combat stress would be expected to have no significant effect in increasing risk of PTSD. In the quartile with the highest predicted CATEs, in comparison, the average treatment effect is large and statistically significant, indicating that exposure to high combat stress would have a substantial effect in increasing risk of PTSD. The estimated average treatment effects in the two intermediate quartiles fall between the extremes found in the low and high quartiles.

A $p$ -value for heterogeneity.

It is noteworthy that we chose to divide the test sample into quartiles because it seemed reasonable given the relatively small sample size, although other perfectly reasonable choices would have been tertiles, quintiles, or, in larger samples, perhaps deciles. The Targeting Operator Characteristic curve (TOC) (Yadlowsky et al.,, 2024) provides an agnostic version of the visualization in Figure 3(b) that does not require pre-specifying the number of groups. This is achieved by ranking soldiers according to predicted treatment effects and computing the average treatment effect for each $q$ -th quantile in this ranking. This can then be compared to the overall average treatment effect in the following way:

\text{TOC}(q)=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\mid\text{Estimated CATE}(X_{i})\geq 1-q\text{-th quantile}\right]-\text{ATE}.

As we are subtracting the overall ATE, this curve will end at 0 for $q=1$ .

Figure 4 shows the estimated TOC curve for the training set predicted CATEs, and indicates that there are signs of treatment effect heterogeneity. For example, the top $q=$ 20% of soldiers with the largest predicted treatment effects have an ATE that is around 11% higher than the overall ATE (with a standard error of $\approx$ 5%). That is, the group of soldiers belonging to the top quintile of predicted treatment effects had low resilience at an 11% higher rate than average. The confidence intervals are wider at lower quantiles than at higher quantiles since the number of observations is smaller. In this setting, the lowest quantiles of the TOC curve cover zero, highlighting there is considerable estimation uncertainty for this smaller group. We can translate the visual stratification exercise underlying the TOC curve into a single-point estimate by forming an estimate of the area under this curve.

An estimate of the area under the TOC (“AUTOC”) is 0.063 with a bootstrapped standard error of 0.028. The AUTOC satisfies a central-limit theorem (Yadlowsky et al.,, 2024), so we can motivate a test for heterogeneity with

	$\displaystyle H_{0}:\text{AUTOC}=0~{}~{}~{}\text{(No heterogeneity detected by estimated CATE function)}$
	$\displaystyle H_{A}:\text{AUTOC}\neq 0,$

using the $t$ -value ${\widehat{\text{AUTOC}}}\,/\,{\sqrt{\text{Var}[\widehat{\text{AUTOC}}}]}=\frac{0.063}{0.028}=2.25$ . An asymptotically valid $p$ -value for this two-sided test is $2\mathbb{P}\left[z>|2.25|\right]=0.02$ (where $z$ is a standard random normal). We reject $H_{0}$ at a significance level of $\alpha=5\%$ , which suggests that we have successfully identified non-zero treatment heterogeneity using our CATE estimate.

Beyond serving as a valuable test for heterogeneity, the AUTOC can also assist in model selection. For instance, when comparing candidate predictor models (e.g., two or more alternative CATE models), the model with the highest AUTOC is the one that most effectively stratifies the test set sample. As mentioned in Section 3, causal forests are a two-step algorithm where the first step accounts for confounding with respect to some specified baseline predictor variables via an estimated propensity score as well as baseline effects via conditional mean estimates and the second step fits a CATE function based on some set of baseline predictor variables. In this application, compared to our sample size, the set of predictor variables considered was large (410), so we began the analysis by using a simple pre-screening heuristic to reduce the number of variables to help our forest capture relevant heterogeneity (Athey and Wager,, 2019): For the CATE model, we only used predictor variables that, in the conditional mean model, had variable importance in the top 25-th percentile. The AUTOC for this restricted CATE model was very similar to that for the full predictor set, indicating that this subset of predictors captured parsimoniously the meaningful heterogeneity associated with the full predictor set; this approach is closely related to evaluating binary classifiers with the ROC curve in a predictive analysis. Section 4.6 gives one more example of using the AUTOC for evaluating treatment targeting strategies.

Remark 2.

grf uses a simple assessment of variable importance, first discussed by Breiman, (2001), that counts how often a particular variable is used for recursive partitioning throughout the forest. For a broader discussion of alternative variable importance measures with forests see Bénard and Josse, (2023) and Hastie et al., (2009).

4.3 Evaluating treatment rules with Qini curves

The exercise and plots in the preceding section can help visualize and get a single-point estimate for evidence of heterogeneity captured by our estimated CATE function. There are numerous use cases for this function. As noted above, one important case would be to use this type of formulation to help decide which soldiers either to hold back from deployment or to provide additional resilience training prior to deployment. A treatment allocation policy of this sort could take many forms. It could, for example, target some fixed proportion of soldiers with the largest predicted effects and then look at what overall reduction in PTSD might be expected. The Qini curve is a simple visualization that plots the overall expected treatment effect on this aggregate PTSD rate obtained by intervening in this way over all possible thresholds of number of soldiers.

Figure 5 shows the Qini curve for this case study based on our estimated CATE function for our sample of $n=$ 2,542 soldiers. The straight dashed line in Figure 5 is a non-targeting baseline policy that shows the reduction in total PTSD cases we could expect by intervening randomly with various numbers of soldiers. The far right of this curve equals $n\cdot\widehat{\text{ATE}}\approx 180$ soldiers, the PTSD cases avoided if no one is deployed. The solid black curve shows results that can be achieved by using our CATE estimate to target intervention. For example, if we use causal forests to choose 500 soldiers predicted to benefit most, we estimate an overall reduction in PTSD cases of around 90 cases (standard error of 25). To achieve the same benefit by intervening randomly, we would have to hold back 1,300 soldiers.

4.4 Describing estimated heterogeneity via covariate profiles

A reasonable question to ask based on the material described thus far is how soldiers differ along observable pre-treatment characteristics in groups that benefit more vs. less from treatment. A deeper question underlying this initial question is whether meaningful psychiatric risk profiles emerge from such an investigation. Such questions can be addressed by zooming in on individuals in the lower and higher quartiles of Figure 3(b) to look at the distribution of some candidate predictor variables. Given that, as noted above, the predictor set is relatively large, it is convenient to begin by inspecting the variable importance using the causal forest variable importance measure described above (had we not had access to this particular variable importance measure, perhaps because we used another method than causal forest, then another simple heuristic to narrow down which variables to look at could be, for example, to choose the ones where the standardized differences in means over the high/low subgroups are more than, say, one standard deviation). When this was done, we found that a handful of baseline 30-day symptom measures had very high importance rankings. As an exploratory step, we look at the top-6 such variables.

Figure 6 shows a histogram of these variables for the groups of soldiers that belong to either the test set bottom or top quartile of predicted CATEs. We can see clear evidence that the distribution differs. This handful of variables captures an assortment of health measures (anxiety, PTSD symptoms, neurosis, etc.) and indicates that soldiers that benefit from being held back from deployment score higher on these indices (i.e, they have more severe 30-day health measures than the soldiers in the bottom quartile). At a high level, these covariate differences confirm an intuition that soldiers at highest risk of PTSD already had poor mental health prior to deployment.

4.5 Inference on effect modifiers via best linear projections

The exploratory heterogeneity exercise in Section 4.4 can be very useful to get an insight into how our estimated CATE function ended up stratifying our data in terms of various covariate profiles. In order to relate candidate predictor variables to exact statistical quantities, such as average effects, we need other approaches. One appealing approach is via a linear summary measure of the CATE. The best linear projection is the optimal prediction of CATE as a function of some chosen variable(s) $Z_{i}$ ,

\tau(X_{i})\sim\alpha+\beta Z_{i}+\varepsilon.

(1)

Formally, we have $\{\alpha^{*},\,\beta^{*}\}=\operatorname{argmin}_{\alpha,\beta}\left\{\mathbb{E}\left[(\tau(X_{i})-\alpha-Z_{i}\beta)^{2}\right]\right\}$ . This is not a standard linear regression problem as $\tau(X_{i})$ is not observable. However, grf enables inference about $\beta$ via a built-in AIPW-style construction that has similar statistical properties as what one would expect of classical parametric summaries and can be used to justify exact confidence intervals (Semenova and Chernozhukov,, 2021). In forming estimates like these, typical considerations that arise in the familiar parametric setting apply concerning multiple testing and pre-specified hypotheses. Note that, when we deploy a best linear projection we are not assuming that the data is linear, we are simply summarising a non-linear object, the CATE, via a linear approximation.

	(1)	(2)	(3)	(4)	(5)
(Intercept)	$0.06^{***}$	$0.06^{***}$	$0.06^{***}$	$0.06^{***}$	$0.06^{***}$
	$(0.01)$	$(0.01)$	$(0.01)$	$(0.01)$	$(0.01)$
z_anx_30d_sx_T0	$0.05^{***}$				$0.06$
	$(0.02)$				$(0.03)$
z_neurophys_30d_T0		$0.04^{***}$			$0.04$
		$(0.01)$			$(0.03)$
z_ptsd_30d_sx_T0			$0.04^{*}$		$0.03$
			$(0.02)$		$(0.03)$
z_sum_30d_sx_T0				$0.05^{**}$	$-0.06$
				$(0.02)$	$(0.06)$
${}^{*}p<0.001$ ; ${}^{}p<0.01$ ; ${}^{*}p<0.05$

Table 3: Estimates of best linear projections (1) using grf for 4 predictors: “z_anx_30d_sx_T0” (anxiety symptoms), “z_neurophys_30d_T0” (neuro/physiological symptoms), “z_ptsd_30d_sx_T0” (PTSD symptoms), and “z_sum_30d_sx_T0” (all symptoms). All variables are constructed as a standardized continuous score for 30-day symptoms.

HC_{3}

(MacKinnon and White,, 1985) standard errors in parentheses.

We consider 4 potential effect modifiers $Z_{i}$ . Table 3 shows estimates of best linear projections first via separate linear models, then finally jointly in the last column. We can interpret the coefficients in this table just as we would with any other regression model. Recall we defined our desirable outcome to be 1 if healthy and the causal effects we defined in Section 4 is measuring a counterfactual difference in PTSD prevalence. We can see that a one-unit increase in “z_anx_30d_sx_T0” (anxiety symptoms score) is related to a 5% increase in PTSD, and similarly for the other variables. These health indicators are all likely to be correlated, as indicated by the last column in Table 3. Table 4 shows the empirical correlation matrix of these variables.

	z_anx_30d_sx_T0	z_neurophys_30d_T0	z_ptsd_30d_sx_T0	z_sum_30d_sx_T0
z_anx_30d_sx_T0	1.00	0.56	0.48	0.85
z_neurophys_30d_T0	0.56	1.00	0.42	0.78
z_ptsd_30d_sx_T0	0.48	0.42	1.00	0.72
z_sum_30d_sx_T0	0.85	0.78	0.72	1.00

Table 4: Empirical correlation table for the predictor variables described in Table 3.

4.6 Risk-based targeting and treatment rule model selection

Recall that in our conceptual resilience taxonomy in Table 2, we labeled soldiers that experienced PTSD regardless of treatment state as “high risk”. In our empirical setting, it seems plausible that resilience and risk-based classifications are overlapping categories. As we’ve seen in Figure 6, the type of people more affected by high combat stress appear to have high baseline prevalence of psychiatric problems. One reasonable risk model for our setting could be the following,

\text{risk}(X_{i})=\mathbb{E}\left[\mathbf{1}(Y_{i}=\text{PTSD})\mid X_{i},~{}W_{i}=\text{high combat stress}\right],

which is the probability of a soldier with covariates $X_{i}$ developing PTSD when exposed to high combat stress. Using grf we can form a forest-based estimate of this model by fitting a regression forest on the subset of training set soldiers exposed to high combat stress.

Figure 7(a) shows a histogram of this estimated probability for soldiers in the held-out test sample (where we used all 410 pre-treatment characteristics as predictor variables). The question is, do these risk estimates perform well in targeting soldiers with different treatment benefits? Our intuition is that soldiers that have a large predicted risk of developing PTSD, also have a large treatment benefit of intervention. Using the same quartile stratification exercise as in Figure 3(b), we could stratify the sample by risk quartiles and then estimate average treatment effects. However, while visually appealing, our end goal is to compare targeting strategies, and comparing quartile figures is not as convenient for this purpose. To this end, it turns out we can use the TOC construction to facilitate a side-by-side comparison.

Recall the TOC curve we used to evaluate our estimated CATEs is defined by the following

\text{TOC}(q)=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\mid\text{Estimated CATE}(X_{i})\geq 1-q\text{-th quantile}\right]-\text{ATE},

where key point to recognize is that the relative ranking of the estimated CATEs makes up this curve. This means we can use this same construction to evaluate other intervention strategies as long as they imply a ranking/prioritization. Using our estimated risk model to prioritize soldiers with high probabilities of PTSD, we get the TOC:

\text{TOC}_{\text{risk}}(q)=\mathbb{E}\left[Y_{i}(1)-Y_{i}(0)\mid\text{Estimated risk}(X_{i})\geq 1-q\text{-th quantile}\right]-\text{ATE}.

Figure 7(b) compares the TOC curve using risk predictions with a TOC curve using our estimated CATEs and reveals that they perform very similarly. The area under the curve using risk-based predictions is close to the CATE-based prediction: 0.065 vs 0.063. We can construct a test for whether these two targeting strategies perform the same by estimating the difference in the area under the curve and calculating a $p$ -value (or equivalently, an $\alpha$ -level confidence interval). The resulting $p$ -value is 0.8 and suggests that these two targeting strategies are statistically indistinguishable. The implication is that for our intended application, risk-based targeting may be good enough.

Finally, as another example of the use case for the TOC curve; recall that Figure 6 reveals that a handful of 30-day health symptoms scores appear to differ markedly for soldiers with different treatment benefits. These variables are all continuous-valued, and as noted in Table 4 highly correlated with each other. We can use the TOC to test if a simple treatment prioritization using one of these variables does well in classifying soldiers on differential effects of high combat exposure on PTSD. If we pick the variable “standardized sum of 30-day health symptoms” we get a TOC that focuses only on that single predictor. Doing so, we get an AUTOC that is similar to both the risk-based and CATE-based ones: 0.068 (standard error 0.024), and that is not distinguishable from either if we compare the difference in AUTOCs.

Remark 3.

We estimated a risk model directly using a random forest with all available predictor variables. In some cases, it may be possible to get better performance by first pre-screening pre-treatment characteristics using general variable selection tools, such as the Lasso. For a practical guide to developing risk models in clinical applications, see Steyerberg, (2019).

5 Discussion

We have surveyed how causal forests can be used to generate both principled estimates of ATE in observational studies and to detect treatment heterogeneity in experimental or observational studies. For more details on grf, including additional functionality such as missing values support, tree-based policy learning, loss-to-follow-up corrections, and more, we refer to vignettes and journal references on the grf website at grf-labs.github.io/grf.

We have also outlined a general framework for analyzing and comparing output from causal machine learning algorithms on a level playing field with tools such as TOC and Qini curves. As noted in the introduction, there are a number of alternative machine learning approaches for CATE estimation that are popular in practice (e.g., Athey and Imbens,, 2016; Hahn et al.,, 2020; Kennedy,, 2023; Künzel et al.,, 2019; Luedtke and van der Laan,, 2016; Nie and Wager,, 2021; Tian et al.,, 2014). Customizable Python libraries that implement several of these methods include causalML (Chen et al.,, 2020) and econML (Battocchi et al.,, 2019); and the software toolkit for machine learning-based causal inference is still rapidly growing. In practice, it can sometimes be a good idea to try multiple different machine learning-based approaches to CATE estimations; the TOC, Qini curves, and related tools can then be used to compare and benchmark their performance.

Acknowledgments

We are grateful to the STARRS-LS collaborators for the use of their survey data and to Ron Kessler for providing invaluable feedback and advice.

Author Contributions: Petukhova had full access to all study data and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Concept and design: Sverdrup, Wager.

Acquisition, analysis, or interpretation of data: All authors.

Drafting of the manuscript: Sverdrup, Petukhova.

Critical revision of the manuscript for important intellectual content: All authors.

Statistical analysis: Petukhova.

Obtained funding: NA.

Administrative, technical, or material support: Sverdrup.

Conflict of Interest Disclosures: No disclosures were reported.

Conflict of Interest Disclosures: The study protocol was approved by the Research Ethics Committees of the Henry Jackson Foundation and Harvard Medical School (IRB15-0765) with a waiver of informed consent based on data being de-identified. Research has been performed in accordance with the Declaration of Helsinki.

Funding/Sponsor: Army STARRS was sponsored by the Department of the Army and funded under cooperative agreement number U01MH087981 (2009-2015) with the U.S. Department of Health and Human Services, National Institutes of Health, National Institute of Mental Health (NIH/NIMH). Petukhova has subsequently been supported by STARRS-LS, which is sponsored and funded by the Department of Defense (USUHS grant number HU0001-15-2-0004). This support funded her participation in this paper. The contents are solely the responsibility of the authors and do not necessarily represent the views of the Department of Health and Human Services, NIMH, or the Department of the Army, or the Department of Defense. The dataset was made available by the Army STARRS Team, which consists of Co-Principal Investigators: Robert J. Ursano, MD (Uniformed Services University of the Health Sciences) and Murray B. Stein, MD, MPH (University of California San Diego and VA San Diego Healthcare System). Site Principal Investigators: Steven Heeringa, PhD (University of Michigan), James Wagner, PhD (University of Michigan) and Ronald C. Kessler, PhD (Harvard Medical School); Army scientific consultant /liaison: Kenneth Cox, MD, MPH (Office of the Deputy Under Secretary of the Army). Other team members: Pablo A. Aliaga, MA (Uniformed Services University of the Health Sciences); COL David M. Benedek, MD (Uniformed Services University of the Health Sciences); Laura Campbell-Sills, PhD (University of California San Diego); Carol S. Fullerton, PhD (Uniformed Services University of the Health Sciences); Nancy Gebler, MA (University of Michigan); Robert K. Gifford, PhD (Uniformed Services University of the Health Sciences); Meredith House, BA (University of Michigan); Paul E. Hurwitz, MPH (Uniformed Services University of the Health Sciences); Sonia Jain, PhD (University of California San Diego); Tzu-Cheg Kao, PhD (Uniformed Services University of the Health Sciences); Lisa Lewandowski-Romps, PhD (University of Michigan); Holly Herberman Mash, PhD (Uniformed Services University of the Health Sciences); James A. Naifeh, PhD (Uniformed Services University of the Health Sciences); Tsz Hin Hinz Ng, MPH (Uniformed Services University of the Health Sciences); Matthew K. Nock, PhD (Harvard University); Nancy A. Sampson, BA (Harvard Medical School); COL Gary H. Wynn, MD (Uniformed Services University of the Health Sciences); and Alan M. Zaslavsky, PhD (Harvard Medical School).

Role of Funder/Sponsor: As a cooperative agreement, scientists employed by the National Institute of Mental Health and U.S. Army liaisons and consultants collaborated to develop the STARRS study protocol and data collection instruments, supervise data collection, interpret results, and prepare reports. Although a draft of the manuscript was submitted to the U.S. Army and National Institute of Mental Health for review and comment before submission for publication, this was done with the understanding that comments would be no more than advisory.

Data availability: Illustration code for the R language (R Core Team,, 2024) to conduct the analysis with synthetic example data is available on GitHub at github.com/grf-labs/grf/tree/master/experiments/ijmpr.

References

Allsopp et al., (2019) Allsopp, K., Read, J., Corcoran, R., and Kinderman, P. (2019). Heterogeneity in psychiatric diagnostic classification. Psychiatry Research, 279:15–22. https://doi.org/10.1016/j.psychres.2019.07.005.
Angrist et al., (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):444–455. https://doi.org/10.1080/01621459.1996.10476902.
Angus and Chang, (2021) Angus, D. C. and Chang, C.-C. H. (2021). Heterogeneity of treatment effect: Estimating how the effects of interventions vary across individuals. JAMA, 326(22):2312–2313. https://doi.org/10.1001/jama.2021.20552.
Athey and Imbens, (2016) Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360. https://doi.org/doi:10.1073/pnas.1510489113.
Athey et al., (2019) Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics, 47(2). https://doi.org/https://doi.org/10.1214/18-AOS1709.
Athey and Wager, (2019) Athey, S. and Wager, S. (2019). Estimating treatment effects with causal forests: An application. Observational studies, 5(2):37–51. https://arxiv.org/abs/1902.07409.
Battocchi et al., (2019) Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oka, P., Oprescu, M., and Syrgkanis, V. (2019). EconML: A Python package for ML-based heterogeneous treatment effects estimation. https://github.com/microsoft/EconML.
Bénard and Josse, (2023) Bénard, C. and Josse, J. (2023). Variable importance for causal forests: breaking down the heterogeneity of treatment effects. arXiv preprint arXiv:2308.03369. https://arxiv.org/abs/2308.03369.
Bhattacharya and Dupas, (2012) Bhattacharya, D. and Dupas, P. (2012). Inferring welfare maximizing treatment assignment under budget constraints. Journal of Econometrics, 167(1):168–196. https://doi.org/https://doi.org/10.1016/j.jeconom.2011.11.007.
Breiman, (2001) Breiman, L. (2001). Random forests. Machine learning, 45:5–32. https://doi.org/10.1023/A:1010933404324.
Carroll et al., (2016) Carroll, R., Corcoran, P., Griffin, E., Perry, I., Arensman, E., Gunnell, D., and Metcalfe, C. (2016). Variation between hospitals in inpatient admission practices for self-harm patients and its impact on repeat presentation. Social Psychiatry and Psychiatric Epidemiology, 51:1485–1493. https://doi.org/10.1007/s00127-016-1247-y.
Chen et al., (2020) Chen, H., Harinen, T., Lee, J.-Y., Yung, M., and Zhao, Z. (2020). CausalML: Python package for causal machine learning. arXiv preprint arXiv:2002.11631. https://doi.org/https://doi.org/10.48550/arXiv.2002.11631.
Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68. https://doi.org/10.1111/ectj.12097.
Cohen and DeRubeis, (2018) Cohen, Z. D. and DeRubeis, R. J. (2018). Treatment selection in depression. Annual Review of Clinical Psychology, 14:209–236. https://doi.org/10.1146/annurev-clinpsy-050817-084746.
Cui et al., (2023) Cui, Y., Kosorok, M. R., Sverdrup, E., Wager, S., and Zhu, R. (2023). Estimating heterogeneous treatment effects with right-censored data via causal survival forests. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85(2):179–211. https://doi.org/10.1093/jrsssb/qkac001.
Cuijpers et al., (2014) Cuijpers, P., Weitz, E., Twisk, J., Kuehner, C., Cristea, I., David, D., DeRubeis, R. J., Dimidjian, S., Dunlop, B. W., Faramarzi, M., et al. (2014). Gender as predictor and moderator of outcome in cognitive behavior therapy and pharmacotherapy for adult depression: An “individual patient data” meta-analysis. Depression and Anxiety, 31(11):941–951. https://doi.org/10.1002/da.22328.
DeAngelis, (2023) DeAngelis, T. (2023). By the numbers: Examining the staggering cost of PTSD. Monitor on Psychology, 54(1). https://www.apa.org/monitor/2023/01/staggering-ptsd-costs.
DeRubeis et al., (2014) DeRubeis, R. J., Cohen, Z. D., Forand, N. R., Fournier, J. C., Gelfand, L. A., and Lorenzo-Luaces, L. (2014). The personalized advantage index: Translating research on prediction into individualized treatment recommendations. a demonstration. PloS One, 9(1):e83875. https://doi.org/10.1371/journal.pone.0083875.
Ding, (2023) Ding, P. (2023). A First Course in Causal Inference. CRC Press. https://doi.org/https://doi.org/10.48550/arXiv.2305.18793.
Driessen et al., (2010) Driessen, E., Cuijpers, P., Hollon, S. D., and Dekker, J. J. (2010). Does pretreatment severity moderate the efficacy of psychological treatment of adult outpatient depression? A meta-analysis. Journal of Consulting and Clinical Psychology, 78(5):668. https://doi.org/10.1037/a0020570.
Feczko et al., (2019) Feczko, E., Miranda-Dominguez, O., Marr, M., Graham, A. M., Nigg, J. T., and Fair, D. A. (2019). The heterogeneity problem: Approaches to identify psychiatric subtypes. Trends in Cognitive Sciences, 23(7):584–601. https://doi.org/10.1016/j.tics.2019.03.009.
Frangakis and Rubin, (2002) Frangakis, C. E. and Rubin, D. B. (2002). Principal stratification in causal inference. Biometrics, 58(1):21–29. https://doi.org/10.1111/j.0006-341x.2002.00021.x.
Friedberg et al., (2020) Friedberg, R., Tibshirani, J., Athey, S., and Wager, S. (2020). Local linear forests. Journal of Computational and Graphical Statistics, 30(2):503–517. https://doi.org/10.1080/10618600.2020.1831930.
Hahn et al., (2020) Hahn, P. R., Murray, J. S., and Carvalho, C. M. (2020). Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis, 15(3):965–1056. https://doi.org/https://doi.org/10.1214/19-BA1195.
Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. https://doi.org/10.1007/978-0-387-84858-7.
Hernan and Robins, (2023) Hernan, M. A. and Robins, J. M. (2023). Causal Inference: What If. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. CRC Press.
Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press. https://doi.org/DOI:10.1017/CBO9781139025751.
Jones et al., (2008) Jones, G., Gavrilovic, J. J., McCabe, R., Becktas, C., and Priebe, S. (2008). Treating suicidal patients in an acute psychiatric day hospital: A challenge to assumptions about risk and overnight care. Journal of Mental Health, 17(4):375–387. https://doi.org/10.1080/09638230701504981.
Kaiser et al., (2022) Kaiser, T., Volkmann, C., Volkmann, A., Karyotaki, E., Cuijpers, P., and Brakemeier, E.-L. (2022). Heterogeneity of treatment effects in trials on psychotherapy of depression. Clinical Psychology: Science and Practice, 29(3):294. https://doi.org/10.1037/cps0000079.
Karstoft et al., (2015) Karstoft, K.-I., Statnikov, A., Andersen, S. B., Madsen, T., and Galatzer-Levy, I. R. (2015). Early identification of posttraumatic stress following military deployment: Application of machine learning methods to a prospective study of Danish soldiers. Journal of Affective Disorders, 184:170–175. https://doi.org/10.1016/j.jad.2015.05.057.
Kennedy, (2023) Kennedy, E. H. (2023). Towards optimal doubly robust estimation of heterogeneous causal effects. Electronic Journal of Statistics, 17(2):3008–3049. https://doi.org/DOI:10.1214/23-EJS2157.
Kent et al., (2016) Kent, D. M., Nelson, J., Dahabreh, I. J., Rothwell, P. M., Altman, D. G., and Hayward, R. A. (2016). Risk and treatment effect heterogeneity: Re-analysis of individual participant data from 32 large clinical trials. International Journal of Epidemiology, 45(6):2075–2088. https://doi.org/10.1093/ije/dyw118.
Kessler et al., (2024) Kessler, R. C., Bossarte, R. M., Hwang, I., Luedtke, A., Naifeh, J. A., Nock, M. K., Petukhova, M., Sadikova, E., Sampson, N. A., Sverdrup, E., Zaslavsky, A. M., Zubizarreta, J. R., Wager, S., Wagner, J., Stein, M. B., and Ursano, R. J. (2024). A prediction model for differential resilience to the effects of combat-related stressors in US Army soldiers. International Journal of Methods in Psychiatric Research, 33(4). https://doi.org/10.1002/mpr.70006.
Kravitz et al., (2004) Kravitz, R. L., Duan, N., and Braslow, J. (2004). Evidence-based medicine, heterogeneity of treatment effects, and the trouble with averages. The Milbank Quarterly, 82(4):661–687. https://doi.org/10.1111/j.0887-378X.2004.00327.x.
Künzel et al., (2019) Künzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 116(10):4156–4165. https://doi.org/doi:10.1073/pnas.1804597116.
Large and Kapur, (2018) Large, M. M. and Kapur, N. (2018). Psychiatric hospitalisation and the risk of suicide. The British Journal of Psychiatry, 212(5):269–273. https://doi.org/10.1192/bjp.2018.22.
Luedtke and van der Laan, (2016) Luedtke, A. R. and van der Laan, M. J. (2016). Super-learning of an optimal dynamic treatment rule. The International Journal of Biostatistics, 12(1):305–332. https://doi.org/10.1515/ijb-2015-0052.
MacKinnon and White, (1985) MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of Econometrics, 29(3):305–325. https://doi.org/https://doi.org/10.1016/0304-4076(85)90158-7.
Maj et al., (2020) Maj, M., Stein, D. J., Parker, G., Zimmerman, M., Fava, G. A., De Hert, M., Demyttenaere, K., McIntyre, R. S., Widiger, T., and Wittchen, H.-U. (2020). The clinical characterization of the adult patient with depression aimed at personalization of management. World Psychiatry, 19(3):269–293. https://doi.org/10.1002/wps.20771.
Nie and Wager, (2021) Nie, X. and Wager, S. (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika, 108(2):299–319. https://doi.org/10.1093/biomet/asaa076.
Papini et al., (2023) Papini, S., Norman, S. B., Campbell-Sills, L., Sun, X., He, F., Kessler, R. C., Ursano, R. J., Jain, S., and Stein, M. B. (2023). Development and validation of a machine learning prediction model of posttraumatic stress disorder after military deployment. JAMA Network, 6(6):e2321273. https://doi.org/10.1001/jamanetworkopen.2023.21273.
Paulson and Krippner, (2007) Paulson, D. S. and Krippner, S. (2007). Haunted by combat: Understanding PTSD in war veterans including women, reservists, and those coming back from Iraq. Bloomsbury Publishing USA. https://books.google.com/books?id=gJ2Mq7nEZ8MC.
Pearl, (1995) Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688. https://doi.org/10.2307/2337329.
R Core Team, (2024) R Core Team (2024). R: A language and environment for statistical computing. https://www.R-project.org/.
Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866. https://doi.org/10.2307/2290910.
Robinson, (1988) Robinson, P. M. (1988). Root-n-consistent semiparametric regression. Econometrica: Journal of the Econometric Society, pages 931–954. https://doi.org/10.2307/1912705.
Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55. https://doi.org/10.1093/biomet/70.1.41.
Ross et al., (2023) Ross, E. L., Bossarte, R. M., Dobscha, S. K., Gildea, S. M., Hwang, I., Kennedy, C. J., Liu, H., Luedtke, A., Marx, B. P., Nock, M. K., Petukhova, M. V., Sampson, N. A., Zainal, N. H., Sverdrup, E., Wager, S., and Kessler, R. C. (2023). Estimated Average Treatment Effect of Psychiatric Hospitalization in Patients With Suicidal Behaviors: A Precision Treatment Analysis. JAMA Psychiatry, 81(2):134–143. https://doi.org/10.1001/jamapsychiatry.2023.3994.
Schultebraucks et al., (2021) Schultebraucks, K., Qian, M., Abu-Amara, D., Dean, K., Laska, E., Siegel, C., Gautam, A., Guffanti, G., Hammamieh, R., Misganaw, B., et al. (2021). Pre-deployment risk factors for ptsd in active-duty personnel deployed to afghanistan: A machine-learning approach for analyzing multivariate predictors. Molecular Psychiatry, 26(9):5011–5022. https://doi.org/10.1038/s41380-020-0789-2.
Semenova and Chernozhukov, (2021) Semenova, V. and Chernozhukov, V. (2021). Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 24(2):264–289. https://doi.org/10.1093/ectj/utaa027.
Steeg et al., (2018) Steeg, S., Carr, M., Emsley, R., Hawton, K., Waters, K., Bickley, H., Ness, J., Geulayov, G., and Kapur, N. (2018). Suicide and all-cause mortality following routine hospital management of self-harm: Propensity score analysis using multicentre cohort data. PLoS One, 13(9):e0204670. https://doi.org/10.1371/journal.pone.0204670.
Steyerberg, (2019) Steyerberg, E. W. (2019). Clinical prediction models: A practical approach to development, validation, and updating. Springer. https://doi.org/10.1007/978-3-030-16399-0.
Sverdrup et al., (2024) Sverdrup, E., Wu, H., Athey, S., and Wager, S. (2024). Qini curves for multi-armed treatment rules. Journal of Computational and Graphical Statistics, forthcoming. (R package version 0.4.0). https://doi.org/10.1080/10618600.2024.2418820.
Thompson and Dobbins, (2018) Thompson, S. R. and Dobbins, S. (2018). The applicability of resilience training to the mitigation of trauma-related mental illness in military personnel. Journal of the American Psychiatric Nurses Association, 24(1):23–34. https://doi.org/10.1177/1078390317739957.
Tian et al., (2014) Tian, L., Alizadeh, A. A., Gentles, A. J., and Tibshirani, R. (2014). A simple method for estimating interactions between a treatment and a large number of covariates. Journal of the American Statistical Association, 109(508):1517–1532. https://doi.org/10.1080/01621459.2014.951443.
Tibshirani et al., (2024) Tibshirani, J., Athey, S., Friedberg, R., Hadad, V., Hirshberg, D., Miner, L., Sverdrup, E., Wager, S., and Wright, M. (2024). grf: Generalized random forests. R package version 2.3.2. github.com/grf-labs/grf.
Ursano et al., (2014) Ursano, R. J., Colpe, L. J., Heeringa, S. G., Kessler, R. C., Schoenbaum, M., Stein, M. B., and Collaborators, A. S. (2014). The army study to assess risk and resilience in servicemembers (Army STARRS). Psychiatry: Interpersonal and Biological Processes, 77(2):107–119. https://doi.org/10.1521/psyc.2014.77.2.107.
van der Laan et al., (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1). https://doi.org/10.2202/1544-6115.1309.
van Klaveren et al., (2019) van Klaveren, D., Balan, T. A., Steyerberg, E. W., and Kent, D. M. (2019). Models with interactions overestimated heterogeneity of treatment effects and were prone to treatment mistargeting. Journal of Clinical Epidemiology, 114:72–83. https://doi.org/10.1016/j.jclinepi.2019.05.029.
VanderWeele et al., (2019) VanderWeele, T. J., Luedtke, A. R., van der Laan, M. J., and Kessler, R. C. (2019). Selecting optimal subgroups for treatment using many covariates. Epidemiology, 30(3):334–341. https://doi.org/https://doi.org/10.1097/ede.0000000000000991.
Velentgas et al., (2013) Velentgas, P., Dreyer, N. A., Nourjah, P., Smith, S. R., Torchia, M. M., et al. (2013). Developing a protocol for observational comparative effectiveness research: A user’s guide. Agency for Healthcare Research and Quality (US).
Wager, (2024) Wager, S. (2024). Causal Inference: A Statistical Learning Approach. In preparation. https://web.stanford.edu/~swager/causal_inf_book.pdf.
Wager and Athey, (2018) Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242. https://doi.org/10.1080/01621459.2017.1319839.
Wickham, (2016) Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer.
Yadlowsky et al., (2024) Yadlowsky, S., Fleming, S., Shah, N., Brunskill, E., and Wager, S. (2024). Evaluating treatment prioritization rules via rank-weighted average treatment effects. Journal of the American Statistical Association, forthcoming. https://doi.org/10.1080/01621459.2024.2393466.

Estimating Treatment Effect Heterogeneity in Psychiatry: A Review and Tutorial with Causal Forests