\AtAppendix

Treatment Effect Heterogeneity and Importance Measures for Multivariate Continuous Treatments

\nameHeejun Shin \email[email protected]
\addrDepartment of Statistics, University of Florida, Gainesville, FL 32611, USA \AND\nameAntonio Linero \email[email protected]
\addrDepartment of Statistics and Data Science, University of Texas at Austin, Austin, TX 78712, USA \AND\nameMichelle Audirac \email[email protected]
\addrDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA \AND\nameKezia Irene \email[email protected]
\addrDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA \AND\nameDanielle Braun \email[email protected]
\addrDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA;
\addrDepartment of Data Science, Dana-Farber Cancer Institute, Boston, MA 02115, USA \AND\nameJoseph Antonelli \email[email protected]
\addrDepartment of Statistics, University of Florida, Gainesville, FL 32611, USA

Abstract

Estimating the joint effect of a multivariate, continuous exposure is crucial, particularly in environmental health where interest lies in simultaneously evaluating the impact of multiple environmental pollutants on health. We develop novel methodology that addresses two key issues for estimation of treatment effects of multivariate, continuous exposures. We use nonparametric Bayesian methodology that is flexible to ensure our approach can capture a wide range of data generating processes. Additionally, we allow the effect of the exposures to be heterogeneous with respect to covariates. Treatment effect heterogeneity has not been well explored in the causal inference literature for multivariate, continuous exposures, and therefore we introduce novel estimands that summarize the nature and extent of the heterogeneity, and propose estimation procedures for new estimands related to treatment effect heterogeneity. We provide theoretical support for the proposed models in the form of posterior contraction rates and show that it works well in simulated examples both with and without heterogeneity. We apply our approach to a study of the health effects of simultaneous exposure to the components of PM_2.5 and find that the negative health effects of exposure to these environmental pollutants is exacerbated by low socioeconomic status and age.

Keywords: Causal inference, Bayesian nonparametrics, Environmental mixtures, Treatment effect heterogeneity, Variable importance measures

1 Introduction

An important scientific question is understanding the effect of a multivariate treatment on an outcome, particularly in environmental health where individuals are exposed simultaneously to multiple pollutants and it is of interest to understand the joint impact of each of these pollutants on public health. Further, the effects of each of these pollutants might be heterogeneous with respect to characteristics of the individuals exposed, and it is important to account for and understand this treatment effect heterogeneity. Ignoring this heterogeneity can lead to biased estimates of the average effect of the treatments on the outcome, and can also mask substantial effects of the treatment on specific subgroups of the population. In this paper, we address this problem by developing methodology for estimating heterogeneous treatment effects of continuous, multivariate treatments.

In the past decade, there has been an increasing interest in analyzing the effects of multivariate exposures in environmental health where the exposures are referred to as environmental mixtures (Dominici et al., 2010; Agier et al., 2016; Stafoggia et al., 2017; Gibson et al., 2019; Shin et al., 2023). Analyzing environmental mixtures is a challenging problem for a number of reasons. For one, the scientific goal is not simply predicting the outcome, but rather understanding the effects of each of the individual exposures, and whether these exposures interact with each other. Therefore, off-the-shelf modern regression strategies such as tree-based models (Breiman, 2001; Chipman et al., 2010) and Gaussian processes (Banerjee et al., 2013) do not immediately apply. A number of studies in the environmental statistics literature have tailored these methods for multivariate exposures. A common theme among these methods is the use of nonparametric Bayesian models. Gaussian processes are adapted for this purpose in the popular Bayesian kernel machine regression (BKMR, Bobb et al. (2015)) or in related work that explicitly identifies interactions between exposures (Ferrari and Dunson, 2020). Related Bayesian approaches use basis function expansions to identify important exposures or interactions among exposures for environmental mixtures (Antonelli et al., 2020; Wei et al., 2020; Samanta and Antonelli, 2022). Other, related approaches have been developed all with the goal of estimating the health effects of environmental mixtures and identifying exposures that affect the outcome (Herring, 2010; Carrico et al., 2015; Narisetty et al., 2019; Boss et al., 2021; Ferrari and Dunson, 2021).

While the aforementioned approaches are useful for multivariate, continuous exposures, they all implicitly assume that the effect of the environmental exposures is the same across the entire population. This assumption does not hold in the context of ambient air pollution, where effects have been shown to vary by characteristics such as age, race, and location (Wang et al., 2020; Bargagli-Stoffi et al., 2020). Treatment effect heterogeneity has seen an explosion of interest in the causal inference literature, though it is typically restricted to a single treatment, either binary (Athey and Imbens, 2016; Wager and Athey, 2018; Hahn et al., 2020; Semenova and Chernozhukov, 2021; Fan et al., 2022; Shin and Antonelli, 2023) or with multiple levels (Chang and Roy, 2024). In the binary setting, heterogeneity is typically summarized by the conditional average treatment effect (CATE) function, $\operatorname{\mathbb{E}}\{Y(1)-Y(0)|\bm{X}=\bm{x}\}$ where $Y(w)$ denotes the potential outcome under a binary exposure level $w$ . While the CATE is a functional estimand, a number of summaries of this function have been proposed to simplify heterogeneity further. One univariate summary is the variance of the CATE (VTE, Levy et al. (2021)), which describes the overall degree of heterogeneity. Related quantities of interest to our study are treatment effect variable importance measures (TE-VIM), which have been recently proposed for univariate treatments in Hines et al. (2022) and Li et al. (2023) to learn which covariates are key factors driving treatment effect heterogeneity. TE-VIM relates regression-based variable importance measures (Zhang and Janson, 2020; Williamson et al., 2021) and the VTE of Levy et al. (2021) within a causal inference framework. Nearly all of this work on treatment effect heterogeneity focuses on binary treatments, and there has been little to no work on the heterogeneous effects of multivariate, continuous exposures.

There exists a substantial gap in the literature on how to both define estimands for continuous, multivariate exposures and for estimating these quantities using observed data and flexible modeling approaches that make as few modeling assumptions as possible. In this paper, we aim to fill this gap by first proposing new estimands in this setting that are interpretable and help describe the complex effect of multiple treatments, and how this effect varies by observed covariates. We extend treatment effect variable importance measures to this more complex setting and describe how to identify and estimate these estimands from observed data. These measures provide researchers with insight into causal effect modifiers for multivariate, continuous exposures and can provide policy-makers with detailed information about how treatments impact outcomes. Additionally, we develop novel estimation strategies using nonparametric Bayesian methodology. By decomposing outcome regression surfaces into distinct, identifiable functions, we are able to explicitly shrink treatment effects towards homogeneity, while still allowing for heterogeneity when it exists. We also make limited modeling assumptions as we use novel extensions of Bayesian additive regression trees (BART) such as the SoftBART prior distribution (Linero and Yang, 2018) and targeted smoothing BART (Starling et al., 2020; Li et al., 2022) for interactions between exposures and covariates to allow for heterogeneity without imposing strong parametric assumptions.

2 Estimands of Interest

Throughout, we assume that we observe $\bm{\mathcal{D}}=\{(Y_{i},\bm{X}_{i},\bm{W}_{i})\}_{i=1}^{n}$ for each individual where $Y_{i}\in\mathbb{R},\bm{X}_{i}\in\mathbb{R}^{p}$ , and $\bm{W}_{i}\in\mathbb{R}^{q}$ . Also, we adopt the potential outcomes framework: $Y_{i}(\bm{w})$ denotes the potential outcome that would be observed under exposure $\bm{w}$ . In order to denote potential outcomes in this manner, and to identify causal effects from the observed data, we make the following assumptions:

(i)

SUTVA (Stable Unit Treatment Value Assumption, Rubin (1980)): Treatments of one unit do not affect the potential outcomes of other units (no interference), and there are not different versions of treatments, such that $Y_{i}=Y_{i}(\bm{W}_{i})$ .
(ii)

Positivity: $0<f_{W|X}(\bm{w}|\bm{X}=\bm{x})$ for all $\bm{x}$ and $\bm{w}$ , where $f_{W|X}$ is the conditional density of the treatments given covariates.
(iii)

Unconfoundedness: $Y(\bm{w})\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{W}|\bm{X}$ for all $\bm{w}$ .

The positivity assumption guarantees that each unit has a non-zero probability to be exposed to each treatment level for all possible values of pre-treatment variables, at least in large samples. Unconfoundedness requires that the treatment $W$ is independent of the potential outcomes, and it implies that there exist no unmeasured variables that confound the treatment - outcome relationship. Because exposures and covariates are both multivariate, there are many causal estimands one could define. In the following sections, we first define estimands targeting both average treatment effects and treatment effect heterogeneity, as well as variable importance measures summarizing this heterogeneity.

2.1 Marginal and Heterogeneous Effects Estimands

One potential estimand to look at in this scenario is $\operatorname{\mathbb{E}}[Y(\bm{w})|\bm{X}=\bm{x}]$ , however, there are infinitely many values that $\bm{w}$ can take and therefore it is difficult to interpret this estimand. One option is to look at values of $\bm{w}$ where all but one of the exposures is fixed, and then visualize this estimand as a function of one exposure. A simpler setting is one in which there are two exposure levels $\bm{w}_{1}$ and $\bm{w}_{0}$ , under which we want to compare the outcomes. For example, one may be interested in examining what would happen if an intervention was applied that lowered the level of pollution by a specific amount. In this case, one can use the pollution levels without the intervention as $\bm{w}_{0}$ and those with the intervention as $\bm{w}_{1}$ . Heterogeneity is therefore described by

\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{x})\equiv\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\bm{x}\}.

If interest is in marginal treatment effects, this quantity can be averaged over the covariate distribution to obtain a marginal effect

\overline{\tau}_{\bm{w}_{1},\bm{w}_{0}}=\operatorname{\mathbb{E}}[\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})]

This problem effectively reduces to the binary treatment setting where the conditional average treatment effect (CATE) is originally defined as there are only two exposure levels of interest. Because of this, many existing approaches to summarizing effect heterogeneity from the existing literature can be used. The variance of this function, $\text{Var}[\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})]$ , or the treatment effect variable importance measures of Hines et al. (2022) can be incorporated analogously. For this reason, in the following section, we focus on the more complex setting where we have more than two exposure levels of interest, yet we want to study heterogeneity and examine the impact of each covariate in the degree of heterogeneity.

2.2 Multivariate Treatment Effect Variable Importance Measures

In many scenarios, we have more than two exposure levels of interest, yet we still want to define relevant quantities that provide information about treatment effect heterogeneity and which covariates are driving this heterogeneity. For this section, we assume that we have a reference level of the exposure denoted by $\bm{w}_{0}$ , which can be the average exposure level, or some other value chosen a priori. It is important to note that the choice of $\bm{w}_{0}$ should not greatly affect results. We can define the following quantity:

\tau_{\bm{w}_{0}}(\bm{x},\bm{w})\equiv\operatorname{\mathbb{E}}[Y(\bm{w})-Y(\bm{w}_{0})|\bm{X}=\bm{x}],

which describes how the potential outcome surface varies by both $\bm{x}$ and $\bm{w}$ . This function is difficult to interpret as it is a function of two multivariate arguments. We propose multivariate treatment effect variable importance measures (MTE-VIM) to summarize how much the heterogeneity of $\tau_{\bm{w}_{0}}(\bm{x},\bm{w})$ is driven by each particular covariate.

Before defining the variable importance metric, we must first define the overall amount of heterogeneity of the treatment effect, which we define as

\phi=\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}].

This looks at the variability of the treatment effects as a function of $\bm{X}$ for a fixed exposure level, but this variability may differ for different values of exposures $\bm{W}$ , and therefore we average this variability across the range of exposures. Note that instead of the marginal variance of $\bm{X}$ , we could have alternatively used the conditional variance of $\bm{X}$ given $\bm{W}$ . This would also provide a useful measure of the overall amount of heterogeneity, but we will see in Section 3.3 that it would complicate our estimation of this quantity as it would require a model for the distribution of $\bm{X}$ given $\bm{W}$ , which is difficult to specify given the dimension of both the covariates and exposures. Next, we define

\phi_{j}=\operatorname{\mathbb{E}}_{\bm{W}}\left(\operatorname{Var}_{\bm{X}_{-j}}\left[\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}\right]\right)

where $\bm{X}_{-j}$ denotes the $p-1$ remaining covariates without the $j$ -th covariate $X_{j}$ . This is similar to $\phi$ but we first take the conditional expectation of $\tau_{\bm{w}_{0}}(\bm{X},\bm{W})$ with respect to the $j$ -th covariate given the remaining covariates so that the term inside the variance is a function of only $\bm{X}_{-j}$ and $\bm{W}$ . We know by the law of total variance that $\phi-\phi_{j}\in[0,\phi]$ , and this quantity measures the amount heterogeneity of $\tau_{\bm{w}_{0}}(\bm{X},\bm{W})$ that cannot be explained without $X_{j}$ .

Assuming some degree of heterogeneity, i.e. $\phi>0$ , we propose the following estimand, which we call MTE-VIM, to describe the importance of each covariate to the overall heterogeneity of the causal effect:

\psi_{j}=1-\phi_{j}/\phi\in[0,1].

This can be interpreted as the proportion of the treatment effect heterogeneity of $\tau_{\bm{w}_{0}}(\bm{X},\bm{W})$ not explained by $\bm{X}_{-j}$ . We note that, however, one should carefully interpret $\psi_{j}$ especially when covariates are highly correlated. As an extreme example, if $X_{1}$ is solely driving the heterogeneity of $\tau_{\bm{w}_{0}}(\bm{X},\bm{W})$ and $X_{2}$ is strongly correlated with $X_{1}$ , then $\psi_{1}$ is close to zero because the remaining covariates include $X_{2}$ which would explain most of the heterogeneity instead of $X_{1}$ . From this perspective, one could also construct a grouped MTE-VIM, $\psi_{s}$ , where $s$ is a subset of $\{1,2,...,p\}$ . We refer readers to Hines et al. (2022) for further discussion.

3 Estimation Issues

Under the assumptions introduced in Section 2, for any fixed $\bm{w}$ we can identify $\operatorname{\mathbb{E}}\{Y(\bm{w})|\bm{X}=\bm{x}\}$ from the observed data by $\operatorname{\mathbb{E}}(Y|\bm{X}=\bm{x},\bm{W}=\bm{w})$ . Other identification strategies such as those involving propensity scores (Rosenbaum and Rubin, 1983) or combinations of outcome models and propensity scores (Bang and Robins, 2005) are common in causal inference. However, these do not apply here due to the difficulty of estimating the density of the multivariate treatment given the covariates, and because these identification strategies are not well studied for multivariate, continuous treatments. Due to this identification result, all estimands, including the MTE-VIM are identifiable given the conditional outcome regression and therefore we focus our estimation on this quantity.

We separate the conditional outcome regression surface into three parts: the main effect of covariates $\bm{X}$ , the main effect of exposures $\bm{W}$ , and the interactions of covariates and exposures on the outcome. Therefore, we write our model as follows:

\displaystyle\operatorname{\mathbb{E}}(Y|\bm{X},\bm{W})=c+f(\bm{X})+g(\bm{W})+h(\bm{X},\bm{W}).

(1)

Note that this model is overparameterized as it currently stands because the $f(\cdot)$ and $g(\cdot)$ functions can be absorbed into the $h(\cdot,\cdot)$ function. For this reason, these functions are not individually identifiable, and only their sum is identified. We write the model in this way so that we can explicitly shrink each of these components separately, and we discuss in this section how to structure the model so that each of these functions is individually identifiable and therefore amenable to shrinkage and regularization.

The first simplification we make is to set $h(\bm{X},\bm{W})=\sum_{j=1}^{p}h_{j}(X_{j},\bm{W})$ where $X_{j}$ is the $j$ -th component of $\bm{X}$ . This restricts interactions between covariates and exposures to be limited to interactions between a single covariate and the multivariate exposures. We feel this is a mild assumption that is reasonable in practice, and will improve estimation efficiency in our model. The first hurdle to identification of the individual functions is that $h_{j}(X_{j},\bm{W})$ may capture the main effect of either the covariate or exposures. For example, we could have $h_{j}(X_{j},\bm{W})=X_{j}+W_{1}+X_{j}W_{1}$ , which captures both main effects and interaction terms, when ideally it would only capture $X_{j}W_{1}$ . Effectively we want this function to only be nonzero if there is truly an interaction between the exposures and $X_{j}$ . For this reason, we restrict $h_{j}(X_{j},\bm{W})$ to be of the form $h^{cov}_{j}(X_{j})h^{exp}_{j}(\bm{W})$ where $h^{cov}_{j}$ and $h^{exp}_{j}$ are arbitrary functions restricted to not be constant functions. This allows us to prevent the interaction function $h_{j}(X_{j},\bm{W})$ from simply absorbing the main effects of $X_{j}$ and $\bm{W}$ . We describe our strategy for each of these separable functions in the following sections.

3.1 Identification through Shifting

The formulation for $h(\bm{X},\bm{W})$ in the previous section helped ensure that the interaction functions do not solely contain main effects of the exposures or covariates, however, the individual functions are still only identifiable up to constant shifts. For example, without further restriction, shifting $f(\bm{X})$ upward by $\delta$ , and $g(\bm{W})$ and $h(\bm{X},\bm{W})$ downward by $\delta/2$ leads to the same likelihood. This is problematic as we would like our interaction functions to be nonzero only when there is heterogeneity of the treatment effect. If these functions are not identifiable, it becomes more difficult to shrink them towards zero when there is no heterogeneity in the model.

One common way to address this is to put moment restrictions on the functions such that $\operatorname{\mathbb{E}}_{\bm{X}}\{f(\bm{X})\}=\operatorname{\mathbb{E}}_{\bm{W}}\{g(\bm{W})\}=\operatorname{\mathbb{E}}_{\bm{X},\bm{W}}\{h(\bm{X},\bm{W})\}=0$ with the additional restriction that $\operatorname{\mathbb{E}}_{\bm{X}}\{h(\bm{X},\bm{W})\}=\operatorname{\mathbb{E}}_{\bm{W}}\{h(\bm{X},\bm{W})\}=0$ . This approach is difficult to implement, however, as enforcing these conditions requires the conditional density of covariates given exposures, and vice versa, which is difficult to estimate given the dimension of the exposures and covariates. We use an alternative restriction that also leads to identifiability of individual functions, but is straightforward to implement. Specifically, we enforce that $f\{\operatorname{\mathbb{E}}(\bm{X})\}=g\{\operatorname{\mathbb{E}}(\bm{W})\}=0$ and $h\{\operatorname{\mathbb{E}}(\bm{X}),\bm{w}\}=h\{\bm{x},\operatorname{\mathbb{E}}(\bm{W})\}=0$ for all $\bm{x}$ and $\bm{w}$ . We show in Appendix B that this leads to identifiability of the individual functions. In practice, this restriction is achieved by centering our estimated functions at each iteration of an MCMC algorithm. If we let all functions with a 0-subscript denote unrestricted functions that do not enforce the aforementioned constraints, then these can be written as:

	$\displaystyle\operatorname{\mathbb{E}}(Y\|\bm{X},\bm{W})$	$\displaystyle=c_{0}+f_{0}(\bm{X})+g_{0}(\bm{W})+\sum_{j=1}^{p}h_{j0}(X_{j},\bm{W})$
		$\displaystyle=\left[c_{0}+f_{0}(\operatorname{\mathbb{E}}(\bm{X}))+g_{0}(\operatorname{\mathbb{E}}(\bm{W}))+\sum_{j=1}^{p}h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right]$
		$\displaystyle\hskip 3.61371pt\hskip 3.61371pt+\left[f_{0}(\bm{X})-f_{0}(\operatorname{\mathbb{E}}(\bm{X}))+\sum_{j=1}^{p}\left\{h_{j0}(X_{j},\operatorname{\mathbb{E}}(\bm{W}))-h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right\}\right]$
		$\displaystyle\hskip 3.61371pt\hskip 3.61371pt+\left[g_{0}(\bm{W})-g_{0}(\operatorname{\mathbb{E}}(\bm{W}))+\sum_{j=1}^{p}\left\{h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\bm{W})-h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right\}\right]$
		$\displaystyle\hskip 3.61371pt\hskip 3.61371pt+\left[\sum_{j=1}^{p}\left\{h_{j0}(X_{j},\bm{W})-h_{j0}(X_{j},\operatorname{\mathbb{E}}(\bm{W}))-h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\bm{W})+h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right\}\right]$
		$\displaystyle=c+f(\bm{X})+g(\bm{W})+\sum_{j=1}^{p}h_{j}(X_{j},\bm{W}).$

At each iteration of the MCMC, the functions are shifted to satisfy this condition, and therefore our individual functions are identifiable, and the interaction functions should only be nonzero when there is heterogeneity of the treatment effect. Because of this, we can apply shrinkage priors to the $h_{j}(X_{j},\bm{W})$ functions, which should improve estimation when heterogeneity is not present. Note that the alternative restriction that we impose is baseline-free in the sense that one can replace the chosen baseline, $\operatorname{\mathbb{E}}(\bm{X})$ and $\operatorname{\mathbb{E}}(\bm{W})$ , with any values of the covariates and exposures.

3.2 Soft Bart and Targeted Smoothing

In this section we detail the specific models we use for each of the $f(\bm{X})$ , $g(\bm{W})$ , and $h(\bm{X},\bm{W})$ functions. For the main effect functions we use the SoftBART prior of Linero and Yang (2018), while we use BART with targeted smoothing (tsBART, Starling et al. (2020); Li et al. (2022)) for estimation of the interaction functions. Before detailing each of these BART extensions, we first briefly introduce the original BART model. BART was introduced by Chipman et al. (2010) and is a fully Bayesian ensemble-of-trees model that has seen increasing usage in causal inference due to the success of the original BART and its variants (Dorie et al., 2019). Specifically, it assumes that

	$\displaystyle Z$	$\displaystyle=f(\bm{v})+\epsilon,\qquad\epsilon\overset{\text{i.i.d.}}{\sim}N(0,\sigma^{2})$
	$\displaystyle f(\bm{v})$	$\displaystyle=\sum_{m=1}^{M}\text{Tree}(\bm{v},\mathcal{T}_{m},\mathcal{M}_{m})$

where $\text{Tree}(\bm{v},\mathcal{T}_{j},\mathcal{M}_{j})$ represents a regression tree with tree structure $\mathcal{T}_{j}$ inducing a step function in $\bm{v}$ , and leaf parameters $\mathcal{M}_{m}=\{\mu_{m1},\cdots,\mu_{mb_{m}}\}$ for prediction. The standard prior distribution sets $\mu_{mk}\sim N(0,\sigma_{\mu}^{2}/M)$ so that the overall prior variance is $\operatorname{Var}(f(\bm{v}))=\sigma_{\mu}^{2}$ . Although we do not detail the hyper-parameters of BART in this paper, we note that the default setting of Chipman et al. (2010) encourages each tree to be shallow and shrinks leaf parameters toward zero so that $\text{Tree}(\cdot)$ can be considered as a weak learner. We refer interested readers to Linero (2017) and Hill et al. (2020) for recent reviews of BART.

Although the original BART approach is successful, due to the piece-wise constant nature of tree ensembles, it suffers when the underlying truth is smooth even if it is relatively simple. To overcome this shortcoming, Linero and Yang (2018) propose a smooth modification of BART (SoftBART or SBART) by allowing $\bm{v}$ to follow a probabilistic path, that is, randomizing the decision rule at each split. Because we expect the effects of environmental exposures or other continuous treatments on the outcome to be smooth, we use SoftBART for estimation of both $f(\bm{X})$ and $g(\bm{W})$ .

Another smooth variant of BART is tsBART (Starling et al., 2020; Li et al., 2022) which is originally motivated by time-to-event data and density estimation, though we use it for estimation of the interaction functions $h_{j}(X_{j},\bm{W})$ . Unlike SoftBART which smooths the regression function over all variables, tsBart smooths over a single targeted variable, say $u$ . After centering at $\gamma(u)$ , a baseline function of $u$ , the model can be written as $f(u,\bm{v})=\gamma(u)+\sum_{j=1}^{m}\text{tsTree}(u,\bm{v},\mathcal{T}_{m},\mathcal{M}_{m})$ where each terminal node is associated with $\mathcal{M}_{m}=\{\mu_{m1}(u),\cdots,\mu_{mb_{m}}(u)\}$ . Here, each element $\mu_{mb}(u)$ follows a Gaussian process in $u$ with mean zero and covariance function $\Sigma(u,u^{\prime})$ , so that $f(u,\bm{v})$ , the sum of $m$ Gaussian processes, is continuous in $u$ . Following Li et al. (2022), we approximate the model by $f(u,\bm{v})=\gamma(u)+\sum_{m=1}^{M}\mathcal{B}_{m}(u)\text{Tree}(\bm{v},\mathcal{T}_{m},\mathcal{M}_{m})$ where $\mathcal{B}_{m}(u)=\sqrt{2}\cos(\omega_{m}u+b_{m})$ is a random basis function where $\omega_{m}\overset{\text{i.i.d.}}{\sim}N(0,\rho^{-2})$ and $b_{m}\overset{\text{i.i.d.}}{\sim}U(0,2\pi)$ so that $\sum_{m=1}^{M}\mathcal{B}_{m}(u)\text{tree}(\bm{v},\mathcal{T}_{m},\mathcal{M}_{m})$ weakly converges to a Gaussian process $\text{GP}(0,\Sigma(\cdot,\cdot))$ as $M\rightarrow\infty$ where $\Sigma(u,u^{\prime})=\sigma_{\mu}^{2}\exp\{-(u-u^{\prime})/(2\rho^{2})\}$ .

By fitting tsBART for each interaction function $h_{j}(X_{j},\bm{W})$ in our model by setting $u=X_{j}$ and $\bm{v}=\bm{W}$ , we have that each fitted function is necessarily the sum of products of a continuous function of $X_{j}$ and a flexible tree of $\bm{W}$ so that it does not include a function solely of $X_{j}$ or $\bm{W}$ .

3.3 Estimation of MTE-VIM

We now detail the estimation strategy for the proposed variable importance measures. Recall that MTE-VIM for the $j$ -th covariate, $\psi_{j}$ , is composed of the total heterogeneity $\phi=\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}]$ and the total heterogeneity accounted for by $\bm{X}_{-j}$ , given by $\phi_{j}=\operatorname{\mathbb{E}}_{\bm{W}}\left(\operatorname{Var}_{\bm{X}_{-j}}\left[\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}\right]\right)$ . As before, under the assumptions of Section 2, we can write $\tau_{\bm{w}_{0}}(\bm{x},\bm{w})=\operatorname{\mathbb{E}}(Y|\bm{X}=\bm{x},\bm{W}=\bm{w})-\operatorname{\mathbb{E}}(Y|\bm{X}=\bm{x},\bm{W}=\bm{w}_{0})$ , and therefore the posterior distribution of $\tau_{\bm{w}_{0}}(\bm{x},\bm{w})$ can be obtained once we have the posterior distribution of the outcome regression model. Obtaining $\phi$ is relatively straightforward, as we use empirical distributions of both the exposures and covariates to approximate moments with respect to $\bm{W}$ or $\bm{X}$ . We can construct an $n\times n$ matrix for which the $(k,l)$ -th element is $\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})$ , where $\bm{X}_{l}$ and $\bm{W}_{k}$ represent the observed covariates of the $l$ -th observation and the exposures of the $k$ -th observation, respectively. Then, the sample variance of the $k$ -th row of the matrix would estimate $\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{k})\}$ . Taking the sample mean of these $n$ column sample variances estimates $\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}]$ . This estimation process can be illustrated as follows:

	$\displaystyle\begin{pmatrix}\tau_{\bm{w}_{0}}(\bm{X}_{1},\bm{W}_{1})&\tau_{\bm{w}_{0}}(\bm{X}_{2},\bm{W}_{1})&\dots&\tau_{\bm{w}_{0}}(\bm{X}_{n},\bm{W}_{1})\\ \tau_{\bm{w}_{0}}(\bm{X}_{1},\bm{W}_{2})&\ddots&&\tau_{\bm{w}_{0}}(\bm{X}_{n},\bm{W}_{2})\\ \vdots&&\ddots&\vdots\\ \tau_{\bm{w}_{0}}(\bm{X}_{1},\bm{W}_{n})&\tau_{\bm{w}_{0}}(\bm{X}_{2},\bm{W}_{n})&\dots&\tau_{\bm{w}_{0}}(\bm{X}_{n},\bm{W}_{n})\end{pmatrix}\longrightarrow$	$\displaystyle\begin{pmatrix}\text{Var}_{\bm{X}}[\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{1})]\\[5.0pt] \text{Var}_{\bm{X}}[\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{2})]\\ \vdots\\ \text{Var}_{\bm{X}}[\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{n})]\\ \end{pmatrix}$
		$\displaystyle\hskip 52.03227pt\big{\downarrow}$
		$\displaystyle\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}]$

The same strategy can be used for $\phi_{j}$ replacing $\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})$ with $\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})\}$ everywhere, so all that is left is to describe how to obtain $\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})\}$ . For ease of exposition, we re-write $\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})$ as $\tau_{\bm{w}_{0}}(X_{lj},\bm{X}_{l(-j)},\bm{W}_{k})$ where $X_{lj}$ denotes the $j$ -th covariate of the $l$ -th individual and $\bm{X}_{l(-j)}$ denotes the remaining covariates of the $l$ -th individual. A nonparametric estimate of this mean can be defined as

\widehat{\operatorname{\mathbb{E}}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}=\frac{\sum_{i=1}^{n}\tau_{\bm{w}_{0}}(X_{ij},\bm{X}_{-j},\bm{W})K(\bm{X}_{i(-j)}-\bm{X}_{-j})}{\sum_{i=1}^{n}K(\bm{X}_{i(-j)}-\bm{X}_{-j})}

where $K(\cdot)$ is an appropriately defined kernel. This will work well if $\bm{X}$ is small or moderate-dimensional, but won’t work well as the number of covariates grows. In this more difficult setting, we could make parametric assumptions and estimate the mean of $X_{j}$ given $\bm{X}_{-j}$ using a regression model. For instance, if $X_{j}$ is continuous, we could assume that it follows a normal distribution and we can estimate the conditional mean and variance using linear regression. This would provide us an estimate of the conditional density $f_{X_{j}|\bm{X}_{-j}}$ from which we can calculate $\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}$ .

Note that both of these require the construction of an $n\times n$ matrix at each MCMC sample, which is computationally intensive. This is particularly problematic for the analysis of Medicare data in Section 6, where our sample consists of all 38702 zip-codes in the United States. As an approximation to this calculation, one can use a blocking scheme where we construct several sub-matrices of the data when calculating the variable importance metrics. Specifically, we split the overall sample into $K$ groups of approximately equal size. Letting the sample size in group $k$ be $n_{k}$ , we can estimate the variable importance metric using the subset of the data in group $k$ , which only requires the calculation of an $n_{k}\times n_{k}$ matrix. We can do this for each group separately, and then average results across groups for a final estimate of $\psi_{j}$ for $j=1,\dots,p$ . We see in Appendix C that this blocking scheme performs equally well as using the full sample, but is significantly faster on large data sets.

4 Posterior contraction rates

Next, we study the rate at which the posterior distributions concentrate in the model described in Section 3. Let $\mathcal{S}(\alpha,p,d)$ denote a set of $\alpha-$ Hölder continuous functions on $[0,1]^{p}$ that are constant in all but $d$ of the coordinates. Following Linero and Yang (2018) and Li et al. (2022), we make the following assumptions about the data-generating process (Condition A).

(A1)

$Y_{i}\sim\operatorname{Normal}(c^{*}+f^{*}(\bm{X}_{i})+g^{*}(\bm{W}_{i})+h^{*}(\bm{X}_{i},\bm{W}_{i}),1)$
(A2)

The range of $\bm{X}_{i}$ and $\bm{W}_{i}$ are $[0,1]^{p}$ and $[0,1]^{q}$ , respectively.
(A3)

$f^{*}\in\mathcal{S}(\alpha_{x},p,d_{x})$ , $g^{*}\in\mathcal{S}(\alpha_{w},p,d_{w})$ , and $h^{*}\in\mathcal{S}(\alpha_{h},p+q,d_{h})$

We assume the error variance is $\sigma^{2}=1$ for simplicity, but it is straightforward to incorporate unknown $\sigma^{2}$ as well. Achieving (A2) is straightforward by applying quantile normalization, where $X_{ij}$ , the $i$ -th individual’s $j$ -th covariate, is replaced with its quantile value among all the $j$ -th covariate values, and the same for the exposures. Further, we must make assumptions about the SoftBART prior distribution $\Pi$ used for each regression function $f$ , $g$ , and $h$ . For brevity, we leave these specific details to Appendix A, and we refer to these assumptions on the prior distribution as Condition P. We have simplified the setup by not incorporating targeted smoothing into $h(\bm{x},\bm{w})$ , though we emphasize that Li et al. (2022) show that for certain choices of basis function, it is not difficult to prove analogous results for targeted smoothing models. We remove targeted smoothing from consideration so that we can use the same Condition P for all model components rather than requiring a separate set of conditions for $h(\cdot,\cdot)$ .

Let $\operatorname{\mathbb{E}}_{0}$ denote the expectation with respect to the true data-generating mechanism, $\Pi_{n}$ denote the posterior distribution, and $\|\mu\|_{n}^{2}=\dfrac{1}{n}\sum_{i=1}^{n}\mu(\bm{X}_{i},\bm{W}_{i})^{2}$ where $\mu(\bm{X}_{i},\bm{W}_{i})=c+f(\bm{X}_{i})+g(\bm{W}_{i})+h(\bm{X}_{i},\bm{W}_{i})$ . In Appendix A, we prove the following theorem.

Theorem 1

If Conditions A and P hold,

\operatorname{\mathbb{E}}_{0}\{\Pi_{n}(\|\mu-\mu^{*}\|_{n}>M\epsilon_{n})\}\rightarrow 0,\quad\text{ as }n\rightarrow\infty

for $\epsilon_{n}=\max(\epsilon_{nf},\epsilon_{ng},\epsilon_{nh})$ where $\epsilon_{nf}=n^{-\alpha_{x}/(2\alpha_{x}+d_{x})}\log(n)^{t_{f}},\epsilon_{ng}=n^{-\alpha_{w}/(2\alpha_{w}+d_{w})}\log(n)^{t_{g}}$ and $\epsilon_{nh}=n^{-\alpha_{h}/(2\alpha_{h}+d_{h})}\log(n)^{t_{h}}$ where $t_{f}=\alpha_{x}(d_{x}+1)/(2\alpha_{x}+d_{x}),t_{g}=\alpha_{w}(d_{w}+1)/(2\alpha_{w}+d_{w}),$ and $t_{h}=\alpha_{h}(d_{h}+1)/(2\alpha_{h}+d_{h}).$

The rates $\epsilon_{nf},\epsilon_{ng},\epsilon_{nh}$ represent the minimax estimation rates within the respective function classes for $(f,g,h)$ , up to a logarithmic term, and therefore the rate $\epsilon_{n}$ is a near-minimax optimal rate for estimation of $\mu$ . This result shows that our model is able to capture any true outcome regression function as long as it is sufficiently smooth, which is reasonable for the study of environmental pollutants. Further note that while we have derived posterior contraction rates for $\mu$ , these imply the same posterior contraction rates for causal effects, such as $\tau_{\bm{w}_{0}}(\bm{x},\bm{w})$ , and for variable importance metrics $\psi_{j}$ .

5 Simulation Studies

Here, we assess the performance of our proposed approach and compare it to existing approaches that do not allow for heterogeneity of the multivariate treatment effect. For the simulation studies we draw covariates and exposures from the following multivariate normal distributions:

	$\displaystyle\bm{X}_{i}\overset{\text{i.i.d.}}{\sim}N(\bm{0},\textbf{I}_{5}),\quad\bm{W}_{i}\sim N(\vec{\mu}_{i},\vec{\Sigma})\qquad\text{where}$
	$\displaystyle\vec{\mu}_{i}=\left(e^{X_{i1}}/(1+e^{X_{i1}})-0.5,0.1X_{i2}^{2}-0.1,0.3X_{i3},\sin(X_{i2}),0.05X_{i4}^{3}\right)^{T},$
	$\displaystyle\Sigma_{ij}=\begin{cases}1\qquad&i=j\\ 0.3\qquad&i\neq j\end{cases}$

and $\Sigma_{ij}$ denotes the $(i,j)$ element of $\vec{\Sigma}$ . The covariates are associated with the exposures and the exposures are correlated with each other, which are both common in observational studies in environmental health. Next, we generate the outcome by

	$\displaystyle Y_{i}=f^{}(\bm{X}_{i})+g^{}(\bm{W}_{i})+\sum_{j=1}^{5}h^{*}_{j}(X_{ij},\bm{W}_{i})+\epsilon_{i},\qquad\epsilon_{i}\overset{\text{i.i.d.}}{\sim}N(0,1)$
	$\displaystyle f^{*}(\bm{X}_{i})=X_{i1}+X_{i2}-0.5X_{i3}$
	$\displaystyle g^{*}(\bm{W}_{i})=\text{I}{\left\{W_{i1}>0\right\}}+W_{i1}e^{0.3W_{i3}}+\arctan(W_{i2})+\sin(W_{i2}W_{i3}\pi)+\min(\|W_{i3}\|,1).$

Further, we consider three scenarios of interactions between covariates and exposures:

	No interaction:	$\displaystyle h^{}_{1}(X_{i1},\bm{W})=h^{}_{2}(X_{i2},\bm{W})=0$
	Moderate:	$\displaystyle h^{}_{1}(X_{i1},\bm{W})=0.2\arctan(4X_{i1})g^{}(\bm{W}),\quad h^{}_{2}(X_{i2},\bm{W})=0.2\cos(X_{i2}\pi)g^{}(\bm{W})$
	Strong:	$\displaystyle h^{}_{1}(X_{i1},\bm{W})=0.4\arctan(4X_{i1})g^{}(\bm{W}),\quad h^{}_{2}(X_{i2},\bm{W})=0.4\cos(X_{i2}\pi)g^{}(\bm{W})$

while $h^{*}_{j}(X_{ij},\bm{W})=0$ for $j=3,4,5$ under all scenarios. The standard deviations of the heterogeneous treatment effect functions are approximately 0.2 and 0.4 times that of the sum of the two main effects, respectively, which is reasonable as we expect that interaction effects to be smaller in magnitude than main effects in general. We run the simulation for 300 different data replicates for each scenario and average results across all simulated data sets. We set the sample size to be $n=2000$ . We aim to estimate $\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})\equiv\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}\}$ at 100 randomly chosen locations from the distribution of $\bm{X}$ , as well as the variable importance metrics $\psi_{j}$ for all covariates. We choose two fixed exposures levels $\bm{w}_{0}=(-0.5,-0.5,-0.5,-0.5,-0.5)^{T}$ and $\bm{w}_{1}=(0.5,0.5,0.5,0.5,0.5)^{T}$ which roughly correspond to the first and the third quartiles of each exposure, respectively. We consider four different approaches in the simulation. The first is the Bayesian kernel machine regression approach (BKMR) commonly used in the environmental mixtures literature, which fits a model of the form $E(Y|\bm{X},\bm{W})=\bm{X}\bm{\beta}+m(\bm{W})$ . A Gaussian process prior is placed on $m(\cdot)$ , which incorporates spike-and-slab prior distributions to remove unnecessary exposures. The second is a standard BART prior that fits a model of the form $E(Y|\bm{X},\bm{W})=m(\bm{X},\bm{W})$ and places a BART prior on $m(\cdot)$ (BART). The third uses a smooth variant of BART for $m(\cdot)$ (SoftBART). Lastly, we use our proposed approach (SepBART), which separates the regression surface into separate functions and then uses either SoftBART or tsBART prior distributions for each function separately. Note that BKMR specifically assumes that the effect of covariates is linear, which is the case in our simulation studies, while it does not allow for treatment effect heterogeneity, which makes it misspecified in scenarios with interactions between exposures and covariates.

Figure 1 shows the simulation results for estimating $\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})$ . We find that our method is the only method that achieves the nominal 95% coverage for all interaction scenarios and produces the smallest root mean squared error (RMSE) when there is treatment effect heterogeneity, regardless of its strength. While it is expected that BKMR does not perform well in the presence of interactions between covariates and exposures as it assumes no treatment effect heterogeneity, it is surprising that SepBART outperforms BKMR even when there is no heterogeneity and the main effect of covariates is linear, which is the situation that BKMR is designed for. The no-interaction scenario is the least favorable setting for SepBART as it forces an interaction structure on the model. However, even for this setting, SepBART shows comparable estimation performance to the best performing model in terms of RMSE, which indicates that our model is able to shrink the $h(\cdot)$ functions to zero when they are not required. It is also notable that SepBART improves on the original BART and SoftBART that do not separate the regression function into distinct components, which suggests that we benefit from the proposed model formulation and separation of the effects into different functions by providing balance between flexibility and efficiency.

Refer to caption — Figure 1: Simulation results for estimating CATE when $\bm{w}_{0}$ and $\bm{w}_{1}$ are given. The dashed line in the second row denotes nominal 95% coverage.

We now assess the performance for estimating MTE-VIM introduced in Section 2. As BKMR does not allow for effect heterogeneity, we focus on our approach only here. We calculate $\hat{\psi}_{j}$ for $j=1,2,...,5$ for each data set by taking the posterior mean of estimates and plot the distribution of $\hat{\psi}_{j}$ in Figure 2. The true values of $\psi_{1}$ and $\psi_{2}$ are 0.72 and 0.28, respectively, and the others are zero. We see that for both interaction scenarios where the true total heterogeneity $\phi$ is 0.26 and 1.14, respectively (moderate and strong), the posterior mean of each estimate is concentrated near the true value. Along with estimating MTE-VIM, we can conduct a hypothesis test for the difference between two MTE-VIMs, i.e. testing $H_{0}:\psi_{j}=\psi_{k}$ . This can be done by constructing the posterior distribution of $\psi_{j}-\psi_{k}$ and seeing whether or not the interval contains zero. We calculate how often the null hypothesis is rejected with level $\alpha=0.05$ over 300 data set replicates, and plot the empirical rejection rate of each testing pair in Figure 3. Since the true $\psi_{1}$ is far away from zero, it shows high rejection rates for comparing $\psi_{1}$ and the others (the first column of Figure 3). For $\psi_{2}$ , which is non-zero, but not large as $\psi_{1}$ , it produces slightly weaker power when interactions are moderate, but detects almost all differences when interactions are strong. For the last three columns where the null is true, we see rejection rates are controlled under the desired level $\alpha=0.05$ regardless of the scenario, though inference is somewhat conservative.

6 Health effects of air pollution mixtures

We now use our proposed approach to gain important epidemiological insights about the effect of PM_2.5 components and ozone on mortality. We have demographic, socioeconomic, and mortality information on all Medicare beneficiaries in the United States above the age of 65 for the years 2000-2016. We observe a number of individual level covariates for these individuals, such as their age, race, sex, and whether they have dual eligibility to Medicaid, which is a proxy for low socioeconomic status. We also observe a number of area-level covariates unique to the zip-code of each individual from the United States Census Bureau and the Center for Disease Control’s Behavioral Risk Factor Surveillance System. These area-level covariates consist of average body mass index, smoking rates, median household income, median house value, education, population density, and percent owner occupied housing. We also adjust for both temperature and precipitation variables that are available from the National Climatic Data Center.

We obtain zip-code level environmental exposure data from two distinct sources. We obtain estimates of total PM_2.5, ammonium, nitrates, and sulfate levels on a ( $0.01^{\circ}\times 0.01^{\circ}$ ) monthly grid from the Atmospheric Composition Analysis Group (Van Donkelaar et al., 2019). We obtain estimates of elemental carbon, organic carbon, and ozone on a 1km by 1km daily grid from the Socioeconomic Data and Applications Center (Di et al., 2019, 2021; Requia et al., 2021). We do not have exact residential addresses of individuals in Medicare and only know their residential zip-code, and therefore all exposures are aggregated to the yearly level at each zip-code. All individual and geographic-level covariates are also aggregated up to the zip-code level by taking their averages or proportions within each zip code so that the outcome of interest is the annual mortality rate in each zip-code, which is defined as the number of deaths in that zip-code for a particular year divided by the number of person-years in that zip-code.

We run our proposed approach as described in Section 3 targeting both marginal and heterogeneous treatment effects to evaluate the extent to which ambient air pollution affects mortality, and whether this effect varies by characteristics of zip-codes. We run our model for each year separately using the prior year exposures as the pollutants of interest, and therefore will present results for all years between 2000 and 2016. We run our MCMC algorithm for 5000 iterations, discarding the first 1000 as a burn-in, and thinning every fourth sample.

6.1 Marginal effects of exposures

First, we investigate the average treatment effect of an increase in environmental mixtures, denoted as $\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})\}$ , where $\bm{w}_{0}$ and $\bm{w}_{1}$ represent vectors of all exposures corresponding to the first and third quartiles for each exposure, respectively. Hence, a positive ATE would indicate a harmful effect on mortality due to an increase in environmental exposures. Figure 4 illustrates estimates of the ATE for each year. We consistently estimate a positive ATE, implying that increasing levels of all environmental exposures increases the mortality rate across the contiguous United States, which suggests a detrimental impact of environmental pollutants on human health.

6.2 Heterogeneity of the effect of PM_2.5 components

While the average treatment effect provides insights into the health effects of air pollution, of even more interest is whether this effect varies across the population, as it is crucially important to understand which communities are at most risk to the detrimental effects of air pollution. Toward this goal, we first investigate the proposed MTE-VIM for each covariate to understand which characteristics drive the heterogeneity of the causal effect. The estimated total heterogeneity, $\phi$ , averaged over all of the study years is 0.04, which implies that the standard deviation of $\tau_{\bm{w}_{0}}(\bm{W},\bm{X})$ with respect to $\bm{X}$ given $\bm{W}=\bm{w}$ is, on average, 0.2. Considering that we estimated an average ATE of 0.5 across study years when $\bm{W}=\bm{w}_{1}$ in Figure 4, this suggests there is a non-negligible amount of treatment effect heterogeneity. Figure 5 shows the posterior mean of the MTE-VIM corresponding to each covariate for each study year. Dual eligibility to Medicaid appears to be the strongest driving factor of treatment effect heterogeneity, followed by race and age. Figure 6 shows the value of these variable importance metrics when averaged over all study years, and we see that dual eligibility to Medicaid, race, and age achieved the largest variable importance measures, with overall means of 0.2, 0.18, and 0.04, respectively. This indicates that, on average, 20% of the treatment effect heterogeneity can only be explained by dual eligibility to Medicaid, even after controlling for other potential effect modifiers.

In addition to identifying the characteristics of zip-codes that modify the treatment effect the most, it is important to understand the nature and direction of this heterogeneity to better understand which groups are most susceptible to air pollution mixtures. To do this, we examine the conditional average treatment effect, using the same reference exposure levels as in the previous section, denoted by $\bm{w}_{1}$ and $\bm{w}_{0}$ . Specifically, we plot $\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\widetilde{\bm{x}}_{(j)}\}$ as a function of $x_{j}$ , where $\widetilde{\bm{x}}_{(j)}=(\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{1},...,x_{j},...,\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{p})$ . This sets all covariates to their sample mean and only varies covariate $j$ , so that we can examine whether the conditional average treatment effect increases or decreases with covariate $j$ . Within the framework of our model and our identification restrictions, $\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\widetilde{\bm{x}}_{(j)}\}$ simplifies to the difference in the interaction function, $h_{j}(\bm{w}_{1},x_{j})-h_{j}(\bm{w}_{0},x_{j})$ , which is zero when $x_{j}$ equals the mean covariate level, $\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{j}$ . Therefore, when this difference is positive, it implies that individuals with the given covariate level are more susceptible to increases in the environmental mixture compared to those with an average covariate level.

Figure 7 shows the conditional average treatment effect curves described above for the covariates with large MTE-VIMs. We observe that dual eligibility shows the largest variation across the covariate values as expected by the largest MTE-VIM. We also see that the conditional average treatment effect increases as dual eligibility increases in most years and their credible intervals mostly overlap across the years, which indicates that the detrimental effects of air pollution are more pronounced in areas with low socioeconomic status. We also find that areas with older individuals are more adversely affected by increases in pollution except for the year 2011, though this effect is not as strong as that of dual eligibility to Medicaid. Although race, as measured by the percentage of the population that is white, achieved high variable importance measures, the direction of the heterogeneity is less clear as we see an increasing trend for certain years while we see a decreasing trend in others with wide credible intervals. Such mixed findings for race have been seen previously, as other studies have found the counter-intuitive finding that the effects of pollution are more severe in areas with higher percentages of white people (Bargagli-Stoffi et al., 2020). Race appears to play a modifying role in the effects of air pollution, but more research is required to understand the nature of this effect. Overall, our study shows that results from the previous literature (Simoni et al., 2015; Bargagli-Stoffi et al., 2020) that focused solely on univariate PM_2.5 exposure extend to the more complex setting of multivariate air pollution mixtures. We find a harmful effect of air pollution on mortality, and that this effect is heterogeneous with larger effects in areas with low socioeconomic status and older individuals.

7 Discussion

In this paper, we proposed a novel approach to analyze and summarize complex, heterogeneous effects of multivariate continuous exposures. We developed new estimands for this setting, including a treatment effect variable importance measure, which is tailored to multivariate exposure scenarios and provides interpretable quantities that simplify the heterogeneity and provide important epidemiological insights about the effects of air pollution mixtures. To facilitate identification of our estimands, and to allow different strengths of regularization for each component of the model, we proposed a separation of the outcome model regression function, and integrated a targeted smoothing method into our model to obtain smooth estimates of the degree of heterogeneity by each covariate. Our theoretical results, coupled with simulation studies, provide strong empirical and analytical support for the efficiency and validity of the proposed model. We used the model to gain new insights into the causal impact of multivariate air pollution mixtures on mortality rates in the Medicare population. Our model estimates a detrimental impact of air pollution, consistent with previous literature, and shows that this effect is more pronounced in zip-codes with lower socioeconomic status and older individuals.

While this paper provides strong evidence of heterogeneous treatment effects for environmental mixtures, and provides users with a novel methodology for the multivariate, continuous treatment setting, there are certain limitations with our approach. For one, our definition of the MTE-VIM requires the user to choose a reference exposure level $\bm{w}_{0}$ . While we expect most reasonable choices of $\bm{w}_{0}$ to lead to similar values of the MTE-VIM, results could be sensitive to this choice. We recommend users utilize a choice of $\bm{w}_{0}$ that does not violate the positivity assumption, which is a crucial consideration when analyzing multivariate exposures as noted in Antonelli and Zigler (2024). An additional limitation of our approach is that we made the assumption that the treatment effect heterogeneity was additive in the covariates, which allowed us to write $h(\bm{X},\bm{W})=\sum_{j=1}^{p}h_{j}(X_{j},\bm{W})$ . We find this to be a reasonable assumption in most cases, and one that facilitates estimation in a difficult setting, but more complex forms of heterogeneity may not be captured well by this additive structure.

Lastly, there are many future research directions to consider. One important direction would be to couple these estimation strategies with sensitivity analysis to unmeasured confounding, a common concern in observational studies such as this one. The recent literature has developed advancements in sensitivity analysis in multiple exposure settings for estimating average treatment effects (Zheng et al., 2021), and these ideas could be extended to heterogeneous treatment effects seen here. Another important direction would be to combine the approach developed in this manuscript with methodology for optimal policy estimation, in order to provide environmental regulators increased information about the best practices for reducing pollution in the future in order to obtain the largest public health benefit from a proposed reduction in air pollution levels.

Acknowledgments and Disclosure of Funding

Research described in this article was conducted under contract to the Health Effects Institute (HEI), an organization jointly funded by the United States Environmental Protection Agency (EPA) (Assistance Award No. CR-83590201) and certain motor vehicle and engine manufacturers. The contents of this article do not necessarily reflect the views of HEI, or its sponsors, nor do they necessarily reflect the views and policies of the EPA or motor vehicle and engine manufacturers. The computations in this paper were run on the FASRC FASSE cluster supported by the FAS Division of Science Research Computing Group at Harvard University. Danielle Braun and Kezia Irene were also funded by the following grants from the National Institute of Health; R01AG066793, RF1AG074372, R01AG066793, RF1AG071024, R01ES030616, R01ES034373, RF1AG080948, R01ES034021

Appendix A Proof of Theorem 1

Before showing the proof of the main result, we must first show the conditions on the SoftBART prior distribution that are required, which we refer to as Condition P. Specifically, we need that

(P1)

There exists positive constants $(C_{M1},C_{M2})$ such that the prior on the number of trees $T$ in the ensemble is $\Pi(T=t)=C_{M1}\exp\{-C_{M2}t\log t\}$ .
(P2)

A single bandwidth parameter $\tau_{t}\equiv\tau$ is used and its prior satisfies $\Pi(\tau\geq x)\leq C_{\tau 1}\exp(-x^{C_{\tau 2}})$ and $\Pi(\tau^{-1}\geq x)\leq C_{\tau 3}\exp(-x^{C_{\tau 4}})$ for some positive constants $C_{\tau 1},\ldots,C_{\tau 4}$ for all sufficiently large $x$ , with $C_{\tau 2},C_{\tau 4}<1$ . Moreover, the density of $\tau^{-1}$ satisfies $\pi_{\tau^{-1}}(x)\geq C_{\tau 5}e^{-C_{\tau 6}x}$ for large enough $x$ and some positive constants $C_{\tau 5}$ and $C_{\tau 6}$ .
(P3)

The prior on the splitting proportions is $s\sim\operatorname{Dirichlet}(a/r^{\xi},\ldots,a/r^{\xi})$ for some $\xi>1$ and $a<0$ , where $r$ is the dimension of the input space (i.e., $p$ for $f$ , $q$ for $g$ , and $p+q$ for $h$ ).
(P4)

The $\mu_{t\ell}$ ’s are i.i.d. from a density $\pi_{\mu}(\mu)$ such that $\pi_{\mu}(\mu)\geq C_{\mu 1}e^{-C_{\mu 2}|\mu|}$ for some coefficients $C_{\mu 1},C_{\mu 2}$ . Additionally, there exists constants $C_{\mu 3},C_{\mu 4}$ such that $\Pi(|\mu_{t\ell}|\geq x)\leq C_{\mu 3}\exp(-x^{C_{\mu 4}})$ for all $x$ .
(P5)

Let $D_{t}$ denote the depth of tree $\mathcal{T}_{t}$ . Then $\Pi(D_{t}=k)>0$ for all $k=0,1,\ldots,2d$ and $\Pi(D_{t}>d_{0})=0$ for $d_{0}\geq 2d$ where $d$ represents the maximum number of covariates on which the function depends.
(P6)

The gating function $\psi:\mathbb{R}\to[0,1]$ of the SoftBART prior is such that $\sup_{x}|\psi^{\prime}(x)|<\infty$ and the function $\rho(x)=\psi(x)\{1-\psi(x)\}$ is such that $\int\rho(x)\ dx>0$ , $\int|x|^{m}\,\rho(x)<\infty$ for all integers $m\geq 0$ , and $\rho(x)$ can be analytically extended to some strip $\{z:|\Im(z)|\leq U\}$ in the complex plane.

All conditions above can be achieved by using the default SoftBART prior described in Linero and Yang (2018). We consider the concentration of the posterior with respect to the norm $\|\cdot\|_{n}$ defined by $\|\mu\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}\mu(\bm{X}_{i},\bm{W}_{i})^{2}$ . We will first verify the following conditions (Condition C) for the prior and ground truth $\mu_{0}$ when Condition A and Condition P hold to establish the result.

(C1)

The prior satisfies $\Pi(\|\mu-\mu_{0}\|_{n}\leq\epsilon_{n})\geq e^{-Cn\epsilon_{n}^{2}}$ for sufficiently large $n$ and constant $C$ independent of $(n,p,q)$ .
(C2)

There exists a sieve $\mathcal{F}_{1}\subseteq\mathcal{F}_{2}\subseteq\cdots$ such that

$\log N(\mathcal{F}_{n},\bar{\epsilon}_{n},\|\cdot\|_{n})\leq C_{N}n\bar{\epsilon}_{n}^{2},$

for sufficiently large $n$ , where $C_{N}$ is a constant independent of $(n,p,q)$ and $\bar{\epsilon}_{n}$ is a constant multiple of $\epsilon_{n}$ . Here, $N(\mathcal{F},\delta,\|\cdot\|_{n})$ denotes the $\delta$ -covering number, which is the smallest number of $\delta$ -balls required to cover $\mathcal{F}$ with respect to $\|\cdot\|_{n}$ .
(C3)

For the same $\mathcal{F}_{n}$ from C2 and same constant $C$ from C1, we have

$\Pi(\mu\notin\mathcal{F}_{n})\leq e^{-(C+4)n\epsilon_{n}^{2}},$

for sufficiently large $n$ .

Note that it suffices to consider $\|\cdot\|_{\infty}$ in place of $\|\cdot\|_{n}$ for each of (C1)–(C3) because trivially $\|\mu\|_{n}\leq\|\mu\|_{\infty}$ . Following from the results of Ghosal and Van Der Vaart (2007) (see, in particular, Theorem 1, Theorem 4, Section 7.7, and the comments following Theorem 2), verifying Condition C is sufficient to prove Theorem 1.

Assuming Conditions A and P, we can state the following useful results, which follow from the proof of Theorem 2 of Linero and Yang (2018).

Proposition 2

Suppose that Condition A holds, and independent priors are specified for $(f,g,h)$ such that Condition P holds for each. Then for any $c>0$ there exist constants and $C_{f}$ independent of $(n,p,q)$ such that for sufficiently large $n$ we have

\Pi(\|f-f_{0}\|_{\infty}\leq c\epsilon_{nf})\geq e^{-C_{f}n\epsilon_{nf}^{2}}.

Similarly, there exist $C_{g}$ and $C_{h}$ such that $\Pi(\|g-g_{0}\|_{\infty}\leq c\epsilon_{ng})\geq e^{-C_{g}n\epsilon_{ng}^{2}}$ and $\Pi(\|h-h_{0}\|\leq c\epsilon_{nh})\geq e^{-C_{h}n\epsilon_{nh}^{2}}$ .

The following lemma now verifies Condition (C1).

Lemma 3

Suppose that Condition (A1) holds with $\sigma^{2}$ fixed and that the conditions of Proposition 2 also hold. Then there exists a constant $C$ such that for sufficiently large $n$

\Pi(\|\mu-\mu_{0}\|_{n}\leq\epsilon_{n})\geq e^{-Cn\epsilon_{n}^{2}}

where $\epsilon_{n}=\max(\epsilon_{nf},\epsilon_{ng},\epsilon_{nf})$ .

Proof First, we note that by the triangle inequality and prior independence that

\displaystyle\Pi(\|\mu-\mu_{0}\|_{\infty}\leq\epsilon_{n})

\displaystyle\geq\Pi(\|f-f_{0}\|_{\infty}\leq\epsilon_{n}/3)\Pi(\|g-g_{0}\|_{\infty}\leq\epsilon_{n}/3)\Pi(\|h-h_{0}\|_{\infty}\leq\epsilon_{n}/3).

Applying Proposition 2 with $c=1/3$ and the definition of $\epsilon_{n}$ this gives

\Pi(\|\mu-\mu_{0}\|_{\infty}\leq\epsilon_{n})\geq e^{-(C_{f}+C_{g}+C_{h})n\epsilon_{n}^{2}}=e^{-Cn\epsilon_{n}^{2}}

where $C=C_{f}+C_{g}+C_{h}$ .
Next, we prove the existence of an appropriate sieve to establish (C2)–(C3). We use the following result, which is proven in the supplementary material of Linero and Yang (2018), to do this.

Lemma 4

Suppose that Condition P holds. For universal constants $(D,E)$ (depending only on the prior) and for sufficiently large $(T,d,U,\sigma_{2})$ and sufficiently small $(\sigma_{1},\delta)$ , there exists a set $\mathcal{G}_{f}$ with the following properties:

1. Covering entropy control: $\log N(\mathcal{G}_{f},D\delta,\|\cdot\|_{\infty})\leq d\log p+3T2^{d_{0}}\log\left(\frac{d\sigma^{2}_{2}T2^{d_{0}}U}{\delta}\right)$

2. Complement probability bound: If $\Pi_{f}$ satisfies Condition P then

	$\displaystyle\Pi_{\mu}(\mathcal{G}_{f}^{c})$	$\displaystyle\leq C^{\prime}_{M1}\exp\{-C^{\prime}_{M2}T\log(T)\}+$
		$\displaystyle\qquad 2^{d_{0}}T\left[\exp\{-Ed\log p\}+C_{\mu 1}\exp\{-U^{C_{\mu 2}}\}\right]+$
		$\displaystyle\qquad 2C_{\tau_{1}}\,\exp\{-\sigma_{1}^{-C_{\tau 2}}\}+2C_{\tau_{3}}\exp\{-\sigma_{2}^{C_{\tau_{4}}}\},$

Similarly, sets $\mathcal{G}_{g}$ and $\mathcal{G}_{h}$ exist with analogous statements holding for $\Pi_{g}$ and $\Pi_{h}$ .

Using this lemma, we can construct an appropriate sieve as follows. First, define the Cartesian product $\mathcal{F}=\mathcal{G}_{f}\times\mathcal{G}_{g}\times\mathcal{G}_{h}$ for appropriately large/small values of $(T,d,U,\sigma_{2},\sigma_{1},\epsilon)$ . Then we immediately have

\log N(\mathcal{F},D\delta,\|\cdot\|_{\infty})\leq\log N(\mathcal{G}_{f},D\delta,\|\cdot\|_{\infty})+\log N(\mathcal{G}_{g},D\delta,\|\cdot\|_{\infty})+\log N(\mathcal{G}_{h},D\delta,\|\cdot\|_{\infty}).

Following Li et al. (2022), for a large constant $\kappa$ to be selected later, take

\sigma_{1}^{-C_{\tau 2}}=\sigma_{2}^{C_{\tau 4}}=U^{C_{\mu 2}}=\kappa n\epsilon_{n}^{2},

$d=\lfloor\kappa n\epsilon_{n}^{2}/\log(p+q+1)\rfloor$ and $T=\lfloor\kappa n\epsilon_{n}^{2}/\log n\rfloor$ . By substituting these constants into the covering entropy bound, we can obtain the inequality

\log N(\mathcal{F},D\epsilon_{n},\|\cdot\|_{\infty})\leq\kappa^{\prime}n\epsilon_{n}^{2}

for an appropriately chosen $\kappa^{\prime}>\kappa$ and large enough $n$ , which verifies (C2). Additionally, by plugging these choices into the complementary probability bound, we can also make

\Pi(\mathcal{F}^{c})\leq\Pi(f\in\mathcal{G}_{f}^{c})+\Pi(g\in\mathcal{G}_{g}^{c})+\Pi(h\in\mathcal{G}_{h}^{c})\leq\exp\{-(C+4)n\epsilon_{n}^{2}\}

for sufficiently large $n$ by taking $\kappa$ sufficiently large as desired. (C1)–(C3) are verified.

Appendix B Proof of identifiability

We prove that each function component in Equation 1 is identifiable by incorporating the shifting strategy introduced in Section 3.1.

Proposition 5

Suppose $\mathcal{S}_{\bm{x}_{0},\bm{w}_{0}}$ denotes a space of a tuple $\big{(}c,f(\cdot),g(\cdot),h(\cdot)\big{)}$ that satisfies $f(\bm{x}_{0})=g(\bm{w}_{0})=0$ and $h\{\bm{x}_{0},\bm{w}\}=h\{\bm{x},\bm{w}_{0}\}=0$ for all $\bm{x}\in\mathbb{R}^{p}$ and $\bm{w}\in\mathbb{R}^{q}$ . If

1.

$\big{\{}\big{(}c,f(\cdot),g(\cdot),h(\cdot)\big{)},\big{(}\widetilde{c},\widetilde{f}(\cdot),\widetilde{g}(\cdot),\widetilde{h}(\cdot)\big{)}\big{\}}\subset\mathcal{S}$ , and
2.

$c+f(\bm{x})+g(\bm{w})+h(\bm{x},\bm{w})=\widetilde{c}+\widetilde{f}(\bm{x})+\widetilde{g}(\bm{w})+\widetilde{h}(\bm{x},\bm{w})$ for all $\bm{x}\in\mathbb{R}^{p}$ and $\bm{w}\in\mathbb{R}^{q}$ ,

then $\big{(}c,f(\cdot),g(\cdot),h(\cdot,\cdot)\big{)}=\big{(}\widetilde{c},\widetilde{f}(\cdot),\widetilde{g}(\cdot),\widetilde{h}(\cdot,\cdot)\big{)}$ .

Proof Let $\bm{x}=\bm{x}_{0}$ and $\bm{w}=\bm{w}_{0}$ . Since both tuples belong to $\mathcal{S}_{\bm{x}_{0},\bm{w}_{0}}$ , the second condition becomes $c=\widetilde{c}$ . Next, we fix $\bm{x}=\bm{x}_{0}$ while allowing $\bm{w}$ to vary. Then, we have $c+g(\bm{w})=\widetilde{c}+\widetilde{g}(\bm{w})\Leftrightarrow c+g(\bm{w})=c+\widetilde{g}(\bm{w})\Leftrightarrow g(\bm{w})=\widetilde{g}(\bm{w})$ for all $\bm{w}\in\mathbb{R}^{q}$ . Analogously, setting $\bm{w}=\bm{w}_{0}$ yields $f(\bm{x})=\widetilde{f}(\bm{x})$ for all $\bm{x}\in\mathbb{R}^{p}$ . Combining the above results produces $c+f(\bm{x})+g(\bm{w})=\widetilde{c}+\widetilde{f}(\bm{x})+\widetilde{g}(\bm{w})$ for all $\bm{x}\in\mathbb{R}^{p}$ and $\bm{w}\in\mathbb{R}^{q}$ . Hence, it follows that $h(\bm{x},\bm{w})=\widetilde{h}(\bm{x},\bm{w})$ for all $\bm{x}\in\mathbb{R}^{p}$ and $\bm{w}\in\mathbb{R}^{q}$ . Setting $\bm{x}_{0}=\operatorname{\mathbb{E}}(\bm{X})$ and $\bm{w}_{0}=\operatorname{\mathbb{E}}(\bm{W})$ concludes the proof.

Appendix C Simulations for the blocking scheme

In this section, we present additional simulation results to support the use of the blocking scheme in calculating variable importance measures, as described in Section 3. We employ the same simulation scenario as in Section 3, where strong interactions exist between covariates and exposures. Recall that the true parameter of interest $(\psi_{1},\psi_{2},\psi_{3},\psi_{4},\psi_{5})=(0.722,0.278,0,0,0)$ . We compare three methods for estimating the parameter: 1) using a randomly chosen 100 observations from the original data set (Small); 2) using all observations (Full); 3) partitioning observations into ten blocks and averaging ten estimates (Block).

Table 1 presents the relative RMSE compared to the blocking scheme averaged over 300 data replicates. A relative RMSE greater than 1 indicates a larger RMSE than the blocking scheme. We also present the histograms of the point estimates of $\psi_{2}$ for each method in Figure 8. Using just one subset of the original observations results in a significantly larger RMSE as it has a larger variability in estimating the parameter, which is expected since it doesn’t incorporate all data points. The blocking scheme, which simply averages estimates from ten small subsets, shows a comparable estimation performance to the “Full” method which requires significantly more intensive computation.

Estimand	Small	Full	Block
$\psi_{1}$	1.29	0.99	1
$\psi_{2}$	1.66	0.99	1
$\psi_{3}$	1.23	1.04	1
$\psi_{4}$	1.59	1.17	1
$\psi_{5}$	1.66	0.99	1

Table 1: Relative RMSE of estimating MTE-VIM.

References

Agier et al. (2016) Lydiane Agier, Lützen Portengen, Marc Chadeau-Hyam, Xavier Basagaña, Lise Giorgis-Allemand, Valérie Siroux, Oliver Robinson, Jelle Vlaanderen, Juan R González, Mark J Nieuwenhuijsen, et al. A systematic comparison of linear regression–based statistical methods to assess exposome-health associations. Environmental health perspectives, 124(12):1848–1856, 2016.
Antonelli and Zigler (2024) Joseph Antonelli and Corwin Zigler. Causal analysis of air pollution mixtures: Estimands, positivity, and extrapolation. arXiv preprint arXiv:2401.17385, 2024.
Antonelli et al. (2020) Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, and Brent Coull. Estimating the health effects of environmental mixtures using bayesian semiparametric regression and sparsity inducing priors. The Annals of Applied Statistics, 14(1):257–275, 2020.
Athey and Imbens (2016) Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
Banerjee et al. (2013) Anjishnu Banerjee, David B Dunson, and Surya T Tokdar. Efficient gaussian process regression for large datasets. Biometrika, 100(1):75–89, 2013.
Bang and Robins (2005) Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
Bargagli-Stoffi et al. (2020) Falco J Bargagli-Stoffi, Riccardo Cadei, Kwonsang Lee, and Francesca Dominici. Causal rule ensemble: Interpretable discovery and inference of heterogeneous causal effects. arXiv preprint arXiv:2009.09036, 2020.
Bobb et al. (2015) Jennifer F Bobb, Linda Valeri, Birgit Claus Henn, David C Christiani, Robert O Wright, Maitreyi Mazumdar, John J Godleski, and Brent A Coull. Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics, 16(3):493–508, 2015.
Boss et al. (2021) Jonathan Boss, Alexander Rix, Yin-Hsiu Chen, Naveen N Narisetty, Zhenke Wu, Kelly K Ferguson, Thomas F McElrath, John D Meeker, and Bhramar Mukherjee. A hierarchical integrative group least absolute shrinkage and selection operator for analyzing environmental mixtures. Environmetrics, 32(8):e2698, 2021.
Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
Carrico et al. (2015) Caroline Carrico, Chris Gennings, David C Wheeler, and Pam Factor-Litvak. Characterization of weighted quantile sum regression for highly correlated data in a risk analysis setting. Journal of agricultural, biological, and environmental statistics, 20(1):100–120, 2015.
Chang and Roy (2024) Peter Chang and Arkaprava Roy. Individualized multi-treatment response curves estimation using rbf-net with shared neurons. arXiv preprint arXiv:2401.16571, 2024.
Chipman et al. (2010) Hugh A Chipman, Edward I George, and Robert E McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
Di et al. (2019) Qian Di, Heresh Amini, Liuhua Shi, Itai Kloog, Rachel Silvern, James Kelly, M. Benjamin Sabath, Christine Choirat, Petros Koutrakis, Alexei Lyapustin, Yujie Wang, Loretta J. Mickley, and Joel Schwartz. An ensemble-based model of pm2.5 concentration across the contiguous united states with high spatiotemporal resolution. Environment International, 130:104909, 2019.
Di et al. (2021) Qian Di, Yaguang Wei, Alexandra Shtein, Carolynne Hultquist, Xiaoshi Xing, Heresh Amini, Liuhua Shi, I Kloog, R Silvern, J Kelly, et al. Daily and annual pm2. 5 concentrations for the contiguous united states, 1-km grids, v1 (2000-2016). NASA Socioeconomic Data and Applications Center (SEDAC), 2021. doi: https://doi.org/10.7927/0rvr-4538.
Dominici et al. (2010) Francesca Dominici, Roger D Peng, Christopher D Barr, and Michelle L Bell. Protecting human health from air pollution: shifting from a single-pollutant to a multi-pollutant approach. Epidemiology (Cambridge, Mass.), 21(2):187, 2010.
Dorie et al. (2019) Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43–68, 2019.
Fan et al. (2022) Qingliang Fan, Yu-Chin Hsu, Robert P Lieli, and Yichong Zhang. Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics, 40(1):313–327, 2022.
Ferrari and Dunson (2020) Federico Ferrari and David B Dunson. Identifying main effects and interactions among exposures using gaussian processes. The annals of applied statistics, 14(4):1743, 2020.
Ferrari and Dunson (2021) Federico Ferrari and David B Dunson. Bayesian factor analysis for inference on interactions. Journal of the American Statistical Association, 116(535):1521–1532, 2021.
Ghosal and Van Der Vaart (2007) Subhashis Ghosal and Aad Van Der Vaart. Convergence rates of posterior distributions for noniid observations. 2007.
Gibson et al. (2019) Elizabeth A Gibson, Jeff Goldsmith, and Marianthi-Anna Kioumourtzoglou. Complex mixtures, complex analyses: an emphasis on interpretable results. Current environmental health reports, 6(2):53–61, 2019.
Hahn et al. (2020) P Richard Hahn, Jared S Murray, and Carlos M Carvalho. Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis, 15(3):965–1056, 2020.
Herring (2010) Amy H Herring. Nonparametric bayes shrinkage for assessing exposures to mixtures subject to limits of detection. Epidemiology (Cambridge, Mass.), 21(Suppl 4):S71, 2010.
Hill et al. (2020) Jennifer Hill, Antonio Linero, and Jared Murray. Bayesian additive regression trees: A review and look forward. Annual Review of Statistics and Its Application, 7(1), 2020.
Hines et al. (2022) Oliver Hines, Karla Diaz-Ordaz, and Stijn Vansteelandt. Variable importance measures for heterogeneous causal effects. arXiv preprint arXiv:2204.06030, 2022.
Levy et al. (2021) Jonathan Levy, Mark van der Laan, Alan Hubbard, and Romain Pirracchio. A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference, 9(1):83–108, 2021.
Li et al. (2023) Haodong Li, Alan Hubbard, and Mark van der Laan. Targeted learning on variable importance measure for heterogeneous treatment effect. arXiv preprint arXiv:2309.13324, 2023.
Li et al. (2022) Yinpu Li, Antonio R Linero, and Jared Murray. Adaptive conditional distribution estimation with bayesian decision tree ensembles. Journal of the American Statistical Association, pages 1–14, 2022.
Linero (2017) Antonio R Linero. A review of tree-based bayesian methods. Communications for Statistical Applications and Methods, 24(6):543–559, 2017.
Linero and Yang (2018) Antonio R Linero and Yun Yang. Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1087–1110, 2018.
Narisetty et al. (2019) Naveen N Narisetty, Bhramar Mukherjee, Yin-Hsiu Chen, Richard Gonzalez, and John D Meeker. Selection of nonlinear interactions by a forward stepwise algorithm: Application to identifying environmental chemical mixtures affecting health outcomes. Statistics in medicine, 38(9):1582–1600, 2019.
Requia et al. (2021) WJ Requia, Y Wei, A Shtein, C Hultquist, X Xing, Q Di, et al. Daily 8-hour maximum and annual o3 concentrations for the contiguous united states, 1-km grids, v1 (2000-2016). NASA Socioeconomic Data and Applications Center (SEDAC), 2021.
Rosenbaum and Rubin (1983) Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
Rubin (1980) Donald B Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American statistical association, 75(371):591–593, 1980.
Samanta and Antonelli (2022) Srijata Samanta and Joseph Antonelli. Estimation and false discovery control for the analysis of environmental mixtures. Biostatistics, 23(4):1039–1055, 2022.
Semenova and Chernozhukov (2021) Vira Semenova and Victor Chernozhukov. Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 24(2):264–289, 2021.
Shin and Antonelli (2023) Heejun Shin and Joseph Antonelli. Improved inference for doubly robust estimators of heterogeneous treatment effects. Biometrics, 00:1–13, 2023.
Shin et al. (2023) Heejun Shin, Danielle Braun, Kezia Irene, and Joseph Antonelli. A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatments. arXiv preprint arXiv:2305.14194, 2023.
Simoni et al. (2015) Marzia Simoni, Sandra Baldacci, Sara Maio, Sonia Cerrai, Giuseppe Sarno, and Giovanni Viegi. Adverse effects of outdoor pollution in the elderly. Journal of thoracic disease, 7(1):34, 2015.
Stafoggia et al. (2017) Massimo Stafoggia, Susanne Breitner, Regina Hampel, and Xavier Basagaña. Statistical approaches to address multi-pollutant mixtures and multiple exposures: the state of the science. Current environmental health reports, 4(4):481–490, 2017.
Starling et al. (2020) Jennifer E Starling, Jared S Murray, Carlos M Carvalho, Radek K Bukowski, and James G Scott. Bart with targeted smoothing: An analysis of patient-specific stillbirth risk. The Annals of Applied Statistics, 14(1):28–50, 2020.
Van Donkelaar et al. (2019) Aaron Van Donkelaar, Randall V Martin, Chi Li, and Richard T Burnett. Regional estimates of chemical composition of fine particulate matter using a combined geoscience-statistical method with information from satellites, models, and monitors. Environmental science & technology, 53(5):2595–2611, 2019.
Wager and Athey (2018) Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
Wang et al. (2020) Bingyu Wang, Ki-Do Eum, Fatemeh Kazemiparkouhi, Cheng Li, Justin Manjourides, Virgil Pavlu, and Helen Suh. The impact of long-term pm2. 5 exposure on specific causes of death: exposure-response curves and effect modification among 53 million us medicare beneficiaries. Environmental Health, 19(1):1–12, 2020.
Wei et al. (2020) Ran Wei, Brian J Reich, Jane A Hoppin, and Subhashis Ghosal. Sparse bayesian additive nonparametric regression with application to health effects of pesticides mixtures. Statistica Sinica, 30(1):55–79, 2020.
Williamson et al. (2021) Brian D Williamson, Peter B Gilbert, Marco Carone, and Noah Simon. Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1):9–22, 2021.
Zhang and Janson (2020) Lu Zhang and Lucas Janson. Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283, 2020.
Zheng et al. (2021) Jiajing Zheng, Alexander D’Amour, and Alexander Franks. Bayesian inference and partial identification in multi-treatment causal inference with unobserved confounding. arXiv preprint arXiv:2111.07973, 2021.