This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix

Treatment Effect Heterogeneity and Importance Measures for Multivariate Continuous Treatments

\nameHeejun Shin \email[email protected]
\addrDepartment of Statistics, University of Florida, Gainesville, FL 32611, USA \AND\nameAntonio Linero \email[email protected]
\addrDepartment of Statistics and Data Science, University of Texas at Austin, Austin, TX 78712, USA \AND\nameMichelle Audirac \email[email protected]
\addrDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA \AND\nameKezia Irene \email[email protected]
\addrDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA \AND\nameDanielle Braun \email[email protected]
\addrDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA;
\addrDepartment of Data Science, Dana-Farber Cancer Institute, Boston, MA 02115, USA \AND\nameJoseph Antonelli \email[email protected]
\addrDepartment of Statistics, University of Florida, Gainesville, FL 32611, USA
Abstract

Estimating the joint effect of a multivariate, continuous exposure is crucial, particularly in environmental health where interest lies in simultaneously evaluating the impact of multiple environmental pollutants on health. We develop novel methodology that addresses two key issues for estimation of treatment effects of multivariate, continuous exposures. We use nonparametric Bayesian methodology that is flexible to ensure our approach can capture a wide range of data generating processes. Additionally, we allow the effect of the exposures to be heterogeneous with respect to covariates. Treatment effect heterogeneity has not been well explored in the causal inference literature for multivariate, continuous exposures, and therefore we introduce novel estimands that summarize the nature and extent of the heterogeneity, and propose estimation procedures for new estimands related to treatment effect heterogeneity. We provide theoretical support for the proposed models in the form of posterior contraction rates and show that it works well in simulated examples both with and without heterogeneity. We apply our approach to a study of the health effects of simultaneous exposure to the components of PM2.5 and find that the negative health effects of exposure to these environmental pollutants is exacerbated by low socioeconomic status and age.

Keywords: Causal inference, Bayesian nonparametrics, Environmental mixtures, Treatment effect heterogeneity, Variable importance measures

1 Introduction

An important scientific question is understanding the effect of a multivariate treatment on an outcome, particularly in environmental health where individuals are exposed simultaneously to multiple pollutants and it is of interest to understand the joint impact of each of these pollutants on public health. Further, the effects of each of these pollutants might be heterogeneous with respect to characteristics of the individuals exposed, and it is important to account for and understand this treatment effect heterogeneity. Ignoring this heterogeneity can lead to biased estimates of the average effect of the treatments on the outcome, and can also mask substantial effects of the treatment on specific subgroups of the population. In this paper, we address this problem by developing methodology for estimating heterogeneous treatment effects of continuous, multivariate treatments.

In the past decade, there has been an increasing interest in analyzing the effects of multivariate exposures in environmental health where the exposures are referred to as environmental mixtures (Dominici et al., 2010; Agier et al., 2016; Stafoggia et al., 2017; Gibson et al., 2019; Shin et al., 2023). Analyzing environmental mixtures is a challenging problem for a number of reasons. For one, the scientific goal is not simply predicting the outcome, but rather understanding the effects of each of the individual exposures, and whether these exposures interact with each other. Therefore, off-the-shelf modern regression strategies such as tree-based models (Breiman, 2001; Chipman et al., 2010) and Gaussian processes (Banerjee et al., 2013) do not immediately apply. A number of studies in the environmental statistics literature have tailored these methods for multivariate exposures. A common theme among these methods is the use of nonparametric Bayesian models. Gaussian processes are adapted for this purpose in the popular Bayesian kernel machine regression (BKMR, Bobb et al. (2015)) or in related work that explicitly identifies interactions between exposures (Ferrari and Dunson, 2020). Related Bayesian approaches use basis function expansions to identify important exposures or interactions among exposures for environmental mixtures (Antonelli et al., 2020; Wei et al., 2020; Samanta and Antonelli, 2022). Other, related approaches have been developed all with the goal of estimating the health effects of environmental mixtures and identifying exposures that affect the outcome (Herring, 2010; Carrico et al., 2015; Narisetty et al., 2019; Boss et al., 2021; Ferrari and Dunson, 2021).

While the aforementioned approaches are useful for multivariate, continuous exposures, they all implicitly assume that the effect of the environmental exposures is the same across the entire population. This assumption does not hold in the context of ambient air pollution, where effects have been shown to vary by characteristics such as age, race, and location (Wang et al., 2020; Bargagli-Stoffi et al., 2020). Treatment effect heterogeneity has seen an explosion of interest in the causal inference literature, though it is typically restricted to a single treatment, either binary (Athey and Imbens, 2016; Wager and Athey, 2018; Hahn et al., 2020; Semenova and Chernozhukov, 2021; Fan et al., 2022; Shin and Antonelli, 2023) or with multiple levels (Chang and Roy, 2024). In the binary setting, heterogeneity is typically summarized by the conditional average treatment effect (CATE) function, 𝔼{Y(1)Y(0)|𝑿=𝒙}\operatorname{\mathbb{E}}\{Y(1)-Y(0)|\bm{X}=\bm{x}\} where Y(w)Y(w) denotes the potential outcome under a binary exposure level ww. While the CATE is a functional estimand, a number of summaries of this function have been proposed to simplify heterogeneity further. One univariate summary is the variance of the CATE (VTE, Levy et al. (2021)), which describes the overall degree of heterogeneity. Related quantities of interest to our study are treatment effect variable importance measures (TE-VIM), which have been recently proposed for univariate treatments in Hines et al. (2022) and Li et al. (2023) to learn which covariates are key factors driving treatment effect heterogeneity. TE-VIM relates regression-based variable importance measures (Zhang and Janson, 2020; Williamson et al., 2021) and the VTE of Levy et al. (2021) within a causal inference framework. Nearly all of this work on treatment effect heterogeneity focuses on binary treatments, and there has been little to no work on the heterogeneous effects of multivariate, continuous exposures.

There exists a substantial gap in the literature on how to both define estimands for continuous, multivariate exposures and for estimating these quantities using observed data and flexible modeling approaches that make as few modeling assumptions as possible. In this paper, we aim to fill this gap by first proposing new estimands in this setting that are interpretable and help describe the complex effect of multiple treatments, and how this effect varies by observed covariates. We extend treatment effect variable importance measures to this more complex setting and describe how to identify and estimate these estimands from observed data. These measures provide researchers with insight into causal effect modifiers for multivariate, continuous exposures and can provide policy-makers with detailed information about how treatments impact outcomes. Additionally, we develop novel estimation strategies using nonparametric Bayesian methodology. By decomposing outcome regression surfaces into distinct, identifiable functions, we are able to explicitly shrink treatment effects towards homogeneity, while still allowing for heterogeneity when it exists. We also make limited modeling assumptions as we use novel extensions of Bayesian additive regression trees (BART) such as the SoftBART prior distribution (Linero and Yang, 2018) and targeted smoothing BART (Starling et al., 2020; Li et al., 2022) for interactions between exposures and covariates to allow for heterogeneity without imposing strong parametric assumptions.

2 Estimands of Interest

Throughout, we assume that we observe 𝓓={(Yi,𝑿i,𝑾i)}i=1n\bm{\mathcal{D}}=\{(Y_{i},\bm{X}_{i},\bm{W}_{i})\}_{i=1}^{n} for each individual where Yi,𝑿ipY_{i}\in\mathbb{R},\bm{X}_{i}\in\mathbb{R}^{p}, and 𝑾iq\bm{W}_{i}\in\mathbb{R}^{q}. Also, we adopt the potential outcomes framework: Yi(𝒘)Y_{i}(\bm{w}) denotes the potential outcome that would be observed under exposure 𝒘\bm{w}. In order to denote potential outcomes in this manner, and to identify causal effects from the observed data, we make the following assumptions:

  1. (i)

    SUTVA (Stable Unit Treatment Value Assumption, Rubin (1980)): Treatments of one unit do not affect the potential outcomes of other units (no interference), and there are not different versions of treatments, such that Yi=Yi(𝑾i)Y_{i}=Y_{i}(\bm{W}_{i}).

  2. (ii)

    Positivity: 0<fW|X(𝒘|𝑿=𝒙)0<f_{W|X}(\bm{w}|\bm{X}=\bm{x}) for all 𝒙\bm{x} and 𝒘\bm{w}, where fW|Xf_{W|X} is the conditional density of the treatments given covariates.

  3. (iii)

    Unconfoundedness: Y(𝒘)𝑾|𝑿Y(\bm{w})\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\bm{W}|\bm{X} for all 𝒘\bm{w}.

The positivity assumption guarantees that each unit has a non-zero probability to be exposed to each treatment level for all possible values of pre-treatment variables, at least in large samples. Unconfoundedness requires that the treatment WW is independent of the potential outcomes, and it implies that there exist no unmeasured variables that confound the treatment - outcome relationship. Because exposures and covariates are both multivariate, there are many causal estimands one could define. In the following sections, we first define estimands targeting both average treatment effects and treatment effect heterogeneity, as well as variable importance measures summarizing this heterogeneity.

2.1 Marginal and Heterogeneous Effects Estimands

One potential estimand to look at in this scenario is 𝔼[Y(𝒘)|𝑿=𝒙]\operatorname{\mathbb{E}}[Y(\bm{w})|\bm{X}=\bm{x}], however, there are infinitely many values that 𝒘\bm{w} can take and therefore it is difficult to interpret this estimand. One option is to look at values of 𝒘\bm{w} where all but one of the exposures is fixed, and then visualize this estimand as a function of one exposure. A simpler setting is one in which there are two exposure levels 𝒘1\bm{w}_{1} and 𝒘0\bm{w}_{0}, under which we want to compare the outcomes. For example, one may be interested in examining what would happen if an intervention was applied that lowered the level of pollution by a specific amount. In this case, one can use the pollution levels without the intervention as 𝒘0\bm{w}_{0} and those with the intervention as 𝒘1\bm{w}_{1}. Heterogeneity is therefore described by

τ𝒘1,𝒘0(𝒙)𝔼{Y(𝒘1)Y(𝒘0)|𝑿=𝒙}.\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{x})\equiv\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\bm{x}\}.

If interest is in marginal treatment effects, this quantity can be averaged over the covariate distribution to obtain a marginal effect

τ¯𝒘1,𝒘0=𝔼[τ𝒘1,𝒘0(𝑿)]\overline{\tau}_{\bm{w}_{1},\bm{w}_{0}}=\operatorname{\mathbb{E}}[\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})]

This problem effectively reduces to the binary treatment setting where the conditional average treatment effect (CATE) is originally defined as there are only two exposure levels of interest. Because of this, many existing approaches to summarizing effect heterogeneity from the existing literature can be used. The variance of this function, Var[τ𝒘1,𝒘0(𝑿)]\text{Var}[\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})], or the treatment effect variable importance measures of Hines et al. (2022) can be incorporated analogously. For this reason, in the following section, we focus on the more complex setting where we have more than two exposure levels of interest, yet we want to study heterogeneity and examine the impact of each covariate in the degree of heterogeneity.

2.2 Multivariate Treatment Effect Variable Importance Measures

In many scenarios, we have more than two exposure levels of interest, yet we still want to define relevant quantities that provide information about treatment effect heterogeneity and which covariates are driving this heterogeneity. For this section, we assume that we have a reference level of the exposure denoted by 𝒘0\bm{w}_{0}, which can be the average exposure level, or some other value chosen a priori. It is important to note that the choice of 𝒘0\bm{w}_{0} should not greatly affect results. We can define the following quantity:

τ𝒘0(𝒙,𝒘)𝔼[Y(𝒘)Y(𝒘0)|𝑿=𝒙],\tau_{\bm{w}_{0}}(\bm{x},\bm{w})\equiv\operatorname{\mathbb{E}}[Y(\bm{w})-Y(\bm{w}_{0})|\bm{X}=\bm{x}],

which describes how the potential outcome surface varies by both 𝒙\bm{x} and 𝒘\bm{w}. This function is difficult to interpret as it is a function of two multivariate arguments. We propose multivariate treatment effect variable importance measures (MTE-VIM) to summarize how much the heterogeneity of τ𝒘0(𝒙,𝒘)\tau_{\bm{w}_{0}}(\bm{x},\bm{w}) is driven by each particular covariate.

Before defining the variable importance metric, we must first define the overall amount of heterogeneity of the treatment effect, which we define as

ϕ=𝔼𝑾[Var𝑿{τ𝒘0(𝑿,𝑾)}].\phi=\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}].

This looks at the variability of the treatment effects as a function of 𝑿\bm{X} for a fixed exposure level, but this variability may differ for different values of exposures 𝑾\bm{W}, and therefore we average this variability across the range of exposures. Note that instead of the marginal variance of 𝑿\bm{X}, we could have alternatively used the conditional variance of 𝑿\bm{X} given 𝑾\bm{W}. This would also provide a useful measure of the overall amount of heterogeneity, but we will see in Section 3.3 that it would complicate our estimation of this quantity as it would require a model for the distribution of 𝑿\bm{X} given 𝑾\bm{W}, which is difficult to specify given the dimension of both the covariates and exposures. Next, we define

ϕj=𝔼𝑾(Var𝑿j[𝔼Xj|𝑿j{τ𝒘0(𝑿,𝑾)}])\phi_{j}=\operatorname{\mathbb{E}}_{\bm{W}}\left(\operatorname{Var}_{\bm{X}_{-j}}\left[\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}\right]\right)

where 𝑿j\bm{X}_{-j} denotes the p1p-1 remaining covariates without the jj-th covariate XjX_{j}. This is similar to ϕ\phi but we first take the conditional expectation of τ𝒘0(𝑿,𝑾)\tau_{\bm{w}_{0}}(\bm{X},\bm{W}) with respect to the jj-th covariate given the remaining covariates so that the term inside the variance is a function of only 𝑿j\bm{X}_{-j} and 𝑾\bm{W}. We know by the law of total variance that ϕϕj[0,ϕ]\phi-\phi_{j}\in[0,\phi], and this quantity measures the amount heterogeneity of τ𝒘0(𝑿,𝑾)\tau_{\bm{w}_{0}}(\bm{X},\bm{W}) that cannot be explained without XjX_{j}.

Assuming some degree of heterogeneity, i.e. ϕ>0\phi>0, we propose the following estimand, which we call MTE-VIM, to describe the importance of each covariate to the overall heterogeneity of the causal effect:

ψj=1ϕj/ϕ[0,1].\psi_{j}=1-\phi_{j}/\phi\in[0,1].

This can be interpreted as the proportion of the treatment effect heterogeneity of τ𝒘0(𝑿,𝑾)\tau_{\bm{w}_{0}}(\bm{X},\bm{W}) not explained by 𝑿j\bm{X}_{-j}. We note that, however, one should carefully interpret ψj\psi_{j} especially when covariates are highly correlated. As an extreme example, if X1X_{1} is solely driving the heterogeneity of τ𝒘0(𝑿,𝑾)\tau_{\bm{w}_{0}}(\bm{X},\bm{W}) and X2X_{2} is strongly correlated with X1X_{1}, then ψ1\psi_{1} is close to zero because the remaining covariates include X2X_{2} which would explain most of the heterogeneity instead of X1X_{1}. From this perspective, one could also construct a grouped MTE-VIM, ψs\psi_{s}, where ss is a subset of {1,2,,p}\{1,2,...,p\}. We refer readers to Hines et al. (2022) for further discussion.

3 Estimation Issues

Under the assumptions introduced in Section 2, for any fixed 𝒘\bm{w} we can identify 𝔼{Y(𝒘)|𝑿=𝒙}\operatorname{\mathbb{E}}\{Y(\bm{w})|\bm{X}=\bm{x}\} from the observed data by 𝔼(Y|𝑿=𝒙,𝑾=𝒘)\operatorname{\mathbb{E}}(Y|\bm{X}=\bm{x},\bm{W}=\bm{w}). Other identification strategies such as those involving propensity scores (Rosenbaum and Rubin, 1983) or combinations of outcome models and propensity scores (Bang and Robins, 2005) are common in causal inference. However, these do not apply here due to the difficulty of estimating the density of the multivariate treatment given the covariates, and because these identification strategies are not well studied for multivariate, continuous treatments. Due to this identification result, all estimands, including the MTE-VIM are identifiable given the conditional outcome regression and therefore we focus our estimation on this quantity.

We separate the conditional outcome regression surface into three parts: the main effect of covariates 𝑿\bm{X}, the main effect of exposures 𝑾\bm{W}, and the interactions of covariates and exposures on the outcome. Therefore, we write our model as follows:

𝔼(Y|𝑿,𝑾)=c+f(𝑿)+g(𝑾)+h(𝑿,𝑾).\displaystyle\operatorname{\mathbb{E}}(Y|\bm{X},\bm{W})=c+f(\bm{X})+g(\bm{W})+h(\bm{X},\bm{W}). (1)

Note that this model is overparameterized as it currently stands because the f()f(\cdot) and g()g(\cdot) functions can be absorbed into the h(,)h(\cdot,\cdot) function. For this reason, these functions are not individually identifiable, and only their sum is identified. We write the model in this way so that we can explicitly shrink each of these components separately, and we discuss in this section how to structure the model so that each of these functions is individually identifiable and therefore amenable to shrinkage and regularization.

The first simplification we make is to set h(𝑿,𝑾)=j=1phj(Xj,𝑾)h(\bm{X},\bm{W})=\sum_{j=1}^{p}h_{j}(X_{j},\bm{W}) where XjX_{j} is the jj-th component of 𝑿\bm{X}. This restricts interactions between covariates and exposures to be limited to interactions between a single covariate and the multivariate exposures. We feel this is a mild assumption that is reasonable in practice, and will improve estimation efficiency in our model. The first hurdle to identification of the individual functions is that hj(Xj,𝑾)h_{j}(X_{j},\bm{W}) may capture the main effect of either the covariate or exposures. For example, we could have hj(Xj,𝑾)=Xj+W1+XjW1h_{j}(X_{j},\bm{W})=X_{j}+W_{1}+X_{j}W_{1}, which captures both main effects and interaction terms, when ideally it would only capture XjW1X_{j}W_{1}. Effectively we want this function to only be nonzero if there is truly an interaction between the exposures and XjX_{j}. For this reason, we restrict hj(Xj,𝑾)h_{j}(X_{j},\bm{W}) to be of the form hjcov(Xj)hjexp(𝑾)h^{cov}_{j}(X_{j})h^{exp}_{j}(\bm{W}) where hjcovh^{cov}_{j} and hjexph^{exp}_{j} are arbitrary functions restricted to not be constant functions. This allows us to prevent the interaction function hj(Xj,𝑾)h_{j}(X_{j},\bm{W}) from simply absorbing the main effects of XjX_{j} and 𝑾\bm{W}. We describe our strategy for each of these separable functions in the following sections.

3.1 Identification through Shifting

The formulation for h(𝑿,𝑾)h(\bm{X},\bm{W}) in the previous section helped ensure that the interaction functions do not solely contain main effects of the exposures or covariates, however, the individual functions are still only identifiable up to constant shifts. For example, without further restriction, shifting f(𝑿)f(\bm{X}) upward by δ\delta, and g(𝑾)g(\bm{W}) and h(𝑿,𝑾)h(\bm{X},\bm{W}) downward by δ/2\delta/2 leads to the same likelihood. This is problematic as we would like our interaction functions to be nonzero only when there is heterogeneity of the treatment effect. If these functions are not identifiable, it becomes more difficult to shrink them towards zero when there is no heterogeneity in the model.

One common way to address this is to put moment restrictions on the functions such that 𝔼𝑿{f(𝑿)}=𝔼𝑾{g(𝑾)}=𝔼𝑿,𝑾{h(𝑿,𝑾)}=0\operatorname{\mathbb{E}}_{\bm{X}}\{f(\bm{X})\}=\operatorname{\mathbb{E}}_{\bm{W}}\{g(\bm{W})\}=\operatorname{\mathbb{E}}_{\bm{X},\bm{W}}\{h(\bm{X},\bm{W})\}=0 with the additional restriction that 𝔼𝑿{h(𝑿,𝑾)}=𝔼𝑾{h(𝑿,𝑾)}=0\operatorname{\mathbb{E}}_{\bm{X}}\{h(\bm{X},\bm{W})\}=\operatorname{\mathbb{E}}_{\bm{W}}\{h(\bm{X},\bm{W})\}=0. This approach is difficult to implement, however, as enforcing these conditions requires the conditional density of covariates given exposures, and vice versa, which is difficult to estimate given the dimension of the exposures and covariates. We use an alternative restriction that also leads to identifiability of individual functions, but is straightforward to implement. Specifically, we enforce that f{𝔼(𝑿)}=g{𝔼(𝑾)}=0f\{\operatorname{\mathbb{E}}(\bm{X})\}=g\{\operatorname{\mathbb{E}}(\bm{W})\}=0 and h{𝔼(𝑿),𝒘}=h{𝒙,𝔼(𝑾)}=0h\{\operatorname{\mathbb{E}}(\bm{X}),\bm{w}\}=h\{\bm{x},\operatorname{\mathbb{E}}(\bm{W})\}=0 for all 𝒙\bm{x} and 𝒘\bm{w}. We show in Appendix B that this leads to identifiability of the individual functions. In practice, this restriction is achieved by centering our estimated functions at each iteration of an MCMC algorithm. If we let all functions with a 0-subscript denote unrestricted functions that do not enforce the aforementioned constraints, then these can be written as:

𝔼(Y|𝑿,𝑾)\displaystyle\operatorname{\mathbb{E}}(Y|\bm{X},\bm{W}) =c0+f0(𝑿)+g0(𝑾)+j=1phj0(Xj,𝑾)\displaystyle=c_{0}+f_{0}(\bm{X})+g_{0}(\bm{W})+\sum_{j=1}^{p}h_{j0}(X_{j},\bm{W})
=[c0+f0(𝔼(𝑿))+g0(𝔼(𝑾))+j=1phj0(𝔼(Xj),𝔼(𝑾))]\displaystyle=\left[c_{0}+f_{0}(\operatorname{\mathbb{E}}(\bm{X}))+g_{0}(\operatorname{\mathbb{E}}(\bm{W}))+\sum_{j=1}^{p}h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right]
+[f0(𝑿)f0(𝔼(𝑿))+j=1p{hj0(Xj,𝔼(𝑾))hj0(𝔼(Xj),𝔼(𝑾))}]\displaystyle\hskip 3.61371pt\hskip 3.61371pt+\left[f_{0}(\bm{X})-f_{0}(\operatorname{\mathbb{E}}(\bm{X}))+\sum_{j=1}^{p}\left\{h_{j0}(X_{j},\operatorname{\mathbb{E}}(\bm{W}))-h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right\}\right]
+[g0(𝑾)g0(𝔼(𝑾))+j=1p{hj0(𝔼(Xj),𝑾)hj0(𝔼(Xj),𝔼(𝑾))}]\displaystyle\hskip 3.61371pt\hskip 3.61371pt+\left[g_{0}(\bm{W})-g_{0}(\operatorname{\mathbb{E}}(\bm{W}))+\sum_{j=1}^{p}\left\{h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\bm{W})-h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right\}\right]
+[j=1p{hj0(Xj,𝑾)hj0(Xj,𝔼(𝑾))hj0(𝔼(Xj),𝑾)+hj0(𝔼(Xj),𝔼(𝑾))}]\displaystyle\hskip 3.61371pt\hskip 3.61371pt+\left[\sum_{j=1}^{p}\left\{h_{j0}(X_{j},\bm{W})-h_{j0}(X_{j},\operatorname{\mathbb{E}}(\bm{W}))-h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\bm{W})+h_{j0}(\operatorname{\mathbb{E}}(X_{j}),\operatorname{\mathbb{E}}(\bm{W}))\right\}\right]
=c+f(𝑿)+g(𝑾)+j=1phj(Xj,𝑾).\displaystyle=c+f(\bm{X})+g(\bm{W})+\sum_{j=1}^{p}h_{j}(X_{j},\bm{W}).

At each iteration of the MCMC, the functions are shifted to satisfy this condition, and therefore our individual functions are identifiable, and the interaction functions should only be nonzero when there is heterogeneity of the treatment effect. Because of this, we can apply shrinkage priors to the hj(Xj,𝑾)h_{j}(X_{j},\bm{W}) functions, which should improve estimation when heterogeneity is not present. Note that the alternative restriction that we impose is baseline-free in the sense that one can replace the chosen baseline, 𝔼(𝑿)\operatorname{\mathbb{E}}(\bm{X}) and 𝔼(𝑾)\operatorname{\mathbb{E}}(\bm{W}), with any values of the covariates and exposures.

3.2 Soft Bart and Targeted Smoothing

In this section we detail the specific models we use for each of the f(𝑿)f(\bm{X}), g(𝑾)g(\bm{W}), and h(𝑿,𝑾)h(\bm{X},\bm{W}) functions. For the main effect functions we use the SoftBART prior of Linero and Yang (2018), while we use BART with targeted smoothing (tsBART, Starling et al. (2020); Li et al. (2022)) for estimation of the interaction functions. Before detailing each of these BART extensions, we first briefly introduce the original BART model. BART was introduced by Chipman et al. (2010) and is a fully Bayesian ensemble-of-trees model that has seen increasing usage in causal inference due to the success of the original BART and its variants (Dorie et al., 2019). Specifically, it assumes that

Z\displaystyle Z =f(𝒗)+ϵ,ϵi.i.d.N(0,σ2)\displaystyle=f(\bm{v})+\epsilon,\qquad\epsilon\overset{\text{i.i.d.}}{\sim}N(0,\sigma^{2})
f(𝒗)\displaystyle f(\bm{v}) =m=1MTree(𝒗,𝒯m,m)\displaystyle=\sum_{m=1}^{M}\text{Tree}(\bm{v},\mathcal{T}_{m},\mathcal{M}_{m})

where Tree(𝒗,𝒯j,j)\text{Tree}(\bm{v},\mathcal{T}_{j},\mathcal{M}_{j}) represents a regression tree with tree structure 𝒯j\mathcal{T}_{j} inducing a step function in 𝒗\bm{v}, and leaf parameters m={μm1,,μmbm}\mathcal{M}_{m}=\{\mu_{m1},\cdots,\mu_{mb_{m}}\} for prediction. The standard prior distribution sets μmkN(0,σμ2/M)\mu_{mk}\sim N(0,\sigma_{\mu}^{2}/M) so that the overall prior variance is Var(f(𝒗))=σμ2\operatorname{Var}(f(\bm{v}))=\sigma_{\mu}^{2}. Although we do not detail the hyper-parameters of BART in this paper, we note that the default setting of Chipman et al. (2010) encourages each tree to be shallow and shrinks leaf parameters toward zero so that Tree()\text{Tree}(\cdot) can be considered as a weak learner. We refer interested readers to Linero (2017) and Hill et al. (2020) for recent reviews of BART.

Although the original BART approach is successful, due to the piece-wise constant nature of tree ensembles, it suffers when the underlying truth is smooth even if it is relatively simple. To overcome this shortcoming, Linero and Yang (2018) propose a smooth modification of BART (SoftBART or SBART) by allowing 𝒗\bm{v} to follow a probabilistic path, that is, randomizing the decision rule at each split. Because we expect the effects of environmental exposures or other continuous treatments on the outcome to be smooth, we use SoftBART for estimation of both f(𝑿)f(\bm{X}) and g(𝑾)g(\bm{W}).

Another smooth variant of BART is tsBART (Starling et al., 2020; Li et al., 2022) which is originally motivated by time-to-event data and density estimation, though we use it for estimation of the interaction functions hj(Xj,𝑾)h_{j}(X_{j},\bm{W}). Unlike SoftBART which smooths the regression function over all variables, tsBart smooths over a single targeted variable, say uu. After centering at γ(u)\gamma(u), a baseline function of uu, the model can be written as f(u,𝒗)=γ(u)+j=1mtsTree(u,𝒗,𝒯m,m)f(u,\bm{v})=\gamma(u)+\sum_{j=1}^{m}\text{tsTree}(u,\bm{v},\mathcal{T}_{m},\mathcal{M}_{m}) where each terminal node is associated with m={μm1(u),,μmbm(u)}\mathcal{M}_{m}=\{\mu_{m1}(u),\cdots,\mu_{mb_{m}}(u)\}. Here, each element μmb(u)\mu_{mb}(u) follows a Gaussian process in uu with mean zero and covariance function Σ(u,u)\Sigma(u,u^{\prime}), so that f(u,𝒗)f(u,\bm{v}), the sum of mm Gaussian processes, is continuous in uu. Following Li et al. (2022), we approximate the model by f(u,𝒗)=γ(u)+m=1Mm(u)Tree(𝒗,𝒯m,m)f(u,\bm{v})=\gamma(u)+\sum_{m=1}^{M}\mathcal{B}_{m}(u)\text{Tree}(\bm{v},\mathcal{T}_{m},\mathcal{M}_{m}) where m(u)=2cos(ωmu+bm)\mathcal{B}_{m}(u)=\sqrt{2}\cos(\omega_{m}u+b_{m}) is a random basis function where ωmi.i.d.N(0,ρ2)\omega_{m}\overset{\text{i.i.d.}}{\sim}N(0,\rho^{-2}) and bmi.i.d.U(0,2π)b_{m}\overset{\text{i.i.d.}}{\sim}U(0,2\pi) so that m=1Mm(u)tree(𝒗,𝒯m,m)\sum_{m=1}^{M}\mathcal{B}_{m}(u)\text{tree}(\bm{v},\mathcal{T}_{m},\mathcal{M}_{m}) weakly converges to a Gaussian process GP(0,Σ(,))\text{GP}(0,\Sigma(\cdot,\cdot)) as MM\rightarrow\infty where Σ(u,u)=σμ2exp{(uu)/(2ρ2)}\Sigma(u,u^{\prime})=\sigma_{\mu}^{2}\exp\{-(u-u^{\prime})/(2\rho^{2})\}.

By fitting tsBART for each interaction function hj(Xj,𝑾)h_{j}(X_{j},\bm{W}) in our model by setting u=Xju=X_{j} and 𝒗=𝑾\bm{v}=\bm{W}, we have that each fitted function is necessarily the sum of products of a continuous function of XjX_{j} and a flexible tree of 𝑾\bm{W} so that it does not include a function solely of XjX_{j} or 𝑾\bm{W}.

3.3 Estimation of MTE-VIM

We now detail the estimation strategy for the proposed variable importance measures. Recall that MTE-VIM for the jj-th covariate, ψj\psi_{j}, is composed of the total heterogeneity ϕ=𝔼𝑾[Var𝑿{τ𝒘0(𝑿,𝑾)}]\phi=\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}] and the total heterogeneity accounted for by 𝑿j\bm{X}_{-j}, given by ϕj=𝔼𝑾(Var𝑿j[𝔼Xj|𝑿j{τ𝒘0(𝑿,𝑾)}])\phi_{j}=\operatorname{\mathbb{E}}_{\bm{W}}\left(\operatorname{Var}_{\bm{X}_{-j}}\left[\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}\right]\right). As before, under the assumptions of Section 2, we can write τ𝒘0(𝒙,𝒘)=𝔼(Y|𝑿=𝒙,𝑾=𝒘)𝔼(Y|𝑿=𝒙,𝑾=𝒘0)\tau_{\bm{w}_{0}}(\bm{x},\bm{w})=\operatorname{\mathbb{E}}(Y|\bm{X}=\bm{x},\bm{W}=\bm{w})-\operatorname{\mathbb{E}}(Y|\bm{X}=\bm{x},\bm{W}=\bm{w}_{0}), and therefore the posterior distribution of τ𝒘0(𝒙,𝒘)\tau_{\bm{w}_{0}}(\bm{x},\bm{w}) can be obtained once we have the posterior distribution of the outcome regression model. Obtaining ϕ\phi is relatively straightforward, as we use empirical distributions of both the exposures and covariates to approximate moments with respect to 𝑾\bm{W} or 𝑿\bm{X}. We can construct an n×nn\times n matrix for which the (k,l)(k,l)-th element is τ𝒘0(𝑿l,𝑾k)\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k}), where 𝑿l\bm{X}_{l} and 𝑾k\bm{W}_{k} represent the observed covariates of the ll-th observation and the exposures of the kk-th observation, respectively. Then, the sample variance of the kk-th row of the matrix would estimate Var𝑿{τ𝒘0(𝑿,𝑾k)}\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{k})\}. Taking the sample mean of these nn column sample variances estimates 𝔼𝑾[Var𝑿{τ𝒘0(𝑿,𝑾)}]\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}]. This estimation process can be illustrated as follows:

(τ𝒘0(𝑿1,𝑾1)τ𝒘0(𝑿2,𝑾1)τ𝒘0(𝑿n,𝑾1)τ𝒘0(𝑿1,𝑾2)τ𝒘0(𝑿n,𝑾2)τ𝒘0(𝑿1,𝑾n)τ𝒘0(𝑿2,𝑾n)τ𝒘0(𝑿n,𝑾n))\displaystyle\begin{pmatrix}\tau_{\bm{w}_{0}}(\bm{X}_{1},\bm{W}_{1})&\tau_{\bm{w}_{0}}(\bm{X}_{2},\bm{W}_{1})&\dots&\tau_{\bm{w}_{0}}(\bm{X}_{n},\bm{W}_{1})\\ \tau_{\bm{w}_{0}}(\bm{X}_{1},\bm{W}_{2})&\ddots&&\tau_{\bm{w}_{0}}(\bm{X}_{n},\bm{W}_{2})\\ \vdots&&\ddots&\vdots\\ \tau_{\bm{w}_{0}}(\bm{X}_{1},\bm{W}_{n})&\tau_{\bm{w}_{0}}(\bm{X}_{2},\bm{W}_{n})&\dots&\tau_{\bm{w}_{0}}(\bm{X}_{n},\bm{W}_{n})\end{pmatrix}\longrightarrow (Var𝑿[τ𝒘0(𝑿,𝑾1)]Var𝑿[τ𝒘0(𝑿,𝑾2)]Var𝑿[τ𝒘0(𝑿,𝑾n)])\displaystyle\begin{pmatrix}\text{Var}_{\bm{X}}[\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{1})]\\[5.0pt] \text{Var}_{\bm{X}}[\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{2})]\\ \vdots\\ \text{Var}_{\bm{X}}[\tau_{\bm{w}_{0}}(\bm{X},\bm{W}_{n})]\\ \end{pmatrix}
\displaystyle\hskip 52.03227pt\big{\downarrow}
𝔼𝑾[Var𝑿{τ𝒘0(𝑿,𝑾)}]\displaystyle\operatorname{\mathbb{E}}_{\bm{W}}[\operatorname{Var}_{\bm{X}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}]

The same strategy can be used for ϕj\phi_{j} replacing τ𝒘0(𝑿l,𝑾k)\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k}) with 𝔼Xj|𝑿j{τ𝒘0(𝑿l,𝑾k)}\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})\} everywhere, so all that is left is to describe how to obtain 𝔼Xj|𝑿j{τ𝒘0(𝑿l,𝑾k)}\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k})\}. For ease of exposition, we re-write τ𝒘0(𝑿l,𝑾k)\tau_{\bm{w}_{0}}(\bm{X}_{l},\bm{W}_{k}) as τ𝒘0(Xlj,𝑿l(j),𝑾k)\tau_{\bm{w}_{0}}(X_{lj},\bm{X}_{l(-j)},\bm{W}_{k}) where XljX_{lj} denotes the jj-th covariate of the ll-th individual and 𝑿l(j)\bm{X}_{l(-j)} denotes the remaining covariates of the ll-th individual. A nonparametric estimate of this mean can be defined as

𝔼^Xj|𝑿j{τ𝒘0(𝑿,𝑾)}=i=1nτ𝒘0(Xij,𝑿j,𝑾)K(𝑿i(j)𝑿j)i=1nK(𝑿i(j)𝑿j)\widehat{\operatorname{\mathbb{E}}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}=\frac{\sum_{i=1}^{n}\tau_{\bm{w}_{0}}(X_{ij},\bm{X}_{-j},\bm{W})K(\bm{X}_{i(-j)}-\bm{X}_{-j})}{\sum_{i=1}^{n}K(\bm{X}_{i(-j)}-\bm{X}_{-j})}

where K()K(\cdot) is an appropriately defined kernel. This will work well if 𝑿\bm{X} is small or moderate-dimensional, but won’t work well as the number of covariates grows. In this more difficult setting, we could make parametric assumptions and estimate the mean of XjX_{j} given 𝑿j\bm{X}_{-j} using a regression model. For instance, if XjX_{j} is continuous, we could assume that it follows a normal distribution and we can estimate the conditional mean and variance using linear regression. This would provide us an estimate of the conditional density fXj|𝑿jf_{X_{j}|\bm{X}_{-j}} from which we can calculate 𝔼Xj|𝑿j{τ𝒘0(𝑿,𝑾)}\operatorname{\mathbb{E}}_{X_{j}|\bm{X}_{-j}}\{\tau_{\bm{w}_{0}}(\bm{X},\bm{W})\}.

Note that both of these require the construction of an n×nn\times n matrix at each MCMC sample, which is computationally intensive. This is particularly problematic for the analysis of Medicare data in Section 6, where our sample consists of all 38702 zip-codes in the United States. As an approximation to this calculation, one can use a blocking scheme where we construct several sub-matrices of the data when calculating the variable importance metrics. Specifically, we split the overall sample into KK groups of approximately equal size. Letting the sample size in group kk be nkn_{k}, we can estimate the variable importance metric using the subset of the data in group kk, which only requires the calculation of an nk×nkn_{k}\times n_{k} matrix. We can do this for each group separately, and then average results across groups for a final estimate of ψj\psi_{j} for j=1,,pj=1,\dots,p. We see in Appendix C that this blocking scheme performs equally well as using the full sample, but is significantly faster on large data sets.

4 Posterior contraction rates

Next, we study the rate at which the posterior distributions concentrate in the model described in Section 3. Let 𝒮(α,p,d)\mathcal{S}(\alpha,p,d) denote a set of α\alpha-Hölder continuous functions on [0,1]p[0,1]^{p} that are constant in all but dd of the coordinates. Following Linero and Yang (2018) and Li et al. (2022), we make the following assumptions about the data-generating process (Condition A).

  1. (A1)

    YiNormal(c+f(𝑿i)+g(𝑾i)+h(𝑿i,𝑾i),1)Y_{i}\sim\operatorname{Normal}(c^{*}+f^{*}(\bm{X}_{i})+g^{*}(\bm{W}_{i})+h^{*}(\bm{X}_{i},\bm{W}_{i}),1)

  2. (A2)

    The range of 𝑿i\bm{X}_{i} and 𝑾i\bm{W}_{i} are [0,1]p[0,1]^{p} and [0,1]q[0,1]^{q}, respectively.

  3. (A3)

    f𝒮(αx,p,dx)f^{*}\in\mathcal{S}(\alpha_{x},p,d_{x}), g𝒮(αw,p,dw)g^{*}\in\mathcal{S}(\alpha_{w},p,d_{w}), and h𝒮(αh,p+q,dh)h^{*}\in\mathcal{S}(\alpha_{h},p+q,d_{h})

We assume the error variance is σ2=1\sigma^{2}=1 for simplicity, but it is straightforward to incorporate unknown σ2\sigma^{2} as well. Achieving (A2) is straightforward by applying quantile normalization, where XijX_{ij}, the ii-th individual’s jj-th covariate, is replaced with its quantile value among all the jj-th covariate values, and the same for the exposures. Further, we must make assumptions about the SoftBART prior distribution Π\Pi used for each regression function ff, gg, and hh. For brevity, we leave these specific details to Appendix A, and we refer to these assumptions on the prior distribution as Condition P. We have simplified the setup by not incorporating targeted smoothing into h(𝒙,𝒘)h(\bm{x},\bm{w}), though we emphasize that Li et al. (2022) show that for certain choices of basis function, it is not difficult to prove analogous results for targeted smoothing models. We remove targeted smoothing from consideration so that we can use the same Condition P for all model components rather than requiring a separate set of conditions for h(,)h(\cdot,\cdot).

Let 𝔼0\operatorname{\mathbb{E}}_{0} denote the expectation with respect to the true data-generating mechanism, Πn\Pi_{n} denote the posterior distribution, and μn2=1ni=1nμ(𝑿i,𝑾i)2\|\mu\|_{n}^{2}=\dfrac{1}{n}\sum_{i=1}^{n}\mu(\bm{X}_{i},\bm{W}_{i})^{2} where μ(𝑿i,𝑾i)=c+f(𝑿i)+g(𝑾i)+h(𝑿i,𝑾i)\mu(\bm{X}_{i},\bm{W}_{i})=c+f(\bm{X}_{i})+g(\bm{W}_{i})+h(\bm{X}_{i},\bm{W}_{i}). In Appendix A, we prove the following theorem.

Theorem 1

If Conditions A and P hold,

𝔼0{Πn(μμn>Mϵn)}0, as n\operatorname{\mathbb{E}}_{0}\{\Pi_{n}(\|\mu-\mu^{*}\|_{n}>M\epsilon_{n})\}\rightarrow 0,\quad\text{ as }n\rightarrow\infty

for ϵn=max(ϵnf,ϵng,ϵnh)\epsilon_{n}=\max(\epsilon_{nf},\epsilon_{ng},\epsilon_{nh}) where ϵnf=nαx/(2αx+dx)log(n)tf,ϵng=nαw/(2αw+dw)log(n)tg\epsilon_{nf}=n^{-\alpha_{x}/(2\alpha_{x}+d_{x})}\log(n)^{t_{f}},\epsilon_{ng}=n^{-\alpha_{w}/(2\alpha_{w}+d_{w})}\log(n)^{t_{g}} and ϵnh=nαh/(2αh+dh)log(n)th\epsilon_{nh}=n^{-\alpha_{h}/(2\alpha_{h}+d_{h})}\log(n)^{t_{h}} where tf=αx(dx+1)/(2αx+dx),tg=αw(dw+1)/(2αw+dw),t_{f}=\alpha_{x}(d_{x}+1)/(2\alpha_{x}+d_{x}),t_{g}=\alpha_{w}(d_{w}+1)/(2\alpha_{w}+d_{w}), and th=αh(dh+1)/(2αh+dh).t_{h}=\alpha_{h}(d_{h}+1)/(2\alpha_{h}+d_{h}).

The rates ϵnf,ϵng,ϵnh\epsilon_{nf},\epsilon_{ng},\epsilon_{nh} represent the minimax estimation rates within the respective function classes for (f,g,h)(f,g,h), up to a logarithmic term, and therefore the rate ϵn\epsilon_{n} is a near-minimax optimal rate for estimation of μ\mu. This result shows that our model is able to capture any true outcome regression function as long as it is sufficiently smooth, which is reasonable for the study of environmental pollutants. Further note that while we have derived posterior contraction rates for μ\mu, these imply the same posterior contraction rates for causal effects, such as τ𝒘0(𝒙,𝒘)\tau_{\bm{w}_{0}}(\bm{x},\bm{w}), and for variable importance metrics ψj\psi_{j}.

5 Simulation Studies

Here, we assess the performance of our proposed approach and compare it to existing approaches that do not allow for heterogeneity of the multivariate treatment effect. For the simulation studies we draw covariates and exposures from the following multivariate normal distributions:

𝑿ii.i.d.N(𝟎,I5),𝑾iN(μi,Σ)where\displaystyle\bm{X}_{i}\overset{\text{i.i.d.}}{\sim}N(\bm{0},\textbf{I}_{5}),\quad\bm{W}_{i}\sim N(\vec{\mu}_{i},\vec{\Sigma})\qquad\text{where}
μi=(eXi1/(1+eXi1)0.5,0.1Xi220.1,0.3Xi3,sin(Xi2),0.05Xi43)T,\displaystyle\vec{\mu}_{i}=\left(e^{X_{i1}}/(1+e^{X_{i1}})-0.5,0.1X_{i2}^{2}-0.1,0.3X_{i3},\sin(X_{i2}),0.05X_{i4}^{3}\right)^{T},
Σij={1i=j0.3ij\displaystyle\Sigma_{ij}=\begin{cases}1\qquad&i=j\\ 0.3\qquad&i\neq j\end{cases}

and Σij\Sigma_{ij} denotes the (i,j)(i,j) element of Σ\vec{\Sigma}. The covariates are associated with the exposures and the exposures are correlated with each other, which are both common in observational studies in environmental health. Next, we generate the outcome by

Yi=f(𝑿i)+g(𝑾i)+j=15hj(Xij,𝑾i)+ϵi,ϵii.i.d.N(0,1)\displaystyle Y_{i}=f^{*}(\bm{X}_{i})+g^{*}(\bm{W}_{i})+\sum_{j=1}^{5}h^{*}_{j}(X_{ij},\bm{W}_{i})+\epsilon_{i},\qquad\epsilon_{i}\overset{\text{i.i.d.}}{\sim}N(0,1)
f(𝑿i)=Xi1+Xi20.5Xi3\displaystyle f^{*}(\bm{X}_{i})=X_{i1}+X_{i2}-0.5X_{i3}
g(𝑾i)=I{Wi1>0}+Wi1e0.3Wi3+arctan(Wi2)+sin(Wi2Wi3π)+min(|Wi3|,1).\displaystyle g^{*}(\bm{W}_{i})=\text{I}{\left\{W_{i1}>0\right\}}+W_{i1}e^{0.3W_{i3}}+\arctan(W_{i2})+\sin(W_{i2}W_{i3}\pi)+\min(|W_{i3}|,1).

Further, we consider three scenarios of interactions between covariates and exposures:

No interaction: h1(Xi1,𝑾)=h2(Xi2,𝑾)=0\displaystyle h^{*}_{1}(X_{i1},\bm{W})=h^{*}_{2}(X_{i2},\bm{W})=0
Moderate: h1(Xi1,𝑾)=0.2arctan(4Xi1)g(𝑾),h2(Xi2,𝑾)=0.2cos(Xi2π)g(𝑾)\displaystyle h^{*}_{1}(X_{i1},\bm{W})=0.2\arctan(4X_{i1})g^{*}(\bm{W}),\quad h^{*}_{2}(X_{i2},\bm{W})=0.2\cos(X_{i2}\pi)g^{*}(\bm{W})
Strong: h1(Xi1,𝑾)=0.4arctan(4Xi1)g(𝑾),h2(Xi2,𝑾)=0.4cos(Xi2π)g(𝑾)\displaystyle h^{*}_{1}(X_{i1},\bm{W})=0.4\arctan(4X_{i1})g^{*}(\bm{W}),\quad h^{*}_{2}(X_{i2},\bm{W})=0.4\cos(X_{i2}\pi)g^{*}(\bm{W})

while hj(Xij,𝑾)=0h^{*}_{j}(X_{ij},\bm{W})=0 for j=3,4,5j=3,4,5 under all scenarios. The standard deviations of the heterogeneous treatment effect functions are approximately 0.2 and 0.4 times that of the sum of the two main effects, respectively, which is reasonable as we expect that interaction effects to be smaller in magnitude than main effects in general. We run the simulation for 300 different data replicates for each scenario and average results across all simulated data sets. We set the sample size to be n=2000n=2000. We aim to estimate τ𝒘1,𝒘0(𝑿)𝔼{Y(𝒘1)Y(𝒘0)|𝑿}\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X})\equiv\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}\} at 100 randomly chosen locations from the distribution of 𝑿\bm{X}, as well as the variable importance metrics ψj\psi_{j} for all covariates. We choose two fixed exposures levels 𝒘0=(0.5,0.5,0.5,0.5,0.5)T\bm{w}_{0}=(-0.5,-0.5,-0.5,-0.5,-0.5)^{T} and 𝒘1=(0.5,0.5,0.5,0.5,0.5)T\bm{w}_{1}=(0.5,0.5,0.5,0.5,0.5)^{T} which roughly correspond to the first and the third quartiles of each exposure, respectively. We consider four different approaches in the simulation. The first is the Bayesian kernel machine regression approach (BKMR) commonly used in the environmental mixtures literature, which fits a model of the form E(Y|𝑿,𝑾)=𝑿𝜷+m(𝑾)E(Y|\bm{X},\bm{W})=\bm{X}\bm{\beta}+m(\bm{W}). A Gaussian process prior is placed on m()m(\cdot), which incorporates spike-and-slab prior distributions to remove unnecessary exposures. The second is a standard BART prior that fits a model of the form E(Y|𝑿,𝑾)=m(𝑿,𝑾)E(Y|\bm{X},\bm{W})=m(\bm{X},\bm{W}) and places a BART prior on m()m(\cdot) (BART). The third uses a smooth variant of BART for m()m(\cdot) (SoftBART). Lastly, we use our proposed approach (SepBART), which separates the regression surface into separate functions and then uses either SoftBART or tsBART prior distributions for each function separately. Note that BKMR specifically assumes that the effect of covariates is linear, which is the case in our simulation studies, while it does not allow for treatment effect heterogeneity, which makes it misspecified in scenarios with interactions between exposures and covariates.

Figure 1 shows the simulation results for estimating τ𝒘1,𝒘0(𝑿)\tau_{\bm{w}_{1},\bm{w}_{0}}(\bm{X}). We find that our method is the only method that achieves the nominal 95% coverage for all interaction scenarios and produces the smallest root mean squared error (RMSE) when there is treatment effect heterogeneity, regardless of its strength. While it is expected that BKMR does not perform well in the presence of interactions between covariates and exposures as it assumes no treatment effect heterogeneity, it is surprising that SepBART outperforms BKMR even when there is no heterogeneity and the main effect of covariates is linear, which is the situation that BKMR is designed for. The no-interaction scenario is the least favorable setting for SepBART as it forces an interaction structure on the model. However, even for this setting, SepBART shows comparable estimation performance to the best performing model in terms of RMSE, which indicates that our model is able to shrink the h()h(\cdot) functions to zero when they are not required. It is also notable that SepBART improves on the original BART and SoftBART that do not separate the regression function into distinct components, which suggests that we benefit from the proposed model formulation and separation of the effects into different functions by providing balance between flexibility and efficiency.

Refer to caption
Figure 1: Simulation results for estimating CATE when 𝒘0\bm{w}_{0} and 𝒘1\bm{w}_{1} are given. The dashed line in the second row denotes nominal 95% coverage.

We now assess the performance for estimating MTE-VIM introduced in Section 2. As BKMR does not allow for effect heterogeneity, we focus on our approach only here. We calculate ψ^j\hat{\psi}_{j} for j=1,2,,5j=1,2,...,5 for each data set by taking the posterior mean of estimates and plot the distribution of ψ^j\hat{\psi}_{j} in Figure 2. The true values of ψ1\psi_{1} and ψ2\psi_{2} are 0.72 and 0.28, respectively, and the others are zero. We see that for both interaction scenarios where the true total heterogeneity ϕ\phi is 0.26 and 1.14, respectively (moderate and strong), the posterior mean of each estimate is concentrated near the true value. Along with estimating MTE-VIM, we can conduct a hypothesis test for the difference between two MTE-VIMs, i.e. testing H0:ψj=ψkH_{0}:\psi_{j}=\psi_{k}. This can be done by constructing the posterior distribution of ψjψk\psi_{j}-\psi_{k} and seeing whether or not the interval contains zero. We calculate how often the null hypothesis is rejected with level α=0.05\alpha=0.05 over 300 data set replicates, and plot the empirical rejection rate of each testing pair in Figure 3. Since the true ψ1\psi_{1} is far away from zero, it shows high rejection rates for comparing ψ1\psi_{1} and the others (the first column of Figure 3). For ψ2\psi_{2}, which is non-zero, but not large as ψ1\psi_{1}, it produces slightly weaker power when interactions are moderate, but detects almost all differences when interactions are strong. For the last three columns where the null is true, we see rejection rates are controlled under the desired level α=0.05\alpha=0.05 regardless of the scenario, though inference is somewhat conservative.

Refer to caption
Figure 2: Violin plots of 300 posterior means of the variable importance measures, ψj\psi_{j} for j=1,2,,5j=1,2,...,5 under each of the two scenarios with heterogeneity. The black dots represent the true variable importance measures: 0.72 for ψ1\psi_{1}, 0.28 for ψ2\psi_{2}, and 0 for the others.
Refer to caption
Figure 3: Rejection rates for each test where H0:ψi=ψjH_{0}:\psi_{i}=\psi_{j} with the level α=0.05\alpha=0.05.

6 Health effects of air pollution mixtures

We now use our proposed approach to gain important epidemiological insights about the effect of PM2.5 components and ozone on mortality. We have demographic, socioeconomic, and mortality information on all Medicare beneficiaries in the United States above the age of 65 for the years 2000-2016. We observe a number of individual level covariates for these individuals, such as their age, race, sex, and whether they have dual eligibility to Medicaid, which is a proxy for low socioeconomic status. We also observe a number of area-level covariates unique to the zip-code of each individual from the United States Census Bureau and the Center for Disease Control’s Behavioral Risk Factor Surveillance System. These area-level covariates consist of average body mass index, smoking rates, median household income, median house value, education, population density, and percent owner occupied housing. We also adjust for both temperature and precipitation variables that are available from the National Climatic Data Center.

We obtain zip-code level environmental exposure data from two distinct sources. We obtain estimates of total PM2.5, ammonium, nitrates, and sulfate levels on a (0.01×0.010.01^{\circ}\times 0.01^{\circ}) monthly grid from the Atmospheric Composition Analysis Group (Van Donkelaar et al., 2019). We obtain estimates of elemental carbon, organic carbon, and ozone on a 1km by 1km daily grid from the Socioeconomic Data and Applications Center (Di et al., 2019, 2021; Requia et al., 2021). We do not have exact residential addresses of individuals in Medicare and only know their residential zip-code, and therefore all exposures are aggregated to the yearly level at each zip-code. All individual and geographic-level covariates are also aggregated up to the zip-code level by taking their averages or proportions within each zip code so that the outcome of interest is the annual mortality rate in each zip-code, which is defined as the number of deaths in that zip-code for a particular year divided by the number of person-years in that zip-code.

We run our proposed approach as described in Section 3 targeting both marginal and heterogeneous treatment effects to evaluate the extent to which ambient air pollution affects mortality, and whether this effect varies by characteristics of zip-codes. We run our model for each year separately using the prior year exposures as the pollutants of interest, and therefore will present results for all years between 2000 and 2016. We run our MCMC algorithm for 5000 iterations, discarding the first 1000 as a burn-in, and thinning every fourth sample.

6.1 Marginal effects of exposures

Refer to caption
Figure 4: Posterior means and corresponding 95% credible intervals for the average treatment effects on mortality rates when every pollutant simultaneously increases from its yearly first quartile to the third quartile. The dashed line represents no average treatment effect.

First, we investigate the average treatment effect of an increase in environmental mixtures, denoted as 𝔼{Y(𝒘1)Y(𝒘0)}\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})\}, where 𝒘0\bm{w}_{0} and 𝒘1\bm{w}_{1} represent vectors of all exposures corresponding to the first and third quartiles for each exposure, respectively. Hence, a positive ATE would indicate a harmful effect on mortality due to an increase in environmental exposures. Figure 4 illustrates estimates of the ATE for each year. We consistently estimate a positive ATE, implying that increasing levels of all environmental exposures increases the mortality rate across the contiguous United States, which suggests a detrimental impact of environmental pollutants on human health.

6.2 Heterogeneity of the effect of PM2.5 components

While the average treatment effect provides insights into the health effects of air pollution, of even more interest is whether this effect varies across the population, as it is crucially important to understand which communities are at most risk to the detrimental effects of air pollution. Toward this goal, we first investigate the proposed MTE-VIM for each covariate to understand which characteristics drive the heterogeneity of the causal effect. The estimated total heterogeneity, ϕ\phi, averaged over all of the study years is 0.04, which implies that the standard deviation of τ𝒘0(𝑾,𝑿)\tau_{\bm{w}_{0}}(\bm{W},\bm{X}) with respect to 𝑿\bm{X} given 𝑾=𝒘\bm{W}=\bm{w} is, on average, 0.2. Considering that we estimated an average ATE of 0.5 across study years when 𝑾=𝒘1\bm{W}=\bm{w}_{1} in Figure 4, this suggests there is a non-negligible amount of treatment effect heterogeneity. Figure 5 shows the posterior mean of the MTE-VIM corresponding to each covariate for each study year. Dual eligibility to Medicaid appears to be the strongest driving factor of treatment effect heterogeneity, followed by race and age. Figure 6 shows the value of these variable importance metrics when averaged over all study years, and we see that dual eligibility to Medicaid, race, and age achieved the largest variable importance measures, with overall means of 0.2, 0.18, and 0.04, respectively. This indicates that, on average, 20% of the treatment effect heterogeneity can only be explained by dual eligibility to Medicaid, even after controlling for other potential effect modifiers.

In addition to identifying the characteristics of zip-codes that modify the treatment effect the most, it is important to understand the nature and direction of this heterogeneity to better understand which groups are most susceptible to air pollution mixtures. To do this, we examine the conditional average treatment effect, using the same reference exposure levels as in the previous section, denoted by 𝒘1\bm{w}_{1} and 𝒘0\bm{w}_{0}. Specifically, we plot 𝔼{Y(𝒘1)Y(𝒘0)|𝑿=𝒙~(j)}\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\widetilde{\bm{x}}_{(j)}\} as a function of xjx_{j}, where 𝒙~(j)=(  x1,,xj,,  xp)\widetilde{\bm{x}}_{(j)}=(\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{1},...,x_{j},...,\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{p}). This sets all covariates to their sample mean and only varies covariate jj, so that we can examine whether the conditional average treatment effect increases or decreases with covariate jj. Within the framework of our model and our identification restrictions, 𝔼{Y(𝒘1)Y(𝒘0)|𝑿=𝒙~(j)}\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\widetilde{\bm{x}}_{(j)}\} simplifies to the difference in the interaction function, hj(𝒘1,xj)hj(𝒘0,xj)h_{j}(\bm{w}_{1},x_{j})-h_{j}(\bm{w}_{0},x_{j}), which is zero when xjx_{j} equals the mean covariate level,   xj\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{j}. Therefore, when this difference is positive, it implies that individuals with the given covariate level are more susceptible to increases in the environmental mixture compared to those with an average covariate level.

Figure 7 shows the conditional average treatment effect curves described above for the covariates with large MTE-VIMs. We observe that dual eligibility shows the largest variation across the covariate values as expected by the largest MTE-VIM. We also see that the conditional average treatment effect increases as dual eligibility increases in most years and their credible intervals mostly overlap across the years, which indicates that the detrimental effects of air pollution are more pronounced in areas with low socioeconomic status. We also find that areas with older individuals are more adversely affected by increases in pollution except for the year 2011, though this effect is not as strong as that of dual eligibility to Medicaid. Although race, as measured by the percentage of the population that is white, achieved high variable importance measures, the direction of the heterogeneity is less clear as we see an increasing trend for certain years while we see a decreasing trend in others with wide credible intervals. Such mixed findings for race have been seen previously, as other studies have found the counter-intuitive finding that the effects of pollution are more severe in areas with higher percentages of white people (Bargagli-Stoffi et al., 2020). Race appears to play a modifying role in the effects of air pollution, but more research is required to understand the nature of this effect. Overall, our study shows that results from the previous literature (Simoni et al., 2015; Bargagli-Stoffi et al., 2020) that focused solely on univariate PM2.5 exposure extend to the more complex setting of multivariate air pollution mixtures. We find a harmful effect of air pollution on mortality, and that this effect is heterogeneous with larger effects in areas with low socioeconomic status and older individuals.

Refer to caption
Figure 5: Heatmap of posterior means of the MTE-VIM by year.
Refer to caption
Figure 6: Estimates of the MTE-VIM averaged over the years 2000-2016.
Refer to caption
Figure 7: Estimates of conditional average treatment effects on mortality rates of increasing all pollutants from the first quartiles to the third quartiles as a function of each covariate, 𝔼{Y(𝒘1)Y(𝒘0)|𝑿=𝒙~(j)}\operatorname{\mathbb{E}}\{Y(\bm{w}_{1})-Y(\bm{w}_{0})|\bm{X}=\widetilde{\bm{x}}_{(j)}\} where 𝒙~(j)=(  x1,,xj,,  xp)\widetilde{\bm{x}}_{(j)}=(\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{1},...,x_{j},...,\makebox[0.0pt][l]{\hskip 1.11942pt\hskip 2.18436pt\makebox[0.0pt][c]{\rule[5.59721pt]{5.6604pt}{0.43057pt}}}{x}_{p}). Each curve represents the posterior mean for each year and the shaded area represents the corresponding 95% credible band.

7 Discussion

In this paper, we proposed a novel approach to analyze and summarize complex, heterogeneous effects of multivariate continuous exposures. We developed new estimands for this setting, including a treatment effect variable importance measure, which is tailored to multivariate exposure scenarios and provides interpretable quantities that simplify the heterogeneity and provide important epidemiological insights about the effects of air pollution mixtures. To facilitate identification of our estimands, and to allow different strengths of regularization for each component of the model, we proposed a separation of the outcome model regression function, and integrated a targeted smoothing method into our model to obtain smooth estimates of the degree of heterogeneity by each covariate. Our theoretical results, coupled with simulation studies, provide strong empirical and analytical support for the efficiency and validity of the proposed model. We used the model to gain new insights into the causal impact of multivariate air pollution mixtures on mortality rates in the Medicare population. Our model estimates a detrimental impact of air pollution, consistent with previous literature, and shows that this effect is more pronounced in zip-codes with lower socioeconomic status and older individuals.

While this paper provides strong evidence of heterogeneous treatment effects for environmental mixtures, and provides users with a novel methodology for the multivariate, continuous treatment setting, there are certain limitations with our approach. For one, our definition of the MTE-VIM requires the user to choose a reference exposure level 𝒘0\bm{w}_{0}. While we expect most reasonable choices of 𝒘0\bm{w}_{0} to lead to similar values of the MTE-VIM, results could be sensitive to this choice. We recommend users utilize a choice of 𝒘0\bm{w}_{0} that does not violate the positivity assumption, which is a crucial consideration when analyzing multivariate exposures as noted in Antonelli and Zigler (2024). An additional limitation of our approach is that we made the assumption that the treatment effect heterogeneity was additive in the covariates, which allowed us to write h(𝑿,𝑾)=j=1phj(Xj,𝑾)h(\bm{X},\bm{W})=\sum_{j=1}^{p}h_{j}(X_{j},\bm{W}). We find this to be a reasonable assumption in most cases, and one that facilitates estimation in a difficult setting, but more complex forms of heterogeneity may not be captured well by this additive structure.

Lastly, there are many future research directions to consider. One important direction would be to couple these estimation strategies with sensitivity analysis to unmeasured confounding, a common concern in observational studies such as this one. The recent literature has developed advancements in sensitivity analysis in multiple exposure settings for estimating average treatment effects (Zheng et al., 2021), and these ideas could be extended to heterogeneous treatment effects seen here. Another important direction would be to combine the approach developed in this manuscript with methodology for optimal policy estimation, in order to provide environmental regulators increased information about the best practices for reducing pollution in the future in order to obtain the largest public health benefit from a proposed reduction in air pollution levels.


Acknowledgments and Disclosure of Funding

Research described in this article was conducted under contract to the Health Effects Institute (HEI), an organization jointly funded by the United States Environmental Protection Agency (EPA) (Assistance Award No. CR-83590201) and certain motor vehicle and engine manufacturers. The contents of this article do not necessarily reflect the views of HEI, or its sponsors, nor do they necessarily reflect the views and policies of the EPA or motor vehicle and engine manufacturers. The computations in this paper were run on the FASRC FASSE cluster supported by the FAS Division of Science Research Computing Group at Harvard University. Danielle Braun and Kezia Irene were also funded by the following grants from the National Institute of Health; R01AG066793, RF1AG074372, R01AG066793, RF1AG071024, R01ES030616, R01ES034373, RF1AG080948, R01ES034021

Appendix A Proof of Theorem 1

Before showing the proof of the main result, we must first show the conditions on the SoftBART prior distribution that are required, which we refer to as Condition P. Specifically, we need that

  1. (P1)

    There exists positive constants (CM1,CM2)(C_{M1},C_{M2}) such that the prior on the number of trees TT in the ensemble is Π(T=t)=CM1exp{CM2tlogt}\Pi(T=t)=C_{M1}\exp\{-C_{M2}t\log t\}.

  2. (P2)

    A single bandwidth parameter τtτ\tau_{t}\equiv\tau is used and its prior satisfies Π(τx)Cτ1exp(xCτ2)\Pi(\tau\geq x)\leq C_{\tau 1}\exp(-x^{C_{\tau 2}}) and Π(τ1x)Cτ3exp(xCτ4)\Pi(\tau^{-1}\geq x)\leq C_{\tau 3}\exp(-x^{C_{\tau 4}}) for some positive constants Cτ1,,Cτ4C_{\tau 1},\ldots,C_{\tau 4} for all sufficiently large xx, with Cτ2,Cτ4<1C_{\tau 2},C_{\tau 4}<1. Moreover, the density of τ1\tau^{-1} satisfies πτ1(x)Cτ5eCτ6x\pi_{\tau^{-1}}(x)\geq C_{\tau 5}e^{-C_{\tau 6}x} for large enough xx and some positive constants Cτ5C_{\tau 5} and Cτ6C_{\tau 6}.

  3. (P3)

    The prior on the splitting proportions is sDirichlet(a/rξ,,a/rξ)s\sim\operatorname{Dirichlet}(a/r^{\xi},\ldots,a/r^{\xi}) for some ξ>1\xi>1 and a<0a<0, where rr is the dimension of the input space (i.e., pp for ff, qq for gg, and p+qp+q for hh).

  4. (P4)

    The μt\mu_{t\ell}’s are i.i.d. from a density πμ(μ)\pi_{\mu}(\mu) such that πμ(μ)Cμ1eCμ2|μ|\pi_{\mu}(\mu)\geq C_{\mu 1}e^{-C_{\mu 2}|\mu|} for some coefficients Cμ1,Cμ2C_{\mu 1},C_{\mu 2}. Additionally, there exists constants Cμ3,Cμ4C_{\mu 3},C_{\mu 4} such that Π(|μt|x)Cμ3exp(xCμ4)\Pi(|\mu_{t\ell}|\geq x)\leq C_{\mu 3}\exp(-x^{C_{\mu 4}}) for all xx.

  5. (P5)

    Let DtD_{t} denote the depth of tree 𝒯t\mathcal{T}_{t}. Then Π(Dt=k)>0\Pi(D_{t}=k)>0 for all k=0,1,,2dk=0,1,\ldots,2d and Π(Dt>d0)=0\Pi(D_{t}>d_{0})=0 for d02dd_{0}\geq 2d where dd represents the maximum number of covariates on which the function depends.

  6. (P6)

    The gating function ψ:[0,1]\psi:\mathbb{R}\to[0,1] of the SoftBART prior is such that supx|ψ(x)|<\sup_{x}|\psi^{\prime}(x)|<\infty and the function ρ(x)=ψ(x){1ψ(x)}\rho(x)=\psi(x)\{1-\psi(x)\} is such that ρ(x)𝑑x>0\int\rho(x)\ dx>0, |x|mρ(x)<\int|x|^{m}\,\rho(x)<\infty for all integers m0m\geq 0, and ρ(x)\rho(x) can be analytically extended to some strip {z:|(z)|U}\{z:|\Im(z)|\leq U\} in the complex plane.

All conditions above can be achieved by using the default SoftBART prior described in Linero and Yang (2018). We consider the concentration of the posterior with respect to the norm n\|\cdot\|_{n} defined by μn2=1ni=1nμ(𝑿i,𝑾i)2\|\mu\|_{n}^{2}=\frac{1}{n}\sum_{i=1}^{n}\mu(\bm{X}_{i},\bm{W}_{i})^{2}. We will first verify the following conditions (Condition C) for the prior and ground truth μ0\mu_{0} when Condition A and Condition P hold to establish the result.

  1. (C1)

    The prior satisfies Π(μμ0nϵn)eCnϵn2\Pi(\|\mu-\mu_{0}\|_{n}\leq\epsilon_{n})\geq e^{-Cn\epsilon_{n}^{2}} for sufficiently large nn and constant CC independent of (n,p,q)(n,p,q).

  2. (C2)

    There exists a sieve 12\mathcal{F}_{1}\subseteq\mathcal{F}_{2}\subseteq\cdots such that

    logN(n,ϵ¯n,n)CNnϵ¯n2,\log N(\mathcal{F}_{n},\bar{\epsilon}_{n},\|\cdot\|_{n})\leq C_{N}n\bar{\epsilon}_{n}^{2},

    for sufficiently large nn, where CNC_{N} is a constant independent of (n,p,q)(n,p,q) and ϵ¯n\bar{\epsilon}_{n} is a constant multiple of ϵn\epsilon_{n}. Here, N(,δ,n)N(\mathcal{F},\delta,\|\cdot\|_{n}) denotes the δ\delta-covering number, which is the smallest number of δ\delta-balls required to cover \mathcal{F} with respect to n\|\cdot\|_{n}.

  3. (C3)

    For the same n\mathcal{F}_{n} from C2 and same constant CC from C1, we have

    Π(μn)e(C+4)nϵn2,\Pi(\mu\notin\mathcal{F}_{n})\leq e^{-(C+4)n\epsilon_{n}^{2}},

    for sufficiently large nn.

Note that it suffices to consider \|\cdot\|_{\infty} in place of n\|\cdot\|_{n} for each of (C1)–(C3) because trivially μnμ\|\mu\|_{n}\leq\|\mu\|_{\infty}. Following from the results of Ghosal and Van Der Vaart (2007) (see, in particular, Theorem 1, Theorem 4, Section 7.7, and the comments following Theorem 2), verifying Condition C is sufficient to prove Theorem 1.

Assuming Conditions A and P, we can state the following useful results, which follow from the proof of Theorem 2 of Linero and Yang (2018).

Proposition 2

Suppose that Condition A holds, and independent priors are specified for (f,g,h)(f,g,h) such that Condition P holds for each. Then for any c>0c>0 there exist constants and CfC_{f} independent of (n,p,q)(n,p,q) such that for sufficiently large nn we have

Π(ff0cϵnf)eCfnϵnf2.\Pi(\|f-f_{0}\|_{\infty}\leq c\epsilon_{nf})\geq e^{-C_{f}n\epsilon_{nf}^{2}}.

Similarly, there exist CgC_{g} and ChC_{h} such that Π(gg0cϵng)eCgnϵng2\Pi(\|g-g_{0}\|_{\infty}\leq c\epsilon_{ng})\geq e^{-C_{g}n\epsilon_{ng}^{2}} and Π(hh0cϵnh)eChnϵnh2\Pi(\|h-h_{0}\|\leq c\epsilon_{nh})\geq e^{-C_{h}n\epsilon_{nh}^{2}}.

The following lemma now verifies Condition (C1).

Lemma 3

Suppose that Condition (A1) holds with σ2\sigma^{2} fixed and that the conditions of Proposition 2 also hold. Then there exists a constant CC such that for sufficiently large nn

Π(μμ0nϵn)eCnϵn2\Pi(\|\mu-\mu_{0}\|_{n}\leq\epsilon_{n})\geq e^{-Cn\epsilon_{n}^{2}}

where ϵn=max(ϵnf,ϵng,ϵnf)\epsilon_{n}=\max(\epsilon_{nf},\epsilon_{ng},\epsilon_{nf}).

Proof  First, we note that by the triangle inequality and prior independence that

Π(μμ0ϵn)\displaystyle\Pi(\|\mu-\mu_{0}\|_{\infty}\leq\epsilon_{n}) Π(ff0ϵn/3)Π(gg0ϵn/3)Π(hh0ϵn/3).\displaystyle\geq\Pi(\|f-f_{0}\|_{\infty}\leq\epsilon_{n}/3)\Pi(\|g-g_{0}\|_{\infty}\leq\epsilon_{n}/3)\Pi(\|h-h_{0}\|_{\infty}\leq\epsilon_{n}/3).

Applying Proposition 2 with c=1/3c=1/3 and the definition of ϵn\epsilon_{n} this gives

Π(μμ0ϵn)e(Cf+Cg+Ch)nϵn2=eCnϵn2\Pi(\|\mu-\mu_{0}\|_{\infty}\leq\epsilon_{n})\geq e^{-(C_{f}+C_{g}+C_{h})n\epsilon_{n}^{2}}=e^{-Cn\epsilon_{n}^{2}}

where C=Cf+Cg+ChC=C_{f}+C_{g}+C_{h}.  
Next, we prove the existence of an appropriate sieve to establish (C2)–(C3). We use the following result, which is proven in the supplementary material of Linero and Yang (2018), to do this.

Lemma 4

Suppose that Condition P holds. For universal constants (D,E)(D,E) (depending only on the prior) and for sufficiently large (T,d,U,σ2)(T,d,U,\sigma_{2}) and sufficiently small (σ1,δ)(\sigma_{1},\delta), there exists a set 𝒢f\mathcal{G}_{f} with the following properties:

1. Covering entropy control: logN(𝒢f,Dδ,)dlogp+3T2d0log(dσ22T2d0Uδ)\log N(\mathcal{G}_{f},D\delta,\|\cdot\|_{\infty})\leq d\log p+3T2^{d_{0}}\log\left(\frac{d\sigma^{2}_{2}T2^{d_{0}}U}{\delta}\right)

2. Complement probability bound: If Πf\Pi_{f} satisfies Condition P then

Πμ(𝒢fc)\displaystyle\Pi_{\mu}(\mathcal{G}_{f}^{c}) CM1exp{CM2Tlog(T)}+\displaystyle\leq C^{\prime}_{M1}\exp\{-C^{\prime}_{M2}T\log(T)\}+
2d0T[exp{Edlogp}+Cμ1exp{UCμ2}]+\displaystyle\qquad 2^{d_{0}}T\left[\exp\{-Ed\log p\}+C_{\mu 1}\exp\{-U^{C_{\mu 2}}\}\right]+
2Cτ1exp{σ1Cτ2}+2Cτ3exp{σ2Cτ4},\displaystyle\qquad 2C_{\tau_{1}}\,\exp\{-\sigma_{1}^{-C_{\tau 2}}\}+2C_{\tau_{3}}\exp\{-\sigma_{2}^{C_{\tau_{4}}}\},

Similarly, sets 𝒢g\mathcal{G}_{g} and 𝒢h\mathcal{G}_{h} exist with analogous statements holding for Πg\Pi_{g} and Πh\Pi_{h}.

Using this lemma, we can construct an appropriate sieve as follows. First, define the Cartesian product =𝒢f×𝒢g×𝒢h\mathcal{F}=\mathcal{G}_{f}\times\mathcal{G}_{g}\times\mathcal{G}_{h} for appropriately large/small values of (T,d,U,σ2,σ1,ϵ)(T,d,U,\sigma_{2},\sigma_{1},\epsilon). Then we immediately have

logN(,Dδ,)logN(𝒢f,Dδ,)+logN(𝒢g,Dδ,)+logN(𝒢h,Dδ,).\log N(\mathcal{F},D\delta,\|\cdot\|_{\infty})\leq\log N(\mathcal{G}_{f},D\delta,\|\cdot\|_{\infty})+\log N(\mathcal{G}_{g},D\delta,\|\cdot\|_{\infty})+\log N(\mathcal{G}_{h},D\delta,\|\cdot\|_{\infty}).

Following Li et al. (2022), for a large constant κ\kappa to be selected later, take

σ1Cτ2=σ2Cτ4=UCμ2=κnϵn2,\sigma_{1}^{-C_{\tau 2}}=\sigma_{2}^{C_{\tau 4}}=U^{C_{\mu 2}}=\kappa n\epsilon_{n}^{2},

d=κnϵn2/log(p+q+1)d=\lfloor\kappa n\epsilon_{n}^{2}/\log(p+q+1)\rfloor and T=κnϵn2/lognT=\lfloor\kappa n\epsilon_{n}^{2}/\log n\rfloor. By substituting these constants into the covering entropy bound, we can obtain the inequality

logN(,Dϵn,)κnϵn2\log N(\mathcal{F},D\epsilon_{n},\|\cdot\|_{\infty})\leq\kappa^{\prime}n\epsilon_{n}^{2}

for an appropriately chosen κ>κ\kappa^{\prime}>\kappa and large enough nn, which verifies (C2). Additionally, by plugging these choices into the complementary probability bound, we can also make

Π(c)Π(f𝒢fc)+Π(g𝒢gc)+Π(h𝒢hc)exp{(C+4)nϵn2}\Pi(\mathcal{F}^{c})\leq\Pi(f\in\mathcal{G}_{f}^{c})+\Pi(g\in\mathcal{G}_{g}^{c})+\Pi(h\in\mathcal{G}_{h}^{c})\leq\exp\{-(C+4)n\epsilon_{n}^{2}\}

for sufficiently large nn by taking κ\kappa sufficiently large as desired. (C1)–(C3) are verified.

Appendix B Proof of identifiability

We prove that each function component in Equation 1 is identifiable by incorporating the shifting strategy introduced in Section 3.1.

Proposition 5

Suppose 𝒮𝐱0,𝐰0\mathcal{S}_{\bm{x}_{0},\bm{w}_{0}} denotes a space of a tuple (c,f(),g(),h())\big{(}c,f(\cdot),g(\cdot),h(\cdot)\big{)} that satisfies f(𝐱0)=g(𝐰0)=0f(\bm{x}_{0})=g(\bm{w}_{0})=0 and h{𝐱0,𝐰}=h{𝐱,𝐰0}=0h\{\bm{x}_{0},\bm{w}\}=h\{\bm{x},\bm{w}_{0}\}=0 for all 𝐱p\bm{x}\in\mathbb{R}^{p} and 𝐰q\bm{w}\in\mathbb{R}^{q}. If

  1. 1.

    {(c,f(),g(),h()),(c~,f~(),g~(),h~())}𝒮\big{\{}\big{(}c,f(\cdot),g(\cdot),h(\cdot)\big{)},\big{(}\widetilde{c},\widetilde{f}(\cdot),\widetilde{g}(\cdot),\widetilde{h}(\cdot)\big{)}\big{\}}\subset\mathcal{S}, and

  2. 2.

    c+f(𝒙)+g(𝒘)+h(𝒙,𝒘)=c~+f~(𝒙)+g~(𝒘)+h~(𝒙,𝒘)c+f(\bm{x})+g(\bm{w})+h(\bm{x},\bm{w})=\widetilde{c}+\widetilde{f}(\bm{x})+\widetilde{g}(\bm{w})+\widetilde{h}(\bm{x},\bm{w}) for all 𝒙p\bm{x}\in\mathbb{R}^{p} and 𝒘q\bm{w}\in\mathbb{R}^{q},

then (c,f(),g(),h(,))=(c~,f~(),g~(),h~(,))\big{(}c,f(\cdot),g(\cdot),h(\cdot,\cdot)\big{)}=\big{(}\widetilde{c},\widetilde{f}(\cdot),\widetilde{g}(\cdot),\widetilde{h}(\cdot,\cdot)\big{)}.

Proof  Let 𝒙=𝒙0\bm{x}=\bm{x}_{0} and 𝒘=𝒘0\bm{w}=\bm{w}_{0}. Since both tuples belong to 𝒮𝒙0,𝒘0\mathcal{S}_{\bm{x}_{0},\bm{w}_{0}}, the second condition becomes c=c~c=\widetilde{c}. Next, we fix 𝒙=𝒙0\bm{x}=\bm{x}_{0} while allowing 𝒘\bm{w} to vary. Then, we have c+g(𝒘)=c~+g~(𝒘)c+g(𝒘)=c+g~(𝒘)g(𝒘)=g~(𝒘)c+g(\bm{w})=\widetilde{c}+\widetilde{g}(\bm{w})\Leftrightarrow c+g(\bm{w})=c+\widetilde{g}(\bm{w})\Leftrightarrow g(\bm{w})=\widetilde{g}(\bm{w}) for all 𝒘q\bm{w}\in\mathbb{R}^{q}. Analogously, setting 𝒘=𝒘0\bm{w}=\bm{w}_{0} yields f(𝒙)=f~(𝒙)f(\bm{x})=\widetilde{f}(\bm{x}) for all 𝒙p\bm{x}\in\mathbb{R}^{p}. Combining the above results produces c+f(𝒙)+g(𝒘)=c~+f~(𝒙)+g~(𝒘)c+f(\bm{x})+g(\bm{w})=\widetilde{c}+\widetilde{f}(\bm{x})+\widetilde{g}(\bm{w}) for all 𝒙p\bm{x}\in\mathbb{R}^{p} and 𝒘q\bm{w}\in\mathbb{R}^{q}. Hence, it follows that h(𝒙,𝒘)=h~(𝒙,𝒘)h(\bm{x},\bm{w})=\widetilde{h}(\bm{x},\bm{w}) for all 𝒙p\bm{x}\in\mathbb{R}^{p} and 𝒘q\bm{w}\in\mathbb{R}^{q}. Setting 𝒙0=𝔼(𝑿)\bm{x}_{0}=\operatorname{\mathbb{E}}(\bm{X}) and 𝒘0=𝔼(𝑾)\bm{w}_{0}=\operatorname{\mathbb{E}}(\bm{W}) concludes the proof.

 

Appendix C Simulations for the blocking scheme

In this section, we present additional simulation results to support the use of the blocking scheme in calculating variable importance measures, as described in Section 3. We employ the same simulation scenario as in Section 3, where strong interactions exist between covariates and exposures. Recall that the true parameter of interest (ψ1,ψ2,ψ3,ψ4,ψ5)=(0.722,0.278,0,0,0)(\psi_{1},\psi_{2},\psi_{3},\psi_{4},\psi_{5})=(0.722,0.278,0,0,0). We compare three methods for estimating the parameter: 1) using a randomly chosen 100 observations from the original data set (Small); 2) using all observations (Full); 3) partitioning observations into ten blocks and averaging ten estimates (Block).

Table 1 presents the relative RMSE compared to the blocking scheme averaged over 300 data replicates. A relative RMSE greater than 1 indicates a larger RMSE than the blocking scheme. We also present the histograms of the point estimates of ψ2\psi_{2} for each method in Figure 8. Using just one subset of the original observations results in a significantly larger RMSE as it has a larger variability in estimating the parameter, which is expected since it doesn’t incorporate all data points. The blocking scheme, which simply averages estimates from ten small subsets, shows a comparable estimation performance to the “Full” method which requires significantly more intensive computation.

Estimand Small Full Block
ψ1\psi_{1} 1.29 0.99 1
ψ2\psi_{2} 1.66 0.99 1
ψ3\psi_{3} 1.23 1.04 1
ψ4\psi_{4} 1.59 1.17 1
ψ5\psi_{5} 1.66 0.99 1
Table 1: Relative RMSE of estimating MTE-VIM.
Refer to caption
Figure 8: Histograms of 500 posterior means of the variable importance measures, ψ2\psi_{2}. The vertical line represents the true value of 0.278.

References

  • Agier et al. (2016) Lydiane Agier, Lützen Portengen, Marc Chadeau-Hyam, Xavier Basagaña, Lise Giorgis-Allemand, Valérie Siroux, Oliver Robinson, Jelle Vlaanderen, Juan R González, Mark J Nieuwenhuijsen, et al. A systematic comparison of linear regression–based statistical methods to assess exposome-health associations. Environmental health perspectives, 124(12):1848–1856, 2016.
  • Antonelli and Zigler (2024) Joseph Antonelli and Corwin Zigler. Causal analysis of air pollution mixtures: Estimands, positivity, and extrapolation. arXiv preprint arXiv:2401.17385, 2024.
  • Antonelli et al. (2020) Joseph Antonelli, Maitreyi Mazumdar, David Bellinger, David Christiani, Robert Wright, and Brent Coull. Estimating the health effects of environmental mixtures using bayesian semiparametric regression and sparsity inducing priors. The Annals of Applied Statistics, 14(1):257–275, 2020.
  • Athey and Imbens (2016) Susan Athey and Guido Imbens. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27):7353–7360, 2016.
  • Banerjee et al. (2013) Anjishnu Banerjee, David B Dunson, and Surya T Tokdar. Efficient gaussian process regression for large datasets. Biometrika, 100(1):75–89, 2013.
  • Bang and Robins (2005) Heejung Bang and James M Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61(4):962–973, 2005.
  • Bargagli-Stoffi et al. (2020) Falco J Bargagli-Stoffi, Riccardo Cadei, Kwonsang Lee, and Francesca Dominici. Causal rule ensemble: Interpretable discovery and inference of heterogeneous causal effects. arXiv preprint arXiv:2009.09036, 2020.
  • Bobb et al. (2015) Jennifer F Bobb, Linda Valeri, Birgit Claus Henn, David C Christiani, Robert O Wright, Maitreyi Mazumdar, John J Godleski, and Brent A Coull. Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics, 16(3):493–508, 2015.
  • Boss et al. (2021) Jonathan Boss, Alexander Rix, Yin-Hsiu Chen, Naveen N Narisetty, Zhenke Wu, Kelly K Ferguson, Thomas F McElrath, John D Meeker, and Bhramar Mukherjee. A hierarchical integrative group least absolute shrinkage and selection operator for analyzing environmental mixtures. Environmetrics, 32(8):e2698, 2021.
  • Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
  • Carrico et al. (2015) Caroline Carrico, Chris Gennings, David C Wheeler, and Pam Factor-Litvak. Characterization of weighted quantile sum regression for highly correlated data in a risk analysis setting. Journal of agricultural, biological, and environmental statistics, 20(1):100–120, 2015.
  • Chang and Roy (2024) Peter Chang and Arkaprava Roy. Individualized multi-treatment response curves estimation using rbf-net with shared neurons. arXiv preprint arXiv:2401.16571, 2024.
  • Chipman et al. (2010) Hugh A Chipman, Edward I George, and Robert E McCulloch. Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298, 2010.
  • Di et al. (2019) Qian Di, Heresh Amini, Liuhua Shi, Itai Kloog, Rachel Silvern, James Kelly, M. Benjamin Sabath, Christine Choirat, Petros Koutrakis, Alexei Lyapustin, Yujie Wang, Loretta J. Mickley, and Joel Schwartz. An ensemble-based model of pm2.5 concentration across the contiguous united states with high spatiotemporal resolution. Environment International, 130:104909, 2019.
  • Di et al. (2021) Qian Di, Yaguang Wei, Alexandra Shtein, Carolynne Hultquist, Xiaoshi Xing, Heresh Amini, Liuhua Shi, I Kloog, R Silvern, J Kelly, et al. Daily and annual pm2. 5 concentrations for the contiguous united states, 1-km grids, v1 (2000-2016). NASA Socioeconomic Data and Applications Center (SEDAC), 2021. doi: https://doi.org/10.7927/0rvr-4538.
  • Dominici et al. (2010) Francesca Dominici, Roger D Peng, Christopher D Barr, and Michelle L Bell. Protecting human health from air pollution: shifting from a single-pollutant to a multi-pollutant approach. Epidemiology (Cambridge, Mass.), 21(2):187, 2010.
  • Dorie et al. (2019) Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34(1):43–68, 2019.
  • Fan et al. (2022) Qingliang Fan, Yu-Chin Hsu, Robert P Lieli, and Yichong Zhang. Estimation of conditional average treatment effects with high-dimensional data. Journal of Business & Economic Statistics, 40(1):313–327, 2022.
  • Ferrari and Dunson (2020) Federico Ferrari and David B Dunson. Identifying main effects and interactions among exposures using gaussian processes. The annals of applied statistics, 14(4):1743, 2020.
  • Ferrari and Dunson (2021) Federico Ferrari and David B Dunson. Bayesian factor analysis for inference on interactions. Journal of the American Statistical Association, 116(535):1521–1532, 2021.
  • Ghosal and Van Der Vaart (2007) Subhashis Ghosal and Aad Van Der Vaart. Convergence rates of posterior distributions for noniid observations. 2007.
  • Gibson et al. (2019) Elizabeth A Gibson, Jeff Goldsmith, and Marianthi-Anna Kioumourtzoglou. Complex mixtures, complex analyses: an emphasis on interpretable results. Current environmental health reports, 6(2):53–61, 2019.
  • Hahn et al. (2020) P Richard Hahn, Jared S Murray, and Carlos M Carvalho. Bayesian regression tree models for causal inference: Regularization, confounding, and heterogeneous effects (with discussion). Bayesian Analysis, 15(3):965–1056, 2020.
  • Herring (2010) Amy H Herring. Nonparametric bayes shrinkage for assessing exposures to mixtures subject to limits of detection. Epidemiology (Cambridge, Mass.), 21(Suppl 4):S71, 2010.
  • Hill et al. (2020) Jennifer Hill, Antonio Linero, and Jared Murray. Bayesian additive regression trees: A review and look forward. Annual Review of Statistics and Its Application, 7(1), 2020.
  • Hines et al. (2022) Oliver Hines, Karla Diaz-Ordaz, and Stijn Vansteelandt. Variable importance measures for heterogeneous causal effects. arXiv preprint arXiv:2204.06030, 2022.
  • Levy et al. (2021) Jonathan Levy, Mark van der Laan, Alan Hubbard, and Romain Pirracchio. A fundamental measure of treatment effect heterogeneity. Journal of Causal Inference, 9(1):83–108, 2021.
  • Li et al. (2023) Haodong Li, Alan Hubbard, and Mark van der Laan. Targeted learning on variable importance measure for heterogeneous treatment effect. arXiv preprint arXiv:2309.13324, 2023.
  • Li et al. (2022) Yinpu Li, Antonio R Linero, and Jared Murray. Adaptive conditional distribution estimation with bayesian decision tree ensembles. Journal of the American Statistical Association, pages 1–14, 2022.
  • Linero (2017) Antonio R Linero. A review of tree-based bayesian methods. Communications for Statistical Applications and Methods, 24(6):543–559, 2017.
  • Linero and Yang (2018) Antonio R Linero and Yun Yang. Bayesian regression tree ensembles that adapt to smoothness and sparsity. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(5):1087–1110, 2018.
  • Narisetty et al. (2019) Naveen N Narisetty, Bhramar Mukherjee, Yin-Hsiu Chen, Richard Gonzalez, and John D Meeker. Selection of nonlinear interactions by a forward stepwise algorithm: Application to identifying environmental chemical mixtures affecting health outcomes. Statistics in medicine, 38(9):1582–1600, 2019.
  • Requia et al. (2021) WJ Requia, Y Wei, A Shtein, C Hultquist, X Xing, Q Di, et al. Daily 8-hour maximum and annual o3 concentrations for the contiguous united states, 1-km grids, v1 (2000-2016). NASA Socioeconomic Data and Applications Center (SEDAC), 2021.
  • Rosenbaum and Rubin (1983) Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55, 1983.
  • Rubin (1980) Donald B Rubin. Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American statistical association, 75(371):591–593, 1980.
  • Samanta and Antonelli (2022) Srijata Samanta and Joseph Antonelli. Estimation and false discovery control for the analysis of environmental mixtures. Biostatistics, 23(4):1039–1055, 2022.
  • Semenova and Chernozhukov (2021) Vira Semenova and Victor Chernozhukov. Debiased machine learning of conditional average treatment effects and other causal functions. The Econometrics Journal, 24(2):264–289, 2021.
  • Shin and Antonelli (2023) Heejun Shin and Joseph Antonelli. Improved inference for doubly robust estimators of heterogeneous treatment effects. Biometrics, 00:1–13, 2023.
  • Shin et al. (2023) Heejun Shin, Danielle Braun, Kezia Irene, and Joseph Antonelli. A spatial interference approach to account for mobility in air pollution studies with multivariate continuous treatments. arXiv preprint arXiv:2305.14194, 2023.
  • Simoni et al. (2015) Marzia Simoni, Sandra Baldacci, Sara Maio, Sonia Cerrai, Giuseppe Sarno, and Giovanni Viegi. Adverse effects of outdoor pollution in the elderly. Journal of thoracic disease, 7(1):34, 2015.
  • Stafoggia et al. (2017) Massimo Stafoggia, Susanne Breitner, Regina Hampel, and Xavier Basagaña. Statistical approaches to address multi-pollutant mixtures and multiple exposures: the state of the science. Current environmental health reports, 4(4):481–490, 2017.
  • Starling et al. (2020) Jennifer E Starling, Jared S Murray, Carlos M Carvalho, Radek K Bukowski, and James G Scott. Bart with targeted smoothing: An analysis of patient-specific stillbirth risk. The Annals of Applied Statistics, 14(1):28–50, 2020.
  • Van Donkelaar et al. (2019) Aaron Van Donkelaar, Randall V Martin, Chi Li, and Richard T Burnett. Regional estimates of chemical composition of fine particulate matter using a combined geoscience-statistical method with information from satellites, models, and monitors. Environmental science & technology, 53(5):2595–2611, 2019.
  • Wager and Athey (2018) Stefan Wager and Susan Athey. Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association, 113(523):1228–1242, 2018.
  • Wang et al. (2020) Bingyu Wang, Ki-Do Eum, Fatemeh Kazemiparkouhi, Cheng Li, Justin Manjourides, Virgil Pavlu, and Helen Suh. The impact of long-term pm2. 5 exposure on specific causes of death: exposure-response curves and effect modification among 53 million us medicare beneficiaries. Environmental Health, 19(1):1–12, 2020.
  • Wei et al. (2020) Ran Wei, Brian J Reich, Jane A Hoppin, and Subhashis Ghosal. Sparse bayesian additive nonparametric regression with application to health effects of pesticides mixtures. Statistica Sinica, 30(1):55–79, 2020.
  • Williamson et al. (2021) Brian D Williamson, Peter B Gilbert, Marco Carone, and Noah Simon. Nonparametric variable importance assessment using machine learning techniques. Biometrics, 77(1):9–22, 2021.
  • Zhang and Janson (2020) Lu Zhang and Lucas Janson. Floodgate: inference for model-free variable importance. arXiv preprint arXiv:2007.01283, 2020.
  • Zheng et al. (2021) Jiajing Zheng, Alexander D’Amour, and Alexander Franks. Bayesian inference and partial identification in multi-treatment causal inference with unobserved confounding. arXiv preprint arXiv:2111.07973, 2021.