This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Local Gaussian process extrapolation for BART models with applications to causal inference

Meijia Wang  
Arizona State University  
Jingyu He  
City University of Hong Kong  
P. Richard Hahn  
Arizona State University
Abstract

Bayesian additive regression trees (BART) is a semi-parametric regression model offering state-of-the-art performance on out-of-sample prediction. Despite this success, standard implementations of BART typically provide inaccurate prediction and overly narrow prediction intervals at points outside the range of the training data. This paper proposes a novel extrapolation strategy that grafts Gaussian processes to the leaf nodes in BART for predicting points outside the range of the observed data. The new method is compared to standard BART implementations and recent frequentist resampling-based methods for predictive inference. We apply the new approach to a challenging problem from causal inference, wherein for some regions of predictor space, only treated or untreated units are observed (but not both). In simulation studies, the new approach boasts superior performance compared to popular alternatives, such as Jackknife+.


Keywords: Tree, Extrapolation, Gaussian process, Predictive interval, XBART, XBCF

1 Introduction

Tree-based supervised learning algorithms, such as the Classification and Regression Tree (CART) (Breiman et al., 1984), Random Forests (Breiman, 2001), and XGBoost (Chen and Guestrin, 2016) are popular in practice due to their ability to learn complex nonlinear functions efficiently. Bayesian Additive Regression Trees (BART, Chipman et al. (2010)) is the most popular model-based regression tree method; it has been demonstrated empirically to provide accurate out-of-sample prediction (without covariate shift), and its Bayesian uncertainty intervals often out-perform alternatives in terms of frequentist coverage (see Chipman et al. (2010); Kapelner and Bleich (2013)). XBART (He and Hahn, 2021) is a stochastic tree ensemble method that can be used to approximate BART models in a fraction of the run-time. Throughout the paper, we will refer to BART models but will use the XBART fitting algorithm.

While tree-based methods frequently provide accurate out-of-sample predictions, their ability to extrapolate is fundamentally limited by their intrinsic, piecewise constant structure. Trees partition the covariate space into disjoint rectangular regions (leaf nodes) based on the observed covariates, then estimate a simple model (typically a constant) based on training data falling in the corresponding region. Consequently, to predict testing data outside the observed range of covariates in training, a naive tree regression model would extrapolate the constant in whatever leaf node that point falls in. Figure 1 illustrates the potential problem of extrapolation for the traditional tree models with constant leaf parameters in one-dimensional covariate space. If the testing data lie out of the range of the training, tree models have to extrapolate using the constant leaf parameter in the corresponding leaf node regardless of how far away the testing observations are from the nearest training point.

Refer to caption
Figure 1: An example of the traditional tree extrapolation. The synthetic training data are black points. The fitted CART model is the step function in the solid line. The two vertical dashed lines are testing data to predict, which are outside the range of the training data. The traditional tree algorithms with constant leaf parameters extrapolate the prediction by a constant of the corresponding leaf node.

In addition to poor point predictions, a flawed extrapolation model impairs posterior prediction intervals. Ideally, a BART model would predict essentially according to the prior at points far from the observed data; standard implementations instead never sample cutpoints beyond the training range and consequently dramatically understate the uncertainty in the unknown function in extrapolation regions. As a result, the coverage of predictive intervals from standard tree-based methods can be poor. While regression tree methods can be used in conjunction with resampling approaches to construct improved prediction intervals (Efron and Stein, 1981; Papadopoulos, 2008; Vovk, 2015; Barber et al., 2021), such a strategy does not solve the intrinsically poor extrapolation of regression trees.

This paper presents a novel approach to help tree methods extrapolate the exterior points and construct prediction intervals. The new method (XBART-GP) fits a Gaussian process (GP) at each leaf node of a BART tree and predicts exterior test points in the leaf by the associated GP model. Our approach differs from a treed Gaussian process (Gramacy and Lee, 2008) in that the local GP extrapolation does not impact the initial BART model fit. Intuitively, model of Gramacy and Lee (2008) is a “treed” GP, while our approach is a “GPed” tree. The local GP extrapolation defines a novel posterior predictive distribution in terms of BART posterior samples and a GP model; it is not a new likelihood for inferring the trees themselves. Instead, the BART-inferred trees are used as inputs to the novel posterior predictive specification (cf. section 3). Our proposed method shares a similar structure with the targeted smoothing BART (tsBART) model (Starling et al., 2018, 2021), which fits a Gaussian process on a single targeted covariate in each tree leaf to introduce smoothness. In contrast, our approach fits the Gaussian process on a selection of covariates to extrapolate prediction.

In simulation studies, we compare our approach with recently proposed alternatives such as Jackknife+ (Barber et al., 2021). Our results illustrate that XBART-GP achieves nearly the desired coverage on out-of-range data while not substantially inflating the interval length.

Additionally, our new extrapolation method is illustrated with an application to causal inference. When only treated (untreated) units are observable in a particular region of covariate space, inferring the untreated (treated) potential outcomes requires extrapolating the untreated (treated) response surface into that region. Such a scenario is said to violate the positivity condition necessary for identifying causal effects from observational data. Previous work on positivity violation includes (D’Amour et al., 2021), who focus on high-dimensional covariate spaces; (Nethery et al., 2019), who propose a spline model to extrapolate in non-overlap regions; and (Zhu et al., 2023), who construct a Gaussian process for the entire covariate space. In this paper, we apply the proposed local GP extrapolation approach to the Bayesian causal forest model (Hahn et al., 2020; Krantsevich et al., 2022).

The rest of the paper is organized as follows. Section 2 reviews Gaussian processes, BART, Jackknife+, and Bayesian causal forests. Section 3 develops the new approach in detail and presents simulation studies demonstrating its advantages over the current state-of-the-art. Section 4 adapts the method to the causal inference context and presents simulation results showing that local GP extrapolation of BART models can handle substantial violations of the overlap condition.

2 Background

2.1 Notation

We clarify notations throughout the paper. Let x=(x(1),,x(p))\mathrm{x}=(x^{(1)},\dots,x^{(p)}) denote a pp-dimensional covariate vector, or regressors. Capital letter 𝐗=(x1,,xn)\mathbf{X}=(\mathrm{x}^{\prime}_{1},\dots,\mathrm{x}^{\prime}_{n})^{\prime} denotes the n×pn\times p predictor matrix, and y=(y1,,yn)\mathrm{y}=(y_{1},\dots,y_{n}) is a vector of target values for supervised learning. First, we demonstrate our approach under the standard regression setting,

yi=f(xi)+ϵi,ϵiN(0,σ2),y_{i}=f(\mathrm{x}_{i})+\epsilon_{i},\quad\epsilon_{i}\sim N(0,\sigma^{2}), (1)

where we assume the residual term ϵ\epsilon is a Gaussian noise term with mean zero and variance σ2\sigma^{2}.

Let Cn,α(x)C_{n,\alpha}(\mathrm{x}) denote the prediction interval for a test point x\mathrm{x} with confidence level 1α1-\alpha given nn predicted values. Notations qn,α/2{yi}q^{-}_{n,\alpha/2}\{y_{i}\} and qn,α/2+{yi}q^{+}_{n,\alpha/2}\{y_{i}\} are the quantile functions for the α/2(n+1)\lfloor\alpha/2(n+1)\rfloor-th smallest value and the (1α/2)(n+1)\lceil(1-\alpha/2)(n+1)\rceil-th smallest value among a set of values {y1,,yn}\{y_{1},\dots,y_{n}\} respectively.

2.2 BART model and XBART algorithm

The BART model (Chipman et al., 2010) assumes that the unknown function f(x)f(\mathrm{x}) in the regression model (1) can be approximated by a sum of regression trees, i.e.

f(x)=l=1Lg(x,Tl,μl),f(\mathrm{x})=\sum_{l=1}^{L}g(\mathrm{x},\mathrm{T}_{l},\mathrm{\mu}_{l}), (2)

where Tl\mathrm{T}_{l} represents a tree structure, which is a set of split rules partitioning the covariate space, and μl\mathrm{\mu}_{l} is a vector of leaf parameters associated with the leaf nodes in tree Tl\mathrm{T}_{l}.

x10.8x_{1}\leq 0.8μ1\mu_{1}x20.4x_{2}\leq 0.4μ2\mu_{2}μ3\mu_{3}noyesnoyes
0.40.8011x1x_{1}x2x_{2}μ1\mu_{1}μ2\mu_{2}μ3\mu_{3}
Figure 2: A regression tree partitions a two-dimensional covariate space (right) with two internal decision nodes (left). Each element of the partition corresponds to a terminal node in the tree with leaf parameter μl\mu_{l}.

Figure 2 illustrates a binary regression tree. The tree consists of a set of internal decision nodes that partition the covariate space to a set of terminal nodes (leaf nodes, say {𝒜1,,𝒜B}\{\mathcal{A}_{1},\cdots,\mathcal{A}_{B}\}). The tree function g(x,Tl,μl)g(\mathrm{x},\mathrm{T}_{l},\mathrm{\mu}_{l}) is essentially a step function due to the constant leaf parameter. It assigns point x\mathrm{x} with the leaf parameter of the leaf node that it falls into in tree Tl\mathrm{T}_{l}.

Trees are known to be prone to overfitting due to their high flexibility. Thus, proper regularization is necessary to achieve good out-of-the-sample performance. BART assigns a regularization prior to the tree structure that strongly favors weak or small trees. The tree prior p(Tl)p(\mathrm{T}_{l}) specifies the probability for a node to split into two child nodes at depth dd to be

α(1+d)β,α(0,1),β[0,),\alpha(1+d)^{-\beta},\quad\alpha\in(0,1),\,\beta\in[0,\infty), (3)

which decreases exponentially as the tree grows deeper, implying strong regularization on the size of the tree. Chipman et al. (2010) suggest choosing α=0.95\alpha=0.95 and β=2\beta=2. The prior of each leaf parameter p(μlb)p(\mu_{lb}) is assumed to be independent normal with variance τ\tau, i.e. μlbN(0,τ)\mu_{lb}\sim N(0,\tau). The prior of the residual variance σ2\sigma^{2} is set to be inverse-Gamma(a,b)\text{inverse-Gamma}(a,b).

The ensemble of trees is fitted by Bayesian backfitting and MCMC sampling schemes. Let Tl\mathrm{T}_{-l} denotes the set of all trees except Tl\mathrm{T}_{l}, and similary define μl\mathrm{\mu}_{-l}. Note that the conditional posterior p(Tl,μlTl,μl,σ2,y)p(\mathrm{T}_{l},\mathrm{\mu}_{l}\mid\mathrm{T}_{-l},\mathrm{\mu}_{-l},\sigma^{2},y) depends on other trees and parameters only through the residuals

rl:=yhlg(x,T^h,μ^h)=g(x,Tl,μl).\mathrm{r}_{l}:=y-\sum_{h\neq l}g(\mathrm{x},\widehat{\mathrm{T}}_{h},\widehat{\mathrm{\mu}}_{h})=g(\mathrm{x},\mathrm{T}_{l},\mathrm{\mu}_{l}). (4)

Furthermore, one can integrate out the conjugate prior of μl\mathrm{\mu}_{l} and derive the posterior of a tree Tl\mathrm{T}_{l} in closed form, i.e.,

p(Tlrl,σ2)p(Tl)p(rlμl,Tl,σ2)p(μlTl,σ2)𝑑μl.p(\mathrm{T}_{l}\mid\mathrm{r}_{l},\sigma^{2})\propto p(\mathrm{T}_{l})\int p(\mathrm{r}_{l}\mid\mathrm{\mu}_{l},\mathrm{T}_{l},\sigma^{2})p(\mathrm{\mu}_{l}\mid\mathrm{T}_{l},\sigma^{2})d\mathrm{\mu}_{l}. (5)

This allows the conditional posterior Tlrl,σ2\mathrm{T}_{l}\mid\mathrm{r}_{l},\sigma^{2} and μlTl,rl,σ2\mathrm{\mu}_{l}\mid\mathrm{T}_{l},\mathrm{r}_{l},\sigma^{2} to be drawn separately and sequentially for l=1,,Ll=1,\dots,L. After all trees are grown (we call it a sweep), the residual variance σ2\sigma^{2} is then sampled from the full conditional σ2T1,,TL,μ1,,μL,y\sigma^{2}\mid\mathrm{T}_{1},\cdots,\mathrm{T}_{L},\mathrm{\mu}_{1},\cdots,\mathrm{\mu}_{L},y.

The original BART model (Chipman et al., 2010) draws trees from the posterior using a random walk Metropolis-Hastings (MH-MCMC) algorithm. Per iteration, the algorithm randomly proposes a single growing or pruning procedure to each tree and accepts or rejects it according to the MH ratios. Since the proposal in each iteration only makes a small modification, one can expect it is not efficient for the MH-MCMC algorithm to explore the posterior model space. The experiments in (Chipman et al., 2010) typically use 200200 burn-in steps and 10001000 iterations with 200200 trees. Furthermore, the algorithm does not scale well with large sizes of data observations or covariates.

XBART (He et al., 2019; He and Hahn, 2021), shortened for accelerated Bayesian additive regression trees, is a stochastic tree ensemble method inspired by BART. The essential idea of XBART is to avoid the expensive MH-MCMC computation but grow completely new trees recursively (like CART). At each node, the model considers a similar set of cutpoints as BART and uses the marginal likelihood to sample split criteria proportionally. Therefore it is much more efficient to explore the posterior of BART and scales well. Our approach is designed for both the framework of XBART and BART, while we will demonstrate the performance using XBART for computational efficiency.

To obtain the posterior prediction interval from XBART, we take the sampled trees as draws from a standard Bayesian Monte Carlo algorithm. Suppose the algorithm draws SS sweeps with S0S_{0} burn-out iterations and LL trees per forest. The predicted posterior mean and variance at iteration ss are defined as f^s(x)=l=1Lgl(x,T^l(s),μ^l(s))\widehat{f}_{s}(\mathrm{x})=\sum_{l=1}^{L}g_{l}(\mathrm{x},\widehat{\mathrm{T}}_{l}^{(s)},\widehat{\mu}_{l}^{(s)}) and σ^s2\widehat{\sigma}_{s}^{2} respectively. The final prediction of target value y^s\widehat{y}_{s} is sampled from N(f^s(x),σ^s2)N(\widehat{f}_{s}(\mathrm{x}),\widehat{\sigma}_{s}^{2}). Furthermore, the prediction interval of target level 1α1-\alpha is given by quantiles of the predictions as follows

C^S,αXBART(x)=[q^S,α/2{y^s},q^S,α/2+{y^s}].\widehat{C}^{\text{XBART}}_{S,\alpha}(\mathrm{x})=\left[\widehat{q}^{-}_{S,\alpha/2}\{\widehat{y}_{s}\},\widehat{q}^{+}_{S,\alpha/2}\{\widehat{y}_{s}\}\right]. (6)

Note that this definition of posterior prediction interval is based on the prediction of every single tree with a constant leaf parameter, i.e., it still suffers from the extrapolation problem when the testing data is out of the range of the training.

2.3 Gaussian process regression

This section reviews Gaussian process regression, a critical ingredient of the proposed XBART-GP extrapolation. Gaussian process regression is a non-parametric kernel-based Bayesian probabilistic model. It can interpolate and extrapolate as the covariance defines the model’s behavior over its full domain. Essentially, the specified covariance function characterizes the linear dependence of the response at pairs of points as a function of some measure of distance between those points in covariate space. For a textbook treatment of the Gaussian process, see Rasmussen (2003).

More specifically, Gaussian process regression assumes {f(x),xp}\{f(\mathrm{x}),\mathrm{x}\in\mathbb{R}^{p}\} is a set of random variables that follows the Gaussian process with some mean function μ(x)\mu(\mathrm{x}) and covariance function k(x,x)k(\mathrm{x},\mathrm{x}^{*}). Let (𝐗,y)(\mathbf{X},\mathrm{y}) denote a set of training data, and 𝐗\mathbf{X}^{*} denote the input of testing data. Following the standard regression setting in Equation 1, the joint distribution of function values f(𝐗)f(\mathbf{X}^{*}) and y\mathrm{y} is given by

(f(𝐗)y)N(μ(𝐗𝐗),(Σ𝐗𝐗,Σ𝐗𝐗Σ𝐗𝐗,Σ𝐗𝐗+σ2𝐈))\begin{pmatrix}f(\mathbf{X}^{*})\\ \mathrm{y}\end{pmatrix}\sim N\left(\mu\begin{pmatrix}\mathbf{X}^{*}\\ \mathbf{X}\end{pmatrix},\begin{pmatrix}\Sigma_{\mathbf{X}^{*}\mathbf{X}^{*}},&\Sigma_{\mathbf{X}^{*}\mathbf{X}}\\ \Sigma_{\mathbf{X}\mathbf{X}^{*}},&\Sigma_{\mathbf{X}\mathbf{X}}+\sigma^{2}\mathbf{I}\end{pmatrix}\right) (7)

where μ()\mu(\cdot) is a regression function, and the covariance matrices are calculated by pre-specified kernel function k(x,x)k(\mathrm{x},\mathrm{x}^{*})

The prediction of the outcome f(𝐗)f(\mathbf{X}^{*}) can be drawn from the conditional distribution f(𝐗)N(μ𝐗𝐗,Σ𝐗𝐗)f(\mathbf{X}^{*})\sim N(\mu_{\mathbf{X}^{*}\mid\mathbf{X}},\Sigma_{\mathbf{X}^{*}\mid\mathbf{X}}) with conditional mean and covariance:

μ𝐗𝐗=μ(𝐗)+Σ𝐗𝐗[Σ𝐗𝐗+σ2𝐈]1(yμ(𝐗))\mu_{\mathbf{X}^{*}\mid\mathbf{X}}=\mu(\mathbf{X}^{*})+\Sigma_{\mathbf{X}^{*}\mathbf{X}}\left[\Sigma_{\mathbf{X}\mathbf{X}}+\sigma^{2}\mathbf{I}\right]^{-1}(\mathrm{y}-\mu(\mathbf{X})) (8)
Σ𝐗𝐗=Σ𝐗𝐗2Σ𝐗𝐗[Σ𝐗𝐗+σ2𝐈]1Σ𝐗𝐗.\Sigma_{\mathbf{X}^{*}\mid\mathbf{X}}=\Sigma^{2}_{\mathbf{X}^{*}\mathbf{X}^{*}}-\Sigma_{\mathbf{X}^{*}\mathbf{X}}\left[\Sigma_{\mathbf{X}\mathbf{X}}+\sigma^{2}\mathbf{I}\right]^{-1}\Sigma_{\mathbf{X}\mathbf{X}^{*}}. (9)

3 Local GP extrapolation for BART

Traditional BART prediction intervals in Equation (6) provide good coverage with tight intervals on many data generating processes (He and Hahn, 2021) with exchangeable train and test data. However, predictions are less reliable when the train and test sets differ substantially in terms of their support. The intuition behind our method is to use the tried-and-true BART predictions (and intervals) for prediction points within the range of the training data but to incorporate Gaussian process extrapolation for points well outside this support. Therefore, the first step in describing the new method is to formally define the concepts of exterior points in the context of regression trees. Specifically, we will define extrapolation points locally in terms of the regression trees: a test point x\mathrm{x}^{\prime} is an exterior point if and only if it is outside the range of the training data in the leaf node it falls in. The basic strategy of the new method is to use the leaf-specific training data to extrapolate such exterior points using a local Gaussian process. Combining these local GP predictions across trees (and across posterior samples) constitutes the main technical content of the new method, details of which are below.

Before getting into those details, a big-picture explanation may be helpful. Typically, a Bayesian posterior predictive distribution is

f(y~y1:n,𝐗1:n)=f(y~θ)π(θy1:n,𝐗1:n)𝑑θ,f(\tilde{y}\mid y_{1:n},\mathbf{X}_{1:n})=\int f(\tilde{y}\mid\theta)\pi(\theta\mid y_{1:n},\mathbf{X}_{1:n})d\theta,

which conveys the idea that future and past data are conditionally independent given the model parameters. Here, we explicitly deviate from this formulation, but only for those future points that are exterior (defined as a function of the model parameters). For such a point, our posterior predictive is defined as

f(y~y1:n,𝐗1:n)=f(y~θ,y1:n,𝐗1:n)π(θy1:n,𝐗1:n)𝑑θf(\tilde{y}\mid y_{1:n},\mathbf{X}_{1:n})=\int f(\tilde{y}\mid\theta,y_{1:n},\mathbf{X}_{1:n})\pi(\theta\mid y_{1:n},\mathbf{X}_{1:n})d\theta

where the predictive distribution explicitly involves both the training data and model parameters (trees and leaf means in the case of BART). In this sense, the proposed approach does “use the data twice” but not in the construction of the posterior. Rather, our procedure is a way of combining the orthodox BART posterior with an intentionally and explicitly distinct posterior predictive model for extrapolation points. Although this is an uncommon approach that violates so-called “diachronic coherence” (Skyrms, 2006), it is in no sense illicit in that it merely amounts to specifying a user-defined predictive distribution that takes BART posterior samples as inputs. From this perspective, describing the method amounts to providing a definition of the predictive kernel, f(y~θ,y1:n,𝐗1:n)f(\tilde{y}\mid\theta,y_{1:n},\mathbf{X}_{1:n}). Specifically, this will be a Gaussian process with a covariance kernel defined in terms of the trees of the BART model.

3.1 Notation

Consider a fitted BART forest {T^l1lL}\{\widehat{\mathrm{T}}_{l}\mid 1\leq l\leq L\} with LL trees. Let {𝒜l1,,𝒜lBl}\{\mathcal{A}_{l1},\cdots,\mathcal{A}_{lB_{l}}\} denote the covariate space partitioned by the ll-th tree T^l\widehat{\mathrm{T}}_{l} with BlB_{l} leaf nodes, and σ^2\widehat{\sigma}^{2} for the estimated residual variance. For the ll-th tree, suppose the test point x\mathrm{x} falls in a leaf node bb with covariate space 𝒜lb\mathcal{A}_{lb}. Let the training data that falls in this leaf node and its residuals at the current tree (defined by Equation (4)) be noted as 𝐗tr\mathbf{X}^{\text{tr}} and rtr\mathrm{r}^{\text{tr}}. Similarly, 𝐗te\mathbf{X}^{\text{te}} and rte\mathrm{r}^{\text{te}} denote the exterior testing data in the leaf node bb and its to-be-predicted residuals.

Note that the tree partitioned leaf node 𝒜lb\mathcal{A}_{lb} can extend to infinity if they are at the boundary. Specifically, the range of subset of training data 𝐗tr\mathbf{X}^{\text{tr}} falling in the bb-th leaf node of the ll-tree forms a pp-dimensional hypercube lb\mathcal{B}_{lb}, which is a subset of 𝒜lb\mathcal{A}_{lb}. If the testing point x\mathrm{x} falls within the range of training data, i.e., xlb\mathrm{x}\in\mathcal{B}_{lb}, we consider it as an interior point, and its prediction follows the standard XBART model. Otherwise, if x{𝒜lblb}\mathrm{x}\in\{\mathcal{A}_{lb}\setminus\mathcal{B}_{lb}\}, it is an exterior point, and their predictions require extrapolation. Notably, this definition is local. Testing data can be exterior for some of the trees and interior for others in the XBART forest.

3.2 Defining the GP model

To extrapolate the prediction of XBART, we assume that the exterior testing data 𝐗te\mathbf{X}^{\text{te}} and training data 𝐗tr\mathbf{X}^{\text{tr}} in the same leaf node form a Gaussian process. The joint distribution of the training and testing data in a leaf node is given by

(rtertr)N(μJ,(𝚺𝐗te𝐗te,𝚺𝐗te𝐗tr𝚺𝐗tr𝐗te,𝚺𝐗tr𝐗tr+1Lσ^2𝐈))\begin{pmatrix}\mathrm{r}^{\text{te}}\\ \mathrm{r}^{\text{tr}}\end{pmatrix}\sim N\left(\mu\mathrm{J},\begin{pmatrix}\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}\mathbf{X}^{\text{te}}},&\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}\mathbf{X}^{\text{tr}}}\\ \mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}\mathbf{X}^{\text{te}}},&\mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}\mathbf{X}^{\text{tr}}}+\frac{1}{L}\widehat{\sigma}^{2}\mathbf{I}\end{pmatrix}\right) (10)

where μ\mu is the leaf parameter, J\mathrm{J} is a column vector of ones, and 𝐈\mathbf{I} is an identity matrix. By definition, the variance of the response yy is assumed to be σ2\sigma^{2}. Here we assume that the residuals in each tree contribute equally to the total variance.

To reflect the relative distance among covariates, we define the covariance function of the Gaussian process as the squared exponential kernel

k(x,x)=τexp(θi=1p(xixi)22δi2)k(\mathrm{x},\mathrm{x}^{\prime})=\tau\exp\left(-\theta\sum_{i=1}^{p}\frac{(x_{i}-x^{\prime}_{i})^{2}}{2\delta_{i}^{2}}\right) (11)

where δi\delta_{i} is the range of the ii-th variable xix_{i} in a leaf node, θ\theta controls the smoothness and τ\tau determines the scale of the function. x=(x1,,xp)\mathrm{x}=(x_{1},\cdots,x_{p}) and x=(x1,,xp)\mathrm{x}^{\prime}=(x_{1}^{\prime},\cdots,x_{p}^{\prime}) are the vector of two data points. In our experiments, we use θ=0.1\theta=0.1 and set τ\tau to be the variance of observed responses divided by the number of trees.

The conditional distribution of the residuals rte\mathrm{r}^{\text{te}} is given by

μ𝐗te𝐗tr\displaystyle\mathrm{\mu}_{\mathbf{X}^{\text{te}}\mid\mathbf{X}^{\text{tr}}} =μJ+𝚺𝐗te𝐗tr[𝚺𝐗tr𝐗tr+1Lσ^2𝐈]1(rtrμJ)\displaystyle=\mu\mathrm{J}+\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}\mathbf{X}^{\text{tr}}}\left[\mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}\mathbf{X}^{\text{tr}}}+\frac{1}{L}\widehat{\sigma}^{2}\mathbf{I}\right]^{-1}(\mathrm{r}^{\text{tr}}-\mu\mathrm{J}) (12)
𝚺𝐗te𝐗tr\displaystyle\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}\mid\mathbf{X}^{\text{tr}}} =𝚺𝐗te𝚺𝐗te𝐗tr[𝚺𝐗tr𝐗tr+1Lσ^2𝐈]1𝚺𝐗tr𝐗te.\displaystyle=\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}}-\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}\mathbf{X}^{\text{tr}}}\left[\mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}\mathbf{X}^{\text{tr}}}+\frac{1}{L}\widehat{\sigma}^{2}\mathbf{I}\right]^{-1}\mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}\mathbf{X}^{\text{te}}}.

The prediction for the residuals of the exterior testing data associated with this leaf node is drawn from the conditional normal distribution rteN(μ𝐗te𝐗tr,𝚺𝐗te𝐗tr)\mathrm{r}^{\text{te}}\sim N(\mathrm{\mu}_{\mathbf{X}^{\text{te}}\mid\mathbf{X}^{\text{tr}}},\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}\mid\mathbf{X}^{\text{tr}}}) rather than the fixed leaf parameter of the tree, which is used for predicting interior points.

3.3 Algorithms

It can be computationally intensive to find the corresponding training data and calculate the covariance matrix for each test data observation. Following the fast algorithm GrowFromRoot introduced in He et al. (2019), we propose an efficient method that pinpoints the exterior testing data and its neighborhood training data in the same leaf node by recursively partitioning the training and testing data at the same time. The procedures are summarized in Algorithm 1 and 2.

Algorithm 1 Prediction by GP extrapolation
1:procedure predict(𝐗te,𝐗tr,ytr,{Tsl},{μsl},L,S\mathbf{X}^{\text{te}},\mathbf{X}^{\text{tr}},\mathrm{y}_{\text{tr}},\{\mathrm{T}_{sl}\},\{\mathrm{\mu}_{sl}\},L,S)
2:output SS Monte Carlo draws of predicted mean {y^ste}\{\widehat{\mathrm{y}}^{\text{te}}_{s}\} for input data 𝐗te\mathbf{X}^{\text{te}}
3:    Create empty vectors {rsltr}\{\mathrm{r}^{\text{tr}}_{sl}\} and {rslte}\{\mathrm{r}^{\text{te}}_{sl}\} to store the predicted residuals.
4:    ryy¯\mathrm{r}\leftarrow\mathrm{y}-\bar{\mathrm{y}} \triangleright Initialize full residuals for training set 𝐗\mathbf{X}.
5:    r0ltry¯/L\mathrm{r}^{\text{tr}}_{0l}\leftarrow\bar{\mathrm{y}}/L \triangleright Initialize predicted residuals for the first iteration.
6:    for ss in 1 to SS do
7:         for ll in 1 to LL do
8:             Create an empty vector v\mathrm{v} to store split variables.
9:             rrrs1,ltr\mathrm{r}\leftarrow\mathrm{r}-\mathrm{r}^{\text{tr}}_{s-1,l} \triangleright Update partial residuals.
10:             PFR(r,𝐗tr,rsltr,𝐗te,rslte,Tsl,μsl,v,𝚛𝚘𝚘𝚝_𝚗𝚘𝚍𝚎)\left(\mathrm{r},\mathbf{X}^{\text{tr}},\mathrm{r}^{\text{tr}}_{sl},\mathbf{X}^{\text{te}},\mathrm{r}^{\text{te}}_{sl},\mathrm{T}_{sl},\mathrm{\mu}_{sl},\mathrm{v},{\tt root\_node}\right) \triangleright Call Algorithm 2.
11:             rr+rsltr\mathrm{r}\leftarrow\mathrm{r}+\mathrm{r}^{\text{tr}}_{sl} \triangleright Update full residuals.
12:         end for
13:         y^stel=1Lrslte\widehat{\mathrm{y}}_{s}^{\text{te}}\leftarrow\sum_{l=1}^{L}\mathrm{r}_{sl}^{\text{te}} \triangleright Calculate predicted mean from the ss-th iteration.
14:    end for
15:    Return predicted means {y^ste}\{\widehat{\mathrm{y}}_{s}^{\text{te}}\}.
16:end procedure
Algorithm 2 PredictFromRoot
1:procedure PFR(r,𝐗tr,rtr,𝐗te,rte\mathrm{r},\mathbf{X}^{\text{tr}},\mathrm{r}^{\text{tr}},\mathbf{X}^{\text{te}},\mathrm{r}^{\text{te}}, TlT_{l}, μl\mathrm{\mu}_{l}, v\mathrm{v}, node)
2:outcome Predict residuals rte\mathrm{r}^{\text{te}} for testing covariates 𝐗te\mathbf{X}^{\text{te}} using Gaussian process.
3:    if node is leaf node then
4:         Define hypercube lb\mathcal{B}_{lb} based on the range of 𝐗tr\mathbf{X}^{\text{tr}}.
5:         Define active variables vb\mathrm{v}_{b} based on 𝐗te\mathbf{X}^{\text{te}} and lb\mathcal{B}_{lb}. \triangleright If there exists a testing point x𝐗te\mathrm{x}\in\mathbf{X}^{\text{te}} such that x\mathrm{x} is out of the range of lb\mathcal{B}_{lb} on that variable, then it is considered as an active variable.
6:         Sample a subset of training data 𝐗btr\mathbf{X}^{\text{tr}}_{b} and rbtr\mathrm{r}^{\text{tr}}_{b} of size 100100 with only variables in vb\mathrm{v}_{b}.
7:         Get exterior testing data 𝐗bte\mathbf{X}^{\text{te}}_{b} with only variables in vb.\mathrm{v}_{b}.
8:         Get covariance matrix Σ𝐗btr𝐗btr\Sigma_{\mathbf{X}^{\text{tr}}_{b}\mathbf{X}^{\text{tr}}_{b}} and its inverse Σ𝐗btr𝐗btr1\Sigma^{-1}_{\mathbf{X}^{\text{tr}}_{b}\mathbf{X}^{\text{tr}}_{b}} for the Gaussian process.
9:         Get conditional mean and variance μ𝐗bte𝐗btr\mu_{\mathbf{X}^{\text{te}}_{b}\mid\mathbf{X}^{\text{tr}}_{b}}, σ𝐗bte𝐗btr\sigma_{\mathbf{X}^{\text{te}}_{b}\mid\mathbf{X}^{\text{tr}}_{b}} from Equation (12).
10:         rtrμlb\mathrm{r}^{\text{tr}}\leftarrow\mu_{lb} \triangleright Assign leaf parameter μlb\mu_{lb} as predicted residuals.
11:         rbteμ𝐗bte𝐗btr+Σ𝐗bte𝐗btrz\mathrm{r}^{\text{te}}_{b}\leftarrow\mu_{\mathbf{X}^{\text{te}}_{b}\mid\mathbf{X}^{\text{tr}}_{b}}+\Sigma_{\mathbf{X}^{\text{te}}_{b}\mid\mathbf{X}^{\text{tr}}_{b}}\mathrm{z} \triangleright Sample predictions from Gaussian process; z\mathrm{z} is a vector of standard normal variables.
12:         for ii in 1 to NtN_{t} do \triangleright NtN_{t} is the number of the testing data.
13:             if 𝐗te[i]lb\mathbf{X}^{\text{te}}[i]\in\mathcal{B}_{lb} then
14:                 rte[i]μlb\mathrm{r}^{\text{te}}[i]\leftarrow\mu_{lb} \triangleright Assign leaf parameter as predicted residual.
15:             else
16:                 rte[i]rbte[i]\mathrm{r}^{\text{te}}[i]\leftarrow\mathrm{r}^{\text{te}}_{b}[i] \triangleright Assign prediction from the Gaussian process.
17:             end if
18:         end for
19:    else\triangleright Split until reaching leaf nodes.
20:         Get split variable vv and cutpoint cc. Add vv to splitting variables v\mathrm{v}.
21:         Shift r,𝐗tr,rtr,𝐗te,rte\mathrm{r},\mathbf{X}^{\text{tr}},\mathrm{r}^{\text{tr}},\mathbf{X}^{\text{te}},\mathrm{r}^{\text{te}} into left and right parts according to vv and cc.
22:         PFR(rleft,𝐗lefttr,rlefttr,𝐗leftte,rleftte,Tl,μl,v,𝚕𝚎𝚏𝚝_𝚗𝚘𝚍𝚎)\left(\mathrm{r}_{\mbox{left}},\mathbf{X}^{\text{tr}}_{\mbox{left}},\mathrm{r}^{\text{tr}}_{\mbox{left}},\mathbf{X}^{\text{te}}_{\mbox{left}},\mathrm{r}^{\text{te}}_{\mbox{left}},T_{l},\mu_{l},\mathrm{v},{\tt left\_node}\right)
23:         PFR(rright,𝐗righttr,rrighttr,𝐗rightte,rrightte,Tl,μl,v,𝚛𝚒𝚐𝚑𝚝_𝚗𝚘𝚍𝚎)\left(\mathrm{r}_{\mbox{right}},\mathbf{X}^{\text{tr}}_{\mbox{right}},\mathrm{r}^{\text{tr}}_{\mbox{right}},\mathbf{X}^{\text{te}}_{\mbox{right}},\mathrm{r}^{\text{te}}_{\mbox{right}},T_{l},\mu_{l},\mathrm{v},{\tt right\_node}\right)
24:    end if
25:end procedure

Certain algorithm design choices to boost performance and efficiency are worth emphasizing. First, we partition the covariates into split variables v\mathrm{v} and active variables vb\mathrm{v}_{b}. Split variables appear along the decision path from the root to a leaf node. Active variables vb\mathrm{v}_{b} are a subset of split variables that satisfies the following condition: If there exists a testing point x𝐗te\mathrm{x}\in\mathbf{X}^{\text{te}} such that x\mathrm{x} is outside the range of lb\mathcal{B}_{lb} on that variable in leaf node bb, then it is considered an active variable. Using only active variables in defining the GP covariance matrix ensures that the extrapolation depends only on those variables, which is intuitive and improves time efficiency. Second, when the training data sample size is large, we subsample training data when performing the GP extrapolation. Third, to avoid the identification of exterior points being misled by any outliers, we define the hypercube lb\mathcal{B}_{lb} in each leaf node by the 95%95\% quantile of the training data.

3.4 Simulation studies

This section illustrates the performance of XBART-GP and compares it to various prediction inference methods in regression modeling. We carefully construct data generating processes with covariate shifts and test the model’s predictive ability on interior and exterior data points separately.

Specifically, we compare our proposed method with the latest frequentist approach Jackknife+ (Barber et al., 2021) and its cross-validation version (CV)+. For any regression model, Jackknife+ uses the leave-one-out procedure to train the model repeatedly and obtain out-of-sample residuals to construct the predictive interval. Similarly, CV+ is a computationally efficient version of Jackknife+ that uses k-fold cross-validation to obtain the model’s out-of-sample uncertainty. We apply the two methods on XBART and Random Forest respectively, as the baseline methods.

For both XBART and XBART-GP, we use the following hyperparameters num_trees =20=20, num_iterations =100=100, 𝙽𝚖𝚒𝚗=𝟸𝟶{\tt Nmin=20}, τ=Var(y)/\tau=\text{Var}(\mathrm{y})/num_trees (where Var(y)\text{Var}(\mathrm{y}) is the variance of the response variable in the training set). We choose θgp=0.1\theta_{gp}=0.1 and τgp=τ\tau_{gp}=\tau to be the smoothness and scale parameters of the kernel function for XBART-GP.

When applying Jackknife+ and CV+ to XBART, we pick the same number of trees and τ\tau value, except that the number of iterations is reduced to 100100 to reduce the running time of the two approaches. For simulations of Jackknife+ and CV+ on Random Forest, called Jackknife+ RF and CV+ RF in the rest of the paper, we use the default settings provided in the scikit-learn (Pedregosa et al., 2011) package in Python.

The real signal is drawn from four challenging functions, ff, listed in Table 1. The response variable yy is generated independently from y=f(x)+ϵy=f(\mathrm{x})+\epsilon with Gaussian noise ϵ\epsilon. In all cases, we generate 200200 data points with covariates xjiidN(0,1)x_{j}\overset{\text{iid}}{\sim}N(0,1) for j=1,,d=10j=1,\dots,d=10 as the training set and another 200200 data points with xjiidN(0,1.52)x_{j}\overset{\text{iid}}{\sim}N(0,1.5^{2}) as the testing set.

Name Function
Linear xtγ\mathrm{x}^{t}\mathrm{\gamma};  γj=2+4(j1)d1\gamma_{j}=-2+\frac{4(j-1)}{d-1}
Single index 10a+sin(5a)10\sqrt{a}+\sin{(5a)}; a=j=110(xjγj)2a=\sum_{j=1}^{10}(x_{j}-\gamma_{j})^{2};  γj=1.5+j13\gamma_{j}=-1.5+\frac{j-1}{3}.
Trig + poly 5sin(3x1)+2x22+3x3x45\sin(3x_{1})+2x_{2}^{2}+3x_{3}x_{4}
Max max(x1,x2,x3)\max(x_{1},x_{2},x_{3})
Table 1: Four true ff functions

Since the definition of exterior point for XBART-GP depends on the tree’s structure and varies by leaf nodes, we simplify the concept in this section to evaluate performance on in- and out-of-range data. In our simulation studies, we consider any testing data outside the training data range as exterior points. Under the above settings, approximately half of the testing data is interior, and half is exterior. The performance is then reported on interior and exterior points separately. We repeat the experiments on each data generating process 1010 times and calculate the average empirical probability of coverage and the average interval length with coverage level 1α=0.91-\alpha=0.9.

Simulation studies and time efficiency comparison on XBART-GP and its baselines were conducted in Python 3.8.103.8.10 on a Linux machine with Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz processor and 64GB RAM; eight cores were used for parallelization whenever it was applicable.

(a) Linear
Refer to caption
(b) Single Index
Refer to caption
Figure 3: Visualization of simulation results in root mean square error (RMSE), interval coverage (Coverage), and interval length for interior and exterior points for linear and single index functions. The horizontal dash line indicates the 90%90\% target coverage level. The vertical lines on top of each bar shows the deviation from repeated experiments.

Figure 3 visualizes the simulation results on linear and single index functions. The full results of four functions are summarized in the Appendix Figure 6 and Table 3. This figure compares the root mean square error (RMSE) of point prediction, coverage rate, and interval length of various approaches. Note that the RMSE of Jackknife+ and CV+ are evaluated based on the average leave-one-out prediction and cross-validated prediction, respectively.

We notice that XBART-GP has the smallest RMSE of point prediction across all functions on both interior and exterior points. For interior data, our approach is the only one that delivers nominal coverage while the interval length is not too large. Furthermore, it also has substantially greater prediction coverage than the baseline methods on exterior points. It is worth noting that XBART-GP achieves better coverage without inflating the prediction interval too much. Indeed, the interval length of XBART-GP is just slightly larger than that of Jackknife+XBART or CV+XBART but is still smaller than that of Jackknife+RF or CV+RF.

Combining Jackknife+ and CV+ with XBART, the two methods produce competitive RMSE, coverage, and interval length compared to XBART itself on interior points. However, the prediction inference methods also fail to extrapolate as the in-sample residuals can not inform uncertainty on out-of-range data.

3.4.1 Time efficiency comparison

We then evaluate the performance and time efficiency of XBART-GP and other baselines for increasing sample size.

The training and testing covariates are generated from the same distributions as in the previous section, with testing covariates having a slightly larger variance. We generate the response variable YY with the linear function in Table 1 and the standard normal error term. We estimate the average coverage, interval width, and computational time of all 66 methods on with sample size n=nt=50,100,150,200,300,500n=n_{t}=50,100,150,200,300,500 over 1010 independent trials for each sample size. The target coverage level is 1α=0.91-\alpha=0.9.

Figure 4 illustrates the performance comparison of XBART-GP and other baseline methods on exterior points with increasing sample size. The performance on interior points is similar and is presented in Appendix Figure 7. Note that the proportion of exterior points generated varies according to the training size. Empirically, the percentage of exterior points decreases from 80%80\% to 40%40\% when the sample size increases from 5050 to 500500 on 1010-dimensional covariates. The x-axises in Figure 4 only reflect the sizes of the training sets. The running time is reported as evaluating all testing data, including interior and exterior points.

Refer to caption
Figure 4: Average RMSE, coverage, interval length on exterior points, and running time (in seconds) at 90%90\% level for all methods over 1010 independent trials with increasing sample size from 5050 to 500500. Time in seconds is presented in logarithm scale.

On exterior points, XBART-GP has the smallest RMSE of point prediction and a stable coverage close to 90%90\% as the sample size grows. The coverage of other baseline methods drops significantly on a larger sample size. The interval of XBART-GP is slightly wider than XBART but still outperforms random forest approaches.

XBART and CV+ approaches are the most efficient in terms of running time. The run-time of XBART-GP increases almost linearly with the amount of training and testing sample size, but it is worth the time when the distribution of the test set shifts from the training set.

4 Application of local GP extrapolation on causal inference

4.1 Background

A motivating example of an applied extrapolation problem occurs in treatment effect estimation, in particular, when one of the necessary assumptions is violated under the potential outcome framework (Imbens and Rubin, 2015) for estimating the treatment effect.

Let YY denote the response variable and ZZ denote the binary treatment variable with Zi=1Z_{i}=1 representing the ii-th observation being treated or Zi=0Z_{i}=0 as being in the control group. The potential outcomes under treatment and control are denoted as Yi(1)Y_{i}(1) and Yi(0)Y_{i}(0) for the ii-th unit, respectively. The observable outcome can be expressed as Yi=ZiYi(1)+(1Zi)Yi(0)Y_{i}=Z_{i}Y_{i}(1)+(1-Z_{i})Y_{i}(0). The treatment effect on an individual xi\mathrm{x}_{i} is defined as τ(xi)=Yi(1)Yi(0)\tau(\mathrm{x}_{i})=Y_{i}(1)-Y_{i}(0).

Most causal inference methods using the potential outcome framework also rely on the following assumptions:

  1. 1.

    Stable Unit Treatment Value Assumption (SUTVA);

  2. 2.

    Ignorability assumption: Yi(1),Yi(0)Zi𝐗iY_{i}(1),Y_{i}(0)\perp\!\!\!\perp Z_{i}\mid\mathbf{X}_{i};

  3. 3.

    Positivity assumption: 0<P(Zi=1xi)<10<P(Z_{i}=1\mid\mathrm{x}_{i})<1.

The positivity assumption is also known as the overlap assumption. In this study, we differentiate the covariate space into two parts, the overlap region where the positivity assumption is satisfied, and the non-overlap region where it is not. As per Zhu et al. (2021), the violation of positivity assumptions can be categorized into two categories, structural and practical violation.

Structural violations occur when it is impossible for individuals with specific covariate values to receive treatment, leading to the treatment effect for these individuals or covariate values being irrelevant. One solution is to ”trim” the population, as described in (Ho et al., 2007; Petersen et al., 2012; Crump et al., 2009), which allows for the estimation of the treatment effect only in the overlap region (where the positivity assumption holds).

Practical violation refers to situations where a subset of covariate values is only observed in the treatment or control group due to selection bias or a limited sample size. Nevertheless, the average treatment effect (ATE) over the entire population and the conditional average treatment effect (CATE) for specific covariate values are still of interest (Yang and Ding, 2018). In these cases, the strategy is to perform model-based extrapolation, extending an estimated response surface into regions where no data was observed.

Recently, a few extrapolation methods have been developed to estimate the ATE of the entire population with poor overlap. Li et al. (2018) and Li et al. (2018) proposed using overlap weights to balance the covariates for estimating the ATE. (Nethery et al., 2019) proposed a way to define overlap and non-overlap regions based on the propensity score. They then use BART to estimate the treatment effect in the overlap region and use a spline model to extrapolate the treatment effect in the non-overlap region. Zhu et al. (2023) addressed this problem by modeling the treatment effect with the Gaussian process model, which by its nature can extrapolate based on the covariate kernel.

4.2 Local GP extrapolation with Bayesian causal forest

In this section, we explore extrapolating CATE in the region of non-overlap by applying the local GP extrapolation method on the Accelerated Bayesian Causal Forest (XBCF) model (Krantsevich et al., 2022). First, we briefly review the structure of the XBCF model. Next, we formally define the overlap and non-overlap area in the ensemble trees framework. Then we describe the modifications we have made to integrate the local GP algorithm for extrapolating the treatment effect estimated by XBCF.

XBCF is a recent approach for estimating heterogeneous treatment effect based on Bayesian Causal Forest (Hahn et al., 2020). The model expresses the observable outcomes as

yi\displaystyle y_{i} =aμ(xi,π^i))+bziτ(xi)+ϵi,ϵiN(0,σzi2),\displaystyle=a\mu(\mathrm{x}_{i},\widehat{\pi}_{i}))+b_{z_{i}}\tau(\mathrm{x}_{i})+\epsilon_{i},\quad\epsilon_{i}\sim N(0,\sigma^{2}_{z_{i}}), (13)
aN(0,1),b0,b1N(0,1/2)\displaystyle a\sim N(0,1),\quad b_{0},b_{1}\sim N(0,1/2)

where π^i\widehat{\pi}_{i} is the estimated propensity score. Function μ(xi,π^i)\mu(\mathrm{x}_{i},\widehat{\pi}_{i}) aims to capture the prognostic effect and function τ(xi)\tau(\mathrm{x}_{i}) captures the treatment effect. Both functions are modeled by a sum of XBART trees, respectively.

Under the derived split criterion of XBCF, the treatment trees in τ(xi)\tau(\mathrm{x}_{i}) stop splitting when the leaf nodes only consist of either the treatment group or the control group. However, the prognostic forest μ(xi,π^i)\mu(\mathrm{x}_{i},\widehat{\pi}_{i}) is not affected by the positivity violation as it does not depend on the treatment variable zz. In this case, both estimators will be biased when the positivity assumption is violated. However, the tree structure effectively captures the heterogeneity in the overlap region, making it ideal for extrapolating causal effects in the non-overlap region using the Gaussian process model in the leaf nodes.

The model relies on the three basic assumption in the potential outcome framework describe in the previous section. However, when the positivity assumption is violated in practice, we can not make firm conclusions about ATE or CATE in the non-overlap region. To mitigate this limitation, we make the assumption that there is no abrupt change in the treatment effect function in both overlap and non-overlap regions, which allows us to make inferences about CATE using the extrapolated predictive interval obtained from local Gaussian processes.

We apply the local GP technique on the XBCF model to extrapolate the treatment effect in non-overlap regions as follows. Given a regression tree, consider a leaf node bb with local covariate subspace 𝒜𝐗\mathcal{A}_{\mathbf{X}}, let matrix 𝐗tr\mathbf{X}^{\text{tr}} denote the training data that falls into the leaf node, which can be separated into a treatment group 𝐗Ttr\mathbf{X}^{\text{tr}}_{T} and a control group 𝐗Ctr\mathbf{X}^{\text{tr}}_{C} based on the corresponding treatment variable zz. The range of covariates in the treated and the control form two hypercubes, here we denote as T\mathcal{B}_{T} and C\mathcal{B}_{C} respectively. The overlap region in the leaf node is the intersection of T\mathcal{B}_{T} and C\mathcal{B}_{C}, or O=TC\mathcal{B}_{O}=\mathcal{B}_{T}\cap\mathcal{B}_{C}. Consequently, the non-overlap area is NO=𝒜𝐗O\mathcal{B}_{NO}=\mathcal{A}_{\mathbf{X}}\setminus\mathcal{B}_{O}.

It’s important to note that in each treatment tree, the residuals in the non-overlap region are biased in direction relative to the true treatment effect. This is because of the treatment trees’ no-splitting behavior. As a result, if testing data is included in the Gaussian process, it will be extrapolated in the opposite direction. Therefore, different from XBART-GP, we assume that the testing data in the non-overlap area 𝐗NOte\mathbf{X}^{\text{te}}_{\text{NO}} share a joint distribution with only the overlap training data 𝐗Otr\mathbf{X}^{\text{tr}}_{\text{O}}, namely

(rNOterOtr)N(μJ,(𝚺𝐗NOte𝐗NOte,𝚺𝐗NOte𝐗Otr𝚺𝐗Otr𝐗NOte,𝚺𝐗Otr𝐗Otr+1L𝝈^z2)).\begin{pmatrix}\mathrm{r}^{\text{te}}_{\text{NO}}\\ \mathrm{r}^{\text{tr}}_{\text{O}}\end{pmatrix}\sim N\left(\mu\mathrm{J},\begin{pmatrix}\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}_{\text{NO}}\mathbf{X}^{\text{te}}_{\text{NO}}},&\mathbf{\Sigma}_{\mathbf{X}^{\text{te}}_{\text{NO}}\mathbf{X}^{\text{tr}}_{\text{O}}}\\ \mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}_{\text{O}}\mathbf{X}^{\text{te}}_{\text{NO}}},&\mathbf{\Sigma}_{\mathbf{X}^{\text{tr}}_{\text{O}}\mathbf{X}^{\text{tr}}_{\text{O}}}+\frac{1}{L}\widehat{\boldsymbol{\sigma}}^{2}_{\mathrm{z}}\end{pmatrix}\right). (14)

Here rOtr\mathrm{r}^{\text{tr}}_{\text{O}} is a vector of partial residuals for the current treatment tree, i.e., the difference between observed response yy and the predictions of all other prognostic and treatment trees. rNOte\mathrm{r}^{\text{te}}_{\text{NO}} is a vector of residuals we wish to predict for the non-overlap testing data. 𝝈^z2\widehat{\boldsymbol{\sigma}}^{2}_{\mathrm{z}} is a diagonal matrix with diagonal elements equal to estimated variance σ^02\widehat{\sigma}_{0}^{2} or σ^12\widehat{\sigma}_{1}^{2} depending on the treatment variable of the training data and LL is the total number of prognostic and treatment trees.

The prediction algorithm for the treatment trees in XBCF-GP follows the same procedures as in Algorithm 2 except for the difference we describe above. Specifically, we replace the out-of-range hypercube lb\mathcal{B}_{lb} in line 44 by the overlap area O\mathcal{B}_{O}. In line 66, the subset of training data 𝐗btr\mathbf{X}^{\text{tr}}_{b} is sampled from the training data 𝐗Otr\mathbf{X}^{\text{tr}}_{\text{O}} from the overlap area instead of the entire training set. Then in line 77, we choose to extrapolate the testing data 𝐗NOte\mathbf{X}^{\text{te}}_{\text{NO}} in the non-overlap area for XBCF-GP. Similar to the algorithm design choices made in XBART-GP, the overlap area is defined by the 95%95\% quantiles of the treated and control data for robustness concerns.

In addition to the modification we make to the XBART-GP algorithms, we also add a tweak to the original XBCF model to ensure the extrapolation works smoothly. To ensure successful extrapolation, the GP requires a minimum amount of overlap data in the leaf nodes for providing accurate estimates. While it is uncommon for the treatment trees in XBCF to split in the non-overlap area, it could still occur in rare cases and result in suboptimal extrapolation. To address this issue, we implemented a strong prior on the treatment trees to discourage splits that do not have sufficient treatment and control data in their children nodes. The amount of overlap data required in a leaf node can be controlled through the hyperparameter Nmin. This hyperparameter is critical as it determines the balance between the quality of the method’s estimates near the overlap area boundary and its extrapolation performance in the non-overlap region. In our experiments using 500500 data points, we set Nmin to 2020.

The XBCF-GP model combines the benefits of the Bayesian Causal Forest and Gaussian Process. XBCF excels at accurately and efficiently estimating homogeneous and heterogeneous treatment effects on large datasets with many covariates, provided the positivity assumption holds. The GP component, applied on a per-leaf node basis, enables extrapolation of treatment effects in regions where the positivity assumption is violated by leveraging the most relevant covariates in the surrounding area.

Our method provides a more effective way to assess the uncertainty of the estimated treatment effect in cases where the positivity assumption is violated. By using the local Gaussian process, our uncertainty estimation takes into account the distance between the testing data and the overlap region. This means that if the testing data is significantly distant from the overlap region, XBCF-GP will reflect this uncertainty through wider prediction intervals.

4.3 Simulation study

This section compares prediction interval coverage and length under the setting of causal inference when the positivity assumption is violated. To demonstrate the performance of XBCF-GP, we compare it with a set of alternative extrapolation approaches proposed by Nethery et al. (2019). The baseline methods we use are summarized as follows:

  • SBART Also known as BART-Stratified. Fit separate BART models for the treated and the control group, then estimate the treatment effect as the difference between the expected value of the treated and the control models.

  • UBART Untrimmed BART is implemented by Nethery et al. (2019), in which a single BART model is fitted with covariates, treatment variable, and estimated propensity score. The treatment effect is estimated by predicting the potential outcomes with the posterior predictive distributions.

  • BART+SPL Nethery et al. (2019) proposed this method to estimate the overlap area with BART and the non-overlap area with a spline model. The region of overlap is defined by propensity score with recommended parameters a=0.1a=0.1 and b=7b=7.

Refer to caption Refer to caption
(a) XBCF (b) XBCF-GP
Refer to caption Refer to caption Refer to caption
(c) SBART (d) UBART (e) BART+SPL
Figure 5: An example of treatment effect estimation with positivity violation on all baseline methods. The two vertical lines indicate the boundaries of the overlap area, with the control group heavily distributed on the left side and the treatment group distributed on the right side. The straight line represents the true treatment effect, while the curved lines/dots in the center denote the estimated treatment effect for each method. The lower and upper curved lines/dots show their respective 95%95\% predictive intervals.

Figure 5 illustrates the performance of XBCF-GP and other baseline methods on a one-dimensional toy dataset. The independent variable xx is uniformly drawn from the range [10,10][-10,10]. The prognostic function is sine, and the treatment effect is 0.25x0.25x. The probability for a sample being treated is π(x)=max{0,min{1,0.08x+0.5}}\pi(x)=\max\{0,\min\{1,0.08x+0.5\}\}. Let f(x)=sin(x)+(0.25x)zf(x)=\sin(x)+(0.25x)z denotes the true function with treatment variable zBern(π(x))z\sim\text{Bern}(\pi(x)). The response values yy are generated as y=f(x)+ϵy=f(x)+\epsilon, where ϵN(0,0.2×sd(f))\epsilon\sim N(0,0.2\times\text{sd}(f)) with the standard deviation sd(f)\text{sd}(f) taken over the dataset. The fitted models in the example use the same parameter setup as in the simulation study, which will be specified shortly.

The figure demonstrates the performance of different models in estimating CATE. The first three methods (XBCF, XBCF-GP, and SBART) are relatively robust to noise. However, the last two methods (UBART and BART+SPL), which directly estimate CATE using the response variable, introduce significant noise in the estimates. All three BART-based methods are susceptible to the influence of the prognostic structure, especially in the non-overlap area. XBCF dodges the bullet by setting up a separate BART forest to control for the prognostic effect, but it cannot extrapolate properly in positivity violation. XBCF-GP inherits the well-regularized structure from XBCF and extrapolates the treatment effect estimations nicely and smoothly.

For a more thorough comparison among these methods, we adopt the same data generating process used in Hahn et al. (2020) and Krantsevich et al. (2022) but modify the propensity function to create the positivity violation scenario. The covariate vector x\mathrm{x} consists of five variables. The first three (denoted as x1,x2,x3x_{1},x_{2},x_{3}) are generated from the standard normal distribution. x4x_{4} is a binary variable, and x5x_{5} is an unordered categorical variable with levels denoted as 1,2,31,2,3. The prognostic function can be either linear or nonlinear:

μ(x)={1+g(x4)+x1x3linear6+g(x4)+6|x31|nonlinear,\mu(\mathrm{x})=\begin{cases}1+g(x_{4})+x_{1}x_{3}\quad&\text{linear}\\ -6+g(x_{4})+6|x_{3}-1|\quad&\text{nonlinear,}\end{cases}

where g(1)=2g(1)=2, g(2)=1g(2)=-1 and g(3)=4g(3)=-4. The treatment effect can be either homogeneous or heterogeneous:

τ(x)={3homogeneous1+2x2x5heterogeneous.\tau(\mathrm{x})=\begin{cases}3\quad&\text{homogeneous}\\ 1+2x_{2}x_{5}\quad&\text{heterogeneous.}\end{cases}

In the simulation studies, we adjust the propensity function such that roughly 20%20\% of the data has a propensity score of 11, 20%20\% of the data has a propensity score of 0, and the rest of the data has a propensity score that is between 0 and 11. The propensity function π(x)\pi(\mathrm{x}) is given by

π(x)=max{0,min{1,h(x)}},h(x)=1.1Φ(3μ(x)/s0.5x1c)0.15+ui/10,\pi(\mathrm{x})=\max\{0,\min\{1,h(\mathrm{x})\}\},\quad h(\mathrm{x})=1.1\Phi(3\mu(\mathrm{x})/s-0.5x_{1}-c)-0.15+u_{i}/10, (15)

where ss is the standard deviation of μ(x)\mu(\mathrm{x}) taken over the observed samples and uiUniform(0,1)u_{i}\sim\text{Uniform}(0,1). We pick the value c=0c=0 for the linear prognostic function and c=3c=3 for the non-linear one. For any propensity score greater than 11 or less than 0, we set it to be 11 or 0, respectively. In addition, we add a Gaussian noise ϵN(0,0.5×sd(μ(x)+τ(x)z))\epsilon\sim N(0,0.5\times\text{sd}(\mu(\mathrm{x})+\tau(\mathrm{x})\mathrm{z})) with the standard deviation taken over the samples to generate the response values.

Method ATE CATE Time
RMSE Coverage I.L. RMSE Coverage I.L.
Linear Homogeneous
XBCF 0.369 0.600 0.927 0.474 0.944 1.670 0.998
XBCF-GP 0.347 0.600 0.903 0.467 0.953 1.736 2.026
SBART 1.148 0.000 0.945 1.511 0.726 3.216 7.155
UBART 0.290 1.000 1.180 1.299 0.995 5.684 6.338
BART+SPL 0.987 1.000 3.713 1.725 0.987 13.806 145.865
Linear Heterogeneous
XBCF 0.630 0.800 1.579 1.773 0.755 3.566 0.978
XBCF-GP 0.601 0.800 1.671 1.671 0.774 3.720 2.707
SBART 1.241 0.000 1.523 1.979 0.858 5.589 7.088
UBART 0.393 1.000 2.071 2.213 0.851 9.032 6.344
BART+SPL 0.958 1.000 6.485 2.748 0.96 24.523 158.565
Non-linear Homogeneous
XBCF 0.786 0.800 2.254 0.627 0.954 2.709 0.910
XBCF-GP 0.687 0.800 2.352 0.575 0.960 2.924 1.827
SBART 2.370 0.000 2.586 3.288 0.802 7.821 7.095
UBART 0.556 1.000 2.704 2.819 0.929 12.867 6.306
BART+SPL 1.439 1.000 8.342 3.476 0.989 27.899 159.738
Non-linear Heterogeneous
XBCF 1.219 0.600 2.456 2.424 0.759 5.147 0.940
XBCF-GP 1.20 0.600 2.628 2.354 0.792 5.576 2.270
SBART 2.931 0.000 2.708 4.051 0.780 8.783 7.134
UBART 0.801 0.800 2.718 3.297 0.840 14.116 6.326
BART+SPL 2.166 0.800 5.688 4.458 0.970 24.641 161.630
Table 2: Results of root mean square error (RMSE), interval coverage (Coverage), interval length (I.L.), and running time (in seconds) for ATE and CATE estimators with different combinations of treatment term and prognostic term types. The sample size is 500.

To gauge the performance of XBCF-GP across overlap and non-overlap regions, we evaluate the average treatment effect (ATE) and conditional average treatment effect (CATE) on three basic metrics: average root mean square error, coverage, and average interval length. In addition, we also compare the time efficiency of the proposed method to baseline methods. We average the results from 2020 independent replications for each scenario with n=500n=500 samples.

For methods that rely on the propensity score, we use the propensity score estimated by a XBART Multinomial model with 22 classes and 100100 iterations instead of the true propensity scores. For XBCF-GP, we set hyper-parameter θ=0.1\theta=0.1 and τ=σ^2/Lτ\tau=\widehat{\sigma}^{2}/L_{\tau}, where σ^2\widehat{\sigma}^{2} is the estimated variance and Lτ=20L_{\tau}=20 is the default number of treatment trees in the XBCF model. We choose a different scale parameter τgp\tau_{gp} from XBART-GP as the variance of the treatment effect tends to be smaller than the total variance of the observed data in most cases. As for the other baseline methods, including XBCF, we use their default parameters for all scenarios.

The experiments were conducted in R 4.1.3 on the same machine described in the previous sections.

Table 2 summarizes the simulation results of various methods. We focus our discussion here on CATE estimation. Both XBCF methods provide the best CATE estimation in all cases and deliver competitive CATE coverage with tighter intervals than other methods. XBCF-GP can improve both the accuracy and coverage of ATE and CATE estimation based on its base model. Furthermore, we notice that the efficiency of XBCF-GP, inherited from XBART, is substantially faster than other BART-based methods.

5 Discussion

The Gaussian process extrapolation technique developed here has several advantages over standard BART extrapolation, as implemented in the most widely available software. First, whereas BART extrapolates a constant function beyond the range of the training data by fusing a Gaussian process at each leaf, “local” Gaussian process extrapolation is possible. This local GP approach automatically incorporates variable relevance, using only variables that are used as splitting rules in a given tree. Because extrapolation occurs at each tree, our procedure also generally improves prediction: GP prediction at local exterior points imposes additional smoothness even at points within the training data’s convex hull.

Second, BART’s prediction intervals are dominated by the residual variance parameter value, which implies that the interval width is nearly constant at any exterior point. By contrast, the local GP extrapolation method implies interval widths that expand quite rapidly the further away a point is from the training data, a desirable property that is directly inherited from the Gaussian process.

Third, the local GP approach requires no additional training time compared to a regular BART model fit. Instead, the extrapolation component is a direct modification of the posterior predictive distribution defined in terms of BART posterior samples and a specified covariance function. Broadly, our approach grew out of attempts to answer the question: how would one want to use a known BART model that fits well at interpolation points to define an extrapolation model? Our proposed solution is to graft a Gaussian process to each leaf node for the purpose of extrapolation only; posterior uncertainty in the BART model parameters is then propagated by averaging over posterior samples. The general philosophy behind this approach is that there is no need to change the likelihood model if BART adequately fits the training data, but conversely, there is no need to stick to the pure BART model for points far from the training data. Extrapolation always requires a leap of faith, and there is no guarantee that any extrapolation method will succeed, but we argue that there are conceptual and computational advantages to disentangling the extrapolation model from the interpolation model.

Finally, we demonstrate how to apply local GP extrapolation of BART models to the problem of imperfect overlap in treatment effect estimation, a common practical challenge in applied causal inference. By applying the local GP extrapolation technique to the Bayesian causal forest model, violations of the positivity assumption in practice can be handled organically. Specifically, we are able to extrapolate the treatment effect surface directly rather than composing it as the difference between two separate extrapolations (one for each treatment arm). Our simulation studies demonstrate that XBCF-GP produces treatment effect estimates with good coverage on a variety of data-generating processes. These tools will help other researchers to draw more reliable inferences from applied data which frequently exhibit imperfect overlap.

Acknowledgement

Jingyu He gratefully acknowledges funding for this project fully supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (Project No. CityU 21504921).

6 Supplementary Materials

R-package for XBART-GP:

R-package ”XBART” is adapted from the original XBART model (https://github.com/JingyuHe/XBART) (He et al., 2019; He and Hahn, 2021) and contains code to perform local Gaussian process extrapolation on trained XBART model described in this artical. Please read file README contained in the zip file for installation instruction. (XBART.zip, zip archive)

Python-package for XBART-GP:

Python-package ”XBART”is adapted from the original XBART model (https://github.com/JingyuHe/XBART) (He et al., 2019; He and Hahn, 2021) and contains code to perform local Gaussian process extrapolation on trained XBART model described in Python. Please read file README contained in the zip file for installation instruction. (XBART-python.zip, zip archive)

R-package for XBCF-GP:

R-package ”XBCF” is adapted from the original XBCF model (https://github.com/socket778/XBCF) (Krantsevich et al., 2022) and contains code to perform local Gaussian process extrapolation on trained XBCF model described in the artical. Please read file README contained in the zip file for installation instruction. (XBCF.zip, zip archive)

Python code for XBART-GP simulations

The supplemental files contain code to perform simulation experiments described in Section 3.4. (XBART-GP simulation.zip, zip archive)

R code for XBCF-GP simulations

The supplemental files contain code to perform simulation experiments described in Section 4.3. (XBCF-GP simulation.zip, zip archive)

References

  • Barber et al. (2021) Barber, R. F., E. J. Candes, A. Ramdas, and R. J. Tibshirani (2021). Predictive inference with the jackknife+. The Annals of Statistics 49(1), 486–507.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning 45(1), 5–32.
  • Breiman et al. (1984) Breiman, L., J. Friedman, R. Olshen, and C. J. Stone (1984). Classification and regression trees. Chapman and Hall/CRC.
  • Chen and Guestrin (2016) Chen, T. and C. Guestrin (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  785–794. ACM.
  • Chipman et al. (2010) Chipman, H. A., E. I. George, R. E. McCulloch, et al. (2010). BART: Bayesian additive regression trees. The Annals of Applied Statistics 4(1), 266–298.
  • Crump et al. (2009) Crump, R. K., V. J. Hotz, G. W. Imbens, and O. A. Mitnik (2009, 01). Dealing with limited overlap in estimation of average treatment effects. Biometrika 96(1), 187–199.
  • D’Amour et al. (2021) D’Amour, A., P. Ding, A. Feller, L. Lei, and J. Sekhon (2021). Overlap in observational studies with high-dimensional covariates. Journal of Econometrics 221(2), 644–654.
  • Efron and Stein (1981) Efron, B. and C. Stein (1981). The jackknife estimate of variance. The Annals of Statistics, 586–596.
  • Gramacy and Lee (2008) Gramacy, R. B. and H. K. H. Lee (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association 103(483), 1119–1130.
  • Hahn et al. (2020) Hahn, P. R., J. S. Murray, C. M. Carvalho, et al. (2020). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis.
  • He and Hahn (2021) He, J. and P. R. Hahn (2021). Stochastic tree ensembles for regularized nonlinear regression. Journal of the American Statistical Association (just-accepted), 1–61.
  • He et al. (2019) He, J., S. Yalov, and P. R. Hahn (2019). XBART: Accelerated Bayesian additive regression trees. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  1130–1138.
  • Ho et al. (2007) Ho, D. E., K. Imai, G. King, and E. A. Stuart (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15(3), 199–236.
  • Imbens and Rubin (2015) Imbens, G. W. and D. B. Rubin (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Kapelner and Bleich (2013) Kapelner, A. and J. Bleich (2013). Prediction with missing data via bayesian additive regression trees. Canadian Journal of Statistics 43.
  • Krantsevich et al. (2022) Krantsevich, N., J. He, and P. R. Hahn (2022). Stochastic tree ensembles for estimating heterogeneous effects.
  • Li et al. (2018) Li, F., K. L. Morgan, and A. M. Zaslavsky (2018). Balancing covariates via propensity score weighting. Journal of the American Statistical Association 113(521), 390–400.
  • Li et al. (2018) Li, F., L. E. Thomas, and F. Li (2018, 09). Addressing extreme propensity scores via the overlap weights. American Journal of Epidemiology 188(1), 250–257.
  • Nethery et al. (2019) Nethery, R. C., F. Mealli, and F. Dominici (2019). Estimating population average causal effects in the presence of non-overlap: The effect of natural gas compressor station exposure on cancer mortality. The annals of applied statistics 13(2), 1242.
  • Papadopoulos (2008) Papadopoulos, H. (2008). Inductive conformal prediction: Theory and application to neural networks. INTECH Open Access Publisher Rijeka.
  • Pedregosa et al. (2011) Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.
  • Petersen et al. (2012) Petersen, M. L., K. E. Porter, S. Gruber, Y. Wang, and M. J. van der Laan (2012). Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research 21(1), 31–54. PMID: 21030422.
  • Rasmussen (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer school on machine learning, pp.  63–71. Springer.
  • Skyrms (2006) Skyrms, B. (2006). Diachronic coherence and radical probabilism. Philosophy of Science 73(5), 959–968.
  • Starling et al. (2018) Starling, J. E., J. S. Murray, C. M. Carvalho, R. K. Bukowski, and J. G. Scott (2018). Bart with targeted smoothing: An analysis of patient-specific stillbirth risk.
  • Starling et al. (2021) Starling, J. E., J. S. Murray, P. A. Lohr, A. R. A. Aiken, C. M. Carvalho, and J. G. Scott (2021). Targeted Smooth Bayesian Causal Forests: An analysis of heterogeneous treatment effects for simultaneous vs. interval medical abortion regimens over gestation. The Annals of Applied Statistics 15(3), 1194 – 1219.
  • Vovk (2015) Vovk, V. (2015). Cross-conformal predictors. Annals of Mathematics and Artificial Intelligence 74(1), 9–28.
  • Yang and Ding (2018) Yang, S. and P. Ding (2018, 03). Asymptotic inference of causal effects with observational studies trimmed by the estimated propensity scores. Biometrika 105(2), 487–493.
  • Zhu et al. (2023) Zhu, A. Y., N. Mitra, and J. Roy (2023). Addressing positivity violations in causal effect estimation using gaussian process priors. Statistics in Medicine 42(1), 33–51.
  • Zhu et al. (2021) Zhu, Y., R. A. Hubbard, J. Chubak, J. Roy, and N. Mitra (2021). Core concepts in pharmacoepidemiology: Violations of the positivity assumption in the causal analysis of observational data: Consequences and statistical approaches. Pharmacoepidemiology and Drug Safety 30(11), 1471–1485.

Appendix A Simulation results for XBART-GP and its baselines

A.1 Simulation results on different functions

(a) Trig + poly
Refer to caption
(b) Max
Refer to caption
Figure 6: Visualization of simulation results in root mean square error (RMSE), interval coverage (Coverage), and interval length for interior and exterior points for tirg + poly and max functions. The horizontal dash line indicates the 90%90\% target coverage level.
Method RMSE Coverage Interval length
Interior Exterior Interior Exterior Interior Exterior
Linear
XBART-GP 1.756 2.506 0.881 0.816 5.709 6.717
XBART 1.986 3.532 0.843 0.584 5.568 5.678
Jackknife+ XBART 1.894 3.510 0.897 0.626 6.120 6.133
Jackknife+ RF 2.975 4.598 0.834 0.656 8.499 8.550
CV+ XBART 1.883 3.466 0.900 0.651 6.366 6.384
CV+ RF 3.006 4.635 0.851 0.652 8.642 8.702
Single index
XBART-GP 4.582 10.631 0.871 0.474 13.938 15.854
XBART 5.004 13.055 0.830 0.297 13.625 13.965
Jackknife+ XBART 4.802 13.027 0.873 0.305 14.715 14.778
Jackknife+ RF 8.077 17.068 0.765 0.263 19.592 19.689
CV+ XBART 4.754 13.097 0.897 0.327 15.725 15.780
CV+ RF 8.165 17.162 0.764 0.265 19.880 19.940
Trig + poly
XBART-GP 4.229 8.549 0.839 0.705 11.441 13.322
XBART 4.591 9.805 0.798 0.615 11.007 11.718
Jackknife+ XBART 4.577 9.845 0.789 0.567 10.054 10.115
Jackknife+ RF 5.719 11.110 0.785 0.584 13.650 13.764
CV+ XBART 4.595 9.904 0.791 0.579 10.334 10.470
CV+ RF 5.778 11.156 0.780 0.595 13.726 13.857
Max
XBART-GP 1.150 1.253 0.866 0.873 3.672 3.940
XBART 1.218 1.489 0.848 0.774 3.564 3.582
Jackknife+ XBART 1.213 1.482 0.862 0.798 3.677 3.678
Jackknife+ RF 1.123 1.268 0.898 0.861 3.689 3.704
CV+ XBART 1.215 1.487 0.872 0.800 3.765 3.764
CV+ RF 1.123 1.267 0.899 0.864 3.687 3.705
Table 3: Results of simulation comparing various approach on prediction interval. Columns are mean square error (RMSE), interval coverage (Coverage), and interval length (I.L.) on interior and exterior points for 44 different data generating processes.

Figure 6 and Table 3 demonstrate the performance of XBART-GP versus other baseline methods. The results on the two functions, trig + poly and max, are consistent with the other two presented in the paper. It shows XBART-GP’s ability to provide the most accurate prediction on both in-range and out-of-the-range data with almost desired coverage and slightly wider prediction intervals on various data generating processes.

A.2 Simulation results on increasing sample size

Refer to caption
Figure 7: Results of simulation comparing various approach on prediction interval. Average RMSE, coverage, interval length on interior points, and running time at 90%90\% level for all methods over 1010 independent trials with increasing sample size from 5050 to 500500.

For the interior points, XBART-based methods produce smaller RMSE than random forest methods. XBART-GP has the smallest RMSE overall. XBART and its Gaussian process extrapolation alternative have lower coverage than Jackknife methods when the sample size is minimal. However, XBART-GP outperforms other methods on interior points as the sample size increases. XBART and XBART-GP produce the shortest intervals. The intervals of Jackknife+ XBART and CV+ XBART converge to XBART as the sample size increases.