Hierarchical Gaussian Process Models for Regression Discontinuity/Kink under Sharp and Fuzzy Designs

Ximing Wu Department of Agricultural Economics, Texas A&M University, College Station, TX 77843; email: [email protected]

Abstract

We propose nonparametric Bayesian estimators for causal inference exploiting Regression Discontinuity/Kink (RD/RK) under sharp and fuzzy designs. Our estimators are based on Gaussian Process (GP) regression and classification. The GP methods are powerful probabilistic machine learning approaches that are advantageous in terms of derivative estimation and uncertainty quantification, facilitating RK estimation and inference of RD/RK models. These estimators are extended to hierarchical GP models with an intermediate Bayesian neural network layer and can be characterized as hybrid deep learning models. Monte Carlo simulations show that our estimators perform comparably to and sometimes better than competing estimators in terms of precision, coverage and interval length. The hierarchical GP models considerably improve upon one-layer GP models. We apply the proposed methods to estimate the incumbency advantage of US house elections. Our estimations suggest a significant incumbency advantage in terms of both vote share and probability of winning in the next elections. Lastly we present an extension to accommodate covariate adjustment.

Keywords: Regression Discontinuity/Kink; Sharp and Fuzzy Discontinuity Designs; Nonparametric Bayesian Estimation; US House Elections

1 Introduction

The regression discontinuity design (Thistlethwaite and Campbell, 1960) and its generalizations are non-experimental methods of causal inference in observational studies. In a typical regression discontinuity design, the assignment of a treatment is determined according to whether a running variable is greater or less than a known cutoff while the outcome of interest is supposed to be a smooth function of the running variable except at the cutoff. The abrupt change in the probability of receiving the treatment can be exploited to infer the local causal effect around the cutoff. For general overviews, see Imbens and Lemieux (2008), Lee and Lemieux (2010), Cattaneo and Escanciano (2017), Cattaneo et al. (2020, forthcoming), and Cattaneo and Titiunik (2022).

We develop in this study a family of estimators for Regression Discontinuity (RD) and Regression Kink (RK) under sharp and fuzzy designs. The proposed estimators are based on Gaussian Process (GP) regression/classification, a probabilistic modeling approach that has gained popularity in machine learning and artificial intelligence community (Rasmussen and Williams, 2006). This nonparametric Bayesian approach places a GP prior on the latent function that underlays the outcome of interest. Modeler’s knowledge regarding the underlying relationship is encoded through the mean and variance functions of the GP and inference is conducted according to Bayes’ rule.

Unlike methods that focus on the conditional mean of the outcome immediately below and above the cutoff (e.g., the local polynomial estimators), the proposed GP methods estimate the underlying functional relationship and rely on the resultant predictive distributions at the cutoff to infer various treatment effects of interest. One key advantage GP methods enjoy over other machine learning methods (such as neural network and random forest) is its automatic Uncertainty Quantification (UQ). Probabilistic quantities such as credibility intervals are readily obtained from the predictive distributions. Another appeal of GP methods is that its gradient remains a Gaussian process, facilitating derivative estimation. This is particularly valuable for the regression kink estimation considered in this study. Furthermore, thanks to the likelihood principle inherent in Bayesian inference, the proposed method infers both RD and RK effects from a common generative model and does not require separate models/configurations for the RD and RK estimation.

The fuzzy RD/RK design arises when treatment assignment differs from actual treatment take-up. The identification of the RD/RK effects under the fuzzy design requires accounting for the change in probability of treatment take-up at the cutoff. Estimating this probability is a classification problem in which the GP method excels. The GP classification is a generalization of GP regression; it models a continuous latent function underlaying the observed discrete outcome. We develop a GP RD estimator for binary outcomes and use it in conjunction with the GP regression to formulate estimators for fuzzy RD/RK effects.

The essence of GP regressions is the covariance function that governs the sample path of a Gaussian process. We use the squared exponential covariance function in our GP regressions. It can be shown to be equivalent to a basis function expansion with an infinite number of radial basis functions and therefore rather expressive. It is, however, stationary such that the covariance between two inputs depends only on their distance. Consequently it may be inadequate if the underlying relationship is nonstationary. To tackle this potential limitation, we further develop a multi-layer hierarchical GP estimator. The first layer utilizes a Bayesian neural network that nonlinearly transforms the inputs into a latent feature space. The second layer utilizes the GP to map the intermediate outputs to the observed responses. This estimator can be interpreted as a deep learning model with a multi-layer architecture. It combines the strength of neural network and GP regressions. Moreover using the GP estimator as the penultimate layer is amenable to automatic uncertainty quantification and derivative estimation, facilitating the estimation and inference of RD/RK effects.

We conduct Monte Carlo simulations to assess the performance of the proposed methods. The results suggest that our estimators provide comparable and sometimes noticeably better performance relative to competing estimators. The hierarchical GP estimators improve upon the one-layer GP estimators substantially, demonstrating the merit of deep learning approach. These overall patterns hold for the RD and RK estimations under both sharp and fuzzy designs, and across different data generating processes and sample sizes. For illustration, we apply the proposed methods to examine the electoral advantage to incumbency in the US House election (Lee, 2008). Lastly we present a further extension to accommodate covariate adjustment.

2 Preliminaries

2.1 Regression-discontinuity and regression-kink designs

In this section, we provide a brief overview of regression discontinuity/kink designs; the main purpose is to introduce the various estimands that are considered in this study. There is an extensive and still growing literature on these topics; readers are referred to Imbens and Lemieux (2008), Lee and Lemieux (2010), Cattaneo and Escanciano (2017), Cattaneo et al. (2020, forthcoming), and Cattaneo and Titiunik (2022) for general reviews.

Suppose that the assignment of a treatment is determined according to whether a continuous running variable $X$ is greater or less than $c$ , a known cutoff. It is assumed that units do not manipulate the running variable to position themselves above or below the cutoff. The most common scenario is the Sharp RD (SRD) design with a binary treatment, wherein all units above the cutoff receive the treatment and vice versa. The SRD treatment effect is defined as

\tau_{\textsf{SRD}}=\lim_{x\downarrow c}\mathbbm{E}[Y|X=x]-\lim_{x\uparrow c}\mathbbm{E}[Y|X=x]

(1)

Hahn et al. (2001) established conditions under which causal effect is identified within this framework.

More generally, suppose the outcome is given by $Y=f(B,X,U)$ , where $B=B(X)$ is a policy/treatment parameter that depends on the running variable smoothly except at the cutoff and $U$ is a stochastic error. The sharp regression kink (SRK) effect is defined as

\tau_{\textsf{SRK}}=\frac{\lim_{x_{0}\downarrow c}\left.\frac{d\mathbbm{E}[Y|X=x]}{dx}\right|_{x=x_{0}}-\lim_{x_{0}\uparrow c}\left.\frac{d\mathbbm{E}[Y|X=x]}{dx}\right|_{x=x_{0}}}{\lim_{x_{0}\downarrow c}\left.\frac{dB(x)}{dx}\right|_{x=x_{0}}-\lim_{x_{0}\uparrow c}\left.\frac{dB(x)}{dx}\right|_{x=x_{0}}}

(2)

Roughly speaking, the SRK parameter captures the local causal effect of a treatment on the derivative of $f$ at the cutoff; see Card et al. (2015) for a detailed analysis of this generalized RK design.

It is not uncommon in observational studies that actual treatment take-up differs from treatment assignment. In this case, the sharp RD/RK estimands introduced above reflect the effect of the treatment assignment rather than that of the treatment take-up. Instead, the fuzzy RD/RK estimands identify the actual treatment effect by exploiting the treatment assignment as an instrumental variable that affects the take-up probability at the cutoff. Define $D=\mathbbm{1}(\textnormal{treatment take-up})$ . The fuzzy RD (FRD) and fuzzy RK (FRK) effects are given by

	$\displaystyle\tau_{\textsf{FRD}}$	$\displaystyle=\frac{\lim_{x\downarrow c}\mathbbm{E}[Y\|X=x]-\lim_{x\uparrow c}\mathbbm{E}[Y\|X=x]}{\lim_{x\downarrow c}\mathbbm{E}[D\|X=x]-\lim_{x\uparrow c}\mathbbm{E}[D\|X=x]}$
	$\displaystyle\tau_{\textsf{FRK}}$	$\displaystyle=\frac{\lim_{x_{0}\downarrow c}\left.\frac{d\mathbbm{E}[Y\|X=x]}{dx}\right\|_{x=x_{0}}-\lim_{x_{0}\uparrow c}\left.\frac{d\mathbbm{E}[Y\|X=x]}{dx}\right\|_{x=x_{0}}}{\lim_{x_{0}\downarrow c}\left.\frac{d\mathbbm{E}[B\|X=x]}{dx}\right\|_{x=x_{0}}-\lim_{x_{0}\uparrow c}\left.\frac{d\mathbbm{E}[B\|X=x]}{dx}\right\|_{x=x_{0}}}$

Both parametric and nonparametric methods have been used in RD/RK estimations. Parametric estimations typically employ low order global polynomials; see e.g., Van der Klaauw (2002) and Arai and Ichimura (2016). Nonparametric methods offer a flexible and robust alternative. Local polynomials, especially local linear estimators, are most commonly used due to its excellent boundary bias properties; see among others, Hahn et al. (2001), Ludwig and Miller (2007), Imbens and Kalyanaraman (2012), and Calonico et al. (2014). Bayesian methods have also been used; see e.g. Chib and Jacobi (2016), Branson et al. (2019) and Chib et al. (2020).

2.2 Gaussian process regression and classification

In this sequel, we provide a brief introduction to the Gaussian Process methods. Interested readers are referred to Rasmussen and Williams (2006) for an illuminating treatment of this topic.

Consider a sample $\mathcal{D}$ consisting of responses $\bm{y}=[y_{1},\dots,y_{N}]^{T}$ and corresponding covariates/inputs $\bm{x}=[x_{1},\dots,x_{N}]^{T}$ . For simplicity, in this review we focus on the case of a single covariate; this is also in accordance with the typical RD/RK estimations wherein the running variable is the sole covariate. The response observations are supposed to be conditionally independent given associated latent values $\bm{f}=[f(x_{1}),\dots,f(x_{N})]^{T}$ . We place on the latent function $f(\cdot)$ a GP prior with mean function $m(\cdot)$ and covariance function $k(\cdot,\cdot)$ . The observation model and the prior usually depend on some hyperparameters $\theta_{y}$ and $\theta_{f}$ respectively. Thus a typical GP model consists of the following components:

•

Observation model on $\bm{y}$ : $\bm{y}|\bm{f},\theta_{y}\sim\prod_{i=1}^{N}p(y_{i}|f(x_{i}),\theta_{y})$
•

GP prior on latent function $f$ : $p(f(\bm{x})|\theta_{f})\sim\mathcal{GP}(m(\bm{x}),k(\bm{x},\bm{x}^{\prime}|\theta_{f}))$
•

hyperprior on $\theta=(\theta_{f},\theta_{y})$ : $\theta\sim p(\theta_{f})p(\theta_{y})$

The conditional posterior distribution of latent values $\bm{f}$ given $\theta$ takes the form

p(\bm{f}|\mathcal{D},\theta)=\frac{p(\bm{y}|\bm{f},\theta_{y})p(\bm{f}|\bm{x},\theta_{f})}{\int p(\bm{y}|\bm{f},\theta_{y})p(\bm{f}|\bm{x},\theta_{f})d\bm{f}}

Integrating over $\theta$ yields the marginal posterior $p(\bm{f}|\mathcal{D})=\int p(\bm{f}|\mathcal{D},\theta)p(\theta)d\theta$ . Given a test point $x_{*}$ , let $f_{*}=f(x_{*})$ be its associated latent value. Denote by $p(f_{*}|\mathcal{D},\theta,x_{*})$ the conditional predictive distribution of $f_{*}$ . The corresponding conditional predictive distribution for the response is given by $p(y_{*}|\mathcal{D},\theta,x_{*})=\int p(y_{*}|f_{*},\theta)p(f_{*}|\mathcal{D},\theta,x_{*})df_{*}$ . Its marginal distribution can be similarly obtained by integrating over $\theta$ .

A Gaussian process is a collection of random variables, any finite member of which are jointly Gaussian. It can be viewed as a generalization of the multivariate Gaussian distribution to (infinite-dimensional) functional space. Throughout this study, we follow the convention of using $\mathcal{GP}(\cdot,\cdot)$ and $\mathcal{N}(\cdot,\cdot)$ to denote a GP and a multivariate Gaussian distribution respectively. The GP provides a powerful probabilistic framework for statistical learning. Consider first a Gaussian observation model $y_{i}=f(x_{i})+u_{i}$ , where $u_{i}\sim\mathcal{N}(0,\sigma^{2})$ . The latent function is modeled as a GP with zero mean and a covariance function $k$ such that $f(x)\sim\mathcal{N}(0,k(x,x^{\prime}))$ . There exists a large collection of covariance functions that are suitable to model relationship of various sorts; see Chapter 4 of Rasmussen and Williams (2006) for details. One popular choice is the Squared Exponential (SE) covariance function:

k_{\mathsf{s}}(x,x^{\prime})=\alpha^{2}\exp{\left[-\frac{(x-x^{\prime})^{2}}{2l^{2}}\right]}

(3)

where $\alpha^{2}$ is a scale parameter and the length scale $l$ governs how fast the correlation between $x$ and $x^{\prime}$ decreases with their distance.

Define $\bm{K}_{\bm{x},\bm{x}^{\prime}}=k(\bm{x},\bm{x}^{\prime}|\theta_{f})$ and $\mathcal{I}_{N}$ the $N$ -dimensional identity matrix. Note $\bm{K}_{x_{*},\bm{x}}$ is a $1\times N$ vector and $K_{x_{*},x_{*}}$ is a scalar. We then have

\left[\begin{array}[]{c}\bm{y}\\ f_{*}\end{array}\right]\sim\mathcal{N}\left({0},\left[\begin{array}[]{c c}\bm{K}_{\bm{x},\bm{x}}+\sigma^{2}\mathcal{I}_{N}&\bm{K}_{\bm{x},x_{*}}\\ \bm{K}_{x_{*},\bm{x}}&K_{x_{*},x_{*}}\end{array}\right]\right)

The standard results on multivariate Gaussian distributions suggest that $f_{*}|\mathcal{D},x_{*},\theta\sim\mathcal{N}(m_{*},v_{*})$ , with the predictive mean and variance given by

	$\displaystyle m_{*}$	$\displaystyle=\mathbbm{E}[f_{}\|\mathcal{D},x_{},\theta]={\bm{K}}_{x_{*},\bm{x}}(\bm{K}_{\bm{x},\bm{x}}+\sigma^{2}\mathcal{I}_{N})^{-1}{\bm{y}}$		(4)
	$\displaystyle v_{*}$	$\displaystyle=\mathbbm{V}[f_{}\|\mathcal{D},x_{},\theta]=K_{x_{},x_{}}-{\bm{K}}_{x_{},\bm{x}}(\bm{K}_{\bm{x},\bm{x}}+\sigma^{2}\mathcal{I}_{N})^{-1}{\bm{K}}_{\bm{x},x_{}}$		(4)

It follows readily that $y_{*}|\mathcal{D},x_{*},\theta\sim\mathcal{N}(m_{*},v_{*}+\sigma^{2})$ .

Since differentiation is a linear operator, the derivatives of a GP remains a GP. Assuming a twice-differentiable kernel, we define

\dot{k}(x,x^{\prime})=\frac{\partial k(x,x^{\prime})}{\partial x},\,\ddot{k}(x,x^{\prime})=\frac{\partial^{2}k(x,x^{\prime})}{\partial x\partial x^{\prime}}

Let $\dot{f}_{*}=\partial f(x_{*})/\partial x_{*},\dot{\bm{K}}_{x_{*},\bm{x}}=\dot{k}(x_{*},\bm{x}|\theta_{f})$ and $\ddot{K}_{x_{*},x_{*}}=\ddot{k}(x_{*},x_{*}|\theta_{f})$ . We can show that $\dot{f}_{*}|\mathcal{D},x_{*},\theta\sim\mathcal{N}(\dot{m}_{*},\dot{v}_{*})$ with

	$\displaystyle\dot{m}_{*}$	$\displaystyle=\mathbbm{E}[\dot{f}_{}\|\mathcal{D},x_{},\theta]={\dot{\bm{K}}}_{x_{*},\bm{x}}(\bm{K}_{\bm{x},\bm{x}}+\sigma^{2}\mathcal{I}_{N})^{-1}{\bm{y}}$		(5)
	$\displaystyle\dot{v}_{*}$	$\displaystyle=\mathbbm{V}[\dot{f}_{}\|\mathcal{D},x_{},\theta]=\ddot{K}_{x_{},x_{}}-{\dot{\bm{K}}}_{x_{},\bm{x}}(\bm{K}_{\bm{x},\bm{x}}+\sigma^{2}\mathcal{I}_{N})^{-1}{\dot{\bm{K}}}_{\bm{x},x_{}}.$		(5)

Below we shall exploit the differential GP to develop GP estimators for RK effects.

For non-Gaussian observation models, the posterior distributions are generally not tractable. The generalized GP regressions, in spirit close to the generalized linear models, are commonly used to tackle non-Gaussianity. Nonetheless, GP’s marginalization and conditionalization properties can still be exploited for prediction. Denote by $\mathbbm{E}[\bm{f}|\mathcal{D},\theta]$ and $\mathbbm{V}[\bm{f}|\mathcal{D},\theta]$ the conditional mean and variance of the latent function associated with the conditional posterior $p(\bm{f}|\mathcal{D},\theta)$ . The predictive mean and variance of $f(x_{*})$ are given by

$\displaystyle\mathbbm{E}[f_{}\|\mathcal{D},x_{},\theta]$	$\displaystyle=\bm{K}_{x_{*},\bm{x}}\bm{K}_{\bm{x},\bm{x}}^{-1}\mathbbm{E}[\bm{f}\|\mathcal{D},\theta]$
$\displaystyle\mathbbm{V}[f_{}\|\mathcal{D},x_{},\theta]$	$\displaystyle=\left(K_{x_{},x_{}}-\bm{K}_{x_{},\bm{x}}\bm{K}_{\bm{x},\bm{x}}^{-1}\bm{K}_{\bm{x},x_{}}\right)+\bm{K}_{x_{},\bm{x}}\bm{K}_{\bm{x},\bm{x}}^{-1}\mathbbm{V}[\bm{f}\|\mathcal{D},\theta]\bm{K}_{\bm{x},\bm{x}}^{-1}\bm{K}_{\bm{x},x_{}}$	(6)
	$\displaystyle=K_{x_{},x_{}}-\bm{K}_{x_{},\bm{x}}(\bm{K}_{\bm{x},\bm{x}}^{-1}-\bm{K}_{\bm{x},\bm{x}}^{-1}\mathbbm{V}[\bm{f}\|\mathcal{D},\theta]\bm{K}_{\bm{x},\bm{x}}^{-1})\bm{K}_{\bm{x},x_{}}$

The first term on the right hand side of (6) corresponds to the posterior variance of $f_{*}$ conditional on a particular $\bm{f}$ ; the second term is due to the randomness of $\bm{f}$ , which has a posterior variance $\mathbbm{V}[\bm{f}|\mathcal{D},\theta]$ .

A variety of methods have been used for the inference of GP models. The marginal likelihood $p(\mathcal{D})=\int p(\mathcal{D}|\theta)p(\theta)d\theta$ , sometimes referred to as the evidence, is indicative of the overall probability of the model. Since the computation of the marginal likelihood can be difficult, the Maximum a Posteriori (MAP) approach seeks the parameters that maximize its integrand: $\hat{\theta}=\arg\max_{\theta}\log p(\mathcal{D}|\theta)+\log p(\theta)$ . The posterior distribution of the latent function $p(\bm{f}|\mathcal{D})$ is then approximated by $p(\bm{f}|\hat{\theta})$ . Since the posterior distribution of $\theta$ is condensed to some point estimate, the MAP approach tends to underestimate a model’s uncertainty. A second approach resorts to deterministic approximations such as the Laplace approximation, expectation propagation and variational methods. These methods approximate the posterior distribution with a simpler surrogate. Generally, the more flexible the approximation is, the better is the precision; however the computation cost usually increases with the degree of precision. A third possibility is to use Monte Carlo Markov Chain (MCMC) methods that construct a Markov chain whose stationary distribution coincides with the posterior distribution. One advantage of the MCMC methods is that it scales well in large parameter space as its rate of convergence is independent of the dimension of hyperparameters. This approach, however, can be computationally expensive. Rasmussen and Williams (2006) offer a detailed treatment of the first two approaches; Gelman et al. (2013) provide a general overview of MCMC methods and a fully Bayesian treatment of GP models. The posterior consistency of GP regression and classification has been treated by Ghosal and Roy (2006), Choi and Schervish (2007), and van der Vaart and van Zanten (2008, 2009).

3 GP models for sharp RD/RK designs

In this section, we introduce GP estimators for RD/RK treatment effect under sharp and fuzzy designs. As mentioned above, a GP is characterized by its mean and covariance functions. Usually the zero mean function suffices. However when extrapolating out of sample, predictions of GP regression gravitate towards the mean function. Incorporating an explicit mean function may mitigate the ‘regressing to zero’ tendency associated with a zero mean function in out of sample prediction. This is particularly desirable for RD estimation of local treatment effect at the cutoff, especially if the cutoff is not covered by the sample range.

A common choice of mean function is a polynomial $m(x)=\bm{h}^{T}(x)\beta$ , where $\bm{h}(x)=[1,x,\dots,x^{S}]^{T}$ and $\beta=[\beta_{0},\beta_{1},\dots,\beta_{S}]^{T}$ . Within the Bayesian paradigm, we can absorb this mean function into a general composite covariance. With a Gaussian prior $\beta\sim\mathcal{N}(0,\Sigma)$ , $m(x)\sim\mathcal{N}(0,\bm{h}^{T}(x)\Sigma\bm{h}(x))$ . Define

k_{\textsf{p}}(x,x^{\prime}|\Sigma)=\bm{h}^{T}(x)\Sigma\bm{h}(x^{\prime})

This function turns out to be the so-called polynomial covariance function. One important appeal of GP regression is its expressiveness. One can not only choose from a large collection of covariance functions with varying features, but also construct new ones from the product and/or sum of covariances. In this study, we consider an additively composite covariance that is the sum of a polynomial covariance and a squared exponential covariance:

k_{\textsf{p+s}}(x,x^{\prime}|\Sigma,l,\alpha^{2})=k_{\textsf{p}}(x,x^{\prime}|\Sigma)+k_{\textsf{s}}(x,x^{\prime}|l,\alpha^{2})

(7)

We can now proceed to RD estimations. We partition the sample into two subsets, denoting the subsample with $x<c$ by $\mathcal{D}_{0}=(\bm{y}_{0},\bm{x}_{0})$ and the rest by $\mathcal{D}_{1}=(\bm{y}_{1},\bm{x}_{1})$ , with respective subsample sizes $N_{0}$ and $N_{1}$ . We first consider the sharp RD design. For $j=0,1$ , we assume that

y_{i,j}=f_{j}(x_{i,j})+u_{i,j},i=1,\dots,N_{j}

where $u_{i,j}\sim\mathcal{N}(0,\sigma_{j}^{2})$ . The latent function $f_{j}$ has a GP prior with zero mean and composite covariance given by (7). We adopt the fully Bayesian approach for inference with independent prior distributions for $\theta_{j}=(l_{j},\alpha_{j}^{2},\Sigma_{j},\sigma_{j}^{2})$ .

Denote by $\tilde{\theta}_{n,j},n=1,\dots,\tilde{N}$ , an MCMC sample of size $\tilde{N}$ drawn from the posterior distribution $p(\theta_{j}|\mathcal{D}_{j})$ . Define $\bm{K}_{x,x^{\prime}}(\theta)=k_{\textsf{p+s}}(x,x^{\prime}|\theta)$ . The predictive mean, conditional on $\tilde{\theta}_{n,j}$ , at the cutoff is given by

\tilde{m}_{n,j}=\bm{K}_{c,\bm{x}_{j}}(\tilde{\theta}_{n,j})(\bm{K}_{\bm{x}_{j},\bm{x}_{j}}(\tilde{\theta}_{n,j})+\tilde{\sigma}_{n,j}^{2}\mathcal{I}_{N_{j}})^{-1}\bm{y}_{j}

(8)

with conditional variance

\tilde{v}_{n,j}=K_{c,c}(\tilde{\theta}_{n,j})-{\bm{K}}_{c,\bm{x}_{j}}(\tilde{\theta}_{n,j})(\bm{K}_{\bm{x}_{j},\bm{x}_{j}}(\tilde{\theta}_{n,j})+\tilde{\sigma}^{2}_{n,j}\mathcal{I}_{N_{j}})^{-1}{\bm{K}}_{\bm{x}_{j},c}(\tilde{\theta}_{n,j})

The predictive mean and variance are then calculated as

\hat{\mu}_{j}=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{m}_{n,j},\quad\hat{v}_{j}=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}(\tilde{m}_{n,j}-\hat{\mu}_{j})^{2}+\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{v}_{n,j}

where the first term of $\hat{v}_{j}$ is the variance of conditional predictive mean and the second is the mean of conditional predictive variance. Finally the RD treatment effect is estimated by

\hat{\tau}_{\textsf{SRD}}=\hat{\mu}_{1}-\hat{\mu}_{0}

with posterior variance

\hat{v}_{\textsf{SRD}}=\hat{v}_{1}+\hat{v}_{0}

Next we consider the estimation of RK effect. Given a binary treatment, the denominator of the RK estimator (2) equals one. Define $\dot{K}_{x,x^{\prime}}(\theta)=\dot{k}_{\textsf{p+s}}(x,x^{\prime}|\theta)$ and $\ddot{K}_{x,x^{\prime}}(\theta)=\ddot{k}_{\textsf{p+s}}(x,x^{\prime}|\theta)$ . The predictive derivative, conditional on $\tilde{\theta}_{n,j}$ , at the cutoff is given by

\tilde{\dot{m}}_{n,j}=\dot{\bm{K}}_{c,\bm{x}_{j}}(\tilde{\theta}_{n,j})(\bm{K}_{\bm{x}_{j},\bm{x}_{j}}(\tilde{\theta}_{n,j})+\tilde{\sigma}_{n,j}^{2}\mathcal{I}_{N_{j}})^{-1}\bm{y}_{j}

with conditional variance

\tilde{\dot{v}}_{n,j}=\ddot{K}_{c,c}(\tilde{\theta}_{n,j})-\dot{\bm{K}}_{c,\bm{x}_{j}}(\tilde{\theta}_{n,j})(\bm{K}_{\bm{x}_{j},\bm{x}_{j}}(\tilde{\theta}_{n,j})+\tilde{\sigma}^{2}_{n,j}\mathcal{I}_{N_{j}})^{-1}\dot{\bm{K}}_{\bm{x}_{j},c}(\tilde{\theta}_{n,j})

Formulae for $\dot{\bm{K}}_{c,\bm{x}_{j}}$ and $\ddot{\bm{K}}_{c,c}$ are provided in Appendix A. The predictive mean and variance of the derivative at the cutoff are calculated as

\hat{\dot{\mu}}_{j}=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{\dot{m}}_{n,j},\quad\hat{\dot{v}}_{j}=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}(\tilde{\dot{m}}_{n,j}-\hat{\dot{\mu}}_{j})^{2}+\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{\dot{v}}_{n,j}

The RK treatment effect is then estimated by

\hat{\tau}_{\textsf{SRK}}=\hat{\dot{\mu}}_{1}-\hat{\dot{\mu}}_{0}

with posterior variance

\hat{v}_{\textsf{SRK}}=\hat{\dot{v}}_{1}+\hat{\dot{v}}_{0}

4 GP models for fuzzy RD/RK designs

Under a fuzzy RD/RK design, the actual treatment take-up may differ from the treatment assignment. Hahn et al. (2001) show that the fuzzy RD design is closely connected to an instrumental variable problem with a discrete instrument, and that the treatment effect of interest is a version of the local average treatment effect or complier average treatment effect. As in the previous section, we start with the RD effect. The estimation under fuzzy RD design entails estimating the ratio of RD effects on the response and treatment take-up probability respectively. The former is given by the sharp RD estimator of the previous section, while the latter is based on the GP regression for binary responses, or GP classification.

Under the fuzzy RD design, the probability of treatment take-up $p(D=1)$ is assumed to be a smooth function of the running variable, except for a jump at the cutoff. Following the convention of the machine learning literature, we set $D=-1$ for no treatment take-up and $D=1$ otherwise. Similarly to the SRD estimator, we model $p(D=1)$ separately for the two subsamples $\mathcal{D}_{0}=(\bm{D}_{0},\bm{x}_{0})$ and $\mathcal{D}_{1}=(\bm{D}_{1},\bm{x}_{1})$ defined according to whether $x\geq c$ . For $j=0,1$ we assume that

p(D_{i,j}=1|x_{i,j})=\psi(\gamma_{j}+D_{i,j}f_{j}(x_{i,j}))

where $\psi$ is a sigmoid function such as the logistic or probit function, $\gamma_{j}$ is an offset coefficient, and $f_{j}$ is a latent function with a GP prior.

The conditional marginal likelihood is given by

	$\displaystyle\log p(\mathcal{D}_{j}\|\theta_{j})=$	$\displaystyle\sum_{i=1}^{N_{j}}\log\psi(\gamma_{j}+D_{i,j}f_{j}(x_{i,j}))-\frac{1}{2}f_{j}(\bm{x}_{j})^{T}\bm{K}_{\bm{x}_{j},\bm{x}_{j}}^{-1}f_{j}(\bm{x}_{j})$
		$\displaystyle-\frac{1}{2}\log\|\bm{K}_{\bm{x}_{j},\bm{x}_{j}}\|-\frac{N_{j}}{2}\log 2\pi$

The hyperparameter $\theta_{j}$ includes $\gamma_{j}$ for the observation model and those for the covariance of the latent function. For notational simplicity, the dependence of the covariance and latent function on the hyperparameters is suppressed.

Since the marginal likelihood $p(D_{j})=\int p(D_{j}|\theta_{j})p(\theta_{j})d\theta_{j}$ is not tractable, we use MCMC method for inference. Let $\tilde{\theta}_{n,j},n=1,\dots,\tilde{N}$ , be an MCMC sample of hyperparameters from the posterior distribution $p(\theta_{j}|\mathcal{D}_{j})$ . For each $\tilde{\theta}_{n,j}$ , we generate an $N_{j}$ -dimensional vector of latent values according to

\tilde{\bm{f}}_{n,j}\sim\mathcal{N}(0,\bm{K}_{\bm{x}_{j},\bm{x}_{j}}(\tilde{\theta}_{n,j}))

The predictive latent value at the cutoff is then given by

\tilde{q}_{n,j}=\bm{K}_{c,\bm{x}_{j}}(\tilde{\theta}_{n,j})\bm{K}^{-1}_{\bm{x}_{j},\bm{x}_{j}}(\tilde{\theta}_{n,j})\tilde{\bm{f}}_{n,j}

(9)

To ensure numerical stability, a small positive constant is typically added to the diagonal of $\bm{K}_{\bm{x}_{j},\bm{x}_{j}}$ before taking its inverse. The corresponding predictive probability at the cutoff is then estimated by $\hat{p}_{j}=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{p}_{n,j}$ , where $\tilde{p}_{n,j}=\psi(\tilde{\gamma}_{n,j}+\tilde{q}_{n,j})$ . Its variance is estimated by $\hat{v}_{p,j}=\frac{1}{\tilde{N}_{j}}\sum_{n=1}^{\tilde{N}_{j}}(\tilde{p}_{n,j}-\hat{p}_{j})^{2}$ . The RD effect of treatment assignment on treatment take-up probability is calculated as

\hat{\tau}_{\mathsf{SRDP}}=\hat{p}_{1}-\hat{p}_{0}

with posterior variance

\hat{v}_{\mathsf{SRDP}}=\hat{v}_{p,0}+\hat{v}_{p,1}

One can proceed to estimate the fuzzy RD effect with the ratio $\hat{\tau}_{\mathsf{SRD}}/\hat{\tau}_{\mathsf{SRDP}}$ . This simple ratio estimator, however, is known to be biased and can be improved via a Jackknife refinement (Scott and Wu, 1981; Shao and Tu, 2012). Denote by $\hat{\tau}^{(-n)}_{\mathsf{SRD}}$ the sharp RD estimator on the response calculated with the $n$ -th MCMC sample left out, and $\hat{\tau}^{(-n)}_{\mathsf{SRDP}}$ its counterpart on the treatment take-up probability. The Jackknife FRD estimator is given by

\hat{\tau}_{\mathsf{FRD}}=\tilde{N}\frac{\hat{\tau}_{\mathsf{SRD}}}{\hat{\tau}_{\mathsf{SRDP}}}-\frac{\tilde{N}}{\tilde{N}-1}\sum_{n=1}^{\tilde{N}}\frac{\hat{\tau}^{(-n)}_{\mathsf{SRD}}}{\hat{\tau}^{(-n)}_{\mathsf{SRDP}}}

with variance

\hat{v}_{\mathsf{FRD}}=\frac{1}{\tilde{N}}\left(\frac{\hat{v}_{\mathsf{SRD}}}{\hat{\tau}^{2}_{\mathsf{SRDP}}}+\frac{\hat{\tau}^{2}_{\mathsf{SRD}}{\hat{v}_{\mathsf{SRDP}}}}{\hat{\tau}^{4}_{\mathsf{SRDP}}}\right)

The estimator for the fuzzy RK effect is constructed analogously as follows:

\hat{\tau}_{\mathsf{FRK}}=\tilde{N}\frac{\hat{\tau}_{\mathsf{SRK}}}{\hat{\tau}_{\mathsf{SRDP}}}-\frac{\tilde{N}}{\tilde{N}-1}\sum_{n=1}^{\tilde{N}}\frac{\hat{\tau}^{(-n)}_{\mathsf{SRK}}}{\hat{\tau}^{(-n)}_{\mathsf{SRDP}}}

For fuzzy RD/RK design with a binary treatment, the FRK estimator shares with FRD the same ‘denominator’ estimator $\hat{\tau}_{\mathsf{SRDP}}$ . Thus its variance is estimated by

\hat{v}_{\mathsf{FRK}}=\frac{1}{\tilde{N}}\left(\frac{\hat{v}_{\mathsf{SRK}}}{\hat{\tau}^{2}_{\mathsf{SRDP}}}+\frac{\hat{\tau}^{2}_{\mathsf{SRK}}{\hat{v}_{\mathsf{SRDP}}}}{\hat{\tau}^{4}_{\mathsf{SRDP}}}\right)

5 Hierarchical GP models

To a large degree, the expressiveness of a GP model is determined by its covariance. The estimators introduced in the preceding sections employ the squared exponential covariance function. In spite of its flexibility, it is stationary and can be inadequate if the underlying curve is non-stationary with rapid variations in its slope/curvature. There are two general ways to introduce non-stationarity to GP regressions (Rasmussen and Williams, 2006). One is to make the parameters of the covariance function input dependent. However it remains a challenge to maintain the positive-definiteness of the covariance with varying parameters. Another possibility is to introduce a non-linear transformation of the input $x$ , say $g(x)$ , and then construct a GP model in the $g(x)$ space.

We adopt the second approach and consider nonlinear transformations that are smooth and bounded, such as the logit, probit and tanh transformations. These sigmoid functions are widely used in neural network (NN) estimations, often referred to as activation functions. They have bounded derivatives and therefore provide desirable numerical stability, especially for RK estimations.

Given an activation function $g$ , we construct our first stage mapping $g(x|\theta_{g})=g(\lambda_{0}+\lambda_{1}x)$ , where $\theta_{g}=(\lambda_{0},\lambda_{1})$ . We then apply the various GP models proposed in the preceding sections to $g(x|\theta_{g})$ . For instance, the Gaussian observation model becomes $y_{i}=f(g(x_{i}|\theta_{g}))+u_{i}$ . A Bayesian approach is adopted for the first stage as well. Thus the hyperparameters for this hierarchical GP model consist of $\theta=(\theta_{g},\theta_{f},\theta_{y})$ with prior distribution $p(\theta_{g})p(\theta_{f})p(\theta_{y})$ . Figure 1 presents a graphical representation of the hierarchical GP model. This model can be interpreted as a four-layer deep learning model. The first and last layers are the inputs and outputs. The second layer is the NN transformation $g(x)$ of the input $x$ and the third layer constructs the GP latent variable based on $g(x)$ . Unlike the other layers, the third layer does not assume (conditional) independence and instead models all units jointly as a Gaussian process.

Refer to caption — Figure 1: Graphical representation of proposed hierarchical GP model (the shaded nodes are observed, the rest are hidden; the third layer, modeled as a GP, is fully connected)

Transformation of variables has had a long tradition in statistics. For instance, the power transformation and Box-Cox transformation have been customarily applied to non-Gaussian variables to reduce non-Gaussianity. In nonparametric estimations, the inputs are often transformed into features (for instance splines, wavelets in series estimation and perceptrons in one-layer neural network) and the subsequent curve fitting is conducted on the feature space. Two key innovations distinguish deep learning methods from these conventional methods (Murphy, 2012). First, in contrast to a ‘shallow’ model with a potentially large number of features, deep learning methods employ a multi-layer architecture, each layer nonlinearlly transforming its input into a slightly more abstract and composite representation. Second, many conventional methods that operate on the feature space can be characterized as un-supervised ‘feature engineering’ wherein the construction of features and the subsequent fitting are disconnected. In contrast, deep learning methods jointly train all layers.

The most popular deep learning models are multilayer neural networks (Goodfellow et al., 2016). The proposed hierarchical model can be characterized a hybrid estimator that features an NN layer followed by a GP layer. It is not uncommon that deep NN models feature thousands or even millions of tuning parameters and therefore can be ‘data hungry’. Given the sometimes small samples for RD estimations, we opt for the more economic hybrid model with two hidden layers. We prefer the GP approach to the NN as the penultimate layer for several reasons. First, it is probabilistic and offers automatic uncertainty quantification. Second, it is amenable to derivative estimations. Third, it is more expressive than an NN with a fixed number of nodes. As a matter of fact, the squared exponential covariance function can be obtained as the limit of basis function expansion with an infinite number of radial basis functions (see e.g., Chapter 4 of Rasmussen and Williams (2006)).

The marginal likelihood of the hierarchical GP model, conditional on $\theta$ , is given by

\log p(\mathcal{D}|\theta)=-\frac{1}{2}\bm{y}^{T}(\bm{K}_{g(\bm{x}),g(\bm{x})}+\sigma^{2}\mathcal{I}_{N})^{-1}\bm{y}-\frac{1}{2}\log|\bm{K}_{g(\bm{x}),g(\bm{x})}+\sigma^{2}\mathcal{I}_{N}|-\frac{N}{2}\log 2\pi

and the unconditional marginal likelihood $p(\mathcal{D})=\int p(\mathcal{D}|\theta)p(\theta_{g})p(\theta_{f})p(\theta_{y})d\theta_{g}d\theta_{f}d\theta_{y}$ . Let $\tilde{\theta}_{n,j}=(\tilde{\theta}_{g,n,j},\tilde{\theta}_{f,n,j},\tilde{\theta}_{y,n,j}),n=1,\dots,\tilde{N}$ , be an MCMC sample generated from the posterior distribution $p(\theta_{j}|\mathcal{D}_{j})$ . Define ${\bm{K}}_{g(x),g(x^{\prime})}(\theta)=k_{\mathsf{p+s}}(g(\bm{x}|\theta_{g}),g(\bm{x}^{\prime}|\theta_{g})|\theta_{f})$ . The predictive mean, conditional on $\tilde{\theta}_{n,j}$ , at the cutoff is given by

\tilde{m}_{n,j}(g)=\bm{K}_{g(c),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})(\bm{K}_{g(\bm{x}_{j}),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})+\tilde{\sigma}_{n,j}^{2}\mathcal{I}_{N_{j}})^{-1}\bm{y}_{j}

(10)

with conditional variance

	$\displaystyle\tilde{v}_{n,j}(g)=$	$\displaystyle K_{g(c),g(c)}(\tilde{\theta}_{n,j})$
		$\displaystyle-{\bm{K}}_{g(c),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})(\bm{K}_{g(\bm{x}_{j}),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})+\tilde{\sigma}^{2}_{n,j}\mathcal{I}_{N_{j}})^{-1}{\bm{K}}_{g(\bm{x}_{j}),g(c)}(\tilde{\theta}_{n,j})$

The predictive mean and variance at the cutoff are then calculated as

\hat{\mu}_{j}(g)=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{m}_{n,j}(g),\quad\hat{v}_{j}(g)=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}(\tilde{m}_{n,j}(g)-\hat{\mu}_{j}(g))^{2}+\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{v}_{n,j}(g)

Finally the treatment effect is estimated by

\hat{\tau}_{\textsf{SRD}}(g)=\hat{\mu}_{1}(g)-\hat{\mu}_{0}(g)

with variance

\hat{v}_{\textsf{SRD}}(g)=\hat{v}_{1}(g)+\hat{v}_{0}(g)

Next consider the SRK effect. Let $\dot{g}(x|\theta)=\partial g(x|\theta_{g})/\partial x$ . Define $\dot{K}_{g(x),g(x^{\prime})}(\theta)=\dot{k}_{\textsf{p+s}}(g(x|\theta_{g}),g(x^{\prime}|\theta_{g})|\theta_{f})\dot{g}(x|\theta_{g})$ and $\ddot{K}_{g(x),g(x^{\prime})}(\theta)=\ddot{k}_{\textsf{p+s}}(g(x|\theta_{g}),g(x^{\prime}|\theta_{g})|\theta_{f})\dot{g}(x|\theta_{g})\dot{g}(x^{\prime}|\theta_{g})$ . The predictive derivative, conditional on $\tilde{\theta}_{n,j}$ , at the cutoff is given by

\tilde{\dot{m}}_{n,j}(g)=\dot{\bm{K}}_{g(c),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})(\bm{K}_{g(\bm{x}_{j}),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})+\tilde{\sigma}_{n,j}^{2}\mathcal{I}_{N_{j}})^{-1}\bm{y}_{j}

with conditional variance

	$\displaystyle\tilde{\dot{v}}_{n,j}(g)=$	$\displaystyle\ddot{K}_{g(c),g(c)}(\tilde{\theta}_{n,j})$
		$\displaystyle-\dot{\bm{K}}_{g(c),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})(\bm{K}_{g(\bm{x}_{j}),g(\bm{x}_{j})}(\tilde{\theta}_{n,j})+\tilde{\sigma}^{2}_{n,j}\mathcal{I}_{N_{j}})^{-1}\dot{\bm{K}}_{g(\bm{x}_{j}),g(c)}(\tilde{\theta}_{n,j})$

The predictive mean derivative is then calculated as

\hat{\dot{\mu}}_{j}(g)=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{\dot{m}}_{n,j}(g)

with predictive variance

\hat{\dot{v}}_{j}(g)=\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}(\tilde{\dot{m}}_{n,j}(g)-\hat{\dot{\mu}}_{j}(g))^{2}+\frac{1}{\tilde{N}}\sum_{n=1}^{\tilde{N}}\tilde{\dot{v}}_{n,j}(g)

The SRK treatment effect is estimated by

\hat{\tau}_{\textsf{SRK}}(g)=\hat{\dot{\mu}}_{1}(g)-\hat{\dot{\mu}}_{0}(g)

with posterior variance

\hat{v}_{\textsf{SRK}}(g)=\hat{\dot{v}}_{1}(g)+\hat{\dot{v}}_{0}(g)

The hierarchical GP models under the fuzzy RD and RK designs are constructed analogously. For brevity, the details are omitted.

6 Numerical performance

6.1 Monte Carlo simulations

We use Monte Carlo simulations to assess the numerical performance of the proposed methods. We examine the following data generating process that has been studied in various previous studies: $y_{i}=f(x_{i})+u_{i}$ , where $x_{i}\sim 2\textnormal{Beta}(2,4)-1$ and $u_{i}\sim\mathcal{N}(0,0.1295^{2})$ . We set the cutoff $c=0$ and consider the following mean functions:

•

DGP1: $f(x)=0.42+0.1\mathbb{I}(x\geq 0)+0.84x-3x^{2}+7.99x^{3}-9.01x^{4}+3.56x^{5}$
•

DGP2: $f(x)=(3.71+2.30x+3.28x^{2}+1.45x^{3}+0.23x^{4}+0.03x^{5})\mathbb{I}(x<0)+(0.26+18.49x-54.18x^{2}+74.30x^{3}-45.02x^{4}+9.83x^{5})\mathbb{I}(x\geq 0)$
•

DGP3: $f(x)=x^{3}$

These designs feature different combinations of RD/RK effects. DGP1 has a moderate RD effect and no RK effect, DGP2 has sizable RD and RK effects, and DGP3 has neither RD nor RK effect. For the fuzzy RD/RK designs, we follow Arai and Ichimura (2016) and set the treatment take-up probability $p(D=1|x<0)=\textnormal{Probit}(x|-1.28,1)$ and $p(D=1|x\geq 0)=\textnormal{Probit}(x|1.28,1)$ . This leads to a jump in the treatment take-up probability of size 0.8 at the cutoff. We consider two sample sizes $N=300$ and 500, and repeat each experiment 300 times.

We employ a fully Bayesian approach for the inference of the proposed GP models. We denote the one- and two-layer GP models by GP1 and GP2 respectively. A half-normal $\mathcal{N}^{+}(0,5^{2})$ prior is placed on $l,\alpha,\sigma$ ; a normal $\mathcal{N}(0,5^{2})$ prior is placed on the bias parameter $\gamma$ of the GP classification model and the parameters for the first layer of the hierarchical GP model $\lambda_{0},\lambda_{1}$ . The prior for the coefficients of the quadratic mean basis functions is set to be $\mathcal{N}(0,100^{2})$ , yielding $\Sigma=100^{2}\mathcal{I}_{3}$ in the corresponding polynomial covariance. These weakly informative priors have been customarily used in previous studies. For our simulations, we use the Stan program that implements the Hamiltonian MCMC method (Neal, 2011). We run four parallel Markov Chains; the number of MCMC draws is set to be 1000 with an equal number of warm-ups. To speed up computation, we include in the estimation only observations with $x$ no farther than twice of the Silverman’s rule-of-thumb bandwidth from the cutoff. We use this commonly-used bandwidth mainly for convenience and twicing the bandwidth makes the thresholding rather generous.

For comparison, we also calculate the local linear (LL) estimators implemented in the R package ‘rdrobust’. We consider two versions of the LL estimators. LL1 uses the MSE-optimal bandwidth, and LL2 is a robust alternative with bias-adjustment and coverage-optimal bandwidth; see Calonico et al. (2014) for details. Note that different bandwidths are used for the LL estimation of RD and RK effects. In contrast, GP models obey the likelihood principle such that given the prior, all inferences and predictions depend only on the likelihood function. Therefore one common model is used for the RD and RK estimations.

We summarize in Table 1 the estimation results for the SRD and SRK effects, reporting the average absolute bias, root mean squared error (RMSE), average coverage probability of 95% confidence/credibility intervals and average interval length (IL). For each category, the estimator with the lowest bias, RMSE, IL or coverage probability closest to the nominal level is highlighted with the boldface font. In terms of both absolute bias and RMSE, the hierarchical GP estimator (GP2) dominates across the board, sometimes by considerable margins. GP2 noticeably improves upon the single-layer GP1. For instance for the RD effect, the average ratio of the RMSE between GP2 and GP1 across the three experiments is 0.69 for $N=300$ and 0.72 for $N=500$ . Their counterparts for the RK effect are 0.61 and 0.60 respectively. Mixed results are obtained for coverage and interval length. LL2 and GP1 generally perform better in terms of coverage probability. As is reported in previous studies, LL2 improves upon LL1 in coverage at the expense of slightly wider intervals. It is also worth noting that the GP estimators tend to have shorter intervals than the LL estimators.

To investigate the sensitivity of estimation results to the specification of tuning parameters, we experiment with some alternative estimation configurations. We report in Appendix B some additional estimation results under a range of prior distributions. These experiments suggest that our estimation results are rather stable with respect to these alternatives.

Table 1: SRD/SRK estimation results; top panel:

N=300

, bottom panel:

N=500

		SRD				SRK
		LL1	LL2	GP1	GP2	LL1	LL2	GP1	GP2
DGP1	$\|\textnormal{Bias}\|$	0.065	0.076	0.069	0.046	1.363	1.791	1.276	1.161
	RMSE	0.084	0.097	0.088	0.058	1.752	2.423	1.629	1.287
	Coverage	0.903	0.913	0.970	0.983	0.863	0.943	0.970	0.743
	IL	0.302	0.356	0.365	0.260	5.363	8.298	7.405	3.798
DGP2	$\|\textnormal{Bias}\|$	0.100	0.087	0.075	0.072	3.771	2.594	2.101	1.666
	RMSE	0.123	0.110	0.094	0.090	4.267	3.119	2.506	2.095
	Coverage	0.890	0.960	0.963	0.967	0.580	0.950	0.853	0.990
	IL	0.453	0.502	0.368	0.388	9.865	12.982	7.618	10.764
DGP3	$\|\textnormal{Bias}\|$	0.061	0.071	0.068	0.032	0.881	1.397	1.254	0.212
	RMSE	0.076	0.089	0.087	0.041	1.274	1.978	1.607	0.305
	Coverage	0.907	0.917	0.967	0.987	0.953	0.963	0.970	0.990
	IL	0.274	0.326	0.364	0.198	4.149	6.660	7.401	1.917
DGP1	$\|\textnormal{Bias}\|$	0.050	0.057	0.053	0.036	1.112	1.366	1.167	1.053
	RMSE	0.066	0.075	0.067	0.045	1.353	1.772	1.475	1.186
	Coverage	0.903	0.913	0.963	0.980	0.880	0.950	0.970	0.730
	IL	0.237	0.277	0.282	0.207	4.381	6.667	6.297	3.310
DGP2	$\|\textnormal{Bias}\|$	0.074	0.067	0.056	0.054	3.311	2.149	1.757	1.410
	RMSE	0.094	0.087	0.071	0.070	3.760	2.660	2.160	1.792
	Coverage	0.880	0.920	0.947	0.960	0.590	0.937	0.837	0.983
	IL	0.319	0.348	0.283	0.294	8.037	10.217	6.419	8.739
DGP3	$\|\textnormal{Bias}\|$	0.045	0.052	0.053	0.026	0.654	1.012	1.157	0.179
	RMSE	0.059	0.068	0.067	0.033	0.926	1.401	1.462	0.244
	Coverage	0.920	0.933	0.960	0.990	0.940	0.957	0.970	1.000
	IL	0.205	0.243	0.282	0.154	2.981	4.750	6.297	1.595

We next examine estimation results under fuzzy designs. The first stage of our FRD/FRK estimators uses the GP classification to estimate the jump in probability of treatment take-up at the cutoff. The GP estimators perform well for this task (see Appendix C for details). Table 2 reports the estimation results of the fuzzy RD/RK effects. The overall pattern is similar to those under the sharp design. GP2 again dominates in terms of estimation precision, registering the smallest absolute bias and RMSE in all but one instances. It also tends to have the shortest confidence interval, while LL2 and GP1 generally excel in coverage probability. Similarly to the cases under a sharp design, GP2 considerably improves upon GP1. The average ratios of the RMSE between GP2 and GP1 for the RD effects are 0.64 for both sample sizes; for the RK effects, they are 0.60 for $N=300$ and 0.58 for $N=500$ .

Table 2: FRD/FRK estimation results; top panel:

N=300

, bottom panel:

N=500

		FRD				FRK
		LL1	LL2	GP1	GP2	LL1	LL2	GP1	GP2
DGP1	$\|\textnormal{Bias}\|$	0.098	0.123	0.085	0.058	1.124	2.068	1.782	1.377
	RMSE	0.162	0.187	0.109	0.072	1.967	5.814	2.214	1.540
	Coverage	0.909	0.909	0.951	0.972	0.972	0.962	0.972	0.875
	IL	0.411	0.486	0.447	0.321	18.292	30.139	9.186	4.867
DGP2	$\|\textnormal{Bias}\|$	0.597	0.791	0.395	0.316	18.681	19.041	6.225	6.105
	RMSE	1.136	1.471	0.494	0.400	21.046	22.916	8.093	7.625
	Coverage	0.920	0.951	0.970	1.000	0.420	0.561	0.932	0.864
	IL	2.509	2.955	1.937	1.852	71.413	120.028	26.794	27.085
DGP3	$\|\textnormal{Bias}\|$	0.093	0.108	0.082	0.037	0.788	1.185	1.740	0.244
	RMSE	0.167	0.178	0.104	0.046	1.400	2.882	2.162	0.364
	Coverage	0.934	0.931	0.969	0.997	0.972	0.972	0.962	1.000
	IL	0.366	0.433	0.443	0.235	14.713	23.908	9.126	2.438
DGP1	$\|\textnormal{Bias}\|$	0.066	0.081	0.071	0.051	1.240	2.137	1.538	1.221
	RMSE	0.096	0.116	0.090	0.061	2.154	5.086	1.947	1.383
	Coverage	0.929	0.908	0.954	0.982	0.993	0.972	0.958	0.852
	IL	0.288	0.338	0.353	0.259	21.869	35.813	8.107	4.246
DGP2	$\|\textnormal{Bias}\|$	0.355	0.531	0.351	0.279	18.315	18.604	5.918	5.582
	RMSE	0.595	0.788	0.433	0.345	21.561	21.984	7.529	6.637
	Coverage	0.976	0.984	0.951	0.992	0.565	0.671	0.927	0.858
	IL	1.542	1.830	1.607	1.535	85.993	142.616	24.140	23.939
DGP3	$\|\textnormal{Bias}\|$	0.060	0.071	0.071	0.032	0.857	1.441	1.437	0.210
	RMSE	0.092	0.104	0.088	0.039	1.773	5.101	1.825	0.286
	Coverage	0.934	0.920	0.955	0.993	0.997	0.997	0.965	1.000
	IL	0.255	0.300	0.349	0.189	18.396	30.100	7.996	2.007

6.2 Empirical application

We apply the proposed methods to the US House election data studied by Lee (2008). This study investigates the electoral advantage to incumbency in elections to the United States House of Representatives. Many factors contribute to successful political campaigns and elections. Lee (2008) argued that districts where a party’s candidate just barely won an election and hence barely became the incumbent are likely to be comparable in all other ways to districts where the party’s candidate just barely lost the election. Therefore differences in the electoral success between these two groups in the next election can be exploited to identify the causal party incumbency advantage.

In this study, the treatment corresponds to winning the previous election, and the running variable corresponds to the margin of victory of a Democratic candidate (with the cutoff set at zero). The outcome of interest is the Democratic vote share in the following election. Elections with the previous winning margin between positive and negative 25% are used in the estimation. To avoid potential censoring issues, we also omit extreme observations with zero or one hundred percent Democratic votes. The number of retained observations is 2681.

We estimate the SRD/SRK effects of incumbency on vote share. In addition, we also estimate the treatment effect on the probability of getting more than 50% of votes, which guarantees a victory. The results are reported in Table 3. All four estimators point to a significant RD effect at the level of 5-7 percentage point increase in vote share. GP2 estimate is closer to the two LL estimates and improves upon GP1 in terms of precision. Although all estimators suggest a positive RK effect, none of them is statistically significant. LL1 and GP2 estimates are close in magnitude and precision. The estimation results on the probability of winning are reported in the third row. As explained in the previous section, GP1 and GP2 generate virtually identical results for classification tasks, thus we focus on GP1 in the estimation of winning probability. All estimators suggest a positive RD effect about 50% increase in the probability of winning the election. The GP estimate has a considerably smaller standard error than the two LL estimates.

Table 3: Estimated SRD/SRK effect on vote share (rows one and two) and SRD effect on winning probability (row three)

	LL1		LL2		GP1		GP2
	$\hat{\tau}$	se $(\hat{\tau})$	$\hat{\tau}$	se $(\hat{\tau})$	$\hat{\tau}$	se $(\hat{\tau})$	$\hat{\tau}$	se $(\hat{\tau})$
SRD	5.81	1.18	5.89	1.39	7.01	2.22	6.23	1.65
SRK	33.44	49.09	69.11	77.73	54.53	166.60	24.51	52.64
SRDP	0.53	0.16	0.49	0.19	0.48	0.04	—	—

The predictive means and their 95% credible intervals of the estimated latent functions for vote share (based on GP2) and winning probability (based on GP1) are plotted in Figure 2. Both indicate unmistakable effects of incumbency advantage. The credible intervals for the winning probability estimation is noticeably wider; this can be attributed to information loss caused by the reduction of a continuous response variable to a binary one.

7 Conclusion and further extension for covariate adjustment

We propose a family of Gaussian Process estimators for causal inference exploiting Regression Discontinuity/Kink (RD/RK) under sharp and fuzzy designs. The GP methods are powerful probabilistic modeling approaches that are advantageous in terms of derivative estimation and uncertainty quantification, facilitating RK estimation and inference of RD/RK models. These estimators are extended to hierarchical GP models with an intermediate Bayesian neural network layer. This estimator can be characterized as a hybrid deep learning model. Monte Carlo simulations show that our estimators perform comparably to and sometimes better than competing estimators in terms of precision, coverage and interval length. The hierarchical GP models improve up one-layer GP models substantially.

Lastly we briefly present a further extension to incorporate covariates. Covariate adjustment based on pre-intervention measures may allow for efficiency gains or the evaluation of treatment effect heterogeneity. Recent studies have explored the inclusion of covariates in RD estimations; see a recent review by Cattaneo et al. (2021) and references therein. It is relatively straightforward to include covariates in GP regressions. Calonico et al. (2019) suggest that in continuity-based RD estimation, covariates should be entered in an additive separable manner. Suppose in addition to the running variable $x$ , one is to incorporate a vector of covariates $z$ . This can be modeled by a Gaussian Process with an additive composite covariance $k_{1}(x,x^{\prime})+k_{2}(z,z^{\prime})$ , where $k_{1}$ and $k_{2}$ are generic covariance functions. The hyperparameters of these two covariances are jointly learned during the estimation. The influence of the covariates is then profiled out and inference on RD effect then proceeds in the same manner as the ‘unadjusted’ estimates dicussed in the previous sections. As an illustration, we apply this covariate-adjusted estimator to the same US House election data. To save space, we focus on the one-layer GP estimator of sharp RD effect on vote share. The estimated coefficient and standard deviation are respectively 6.49 and 2.19, which are close to the ‘unadjusted’ estimates of 7.01 and 2.22. A detailed investigation of covariate-adjusted estimation is beyond the scope of this study. For completeness, necessary details to implement the covariate-adjusted RD estimator are provided in Appendix D.

Appendix

A. Posterior variance of differentiate GP

Recall that we use a composite covariance $k_{\mathsf{p+s}}(x,x^{\prime})=k_{\mathsf{p}}(x,x^{\prime})+k_{\mathsf{s}}(x,x^{\prime})$ , where $k_{\mathsf{p}}(x,x^{\prime}|\Sigma)=\bm{h}^{T}(x)\Sigma\bm{h}(x^{\prime})$ and $k_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})=\alpha^{2}\exp(-\frac{(x-x^{\prime})^{2}}{2l^{2}})$ . It follows that

\dot{k}_{\mathsf{p}}(x,x^{\prime}|\Sigma)=\frac{\partial k_{\mathsf{p}}(x,x^{\prime}|\Sigma)}{\partial x}=\frac{\partial\bm{h}^{T}(x)}{\partial x}\Sigma\bm{h}(x^{\prime})

and

\ddot{k}_{\mathsf{p}}(x,x^{\prime}|\Sigma)=\frac{\partial^{2}k_{\mathsf{p}}(x,x^{\prime}|\Sigma)}{\partial x\partial x^{\prime}}=\frac{\partial\bm{h}^{T}(x)}{\partial x}\Sigma\frac{\partial\bm{h}(x^{\prime})}{\partial x^{\prime}}

Next for the SE covariance, we have

\dot{k}_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})=\frac{\partial k_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})}{\partial x}=-\frac{x-x^{\prime}}{l^{2}}k_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})

and

\ddot{k}_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})=\frac{\partial^{2}k_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})}{\partial x\partial x^{\prime}}=\frac{1}{l^{2}}k_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})+\frac{(x-x^{\prime})^{2}}{l^{4}}k_{\mathsf{s}}(x,x^{\prime}|l,\alpha^{2})

It follows that $\ddot{k}_{\mathsf{s}}(x,x|l,\alpha^{2})=\frac{\alpha^{2}}{l^{2}}$ .

Gathering these results, we have

	$\displaystyle\dot{\bm{K}}_{c,\bm{x}}(\theta)$	$\displaystyle=\left.\frac{\partial\bm{h}^{T}(x)}{\partial x}\right\|_{x=c}\Sigma\bm{h}(\bm{x})-\frac{c-\bm{x}}{l^{2}}k_{\mathsf{s}}(c,\bm{x}\|l,\alpha^{2})$
	$\displaystyle\ddot{K}_{c,c}(\theta)$	$\displaystyle=\left.\frac{\partial\bm{h}^{T}(x)}{\partial x}\right\|_{x=c}\Sigma\left.\frac{\partial\bm{h}(x)}{\partial x}\right\|_{x=c}+\frac{\alpha^{2}}{l^{2}}$

B. Estimation results under alternative configurations

We have experimented with many aspects of the estimation configuration. The results are not sensitive to alternative specifications. It is infeasible to report all these results. Below we report experiments on alternative prior distributions for the key tuning parameters $l,\alpha,\sigma$ of the covariances, focusing on the sharp RD/RK estimation of single-layer GP regressions. Instead of the half normal $\mathcal{N}^{+}(0,5^{2})$ prior used in the simulations reported in the text, we consider two alternatives: $\mathcal{N}^{+}(0,3^{2})$ and $\mathcal{N}^{+}(0,10^{2})$ . The former is somewhat restrictive while the latter is highly non-informative. The results reported in Table A.1 clearly suggest that our estimation is not sensitive to the range of variation in prior distributions. For $N=500$ , the results obtained under different prior distributions are hardly distinguishable. This is consistent with the general expectation that in Bayesian analysis, the importance of the prior tends to diminish with the sample size.

Table A.1 SRD/SRK estimation results with prior distributions for $l,\alpha,\rho$ at $\mathcal{N}^{+}(0,a^{2})$
Top panel: $N=300$ ; Bottom panel: $N=500$ SRD SRK $a=3$ $a=5$ $a=10$ $a=3$ $a=5$ $a=10$ DGP1 $|\textnormal{Bias}|$ 0.069 0.069 0.069 1.281 1.276 1.272 RMSE 0.088 0.088 0.087 1.632 1.629 1.625 Coverage 0.970 0.970 0.967 0.973 0.970 0.970 IL 0.366 0.365 0.361 8.008 7.405 7.252 DGP2 $|\textnormal{Bias}|$ 0.075 0.069 0.075 2.088 2.101 2.117 RMSE 0.094 0.088 0.094 2.497 2.506 2.522 Coverage 0.957 0.963 0.950 0.873 0.853 0.853 IL 0.369 0.368 0.363 8.103 7.618 7.372 DGP3 $|\textnormal{Bias}|$ 0.068 0.068 0.068 1.257 1.254 1.251 RMSE 0.087 0.088 0.087 1.609 1.607 1.607 Coverage 0.967 0.967 0.967 0.977 0.970 0.967 IL 0.365 0.364 0.361 7.665 7.401 7.264 DGP1 $|\textnormal{Bias}|$ 0.053 0.053 0.053 1.170 1.167 1.163 RMSE 0.067 0.067 0.067 1.478 1.475 1.473 Coverage 0.963 0.963 0.957 0.973 0.970 0.967 IL 0.282 0.282 0.280 6.621 6.297 6.189 DGP2 $|\textnormal{Bias}|$ 0.056 0.056 0.053 1.753 1.757 1.769 RMSE 0.071 0.071 0.067 2.156 2.160 2.168 Coverage 0.947 0.947 0.957 0.843 0.837 0.833 IL 0.285 0.283 0.280 6.719 6.419 6.278 DGP3 $|\textnormal{Bias}|$ 0.053 0.053 0.053 1.161 1.157 1.153 RMSE 0.067 0.067 0.067 1.467 1.462 1.460 Coverage 0.960 0.960 0.957 0.977 0.970 0.967 IL 0.282 0.282 0.280 6.544 6.297 6.236

C. Estimation results on treatment take-up probability

Our numerical experiments suggest little difference in the performance between GP1 and GP2 in the estimation of treatment take-up probability. We therefore use GP1 in our simulations due to its lower computation cost. Since the same procedure of treatment take-up is used in all three DGP’s under the fuzzy design, the estimation results are averaged across these experiments. The results are reported in Table A.2. The GP estimator outperforms the LL estimators in terms of precision, coverage probability and interval length.

Table A.2 SRDP estimation results on treatment take-up probability
$N=300$ $N=500$ LL1 LL2 GP1 LL1 LL2 GP1 $|\textnormal{Bias}|$ 0.131 0.153 0.056 0.105 0.124 0.047 RMSE 0.173 0.201 0.070 0.136 0.161 0.059 Coverage 0.827 0.817 0.947 0.880 0.873 0.960 IL 0.567 0.676 0.298 0.457 0.543 0.259

D. Covariate-adjusted estimation

We outline in this section the procedure to implement covariate-adjusted GP regression under sharp RD design. To incorporate covariate $z$ , we generalize the ‘unadjusted’ model as follows:

y_{i}=f(x_{i})+f_{z}(z_{i})+u_{i},i=1,\dots,N

To ease notation, we drop the subscript $j=0,1$ for the control and treatment groups in this section, with the understanding that this estimator is to be applied to both groups. Following the suggestion of Calonico et al. (2019), we enter the covariate additively in a linear-in-parameter manner. This is achieved via a polynomial covariance function for covariate $z$ , yielding $f_{z}(z)=\bm{h}^{T}_{z}(z)\beta_{z}$ , where $\bm{h}_{z}$ is a collection of polynomial basis functions and $\beta_{z}$ is a vector of compatible dimension. Assuming a Gaussian prior $\beta_{z}\sim\mathcal{N}(0,\Sigma_{z})$ , we have $f_{z}(z)\sim\mathcal{N}(0,\bm{h}_{z}^{T}(z)\Sigma_{z}\bm{h}_{z}(z))$ . It follows that $y|x,z\sim\mathcal{N}(0,k(x,x^{\prime};\theta_{x})+\bm{h}_{z}^{T}(z)\Sigma_{z}\bm{h}_{z}(z^{\prime})+\sigma^{2})$ , where $k$ is a generic covariance function with parameters $\theta_{x}$ . Let $\bm{H}_{z}(\bm{z})$ be the collection of $\bm{h}_{z}$ and $\bm{K}_{\bm{x},\bm{x}}=k(\bm{x},\bm{x};\theta_{x})$ given a sample $\mathcal{D}=(\bm{y},\bm{x},\bm{z})$ . Define $\bm{K}_{f}=\bm{K}_{\bm{x},\bm{x}}+\sigma^{2}\mathcal{I}_{N}$ and $\bar{\beta}_{z}=\left\{\Sigma_{z}^{-1}+\bm{H}_{z}^{T}\bm{K}_{f}^{-1}\bm{H}_{z}\right\}^{-1}\left\{\bm{H}^{T}_{z}\bm{K}_{f}^{-1}\bm{y}\right\}$ . We can show that $f_{*}|\mathcal{D},x_{*},\theta_{x},\Sigma_{z}\sim\mathcal{N}(m_{*},v_{*})$ with

	$\displaystyle m_{*}$	$\displaystyle=\bm{K}_{x_{*},\bm{x}}\bm{K}_{f}^{-1}(\bm{y}-\bm{H}_{z}(\bm{z})\bar{\beta}_{z})$
	$\displaystyle v_{*}$	$\displaystyle=K_{x_{},x_{}}-\bm{K}_{x_{},\bm{x}}\bm{K}_{f}^{-1}\bm{K}_{\bm{x},x_{}}$

Comparison of these quantities with their ‘unadjusted’ counterparts (4) suggests that the influence of covariate $z$ has been profiled out in the predictive mean calculation. Although the formula for the predictive variance remains the same, note that $\theta_{x}$ , the parameters for the covariance function, are generally altered with the incorporation of covariate $z$ . Therefore, the predictive variance is also adjusted accordingly.

We apply this estimator to the control and treatment groups alike in our covariate-adjusted RD estimation. Tuning parameters for the running variable $x$ and covariate $z$ are learned jointly during the estimation process. Once the potential influence of the covariate is profiled out, the same produces for the ‘unadjusted’ estimators are used for the estimation and inference of covariate-adjusted RD effect.

References

Arai and Ichimura (2016) Arai, Y. and Ichimura, H. (2016), “Optimal bandwidth selection for the fuzzy regression discontinuity estimator,” Economics Letters, 141, 103–106.
Branson et al. (2019) Branson, Z., Rischard, M., Bornn, L., and Miratrix, L. W. (2019), “A nonparametric Bayesian methodology for regression discontinuity designs,” Journal of Statistical Planning and Inference, 202, 14–30.
Calonico et al. (2019) Calonico, S., Cattaneo, M. D., Farrell, M. H., and Titiunik, R. (2019), “Regression discontinuity designs using covariates,” Review of Economics and Statistics, 101, 442–451.
Calonico et al. (2014) Calonico, S., Cattaneo, M. D., and Titiunik, R. (2014), “Robust nonparametric confidence intervals for regression-discontinuity designs,” Econometrica, 82, 2295–2326.
Card et al. (2015) Card, D., Lee, D. S., Pei, Z., and Weber, A. (2015), “Inference on causal effects in a generalized regression kink design,” Econometrica, 83, 2453–2483.
Cattaneo and Escanciano (2017) Cattaneo, M. D. and Escanciano, J. C. (2017), Regression Discontinuity Designs: Theory and Applications (Advances in Econometrics, Volume 38), Emerald Group Publishing.
Cattaneo et al. (2020) Cattaneo, M. D., Idrobo, N., and Titiunik, R. (2020), “A practical introduction to regression discontinuity designs: foundations,” Cambridge Elements: Quantitative and Computational Methods for Social Science.
Cattaneo et al. (forthcoming) — (forthcoming), “A practical introduction to regression discontinuity designs: extensions,” Cambridge Elements: Quantitative and Computational Methods for Social Science.
Cattaneo et al. (2021) Cattaneo, M. D., Keele, L., and Titiunik, R. (2021), “Covariate Adjustment in Regression Discontinuity Designs,” arXiv preprint arXiv:2110.08410.
Cattaneo and Titiunik (2022) Cattaneo, M. D. and Titiunik, R. (2022), “Regression Discontinuity Designs,” Annual Review of Economics, 14.
Chib et al. (2020) Chib, S., Greenberg, E., and Simoni, A. (2020), “Nonparametric bayes analysis of the sharp and fuzzy regression discontinuity designs,” Draft Manuscript.
Chib and Jacobi (2016) Chib, S. and Jacobi, L. (2016), “Bayesian fuzzy regression discontinuity analysis and returns to compulsory schooling,” Journal of Applied Econometrics, 31, 1026–1047.
Choi and Schervish (2007) Choi, T. and Schervish, M. J. (2007), “On posterior consistency in nonparametric regression problems,” Journal of Multivariate Analysis, 98, 1969–1987.
Gelman et al. (2013) Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013), Bayesian data analysis, CRC press.
Ghosal and Roy (2006) Ghosal, S. and Roy, A. (2006), “Posterior consistency of Gaussian process prior for nonparametric binary regression,” The Annals of Statistics, 34, 2413–2429.
Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016), Deep Learning, MIT Press, http://www.deeplearningbook.org.
Hahn et al. (2001) Hahn, J., Todd, P., and Van der Klaauw, W. (2001), “Identification and estimation of treatment effects with a regression-discontinuity design,” Econometrica, 69, 201–209.
Imbens and Kalyanaraman (2012) Imbens, G. and Kalyanaraman, K. (2012), “Optimal bandwidth choice for the regression discontinuity estimator,” The Review of economic studies, 79, 933–959.
Imbens and Lemieux (2008) Imbens, G. W. and Lemieux, T. (2008), “Regression discontinuity designs: A guide to practice,” Journal of econometrics, 142, 615–635.
Lee (2008) Lee, D. S. (2008), “Randomized experiments from non-random selection in US House elections,” Journal of Econometrics, 142, 675–697.
Lee and Lemieux (2010) Lee, D. S. and Lemieux, T. (2010), “Regression discontinuity designs in economics,” Journal of economic literature, 48, 281–355.
Ludwig and Miller (2007) Ludwig, J. and Miller, D. L. (2007), “Does Head Start improve children’s life chances? Evidence from a regression discontinuity design,” The Quarterly journal of economics, 122, 159–208.
Murphy (2012) Murphy, K. P. (2012), Machine learning: a probabilistic perspective, MIT press.
Neal (2011) Neal, R. M. (2011), “MCMC using Hamiltonian dynamics,” Handbook of markov chain monte carlo, 2, 2.
Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006), Gaussian processes for machine learning, MIT press Cambridge.
Scott and Wu (1981) Scott, A. and Wu, C.-F. (1981), “On the asymptotic distribution of ratio and regression estimators,” Journal of the American Statistical Association, 76, 98–102.
Shao and Tu (2012) Shao, J. and Tu, D. (2012), The jackknife and bootstrap, Springer Science & Business Media.
Thistlethwaite and Campbell (1960) Thistlethwaite, D. L. and Campbell, D. T. (1960), “Regression-discontinuity analysis: An alternative to the ex post facto experiment.” Journal of Educational psychology, 51, 309.
Van der Klaauw (2002) Van der Klaauw, W. (2002), “Estimating the effect of financial aid offers on college enrollment: A regression–discontinuity approach,” International Economic Review, 43, 1249–1287.
van der Vaart and van Zanten (2008) van der Vaart, A. W. and van Zanten, J. H. (2008), “Rates of contraction of posterior distributions based on Gaussian process priors,” The Annals of Statistics, 36, 1435–1463.
van der Vaart and van Zanten (2009) — (2009), “Adaptive Bayesian estimation using a Gaussian random field with inverse gamma bandwidth,” The Annals of Statistics, 2655–2675.

$\displaystyle\mathbbm{E}[f_{}\|\mathcal{D},x_{},\theta]$	$\displaystyle=\bm{K}_{x_{*},\bm{x}}\bm{K}_{\bm{x},\bm{x}}^{-1}\mathbbm{E}[\bm{f}\|\mathcal{D},\theta]$
$\displaystyle\mathbbm{V}[f_{}\|\mathcal{D},x_{},\theta]$	$\displaystyle=\left(K_{x_{},x_{}}-\bm{K}_{x_{},\bm{x}}\bm{K}_{\bm{x},\bm{x}}^{-1}\bm{K}_{\bm{x},x_{}}\right)+\bm{K}_{x_{},\bm{x}}\bm{K}_{\bm{x},\bm{x}}^{-1}\mathbbm{V}[\bm{f}\|\mathcal{D},\theta]\bm{K}_{\bm{x},\bm{x}}^{-1}\bm{K}_{\bm{x},x_{}}$	(6)
	$\displaystyle=K_{x_{},x_{}}-\bm{K}_{x_{},\bm{x}}(\bm{K}_{\bm{x},\bm{x}}^{-1}-\bm{K}_{\bm{x},\bm{x}}^{-1}\mathbbm{V}[\bm{f}\|\mathcal{D},\theta]\bm{K}_{\bm{x},\bm{x}}^{-1})\bm{K}_{\bm{x},x_{}}$