This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newfloatcommand

capbtabboxtable[][\FBwidth]

Caliper Synthetic Matching: Generalized Radius Matching with Local Synthetic Controls

Jonathan Che Department of Statistics, Harvard University    Xiang Meng 11footnotemark: 1    Luke Miratrix Harvard Graduate School of Education
Abstract

Matching promises transparent causal inferences for observational data, making it an intuitive approach for many applications. In practice, however, standard matching methods often perform poorly compared to modern approaches such as response-surface modeling and optimizing balancing weights. We propose Caliper Synthetic Matching (CSM) to address these challenges while preserving simple and transparent matches and match diagnostics. CSM extends Coarsened Exact Matching (CEM; Iacus et al.,, 2012) by incorporating general distance metrics, adaptive calipers, and locally constructed synthetic controls. We show that CSM can be viewed as a monotonic imbalance bounding (MIB; Iacus et al.,, 2011) matching method, so that it inherits the usual bounds on imbalance and bias enjoyed by MIB methods. We further provide a bound on a measure of joint covariate imbalance. Using a simulation study, we illustrate how CSM can even outperform modern matching methods in certain settings, and finally illustrate its use in an empirical example. Overall, we find CSM allows for many of the benefits of matching while avoiding some of the costs.

1 Introduction

The “Jovem de Futuro” (Young of the Future) program of Brazil aimed to improve education quality by providing strategies and, for those who made certain targets, monetary support to schools in Rio de Janeiro and Sao Paulo, Brazil. Education reform efforts like this are common, and when they occur various stakeholders are typically anxious to understand if such a large commitment of resources actually improved circumstances on the ground.

In the case of Jovem de Futuro, researchers ensured the ability to answer this question by taking a step beyond what is typically seen in practice: they recruited schools and randomized them into treatment and control, thus enabling direct and effective estimation of impacts by comparing two groups that were otherwise similar, save for treatment (Barros et al.,, 2012). But getting agencies to implement a randomized controlled trial is often a hard sell: RCTs require supports beyond those of installing a program in the first place. When, as is common in practice, there is no randomization to ensure one has a collection of comparable schools, researchers will instead turn to observational study methods to, in effect, construct a comparison group that can serve as a counterfactual for those schools who took treatment.

Matching provides a simple approach for such an endeavor. In a basic setting, matching methods pair each treated unit with a similar control unit, producing a matched control sample that mirrors the treated sample in terms of observable covariates. Under standard assumptions, the samples may then be analyzed as if treatment were randomly assigned. The simplicity of this approach has led matching to be a popular method for observational causal inference (Imbens,, 2004; Imbens and Rubin,, 2015).

For conceptual clarity, the gold standard of matching methods is exact matching. For each treated unit, exact matching finds a control unit with the same observed covariates. We then end up with a control group that looks, in terms of the covariates, exactly like the treatment group, except for receipt of treatment. In other words, exact matching perfectly balances the joint covariate distribution between the treated and matched control units, eliminating any potential bias due to associations between observable covariates and potential outcomes (Imai et al.,, 2008; Rosenbaum and Rubin, 1985a, ). Exact matching also produces transparent matched datasets, since the difference in outcomes between a treated unit and its exactly matched control is an unbiased (albeit noisy) estimate of the treatment effect for that unit. This leads to the familiar statistical idea of averaging noisy observations to estimate a target estimand, e.g., the average treatment effect among all (or some subset) of the treated units.

In practice, however, exact matching is usually impossible. Researchers have therefore developed a variety of methods for conducting principled causal inference without exact joint covariate balance. For example, many matching approaches aim to balance the marginal distributions. Researchers construct a matched dataset, check the marginal means of the matched sets, and then iteratively tweak and refine their model if the means are too different. Direct balancing approaches (e.g., Hainmueller,, 2012; Zubizarreta,, 2015; Ben-Michael et al., 2021a, ) improve on this ad-hoc procedure by directly targeting approximate balance on specified features of the marginal, or sometimes even joint, distribution. Alternatively, one can use dimension reduction (e.g., Aikens et al.,, 2020), to make it easier to match “locally.” All of these forms of matching can be viewed as a preprocessing step (Ho et al.,, 2007) done before applying an outcome model, e.g., weighted linear regression, to the matched dataset; the final model can further adjust for remaining covariate imbalances. Ideally the matching reduces the impact of model selection on the final estimates.

Matching and outcome modeling can also be done concurrently. Semiparametric modeling approaches, such as doubly robust (Robins et al.,, 1994; Rotnitzky et al.,, 1998), double machine learning (Chernozhukov et al.,, 2018), or targeted maximum likelihood estimation (Van Der Laan and Rubin,, 2006), use model-assisted averages to target the estimand of interest, leading to provably efficient and unbiased estimates. One can also simply directly model the outcome (e.g., Hill,, 2011), to similar effect.

The methods discussed above generally yield effective causal estimates by aiming for overarching objectives across all units, such as achieving marginal covariate balance or ensuring good model fit. Even if they attain marginal covariate balance, however, they can fail to achieve joint balance, and this can lead to a poor result, as highlighted in Iacus et al., (2011). Furthermore, such methods may not be particularly transparent, or can heavily depend on model specification, as noted in Iacus et al., (2012). Exact matching, by contrast, targets joint balance, reduces model dependency and, ideally, increases transparency. In this paper, we build on a body of literature that focuses on this original spirit of exact matching. We provide four opinionated maxims for methods in this literature:

  1. 1.

    Distances should be intuitive

  2. 2.

    Matches should be local

  3. 3.

    Match each unit as best you can in a way that you can monitor

  4. 4.

    Estimates should be transparent

Given our maxims, we then directly construct Caliper Synthetic Matching (CSM) to effectively and clearly address each of these ideas.

The CSM method is a member of the Monotonic Imbalance Bounding (MIB; Iacus et al.,, 2011) class. MIB is a category of methods ensuring joint balance. When introducing Monotonic Imbalance Bounding (MIB), the authors indicated that while distance matching with a single scalar caliper does not constitute MIB, employing a distinct caliper for each covariate does. This paper explores this idea further by formalizing the scaled distance approach and examining its properties, specifically through a scaled distance metric. The scaled distance allows for specific control over each covariate, mitigating issues such as sensitivity to outliers inherent in similar methods such as Mahalanobis distance.

We also connect CSM to Coarsened Exact Matching (CEM; Iacus et al.,, 2012). CEM, a method within the MIB class, coarsens continuous covariates 111E.g., dividing an age covariate into ranges such as 0-25, 25-50, 50-75, 75+ before exactly matching on these coarsened covariates. We show how CSM keeps the MIB properties of CEM, but gives stricter bounds on worst-case bias. Within this framework we also provide a novel theoretical result on the control of joint balance, along with an alternate derivation of bias control than in the original literature.

Another innovation of this paper is we, post matching, apply a variant of the synthetic control method (SCM; Abadie et al.,, 2010) to each treated unit in turn to refine the weights given to the control units matched to each treated unit. The SCM method’s goal is to create a single synthetic control unit for each treated unit that mirrors the treated unit’s characteristics and outcomes. The synthetic control step further minimizes covariate imbalance over the matching step with respect to a scaled distance metric. We show, via a Taylor expansion in a space defined by the used distance metric, how this leads to removing a linear bias term in the estimated impact in a manner akin to linear interpolation. Our use of SCM diverges from standard SCM by allowing for a scaled distance metric, not necessarily using historical outcomes, and employing a predetermined VV matrix based on covariate-wise calipers. We argue synthetic controls maintain the transparency of matching by enabling direct evaluation of counterfactual plausibility.

Our matching approach allows for a simple form of inference that accounts for the reuse of controls for different treated units. Uncertainty estimation in matching methods is difficult, as underscored by Abadie and Imbens (Abadie and Imbens,, 2006, 2008, 2011), who illustrated the limitations of the bootstrap method and developed an asymptotically valid approach for the MM-nearest neighbor estimator. We follow Ben-Michael et al., (2022) (also see Keele et al., (2023) and Lu et al., (2023)), using the resultant weights of the final matched control units along with a plug-in variance formula. In particular, we use the residuals within matched clusters to estimate residual uncertainty, which then provides a variance estimator that takes into account the overall effective sample size imposed by the final control weights.

The rest of the paper proceeds as follows. In Section 2, we provide background and motivate CSM with a toy example. In Section 3, we construct CSM in a stepwise manner to address the four maxims proposed above. We then discuss the properties of CSM in Section 4, and provide our theoretical results on bias control. In Section 5, we conduct a variety of simulation studies to illustrate how CSM performs compared to other observational causal inference methods. Finally, in Section 6 we work through the applied example of the Jovem de Futuro program, analyzing it as a within study comparison to demonstrate how CSM may be used in practice, and to assess its performance on real data.

2 Background

2.1 Setup

Suppose we have nn independent and identically distributed observations, with nTn_{T} treated units and nCn_{C} control units. For each unit ii, let Zi{0,1}Z_{i}\in\{0,1\} denote its binary treatment status, YiY_{i}\in\mathbb{R} denote its observed real-valued outcome, and 𝐗i{X1i,,Xpi}Tp\mathbf{X}_{i}\equiv\{X_{1i},\dots,X_{pi}\}^{T}\in\mathbb{R}^{p} denote its pp-dimensional real-valued covariate vector. We use the potential outcomes framework and denote the observed outcome for unit ii as Yi(1Zi)Yi(0)+ZiYi(1)Y_{i}\equiv(1-Z_{i})Y_{i}(0)+Z_{i}Y_{i}(1), for potential outcomes Yi(1)Y_{i}(1) and Yi(0)Y_{i}(0) under the stable unit treatment value assumption (Imbens and Rubin,, 2015). We make the standard conditional ignorability assumption:

(Y(1),Y(0))Z𝐗,\displaystyle(Y(1),Y(0))\perp\!\!\!\perp Z\mid\mathbf{X},

so that conditioning on the observed covariates is sufficient to identify the causal effect of ZZ. Under a population sampling framework, we write ϵiYi(Zi)fZi(𝐗)\epsilon_{i}\equiv Y_{i}(Z_{i})-f_{Z_{i}}(\mathbf{X}), where f0(𝐗)E[Y(0)|𝐗]f_{0}(\mathbf{X})\equiv E[Y(0)|\mathbf{X}] and f1(𝐗)E[Y(1)|𝐗]f_{1}(\mathbf{X})\equiv E[Y(1)|\mathbf{X}] are the true conditional expectation functions of the potential outcomes under control and treatment, respectively. We write the set of all treated units’ indices as 𝒯={i:Zi=1}\mathcal{T}=\{i:Z_{i}=1\}, the set of all control units’ indices as 𝒞={i:Zi=0}\mathcal{C}=\{i:Z_{i}=0\}, and t𝒯t\in\mathcal{T}, j𝒞j\in\mathcal{C} as individual treated and control units respectively. We denote the set of the indices of the control units matched to a treated unit t𝒯t\in\mathcal{T} as 𝒞t={j𝒞: unit j is matched to unit t}\mathcal{C}_{t}=\{j\in\mathcal{C}:\text{ unit }j\text{ is matched to unit }t\}. Finally, we denote the size of a set 𝒮\mathcal{S} as |𝒮||\mathcal{S}|.

In this paper, we will focus on estimating the sample average treatment effect on the treated (SATT):

τ=1nTt𝒯(Yt(1)Yt(0)),\displaystyle\tau=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\left(Y_{t}(1)-Y_{t}(0)\right),

where nT=|𝒯|n_{T}=|\mathcal{T}| denotes the number of treated units.

Under a population sampling framework, the SATT approaches the overall population average treatment effect on the treated (PATT) as the number of treated units increases. While matching methods can easily be extended to estimate sample and population average treatment effects (SATEs and PATEs), we focus on the SATT to clarify key ideas and simplify exposition.

2.2 Motivation: the spirit of exact matching

To motivate the importance of locality and joint balance, we provide a pair of toy examples. Figure 1 plots the covariates X1X_{1} and X2X_{2} of control units cic_{i} and treated units tjt_{j} for two scenarios. The colors and contours visualize f0()f_{0}(\cdot), which takes on greater values within the innermost contour. To conduct causal inference, we use the observed outcomes of the control units cic_{i} to impute the unobserved counterfactual outcomes of the treated units tjt_{j}.

Refer to caption
(a) Toy Example 1
Refer to caption
(b) Toy Example 2
Figure 1: Toy examples demonstrating utility of locality and joint balance.

For Example 1 (Figure 1(a)), accurate causal inference is not possible due to a lack of overlap: we cannot accurately impute outcomes for t1t_{1} and t2t_{2} because we do not observe any nearby evaluations of f0()f_{0}(\cdot). Many causal inference methods, however, could fail to detect this problem. For example, a matching or balancing approach that works with marginal distributions could exactly balance both X1X_{1} and X2X_{2} by assigning weights of 1 to all four units. Similarly, an outcome model may fit a flat response surface to the observed outcomes of c1c_{1} and c2c_{2}. Because the usual marginal balance checks and outcome models would appear good, the analyst would be left unaware that these analyses significantly underestimate counterfactual outcomes, i.e., overestimate the SATT. Approximate exact matching approaches would also fail here, as there are no control units close to the treated units. The failure to find local matches, however, would alert the analyst to the presence of a problem. Therefore, it is important to have a metric on how good controls each treatment can obtain. In our method, we use nearest neighbor distance as this quality metric.

Example 2 (Figure 1(b)) highlights the role of transparent local matches. Because f0()f_{0}(\cdot) is smooth, using only the outcomes of local control units (i.e., controls c3c_{3} and c4c_{4} within the dotted circles) leads to accurate counterfactual estimates. Local matches also produce transparent estimates; unlike using a black-box outcome model, it is immediately clear how each control unit contributes to each treated unit’s counterfactual estimate. Finally, local matches encourage joint balance. In Example 2, if we assigned weights of 1 to c1c_{1} and c2c_{2}, and 0 to c3c_{3} and c4c_{4}, we would have perfect marginal balance, but poor local match quality and a poor counterfactual. It is worth sacrificing some marginal balance to upweight the closer units c3c_{3} and c4c_{4}, since doing so greatly improves joint balance and the resulting treatment-effect estimates.

Having illustrated the importance of using distance matching to achieve local matches, we now turn to the selection of the distance metric. The Mahalanobis distance has been commonly used in the literature, but it has drawbacks, including sensitivity to outliers and the curse of dimensionality. Outliers can distort the covariance matrix upon which Mahalanobis distance relies, leading to poor match quality. Moreover, with an increase in the number of covariates, the effectiveness of Mahalanobis matching may diminish due to the curse of dimensionality, where units in high-dimensional spaces appear far apart, complicating the search for suitable matches.

To partially address these issues, we propose a more flexible metric, the scaled distance, which permits analysts to specify the desired tightness of matching for each covariate in turn. Regions of low overlap can then be handled via an adaptive caliper method as detailed in Section 3.3. The challenge of dimensionality is also reduced, given prior knowledge on the relative importance of the covariates, as covariates of interest can be scaled to achieve higher level of control; see Section 3.1.

Lastly, while these toy examples demonstrate the benefits of adhering to the ”spirit of exact matching” by finding local matches to enhance causal estimates and identify potential biases, it is important to acknowledge that in some, possibly even many, situations, focusing on joint balance in this manner may be unnecessary or even detrimental. For example, if the conditional expectation function is additive in its covariates, i.e., f0(𝐗)=j=1pgj(Xj)f_{0}(\mathbf{X})=\sum_{j=1}^{p}g_{j}(X_{j}) for functions gj():g_{j}(\cdot):\mathbb{R}\to\mathbb{R}, marginal balance suffices (Zubizarreta,, 2015). In such settings, attempting to balance the full joint covariate distribution is excessively conservative, as it protects estimates from bias that does not exist, namely, bias due to imbalances in covariate interactions.

2.3 Related work

Matching methods have a long history in observational causal inference (Stuart,, 2010). Rather than attempt an exhaustive review, we briefly trace how these methods have operationalized the spirit of exact matching over time and provide further details in Section 3 as we develop the method introduced in this paper.

Early work in matching incorporated locality via nearest-neighbor (Rubin,, 1973), caliper (Cochran and Rubin,, 1973; Rosenbaum and Rubin, 1985b, ), and radius matching (Dehejia and Wahba,, 2002) approaches. These methods were typically combined with dimension reduction, e.g., via propensity scores (Rosenbaum and Rubin,, 1983), to circumvent the challenge of near-exact matching with multiple covariates, though some approaches, e.g., Mahalanobis distance matching (Rubin and Thomas,, 2000) and genetic matching (Diamond and Sekhon,, 2013), directly operated on a scaled version of the original covariate space. Also see Aikens et al., (2020), who use set-aside data to fit a prognostic score, and then match on that score. Regardless, to evaluate and improve the quality of a matched dataset, researchers would typically conduct iterative balance checks, revising their matching scheme if it led to poor marginal mean balance.

To circumvent the need for these iterative balance checks, Iacus et al., (2012) introduced Coarsened Exact Matching (CEM), which we discuss in further detail in Section 3.2. CEM fixes a user-specified level of joint balance, operationalized by a given degree of coarsening, and returns the resulting sample. By making local matches a primary rather than a secondary criterion, CEM enjoys the desirable transparency and joint balance properties of exact matching.

More recently, researchers have developed matching methods that flexibly learn notions of locality and similarity rather than using user-specified defaults. Genetic matching (Diamond and Sekhon,, 2013) learns to more precisely match covariates that appear to be important for overall covariate balance, while “almost-exact” matching methods for discrete (Dieng et al.,, 2019; Wang et al.,, 2021) and continuous (Morucci et al.,, 2020; Parikh et al.,, 2022) data aim to more precisely (or exactly) match covariates that are more predictive of the potential outcomes. Matching After Learning to Stretch (Parikh et al.,, 2022) minimizes predictive error over a hold-out training set to identify a similarity metric, and synthetic controls (Abadie et al.,, 2010) minimize the resulting predictive error with respect to a set of held-out pre-treatment outcome variables to obtain optimal variable importance weights in a time-series setting.

Work outside of matching has also noted the importance of locality in observational causal inference. For example, Abadie and L’hour, (2021) augments the popular synthetic controls methodology with a penalty for using control units far from the treated unit. In a similar setting, Ben-Michael et al., 2021b tunes the extent to which synthetic controls may extrapolate away from the control units. More generally, Kellogg et al., (2021) explicitly trades off the bias from extrapolating beyond local matches with the bias from linearly interpolating between distant units. While these approaches do not attempt to directly emulate exact matching, they highlight the use of local data for principled causal inference.

In another direction, one can directly optimize weights with respect to overall balance criterion, rather than engage in the iterative cycle of matching with a subsequent balance check on the result. For three examples of many, see stable balance weights (SBW; Zubizarreta,, 2015), entropy balance weights (Hainmueller,, 2012), or covariate balancing propentiy scores (Imai and Ratkovic,, 2014).

3 Caliper Synthetic Matching

Stuart, (2010) decomposes matching analyses into two phases. In the design phase, researchers select a distance measure, use it to run a matching method, and diagnose the quality of the resulting matches. In the subsequent analysis phase, researchers use the matched units to estimate treatment effects.

In this section, we propose a matching method that satisfies our set of aesthetic maxims delineated above. We construct our method in a modular fashion; at each stage, we increase the complexity of the method to improve its use of the data. While we believe that the final proposed method simultaneously maximizes transparency and performance, in practice researchers may make different choices at each stage to limit complexity as necessary.

3.1 Principle 1: Distances should be intuitive

In the absence of exact matches, matching algorithms find control units as “close” as possible to their treated counterparts to improve joint covariate balance and reduce potential bias (Rosenbaum and Rubin, 1985a, ). One popular distance measure for “closeness” is propensity score distance:

d(e)(𝐗i,𝐗j)=|e(𝐗i)e(𝐗j)|,\displaystyle d^{(e)}(\mathbf{X}_{i},\mathbf{X}_{j})=|e(\mathbf{X}_{i})-e(\mathbf{X}_{j})|, (1)

where the propensity score e()=P(Zi=1𝐗i)e(\cdot)=P(Z_{i}=1\mid\mathbf{X}_{i}) represents the probability that a unit is treated, given its covariates (Rosenbaum and Rubin,, 1983).

While matching units with similar propensity scores leads to principled causal estimates, it does not construct intuitive matched sets (King and Nielsen,, 2019). Propensity score distance is not formally a distance metric on p\mathbb{R}^{p},222For example, d(e)(𝐗i,𝐗j)=0d^{(e)}(\mathbf{X}_{i},\mathbf{X}_{j})=0 does not imply that 𝐗i=𝐗j\mathbf{X}_{i}=\mathbf{X}_{j}. For a simple introduction to distance metrics on p\mathbb{R}^{p}, see Appendix B.1. so it can violate our natural understanding of “closeness.” For example, two units that are “close” in terms of propensity score distance, which simply means they have similar chances of getting treatment, may have very different covariate values. If a researcher matches two such units, it can be unclear whether they should trust this match, fit a better propensity-score model, or find a closer propensity-score match.

Formal distance metrics provide a more natural approach for assessing similarity between units. Researchers are generally familiar with the covariates in their data, so directly attaching a distance metric to the space of covariates builds upon existing intuitions. Multidimensional distance metrics typically take the form of a scaled Euclidean (i.e., L2L_{2}) distance metric:

dV(2)(𝐗i,𝐗j)=(𝐗i𝐗j)T(VTV)(𝐗i𝐗j)\displaystyle d^{(2)}_{V}(\mathbf{X}_{i},\mathbf{X}_{j})=\sqrt{(\mathbf{X}_{i}-\mathbf{X}_{j})^{T}(V^{T}V)(\mathbf{X}_{i}-\mathbf{X}_{j})} (2)

or scaled LL_{\infty} distance metric:

dV()(𝐗i,𝐗j)=supk=1,,p|V(𝐗i𝐗j)|k,\displaystyle d^{(\infty)}_{V}(\mathbf{X}_{i},\mathbf{X}_{j})=\sup_{k=1,\dots,p}|V(\mathbf{X}_{i}-\mathbf{X}_{j})|_{k}, (3)

for a given p×pp\times p symmetric positive definite matrix VV. The matrix VV scales the raw differences in the covariates and their two-way interactions. For example, if V11V_{11} were large, then the resulting distance metric would magnify, i.e., upweight, differences in the first covariate. We consider both scaled L2L_{2} and scaled LL_{\infty} distance metrics in this paper.

A variety of scaled distance metrics have been proposed in the literature. One popular scaled L2L_{2} distance metric is Mahalanobis distance, which uses VTV=Σ1V^{T}V=\Sigma^{-1}, the inverse covariance matrix estimated from the control group (Rubin,, 1980). Other L2L_{2} approaches, such as genetic matching (Diamond and Sekhon,, 2013) restrict the scale matrix VV to be diagonal and directly optimize it.

In this paper, we set VV to a diagonal matrix determined by a covariate-wise caliper specified by a researcher, as formalized in Proposition B.3. A covariate-wise caliper, πk\pi_{k}, represents the maximum allowable difference in the kk-th covariate for matching purposes. When the LL_{\infty} metric dVd_{V}^{\infty} is set to cc, the kk-th covariate difference between the two units is at most cπkc\pi_{k}. The is equivalent to scaling each covariate kk by Vkk=1/πkV_{kk}=1/\pi_{k} and then matching using the LL_{\infty} distance with threshold cc. Generally, we advocate setting π\pi so that any pair of units with distance bounded by c=1c=1 is considered to be a “reasonably similar” comparison.

Using dVd_{V}^{\infty} with a covariate-wise caliper is strongly connected to Coarsened Exact Matching (CEM) (Iacus et al.,, 2012), as the distance metric provides researchers with the desired degree of covariate-wise control. In CEM, continuous variables are initially coarsened into discrete bins; for example, an age variable may be segmented into bins of 0-25, 25-50, 50-75, and 75-100 years. Observations are then precisely matched based on these coarsened covariates. By setting a covariate-wise caliper to 12.5, we allow a treated unit to be matched with a control unit if their age difference is no more than 12.5 years. This approach generalizes the concept of coarsening, as coarsening, given the above bins, would not match a 26-year-old with a 24-year-old, but our method could. Further details are discussed in Section 4.5.

3.2 Principle 2: Matches should be local

Section 2.2 argued the importance of local matching. Here we discuss ways of achieving local matching by mainly revisiting the literature.

Given a chosen measure of “closeness,” there are many ways to select the closest matched units. One popular approach is nearest-neighbors matching, where each treated unit is matched with the control unit(s) closest to it. This can be done either greedily for each treated unit (Rubin,, 1973) or optimally over all treated units (Rosenbaum,, 1989). Nearest-neighbors matching can be conducted after some preprocessing steps, such as learning an optimal distance metric (Diamond and Sekhon,, 2013; Parikh et al.,, 2022).

While nearest-neighbors approaches are intuitive, they can quietly fail to achieve covariate balance. While nearest-neighbors approaches guarantee that each treated unit is matched with its closest control, the closest control may still be quite far off. Large distances between treated units and their matched controls, i.e., low match quality, can lead to poor joint covariate balance.

To combat this problem, many methods apply calipers coupled with nearest-neighbor matching. Calipers are a distance cc beyond which matches are forbidden. Calipers modify the distance metric as:

d(𝐗i,𝐗j)={d(𝐗i,𝐗j)if d(𝐗i,𝐗j)cif d(𝐗i,𝐗j)>c\displaystyle d(\mathbf{X}_{i},\mathbf{X}_{j})=\begin{cases}d(\mathbf{X}_{i},\mathbf{X}_{j})&\text{if }d(\mathbf{X}_{i},\mathbf{X}_{j})\leq c\\ \infty&\text{if }d(\mathbf{X}_{i},\mathbf{X}_{j})>c\end{cases}

Using calipers, nearest-neighbor approaches can avoid problems associated with poor match quality, using the closest matches only if they are “close enough.” The simplest form of nearest neighbor matching does not, however, take full advantage of data rich areas of the sample: if there are many controls, using all of them rather than the strictly closest could improve precision.

Calipers are frequently applied to the propensity score, rather than to covariate distance. Unfortunately, units that are local with respect to assignment probability may not be actually local with respect to their covariates; see King and Nielsen, (2019).

The distance metric and the caliper are quite connected; in particular, if we scale our distance metric, we can simply scale the caliper by the same amount. This means c=1c=1 indicates a caliper of one unit on the distance metric, which under the infinity-norm means no covariate differs by more than one “width” as defined by π\pi. Thus, without loss of generality, we generally assume an initial value of c=1c=1 hereafter, unless otherwise noted.

In this paper, we directly use all control units within a given caliper of each treated unit, which maximizes the number of local matches used to produce causal estimates. We allow matching with replacement, meaning a control unit close to multiple treatment units will match to all of then. This matching approach is known as radius matching (Dehejia and Wahba,, 2002) with replacement, though previous proposals mostly use radius matching on propensity score distances. We relegate discussion of caliper selection to Appendix A.1.

3.3 Principle 3: Match each unit as best you can in a way that you can monitor

Matching with a fixed caliper could lead to a substantial fraction of the treated units being dropped, which can significantly change the target estimand. In particular, dropping difficult-to-match treated units, while potentially improving the quality of the resulting estimate, changes the estimand from the SATT to the feasible sample average treatment effect (FSATT):

τ=1||tYt(1)Yt(0),\tau_{\mathcal{F}}=\frac{1}{|\mathcal{F}|}\sum_{t\in\mathcal{F}}Y_{t}(1)-Y_{t}(0),

where \mathcal{F} denotes the set of indices of treated units with at least one control unit within cc dV()d_{V}^{(\infty)} units, i.e., ={t𝒯:j𝒞 with dV()(𝐗t,𝐗j)c}\mathcal{F}=\{t\in\mathcal{T}:\exists\ j\in\mathcal{C}\text{ with }d_{V}^{(\infty)}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c\}. The FSATT can differ from the SATT if the excluded units have systematically different treatment impacts than the kept units.

Instead of shifting the estimand based on a selected caliper, we propose instead assigning each treated unit tt an adaptive caliper ctc_{t}, with

ct=max{c,αdt} with dtminj:Zj=0d(𝐗t,𝐗j),c_{t}=\max\{c,\alpha d_{t}\}\mbox{ with }d_{t}\equiv\min_{j:Z_{j}=0}d(\mathbf{X}_{t},\mathbf{X}_{j}), (4)

where cc is our global minimum caliper that we default to 1 given our distance metric, dtd_{t} is the distance between unit tt and its nearest control-unit neighbor (Dehejia and Wahba,, 2002), and α1\alpha\geq 1 is an inflation factor that allows for catching all control units that are similarly close to the closest unit.

The adaptive caliper guarantees all treated units are matched to at least one control. The floor cc value allows for taking advantage of data-rich environments; for treated units with many controls, we will not shrink to an overly small caliper but instead keep all those controls measured as “reasonably similar” as a comparison group.

In data-rich contexts, the adaptive caliper may also be selected so that the resulting matched sets work well with synthetic controls (introduced below), e.g., we can set ctc_{t} to be the smallest caliper such that treated unit tt has p+1p+1 within-caliper controls, where pp is the dimension of the covariate space.

In principle, with adaptive calipers our matched dataset allows for direct estimate of the SATT as all treated units are preserved. In practice, some treated units may have very poor matches, which would be indicated by large ctc_{t} values. Because we have a direct measure of match quality via ctc_{t}, we can monitor the impact of these poor matches on our overall estimand, and possibly deliberately trim based on our diagnostics.

An important step in any matching procedure is to assess the resulting matches. Classic approaches would be to conduct balance checks by comparing marginal covariate distributions between the treated and control groups (usually people will look at standardized mean differences). Such marginal balance checks may reveal significant departures from joint balance, but cannot confirm when joint balance is approximately achieved, as demonstrated in Toy Examples 1 and 2. Checking low-dimensional summaries of joint balance can also fail to assess overlap or identify subsets of the treated units for which it may be easier or more difficult to estimate treatment effects.

Using a distance-metric caliper, on the other hand, directly ensures good covariate balance (e.g., see Proposition 4.2), if the caliper is fixed. We can extend this idea to using the unit-specific caliper values ctc_{t} across all treated units to assess the estimate-estimand tradeoff, creating balance-sample-size frontier plots (King et al.,, 2017) to show how dropping poorly matched treated units (i.e., those with large ctc_{t}) affects both potential bias and the SATT estimate for the remaining sample. Also see Aikens et al., (2020); Aikens and Baiocchi, (2023) for other approaches to making diagnostic plots. We can also summarize characteristics of units with different ctc_{t} values to better understand regions with poorer overlap.

3.4 Principle 4: Estimates should be transparent

Given matched units from the design phase of a matching analysis, the final step is to produce an estimate. With high-quality matches, the SATT may be, in principle, estimated with a simple average:

τ^avg=1nTt𝒯(Yt1|𝒞t|j𝒞tYj).\hat{\tau}^{avg}=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\left(Y_{t}-\frac{1}{|\mathcal{C}_{t}|}\sum_{j\in\mathcal{C}_{t}}Y_{j}\right).

This is quite clear, but when matching is imperfect, it can be much lower performing than other methods that do further adjustment.

In this paper, instead of using the average weight 1|𝒞t|\frac{1}{|\mathcal{C}_{t}|} for each matched control, we calculate weights with the synthetic control method (SCM; Abadie et al.,, 2010) within the matched sets. For each treated unit tt, we find convex weights, i.e., weights wjt0w_{jt}\geq 0 that sum to 1, for its matched control units that minimize covariate imbalance as measured by a scaled distance metric dV(,)=dV(2)(,)d_{V}(\cdot,\cdot)=d_{V}^{(2)}(\cdot,\cdot) or dV()(,)d_{V}^{(\infty)}(\cdot,\cdot):

𝐰t:=argmin{wjt:j𝒞t}\displaystyle\mathbf{w}_{t}:=\operatorname*{arg\,min}_{\{w_{jt}:j\in\mathcal{C}_{t}\}} dV(𝐗t,j𝒞twjt𝐗j)\displaystyle\hskip 5.69054ptd_{V}(\mathbf{X}_{t},\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j})
s.t. j𝒞twjt=1\displaystyle\sum_{j\in\mathcal{C}_{t}}w_{jt}=1
0wjt1 for j𝒞t,\displaystyle 0\leq w_{jt}\leq 1\text{ for }j\in\mathcal{C}_{t},

where we have written the weight for control unit jj associated with treated unit tt as wjtw_{jt}. The “synthetic control” unit for treated unit tt gets covariates j𝒞twjt𝐗j\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j} and outcome j𝒞twjtYj\sum_{j\in\mathcal{C}_{t}}w_{jt}Y_{j}, and the ATT estimate is taken as a simple difference-in-means between the outcomes of the treated units and their synthetic controls:

τ^tSC=1nTt𝒯(Ytj𝒞twjtYj).\displaystyle\hat{\tau}_{t}^{SC}=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\big{(}Y_{t}-\sum_{j\in\mathcal{C}_{t}}w_{jt}Y_{j}\big{)}.

See Appendix B.6 for further detail.

Our approach deviates from the standard SCM setup in a few ways. First, while Abadie et al., (2010) introduces SCM as a quadratic programming problem using scaled L2L_{2} distance, we also allow SCM to be implemented as a linear programming problem using scaled LL_{\infty} distance. Second, while synthetic controls are typically used in time-series settings with past outcomes as additional covariates, we directly apply them in our setting without focus on past outcomes. Finally, the SCM introduced in Abadie et al., (2010) includes an outer optimization to learn an “optimal” scaling matrix VV, whereas we simply use the VV matrix implied by the given covariate-wise caliper. This follows other approaches to the SCM. See, e.g., Ben-Michael et al., 2021b .

Using synthetic controls in the analysis phase provides two primary benefits. First, synthetic controls are highly transparent. Because synthetic controls explicitly produce a counterfactual for each treated unit, researchers can directly check whether each counterfactual seems reasonable.

Second, as we prove, synthetic controls naturally reduce bias within calipers as they are mathematically equivalent to linear interpolation. While linear interpolation over long distances can lead to bias, interpolating over short distances, such as within a caliper, typically improves results due to the local linearity of smooth outcome functions.

3.5 The full method

To recap, we propose matching by first selecting an interpretable distance metric that can be used to assess similarity between units based on their covariate profiles. This metric should be interpretable and justifiable by those using it.

Once a metric is in place, identify sets of local controls for each treated unit, possibly adaptively adjusting the caliper (radius) defining locality unit by unit. Finally, build synthetic comparison units by using a variant of the synthetic control procedure for each unit to obtain further reductions in bias. The final estimate is simply the average of the pairwise differences between treated and synthetic control. Finally, run some diagnostics, examining measures such as how the estimated impact shifts with the trimming of hard-to-match units.

We denote the use of synthetic controls within adaptive LL_{\infty} calipers as Caliper Synthetic Matching (CSM). We generally use a scaled LL_{\infty} distance for its simplicity, its connections to exact matching, and its connections to CEM. Adapting the caliper enables clear diagnostic plots of the estimate-estimand tradeoff and the synthetic controls produce interpretable local bias corrections.

4 Theoretical Properties

4.1 Monotonic Imbalance Bounding

Iacus et al., (2011) introduces the Monotonic Imbalance Bounding (MIB) class of matching methods. MIB matching methods directly control covariate balance between the treated and matched control groups, independently for each covariate. As a result, MIB matching methods enjoy desirable properties such as bounded covariate imbalance and bounded estimation error, under reasonable assumptions.

Although matching using Mahalanobis distance alone is not MIB, matching using a distance-metric caliper is MIB as long as the caliper for each covariate may be tuned without affecting the caliper for any other covariate (Iacus et al.,, 2011). Any such covariate-wise caliper can be satisfied by a scalar caliper on an appropriately scaled distance metric. For example, if p=2p=2 and we want to ensure |Xt1Xj1|2|X_{t1}-X_{j1}|\leq 2 and |Xt2Xj2|5|X_{t2}-X_{j2}|\leq 5, we may define V=[120015]V=\begin{bmatrix}\frac{1}{2}&0\\ 0&\frac{1}{5}\end{bmatrix} so that restricting dV()(𝐗t,𝐗j)1d^{(\infty)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1 satisfies the desired caliper. By the triangle inequality, the bound on the distance translates to bounds on the individual covariates. This idea is formalized by Proposition 4.1, which shows that matching using a distance-metric caliper is a member of the MIB class.

Denote the weighted mean for covariate kk for the treated and control units after matching as X¯Tk𝐰\bar{X}^{\mathbf{w}}_{Tk} and X¯Ck𝐰\bar{X}^{\mathbf{w}}_{Ck}, respectively, where the weight is obtained from matching. More detailed definitions can be found in Appendix B.3. We then have:

Proposition 4.1.

Given covariatewise caliper 𝛑+p\bm{\pi}\in\mathbb{R}^{p}_{+} and scaling matrix V=diag{1𝛑}V=diag\{\frac{1}{\bm{\pi}}\}, we have

  1. (a)

    dV(2)(𝐗t,𝐗j)1d^{(2)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1 for all tT,j𝒞tt\in\mathcal{M}_{T},j\in\mathcal{C}_{t} |X¯Tk𝐰X¯Ck𝐰|πk for k=1,,p\implies|\bar{X}^{\mathbf{w}}_{Tk}-\bar{X}^{\mathbf{w}}_{Ck}|\leq\pi_{k}\text{ for }k=1,\dots,p

  2. (b)

    dV()(𝐗t,𝐗j)1d^{(\infty)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1 for all tT,j𝒞tt\in\mathcal{M}_{T},j\in\mathcal{C}_{t} |X¯Tk𝐰X¯Ck𝐰|πk for k=1,,p\implies|\bar{X}^{\mathbf{w}}_{Tk}-\bar{X}^{\mathbf{w}}_{Ck}|\leq\pi_{k}\text{ for }k=1,\dots,p

Proof.

See Appendix B.3. ∎

Proposition 4.1 not only shows that changing πk\pi_{k} for one variable does not affect the imbalance on the other variables, but also shows that covariate balance is controlled by VV. Again, note that the choice to bound dV(2)(𝐗t,𝐗j)d^{(2)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j}) and d()(𝐗t,𝐗j)d^{(\infty)}(\mathbf{X}_{t},\mathbf{X}_{j}) by 1 is arbitrary, e.g., bounding dV()(𝐗t,𝐗j)1d^{(\infty)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1 is equivalent to bounding dV()(𝐗t,𝐗j)cd^{(\infty)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c for c>0c>0 and V=diag{cπ}V=diag\{\frac{c}{\pi}\}.

With adaptive calipers, CSM is MIB for the feasible treated units within \mathcal{F} but dV(𝐗t,𝐗j)1d_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1 does not hold for other treated units, and thus the bound does not apply. That said, the adaptive calipers still provide transparency about the extent to which CSM leaves the MIB class in terms of number of units and the amount they deviate as recorded by their caliper sizes. In the subsequent subsections, we focus on cases when all treated units have controls within a set caliper and thus are feasible, i.e., T=\mathcal{M}_{T}=\mathcal{F}, to focus on several properties of the MIB matching methods.

4.2 Bounded joint covariate imbalance

Having shown the bounded marginal balance, we now show how distance-metric caliper matching methods control joint covariate imbalance. Write the empirical joint covariate distributions of the treated and control units as:

fT(𝐱)\displaystyle f_{T}(\mathbf{x}) 1nTt𝒯δ𝐗t(𝐱)\displaystyle\equiv\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\delta_{\mathbf{X}_{t}}(\mathbf{x})
fC(𝐱)\displaystyle f_{C}(\mathbf{x}) 1nTt𝒯j𝒞twjtδ𝐗j(𝐱),\displaystyle\equiv\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}\delta_{\mathbf{X}_{j}}(\mathbf{x}),

where δ𝐗()\delta_{\mathbf{X}}(\cdot) represents a Dirac delta function at 𝐗\mathbf{X}, i.e., δ𝐗(𝐲)=1\delta_{\mathbf{X}}(\mathbf{y})=1 if 𝐲=𝐗\mathbf{y}=\mathbf{X} and 0 otherwise. These empirical distributions are simply weighted sums of point masses located at the units’ values.

To demonstrate how distance-metric calipers control the difference between fTf_{T} and fCf_{C}, we use Wasserstein distance. Formally, the qq-Wasserstein distance between probability distributions PP and QQ with distance metric d(,)d(\cdot,\cdot) is:

𝒲q(P,Q)=infπΠ(P,Q)Eπ[d(𝐗,𝐘)q]1/q,\displaystyle\mathcal{W}_{q}(P,Q)=\inf_{\pi\in\Pi(P,Q)}E_{\pi}\big{[}d(\mathbf{X},\mathbf{Y})^{q}\big{]}^{1/q},

where the infimum is over all couplings (joint distributions) π\pi that have marginal distributions of PP and QQ (Pi(P,Q)Pi(P,Q)). More intuitively, Wasserstein distance is also known as “earth-mover’s distance”: viewing PP and QQ as piles of soil each with total mass 1, Wq(P,Q)W_{q}(P,Q) measures the minimum “cost” to move soil to make PP distribute as QQ (or vice-versa), where the “cost” of moving kk units of soil from 𝐱\mathbf{x} to 𝐲\mathbf{y} is kd(𝐱,𝐲)k\cdot d(\mathbf{x},\mathbf{y}). Wasserstein distance therefore uses a distance metric between points in p\mathbb{R}^{p} to measure distance between full probability distributions on p\mathbb{R}^{p}.

With this notation in hand, Proposition 4.2 shows that radius matching bounds the Wasserstein distance between fTf_{T} and fCf_{C}.

Proposition 4.2.

For c>0c>0, and a matching method, we have, for both =2\ell=2 and \infty:

For all t𝒯,j𝒞t,dV(l)(𝐗t,𝐗j)c𝒲q(l)(fT,fC)c\displaystyle\text{For all }t\in\mathcal{T},j\in\mathcal{C}_{t},d^{(l)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c\implies\mathcal{W}^{(l)}_{q}(f_{T},f_{C})\leq c
Proof.

See Appendix B.4. ∎

We make a few remarks about Proposition 4.2. First, while the notation is technical, the intuition is straightforward. Distance-metric calipers control how far each treated unit’s covariates can be from its matched controls’ covariates. Since fTf_{T} and fCf_{C} are weighted sums of the point masses associated with these covariates, the calipers must also control the distance between fTf_{T} and fCf_{C}. For an exact caliper c=0c=0, Proposition 4.2 simply states that exact matching guarantees that fTf_{T} coincides with fCf_{C}. In practice, however, reducing the caliper size to c=0c=0 drops all of the treated units, leaving the proposition to be vacuously true; cc must be chosen with the estimate-estimand tradeoff in mind. To the best of our knowledge, this is the first theoretical result for a matching method’s bound on joint balance. Iacus et al., (2011) did show that CEM improved joint balance, but only through simulation.

The Wasserstein distance depends on the chosen distance metric. In the proposition, 𝒲q(l)(fT,fC)\mathcal{W}^{(l)}_{q}(f_{T},f_{C}) is dependent on dVd_{V}. Thus, if our VV changes, the Wasserstein distance also changes. However, the core idea of the proof remains intact: the bounds on the covariates due to matching ensure a joint balance between the covariates of the matched controls and treatments, thereby providing local control and helping to avoid those adverse situations discussed in Section 2.2.

Control over joint covariate imbalance naturally implies control over marginal imbalances. For example, Proposition 4.3 shows how distance-metric calipers also bound the distance between the (pp-dimensional) empirical weighted covariate means of the treated and matched control units, respectively denoted 𝐗¯T\bar{\mathbf{X}}_{T} and 𝐗¯C\bar{\mathbf{X}}_{C}.

Proposition 4.3.

For c>0c>0 and a matching method, we have, for both =\ell= 2 and \infty:

For all tT,j𝒞t,dV(l)(𝐗t,𝐗j)cdV(𝐗¯T,𝐗¯C)c\displaystyle\text{For all }t\in\mathcal{M}_{T},j\in\mathcal{C}_{t},d^{(l)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c\implies d_{V}^{\ell}(\bar{\mathbf{X}}_{T},\bar{\mathbf{X}}_{C})\leq c
Proof.

See Appendix B.5. ∎

The above proposition is in effect a joint version of Proposition (b), which gives bounds on the covariates in turn.

Lastly, we note that other distance metrics between PP and QQ could be chosen. For instance, Iacus et al., (2011) uses the L1L_{1} distance, demonstrating that Coarsened Exact Matching reduces the joint imbalance between the control and treatment groups using an empirical example. We use L1L_{1} distance for its computational simplicity and use the Wasserstein distance because its properties facilitate easier proof construction, as it inherently utilizes a distance metric.

In summary, distance-metric calipers enable precise control of joint covariate imbalance. While these bounds are not necessarily small in practice, they guarantee that observed imbalance cannot be too great, even in the worst case. This leads to a variety of desirable properties which we illustrate in the following sections.

4.3 Bounded bias

Because MIB matching methods bound the distance between the covariates of matched units, they naturally bound the distance between smooth functions f:pf:\mathbb{R}^{p}\to\mathbb{R} of those covariates as well. Write the control potential outcome for unit ii as Yi(0)=f0(𝐗i)+ϵiY_{i}(0)=f_{0}(\mathbf{X}_{i})+\epsilon_{i}. Then assuming that f0()f_{0}(\cdot) is smooth (i.e., Lipschitz) immediately bounds the bias of any FSATT estimate produced by a method using a distance-metric calipers. (See Appendix B.2 for technical details about Lipschitz functions in p\mathbb{R}^{p}.)

Proposition 4.4.

Suppose f0:pf_{0}:\mathbb{R}^{p}\to\mathbb{R} is Lipschitz(λ)(\lambda) with respect to dV(,)d_{V}(\cdot,\cdot) = dV(2)(,)d^{(2)}_{V}(\cdot,\cdot) or dV()(,)d^{(\infty)}_{V}(\cdot,\cdot). Then for a matching procedure such that for all tt, dV(𝐗t,𝐗j)cd_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c for all j𝒞tj\in\mathcal{C}_{t}:

|E[ττ^]|λc.\big{|}E[\tau-\hat{\tau}]\big{|}\leq\lambda c.
Proof.
|E\displaystyle\big{|}E [ττ^]|\displaystyle[\tau-\hat{\tau}]\big{|}
=|1nTt𝒯(j𝒞twjt(f0(𝐗t)f0(𝐗j)))|\displaystyle=\bigg{|}\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\Big{(}\sum_{j\in\mathcal{C}_{t}}w_{jt}\big{(}f_{0}(\mathbf{X}_{t})-f_{0}(\mathbf{X}_{j})\big{)}\Big{)}\bigg{|}
|1nTt𝒯(j𝒞twjtλd(𝐗j,𝐗t))|\displaystyle\leq\bigg{|}\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\Big{(}\sum_{j\in\mathcal{C}_{t}}w_{jt}\lambda d(\mathbf{X}_{j},\mathbf{X}_{t})\Big{)}\bigg{|}
λc.\displaystyle\leq\lambda c.

Proposition 4.4 states that for distance-metric caliper matching methods, worst-case bias is proportional to the caliper size cc.333Proposition 4.4 applies to a slightly different set of methods than Proposition 1 from Iacus et al., (2011), which proves a similar bias bound for MIB matching methods (see Appendix B.3). As in Proposition 4.2, setting c=0c=0 shows that exact matching leads to unbiased estimates, but in practice we must use c>0c>0 to navigate the estimate-estimand tradeoff without dropping all of the treated units.

Of course, we rarely know the Lipschitz constant λ\lambda in practice, so Proposition 4.4 does not generate empirical bias bounds. Nonetheless, Proposition 4.4 illustrates how distance-metric calipers control bias. If each treated unit is close to all of its matched controls in covariate space, its expected counterfactual outcome must be close to their expected outcomes, for reasonable (i.e., smooth) outcome functions. As a result, any weighted average of the control units’ outcomes cannot differ too much in expectation from the treated unit’s true counterfactual outcome, regardless of the specific form of f0()f_{0}(\cdot).

In practice, we can ideally get estimates that beat this bound by assuming local linearity and adjusting further for residual imbalance. We discuss this in the next section.

4.4 Bias reduction from synthetic controls

While using distance-metric calipers bounds bias, using synthetic controls within these calipers actively reduces bias. Specifically, synthetic controls naturally remove linear bias by conducting local linear interpolation. This means that if f0()f_{0}(\cdot) is linear, synthetic controls would completely eliminate bias if the weights achieved perfect balance on XX. For nonlinear f0()f_{0}(\cdot), bias will be reduced to higher-order nonlinear trends, which are, in general, less influential within tight calipers for smooth functions. Proposition 4.5 more precisely shows how exact synthetic controls eliminate linear bias.

Proposition 4.5.

Let dV(,)d_{V}(\cdot,\cdot) = dV(2)(,)d^{(2)}_{V}(\cdot,\cdot) or dV()(,)d^{(\infty)}_{V}(\cdot,\cdot). Suppose f0:pf_{0}:\mathbb{R}^{p}\to\mathbb{R} is differentiable and Lipschitz(λ\lambda) with respect to dVd_{V}. Then for a matching procedure such that for all tt, dV(𝐗t,𝐗j)cd_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c for all j𝒞tj\in\mathcal{C}_{t}, if j𝒞twjt𝐗j=𝐗t\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j}=\mathbf{X}_{t} for all t𝒯t\in\mathcal{T} (i.e., we have perfect balance locally for each treated unit):

|E[ττ^]|o(c).\big{|}E[\tau-\hat{\tau}]\big{|}\leq o(c).
Proof.

See Appendix B.8 for a proof based on a Taylor expansion with respect to the given distance metric. The resulting expansion differs from the usual multivariate Taylor expansion, since it uses directional derivatives to better utilize the distance-metric calipers. ∎

In Proposition 4.5, the o(c)o(c) term represents higher-order terms which go to zero more quickly than does the caliper cc as cc shrinks toward zero. Compare to the bias bound of λc\lambda c without the synthetic control step in Propostion 4.4: we have lost the λc\lambda c term, leaving something smaller (asymptotically). In practice, shrinking cc to zero drops all treated units, but the notation highlights how implementing synthetic controls within calipers takes advantage of local linearity. In particular, while linear interpolation across large distances can lead to significant interpolation bias (Kellogg et al.,, 2021), restricting the donor-pool units to be within a caliper distance from the treated unit reduces the impact of nonlinearities.

4.5 Comparing CSM to CEM

As discussed in Section 3.2, radius matching methods have many similarities with coarsened exact matching (CEM) (Iacus et al.,, 2012), which we celebrate by matching two of the three letters. In particular, radius matching and CEM both possesses the imbalance-bounding and bias-bounding guarantees discussed in the previous sections.

To clarify the benefits of radius matching, we consider an example with two covariates X1X_{1} and X2X_{2}, as in Figure 2. The coarsening of each variable is made equal sized. We visualizes the CEM grid defined by the coarsening of the two covariates, along with an equivalently sized LL_{\infty} caliper around treated unit t1t_{1}.

Refer to caption
Figure 2: Comparison of coarsened exact matching with radius matching using LL_{\infty} calipers. Dashed gridlines represent CEM calipers. Shaded box represents scaled LL_{\infty} caliper around point t1t_{1}.

Because t1t_{1} lies near a boundary defined by the covariate coarsening, CEM does not match t1t_{1} to c1c_{1}, while the LL_{\infty} caliper does. On the other hand, CEM matches t1t_{1} to c2c_{2} because they lie in the same caliper grid, even though the two units lie more than one LL_{\infty} unit apart from each other. Figure 2 shows how, given a fixed caliper size, CEM guarantees that treated units lie within two LL_{\infty} units of their matched controls whereas the LL_{\infty} caliper guarantees the distance is no more than one unit. By centering calipers on each treated unit, radius matching matches each treated unit to all nearby control units while guaranteeing imbalance and bias bounds that are twice as tight as those guaranteed by CEM. In other words, when each method is “equally sized” in terms of how large a treated unit’s cachement area for possible control is, radius matching has tighter control of bias.

As a result, we can present the following proposition, which highlights the improved guarantee of bias bounds. Specifically, it improves the upper bound of the bias for CSM compared to CEM.

Proposition 4.6.

Suppose f0:pf_{0}:\mathbb{R}^{p}\to\mathbb{R} is Lipschitz(λ)(\lambda) with respect to dV(,)d_{V}(\cdot,\cdot) = dV(2)(,)d^{(2)}_{V}(\cdot,\cdot) or dV()(,)d^{(\infty)}_{V}(\cdot,\cdot), and 𝛑+p\bm{\pi}\in\mathbb{R}^{p}_{+}. Then CSM with covariate-wise caliper π/2\pi/2 and CEM with covariate-wise coarsening π\pi have the same cachement size for each treated unit, but

|E[ττ^]|{λc for CEMλc/2 for CSM\big{|}E[\tau-\hat{\tau}]\big{|}\leq\begin{cases}\lambda c\text{ for CEM}\\ \lambda c/2\text{ for CSM}\end{cases}

Naturally, there are tradeoffs for these improved bias bounds. Computationally, radius matching requires computing distances between each treated unit and the control units, an operation of order ntncn_{t}n_{c}, unlike CEM which only requires a frequency tabulation of order nt+ncn_{t}+n_{c}. Non-uniform calipers are also slightly easier to implement via covariate coarsenings, though for many common covariates one can directly transform the covariate and use a uniform caliper to replicate a non-uniform caliper.444E.g., rather than non-uniformly coarsening income as { $0-20k, $20k-50k, $50k-100k, $100k+ }, it may be more reasonable to log-transform the income covariate and use a uniform caliper to avoid, e.g., concluding that an individual earning $20k is as far from an individual earning $20.1k as they are from an individual earning $50k. Overall, however, radius matching preserves the transparency and interpretability of CEM while significantly improving on its useful bias and imbalance bounding properties.

We also note that CSM offers the flexibility to select units that are challenging to match effectively, a feature that CEM lacks. In CEM, after specifying the desired coarsening size, an analyst can only determine if a treated unit has found a match by checking if there is a control within the same stratum. Conversely, with the adaptive caliper, we can evaluate the treated units based on the size of their adaptive calipers, which indicate the extent of their matching difficulty. By contrast, CEM allows for a clean inferential strategy by viewing the bins as strata in a post-stratified experiment; with CSM inference is less clear cut, as we will discuss below.

4.6 Bias-variance tradeoff

Reducing caliper sizes reduces the potential bias of the resulting SATT estimate, as shown by Proposition 4.4, but may increase the estimate’s variance due to reducing the number of controls and, without adaptive calipers, the number of treated units. To formalize this idea, we introduce the (conditional) mean-squared error for the SATT τ\tau:

CMSE=E[(τ^τ)2|𝐗,𝐙],\displaystyle CMSE=E[(\hat{\tau}-\tau)^{2}|\mathbf{X},\mathbf{Z}],

where the expectation is conditioned on the observed covariates and treatment assignments. Then, using a working model of the potential outcomes of Yi(z)=f(Xi)+zτi+eiY_{i}(z)=f(X_{i})+z\tau_{i}+e_{i}, where eie_{i} is a 0-centered noise term, we can use standard algebraic manipulation to show that:

τ^τ\displaystyle\hat{\tau}-\tau =Bias+Error\displaystyle=Bias+Error (5)

where

Bias=1nTt𝒯j𝒞twjt(f0(𝐗t)f0(𝐗j))Error=1nTt𝒯(ϵtj𝒞wjtϵj).\begin{split}Bias&=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}\big{(}f_{0}(\mathbf{X}_{t})-f_{0}(\mathbf{X}_{j})\big{)}\\ Error&=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}(\epsilon_{t}-\sum_{j\in\mathcal{C}}w_{jt}\epsilon_{j}).\end{split} (6)

We then have

CMSE\displaystyle CMSE =Bias2+Var,\displaystyle=Bias^{2}+Var,

where, using our working model,

Var\displaystyle Var =1nT2t𝒯σt2+1nT2j𝒞(t𝒯wjt)2σj2,\displaystyle=\frac{1}{n_{T}^{2}}\sum_{t\in\mathcal{T}}\sigma_{t}^{2}+\frac{1}{n_{T}^{2}}\sum_{j\in\mathcal{C}}(\sum_{t\in\mathcal{T}}w_{jt})^{2}\sigma_{j}^{2}, (7)

with σi\sigma_{i} representing the population sampling variance of unit ii conditional on its covariates 𝐗i\mathbf{X}_{i} (Kallus,, 2020).

If we assume homoskedasticity of the residuals within treatment arms (i.e., σi=σC\sigma_{i}=\sigma_{C} in the control group) and the conditions of Proposition 4.4 with caliper cc, we can bound CMSE as:

CMSE\displaystyle CMSE (λc)2+(σT2nT+j𝒞(t𝒯wjt)2σC2nT2)\displaystyle\leq(\lambda c)^{2}+\Big{(}\frac{\sigma^{2}_{T}}{n_{T}}+\frac{\sum_{j\in\mathcal{C}}(\sum_{t\in\mathcal{T}}w_{jt})^{2}\sigma^{2}_{C}}{n_{T}^{2}}\Big{)}
=(λc)2+σT2ESS(𝒯)+σC2ESS(𝒞),\displaystyle=(\lambda c)^{2}+\frac{\sigma_{T}^{2}}{ESS(\mathcal{T})}+\frac{\sigma^{2}_{C}}{ESS(\mathcal{C})}, (8)

where the effective sample size (ESS) of a set 𝒮\mathcal{S} of units with weights wiw_{i} is

ESS(𝒮)=(i𝒮wi)2i𝒮wi2,\displaystyle ESS(\mathcal{S})=\frac{(\sum_{i\in\mathcal{S}}w_{i})^{2}}{\sum_{i\in\mathcal{S}}w_{i}^{2}}, (9)

with wi=t𝒯witw_{i}=\sum_{t\in\mathcal{T}}w_{it} for i𝒞i\in\mathcal{C}, and wi=1/nTw_{i}=1/n_{T} for i𝒯i\in\mathcal{T}.

Equation 8 clarifies the relationship between caliper size and the bias-variance tradeoff. For a fixed estimand, where we do not drop or add treated units as caliper size changes,increasing cc naturally exposes the resulting estimate to more bias. Increasing cc also generally reduces variance by dispersing weight across more control units, increasing the effective sample size of the control group.

That said, in contexts where |𝒞||𝒯||\mathcal{C}|\gg|\mathcal{T}|, the overall variance of the average impact estimate could be dominated by the variance associated with the treated units (of order 1/ESS(𝒯)1/ESS(\mathcal{T}), which remains unchanged even if more control units are added (which shrinks 1/ESS(𝒞1/ESS(\mathcal{C}. To be more explicit, consider a single treated unit tt with matched controls j𝒞tj\in\mathcal{C}_{t} all given uniform weights. Here the variance associated with the treated unit is σ2/1\sigma^{2}/1 while the variance associated the controls is σ2/|𝒞t|\sigma^{2}/|\mathcal{C}_{t}| (assuming homoskedasticity). Even if 𝒞t\mathcal{C}_{t} were large, we still have the initial σ2\sigma^{2} term. Once 𝒞t\mathcal{C}_{t} reaches 4 or 5, increasing caliper size would have diminishing returns on variance reduction, suggesting that it may typically be better, in terms of CMSE, to use smaller calipers.

On the other hand, when matching with replacement, a single control jj may be assigned significant weight for multiple treated units tt. In these cases, t𝒯wjt\sum_{t\in\mathcal{T}}w_{jt} could be greater than 1 for some control units, driving ESS(𝒞)ESS(\mathcal{C}) down. In other words, if the same controls get reused heavily enough, the variance associated with the control units may not be dominated by the variance associated with the treated units, even if the initial control pool is large.

4.7 Variance Estimation

How to estimate standard errors for our matching method requires special attention. Abadie and Imbens, through a series of papers (2006, 2008, 2011), argued that the bootstrap method was inadequate and developed an asymptotically valid approach for the MM-nearest neighbor estimator as the numbers of treatment and control units increase infinitely, with MM remaining fixed. Otsu and Rai, (2017) introduced a weighted bootstrap methodology that relies on overlap of treated and control covariates.

We instead draw from the survey sampling literature (e.g., Potthoff et al.,, 1992) and plug in effective sample sizes and variance estimates into our Equation 8. This has been successfully applied in various different types of weighting estimators in causal inference (Ben-Michael et al.,, 2022; Lu et al.,, 2023; Keele et al.,, 2023).

Our approach differs from the Abadie and Imbens literature in several aspects: First, we utilize synthetic control weights instead of average MM-nearest-neighbors weights. Second, we condition on the treated units as defined by their covariates, assume a stochastic model for the units’ outcomes, and make a working assumption of homoskedasticity. Third, we operate in a scenario where the number of treatment units (nTn_{T}) is low, while the number of control units (nCn_{C}) is high. As Ferman, (2021) noted, the asymptotics in this setting differ from those Abadie and Imbens described, with nTn_{T} increasing to infinity.

Our variance estimator is the following plug-in estimator:

Var^(τ^)\displaystyle\widehat{Var}(\hat{\tau}) =S2(1nT+1ESS(𝒞))\displaystyle=S^{2}\left(\frac{1}{n_{T}}+\frac{1}{\text{ESS}(\mathcal{C})}\right) (10)

where S2S^{2} is an estimate of the homoskedastic residuals (assumed the same for treatment and control groups), and ESS(𝒞)\text{ESS}(\mathcal{C}) follows 9. In particular, S2S^{2} is a pooled variance estimator, pooling across clusters:

S2=1NCt|𝒞t|st2 with NC=t|𝒞t|.S^{2}=\frac{1}{N_{C}}\sum_{t}|\mathcal{C}_{t}|s^{2}_{t}\text{ with }N_{C}=\sum_{t}|\mathcal{C}_{t}|.

where each cluster’s residual variance is

st2=1|𝒞t|1j𝒞tetj2,s^{2}_{t}=\frac{1}{|\mathcal{C}_{t}|-1}\sum_{j\in\mathcal{C}_{t}}e_{tj}^{2},

with

etj=YtjY¯t, where Y¯t=1|𝒞t|j𝒞tYtj.e_{tj}=Y_{tj}-\bar{Y}_{t},\text{ where }\bar{Y}_{t}=\frac{1}{|\mathcal{C}_{t}|}\sum_{j\in\mathcal{C}_{t}}Y_{tj}.

We drop all clusters 𝒞t\mathcal{C}_{t} with |𝒞t|=1|\mathcal{C}_{t}|=1, including for the calculation of NCN_{C}. Although we could use the weights from the synthetic control to prioritize “good” controls, we opt for uniform weights over all matched units as the key element is control unit similarity, which is not necessarily optimized by the synthetic step.

Our estimates of sj2s_{j}^{2} are inflated by differences in f0(𝐗i)f_{0}(\mathbf{X}_{i}) driven by variation in the 𝐗i\mathbf{X}_{i} within each cluster. In other words, if the control units are widely dispersed and the expected outcome f0(Xi)f_{0}(X_{i}) changes rapidly as 𝐗i\mathbf{X}_{i} changes, then the variation in f0(Xi)f_{0}(X_{i}) will be absorbed by the sj2s_{j}^{2} terms, resulting in overly large standard errors. This phenomenon was observed in our simulations, as discussed in Section 5.4. This inflation is not necessarily bad, however, as we would expect it to grow roughly in proportion to the size of the bias term, which is also driven by these differences (see the f0(Xt)f0(Xi)f_{0}(X_{t})-f_{0}(X_{i}) terms in Equation 6). One can thus view these SEs as predicting overall error (see, e.g., the discussion of standard errors as predicting overall error in Sundberg, (2003) and discussion of expanding of standard errors to include bias in Weidmann and Miratrix, 2021b ).

Our inference goal is the Root Mean Squared Error (RMSE) of our estimate, conditioned on the treatment group’s covariates, represented by E[(τ^τ)2|𝐗𝒯]\sqrt{E[(\hat{\tau}-\tau)^{2}|\mathbf{X}_{\mathcal{T}}]}. Other options are possible, as is well illustrated by the literature. Kallus, (2020) focuses on the conditional standard error, essentially the variance term. Abadie and Imbens, along with Otsu and Rai, (2017), target the unconditional coverage. Ferman, (2021) aims at the unconditional type-I error rate. In particular, it is important to note, as Imbens and Rubin, (2015) mentions, that the conditional and unconditional standard errors differ.

5 Simulation Studies

To understand how CSM performs, we consider a range of simulation studies. First, we examine a simple simulation based on the toy examples in Section 2.2, where local matches tend to be important. We then consider a few canonical simulated datasets taken from the literature to assess general performance. Finally, we assess our inferential method.

5.1 Methods

We compare CSM to a variety of popular matching, balancing, outcome regression, propensity score, and doubly robust methods, as described in Table 1.555We initially also included CEM using synthetic controls within each cell and caliper matching using simple averages within each caliper, but, for clarity, we now exclude these results since their performances typically lie between the performances of CSM and CEM.

\rowcolor[HTML]C0C0C0 Method class Method name Description
Baseline diff Difference-in-means estimator
match-1NN One nearest neighbor matching
Matching match-CEM Coarsened exact matching
or-lm Linear model on all two-way interactions
Outcome model or-BART Bayesian additive regression tree (BART)
ps-lm
Logistic model for propensity score
on all two-way interactions
Propensity score ps-BART BART with binary outcome (probit link)
bal-SBW1
Stable balancing weights (SBW)
on marginal means
Balancing bal-SBW2
SBW on marginal and two-way
interaction means
dr-AIPW1
Augmented inverse propensity weighting
(AIPW) using SuperLearner on
linear models
dr-AIPW2
AIPW using SuperLearner on
machine-learning models
dr-TMLE1
Targeted maximum likelihood estimation
(TMLE) using SuperLearner on
linear models
Doubly robust dr-TMLE2
TMLE using SuperLearner on
machine-learning models
Table 1: Methods used in simulation studies. References for methods are: CEM (Iacus et al.,, 2012), BART (Chipman et al.,, 2010), SBW (Zubizarreta,, 2015), AIPW (Robins et al.,, 1994), SuperLearner (Van Der Laan Mark et al.,, 2007), and TMLE (Van Der Laan and Rubin,, 2006).

We use default settings for all of the algorithms to standardize comparisons. We implement BART, TMLE, and AIPW in R using defaults from the dbarts (Dorie,, 2023), tmle (Gruber and Van Der Laan,, 2012), and AIPW (Yongqi Zhong et al.,, 2021) packages, respectively. For SuperLearner, the linear models include a simple mean, linear regression, and generalized linear regression models; the machine-learning models include the linear models as well as generalized additive models, random forests, BART, and XGBoost (Chen and Guestrin,, 2016).

To standardize comparisons for the matching methods, for each dataset we use the distance metric implied by the covariatewise caliper generated by coarsening each numeric covariate into five equally spaced bins. I.e., we use dV()d_{V}^{(\infty)} and a diagonal VV with entries Vkk=5/(max(Xk)min(Xk))V_{kk}=5/(max(X_{k})-min(X_{k})). We do not tune the covariatewise caliper, assuming zero domain knowledge about variable importance. We conduct CEM using the same uniform coarsening of each covariate into five bins.

5.2 Toy example simulation

To build intuition for situations in which CSM may perform well, we first show results from the simulation based on the toy examples in Section 2.2. We simulate the covariates for 50 treated units each from multivariate normal distributions centered at (0.25,0.25)(0.25,0.25) and (0.75,0.75)(0.75,0.75), and 225 control units each from multivariate normal distributions centered at (0.25,0.75)(0.25,0.75) and (0.25,0.75)(0.25,0.75), all with covariance matrices [0.12000.12]\begin{bmatrix}0.1^{2}&0\\ 0&0.1^{2}\end{bmatrix}. We then add 100 control units distributed uniformly on the unit square to ensure we do have some overlap. We finally generate outcomes for each unit (x1,x2)(x_{1},x_{2}) as:

Y=f0(x1,x2)+Zτ(x1,x2)+ϵ,Y=f_{0}(x_{1},x_{2})+Z\cdot\tau(x_{1},x_{2})+\epsilon,

where f0f_{0} is the density function for a multivariate normal distribution centered at (0.5,0.5)(0.5,0.5) with covariance matrix [10.80.81]\begin{bmatrix}1&0.8\\ 0.8&1\end{bmatrix}, τ(x1,x2)=3x1+3x2\tau(x_{1},x_{2})=3x_{1}+3x_{2} is the true treatment effect, and ϵN(0,0.52)\epsilon\sim N(0,0.5^{2}) is homoskedastic noise. Figure 3 plots a sample simulated dataset.

Refer to caption
Figure 3: Sample simulated toy dataset.

Figure 4 shows the root mean-squared error and absolute bias of the point ATT estimates for the various methods.

Refer to caption
Figure 4: Method results on simulation based on toy example across 250 simulations.

We see that CSM performs well, even compared to complex machine-learning methods.

We emphasize that the goal of this simulation is not to demonstrate CSM’s performance, but rather to illustrate settings in which we may expect good performance. In the toy example, the interaction between the two covariates drives both the control potential outcome function f0(X1,X2)f_{0}(X_{1},X_{2}) and the implicit propensity score. Because the covariates are interacted with each other, joint covariate balance is very important. As a result, those methods that do not target joint balance or otherwise do not have access to the interaction term between X1X_{1} and X2X_{2} perform poorly due to high levels of bias.

5.3 Canonical simulations

The datasets from Kang and Schafer, (2007), Hainmueller, (2012), and the 2016 American Causal Inference Conference competition (Dorie et al.,, 2017) have canonically been used to compare methods for observational causal inference. Table 2 contains basic information about the settings we used for each dataset. Please see the original papers for further details.

\rowcolor[HTML]C0C0C0 Dataset pp ntn_{t} ncn_{c}
Heterogeneous
effects?
Kang & Schafer, (2007) 4 500\approx 500 500\approx 500 No
Hainmueller et al. (2012) 6 50 250 No
ACIC 2016 10 350\approx 350 650\approx 650 Yes
Table 2: Basic descriptions of simulated datasets. We use the Kang and Schafer, (2007) as-is from the original paper; the “high overlap” and “highly nonlinear” outcome model condition for the data from Hainmueller, (2012); and the “step-function” propensity and treatment model conditions for the ACIC 2016 data using 10 numeric covariates.

Figure 5 summarizes the results of our simulations using these datasets and illustrates the tradeoffs made by CSM.

Refer to caption
Figure 5: Method results on canonical simulated datasets across 250 simulations. Results for CSM highlighted in bold font. Four methods truncated at 1 for displaying the results.

We see that CSM tends to outperform the simple matching, balancing, and outcome or propensity-score modeling approaches, but that it is outperformed by the complex machine-learning approaches known to do well on these types of datasets (Dorie et al.,, 2017). CSM is designed to maximize performance while preserving the simple intuitions underlying exact matching. Under such constraints, it is remarkable that CSM performs nearly as well as black-box machine-learning approaches on the Kang and Schafer, (2007) and Hainmueller, (2012) datasets.

The trends in Figure 5 highlight important concepts regarding locality in observational causal inference. As the number of covariates increases, the matching methods tend to deteriorate in performance relative to the complex modeling methods. Caliper-based matching methods only use local information. This behavior protects them against incorrect extrapolation, but also prevents them from correctly extrapolating in cases where overlap is unnecessary.

In many simulated (and real) datasets, extrapolating marginal effects can improve counterfactual predictions. For example, suppose increasing X1X_{1} clearly increases outcomes among controls which have low values of X2X_{2}, and we would like to predict counterfactual outcomes for treated units with high values of X2X_{2}. A linear model fit to the controls may extrapolate and assume that increasing X1X_{1} always increases outcomes regardless of the value of X2X_{2}, while an exact matching estimator would not do so. Increasing the number of covariates exaggerates this behavior, since models extrapolate more marginal trends. Figure 5 shows us that this type of model-based marginal extrapolation improves results for these simulated datasets, since in the considered DGPs there are few high-dimensional interactions to falsify such extrapolation.

Matching aims to produce transparent, model-free causal inferences. As shown by Figure 5, this transparency generally comes with a cost in terms of performance. The simulations show that while avoiding model-based extrapolation controls bias in the worst case (e.g., Proposition 4.4), it can be too conservative in settings where such extrapolation is helpful. Whether these costs are outweighed by the ability to clearly explain how counterfactual predictions are made depends on the particular causal inference setting. Nonetheless, CSM performs competitively in these canonical simulations, making it an attractive option in settings where explainability is paramount.

5.4 Assess the Quality of the Variance Estimator

Refer to caption
Figure 6: Toy example in different degree of overlaps. From left to right: very low, medium, and very high

We next assess the variance estimator described in Section 4.7. We revisit the example from Section 5.2 but vary the degree of overlap. Specifically, we increase the overlap by increasing the proportion of control units distributed uniformly on the unit square and reducing the proportion of controls centered at (0.25, 0.75) and (0.75, 0.25). For the range of overlap, see the three illustrative datasets in Figure 6. These scenarios correspond to the same distribution of treated units, but different pools of potential controls to select matches from. For each degree of overlap, we generate 500 trials.

Our primary estimation target is the SATT.

To capture tuning cc for bias reduction, we use an adaptive strategy for the caliper size, shrinking the caliper to select no more than five units in regions with great density, and expanding it to include at least one unit in regions of low density. With larger calipers, the synthetic control optimization tends to select only a few units of the set when we have a low-dimensional covariate set, and the selected units tend to be more distant from the treated unit than others within the neighborhood defined by the caliper; for future work we could follow Ben-Michael et al., 2021b and regularize the weights to take more advantage of rich areas of control units.

Overlap N~C\tilde{N}_{C} avg SE^\widehat{SE} avg SESE Bias RMSE Coverage
Very Low 23.9 0.12 0.11 0.04 0.12 0.95
Low 39.2 0.10 0.10 0.02 0.10 0.94
Medium 51.5 0.09 0.08 0.01 0.08 0.96
High 61.3 0.08 0.08 0.01 0.08 0.95
Very High 70.2 0.08 0.08 0.01 0.08 0.94
Table 3: Estimated vs. actual standard error for varying degrees of overlap. Columns are average effective sample size of control group (vs. 100 treated units), average estimated standard error, average true standard error, Bias, RMSE, and coverage of the SATT. Monte Carlo SEs are around 0.025 for the true standard error and RMSE, 0.01 for coverage, and <0.001<0.001 for the other measures.

Table 3 presents the results. The column “avg SE^\widehat{SE}” shows the estimated standard error averaged over 500 iterations. The “avg SESE” column shows the true conditional standard error,666This value is calculated as the standard deviation of τ^biasτ\hat{\tau}-bias-\tau across the simulations in order to remove variation due to simulation iterations having different true SATT values. Overall, we observe that our method slightly overestimates the true standard error due to the inflation caused by absorbing variation in f0f_{0}, as discussed in Section 4.7.

As expected, both the true standard error and bias go down as overlap increases, due to the steadily increasing effective sample size and improved ability to find close matches. In particular, when there is low overlap, we re-use many control units, and only at high overlap is our control group getting close to the treatment group in size. Additionally, with high overlap, there are more control units near the concentration of treated units, allowing for closer matches and thus less bias.

Finally, the “Coverage” column shows the coverage of the 95% confidence interval. The CSM method effectively maintains a small degree of bias and gives reasonably estimated standard errors, resulting in CI coverage close to the nominal 95% across all the scenarios.

An alternate form of inference would some form of bootstrap. Unfortunately, we found that bootstrap methods, including the weighted bootstrap in Otsu and Rai, (2017) applied to our context, and the naïve casewise bootstrap, did not work effectively. We leave further development of bootstrap inference to future work.

6 Applied Example: Effect of Improving Education Quality in Schools in Brazil

We illustrate our approach via a within study comparison based on a randomized control trial, “Jovem de Futuro” (Young of the Future), of schools in Brazil, following a study by Barros et al., (2012). Data from this intervention, targeting education quality, was also analyzed as a within study comparison by Ferman, (2021).

In a within study comparison, we try to recover a known impact via our observational study method. Within study comparisons allow for assessment of how observational methods serve “in the wild.” A typical early example of this is the famous LaLonde paper (LaLonde,, 1986), but for a more extended discussion, see Cook et al., (2008). In particular, we take units enrolled in an experimental trial, but who did not receive treatment (the experimental control group) as our “treatment” and match them to a larger pool of units not in the experiment at all; see Weidmann and Miratrix, 2021a for further discussion of this form of within study comparison.

Jovem de Futuro combined two efforts: a) offering strategies and tools for school management boards to improve efficiency; b) offering grant money for improving education quality, conditional on school’s standardized test score passing a certain threshold. The intervention lasted three years, and there were three rounds delivered: 2010-2012 (the 2010 implementation), 2011-2013 (the 2011 implementation), and 2012-2014 (the 2012 implementation). We focus on the 2010 implementation.

The 2010 implementation had 15 treated schools and 15 control schools in Rio de Janeiro, and 39 treated schools and 39 control schools in Sao Paulo. We set Zi=1Z_{i}=1 for the control schools who applied for the intervention program but were assigned to the no-intervention group, and Zi=0Z_{i}=0 for the non-participating schools. Since there is no actual intervention for either the Zi=1Z_{i}=1 and Zi=0Z_{i}=0 schools, we expect the “treatment effect” to be zero, if our methods succeed in removing any confounding selection bias. The sample sizes for Rio de Janeiro and for Sao Paulo are nTRio=15,nCRio=966n_{T}^{Rio}=15,n_{C}^{Rio}=966 and nTSP=39,nCSP=3481n_{T}^{SP}=39,n_{C}^{SP}=3481.

For each school, we have four matching variables: X1,X2X_{1},X_{2}, and X3X_{3} are standardized test scores from 2007 to 2009, and X4X_{4} is an indicator of being in Sao Paolo. Table 4 shows the pre-matching covariate mean difference. Overall, the schools in the RCT scored lower than those not in the RCT.

Our outcome is the test scores in 2010(the outcome following the first year of intervention).

Variable Control (Z = 0) Treated (Z = 1) Difference (treated - control) Score 2007 (X1X_{1}) 4.7 -2.8 -7.5 Score 2008 (X2X_{2}) 0.83 -0.99 -1.8 Score 2009 (X3X_{3}) 2.3 -2.5 -4.8 In Sao Paulo (X4X_{4}) 78% 72% -6%

Table 4: Marginal means of the covariates in Barros et al., (2012) before matching. All variables have been multiplied by 100.

6.1 CSM Procedure

We start our CSM analysis by choosing LL^{\infty} as the distance metric. Next, we set the covariate-wise calipers. For the standardized test scores X1X_{1} to X3X_{3}, we set the covariate-wise calipers to 0.2, meaning we want matched schools to differ at most 0.2 standard deviations in test scores. For X4X_{4} (indicator of being in Sao Paolo), we set the covariate-wise caliper to 1/10001/1000, a small value to ensure the matching of X4X_{4} is exact.

The distance-metric caliper cc is usually set to a default of 1, as done in Proposition 4.1, to give direct interpretation to the covariate-wise calipers 𝝅\bm{\pi}. In practice, however, given 𝝅\bm{\pi} we can tune cc to guarantee each treated unit gets enough but not too many matched controls, thereby optimally trading off data use and potential bias (see Section 6.3). When we set c<1c<1, we obtain tighter matches than initially planned with 𝝅\bm{\pi}. When we set c>1c>1, our matches can be worse than initially planned.

To tune cc, we plot histograms of the distances of the top-1, top-2, and top-3 matches for each treated unit, and look for a natural break. We see that, for most treated units, even the third worst match has a distance of around 0.35 or less. We can thus focus on bias reduction, and set our distance-metric caliper cc to 0.350.35. See Appendix A.1 for further details.

Some of our treated units have no controls within this set minimum caliper. For these units, we use an adaptive caliper, matching the units to their nearest neighbor. We then apply the synthetic control method to obtain weights for each set of matched controls. Finally, we use these weighted controls to construct the point estimate and estimated standard error, using the inferential method outlined in Section 4.7.

6.2 Assessing the Covariate Balance

Our chosen caliper results in 49 (91%) treated units having at least one matched control within a radius of c=0.35c=0.35. These are our “feasible” treated units. Among the 49 feasible treated units, all matched controls are within 0.2 standardized scores for the years 2007, 2008, and 2009. Figure 7 illustrates how the differences in marginal means for three matching variables vary when adding infeasible treated units in order of increasing adaptive caliper size, until all 54 treated units are included.

By Proposition 4.3, the marginal mean difference of each covariate must be within cπkc\cdot\pi_{k}, where πk\pi_{k} is the covariate caliper of the kk-th covariate, and cc is the distance-metric caliper. For X1X_{1} to X3X_{3}, this value is 0.350.2=0.070.35\cdot 0.2=0.07. For X4X_{4}, this value is 0.3511000=3.5×1040.35\cdot\frac{1}{1000}=3.5\times 10^{-4}. The leftmost point in each plot in Figure 7 shows the difference in marginal means between the treated units and their synthetic controls for the 49 feasible treated units, with differences all well below the corresponding cπkc\cdot\pi_{k} values. This result is not surprising: cπkc\cdot\pi_{k} represents a worst-case guarantee, and imbalances within each matched control set can cancel out.

An important use of Figure 7 is to provide a diagnostic on covariate balance. The plot assesses the approximate joint balance through marginal covariate balance while including increasingly infeasible units. While joint and marginal balance is guaranteed for feasible units, it is not guaranteed when more infeasible units are added. In this dataset, the marginal balances remain good even when all infeasible treated units are included.

Refer to caption
Figure 7: Love plot on all four matching variables as increasingly infeasible units are included.

6.3 Assessing the Impact of Synthetic Controls

The synthetic control step after matching ideally improves the comparability of the controls matched to each treated unit, but can also reduce the effective sample size of the control group by downweighting some units. We next assess this bias-variance trade-off, comparing SCM with classic radial matching (where we take the simple average of all close controls as the counterfactual for each treated unit in turn) and one-nearest-neighbor (1-NN) matching. Both can be viewed as weights on the controls:

  1. 1.

    Simple average weight: let wjt=1|𝒞t|w_{jt}=\frac{1}{|\mathcal{C}_{t}|} for each control unit. This is the default weighting method used in the matching literature after matches are made.

  2. 2.

    One-nearest-neighbor777If unit tt has kk unites tied for nearest neighbor, we assign each a weight of wjt=1kw_{jt}=\frac{1}{k}.: let wjt={1if unit j is closest to unit t0otherwise.w_{jt}=\begin{cases}1&\text{if unit }j\text{ is closest to unit }t\\ 0&\text{otherwise.}\end{cases}

    This weight corresponds to 1-to-1 matching.

The results for SCM, average, and 1-NN weights on the feasible treated units are shown on Table 5. We analyze the bias-variance trade-off through two metrics: the mean covariate difference (a proxy for bias, Equation 6) and the effective sample size (ESS, a proxy for variance, Equation 7). The mean covariate difference, shown in the first column of Table 5, is the first-order term in Taylor’s expansion of bias. It is also referred to as the extrapolation bias in Kellogg et al., (2021), representing how far the weighted control is from the treated unit in dVd_{V}^{\infty} in the covariate space. SCM performs better than simple average and 1-NN since it is optimized to minimize this bias.

Under homoskedasticity, the variance of an estimator is inversely associated with ESS, meaning a larger ESS implies smaller variance. ESS will be larger when more controls are used and when weights are more evenly distributed among those selected. Average weights achieve the highest ESS by assigning non-zero weights to all 202 distinct controls within the caliper distance of at least one of the 49 feasible units. In contrast, 1-NN uses only 49 distinct controls, as each treated unit is matched to a unique nearest neighbor. SCM falls between these scenarios, assigning zero weights to some control unit otherwise within caliper distance, but using multiple controls where possible. Overall, SCM performs well in the bias-variance trade-off by achieving the lowest imbalance (bias) and moderate variance.

Mean Individual Median Individual Number of Method Imbalance Imbalance ESS Unique Controls Feasible Treated Group - - 49 - CSM 0.057 0.000 95 155 Average 0.118 0.091 202 580 1-NN 0.158 0.148 49 49

Table 5: Effect of synthetic controls on bias-variance trade-off metrics of the FSATT estimate, comparing to average and 1-NN weighting.

6.4 FSATT and SATT Estimates

We finally present the estimated treatment effect on the feasible units and also show the influence of infeasible treated units on the ATT estimate. Before interpreting the results, recall that, given our within study design, we expect to see no treatment impact. In other words, assuming our estimation is sound, we should expect an “impact estimate” of zero.

Figure 8 presents the main results of the analysis, providing a series of confidence intervals given a steady increase in the number of units kept in the analysis. Each confidence interval is constructed using the method in Section 4.7. The leftmost confidence interval, representing the feasible set, covers zero, indicating no significant effect as desired. As we include more infeasible units, the ATT varies slightly, but is overall stable, and all confidence intervals include 0. This stability aligns with Figure 7, which shows covariate balances only vary slightly as more infeasible units are included.

Refer to caption
Figure 8: Estimate-estimand tradeoff plot. Maximum adaptive caliper sizes range from 0.35 (for the FSATT) to 0.75 (for the full ATT).

The SATT estimates from CSM and the other two methods are presented in Table 6. Using the point estimates and standard errors, all methods conclude no treatment effect. The standard error follows a trend where 1-NN is the largest, Average is the smallest, and CSM is in the middle, consistent with the trend we observed in Section 6.3: CSM uses more data than 1-NN, but sacrifices some precision to ideally reduce bias.

Method Point Est. SE
CSM 0.035 0.039
Average 0.013 0.038
1-NN 0.008 0.053
Table 6: Point estimates and standard errors for all three methods.

7 Discussion

Matching methods enable researchers to simply and intuitively draw causal conclusions from observational data. In practice, however, standard matching approaches typically need to sacrifice either performance or transparency to achieve results comparable to modern optimization and machine-learning methods for causal inference. To address this challenge, this paper introduces Caliper Synthetic Matching. CSM builds on the spirit of exact matching by preserving four key principles: intuitive distances, local matches, ability to monitor matching success, and transparent estimates. We find that CSM can achieve comparable performance to modern methods on standard benchmark datasets, particularly when many covariates exhibit interaction effects.

CSM provides a framework for transparent causal inference, and each stage may be extended in various ways. Future work might reconsider the question of distance-metric and caliper selection, optimizing the distance-metric scaling matrix VV using held-out data, or incorporating estimated covariate densities to optimize caliper lengths (for work in this area, see Parikh et al., (2022)). Dimension reduction on the number of covariates, possibly by using prognostic scores (as done in, e.g., (Aikens et al.,, 2020)), could also help mitigate the curse of dimensionality. Interpretable estimation methods other than standard synthetic controls may also be used, such as the modified synthetic controls approaches proposed by Abadie and L’hour, (2021) and Ben-Michael et al., 2021b . Ultimately, CSM represents a promising addition to the toolkit for researchers conducting causal inference using matching methods.

References

  • Abadie et al., (2010) Abadie, A., Diamond, A., and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of california’s tobacco control program. Journal of the American statistical Association, 105(490):493–505.
  • Abadie and Imbens, (2006) Abadie, A. and Imbens, G. W. (2006). Large sample properties of matching estimators for average treatment effects. Econometrica, 74(1):235–267.
  • Abadie and Imbens, (2008) Abadie, A. and Imbens, G. W. (2008). On the failure of the bootstrap for matching estimators. Econometrica, 76(6):1537–1557.
  • Abadie and Imbens, (2011) Abadie, A. and Imbens, G. W. (2011). Bias-corrected matching estimators for average treatment effects. Journal of Business & Economic Statistics, 29(1):1–11.
  • Abadie and L’hour, (2021) Abadie, A. and L’hour, J. (2021). A penalized synthetic control estimator for disaggregated data. Journal of the American Statistical Association, 116(536):1817–1834.
  • Aikens and Baiocchi, (2023) Aikens, R. C. and Baiocchi, M. (2023). Assignment-control plots: a visual companion for causal inference study design. The American Statistician, 77(1):72–84.
  • Aikens et al., (2020) Aikens, R. C., Greaves, D., and Baiocchi, M. (2020). A pilot design for observational studies: using abundant data thoughtfully. Statistics in Medicine, 39(30):4821–4840.
  • Barros et al., (2012) Barros, R., Carvalho, M. d., Franco, S., and Rosalém, A. (2012). Impacto do projeto jovem de futuro. Est. Aval. Educ, pages 214–226.
  • (9) Ben-Michael, E., Feller, A., Hirshberg, D. A., and Zubizarreta, J. R. (2021a). The balancing act in causal inference. arXiv preprint arXiv:2110.14831.
  • (10) Ben-Michael, E., Feller, A., and Rothstein, J. (2021b). The augmented synthetic control method. Journal of the American Statistical Association, 116(536):1789–1803.
  • Ben-Michael et al., (2022) Ben-Michael, E., Feller, A., and Rothstein, J. (2022). Synthetic controls with staggered adoption. Journal of the Royal Statistical Society Series B: Statistical Methodology, 84(2):351–381.
  • Boyd et al., (2004) Boyd, S., Boyd, S. P., and Vandenberghe, L. (2004). Convex optimization. Cambridge university press.
  • Chen and Guestrin, (2016) Chen, T. and Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA. ACM.
  • Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
  • Chipman et al., (2010) Chipman, H. A., George, E. I., and McCulloch, R. E. (2010). Bart: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1):266–298.
  • Cochran and Rubin, (1973) Cochran, W. G. and Rubin, D. B. (1973). Controlling bias in observational studies: A review. Sankhyā: The Indian Journal of Statistics, Series A, pages 417–446.
  • Cook et al., (2008) Cook, T. D., Shadish, W. R., and Wong, V. C. (2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management: The Journal of the Association for Public Policy Analysis and Management, 27(4):724–750.
  • Dehejia and Wahba, (2002) Dehejia, R. H. and Wahba, S. (2002). Propensity score-matching methods for nonexperimental causal studies. Review of Economics and statistics, 84(1):151–161.
  • Diamond and Sekhon, (2013) Diamond, A. and Sekhon, J. S. (2013). Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics, 95(3):932–945.
  • Dieng et al., (2019) Dieng, A., Liu, Y., Roy, S., Rudin, C., and Volfovsky, A. (2019). Interpretable almost-exact matching for causal inference. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2445–2453. PMLR.
  • Dorie, (2023) Dorie, V. (2023). dbarts: Discrete bayesian additive regression trees sampler. R package version 0.9-23.
  • Dorie et al., (2017) Dorie, V., Hill, J., Shalit, U., Scott, M., and Cervone, D. (2017). Automated versus do-it-yourself methods for causal inference: Lessons learned from a data analysis competition. Statistical Science, 34.
  • Ferman, (2021) Ferman, B. (2021). Matching estimators with few treated and many control observations. Journal of Econometrics, 225(2):295–307.
  • Gruber and Van Der Laan, (2012) Gruber, S. and Van Der Laan, M. (2012). tmle: an r package for targeted maximum likelihood estimation. Journal of Statistical Software, 51:1–35.
  • Hainmueller, (2012) Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political analysis, 20(1):25–46.
  • Hill, (2011) Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1):217–240.
  • Ho et al., (2007) Ho, D. E., Imai, K., King, G., and Stuart, E. A. (2007). Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political analysis, 15(3):199–236.
  • Iacus et al., (2011) Iacus, S. M., King, G., and Porro, G. (2011). Multivariate matching methods that are monotonic imbalance bounding. Journal of the American Statistical Association, 106(493):345–361.
  • Iacus et al., (2012) Iacus, S. M., King, G., and Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political analysis, 20(1):1–24.
  • Imai et al., (2008) Imai, K., King, G., and Stuart, E. A. (2008). Misunderstandings between experimentalists and observationalists about causal inference. Journal of the royal statistical society: series A (statistics in society), 171(2):481–502.
  • Imai and Ratkovic, (2014) Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society Series B: Statistical Methodology, 76(1):243–263.
  • Imbens, (2004) Imbens, G. W. (2004). Nonparametric estimation of average treatment effects under exogeneity: A review. Review of Economics and statistics, 86(1):4–29.
  • Imbens and Rubin, (2015) Imbens, G. W. and Rubin, D. B. (2015). Causal inference in statistics, social, and biomedical sciences. Cambridge University Press.
  • Kallus, (2020) Kallus, N. (2020). Generalized optimal matching methods for causal inference. J. Mach. Learn. Res., 21:62–1.
  • Kang and Schafer, (2007) Kang, J. D. Y. and Schafer, J. L. (2007). Demystifying Double Robustness: A Comparison of Alternative Strategies for Estimating a Population Mean from Incomplete Data. Statistical Science, 22(4):523 – 539.
  • Keele et al., (2023) Keele, L. J., Ben-Michael, E., Feller, A., Kelz, R., and Miratrix, L. (2023). Hospital quality risk standardization via approximate balancing weights. The Annals of Applied Statistics, 17(2):901–928.
  • Kellogg et al., (2021) Kellogg, M., Mogstad, M., Pouliot, G. A., and Torgovitsky, A. (2021). Combining matching and synthetic control to tradeoff biases from extrapolation and interpolation. Journal of the American Statistical Association, 116(536):1804–1816.
  • King et al., (2017) King, G., Lucas, C., and Nielsen, R. A. (2017). The balance-sample size frontier in matching methods for causal inference. American Journal of Political Science, 61(2):473–489.
  • King and Nielsen, (2019) King, G. and Nielsen, R. (2019). Why propensity scores should not be used for matching. Political Analysis, 27(4):435–454.
  • LaLonde, (1986) LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American economic review, pages 604–620.
  • Lu et al., (2023) Lu, B., Ben-Michael, E., Feller, A., and Miratrix, L. (2023). Is it who you are or where you are? accounting for compositional differences in cross-site treatment effect variation. Journal of Educational and Behavioral Statistics, 48(4):420–453.
  • Morucci et al., (2020) Morucci, M., Orlandi, V., Roy, S., Rudin, C., and Volfovsky, A. (2020). Adaptive hyper-box matching for interpretable individualized treatment effect estimation. In Conference on Uncertainty in Artificial Intelligence, pages 1089–1098. PMLR.
  • Otsu and Rai, (2017) Otsu, T. and Rai, Y. (2017). Bootstrap inference of matching estimators for average treatment effects. Journal of the American Statistical Association, 112(520):1720–1732.
  • Parikh et al., (2022) Parikh, H., Rudin, C., and Volfovsky, A. (2022). Malts: Matching after learning to stretch. Journal of Machine Learning Research.
  • Potthoff et al., (1992) Potthoff, R. F., Woodbury, M. A., and Manton, K. G. (1992). “equivalent sample size” and “equivalent degrees of freedom” refinements for inference using survey weights under superpopulation models. Journal of the American Statistical Association, 87(418):383–396.
  • Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866.
  • Rosenbaum, (1989) Rosenbaum, P. R. (1989). Optimal matching for observational studies. Journal of the American Statistical Association, 84(408):1024–1032.
  • Rosenbaum and Rubin, (1983) Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1):41–55.
  • (49) Rosenbaum, P. R. and Rubin, D. B. (1985a). The bias due to incomplete matching. Biometrics, pages 103–116.
  • (50) Rosenbaum, P. R. and Rubin, D. B. (1985b). Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. The American Statistician, 39(1):33–38.
  • Rotnitzky et al., (1998) Rotnitzky, A., Robins, J. M., and Scharfstein, D. O. (1998). Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the american statistical association, 93(444):1321–1339.
  • Rubin, (1973) Rubin, D. B. (1973). Matching to remove bias in observational studies. Biometrics, pages 159–183.
  • Rubin, (1980) Rubin, D. B. (1980). Bias reduction using mahalanobis-metric matching. Biometrics, pages 293–298.
  • Rubin and Thomas, (2000) Rubin, D. B. and Thomas, N. (2000). Combining propensity score matching with additional adjustments for prognostic covariates. Journal of the American Statistical Association, 95(450):573–585.
  • Stuart, (2010) Stuart, E. A. (2010). Matching methods for causal inference: A review and a look forward. Statistical science: a review journal of the Institute of Mathematical Statistics, 25(1):1.
  • Sundberg, (2003) Sundberg, R. (2003). Conditional statistical inference and quantification of relevance. Journal of the Royal Statistical Society Series B: Statistical Methodology, 65(1):299–315.
  • Van Der Laan and Rubin, (2006) Van Der Laan, M. J. and Rubin, D. (2006). Targeted maximum likelihood learning. The international journal of biostatistics, 2(1).
  • Van Der Laan Mark et al., (2007) Van Der Laan Mark, J. et al. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1):1–23.
  • Wang et al., (2021) Wang, T., Morucci, M., Awan, M. U., Liu, Y., Roy, S., Rudin, C., and Volfovsky, A. (2021). Flame: A fast large-scale almost matching exactly approach to causal inference. The Journal of Machine Learning Research, 22(1):1477–1517.
  • (60) Weidmann, B. and Miratrix, L. (2021a). Lurking inferential monsters? quantifying selection bias in evaluations of school programs. Journal of Policy Analysis and Management, 40(3):964–986.
  • (61) Weidmann, B. and Miratrix, L. (2021b). Missing, presumed different: Quantifying the risk of attrition bias in education evaluations. Journal of the Royal Statistical Society Series A: Statistics in Society, 184(2):732–760.
  • Yongqi Zhong et al., (2021) Yongqi Zhong, Edward H. Kennedy, Lisa M. Bodnar, and Ashley I. Naimi (2021). Aipw: An r package for augmented inverse probability weighted estimation of average causal effects. American Journal of Epidemiology. In Press.
  • Zubizarreta, (2015) Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511):910–922.

Appendix A Practical considerations

A.1 How to select a caliper

The choice of caliper size is a notorious practical problem for caliper-based matching methods. Ideally, the researcher would have a covariatewise caliper 𝝅\bm{\pi} in mind, so that the distance-metric caliper can be simply to equal 1, as in Proposition B.3. More generally, given a fixed distance metric, the choice of cc should be chosen based on an a priori desired level of bias control rather than a post hoc assessment based on the observed data. For example, a researcher who wants to ensure that all matches lie within 0.5 standard deviations of each other in each covariate, could restrict dV()(𝐗t,𝐗j)0.5d_{V}^{(\infty)}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 0.5 with V=diag{1sd(Xk),k=1,,p}V=diag\{\frac{1}{sd(X_{k})},k=1,\dots,p\}.

In practice, however, researchers may want to select a distance metric caliper cc that “optimally” trades off data use and potential bias. To visualize this tradeoff, we suggest making a histogram of the distances between the treated and nearby control units in the data: Start by computing the nt×ncn_{t}\times n_{c} distance matrix used to identify the within-caliper control units for each treated unit. Then, plot the smallest kk distances d(𝐗t,𝐗j)d(\mathbf{X}_{t},\mathbf{X}_{j}) for each t𝒞tt\in\mathcal{C}_{t} on a histogram to see how “far” the control units tend to be from the treated units. By doing this, we avoid needing to estimate the pp-dimensional density of the treated and control units, which is generally challenging to do.

In Figure 9, we created three histograms: closest distance, second closest distance, and third closest distance for each t𝒞tt\in\mathcal{C}_{t} in the Brazilian school dataset discussed in Section 6, for units that are exactly matched on the categorical covariates. We see that c=0.35c=0.35, a value slightly above the 75th percentile of the third closest distance, is a good cut-off since most d(𝐗t,𝐗j)d(\mathbf{X}_{t},\mathbf{X}_{j}) lies on the left of 0.350.35. The 0.35 cut-off also captures most of the third closest and second closest distances, indicating we would get at least three units matched for most of our treated units. In general, one could look for peaks around particular values of d(𝐗t,𝐗j)d(\mathbf{X}_{t},\mathbf{X}_{j}), as expanding the fixed caliper to include those peaks can greatly increase the effective sample size of the data.

Refer to caption
Figure 9: Histogram of distances between treated units and control units.

Appendix B Derivations and Technical details

B.1 Distance metrics in multiple dimensions

Formally, a distance metric on p\mathbb{R}^{p} is a (non-negative) function d(,):p×pd(\cdot,\cdot):\mathbb{R}^{p}\times\mathbb{R}^{p}\to\mathbb{R} that satisfies the following for x,y,zpx,y,z\in\mathbb{R}^{p}:

  1. 1.

    d(x,y)=0x=yd(x,y)=0\iff x=y

  2. 2.

    d(x,y)=d(y,x)d(x,y)=d(y,x)

  3. 3.

    d(x,z)d(x,y)+d(y,z)d(x,z)\leq d(x,y)+d(y,z) [Triangle inequality]

These three properties satisfy our intuitive understanding of measuring distance between two points: we measure a distance of zero if and only if the two points concide with each other; the distance is the same regardless of which point we start measuring from; and the distance cannot be made shorter by passing through a third point.

Now consider a scaled L2L_{2} norm on p\mathbb{R}^{p}, which measures the length of a given vector upu\in\mathbb{R}^{p} as:

uV(2)=ut(VTV)u.||u||_{V}^{(2)}=\sqrt{u^{t}(V^{T}V)u}.

We say that the scaled distance metric in Equation 2 is induced by the scaled L2L_{2} norm defined above, since for x,ypx,y\in\mathbb{R}^{p} we can write:

dV(2)(x,y)=xyV(2).d_{V}^{(2)}(x,y)=||x-y||_{V}^{(2)}.

The same relationship holds for the scaled distance metric in Equation 3 and the scaled L2L_{2} norm uV()=max|Vu|.||u||_{V}^{(\infty)}=\max|Vu|.

Induced distance metrics in p\mathbb{R}^{p} possess desirable properties for x,yx,y\in\mathbb{R}:

  • Translation invariance: for cc\in\mathbb{R}, d(x,y)=d(x+c,y+c)d(x,y)=d(x+c,y+c)

  • Absolute homogeneity: for cc\in\mathbb{R}, d(cx,cy)=|c|d(x,y)d(cx,cy)=|c|d(x,y)

Again, these properties are intuitive: moving two points by the same amount does not change the distance between them, and scaling two points scales the distance between them.

In deriving our formal results below, we will need the following simply lemma about our distance metrics:

Lemma B.1.

For any translation-invariant, absolutely homogeneous distance metric d(,)d(\cdot,\cdot) on a metric space, d(𝐚+c𝐯,𝐚)=cd(\mathbf{a}+c\mathbf{v},\mathbf{a})=c for cc\in\mathbb{R}, 𝐚p\mathbf{a}\in\mathbb{R}^{p} and unit vector 𝐯p\mathbf{v}\in\mathbb{R}^{p}.

Proof.
d(𝐚+c𝐯,𝐚)\displaystyle d(\mathbf{a}+c\mathbf{v},\mathbf{a}) =d(c𝐯,0)\displaystyle=d(c\mathbf{v},0) [translation invariance]
=|c|d(𝐯,0)\displaystyle=|c|\cdot d(\mathbf{v},0) [absolute homogeneity]
=c𝐯\displaystyle=c\cdot||\mathbf{v}|| [def. metric-induced norm]
=c\displaystyle=c [unit vector v]\displaystyle\text{[unit vector }v\text{]}

B.2 Lipschitz functions in multiple dimensions

Recall that if a function f:f:\mathbb{R}\to\mathbb{R} is Lipschitz(λ\lambda), then for any x,ax,a\in\mathbb{R}:

|f(x)f(a)|λ|xa|.|f(x)-f(a)|\leq\lambda|x-a|.

This implies that the function’s derivatives are bounded by λ\lambda: we cannot grow faster than λ\lambda as we go from xx to aa.

In higher dimensions, |xa||x-a| is no longer scalar-valued; as a result, Lipschitz functions must be defined with respect to a distance metric. Formally speaking, we equip p\mathbb{R}^{p} with a distance metric d(,)d(\cdot,\cdot) of the form given by Equation 2 or 3. Then f:pf:\mathbb{R}^{p}\to\mathbb{R} is Lipschitz(λ\lambda) with respect to distance d(,)d(\cdot,\cdot) if for any 𝐱,𝐚n\mathbf{x},\mathbf{a}\in\mathbb{R}^{n}:

|f(𝐱)f(𝐚)|λd(𝐱,𝐚).|f(\mathbf{x})-f(\mathbf{a})|\leq\lambda d(\mathbf{x},\mathbf{a}).

Notably, a function can only be Lipschitz relative to a given distance metric, so, e.g., a function that is Lipschitz(λ)(\lambda) with respect to L2L_{2} distance may not be Lipschitz(λ)(\lambda) with respect to LL_{\infty} distance.

Multivariate Lipschitz functions have bounded derivatives like their unidimensional counterparts. In particular, Lemma B.2 shows that the directional derivatives of any Lipschitz function are bounded.

Lemma B.2 (Lipschitz bound on directional derivative).

Suppose f:pf:\mathbb{R}^{p}\to\mathbb{R} is Lipschitz(λ\lambda) with respect to d(,)d(\cdot,\cdot). For any 𝐱𝐚\mathbf{x}\neq\mathbf{a}, denote the unit vector 𝐯=𝐱𝐚d(𝐱,𝐚)\mathbf{v}=\frac{\mathbf{x}-\mathbf{a}}{d(\mathbf{x},\mathbf{a})}. Then for direction vv

𝐯f(𝐚)λ\nabla_{\mathbf{v}}f(\mathbf{a})\leq\lambda

if 𝐯f(𝐚)\nabla_{\mathbf{v}}f(\mathbf{a}) is defined.

Proof.
𝐯f(𝐚)\displaystyle\nabla_{\mathbf{v}}f(\mathbf{a}) =limh0f(𝐚+h𝐯)f(𝐚)h\displaystyle=\lim_{h\to 0}\frac{f(\mathbf{a}+h\mathbf{v})-f(\mathbf{a})}{h} [def. directional derivative]
=limh0f(𝐚+h𝐯)f(𝐚)d(𝐚+h𝐯,𝐚)\displaystyle=\lim_{h\to 0}\frac{f(\mathbf{a}+h\mathbf{v})-f(\mathbf{a})}{d(\mathbf{a}+h\mathbf{v},\mathbf{a})} Lemma B.1
λ.\displaystyle\leq\lambda. def. Lipschitz

B.3 Proof that CSM is in the MIB (Monotonic Imbalance Bounding) Class

Iacus et al., (2011) defines the monotonic imbalance bounding (MIB) class of matching methods as follows:

Definition B.1.

A matching method is MIB for a function f(χ)f(\chi) of the dataset χ\chi with respect to distance metric d(,)d(\cdot,\cdot) if there exists some monotonically increasing function γf,d()\gamma_{f,d}(\cdot) on any non-negative p-dimensional vectors 𝛑+p\bm{\pi}\in\mathbb{R}^{p}_{+} such that:

d(f(χmT(𝝅)),f(χmC(𝝅)))γf,d(𝝅)\displaystyle d(f(\chi_{m_{T}(\bm{\pi})}),f(\chi_{m_{C}(\bm{\pi})}))\leq\gamma_{f,d}(\bm{\pi}) (11)

Here, ff is function that summarizes a dataset. 𝝅p\bm{\pi}\in\mathbb{R}^{p} is the (pp-dimensional) vector of tuning parameters. In our paper, 𝝅\bm{\pi} is the vector of covariate-wise calipers. χmT(𝝅)\chi_{m_{T}(\bm{\pi})} and χmC(𝝅)\chi_{m_{C}(\bm{\pi})} are datasets of matched treated and matched control units, when we set our parameters to 𝝅\bm{\pi}. The function γf,d()\gamma_{f,d}(\cdot) is monotonically increasing if it is non-decreasing in every dimention of 𝝅\bm{\pi} and strictly increasing in at least one of them. This essentially says as we reduce any dimension of 𝝅\bm{\pi}, the bound on d(,)d(\cdot,\cdot) also decreases.

We show that CSM is MIB by choosing appropriate functions ff, dd, and γ\gamma to verify Equation 11. We select ff to be the weighted mean for the kk-th covariate, where the weight is obtained from matching. Specifically, we define X¯Tk𝐰\bar{X}^{\mathbf{w}}_{Tk} and X¯Ck𝐰\bar{X}^{\mathbf{w}}_{Ck} as the weighted means for the treated and control units after matching:

f(χmT(𝝅))\displaystyle f(\chi_{m_{T}(\bm{\pi})}) =X¯Tk𝐰=1nTt𝒯Xtk\displaystyle=\bar{X}^{\mathbf{w}}_{Tk}=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}X_{tk}
f(χmC(𝝅))\displaystyle f(\chi_{m_{C}(\bm{\pi})}) =X¯Ck𝐰=1nTt𝒯j𝒞twjtXjk.\displaystyle=\bar{X}^{\mathbf{w}}_{Ck}=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}X_{jk}.

We further choose d(x,y)=|xy|d(x,y)=|x-y| and γf,d(𝝅)=γk(πk)\gamma_{f,d}(\bm{\pi})=\gamma_{k}(\pi_{k}). Then Equation 11 becomes |X¯TkX¯Ck|πk|\bar{X}_{Tk}-\bar{X}_{Ck}|\leq\pi_{k}. This inequality is established by Proposition 4.1, hence CSM is MIB.

We now provide a proof of Proposition 4.1. First, we introduce a useful lemma:

Lemma B.3.

Given covariatewise caliper 𝛑+p\bm{\pi}\in\mathbb{R}^{p}_{+} on units tt and jj, for scaling matrix V=diag{1𝛑}V=diag\{\frac{1}{\bm{\pi}}\}:

  1. (a)

    dV(2)(𝐗t,𝐗j)1|XtkXjk|πk for k=1,,pd^{(2)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1\implies|X_{tk}-X_{jk}|\leq\pi_{k}\text{ for }k=1,\dots,p

  2. (b)

    dV()(𝐗t,𝐗j)1|XtkXjk|πk for k=1,,pd^{(\infty)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1\iff|X_{tk}-X_{jk}|\leq\pi_{k}\text{ for }k=1,\dots,p

Proof.
  1. (a)

    Without loss of generality, suppose for contradiction that |XtkXjk|>πk for k=1|X_{tk}-X_{jk}|>\pi_{k}\text{ for }k=1. Then:

    dV(2)(𝐗t,𝐗j)=k=1p(XtkXjk)2πk2(Xt1Xj1)2π12>1.\displaystyle d^{(2)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})=\sqrt{\sum_{k=1}^{p}\frac{(X_{tk}-X_{jk})^{2}}{\pi_{k}^{2}}}\geq\sqrt{\frac{(X_{t1}-X_{j1})^{2}}{\pi_{1}^{2}}}>1.
  2. (b)

    This follows from definitions:

    dV()(𝐗t,𝐗j)1\displaystyle d^{(\infty)}_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq 1 supk=1,,p|XtkXjkπk|1\displaystyle\iff\sup_{k=1,\dots,p}|\frac{X_{tk}-X_{jk}}{\pi_{k}}|\leq 1
    |XtkXjk|πk for k=1,,p\displaystyle\iff|X_{tk}-X_{jk}|\leq\pi_{k}\text{ for }k=1,\dots,p

Proof of Proposition 4.1:

Applying Lemma B.3 above,

|X¯Tk𝐰X¯Ck𝐰|\displaystyle|\bar{X}^{\mathbf{w}}_{Tk}-\bar{X}^{\mathbf{w}}_{Ck}| =|1nTt𝒯(Xtkj𝒞twjtXjk)|\displaystyle=|\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}(X_{tk}-\sum_{j\in\mathcal{C}_{t}}w_{jt}X_{jk})|
=|1nTt𝒯j𝒞twjt(XtkXjk)|\displaystyle=|\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}(X_{tk}-X_{jk})| Sum of wjt over 𝒞t is 1\displaystyle\text{Sum of }w_{jt}\text{ over }\mathcal{C}_{t}\text{ is 1}
1nTt𝒯j𝒞twjt|XtkXjk|\displaystyle\leq\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}|X_{tk}-X_{jk}| Triangle inequality
1nTt𝒯j𝒞twjtπk\displaystyle\leq\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}\pi_{k} Lemma B.3
=πk(1nTt𝒯j𝒞twjt)\displaystyle=\pi_{k}\left(\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}\right)
=πk\displaystyle=\pi_{k}

The last line is because 1nTt𝒯j𝒞twjt=1nTt𝒯1=1\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\sum_{j\in\mathcal{C}_{t}}w_{jt}=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}1=1

B.4 Proof of Proposition 4.2

We prove the bound on the Wasserstein distance by choosing a specific coupling:

Proof.

For :

𝒲q()(fT,fC)\displaystyle\mathcal{W}_{q}^{(\ell)}(f_{T},f_{C}) =inf𝐗fT𝐘fCE[dV()(𝐗,𝐘)q]1/q\displaystyle=\inf_{\begin{subarray}{c}\mathbf{X}\sim f_{T}\\ \mathbf{Y}\sim f_{C}\end{subarray}}E\big{[}d_{V}^{(\ell)}(\mathbf{X},\mathbf{Y})^{q}\big{]}^{1/q}

We choose a coupling, generated as:

  1. 1.

    Sample 𝐗fT1\mathbf{X}\sim f_{T1} as 𝐗Uniform({𝐗1,,𝐗nT})\mathbf{X}\sim\text{Uniform}(\{\mathbf{X}_{1},\dots,\mathbf{X}_{n_{T}}\}), so 𝐗=𝐗t\mathbf{X}=\mathbf{X}_{t} for some t1,,nTt\in 1,\dots,n_{T}.

  2. 2.

    Sample 𝐘fC1𝐗=𝐗t\mathbf{Y}\sim f_{C1}\mid\mathbf{X}=\mathbf{X}_{t} as 𝐘Weighted Uniform({𝐗j:j𝒞t})\mathbf{Y}\sim\text{Weighted Uniform}(\{\mathbf{X}_{j}:j\in\mathcal{C}_{t}\}) for the control units matched to treated unit tt, with their appropriate weights.

This coupling (fT1,fC1f_{T1},f_{C1}) clearly produces the correct marginals (fT,fCf_{T},f_{C}), so we can write:

𝒲q(fT,fC)q\displaystyle\mathcal{W}_{q}(f_{T},f_{C})^{q} EXfT1YfC1Xt[d(X,Y)q]\displaystyle\leq E_{\begin{subarray}{c}X\sim f_{T1}\\ Y\sim f_{C1}\mid X_{t}\end{subarray}}\big{[}d(X,Y)^{q}\big{]}
=EXfT1[EYfC1X[d(X,Y)qX]]\displaystyle=E_{X\sim f_{T1}}\Big{[}E_{Y\sim f_{C1}\mid X}\big{[}d(X,Y)^{q}\mid X\big{]}\Big{]}
=1nTt𝒯[j𝒞twjtd(Xt,Xj)q]\displaystyle=\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\Big{[}\sum_{j\in\mathcal{C}_{t}}w_{jt}d(X_{t},X_{j})^{q}\Big{]}
cq1nTt𝒯[j𝒞twjt]\displaystyle\leq c^{q}\cdot\frac{1}{n_{T}}\sum_{t\in\mathcal{T}}\Big{[}\sum_{j\in\mathcal{C}_{t}}w_{jt}\Big{]}
=cq.\displaystyle=c^{q}.

For q=q=\infty, take the limit as qq increases of the above result. ∎

B.5 Proof of Proposition 4.3

Proposition 4.3, showing the bound on the covariate mean differences, is trivially proved using the fact that dV(2)(,)d^{(2)}_{V}(\cdot,\cdot) and dV()(,)d^{(\infty)}_{V}(\cdot,\cdot) are induced norms (see Appendix B.1).

Proof.
dV(𝐗¯T,𝐗¯C)\displaystyle d_{V}(\bar{\mathbf{X}}_{T},\bar{\mathbf{X}}_{C}) =dV(1nTt𝐗t,1nTtj𝒞twjt𝐗j)\displaystyle=d_{V}(\frac{1}{n_{T}}\sum_{t}\mathbf{X}_{t},\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j})
=dV(0,1nTtj𝒞twjt𝐗j1nTt𝐗t)\displaystyle=d_{V}(0,\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j}-\frac{1}{n_{T}}\sum_{t}\mathbf{X}_{t}) [translation invariance]
=dV(0,1nTtj𝒞twjt(𝐗j𝐗t))\displaystyle=d_{V}(0,\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}(\mathbf{X}_{j}-\mathbf{X}_{t})) [jwjt=1]\displaystyle[\sum_{j}w_{jt}=1]
=1nTtj𝒞twjt(𝐗j𝐗t)V\displaystyle=||\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}(\mathbf{X}_{j}-\mathbf{X}_{t})||_{V} [||||v induces dV(,)]\displaystyle[||\cdot||_{v}\text{ induces }d_{V}(\cdot,\cdot)]
1nTtj𝒞twjt𝐗j𝐗tV\displaystyle\leq\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}||\mathbf{X}_{j}-\mathbf{X}_{t}||_{V} [triangle inequality]\displaystyle[\text{triangle inequality}]
=1nTtj𝒞twjtdV(𝐗j,𝐗t)\displaystyle=\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})
1nTtj𝒞twjtc\displaystyle\leq\frac{1}{n_{T}}\sum_{t}\sum_{j\in\mathcal{C}_{t}}w_{jt}c [dV(𝐗j,𝐗t)c]\displaystyle[d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\leq c]
=c\displaystyle=c

B.6 Derivations and technical details of the synthetic controls approach

Synthetic controls naturally control linear bias, due to the following observations.

Proposition B.4.

Synthetic control weights project the treated unit’s covariates onto the convex hull of the donor pool units’ covariates.

Proof.

Recall that the convex hull of X={𝐗1,,𝐗J}X=\{\mathbf{X}_{1},\dots,\mathbf{X}_{J}\} for 𝐗jp\mathbf{X}_{j}\in\mathbb{R}^{p} is:

conv(X)={j=1Jwjt𝐗jwjt=1,wjt0 for j=1,,J}conv(X)=\big{\{}\sum_{j=1}^{J}w_{jt}\mathbf{X}_{j}\mid\sum w_{jt}=1,w_{jt}\geq 0\text{ for }j=1,\dots,J\big{\}}

See, e.g., Boyd et al., (2004).

Recall also that the Euclidean projection of 𝐗t\mathbf{X}_{t} onto the set conv(X)conv(X) in the norm ||||||\cdot|| is defined as:

argmin𝐮\displaystyle\operatorname*{arg\,min}_{\mathbf{u}} 𝐗t𝐮\displaystyle\hskip 5.69054pt||\mathbf{X}_{t}-\mathbf{u}||
subject to 𝐮conv(X)\displaystyle\hskip 5.69054pt\mathbf{u}\in conv(X)

This optimization problem is equivalent to standard SCM. ∎

The covariates of the donor pool generally form a pp-dimensional convex hull with a pp dimensional covariate,888Assuming that there are at least p+1p+1 non-collinear donor-pool units. Otherwise, the dimension of the convex hull may be less than pp. i.e., a pp-dimensional polygon (i.e., polytope) that contains the points corresponding to each of the donor pool units’ covariates, as well as the lines between those points. Geometrically, synthetic controls simply find the point in this convex hull that is “closest” to the treated unit’s covariates (i.e., “project” the treated unit onto the convex hull), where “closeness” is defined using the given distance metric. Importantly, every point in a convex hull can be written as a convex weighted average of the points generating the hull, so the “closest” point corresponds to a set of non-negative donor-pool-unit weights that sums to one.

To impute the counterfactual outcome for the treated unit (i.e., the outcome for the constructed synthetic control unit), synthetic controls linearly interpolate the donor-pool units’ outcomes to the projected point. Slightly more formally:

Remark 1.

Suppose we use Ytj𝒞twjtYjY_{t}^{*}\equiv\sum_{j\in\mathcal{C}_{t}}w_{jt}Y_{j} as our estimate of the counterfactual outcome for treated unit tt, using control units jj with convex weights wjtw_{jt}. Then we may interpret YtY_{t}^{*} as a linear interpolation of the control units’ outcomes to j𝒞twjt𝐗j\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j}.

Proof.

Write Yj=f(𝐗j)Y_{j}=f(\mathbf{X}_{j}), without noise. If we suppose that f()f(\cdot) is linear, by the definition of a linear map we have:

j𝒞twjtYj\displaystyle\sum_{j\in\mathcal{C}_{t}}w_{jt}Y_{j} =j𝒞twjtf(𝐗j)\displaystyle=\sum_{j\in\mathcal{C}_{t}}w_{jt}f(\mathbf{X}_{j})
=f(j𝒞twjt𝐗j).\displaystyle=f(\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j}).

I.e., YjY_{j}^{*} may be interpreted as the evaluation of f()f(\cdot) at j𝒞twjt𝐗j\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j}. Convex weights ensure that this is interpolation, not extrapolation. ∎

In summary, synthetic controls linearly interpolate the donor-pool units’ outcomes to the point on the convex hull closest to the treated unit. As a result, they are subject to linear interpolation bias (Kellogg et al.,, 2021) in this first stage, though the bias is controlled by the maximum distance across which they are allowed to interpolate with the caliper. Synthetic controls then flatly extrapolate this outcome to the treated unit, i.e., impute the treated unit’s outcome as the linearly interpolated value. This second step is subject to potential extrapolation bias from the point on the convex hull to the location of the treated unit (Kellogg et al.,, 2021) proportional to dV(𝐗t,j𝒞twjt𝐗j)d_{V}(\mathbf{X}_{t},\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j}), assuming the potential outcome function is Lipschitz. Both interpolation and extrapolation bias are therefore controlled by the caliper size in CSM.

B.7 Optimization for finding SCMs

For a scaled L2L_{2} distance, the SCM optimization is typically directly solved as a quadratic programming problem. For a scaled LL_{\infty} distance, we have to transform the SCM optimization into the following linear programming problem, which we solve for each treated unit tt in turn:

minimize y\displaystyle\hskip 5.69054pty
subject to y[V(𝐗tj𝒞twjt𝐗j)]ky for k=1,,p\displaystyle\hskip 5.69054pt-y\leq\Big{[}V(\mathbf{X}_{t}-\sum_{j\in\mathcal{C}_{t}}w_{jt}\mathbf{X}_{j})\Big{]}_{k}\leq y\text{ for }k=1,\dots,p
j𝒞twjt=1\displaystyle\hskip 5.69054pt\sum_{j\in\mathcal{C}_{t}}w_{jt}=1
0wjt1 for j𝒞t\displaystyle\hskip 5.69054pt0\leq w_{jt}\leq 1\text{ for }j\in\mathcal{C}_{t}

In words, we are trying to find the smallest y0y\geq 0 that “pinches” the covariate balance for each covariate. The wjtw_{jt} are all free (under their constraints), so the goal of shrinking yy induces us to find wjtw_{jt} to make small yy achievable.

B.8 Proof of bias reduction from synthetic controls (Proposition 4.5)

Proving Proposition 4.5 requires some additional mathematical exposition. Specifically, we require f0()f_{0}(\cdot) to be differentiable in order to use a Taylor expansion. That said, to effectively use the Lipschitz property, standard multivariable Taylor expansion does not suffice. Instead, we require a Taylor expansion in a distance metric.

Lemma B.5 (Taylor expansion in a distance metric).

Suppose f:pf:\mathbb{R}^{p}\to\mathbb{R} is differentiable. Let 𝐗j,𝐗tp\mathbf{X}_{j},\mathbf{X}_{t}\in\mathbb{R}^{p}, and write unit vector 𝐯j𝐗j𝐗tdV(𝐗j,𝐗t)\mathbf{v}_{j}\equiv\frac{\mathbf{X}_{j}-\mathbf{X}_{t}}{d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})} where dVd_{V} is dV(2)d_{V}^{(2)} or dV()d_{V}^{(\infty)}, given by Equations 2 or 3. Then:

f(𝐗j)=f(𝐗t)+dV(𝐗j,𝐗t)𝐯jf(𝐗t)+o(dV(𝐗j,𝐗t))f(\mathbf{X}_{j})=f(\mathbf{X}_{t})+d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\nabla_{\mathbf{v}_{j}}f(\mathbf{X}_{t})+o(d_{V}(\mathbf{X}_{j},\mathbf{X}_{t}))

where 𝐯jff𝐯j\nabla_{\mathbf{v}_{j}}f\equiv\nabla f\cdot\mathbf{v}_{j} is the the directional derivative of ff in direction 𝐯j\mathbf{v}_{j}

Proof.

By standard multivariate Taylor expansion, we know that:

f(𝐗j)\displaystyle f(\mathbf{X}_{j}) =f(𝐗t)+(𝐗j𝐗t)Tf(𝐗t)+o(𝐗j𝐗t)\displaystyle=f(\mathbf{X}_{t})+(\mathbf{X}_{j}-\mathbf{X}_{t})^{T}\nabla f(\mathbf{X}_{t})+o(||\mathbf{X}_{j}-\mathbf{X}_{t}||)

for the usual Euclidean norm ||||||\cdot||.

We then have, using our definition of our directional derivative

(𝐗j𝐗t)Tf(𝐗t)\displaystyle(\mathbf{X}_{j}-\mathbf{X}_{t})^{T}\nabla f(\mathbf{X}_{t}) =dV(𝐗j,𝐗t)1dV(𝐗j,𝐗t)(𝐗j𝐗t)Tf(𝐗t)\displaystyle=d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\frac{1}{d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})}(\mathbf{X}_{j}-\mathbf{X}_{t})^{T}\nabla f(\mathbf{X}_{t})
=dV(𝐗j,𝐗t)𝐯jf(𝐗t)\displaystyle=d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\nabla_{\mathbf{v}_{j}}f(\mathbf{X}_{t})

which gives

f(𝐗j)\displaystyle f(\mathbf{X}_{j}) =f(𝐗t)+dV(𝐗j,𝐗t)𝐯jf(𝐗t)+o(𝐗j𝐗t)\displaystyle=f(\mathbf{X}_{t})+d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\nabla_{\mathbf{v}_{j}}f(\mathbf{X}_{t})+o(||\mathbf{X}_{j}-\mathbf{X}_{t}||)

Finally, for the little-oo remainder term, we show that if ga(x)=o(xa)g_{a}(x)=o(||x-a||), then ga(x)=o(dV(x,a))g_{a}(x)=o(d_{V}(x,a)). We demonstrate this by proving that limxaga(x)dV(x,a)=0\lim_{x\to a}\frac{g_{a}(x)}{d_{V}(x,a)}=0.

limxaga(x)dV(x,a)\displaystyle\lim_{x\to a}\frac{g_{a}(x)}{d_{V}(x,a)} =limxaga(x)xaxadV(x,a)\displaystyle=\lim_{x\to a}\frac{g_{a}(x)}{||x-a||}\cdot\frac{||x-a||}{d_{V}(x,a)}
=limxaga(x)xalimxaxadV(x,a),\displaystyle=\lim_{x\to a}\frac{g_{a}(x)}{||x-a||}\cdot\lim_{x\to a}\frac{||x-a||}{d_{V}(x,a)},

if both limits exist. We know limxaga(x)xa=0\lim_{x\to a}\frac{g_{a}(x)}{||x-a||}=0 since ga(x)=o(xa)g_{a}(x)=o(||x-a||), so it remains to show that limxaxadV(x,a)\lim_{x\to a}\frac{||x-a||}{d_{V}(x,a)} exists.

limxaxadV(x,a)\displaystyle\lim_{x\to a}\frac{||x-a||}{d_{V}(x,a)} =limt0tdV(a+tv,a)\displaystyle=\lim_{t\to 0}\frac{t}{d_{V}(a+tv,a)} [tv=xa,t=xa]\displaystyle[tv=x-a,t=||x-a||]
=limt01ddtdV(a+tv,a)\displaystyle=\lim_{t\to 0}\frac{1}{\frac{d}{dt}d_{V}(a+tv,a)} [L’Hospital]\displaystyle[\text{L'Hospital}]

The above limit exists because ddtdV(a+tv,a)=ddtdV(tv,0)=ddttvV=vV\frac{d}{dt}d_{V}(a+tv,a)=\frac{d}{dt}d_{V}(tv,0)=\frac{d}{dt}t||v||_{V}=||v||_{V} is non-zero when v0v\neq 0. Hence ga(x)=o(dV(x,a))g_{a}(x)=o(d_{V}(x,a)). ∎

Lemma B.5 is technical, but it captures a very simple intuition. Standard multivariate Taylor expansion takes the dot product of the pp-dimensional gradient of f()f(\cdot) with the pp-dimensional vector of differences between 𝐱\mathbf{x} and the point aa. Calipers, however, only control the scalar quantities d(𝐗t,𝐗j)d(\mathbf{X}_{t},\mathbf{X}_{j}). Lemma B.5 simply rewrites standard multivariate Taylor expansion to better utilize the fact that Lipschitz functions control scalar directional derivatives (as discussed in Appendix B.2).

With our lemma, we now provide the proof for Proposition 4.5, where we use the fact that, by Lemma B.2, Lipschitz functions have bounded directional derivatives.

Proof.

Recall that by Taylor expansion of f0()f_{0}(\cdot) around 𝐗t\mathbf{X}_{t}, we have:

jwjtf0(𝐗j)f0(𝐗t)\displaystyle\sum_{j}w_{jt}f_{0}(\mathbf{X}_{j})-f_{0}(\mathbf{X}_{t}) =jwjtdV(𝐗j,𝐗t)𝐯jf0(𝐗t)+jwjto(dV(𝐗j,𝐗t))\displaystyle=\sum_{j}w_{jt}d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\nabla_{\mathbf{v}_{j}}f_{0}(\mathbf{X}_{t})+\sum_{j}w_{jt}o(d_{V}(\mathbf{X}_{j},\mathbf{X}_{t}))
=jwjtdV(𝐗j,𝐗t)𝐯jf0(𝐗t)+o(dV)\displaystyle=\sum_{j}w_{jt}d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\nabla_{\mathbf{v}_{j}}f_{0}(\mathbf{X}_{t})+o(d_{V})

Consider the linear term:

jwjtdV(𝐗j,𝐗t)𝐯jf0(𝐗t)\displaystyle\sum_{j}w_{jt}d_{V}(\mathbf{X}_{j},\mathbf{X}_{t})\nabla_{\mathbf{v}_{j}}f_{0}(\mathbf{X}_{t}) =jwjtf0(𝐗t)T(𝐗j𝐗t)\displaystyle=\sum_{j}w_{jt}\nabla f_{0}(\mathbf{X}_{t})^{T}(\mathbf{X}_{j}-\mathbf{X}_{t}) [def. 𝐯j]\displaystyle[\text{def. }\nabla_{\mathbf{v}_{j}}]
=jwjtk=1pck(𝐗jk𝐗tk)\displaystyle=\sum_{j}w_{jt}\sum_{k=1}^{p}c_{k}(\mathbf{X}_{jk}-\mathbf{X}_{tk})
=k=1pck(jwjt𝐗jk𝐗tk)\displaystyle=\sum_{k=1}^{p}c_{k}(\sum_{j}w_{jt}\mathbf{X}_{jk}-\mathbf{X}_{tk})
=0\displaystyle=0 [exact SC]\displaystyle[\text{exact SC}]

where 𝐜=[c1cp]T\mathbf{c}=[c_{1}\dots c_{p}]^{T} is the fixed (unknown) gradient of f0()f_{0}(\cdot) at 𝐗t\mathbf{X}_{t}. Thus, if the synthetical control matches, we drop the linear term from our bias.

Using the above we have, given that, for all tt, dV(𝐗t,𝐗j)cd_{V}(\mathbf{X}_{t},\mathbf{X}_{j})\leq c for all j𝒞tj\in\mathcal{C}_{t}:

t1Ntjwjtf0(𝐗j)f0(𝐗t)=1Nttjwjto(dV(Xj,Xt))=1Nttjwjto(c)=o(c).\displaystyle\sum_{t}\frac{1}{N_{t}}\sum_{j}w_{jt}f_{0}(\mathbf{X}_{j})-f_{0}(\mathbf{X}_{t})=\frac{1}{N_{t}}\sum_{t}\sum_{j}w_{jt}o(d_{V}(X_{j},X_{t}))=\frac{1}{N_{t}}\sum_{t}\sum_{j}w_{jt}o(c)=o(c).