Covariate Adjustment in Experiments with Matched Pairs††thanks: Yichong Zhang acknowledges the financial support from the NSFC under the grant No. 72133002 and a Lee Kong Chian fellowship. Any and all errors are our own.
Abstract
This paper studies inference on the average treatment effect (ATE) in experiments in which treatment status is determined according to “matched pairs” and it is additionally desired to adjust for observed, baseline covariates to gain further precision. By a “matched pairs” design, we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. Importantly, we presume that not all observed, baseline covariates are used in determining treatment assignment. We study a broad class of estimators based on a “doubly robust” moment condition that permits us to study estimators with both finite-dimensional and high-dimensional forms of covariate adjustment. We find that estimators with finite-dimensional, linear adjustments need not lead to improvements in precision relative to the unadjusted difference-in-means estimator. This phenomenon persists even if the adjustments are interacted with treatment; in fact, doing so leads to no changes in precision. However, gains in precision can be ensured by including fixed effects for each of the pairs. Indeed, we show that this adjustment leads to the minimum asymptotic variance of the corresponding ATE estimator among all finite-dimensional, linear adjustments. We additionally study an estimator with a regularized adjustment, which can accommodate high-dimensional covariates. We show that this estimator leads to improvements in precision relative to the unadjusted difference-in-means estimator and also provide conditions under which it leads to the “optimal” nonparametric, covariate adjustment. A simulation study confirms the practical relevance of our theoretical analysis, and the methods are employed to reanalyze data from an experiment using a “matched pairs” design to study the effect of macroinsurance on microenterprise.
KEYWORDS: Experiment, matched pairs, covariate adjustment, randomized controlled trial, treatment assignment, LASSO
JEL classification codes: C12, C14
1 Introduction
This paper studies inference on the average treatment effect in experiments in which treatment status is determined according to “matched pairs.” By a “matched pairs” design, we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. This method is used routinely in all parts of the sciences. Indeed, commands to facilitate its implementation are included in popular software packages, such as sampsi in Stata. References to a variety of specific examples can be found, for instance, in the following surveys of various field experiments: Donner and Klar (2000), Glennerster and Takavarasha (2013), and Rosenberger and Lachin (2015). See also Bruhn and McKenzie (2009), who, based on a survey of selected development economists, report that 56% of researchers have used such a design at some point. Bai et al. (2022) develop methods for inference on the average treatment effect in such experiments based on the difference-in-means estimator. In this paper, we pursue the goal of improving upon the precision of this estimator by exploiting observed, baseline covariates that are not used in determining treatment status.
To this end, we study a broad class of estimators for the average treatment effect based on a “doubly robust” moment condition. The estimators in this framework are distinguished via different “working models” for the conditional expectations of potential outcomes under treatment and control given the observed, baseline covariates. Importantly, because of the double-robustness, these “working models” need not be correctly specified in order for the resulting estimator to be consistent. In this way, the framework permits us to study both finite-dimensional and high-dimensional forms of covariate adjustment without imposing unreasonable restrictions on the conditional expectations themselves. Under high-level conditions on the “working models” and their corresponding estimators and a requirement that pairs are formed so that units within pairs are suitably “close” in terms of the baseline covariates, we derive the limiting distribution of the covariate-adjusted estimator of the average treatment effect. We further construct an estimator for the variance of the limiting distribution and provide conditions under which it is consistent for this quantity.
Using our general framework, we first consider finite-dimensional, linear adjustments. For this class of estimators, our main findings are summarized as follows. First, we find that estimators with such adjustments are not guaranteed to be weakly more efficient than the unadjusted difference-in-means estimator. This finding echoes similar findings by Yang and Tsiatis (2001) and Tsiatis et al. (2008) in settings in which treatment is determined by i.i.d. coin flips, and Freedman (2008) in a finite population setting in which treatment is determined according to complete randomization. See Negi and Wooldridge (2021) for a succinct treatment of that literature. Moreover, we find that this phenomenon persists even if the adjustments are interacted with treatment. In fact, doing so leads to no changes in precision. In this sense, our results diverge from those in settings with complete randomization and treated fraction one half, where adjustments based on the uninteracted and interacted linear adjustments both guarantee gains in precision. Last, we show that estimators with both uninteracted and interacted linear adjustments with pair fixed effects are guaranteed to be weakly more efficient than the unadjusted difference-in-means estimator.
We then use our framework to consider high-dimensional adjustments based on penalization. Specifically, we first obtain an intermediate estimator by using the LASSO to estimate the “working model” for the relevant conditional expectations. When the treatment is determined according to “matched pairs,” however, this estimator need not be more precise than the unadjusted difference-in-means estimator. Therefore, following Cohen and Fogarty (2023), we consider, in an additional step, an estimator based on the finite-dimensional, linear adjustment described above that uses the predicted values for the “working model” as the covariates and includes fixed effects for each of the pairs. We show that the resulting estimator improves upon both the intermediate estimator and the unadjusted difference-in-means estimator in terms of precision. Moreover, we provide conditions under which the refitted adjustments attain the relevant efficiency bound derived by Armstrong (2022).
Concurrent with our paper, Cytrynbaum (2023) considers covariate adjustment in experiments in which units are grouped into tuples with possibly more than two units, rather than pairs. Both our paper and Cytrynbaum (2023) find that finite-dimensional, linear regression adjustments with pair fixed effects are guaranteed to improve precision relative to the unadjusted difference-in-means estimator, and show that such adjustments are indeed optimal among all linear adjustments. However, Cytrynbaum (2023) does not pursue more general forms of covariate adjustments, including the regularized adjustments described above. Such results permit us to study nonparametric adjustments as well as high-dimensional adjustments using covariates whose dimension diverges rapidly with the sample size.
The remainder of our paper is organized as follows. In Section 2, we describe our setup and notation. In particular, there we describe the precise sense in which we require that units in each pair are “close” in terms of their baseline covariates. In Section 3, we introduce our general class of estimators based on a “doubly robust” moment condition. Under certain high-level conditions on the “working models” and their corresponding estimators, we derive the limiting behavior of the covariate-adjusted estimator. In Section 4, we use our general framework to study a variety of estimators with finite-dimensional, linear covariate adjustment. In Section 5, we use our general framework to study covariate adjustment based on the regularized regression. In Section 6, we examine the finite-sample behavior of tests based on these different estimators via a small simulation study. We find that covariate adjustment can lead to considerable gains in precision. Finally, in Section 7, we apply our methods to reanalyze data from an experiment using a “matched pairs” design to study the effect of macroinsurance on microenterprise. Proofs of all results and some details for simulations are given in the Online Supplement.
2 Setup and Notation
Let denote the (observed) outcome of interest for the th unit, be an indicator for whether the th unit is treated, and and denote observed, baseline covariates for the th unit; and will be distinguished below through the feature that only the former will be used in determining treatment assignment. Further denote by the potential outcome of the th unit if treated and by the potential outcome of the th unit if not treated. The (observed) outcome and potential outcomes are related to treatment status by the relationship
(1) |
For a random variable indexed by , , it will be useful to denote by the random vector . Denote by the distribution of the observed data , where , and by the distribution of , where . Note that is determined by (1), , and the mechanism for determining treatment assignment. We assume throughout that consists of i.i.d. observations, i.e., , where is the marginal distribution of . We therefore state our assumptions below in terms of assumptions on and the mechanism for determining treatment assignment. Indeed, we will not make reference to in the sequel, and all operations are understood to be under and the mechanism for determining the treatment assignment. Our object of interest is the average effect of the treatment on the outcome of interest, which may be expressed in terms of this notation as
(2) |
We now describe our assumptions on . We restrict to satisfy the following mild requirement:
Assumption 2.1.
The distribution is such that
-
(a)
for .
-
(b)
for .
-
(c)
and are Lipschitz for .
Next, we describe our assumptions on the mechanism determining treatment assignment. In order to describe these assumptions more formally, we require some further notation to define the relevant pairs of units. The pairs may be represented by the sets
where is a permutation of elements. Because of its possible dependence on , encompasses a broad variety of different ways of pairing the units according to the observed, baseline covariates . Given such a , we assume that treatment status is assigned as described in the following assumption:
Assumption 2.2.
Treatment status is assigned so that and, conditional on , are i.i.d. and each uniformly distributed over the values in .
Following Bai et al. (2022), our analysis will additionally require some discipline on the way in which pairs are formed. Let denote the Euclidean norm. We will require that units in each pair are “close” in the sense described by the following assumption:
Assumption 2.3.
The pairs used in determining treatment status satisfy
for .
It will at times be convenient to require further that units in consecutive pairs are also “close” in terms of their baseline covariates. One may view this requirement, which is formalized in the following assumption, as “pairing the pairs“ so that they are “close” in terms of their baseline covariates.
Assumption 2.4.
The pairs used in determining treatment status satisfy
for any and .
Bai et al. (2022) provide results to facilitate constructing pairs satisfying Assumptions 2.3–2.4 under weak assumptions on . In particular, given pairs satisfying Assumption 2.3, it is frequently possible to “re-order” them so that Assumption 2.4 is satisfied. See Theorem 4.3 in Bai et al. (2022) for further details. As in Bai et al. (2022), we highlight the fact that Assumption 2.4 will only be used to enable consistent estimation of relevant variances.
Remark 2.1.
Under this setup, Bai et al. (2022) consider the unadjusted difference-in-means estimator
(3) |
and show that it is consistent and asymptotically normal with limiting variance
We note that is the unadjusted estimator because it does not use information in in either the design or analysis stage. If both and are used to form pairs in the “matched pairs” design, then the difference-in-means estimator, which we refer to as , has limiting variance
In this case, achieves the efficiency bound derived by Armstrong (2022), and we can see that
For related results for parameters other than the average treatment effect, see Bai et al. (2023a). We note, however, that it is not always practical to form pairs using both and for two reasons. First, the covariate may only be collected along with the outcome variable and therefore may not be available at the design stage. Second, the quality of pairing decreases with the dimension of matching variables. Indeed, it is common in practice to match on some but not all baseline covariates. Such considerations motivate our analysis below.
3 Main Results
To accommodate various forms of covariate-adjusted estimators of in a single framework, it is useful to note that it follows from Assumption 2.2 that for any and any function such that ,
(4) |
We note that (4) is just the augmented inverse propensity score weighted moment for in which the propensity score is and the conditional mean model is . Such a moment is also “doubly robust.” As the propensity score for the “matched pairs” design is exactly one half, we do not require the conditional mean model to be correctly specified, i.e., . See, for instance, Robins et al. (1995). Intuitively, is the “working model” which researchers use to estimate , and can be arbitrarily misspecified because of (4). Although will be identical across for the examples in Section 4, the notation permits to depend on the sample size in anticipation of the high-dimensional results in Section 5. Based on the moment condition in (4), our proposed estimator of is given by
(5) |
where, for ,
(6) |
and is a suitable estimator of the “working model” in (4).
By some simple algebra, we have111We thank the referee for this excellent point.
(7) |
where
(8) |
It means our regression adjusted estimator can be viewed as a difference-in-means estimator, but with the “adjusted” outcome .
We require some new discipline on the behavior of for and :
Assumption 3.1.
The functions for and satisfy
-
(a)
For ,
-
(b)
For ,
-
(c)
, , for , and are Lipschitz uniformly over .
Assumption 3.1(a) is an assumption to rule out degenerate situations. Assumption 3.1(b) is a mild uniform integrability assumption on the “working models.” If for , then it is satisfied as long as . Assumption 3.1(c) ensures that units that are “close” in terms of the observed covariates are also “close” in terms of potential outcomes, uniformly across .
Theorem 3.1 below establishes the limit in distribution of . We note that the theorem depends on high-level conditions on and . In the sequel, these conditions will be verified in several examples.
Theorem 3.1.
In order to facilitate the use of Theorem 3.1 for inference about , we next provide a consistent estimator of . Define
where is defined in (8). The variance estimator is given by
(11) |
The variance estimator in (11), in particular its component , is analogous to the “pairs of pairs” variance estimator in Bai et al. (2022). Such a variance estimator has also been used in Abadie and Imbens (2008) in a related setting. Note that it can be shown similarly as in Remark 3.9 of Bai et al. (2022) that in (11) is nonnegative.
Theorem 3.2 below establishes the consistency of this estimator and its implications for inference about . In the statement of the theorem, we make use of the following notation: for any scalars and , is understood to be .
Theorem 3.2.
Remark 3.1.
Based on (7), it is natural to estimate using the usual estimator of the limiting variance of the difference-in-means estimator, i.e.,
However, it can be shown that , where
Furthermore,
where the inequality is strict unless
with probability one. In this sense, the usual estimator of the limiting variance of the difference-in-means estimator is conservative.
Remark 3.2.
An important and immediate implication of Theorem 3.1 is that is minimized when
with probability one. In other words, the “working model” for given by , need only be correct “on average” over the variables that are not used in determining the pairs. For such a choice of and , in Theorem 3.1 becomes simply
which agrees with the variance obtained in Bai et al. (2022) when both and are used in determining the pairs. Such a variance also achieves the efficiency bound derived by Armstrong (2022).
Remark 3.3.
Following Bai et al. (2023b), it is straightforward to extend the analysis in this paper to the case with multiple treatment arms and where treatment status is determined using a “matched tuples” design, but we do not pursue this further in this paper.
Remark 3.4.
Following Bai et al. (2022), we conjecture it it possible to establish the validity of a randomization test based on the test statistic studentized by a randomized version of (11). We emphasize that the validity of the randomization test depends crucially on the choice of studentization in the test statistic. See, for instance, Remark 3.16 in Bai et al. (2022). Such tests have been studied in finite-population settings with covariate adjustments by Zhao and Ding (2021). We leave a detailed analysis of randomization tests for future work.
4 Linear Adjustments
In this section, we consider linearly covariate-adjusted estimators of based on a set of regressors generated by and . To this end, define , where . We impose the following assumptions on the function :
Assumption 4.1.
The function is such that
-
(a)
no component of is constant and is non-singular.
-
(b)
.
-
(c)
, , and for are Lipschitz.
Assumption 4.1 is analogous to Assumption 2.1. Note, in particular, that Assumption 4.1(a) rules out situations where is a function of only. See Remark 4.3 for a discussion of the behavior of the covariate-adjusted estimators in such situations.
4.1 Linear Adjustments without Pair Fixed Effects
Consider the following linear regression model:
(13) |
Let , , and denote the OLS estimators of , , and in (13). We call these estimators naïve because the corresponding regression adjustment is subject to Freedman’s critique and can lead to an adjusted estimator that is less efficient than the simple difference-in-means estimator .
Theorem 4.1 establishes (9) and (12) for a suitable choice of for and, as a result, the limiting distribution of and the validity of the variance estimator.
Theorem 4.1.
Remark 4.1.
Freedman (2008) studies regression adjustment based on (13) when treatment is assigned by complete randomization instead of a “matched pairs” design. In such settings, Lin (2013) proposes adjustment based on the following linear regression model:
(14) |
where
Let denote the OLS estimators for in (14). It is straightforward to show satisfies (5)–(6) with
where
It can be shown using similar arguments to those used to establish Theorem 4.1 that (9) and Assumption 3.1 are satisfied with
for and . It thus follows by inspecting the expression for in Theorem 3.1 that the limiting variance of is the same as that of based on (13).
Remark 4.2.
Note that is the ordinary least squares estimator for in the linear regression
Furthermore, Theorem 4.1 implies that its limiting variance is , given by in Theorem 3.1 with . The usual heteroskedasticity-robust estimator of the limiting variance of is, however, simply defined in Remark 3.1 with . It thus follows that is conservative for in the sense described therein. It is, of course, possible to estimate consistently using proposed in Theorem 3.2 with , but is not guaranteed to be smaller than the limiting variance of the unadjusted estimator, i.e., , so the linear adjustment without pair fixed effects can harm the precision of the estimator. Evidence of this phenomenon is provided in our simulations in Section 6.
4.2 Linear Adjustments with Pair Fixed Effects
Remark 4.1 implies that in “matched pairs” designs, including interaction terms in the linear regression does not lead to an estimator with lower limiting variance than the one based on the linear regression without interaction terms. It is therefore interesting to study whether there exists a linearly covariate-adjusted estimator with lower limiting variance than the ones based on (13) and (14) as well as the difference-in-means estimator. To that end, consider instead the following linear regression model:
(15) |
Let , , and , denote the OLS estimators of , , , in (15), where “pfe” stands for pair fixed effect. It follows from the Frisch-Waugh-Lovell theorem that
Therefore, satisfies (5)–(6) with
Theorem 4.2 establishes (9) and (12) for a suitable choice of and, as a result, the limiting distribution of and the validity of the variance estimator.
Theorem 4.2.
Remark 4.3.
When is restricted to be a function of only, coincides to first order with the unadjusted difference-in-means estimator defined in (3). To see this, suppose further that is Lipschitz and that are bounded. The proof of Theorem 4.2 reveals that and coincide with the OLS estimators of the intercept and slope parameters in a linear regression of on a constant and . Using this observation, it follows by arguing as in Section S.1.1 of Bai et al. (2022) that
See also Remark 3.8 of Bai et al. (2022).
Remark 4.4.
Note in the expression of in Theorem 3.1 only depends on through . With this in mind, consider the class of all linearly covariate-adjusted estimators based on , i.e., . For this specification of ,
It follows that among all such linear adjustments, in (10) is minimized when
This observation implies that the linear adjustment with pair fixed effects, i.e., , yields the optimal linear adjustment in the sense of minimizing . Its limiting variance is, in particular, weakly smaller than the limiting variance of the unadjusted difference-in-means estimator defined in (3). On the other hand, the covariate-adjusted estimators based on (13) or (14), i.e., and , are in general not optimal among all linearly covariate-adjusted estimators based on . In fact, the limiting variances of these two estimators may even be larger than that of the unadjusted difference-in-means estimator.
Remark 4.5.
“Matched pairs” design is essentially a non-parametric way to adjust for . Projecting on the pair dummies in (15) is equivalent to pair-wise demeaning, which effectively removes from . This is key to the optimality of over all linearly adjusted estimators. Following the same logic, we expect that by replacing the pair dummies with sieve bases of in (15), the linear regression can still effectively remove from so that the new adjusted estimator is asymptotically equivalent to , and thus, linearly optimal.
Remark 4.6.
Remark 4.2 also applies here with replaced by . Even though can be computed via OLS estimation of (15), we emphasize that the usual heteroskedascity-robust standard errors that naïvely treats the data (including treatment status) as if it were i.i.d. need not be consistent for the limiting variance derived in our analysis.
Remark 4.7.
One can also consider the estimator based on the following linear regression model:
(16) |
Let denote the OLS estimators for in (16). It is straightforward to show satisfies (5)–(6) with
Following similar arguments to those used in the proof of Theorem 4.1, we can establish that (9) and Assumption 3.1 are satisfied with
where
Because , it follows from Remark 4.4 that the limiting variance of is identical to the limiting variance of .
Remark 4.8.
Wu and Gagnon-Bartsch (2021) consider the covariate adjustment for paired experiments under the design-based framework, where the covariates are treated as deterministic, and thus, the cross-sectional dependence between units in the same pair due to the closeness of their covariates is not counted in their analysis. We differ from them by considering the sampling-based framework in which the covariates are treated as random and the pairs are formed by matching, and thus, have an impact on statistical inference. Under their framework, Wu and Gagnon-Bartsch (2021) point out that covariate adjustments may have a positive or negative effect on the estimation accuracy depending on how they are estimated. This is consistent with our findings in this section. Specifically, we show that when the regression adjustments are estimated by a linear regression with pair fixed effects, the resulting ATE estimator is guaranteed to weakly improve upon the difference-in-means estimator in terms of efficiency. However, this improvement is not guaranteed if the adjustments are estimated without pair fixed effects.
Remark 4.9.
If we choose as a set of sieve basis functions with increasing dimension, then under suitable regularity conditions, the linear adjustments both with and without pair fixed effects achieve the same limiting variance as , and thus, the efficiency bound. In fact, if contains sieve bases, then the linear adjustment without pair fixed effects can approximate the true specification in the sense that and . This property implies in Theorem 3.1 equals zero. Similarly, the linear adjustment with pair fixed effects can approximate the true specification in the sense that and . This property again implies in Theorem 3.1 equals zero. Therefore, in both cases, the adjusted estimator achieves the minimum variance. In the next section, we consider -regularized adjustments which may be viewed as providing a way to choose the relevant sieve bases in a data-driven manner.
5 Regularized Adjustments
In this section, we study covariate adjustments based on the -regularized linear regression. Such settings can arise if the covariates are high-dimensional or if the dimension of is fixed but the regressors include many sieve basis functions of and . To accommodate situations where the dimension of increases with , we add a subscript and denote it by instead. Let denote the dimension of . For , let , where and will be permitted below to be possibly much larger than .
In what follows, we consider a two-step method in the spirit of Cohen and Fogarty (2023). In the first step, an intermediate estimator, , is obtained using (5) with a “working model” obtained through a -regularized linear regression adjustments for . As explained further below in Theorem 5.1, when is approximately correctly specified, such an estimator is optimal in the sense that it minimizes the limiting variance in Theorem 3.1. When this is not the case, however, for reasons analogous to those put forward in Remark 4.2, it needs not to have a limiting variance weakly smaller than the unadjusted difference-in-means estimator. In a second step, we therefore consider an estimator by refitting a version of (15) in which the covariates are replaced by the regularized estimates of for . The resulting estimator, , has the limiting variance weakly smaller than that of the intermediate estimator and thus remains optimal under approximately correct specification in the same sense. Moreover, it has limiting variance weakly smaller than the unadjusted difference-in-means estimator. Wager et al. (2016) also consider high-dimensional regression adjustments in randomized experiments using LASSO. We differ from their work by considering the “matched pairs” design, and more importantly, discussing when and how regularized adjustments can improve estimation efficiency upon the difference-in-means estimator.
Before proceeding, we introduce some additional notation that will be required in our formal description of the methods. We denote by the th components of . For a vector and , recall that
where it is understood that and . Using this notation, we further define
For , define
(17) |
where is a penalty parameter that will be disciplined by the assumptions below, is a diagonal matrix, and is the penalty loading for the th regressor. Let denote the estimator in (5) with for .
We now proceed with the statement of our assumptions. The first assumption collects a variety of moment conditions that will be used in our formal analysis:
Assumption 5.1.
-
(a)
There exist nonrandom quantities such that with defined as
we have
(18) where and .
-
(b)
For some and constant ,
with probability one.
-
(c)
For some and , we require that
(19) -
(d)
For some , , , the following statements hold with probability one:
Remark 5.1.
It is instructive to note that (18) in Assumption 5.1(a) is the subgradient condition for a -penalized regression of the outcome on when the penalty is of order . Specifically, if , then this condition holds automatically for the equal to the coefficients of a linear projection of onto . When , but is approximately correctly specified in the sense that the approximation error is sufficiently small, then (18) also holds. However, the approximately correct specification is not necessary for (18). For example, suppose is a vector of independent standard normal random variables, is independent of , , and
where . Then, Assumption 5.1(a) holds with . We can impose a sparse restriction on so that it further satisfies Assumption 5.3(b) below. On the other hand, the linear regression adjustment is not approximately correctly specified because , and we have .
Remark 5.2.
Assumption 5.1(b) and 5.1(d) are standard in the high-dimensional estimation literature; see, for instance, Belloni et al. (2017). The last four inequalities in Assumption 5.1(d), in particular, permit us to apply the high-dimensional central limit theorem in Chernozhukov et al. (2017, Theorem 2.1).
Remark 5.3.
Our analysis will, as before, also require some discipline on the way in which pairs are formed. For this purpose, Assumption 2.3 will suffice, but we will need an additional Lipshitz-like condition:
Assumption 5.2.
For some and any and in the support of , we have
We next specify our restrictions on the penalty parameter .
Assumption 5.3.
-
(a)
For some ,
-
(b)
and , where
(20)
We note that Assumption 5.3(b) permits to be much greater than . It also requires sparsity in the sense that .
Finally, as is common in the analysis of -penalized regression, we require a “restricted eigenvalue” condition. This assumption permits us to apply Bickel et al. (2009, Lemma 4.1) and establish the error bounds for and .
Assumption 5.4.
For some and , the following statements hold with probability approaching one:
where .
Using these assumptions, the following theorem characterizes the behavior of :
Theorem 5.1.
Suppose satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.2–2.3. Further suppose Assumptions 5.1–5.4 hold. Then, (9), (12), and Assumption 3.1 are satisfied with and
for and . Denote the variance of by . If the regularized adjustment is approximately correctly specified, i.e., and , then achieves the minimum variance, i.e.,
Remark 5.4.
We recommend employing an iterative estimation procedure outlined by Belloni et al. (2017) to estimate , in which the -th step’s penalty loadings are estimated based on the th step’s LASSO estimates. Formally, this iterative procedure is described by the following algorithm:
Algorithm 5.1.
-
Step 0: Set if .
-
-
Step : Compute and compute following (17) with as the penalty loadings, and if .
-
-
Step :
-
Step : Set .
Remark 5.5.
When the -regularized adjustment is approximately correctly specified, Theorem 5.1 shows achieves the minimum variance derived in Remark 3.2, and thus, is guaranteed to be weakly more efficient than the difference-in-means estimator (). When is fixed dimensional and consists of sieve basis functions of , the approximately correct specification usually holds. Specifically, under regularity conditions such as the smoothness of , we can approximate by and is automatically sparse in the sense that . This means our regularized regression adjustment can select relevant sieve bases in nonparametric regression adjustments in a data-driven manner and automatically minimize the limiting variance of the corresponding ATE estimator.
Remark 5.6.
When the dimension of is ultra-high (i.e., ) and the regularized adjustment is not approximately correctly specified, suffers from Freedman (2008)’s critique that, theoretically, it is possible to be less efficient than . To overcome this problem, we consider an additional step in which we treat the regularized adjustments as a two-dimensional covariate and refit a linear regression with pair fixed effects. Such a procedure has also been studied by Cohen and Fogarty (2023) in the setting with low-dimensional covariates and complete randomization. In fact, this strategy can improve upon general initial regression adjustments as long as (9), (12), and Assumption 3.1 are satisfied.
Theorem 5.2 below shows the “refit” estimator for the ATE is weakly more efficient than both and . To state the results, define , , and as the estimator in (15) with replaced by . Note that remains numerically the same if we include the intercept in the definition of . Following Remark 4.3, is the intercept in the linear regression of on constant and . Replacing by will not change the regression estimators.
The following assumption will be employed to control in our subsequent analysis:
Assumption 5.5.
For some and ,
The following theorem characterizes the behavior of :
Theorem 5.2.
Remark 5.7.
It is possible to further relax the full rank condition in Assumption 5.5 by running a ridge regression or truncating the minimum eigenvalue of the gram matrix in the refitting step.
6 Simulations
In this section, we conduct Monte Carlo experiments to assess the finite-sample performance of the inference methods proposed in the paper. In all cases, we follow Bai et al. (2022) to consider tests of the hypothesis that
with at nominal level .
6.1 Data Generating Processes
We generate potential outcomes for and by the equation
(21) |
where , and are specified in each model as follows. In each of the specifications, () are i.i.d. across . The number of pairs is equal to 100 and 200. The number of replications is 10,000.
- Model 1
-
, where is the standard normal distribution function and
; ; for ; . We set and .
- Model 2
-
, where is the same as in Model 1. . for ; . and .
- Model 3
-
The same as in Model 2, except that with .
- Model 4
-
, where is the same as in Model 1. . for ; . and .
- Model 5
-
The same as in Model 4, except that .
- Model 6
-
The same as in Model 5, except that .
- Model 7
-
and , where with and consisting of 1 on the diagonal and on all other elements. with . for ; . .
- Model 8
-
The same as in Model 7, except that .
- Model 9
-
The same as in Model 8, except that
- Model 10
-
and , where with and consisting of 1 on the diagonal and on all other elements. with . for ; .
- Model 11
-
The same as in Model 10, except that .
- Model 12
-
and , where with . is the Toeplitz matrix
, with , and with . for ; .
- Model 13
-
The same as in Model 12, except that , , , and with and .
- Model 14
-
The same as in Model 13, except that .
- Model 15
-
The same as in Model 14, except that .
It is worth noting that Models 1, 2, 3, 4, 7, 10, 12, and 13 imply homogeneous treatment effects because . Among them, is linear in in Models 1, 2, and 12. Models 5, 8, 11, and 14 have heterogeneous but homoscedastic treatment effects. In Models 6, 9, and 15, however, the implied treatment effects are both heterogeneous and heteroscedastic. Models 12-15 contain high-dimensional covariates.
We follow Bai et al. (2022) to match pairs. Specifically, if , we match pairs by sorting . If , we match pairs by the permutation calculated using the R package nbpMatching. For more details, see Bai et al. (2022, Section 4). After matching the pairs, we flip coins to randomly select one unit within each pair for treatment and another for control.
6.2 Estimation and Inference
We set and , where and are used to illustrate the size and power, respectively. Rejection probabilities in percentage points are presented. To further illustrate the efficiency gains obtained by regression adjustments, in Figure 1, we plot the average standard error reduction in percentage relative to the standard error of the estimator without adjustments for various estimation methods.
Specifically, we consider the following adjusted estimators.
-
(i)
unadj: the estimator with no adjustments. In this case, our standard error is identical to the adjusted standard error proposed by Bai et al. (2022).
-
(ii)
naïve: the linear adjustments with regressors but without pair dummies.
-
(iii)
naïve2: the linear adjustments with and regressors but without pair dummies.
-
(iv)
pfe: the linear adjustments with regressors and pair dummies.
-
(v)
refit: refit the -regularized adjustments by linear regression with pair dummies.
See Section C in the Online Supplement for the regressors used in the regularized adjustments.
For Models 1-11, we examine the performance of estimators (i)-(v). For Models 12-15, we assess the performance among estimators (i) and (v) in high-dimensional settings. Note that the adjustments are misspecified for almost all the models. The only exception is Model 1, for which the linear adjustment in is correctly specified because is just a linear function of .
6.3 Simulation Results
Tables 1 and 3 report rejection probabilities at the 0.05 level and power of the different methods for Models 1–11 when is 100 and 200, respectively. Several patterns emerge. First, for all the estimators, the rejection rates under are close to the nominal level even when and with misspecified adjustments. This result is expected because all the estimators take into account the dependence structure arising in the “matched pairs” design, consistent with the findings in Bai et al. (2022).
Second, in terms of power, “pfe” is higher than “unadj”, “naïve”, and “naïve2” for all eleven models, as predicted by our theory. This finding confirms that “pfe” is the optimal linear adjustment and will not degrade the precision of the ATE estimator. In contrast, we observe that “naïve” and “naïve2” in Model 3 are even less powerful than the unadjusted estimator “unadj”. Figure 1 further confirms that these two methods inflate the estimation standard error. This result echoes Freedman’s critique (Freedman, 2008) that careless regression adjustments may degrade the estimation precision. Our “pfe” addresses this issue because it has been proven to be weakly more efficient than the unadjusted estimator.
Third, the improvement of power for “pfe” is mainly due to the reduction of estimation standard errors, which can be more than 50% as shown in Figure 1 for Models 4–9. This means that the length of the confidence interval of the “pfe” estimator is just half of that for the “unadj” estimator. Note the standard error of the “unadj” estimator is the one proposed by Bai et al. (2022), which has already been adjusted to account for the cross-sectional dependence created in pair matching. The extra 50% reduction is therefore produced purely by the regression adjustment. For Models 10-11, the reduction of standard errors achieved by “pfe” is more than 40% as well. For Model 1, the linear regression is correctly specified so that all three methods achieve the global minimum asymptotic variance and maximum power. For Model 2, so that the linear adjustment satisfies the conditions in Theorem 3.1. Therefore, “pfe”, as the best linear adjustment, is also the best adjustment globally, achieving the global minimum asymptotic variance and maximum power. In contrast, “naïve” and “naïve2” are not the best linear adjustment and therefore less powerful than “pfe” because of the omitted pair dummies.
Finally, the “refit” method has the best power for most models as they automatically achieve the global minimum asymptotic variance when the dimension of is fixed.
Tables 2 and 4 report the size and power for the “refit” adjustments when both and are high-dimensional. We see that the size under the null is close to the nominal 5% while the power for the adjusted estimator is higher than the unadjusted one. Figure 1 further illustrates the reduction of the standard error is more than 30% for all high-dimensional models.
: | : | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Model | unadj | naïve | naïve2 | pfe | refit | unadj | naïve | naïve2 | pfe | refit |
1 | 5.47 | 5.57 | 5.63 | 5.76 | 5.84 | 22.48 | 43.89 | 43.95 | 43.91 | 43.92 |
2 | 4.96 | 5.26 | 5.30 | 5.47 | 5.32 | 23.32 | 28.02 | 27.96 | 37.21 | 33.12 |
3 | 4.99 | 5.28 | 5.24 | 5.48 | 5.27 | 32.19 | 27.88 | 27.96 | 37.34 | 36.29 |
4 | 5.31 | 5.28 | 5.28 | 5.48 | 5.79 | 11.78 | 27.88 | 28.03 | 37.34 | 43.28 |
5 | 5.43 | 5.09 | 5.08 | 5.49 | 5.78 | 11.87 | 27.72 | 27.88 | 36.69 | 43.08 |
6 | 5.28 | 5.43 | 5.41 | 5.58 | 5.79 | 11.78 | 26.67 | 26.72 | 34.71 | 40.29 |
7 | 5.64 | 5.63 | 5.62 | 5.98 | 6.04 | 9.24 | 34.55 | 34.65 | 37.96 | 42.08 |
8 | 5.63 | 5.54 | 5.51 | 6.03 | 6.17 | 9.28 | 34.11 | 34.42 | 37.22 | 41.29 |
9 | 5.74 | 5.69 | 5.76 | 6.19 | 5.89 | 8.99 | 32.39 | 32.30 | 35.42 | 38.75 |
10 | 5.24 | 5.78 | 5.73 | 6.05 | 6.04 | 14.27 | 30.80 | 30.75 | 32.02 | 32.51 |
11 | 5.19 | 5.78 | 5.72 | 6.07 | 5.95 | 14.36 | 30.60 | 30.49 | 32.21 | 32.81 |
: | : | |||
---|---|---|---|---|
unadj | refit | unadj | refit | |
12 | 5.35 | 6.12 | 22.01 | 42.56 |
13 | 5.31 | 6.11 | 21.47 | 42.47 |
14 | 5.24 | 6.07 | 21.39 | 41.14 |
15 | 5.31 | 6.23 | 20.73 | 38.67 |
: | : | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Model | unadj | naïve | naïve2 | pfe | refit | unadj | naïve | naïve2 | pfe | refit |
1 | 5.08 | 5.04 | 5.10 | 5.21 | 5.31 | 38.94 | 70.35 | 70.36 | 70.32 | 70.30 |
2 | 5.69 | 5.28 | 5.28 | 5.24 | 5.40 | 40.31 | 49.25 | 49.32 | 65.36 | 57.87 |
3 | 5.44 | 5.29 | 5.30 | 5.35 | 5.41 | 56.89 | 49.43 | 49.51 | 64.96 | 62.42 |
4 | 5.45 | 5.29 | 5.29 | 5.35 | 5.20 | 18.55 | 49.43 | 49.67 | 64.96 | 69.96 |
5 | 5.45 | 5.24 | 5.18 | 5.19 | 5.29 | 18.41 | 48.65 | 48.80 | 64.11 | 69.09 |
6 | 5.62 | 5.32 | 5.31 | 5.35 | 5.43 | 18.19 | 46.71 | 46.67 | 61.09 | 65.98 |
7 | 5.24 | 5.51 | 5.46 | 5.34 | 5.49 | 11.86 | 60.73 | 60.63 | 65.14 | 69.24 |
8 | 5.23 | 5.49 | 5.47 | 5.35 | 5.65 | 11.84 | 60.00 | 60.10 | 64.93 | 68.02 |
9 | 5.30 | 5.58 | 5.57 | 5.66 | 5.81 | 11.90 | 57.25 | 57.28 | 61.61 | 64.88 |
10 | 5.34 | 5.19 | 5.15 | 5.25 | 5.31 | 23.95 | 55.49 | 55.44 | 56.64 | 56.43 |
11 | 5.41 | 5.36 | 5.32 | 5.34 | 5.41 | 23.88 | 55.01 | 55.05 | 56.31 | 56.18 |
: | : | |||
---|---|---|---|---|
unadj | refit | unadj | refit | |
12 | 4.97 | 5.22 | 38.91 | 68.10 |
13 | 4.95 | 5.19 | 38.04 | 68.06 |
14 | 5.01 | 5.24 | 37.65 | 66.69 |
15 | 5.15 | 5.40 | 36.61 | 63.79 |

Notes: The figure plots average standard error reduction in percentage achieved by regression adjustments relative to “unadj” under for Models 1-15 when .
7 Empirical Illustration
In this section, we revisit the randomized experiment with a matched pairs design conducted in Groh and McKenzie (2016). In the paper, they examined the impact of macroinsurance on microenterprises. Here, we apply the covariate adjustment methods developed in this paper to their data and reinvestigate the average effect of macroinsurance on three outcome variables: the microenterprises’ monthly profits, revenues, and investment.
The subjects in the experiment are microenterprise owners, who were the clients of the largest microfinance institution in Egypt. In the randomization, after an exact match of gender and the institution’s branch code, those clients were grouped into pairs by applying an optimal greedy algorithm to additional 13 matching variables. Within each pair, a macroinsurance product was then offered to one randomly assigned client, and the other acted as a control. Based on the pair identities and all the matching variables, we re-order the pairs in our sample according to the procedure described in Section 5.1 of Jiang et al. (2022). The resulting sample contains 2824 microenterprise owners, that is, 1412 pairs of them.222See Groh and McKenzie (2016) and Jiang et al. (2022) for more details.
Table 5 reports the ATEs with the standard errors (in parentheses) estimated by different methods. Among them, “GM” corresponds to the method used in Groh and McKenzie (2016).333Groh and McKenzie (2016) estimated the effect by regression with regressors including some baseline variables, a dummy for missing observations, and dummies for the pairs. Specifically, for profits and revenues, the regressors are the baseline value for the outcome of interest, a dummy for missing observations, and pair dummies; for investment, the regressors only include pair dummies. The standard errors for the “GM” ATE estimate are calculated by the usual heteroskedastity-consistent estimator. The “GM” results in Table 5 were obtained by applying the Stata code provided by Groh and McKenzie (2016). The description of other methods is similar to that in Section 6.2.444Specifically: (i) includes gender and 13 additional matching variables for all adjustments. Three of the matching variables are continuous, and the others are dummies. (ii) To maintain comparability, we keep and consistent across all adjustments except for “refit” for each outcome variable. For profits and revenue, includes the baseline value for the outcome of interest, a dummy for whether the firm is above the 95th percentile of the control firms’ distributions of the outcome variable, and a dummy for missing observations. For investment, includes all the covariates used for the first two outcome variables. (iii) For “refit”, we intentionally expand the dimensions of . In addition to the baseline values used in the other adjustments and the dummy variables for missing observations, the used in “refit” also includes the interaction of the continuous original variables with three continuous variables and the first three discrete variables in . (iv) All the continuous variables in and are standardized initially when the regression-adjusted estimators are employed. The results in this table prompt the following observations.
First, aligning with our theoretical and simulation findings, we observe that the standard errors associated with the covariate-adjusted ATEs, particularly those for the “naïve2” and “pfe” estimates, are generally lower compared to the ATE estimate without any adjustment. This pattern is consistent across nearly all the outcome variables. To illustrate, when examining the revenue outcome, the standard errors for the “pfe” estimates are 10.2% smaller than those for the unadjusted ATE estimate.
Second, the standard errors of the “refit” estimates are consistently smaller than those of the unadjusted ATE estimate across all the outcome variables. For example, when profits are the outcome variable, the “refit” estimates exhibit standard errors 7.5% smaller than those of the unadjusted ATE estimate. Moreover, compared with those of the “pfe” estimates, the standard errors of “refit” are slightly smaller.
Y | n | unadj | GM | naïve | naïve2 | pfe | refit |
---|---|---|---|---|---|---|---|
Profits | 1322 | -85.65 | -50.88 | -41.69 | -50.97 | -51.60 | -55.13 |
(49.43) | (46.46) | (47.22) | (45.49) | (46.94) | (45.71) | ||
Revenue | 1318 | -838.60 | -660.16 | -611.75 | -610.80 | -635.80 | -600.97 |
(319.02) | (284.02) | (286.93) | (282.93) | (286.50) | (284.60) | ||
Investment | 1410 | -66.60 | -66.60 | -49.37 | -50.72 | -67.31 | -58.77 |
(118.93) | (118.66) | (119.23) | (118.97) | (118.88) | (118.84) |
Notes: The table reports the ATE estimates of the effect of macroinsurance for microenterprises. Standard errors are in parentheses.
8 Conclusion
This paper considers covariate adjustment for the estimation of average treatment effect in “matched pairs” designs when covariates other than the matching variables are available. When the dimension of these covariates is low, we suggest estimating the average treatment effect by a linear regression of the outcome on treatment status and covariates, controlling for pair fixed effects. We show that this estimator is no worse than the simple difference-in-means estimator in terms of efficiency. When the dimension of these covariates is high, we suggest a two-step estimation procedure: in the first step, we run -regularized regressions of outcome on covariates for the treated and control groups separately and obtain the fitted values for both potential outcomes, and in the second step, we estimate the average treatment effect by refitting a linear regression of outcome on treatment status and regularized adjustments from the first step, controlling for the pair fixed effects. We show that the final estimator is no worse than the simple difference-in-means estimator in terms of efficiency. When the conditional mean models are approximately correctly specified, this estimator further achieves the minimum variance as if all relevant covariates are used to form pairs in the experiment design stage. We take the choice of variables to use in forming pairs as given and focus on how to obtain more efficient estimators of the average treatment effect in the analysis stage. Our paper is therefore silent on the important question of how to choose the relevant matching variables in the design stage. This topic is left for future research.
Appendix A Proofs of Main Results
In the appendix, we use to denote there exists such that .
A.1 Proof of Theorem 3.1
Step 1: Decomposition by recursive conditioning
To begin, note
(22) |
where the third equality follows from (9). Similarly,
(23) |
It follows from (22)–(23) that
(24) |
where
Next, consider
For simplicity, define for . It follows from Assumption 2.2 that . On the other hand,
where the inequality follows from and the convergence follows from Assumptions 2.3 and 3.1(c). By Markov’s inequality and the fact that , for any ,
Since probabilities are bounded, we have . This fact, together with (24), imply
where
Note that conditional on and , and are independent while and are constants.
Step 2: Conditional central limit theorems
We first analyze the limiting behavior of . Define
Note by Assumption 2.2 that . We proceed verify the Lindeberg condition for conditional on and , i.e., we show that for every ,
(25) |
To that end, first note Lemma B.2 implies
(26) |
(26) and Assumption 3.1(a) imply that for all ,
(27) |
Furthermore, for some ,
(28) |
Next, note for any and , the left-hand side of (25) can be written as
(29) |
where the first inequality follows by inspection, the second follows from (27)–(28), and the last follows from Assumption 2.2. We then argue
(30) |
To this end, we once again verify the Lindeberg condition in Lemma 11.4.2 of Lehmann and Romano (2005). Note
Therefore, in light of Lemma B.1, we only need to verify
(31) |
which follows immediately from Lemma B.3.
Another application of (31) implies (25). Lindeberg’s central limit theorem and (26) then imply that
Similar arguments lead to
Step 3: Combining conditional and unconditional components
Meanwhile, it follows from the same arguments as those in (S.22)–(S.25) of Bai et al. (2022) that
To establish (10), define , where
Note
Further note are all constants conditional on and . Suppose by contradiction that does not converge in distribution to . Then, there exists and a subsequence such that
(32) |
Because the sequence and are bounded by Assumptions 3.1(b), there is a further subsequence, which with some abuse of notation we still denote by , along which and for some . Then, all converge to constants. Therefore, it follows from Lemma S.1.2 of Bai et al. (2022) that
a contradiction to (32). Therefore, the desired convergence in Theorem 3.1 follows.
Step 4: Rearranging the variance formula
To conclude the proof with the the variance formula as stated in the theorem, note
(33) |
where the first equality follows from the law of total variance, the second one follows by direct calculation, and the last one follows by expanding the variance of the sum. Similarly,
(34) |
It follows that
where the first equality follows by definition, the second one follows from (33)–(34), the third one again follows by definition, and the last one follows because by the law of iterated expectations,
The conclusion then follows.
A.2 Proof of Theorem 3.2
Theorem 3.1 implies . Next, we show
(35) |
To that end, define
Note
Therefore, to establish (35), we first show
(36) |
and
(37) |
(37) immediately follows from repeated applications of the inequality and (12). To verify (36), note
It follows from similar arguments to those in the proof of Lemma B.2 below that
Similarly, it follows from the proof of the same lemma that
To establish (36), note
where the last equality follows from the definition of and . It then follows from the Cauchy-Schwarz inequality that
which, together with (36)–(37) as well as Assumptions 2.1(b) and 3.1(b), imply (35).
Next, we show
(38) |
Note
(39) | ||||
In what follows, we show
(40) | ||||
(41) | ||||
(42) | ||||
(43) | ||||
(44) |
To establish (40)–(41), note they follow directly from (36) and Assumptions 2.1(b) and 3.1(b). Next, note (42) follows from repeated applications of the inequality and (12). (43) can be established by similar arguments. (44) follows from similar arguments to those in the proof of Lemma S.1.7 of Bai et al. (2022), with the uniform integrability arguments replaced by arguments similar to those in the proof of Lemma B.2, together with Assumptions 2.1–2.4 and 3.1. (39)–(44) imply (38) immediately.
Finally, note we have shown
Assumption 3.1(a) implies is bounded away from zero, so
The conclusion of the theorem then follows.
A.3 Proof of Theorem 4.1
We will apply the Frisch-Waugh-Lovell theorem to obtain an expression for . Consider the linear regression of on and . Define
for and
The th residual based on the OLS estimation of this linear regression model is given by
is then given by the OLS estimator of the coefficient in the linear regression of on . Note
It follows from Assumption 4.1(b) and the weak law of large number that
On the other hand, it follows from Assumptions 2.2–2.3 and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.5 of Bai et al. (2022) that
for . Therefore,
Next,
It follows from similar arguments as above as well as Assumptions 2.1(b), 2.2–2.3, and 4.1(b)–(c) that
The convergence of therefore follows from the continuous mapping theorem and Assumption 4.1(a).
To see (12) is satisfied, note
(12) then follows from the fact that , Assumption 4.1(b), and the weak law of large numbers. To establish (9), first note
In what follows, we establish
(45) |
from which (9) follows immediately because . Note by Assumption 2.2 that . Also note
where
We will argue are all . Since this could be carried out separately for each entry of and , we assume without loss of generality that . First, it follows from Assumptions 2.2–2.3 and 4.1(c) as well as similar arguments to those in the proof of Lemma S.1.4 of Bai et al. (2022) that
It then follows from similar arguments using the Lindeberg central limit theorem as in the proof of Lemma S.1.4 of Bai et al. (2022) that . Similar arguments establish . Finally, we show . Note that and by Assumptions 2.2–2.3 and 4.1(c),
Therefore, for any fixed , Markov’s inequality implies
Since probabilities are bounded and therefore uniformly integrable, we have that
Therefore, (45) follows. Finally, it is straightforward to see Assumption 3.1 is implied by Assumption 4.1.
A.4 Proof of Theorem 4.2
By the Frisch-Waugh-Lovell theorem, is equal to the OLS estimator in the linear regression of on and . To apply the Frisch-Waugh-Lovell theorem again, we study the linear regression of on . The OLS estimator of the regression coefficient in such a regression equals
The residual is therefore . equals the OLS estimator of the coefficient in the linear regression of on those residuals. Define
Apparently . A moment’s thought reveals that further equals the coefficient estimate using least squares in the linear regression of on for . It follows from Assumptions 2.1(b)–(c), 2.2–2.3, and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.5 of Bai et al. (2022) that
(46) | ||||
Next, note that
(47) |
For convenience, we introduce the following notation:
The first term in (47) converges in probability to by the weak law of large numbers. For the second term, we have that
where the convergence in probability holds because of Assumptions 2.2–2.3 and 4.1(c). It follows from Assumptions 2.2–2.3 and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.6 of Bai et al. (2022) that
Therefore,
We now turn to
Note that
It follows from Assumptions 2.1(b)–(c), 2.2–2.3, 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.6 of Bai et al. (2022) that
The convergence in probability of now follows from Assumption 4.1(a) and the continuous mapping theorem. (9)–(12) can be established using similar arguments to those in the proof of Theorem 4.1. Finally, it is straightforward to see Assumption 3.1 is implied by Assumption 4.1.
A.5 Proof of Theorem 5.1
We divide the proof into three steps. In the first step, we show
(48) |
In the second step, we show (9), (12), and Assumption 3.1 hold. In the third step, we show the asymptotic variance achieves the minimum under the approximately correct specification condition in Theorem 5.1.
Step 1: Proof of (48)
Note that
Rearranging the terms, we then have
(49) |
On the event , we have
where and the last inequality follows from (18) and the fact that
Next, define
and let be the support of . Then, we have
and thus,
Further define and 555For example, if , we have . and recall . Then, together with (A.5), we have
(50) |
Define
For sufficiently large , we have . It follows from Bickel et al. (2009) and Assumption 5.4 that
Therefore, we have
Next, we show (9) for . First note
Next, note that it follows from Assumption 2.2 that conditional on and ,
is a sequence of independent Rademacher random variables. Therefore, Hoeffding’s inequality implies
Define
We then have
(51) |
Next, we determine the order of . Note
where is an i.i.d. sequence of Rademacher random variables,
and is the th element of . Note the second inequality follows from Lemma 2.3.1 of van der Vaart and Wellner (1996), the third inequality follows from Theorem 4.12 of Ledoux and Talagrand (1991) and the definition of , and the last follows from Assumption 5.1. Note also has an envelope and
because of Assumption 5.1. Because the cardinality of is , for any we have that
where is the covering number for class under the metric using balls of radius . Therefore, Corollary 5.1 of Chernozhukov et al. (2014) implies
Therefore, . Together with (51), they imply
In light of (48) and Assumption 5.3, we have
Next, note that Assumption 3.1(a) and 3.1(b) follow Assumption 5.1, and Assumption 3.1(c) follows Assumptions 5.1 and 5.2.
Step 3: Asymptotic variance
Suppose the true specification is approximately sparse as specified in Theorem 5.1. Let , , and . Then, we have
This concludes the proof.
A.6 Proof of Theorem 5.2
We divide the proof into three steps. In the first step, we show
(52) |
In the second step, we show (9), (12), and Assumption 3.1 hold. In the third step, we show that and .
Step 1: Proof of (52)
Let
Then, by the proof of Theorem 4.2, we have equals the coefficient estimate using least squares in the linear regression of on . Then, for any such that , we have
where the second inequality is by the fact that
and the last equality is by the proof of Theorem 5.1. This implies
where the last equality holds due to the same argument as used in the proof of Theorem 4.2. Similarly, we can show that
which leads to (52).
We first show (9). We have
where the last equality holds by (52) and the facts that
as shown in Theorem 5.1 and
Next, we show (12). We note that
Step 3: Asymptotic variance
Appendix B Auxiliary Lemmas
Lemma B.1.
Suppose is a sequence of random variables satisfying
(53) |
Suppose is another random variable defined on the same probability space with . Then,
(54) |
Proof.
Fix . We will show there exists so that
(55) |
First note the event is measurable with respect to the -algebra generated by , and therefore
(56) |
Next, by Theorem 10.3.5 of Dudley (1989), (53) implies that there exists a such that for any sequence of events such that , we have
(57) |
In light of the previous result, note
By Theorem 10.3.5 of Dudley (1989) again, (53) implies , so by choosing large enough, we can make sure
Proof.
To begin, note it follows from Assumption 2.2 and that
(58) |
Next,
(59) |
In what follows, we will show
To that end, first note from Assumptions 2.3 and 3.1(c) that
Next, note
where the first inequality follows from the triangle inequality, the second follows from the Cauchy-Schwarz inequality, the last follows from Assumptions 2.1(c) and 3.1(c). To see the convergence holds, first note because
the weak law of large numbers implies
On the other hand,
Assumption 3.1(b) and Lemma B.1 imply
Therefore, Lemma 11.4.2 of Lehmann and Romano (2005) implies
Finally, note is bounded for by Assumption 3.1(b), so
The desired convergence therefore follows.
Next, we argue
(61) |
To establish (61), we verify the uniform integrability condition in Lemma 11.4.2 of Lehmann and Romano (2005). To that end, we will repeatedly use the inequality
(62) | ||||
(63) |
Note
where in the second inequality we use the fact that the variance of a random variable is bounded by its second moment. Note Assumption 3.1 implies is bounded for , and therefore
On the other hand
(64) | ||||
It follows from Assumptions 2.1(b) and 3.1(b) together with Lemma B.1 that
For the last term in (64), note
Meanwhile,
It then follows from the previous two inqualities, Assumption 3.1(b), and Lemma B.1 that
Similar arguments establish
Therefore, (61) follows. The conclusion then follows from (60)–(61) and Assumption 3.1(a).
Proof.
Note
where the first inequality follows from and the second inequality follows from (62). Next, note
where the first inequality follows from (62), the second one follows from the conditional Jensen’s inequality and (63), and the third one follows again from the conditional Jensen’s inequality. It then follows from Lemma B.1 together with Assumptions 2.1(b) and 3.1(b) that
Similar arguments lead to
The conclusion then follows.
Lemma B.4.
Proof.
For the first result, we note that
The first two terms on the RHS of the above display are . The last term on the RHS is also by Chebyshev’s inequality. This implies the desired result.
For the second result, define
and
We aim to show that and . Then, by letting which implies
First, we show . Let
for some sufficiently large constant and be a sequence of i.i.d. Rademacher random variables independent of everything else. Then, for any fixed , we have
where the first inequality is by van der Vaart and Wellner (1996, Lemma 2.3.7), the second inequality is by the Hoeffding’s inequality conditional on and the fact that, on ,
where is a fixed constant, and the last equality is by the fact that . Furthermore, we note that
Therefore, we have
for any fixed , which is the desired result.
Next, we show . Define . Then, we have
where, conditional on , is a sequence of i.i.d. Rademacher random variables, the second last inequality is by Hoeffding’s inequality, and the last inequality is by that, on ,
By letting for some sufficiently large and noting that , we have
and thus, .
Next, we show . We note that, for , conditional on , are independent. In what follows, we couple
with a centered Gaussian random vector as in Theorem 2.1 in Chernozhukov et al. (2017). Let be a Gaussian random vector with for and that additionally satisfies the conditions of that theorem. Specifically, is a centered Gaussian random vector in such that on ,
and
Further define as the quantile of . Then, we have
where the first inequality is by the last display in the proof of Lemma E.2 in Chetverikov and Sørensen (2022) and the second inequality is by the fact that for . Therefore, we have
where the second inequality is by Theorem 2.1 in Chernozhukov et al. (2017).
Finally, we turn to with . We have
(65) |
Note is a sequence of independent centered random variables and
Following Theorem 2.1 in Chernozhukov et al. (2017), Lemma E.2 in Chetverikov and Sørensen (2022), and similar arguments to the ones above, we have
(66) |
For the second term on the RHS of (65), we define . We have
where, conditional on , is a sequence of i.i.d. Rademacher random variables and the last inequality is by Hoeffding’s inequality. In addition, on , we have
Therefore, we have
(67) |
Combining (65), (66), (67), and the fact that , we have . The same result holds for .
Appendix C Details for Simulations
The regressors in the LASSO-based adjustment are as follows.
-
(i)
For Models 1-6, we use where and are the sample medians of and , respectively.
-
(ii)
For Models 7-9, we use where and , for , are the sample medians of and , respectively.
-
(iii)
For Models 10-11, we use where ,for , and , for , are the sample medians of and , respectively.
-
(iv)
Models 12-15 already contain high-dimensional covariates. We just use and as the LASSO regressors.
References
- Abadie and Imbens (2008) Abadie, A. and Imbens, G. W. (2008). Estimation of the Conditional Variance in Paired Experiments. Annales d’É conomie et de Statistique 175–187.
- Armstrong (2022) Armstrong, T. B. (2022). Asymptotic Efficiency Bounds for a Class of Experimental Designs. ArXiv:2205.02726 [stat], URL http://arxiv.org/abs/2205.02726.
- Bai et al. (2023a) Bai, Y., Liu, J., Shaikh, A. M. and Tabord-Meehan, M. (2023a). On the Efficiency of Finely Stratified Experiments. ArXiv:2307.15181 [econ, math, stat], URL http://arxiv.org/abs/2307.15181.
- Bai et al. (2023b) Bai, Y., Liu, J. and Tabord-Meehan, M. (2023b). Inference for Matched Tuples and Fully Blocked Factorial Designs. ArXiv:2206.04157 [econ, math, stat], URL http://arxiv.org/abs/2206.04157.
- Bai et al. (2022) Bai, Y., Romano, J. P. and Shaikh, A. M. (2022). Inference in Experiments With Matched Pairs. Journal of the American Statistical Association, 117 1726–1737. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/01621459.2021.1883437, URL https://doi.org/10.1080/01621459.2021.1883437.
- Belloni et al. (2017) Belloni, A., Chernozhukov, V., Fernández-Val, I. and Hansen, C. (2017). Program evaluation and causal inference with high-dimensional data. Econometrica, 85 233–298.
- Bickel et al. (2009) Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37 1705–1732. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/journals/annals-of-statistics/volume-37/issue-4/Simultaneous-analysis-of-Lasso-and-Dantzig-selector/10.1214/08-AOS620.full.
- Bruhn and McKenzie (2009) Bruhn, M. and McKenzie, D. (2009). In pursuit of balance: Randomization in practice in development field experiments. American Economic Journal: Applied Economics, 1 200–232.
- Chernozhukov et al. (2014) Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42 1564–1597.
- Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45 2309–2352. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/journals/annals-of-probability/volume-45/issue-4/Central-limit-theorems-and-bootstrap-in-high-dimensions/10.1214/16-AOP1113.full.
- Chetverikov and Sørensen (2022) Chetverikov, D. and Sørensen, J. R.-V. (2022). Analytic and Bootstrap-after-Cross-Validation Methods for Selecting Penalty Parameters of High-Dimensional M-Estimators. Tech. Rep. arXiv:2104.04716, arXiv. ArXiv:2104.04716 [econ, math, stat] type: article, URL http://arxiv.org/abs/2104.04716.
- Cohen and Fogarty (2023) Cohen, P. L. and Fogarty, C. B. (2023). No-harm calibration for generalized oaxaca-blinder estimators. Biometrika. Forthcoming.
- Cytrynbaum (2023) Cytrynbaum, M. (2023). Covariate adjustment in stratified experiments.
- Donner and Klar (2000) Donner, A. and Klar, N. (2000). Design and analysis of cluster randomization trials in health research, vol. 27. Arnold London.
- Dudley (1989) Dudley, R. M. (1989). Real Analysis and Probability. Wadsworth and Brook/Cole.
- Freedman (2008) Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40 180–193. URL http://www.sciencedirect.com/science/article/pii/S019688580700005X.
- Glennerster and Takavarasha (2013) Glennerster, R. and Takavarasha, K. (2013). Running randomized evaluations: A practical guide. Princeton University Press.
- Groh and McKenzie (2016) Groh, M. and McKenzie, D. (2016). Macroinsurance for microenterprises: A randomized experiment in post-revolution egypt. Journal of Development Economics, 118 13–25.
- Jiang et al. (2022) Jiang, L., Liu, X., Phillips, P. C. and Zhang, Y. (2022). Bootstrap inference for quantile treatment effects in randomized experiments with matched pairs. Review of Economics and Statistics. Forthcoming.
- Ledoux and Talagrand (1991) Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics, Springer-Verlag, Berlin Heidelberg. URL https://www.springer.com/gp/book/9783642202117.
- Lehmann and Romano (2005) Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. 3rd ed. Springer, New York.
- Lin (2013) Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7 295–318. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/euclid.aoas/1365527200.
- Negi and Wooldridge (2021) Negi, A. and Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40 504–534. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/07474938.2020.1824732, URL https://doi.org/10.1080/07474938.2020.1824732.
- Robins et al. (1995) Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association, 90 106–121. Publisher: [American Statistical Association, Taylor & Francis, Ltd.], URL https://www.jstor.org/stable/2291134.
- Rosenberger and Lachin (2015) Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in clinical trials: Theory and Practice. John Wiley & Sons.
- Tsiatis et al. (2008) Tsiatis, A. A., Davidian, M., Zhang, M. and Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics in Medicine, 27 4658–4677.
- van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer-Verlag, New York.
- Wager et al. (2016) Wager, S., Du, W., Taylor, J. and Tibshirani, R. J. (2016). High-dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences, 113 12673–12678.
- Wu and Gagnon-Bartsch (2021) Wu, E. and Gagnon-Bartsch, J. A. (2021). Design-based covariate adjustments in paired experiments. Journal of Educational and Behavioral Statistics, 46 109–132.
- Yang and Tsiatis (2001) Yang, L. and Tsiatis, A. A. (2001). Efficiency Study of Estimators for a Treatment Effect in a Pretest–Posttest Trial. The American Statistician, 55 314–321. Publisher: Taylor & Francis _eprint: https://doi.org/10.1198/000313001753272466, URL https://doi.org/10.1198/000313001753272466.
- Zhao and Ding (2021) Zhao, A. and Ding, P. (2021). Covariate-adjusted Fisher randomization tests for the average treatment effect. Journal of Econometrics, 225 278–294. URL https://www.sciencedirect.com/science/article/pii/S0304407621001457.