This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.


Covariate Adjustment in Experiments with Matched Pairsthanks: Yichong Zhang acknowledges the financial support from the NSFC under the grant No. 72133002 and a Lee Kong Chian fellowship. Any and all errors are our own.

Yuehao Bai
Department of Economics
University of Southern California
[email protected]
   Liang Jiang
Fanhai International School of Finance
Fudan University
[email protected]
   Joseph P. Romano
Departments of Economics & Statistics
Stanford University
[email protected]
   Azeem M. Shaikh
Department of Economics
University of Chicago
[email protected]
   Yichong Zhang
School of Economics
Singapore Management University
[email protected]
Abstract

This paper studies inference on the average treatment effect (ATE) in experiments in which treatment status is determined according to “matched pairs” and it is additionally desired to adjust for observed, baseline covariates to gain further precision. By a “matched pairs” design, we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. Importantly, we presume that not all observed, baseline covariates are used in determining treatment assignment. We study a broad class of estimators based on a “doubly robust” moment condition that permits us to study estimators with both finite-dimensional and high-dimensional forms of covariate adjustment. We find that estimators with finite-dimensional, linear adjustments need not lead to improvements in precision relative to the unadjusted difference-in-means estimator. This phenomenon persists even if the adjustments are interacted with treatment; in fact, doing so leads to no changes in precision. However, gains in precision can be ensured by including fixed effects for each of the pairs. Indeed, we show that this adjustment leads to the minimum asymptotic variance of the corresponding ATE estimator among all finite-dimensional, linear adjustments. We additionally study an estimator with a regularized adjustment, which can accommodate high-dimensional covariates. We show that this estimator leads to improvements in precision relative to the unadjusted difference-in-means estimator and also provide conditions under which it leads to the “optimal” nonparametric, covariate adjustment. A simulation study confirms the practical relevance of our theoretical analysis, and the methods are employed to reanalyze data from an experiment using a “matched pairs” design to study the effect of macroinsurance on microenterprise.

KEYWORDS: Experiment, matched pairs, covariate adjustment, randomized controlled trial, treatment assignment, LASSO

JEL classification codes: C12, C14

1 Introduction

This paper studies inference on the average treatment effect in experiments in which treatment status is determined according to “matched pairs.” By a “matched pairs” design, we mean that units are sampled i.i.d. from the population of interest, paired according to observed, baseline covariates and finally, within each pair, one unit is selected at random for treatment. This method is used routinely in all parts of the sciences. Indeed, commands to facilitate its implementation are included in popular software packages, such as sampsi in Stata. References to a variety of specific examples can be found, for instance, in the following surveys of various field experiments: Donner and Klar (2000), Glennerster and Takavarasha (2013), and Rosenberger and Lachin (2015). See also Bruhn and McKenzie (2009), who, based on a survey of selected development economists, report that 56% of researchers have used such a design at some point. Bai et al. (2022) develop methods for inference on the average treatment effect in such experiments based on the difference-in-means estimator. In this paper, we pursue the goal of improving upon the precision of this estimator by exploiting observed, baseline covariates that are not used in determining treatment status.

To this end, we study a broad class of estimators for the average treatment effect based on a “doubly robust” moment condition. The estimators in this framework are distinguished via different “working models” for the conditional expectations of potential outcomes under treatment and control given the observed, baseline covariates. Importantly, because of the double-robustness, these “working models” need not be correctly specified in order for the resulting estimator to be consistent. In this way, the framework permits us to study both finite-dimensional and high-dimensional forms of covariate adjustment without imposing unreasonable restrictions on the conditional expectations themselves. Under high-level conditions on the “working models” and their corresponding estimators and a requirement that pairs are formed so that units within pairs are suitably “close” in terms of the baseline covariates, we derive the limiting distribution of the covariate-adjusted estimator of the average treatment effect. We further construct an estimator for the variance of the limiting distribution and provide conditions under which it is consistent for this quantity.

Using our general framework, we first consider finite-dimensional, linear adjustments. For this class of estimators, our main findings are summarized as follows. First, we find that estimators with such adjustments are not guaranteed to be weakly more efficient than the unadjusted difference-in-means estimator. This finding echoes similar findings by Yang and Tsiatis (2001) and Tsiatis et al. (2008) in settings in which treatment is determined by i.i.d. coin flips, and Freedman (2008) in a finite population setting in which treatment is determined according to complete randomization. See Negi and Wooldridge (2021) for a succinct treatment of that literature. Moreover, we find that this phenomenon persists even if the adjustments are interacted with treatment. In fact, doing so leads to no changes in precision. In this sense, our results diverge from those in settings with complete randomization and treated fraction one half, where adjustments based on the uninteracted and interacted linear adjustments both guarantee gains in precision. Last, we show that estimators with both uninteracted and interacted linear adjustments with pair fixed effects are guaranteed to be weakly more efficient than the unadjusted difference-in-means estimator.

We then use our framework to consider high-dimensional adjustments based on 1\ell_{1} penalization. Specifically, we first obtain an intermediate estimator by using the LASSO to estimate the “working model” for the relevant conditional expectations. When the treatment is determined according to “matched pairs,” however, this estimator need not be more precise than the unadjusted difference-in-means estimator. Therefore, following Cohen and Fogarty (2023), we consider, in an additional step, an estimator based on the finite-dimensional, linear adjustment described above that uses the predicted values for the “working model” as the covariates and includes fixed effects for each of the pairs. We show that the resulting estimator improves upon both the intermediate estimator and the unadjusted difference-in-means estimator in terms of precision. Moreover, we provide conditions under which the refitted adjustments attain the relevant efficiency bound derived by Armstrong (2022).

Concurrent with our paper, Cytrynbaum (2023) considers covariate adjustment in experiments in which units are grouped into tuples with possibly more than two units, rather than pairs. Both our paper and Cytrynbaum (2023) find that finite-dimensional, linear regression adjustments with pair fixed effects are guaranteed to improve precision relative to the unadjusted difference-in-means estimator, and show that such adjustments are indeed optimal among all linear adjustments. However, Cytrynbaum (2023) does not pursue more general forms of covariate adjustments, including the regularized adjustments described above. Such results permit us to study nonparametric adjustments as well as high-dimensional adjustments using covariates whose dimension diverges rapidly with the sample size.

The remainder of our paper is organized as follows. In Section 2, we describe our setup and notation. In particular, there we describe the precise sense in which we require that units in each pair are “close” in terms of their baseline covariates. In Section 3, we introduce our general class of estimators based on a “doubly robust” moment condition. Under certain high-level conditions on the “working models” and their corresponding estimators, we derive the limiting behavior of the covariate-adjusted estimator. In Section 4, we use our general framework to study a variety of estimators with finite-dimensional, linear covariate adjustment. In Section 5, we use our general framework to study covariate adjustment based on the regularized regression. In Section 6, we examine the finite-sample behavior of tests based on these different estimators via a small simulation study. We find that covariate adjustment can lead to considerable gains in precision. Finally, in Section 7, we apply our methods to reanalyze data from an experiment using a “matched pairs” design to study the effect of macroinsurance on microenterprise. Proofs of all results and some details for simulations are given in the Online Supplement.

2 Setup and Notation

Let Yi𝐑Y_{i}\in\mathbf{R} denote the (observed) outcome of interest for the iith unit, Di{0,1}D_{i}\in\{0,1\} be an indicator for whether the iith unit is treated, and Xi𝐑kxX_{i}\in\mathbf{R}^{k_{x}} and Wi𝐑kwW_{i}\in\mathbf{R}^{k_{w}} denote observed, baseline covariates for the iith unit; XiX_{i} and WiW_{i} will be distinguished below through the feature that only the former will be used in determining treatment assignment. Further denote by Yi(1)Y_{i}(1) the potential outcome of the iith unit if treated and by Yi(0)Y_{i}(0) the potential outcome of the iith unit if not treated. The (observed) outcome and potential outcomes are related to treatment status by the relationship

Yi=Yi(1)Di+Yi(0)(1Di).Y_{i}=Y_{i}(1)D_{i}+Y_{i}(0)(1-D_{i})~{}. (1)

For a random variable indexed by ii, AiA_{i}, it will be useful to denote by A(n)A^{(n)} the random vector (A1,,A2n)(A_{1},\ldots,A_{2n}). Denote by PnP_{n} the distribution of the observed data Z(n)Z^{(n)}, where Zi=(Yi,Di,Xi,Wi)Z_{i}=(Y_{i},D_{i},X_{i},W_{i}), and by QnQ_{n} the distribution of U(n)U^{(n)}, where Ui=(Yi(1),Yi(0),Xi,Wi)U_{i}=(Y_{i}(1),Y_{i}(0),X_{i},W_{i}). Note that PnP_{n} is determined by (1), QnQ_{n}, and the mechanism for determining treatment assignment. We assume throughout that U(n)U^{(n)} consists of 2n2n i.i.d. observations, i.e., Qn=Q2nQ_{n}=Q^{2n}, where QQ is the marginal distribution of UiU_{i}. We therefore state our assumptions below in terms of assumptions on QQ and the mechanism for determining treatment assignment. Indeed, we will not make reference to PnP_{n} in the sequel, and all operations are understood to be under QQ and the mechanism for determining the treatment assignment. Our object of interest is the average effect of the treatment on the outcome of interest, which may be expressed in terms of this notation as

Δ(Q)=E[Yi(1)Yi(0)].\Delta(Q)=E[Y_{i}(1)-Y_{i}(0)]~{}. (2)

We now describe our assumptions on QQ. We restrict QQ to satisfy the following mild requirement:

Assumption 2.1.

The distribution QQ is such that

  1. (a)

    0<E[Var[Yi(d)|Xi]]0<E[\mathrm{Var}[Y_{i}(d)|X_{i}]] for d{0,1}d\in\{0,1\}.

  2. (b)

    E[Yi2(d)]<E[Y_{i}^{2}(d)]<\infty for d{0,1}d\in\{0,1\}.

  3. (c)

    E[Yi(d)|Xi=x]E[Y_{i}(d)|X_{i}=x] and E[Yi2(d)|Xi=x]E[Y_{i}^{2}(d)|X_{i}=x] are Lipschitz for d{0,1}d\in\{0,1\}.

Next, we describe our assumptions on the mechanism determining treatment assignment. In order to describe these assumptions more formally, we require some further notation to define the relevant pairs of units. The nn pairs may be represented by the sets

{π(2j1),π(2j)} for j=1,,n,\{\pi(2j-1),\pi(2j)\}\text{ for }j=1,\ldots,n~{},

where π=πn(X(n))\pi=\pi_{n}(X^{(n)}) is a permutation of 2n2n elements. Because of its possible dependence on X(n)X^{(n)}, π\pi encompasses a broad variety of different ways of pairing the 2n2n units according to the observed, baseline covariates X(n)X^{(n)}. Given such a π\pi, we assume that treatment status is assigned as described in the following assumption:

Assumption 2.2.

Treatment status is assigned so that (Y(n)(1),Y(n)(0),W(n))D(n)|X(n)(Y^{(n)}(1),Y^{(n)}(0),W^{(n)})\perp\!\!\!\perp D^{(n)}|X^{(n)} and, conditional on X(n)X^{(n)}, (Dπ(2j1),Dπ(2j)),j=1,,n(D_{\pi(2j-1)},D_{\pi(2j)}),j=1,\ldots,n are i.i.d. and each uniformly distributed over the values in {(0,1),(1,0)}\{(0,1),(1,0)\}.

Following Bai et al. (2022), our analysis will additionally require some discipline on the way in which pairs are formed. Let 2\|\cdot\|_{2} denote the Euclidean norm. We will require that units in each pair are “close” in the sense described by the following assumption:

Assumption 2.3.

The pairs used in determining treatment status satisfy

1n1jnXπ(2j)Xπ(2j1)2rP0\frac{1}{n}\sum_{1\leq j\leq n}\|X_{\pi(2j)}-X_{\pi(2j-1)}\|_{2}^{r}\stackrel{{\scriptstyle P}}{{\rightarrow}}0

for r{1,2}r\in\{1,2\}.

It will at times be convenient to require further that units in consecutive pairs are also “close” in terms of their baseline covariates. One may view this requirement, which is formalized in the following assumption, as “pairing the pairs“ so that they are “close” in terms of their baseline covariates.

Assumption 2.4.

The pairs used in determining treatment status satisfy

1n1jn2Xπ(4jk)Xπ(4j)22P0\frac{1}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}\|X_{\pi(4j-k)}-X_{\pi(4j-\ell)}\|_{2}^{2}\stackrel{{\scriptstyle P}}{{\rightarrow}}0

for any k{2,3}k\in\{2,3\} and {0,1}\ell\in\{0,1\}.

Bai et al. (2022) provide results to facilitate constructing pairs satisfying Assumptions 2.32.4 under weak assumptions on QQ. In particular, given pairs satisfying Assumption 2.3, it is frequently possible to “re-order” them so that Assumption 2.4 is satisfied. See Theorem 4.3 in Bai et al. (2022) for further details. As in Bai et al. (2022), we highlight the fact that Assumption 2.4 will only be used to enable consistent estimation of relevant variances.

Remark 2.1.

Under this setup, Bai et al. (2022) consider the unadjusted difference-in-means estimator

Δ^nunadj=1n1i2nDiYi1n1i2n(1Di)Yi\displaystyle\hat{\Delta}_{n}^{\rm unadj}=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}Y_{i}-\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})Y_{i} (3)

and show that it is consistent and asymptotically normal with limiting variance

σunadj2(Q)=12Var[E[Yi(1)Yi(0)|Xi]]+E[Var[Yi(1)|Xi]]+E[Var[Yi(0)|Xi]].\displaystyle\sigma_{\mathrm{unadj}}^{2}(Q)=\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i}]]~{}.

We note that Δ^nunadj\hat{\Delta}_{n}^{\rm unadj} is the unadjusted estimator because it does not use information in WiW_{i} in either the design or analysis stage. If both XiX_{i} and WiW_{i} are used to form pairs in the “matched pairs” design, then the difference-in-means estimator, which we refer to as Δ^nideal\hat{\Delta}_{n}^{\rm ideal}, has limiting variance

σideal2(Q)=12Var[E[Yi(1)Yi(0)|Xi,Wi]]+E[Var[Yi(1)|Xi,Wi]]+E[Var[Yi(0)|Xi,Wi]].\displaystyle\sigma^{2}_{\mathrm{ideal}}(Q)=\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]]~{}.

In this case, Δ^nideal\hat{\Delta}_{n}^{\rm ideal} achieves the efficiency bound derived by Armstrong (2022), and we can see that

σunadj2(Q)σideal2(Q)\displaystyle\sigma_{\mathrm{unadj}}^{2}(Q)-\sigma^{2}_{\mathrm{ideal}}(Q) =12E[Var[E[Yi(1)+Yi(0)|Xi,Wi]|Xi]]0.\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]|X_{i}]]\geq 0~{}.

For related results for parameters other than the average treatment effect, see Bai et al. (2023a). We note, however, that it is not always practical to form pairs using both XiX_{i} and WiW_{i} for two reasons. First, the covariate WiW_{i} may only be collected along with the outcome variable and therefore may not be available at the design stage. Second, the quality of pairing decreases with the dimension of matching variables. Indeed, it is common in practice to match on some but not all baseline covariates. Such considerations motivate our analysis below.   

3 Main Results

To accommodate various forms of covariate-adjusted estimators of Δ(Q)\Delta(Q) in a single framework, it is useful to note that it follows from Assumption 2.2 that for any d{0,1}d\in\{0,1\} and any function md,n:𝐑kx×𝐑kw𝐑m_{d,n}:\mathbf{R}^{k_{x}}\times\mathbf{R}^{k_{w}}\rightarrow\mathbf{R} such that E[|md,n(Xi,Wi)|]<E[|m_{d,n}(X_{i},W_{i})|]<\infty,

E[2I{Di=d}(Yimd,n(Xi,Wi))+md,n(Xi,Wi)]=E[Yi(d)].E\left[2I\{D_{i}=d\}(Y_{i}-m_{d,n}(X_{i},W_{i}))+m_{d,n}(X_{i},W_{i})\right]=E[Y_{i}(d)]~{}. (4)

We note that (4) is just the augmented inverse propensity score weighted moment for E[Yi(d)]E[Y_{i}(d)] in which the propensity score is 1/21/2 and the conditional mean model is md,n(Xi,Wi)m_{d,n}(X_{i},W_{i}). Such a moment is also “doubly robust.” As the propensity score for the “matched pairs” design is exactly one half, we do not require the conditional mean model to be correctly specified, i.e., md,n(Xi,Wi)=E[Yi(d)|Xi,Wi]m_{d,n}(X_{i},W_{i})=E[Y_{i}(d)|X_{i},W_{i}]. See, for instance, Robins et al. (1995). Intuitively, md,nm_{d,n} is the “working model” which researchers use to estimate E[Yi(d)|Xi,Wi]E[Y_{i}(d)|X_{i},W_{i}], and can be arbitrarily misspecified because of (4). Although md,nm_{d,n} will be identical across n1n\geq 1 for the examples in Section 4, the notation permits md,nm_{d,n} to depend on the sample size nn in anticipation of the high-dimensional results in Section 5. Based on the moment condition in (4), our proposed estimator of Δ(Q)\Delta(Q) is given by

Δ^n=μ^n(1)μ^n(0),\hat{\Delta}_{n}=\hat{\mu}_{n}(1)-\hat{\mu}_{n}(0)~{}, (5)

where, for d{0,1}d\in\{0,1\},

μ^n(d)=12n1i2n(2I{Di=d}(Yim^d,n(Xi,Wi))+m^d,n(Xi,Wi))\hat{\mu}_{n}(d)=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}(Y_{i}-\hat{m}_{d,n}(X_{i},W_{i}))+\hat{m}_{d,n}(X_{i},W_{i})) (6)

and m^d,n\hat{m}_{d,n} is a suitable estimator of the “working model” md,nm_{d,n} in (4).

By some simple algebra, we have111We thank the referee for this excellent point.

Δ^n=1n1i2nDiY~i1n1i2n(1Di)Y~i,\displaystyle\hat{\Delta}_{n}=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\tilde{Y}_{i}-\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\tilde{Y}_{i}~{}, (7)

where

Y~i\displaystyle\tilde{Y}_{i} =Yi12(m^1,n(Xi,Wi)+m^0,n(Xi,Wi)).\displaystyle=Y_{i}-\frac{1}{2}(\hat{m}_{1,n}(X_{i},W_{i})+\hat{m}_{0,n}(X_{i},W_{i}))~{}. (8)

It means our regression adjusted estimator can be viewed as a difference-in-means estimator, but with the “adjusted” outcome Y~i\tilde{Y}_{i}.

We require some new discipline on the behavior of md,nm_{d,n} for d{0,1}d\in\{0,1\} and n1n\geq 1:

Assumption 3.1.

The functions md,nm_{d,n} for d{0,1}d\in\{0,1\} and n1n\geq 1 satisfy

  1. (a)

    For d{0,1}d\in\{0,1\},

    lim infnE[Var[Yi(d)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]]>0.\liminf_{n\to\infty}E\left[\operatorname*{Var}\left[Y_{i}(d)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Bigg{|}X_{i}\right]\right]>0~{}.
  2. (b)

    For d{0,1}d\in\{0,1\},

    limλlim supnE[md,n2(Xi,Wi)I{|md,n(Xi,Wi)|>λ}]=0.\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[m_{d,n}^{2}(X_{i},W_{i})I\{|m_{d,n}(X_{i},W_{i})|>\lambda\}]=0~{}.
  3. (c)

    E[md,n(Xi,Wi)|Xi=x]E[m_{d,n}(X_{i},W_{i})|X_{i}=x], E[md,n2(Xi,Wi)|Xi=x]E[m_{d,n}^{2}(X_{i},W_{i})|X_{i}=x], E[md,n(Xi,Wi)Yi(d)|Xi=x]E[m_{d,n}(X_{i},W_{i})Y_{i}(d)|X_{i}=x] for d{0,1}d\in\{0,1\}, and E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi=x]E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}=x] are Lipschitz uniformly over n1n\geq 1.

Assumption 3.1(a) is an assumption to rule out degenerate situations. Assumption 3.1(b) is a mild uniform integrability assumption on the “working models.” If md,n()md()m_{d,n}(\cdot)\equiv m_{d}(\cdot) for d{0,1}d\in\{0,1\}, then it is satisfied as long as E[md2(Xi,Wi)]<E[m_{d}^{2}(X_{i},W_{i})]<\infty. Assumption 3.1(c) ensures that units that are “close” in terms of the observed covariates are also “close” in terms of potential outcomes, uniformly across n1n\geq 1.

Theorem 3.1 below establishes the limit in distribution of Δ^n\hat{\Delta}_{n}. We note that the theorem depends on high-level conditions on md,n()m_{d,n}(\cdot) and m^d,n()\hat{m}_{d,n}(\cdot). In the sequel, these conditions will be verified in several examples.

Theorem 3.1.

Suppose QQ satisfies Assumption 2.1, the treatment assignment mechanism satisfies Assumptions 2.22.3, and md,n()m_{d,n}(\cdot) for d{0,1}d\in\{0,1\} and n1n\geq 1 satisfy Assumption 3.1. Further suppose m^d,n()\hat{m}_{d,n}(\cdot) satisfies

12n1i2n(2Di1)(m^d,n(Xi,Wi)md,n(Xi,Wi))P0.\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))\stackrel{{\scriptstyle P}}{{\rightarrow}}0~{}. (9)

Then, Δ^n\hat{\Delta}_{n} defined in (5) satisfies

n(Δ^nΔ(Q))σn(Q)dN(0,1),\frac{\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))}{\sigma_{n}(Q)}\stackrel{{\scriptstyle d}}{{\rightarrow}}N(0,1)~{}, (10)

where σn2(Q)=σ1,n2(Q)+σ2,n2(Q)+σ3,n2(Q)\sigma_{n}^{2}(Q)=\sigma_{1,n}^{2}(Q)+\sigma_{2,n}^{2}(Q)+\sigma_{3,n}^{2}(Q) with

σ1,n2(Q)\displaystyle\sigma_{1,n}^{2}(Q) =12E[Var[E[Yi(1)+Yi(0)|Xi,Wi](m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]]\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]]
σ2,n2(Q)\displaystyle\sigma_{2,n}^{2}(Q) =12Var[E[Yi(1)Yi(0)|Xi,Wi]]\displaystyle=\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]]
σ3,n2(Q)\displaystyle\sigma_{3,n}^{2}(Q) =E[Var[Yi(1)|Xi,Wi]]+E[Var[Yi(0)|Xi,Wi]].\displaystyle=E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]]~{}.

In order to facilitate the use of Theorem 3.1 for inference about Δ(Q)\Delta(Q), we next provide a consistent estimator of σn(Q)\sigma_{n}(Q). Define

τ^n2\displaystyle\hat{\tau}_{n}^{2} =1n1jn(Y~π(2j1)Y~π(2j))2\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)})^{2}
λ^n\displaystyle\hat{\lambda}_{n} =2n1jn2(Y~π(4j3)Y~π(4j2))(Y~π(4j1)Y~π(4j))(Dπ(4j3)Dπ(4j2))(Dπ(4j1)Dπ(4j)),\displaystyle=\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\tilde{Y}_{\pi(4j-2)})(\tilde{Y}_{\pi(4j-1)}-\tilde{Y}_{\pi(4j)})(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})~{},

where Y~i\tilde{Y}_{i} is defined in (8). The variance estimator is given by

σ^n2=τ^n212(λ^n+Δ^n2).\hat{\sigma}_{n}^{2}=\hat{\tau}_{n}^{2}-\frac{1}{2}(\hat{\lambda}_{n}+\hat{\Delta}_{n}^{2})~{}. (11)

The variance estimator in (11), in particular its component λ^n\hat{\lambda}_{n}, is analogous to the “pairs of pairs” variance estimator in Bai et al. (2022). Such a variance estimator has also been used in Abadie and Imbens (2008) in a related setting. Note that it can be shown similarly as in Remark 3.9 of Bai et al. (2022) that σ^n2\hat{\sigma}_{n}^{2} in (11) is nonnegative.

Theorem 3.2 below establishes the consistency of this estimator and its implications for inference about Δ(Q)\Delta(Q). In the statement of the theorem, we make use of the following notation: for any scalars aa and bb, [a±b][a\pm b] is understood to be [ab,a+b][a-b,a+b].

Theorem 3.2.

Suppose QQ satisfies Assumption 2.1, the treatment assignment mechanism satisfies Assumptions 2.22.4, and md,n()m_{d,n}(\cdot) for d{0,1}d\in\{0,1\} and n1n\geq 1 satisfy Assumption 3.1. Further suppose m^d,n()\hat{m}_{d,n}(\cdot) satisfies (9) and

12n1i2n(m^d,n(Xi,Wi)md,n(Xi,Wi))2P0.\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))^{2}\stackrel{{\scriptstyle P}}{{\rightarrow}}0~{}. (12)

Then,

σ^nσn(Q)P1.\frac{\hat{\sigma}_{n}}{\sigma_{n}(Q)}\stackrel{{\scriptstyle P}}{{\rightarrow}}1~{}.

Hence, (10) holds with σ^n\hat{\sigma}_{n} in place of σn(Q)\sigma_{n}(Q). In particular, for any α(0,1)\alpha\in(0,1),

P{Δ(Q)[Δ^n±σ^nΦ1(1α2)]}1α,P\left\{\Delta(Q)\in\left[\hat{\Delta}_{n}\pm\hat{\sigma}_{n}\Phi^{-1}\left(1-\frac{\alpha}{2}\right)\right]\right\}\rightarrow 1-\alpha~{},

where Φ\Phi is the standard normal c.d.f.

Remark 3.1.

Based on (7), it is natural to estimate σn2(Q)\sigma_{n}^{2}(Q) using the usual estimator of the limiting variance of the difference-in-means estimator, i.e.,

σ^diff,n2\displaystyle\hat{\sigma}_{\mathrm{diff},n}^{2} =1n1i2nDi(Y~i(1n1i2nDiY~i))2+1n1i2n(1Di)(Y~i(1n1i2n(1Di)Y~i))2.\displaystyle=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\left(\tilde{Y}_{i}-\left(\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\tilde{Y}_{i}\right)\right)^{2}+\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\left(\tilde{Y}_{i}-\left(\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\tilde{Y}_{i}\right)\right)^{2}.

However, it can be shown that σ^diff,n2=σdiff,n2(Q)+oP(1)\hat{\sigma}_{\mathrm{diff},n}^{2}=\sigma_{\mathrm{diff},n}^{2}(Q)+o_{P}(1), where

σdiff,n2(Q)=Var[Yi(1)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))]+Var[Yi(0)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))].\sigma_{\mathrm{diff},n}^{2}(Q)=\operatorname*{Var}\left[Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\right]+\operatorname*{Var}\left[Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\right]~{}.

Furthermore,

σdiff,n2(Q)σn2(Q)\displaystyle\sigma_{\mathrm{diff},n}^{2}(Q)-\sigma_{n}^{2}(Q) =12Var[E[Yi(1)+Yi(0)(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]]0,\displaystyle=\frac{1}{2}\operatorname*{Var}\left[E[Y_{i}(1)+Y_{i}(0)-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]\right]\geq 0~{},

where the inequality is strict unless

E[Yi(1)+Yi(0)(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]=E[Yi(1)+Yi(0)(m1,n(Xi,Wi)+m0,n(Xi,Wi))]\displaystyle E[Y_{i}(1)+Y_{i}(0)-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]=E[Y_{i}(1)+Y_{i}(0)-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))]

with probability one. In this sense, the usual estimator of the limiting variance of the difference-in-means estimator is conservative.   

Remark 3.2.

An important and immediate implication of Theorem 3.1 is that σn2(Q)\sigma_{n}^{2}(Q) is minimized when

E[Yi(0)+Yi(1)|Xi,Wi]E[Yi(0)+Yi(1)|Xi]=\displaystyle E[Y_{i}(0)+Y_{i}(1)|X_{i},W_{i}]-E[Y_{i}(0)+Y_{i}(1)|X_{i}]=
m0,n(Xi,Wi)+m1,n(Xi,Wi)E[m0,n(Xi,Wi)+m1,n(Xi,Wi)|Xi]\displaystyle\hskip 56.9055ptm_{0,n}(X_{i},W_{i})+m_{1,n}(X_{i},W_{i})-E[m_{0,n}(X_{i},W_{i})+m_{1,n}(X_{i},W_{i})|X_{i}]

with probability one. In other words, the “working model” for E[Yi(0)+Yi(1)|Xi,Wi]E[Y_{i}(0)+Y_{i}(1)|X_{i},W_{i}] given by m0,n(Xi,Wi)+m1,n(Xi,Wi)m_{0,n}(X_{i},W_{i})+m_{1,n}(X_{i},W_{i}), need only be correct “on average” over the variables that are not used in determining the pairs. For such a choice of m0,n(Xi,Wi)m_{0,n}(X_{i},W_{i}) and m1,n(Xi,Wi)m_{1,n}(X_{i},W_{i}), σn2(Q)\sigma_{n}^{2}(Q) in Theorem 3.1 becomes simply

12Var[E[Yi(1)Yi(0)|Xi,Wi]]+E[Var[Yi(1)|Xi,Wi]]+E[Var[Yi(0)|Xi,Wi]],\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)\Big{|}X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]]~{},

which agrees with the variance obtained in Bai et al. (2022) when both XiX_{i} and WiW_{i} are used in determining the pairs. Such a variance also achieves the efficiency bound derived by Armstrong (2022).   

Remark 3.3.

Following Bai et al. (2023b), it is straightforward to extend the analysis in this paper to the case with multiple treatment arms and where treatment status is determined using a “matched tuples” design, but we do not pursue this further in this paper.   

Remark 3.4.

Following Bai et al. (2022), we conjecture it it possible to establish the validity of a randomization test based on the test statistic studentized by a randomized version of (11). We emphasize that the validity of the randomization test depends crucially on the choice of studentization in the test statistic. See, for instance, Remark 3.16 in Bai et al. (2022). Such tests have been studied in finite-population settings with covariate adjustments by Zhao and Ding (2021). We leave a detailed analysis of randomization tests for future work.   

4 Linear Adjustments

In this section, we consider linearly covariate-adjusted estimators of Δ(Q)\Delta(Q) based on a set of regressors generated by Xi𝐑kxX_{i}\in\mathbf{R}^{k_{x}} and Wi𝐑kwW_{i}\in\mathbf{R}^{k_{w}}. To this end, define ψi=ψ(Xi,Wi)\psi_{i}=\psi(X_{i},W_{i}), where ψ:𝐑kx×𝐑kw𝐑p\psi:\mathbf{R}^{k_{x}}\times\mathbf{R}^{k_{w}}\to\mathbf{R}^{p}. We impose the following assumptions on the function ψ\psi:

Assumption 4.1.

The function ψ\psi is such that

  1. (a)

    no component of ψ\psi is constant and E[Var[ψi|Xi]]E[\operatorname*{Var}[\psi_{i}|X_{i}]] is non-singular.

  2. (b)

    Var[ψi]<\operatorname*{Var}[\psi_{i}]<\infty.

  3. (c)

    E[ψi|Xi=x]E[\psi_{i}|X_{i}=x], E[ψiψi|Xi=x]E[\psi_{i}\psi_{i}^{\prime}|X_{i}=x], and E[ψiYi(d)|Xi=x]E[\psi_{i}Y_{i}(d)|X_{i}=x] for d{0,1}d\in\{0,1\} are Lipschitz.

Assumption 4.1 is analogous to Assumption 2.1. Note, in particular, that Assumption 4.1(a) rules out situations where ψi\psi_{i} is a function of XiX_{i} only. See Remark 4.3 for a discussion of the behavior of the covariate-adjusted estimators in such situations.

4.1 Linear Adjustments without Pair Fixed Effects

Consider the following linear regression model:

Yi=α+ΔDi+ψiβ+ϵi.Y_{i}=\alpha+\Delta D_{i}+\psi_{i}^{\prime}\beta+\epsilon_{i}~{}. (13)

Let α^nnaive\hat{\alpha}_{n}^{\rm naive}, Δ^nnaive\hat{\Delta}_{n}^{\rm naive}, and β^nnaive\hat{\beta}_{n}^{\rm naive} denote the OLS estimators of α\alpha, Δ\Delta, and β\beta in (13). We call these estimators naïve because the corresponding regression adjustment is subject to Freedman’s critique and can lead to an adjusted estimator that is less efficient than the simple difference-in-means estimator Δ^nunadj\hat{\Delta}_{n}^{\rm unadj}.

It follows from direct calculation that

Δ^nnaive=1n1i2n(Yiψiβ^nnaive)(2Di1).\hat{\Delta}_{n}^{\rm naive}=\frac{1}{n}\sum_{1\leq i\leq 2n}(Y_{i}-\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive})(2D_{i}-1)~{}.

Therefore, Δ^nnaive\hat{\Delta}_{n}^{\rm naive} satisfies (5)–(6) with

m^d,n(Xi,Wi)=ψiβ^nnaive.\hat{m}_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}~{}.

Theorem 4.1 establishes (9) and (12) for a suitable choice of md,n(Xi,Wi)m_{d,n}(X_{i},W_{i}) for d{0,1}d\in\{0,1\} and, as a result, the limiting distribution of Δ^nnaive\hat{\Delta}_{n}^{\rm naive} and the validity of the variance estimator.

Theorem 4.1.

Suppose QQ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.22.3. Further suppose ψ\psi satisfies Assumption 4.1. Then, as nn\to\infty,

β^nnaivePβnaive=Var[ψi]1Cov[ψi,Yi(1)+Yi(0)].\hat{\beta}_{n}^{\rm naive}\stackrel{{\scriptstyle P}}{{\to}}\beta^{\rm naive}=\operatorname*{Var}[\psi_{i}]^{-1}\operatorname*{Cov}[\psi_{i},Y_{i}(1)+Y_{i}(0)]~{}.

Moreover, (9), (12), and Assumption 3.1 are satisfied with

md,n(Xi,Wi)=ψiβnaivem_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\beta^{\rm naive}

for d{0,1}d\in\{0,1\} and n1n\geq 1.

Remark 4.1.

Freedman (2008) studies regression adjustment based on (13) when treatment is assigned by complete randomization instead of a “matched pairs” design. In such settings, Lin (2013) proposes adjustment based on the following linear regression model:

Yi=α+ΔDi+(ψiψ¯n)γ+Di(ψiψ¯n)η+ϵi,Y_{i}=\alpha+\Delta D_{i}+(\psi_{i}-\bar{\psi}_{n})^{\prime}\gamma+D_{i}(\psi_{i}-\bar{\psi}_{n})^{\prime}\eta+\epsilon_{i}~{}, (14)

where

ψ¯n=12n1i2nψi.\bar{\psi}_{n}=\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}~{}.

Let α^nint,Δ^nint,γ^nint,η^nint\hat{\alpha}_{n}^{\rm int},\hat{\Delta}_{n}^{\rm int},\hat{\gamma}_{n}^{\rm int},\hat{\eta}_{n}^{\rm int} denote the OLS estimators for α,Δ,γ,η\alpha,\Delta,\gamma,\eta in (14). It is straightforward to show Δ^nint\hat{\Delta}_{n}^{\rm int} satisfies (5)–(6) with

m^1,n(Xi,Wi)\displaystyle\hat{m}_{1,n}(X_{i},W_{i}) =(ψiμ^ψ,n(1))(γ^nint+η^nint)\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}(\hat{\gamma}_{n}^{\rm int}+\hat{\eta}_{n}^{\rm int})
m^0,n(Xi,Wi)\displaystyle\hat{m}_{0,n}(X_{i},W_{i}) =(ψiμ^ψ,n(0))γ^nint,\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(0))^{\prime}\hat{\gamma}_{n}^{\rm int}~{},

where

μ^ψ,n(d)=1n1i2nI{Di=d}ψi.\hat{\mu}_{\psi,n}(d)=\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\psi_{i}~{}.

It can be shown using similar arguments to those used to establish Theorem 4.1 that (9) and Assumption 3.1 are satisfied with

md,n(Xi,Wi)=(ψiE[ψi])Var[ψi]1Cov[ψi,Yi(d)]m_{d,n}(X_{i},W_{i})=(\psi_{i}-E[\psi_{i}])^{\prime}\operatorname*{Var}[\psi_{i}]^{-1}\operatorname*{Cov}[\psi_{i},Y_{i}(d)]

for d{0,1}d\in\{0,1\} and n1n\geq 1. It thus follows by inspecting the expression for σn2(Q)\sigma^{2}_{n}(Q) in Theorem 3.1 that the limiting variance of Δ^nint\hat{\Delta}_{n}^{\rm int} is the same as that of Δ^nnaive\hat{\Delta}_{n}^{\rm naive} based on (13).   

Remark 4.2.

Note that Δ^nnaive\hat{\Delta}_{n}^{\rm naive} is the ordinary least squares estimator for Δ\Delta in the linear regression

Yiψiβ^nnaive=α+ΔDi+ϵi.\displaystyle Y_{i}-\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}=\alpha+\Delta D_{i}+\epsilon_{i}~{}.

Furthermore, Theorem 4.1 implies that its limiting variance is σnaive2(Q)\sigma_{\mathrm{naive}}^{2}(Q), given by σn2(Q)\sigma_{n}^{2}(Q) in Theorem 3.1 with md(Xi,Wi)=ψiβnaivem_{d}(X_{i},W_{i})=\psi_{i}^{\prime}\beta^{\rm naive}. The usual heteroskedasticity-robust estimator of the limiting variance of Δ^nnaive\hat{\Delta}_{n}^{\rm naive} is, however, simply σ^diff,n2\hat{\sigma}_{\mathrm{diff},n}^{2} defined in Remark 3.1 with m^d,n(Xi,Wi)=ψiβ^nnaive\hat{m}_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}. It thus follows that σ^diff,n2\hat{\sigma}_{\mathrm{diff},n}^{2} is conservative for σnaive2(Q)\sigma_{\mathrm{naive}}^{2}(Q) in the sense described therein. It is, of course, possible to estimate σnaive2(Q)\sigma_{\mathrm{naive}}^{2}(Q) consistently using σ^n2\hat{\sigma}_{n}^{2} proposed in Theorem 3.2 with m^d,n(Wi,Xi)=ψiβ^nnaive\hat{m}_{d,n}(W_{i},X_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm naive}, but σnaive2(Q)\sigma_{\mathrm{naive}}^{2}(Q) is not guaranteed to be smaller than the limiting variance of the unadjusted estimator, i.e., σunadj2(Q)\sigma_{\mathrm{unadj}}^{2}(Q), so the linear adjustment without pair fixed effects can harm the precision of the estimator. Evidence of this phenomenon is provided in our simulations in Section 6.   

4.2 Linear Adjustments with Pair Fixed Effects

Remark 4.1 implies that in “matched pairs” designs, including interaction terms in the linear regression does not lead to an estimator with lower limiting variance than the one based on the linear regression without interaction terms. It is therefore interesting to study whether there exists a linearly covariate-adjusted estimator with lower limiting variance than the ones based on (13) and (14) as well as the difference-in-means estimator. To that end, consider instead the following linear regression model:

Yi=ΔDi+ψiβ+1jnθjI{i{π(2j1),π(2j)}}+ϵi.Y_{i}=\Delta D_{i}+\psi_{i}^{\prime}\beta+\sum_{1\leq j\leq n}\theta_{j}I\{i\in\{\pi(2j-1),\pi(2j)\}\}+\epsilon_{i}~{}. (15)

Let Δ^npfe\hat{\Delta}_{n}^{\rm pfe}, β^npfe\hat{\beta}_{n}^{\rm pfe}, and γ^j,n\hat{\gamma}_{j,n}, 1jn1\leq j\leq n denote the OLS estimators of Δ\Delta, β\beta, θj\theta_{j}, 1jn1\leq j\leq n in (15), where “pfe” stands for pair fixed effect. It follows from the Frisch-Waugh-Lovell theorem that

Δ^npfe=1n1i2n(Yiψiβ^npfe)(2Di1).\hat{\Delta}_{n}^{\rm pfe}=\frac{1}{n}\sum_{1\leq i\leq 2n}(Y_{i}-\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm pfe})(2D_{i}-1)~{}.

Therefore, Δ^npfe\hat{\Delta}_{n}^{\rm pfe} satisfies (5)–(6) with

m^d,n(Xi,Wi)=ψiβ^npfe.\hat{m}_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\hat{\beta}_{n}^{\rm pfe}~{}.

Theorem 4.2 establishes (9) and (12) for a suitable choice of md,n(Xi,Wi),d{0,1}m_{d,n}(X_{i},W_{i}),d\in\{0,1\} and, as a result, the limiting distribution of Δ^npfe\hat{\Delta}_{n}^{\rm pfe} and the validity of the variance estimator.

Theorem 4.2.

Suppose QQ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.22.3. Then, as nn\to\infty,

β^npfePβpfe=(2E[Var[ψi|Xi]])1E[Cov[ψi,Yi(1)+Yi(0)|Xi]].\hat{\beta}_{n}^{\rm pfe}\stackrel{{\scriptstyle P}}{{\to}}\beta^{\rm pfe}=(2E[\operatorname*{Var}[\psi_{i}|X_{i}]])^{-1}E[\operatorname*{Cov}[\psi_{i},Y_{i}(1)+Y_{i}(0)|X_{i}]]~{}.

Moreover, (9), (12), and Assumption 3.1 are satisfied with

md,n(Xi,Wi)=ψiβpfem_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\beta^{\rm pfe}

for d{0,1}d\in\{0,1\} and n1n\geq 1.

Remark 4.3.

When ψ\psi is restricted to be a function of XiX_{i} only, Δ^npfe\hat{\Delta}_{n}^{\rm pfe} coincides to first order with the unadjusted difference-in-means estimator Δ^nunadj\hat{\Delta}_{n}^{\rm unadj} defined in (3). To see this, suppose further that ψ\psi is Lipschitz and that Var[Yi(d)|Xi=x],d{0,1}\text{Var}[Y_{i}(d)|X_{i}=x],d\in\{0,1\} are bounded. The proof of Theorem 4.2 reveals that Δ^npfe\hat{\Delta}_{n}^{\rm pfe} and β^npfe\hat{\beta}_{n}^{\rm pfe} coincide with the OLS estimators of the intercept and slope parameters in a linear regression of (Yπ(2j)Yπ(2j1))(Dπ(2j)Dπ(2j1))(Y_{\pi(2j)}-Y_{\pi(2j-1)})(D_{\pi(2j)}-D_{\pi(2j-1)}) on a constant and (ψπ(2j)ψπ(2j1))(Dπ(2j)Dπ(2j1))(\psi_{\pi(2j)}-\psi_{\pi(2j-1)})(D_{\pi(2j)}-D_{\pi(2j-1)}). Using this observation, it follows by arguing as in Section S.1.1 of Bai et al. (2022) that

n(Δ^npfeΔ(Q))=n(Δ^nunadjΔ(Q))+oP(1).\sqrt{n}(\hat{\Delta}_{n}^{\rm pfe}-\Delta(Q))=\sqrt{n}(\hat{\Delta}_{n}^{\rm unadj}-\Delta(Q))+o_{P}(1)~{}.

See also Remark 3.8 of Bai et al. (2022).   

Remark 4.4.

Note in the expression of σn2(Q)\sigma_{n}^{2}(Q) in Theorem 3.1 only depends on md,n(Xi,Wi),d{0,1}m_{d,n}(X_{i},W_{i}),d\in\{0,1\} through σ1,n2(Q)\sigma_{1,n}^{2}(Q). With this in mind, consider the class of all linearly covariate-adjusted estimators based on ψi\psi_{i}, i.e., md,n(Xi,Wi)=ψiβ(d)m_{d,n}(X_{i},W_{i})=\psi_{i}^{\prime}\beta(d). For this specification of md,n(Xi,Wi),d{0,1}m_{d,n}(X_{i},W_{i}),d\in\{0,1\},

σ1,n2(Q)=E[(E[Yi(1)+Yi(0)|Xi,Wi]E[Yi(1)+Yi(0)|Xi](ψiE[ψi|Xi])(β(1)+β(0)))2].\sigma_{1,n}^{2}(Q)=E[(E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)+Y_{i}(0)|X_{i}]-(\psi_{i}-E[\psi_{i}|X_{i}])^{\prime}(\beta(1)+\beta(0)))^{2}]~{}.

It follows that among all such linear adjustments, σn2(Q)\sigma_{n}^{2}(Q) in (10) is minimized when

β(1)+β(0)=2βpfe.\beta(1)+\beta(0)=2\beta^{\rm pfe}~{}.

This observation implies that the linear adjustment with pair fixed effects, i.e., Δ^npfe\hat{\Delta}_{n}^{\rm pfe}, yields the optimal linear adjustment in the sense of minimizing σn2(Q)\sigma_{n}^{2}(Q). Its limiting variance is, in particular, weakly smaller than the limiting variance of the unadjusted difference-in-means estimator defined in (3). On the other hand, the covariate-adjusted estimators based on (13) or (14), i.e., Δ^nnaive\hat{\Delta}_{n}^{\rm naive} and Δ^nint\hat{\Delta}_{n}^{\rm int}, are in general not optimal among all linearly covariate-adjusted estimators based on ψi\psi_{i}. In fact, the limiting variances of these two estimators may even be larger than that of the unadjusted difference-in-means estimator.   

Remark 4.5.

“Matched pairs” design is essentially a non-parametric way to adjust for XiX_{i}. Projecting ψi\psi_{i} on the pair dummies in (15) is equivalent to pair-wise demeaning, which effectively removes E(ψi|Xi)E(\psi_{i}|X_{i}) from ψi\psi_{i}. This is key to the optimality of Δ^npfe\hat{\Delta}_{n}^{\rm pfe} over all linearly adjusted estimators. Following the same logic, we expect that by replacing the pair dummies with sieve bases of XiX_{i} in (15), the linear regression can still effectively remove E(ψi|Xi)E(\psi_{i}|X_{i}) from ψi\psi_{i} so that the new adjusted estimator is asymptotically equivalent to Δ^npfe\hat{\Delta}_{n}^{\rm pfe}, and thus, linearly optimal.   

Remark 4.6.

Remark 4.2 also applies here with βnaive\beta^{\rm naive} replaced by βpfe\beta^{\rm pfe}. Even though Δ^npfe\hat{\Delta}_{n}^{\rm pfe} can be computed via OLS estimation of (15), we emphasize that the usual heteroskedascity-robust standard errors that naïvely treats the data (including treatment status) as if it were i.i.d. need not be consistent for the limiting variance derived in our analysis.   

Remark 4.7.

One can also consider the estimator based on the following linear regression model:

Yi=ΔDi+(ψiψ¯n)γ+Di(ψiμ^ψ,n(1))η+1jnθjI{i{π(2j1),π(2j)}}+ϵi.Y_{i}=\Delta D_{i}+(\psi_{i}-\bar{\psi}_{n})^{\prime}\gamma+D_{i}(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}\eta+\sum_{1\leq j\leq n}\theta_{j}I\{i\in\{\pi(2j-1),\pi(2j)\}\}+\epsilon_{i}~{}. (16)

Let Δ^nintpfe,γ^nintpfe,η^nintpfe\hat{\Delta}_{n}^{\rm int-pfe},\hat{\gamma}_{n}^{\rm int-pfe},\hat{\eta}_{n}^{\rm int-pfe} denote the OLS estimators for Δ,γ,η\Delta,\gamma,\eta in (16). It is straightforward to show Δ^nintpfe\hat{\Delta}_{n}^{\rm int-pfe} satisfies (5)–(6) with

m^1,n(Xi,Wi)\displaystyle\hat{m}_{1,n}(X_{i},W_{i}) =(ψiμ^ψ,n(1))η^nintpfe\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}\hat{\eta}_{n}^{\rm int-pfe}
m^0,n(Xi,Wi)\displaystyle\hat{m}_{0,n}(X_{i},W_{i}) =(ψiμ^ψ,n(0))(η^nintpfeγ^nintpfe).\displaystyle=(\psi_{i}-\hat{\mu}_{\psi,n}(0))^{\prime}(\hat{\eta}_{n}^{\rm int-pfe}-\hat{\gamma}_{n}^{\rm int-pfe})~{}.

Following similar arguments to those used in the proof of Theorem 4.1, we can establish that (9) and Assumption 3.1 are satisfied with

m1,n(Xi,Wi)\displaystyle m_{1,n}(X_{i},W_{i}) =(ψiE[ψi])ηintpfe\displaystyle=(\psi_{i}-E[\psi_{i}])^{\prime}\eta^{\rm int-pfe}
m0,n(Xi,Wi)\displaystyle m_{0,n}(X_{i},W_{i}) =(ψiE[ψi])(ηintpfeγintpfe),\displaystyle=(\psi_{i}-E[\psi_{i}])^{\prime}(\eta^{\rm int-pfe}-\gamma^{\rm int-pfe})~{},

where

γintpfe\displaystyle\gamma^{\rm int-pfe} =(E[Var[ψi|Xi]])1E[Cov[ψi,Yi(1)Yi(0)|Xi]],\displaystyle=(E[\operatorname*{Var}[\psi_{i}|X_{i}]])^{-1}E[\mathrm{Cov}[\psi_{i},Y_{i}(1)-Y_{i}(0)|X_{i}]]~{},
ηintpfe\displaystyle\eta^{\rm int-pfe} =(E[Var[ψi|Xi]])1E[Cov[ψi,Yi(1)|Xi]].\displaystyle=(E[\operatorname*{Var}[\psi_{i}|X_{i}]])^{-1}E[\mathrm{Cov}[\psi_{i},Y_{i}(1)|X_{i}]]~{}.

Because 2ηintpfeγintpfe=2βpfe2\eta^{\rm int-pfe}-\gamma^{\rm int-pfe}=2\beta^{\rm pfe}, it follows from Remark 4.4 that the limiting variance of Δ^nintpfe\hat{\Delta}_{n}^{\rm int-pfe} is identical to the limiting variance of Δ^npfe\hat{\Delta}_{n}^{\rm pfe}.   

Remark 4.8.

Wu and Gagnon-Bartsch (2021) consider the covariate adjustment for paired experiments under the design-based framework, where the covariates are treated as deterministic, and thus, the cross-sectional dependence between units in the same pair due to the closeness of their covariates is not counted in their analysis. We differ from them by considering the sampling-based framework in which the covariates are treated as random and the pairs are formed by matching, and thus, have an impact on statistical inference. Under their framework, Wu and Gagnon-Bartsch (2021) point out that covariate adjustments may have a positive or negative effect on the estimation accuracy depending on how they are estimated. This is consistent with our findings in this section. Specifically, we show that when the regression adjustments are estimated by a linear regression with pair fixed effects, the resulting ATE estimator is guaranteed to weakly improve upon the difference-in-means estimator in terms of efficiency. However, this improvement is not guaranteed if the adjustments are estimated without pair fixed effects.   

Remark 4.9.

If we choose ψi\psi_{i} as a set of sieve basis functions with increasing dimension, then under suitable regularity conditions, the linear adjustments both with and without pair fixed effects achieve the same limiting variance as Δ^nideal\hat{\Delta}^{\rm ideal}_{n}, and thus, the efficiency bound. In fact, if ψi\psi_{i} contains sieve bases, then the linear adjustment without pair fixed effects can approximate the true specification E[Yi(1)+Yi(0)|Xi,Wi]E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}] in the sense that E[Yi(1)+Yi(0)|Xi,Wi]=ψiβnaive+RiE[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]=\psi_{i}^{\prime}\beta^{\rm naive}+R_{i} and E[Ri2]=o(1)E[R_{i}^{2}]=o(1). This property implies σ1,n2(Q)\sigma_{1,n}^{2}(Q) in Theorem 3.1 equals zero. Similarly, the linear adjustment with pair fixed effects can approximate the true specification E[Yi(1)+Yi(0)|Xi,Wi]E[Yi(1)+Yi(0)|Xi]E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)+Y_{i}(0)|X_{i}] in the sense that E[Yi(1)+Yi(0)|Xi,Wi]E[Yi(1)+Yi(0)|Xi]=ψ~iβnaive+R~iE[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)+Y_{i}(0)|X_{i}]=\tilde{\psi}_{i}^{\prime}\beta^{\rm naive}+\tilde{R}_{i} and E[R~i2]=o(1)E[\tilde{R}_{i}^{2}]=o(1). This property again implies σ1,n2(Q)\sigma_{1,n}^{2}(Q) in Theorem 3.1 equals zero. Therefore, in both cases, the adjusted estimator achieves the minimum variance. In the next section, we consider 1\ell_{1}-regularized adjustments which may be viewed as providing a way to choose the relevant sieve bases in a data-driven manner.   

5 Regularized Adjustments

In this section, we study covariate adjustments based on the 1\ell_{1}-regularized linear regression. Such settings can arise if the covariates WiW_{i} are high-dimensional or if the dimension of WiW_{i} is fixed but the regressors include many sieve basis functions of XiX_{i} and WiW_{i}. To accommodate situations where the dimension of WiW_{i} increases with nn, we add a subscript and denote it by Wn,iW_{n,i} instead. Let kw,nk_{w,n} denote the dimension of Wn,iW_{n,i}. For n1n\geq 1, let ψn,i=ψn(Xi,Wn,i)\psi_{n,i}=\psi_{n}(X_{i},W_{n,i}), where ψn:𝐑kx×𝐑kw,n𝐑pn\psi_{n}:\mathbf{R}^{k_{x}}\times\mathbf{R}^{k_{w,n}}\to\mathbf{R}^{p_{n}} and pnp_{n} will be permitted below to be possibly much larger than nn.

In what follows, we consider a two-step method in the spirit of Cohen and Fogarty (2023). In the first step, an intermediate estimator, Δ^nr\hat{\Delta}_{n}^{\rm r}, is obtained using (5) with a “working model” obtained through a 1\ell_{1}-regularized linear regression adjustments md,n(Xi,Wi)m_{d,n}(X_{i},W_{i}) for d{0,1}d\in\{0,1\}. As explained further below in Theorem 5.1, when md,n(Xi,Wi)m_{d,n}(X_{i},W_{i}) is approximately correctly specified, such an estimator is optimal in the sense that it minimizes the limiting variance in Theorem 3.1. When this is not the case, however, for reasons analogous to those put forward in Remark 4.2, it needs not to have a limiting variance weakly smaller than the unadjusted difference-in-means estimator. In a second step, we therefore consider an estimator by refitting a version of (15) in which the covariates ψi\psi_{i} are replaced by the regularized estimates of md,n(Xi,Wi)m_{d,n}(X_{i},W_{i}) for d{0,1}d\in\{0,1\}. The resulting estimator, Δ^nrefit\hat{\Delta}_{n}^{\rm refit}, has the limiting variance weakly smaller than that of the intermediate estimator and thus remains optimal under approximately correct specification in the same sense. Moreover, it has limiting variance weakly smaller than the unadjusted difference-in-means estimator. Wager et al. (2016) also consider high-dimensional regression adjustments in randomized experiments using LASSO. We differ from their work by considering the “matched pairs” design, and more importantly, discussing when and how regularized adjustments can improve estimation efficiency upon the difference-in-means estimator.

Before proceeding, we introduce some additional notation that will be required in our formal description of the methods. We denote by ψn,i,l\psi_{n,i,l} the llth components of ψn,i\psi_{n,i}. For a vector a𝐑ka\in\mathbf{R}^{k} and 0p0\leq p\leq\infty, recall that

ap=(1lk|al|p)1/p,\|a\|_{p}=\Big{(}\sum_{1\leq l\leq k}|a_{l}|^{p}\Big{)}^{1/p}~{},

where it is understood that a0=1lkI{ak0}\|a\|_{0}=\sum_{1\leq l\leq k}I\{a_{k}\neq 0\} and a=sup1lk|al|\|a\|_{\infty}=\sup_{1\leq l\leq k}|a_{l}|. Using this notation, we further define

Ξn=sup(x,w)×supp(Xi)×supp(Wi)ψn,i(x,w).\Xi_{n}=\sup_{(x,w)\times\mathrm{supp}(X_{i})\times\mathrm{supp}(W_{i})}\|\psi_{n,i}(x,w)\|_{\infty}~{}.

For d{0,1}d\in\{0,1\}, define

(α^d,nr,β^d,nr)argmina𝐑,b𝐑pn1n1i2n:Di=d(Yiaψn,ib)2+λd,nrΩ^n(d)b1,(\hat{\alpha}_{d,n}^{\rm r},\hat{\beta}_{d,n}^{\rm r})\in\operatorname*{argmin}_{a\in\mathbf{R},b\in\mathbf{R}^{p_{n}}}\frac{1}{n}\sum_{1\leq i\leq 2n:D_{i}=d}(Y_{i}-a-\psi_{n,i}^{\prime}b)^{2}+\lambda_{d,n}^{\rm r}\|\hat{\Omega}_{n}(d)b\|_{1}~{}, (17)

where λd,nr\lambda_{d,n}^{\rm r} is a penalty parameter that will be disciplined by the assumptions below, Ω^n(d)=diag(ω^1(d),,ω^pn(d))\hat{\Omega}_{n}(d)=\operatorname*{diag}(\hat{\omega}_{1}(d),\cdots,\break\hat{\omega}_{p_{n}}(d)) is a diagonal matrix, and ω^n,l(d)\hat{\omega}_{n,l}(d) is the penalty loading for the llth regressor. Let Δ^nr\hat{\Delta}_{n}^{\rm r} denote the estimator in (5) with m^d,n(Xi,Wn,i)=ψn,iβ^d,nr\hat{m}_{d,n}(X_{i},W_{n,i})=\psi_{n,i}^{\prime}\hat{\beta}_{d,n}^{\rm r} for d{0,1}d\in\{0,1\}.

We now proceed with the statement of our assumptions. The first assumption collects a variety of moment conditions that will be used in our formal analysis:

Assumption 5.1.
  1. (a)

    There exist nonrandom quantities (αd,nr,βd,nr)(\alpha_{d,n}^{\rm r},\beta_{d,n}^{\rm r}) such that with ϵn,ir(d)\epsilon_{n,i}^{\rm r}(d) defined as

    ϵn,ir(d)=Yi(d)αd,nrψn,iβd,nr,\displaystyle\epsilon_{n,i}^{\rm r}(d)=Y_{i}(d)-\alpha_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}~{},

    we have

    Ωn(d)1E[ψn,iϵn,ir(d)]+|E[ϵn,ir(d)]|=o(λd,nr),\|\Omega_{n}(d)^{-1}E[\psi_{n,i}\epsilon_{n,i}^{\rm r}(d)]\|_{\infty}+|E[\epsilon_{n,i}^{\rm r}(d)]|=o\left(\lambda_{d,n}^{\rm r}\right)~{}, (18)

    where Ωn(d)=diag(ωn,1(d),,ωn,pn(d))\Omega_{n}(d)=\operatorname*{diag}(\omega_{n,1}(d),\cdots,\omega_{n,p_{n}}(d)) and ωn,l2(d)=Var[ψn,i,lϵn,ir(d)]\omega_{n,l}^{2}(d)=\operatorname*{Var}[\psi_{n,i,l}\epsilon_{n,i}^{\rm r}(d)].

  2. (b)

    For some q>2q>2 and constant C1C_{1},

    supn1max1lpnE[|ψn,i,lq||Xi]\displaystyle\sup_{n\geq 1}\max_{1\leq l\leq p_{n}}E[|\psi_{n,i,l}^{q}||X_{i}] C1,\displaystyle\leq C_{1}~{},
    supn1|ψn,iβd,nrpd|\displaystyle\sup_{n\geq 1}|\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r-pd}| C1,\displaystyle\leq C_{1}~{},
    supn1|E[Yi(a)|Xi,Wn,i]|\displaystyle\sup_{n\geq 1}|E[Y_{i}(a)|X_{i},W_{n,i}]| C1,\displaystyle\leq C_{1}~{},

    with probability one.

  3. (c)

    For some c¯\underaccent{\bar}{c} and c¯\bar{c}, we require that

    0<c¯lim infnmin1lpnω^n,l(d)/ωn,l(d)lim supnmax1lpnω^n,l(d)/ωn,l(d)c¯<.0<\underline{c}\leq\liminf_{n\rightarrow\infty}\min_{1\leq l\leq p_{n}}\hat{\omega}_{n,l}(d)/\omega_{n,l}(d)\leq\limsup_{n\rightarrow\infty}\max_{1\leq l\leq p_{n}}\hat{\omega}_{n,l}(d)/\omega_{n,l}(d)\leq\bar{c}<\infty. (19)
  4. (d)

    For some c0c_{0}, σ¯\underaccent{\bar}{\sigma}, σ¯\bar{\sigma}, the following statements hold with probability one:

    0<σ¯2lim infnmind{0,1},1lpnωn,l2(d)lim supnmaxd{0,1},1lpnωn,l2(d)σ¯2<,0<\underaccent{\bar}{\sigma}^{2}\leq\liminf_{n\rightarrow\infty}~{}\min_{d\in\{0,1\},1\leq l\leq p_{n}}\omega_{n,l}^{2}(d)\leq\limsup_{n\rightarrow\infty}~{}\max_{d\in\{0,1\},1\leq l\leq p_{n}}\omega_{n,l}^{2}(d)\leq\bar{\sigma}^{2}<\infty~{},
    supn1maxd{0,1}E[(ψn,iβd,nr)2]c0\displaystyle\sup_{n\geq 1}\max_{d\in\{0,1\}}E[(\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r})^{2}]\leq c_{0} <,\displaystyle<\infty~{},
    maxd{0,1},1lpn12n1i2nE[ϵn,i4(d)|Xi]c0\displaystyle\max_{d\in\{0,1\},1\leq l\leq p_{n}}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\epsilon_{n,i}^{4}(d)|X_{i}]\leq c_{0} <,\displaystyle<\infty~{},
    supn1maxd{0,1}E[ϵn,i4(d)]c0\displaystyle\sup_{n\geq 1}~{}\max_{d\in\{0,1\}}E[\epsilon_{n,i}^{4}(d)]\leq c_{0} <,\displaystyle<\infty~{},
    mind{0,1}Var[Yi(d)ψn,i(β1,nr+β0,nr)/2]σ¯2\displaystyle\min_{d\in\{0,1\}}\operatorname*{Var}[Y_{i}(d)-\psi_{n,i}^{\prime}(\beta_{1,n}^{\rm r}+\beta_{0,n}^{\rm r})/2]\geq\underaccent{\bar}{\sigma}^{2} >0,\displaystyle>0~{},
    min1lpn1n1i2nI{Di=d}Var[ψn,i,lϵn,i(d)|Xi]σ¯2\displaystyle\min_{1\leq l\leq p_{n}}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\operatorname*{Var}[\psi_{n,i,l}\epsilon_{n,i}(d)|X_{i}]\geq\underaccent{\bar}{\sigma}^{2} >0,\displaystyle>0~{},
    min1lpnVar[E[ψn,i,lϵn,i(d)|Xi]]σ¯2\displaystyle\min_{1\leq l\leq p_{n}}\operatorname*{Var}[E[\psi_{n,i,l}\epsilon_{n,i}(d)|X_{i}]]\geq\underaccent{\bar}{\sigma}^{2} >0.\displaystyle>0~{}.
Remark 5.1.

It is instructive to note that (18) in Assumption 5.1(a) is the subgradient condition for a 1\ell_{1}-penalized regression of the outcome Yi(d)Y_{i}(d) on ψn,i\psi_{n,i} when the penalty is of order o(λnr)o(\lambda_{n}^{\rm r}). Specifically, if pnnp_{n}\ll n, then this condition holds automatically for the βd,nr\beta_{d,n}^{\rm r} equal to the coefficients of a linear projection of Yi(d)Y_{i}(d) onto (1,ψn,i)(1,\psi_{n,i}^{\prime}). When pnnp_{n}\gg n, but E[Yi(d)|Xi,Wi]E[Y_{i}(d)|X_{i},W_{i}] is approximately correctly specified in the sense that the approximation error Rn,i(d)=E[Yi(d)|Xi,Wi]αd,nrψn,iβd,nrR_{n,i}(d)=E[Y_{i}(d)|X_{i},W_{i}]-\alpha_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r} is sufficiently small, then (18) also holds. However, the approximately correct specification is not necessary for (18). For example, suppose Wn,i=(Wn,i,1,,Wn,i,pn)W_{n,i}=(W_{n,i,1},\cdots,W_{n,i,p_{n}}) is a pnp_{n} vector of independent standard normal random variables, Wn,iW_{n,i} is independent of XiX_{i}, ψn,i=(Xi,Wn,i)\psi_{n,i}=(X_{i}^{\prime},W_{n,i}^{\prime})^{\prime}, and

Yi(d)=αd,nr+ψn,iβd,nr+l=1pnWn,i,l21pn+un,i(d),\displaystyle Y_{i}(d)=\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}+\sum_{l=1}^{p_{n}}\frac{W_{n,i,l}^{2}-1}{\sqrt{p_{n}}}+u_{n,i}(d)~{},

where E(un,i(d)|Xi,Wn,i)=0E(u_{n,i}(d)|X_{i},W_{n,i})=0. Then, Assumption 5.1(a) holds with ϵn,ir(d)=l=1pnWn,i,l21pn+un,i(d)\epsilon_{n,i}^{\rm r}(d)=\sum_{l=1}^{p_{n}}\frac{W_{n,i,l}^{2}-1}{\sqrt{p_{n}}}+u_{n,i}(d). We can impose a sparse restriction on βd,nr\beta_{d,n}^{\rm r} so that it further satisfies Assumption 5.3(b) below. On the other hand, the linear regression adjustment is not approximately correctly specified because Rn,i(d)=E(Yi(d)|Xi,Wn,i)(αd,nr+ψn,iβd,nr)=l=1pnWn,i,l21pnR_{n,i}(d)=E(Y_{i}(d)|X_{i},W_{n,i})-(\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r})=\sum_{l=1}^{p_{n}}\frac{W_{n,i,l}^{2}-1}{\sqrt{p_{n}}}, and we have ERn,i2(d)=20ER_{n,i}^{2}(d)=2\nrightarrow 0.   

Remark 5.2.

Assumption 5.1(b) and 5.1(d) are standard in the high-dimensional estimation literature; see, for instance, Belloni et al. (2017). The last four inequalities in Assumption 5.1(d), in particular, permit us to apply the high-dimensional central limit theorem in Chernozhukov et al. (2017, Theorem 2.1).   

Remark 5.3.

The penalty loadings in Assumption 5.1(c) can be computed by an iterative procedure proposed by Belloni et al. (2017). We provide more detail in Algorithm 5.1 below. We can then verify (19) under “matched pairs” designs following arguments similar to those in Belloni et al. (2017).   

Our analysis will, as before, also require some discipline on the way in which pairs are formed. For this purpose, Assumption 2.3 will suffice, but we will need an additional Lipshitz-like condition:

Assumption 5.2.

For some L>0L>0 and any x1x_{1} and x2x_{2} in the support of XiX_{i}, we have

|(Ψ(x1)Ψ(x2))βd,nr|Lx1x22.|(\Psi(x_{1})-\Psi(x_{2}))^{\prime}\beta_{d,n}^{\rm r}|\leq L||x_{1}-x_{2}||_{2}~{}.

We next specify our restrictions on the penalty parameter λd,nr\lambda_{d,n}^{\rm r}.

Assumption 5.3.
  1. (a)

    For some n\ell\ell_{n}\rightarrow\infty,

    λd,nr=nnΦ1(10.12log(n)pn).\lambda_{d,n}^{\rm r}=\frac{\ell\ell_{n}}{\sqrt{n}}\Phi^{-1}\left(1-\frac{0.1}{2\log(n)p_{n}}\right)~{}.
  2. (b)

    Ξn2(logpn)7/n0\Xi_{n}^{2}(\log p_{n})^{7}/n\to 0 and (nsnlogpn)/n0(\ell\ell_{n}s_{n}\log p_{n})/\sqrt{n}\to 0, where

    sn=maxd{0,1}βd,nr0.s_{n}=\max_{d\in\{0,1\}}\|\beta_{d,n}^{\rm r}\|_{0}. (20)

We note that Assumption 5.3(b) permits pnp_{n} to be much greater than nn. It also requires sparsity in the sense that sn=o(n)s_{n}=o(\sqrt{n}).

Finally, as is common in the analysis of 1\ell_{1}-penalized regression, we require a “restricted eigenvalue” condition. This assumption permits us to apply Bickel et al. (2009, Lemma 4.1) and establish the error bounds for |α^d,nrαd,nr|+β^d,nrβd,nr1|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|+||\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}||_{1} and 1n1i2nI{Di=d}(α^d,nrαd,nr+ψn,i(β^d,nrβd,nr))2\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\left(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\right)^{2}.

Assumption 5.4.

For some κ1>0,κ2\kappa_{1}>0,\kappa_{2} and n\ell_{n}\to\infty, the following statements hold with probability approaching one:

infd{0,1},v𝐑pn+1:v0(sn+1)n(v22)1v(1n1i2nI{Di=d}ψ˘n,iψ˘n,i)v\displaystyle\inf_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\|v\|_{0}\leq(s_{n}+1)\ell_{n}}(\|v\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\Bigg{)}v κ1\displaystyle\geq\kappa_{1}
supd{0,1},v𝐑pn+1:v0(sn+1)n(v22)1v(1n1i2nI{Di=d}ψ˘n,iψ˘n,i)v\displaystyle\sup_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\|v\|_{0}\leq(s_{n}+1)\ell_{n}}(\|v\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\Bigg{)}v κ2\displaystyle\leq\kappa_{2}
infd{0,1},v𝐑pn+1:v0(sn+1)n(v22)1v(1n1i2nI{Di=d}E[ψ˘n,iψ˘n,i|Xi])v\displaystyle\inf_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\|v\|_{0}\leq(s_{n}+1)\ell_{n}}(\|v\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}|X_{i}]\Bigg{)}v κ1\displaystyle\geq\kappa_{1}
supd{0,1},v𝐑pn+1:v0(sn+1)n(v22)1v(1n1i2nI{Di=d}E[ψ˘n,iψ˘n,i|Xi])v\displaystyle\sup_{d\in\{0,1\},v\in\mathbf{R}^{p_{n}+1}:\|v\|_{0}\leq(s_{n}+1)\ell_{n}}(\|v\|_{2}^{2})^{-1}v^{\prime}\Bigg{(}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}|X_{i}]\Bigg{)}v κ2,\displaystyle\leq\kappa_{2}~{},

where ψ˘n,i=(1,ψn,i)\breve{\psi}_{n,i}=(1,\psi_{n,i}^{\prime})^{\prime}.

Using these assumptions, the following theorem characterizes the behavior of Δ^nr\hat{\Delta}_{n}^{\rm r}:

Theorem 5.1.

Suppose QQ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.22.3. Further suppose Assumptions 5.15.4 hold. Then, (9), (12), and Assumption 3.1 are satisfied with m^d,n(Xi,Wn,i)=α^d,nr+ψn,iβ^d,nr\hat{m}_{d,n}(X_{i},W_{n,i})=\hat{\alpha}_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\hat{\beta}_{d,n}^{\rm r} and

md,n(Xi,Wn,i)=αd,nr+ψn,iβd,nrm_{d,n}(X_{i},W_{n,i})=\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r}

for d{0,1}d\in\{0,1\} and n1n\geq 1. Denote the variance of Δ^nr\hat{\Delta}_{n}^{\rm r} by σnr,2\sigma_{n}^{\rm r,2}. If the regularized adjustment is approximately correctly specified, i.e., E[Yi(d)|Xi,Wn,i]=αd,nr+ψn,iβd,nr+Rn,i(d)E[Y_{i}(d)|X_{i},W_{n,i}]=\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta^{\rm r}_{d,n}+R_{n,i}(d) and maxd{0,1}E[Rn,i2(d)]=o(1)\max_{d\in\{0,1\}}E[R_{n,i}^{2}(d)]=o(1), then σnr,2\sigma_{n}^{\rm r,2} achieves the minimum variance, i.e.,

limnσnr,2=σ22(Q)+σ32(Q).\displaystyle\lim_{n\to\infty}\sigma_{n}^{\rm r,2}=\sigma_{2}^{2}(Q)+\sigma_{3}^{2}(Q)~{}.
Remark 5.4.

We recommend employing an iterative estimation procedure outlined by Belloni et al. (2017) to estimate β^d,nr\hat{\beta}_{d,n}^{\rm r}, in which the mm-th step’s penalty loadings are estimated based on the (m1)(m-1)th step’s LASSO estimates. Formally, this iterative procedure is described by the following algorithm:

Algorithm 5.1.
  1. Step 0: Set ϵ^n,ir,(0)(d)=Yi\hat{\epsilon}_{n,i}^{{\rm r},(0)}(d)=Y_{i} if Di=dD_{i}=d.

  2. \vdots

  3. Step mm: Compute ω^n,l(m)(d)=1n1i2nI{Di=d}ψn,i,l2(ϵ^n,ir,(m1)(d))2\hat{\omega}_{n,l}^{(m)}(d)=\sqrt{\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\psi_{n,i,l}^{2}(\hat{\epsilon}_{n,i}^{{\rm r},(m-1)}(d))^{2}} and compute (α^d,nr,(m),β^d,nr,(m))(\hat{\alpha}_{d,n}^{{\rm r},(m)},\hat{\beta}_{d,n}^{{\rm r},(m)}) following (17) with ω^n,l(m)\hat{\omega}_{n,l}^{(m)} as the penalty loadings, and ϵ^n,ir,(m)(d)=Yiα^d,nr,(m)ψiβ^d,nr,(m)\hat{\epsilon}_{n,i}^{{\rm r},(m)}(d)=Y_{i}-\hat{\alpha}_{d,n}^{{\rm r},(m)}-\psi_{i}^{\prime}\hat{\beta}_{d,n}^{{\rm r},(m)} if Di=dD_{i}=d.

  4. \vdots

  5. Step MM: \ldots

  6. Step M+1M+1: Set β^d,nr=β^d,nr,(M)\hat{\beta}_{d,n}^{\rm r}=\hat{\beta}_{d,n}^{{\rm r},(M)}.

As suggested by Belloni et al. (2017), we set MM to be 15. We note that R package hdm has a built-in option for this iterative procedure. For this choice of penalty loadings, arguments similar to those in Belloni et al. (2017) can be used to verify (19) under “matched pairs” designs.   

Remark 5.5.

When the 1\ell_{1}-regularized adjustment is approximately correctly specified, Theorem 5.1 shows Δ^nr\hat{\Delta}_{n}^{\rm r} achieves the minimum variance derived in Remark 3.2, and thus, is guaranteed to be weakly more efficient than the difference-in-means estimator (Δ^nunadj\hat{\Delta}_{n}^{\rm unadj}). When Wn,iW_{n,i} is fixed dimensional and ψn,i\psi_{n,i} consists of sieve basis functions of (Xi,Wn,i)(X_{i},W_{n,i}), the approximately correct specification usually holds. Specifically, under regularity conditions such as the smoothness of E(Yi(d)|Xi,Wn,i)E(Y_{i}(d)|X_{i},W_{n,i}), we can approximate E(Yi(d)|Xi,Wn,i)E(Y_{i}(d)|X_{i},W_{n,i}) by αd,nr+ψn,iβd,nr\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\beta^{\rm r}_{d,n} and βd,nr\beta^{\rm r}_{d,n} is automatically sparse in the sense that βd,nr0n||\beta^{\rm r}_{d,n}||_{0}\ll n. This means our regularized regression adjustment can select relevant sieve bases in nonparametric regression adjustments in a data-driven manner and automatically minimize the limiting variance of the corresponding ATE estimator.   

Remark 5.6.

When the dimension of ψn,i\psi_{n,i} is ultra-high (i.e., pnnp_{n}\gg n) and the regularized adjustment is not approximately correctly specified, Δ^nr\hat{\Delta}_{n}^{\rm r} suffers from Freedman (2008)’s critique that, theoretically, it is possible to be less efficient than Δ^nunadj\hat{\Delta}_{n}^{\rm unadj}. To overcome this problem, we consider an additional step in which we treat the regularized adjustments (ψn,iβ^1,nr,ψn,iβ^0,nr)(\psi_{n,i}^{\prime}\hat{\beta}_{1,n}^{\rm r},\psi_{n,i}^{\prime}\hat{\beta}_{0,n}^{\rm r}) as a two-dimensional covariate and refit a linear regression with pair fixed effects. Such a procedure has also been studied by Cohen and Fogarty (2023) in the setting with low-dimensional covariates and complete randomization. In fact, this strategy can improve upon general initial regression adjustments as long as (9), (12), and Assumption 3.1 are satisfied.   

Theorem 5.2 below shows the “refit” estimator for the ATE is weakly more efficient than both Δ^nunadj\hat{\Delta}_{n}^{\rm unadj} and Δ^nr\hat{\Delta}_{n}^{\rm r}. To state the results, define Γn,i=(ψn,iβ1,nr,ψn,iβ0,nr)\Gamma_{n,i}=(\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r},\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r})^{\prime}, Γ^n,i=(ψn,iβ^1,nr,ψn,iβ^0,nr)\hat{\Gamma}_{n,i}=(\psi_{n,i}^{\prime}\hat{\beta}_{1,n}^{\rm r},\psi_{n,i}^{\prime}\hat{\beta}_{0,n}^{\rm r}), and Δ^nrefit\hat{\Delta}_{n}^{\rm refit} as the estimator in (15) with ψi\psi_{i} replaced by Γ^n,i\hat{\Gamma}_{n,i}. Note that Δ^nrefit\hat{\Delta}_{n}^{\rm refit} remains numerically the same if we include the intercept α^d,nr\hat{\alpha}_{d,n}^{\rm r} in the definition of Γ^n,i\hat{\Gamma}_{n,i}. Following Remark 4.3, Δ^nrefit\hat{\Delta}_{n}^{\rm refit} is the intercept in the linear regression of (Dπ(2j1)Dπ(2j1))(Yπ(2j1)Yπ(2j))(D_{\pi(2j-1)}-D_{\pi(2j-1)})(Y_{\pi(2j-1)}-Y_{\pi(2j)}) on constant and (Dπ(2j1)Dπ(2j1))(Γ^n,π(2j1)Γ^n,π(2j))(D_{\pi(2j-1)}-D_{\pi(2j-1)})(\hat{\Gamma}_{n,\pi(2j-1)}-\hat{\Gamma}_{n,\pi(2j)}). Replacing Γ^n,i\hat{\Gamma}_{n,i} by Γ^n,i+(α^1,nr,α^0,nr)\hat{\Gamma}_{n,i}+(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime} will not change the regression estimators.

The following assumption will be employed to control Γn,i\Gamma_{n,i} in our subsequent analysis:

Assumption 5.5.

For some κ1>0\kappa_{1}>0 and κ2\kappa_{2},

infn1infv𝐑2v22vE[Var[Γn,i|Xi]]vκ1\displaystyle\inf_{n\geq 1}\inf_{v\in\mathbf{R}^{2}}||v||_{2}^{-2}v^{\prime}E[\operatorname*{Var}[\Gamma_{n,i}|X_{i}]]v\geq\kappa_{1}
supn1supv𝐑2v22vE[Var[Γn,i|Xi]]vκ2.\displaystyle\sup_{n\geq 1}\sup_{v\in\mathbf{R}^{2}}||v||_{2}^{-2}v^{\prime}E[\operatorname*{Var}[\Gamma_{n,i}|X_{i}]]v\leq\kappa_{2}~{}.

The following theorem characterizes the behavior of Δ^nrefit\hat{\Delta}_{n}^{\rm refit}:

Theorem 5.2.

Suppose QQ satisfies Assumption 2.1 and the treatment assignment mechanism satisfies Assumptions 2.22.3. Further suppose Assumptions 5.15.5 hold. Then, (9), (12), and Assumption 3.1 are satisfied with m^d,n(Xi,Wn,i)=Γ^n,iβ^nrefit\hat{m}_{d,n}(X_{i},W_{n,i})=\hat{\Gamma}_{n,i}^{\prime}\hat{\beta}_{n}^{\rm refit} and

md,n(Xi,Wn,i)=Γn,iβnrefitm_{d,n}(X_{i},W_{n,i})=\Gamma_{n,i}^{\prime}\beta_{n}^{\rm refit}

for d{0,1}d\in\{0,1\} and n1n\geq 1, where βnrefit=(2E[Var[Γn,i|Xi]])1E[Cov[Γn,i,Yi(1)+Yi(0)|Xi]]\beta_{n}^{\rm refit}=(2E[\operatorname*{Var}[\Gamma_{n,i}|X_{i}]])^{-1}E[\operatorname*{Cov}[\Gamma_{n,i},Y_{i}(1)+Y_{i}(0)|X_{i}]]. In addition, denote the asymptotic variance of Δ^nrefit\hat{\Delta}_{n}^{\rm refit} as σnrefit,2\sigma_{n}^{\rm refit,2}. Then, σnunadj,2σnrefit,2\sigma_{n}^{\rm unadj,2}\geq\sigma_{n}^{\rm refit,2} and σnr,2σnrefit,2\sigma_{n}^{\rm r,2}\geq\sigma_{n}^{\rm refit,2}.

Remark 5.7.

It is possible to further relax the full rank condition in Assumption 5.5 by running a ridge regression or truncating the minimum eigenvalue of the gram matrix in the refitting step.   

6 Simulations

In this section, we conduct Monte Carlo experiments to assess the finite-sample performance of the inference methods proposed in the paper. In all cases, we follow Bai et al. (2022) to consider tests of the hypothesis that

H0:Δ(Q)=Δ0 versus H1:Δ(Q)Δ0.\displaystyle H_{0}:\Delta(Q)=\Delta_{0}\text{ versus }H_{1}:\Delta(Q)\neq\Delta_{0}~{}.

with Δ0=0\Delta_{0}=0 at nominal level α=0.05\alpha=0.05.

6.1 Data Generating Processes

We generate potential outcomes for d{0,1}d\in\{0,1\} and 1i2n1\leq i\leq 2n by the equation

Yi(d)=μd+md(Xi,Wi)+σd(Xi,Wi)ϵd,i,d=0,1,Y_{i}(d)=\mu_{d}+m_{d}(X_{i},W_{i})+\sigma_{d}(X_{i},W_{i})\epsilon_{d,i},~{}d=0,1, (21)

where μd,md(Xi,Wi),σd(Xi,Wi)\mu_{d},m_{d}\left(X_{i},W_{i}\right),\sigma_{d}\left(X_{i},W_{i}\right), and ϵd,i\epsilon_{d,i} are specified in each model as follows. In each of the specifications, (Xi,Wi,ϵ0,i,ϵ1,iX_{i},W_{i},\epsilon_{0,i},\epsilon_{1,i}) are i.i.d. across ii. The number of pairs nn is equal to 100 and 200. The number of replications is 10,000.

Model 1

(Xi,Wi)=(Φ(Vi1),Φ(Vi2))\left(X_{i},W_{i}\right)^{\top}=\left(\Phi\left(V_{i1}\right),\Phi\left(V_{i2}\right)\right)^{\top}, where Φ()\Phi(\cdot) is the standard normal distribution function and

ViN((00),(1ρρ1)),V_{i}\sim N\left(\left(\begin{array}[]{l}0\\ 0\end{array}\right),\left(\begin{array}[]{ll}1&\rho\\ \rho&1\end{array}\right)\right),

m0(Xi,Wi)=γ(Wi12)m_{0}\left(X_{i},W_{i}\right)=\gamma\left(W_{i}-\frac{1}{2}\right); m1(Xi,Wi)=m0(Xi,Wi)m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right); ϵd,iN(0,1)\epsilon_{d,i}\sim N(0,1) for d=0,1d=0,1; σ0(Xi,Wi)=σ1(Xi,Wi)=1\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1. We set γ=4\gamma=4 and ρ=0.2\rho=0.2.

Model 2

(Xi,Wi)=(Φ(Vi1),V1iVi2)\left(X_{i},W_{i}\right)^{\top}=\left(\Phi\left(V_{i1}\right),V_{1i}V_{i2}\right)^{\top}, where ViV_{i} is the same as in Model 1. m0(Xi,Wi)=m1(Xi,Wi)=γ1(Wiρ)+γ2(Φ1(Xi)21)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}\left(W_{i}-\rho\right)+\gamma_{2}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right). ϵd,iN(0,1)\epsilon_{d,i}\sim N(0,1) for d=0,1d=0,1; σ0(Xi,Wi)=σ1(Xi,Wi)=1\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1. (γ1,γ2)=(1,2)\left(\gamma_{1},\gamma_{2}\right)^{\top}=\left(1,2\right)^{\top} and ρ=0.2\rho=0.2.

Model 3

The same as in Model 2, except that m0(Xi,Wi)=m1(Xi,Wi)=γ1(Wiρ)+γ2(Φ(Wi)12)+γ3(Φ1(Xi)21)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}\left(W_{i}-\rho\right)+\gamma_{2}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right) with (γ1,γ2,γ3)=(14,1,2)\left(\gamma_{1},\gamma_{2},\gamma_{3}\right)^{\top}=\left(\frac{1}{4},1,2\right)^{\top}.

Model 4

(Xi,Wi)=(Vi1,V1iVi2)\left(X_{i},W_{i}\right)^{\top}=\left(V_{i1},V_{1i}V_{i2}\right)^{\top}, where ViV_{i} is the same as in Model 1. m0(Xi,Wi)=m1(Xi,Wi)=γ1(Wiρ)+γ2(Φ(Wi)12)+γ3(Xi21)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}\left(W_{i}-\rho\right)+\gamma_{2}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}\left(X_{i}^{2}-1\right). ϵd,iN(0,1)\epsilon_{d,i}\sim N(0,1) for d=0,1d=0,1; σ0(Xi,Wi)=σ1(Xi,Wi)=1\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1. (γ1,γ2,γ3)=(2,1,2)\left(\gamma_{1},\gamma_{2},\gamma_{3}\right)^{\top}=\left(2,1,2\right)^{\top} and ρ=0.2\rho=0.2.

Model 5

The same as in Model 4, except that m1(Xi,Wi)=m0(Xi,Wi)+(Φ(Xi)12)m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\left(\Phi\left(X_{i}\right)-\frac{1}{2}\right).

Model 6

The same as in Model 5, except that σ0(Xi,Wi)=σ1(Xi,Wi)=(Φ(Xi)+0.5)\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=\left(\Phi\left(X_{i}\right)+0.5\right).

Model 7

Xi=(Vi1,Vi2)X_{i}=\left(V_{i1},V_{i2}\right)^{\top} and Wi=(Vi1Vi3,Vi2Vi4)W_{i}=\left(V_{i1}V_{i3},V_{i2}V_{i4}\right)^{\top}, where ViN(0,Σ)V_{i}\sim N(0,\Sigma) with dim(Vi)=4\text{dim}(V_{i})=4 and Σ\Sigma consisting of 1 on the diagonal and ρ\rho on all other elements. m0(Xi,Wi)=m1(Xi,Wi)=γ1(Wiρ)+γ2(Φ(Wi)12)+γ3(Xi121)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}\left(W_{i}-\rho\right)+\gamma_{2}^{\prime}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}\left(X_{i1}^{2}-1\right) with γ1=(2,2),γ2=(1,1),γ3=1\gamma_{1}=\left(2,2\right)^{\top},\gamma_{2}=\left(1,1\right)^{\top},\gamma_{3}=1. ϵd,iN(0,1)\epsilon_{d,i}\sim N(0,1) for d=0,1d=0,1; σ0(Xi,Wi)=σ1(Xi,Wi)=1\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1. ρ=0.2\rho=0.2.

Model 8

The same as in Model 7, except that m1(Xi,Wi)=m0(Xi,Wi)+(Φ(Xi1)12)m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\left(\Phi\left(X_{i1}\right)-\frac{1}{2}\right).

Model 9

The same as in Model 8, except that σ0(Xi,Wi)=σ1(Xi,Wi)=(Φ(Xi1)+0.5)\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=\left(\Phi\left(X_{i1}\right)+0.5\right)

Model 10

Xi=(Φ(Vi1),,Φ(Vi4))X_{i}=\left(\Phi\left(V_{i1}\right),\cdots,\Phi\left(V_{i4}\right)\right)^{\top} and Wi=(Vi1Vi5,Vi2Vi6)W_{i}=\left(V_{i1}V_{i5},V_{i2}V_{i6}\right)^{\top}, where ViN(0,Σ)V_{i}\sim N(0,\Sigma) with dim(Vi)=6\text{dim}(V_{i})=6 and Σ\Sigma consisting of 1 on the diagonal and ρ\rho on all other elements. m0(Xi,Wi)=m1(Xi,Wi)=γ1(Wiρ)+γ2(Φ(Wi)12)+γ3((Φ1(Xi1)2,Φ1(Xi2)2)1)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}\left(W_{i}-\rho\right)+\gamma_{2}^{\prime}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}^{\prime}\left(\left(\Phi^{-1}\left(X_{i1}\right)^{2},\Phi^{-1}\left(X_{i2}\right)^{2}\right)^{\top}-1\right) with γ1=(1,1),γ2=(12,12),γ3=(12,12)\gamma_{1}=\left(1,1\right)^{\top},\gamma_{2}=\left(\frac{1}{2},\frac{1}{2}\right)^{\top},\gamma_{3}=\left(\frac{1}{2},\frac{1}{2}\right)^{\top}. ϵd,iN(0,1)\epsilon_{d,i}\sim N(0,1) for d=0,1d=0,1; σ0(Xi,Wi)=σ1(Xi,Wi)=1\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1.

Model 11

The same as in Model 10, except that m1(Xi,Wi)=m0(Xi,Wi)+14j=14(Xij12)m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\frac{1}{4}\sum_{j=1}^{4}\left(X_{ij}-\frac{1}{2}\right).

Model 12

Xi=(Φ(Vi1),,Φ(Vi4))X_{i}=\left(\Phi\left(V_{i1}\right),\cdots,\Phi\left(V_{i4}\right)\right)^{\top} and Wi=(Vi1Vi41,,Vi40Vi80)W_{i}=\left(V_{i1}V_{i41},\cdots,V_{i40}V_{i80}\right)^{\top}, where ViN(0,Σ)V_{i}\sim N(0,\Sigma) with dim(Vi)=80\text{dim}(V_{i})=80. Σ\Sigma is the Toeplitz matrix

Σ=(10.50.520.5790.510.50.5780.520.510.5770.5790.5780.5771).\displaystyle\Sigma=\begin{pmatrix}1&0.5&0.5^{2}&\cdots&0.5^{79}\\ 0.5&1&0.5&\cdots&0.5^{78}\\ 0.5^{2}&0.5&1&\cdots&0.5^{77}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0.5^{79}&0.5^{78}&0.5^{77}&\cdots&1\end{pmatrix}.

m0(Xi,Wi)=m1(Xi,Wi)=γ1Wi+γ2(Φ1(Xi)21)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}W_{i}+\gamma_{2}^{\prime}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right), γ1=(112,122,,1402)\gamma_{1}=\left(\frac{1}{1^{2}},\frac{1}{2^{2}},\cdots,\frac{1}{40^{2}}\right)^{\top} with dim(γ1)=40\text{dim}(\gamma_{1})=40, and γ2=(18,18,18,18)\gamma_{2}=\left(\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8}\right)^{\top} with dim(γ2)=4\text{dim}(\gamma_{2})=4. ϵd,iN(0,1)\epsilon_{d,i}\sim N(0,1) for d=0,1d=0,1; σ0(Xi,Wi)=σ1(Xi,Wi)=1\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=1.

Model 13

The same as in Model 12, except that m0(Xi,Wi)=m1(Xi,Wi)=γ1Wi+γ2(Φ(Wi)12)+γ3(Φ1(Xi)21)m_{0}\left(X_{i},W_{i}\right)=m_{1}\left(X_{i},W_{i}\right)=\gamma_{1}^{\prime}W_{i}+\gamma_{2}^{\prime}\left(\Phi\left(W_{i}\right)-\frac{1}{2}\right)+\gamma_{3}^{\prime}\left(\Phi^{-1}\left(X_{i}\right)^{2}-1\right), γ1=(112,,1402)\gamma_{1}=\left(\frac{1}{1^{2}},\cdots,\frac{1}{40^{2}}\right)^{\top}, γ2=18(112,,1402)\gamma_{2}=\frac{1}{8}\left(\frac{1}{1^{2}},\cdots,\frac{1}{40^{2}}\right)^{\top}, and γ3=(18,18,18,18)\gamma_{3}=\left(\frac{1}{8},\frac{1}{8},\frac{1}{8},\frac{1}{8}\right)^{\top} with dim(γ1)=dim(γ2)=40\text{dim}(\gamma_{1})=\text{dim}(\gamma_{2})=40 and dim(γ3)=4\text{dim}(\gamma_{3})=4.

Model 14

The same as in Model 13, except that m1(Xi,Wi)=m0(Xi,Wi)+j=141j2(Xij12)m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right)+\sum_{j=1}^{4}\frac{1}{j^{2}}\left(X_{ij}-\frac{1}{2}\right).

Model 15

The same as in Model 14, except that σ0(Xi,Wi)=σ1(Xi,Wi)=(Xi1+0.5)\sigma_{0}\left(X_{i},W_{i}\right)=\sigma_{1}\left(X_{i},W_{i}\right)=\left(X_{i1}+0.5\right).

It is worth noting that Models 1, 2, 3, 4, 7, 10, 12, and 13 imply homogeneous treatment effects because m1(Xi,Wi)=m0(Xi,Wi)m_{1}\left(X_{i},W_{i}\right)=m_{0}\left(X_{i},W_{i}\right). Among them, E[Yi(a)|Xi,Wi]E[Yi(a)|Xi]E[Y_{i}(a)|X_{i},W_{i}]-E[Y_{i}(a)|X_{i}] is linear in WiW_{i} in Models 1, 2, and 12. Models 5, 8, 11, and 14 have heterogeneous but homoscedastic treatment effects. In Models 6, 9, and 15, however, the implied treatment effects are both heterogeneous and heteroscedastic. Models 12-15 contain high-dimensional covariates.

We follow Bai et al. (2022) to match pairs. Specifically, if dim(Xi)=1\text{dim}\left(X_{i}\right)=1, we match pairs by sorting Xi,i=1,,2nX_{i},i=1,\ldots,2n. If dim(Xi)>1\text{dim}\left(X_{i}\right)>1, we match pairs by the permutation π\pi calculated using the R package nbpMatching. For more details, see Bai et al. (2022, Section 4). After matching the pairs, we flip coins to randomly select one unit within each pair for treatment and another for control.

6.2 Estimation and Inference

We set μ0=0\mu_{0}=0 and μ1=Δ\mu_{1}=\Delta, where Δ=0\Delta=0 and 1/41/4 are used to illustrate the size and power, respectively. Rejection probabilities in percentage points are presented. To further illustrate the efficiency gains obtained by regression adjustments, in Figure 1, we plot the average standard error reduction in percentage relative to the standard error of the estimator without adjustments for various estimation methods.

Specifically, we consider the following adjusted estimators.

  1. (i)

    unadj: the estimator with no adjustments. In this case, our standard error is identical to the adjusted standard error proposed by Bai et al. (2022).

  2. (ii)

    naïve: the linear adjustments with regressors WiW_{i} but without pair dummies.

  3. (iii)

    naïve2: the linear adjustments with XiX_{i} and WiW_{i} regressors but without pair dummies.

  4. (iv)

    pfe: the linear adjustments with regressors WiW_{i} and pair dummies.

  5. (v)

    refit: refit the 1\ell_{1}-regularized adjustments by linear regression with pair dummies.

See Section C in the Online Supplement for the regressors used in the regularized adjustments.

For Models 1-11, we examine the performance of estimators (i)-(v). For Models 12-15, we assess the performance among estimators (i) and (v) in high-dimensional settings. Note that the adjustments are misspecified for almost all the models. The only exception is Model 1, for which the linear adjustment in WiW_{i} is correctly specified because md(Xi,Wi)m_{d}(X_{i},W_{i}) is just a linear function of WiW_{i}.

6.3 Simulation Results

Tables 1 and 3 report rejection probabilities at the 0.05 level and power of the different methods for Models 1–11 when nn is 100 and 200, respectively. Several patterns emerge. First, for all the estimators, the rejection rates under H0H_{0} are close to the nominal level even when n=100n=100 and with misspecified adjustments. This result is expected because all the estimators take into account the dependence structure arising in the “matched pairs” design, consistent with the findings in Bai et al. (2022).

Second, in terms of power, “pfe” is higher than “unadj”, “naïve”, and “naïve2” for all eleven models, as predicted by our theory. This finding confirms that “pfe” is the optimal linear adjustment and will not degrade the precision of the ATE estimator. In contrast, we observe that “naïve” and “naïve2” in Model 3 are even less powerful than the unadjusted estimator “unadj”. Figure 1 further confirms that these two methods inflate the estimation standard error. This result echoes Freedman’s critique (Freedman, 2008) that careless regression adjustments may degrade the estimation precision. Our “pfe” addresses this issue because it has been proven to be weakly more efficient than the unadjusted estimator.

Third, the improvement of power for “pfe” is mainly due to the reduction of estimation standard errors, which can be more than 50% as shown in Figure 1 for Models 4–9. This means that the length of the confidence interval of the “pfe” estimator is just half of that for the “unadj” estimator. Note the standard error of the “unadj” estimator is the one proposed by Bai et al. (2022), which has already been adjusted to account for the cross-sectional dependence created in pair matching. The extra 50% reduction is therefore produced purely by the regression adjustment. For Models 10-11, the reduction of standard errors achieved by “pfe” is more than 40% as well. For Model 1, the linear regression is correctly specified so that all three methods achieve the global minimum asymptotic variance and maximum power. For Model 2, md(Xi,Wi)E[md(Xi,Wi)|Xi]=γ(WiE[Wi|Xi])m_{d}(X_{i},W_{i})-E[m_{d}(X_{i},W_{i})|X_{i}]=\gamma(W_{i}-E[W_{i}|X_{i}]) so that the linear adjustment γWi\gamma W_{i} satisfies the conditions in Theorem 3.1. Therefore, “pfe”, as the best linear adjustment, is also the best adjustment globally, achieving the global minimum asymptotic variance and maximum power. In contrast, “naïve” and “naïve2” are not the best linear adjustment and therefore less powerful than “pfe” because of the omitted pair dummies.

Finally, the “refit” method has the best power for most models as they automatically achieve the global minimum asymptotic variance when the dimension of WiW_{i} is fixed.

Tables 2 and 4 report the size and power for the “refit” adjustments when both WiW_{i} and XiX_{i} are high-dimensional. We see that the size under the null is close to the nominal 5% while the power for the adjusted estimator is higher than the unadjusted one. Figure 1 further illustrates the reduction of the standard error is more than 30% for all high-dimensional models.

Table 1: Rejection probabilities for Models 1-11 when n=100n=100
H0H_{0}: Δ=0\Delta=0 H1H_{1}: Δ=1/4\Delta=1/4
Model unadj naïve naïve2 pfe refit unadj naïve naïve2 pfe refit
1 5.47 5.57 5.63 5.76 5.84 22.48 43.89 43.95 43.91 43.92
2 4.96 5.26 5.30 5.47 5.32 23.32 28.02 27.96 37.21 33.12
3 4.99 5.28 5.24 5.48 5.27 32.19 27.88 27.96 37.34 36.29
4 5.31 5.28 5.28 5.48 5.79 11.78 27.88 28.03 37.34 43.28
5 5.43 5.09 5.08 5.49 5.78 11.87 27.72 27.88 36.69 43.08
6 5.28 5.43 5.41 5.58 5.79 11.78 26.67 26.72 34.71 40.29
7 5.64 5.63 5.62 5.98 6.04 9.24 34.55 34.65 37.96 42.08
8 5.63 5.54 5.51 6.03 6.17 9.28 34.11 34.42 37.22 41.29
9 5.74 5.69 5.76 6.19 5.89 8.99 32.39 32.30 35.42 38.75
10 5.24 5.78 5.73 6.05 6.04 14.27 30.80 30.75 32.02 32.51
11 5.19 5.78 5.72 6.07 5.95 14.36 30.60 30.49 32.21 32.81
Table 2: Rejection probabilities for Models 12-15 when n=100n=100
H0H_{0}: Δ=0\Delta=0 H1H_{1}: Δ=1/4\Delta=1/4
unadj refit unadj refit
12 5.35 6.12 22.01 42.56
13 5.31 6.11 21.47 42.47
14 5.24 6.07 21.39 41.14
15 5.31 6.23 20.73 38.67
Table 3: Rejection probabilities for Models 1-11 when n=200n=200
H0H_{0}: Δ=0\Delta=0 H1H_{1}: Δ=1/4\Delta=1/4
Model unadj naïve naïve2 pfe refit unadj naïve naïve2 pfe refit
1 5.08 5.04 5.10 5.21 5.31 38.94 70.35 70.36 70.32 70.30
2 5.69 5.28 5.28 5.24 5.40 40.31 49.25 49.32 65.36 57.87
3 5.44 5.29 5.30 5.35 5.41 56.89 49.43 49.51 64.96 62.42
4 5.45 5.29 5.29 5.35 5.20 18.55 49.43 49.67 64.96 69.96
5 5.45 5.24 5.18 5.19 5.29 18.41 48.65 48.80 64.11 69.09
6 5.62 5.32 5.31 5.35 5.43 18.19 46.71 46.67 61.09 65.98
7 5.24 5.51 5.46 5.34 5.49 11.86 60.73 60.63 65.14 69.24
8 5.23 5.49 5.47 5.35 5.65 11.84 60.00 60.10 64.93 68.02
9 5.30 5.58 5.57 5.66 5.81 11.90 57.25 57.28 61.61 64.88
10 5.34 5.19 5.15 5.25 5.31 23.95 55.49 55.44 56.64 56.43
11 5.41 5.36 5.32 5.34 5.41 23.88 55.01 55.05 56.31 56.18
Table 4: Rejection probabilities for Models 12-15 when n=200n=200
H0H_{0}: Δ=0\Delta=0 H1H_{1}: Δ=1/4\Delta=1/4
unadj refit unadj refit
12 4.97 5.22 38.91 68.10
13 4.95 5.19 38.04 68.06
14 5.01 5.24 37.65 66.69
15 5.15 5.40 36.61 63.79
Refer to caption
Figure 1: Average Standard Error Reduction in Percentage under H1H_{1} when n=200n=200
\justify

Notes: The figure plots average standard error reduction in percentage achieved by regression adjustments relative to “unadj” under H1H_{1} for Models 1-15 when n=200n=200.

7 Empirical Illustration

In this section, we revisit the randomized experiment with a matched pairs design conducted in Groh and McKenzie (2016). In the paper, they examined the impact of macroinsurance on microenterprises. Here, we apply the covariate adjustment methods developed in this paper to their data and reinvestigate the average effect of macroinsurance on three outcome variables: the microenterprises’ monthly profits, revenues, and investment.

The subjects in the experiment are microenterprise owners, who were the clients of the largest microfinance institution in Egypt. In the randomization, after an exact match of gender and the institution’s branch code, those clients were grouped into pairs by applying an optimal greedy algorithm to additional 13 matching variables. Within each pair, a macroinsurance product was then offered to one randomly assigned client, and the other acted as a control. Based on the pair identities and all the matching variables, we re-order the pairs in our sample according to the procedure described in Section 5.1 of Jiang et al. (2022). The resulting sample contains 2824 microenterprise owners, that is, 1412 pairs of them.222See Groh and McKenzie (2016) and Jiang et al. (2022) for more details.

Table 5 reports the ATEs with the standard errors (in parentheses) estimated by different methods. Among them, “GM” corresponds to the method used in Groh and McKenzie (2016).333Groh and McKenzie (2016) estimated the effect by regression with regressors including some baseline variables, a dummy for missing observations, and dummies for the pairs. Specifically, for profits and revenues, the regressors are the baseline value for the outcome of interest, a dummy for missing observations, and pair dummies; for investment, the regressors only include pair dummies. The standard errors for the “GM” ATE estimate are calculated by the usual heteroskedastity-consistent estimator. The “GM” results in Table 5 were obtained by applying the Stata code provided by Groh and McKenzie (2016). The description of other methods is similar to that in Section 6.2.444Specifically: (i) XiX_{i} includes gender and 13 additional matching variables for all adjustments. Three of the matching variables are continuous, and the others are dummies. (ii) To maintain comparability, we keep XiX_{i} and WiW_{i} consistent across all adjustments except for “refit” for each outcome variable. For profits and revenue, WiW_{i} includes the baseline value for the outcome of interest, a dummy for whether the firm is above the 95th percentile of the control firms’ distributions of the outcome variable, and a dummy for missing observations. For investment, WiW_{i} includes all the covariates used for the first two outcome variables. (iii) For “refit”, we intentionally expand the dimensions of WiW_{i}. In addition to the baseline values used in the other adjustments and the dummy variables for missing observations, the WiW_{i} used in “refit” also includes the interaction of the continuous original WiW_{i} variables with three continuous variables and the first three discrete variables in XiX_{i}. (iv) All the continuous variables in XiX_{i} and WiW_{i} are standardized initially when the regression-adjusted estimators are employed. The results in this table prompt the following observations.

First, aligning with our theoretical and simulation findings, we observe that the standard errors associated with the covariate-adjusted ATEs, particularly those for the “naïve2” and “pfe” estimates, are generally lower compared to the ATE estimate without any adjustment. This pattern is consistent across nearly all the outcome variables. To illustrate, when examining the revenue outcome, the standard errors for the “pfe” estimates are 10.2% smaller than those for the unadjusted ATE estimate.

Second, the standard errors of the “refit” estimates are consistently smaller than those of the unadjusted ATE estimate across all the outcome variables. For example, when profits are the outcome variable, the “refit” estimates exhibit standard errors 7.5% smaller than those of the unadjusted ATE estimate. Moreover, compared with those of the “pfe” estimates, the standard errors of “refit” are slightly smaller.

Table 5: Impacts of Macronsurance for Microenterprises
Y n unadj GM naïve naïve2 pfe refit
Profits 1322 -85.65 -50.88 -41.69 -50.97 -51.60 -55.13
(49.43) (46.46) (47.22) (45.49) (46.94) (45.71)
Revenue 1318 -838.60 -660.16 -611.75 -610.80 -635.80 -600.97
(319.02) (284.02) (286.93) (282.93) (286.50) (284.60)
Investment 1410 -66.60 -66.60 -49.37 -50.72 -67.31 -58.77
(118.93) (118.66) (119.23) (118.97) (118.88) (118.84)
\justify

Notes: The table reports the ATE estimates of the effect of macroinsurance for microenterprises. Standard errors are in parentheses.

8 Conclusion

This paper considers covariate adjustment for the estimation of average treatment effect in “matched pairs” designs when covariates other than the matching variables are available. When the dimension of these covariates is low, we suggest estimating the average treatment effect by a linear regression of the outcome on treatment status and covariates, controlling for pair fixed effects. We show that this estimator is no worse than the simple difference-in-means estimator in terms of efficiency. When the dimension of these covariates is high, we suggest a two-step estimation procedure: in the first step, we run 1\ell_{1}-regularized regressions of outcome on covariates for the treated and control groups separately and obtain the fitted values for both potential outcomes, and in the second step, we estimate the average treatment effect by refitting a linear regression of outcome on treatment status and regularized adjustments from the first step, controlling for the pair fixed effects. We show that the final estimator is no worse than the simple difference-in-means estimator in terms of efficiency. When the conditional mean models are approximately correctly specified, this estimator further achieves the minimum variance as if all relevant covariates are used to form pairs in the experiment design stage. We take the choice of variables to use in forming pairs as given and focus on how to obtain more efficient estimators of the average treatment effect in the analysis stage. Our paper is therefore silent on the important question of how to choose the relevant matching variables in the design stage. This topic is left for future research.

Appendix A Proofs of Main Results

In the appendix, we use anbna_{n}\lesssim b_{n} to denote there exists c>0c>0 such that ancbna_{n}\leq cb_{n}.

A.1 Proof of Theorem 3.1

Step 1: Decomposition by recursive conditioning

To begin, note

μ^n(1)\displaystyle\hat{\mu}_{n}(1) =12n1i2n(2Di(Yi(1)m^1,n(Xi,Wi))+m^1,n(Xi,Wi))\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}(Y_{i}(1)-\hat{m}_{1,n}(X_{i},W_{i}))+\hat{m}_{1,n}(X_{i},W_{i}))
=12n1i2n(2DiYi(1)(2Di1)m^1,n(Xi,Wi))\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}Y_{i}(1)-(2D_{i}-1)\hat{m}_{1,n}(X_{i},W_{i}))
=12n1i2n(2DiYi(1)(2Di1)m1,n(Xi,Wi))+oP(n1/2)\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}Y_{i}(1)-(2D_{i}-1)m_{1,n}(X_{i},W_{i}))+o_{P}(n^{-1/2})
=12n1i2n(2DiYi(1)Dim1,n(Xi,Wi)(1Di)m1,n(Xi,Wi))+oP(n1/2),\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}Y_{i}(1)-D_{i}m_{1,n}(X_{i},W_{i})-(1-D_{i})m_{1,n}(X_{i},W_{i}))+o_{P}(n^{-1/2})~{}, (22)

where the third equality follows from (9). Similarly,

μ^n(0)=12n1i2n(2(1Di)Yi(0)Dim0,n(Xi,Wi)(1Di)m0,n(Xi,Wi))+oP(n1/2).\hat{\mu}_{n}(0)=\frac{1}{2n}\sum_{1\leq i\leq 2n}(2(1-D_{i})Y_{i}(0)-D_{i}m_{0,n}(X_{i},W_{i})-(1-D_{i})m_{0,n}(X_{i},W_{i}))+o_{P}(n^{-1/2})~{}. (23)

It follows from (22)–(23) that

Δ^n=1n1i2nDiϕ1,n,i1n1i2n(1Di)ϕ0,n,i+oP(n1/2),\hat{\Delta}_{n}=\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\phi_{1,n,i}-\frac{1}{n}\sum_{1\leq i\leq 2n}(1-D_{i})\phi_{0,n,i}+o_{P}(n^{-1/2})~{}, (24)

where

ϕ1,n,i\displaystyle\phi_{1,n,i} =Yi(1)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))\displaystyle=Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))
ϕ0,n,i\displaystyle\phi_{0,n,i} =Yi(0)12(m1,n(Xi,Wi)+m0,n(Xi,Wi)).\displaystyle=Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))~{}.

Next, consider

𝕃n=12n1i2n(2Di1)E[m1,n(Xi,Wi)+m0,n(Xi,Wi)|Xi].\displaystyle\mathbb{L}_{n}=\frac{1}{2\sqrt{n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})|X_{i}]~{}.

For simplicity, define Md,n(Xi)=E[md,n(Xi,Wi)|Xi]M_{d,n}(X_{i})=E[m_{d,n}(X_{i},W_{i})|X_{i}] for d{0,1}d\in\{0,1\}. It follows from Assumption 2.2 that E[𝕃n|X(n)]=0E[\mathbb{L}_{n}|X^{(n)}]=0. On the other hand,

Var[𝕃n|X(n)]\displaystyle\operatorname*{Var}[\mathbb{L}_{n}|X^{(n)}] =14n1jn(M1,n(Xπ(2j1))+M0,n(Xπ(2j1))(M1,n(Xπ(2j))+M0,n(Xπ(2j))))2\displaystyle=\frac{1}{4n}\sum_{1\leq j\leq n}\left(M_{1,n}(X_{\pi(2j-1)})+M_{0,n}(X_{\pi(2j-1)})-(M_{1,n}(X_{\pi(2j)})+M_{0,n}(X_{\pi(2j)}))\right)^{2}
1n1jn|M1,n(Xπ(2j1))M1,n(Xπ(2j))|2+1n1jn|M0,n(Xπ(2j1))M0,n(Xπ(2j))|2\displaystyle\lesssim\frac{1}{n}\sum_{1\leq j\leq n}|M_{1,n}(X_{\pi(2j-1)})-M_{1,n}(X_{\pi(2j)})|^{2}+\frac{1}{n}\sum_{1\leq j\leq n}|M_{0,n}(X_{\pi(2j-1)})-M_{0,n}(X_{\pi(2j)})|^{2}
P0,\displaystyle\stackrel{{\scriptstyle P}}{{\to}}0~{},

where the inequality follows from (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and the convergence follows from Assumptions 2.3 and 3.1(c). By Markov’s inequality and the fact that E[𝕃n|X(n)]=0E[\mathbb{L}_{n}|X^{(n)}]=0, for any ϵ>0\epsilon>0,

P{|𝕃n|>ϵ|X(n)}Var[𝕃n|X(n)]ϵ2P0.P\{|\mathbb{L}_{n}|>\epsilon|X^{(n)}\}\leq\frac{\operatorname*{Var}[\mathbb{L}_{n}|X^{(n)}]}{\epsilon^{2}}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Since probabilities are bounded, we have 𝕃n=oP(1)\mathbb{L}_{n}=o_{P}(1). This fact, together with (24), imply

n(Δ^nΔ(Q))=AnBn+CnDn,\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))=A_{n}-B_{n}+C_{n}-D_{n}~{},

where

An\displaystyle A_{n} =1n1i2n(Diϕ1,n,iE[Diϕ1,n,i|X(n),D(n)])\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}\left(D_{i}\phi_{1,n,i}-E[D_{i}\phi_{1,n,i}|X^{(n)},D^{(n)}]\right)
Bn\displaystyle B_{n} =1n1i2n((1Di)ϕ0,n,iE[(1Di)ϕ0,n,i|X(n),D(n)])\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}\left((1-D_{i})\phi_{0,n,i}-E[(1-D_{i})\phi_{0,n,i}|X^{(n)},D^{(n)}]\right)
Cn\displaystyle C_{n} =1n1i2nDi(E[Yi(1)|Xi]E[Yi(1)])\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}D_{i}(E[Y_{i}(1)|X_{i}]-E[Y_{i}(1)])
Dn\displaystyle D_{n} =1n1i2n(1Di)(E[Yi(0)|Xi]E[Yi(0)]).\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}(1-D_{i})(E[Y_{i}(0)|X_{i}]-E[Y_{i}(0)])~{}.

Note that conditional on X(n)X^{(n)} and D(n)D^{(n)}, AnA_{n} and BnB_{n} are independent while CnC_{n} and DnD_{n} are constants.

Step 2: Conditional central limit theorems

We first analyze the limiting behavior of AnA_{n}. Define

sn2=1i2nDiVar[ϕ1,n,i|Xi].s_{n}^{2}=\sum_{1\leq i\leq 2n}D_{i}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]~{}.

Note by Assumption 2.2 that sn2=nVar[An|X(n),D(n)]s_{n}^{2}=n\operatorname*{Var}[A_{n}|X^{(n)},D^{(n)}]. We proceed verify the Lindeberg condition for AnA_{n} conditional on X(n)X^{(n)} and D(n)D^{(n)}, i.e., we show that for every ϵ>0\epsilon>0,

1sn21i2nE[|Di(ϕ1,n,iE[ϕ1,n,i|Xi])|2I{|Di(ϕ1,n,iE[ϕ1,n,i|Xi])|>ϵsn}|X(n),D(n)]P0.\frac{1}{s_{n}^{2}}\sum_{1\leq i\leq 2n}E[|D_{i}(\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}])|^{2}I\{|D_{i}(\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}])|>\epsilon s_{n}\}|X^{(n)},D^{(n)}]\stackrel{{\scriptstyle P}}{{\to}}0~{}. (25)

To that end, first note Lemma B.2 implies

sn2nE[Var[ϕ1,n,i|Xi]]P1.\frac{s_{n}^{2}}{nE[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]}\stackrel{{\scriptstyle P}}{{\to}}1~{}. (26)

(26) and Assumption 3.1(a) imply that for all λ>0\lambda>0,

P{ϵsn>λ}P1.P\{\epsilon s_{n}>\lambda\}\stackrel{{\scriptstyle P}}{{\to}}1~{}. (27)

Furthermore, for some c>0c>0,

P{sn2n>c}1.P\left\{\frac{s_{n}^{2}}{n}>c\right\}\to 1~{}. (28)

Next, note for any λ>0\lambda>0 and δ1>0\delta_{1}>0, the left-hand side of (25) can be written as

1sn2/n1n1i2n:Di=1E[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>ϵsn}|X(n),D(n)]\displaystyle\frac{1}{s_{n}^{2}/n}\frac{1}{n}\sum_{1\leq i\leq 2n:D_{i}=1}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\epsilon s_{n}\}|X^{(n)},D^{(n)}]
1sn2/n1n1i2nE[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>ϵsn}|X(n),D(n)]\displaystyle\leq\frac{1}{s_{n}^{2}/n}\frac{1}{n}\sum_{1\leq i\leq 2n}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\epsilon s_{n}\}|X^{(n)},D^{(n)}]
1c1n1i2nE[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>λ}|X(n),D(n)]+oP(1)\displaystyle\leq\frac{1}{c}\frac{1}{n}\sum_{1\leq i\leq 2n}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}|X^{(n)},D^{(n)}]+o_{P}(1)
2c12n1i2nE[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>λ}|Xi]+oP(1),\displaystyle\leq\frac{2}{c}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}|X_{i}]+o_{P}(1)~{}, (29)

where the first inequality follows by inspection, the second follows from (27)–(28), and the last follows from Assumption 2.2. We then argue

12n1i2nE[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>λ}|Xi]=E[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>λ}]+oP(1).\frac{1}{2n}\sum_{1\leq i\leq 2n}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}|X_{i}]\\ =E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}]+o_{P}(1)~{}. (30)

To this end, we once again verify the Lindeberg condition in Lemma 11.4.2 of Lehmann and Romano (2005). Note

|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|>λ}|ϕ1,n,iE[ϕ1,n,i|Xi]|2.|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|>\lambda\}\leq|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}~{}.

Therefore, in light of Lemma B.1, we only need to verify

limγlim supnE[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|2>γ}]=0,\lim_{\gamma\to\infty}\limsup_{n\to\infty}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}>\gamma\}]=0~{}, (31)

which follows immediately from Lemma B.3.

Another application of (31) implies (25). Lindeberg’s central limit theorem and (26) then imply that

supt𝐑|P{An/E[Var[ϕ1,n,i|Xi]]t|X(n),D(n)}Φ(t)|P0.\sup_{t\in\mathbf{R}}|P\{A_{n}/\sqrt{E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]}\leq t|X^{(n)},D^{(n)}\}-\Phi(t)|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Similar arguments lead to

supt𝐑|P{Bn/E[Var[ϕ0,n,i|Xi]]t|X(n),D(n)}Φ(t)|P0.\sup_{t\in\mathbf{R}}|P\{B_{n}/\sqrt{E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]}\leq t|X^{(n)},D^{(n)}\}-\Phi(t)|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Step 3: Combining conditional and unconditional components

Meanwhile, it follows from the same arguments as those in (S.22)–(S.25) of Bai et al. (2022) that

CnDndN(0,12E[(E[Yi(1)|Xi]E[Yi(1)](E[Yi(0)|Xi]E[Yi(0)]))2]).C_{n}-D_{n}\stackrel{{\scriptstyle d}}{{\to}}N\left(0,\frac{1}{2}E\left[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(1)]-(E[Y_{i}(0)|X_{i}]-E[Y_{i}(0)]))^{2}\right]\right)~{}.

To establish (10), define νn2=ν1,n2+ν0,n2+ν22\nu_{n}^{2}=\nu_{1,n}^{2}+\nu_{0,n}^{2}+\nu_{2}^{2}, where

ν1,n2\displaystyle\nu_{1,n}^{2} =E[Var[ϕ1,n,i|Xi]]\displaystyle=E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]
ν0,n2\displaystyle\nu_{0,n}^{2} =E[Var[ϕ0,n,i|Xi]]\displaystyle=E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]
ν2\displaystyle\nu^{2} =12E[(E[Yi(1)|Xi]E[Yi(1)](E[Yi(0)|Xi]E[Yi(0)]))2]\displaystyle=\frac{1}{2}E\left[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(1)]-(E[Y_{i}(0)|X_{i}]-E[Y_{i}(0)]))^{2}\right]

Note

n(Δ^nΔ(Q))νn=Anν1,nν1,nνnBnν0,nν0,nνn+CnDnν2ν2νn.\frac{\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))}{\nu_{n}}=\frac{A_{n}}{\nu_{1,n}}\frac{\nu_{1,n}}{\nu_{n}}-\frac{B_{n}}{\nu_{0,n}}\frac{\nu_{0,n}}{\nu_{n}}+\frac{C_{n}-D_{n}}{\nu_{2}}\frac{\nu_{2}}{\nu_{n}}~{}.

Further note νn,ν1,n,ν0,n,ν2\nu_{n},\nu_{1,n},\nu_{0,n},\nu_{2} are all constants conditional on X(n)X^{(n)} and D(n)D^{(n)}. Suppose by contradiction that n(Δ^nΔ(Q))νn\frac{\sqrt{n}(\hat{\Delta}_{n}-\Delta(Q))}{\nu_{n}} does not converge in distribution to N(0,1)N(0,1). Then, there exists ϵ>0\epsilon>0 and a subsequence {nk}\{n_{k}\} such that

supt𝐑|P{nk(Δ^nkΔ(Q))/νnkt}Φ(t)|ϵ.\sup_{t\in\mathbf{R}}|P\{\sqrt{n}_{k}(\hat{\Delta}_{n_{k}}-\Delta(Q))/\nu_{n_{k}}\leq t\}-\Phi(t)|\to\epsilon~{}. (32)

Because the sequence ν1,nk\nu_{1,n_{k}} and ν0,nk\nu_{0,n_{k}} are bounded by Assumptions 3.1(b), there is a further subsequence, which with some abuse of notation we still denote by {nk}\{n_{k}\}, along which ν1,nkν1\nu_{1,n_{k}}\to\nu_{1}^{\ast} and ν0,nkν0\nu_{0,n_{k}}\to\nu_{0}^{\ast} for some ν1,ν00\nu_{1}^{\ast},\nu_{0}^{\ast}\geq 0. Then, ν1,nk/νnk,ν0,nk/νnk,ν2/νnk\nu_{1,n_{k}}/\nu_{n_{k}},\nu_{0,n_{k}}/\nu_{n_{k}},\nu_{2}/\nu_{n_{k}} all converge to constants. Therefore, it follows from Lemma S.1.2 of Bai et al. (2022) that

nk(Δ^nkΔ(Q))/νnkdN(0,1),\sqrt{n}_{k}(\hat{\Delta}_{n_{k}}-\Delta(Q))/\nu_{n_{k}}\stackrel{{\scriptstyle d}}{{\to}}N(0,1)~{},

a contradiction to (32). Therefore, the desired convergence in Theorem 3.1 follows.

Step 4: Rearranging the variance formula

To conclude the proof with the the variance formula as stated in the theorem, note

Var[Yi(0)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]\displaystyle\operatorname*{Var}\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{|}X_{i}\Big{]}
=Var[E[Yi(0)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi,Wi]|Xi]\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
+E[Var[Yi(0)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt+E\Big{[}\operatorname*{Var}\Big{[}Y_{i}(0)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
=Var[E[Yi(1)+Yi(0)2|Xi,Wi]12(m1,n(Xi,Wi)+m0,n(Xi,Wi))E[Yi(1)Yi(0)2|Xi,Wi]|Xi]\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))-E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
+E[Var[Yi(0)|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]|X_{i}]
=Var[E[Yi(1)+Yi(0)2|Xi,Wi]12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{|}X_{i}\Big{]}
+Var[E[Yi(1)Yi(0)2|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt+\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
2Cov[E[Yi(1)+Yi(0)2|Xi,Wi]12(m1,n(Xi,Wi)+m0,n(Xi,Wi)),E[Yi(1)Yi(0)2|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt-2\mathrm{Cov}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})),E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
+E[Var[Yi(0)|Xi,Wi]|Xi],\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]|X_{i}]~{}, (33)

where the first equality follows from the law of total variance, the second one follows by direct calculation, and the last one follows by expanding the variance of the sum. Similarly,

Var[Yi(1)12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]\displaystyle\operatorname*{Var}\Big{[}Y_{i}(1)-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{|}X_{i}\Big{]}
=Var[E[Yi(1)+Yi(0)2|Xi,Wi]12(m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]\displaystyle=\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))\Big{|}X_{i}\Big{]}
+Var[E[Yi(1)Yi(0)2|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt+\operatorname*{Var}\Big{[}E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
+2Cov[E[Yi(1)+Yi(0)2|Xi,Wi]12(m1,n(Xi,Wi)+m0,n(Xi,Wi)),E[Yi(1)Yi(0)2|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt+2\mathrm{Cov}\Big{[}E\Big{[}\frac{Y_{i}(1)+Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i})),E\Big{[}\frac{Y_{i}(1)-Y_{i}(0)}{2}\Big{|}X_{i},W_{i}\Big{]}\Big{|}X_{i}\Big{]}
+E[Var[Yi(1)|Xi,Wi]|Xi].\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]|X_{i}]~{}. (34)

It follows that

σn2(Q)\displaystyle\sigma_{n}^{2}(Q) =12E[Var[E[Yi(1)+Yi(0)|Xi,Wi](m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]]\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]]
+12E[Var[E[Yi(1)Yi(0)|Xi,Wi]|Xi]]+12Var[E[Yi(1)Yi(0)|Xi]]\displaystyle\hskip 30.00005pt+\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]|X_{i}]]+\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i}]]
+E[Var[Yi(0)|Xi,Wi]|Xi]+E[Var[Yi(1)|Xi,Wi]|Xi]\displaystyle\hskip 30.00005pt+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]|X_{i}]+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]|X_{i}]
=12E[Var[E[Yi(1)+Yi(0)|Xi,Wi](m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]]\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]]
+12E[(E[Yi(1)Yi(0)|Xi,Wi]E[Yi(1)Yi(0)|Xi])2]\displaystyle\hskip 30.00005pt+\frac{1}{2}E[(E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)-Y_{i}(0)|X_{i}])^{2}]
+12E[(E[Yi(1)Yi(0)|Xi]E[Yi(1)Yi(0)])2]\displaystyle\hskip 30.00005pt+\frac{1}{2}E[(E[Y_{i}(1)-Y_{i}(0)|X_{i}]-E[Y_{i}(1)-Y_{i}(0)])^{2}]
+E[(Yi(0)E[Yi(0)|Xi,Wi])2]+E[(Yi(1)E[Yi(1)|Xi,Wi])2]\displaystyle\hskip 30.00005pt+E[(Y_{i}(0)-E[Y_{i}(0)|X_{i},W_{i}])^{2}]+E[(Y_{i}(1)-E[Y_{i}(1)|X_{i},W_{i}])^{2}]
=12E[Var[E[Yi(1)+Yi(0)|Xi,Wi](m1,n(Xi,Wi)+m0,n(Xi,Wi))|Xi]]\displaystyle=\frac{1}{2}E[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{i}]-(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))|X_{i}]]
+12Var[E[Yi(1)Yi(0)|Xi,Wi]]+E[Var[Yi(0)|Xi,Wi]]+E[Var[Yi(1)|Xi,Wi]],\displaystyle\hskip 30.00005pt+\frac{1}{2}\operatorname*{Var}[E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(0)|X_{i},W_{i}]]+E[\operatorname*{Var}[Y_{i}(1)|X_{i},W_{i}]]~{},

where the first equality follows by definition, the second one follows from (33)–(34), the third one again follows by definition, and the last one follows because by the law of iterated expectations,

E[(E[Yi(1)Yi(0)|Xi,Wi]E[Yi(1)Yi(0)|Xi])(E[Yi(1)Yi(0)|Xi]E[Yi(1)Yi(0)])]=0.E[(E[Y_{i}(1)-Y_{i}(0)|X_{i},W_{i}]-E[Y_{i}(1)-Y_{i}(0)|X_{i}])(E[Y_{i}(1)-Y_{i}(0)|X_{i}]-E[Y_{i}(1)-Y_{i}(0)])]=0~{}.

The conclusion then follows.  

A.2 Proof of Theorem 3.2

Theorem 3.1 implies Δ^nPΔ(Q)\hat{\Delta}_{n}\stackrel{{\scriptstyle P}}{{\to}}\Delta(Q). Next, we show

τ^n2E[Var[ϕ1,n,i|Xi]]+E[Var[ϕ0,n,i|Xi]]+E[(E[Yi(1)|Xi]E[Yi(0)|Xi])2]P0.\hat{\tau}_{n}^{2}-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]\stackrel{{\scriptstyle P}}{{\to}}0~{}. (35)

To that end, define

Y̊i=Yi12(m1,n(Xi,Wi)+m0,n(Xi,Wi)).\mathring{Y}_{i}=Y_{i}-\frac{1}{2}(m_{1,n}(X_{i},W_{i})+m_{0,n}(X_{i},W_{i}))~{}.

Note

τ^n2\displaystyle\hat{\tau}_{n}^{2} =1n1jn(Y̊π(2j1)Y̊π(2j)+(Y~π(2j1)Y~π(2j)(Y̊π(2j1)Y̊π(2j))))2\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}\left(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}+(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))\right)^{2}
=1n1jn(Y̊π(2j1)Y̊π(2j))2+1n1jn(Y~π(2j1)Y~π(2j)(Y̊π(2j1)Y̊π(2j)))2\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}+\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))^{2}
+2n1jn(Y~π(2j1)Y~π(2j)(Y̊π(2j1)Y̊π(2j)))(Y̊π(2j1)Y̊π(2j)).\displaystyle\hskip 30.00005pt+\frac{2}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})~{}.

Therefore, to establish (35), we first show

1n1jn(Y̊π(2j1)Y̊π(2j))2E[Var[ϕ1,n,i|Xi]]+E[Var[ϕ0,n,i|Xi]]+E[(E[Yi(1)|Xi]E[Yi(0)|Xi])2]P0\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]\stackrel{{\scriptstyle P}}{{\to}}0 (36)

and

1n1jn(Y~π(2j1)Y~π(2j)(Y̊π(2j1)Y̊π(2j)))2P0.\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))^{2}\stackrel{{\scriptstyle P}}{{\to}}0~{}. (37)

(37) immediately follows from repeated applications of the inequality (ab)22(a2+b2)(a-b)^{2}\leq 2(a^{2}+b^{2}) and (12). To verify (36), note

1n1jn(Y̊π(2j1)Y̊π(2j))2=1n1i2nY̊i22n1jnY̊π(2j1)Y̊π(2j).\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}=\frac{1}{n}\sum_{1\leq i\leq 2n}\mathring{Y}_{i}^{2}-\frac{2}{n}\sum_{1\leq j\leq n}\mathring{Y}_{\pi(2j-1)}\mathring{Y}_{\pi(2j)}~{}.

It follows from similar arguments to those in the proof of Lemma B.2 below that

1n1i2nY̊i2E[ϕ1,n,i2]+E[ϕ0,n,i2]P0.\frac{1}{n}\sum_{1\leq i\leq 2n}\mathring{Y}_{i}^{2}-E[\phi_{1,n,i}^{2}]+E[\phi_{0,n,i}^{2}]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Similarly, it follows from the proof of the same lemma that

2n1jnY̊π(2j1)Y̊π(2j)2E[E[ϕ1,n,i|Xi]E[ϕ0,n,i|Xi]]P0.\frac{2}{n}\sum_{1\leq j\leq n}\mathring{Y}_{\pi(2j-1)}\mathring{Y}_{\pi(2j)}-2E[E[\phi_{1,n,i}|X_{i}]E[\phi_{0,n,i}|X_{i}]]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

To establish (36), note

E[ϕ1,n,i2]+E[ϕ0,n,i2]2E[E[ϕ1,n,i|Xi]E[ϕ0,n,i|Xi]]\displaystyle E[\phi_{1,n,i}^{2}]+E[\phi_{0,n,i}^{2}]-2E[E[\phi_{1,n,i}|X_{i}]E[\phi_{0,n,i}|X_{i}]]
=E[Var[ϕ1,n,i|Xi]]+E[Var[ϕ0,n,i|Xi]]+E[E[ϕ1,n,i|Xi]2]+E[E[ϕ0,n,i|Xi]2]2E[E[ϕ1,n,i|Xi]E[ϕ0,n,i|Xi]]\displaystyle=E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[E[\phi_{1,n,i}|X_{i}]^{2}]+E[E[\phi_{0,n,i}|X_{i}]^{2}]-2E[E[\phi_{1,n,i}|X_{i}]E[\phi_{0,n,i}|X_{i}]]
=E[Var[ϕ1,n,i|Xi]]+E[Var[ϕ0,n,i|Xi]]+E[(E[ϕ1,n,i|Xi]E[ϕ0,n,i|Xi])2]\displaystyle=E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[(E[\phi_{1,n,i}|X_{i}]-E[\phi_{0,n,i}|X_{i}])^{2}]
=E[Var[ϕ1,n,i|Xi]]+E[Var[ϕ0,n,i|Xi]]+E[(E[Yi(1)|Xi]E[Yi(0)|Xi])2],\displaystyle=E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]+E[\operatorname*{Var}[\phi_{0,n,i}|X_{i}]]+E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]~{},

where the last equality follows from the definition of ϕ1,n,i\phi_{1,n,i} and ϕ0,n,i\phi_{0,n,i}. It then follows from the Cauchy-Schwarz inequality that

|1n1jn(Y~π(2j1)Y~π(2j)(Y̊π(2j1)Y̊π(2j)))(Y̊π(2j1)Y̊π(2j))|(1n1jn(Y̊π(2j1)Y̊π(2j))2)(1n1jn(Y~π(2j1)Y~π(2j)(Y̊π(2j1)Y̊π(2j)))2)P0,\left|\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})\right|\\ \leq\left(\frac{1}{n}\sum_{1\leq j\leq n}(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)})^{2}\right)\left(\frac{1}{n}\sum_{1\leq j\leq n}(\tilde{Y}_{\pi(2j-1)}-\tilde{Y}_{\pi(2j)}-(\mathring{Y}_{\pi(2j-1)}-\mathring{Y}_{\pi(2j)}))^{2}\right)\stackrel{{\scriptstyle P}}{{\to}}0~{},

which, together with (36)–(37) as well as Assumptions 2.1(b) and 3.1(b), imply (35).

Next, we show

λ^nPE[(E[Yi(1)|Xi]E[Yi(0)|Xi])2].\hat{\lambda}_{n}\stackrel{{\scriptstyle P}}{{\to}}E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]~{}. (38)

Note

λ^n2n1jn2(Y̊π(4j3)Y̊π(4j2))(Y̊π(4j1)Y̊π(4j))(Dπ(4j3)Dπ(4j2))(Dπ(4j1)Dπ(4j))\displaystyle\hat{\lambda}_{n}-\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)}) (39)
=2n1jn2(Y~π(4j3)Y̊π(4j3)(Y~π(4j2)Y̊π(4j2)))(Y̊π(4j1)Y̊π(4j))\displaystyle=\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-3)}-(\tilde{Y}_{\pi(4j-2)}-\mathring{Y}_{\pi(4j-2)}))(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})
×(Dπ(4j3)Dπ(4j2))(Dπ(4j1)Dπ(4j))\displaystyle\hskip 50.00008pt\times(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})
+2n1jn2(Y̊π(4j3)Y̊π(4j2))(Y~π(4j1)Y̊π(4j1))(Y~π(4j)Y̊π(4j)))\displaystyle\hskip 30.00005pt+\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})(\tilde{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j-1)})-(\tilde{Y}_{\pi(4j)}-\mathring{Y}_{\pi(4j)}))
×(Dπ(4j3)Dπ(4j2))(Dπ(4j1)Dπ(4j))\displaystyle\hskip 50.00008pt\times(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})
+2n1jn2(Y~π(4j3)Y̊π(4j3)(Y~π(4j2)Y̊π(4j2)))\displaystyle\hskip 30.00005pt+\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-3)}-(\tilde{Y}_{\pi(4j-2)}-\mathring{Y}_{\pi(4j-2)}))
×(Y~π(4j1)Y̊π(4j1))(Y~π(4j)Y̊π(4j)))(Dπ(4j3)Dπ(4j2))(Dπ(4j1)Dπ(4j)).\displaystyle\hskip 50.00008pt\times(\tilde{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j-1)})-(\tilde{Y}_{\pi(4j)}-\mathring{Y}_{\pi(4j)}))(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})~{}.

In what follows, we show

2n1jn2(Y̊π(4j3)Y̊π(4j2))2=OP(1)\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})^{2}=O_{P}(1) (40)
2n1jn2(Y̊π(4j1)Y̊π(4j))2=OP(1)\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})^{2}=O_{P}(1) (41)
2n1jn2(Y~π(4j3)Y̊π(4j3)(Y~π(4j2)Y̊π(4j2)))2=oP(1)\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-3)}-(\tilde{Y}_{\pi(4j-2)}-\mathring{Y}_{\pi(4j-2)}))^{2}=o_{P}(1) (42)
2n1jn2(Y~π(4j1)Y̊π(4j1))(Y~π(4j)Y̊π(4j)))2=oP(1)\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\tilde{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j-1)})-(\tilde{Y}_{\pi(4j)}-\mathring{Y}_{\pi(4j)}))^{2}=o_{P}(1) (43)
2n1jn2(Y̊π(4j3)Y̊π(4j2))(Y̊π(4j1)Y̊π(4j))(Dπ(4j3)Dπ(4j2))(Dπ(4j1)Dπ(4j))\displaystyle\frac{2}{n}\sum_{1\leq j\leq\lfloor\frac{n}{2}\rfloor}(\mathring{Y}_{\pi(4j-3)}-\mathring{Y}_{\pi(4j-2)})(\mathring{Y}_{\pi(4j-1)}-\mathring{Y}_{\pi(4j)})(D_{\pi(4j-3)}-D_{\pi(4j-2)})(D_{\pi(4j-1)}-D_{\pi(4j)})
PE[(E[Yi(1)|Xi]E[Yi(0)|Xi])2].\displaystyle\hskip 50.00008pt\stackrel{{\scriptstyle P}}{{\to}}E[(E[Y_{i}(1)|X_{i}]-E[Y_{i}(0)|X_{i}])^{2}]~{}. (44)

To establish (40)–(41), note they follow directly from (36) and Assumptions 2.1(b) and 3.1(b). Next, note (42) follows from repeated applications of the inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and (12). (43) can be established by similar arguments. (44) follows from similar arguments to those in the proof of Lemma S.1.7 of Bai et al. (2022), with the uniform integrability arguments replaced by arguments similar to those in the proof of Lemma B.2, together with Assumptions 2.12.4 and 3.1. (39)–(44) imply (38) immediately.

Finally, note we have shown

σ^n2σn2P0.\hat{\sigma}_{n}^{2}-\sigma_{n}^{2}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Assumption 3.1(a) implies σn2\sigma_{n}^{2} is bounded away from zero, so

σ^nσnP1.\frac{\hat{\sigma}_{n}}{\sigma_{n}}\stackrel{{\scriptstyle P}}{{\to}}1~{}.

The conclusion of the theorem then follows.  

A.3 Proof of Theorem 4.1

We will apply the Frisch-Waugh-Lovell theorem to obtain an expression for β^nnaive\hat{\beta}_{n}^{\rm naive}. Consider the linear regression of ψi\psi_{i} on 11 and DiD_{i}. Define

μ^ψ,n(d)=1n1i2nψiI{Di=d}\hat{\mu}_{\psi,n}(d)=\frac{1}{n}\sum_{1\leq i\leq 2n}\psi_{i}I\{D_{i}=d\}

for d{0,1}d\in\{0,1\} and

Δ^ψ,n=μ^ψ,n(1)μ^ψ,n(0).\hat{\Delta}_{\psi,n}=\hat{\mu}_{\psi,n}(1)-\hat{\mu}_{\psi,n}(0)~{}.

The iith residual based on the OLS estimation of this linear regression model is given by

ψ~i=ψiμ^ψ,n(0)Δ^ψ,nDi.\tilde{\psi}_{i}=\psi_{i}-\hat{\mu}_{\psi,n}(0)-\hat{\Delta}_{\psi,n}D_{i}~{}.

β^nnaive\hat{\beta}_{n}^{\rm naive} is then given by the OLS estimator of the coefficient in the linear regression of YiY_{i} on ψ~i\tilde{\psi}_{i}. Note

12n1i2nψ~iψ~i\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}\tilde{\psi}_{i}^{\prime} =12n1i2n(ψiμ^ψ,n(1))(ψiμ^ψ,n(1))Di+12n1i2n(ψiμ^ψ,n(0))(ψiμ^ψ,n(0))(1Di)\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(1))(\psi_{i}-\hat{\mu}_{\psi,n}(1))^{\prime}D_{i}+\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(0))(\psi_{i}-\hat{\mu}_{\psi,n}(0))^{\prime}(1-D_{i})
=12n1i2nψiψi12μ^ψ,n(1)μ^ψ,n(1)12μ^ψ,n(0)μ^ψ,n(0).\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}-\frac{1}{2}\hat{\mu}_{\psi,n}(1)\hat{\mu}_{\psi,n}(1)^{\prime}-\frac{1}{2}\hat{\mu}_{\psi,n}(0)\hat{\mu}_{\psi,n}(0)^{\prime}~{}.

It follows from Assumption 4.1(b) and the weak law of large number that

12n1i2nψiψiPE[ψiψi].\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}\stackrel{{\scriptstyle P}}{{\to}}E[\psi_{i}\psi_{i}^{\prime}]~{}.

On the other hand, it follows from Assumptions 2.22.3 and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.5 of Bai et al. (2022) that

μ^ψ,n(d)PE[ψi]\hat{\mu}_{\psi,n}(d)\stackrel{{\scriptstyle P}}{{\to}}E[\psi_{i}]

for d{0,1}d\in\{0,1\}. Therefore,

12n1i2nψ~iψ~iPVar[ψi].\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}\tilde{\psi}_{i}^{\prime}\stackrel{{\scriptstyle P}}{{\to}}\operatorname*{Var}[\psi_{i}]~{}.

Next,

12n1i2nψ~iYi\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}Y_{i} =12n1i2n(ψiμ^ψ,n(1))Yi(1)Di+12n1i2n(ψiμ^ψ,n(0))Yi(0)(1Di)\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(1))Y_{i}(1)D_{i}+\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{i}-\hat{\mu}_{\psi,n}(0))Y_{i}(0)(1-D_{i})

It follows from similar arguments as above as well as Assumptions 2.1(b), 2.22.3, and 4.1(b)–(c) that

12n1i2nψ~iYiPCov[ψi,Yi(1)+Yi(0)].\frac{1}{2n}\sum_{1\leq i\leq 2n}\tilde{\psi}_{i}Y_{i}\stackrel{{\scriptstyle P}}{{\to}}\operatorname*{Cov}[\psi_{i},Y_{i}(1)+Y_{i}(0)]~{}.

The convergence of β^nnaive\hat{\beta}_{n}^{\rm naive} therefore follows from the continuous mapping theorem and Assumption 4.1(a).

To see (12) is satisfied, note

12n1i2n(m^d,n(Xi,Wi)md,n(Xi,Wi))2=(β^nnaiveβnaive)(12n1i2nψiψi)(β^nnaiveβnaive).\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))^{2}=(\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive})^{\prime}\left(\frac{1}{2n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}\right)(\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive})~{}.

(12) then follows from the fact that β^nnaivePβ\hat{\beta}_{n}^{\rm naive}\stackrel{{\scriptstyle P}}{{\to}}\beta, Assumption 4.1(b), and the weak law of large numbers. To establish (9), first note

12n1i2n(2Di1)(m^d,n(Xi,Wi)md,n(Xi,Wi))=12nΔ^ψ,n(β^nnaiveβnaive).\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{i})-m_{d,n}(X_{i},W_{i}))=\frac{1}{\sqrt{2}}\sqrt{n}\hat{\Delta}_{\psi,n}^{\prime}(\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive})~{}.

In what follows, we establish

nΔ^ψ,n=OP(1),\sqrt{n}\hat{\Delta}_{\psi,n}=O_{P}(1)~{}, (45)

from which (9) follows immediately because β^nnaiveβnaive=oP(1)\hat{\beta}_{n}^{\rm naive}-\beta^{\rm naive}=o_{P}(1). Note by Assumption 2.2 that E[nΔ^ψ,n|X(n)]=0E[\sqrt{n}\hat{\Delta}_{\psi,n}|X^{(n)}]=0. Also note

nΔ^ψ,n=FnGn+Hn,\sqrt{n}\hat{\Delta}_{\psi,n}=F_{n}-G_{n}+H_{n}~{},

where

Fn\displaystyle F_{n} =1n1i2n(ψiE[ψi|Xi])Di,\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}(\psi_{i}-E[\psi_{i}|X_{i}])D_{i}~{},
Gn\displaystyle G_{n} =1n1i2n(ψiE[ψi|Xi])(1Di),and\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq i\leq 2n}(\psi_{i}-E[\psi_{i}|X_{i}])(1-D_{i})~{},\quad\text{and}
Hn\displaystyle H_{n} =1n1jn(E[ψπ(2j1)|Xπ(2j1)]E[ψπ(2j)|Xπ(2j)])(Dπ(2j1)Dπ(2j)).\displaystyle=\frac{1}{\sqrt{n}}\sum_{1\leq j\leq n}(E[\psi_{\pi(2j-1)}|X_{\pi(2j-1)}]-E[\psi_{\pi(2j)}|X_{\pi(2j)}])(D_{\pi(2j-1)}-D_{\pi(2j)})~{}.

We will argue Fn,Gn,HnF_{n},G_{n},H_{n} are all OP(1)O_{P}(1). Since this could be carried out separately for each entry of FnF_{n} and GnG_{n}, we assume without loss of generality that kψ=1k_{\psi}=1. First, it follows from Assumptions 2.22.3 and 4.1(c) as well as similar arguments to those in the proof of Lemma S.1.4 of Bai et al. (2022) that

Var[Fn|X(n),D(n)]=1n1i2nVar[ψi|Xi]DiPE[Var[ψi|Xi]]>0.\mathrm{Var}[F_{n}|X^{(n)},D^{(n)}]=\frac{1}{n}\sum_{1\leq i\leq 2n}\mathrm{Var}[\psi_{i}|X_{i}]D_{i}\stackrel{{\scriptstyle P}}{{\to}}E[\mathrm{Var}[\psi_{i}|X_{i}]]>0~{}.

It then follows from similar arguments using the Lindeberg central limit theorem as in the proof of Lemma S.1.4 of Bai et al. (2022) that Fn=OP(1)F_{n}=O_{P}(1). Similar arguments establish Gn=OP(1)G_{n}=O_{P}(1). Finally, we show Hn=OP(1)H_{n}=O_{P}(1). Note that E[Hn|X(n)]=0E[H_{n}|X^{(n)}]=0 and by Assumptions 2.22.3 and 4.1(c),

Var[Hn|X(n)]=1n1jn(E[ψπ(2j1)|Xπ(2j1)]E[ψπ(2j)|Xπ(2j)])2P0.\mathrm{Var}[H_{n}|X^{(n)}]=\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{\pi(2j-1)}|X_{\pi(2j-1)}]-E[\psi_{\pi(2j)}|X_{\pi(2j)}])^{2}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Therefore, for any fixed ϵ>0\epsilon>0, Markov’s inequality implies

P{|HnE[Hn|X(n)]|>ϵ|X(n)}Var[Hn|X(n)]ϵ2P0.P\{|H_{n}-E[H_{n}|X^{(n)}]|>\epsilon|X^{(n)}\}\leq\frac{\mathrm{Var}[H_{n}|X^{(n)}]}{\epsilon^{2}}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Since probabilities are bounded and therefore uniformly integrable, we have that

P{|HnE[Hn|X(n)]|>ϵ}0.P\{|H_{n}-E[H_{n}|X^{(n)}]|>\epsilon\}\to 0~{}.

Therefore, (45) follows. Finally, it is straightforward to see Assumption 3.1 is implied by Assumption 4.1.  

A.4 Proof of Theorem 4.2

By the Frisch-Waugh-Lovell theorem, β^npfe\hat{\beta}_{n}^{\rm pfe} is equal to the OLS estimator in the linear regression of {(Yπ(2j1)Yπ(2j),Yπ(2j)Yπ(2j1)):1jn}\{(Y_{\pi(2j-1)}-Y_{\pi(2j)},Y_{\pi(2j)}-Y_{\pi(2j-1)}):1\leq j\leq n\} on {(2Dπ(2j1)1,2Dπ(2j)1):1jn}\{(2D_{\pi(2j-1)}-1,2D_{\pi(2j)}-1):1\leq j\leq n\} and {(ψπ(2j1)ψπ(2j),ψπ(2j)ψπ(2j1)):1jn}\{(\psi_{\pi(2j-1)}-\psi_{\pi(2j)},\psi_{\pi(2j)}-\psi_{\pi(2j-1)}):1\leq j\leq n\}. To apply the Frisch-Waugh-Lovell theorem again, we study the linear regression of {(ψπ(2j1)ψπ(2j),ψπ(2j)ψπ(2j1)):1jn}\{(\psi_{\pi(2j-1)}-\psi_{\pi(2j)},\psi_{\pi(2j)}-\psi_{\pi(2j-1)}):1\leq j\leq n\} on {(2Dπ(2j1)1,2Dπ(2j)1):1jn}\{(2D_{\pi(2j-1)}-1,2D_{\pi(2j)}-1):1\leq j\leq n\}. The OLS estimator of the regression coefficient in such a regression equals

Δ^ψ,n=1n1jn(Dπ(2j1)Dπ(2j))(ψπ(2j1)ψπ(2j)).\displaystyle\hat{\Delta}_{\psi,n}=\frac{1}{n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})~{}.

The residual is therefore {(ψπ(2j1)ψπ(2j)(2Dπ(2j1)1)Δ^ψ,n,ψπ(2j)ψπ(2j1)(2Dπ(2j)1)Δ^ψ,n):1jn}\{(\psi_{\pi(2j-1)}-\psi_{\pi(2j)}-(2D_{\pi(2j-1)}-1)\hat{\Delta}_{\psi,n},\psi_{\pi(2j)}-\psi_{\pi(2j-1)}-(2D_{\pi(2j)}-1)\hat{\Delta}_{\psi,n}):1\leq j\leq n\}. β^npfe\hat{\beta}_{n}^{\rm pfe} equals the OLS estimator of the coefficient in the linear regression of {(Yπ(2j1)Yπ(2j),Yπ(2j)Yπ(2j1)):1jn}\{(Y_{\pi(2j-1)}-Y_{\pi(2j)},Y_{\pi(2j)}-Y_{\pi(2j-1)}):1\leq j\leq n\} on those residuals. Define

δY,j\displaystyle\delta_{Y,j} =(Dπ(2j1)Dπ(2j))(Yπ(2j1)Yπ(2j))and\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(Y_{\pi(2j-1)}-Y_{\pi(2j)})\quad\text{and}
δψ,j\displaystyle\delta_{\psi,j} =(Dπ(2j1)Dπ(2j))(ψπ(2j1)ψπ(2j))\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})

Apparently Δ^ψ,n=1n1jnδψ,j\hat{\Delta}_{\psi,n}=\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}. A moment’s thought reveals that β^npfe\hat{\beta}_{n}^{\rm pfe} further equals the coefficient estimate using least squares in the linear regression of δY,j\delta_{Y,j} on δψ,jΔ^ψ,n\delta_{\psi,j}-\hat{\Delta}_{\psi,n} for 1jn1\leq j\leq n. It follows from Assumptions 2.1(b)–(c), 2.22.3, and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.5 of Bai et al. (2022) that

Δ^ψ,n\displaystyle\hat{\Delta}_{\psi,n} P0and\displaystyle\stackrel{{\scriptstyle P}}{{\to}}0\quad\text{and} (46)
1n1jnδY,j\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\delta_{Y,j} PΔ(Q).\displaystyle\stackrel{{\scriptstyle P}}{{\to}}\Delta(Q)~{}.

Next, note that

1n1jnδψ,jδψ,j\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{\psi,j}^{\prime}
=1n1jn(ψπ(2j1)ψπ(2j))(ψπ(2j1)ψπ(2j))\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})^{\prime}
=1n1i2nψiψi1n1jn(ψπ(2j1)ψπ(2j)+ψπ(2j)ψπ(2j1)).\displaystyle=\frac{1}{n}\sum_{1\leq i\leq 2n}\psi_{i}\psi_{i}^{\prime}-\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})~{}. (47)

For convenience, we introduce the following notation:

μd(Xi)\displaystyle\mu_{d}(X_{i}) =E[Yi(d)|Xi]\displaystyle=E[Y_{i}(d)|X_{i}]
Ψ(Xi)\displaystyle\Psi(X_{i}) =E[ψi|Xi]\displaystyle=E[\psi_{i}|X_{i}]
ξd(Xi)\displaystyle\xi_{d}(X_{i}) =E[ψiYi(d)|Xi].\displaystyle=E[\psi_{i}Y_{i}(d)|X_{i}]~{}.

The first term in (47) converges in probability to 2E[ψiψi]2E[\psi_{i}\psi_{i}^{\prime}] by the weak law of large numbers. For the second term, we have that

E[1n1jn(ψπ(2j1)ψπ(2j)+ψπ(2j)ψπ(2j1))|X(n)]\displaystyle E\Big{[}\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})\Big{|}X^{(n)}\Big{]}
=1n1i2nΨ(Xi)Ψ(Xi)1n1jn(Ψ(Xπ(2j1))Ψ(Xπ(2j)))(Ψ(Xπ(2j1))Ψ(Xπ(2j)))\displaystyle=\frac{1}{n}\sum_{1\leq i\leq 2n}\Psi(X_{i})\Psi(X_{i})^{\prime}-\frac{1}{n}\sum_{1\leq j\leq n}(\Psi(X_{\pi(2j-1)})-\Psi(X_{\pi(2j)}))(\Psi(X_{\pi(2j-1)})-\Psi(X_{\pi(2j)}))^{\prime}
P2E[Ψ(Xi)Ψ(Xi)],\displaystyle\stackrel{{\scriptstyle P}}{{\to}}2E[\Psi(X_{i})\Psi(X_{i})^{\prime}]~{},

where the convergence in probability holds because of Assumptions 2.22.3 and 4.1(c). It follows from Assumptions 2.22.3 and 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.6 of Bai et al. (2022) that

|1n1jn(ψπ(2j1)ψπ(2j)+ψπ(2j)ψπ(2j1))E[1n1jn(ψπ(2j1)ψπ(2j)+ψπ(2j)ψπ(2j1))|X(n)]|P0.\Big{|}\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})-E\Big{[}\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}\psi_{\pi(2j)}^{\prime}+\psi_{\pi(2j)}\psi_{\pi(2j-1)}^{\prime})\Big{|}X^{(n)}\Big{]}\Big{|}\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Therefore,

1n1jnδψ,jδψ,jP2E[Var[ψi|Xi]].\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{\psi,j}^{\prime}\stackrel{{\scriptstyle P}}{{\to}}2E[\operatorname*{Var}[\psi_{i}|X_{i}]]~{}.

We now turn to

1n1jnδψ,jδY,j=1n1jn(ψπ(2j1)ψπ(2j))(Yπ(2j1)Yπ(2j)).\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{Y,j}=\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{\pi(2j-1)}-\psi_{\pi(2j)})(Y_{\pi(2j-1)}-Y_{\pi(2j)})~{}.

Note that

E[ψπ(2j1)Yπ(2j1)|X(n)]\displaystyle E[\psi_{\pi(2j-1)}Y_{\pi(2j-1)}|X^{(n)}] =12ξ1(Xπ(2j1))+12ξ0(Xπ(2j1))\displaystyle=\frac{1}{2}\xi_{1}(X_{\pi(2j-1)})+\frac{1}{2}\xi_{0}(X_{\pi(2j-1)})
E[ψπ(2j1)Yπ(2j)|X(n)]\displaystyle E[\psi_{\pi(2j-1)}Y_{\pi(2j)}|X^{(n)}] =12Ψ(Xπ(2j1))(μ1(Xπ(2j))+μ0(Xπ(2j))).\displaystyle=\frac{1}{2}\Psi(X_{\pi(2j-1)})(\mu_{1}(X_{\pi(2j)})+\mu_{0}(X_{\pi(2j)}))~{}.

It follows from Assumptions 2.1(b)–(c), 2.22.3, 4.1(b)–(c) as well as similar arguments to those in the proof of Lemma S.1.6 of Bai et al. (2022) that

1n1jnδψ,jδY,jPE[ψi(Yi(1)+Yi(0))]E[Ψ(Xi)(μ1(Xi)+μ0(Xi))].\frac{1}{n}\sum_{1\leq j\leq n}\delta_{\psi,j}\delta_{Y,j}\stackrel{{\scriptstyle P}}{{\to}}E[\psi_{i}(Y_{i}(1)+Y_{i}(0))]-E[\Psi(X_{i})(\mu_{1}(X_{i})+\mu_{0}(X_{i}))]~{}.

The convergence in probability of β^npfe\hat{\beta}_{n}^{\rm pfe} now follows from Assumption 4.1(a) and the continuous mapping theorem. (9)–(12) can be established using similar arguments to those in the proof of Theorem 4.1. Finally, it is straightforward to see Assumption 3.1 is implied by Assumption 4.1.  

A.5 Proof of Theorem 5.1

We divide the proof into three steps. In the first step, we show

|α^d,nrαd,nr|+β^d,nrβd,nr1=OP(snλnr).|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|+\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\|_{1}=O_{P}\left(s_{n}\lambda_{n}^{\rm r}\right)~{}. (48)

In the second step, we show (9), (12), and Assumption 3.1 hold. In the third step, we show the asymptotic variance achieves the minimum under the approximately correct specification condition in Theorem 5.1.

Step 1: Proof of (48)

Note that

1n1i2nI{Di=d}(Yi(d)α^d,nrψn,iβ^d,nr)2+λd,nrΩ^n(d)β^d,nr1\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(Y_{i}(d)-\hat{\alpha}_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\hat{\beta}_{d,n}^{\rm r})^{2}+\lambda_{d,n}^{\rm r}\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\|_{1}
1n1i2nI{Di=d}(Yi(d)αd,nrψn,iβd,nr)2+λd,nrΩ^n(d)βd,nr1.\displaystyle\leq\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(Y_{i}(d)-\alpha_{d,n}^{\rm r}-\psi_{n,i}^{\prime}\beta_{d,n}^{\rm r})^{2}+\lambda_{d,n}^{\rm r}\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\|_{1}~{}.

Rearranging the terms, we then have

1n1i2nI{Di=d}(α^d,nrαd,nr+ψn,i(β^d,nrβd,nr))2+λd,nrΩ^n(d)β^d,nr1\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}))^{2}+\lambda_{d,n}^{\rm r}\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\|_{1}
(2n1i2nI{Di=d}ϵn,i(d)ψn,i)(β^d,nrβd,nr)+(2n1i2nI{Di=d}ϵn,i(d))(α^d,nrαd,nr)\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\psi_{n,i}^{\prime}\right)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})+\left(\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\right)(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r})
+λd,nrΩ^n(d)βd,nr1\displaystyle+\lambda_{d,n}^{\rm r}\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\|_{1} (49)

Next, define

𝕌n(d)=Ωn1(d)1n1i2nI{Di=d}(ψn,iϵn,i(d)E[ψn,iϵn,i(d)])\mathbb{U}_{n}(d)=\Omega_{n}^{-1}(d)\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])

and

n(d)={𝕌n(d)6σ¯σ¯log(2npn)n,|1ni[2n]I{Di=d}ϵn,i(d)E[ϵn,i(d)]|log(2npn)n}.\mathcal{E}_{n}(d)=\left\{\|\mathbb{U}_{n}(d)\|_{\infty}\leq\frac{6\bar{\sigma}}{\underaccent{\bar}{\sigma}}\sqrt{\frac{\log(2np_{n})}{n}},~{}\left|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)]\right|\leq\sqrt{\frac{\log(2np_{n})}{n}}\right\}~{}.

Lemma B.4 implies P{n(d)}1P\{\mathcal{E}_{n}(d)\}\to 1 for d{0,1}d\in\{0,1\}.

On the event n(d)\mathcal{E}_{n}(d), we have

|(2n1i2nI{Di=d}ϵn,i(d)ψn,i)(β^d,nrβd,nr)|\displaystyle\left|\left(\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\psi_{n,i}^{\prime}\right)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\right|
Ωn1(d)2n1i2nI{Di=d}ϵn,i(d)ψn,iΩn(d)(β^d,nrβd,nr)1\displaystyle\leq\left\|\Omega_{n}^{-1}(d)\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\epsilon_{n,i}(d)\psi_{n,i}\right\|_{\infty}\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\|_{1}
2𝕌n(d)Ωn(d)(β^d,nrβd,nr)1+Ωn1(d)2n1i2nI{Di=d}E[ϵn,i(d)ψn,i]Ωn(d)(β^d,nrβd,nr)1\displaystyle\leq 2\|\mathbb{U}_{n}(d)\|_{\infty}\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\|_{1}+\left\|\Omega_{n}^{-1}(d)\frac{2}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}(d)\psi_{n,i}]\right\|_{\infty}\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\|_{1}
2𝕌n(d)Ωn(d)(β^d,nrβd,nr)1+Ωn1(d)2E[ϵn,i(d)ψn,i]Ωn(d)(β^d,nrβd,nr)1\displaystyle\leq 2\|\mathbb{U}_{n}(d)\|_{\infty}\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\|_{1}+\left\|\Omega_{n}^{-1}(d)2E[\epsilon_{n,i}(d)\psi_{n,i}]\right\|_{\infty}\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\|_{1}
(12σ¯σ¯n+dn)λd,nrΩn(d)(β^d,nrβd,nr)1,\displaystyle\leq\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\lambda_{d,n}^{\rm r}\|\Omega_{n}(d)(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\|_{1}~{},

where dn=o(1)d_{n}=o(1) and the last inequality follows from (18) and the fact that

λd,nrnlog(2npn)n.\lambda_{d,n}^{\rm r}\geq\ell\ell_{n}\sqrt{\frac{\log(2np_{n})}{n}}~{}.

Next, define

δ^d,n=β^d,nrβd,nr\hat{\delta}_{d,n}=\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}

and let Sd,nS_{d,n} be the support of βd,nr\beta_{d,n}^{\rm r}. Then, we have

Ω^n(d)β^d,nr1\displaystyle\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\|_{1} =(Ω^n(d)β^d,nr)Sd,n1+(Ω^n(d)β^d,nr)Sd,nc1=(Ω^n(d)β^d,nr)Sd,n1+(Ω^n(d)δ^d,n)Sd,nc1,\displaystyle=\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}}\|_{1}+\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}^{c}}\|_{1}=\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}}\|_{1}+\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\|_{1}~{},
Ω^n(d)βd,nr1\displaystyle\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\|_{1} =(Ω^n(d)βd,nr)Sd,n1(Ω^n(d)β^d,nr)Sd,n1+(Ω^n(d)δ^d,nr)Sd,n1,\displaystyle=\|(\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r})_{S_{d,n}}\|_{1}\leq\|(\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r})_{S_{d,n}}\|_{1}+\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n}^{\rm r})_{S_{d,n}}\|_{1}~{},

and thus,

(12σ¯σ¯n+dn)Ωn(d)δ^d,n1+Ω^n(d)βd,nr1Ω^n(d)β^d,nr1\displaystyle\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\|\Omega_{n}(d)\hat{\delta}_{d,n}\|_{1}+\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\|_{1}-\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\|_{1}
=(12σ¯σ¯n+dn)(Ωn(d)δ^d,n)Sd,n1+(12σ¯σ¯n+dn)(Ωn(d)δ^d,n)Sd,nc1+Ω^n(d)βd,nr1Ω^n(d)β^d,nr1\displaystyle=\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}}\|_{1}+\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\|_{1}+\|\hat{\Omega}_{n}(d)\beta_{d,n}^{\rm r}\|_{1}-\|\hat{\Omega}_{n}(d)\hat{\beta}_{d,n}^{\rm r}\|_{1}
(12σ¯σ¯n+dn)(Ωn(d)δ^d,n)Sd,n1+(12σ¯σ¯n+dn)(Ωn(d)δ^d,n)Sd,nc1+(Ω^n(d)δ^d,nr)Sd,n1(Ω^n(d)δ^d,nr)Sd,nc1.\displaystyle\leq\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}}\|_{1}+\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}\right)\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\|_{1}+\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n}^{\rm r})_{S_{d,n}}\|_{1}-\|(\hat{\Omega}_{n}(d)\hat{\delta}_{d,n}^{\rm r})_{S_{d,n}^{c}}\|_{1}~{}.

Further define δ˘d,n=(α^d,nrαd,nr,δ^d,n)\breve{\delta}_{d,n}=(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r},\hat{\delta}_{d,n}^{\prime})^{\prime} and S˘d,n={1,Sd,n+1}\breve{S}_{d,n}=\{1,S_{d,n}+1\}555For example, if Sd,n={2,4,9}S_{d,n}=\{2,4,9\}, we have S˘d,n={1,3,5,10}\breve{S}_{d,n}=\{1,3,5,10\}. and recall ψ˘n,i=(1,ψn,i)\breve{\psi}_{n,i}=(1,\psi_{n,i}^{\prime})^{\prime}. Then, together with (A.5), we have

0\displaystyle 0 1n1i2nI{Di=d}(ψ˘n,iδ˘d,n)2\displaystyle\leq\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\breve{\psi}_{n,i}^{\prime}\breve{\delta}_{d,n})^{2}
λd,nr[(12σ¯σ¯n+dn+c¯)(Ωn(d)δ^d,n)Sd,n1(c¯12σ¯σ¯ndn)(Ωn(d)δ^d,n)Sd,nc1]\displaystyle\leq\lambda_{d,n}^{\rm r}\left[\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}}\|_{1}-\left(\underline{c}-\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}-d_{n}\right)\|(\Omega_{n}(d)\hat{\delta}_{d,n})_{S_{d,n}^{c}}\|_{1}\right]
+λd,nr(1/n+dn)|α^d,nrαd,nr|\displaystyle+\lambda_{d,n}^{\rm r}(1/\ell_{n}+d_{n})|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|
λd,nr[(12σ¯σ¯n+dn+c¯)σ¯(δ^d,n)Sd,n1(c¯12σ¯σ¯ndn)σ¯(δ^d,n)Sd,nc1]\displaystyle\leq\lambda_{d,n}^{\rm r}\left[\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\bar{\sigma}\|(\hat{\delta}_{d,n})_{S_{d,n}}\|_{1}-\left(\underline{c}-\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}-d_{n}\right)\underaccent{\bar}{\sigma}\|(\hat{\delta}_{d,n})_{S_{d,n}^{c}}\|_{1}\right]
+λd,nr(1/n+dn)|α^d,nrαd,nr|\displaystyle+\lambda_{d,n}^{\rm r}(1/\ell_{n}+d_{n})|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|
λd,nr[(12σ¯σ¯n+dn+c¯)σ¯(δ˘d,n)S˘d,n1(c¯12σ¯σ¯ndn)σ¯(δ˘d,n)S˘d,nc1].\displaystyle\leq\lambda_{d,n}^{\rm r}\left[\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\bar{\sigma}\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\|_{1}-\left(\underline{c}-\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}-d_{n}\right)\underaccent{\bar}{\sigma}\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}^{c}}\|_{1}\right]~{}. (50)

Define

𝒞n={u𝐑pn+1:uS˘d,nc12σ¯c¯σ¯c¯uS˘d,n1}.\mathcal{C}_{n}=\left\{u\in\mathbf{R}^{p_{n}+1}:\|u_{\breve{S}_{d,n}^{c}}\|_{1}\leq\frac{2\bar{\sigma}\bar{c}}{\underaccent{\bar}{\sigma}\underline{c}}\|u_{\breve{S}_{d,n}}\|_{1}\right\}~{}.

For sufficiently large nn, we have δ˘d,n𝒞n\breve{\delta}_{d,n}\in\mathcal{C}_{n}. It follows from Bickel et al. (2009) and Assumption 5.4 that

infu𝒞n(uS˘d,n1)2(sn+1)u(1n1i2nI{Di=d}ψ˘n,iψ˘n,i)u0.25κ12.\inf_{u\in\mathcal{C}_{n}}(\|u_{\breve{S}_{d,n}}\|_{1})^{-2}(s_{n}+1)u^{\prime}\left(\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\breve{\psi}_{n,i}\breve{\psi}_{n,i}^{\prime}\right)u\geq 0.25\kappa_{1}^{2}~{}.

Therefore, we have

0.25κ12(δ˘d,n)S˘d,n12\displaystyle 0.25\kappa_{1}^{2}\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\|_{1}^{2} 1n1i2nI{Di=d}(ψ˘n,iδ˘d,n)2\displaystyle\leq\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\breve{\psi}_{n,i}^{\prime}\breve{\delta}_{d,n})^{2}
(12σ¯σ¯n+dn+c¯)λd,nr(sn+1)(δ˘d,n)Sd,n1,\displaystyle\leq\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\lambda_{d,n}^{\rm r}(s_{n}+1)\|(\breve{\delta}_{d,n})_{S_{d,n}}\|_{1}~{},

which implies

(δ˘d,n)Sd,n14(12σ¯σ¯n+dn+c¯)(sn+1)λd,nr/κ12.\|(\breve{\delta}_{d,n})_{S_{d,n}}\|_{1}\leq 4\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)(s_{n}+1)\lambda_{d,n}^{\rm r}/\kappa_{1}^{2}~{}.

We then have

|α^d,nrαd,nr|+β^d,nrβd,nr1\displaystyle|\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}|+\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\|_{1} (δ˘d,n)S˘d,n1+(δ˘d,n)S˘d,nc1\displaystyle\leq\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\|_{1}+\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}^{c}}\|_{1}
(1+2σ¯c¯σ¯c¯)(δ˘d,n)S˘d,n1\displaystyle\leq\left(1+\frac{2\bar{\sigma}\bar{c}}{\underaccent{\bar}{\sigma}\underline{c}}\right)\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\|_{1}
4(1+2σ¯c¯σ¯c¯)(12σ¯σ¯n+dn+c¯)(sn+1)λd,nr/κ12.\displaystyle\leq 4\left(1+\frac{2\bar{\sigma}\bar{c}}{\underaccent{\bar}{\sigma}\underline{c}}\right)\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)(s_{n}+1)\lambda_{d,n}^{\rm r}/\kappa_{1}^{2}~{}.

Then, we have (48) holds because P{n(d)}1P\{\mathcal{E}_{n}(d)\}\to 1.

Step 2: Verifying (9), (12), and Assumption 3.1

By (A.5), on n(d)\mathcal{E}_{n}(d) we have

1n1i2nI{Di=d}(α^d,nrαd,nr+ψn,iδ^d,n)2\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}\hat{\delta}_{d,n})^{2} λd,nr(12σ¯σ¯n+dn+c¯)σ¯(δ˘d,n)S˘d,n1\displaystyle\leq\lambda_{d,n}^{\rm r}\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)\bar{\sigma}\|(\breve{\delta}_{d,n})_{\breve{S}_{d,n}}\|_{1}
4(12σ¯σ¯n+dn+c¯)2σ¯(sn+1)λd,nr,2/κ12.\displaystyle\leq 4\left(\frac{12\bar{\sigma}}{\underaccent{\bar}{\sigma}\ell\ell_{n}}+d_{n}+\bar{c}\right)^{2}\bar{\sigma}(s_{n}+1)\lambda_{d,n}^{\rm r,2}/\kappa_{1}^{2}~{}.

Because P{n(d)}1P\{\mathcal{E}_{n}(d)\}\to 1, we have

1n1i2nI{Di=d}(α^d,nrαd,nr+ψn,i(β^d,nβd,n))2=OP(sn(λnr)2)=oP(1),\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}-\beta_{d,n}))^{2}=O_{P}\left(s_{n}(\lambda_{n}^{\rm r})^{2}\right)=o_{P}(1)~{},

which implies (12) holds.

Next, we show (9) for β^d,nr\hat{\beta}_{d,n}^{\rm r}. First note

|12n1i2n(2Di1)(m^d,n(Xi,Wn,i)md,n(Xi,Wn,i))|\displaystyle\left|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{n,i})-m_{d,n}(X_{i},W_{n,i}))\right|
|12n1i2n(2Di1)ψn,i(β^d,nrβd,nr)|\displaystyle\left|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r})\right|
12n1in(2Di1)ψn,iβ^d,nrβd,nr1.\displaystyle\leq\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\|_{1}~{}.

Next, note that it follows from Assumption 2.2 that conditional on X(n)X^{(n)} and Wn(n)W_{n}^{(n)},

{Dπ(2j1)Dπ(2j):1jn}\{D_{\pi(2j-1)}-D_{\pi(2j)}:1\leq j\leq n\}

is a sequence of independent Rademacher random variables. Therefore, Hoeffding’s inequality implies

P{12n1i2n(2Di1)ψn,i>t|X(n),Wn(n)}\displaystyle P\left\{\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}>t\Bigg{|}X^{(n)},W_{n}^{(n)}\right\}
1lpnP{|12n1jn(ψn,π(2j1)ψn,π(2j))(Dπ(2j1)Dπ(2j))|>t|X(n),Wn(n)}\displaystyle\leq\sum_{1\leq l\leq p_{n}}P\left\{\left|\frac{1}{\sqrt{2n}}\sum_{1\leq j\leq n}(\psi_{n,\pi(2j-1)}-\psi_{n,\pi(2j)})(D_{\pi(2j-1)}-D_{\pi(2j)})\right|>t\Bigg{|}X^{(n)},W_{n}^{(n)}\right\}
1lpn2exp(t21n1jn(ψn,π(2j1)ψn,π(2j))2).\displaystyle\leq\sum_{1\leq l\leq p_{n}}2\exp\left(-\frac{t^{2}}{\frac{1}{n}\sum_{1\leq j\leq n}(\psi_{n,\pi(2j-1)}-\psi_{n,\pi(2j)})^{2}}\right)~{}.

Define

νn2=max1lpn1n1i2nψn,i,l2.\nu_{n}^{2}=\max_{1\leq l\leq p_{n}}\frac{1}{n}\sum_{1\leq i\leq 2n}\psi_{n,i,l}^{2}~{}.

We then have

P{12n1i2n(2Di1)ψn,i>νn2log(pnn)|X(n),Wn(n)}(pnn)1.P\left\{\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}>\nu_{n}\sqrt{2\log(p_{n}\vee n)}\Bigg{|}X^{(n)},W_{n}^{(n)}\right\}\leq(p_{n}\vee n)^{-1}~{}. (51)

Next, we determine the order of νn2\nu_{n}^{2}. Note

E[νn2]\displaystyle E[\nu_{n}^{2}] max1lpn2E[ψn,i,l2]+2E[12n1i2n(ψn,i,l2E[ψn,i,l2])]\displaystyle\leq\max_{1\leq l\leq p_{n}}2E[\psi_{n,i,l}^{2}]+2E\left[\frac{1}{2n}\sum_{1\leq i\leq 2n}(\psi_{n,i,l}^{2}-E[\psi_{n,i,l}^{2}])\right]
1+E[max1lpn|12n1i2neiψn,i,l2|]\displaystyle\lesssim 1+E\left[\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}e_{i}\psi_{n,i,l}^{2}\right|\right]
1+ΞnE[max1lpn|12n1i2neiψn,i,l|]\displaystyle\lesssim 1+\Xi_{n}E\left[\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}e_{i}\psi_{n,i,l}\right|\right]
1+ΞnE[supfn|12n1i2nf(ei,ψn,i,l)|]\displaystyle\lesssim 1+\Xi_{n}E\left[\sup_{f\in\mathcal{F}_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}f(e_{i},\psi_{n,i,l})\right|\right]

where {ei:1in}\{e_{i}:1\leq i\leq n\} is an i.i.d. sequence of Rademacher random variables,

n={f:𝐑×𝐑pn𝐑,f(e,ψ)=eψl,1lpn},\mathcal{F}_{n}=\{f:\mathbf{R}\times\mathbf{R}^{p_{n}}\mapsto\mathbf{R},f(e,\psi)=e\psi_{l},1\leq l\leq p_{n}\}~{},

and ψl\psi_{l} is the llth element of ψ\psi. Note the second inequality follows from Lemma 2.3.1 of van der Vaart and Wellner (1996), the third inequality follows from Theorem 4.12 of Ledoux and Talagrand (1991) and the definition of Ξn\Xi_{n}, and the last follows from Assumption 5.1. Note also n\mathcal{F}_{n} has an envelope F=ΞnF=\Xi_{n} and

supn1supfnE[f2]<\sup_{n\geq 1}\sup_{f\in\mathcal{F}_{n}}E[f^{2}]<\infty

because of Assumption 5.1. Because the cardinality of n\mathcal{F}_{n} is pnp_{n}, for any ϵ<1\epsilon<1 we have that

supQ:Q is a discrete distribution with finite support𝒩(ϵFQ,2,,L2(Q))pnϵ,\sup_{Q:Q\text{ is a discrete distribution with finite support}}\mathcal{N}(\epsilon\|F\|_{Q,2},\mathcal{F},L_{2}(Q))\leq\frac{p_{n}}{\epsilon}~{},

where 𝒩(ϵ,,L2(Q))\mathcal{N}(\epsilon,\mathcal{F},L_{2}(Q)) is the covering number for class \mathcal{F} under the metric L2(Q)L_{2}(Q) using balls of radius ϵ\epsilon. Therefore, Corollary 5.1 of Chernozhukov et al. (2014) implies

E[supfn|12n1i2neiψn,i,l|]logpnn+Ξnlogpnn=o(Ξn1).E\left[\sup_{f\in\mathcal{F}_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}e_{i}\psi_{n,i,l}\right|\right]\lesssim\sqrt{\frac{\log p_{n}}{n}}+\frac{\Xi_{n}\log p_{n}}{n}=o(\Xi_{n}^{-1})~{}.

Therefore, νn=Op(1)\nu_{n}=O_{p}(1). Together with (51), they imply

12n1i2n(2Di1)ψn,i=OP(log(pnn)).\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}=O_{P}\left(\sqrt{\log(p_{n}\vee n)}\right)~{}.

In light of (48) and Assumption 5.3, we have

12n1i2n(2Di1)ψn,iβ^d,nrβd,nr1=OP(snnlog(pnn)n)=oP(1).\left\|\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}\right\|_{\infty}\|\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}\|_{1}=O_{P}\left(\frac{s_{n}\ell\ell_{n}\log(p_{n}\vee n)}{\sqrt{n}}\right)=o_{P}(1)~{}.

Next, note that Assumption 3.1(a) and 3.1(b) follow Assumption 5.1, and Assumption 3.1(c) follows Assumptions 5.1 and 5.2.

Step 3: Asymptotic variance

Suppose the true specification is approximately sparse as specified in Theorem 5.1. Let Y~i(d)=Yi(d)μd(Xi)\tilde{Y}_{i}(d)=Y_{i}(d)-\mu_{d}(X_{i}), ψ~n,i=ψn,iE[ψn,i|Xi]\tilde{\psi}_{n,i}=\psi_{n,i}-E[\psi_{n,i}|X_{i}], and R~n,i(d)=Rn,i(d)E[Rn,i(d)|Xi]\tilde{R}_{n,i}(d)=R_{n,i}(d)-E[R_{n,i}(d)|X_{i}]. Then, we have

E[(E[Y~i(1)+Y~i(0)|Wn,i,Xi]ψ~n,i(β1,nr+β0,nr))2]=E[(R~n,i(1)+R~n,i(0))2]=o(1).\displaystyle E\left[(E[\tilde{Y}_{i}(1)+\tilde{Y}_{i}(0)|W_{n,i},X_{i}]-\tilde{\psi}_{n,i}^{\prime}(\beta_{1,n}^{\rm r}+\beta_{0,n}^{\rm r}))^{2}\right]=E[(\tilde{R}_{n,i}(1)+\tilde{R}_{n,i}(0))^{2}]=o(1)~{}.

This concludes the proof.  

A.6 Proof of Theorem 5.2

We divide the proof into three steps. In the first step, we show

β^nrefitβnrefit=oP(1).\displaystyle\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit}=o_{P}(1)~{}. (52)

In the second step, we show (9), (12), and Assumption 3.1 hold. In the third step, we show that σnna,2σnrefit,2\sigma_{n}^{\rm na,2}\geq\sigma_{n}^{\rm refit,2} and σnr,2σnrefit,2\sigma_{n}^{\rm r,2}\geq\sigma_{n}^{\rm refit,2}.

Step 1: Proof of (52)

Let

Δ^Γ,n\displaystyle\hat{\Delta}_{\Gamma,n} =1n1jn(Dπ(2j1)Dπ(2j))(Γn,π(2j1)Γn,π(2j)),\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(\Gamma_{n,\pi(2j-1)}-\Gamma_{n,\pi(2j)})~{},
Δ^Γ^,n\displaystyle\hat{\Delta}_{\hat{\Gamma},n} =1n1jn(Dπ(2j1)Dπ(2j))(Γ^n,π(2j1)Γ^n,π(2j)),\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(\hat{\Gamma}_{n,\pi(2j-1)}-\hat{\Gamma}_{n,\pi(2j)})~{},
δΓ,j\displaystyle\delta_{\Gamma,j} =(Dπ(2j1)Dπ(2j))(Γn,π(2j1)Γn,π(2j)),\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(\Gamma_{n,\pi(2j-1)}-\Gamma_{n,\pi(2j)})~{},
δΓ^,j\displaystyle\delta_{\hat{\Gamma},j} =(Dπ(2j1)Dπ(2j))(Γ^n,π(2j1)Γ^n,π(2j)).\displaystyle=(D_{\pi(2j-1)}-D_{\pi(2j)})(\hat{\Gamma}_{n,\pi(2j-1)}-\hat{\Gamma}_{n,\pi(2j)})~{}.

Then, by the proof of Theorem 4.2, we have β^nrefit\hat{\beta}_{n}^{\rm refit} equals the coefficient estimate using least squares in the linear regression of δY,j\delta_{Y,j} on δΓ^,jΔ^Γ^,n\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n}. Then, for any u𝐑2u\in\mathbf{R}^{2} such that u2=1||u||_{2}=1, we have

|(1n1jn((δΓ^,jΔ^Γ^,n)u)2)1/2(1n1jn((δΓ,jΔ^Γ,n)u)2)1/2|\displaystyle\left|\left(\frac{1}{n}\sum_{1\leq j\leq n}((\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})^{\prime}u)^{2}\right)^{1/2}-\left(\frac{1}{n}\sum_{1\leq j\leq n}((\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})^{\prime}u)^{2}\right)^{1/2}\right|
(1n1jn((δΓ^,jδΓ,j)u)2((Δ^Γ^,nΔ^Γ,n)u)2)1/2\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}((\delta_{\hat{\Gamma},j}-\delta_{\Gamma,j})^{\prime}u)^{2}-((\hat{\Delta}_{\hat{\Gamma},n}-\hat{\Delta}_{\Gamma,n})^{\prime}u)^{2}\right)^{1/2}
(2n1i2nΓ^n,i+(α^1,nr,α^0,nr)Γn,i(α1,nr,α0,nr)22)1/2\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}\left\|\hat{\Gamma}_{n,i}+(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime}-\Gamma_{n,i}-(\alpha_{1,n}^{\rm r},\alpha_{0,n}^{\rm r})^{\prime}\right\|_{2}^{2}\right)^{1/2}
d{0,1}12n1i2n(α^d,nrαd,nr+ψn,i(β^d,nrβd,nr))2=oP(1),\displaystyle\lesssim\sum_{d\in\{0,1\}}\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{\alpha}_{d,n}^{\rm r}-\alpha_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\hat{\beta}_{d,n}^{\rm r}-\beta_{d,n}^{\rm r}))^{2}=o_{P}(1)~{},

where the second inequality is by the fact that

δΓ,j=(Dπ(2j1)Dπ(2j))(Γn,π(2j1)+(α1,nr,α0,nr)Γn,π(2j)(α1,nr,α0,nr)),\displaystyle\delta_{\Gamma,j}=(D_{\pi(2j-1)}-D_{\pi(2j)})(\Gamma_{n,\pi(2j-1)}+(\alpha_{1,n}^{\rm r},\alpha_{0,n}^{\rm r})^{\prime}-\Gamma_{n,\pi(2j)}-(\alpha_{1,n}^{\rm r},\alpha_{0,n}^{\rm r})^{\prime})~{},
δΓ^,j=(Dπ(2j1)Dπ(2j))(Γ^n,π(2j1)+(α^1,nr,α^0,nr)Γ^n,π(2j)(α^1,nr,α^0,nr)),\displaystyle\delta_{\hat{\Gamma},j}=(D_{\pi(2j-1)}-D_{\pi(2j)})(\hat{\Gamma}_{n,\pi(2j-1)}+(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime}-\hat{\Gamma}_{n,\pi(2j)}-(\hat{\alpha}_{1,n}^{\rm r},\hat{\alpha}_{0,n}^{\rm r})^{\prime})~{},

and the last equality is by the proof of Theorem 5.1. This implies

1n1jn(δΓ^,jΔ^Γ^,n)(δΓ^,jΔ^Γ^,n)2E[Var[Γn,i|Xi]]\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})^{\prime}-2E[\operatorname*{Var}[\Gamma_{n,i}|X_{i}]]
=1n1jn(δΓ^,jΔ^Γ^,n)(δΓ^,jΔ^Γ^,n)1n1jn(δΓ,jΔ^Γ,n)(δΓ,jΔ^Γ,n)\displaystyle=\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})^{\prime}-\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})^{\prime}
+1n1jn(δΓ,jΔ^Γ,n)(δΓ,jΔ^Γ,n)2E[Var[Γn,i|Xi]]=oP(1),\displaystyle+\frac{1}{n}\sum_{1\leq j\leq n}(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})(\delta_{\Gamma,j}-\hat{\Delta}_{\Gamma,n})^{\prime}-2E[\operatorname*{Var}[\Gamma_{n,i}|X_{i}]]=o_{P}(1)~{},

where the last equality holds due to the same argument as used in the proof of Theorem 4.2. Similarly, we can show that

1n1jnδY,j(δΓ^,jΔ^Γ^,n)E[Cov[Γn,i,Yi(1)+Yi(0)|Xi]]=oP(1),\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}\delta_{Y,j}(\delta_{\hat{\Gamma},j}-\hat{\Delta}_{\hat{\Gamma},n})-E[\operatorname*{Cov}[\Gamma_{n,i},Y_{i}(1)+Y_{i}(0)|X_{i}]]=o_{P}(1)~{},

which leads to (52).

Step 2: Verifying (9), (12), and Assumption 3.1

We first show (9). We have

12n1i2n(2Di1)(m^d,n(Xi,Wn,i)md,n(Xi,Wn,i))\displaystyle\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{m}_{d,n}(X_{i},W_{n,i})-m_{d,n}(X_{i},W_{n,i}))
=12n1i2n(2Di1)(Γ^n,iΓn,i)β^nrefit+12n1i2n(2Di1)Γn,i(β^nrefitβnrefit)\displaystyle=\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)(\hat{\Gamma}_{n,i}-\Gamma_{n,i})^{\prime}\hat{\beta}_{n}^{\rm refit}+\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\Gamma_{n,i}^{\prime}(\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit})
=(12n1i2n(2Di1)ψn,i(β^1,nrβ1,nr),12n1i2n(2Di1)ψn,i(β^0,nrβ0,nr))β^nrefit\displaystyle=\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{1,n}^{\rm r}-\beta_{1,n}^{\rm r}),\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{0,n}^{\rm r}-\beta_{0,n}^{\rm r})\end{pmatrix}\hat{\beta}_{n}^{\rm refit}
+(12n1i2n(2Di1)ψn,iβ1,nr,12n1i2n(2Di1)ψn,iβ0,nr)(β^1,nrefitβ1,nrefit)\displaystyle+\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r},\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r}\end{pmatrix}(\hat{\beta}_{1,n}^{\rm refit}-\beta_{1,n}^{\rm refit})
=oP(1),\displaystyle=o_{P}(1)~{},

where the last equality holds by (52) and the facts that

(12n1i2n(2Di1)ψn,i(β^1,nrβ1,nr),12n1i2n(2Di1)ψn,i(β^0,nrβ0,nr))=oP(1)\displaystyle\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{1,n}^{\rm r}-\beta_{1,n}^{\rm r}),\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}(\hat{\beta}_{0,n}^{\rm r}-\beta_{0,n}^{\rm r})\end{pmatrix}=o_{P}(1)

as shown in Theorem 5.1 and

(12n1i2n(2Di1)ψn,iβ1,nr,12n1i2n(2Di1)ψn,iβ0,nr)=OP(1).\displaystyle\begin{pmatrix}\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r},\frac{1}{\sqrt{2n}}\sum_{1\leq i\leq 2n}(2D_{i}-1)\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r}\end{pmatrix}=O_{P}(1)~{}.

Next, we show (12). We note that

12n1i2n(m^d,n(Xi,Wn,i)md,n(Xi,Wn,i))2\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}(\hat{m}_{d,n}(X_{i},W_{n,i})-m_{d,n}(X_{i},W_{n,i}))^{2}
12n1i2n((Γ^n,iΓn,i)β^nrefit)2+12n1i2n(Γn,i(β^nrefitβnrefit))2\displaystyle\lesssim\frac{1}{2n}\sum_{1\leq i\leq 2n}((\hat{\Gamma}_{n,i}-\Gamma_{n,i})^{\prime}\hat{\beta}_{n}^{\rm refit})^{2}+\frac{1}{2n}\sum_{1\leq i\leq 2n}(\Gamma_{n,i}^{\prime}(\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit}))^{2}
12n1i2n((ψn,i(β1,nrβ^1,nr))2+(ψn,i(β0,nrβ^0,nr))2)β^nrefit22\displaystyle\lesssim\frac{1}{2n}\sum_{1\leq i\leq 2n}((\psi_{n,i}^{\prime}(\beta_{1,n}^{\rm r}-\hat{\beta}_{1,n}^{\rm r}))^{2}+(\psi_{n,i}^{\prime}(\beta_{0,n}^{\rm r}-\hat{\beta}_{0,n}^{\rm r}))^{2})||\hat{\beta}_{n}^{\rm refit}||^{2}_{2}
+12n1i2n[(ψn,iβ1,nr)2+(ψn,iβ0,nr)2]β^nrefitβnrefit22\displaystyle+\frac{1}{2n}\sum_{1\leq i\leq 2n}[(\psi_{n,i}^{\prime}\beta_{1,n}^{\rm r})^{2}+(\psi_{n,i}^{\prime}\beta_{0,n}^{\rm r})^{2}]||\hat{\beta}_{n}^{\rm refit}-\beta_{n}^{\rm refit}||_{2}^{2}
d=0,112n1i2n[(αd,nrα^d,nr+ψn,i(βd,nrβ^d,nr))2+(αd,nrα^d,nr)2]β^nrefit22+oP(1)\displaystyle\lesssim\sum_{d=0,1}\frac{1}{2n}\sum_{1\leq i\leq 2n}\left[(\alpha_{d,n}^{\rm r}-\hat{\alpha}_{d,n}^{\rm r}+\psi_{n,i}^{\prime}(\beta_{d,n}^{\rm r}-\hat{\beta}_{d,n}^{\rm r}))^{2}+(\alpha_{d,n}^{\rm r}-\hat{\alpha}_{d,n}^{\rm r})^{2}\right]||\hat{\beta}_{n}^{\rm refit}||^{2}_{2}+o_{P}(1)
=oP(1).\displaystyle=o_{P}(1)~{}.

Last, Assumption 3.1(1) can be verified in the same manner as we did in the proof of Theorem 5.1.

Step 3: Asymptotic variance

Recall σ22(Q)\sigma_{2}^{2}(Q) and σ32(Q)\sigma_{3}^{2}(Q) defined in Theorem 3.1. As we have already verified (9) for m^d,n(Xi,Wn,i)=Γ^n,iβ^nrefit\hat{m}_{d,n}(X_{i},W_{n,i})=\hat{\Gamma}_{n,i}\hat{\beta}_{n}^{\rm refit} and md,n(Xi,Wn,i)=Γn,iβnrefitm_{d,n}(X_{i},W_{n,i})=\Gamma_{n,i}\beta_{n}^{\rm refit}, we have, for b{unadj,r,refit}b\in\{\rm{unadj,r,refit}\}, that

σnb,2σ22(Q)σ32(Q)=12E[Var[E[Yi(1)+Yi(0)|Xi,Wn,i]Γn,iγb|Xi]]\displaystyle\sigma_{n}^{\rm b,2}-\sigma_{2}^{2}(Q)-\sigma_{3}^{2}(Q)=\frac{1}{2}E\left[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{n,i}]-\Gamma_{n,i}^{\prime}\gamma^{b}|X_{i}]\right]

with

γunadj=(0,0),γr=(1,1),andγrefit=βnrefit.\displaystyle\gamma^{\rm unadj}=(0,0)^{\prime}~{},\quad\gamma^{\rm r}=(1,1)^{\prime}~{},\quad\text{and}\quad\gamma^{\rm refit}=\beta_{n}^{\rm refit}~{}.

In addition, we note that

12E[Var[E[Yi(1)+Yi(0)|Xi,Wn,i]Γn,iγ|Xi]]\displaystyle\frac{1}{2}E\left[\operatorname*{Var}[E[Y_{i}(1)+Y_{i}(0)|X_{i},W_{n,i}]-\Gamma_{n,i}^{\prime}\gamma|X_{i}]\right]

is minimized at γ=βnrefit\gamma=\beta_{n}^{\rm refit}, which leads to the desired result.  

Appendix B Auxiliary Lemmas

Lemma B.1.

Suppose ϕn,n1\phi_{n},n\geq 1 is a sequence of random variables satisfying

limλlim supnE[|ϕn|I{|ϕn|>λ}]=0.\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[|\phi_{n}|I\{|\phi_{n}|>\lambda\}]=0~{}. (53)

Suppose XX is another random variable defined on the same probability space with ϕn,n1\phi_{n},n\geq 1. Then,

limγlim supnE[E[|ϕn||X]I{E[|ϕn||X]>γ}]=0.\lim_{\gamma\to\infty}\limsup_{n\to\infty}E[E[|\phi_{n}||X]I\{E[|\phi_{n}||X]>\gamma\}]=0~{}. (54)
Proof.

Fix ϵ>0\epsilon>0. We will show there exists γ>0\gamma>0 so that

lim supnE[E[|ϕn||X]I{E[|ϕn||X]>γ}]<ϵ.\limsup_{n\to\infty}E[E[|\phi_{n}||X]I\{E[|\phi_{n}||X]>\gamma\}]<\epsilon~{}. (55)

First note the event {E[|ϕn||X]>γ}\{E[|\phi_{n}||X]>\gamma\} is measurable with respect to the σ\sigma-algebra generated by XX, and therefore

E[E[|ϕn||X]I{E[|ϕn||X]>γ}]=E[|ϕn|I{E[|ϕn||X]>γ}].E[E[|\phi_{n}||X]I\{E[|\phi_{n}||X]>\gamma\}]=E[|\phi_{n}|I\{E[|\phi_{n}||X]>\gamma\}]~{}. (56)

Next, by Theorem 10.3.5 of Dudley (1989), (53) implies that there exists a δ>0\delta>0 such that for any sequence of events AnA_{n} such that lim supnP{An}<δ\limsup_{n\to\infty}P\{A_{n}\}<\delta, we have

lim supnE[|ϕn|I{An}]<ϵ.\limsup_{n\to\infty}E[|\phi_{n}|I\{A_{n}\}]<\epsilon~{}. (57)

In light of the previous result, note

P{E[|ϕn||X]>γ}\displaystyle P\{E[|\phi_{n}||X]>\gamma\} E[E[|ϕn||X]]γ=E[|ϕn|]γ\displaystyle\leq\frac{E[E[|\phi_{n}||X]]}{\gamma}=\frac{E[|\phi_{n}|]}{\gamma}

By Theorem 10.3.5 of Dudley (1989) again, (53) implies lim supnE[|ϕn|]<\limsup_{n\to\infty}E[|\phi_{n}|]<\infty, so by choosing γ\gamma large enough, we can make sure

lim supnP{E[|ϕn||X]>γ}<δ for all n.\limsup_{n\to\infty}P\{E[|\phi_{n}||X]>\gamma\}<\delta\text{ for all }n~{}.

(55) then follows from (56)–(57).  

Lemma B.2.

Suppose Assumptions 2.12.3 and 3.1 hold. Then,

sn2nE[Var[ϕ1,n,i|Xi]]P1.\frac{s_{n}^{2}}{nE[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]}\stackrel{{\scriptstyle P}}{{\to}}1~{}.
Proof.

To begin, note it follows from Assumption 2.2 and Qn=Q2nQ_{n}=Q^{2n} that

1n1i2nDiVar[ϕ1,n,i|Xi]=12n1i2nVar[ϕ1,n,i|Xi]+12n1i2n:Di=1Var[ϕ1,n,i|Xi]12n1i2n:Di=0Var[ϕ1,n,i|Xi].\frac{1}{n}\sum_{1\leq i\leq 2n}D_{i}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]=\frac{1}{2n}\sum_{1\leq i\leq 2n}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]\\ +\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=1}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=0}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]~{}. (58)

Next,

|12n1i2n:Di=1Var[ϕ1,n,iXi]12n1i2n:Di=0Var[ϕ1,n,iXi]|12n1jn|Var[ϕ1,n,π(2j1)Xπ(2j1)]Var[ϕ1,n,π(2j)Xπ(2j)]|.\left|\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=1}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-\frac{1}{2n}\sum_{1\leq i\leq 2n:D_{i}=0}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]\right|\\ \leq\frac{1}{2n}\sum_{1\leq j\leq n}|\operatorname*{Var}[\phi_{1,n,\pi(2j-1)}|X_{\pi(2j-1)}]-\operatorname*{Var}[\phi_{1,n,\pi(2j)}|X_{\pi(2j)}]|~{}. (59)

In what follows, we will show

1n1jn|Cov[Yπ(2j1)(1),m1,n(Xπ(2j1),Wπ(2j1))|Xπ(2j1)]]Cov[Yπ(2j)(1),m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]]|P0.\frac{1}{n}\sum_{1\leq j\leq n}|\operatorname*{Cov}[Y_{\pi(2j-1)}(1),m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})|X_{\pi(2j-1)}]]\\ -\operatorname*{Cov}[Y_{\pi(2j)}(1),m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]]|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

To that end, first note from Assumptions 2.3 and 3.1(c) that

1n1jn|E[Yπ(2j1)(1)m1,n(Xπ(2j1),Wπ(2j1))|Xπ(2j1)]E[Yπ(2j)(1)m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]|\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}|E[Y_{\pi(2j-1)}(1)m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})|X_{\pi(2j-1)}]-E[Y_{\pi(2j)}(1)m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]|
1n1jn|Xπ(2j1)Xπ(2j)|P0.\displaystyle\lesssim\frac{1}{n}\sum_{1\leq j\leq n}|X_{\pi(2j-1)}-X_{\pi(2j)}|\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Next, note

1n1jn|E[Yπ(2j1)(1)|Xπ(2j1)]E[m1,n(Xπ(2j1),Wπ(2j1))|Xπ(2j1)]\displaystyle\frac{1}{n}\sum_{1\leq j\leq n}|E[Y_{\pi(2j-1)}(1)|X_{\pi(2j-1)}]E[m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})|X_{\pi(2j-1)}]
E[Yπ(2j)(1)|Xπ(2j)]E[m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]|\displaystyle\hskip 30.00005pt-E[Y_{\pi(2j)}(1)|X_{\pi(2j)}]E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]|
1n1jn|E[Yπ(2j1)(1)|Xπ(2j1)]||E[m1,n(Xπ(2j1),Wπ(2j1))|Xπ(2j1)]E[m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]|\displaystyle\leq\frac{1}{n}\sum_{1\leq j\leq n}|E[Y_{\pi(2j-1)}(1)|X_{\pi(2j-1)}]||E[m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})|X_{\pi(2j-1)}]-E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]|
+1n1jn|E[Yπ(2j1)(1)|Xπ(2j1)]E[Yπ(2j)(1)|Xπ(2j)]||E[m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]|\displaystyle\hskip 30.00005pt+\frac{1}{n}\sum_{1\leq j\leq n}|E[Y_{\pi(2j-1)}(1)|X_{\pi(2j-1)}]-E[Y_{\pi(2j)}(1)|X_{\pi(2j)}]||E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]|
(1n1jn|E[Yπ(2j1)(1)|Xπ(2j1)]|2)1/2\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}|E[Y_{\pi(2j-1)}(1)|X_{\pi(2j-1)}]|^{2}\right)^{1/2}
×(1n1jn|E[m1,n(Xπ(2j1),Wπ(2j1))|Xπ(2j1)]E[m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]|2)1/2\displaystyle\hskip 50.00008pt\times\left(\frac{1}{n}\sum_{1\leq j\leq n}|E[m_{1,n}(X_{\pi(2j-1)},W_{\pi(2j-1)})|X_{\pi(2j-1)}]-E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]|^{2}\right)^{1/2}
+(1n1jn|E[m1,n(Xπ(2j),Wπ(2j))|Xπ(2j)]|2)1/2\displaystyle\hskip 30.00005pt+\left(\frac{1}{n}\sum_{1\leq j\leq n}|E[m_{1,n}(X_{\pi(2j)},W_{\pi(2j)})|X_{\pi(2j)}]|^{2}\right)^{1/2}
×(1n1jn|E[Yπ(2j1)(1)|Xπ(2j1)]E[Yπ(2j)(1)|Xπ(2j)]|2)1/2\displaystyle\hskip 50.00008pt\times\left(\frac{1}{n}\sum_{1\leq j\leq n}|E[Y_{\pi(2j-1)}(1)|X_{\pi(2j-1)}]-E[Y_{\pi(2j)}(1)|X_{\pi(2j)}]|^{2}\right)^{1/2}
(1n1i2n|E[Yi(1)|Xi]|2)1/2(1n1jn|Xπ(2j1)Xπ(2j)|2)1/2\displaystyle\lesssim\left(\frac{1}{n}\sum_{1\leq i\leq 2n}|E[Y_{i}(1)|X_{i}]|^{2}\right)^{1/2}\left(\frac{1}{n}\sum_{1\leq j\leq n}|X_{\pi(2j-1)}-X_{\pi(2j)}|^{2}\right)^{1/2}
+(1n1i2n|E[m1,n(Xi,Wi)|Xi]|2)1/2(1n1jn|Xπ(2j1)Xπ(2j)|2)1/2P0,\displaystyle\hskip 30.00005pt+\left(\frac{1}{n}\sum_{1\leq i\leq 2n}|E[m_{1,n}(X_{i},W_{i})|X_{i}]|^{2}\right)^{1/2}\left(\frac{1}{n}\sum_{1\leq j\leq n}|X_{\pi(2j-1)}-X_{\pi(2j)}|^{2}\right)^{1/2}\stackrel{{\scriptstyle P}}{{\to}}0~{},

where the first inequality follows from the triangle inequality, the second follows from the Cauchy-Schwarz inequality, the last follows from Assumptions 2.1(c) and 3.1(c). To see the convergence holds, first note because

E[|E[Yi(1)|Xi]|2]E[E[Yi2(1)|Xi]]=E[Yi2(1)]<,E[|E[Y_{i}(1)|X_{i}]|^{2}]\leq E[E[Y_{i}^{2}(1)|X_{i}]]=E[Y_{i}^{2}(1)]<\infty~{},

the weak law of large numbers implies

1n1i2n|E[Yi(1)|Xi]|2P2E[|E[Yi(1)|Xi]|2]<.\frac{1}{n}\sum_{1\leq i\leq 2n}|E[Y_{i}(1)|X_{i}]|^{2}\stackrel{{\scriptstyle P}}{{\to}}2E[|E[Y_{i}(1)|X_{i}]|^{2}]<\infty~{}.

On the other hand,

12n1i2n|E[m1,n(Xi,Wi)|Xi]|212n1i2nE[m1,n2(Xi,Wi)|Xi].\frac{1}{2n}\sum_{1\leq i\leq 2n}|E[m_{1,n}(X_{i},W_{i})|X_{i}]|^{2}\leq\frac{1}{2n}\sum_{1\leq i\leq 2n}E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]~{}.

Assumption 3.1(b) and Lemma B.1 imply

limλlim supnE[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>λ}]=0.\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\lambda\}]=0~{}.

Therefore, Lemma 11.4.2 of Lehmann and Romano (2005) implies

12n1i2nE[m1,n2(Xi,Wi)|Xi]E[E[m1,n2(Xi,Wi)|Xi]]P0.\frac{1}{2n}\sum_{1\leq i\leq 2n}E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]-E[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]]\stackrel{{\scriptstyle P}}{{\to}}0~{}.

Finally, note E[E[m1,n2(Xi,Wi)|Xi]]=E[m1,n2(Xi,Wi)]E[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]]=E[m_{1,n}^{2}(X_{i},W_{i})] is bounded for n1n\geq 1 by Assumption 3.1(b), so

1n1i2n|E[m1,n(Xi,Wi)|Xi]|2=OP(1).\frac{1}{n}\sum_{1\leq i\leq 2n}|E[m_{1,n}(X_{i},W_{i})|X_{i}]|^{2}=O_{P}(1)~{}.

The desired convergence therefore follows.

Similar arguments applied termwise imply the right-hand side of (59) is oP(1)o_{P}(1). (58)–(59) then imply

sn2n12n1i2nVar[ϕ1,n,i|Xi]0.\frac{s_{n}^{2}}{n}-\frac{1}{2n}\sum_{1\leq i\leq 2n}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]\to 0~{}. (60)

Next, we argue

12n1i2nVar[ϕ1,n,i|Xi]E[Var[ϕ1,n,i|Xi]]0.\frac{1}{2n}\sum_{1\leq i\leq 2n}\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]\to 0~{}. (61)

To establish (61), we verify the uniform integrability condition in Lemma 11.4.2 of Lehmann and Romano (2005). To that end, we will repeatedly use the inequality

|1jkaj|I{|1jkaj|>λ}\displaystyle\left|\sum_{1\leq j\leq k}a_{j}\right|I\left\{\left|\sum_{1\leq j\leq k}a_{j}\right|>\lambda\right\} 1jkk|aj|I{|aj|>λk}\displaystyle\leq\sum_{1\leq j\leq k}k|a_{j}|I\left\{|a_{j}|>\frac{\lambda}{k}\right\} (62)
|ab|I{|ab|>λ}\displaystyle|ab|I\{|ab|>\lambda\} |a|2I{|a|>λ}+|b|2I{|b|>λ}.\displaystyle\leq|a|^{2}I\{|a|>\sqrt{\lambda}\}+|b|^{2}I\{|b|>\sqrt{\lambda}\}~{}. (63)

Note

E[|Var[ϕ1,n,iXi]E[Var[ϕ1,n,iXi]]|I{|Var[ϕ1,n,iXi]E[Var[ϕ1,n,iXi]]|>λ}]\displaystyle E[|\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]|I\{|\operatorname*{Var}[\phi_{1,n,i}|X_{i}]-E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]|>\lambda\}]
E[|Var[ϕ1,n,iXi]|I{|Var[ϕ1,n,iXi]|>λ2}]+E[Var[ϕ1,n,i|Xi]]I{E[Var[ϕ1,n,i|Xi]]>λ2}\displaystyle\lesssim E\left[|\operatorname*{Var}[\phi_{1,n,i}|X_{i}]|I\left\{|\operatorname*{Var}[\phi_{1,n,i}|X_{i}]|>\frac{\lambda}{2}\right\}\right]+E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]I\left\{E[\operatorname*{Var}[\phi_{1,n,i}|X_{i}]]>\frac{\lambda}{2}\right\}
E[E[ϕ1,n,i2|Xi]I{E[ϕ1,n,i2|Xi]>λ2}]+E[ϕ1,n,i2]I{E[ϕ1,n,i2]>λ2},\displaystyle\leq E\left[E[\phi_{1,n,i}^{2}|X_{i}]I\left\{E[\phi_{1,n,i}^{2}|X_{i}]>\frac{\lambda}{2}\right\}\right]+E[\phi_{1,n,i}^{2}]I\left\{E[\phi_{1,n,i}^{2}]>\frac{\lambda}{2}\right\}~{},

where in the second inequality we use the fact that the variance of a random variable is bounded by its second moment. Note Assumption 3.1 implies E[ϕ1,n,i2]E[\phi_{1,n,i}^{2}] is bounded for n1n\geq 1, and therefore

limλlim supnE[ϕ1,n,i2]I{E[ϕ1,n,i2]>λ2}=0.\lim_{\lambda\to\infty}\limsup_{n\to\infty}E[\phi_{1,n,i}^{2}]I\left\{E[\phi_{1,n,i}^{2}]>\frac{\lambda}{2}\right\}=0~{}.

On the other hand

E[E[ϕ1,n,i2|Xi]I{E[ϕ1,n,i2|Xi]>λ2}]\displaystyle E\left[E[\phi_{1,n,i}^{2}|X_{i}]I\left\{E[\phi_{1,n,i}^{2}|X_{i}]>\frac{\lambda}{2}\right\}\right] (64)
E[E[Yi2(1)|Xi]I{E[Yi2(1)|Xi]>λ12}]+E[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>λ3}]\displaystyle\lesssim E\left[E[Y_{i}^{2}(1)|X_{i}]I\left\{E[Y_{i}^{2}(1)|X_{i}]>\frac{\lambda}{12}\right\}\right]+E\left[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\lambda}{3}\right\}\right]
+E[E[m0,n2(Xi,Wi)|Xi]I{E[m0,n2(Xi,Wi)|Xi]>λ3}]\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\lambda}{3}\right\}\right]
+E[|E[Yi(1)m1,n(Xi,Wi)|Xi]|I{|E[Yi(1)m1,n(Xi,Wi)|Xi]|>λ12}]\displaystyle\hskip 30.00005pt+E\left[|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{12}\right\}\right]
+E[|E[Yi(1)m0,n(Xi,Wi)|Xi]|I{|E[Yi(1)m0,n(Xi,Wi)|Xi]|>λ12}]\displaystyle\hskip 30.00005pt+E\left[|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{12}\right\}\right]
+E[|E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi]|I{|E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi]|>λ6}].\displaystyle\hskip 30.00005pt+E\left[|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{6}\right\}\right]~{}.

It follows from Assumptions 2.1(b) and 3.1(b) together with Lemma B.1 that

limλlim supnE[E[Yi2(1)|Xi]I{E[Yi2(1)|Xi]>λ12}]\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[E[Y_{i}^{2}(1)|X_{i}]I\left\{E[Y_{i}^{2}(1)|X_{i}]>\frac{\lambda}{12}\right\}\right] =0\displaystyle=0
limλlim supnE[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>λ3}]\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\lambda}{3}\right\}\right] =0\displaystyle=0
limλlim supnE[E[m0,n2(Xi,Wi)|Xi]I{E[m0,n2(Xi,Wi)|Xi]>λ3}]\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\lambda}{3}\right\}\right] =0.\displaystyle=0~{}.

For the last term in (64), note

E[|E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi]|I{|E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi]|>λ6}]\displaystyle E\left[|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{6}\right\}\right]
E[E[|m1,n(Xi,Wi)m0,n(Xi,Wi)||Xi]I{E[|m1,n(Xi,Wi)m0,n(Xi,Wi)||Xi]>λ6}].\displaystyle\leq E\left[E[|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})||X_{i}]I\left\{E[|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})||X_{i}]>\frac{\lambda}{6}\right\}\right]~{}.

Meanwhile,

E[E[|m1,n(Xi,Wi)m0,n(Xi,Wi)|I{|m1,n(Xi,Wi)m0,n(Xi,Wi)|>λ}]\displaystyle E\left[E[|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|I\left\{|m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|>\lambda\right\}\right]
E[m1,n2(Xi,Wi)I{|m1,n(Xi,Wi)|>λ}]+E[m0,n2(Xi,Wi)I{|m0,n(Xi,Wi)|>λ}].\displaystyle\leq E[m_{1,n}^{2}(X_{i},W_{i})I\{|m_{1,n}(X_{i},W_{i})|>\sqrt{\lambda}\}]+E[m_{0,n}^{2}(X_{i},W_{i})I\{|m_{0,n}(X_{i},W_{i})|>\sqrt{\lambda}\}]~{}.

It then follows from the previous two inqualities, Assumption 3.1(b), and Lemma B.1 that

limλlim supnE[|E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi]|I{|E[m1,n(Xi,Wi)m0,n(Xi,Wi)|Xi]|>λ6}]=0.\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{6}\right\}\right]=0~{}.

Similar arguments establish

limλlim supnE[|E[Yi(1)m1,n(Xi,Wi)|Xi]|I{|E[Yi(1)m1,n(Xi,Wi)|Xi]|>λ12}]\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[Y_{i}(1)m_{1,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{12}\right\}\right] =0\displaystyle=0
limλlim supnE[|E[Yi(1)m0,n(Xi,Wi)|Xi]|I{|E[Yi(1)m0,n(Xi,Wi)|Xi]|>λ12}]\displaystyle\lim_{\lambda\to\infty}\limsup_{n\to\infty}E\left[|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[Y_{i}(1)m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\lambda}{12}\right\}\right] =0.\displaystyle=0~{}.

Therefore, (61) follows. The conclusion then follows from (60)–(61) and Assumption 3.1(a).  

Lemma B.3.

Suppose Assumptions 2.12.3 and 3.1 hold. Then,

limγlim supnE[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|2>γ}]=0.\lim_{\gamma\to\infty}\limsup_{n\to\infty}E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}>\gamma\}]=0~{}.
Proof.

Note

E[|ϕ1,n,iE[ϕ1,n,i|Xi]|2I{|ϕ1,n,iE[ϕ1,n,i|Xi]|2>γ}]\displaystyle E[|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}I\{|\phi_{1,n,i}-E[\phi_{1,n,i}|X_{i}]|^{2}>\gamma\}]
E[(ϕ1,n,i2+E[ϕ1,n,i|Xi]2)I{ϕ1,n,i2+E[ϕ1,n,i|Xi]2>γ2}]\displaystyle\lesssim E\left[(\phi_{1,n,i}^{2}+E[\phi_{1,n,i}|X_{i}]^{2})I\left\{\phi_{1,n,i}^{2}+E[\phi_{1,n,i}|X_{i}]^{2}>\frac{\gamma}{2}\right\}\right]
E[ϕ1,n,i2I{ϕ1,n,i2>γ4}]+E[E[ϕ1,n,i|Xi]2I{E[ϕ1,n,i|Xi]2>γ4}].\displaystyle\lesssim E\left[\phi_{1,n,i}^{2}I\left\{\phi_{1,n,i}^{2}>\frac{\gamma}{4}\right\}\right]+E\left[E[\phi_{1,n,i}|X_{i}]^{2}I\left\{E[\phi_{1,n,i}|X_{i}]^{2}>\frac{\gamma}{4}\right\}\right]~{}.

where the first inequality follows from (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) and the second inequality follows from (62). Next, note

E[E[ϕ1,n,i|Xi]2I{E[ϕ1,n,i|Xi]2>γ4}]\displaystyle E\left[E[\phi_{1,n,i}|X_{i}]^{2}I\left\{E[\phi_{1,n,i}|X_{i}]^{2}>\frac{\gamma}{4}\right\}\right]
E[E[Yi(1)|Xi]2I{E[Yi(1)|Xi]2>γ24}]+E[E[m1,n(Xi,Wi)|Xi]2I{E[m1,n(Xi,Wi)|Xi]2>γ6}]\displaystyle\lesssim E\left[E[Y_{i}(1)|X_{i}]^{2}I\left\{E[Y_{i}(1)|X_{i}]^{2}>\frac{\gamma}{24}\right\}\right]+E\left[E[m_{1,n}(X_{i},W_{i})|X_{i}]^{2}I\left\{E[m_{1,n}(X_{i},W_{i})|X_{i}]^{2}>\frac{\gamma}{6}\right\}\right]
+E[E[m0,n(Xi,Wi)|Xi]2I{E[m0,n(Xi,Wi)|Xi]2>γ6}]\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}(X_{i},W_{i})|X_{i}]^{2}I\left\{E[m_{0,n}(X_{i},W_{i})|X_{i}]^{2}>\frac{\gamma}{6}\right\}\right]
+E[|E[Yi(1)|Xi]E[m1,n(Xi,Wi)|Xi]|I{|E[Yi(1)|Xi]E[m1,n(Xi,Wi)|Xi]|>γ24}]\displaystyle\hskip 30.00005pt+E\left[|E[Y_{i}(1)|X_{i}]E[m_{1,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[Y_{i}(1)|X_{i}]E[m_{1,n}(X_{i},W_{i})|X_{i}]|>\frac{\gamma}{24}\right\}\right]
+E[|E[Yi(1)|Xi]E[m0,n(Xi,Wi)|Xi]|I{|E[Yi(1)|Xi]E[m0,n(Xi,Wi)|Xi]|>γ24}]\displaystyle\hskip 30.00005pt+E\left[|E[Y_{i}(1)|X_{i}]E[m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[Y_{i}(1)|X_{i}]E[m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\gamma}{24}\right\}\right]
+E[|E[m1,n(Xi,Wi)|Xi]E[m0,n(Xi,Wi)|Xi]|I{|E[m1,n(Xi,Wi)|Xi]E[m0,n(Xi,Wi)|Xi]|>γ12}]\displaystyle\hskip 30.00005pt+E\left[|E[m_{1,n}(X_{i},W_{i})|X_{i}]E[m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})|X_{i}]E[m_{0,n}(X_{i},W_{i})|X_{i}]|>\frac{\gamma}{12}\right\}\right]
E[E[Yi2(1)|Xi]I{E[Yi2(1)|Xi]>γ24}]+E[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>γ6}]\displaystyle\lesssim E\left[E[Y_{i}^{2}(1)|X_{i}]I\left\{E[Y_{i}^{2}(1)|X_{i}]>\frac{\gamma}{24}\right\}\right]+E\left[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\gamma}{6}\right\}\right]
+E[E[m0,n2(Xi,Wi)|Xi]I{E[m0,n2(Xi,Wi)|Xi]>γ6}]\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\gamma}{6}\right\}\right]
+E[|E[Yi(1)|Xi]|I{|E[Yi(1)|Xi]|>γ24}]\displaystyle\hskip 30.00005pt+E\left[|E[Y_{i}(1)|X_{i}]|I\left\{|E[Y_{i}(1)|X_{i}]|>\sqrt{\frac{\gamma}{24}}\right\}\right]
+E[|E[m1,n(Xi,Wi)|Xi]|I{|E[m1,n(Xi,Wi)|Xi]|>γ24}]\displaystyle\hskip 30.00005pt+E\left[|E[m_{1,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})|X_{i}]|>\sqrt{\frac{\gamma}{24}}\right\}\right]
+E[|E[m0,n(Xi,Wi)|Xi]|I{|E[m0,n(Xi,Wi)|Xi]|>γ24}]\displaystyle\hskip 30.00005pt+E\left[|E[m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{0,n}(X_{i},W_{i})|X_{i}]|>\sqrt{\frac{\gamma}{24}}\right\}\right]
+E[|E[m1,n(Xi,Wi)|Xi]|I{|E[m1,n(Xi,Wi)|Xi]|>γ12}]\displaystyle\hskip 30.00005pt+E\left[|E[m_{1,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{1,n}(X_{i},W_{i})|X_{i}]|>\sqrt{\frac{\gamma}{12}}\right\}\right]
+E[|E[m0,n(Xi,Wi)|Xi]|I{|E[m0,n(Xi,Wi)|Xi]|>γ12}]\displaystyle\hskip 30.00005pt+E\left[|E[m_{0,n}(X_{i},W_{i})|X_{i}]|I\left\{|E[m_{0,n}(X_{i},W_{i})|X_{i}]|>\sqrt{\frac{\gamma}{12}}\right\}\right]
E[E[Yi2(1)|Xi]I{E[Yi2(1)|Xi]>γ24}]+E[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>γ6}]\displaystyle\leq E\left[E[Y_{i}^{2}(1)|X_{i}]I\left\{E[Y_{i}^{2}(1)|X_{i}]>\frac{\gamma}{24}\right\}\right]+E\left[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\gamma}{6}\right\}\right]
+E[E[m0,n2(Xi,Wi)|Xi]I{E[m0,n2(Xi,Wi)|Xi]>γ6}]\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\gamma}{6}\right\}\right]
+E[E[Yi2(1)|Xi]I{E[Yi2(1)|Xi]>γ24}]\displaystyle\hskip 30.00005pt+E\left[E[Y_{i}^{2}(1)|X_{i}]I\left\{E[Y_{i}^{2}(1)|X_{i}]>\frac{\gamma}{24}\right\}\right]
+E[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>γ24}]\displaystyle\hskip 30.00005pt+E\left[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\frac{\gamma}{24}\right\}\right]
+E[E[m0,n2(Xi,Wi)|Xi]I{E[m0,n2(Xi,Wi)|Xi]>γ24}]\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]>\sqrt{\frac{\gamma}{24}}\right\}\right]
+E[E[m1,n2(Xi,Wi)|Xi]I{E[m1,n2(Xi,Wi)|Xi]>γ12}]\displaystyle\hskip 30.00005pt+E\left[E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{1,n}^{2}(X_{i},W_{i})|X_{i}]>\sqrt{\frac{\gamma}{12}}\right\}\right]
+E[E[m0,n2(Xi,Wi)|Xi]I{E[m0,n2(Xi,Wi)|Xi]>γ12}],\displaystyle\hskip 30.00005pt+E\left[E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]I\left\{E[m_{0,n}^{2}(X_{i},W_{i})|X_{i}]>\sqrt{\frac{\gamma}{12}}\right\}\right]~{},

where the first inequality follows from (62), the second one follows from the conditional Jensen’s inequality and (63), and the third one follows again from the conditional Jensen’s inequality. It then follows from Lemma B.1 together with Assumptions 2.1(b) and 3.1(b) that

limγlim supnE[E[ϕ1,n,i|Xi]2I{E[ϕ1,n,i|Xi]2>γ4}]=0.\lim_{\gamma\to\infty}\limsup_{n\to\infty}E\left[E[\phi_{1,n,i}|X_{i}]^{2}I\left\{E[\phi_{1,n,i}|X_{i}]^{2}>\frac{\gamma}{4}\right\}\right]=0~{}.

Similar arguments lead to

limγlim supnE[ϕ1,n,i2I{ϕ1,n,i2>γ4}]=0.\lim_{\gamma\to\infty}\limsup_{n\to\infty}E\left[\phi_{1,n,i}^{2}I\left\{\phi_{1,n,i}^{2}>\frac{\gamma}{4}\right\}\right]=0~{}.

The conclusion then follows.  

Lemma B.4.

Suppose Assumptions in Theorem 5.1 hold. Then,

P{|1ni[2n]I{Di=d}ϵn,i(d)E[ϵn,i(d)]|log(2npn)n}1P\left\{\left|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)]\right|\leq\sqrt{\frac{\log(2np_{n})}{n}}\right\}\rightarrow 1

and

P{Ωn1(d)1n1i2nI{Di=d}(ψn,iϵn,i(d)E[ψn,iϵn,i(d)])6σ¯σ¯log(2npn)n}1.P\left\{\left\|\Omega_{n}^{-1}(d)\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\|_{\infty}\leq\frac{6\bar{\sigma}}{\underaccent{\bar}{\sigma}}\sqrt{\frac{\log(2np_{n})}{n}}\right\}\to 1~{}.
Proof.

For the first result, we note that

|1ni[2n]I{Di=d}ϵn,i(d)E[ϵn,i(d)]|\displaystyle\left|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)]\right| |1ni[2n]I{Di=d}(ϵn,i(d)E[ϵn,i(d)|Xi])|\displaystyle\leq\left|\frac{1}{n}\sum_{i\in[2n]}I\{D_{i}=d\}(\epsilon_{n,i}(d)-E[\epsilon_{n,i}(d)|X_{i}])\right|
+|1ni[2n](I{Di=d}1/2)(E[ϵn,i(d)|Xi]E[ϵn,i(d)])|\displaystyle+\left|\frac{1}{n}\sum_{i\in[2n]}(I\{D_{i}=d\}-1/2)(E[\epsilon_{n,i}(d)|X_{i}]-E[\epsilon_{n,i}(d)])\right|
+|12ni[2n](E[ϵn,i(d)|Xi]E[ϵn,i(d)])|.\displaystyle+\left|\frac{1}{2n}\sum_{i\in[2n]}(E[\epsilon_{n,i}(d)|X_{i}]-E[\epsilon_{n,i}(d)])\right|.

The first two terms on the RHS of the above display are OP(1/n)O_{P}(1/\sqrt{n}). The last term on the RHS is also OP(1/n)O_{P}(1/\sqrt{n}) by Chebyshev’s inequality. This implies the desired result.

For the second result, define

n,0(d)=(maxd{0,1}12n1i2nE[ϵn,i4(d)|Xi]c0<,min1lpn1n1i2nI{Di=d}Var[ψn,i,lϵn,i(d)|Xi]σ¯2>0,)\displaystyle\mathcal{E}_{n,0}(d)=\begin{pmatrix}&\max_{d\in\{0,1\}}\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\epsilon_{n,i}^{4}(d)|X_{i}]\leq c_{0}<\infty~{},\\ &\min_{1\leq l\leq p_{n}}\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}\operatorname*{Var}[\psi_{n,i,l}\epsilon_{n,i}(d)|X_{i}]\geq\underaccent{\bar}{\sigma}^{2}>0~{},\end{pmatrix}
n,1(d)={1n1i2nI{Di=d}(ψn,iϵn,i(d)E[ψn,iϵn,i(d)|Xi])2.04σ¯log(2npn)/n},\displaystyle\mathcal{E}_{n,1}(d)=\left\{\left\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)|X_{i}])\right\|_{\infty}\leq 2.04\overline{\sigma}\sqrt{\log(2np_{n})/n}\right\}~{},
n,2(d)={1n1i2nI{Di=d}(E[ψn,iϵn,i(d)|Xi]E[ψn,iϵn,i(d)])3.96σ¯log(2npn)/n},\displaystyle\mathcal{E}_{n,2}(d)=\left\{\left\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(E[\psi_{n,i}\epsilon_{n,i}(d)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\|_{\infty}\leq 3.96\overline{\sigma}\sqrt{\log(2np_{n})/n}\right\}~{},
n,3(d)={max1lpn|12n1i2n(E[ψn,i,l2ϵn,i2(d)|Xi]E[ψn,i,l2ϵn,i2(d)])|1/20.01σ¯},\displaystyle\mathcal{E}_{n,3}(d)=\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)])\right|^{1/2}\leq 0.01\overline{\sigma}\right\}~{},

and

n,4(d)={max1lpn|12n1i2n(2I{Di=d}1)(E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2])|1/20.01σ¯}.\displaystyle\mathcal{E}_{n,4}(d)=\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])\right|^{1/2}\leq 0.01\overline{\sigma}\right\}~{}.

We aim to show that P{n,1(d)}1P\{\mathcal{E}_{n,1}(d)\}\to 1 and P{n,2(d)}1P\{\mathcal{E}_{n,2}(d)\}\to 1. Then, by letting C=6σ¯/σ¯C=6\bar{\sigma}/\underaccent{\bar}{\sigma} which implies

P{n(d)}\displaystyle P\{\mathcal{E}_{n}(d)\} =1P{nc(d)}\displaystyle=1-P\{\mathcal{E}_{n}^{c}(d)\}
1P{1n1i2nI{Di=d}(ψn,iϵn,i(d)E[ψn,iϵn,i(d)])Cσ¯log(2npn)n}\displaystyle\geq 1-P\left\{\left\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\|_{\infty}\geq C\underaccent{\bar}{\sigma}\sqrt{\frac{\log(2np_{n})}{n}}\right\}
=1P{1n1i2nI{Di=d}(ψn,iϵn,i(d)E[ψn,iϵn,i(d)])6σ¯log(2npn)n}\displaystyle=1-P\left\{\left\|\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)])\right\|_{\infty}\geq 6\bar{\sigma}\sqrt{\frac{\log(2np_{n})}{n}}\right\}
1P{n,1c(d)}P{n,2c(d)}1.\displaystyle\geq 1-P\{\mathcal{E}_{n,1}^{c}(d)\}-P\{\mathcal{E}_{n,2}^{c}(d)\}\to 1~{}.

First, we show P{n,3(d)}1P\{\mathcal{E}_{n,3}(d)\}\to 1. Let

tn=Clog(npn)Ξn2n0t_{n}=C\sqrt{\frac{\log(np_{n})\Xi_{n}^{2}}{n}}\to 0

for some sufficiently large constant C>0C>0 and {ei}1i2n\{e_{i}\}_{1\leq i\leq 2n} be a sequence of i.i.d. Rademacher random variables independent of everything else. Then, for any fixed t>0t>0, we have

(14max1lpnVar[E[ψn,i,l2ϵn,i2(d)|Xi]]2nt2)P{max1lpn|12n1i2n[E[ψn,i,l2ϵn,i2(d)|Xi]E[ψn,i,l2ϵn,i2(d)]]|t}\displaystyle\left(1-\frac{4\max_{1\leq l\leq p_{n}}\operatorname*{Var}[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]]}{2nt^{2}}\right)P\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}\left[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)]\right]\right|\geq t\right\}
2P{max1lpn|12n1i2n4eiE[ψn,i,l2ϵn,i2(d)|Xi]|t}\displaystyle\leq 2P\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}4e_{i}E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]\right|\geq t\right\}
=o(1)+2E[P{max1lpn|12n1i2n4eiE[ψn,i,l2ϵn,i2(d)|Xi]|t|X(n)}I{n,0(d)}]\displaystyle=o(1)+2E\left[P\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}4e_{i}E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]\right|\geq t\bigg{|}X^{(n)}\right\}I\{\mathcal{E}_{n,0}(d)\}\right]
o(1)+pnexp(nt2Ξn2c)=o(1),\displaystyle\lesssim o(1)+p_{n}\exp\left(-\frac{nt^{2}}{\Xi_{n}^{2}c}\right)=o(1),

where the first inequality is by van der Vaart and Wellner (1996, Lemma 2.3.7), the second inequality is by the Hoeffding’s inequality conditional on X(n)X^{(n)} and the fact that, on n,0(d)\mathcal{E}_{n,0}(d),

12n1i2n(E[ψn,i,l2ϵn,i2(d)|Xi])2\displaystyle\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}])^{2} 12n1i2nE[ψn,i,l4|Xi]E[ϵn,i4(d)|Xi]\displaystyle\leq\frac{1}{2n}\sum_{1\leq i\leq 2n}E[\psi_{n,i,l}^{4}|X_{i}]E[\epsilon_{n,i}^{4}(d)|X_{i}]
Ξn22n1i2nE[ψn,i,l2|Xi]E[ϵn,i4(d)|Xi]Ξn2Cc0,\displaystyle\leq\frac{\Xi_{n}^{2}}{2n}\sum_{1\leq i\leq 2n}E[\psi_{n,i,l}^{2}|X_{i}]E[\epsilon_{n,i}^{4}(d)|X_{i}]\leq\Xi_{n}^{2}Cc_{0},

where CC is a fixed constant, and the last equality is by the fact that log(pn)Ξn2=o(n)\log(p_{n})\Xi_{n}^{2}=o(n). Furthermore, we note that

4max1lpnVar[E[ψn,i,l2ϵn,i2(d)|Xi]]2n\displaystyle\frac{4\max_{1\leq l\leq p_{n}}\operatorname*{Var}[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]]}{2n}
max1lpnE[E[ψn,i,l4|Xi]E[ϵn,i4(d)|Xi]]n\displaystyle\lesssim\frac{\max_{1\leq l\leq p_{n}}E\left[E[\psi_{n,i,l}^{4}|X_{i}]E[\epsilon_{n,i}^{4}(d)|X_{i}]\right]}{n}
Ξn2max1lpnE[E[ψn,i,l2|Xi]E[ϵn,i4(d)|Xi]]n\displaystyle\lesssim\frac{\Xi_{n}^{2}\max_{1\leq l\leq p_{n}}E\left[E[\psi_{n,i,l}^{2}|X_{i}]E[\epsilon_{n,i}^{4}(d)|X_{i}]\right]}{n}
Ξn2E[ϵn,i4(d)]n\displaystyle\lesssim\frac{\Xi_{n}^{2}E[\epsilon_{n,i}^{4}(d)]}{n}
=o(1).\displaystyle=o(1).

Therefore, we have

P{max1lpn|12n1i2n[E[ψn,i,l2ϵn,i2(d)|Xi]E[ψn,i,l2ϵn,i2(d)]]|t}=o(1)\displaystyle P\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}\left[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)]\right]\right|\geq t\right\}=o(1)

for any fixed t>0t>0, which is the desired result.

Next, we show P{n,4(d)}1P\{\mathcal{E}_{n,4}(d)\}\to 1. Define an,i,l=E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2]a_{n,i,l}=E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}]. Then, we have

P{max1lpn|12n1i2n(2I{Di=d}1)(E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2])|>t|X(n)}I{n,0(d)}\displaystyle P\left\{\max_{1\leq l\leq p_{n}}\left|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])\right|>t\bigg{|}X^{(n)}\right\}I\{\mathcal{E}_{n,0}(d)\}
1lpnP{|12n1jn(I{Dπ(2j1)=d}I{Dπ(2j)=d})(an,π(2j1),lan,π(2j),l)|>t|X(n)}I{n,0(d)}\displaystyle\leq\sum_{1\leq l\leq p_{n}}P\left\{\left|\frac{1}{2n}\sum_{1\leq j\leq n}(I\{D_{\pi(2j-1)}=d\}-I\{D_{\pi(2j)}=d\})(a_{n,\pi(2j-1),l}-a_{n,\pi(2j),l})\right|>t\bigg{|}X^{(n)}\right\}I\{\mathcal{E}_{n,0}(d)\}
1lpnexp(2nt21n1jn(an,π(2j1),lan,π(2j),l)2)I{n,0(d)}\displaystyle\leq\sum_{1\leq l\leq p_{n}}\exp\left(-\frac{2nt^{2}}{\frac{1}{n}\sum_{1\leq j\leq n}(a_{n,\pi(2j-1),l}-a_{n,\pi(2j),l})^{2}}\right)I\{\mathcal{E}_{n,0}(d)\}
exp(log(pn)2nt2Ξn2c2),\displaystyle\leq\exp\left(\log(p_{n})-\frac{2nt^{2}}{\Xi_{n}^{2}c^{2}}\right)~{},

where, conditional on X(n)X^{(n)}, {I{Dπ(2j1)=d}I{Dπ(2j)=d}}1jn\{I\{D_{\pi(2j-1)}=d\}-I\{D_{\pi(2j)}=d\}\}_{1\leq j\leq n} is a sequence of i.i.d. Rademacher random variables, the second last inequality is by Hoeffding’s inequality, and the last inequality is by that, on n,0(d)\mathcal{E}_{n,0}(d),

(1n1jn(an,π(2j1),lan,π(2j),l)2)1/2\displaystyle\left(\frac{1}{n}\sum_{1\leq j\leq n}(a_{n,\pi(2j-1),l}-a_{n,\pi(2j),l})^{2}\right)^{1/2}
(1n1jn(E[ψn,π(2j1),l2ϵn,i2(d)|Xπ(2j1)])2)1/2+(1n1jn(E[ψn,π(2j),l2ϵn,i2(d)|Xπ(2j)])2)1/2\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j-1),l}^{2}\epsilon_{n,i}^{2}(d)|X_{\pi(2j-1)}])^{2}\right)^{1/2}+\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j),l}^{2}\epsilon_{n,i}^{2}(d)|X_{\pi(2j)}])^{2}\right)^{1/2}
(2n1i2n(E[ψn,i,l2ϵn,i2(d)|Xi])2)1/2\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(d)|X_{i}])^{2}\right)^{1/2}
Ξnc.\displaystyle\leq\Xi_{n}c~{}.

By letting t=Clog(pn)Ξn2nt=C\sqrt{\frac{\log(p_{n})\Xi_{n}^{2}}{n}} for some sufficiently large CC and noting that P{n,0(d)}1P\{\mathcal{E}_{n,0}(d)\}\rightarrow 1, we have

max1lpn|1i2n(2I{Di=d}1)(E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2])2n|=Op(log(pn)Ξn2n),\displaystyle\max_{1\leq l\leq p_{n}}\left|\sum_{1\leq i\leq 2n}\frac{(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{2n}\right|=O_{p}\left(\sqrt{\frac{\log(p_{n})\Xi_{n}^{2}}{n}}\right)~{},

and thus, P{n,4(d)}1P\{\mathcal{E}_{n,4}(d)\}\to 1.

Next, we show P{n,1(d)}1P\{\mathcal{E}_{n,1}(d)\}\to 1. We note that, for d{0,1}d\in\{0,1\}, conditional on (D(n),X(n))(D^{(n)},X^{(n)}), {ψn,iϵn,i(d)}1i2n\{\psi_{n,i}\epsilon_{n,i}(d)\}_{1\leq i\leq 2n} are independent. In what follows, we couple

𝕌n=1n1i2nI{Di=d}(ψn,iϵn,i(d)E[ψn,iϵn,i(d)|Xi])\mathbb{U}_{n}=\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(\psi_{n,i}\epsilon_{n,i}(d)-E[\psi_{n,i}\epsilon_{n,i}(d)|X_{i}])

with a centered Gaussian random vector as in Theorem 2.1 in Chernozhukov et al. (2017). Let Z=(Z1,,Zpn)Z=(Z_{1},\ldots,Z_{p_{n}}) be a Gaussian random vector with E[Zl]=0E[Z_{l}]=0 for 1lpn1\leq l\leq p_{n} and Var[Z]=Var[𝕌n|X(n),D(n)]\operatorname*{Var}[Z]=\operatorname*{Var}[\mathbb{U}_{n}|X^{(n)},D^{(n)}] that additionally satisfies the conditions of that theorem. Specifically, Z=(Z1,,Zpn)Z=(Z_{1},\cdots,Z_{p_{n}}) is a centered Gaussian random vector in RpnR^{p_{n}} such that on n,0(d)n,3(d)n,4(d)\mathcal{E}_{n,0}(d)\cap\mathcal{E}_{n,3}(d)\cap\mathcal{E}_{n,4}(d),

E[ZZ]\displaystyle E[ZZ^{\prime}] =1n21i2nI{Di=d}E[ϵn,i2(d)ψn,iψn,i|Xi]\displaystyle=\frac{1}{n^{2}}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}^{2}(d)\psi_{n,i}\psi_{n,i}^{\prime}|X_{i}]
1n(1n1i2nI{Di=d}E[ϵn,i(d)ψn,i|Xi])(1n1i2nI{Di=d}E[ϵn,i(d)ψn,i|Xi])\displaystyle-\frac{1}{n}\left(\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}(d)\psi_{n,i}|X_{i}]\right)\left(\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}(d)\psi_{n,i}|X_{i}]\right)^{\prime}

and

max1lpnE[Zl2]\displaystyle\max_{1\leq l\leq p_{n}}E[Z_{l}^{2}] max1lpn1i2nI{Di=d}E[ϵn,i2(d)ψn,i,l2|Xi]n2\displaystyle\leq\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]}{n^{2}}
σ¯2n+max1lpn1i2nI{Di=d}(E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2])n2\displaystyle\leq\frac{\overline{\sigma}^{2}}{n}+\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}I\{D_{i}=d\}(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{n^{2}}
σ¯2n+max1lpn1i2n(2I{Di=d}1)(E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2])2n2\displaystyle\leq\frac{\overline{\sigma}^{2}}{n}+\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}(2I\{D_{i}=d\}-1)(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{2n^{2}}
+max1lpn1i2n(E[ϵn,i2(d)ψn,i,l2|Xi]E[ϵn,i2(d)ψn,i,l2])2n2\displaystyle+\frac{\max_{1\leq l\leq p_{n}}\sum_{1\leq i\leq 2n}(E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}|X_{i}]-E[\epsilon_{n,i}^{2}(d)\psi_{n,i,l}^{2}])}{2n^{2}}
1.02σ¯2n.\displaystyle\leq\frac{1.02\overline{\sigma}^{2}}{n}~{}.

Further define q(1α)q(1-\alpha) as the (1α)(1-\alpha) quantile of ||Z||||Z||_{\infty}. Then, we have

q(11/n)1.02σ¯(2log(2pn)+2log(n))n2.04σ¯log(2npn)/n,\displaystyle q(1-1/n)\leq\frac{1.02\overline{\sigma}(\sqrt{2\log(2p_{n})}+\sqrt{2\log(n)})}{\sqrt{n}}\leq 2.04\overline{\sigma}\sqrt{\log(2np_{n})/n},

where the first inequality is by the last display in the proof of Lemma E.2 in Chetverikov and Sørensen (2022) and the second inequality is by the fact that a+b2(a+b)\sqrt{a}+\sqrt{b}\leq\sqrt{2(a+b)} for a,b>0a,b>0. Therefore, we have

P{n,1c(d))\displaystyle P\{\mathcal{E}_{n,1}^{c}(d)) P{n,1c(d),n,0(d),n,3(d),n,4(d)}+o(1)\displaystyle\leq P\{\mathcal{E}_{n,1}^{c}(d),\mathcal{E}_{n,0}(d),\mathcal{E}_{n,3}(d),\mathcal{E}_{n,4}(d)\}+o(1)
=EP{n,1c(d)|D(n),X(n)}I{n,0(d),n,3(d),n,4(d)}+o(1)\displaystyle=EP\{\mathcal{E}_{n,1}^{c}(d)|D^{(n)},X^{(n)}\}I\{\mathcal{E}_{n,0}(d),\mathcal{E}_{n,3}(d),\mathcal{E}_{n,4}(d)\}+o(1)
E[P{||Z||2.04σ¯log(2npn)/n|D(n),X(n))}I{n,0(d),n,3(d),n,4(d)}]+o(1)\displaystyle\leq E[P\{||Z||_{\infty}\geq 2.04\overline{\sigma}\sqrt{\log(2np_{n})/n}|D^{(n)},X^{(n)})\}I\{\mathcal{E}_{n,0}(d),\mathcal{E}_{n,3}(d),\mathcal{E}_{n,4}(d)\}]+o(1)
E[P{||Z||q(11/n)|D(n),X(n)}]=o(1),\displaystyle\leq E[P\{||Z||_{\infty}\geq q(1-1/n)|D^{(n)},X^{(n)}\}]=o(1),

where the second inequality is by Theorem 2.1 in Chernozhukov et al. (2017).

Finally, we turn to n,2(d)\mathcal{E}_{n,2}(d) with d=1d=1. We have

1n1i2nI{Di=1}(E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)])\displaystyle\frac{1}{n}\sum_{1\leq i\leq 2n}I\{D_{i}=1\}(E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)])
=12n1i2n(E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)])+12n1i2n(2Di1)(E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)]).\displaystyle=\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)])+\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)(E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]). (65)

Note {E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)]}1i2n\{E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\}_{1\leq i\leq 2n} is a sequence of independent centered random variables and

max1lpnE[(E[ψn,i,lϵn,i(1)|Xi]E[ψn,i,lϵn,i(1)])2]σ¯2.\displaystyle\max_{1\leq l\leq p_{n}}E[(E[\psi_{n,i,l}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i,l}\epsilon_{n,i}(1)])^{2}]\leq\overline{\sigma}^{2}.

Following Theorem 2.1 in Chernozhukov et al. (2017), Lemma E.2 in Chetverikov and Sørensen (2022), and similar arguments to the ones above, we have

P{12n1i2n(E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)])σ¯2log(2npn)/n)1.\displaystyle P\left\{\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)])\right\|_{\infty}\leq\overline{\sigma}\sqrt{2\log(2np_{n})/n}\right)\to 1. (66)

For the second term on the RHS of (65), we define gn,i,l=E[ψn,i,lϵn,i(1)|Xi]E[ψn,i,lϵn,i(1)]g_{n,i,l}=E[\psi_{n,i,l}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i,l}\epsilon_{n,i}(1)]. We have

P{12n1i2n(2Di1)E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)]>t|X(n)}\displaystyle P\left\{\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\|_{\infty}>t\bigg{|}X^{(n)}\right\}
1lpnP{|12n1jn(Dπ(2j1)Dπ(2j))(gn,π(2j1),lgn,π(2j),l)|>t|X(n)}\displaystyle\leq\sum_{1\leq l\leq p_{n}}P\left\{\left|\frac{1}{2n}\sum_{1\leq j\leq n}(D_{\pi(2j-1)}-D_{\pi(2j)})(g_{n,\pi(2j-1),l}-g_{n,\pi(2j),l})\right|>t\bigg{|}X^{(n)}\right\}
1lpnexp(2nt21n1jn(gn,π(2j1),lgn,π(2j),l)2),\displaystyle\leq\sum_{1\leq l\leq p_{n}}\exp\left(-\frac{2nt^{2}}{\frac{1}{n}\sum_{1\leq j\leq n}(g_{n,\pi(2j-1),l}-g_{n,\pi(2j),l})^{2}}\right)~{},

where, conditional on X(n)X^{(n)}, {(Dπ(2j1)Dπ(2j))}1jn\{(D_{\pi(2j-1)}-D_{\pi(2j)})\}_{1\leq j\leq n} is a sequence of i.i.d. Rademacher random variables and the last inequality is by Hoeffding’s inequality. In addition, on n,3(1)\mathcal{E}_{n,3}(1), we have

(1n1jn(gn,π(2j1),lgn,π(2j),l)2)1/2\displaystyle\left(\frac{1}{n}\sum_{1\leq j\leq n}(g_{n,\pi(2j-1),l}-g_{n,\pi(2j),l})^{2}\right)^{1/2}
(1n1jn(E[ψn,π(2j1),lϵn,i(1)|Xπ(2j1)])2)1/2+(1n1jn(E[ψn,π(2j),lϵn,i(1)|Xπ(2j)])2)1/2\displaystyle\leq\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j-1),l}\epsilon_{n,i}(1)|X_{\pi(2j-1)}])^{2}\right)^{1/2}+\left(\frac{1}{n}\sum_{1\leq j\leq n}(E[\psi_{n,\pi(2j),l}\epsilon_{n,i}(1)|X_{\pi(2j)}])^{2}\right)^{1/2}
(2n1i2n(E[ψn,i,lϵn,i(1)|Xi])2)1/2\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}(E[\psi_{n,i,l}\epsilon_{n,i}(1)|X_{i}])^{2}\right)^{1/2}
(2n1i2nE[ψn,i,l2ϵn,i2(1)|Xi])1/2\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(1)|X_{i}]\right)^{1/2}
(2n1i2n[E[ψn,i,l2ϵn,i2(1)|Xi]E[ψn,i,l2ϵn,i2(1)]])1/2+2σ¯\displaystyle\leq\left(\frac{2}{n}\sum_{1\leq i\leq 2n}\left[E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(1)|X_{i}]-E[\psi_{n,i,l}^{2}\epsilon_{n,i}^{2}(1)]\right]\right)^{1/2}+2\overline{\sigma}
2.02σ¯.\displaystyle\leq 2.02\overline{\sigma}.

Therefore, we have

P{12n1i2n(2Di1)E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)]>2.02log(npn)σ¯2n}\displaystyle P\left\{\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\|_{\infty}>2.02\sqrt{\frac{\log(np_{n})\overline{\sigma}^{2}}{n}}\right\}
P{12n1i2n(2Di1)E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)]>2.02log(npn)σ¯2n,n,3(1)}+o(1)\displaystyle\leq P\left\{\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\|_{\infty}>2.02\sqrt{\frac{\log(np_{n})\overline{\sigma}^{2}}{n}},\mathcal{E}_{n,3}(1)\right\}+o(1)
E[P{12n1i2n(2Di1)E[ψn,iϵn,i(1)|Xi]E[ψn,iϵn,i(1)]>2.02log(npn)σ¯2n|X(n)}I{n,3(1)}]+o(1)\displaystyle\leq E\left[P\left\{\left\|\frac{1}{2n}\sum_{1\leq i\leq 2n}(2D_{i}-1)E[\psi_{n,i}\epsilon_{n,i}(1)|X_{i}]-E[\psi_{n,i}\epsilon_{n,i}(1)]\right\|_{\infty}>2.02\sqrt{\frac{\log(np_{n})\overline{\sigma}^{2}}{n}}\bigg{|}X^{(n)}\right\}I\{\mathcal{E}_{n,3}(1)\}\right]+o(1)
=o(1).\displaystyle=o(1)~{}. (67)

Combining (65), (66), (67), and the fact that 2+2.023.98\sqrt{2}+2.02\leq 3.98, we have P{n,2(1)}1P\{\mathcal{E}_{n,2}(1)\}\to 1. The same result holds for n,2(0)\mathcal{E}_{n,2}(0).  

Appendix C Details for Simulations

The regressors in the LASSO-based adjustment are as follows.

  1. (i)

    For Models 1-6, we use {1,Xi,Wi,Xi2,Wi2,XiWi,(XiX~)I{Xi>X~},(WiW~)I{Wi>W~},(XiX~)2I{Xi>X~},(WiW~)2I{Wi>W~}}\{1,X_{i},W_{i},X_{i}^{2},W_{i}^{2},X_{i}W_{i},(X_{i}-\tilde{X})I\{X_{i}>\tilde{X}\},(W_{i}-\tilde{W})I\{W_{i}>\tilde{W}\},(X_{i}-\tilde{X})^{2}I\{X_{i}>\tilde{X}\},(W_{i}-\tilde{W})^{2}I\{W_{i}>\tilde{W}\}\} where X~\tilde{X} and W~\tilde{W} are the sample medians of {Xi}i[2n]\{X_{i}\}_{i\in[2n]} and {Wi}i[2n]\{W_{i}\}_{i\in[2n]}, respectively.

  2. (ii)

    For Models 7-9, we use {1,Xi,Wi,Xi2,Wi2,Xi1Wi1,Xi2Wi1,Xi1Wi2,Xi2Wi2,(XijXj~)I{Xij>Xj~},(XijXj~)2I{Xij>Xj~},(WijW1~)I{Wij>Wj~},(WijWj~)2I{Wij>Wj~}}\{1,X_{i},W_{i},X_{i}^{2},W_{i}^{2},X_{i1}W_{i1},X_{i2}W_{i1},X_{i1}W_{i2},X_{i2}W_{i2},(X_{ij}-\tilde{X_{j}})I\{X_{ij}>\tilde{X_{j}}\},(X_{ij}-\tilde{X_{j}})^{2}I\{X_{ij}>\tilde{X_{j}}\},(W_{ij}-\tilde{W_{1}})I\{W_{ij}>\tilde{W_{j}}\},(W_{ij}-\tilde{W_{j}})^{2}I\{W_{ij}>\tilde{W_{j}}\}\} where Xj~\tilde{X_{j}} and Wj~\tilde{W_{j}}, for j=1,2j=1,2, are the sample medians of {Xij}i[2n]\{X_{ij}\}_{i\in[2n]} and {Wij}i[2n]\{W_{ij}\}_{i\in[2n]}, respectively.

  3. (iii)

    For Models 10-11, we use {1,Xi,Wi,Xi2,Wi2,Xi1Wi1,Xi2Wi2,Xi3Wi1,Xi4Wi2,(XijXj~)I{Xij>Xj~},(XijXj~)2I{Xij>Xj~},(WijWj~)I{Wij>Wj~},(WijWj~)2I{Wij>Wj~}}\{1,X_{i},W_{i},X_{i}^{2},W_{i}^{2},X_{i1}W_{i1},X_{i2}W_{i2},X_{i3}W_{i1},X_{i4}W_{i2},(X_{ij}-\tilde{X_{j}})I\{X_{ij}>\tilde{X_{j}}\},(X_{ij}-\tilde{X_{j}})^{2}I\{X_{ij}>\tilde{X_{j}}\},(W_{ij}-\tilde{W_{j}})I\{W_{ij}>\tilde{W_{j}}\},(W_{ij}-\tilde{W_{j}})^{2}I\{W_{ij}>\tilde{W_{j}}\}\} where Xj~\tilde{X_{j}},for j=1,2,3,4j=1,2,3,4, and Wj~\tilde{W_{j}}, for j=1,2j=1,2, are the sample medians of {Xij}i[2n]\{X_{ij}\}_{i\in[2n]} and {Wij}i[2n]\{W_{ij}\}_{i\in[2n]}, respectively.

  4. (iv)

    Models 12-15 already contain high-dimensional covariates. We just use XiX_{i} and WiW_{i} as the LASSO regressors.

References

  • Abadie and Imbens (2008) Abadie, A. and Imbens, G. W. (2008). Estimation of the Conditional Variance in Paired Experiments. Annales d’É conomie et de Statistique 175–187.
  • Armstrong (2022) Armstrong, T. B. (2022). Asymptotic Efficiency Bounds for a Class of Experimental Designs. ArXiv:2205.02726 [stat], URL http://arxiv.org/abs/2205.02726.
  • Bai et al. (2023a) Bai, Y., Liu, J., Shaikh, A. M. and Tabord-Meehan, M. (2023a). On the Efficiency of Finely Stratified Experiments. ArXiv:2307.15181 [econ, math, stat], URL http://arxiv.org/abs/2307.15181.
  • Bai et al. (2023b) Bai, Y., Liu, J. and Tabord-Meehan, M. (2023b). Inference for Matched Tuples and Fully Blocked Factorial Designs. ArXiv:2206.04157 [econ, math, stat], URL http://arxiv.org/abs/2206.04157.
  • Bai et al. (2022) Bai, Y., Romano, J. P. and Shaikh, A. M. (2022). Inference in Experiments With Matched Pairs. Journal of the American Statistical Association, 117 1726–1737. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/01621459.2021.1883437, URL https://doi.org/10.1080/01621459.2021.1883437.
  • Belloni et al. (2017) Belloni, A., Chernozhukov, V., Fernández-Val, I. and Hansen, C. (2017). Program evaluation and causal inference with high-dimensional data. Econometrica, 85 233–298.
  • Bickel et al. (2009) Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37 1705–1732. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/journals/annals-of-statistics/volume-37/issue-4/Simultaneous-analysis-of-Lasso-and-Dantzig-selector/10.1214/08-AOS620.full.
  • Bruhn and McKenzie (2009) Bruhn, M. and McKenzie, D. (2009). In pursuit of balance: Randomization in practice in development field experiments. American Economic Journal: Applied Economics, 1 200–232.
  • Chernozhukov et al. (2014) Chernozhukov, V., Chetverikov, D. and Kato, K. (2014). Gaussian approximation of suprema of empirical processes. The Annals of Statistics, 42 1564–1597.
  • Chernozhukov et al. (2017) Chernozhukov, V., Chetverikov, D. and Kato, K. (2017). Central limit theorems and bootstrap in high dimensions. The Annals of Probability, 45 2309–2352. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/journals/annals-of-probability/volume-45/issue-4/Central-limit-theorems-and-bootstrap-in-high-dimensions/10.1214/16-AOP1113.full.
  • Chetverikov and Sørensen (2022) Chetverikov, D. and Sørensen, J. R.-V. (2022). Analytic and Bootstrap-after-Cross-Validation Methods for Selecting Penalty Parameters of High-Dimensional M-Estimators. Tech. Rep. arXiv:2104.04716, arXiv. ArXiv:2104.04716 [econ, math, stat] type: article, URL http://arxiv.org/abs/2104.04716.
  • Cohen and Fogarty (2023) Cohen, P. L. and Fogarty, C. B. (2023). No-harm calibration for generalized oaxaca-blinder estimators. Biometrika. Forthcoming.
  • Cytrynbaum (2023) Cytrynbaum, M. (2023). Covariate adjustment in stratified experiments.
  • Donner and Klar (2000) Donner, A. and Klar, N. (2000). Design and analysis of cluster randomization trials in health research, vol. 27. Arnold London.
  • Dudley (1989) Dudley, R. M. (1989). Real Analysis and Probability. Wadsworth and Brook/Cole.
  • Freedman (2008) Freedman, D. A. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40 180–193. URL http://www.sciencedirect.com/science/article/pii/S019688580700005X.
  • Glennerster and Takavarasha (2013) Glennerster, R. and Takavarasha, K. (2013). Running randomized evaluations: A practical guide. Princeton University Press.
  • Groh and McKenzie (2016) Groh, M. and McKenzie, D. (2016). Macroinsurance for microenterprises: A randomized experiment in post-revolution egypt. Journal of Development Economics, 118 13–25.
  • Jiang et al. (2022) Jiang, L., Liu, X., Phillips, P. C. and Zhang, Y. (2022). Bootstrap inference for quantile treatment effects in randomized experiments with matched pairs. Review of Economics and Statistics. Forthcoming.
  • Ledoux and Talagrand (1991) Ledoux, M. and Talagrand, M. (1991). Probability in Banach Spaces: Isoperimetry and Processes. Classics in Mathematics, Springer-Verlag, Berlin Heidelberg. URL https://www.springer.com/gp/book/9783642202117.
  • Lehmann and Romano (2005) Lehmann, E. L. and Romano, J. P. (2005). Testing Statistical Hypotheses. 3rd ed. Springer, New York.
  • Lin (2013) Lin, W. (2013). Agnostic notes on regression adjustments to experimental data: Reexamining Freedman’s critique. Annals of Applied Statistics, 7 295–318. Publisher: Institute of Mathematical Statistics, URL https://projecteuclid.org/euclid.aoas/1365527200.
  • Negi and Wooldridge (2021) Negi, A. and Wooldridge, J. M. (2021). Revisiting regression adjustment in experiments with heterogeneous treatment effects. Econometric Reviews, 40 504–534. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/07474938.2020.1824732, URL https://doi.org/10.1080/07474938.2020.1824732.
  • Robins et al. (1995) Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1995). Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association, 90 106–121. Publisher: [American Statistical Association, Taylor & Francis, Ltd.], URL https://www.jstor.org/stable/2291134.
  • Rosenberger and Lachin (2015) Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in clinical trials: Theory and Practice. John Wiley & Sons.
  • Tsiatis et al. (2008) Tsiatis, A. A., Davidian, M., Zhang, M. and Lu, X. (2008). Covariate adjustment for two-sample treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics in Medicine, 27 4658–4677.
  • van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes with Applications to Statistics. Springer-Verlag, New York.
  • Wager et al. (2016) Wager, S., Du, W., Taylor, J. and Tibshirani, R. J. (2016). High-dimensional regression adjustments in randomized experiments. Proceedings of the National Academy of Sciences, 113 12673–12678.
  • Wu and Gagnon-Bartsch (2021) Wu, E. and Gagnon-Bartsch, J. A. (2021). Design-based covariate adjustments in paired experiments. Journal of Educational and Behavioral Statistics, 46 109–132.
  • Yang and Tsiatis (2001) Yang, L. and Tsiatis, A. A. (2001). Efficiency Study of Estimators for a Treatment Effect in a Pretest–Posttest Trial. The American Statistician, 55 314–321. Publisher: Taylor & Francis _eprint: https://doi.org/10.1198/000313001753272466, URL https://doi.org/10.1198/000313001753272466.
  • Zhao and Ding (2021) Zhao, A. and Ding, P. (2021). Covariate-adjusted Fisher randomization tests for the average treatment effect. Journal of Econometrics, 225 278–294. URL https://www.sciencedirect.com/science/article/pii/S0304407621001457.